Comparing and Improving Inter-Rater Reliability for Job References

Info: 8617 words (34 pages) Dissertation
Published: 9th Dec 2019

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

Introduction

The job reference is one of the most popular tools for job selection (Chamaro-Premuzis & Furnham, 2010; Muchinsky, 1979), second only to the interview (Furnham, 2017). It’s estimated that approximately 96% of organisations use job references in their hiring decisions (Burke, 2005). Despite its popularity, job references are criticized for their susceptibility to myriad forms of bias (Aamodt, 2006; Furnham, 2017), and weak predictive validity (Aamodt & Williams, 2005; Hunter & Hunter, 1986; Furnham, 2017; Reilly & Chau, 1982). While some studies have established the impact of bias on job references and even more on its predictive validity, few have empirically investigated the reliability of job references, particularly that between raters. Of the research available, the inter-rater reliability of letters of reference (LORs) tends to be low as a result of various biases including leniency, lack of information, or memory retrieval. As such, the goal of this study is to develop alternate job reference formats and compare their inter-rater reliability. 4 job reference formats will be compared: 1) the letter of reference (LOR), 2) the multi-item 3) the relative percentile method (RPM), and 4) the global KSAO Ranking (GKR) formats.

The Letter of Reference

The LOR is defined as a letter expressing an opinion regarding a candidate’s ability, previous performance, work habits, character, or potential for future success (Aamodt, 2006). The content of LORs may vary, but generally include the referee’s assessment of the candidate’s KSAOs and personality. In general, the LOR is used to confirm details from other sources of information, check for disciplinary issues and other ‘red flags’, and to gain character insight on the subject from a variety of sources that presumably had substantial opportunity to interact with the applicant in different relevant life domains (Aamodt, Nagy, & Thomson, 1998). Although LORs have the potential to contribute unique insight into an applicant, it is widely known that they play host to myriad sources of bias that often exaggerates the job applicant’s KSAOs, character, and/or experience (Aamodt, 2006). As an evaluative process, job references are subject to various cognitive biases raters use when assessing the subject (Aamodt, et al, 1998; Judge & Higgins, 1998; Wilson, 1948), for example memory retrieval (Wherry & Bartlett, 1982), racial and gender stereotyping (Cotton, O’Neill & Griffin, 2008), halo effects (Aamodt et al, 1998) or affective disposition (Feldman, 1981; Judge & Higgins, 1998). In particular, leniency bias (Aamodt, 1999; Judge & Higgins, 1998) and self-selecting referees (Aamodt, Bryan, & Whitcomb, 1993) are highlighted as the biggest culprits. Despite the popularity of the LOR as a selection tool, its reliability and effectiveness has been criticised for decades (Furnham, 2017; Judge & Higgins, 1998; Muchinsky, 1979), and its ability to predict a candidate’s performance is largely regarded as poor (Reilly & Chao, 1982; Furnham, 2017; Hunter & Hunter, 1984).

Biases, Inter-Rater Reliability, & Predictive Validity

Although the limitations of LORs are widely known, there are few empirical studies on its reliability and validity despite repeated calls for more research (e.g. Judge & Higgins, 1998; Muchinsky, 1979; Smith & George, 1992). One potential factor contributing to the LOR’s low predictive validity is the subpar inter-rater reliability between referees (Furnham, 2017). Baxter et al (1981) found that referees often provide highly personal information that is not known to other referees. Mosel & Gaheen (1959) found an r_wg value of .40 in their study on the similarity of LORs. Although this may seem acceptable, the inter-rater reliability is likely higher than normal as the writers were all previous supervisors of the candidates, and the authors did not control for the detail, favourability, or length of the references. It may be the case that short references are rated very similarly simply because they’re vague and don’t provide the nuanced details that would differentiate them. When soliciting references for job selection, some discriminant reliability may preferable as it provides evidence from a variety of perspectives, however the degree of freedom writers have in writing references means low inter-rater reliability may be “more directly a function of letter writers’ idiosyncrasies than of the subjects’ qualities” (Baxter et al, 1983, pg 300). In other words, the low inter-rater reliability may be more due to the various biases originating in the writer than the variety of perspectives gained by choosing references from different domains of the subject’s life. For example, Aamodt & Williams (2005) found an r_wg of only .22 between LORs for the same candidate. In fact, they found more agreement between references written by the same person for different candidates than between references written for the same candidate!

Baxter et al’s suggestion raises a second strongly established point in the job reference literature: the vast majority of LORs are extremely favourable to the applicant (Browning, 1968; Judge & Higgins, 1998; Knouse, 1983). Less than 1% of references rate candidates as below average (Aamodt & Williams, 2005). This creates a range restriction effect whereby all candidates, regardless of real competencies, are rated nearly the same in the traditional LOR. This leniency bias is cited as one of the leading reasons for the generally low reliability and validity of the traditional job reference (Aamodt et al, 1998; Aamodt, 2005; Judge & Higgins, 1998; Muchinsky, 1979). For some referees, including any negative or qualifying comments in a reference is seen as sabotaging the candidate’s chances at finding a new job (Judge & Higgins, 1998). For other referees, they may have a hidden objective or motivation for providing the reference they did, for example inflating their reference in order to push a problem employee onto someone else (Furnham, 2017). Still others are concerned over legal ramifications if the reference is interpreted as libel (Ryan & Laser, 1991), and some reference writers may use contextual characteristics of the letter such as tone or structure to subtly communicate to the reader information that is not made explicit in the content (Hazer et al, 1997).

All of these biases are not only complicit in the low reliability of job references, they also are a major factor in their low predictive validity as well (Aamodt et al, 1998; Furnham, 2017). The maximum correlation of the job reference with any criterion variable has an upper limit determined by its internal reliability (Howitt & Cramer, 2005). In other words, the degree of error from various biases in job references limits how much it can possibly correlate with any outcome variable such as job performance or turnover intentions. As a result, the predictive validity for job references has a strong ceiling effect and a restricted range, which attenuates validity estimates. There have been a few studies on the predictive validity of the LOR (e.g. Aamodt et al, 1993, Aamodt & Williams, 2005; Browning, 1968; Carroll & Nash, 1972; Mosel & Goheen, 1959; Myers & Errett, 1959). Beyond these single studies, 4 meta-analyses have been conducted on the predictive validity of the LOR, with overall R² values ranging from .13 to .26, with an average of .16 (Aamodt & Williams, 2005; Hunter & Hunter, 1981; Reilly & Chao, 1982), meaning its one of the weakest predictors for job performance.

Alternatives to the Letter of Reference

The results of these meta-analytic findings beg the question of what can we do better? Various studies have generated different methods of increasing the predictive validity of the LOR. Knouse (1994) suggests developing more detailed and behaviourally anchored LORs, however in addition he suggests also including a job description for the job being sought, as well as a prompt to keep the LOR job-relevant. Surprisingly, there is also evidence that including some negative comments not only increases the predictive validity of LORs, it may also improve the perceived employability of the candidate as it is seen as more accurate (Knouse, 1983). In short, more detailed, comprehensive, structured, and neutral LORs are likely to have better predictive and face validity (Aamodt et al, 1998; Aamodt, 2006).

Despite the various ways of improving the LOR, Aamodt et al (2006) also found that although the value-added of high quality LORs for predictive validity is significant, the impact is limited. Instead, adopting an item-response format may have a larger effect. Supporting this, making the switch to structured reference formats appears to increase the predictive validity of the job reference (Aamodt et al, 1993; Aamodt, 2006, McCarthy & Goffin, 2001). McCarthy & Goffin were the first to test the predictive validity of 2 alternative forms of job references: the Relative Percentile Method and Global Trait Rankings formats. They found that the RPM format had substantially higher predictive validity (R²= .18) than the LOR, however the global trait rankings and multi-item response formats were not found to be significant predictors of job performance.

This Study

While there is support that alternative forms of the job reference may have stronger predictive validity, to date no study has compared the reliability of different job reference formats. In fact, only 2 studies have empirically evaluated the inter-rater reliability of LORs (Aamodt & Williams, 2005; Mosel & Goheen, 1959). The purpose of this study was to develop 3 different standardised reference formats and compare their inter-rater reliability to traditional LORs. A standardised format may be beneficial as it structures the reference and provides a template all reference writers must conform too. Having structure may limit inconsistencies in level of detail, length, comprehensiveness, or leniency. In addition, having structure may also mitigate the effects of individual differences of the writer (i.e. affability, writing quality, honesty-humility) will have on the evaluation of the reference.

4 different reference formats were examined. The first was the traditional LOR which prompted the referees to provide their opinion on the applicant. The second was a structured, multi-item Likert variant of the LOR which presented a series of items representing the KSAOs required for the position. The third was the RPM format, which included the same KSAO item statements but instead of the Likert scale, asked the referee to place the applicant on a scale from 0 to 100 compared to their peer group. Finally, the GKR format asked referees to rank order KSAOs from most to least characteristic to the candidate within each performance domain (i.e. job knowledge, skills, personaility). Items for the multi-item, RPM, and GKR formats were developed to reflect the KSAOs required for the job position based on the job description provided by O*NET. McCarthy & Goffin (2001) argue that the use of an RPM format may increase accuracy by providing meaningful behavioural reference points to anchor scores, as well as presumably widening the variance by expanding the range. While the McCarthy & Goffin (2001) study is the first and only that investigated the use of RPM in job references, its advantages have been established in performance appraisal (Goffin, Gellatly, Paunonen, Jackson, & Meyer, 1996; Jelley & Goffin, 2001). In terms of the GKR format, the use of rank-order response formats may forcibly widen the variance as referees cannot give the applicant highly favourable scores on each dimension. Further, the GKR format may also increase accuracy by prompting the referee to reflect more deeply on which statements are most similar to the applicant, rather than relying on immediate, surface-level schemas of the applicant.

Based on the literature available, 3 hypotheses were developed:

H₁) The GKR condition will have the highest inter-rater reliability for both samples, followed by the RPM, Multi-Item, and lastly the LOR conditions.

H_2a) ICC scores will be lower in the MTURK sample than the expert assessors for all conditions.

H_2b) The difference in ICC scores between the MTURK sample and the expert raters will be lowest in the GKR condition, followed by the RPM, Multi-Item, and LOR.

Methods

Participants:

The sample consisted of 600 participants recruited through Amazon Mechanical Turk (MTURK). The majority of participants were female (N = 376, 62%), and the average age was 34. Participants were limited to Canada and the USA to control for potential language proficiency or cultural effects. A second comparison sample of 5 expert assessors from a professional job assessment centre was also gathered. 3 of the assessors were female, and the average years of experience was 11.5.

Procedure:

First, the MTURK participants were randomly assigned to a reference format*vignette condition. Following this, the participants were given time to review the O*NET job description for an office manager. Once familiar with the job description, the participants were presented with a video vignette of 1 of 3 job candidate going through a job simulation exercise. The 3 video vignettes were selected in partnership with a local job assessment centre. The videos were of 3 different job candidates for the position of office manager that reflected strong, acceptable, and weak competency for the position. The candidates performed a series of job relevant tasks, and interacted with confederate clients, coworkers, subordinates and supervisors. Since these videos were previously used for real life job selection purposes, they sought to comprehensively cover the range of day-to-day tasks associated with an office manager position, and we were given access to the candidates’ performance evaluations from the assessment center. Secondly, 5 professional assessors from the assessment centre were asked to complete the LOR, Multi-Item, RPM, and GKR format reference for each candidate. In total, each expert assessor developed twelve references, 1 of each reference format for each of the candidate conditions. The relative strength of the candidates was differentiated by their overall objective performance score provided by the assessment centre.

Once completed, ICC (1, k) analyses were computed to assess inter-rater reliability for each reference*vignette condition. To develop a quantitative performance score for the LOR reference condition, a panel of 12 expert raters reviewed the job description for office manager, then rated the performance potential of each candidate using a 3-item, 7-point Likert scale using the LORs written by the participants. No time limit was placed on writing the references, however word count was measured to control for length. Once completed, references were kept for a future study. To simulate the high stakes of ‘real-world’ reference writing, the participants were asked to imagine the candidate was a friend and recent co-worker of theirs that needed the job.

Measures:

Letter of Reference:

Participants in the LOR condition were given a fillable text box and asked to respond to the following prompt: “Please tell us whether or not you believe X is a good fit for this position and why”. There were no time limits on how long participants could take to respond, and no prompts for further information or detail was provided.

Multi-Item, Relative percentile, and Global KSAO formats:

The set of items in the Multi-Item, RPM, and GKR format conditions were developed on the basis of the various KSAOs listed as relevant in the O*NET job description for office manager. Each format consisted of 6 dimensions, with 4 to 5 items each. While the items were the same, each condition had a different response format. Each format assessed the candidates on their KSAOs.

For the Multi-Item, the response format was a 7-point Likert scale from 1 (not at all descriptive of the candidate) to 7 (extremely descriptive of the candidate). For the RPM condition, participants were required to rate the candidates in comparison to their peer group. The reliability and validity of the RPM approach to performance rating has been established in previous studies (Goffin, Gellatly, et al, 1996). Lastly, in the GKR condition, participants were asked to rank order how similar to the candidate the item statements were for each performance domain. For each domain, 2 bogus statements representing a KSAO characteristic not highly relevant to the job was included. The bogus items appeared to be equally as desirable and favourable as the items drawn from the job description, however they were less job relevant. The bogus items were included to avoid the issue of multicollinearity associated with rank-order response formats.

Word Count:

The number of words for the LOR was measured to control for the effect of reference length on ICC (1, k) scores. As references get longer, it is likely that there is more room for variance in content, structure, and tone that may affect the convergence of ratings.

Strength of Candidate:

The strength of the candidates in the LOR condition based on the references generated was assessed by a panel of 12 expert raters using a 3 item, 10-point Likert scale 1.) How effective would this candidate be as an office manager? 2.) How qualified would this candidate be as an office manager? 3.) Overall, how strong is the candidate for an office manager position?

Results

Preliminary results:

Prior to assessing inter-rater reliability, a series of preliminary analyses were conducted. Of the 600 participants, all had completed the reference, and all had usable data. For the Multi-Item, RPM, and GKR conditions, the distribution of the data for each candidate was not normal. For the strong and moderate candidates, the data was positively skewed, while the data for the weak candidate was negatively skewed. This was expected as the candidates were deliberately chosen for their differing level of competency in the job position, and the performance ratings reflected this.

To assess the psychometric properties of the reference formats, item means, standard deviations, internal consistency reliability coefficients, and item-total correlations were examined. All items for the LOR, Multi-item and RPM conditions showed acceptable statistical properties and were retained for further analyses. Internal consistency coefficients across the format conditions ranged from .71 to .92 and were comparable across formats. Inter-correlations between the different KSAO dimensions were all low to moderate, with none approaching 50% or more shared variance. This demonstrates that the participants were successfully able to differentiate the different KSAO dimensions. In terms of the GKR format, low inter-scale correlations (.11 to .36) were found, however this was expected as rank-ordering precludes uniform high or low rankings across the KSAOs. Correlations between the different reference conditions were moderate to high (see table 1). This was expected as only the response format changed, the construct being measured, and the item statements remained largely the same throughout each condition.

Table 1: Correlation Table for Reference Conditions
	LOR	Multi-Item	RPM
Weak Candidate
LOR
Multi-Item	.45**
RPM	.42**	.57**
GKR	.39**	.39**	.45**
Moderate Candidate
LOR
Multi-Item	.41**
RPM	.37**	.45**
GKR	.49**	.32**	.40**
Strong Candidate
LOR
Multi-Item	.51**
RPM	.47**	.54**
GKR	.47**	.36**	0.44**

** significant at p = <.001

Inter-rater reliability results

All analyses for inter-rater reliability results were run in R 3.5.2. Since raters were randomly assigned and the level of agreement between raters was the target variable, analyses for absolute agreement, one-way random effect ICCs (1, k) (Shrout & Fleiss, 1979) were performed. A two-way F-test for equality of SD was performed to test for significant difference in variance between the conditions. A significantly lower SD value indicates lower variance in scores provided by the rater and higher agreement among raters. The M and SD for the RPM condition were divided by 10 to make them equivalent to the other conditions for comparative analyses.

H1: The GKR condition will have the highest inter-rater agreement, followed by the RPM, Multi-item, and LOR conditions.

Across the strong, moderate, and weak video vignette conditions in both the MTURK and expert assessor samples a similar pattern in terms of ICC (1, k) scores appeared. For a summary of ICC difference, SD difference, and critical F tests for difference of variance scores, see tables 2, 3, 4 (vertical ICC_diff, SD_diff, and F_crit). ICC (1, k) scores for the multi-item format ranged from .06 to .08 higher than the LOR condition for the MTURK sample, and .03 to .06 for the expert assessors. ICC (1. k) scores for the RPM condition ranged from .10 to .11 higher for the MTURK sample, and .04 to .07 higher for the expert assessors than the Multi-Item format. Finally, the GKR condition ranged from .03 lower to .02 higher than the RPM condition. While both the RPM and GKR conditions had higher ICC (1, k) scores and significantly lower SD scores than the Structured or LOR conditions, they had similar ICC (1, k) and SD scores between them across all three candidate conditions. However, the RPM had significantly lower SD scores in the expert assessor sample, and the strong candidate MTURK sample. H₁was partially supported.

Table 1: Inter-rater reliability results for the weak candidate
	MTURK Participants				Expert Assessors
	ICC (95% CIs)	M	SD	F	ICC (95% CIs)	M	SD	F	ICC_diff	SD_diff	F_crit
Letter of Reference	.19 (.11 to .27)	6.23	3.27	2.829*	0.44 (.40 to .48)	4.7	1.8	4.153*	0.25	1.47	3.24**
ICC_diff/Sd_diff	0.06		1.31		0.03		1.15
F_crit			3.814**				7.669**
Multi-Item	.25(.204 to .312)	5.98	1.96	3.119*	0.47 (.427 to.494)	4.5	0.65	5.640*	0.22	1.31	9.09*
ICC_diff/Sd_diff	0.1		1.14		0.04		0.08
Fcrit			2.390**				1.256*
RPM	.35 (.305 to .376)	52(5.2)	8.23(.82)	4.407*	.51 (.482 to .553)	42(4.2)	5.76(.58)	5.93*	0.16	2.47(.25)	1.999*
ICC_diff/Sd_diff	0.03		0.45		0.01		0.23
Fcrit			0.137				2.475**
GKR	.32 (315 to .360)	4.8	0.37	4.875*	.52 (.490 to .550)	4.1	0.35	6.021*	0.2	0.02	1.117

Gender
Male	78 (34%)
Female	112 (56%)
Other	0 (0%)

significant at p = <.05, significant at p = <.01 H_2a: ICC scores will be lower in the MTURK sample than the expert raters for all conditions. Across all 12 formatcandidate conditions, the ICC (1, k) scores for the MTURK sample were lower than that of the expert assessors (LOR: .23 to .25; Multi-Item: .21 to .23; RPM: .02 to .16; GKR: .14 to .21). While the differences were significant for the LOR and Multi-item conditions across all three candidates, none of the differences in the RPM or GKR conditions were significant with the exception of the weak candidateRPM condition. For a summary of ICC (1, k)* difference, SD difference, and critical F test for difference of variance scores, see tables 2, 3, 4 (horizontal ICC_diff, SD_diff, and F_crit). H_2a was supported. Table 1: Inter-rater reliability results for the moderate candidate
	MTURK Participants				Expert Assessors
	ICC (95% CIs)	M	SD	F	ICC (95% CIs)	M	SD	F	ICC_diff	SD_diff	F_crit
Letter of Reference	.15 (.051 to .248)	7.21	1.81	2.627*	0.38 (.27 to .47)	6.43	1.06	4.153*	0.23	0.75	2.916**
ICC_diff/Sd_diff	0.08		0.71		0.06		0.36
F_crit			2.708**				2.293**
Multi-Item	.23(.193 to .286)	7.11	1.1	3.014*	0.44 (.407 to.494)	6.15	0.7	5.640*	0.21	0.9	2.469**
ICC_diff/Sd_diff	0.11		0.29		0.04		0.02
Fcrit			1.844*				1.06
RPM	.34 (.296 to .408)	67(6.7)	8.11(.81)	4.361*	.48 (.446 to .509)	67(6.7)	7.24(.72)	5.93*	0.14	8.7(.87)	1.267
ICC_diff/Sd_diff	0.01		0.22		0.01		0.26
Fcrit			1.532				1.653*
GKR	.33 (315 to .360)	6.4	0.69	4.983*	.47 (.437 to .495)	6.2	0.56	6.021*	0.14	0.13	1.518

Gender
Male	45 (24%)
Female	153 (75%)
Other	2(1%)

*significant at p = <.05, **significant at p = <.01

H_2b: The difference in ICC scores between the MTURK sample and the expert raters will be highest in the LOR condition, followed by the Multi-Item, RPM, and GKR conditions.

Across all three candidate conditions, the differences in ICC (1, k) and SD scores between the MTURK and expert assessor samples decreased as the analysis moved from the LOR, Multi-item, RPM, and GKR conditions. The ICC_diffand SD_diffscores were largest in the LOR condition. The difference in ICC (1, k) and SD scores were also significant in the Multi-Item condition, however the difference scores were slightly less. For the RPM and GKR conditions, the only significant difference was in the weak candidate*RPM condition, meaning the difference in rating scores between the MTURK and expert assessors was no longer significant in the RPM and GKR conditions, H_2bwas supported.

Table 1: Inter-rater reliability results for the strong candidate
	MTURK Participants				Expert Assessors
	ICC (95% CIs)	M	SD	F	ICC (95% CIs)	M	SD	F	ICC_diff	SD_diff	F_crit
Letter of Reference	.18 (.065 to .271)	9.01	2.35	2.627*	0.41 (.30 to .50)	8.65	1.02	4.153*	0.23	1.23	5.308**
ICC_diff/Sd_diff	0.06		0.23		0.06		0.06
F_crit			1.35*				2.293**
Multi-Item	.24(.199 to .292)	8.54	2.02	3.014*	0.47 (.417 to.523)	8.09	0.94	5.640*	0.23	1.08	4.618**
ICC_diff/Sd_diff	0.10		1.56		0.07		0.02
Fcrit			12.129**				1.06
RPM	.34 (.296 to .408)	81(8.7)	5.76(.58)	4.361*	.54 (.517 to .563)	77(7.7)	5.24(.52)	5.93*	0.2	0.06	1.244
ICC_diff/Sd_diff	0.02		0.05		0.03		0.03
Fcrit			1.80*				1.653*
GKR	.36 (315 to .360)	8.00	0.63	4.983*	.57 (.537 to .595)	7.2	0.55	6.021*	0.21	0.08	1.312

Gender
Male	96(48%)
Female	104(52%)
Other	0(0%)

*significant at p = <.05, **significant at p = <.01

Discussion

This study compared the inter-rater reliability of 4 different job reference formats: the LOR, Multi-item, RPM, and GKR. This is one of the first studies to empirically investigate inter-rater reliability for job references, and the first for the RPM and GKR formats. In addition, this study compared ICC (1, k) and SDs of the 4 reference formats, and the results suggested that standardised reference formats have significant and meaningfully higher inter-rater reliability than the LOR. In particular, the RPM and GKR formats showed much higher ICC (1, k) scores and lower SD scores for both the MTURK and expert assessor samples.

The scores were also more similar between the MTURK and expert assessors in the RPM and GKR conditions, which suggests that the increase in reliability standardised reference formats bring is most significant for non-expert reference writers. The practical implication of this is that the greatest value for standardised reference formats is found in situations where reference writers are not experts or are unable to develop high-quality references. Interestingly, in the RPM and GKR conditions, the SD scores for the MTURK sample dropped to the point where there was no significant difference between them and the expert assessors. This may indicate that by using standardised reference formats, organisations can obtain higher quality and more accurate insights from inexperienced writers. There are a variety of potential reasons for this finding. 1.) It may be that to answer standardised reference formats writers engage in more reflection on the candidate. 2.) Standardised formats are likely more comprehensive, as they require the writer to rate the candidate on pre-specified performance domains. 3.) Simply being quantitative, rather than qualitative formats attenuates the possible variance that can be found. LORs are by nature more open to writers offering idiosyncratic information, however they may also provide short, vague references as well. 4.) Lastly, it may be that standardised reference formats-particularly the GKR-are less effected by writer bias. Although not included as a hypothesis in this study, the mean performance scores provided by the writers reduced as the standardisation increased. The scores also came closer to converging with that of the expert assessors and the overall performance scores of the video candidates that were provided by the assessment center. This may indicate that with increased standardisation, writers are less able or less willing to be lenient in their ratings. As noted earlier, leniency bias attenuates the range of predictive validity for job references and is a serious limitation on their usefulness. That the reduced mean score effect was largest for the MTURK sample and that more standardised formats showed greater convergence between amateur and expert reference writers has promising implications for reducing leniency bias and increasing predictive validity for job references.

Strengths and Limitations:

The design of this study had several main strengths. Firstly, the use of detailed video vignettes of actual office manager job candidates engaging in a comprehensive battery of tasks in a professional assessment centre allowed us to standardise the candidates across the reference conditions. This approach 1) reduced variability in rater biases, 2) provided objective performance ratings for the candidates, 3) ensured the participants had a comprehensive view of the candidates’ performance in a variety of performance domains, and 4) allowed us to choose candidates of pre-specified differing levels of performance to ensure a strong manipulation. That said however, this design may have also reduced the effect of several important biases that would be present in a more naturalistic setting, including constrained opportunity for observation, recall biases, and social biases. Future studies should seek to replicate findings in a naturalistic setting. Additional strengths include that performance rating items for this study were based directly on the job description provided by O*NET, the use of professional assessors for comparators, and a degree of ecological validity by having a wide variety of raters in the MTURK sample.

Despite the strengths of this study, there are several limitations that should be taken into account. The expert assessor ICCs may have been higher in part due to a lower N. Only 5 professional assessors were recruited for this study as a comparator group for the MTURK sample. Due to the low N and potential sample similarities between the expert assessors, the ICC scores may have been somewhat inflated in comparison to the MTURK sample. Second, the number of individual items (and thus decisions) comprising a score will directly influence the likelihood of absolute agreement such that less items generally leads to higher inter-rater agreement. While the Multi-Item, RPM, and GKR formats all had the same number of items, the LOR only had 3 items to develop an overall score. That said, the ICC (1, k) scores found for the LOR condition were substantially lower and the SD scores substantially higher despite having less items. If anything, including more items may actually lower the ICC (1, k) scores found for the LOR conditions. Lastly, there is some evidence that while rank-ordered response formats do decrease the room for favourability bias (the self-report equivalence of leniency bias), it may not do so in a systematic way. Vasilopoulis et al (2006) found that cognitive ability moderates the reduction of favourability bias such that high cognitive performers are more able to recognise which traits are most job relevant in situ and are better able to ‘game’ self-report evaluations. A similar effect may be involved with writing job references. Future studies should take into account this potential new bias introduced using forced-choice response formats.

Future studies:

Focusing solely on comparing inter-rater reliability between job reference formats and amateur/professional reference writers, this study was unable to account for other important questions. For example, how standardisation increases the inter-rater reliability was not addressed in this study. That said, the increased convergence between the MTURK, expert assessors, and objective performance scores as standardisation went up may indicate that increasing standardisation leaves less room for leniency bias. This may be one important way in which standardised job references increase reliability. Future studies may follow up on this finding by including individual differences such as honesty-humility or affability (Judge & Higgins, 1998) as control variables.

Second, while increasing inter-rater reliability will most likely also increase the criterion validity of job references as well, a highly reliable but not predictive measure isn’t much use. Other than McCarthy and Goffin’s (2001) paper, no study has empirically investigated the predictive validity of RPM and GKR format references. While their results were promising and showed increased validity over the LOR, future studies should be conducted to replicate the results in a variety of situations. Lastly, McCarthy & Goffin also note that the GKR format was unpopular with raters in their sample, and only a minority had confidence it was a good predictor of future performance for military officers. As practical selection tools, the face validity and how referees, job candidates, and hiring managers react to the different formats is an important question that future studies should explore further.

Conclusion

Overall, the results of this study suggest that despite the many limitations and concerns associated with the job reference (Furnham, 2017; Judge & Higgins, 1998), there are potential ways to increase their reliability and usefulness. This study demonstrated RPM and GKR format inter-rater reliabilities of .32 to .36 for inexperienced reference writers. Although by research conventions the ICCs for the RPM and GKR conditions were still low (Koo, 2016), The ICCs found here were on par with that of the structured interview, the most similar job selection tool to the reference and one of the more reliable and valid (Furnham, 2017). This difference is substantial, as not only is that nearly double that of the traditional LOR, the RPM and GKR format are likely more practical, more comprehensive, less susceptible to bias, and reduces the variability in the quality of references from inexperienced writers (Aamodt, 1993; Aamodt & Williams, 2005).

References

Aamodt, M.G., Bryan, D.A., & Whitcomb, A. J. (1993). Predicting Performance with Letters of Recommendation, Public Personnel Management, 22(1) 81-96.

Aamodt, M.G., Nagy, M.S., Thomson, N. (1998). Employment References: Who Are We Talking About? Paper presented at the annual meeting of the International Personnel Management Association Assessment Council, Chicago, IL.

Aamodt, M. G., & Williams, F. (2005). Reliability, validity, and adverse impact of references and letters of recommendation. Paper presented at the 20th annual meeting of the Society for Industrial and Organizational Psychology, Los Angeles, CA.

Aamodt, M.G. (2006). Validity of Recommendations and References, Assessment Council News, 9, 4-8

Browning, R. C. (1968). Validity of reference ratings from previous employers. Personnel Psychology,

21, 389–393.

Burke, M. E. (2005). 2004 reference and background checking survey report. Alexandria, VA: Society for Human Resource Management.

Carroll, S. J., & Nash, A. N. (1972). Effectiveness of a forced-choice reference check. Personnel Administration, 35, 42–46.

Chammaro-Premuzic, T., & Furnham, A. (2010). The Psychology of Personnel Selection. Cambridge, UK: Cambridge University Press

Cotton. J.L., O’Neill, B.S., Griffin, A. (2008). The “name game”: affective and hiring reactions to first names, Journal of Managerial Psychology, 23(1), 18-39

Furnham, A. (2017). The Contribution of Others’ Methods in Recruitment & Selection, In H.W Goldstein, E.D. Pulakos, J. Passmore, & C. Samedo (Eds.) The Wiley Handbook of the Psychology of Recruitment, Selection, and Employee Retention, Chichester, UK: John Wiley & Sons, ltd.

Goffin, R. D., Gellatly, I. R., Paunonen, S. V., Jackson, D. N., & Meyer, J. P. (1996). Criterion validation of two approaches to performance appraisal: The behavioral observation scale and the relative percentile method. Journal of Business and Psychology, 11, 23–33.

Goffin, R. D., & Gellatly, I. R. (2001). A multi-rater assessment of organizational commitment: Are self-report measures biased? Journal of Organizational Behavior, 22, 437–451.

Howitt, D., & Cramer, D. (2005). Reliability and validity: Evaluating the value of tests and measures In D. Howitt, & D. Cramer (Eds.), Introduction to research methods in psychology(218-231). Harlow, Essex: Pearson.

Judge, T. A., & Higgins, C.A. (1998). Affective Disposition and the Letter of Reference, Organizational Behavior and Human Decision Processes, 75(3), 211-226.

Loher, B.T., Hazer, J.T., Tsai, A., Tilton. K., & James, J. (1997). Letters of reference: A process approach, Journal of Business and Psychology, 11(3), 339-354.

Knouse, S.B. (1983). The Letter of Recommendation: Specificity and Favourability, Personnel Psychology, 36(2), 321-331

McCarthy, J. M., Goffin, R. D. (2001). Improving the Validity of Letters of Recommendation: An Investigation of Three Standardized Reference Forms, Military Psychology, 13(4), 199–222

Mosel, J. N., & Goheen, H. W. (1959). The Employment Recommendation Questionnaire: 3. Validity of different types of references. Personnel Psychology, 12, 469–477

Muchinsky, P.M. (1979). The Use of Reference Reports in Personnel Selection: A Review and Evaluation, Journal of Occupational & Organizational Psychology, 38(2)327-350

Reilly, R. R., & Chau, G.T. (1982). Validity and Fairness of Some Alternative Employee Selection Methods, Personnel Psychology, 35(1), 1-62, DOI: 10.1111/j.1744-6570.1982.tb02184.x

Ryan, A. M., Laser, M. (1991). Negligent Hiring and Defamation: Areas of Liability Related to Pre-Employment Questions, Personnel Psychology, 44(2), 293-319

Vasilopoulis, N.L., Cucina, J.M., Dyomina, N.V., Morewitz, C.L., Reilly, R.R. (2006). Forced-Choice Personality Tests: A Measure of Personality and Cognitive Ability? Human performance, 19(3), 175-199.

Wherry, R.J., Bartlett, C.J. (1982). The Control of Bias in Ratings: A Theory of Rating, Personnel Psychology, 35(3), 521-551