摘要

This paper reports on a mixed-methods approach to evaluate rater performance on a local oral English proficiency test. Three types of reliability estimates were reported to examine rater performance from different perspectives. Quantitative results were also triangulated with qualitative rater comments to arrive at a more representative picture of rater performance and to inform rater training. Specifically, both quantitative (6338 valid rating scores) and qualitative data (506 sets of rater comments) were analyzed with respect to rater consistency, rater consensus, rater severity, rater interaction, and raters' use of rating scale. While raters achieved overall satisfactory inter-rater reliability (r = .73), they differed in severity and achieved relatively low exact score agreement. Disagreement of rating scores was largely explained by two significant main effects: (1) examinees' oral English proficiency level, that is, raters tend to agree more on higher score levels than on lower score levels; (2) raters' differential severity due to raters' varied perceptions of speech intelligibility toward Indian and low-proficient Chinese examinees. However, effect sizes of raters' differential severity effect on overall rater agreement were rather small, suggesting that varied perceptions among trained raters of second language (L2) intelligibility, though possible, are not likely to have a large impact on the overall evaluation of oral English proficiency. In contrast, at the lower score levels, examinees' varied language proficiency profiles generated difficulty for rater alignment. Rater disagreement at these levels accounted for most of the overall rater disagreement and thus should be focused on during rater training. Implication of this study is that interpretation of rater performance should not just focus on identifying interactions between raters' and examinees' linguistic background but also examine the impact of rater interactions across examinees' language proficiency levels. Findings of this study also indicate effectiveness of triangulating different sources of data on rater performance using a mixed-methods approach, especially in local testing contexts.

  • 出版日期2014-10