Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Weighting Patterns and Rater Variability in an English as a Foreign Language Speaking Test

Abstract

This study is an attempt to measure the weighting patterns of the raters in a large-scale English as a Foreign Language (EFL) speaking test, classify these raters according to their weighting patterns, characterize the different types of raters in the rating process, and associate the rater types with different patterns of rater variability. The context was the Test for English Majors - Band 4, Oral Test (TEM4-Oral), a high-stakes certification test administered to college EFL majors in China toward the end of their sophomore year. To quantify the weighting patterns, 126 nonnative-speaking college teachers of English who served as TEM4-Oral raters in 2010 were requested to judge the EFL oral proficiency of 120 hypothetical test-takers with computer-generated score profiles featuring strengths and weaknesses in various criteria. Their relative weights on the criteria were derived from regression analyses, and then fed into cluster analyses to classify the raters into three types. To characterize different types of raters, a sample of 21 raters were involved in verbal protocols and requested to rate the performance of five real test-takers and justify their ratings. To associate the rater types with the different patterns of rater variability, the real ratings of 33 raters including all three types were analyzed through Many-Facet Rasch Measurement, Hierarchical Linear Modeling, Generalizability Theory, and Confirmatory Factor Analysis. The cluster analyses classified the raters into three types according to whether they gave the largest weights to form-related criteria or content-related criteria, or were balanced in the weighting patterns, and the three types were named form-oriented, content-oriented, and balanced respectively. In the verbal protocols, the form-oriented raters were found to be most severely subject to the anchoring and masking effects of pronunciation and intonation whereas the content-oriented raters displayed the strongest mitigation of such effects. The balanced raters came in between, but shared more similarity with the content-oriented raters. In association with rater variability, the form-oriented raters were found to be most severe among the three types and the content-oriented raters most lenient. On specific TEM4-Oral subscales, the form-oriented raters were unexpectedly severe in pronunciation and intonation, but unexpectedly lenient in grammar and vocabulary, whereas the content-oriented raters were unexpectedly lenient on the content-related subscale of discussion but unexpectedly severe on the subscale of grammar and vocabulary. However, no clear-cut relationship was found in reliability and restriction of range, and mixed results were reported in terms of halo effect.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View