Skip to main content
Open Access Publications from the University of California

Cross-classified Random Effects Modeling for Moderated Item Calibration: An Application to English Language Proficiency Assessment for Special Population

  • Author(s): Chung, Seungwon
  • Advisor(s): Cai, Li
  • et al.

Test forms are often modified in order to accommodate various special populations or situations where administration of the original test forms is infeasible. While this practice aims to promote fairness for all, the goal can only be met if they are coupled with a systematic method for obtaining comparable scores across test forms. Numerous psychometric barriers stand in the way. For example, the limited sample size of the students taking modified test forms can prevent the usage of standard calibration and linking procedures. One particular area in which these issues are pronounced is the English language proficiency (ELP) assessment for students who are blind or have low vision, and consequently take modified Braille test forms developed under English Language Proficiency Assessment for the 21st Century (ELPA21).

To address these bottlenecks, this study proposes a method for moderated item calibration. A unified cross-classified random effects model that jointly utilizes item response data to the original test form and judgmental data provided by expert raters is developed to revise item parameters for scoring modified test forms. Estimation of this new model is performed using the Metropolis-Hastings Robbins-Monro algorithm (MH-RM; Cai, 2008, 2010a, 2010b). The method is programmed in R (R Core Team, 2018) and its implementation strategies are discussed in detail. The proposed model is applied to Braille test forms in ELPA21, and its performance is compared to the common practice. Simulation study validates the modeling framework and provide guidance for future data collection and study design. More generally, this study is significant because of its broad adaptability to 1) any special population for whom direct item calibration or standard linking is not feasible, and 2) any operational or research setting where field testing cannot be conducted because of resource/sample size constraints.

Main Content
Current View