This study employed Bachman and Palmer's (2010) Assessment Use Argument framework to investigate to what extent the use of a second language oral test as an exit test in a Hong Kong university can be justified. It also aimed to help test developers of this oral test identify the most critical areas in the current test design that might need improvement. Candidates' oral responses to five integrated speaking tasks in this oral test were rated on five dimensions: Task fulfillment and relevance (TFR), Clarity of presentation (CoP), Grammar and Vocabulary (GV), Pronunciation (Pron), and Confidence and Fluency (CoFlu).
To provide backing for the meaningfulness of interpretations, confirmatory factor analysis (CFA) and item response theory (IRT) analyses were used to analyze 999 candidates' scores and raters' verbal reports were also analyzed to provide complementary information to the results of the quantitative analyses. Several CFA models were first tested and compared in terms of their statistical fit and substantive interpretability. And a graded response model was applied to the test data. The CFA results showed that the superior fit of the Higher-order trait-Uncorrelated Method model validated the test design, confirmed the current multicomponential view of language ability in the literature, and provided the most parsimonious explanation of the relationships among the five dimensions and overall speaking proficiency. The analytic scores were found to have much larger factor loadings on the trait factors than on the method factors, providing evidence that the component test scores could be meaningfully interpreted as indicators of the five dimensions. The presence of a higher-order speaking ability factor governing the five trait factors also supported the practice of reporting one composite score. Task Fulfillment and Relevance (TFR) measured on Task 4 had the highest method loading (.60) on Task 4 and the lowest trait factor loading (.36) on TFR, which suggested TFR4 might be too task specific and weak in measuring students' speaking ability to fulfill a speaking task in a relevant way. The trace lines of the graded response model also confirmed this. The raters' verbal reports showed that most raters did not have much difficulty differentiating across the performance levels. Hence, the problem of TFR4 can only be due to the nature of the task itself and its low discrimination. Both CFA and IRT results indicated that task types had great effects on test takers' speaking abilities especially TFR and that this language ability component might be too task specific.
In order to investigate the impartiality of interpretations, multi-group CFA and differential item functioning (DIF) were conducted to examine the extent to which the oral test had test bias and item bias across (1) gender and (2) disciplines. The multi-sample CFA results indicated that the factor structure was significantly different between males and females. However, the comparison of the factor loadings between females and males showed that only the factor loading of one item for the male group was significantly different from the female group at the 0.05 level. DIF results also suggested that the majority of the items displayed no DIF. The source of DIF may be attributed to the group mean difference on the latent trait and their real differences on certain aspects of language ability measured in this test. This provided backing for the impartiality of score interpretations, indicating that the rating-based interpretations from GSLPA SLT are impartial to a large extent across subgroups of test takers (males vs. females; business vs. non-business).
In order to examine the consistency of test scores, Generalizability theory (G theory) analyses were performed to investigate whether the test was dependable and whether the five dimensions were separable. G theory results showed that the phi coefficient for the whole test fell between .76 and .85 and Grammar and Vocabulary and Pronunciation proved to be the most dependable dimensions. G theory and CFA results both confirmed that the five speaking dimension were highly correlated with each other. The possible reasons of these findings were further discussed with reference to the raters' verbal reports.
Based on the above results, it can be concluded that the meaningfulness, impartiality, and consistency could be justified to a large extent. Some critical areas to be improved in the test design and administration were identified. Theoretical and practical implications were addressed and methodological limitations were also discussed. Overall, this study highlights the usefulness of Bachman and Palmer's Assessment Use Argument (2010) to justify the use of an existing language assessment.