This study applies the Standards for Educational and Psychological Testing to appraise validity evidence for the interpretation of test scores that are used in experiential education, adventure education, and out-of-school time programs. Specifically, Rasch item response modeling is applied as a primary tool to examine validity evidence for the Life Effectiveness Questionnaire (LEQ). Hypotheses and assumptions about test content and test structure are evaluated using technically calibrated responses to items as empirical evidence. This evidence, produced in the form of Wright maps, allows for a combined focus on the qualitative, substantive meaning of the test scores, as well as their technical, statistical properties.
The partial credit model fit better than the rating scale model for the composite unidimensional and multidimensional models (p>0.05) of the LEQ, and for seven of the eight consecutive unidimensional models of LEQ subscales (p>0.05). DIF analyses found no meaningful statistical evidence of construct-irrelevant variance by gender, age, or voluntary status. Analyses of person to item distributions found a mismatch between the intended meaning of item content and empirical item difficulties, so that the test developers' initial hypothesis about item content was not supported. Evidence from Wright maps indicates no order or sequencing of LEQ test content, and also that items oversample from one easy level of the Life Effectiveness continuum. These findings suggest that item content does not adequately represent the Life Effectiveness construct.
Evidence from Wright maps indicates that respondents did not interpret and differentiate between response categories equivalently within items or across subdomains, so that assumptions of interval, Likert scaling that are the basis of the averaged scores for the LEQ are not supported. Wright maps provided evidence of good respondent-to-item threshold distribution for the composite LEQ. However, item thresholds are skewed to the lower levels of life effectiveness, so that standard errors are higher for respondents with higher ability estimates. Despite reasonable reliability coefficients, sizable standard errors and large confidence intervals associated with the LEQ subscales, in particular, as well as for high ability respondents in the composite LEQ, caution against over-confidence in using student ability estimates to make claims about change.