Skip to main content
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Triangulating Evidence to Investigate the Validity of Measures: Evidence from Discussion during Instruction, Cognitive Interviews, and Written Assessments


Classrooms are a primary site of evidence about learning. Yet classroom proceedings often occur behind closed doors and hence evidence of student learning is observable only to the classroom teacher. The informal and undocumented nature of this information means that it is rarely included in statistical models or quantifiable analyses. This research investigated whether or not whole-class discussions, focused on a single articulated learning trajectory, contained information that could be used as quantifiable evidence of student knowledge to better understand learning processes. Specifically, the research considered what evidence of student knowledge was observable from three modes: whole-class discussions during instruction, cognitive interviews, and written quiz assessments.

The theoretical framework of this study relied on the concept of learning progressions to integrate strengths of what are usually thought to be divergent approaches to educational research. Specifically, this framework included key aspects of formative and summative assessments, used qualitative and quantitative methods, and capitalized on both teachers' and education measurement researchers' notions of validity, to investigate consistency of multiple forms of evidence of student learning. The methods used in this study systematically connected instruction, learning, and assessment through a shared model of learning. This study focused on a model of learning in the domain of statistics in middle school mathematics. A trajectory of student learning was described in the Data Modeling and Statistical Reasoning learning progression, and the focus of this study was one dimension of the learning progression, the Conceptions of Statistics construct.

Items, item archetypes, item sets, and parallel item sets were designed to elicit particular ranges of performances described in the construct map. In order to study the degree to which consistency between student responses could be found, administration of the three instruments was tightly controlled to limit learning gains and losses over time. Decision rules were created to systematically establish units of analysis from discussion and interview responses that were comparable with quiz responses. Parallel interpretation guides were designed so that raters could score response units consistently across the assessment modes. Raters' scores were examined to investigate the reliability of response interpretations, and multimodal scoring was used to investigate the relationship between the three sources of evidence.

Results suggest that evidence of knowledge observed in the three modes, while distinct in their levels of detail, consistently validated one other, and the learning construct itself. Interrater score differences, kappa coefficients, correlations, and percent agreements were moderate to strong at every level of analysis, indicating that, when aligned in this way, interpretations of responses resulted in reliable scoring. Rater reflections noted that there were six factors that may influence the quality of evidence observed. Specifically, they were: the level of verbosity, interpretability, responsiveness to topic, uniqueness of contribution, and frequency and density of evidence affected the degree to which raters were confident in their determination of scores.

Analyses comparing interview and quiz scores, discussion and quiz scores, and discussion, interview, and quiz scores found that the mean score differences were low and percent agreement was moderate to strong. The analyses found that scores in the discussion and quiz had stronger consistency than either mode had with the interview. Similarly, correlations between the discussions, interviews, and quizzes across all students and all items indicated that a statistically significant correlation was found between discussions and quizzes, and discussions and interviews but not between interviews and quizzes. Those correlations were modest suggesting that although a positive relationship was found, they were not so highly correlated as to be considered equivalent. Overall, results from the study found that student responses in discussions, interviews, and quizzes could be consistently scored to determine students' levels of understanding.

This work is a significant contribution to the field for a number of reasons. This study is the first of its kind to apply the construct map as a tool for coordinating classroom discussions, assessment design, evidence interpretation, and analytical decisions. Further, this study relied on the use of assessments in classroom discussions as one of the three sources of evidence of student knowledge, and established a systematic process for collecting and analyzing both text and numerical data for quantitative analysis. Specifically, this study took a systematic approach to the interpretation and scoring of evidence so that nuanced scoring of evidence from interviews and discussions could be summarized into units of analysis comparable with quiz scores. Finally, this work is a significant contribution to the field because it supports making a shift in the means by which teachers and researchers look for and identify changes in student knowledge.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View