Generating Knowledge of Academic Language Among Urban Middle School Students

Abstract A quasi-experimental study of a novel, cross-content area vocabulary intervention program called Word Generation showed significantly greater growth among 6th- to 8th-grade students in schools implementing the program than in comparison schools, on a curriculum-specific test. Furthermore, the language-minority students in the treatment, but not the comparison, schools showed greater growth than the English-only students. Improvement on the curriculum-specific test predicted performance on the Massachusetts Comprehensive Assessment System (MCAS) English language arts assessment, again only for students in the treatment schools. Recognizing the need to implement a more rigorous experimental study of this program, nonetheless we conclude that participation in the intervention, with its focus on deep reading, comprehension of current-events topics, productive classroom discussion, developing arguments, and producing persuasive essays, was a plausible contributor to student performance on the MCAS.

The reading comprehension of post-primary grade students, in particular those attending urban schools, is a matter of recurrent concern. Performance of eighth-graders on the National Assessment of Educational Progress reading assessment, for example, shows that 74% of all students perform at or above the basic (grade-appropriate) level, whereas only 60% of students in the large central urban districts perform at that level. Ethnic and language minority students in these districts perform even less well than White and native English-speaking students. Although the specific reading challenges faced by students performing below basic level are no doubt heterogeneous, a ubiquitous issue mentioned by their teachers and confirmed by assessment is their limited vocabularies. It is not surprising that the many language minority students in urban districts show gaps in English vocabulary, but even native English speakers may fail to develop rich vocabularies if they have a history of low reading ability, limited comprehension, and low investment of time in reading, because much sophisticated vocabulary is acquired through reading (Anderson, Wilson, & Fielding, 1988;Stanovich, 1986).
In response to administrators' and teachers' worries about the vocabulary skills of students in the Boston Public Schools (BPS), an intervention called Word Generation was designed to promote the learning of all-purpose academic vocabulary among sixth-to eighth-grade students. Word Generation was developed under the auspices of the Strategic Education Research Partnership (SERP), as part of the BPS-SERP field site activities (www.serpinstitute.org). Word Generation adheres to research-based principles of vocabulary learning and incorporates opportunities for students to use the five words taught each week in classroom discussion, debate, and writing by embedding those words in brief texts that present controversies of high interest to (pre)adolescents. Thus, in addition to teaching vocabulary, the program provides opportunities for students to develop and practice oral academic language skills, argumentation strategies, and writing skills while learning about issues of current public interest (e.g., steroid use among athletes, legalization of euthanasia, censorship of libraries and of popular music, funding for stem cell research).
In this article we report on an analysis of the impact of Word Generation on students' learning of words taught and the relationship of their performance on curriculum-linked assessments of Word Generation vocabulary to their performance on the English Language Arts assessment administered as part of the state accountability system. In addition, in light of the greater risk for poor literacy performance of students from non-English-speaking homes, we analyze the differential effect of the program on language minority (LM) and English-only (EO) students.

PRINCIPLES UNDERLYING WORD GENERATION
A first principle of effective vocabulary instruction is to teach words students need to know-those that are crucial to their comprehension of their assigned texts, for example. Middle school students are expected to read content-area texts that contain many technical, discipline-specific words as well as many "all-purpose academic words." This latter category of words is less likely to be explicitly taught, in particular by math, science, and social studies teachers who concentrate their instructional time and effort on the words of their respective disciplines. Yet students who do not know the all-purpose academic words struggle with comprehension of texts containing them (Anderson & Freebody, 1981), and thus struggle as well to learn the disciplinary concepts and words that are being taught. For these reasons, a basic Word Generation design principle was to focus on all-purpose academic words but to enhance content area teachers' willingness to teach them.
Additional Word Generation features were designed to implement what we know about effective vocabulary teaching. Fortunately, the field of vocabulary instruction has been well researched. Dozens of small-scale experimental studies provide evidence about instructional factors that promote successful vocabulary learning (Beck, McKeown, & Kucan, 2002;Beck, Perfetti, & McKeown, 1982;Graves, 2006;McKeown, Beck, Omanson, & Perfetti, 1983;McKeown, Beck, Omanson, & Pople, 1985;National Institute of Child Health and Human Development, 2000;Stahl & Fairbanks, 1986;Stahl & Nagy, 2006). Those factors include the following: • Encountering the target word in semantically rich contexts within motivating texts, rather than in a list of words. • Recurrent exposure to the word, in varied contexts.
• Opportunities to use the word orally and in writing.
• Explicit instruction in word meaning.
• Explicit instruction in word learning strategies, including morphological analysis, cognate use, and polysemy.
Word Generation was designed to provide instruction that embodied these factors. Thus, the target words were presented in brief but engaging texts, and teachers were given guidance about how to read those texts with their students so as to ensure comprehension. Then explicit instruction in the word meanings was provided. The words were also embedded in activities to be carried out by math, social studies, and science teachers, to ensure both recurrent exposure and opportunities to learn about subject-specific meanings (polysemy) and contexts of use. The daily activities were designed to promote opportunity for oral discussion and debate on 4 days of the week, and writing on the 5th.

DESIGN OF THE INTERVENTION
Word Generation is a 24-week-long sequence of topics of current interest, each associated with five all-purpose academic words, and prescribed activities related to math, science, and social studies (see www.wordgeneration.org for sample materials). The basic sequence of Word Generation activities was the following: On Monday a brief text in which the five target words were embedded was read by the students and teacher together, then discussed using guiding comprehension questions; this text presented arguments on both sides of some difficult controversy or dilemma. Then the five target words were highlighted and provided with student-friendly context-related definitions. This activity typically occurred in the English Language Arts classroom. On Tuesday, Wednesday, and Thursday, in an order determined by each school, the math, social studies, and science teachers respectively implemented activities provided for them, each of which embedded the same five target words. The math teacher assigned one or two problems related in content to the dilemma of the week; the format of these problems was modeled on the state math assessment. Math teachers then discussed the content as well as the math procedures. The science teacher presented a new text that focused on science content related to the dilemma of the week; students filled in target words left blank in the text before the class discussed the text. The social studies teacher organized a debate about the dilemma of the week in one of several possible formats (fishbowl, pairs, whole class, four corners, etc.). On Friday, the students were asked to write a "taking a stand" essay about the dilemma.
Various aspects of the Word Generation design respond to the local conditions in the district for which it was originally developed. Most middle-grade students in the BPS attend separate middle schools serving sixth to eighth graders, though about 25% are in K-8 schools. In the middle schools, content area instruction is departmentalized, and teacher planning time built into the school schedule typically occurs within departments, limiting the opportunities for teachers to share information about student progress or curricular emphases across those departmental boundaries. Transcending the boundaries to recruit participation by all the teachers in vocabulary teaching was one goal of the Word Generation design.
In addition, the extensively articulated state and district curriculum standards, as well as district pacing guides for math, science, and social studies, limited the classroom time available for focusing on vocabulary or on topics not explicitly included in the standards. Thus, to secure collaboration from the District leadership and the teachers, we agreed to design activities that could be completed in 15 min per day (thus taking only 15 min per week from math, science, or social studies). Furthermore, each school implementing the program had considerable leeway to decide on scheduling (which group of teachers was responsible for which day[s] of the week) and on extent of use. For example, one school excluded Structured English Immersion students during the 1st year of implmentation but included them subsequently. The five schools reported on here used the program with all students in Grades 6 to 8.

RESEARCH QUESTIONS
The larger effort surrounding Word Generation is an attempt to influence student learning, teacher learning, and school coherence simultaneously. Clearly, the effect of a program that lasts only 75 min a week cannot be expected to be enormous. Thus, our goal was that Word Generation would ultimately influence teachers by providing a scaffolded context for learning and practicing new pedagogical strategies useful in teaching vocabulary and fomenting productive classroom discussion. Nonetheless, in this article we report only immediate effects on students participating in the program, acknowledging that the basis for strong causal inference about program impact is limited in this quasiexperimental study. We posed three research questions: 1. Did students in the schools implementing Word Generation learn the words taught? 2. Was there a differential effect of Word Generation on LM and EO students? 3. Did the learning promoted by Word Generation relate to performance on external achievement measures?

METHODS
This is a quasi-experimental study in which academic word-learning by students in five schools implementing the Word Generation program was compared to academic word-learning by students in three schools within the same system that did not choose to implement the program. Because the implementing schools were those that volunteered for the program, selection effects must be taken into account in interpreting the findings. The data reported in Table 1 suggest that the comparison schools were performing better than the treatment schools at the start of the current study and that impression was confirmed by disparities in perfomance on the curriculum-specific pretest (see Figure 1). The design of the intervention precluded assignment to treatment at the individual or classroom level.   . Demographics of the Word Generation and comparison schools reflect BPS more broadly, with a high incidence of poverty (ranging from a low of 79% to a high of 91% students receiving free or reduced-price lunch; see Table 1). BPS is characterized in general by rather high levels of special education designation, and all the schools shared this feature (between 16% and 33% of students with Individualized Education Programs). A very high proportion of students at these schools come from second-language homes, with percentages ranging from 32% to 70% across the schools. Four of the treatment schools offered Sheltered English Immersion services to their limited English proficient students; all students enrolled in these sheltered classrooms (who represented between 6% and 26% of their school populations) received the Word Generation curriculum, albeit with modifications such as extended time and translation of key concepts. The comparison schools looked somewhat less disadvantaged as a group than the intervention schools (see Table 1), and their average scores on the state accountability assessment at the start of this study were higher (mean of 45% failing in the comparison schools, compared to 56% in the treatment schools). This is not surprising; the schools volunteered to participate in the intervention, and those with lower scores were more likely to show an interest.

Professional Development (PD) and Implementation
In addition to varying on demographic profiles, the treatment schools varied in the degree to which they utilized the PD offered to support implementation of Word Generation, and in the eagerness with which the teachers across the building embraced the program. The full PD package available to the schools involved school leader participation in an implementation planning session, a schoolwide training for all teachers prior to starting the program, two to three sessions of implementation support throughout the year, and regularly scheduled school-based sessions of collaborative planning and preparation. Specifically, • Key instructional leaders in newly implementing schools were invited to a Word Generation summer institute, where they could organize testing and implementation schedules with the Word Generation research team.
• Prior to launching the intervention, a 4-hr PD session is recommended. These hours are devoted to introducing the yearlong intervention, discussing relevant research on vocabulary teaching and learning, viewing videotapes of exemplary implementation of the program by other BPS practitioners, and hands-on practice with program materials. • Ongoing PD (two to three more school-based sessions) is also recommended, and most schools establish several dates throughout the academic year when feedback and further PD can be delivered. • Effective implementation of Word Generation is contingent upon a set of shared understandings and commitments at the school level. Regular cluster, grade, and/or content level meetings for team building, updating implementation schedules and calendars, and other organizational features are vital for ensuring cross-content communication and maintaining effective program implementation.
As can be seen from Table 1, the five implementing schools participated in these various activities to varying degrees because of difficulties scheduling and organizing the required meetings. For example, the Mercer received only one brief PD session, whereas the Occidental participated in the summer institute, received 8 hr of PD prior to launch, and engaged in biweekly cross-grade school-site sessions devoted to feedback on and previewing of the materials and activities, with support from the Word Generation team at several of those sessions.

Student Participants
Both pre-and posttest data were available on 697 sixth-, seventh-, and eighthgrade students in five treatment schools and 319 in three comparison schools. All students in the treatment schools received the intervention; those represented in this data set had completed usable test forms at both pre-and posttest. There were 349 girls and 348 boys in the treatment schools, and 162 girls and 157 boys in comparison schools. Of these, 438 were classified as LM (parents reported preferring to receive materials in a language other than English), 287 in treatment schools and 151 in comparison schools. As can be seen from Table  1, the vast majority of students in both treatment and comparison schools were low income.

Measures
The efficacy of the intervention was assessed using a 48-item multiple-choice test that randomly sampled two of the five words taught each week. The vocabulary assessment was not completed by all students in the time available.
Because items at the end of the assessment had particularly low rates of completion, we dropped the last 4 items from our analysis of both pre-and posttest. The reliability of the test with the 40 items that remained was acceptable (Cronbach's α = .876).
This instrument was administered to students in all the treatment schools in October 2007, before the introduction of Word Generation materials. Because of difficulty recruiting the comparison schools, the pretest was not administered there until January. The posttest (identical to the pretest except for the order of items) was administered in all the schools in late May. Because of the unfortunate disparity in interval between pre-and posttesting in the two groups of schools, we present analyses in terms of words learned per month as well as total words learned.
In addition to this curriculum-based assessment, we had access for most of the students to scores on the Massachusetts Comprehensive Assessment System (MCAS) English Language Arts (ELA) scores for spring 2008. In addition, we had Group Reading and Diagnostic Evaluation (GRADE; Williams, 2000) for both spring and fall for a selection of students in all comparison (n = 133) and treatment (n = 256) schools (Table 2). These scores were provided by the district for all the students for whom data were available. The decision to adminster the assessment was made at the school and classroom level. Thus, although these data are far from complete, we have no reason to think that there was a particular sampling bias across the schools.

Analysis
Our general approach to the analysis was to use regression models to predict posttest scores, controlling for pretest scores as well as gender and subsequently introducing variables of interest (treatment condition, language status) as well as their interactions. We recognize that this analytic strategy ignores the nesting of students within schools, and thus planned follow-up analyses comparing treatment and comparison schools as well as within one, large treatment school to further explore the initial findings.

RESULTS
Descriptive statistics suggested that boys did better than girls on the pre-and posttest, that the pretest scores of students in the comparison schools were higher than those in treatment schools, and that LM students performed less well than EO students across all the schools (Table 3). These results show that students in the Word Generation program learned approximately the number of words that differentiated eighth from sixth graders on the pretest-in other words, participation in 20 to 22 weeks of the curriculum was equivalent to 2 years of incidental learning. Unfortunately, the relative improvements in the Word Generation schools will be exaggerated by the differences in timing of the pretest. Table 4 presents both the total pre-to posttest improvement and the improvement divided by the number of months between pre-and posttests (8 months for treatment schools, 5 months for comparison schools). The results shown in Table 4 demonstrate that Word Generation schools outperformed the comparison schools when the amount of time between tests is taken into consideration (see Figure 1).
Regression analysis was used to determine if participation in Word Generation predicted improved vocabulary outcomes, controlling for the pretest. Gender was a significant predictor of word learning (β = -0.052, p < .007), as was treatment (β = 0.166, p < .001; see Table 5). Language status (LM vs. EO) was not a significant predictor, but the interaction of treatment and language status was at the margin of significance (p = .055), and including the interaction improved the overall model. Of interest, student pretest vocabulary did not interact with treatment in predicting posttest scores.  We split the data set to investigate the home language variable more closely ( Table 6). The first set of regressions used pretests and gender to predict posttest scores in the comparison schools (r 2 = .62) and Word Generation schools (r 2 = .64). In Word Generation schools LM status predicted improved vocabulary (β = -0.053, p = .022), but it was not a significant predictor in comparison schools. The growth for LM and EO students in comparison versus Word Generation schools is plotted in Figure 2.

Predicting Standardized Test Scores
To determine whether participation in Word Generation had any relationship to performance on the MCAS, a regression model was fit with MCAS scores in April 2008 as the outcome, using gender, treatment status, pretest, and posttest scores as predictors (Table 7). We added an interaction term to see if posttest scores interacted with treatment in predicting MCAS scores (controlling for pretest scores). The interaction term was significant (β = .21, p = .01) and its inclusion improved the model. We further explored the interaction between treatment and vocabulary improvement by splitting the data and refitting the models to data from the treatment and comparison school separately. The fitted model for comparison school data did not predict MCAS achievement (R 2 = .41) as well as the fitted model for the treatment school data (R 2 = .49). In the Word Generation schools student posttest scores (β = 0.527, p < .001) were much stronger predictors of MCAS achievement than pretest scores were (β = 0.201, p < .001), perhaps because the posttest scores captured not only target vocabulary knowledge at the end of the year but also level of student participation in the Word Generation program (Table 8).
Unfortunately, these analyses do not control for baseline reading achievement data, which were available only for a subset of students in our sample (n = 389). For that subgroup, we used fall standardized reading comprehension scores (on the GRADE) as a covariate to determine if the relation between improved vocabulary and MCAS persisted even when controlling for overall reading levels. Results demonstrate both that the GRADE is a strong predictor of spring MCAS scores (β = 0.750, p < .001) and that the interaction between treatment and improvement persists in the model controlling for GRADE. Split file analysis demonstrated the familiar pattern, with vocabulary improvement predicting MCAS scores for student in the treatment schools (β = 0.151, p < .001) but not for students in the comparison schools (see Tables 9 and 10). GRADE scores were also used to determine if better readers learned words more efficiently than less able readers. Results demonstrate that GRADE baseline scores did not predict word learning and that there was no significant interaction between treatment and baseline reading achievement as measured on the GRADE (Table 11).

School-Level Confound
One of the challenges in interpreting these data is that there are significant differences both in the size of the program impact in different Word Generation schools (ranging from 3.7 to 5.1 points improvement on average) and in the language demographics (percentage of LM students) of those schools, making it difficult to disentangle student language status from school treatment effects. To address this issue, we explored language status as a predictor of word learning within the largest school in our sample, a school that also has sizeable Note. GRADE = Group Reading and Diagnostic Evaluation. Note. GRADE = Group Reading and Diagnostic Evaluation. a n = 134. b n = 265.
numbers of both LM (n = 110) and EO (n = 179) students. Within this school, speaking a language other than English at home predicted better vocabulary posttest scores controlling for pretests (β = 0.073, p = .034). Thus, though school treatment effects complicate our understanding of how language status interacts with participation in Word Generation, there is evidence from at least one large school that the LM students are indeed benefitting differentially from Word Generation.

DISCUSSION
The results of this initial trial of a novel approach to teaching academic language and vocabulary are promising. Students in schools implementing the program learned more of the targeted words than students in comparison schools, even though the latter group performed at a higher level at the start. Language minority students benefited more strongly than EO students, and improvement on the curriculum-specific assessment predicted performance on the state ELA assessment. Although the design of this study precludes making strong causal inferences, these preliminary results are encouraging. In particular, though the significant differences in the language demographics of different Word Generation schools makes it difficult to disentangle effect of student language status and school treatment effects, the LM-EO differences in word learning were replicated within one school. This analysis suggests that confounding effects of school-level effectiveness do not explain the faster word learning of LM students. Instead, we may need to contemplate the possibility that these students were benefiting from effective, engaging, vocabulary-focused pedagogy.
It is of interest to compare the effect size obtained with the Word Generation curriculum to that obtained in other vocabulary interventions. A similarly structured intervention, the Vocabulary Improvement Program (Carlo et al., 2004), obtained an effect size of .50. The Stahl and Fairbanks (1986) metaanalysis of vocabulary curricula reviewed studies with effect sizes ranging as high as 2 under short-term laboratory-teaching conditions and as low as 0 under more authentic educational conditions. Thus, although Word Generation is not just a vocabulary intervention, and by design did not try to teach large numbers of words, its impact on students compares well with that of other successful programs.
It is particularly encouraging that posttest scores on the Word Generation assessments strongly related to performance on the state accountability assessment. One might assume this reflects the coincidence that the words taught also occurred on the state test. However, this simple explanation is undermined by the absence of a similarly strong relationship in the treatment schools. Furthermore, although improvement in the Word Generation schools was significant, it was still modest-about 4 words out of 40 tested. That translates into only about 12 words out of the 120 taught, which can hardly by itself explain a lot of variance on a long and challenging ELA assessment. Rather, we think it likely that improvement on our curriculum-based assessment represents an index of exposure to the Word Generation curriculum-a curriculum that taught new content, deep reading and comprehension skills, discussion, argumentation, and writing. Because the Massachusetts test is a relatively challenging one (arguably the best aligned with the National Assessment of Educational Progress of all the state assessments; McBeath, Reyes, & Ehrlander, 2007), performance on the MCAS is more likely to be related to those complex skills than to specific word knowledge.
A major design feature of Word Generation is the effort to organize instruction around engaging topics and issues. Of course, some of those topics were more successfully engaging than others, a factor that might be related to the success of the students in actually learning the words associated with them. Looking at the words that showed particularly strong gains (percentage of students knowing them at posttest minus percentage knowing them at pretest), and in particular strong relative improvement (ratio of improvement in Word Generation schools to improvement in comparison schools), may provide some hints about the topics that were most salient and engaging. For example, 2 of the 10 best learned words (amnesty, exclude) came from the topic "undocumented immigrants," one likely to be of particular interest to the LM students. The other topic that generated two high-ranking words was "paying students to do well in school" (incentive, enable). Words from topics directly related to adolescence-grades as a criterion for sports participation, children's health/obesity, academic tracking, funding for girls in sports, preventing bullying-were among the better learned, whereas the topic of compulsory voting was associated with the two least well-learned words (enforce, apathy). Of course, it is difficult to disentangle features of the topics from quality of the activities associated with them, the teacher interest in them, and the learnability of the words themselves. In future research we will collect data from students themselves about their interest in the array of topics presented, as a mechanism to inform us about what makes some topics engaging and to test to what degree topic explains word learning.

Limitations
This study has many limitations. Our attempt to carry out a well-designed quasi-experiment was somewhat compromised by the fact that one comparison school in particular was performing much better than any intervention school, reflecting that school's emphasis on vocabulary across the curriculum. The design was also threatened by the fact that we received access to the comparison schools only partway through the school year, so pretest schedules were disrupted. The implementation schools varied as well in their capacity to access and benefit from the professional development opportunities offered and in their capacity to collect pre-and posttest data from large numbers of students. Though we had GRADE data available as a covariable, it was missing for a large percentage of students in both intervention and comparison schools. We were constrained by the small number of schools and the varying numbers of students across schools, and thus could not fully address the nesting of students within schools in our data analysis.
We are addressing these limitations in our current and imminent work on Word Generation. We have extended the quasi-experimental study to another year, with more schools and tighter control over the data collection procedures. We will be mounting a proper experimental study of Word Generation's impact and will explore how to support its effective implementation at larger scale. We have collected videotapes both of professional development sessions and of teachers implementing various Word Generation lessons to serve as a support for professional development (http://www.wordgeneration.org) or a self-teaching tool. Considerable work also needs to be done to document quality of implementation, to improve available materials, and to extend the curricular topics and activities. Nonetheless, we are greatly encouraged by these preliminary data suggesting the effectiveness of Word Generation in general, and its particular its value for LM learners.

ACKNOWLEDGMENTS
The SERP-BPS field site and thus the original planning for Word Generation were supported by grants to the Strategic Education Research Partnership (SERP) from the Spencer Foundation and the William and Flora Hewlett Foundation; further development and evaluation of Word Generation were supported by a Senior Urban Education Fellowship awarded to Catherine Snow by the Council of Great City Schools. Joshua Lawrence was supported by funds awarded to Catherine Snow by the Spencer Foundation and the Carnegie Corporation of New York. We also acknowledge the funding to SERP from the Lowenstein Foundation, to develop professional development opportunities through www.wordgeneration.org. This research was supported in part by Grant Number R305A050056, National Research and Development Center for English Language Learners, from the Institute of Education Sciences (IES), US Department of Education (USED). The contents do not necessarily represent the positions or policies of IES or USED and readers should not assume endorsement by the federal government for any of the positions or statements expressed herein.