Carnegie Knowledge Network » Value-Added

Carnegie Knowledge Network Concluding Recommendations

Joanna Huang — Fri, 30 Jan 2015 19:17:28 +0000

Dan Goldhaber
Douglas N. Harris
Susanna Loeb
Daniel F. McCaffrey
Stephen W. Raudenbush

It is common knowledge that teacher quality is a key in-school factor affecting student achievement. While the quality of teaching clearly matters for how much students learn, this quality is challenging to measure. Evaluating teacher quality based on the level of their students’ end-of-year test scores has been one method of assessing teacher quality, but this approach favors those teachers of students who begin the year already at a high academic level. Value-added methodology is an alternative use of annual test score data to assess teacher quality. As opposed to using the level of student achievement on a test at the end of the year to assess a teacher’s effectiveness, value-added methodology seeks to isolate a teacher’s contribution to student growth from other factors that contribute to student achievement, such as prior achievement or socio-economic status. Methodologists generally agree that value-added estimates can serve as a gauge of some dimensions of teacher effectiveness.

Yet the experts also agree that value-added measures have significant limitations. Most clearly, they do not capture a teacher’s entire contribution to student learning because they are based solely on test performance, itself an imprecise signal of teacher effectiveness. There is disagreement among experts on the extent to which value added is biased. For instance, there is room for disagreement about whether or how much statistical models can or should account for factors such as poverty, or how teachers can be compared across different classroom contexts. The comparability of value added is greater among teachers working in similar contexts, which is true of all forms of performance evaluation. Moreover, value-added measures are likely to vary according to the tests used to compute them. Because tests cover different content, they reflect different areas of student knowledge. Value added measures what the tests measure, and because these tests capture only a slice of what students are learning, value added reflects only that slice. Teachers could be effective or ineffective on outcomes not captured by students’ scores on these tests.

As important as it is to judge value-added measures on their validity and reliability, it is more relevant to know to what extent they can benefit schools and students if used in practice. Largely because value-added policies are so new, we know very little about how they work in practice. Thus there remains considerable debate about how value-added measures should be used to inform personnel policies, if they are to be used at all. As the real impacts of reforms using value added emerge and as researchers assess these effects, we will learn more about the consequences of different uses of value added. Until then, the CKN briefs have given us a rich picture of current research on the use of value-added measures for teacher evaluation. Reviewing the briefs and their implications for practitioners, we arrived at the following recommendations:

1. When using value added, allow educational leaders to make judgments in interpreting the value-added results in light of other available measures of teacher quality and the principals’ own assessments.

Studies provide evidence that value-added measures meaningfully distinguish between teachers whose future students will consistently perform well and teachers whose students will not.^[1] However, the studies also highlight the shortcomings of value added, among them instability in measures across time;^[2] systematically lower values for teachers of some types of students;^[3] and difficulty in comparing teachers across schools.^[4] These issues are potential problems for any form of evaluation, including teacher observation.^[5] The principal and other school leaders are often well situated to evaluate the many circumstances in which quantitative measures may give an incomplete picture. For instance, school leaders can see a teacher’s indirect contributions to learning through her overall work in the school, or her contribution to student outcomes not measured by achievement tests. Educational leaders likewise can make allowances for unexpected events in a teacher’s environment—a sudden change in assignment, for instance — that can’t be captured by value-added models. Human judgment also may reduce teachers’ concerns about being assessed only by test scores or being subjected to statistical procedures they don’t trust or understand.

2. Use value added with other measures that are valid and have variation.

Because education and teacher evaluations have multiple goals,^[6] assessing progress toward these goals will require multiple measures. Moreover, evaluation programs that combine value added with other measures, at least in some instances, have led to instructional improvement.^[7] While ultimately all measures should be evaluated on the same set of standards, each measure will have strengths and weaknesses. Using the measures together helps reduce errors and provides a sense of balance that may make them more acceptable to those being evaluated.^[8] When multiple measures are included in an evaluation system, all of the measures must vary across teachers in order to matter at all for the evaluation. A measure that rates all teachers at the same level will carry no weight when it is combined with other measures to classify teacher performance. For example, suppose that a school system uses both observation and value-added scores to determine teacher effectiveness. Then suppose that all of the teachers in the school system are rated proficient according to observations by their principals. The only way a teacher can be deemed “in need of improvement” is if she has a low value-added score. The only way she can be classified as “exemplary” is if she has high value-added score. Even though the system uses two measures, it is only the value-added measure that differentiates on the summative measure.

Multiple measures don’t necessarily have to be combined into a single rating scale; for instance, the value-added estimate could be used mainly to identify candidates for more intensive observation.^[9] Schools would then measure the performance of the identified teachers with more intensive measures that can provide a more accurate assessment of teacher performance but are too costly to collect for all teachers. Multiple measures may also take on many forms. Along with observations, scores on student learning objectives, and survey responses,^[10] they might also include the value that teachers bring to student outcomes other than achievement test scores. These outcomes might include progressing through courses, graduating from high school, even attending or completing college.^[11] If carefully developed and thoroughly evaluated, such long-term measures could help give schools and districts a richer picture of teacher effectiveness.

The decision to use value-added measures depends not on whether they are perfect—all measures are flawed. Rather, it depends on what other measures of effectiveness are available or feasible and the quality of those measures. If principals are skilled, have goals aligned with district or state goals, and have adequate time for evaluation, then they may be a better source of information on teacher effectiveness than value-added measures because they can observe more than test scores and see teachers throughout the course of the year. However, not all schools have principals who can do this effectively.

3. Choose a test that measures knowledge that is valued.

Value added is sensitive to whatever test students take; different tests lead to different rankings of teacher effectiveness.^[12] If we want value-added scores to yield information about teaching beyond what is currently measured by standardized tests, we must use a test that measures knowledge we value. Schools today focus largely on the content measured by standardized accountability tests. Tests used to determine value added should give teachers incentive to teach knowledge that is valued.

4. Consider differences in teaching contexts when using value added to compare teachers.

Standard methods to statistically adjust for students’ prior achievement can produce unbiased (or nearly unbiased) estimates of teacher value added. However, the research supporting this finding is based on studies in which teachers were randomly assigned to students within the same school, mainly within elementary schools, while most value-added systems compare teachers in different schools.

The distinction between within-school and between-school comparisons is an important one, because teachers within the same school share the same organizational conditions (leadership and resources); are subject to similar contextual factors (neighborhood safety, parental support, norms that favor academic achievement); and, particularly in elementary school, they tend to teach students with similar levels of prior achievement. In a heterogeneous district, schools can vary widely by all these factors. If well-managed schools foster better teaching and learning than do poorly-managed schools, then teachers in well-managed schools will outperform equally able teachers in poorly-managed schools. If some contextual conditions promote more student learning than do other conditions, teachers in schools with more favorable conditions will outperform equally able teachers in schools with less favorable conditions. Our confidence in value-added estimations of teachers will increase with the degree of similarity between the classrooms of the teachers being compared. Moreover, most tests are not designed to compare the learning gains of students who start at very different levels of achievement.

For these reasons, when education leaders are making consequential personnel decisions informed by value added, they would be wise to take into account the different contexts in which teachers work. They should also look at value-added scores across the district to detect general patterns in the distribution of teacher effectiveness. This practice serves as a check on the equitability of hiring and placement policies.

5. Take specific steps to ensure the overall credibility of the teacher evaluation system.

Many of the CKN briefs identified threats to the validity of value added as a measure of teacher effectiveness for some teachers.^[13] Inaccurate value-added estimates can lead to decisions harmful to both teachers and students.^[14] For value-added measures to be useful, they must be subject to the following minimal conditions: the tests on which they are based should reliably capture valued student outcomes; data should accurately identify which students are in which teachers’ classes; and teachers should work with enough students to produce value-added estimates that are relatively free of noise.

So school systems must look for patterns that suggest errors and use other data to confirm that teachers who consistently receive high or low value-added scores are truly high- or low-performers. For example, if all the special education teachers receive low value-added scores, but if those scores are belied by classroom observations and student surveys, educational leaders should examine whether there might be problems with one or the other measure.^[15] System leaders should clearly document the statistical methods used for calculating value added, taking steps to secure data accuracy. They should provide evidence of that accuracy and of the checks used to ensure the appropriateness of the evaluation models, the checks used to confirm the assumptions made by the models, and the tests of the robustness of the results. When communicating about individual teacher value-added scores, reports should indicate the nature of value-added scores as estimates, and be careful not to overstate the degree of their precision.^[16]

Evaluation measures that capture meaningful differences in teaching quality are important components of a system that ensures quality learning opportunities for all students. There is great risk of teacher evaluation systems not performing as planned, however. The human resources literature is rife with data and theories on why these performance measurement systems fail to yield the rich information and informed decisions they were intended to provide.^[17] Likewise, the literature on performance measurement for public employees offers numerous examples of unintended consequences, including employees gaming the system or trying to improve their performance ratings without improving their actual performance. Edward Deming, the quality improvement expert whose ideas have revolutionized the industry, cautioned organizations against using performance measurement for individuals, believing that doing so would lead to capricious behavior, as well as fear that stifles creativity.^[18] This warning should be taken seriously. If school leaders are effectively evaluating and motivating teachers using measures other than value-added measures, then value-added measures may provide little benefit for students. If local evaluations are not effective, then value-added measures may be a beneficial tool, at least until schools implement better systems.

Clearly, it is challenging to use evaluation and performance monitoring to improve outcomes. States and school districts will need to take deliberate steps to increase the chances that their systems will work as intended. Because current evaluations by managers tend to be uniformly high, school systems will need to monitor evaluation data for variability in performance ratings and promote and reward accurate evaluations. Systems must make sure that teachers are not narrowing the curriculum or manipulating student assignments to improve their value-added scores. For instance, schools might assess students periodically on tests that are different from the state test but cover the same standards. Schools might also monitor student and teacher assignments to identify unusual changes over time.

School systems should identify and monitor the factors that should not change when value added is used to assess performance by watching for evidence that administrators or teachers are taking unwanted steps to influence their value-added scores. For instance, the proportion of students classified with certain learning disabilities, which is susceptible to manipulation, should not change.^[19] School systems should also list the factors that should change with a new evaluation system, such as the number of classroom observations performed or the alignment between the type of professional development chosen and teachers’ needs. They should then monitor these factors for evidence that changes are occurring, and plan for how to respond if the system fails to behave as expected.

Data from millions of students and thousands of teachers as well as careful thought and analysis have taught us much about the statistical properties of value added. Yet these measures are only just beginning to be put into widespread use for teacher evaluation.^[20] As these new systems roll out, states and districts should be learning and experimenting, determining what works and what doesn’t. They should use this time to collect data and monitor their evaluation systems, using what they learn to make revisions. They should share their experiences with other school systems and learn from them in turn. The research community must also work with states and districts to identify the best practices for teacher evaluation, assessing whether the new systems really do yield better teaching and learning. Finally, the decisions school systems reach about value added—including whether and how to use the measure for evaluation—should not be seen as one-time events. School systems are now gathering a wealth of data from which we can learn how to make educational organizations more effective at conducting evaluations to improve teaching. Their evaluation systems must make use of that data, and evolve based on what that data shows.

References +

Daniel F. McCaffrey, “Do Value-Added Methods Level the Playing Field for Teachers?” (Carnegie Knowledge Network Brief No. 2, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, October 2012), http://www.carnegieknowledgenetwork.org/briefs/value-added/level-playing-field/; and Stephen W. Raudenbush,“What Do We Know About the Long-Term Impacts of Teacher Value Added?” (Carnegie Knowledge Network Brief No. 15, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, March 2014), http://www.carnegieknowledgenetwork.org/briefs/long-term-impacts/.
Susanna Loeb and Christopher A. Candelaria, “How Stable Are Value-Added Estimates across Years, Subjects, and Student Groups?” (Carnegie Knowledge Network Brief No. 3, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, October 2012), http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-stability/.
Douglas N. Harris and Andrew Anderson, “Does Value Added Work Better in Elementary Than in Secondary Grades?” (Carnegie Knowledge Network Brief No. 7, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, May 2013), http://www.carnegieknowledgenetwork.org/briefs/value-added/grades/; and McCaffrey, “Value-Added Methods.”
Stephen W. Raudenbush, “What Do We Know About Using Value-Added to Compare Teachers Who Work in Different Schools?” (Carnegie Knowledge Network Brief No. 10, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, August 2013), http://www.carnegieknowledgenetwork.org/briefs/comparing-teaching/.
Douglas N. Harris, “How Do Value-Added Indicators Compare to Other Measures of Teacher Effectiveness?” (Carnegie Knowledge Network Brief No. 5, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, October 2012), http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-other-measures/.
Douglas N. Harris, “How Might We Use Multiple Measures for Teacher Accountability?” (Carnegie Knowledge Network Brief No. 11, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, October 2013), http://www.carnegieknowledgenetwork.org/briefs/multiple_measures/.
Susanna Loeb, “How Can Value-Added Measures Be Used for Teacher Improvement?” (Carnegie Knowledge Network Brief No. 13, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, December 2013), http://www.carnegieknowledgenetwork.org/briefs/teacher_improvement/.
Harris, “Value-Added Indicators.”
Harris, “Multiple Measures.”
Harris, “Value-Added Indicators.”
Raudenbush, “Long-Term Impacts.”
Douglas F. McCaffrey, “Will Teacher Value-Added Scores Change when Accountability Tests Change?” (Carnegie Knowledge Network Brief No. 8, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, June 2013), http://www.carnegieknowledgenetwork.org/briefs/value-added/accountability-tests/.
Dan Goldhaber and Roddy Theobald, “Do Different Value-Added Models Tell Us the Same Things?” (Carnegie Knowledge Network Brief No. 4, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, October 2012), http://www.carnegieknowledgenetwork.org/briefs/value-added/different-growth-models/.
Dan Goldhaber and Susanna Loeb, “What Do We Know About the Tradeoffs Associated with Teacher Misclassification in High Stakes Personnel Decisions?” (Carnegie Knowledge Network Brief No. 6, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, April 2013), http://www.carnegieknowledgenetwork.org/briefs/value-added/teacher-misclassifications/.
Douglas N. Harris, “How Might We Use Multiple Measures for Teacher Accountability?” (Carnegie Knowledge Network Brief No. X, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, May 2013), http://www.carnegieknowledgenetwork.org/briefs/multiple_measures/.
Stephen W. Raudenbush and Marshall Jean, “How Should Educators Interpret Value Added Scores?” (Carnegie Knowledge Network Brief No. 1, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, October 2012), http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/.
For a discussion of challenges to using performance measures for incentive programs, see Loeb, “Teacher Improvement.”
Sharon L. Lohr, “The Value Deming’s Ideas Can Add to Educational Evaluation,” Statistics, Politics, and Policy 3, issue 2 (2012).
Daniel F. McCaffrey and Heather Buzick, “Is Value Added Accurate for Teachers of Students with Disabilities?” (Carnegie Knowledge Network Brief No. 14, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, January 2014), http://www.carnegieknowledgenetwork.org/briefs/teacher_disabilities/.
Dan Goldhaber and Susana Loeb, “What Do We Know About the Tradeoffs Associated with Teacher Misclassification in High Stakes Personnel Decisions?” (Carnegie Knowledge Network Brief No. X, Carnegie Foundation for the Advancement of Teaching, Stanford, CA, May 2013), http://www.carnegieknowledgenetwork.org/briefs/value-added/teacher-misclassifications/; and McCaffrey, “Do Value-Added Methods Level?”

What Do We Know About the Long-term Impacts of Teacher Value-Added?

gotake — Thu, 27 Mar 2014 21:30:31 +0000

Email this page

Stephen Raudenbush
Chair
Committee on Education
University of Chicago

Stephen W. Raudenbush

Highlights:

Two recent studies provide evidence that attending the class of a high-value-added teacher predicts higher-than-expected educational attainment, earnings, and other adult outcomes.
In one study, part of the impact of attending an effective classroom may have been attributable to small class size; in the other, part of the effect may be attributable to the effectiveness of the school.
Teacher value-added scores “fade out” over time: knowing that a student had a teacher with a high value-added score one year provides little information about how well that student will fare on achievement tests several years later.
The studies provide important new evidence on the significance of early classroom experience to later success.

Introduction

Proposals to evaluate teachers based on their “value-added” to student test scores generate intense debate. Underlying the debate are concerns about three factors: bias, precision, and relevance. Previous Carnegie Foundation briefs have detailed the reasons why the first two are significant concerns.^[1]^[2] But even if value-added scores were unbiased and reasonably precise, their usefulness for evaluating teaching would still depend on the third factor—their relevance to the aims of schooling. After all, helping children do well on an achievement tests is of little value in itself. The question is whether a test score gain in a given year of schooling represents growth in skills that matter over the long-term.

My aim in the current brief is to consider a key aspect of the relevance of value-added scores: their predictive validity—whether teachers who produce high value-added on achievement tests also engender lasting cognitive and non-cognitive skills that help prepare their students for success in later life.

One measure of the predictive validity of value-added is the extent to which it persists or “fades out” over time. At issue is whether elevated value-added scores displayed during the initial year persist in subsequent years. I review 10 studies of the persistence of value-added scores. In each case, researchers compute value-added scores in an “initial year” and then in subsequent years. All studies show that value-added scores tend to fade out over time. Five years after the initial year, it appears that 75 percent to 100 percent of the initial impact has disappeared.

One might infer from these results that attending the class of a teacher with high value-added in a given year has little consequence beyond that year. However, two careful, large-scale studies, reviewed in detail below, suggest that despite the lack of persistence of value-added on future test scores, one year of experience with a high-value-added teacher predicts higher rates of college attendance and adult earnings, as well as other important outcomes. While the effects are not large for individual students, they become substantial when they are aggregated over the students a teacher encounters. Moreover, the cumulative effects of a sequence of effective teachers may be substantial for an individual student.

The seeming contradiction between the lack of persistence of value-added on achievement test scores and the significant impacts on adult outcomes frames an important puzzle for future research. One possible explanation is that teachers who produce comparatively high gains on test scores are also effective in producing gains in other skills—deemed “non-cognitive” skills—that matter in the labor market and in other aspects of adult life. And one study suggests that high-value-added teachers are indeed comparatively effective in promoting high levels of effort, initiative, and classroom participation among their students. Another possibility is that teachers who produce high gains on test scores also produce high gains on deeper cognitive skills, such as reasoning and problem-solving, that current tests may not fully capture but that pay off in the labor market.^[3]

Skeptics may argue that value-added scores in the initial year and estimates of later impacts share a bias. For example, it might be that high-value-added teachers work in particularly effective schools, and that students who attend these schools for sustained periods see not only high initial test scores but also favorable long-term effects. Despite efforts by researchers to identify and remove such biases, they cannot be entirely discounted. More specifically, Chetty et al. (2013) attribute students’ labor market gains to the value-added of individual teachers, despite the fact that some of these gains may be attributable to attending an effective school.^[4] And teacher effects estimated by Chetty et al. (2011) appear to include the impact of reduced class size in addition to the impact of individual teacher skill.

The subsequent sections of this brief consider in more detail (a) the size of the initial value-added effects; (b) the persistence of initial value-added; (c) reported impacts on adult outcomes; (d) potential explanations for these findings and suggestions for further research; and (e) implications for school practice.

Magnitude of Initial Value-Added Effects

To calibrate the importance of teacher value-added for students’ future outcomes, we need to think a bit about how much teachers vary in the value-added score for a single year. If these initial scores vary a great deal, they might also strongly predict later outcomes. But if they vary only a little, one would expect them to be of little use in predicting future outcomes. The consensus seems to be that attending the class of a teacher who is one standard deviation above her peers in value-added is associated with a gain in achievement of 10 to 15 percent of a standard deviation in student achievement.

The “better” teacher’s impact may seem small, but it would be quite significant when aggregated over all the students in a class.

To make this meaningful, consider a comparatively “good” teacher—one who is in the 70th percentile of the teacher distribution in value-added. A teacher one standard deviation below such a teacher is at about the 30th percentile. Studies so far tell us that we can expect the “better” teacher’s students to score about 6 percentile points higher, on average, on a standardized achievement test than the students of the “worse” teacher. This is a difference between being at the 53rd percentile and the 47th percentile. While this impact may seem small, it would be quite significant when aggregated over all the students in a class, if it persisted and laid the basis for lasting differences in socially valued outcomes. Educational interventions that can produce an effect this large in one year are often regarded as quite successful.

Persistence of Value-added

There is a substantial decay over time in value-added to future achievement test scores.

I found 10 studies of the persistence of initial value-added over subsequent years, and these are listed in Table 1. These studies are based on computation of a teacher’s value-added to the scores on tests taken one, two, or more years after a student has encountered that teacher. Looking at the table, consider again two teachers who differ by one standard deviation on value-added in the initial year. How much of this initial difference remains one year after? The lowest estimates suggest that only about 18-25 percent of the initial difference persists. The most optimistic estimates are that 50 percent does. And the median is 24 percent, meaning that about a quarter of the initial value-added remains after one year. Fewer studies follow students for more than one year, but those that do suggest that initial differences continue to fade, albeit perhaps at a slower rate. Two studies follow students for more than 3 years after the initial year, and these suggest that 25 percent or less of the initial difference persists. The consensus seems to be that there is a substantial decay over time in value-added to future achievement test scores.

Table 1: Persistence of Value-Added After Initial Year as Fraction of Value-Added During the Initial Year

Study	Sample	Yr. 1	Yr. 2	Yr. 3	Yr. > 3
Kinsler (2012)^[5]	N=689,641 students, grades 3-5, 1998-2005, in North Carolina	.24 (math) .14 (reading)
Master, Loeb, and Wycoff, 2014	N=700,000 students, grades 3-8, 2005-2226 in New York City	.19 (math) .21 (language arts)
McCaffrey et al. (2004)^[6]	N=678, grades 3-5, large suburban district	.25	.15	–	–
Lockwood et al.^[7]	N=10,000, grades 1-5, large urban district	.18	.15	.14	.12
Kane and Staiger (2008)^[8]	97 pairs of teachers, grades 2-5, randomization to students to teachers within pairs	.50
Jacob, Lefgren, and Sims (2010)^[9]	n=18,240, grades 4-15, mid-size Western District	.20
Rothstein (2010)^[10]	n=99,071, grades 3-5, North Carolina statewide	.27 (math) .33 (reading)
Measurement of Effective Teaching (2012)	1811 teachers randomized within schools to student rosters, grades 4-8 in 6 school districts	.45
Chetty et al. (2012)	10,992 students randomized to classes within 79 schools in Tennessee				0
Chetty et al. (2013)	2.5 million children grades 3-8 in New York	.50	.40	.20	.20

The researchers cited in Table 1 have offered several explanations for the apparent fade-out of value-added scores. One is that high value-added scores in the initial year reflect teacher efforts to “teach to the test” rather than to produce meaningful skills. While plausible in light of the results in Table 1, this finding would suggest that exposure to a teacher who produces high value-added would not increase a range of favorable long-term outcomes. It seems implausible that the teachers who are best at teaching to the test are also best at fostering more general skills. Testing this explanation requires us to ask whether exposure to a high-value-added teacher has long-term benefits, and that is the next topic of this brief.

Second, it may be that tests taken in the initial year produce lasting gains on the content those tests measure but that tests in subsequent years measure different skills. For example, the initial test might measure a child’s ability to add double-digit numbers, while the later test might assess the child’s ability to multiply and divide fractions. Knowing how to add may only modestly predict knowing how to manipulate fractions. This explanation would seem, however, to imply that we would see less persistence in math scores than in reading scores, because the skills required to master a series of mathematical skills are believed to change more rapidly over time than the skills required for reading. Yet, we see low persistence of value-added scores in reading as well as math.^[11]

A third explanation is that subsequent teachers may be constrained in their ability to capitalize on what earlier teachers have taught. This explanation would be plausible if a teacher’s class were composed of a small subset of students who gained considerable skill during the prior year and a larger subset of students whose previous gains were modest. The teacher of such a class might be inclined to pitch the instructional level to the larger set of lower-achieving students, preventing those who had benefited from prior good teaching to advance quickly. This explanation seems to predict that high value-added in the initial year would not predict favorable long-term outcomes.

Long-Term Impact of Value-added

Two of the 10 studies listed in Table 2 followed students from their elementary school classrooms into adulthood, obtaining data on long-term outcomes, including college attendance and the quality of the college attended around age 20, earnings at age 28, the quality of the neighborhood of residence during adulthood, and teen parenthood. The two studies differ in their design, but taken together, they suggest that early classroom experience influences long-term outcomes and that teacher skill is a key source of these impacts.

The first of these studies is based on a pioneering experiment in Tennessee that tested the impact of class size on student learning.^[12] In this study, which covered a period in the 1980s, students in 80 schools were assigned at random to kindergarten and first- and second-grade classrooms that were designated as large or small. Teachers were also assigned to classrooms at random. Remarkably, a team of researchers was able to obtain extensive administrative data on the long-term outcomes of this experiment.^[13] The random assignment of students to classrooms (within schools) provided a solid basis for assessing the long-term impacts of classroom membership. These are reported in Column 1 of Table 2 below.

Importantly for our purposes, these impacts are not necessarily the impact of teacher effectiveness alone. A classroom might be effective because of its small class size or because of random variation in peer composition. However ill-defined the differences, the beauty of this study is that researchers were able to characterize the distribution of classroom impacts. Column 1 indicates that two classrooms that differed by one standard deviation in value-added produced a student learning difference of nearly a one-third standard deviation. In practical terms, this is a difference of about 9 percentile points. More remarkably, the results imply that students attending the more effective classroom will have earned, on average, about $1,520 per year more as young adults than will students who had attended the less effective classroom. The authors were able to establish that only a part of this effect can be explained by class size or peer composition. Moreover, because random assignment of students to teachers occurred within schools, this effect cannot be attributed to the overall effectiveness of the school. It is instead attributable to the impact of the classroom assignment.

Attending a classroom that increased test scores was reported to increase college attendance, the quality of the college attended and earnings.

The results in Column 1 suggest that the classroom to which students were assigned had an important influence on later outcomes. But they do not suggest that early gains on test scores help us explain those classroom effects. The next question the authors asked was whether classes that specifically boosted test scores were those that also improved long-term outcomes. The answer was that they did. Attending a classroom that increased test scores by one standard deviation (about 8.8 percentile points) during the initial year of the experiment was reported to increase college attendance, the quality of the college attended (as indicated by the mean adult earnings of its graduates), and earnings. The impact on college attendance was small (just over a quarter of one percentage point in a sample of whom 45.5 percent attended college) as was the impact on college quality. However, the earnings impact of $1,619 per year is large when aggregated over all of the students in the classroom.

Three features of this study are notable. First, because it is based on random assignment of students, it is free of key sources of bias (see Note 1). Second, it does not establish the predictive validity of teacher value-added scores per se, but rather, the more global effect of attending a classroom that worked well for a variety of possible reasons, including its size. But the study does suggest that classrooms that produce test score gains also produce valuable long-term outcomes. Third, it compares classrooms within the same school. In contrast, value-added scores are typically computed for teachers who work in different schools—comparisons that pose special challenges to the validity of conclusions drawn.^[14]

The second study of long-term outcomes directly addressed the question of teacher value-added by comparing teachers who work in different schools.^[15] It used a sample of 2.5 million children attending New York City schools in grade 3-8, and it used administrative records to follow them into adulthood. This study was not based on random assignment of children to classrooms. However, authors took care to identify and control for the sources of bias that might arise when it is not possible to conduct a randomized experiment.

Table 2: Impacts of Value-Added on Adult Outcomes

	Impact of classroom quality overall (Chetty et al. 2011)	Impact of classroom value-added (Chetty et al. 2011)	Impact of teacher value-added (Chetty et al. 2013)
Initial test scores	8.8 percentiles (.32 sd)
College Attendance		0.28% above mean of 45.5%	0.82% above mean of 37.22%
College Quality index		0.06 sd	0.02 sd
Earnings	$1520 (8.8% above mean)	$1619 (11.1% above mean)	$350 (1.65% above mean)
Teen parenthood			0.61% below mean of 14.3%
Other outcomes			Increases in neighborhood quality, saving with 401K

The results of this study (Column 3 of Table 2) in some ways mirror those of the Tennessee experiment (Columns 1 and 2). We see small but statistically significant effects of teacher value-added on college attendance and college quality. We see a statistically significant but much smaller impact on earnings (having a teacher with one standard deviation higher value-added predicted earning $350 per year more than expected at age 28). The authors also reported a reduction in teenage parenthood and increases in neighborhood socioeconomic status (as measured by the fraction of neighbors with a college education) and savings (as indicated by having a 401k retirement account). These, too, were small but statistically significant effects.

The authors devised an ingenious test of the validity of these findings. They asked whether a cohort of children in a particular school in a particular grade achieved less the year after a high value-added teacher left than did the previous cohort of students, and, conversely, whether children gained more the year after a high-value-added teacher joined the staff. Their findings essentially replicated the results shown in Table 2 (Column 3) for college attendance and college quality; however, findings regarding earnings were too imprecise to test impacts. A key assumption in this analysis is that the movement in and out of schools by effective (or ineffective) teachers is a cause of subsequent student achievement more than a result of past school effectiveness. This research strategy is ingenious, yet we cannot rule out the possibility that at least part of the impact that the authors ascribe to teachers is actually generated by an effective school. I say this because the authors did not control for the school a child attended when they assessed the association between teacher value-added and long-term outcomes. (See Note 11).

Classroom effectiveness in the early grades and teacher effectiveness as indicated by value-added have some influence on important life outcomes.

This remarkable study suggests that teacher value-added has long-term consequences for children. Although the results are non-experimental, they are consistent with the experimental findings (Columns 2 and 3 of Table 2) concerning the impacts of effective classrooms. Taking these studies together, I conclude that classroom effectiveness in the early grades and, more specifically, teacher effectiveness as indicated by value-added have some influence on important life outcomes.

Questions for Future Research

The research reviewed here suggests that teacher value-added explains a modest, but not negligible, fraction of variation in student test scores during the initial year. However, the effects of a teacher on later test scores are much smaller, and most of the initial effect on test scores has faded out after three years. Despite the failure of value-added impacts to persist with respect to test scores, the research reviewed here has established that indicators of classroom effectiveness predict a range of adult outcomes. How can we reconcile the lack of persistence of impacts on test scores with the later emergence of impacts on important life outcomes?

Several explanations for the fade-out of test score effects fail to account for the emergence of later outcomes. Teaching to the test might account for ephemeral effects on test scores but can hardly account for long-term benefits. The same is true of explanations that emphasize the inability of later teachers to capitalize on the gains produced by effective early teachers. The possibility that later tests measure skills not captured by earlier tests remains somewhat plausible.

The single explanation posed by the authors of the long-term studies reviewed here is that teachers who are effective at producing initial gains in test scores are also effective in producing gains in non-cognitive or “soft skills.” This is the same explanation that researchers have drawn regarding the long-term effects of several experimental early childhood interventions.^[16] These interventions showed early impacts on test scores that faded completely over the next several years. Nevertheless, they produced favorable outcomes over the life course. Evidence suggests that the effects of early and sustained interventions on non-academic skills accounts at least in part for these long-term benefits.

Chetty et al. (2013) tested the impact of teacher value-added on an index of non-academic skills, including measures of initiative, effort, and collaboration. They found that teachers who produced high initial value-added on test scores also produced favorable non-academic skills and that these were correlated with the adult outcomes of interest. These findings are consistent with the notion that teacher impacts on non-academic skills may help us understand the puzzle of fading test score effects and the emergence of long-term impacts.

At best, achievement tests capture some important aspects of cognitive skills needed in the labor market.

I would urge caution, however, in inferring that the skill gains not measured by later achievement tests are all “non-cognitive.” At best, achievement tests capture some important aspects of cognitive skills needed in the labor market. It is quite plausible that teachers who are effective at producing gains on a given test are also good at producing gains in deeper cognitive skills not captured by standardized tests.

Considerably more research is needed on how specific aspects of teaching contribute to the range of skills that pay off in adult life. The research reviewed here has shown that it is possible to trace long-term effects of classroom experience, so we can anticipate more studies of this type. I would encourage researchers to think about the connections between academic learning in various subjects and the development of non-academic skills such as effort, initiative, persistence, and collaboration. In school settings, much of what we ask of students concerns academic learning. Reasoning and problem-solving skills appear ever more important, and substantial effort, initiative, and collaboration are likely essential for developing these skills. At the same time, it is likely that success in developing academic skills reinforces determination, effort, and initiative. In sum, it seems likely that academic and non-academic skills that matter in the labor market develop together and are mutually reinforcing.

We need a theory to explain how effective teaching fosters a range of skills and dispositions that, together, shape prospects for future success.

The researchers cited above note that achievement tests may poorly reflect these non-academic skills. I would emphasize that these tests may also fail to capture key cognitive skills, and, in particular, reasoning and problem-solving. In sum, it seems that we need a theory to explain how effective teaching fosters a range of skills and dispositions that, together, shape prospects for future success. It will take more long-term studies of the impact of teaching to test such theories. Moreover, the evidence that teachers vary in the extent to which initial value-added persists^[17] suggests the need to assess the impact of teachers and schools on a wide range of outcomes.

Implications for Policy and Practice

Teacher value-added scores, computed with care, should be taken seriously because these scores serve as meaningful signals of long-term benefit to students. The caveat “computed with care” is important. The researchers cited here took great care to identify and control for potential sources of bias. At the same time, teacher value-added scores are not precise. As a result, even those who advocate using value-added in teacher evaluation emphasize the importance of combining value-added with data from other measures of classroom effectiveness.

The research suggests another way that we can and should enrich data on effective teaching: examining the value that teachers add to outcomes other than standardized test scores. The evidence seems to suggest that teacher effectiveness contributes to long-term outcomes in ways that are imperfectly captured by test scores. Effective teachers likely assist their students by producing a range of skills that support later success. Many school districts already have data that can help them assess teacher contributions to achievement in later grades, course-taking, high school graduation, and even college attendance and completion. We can thus see potential for policymakers, practitioners, and researchers to collaborate in constructing a richer set of effectiveness indicators so we can better appreciate the impact of teaching.

References +

For a detailed discussion of bias, see McCaffrey, D.F. (October 2012). “Do Value-Added Models Level the Playing Field for Teachers?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/briefs/value-added/level-playing-field/
For a detailed discussion of precision, see Raudenbush, S., & Jean, M. (October 2012). “How Should Educators Interpret Value-Added Scores?” Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added and Loeb, S., and Candelaria, C. (October 2012). “How Stable Are Value-Added Indicators Across Years, Subjects, and Student Groups?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-stability/
Master, Loeb, and Wycoff (2014) investigate teacher impact on cognitive skills not captured in the achievement test used to compute value-added. In Teachers’ Effects on Students’ Long-Term Knowledge. Unpublished manuscript, Stanford University School of Education.
See Raudenbush, S.W. (August 2013). “What Do We Know About Using Value-Added to Compare Teachers Who Work in Different Schools?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/briefs/comparing-teaching/
Kinsler, J. (2012). Beyond Levels and Growth Estimating Teacher Value-Added and its Persistence. Journal of Human Resources, 47(3): 722-753.
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., and Hamilton, L. (2004). Models for Value-Added Modeling of Teacher Effects. Journal of Educational and Behavioral, 29( 1): 67-101, Value-Added Assessment Special Issue.
Lockwood, J. R., McCaffrey, D. F., Mariano, L. T., and Setodji, C. (2007). Bayesian Methods for Scalable Multivariate Value-Added Assessment. Journal of Educational and Behavioral Statistics 32: 125-150.
Kane, T. J., and Staiger, D. O. (2008). Are Teacher-Level Value-Added Estimates Biased? An Experimental Validation of Non-Experimental Estimates. Occasional Paper. Harvard University.
Jacob, B.A., Lefgren, L., and Sims, D.P. (2010). The persistence of teacher-induced learning gains. Journal of Human Resources, 45(4): 915-943.
Rothstein report the standard deviation of value-added for 4th grade teachers on 4th grade test scores to be .180 and .150 for reading and math, respectively. He reported the standard deviation of value-added for 4th grade teachers on fifth grade scores to be .108 and .118, respectively (see Table 7). These represent persistence of .59 and .79, respective. I computed the average of these for the table. Rothstein, J. (2010). Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement. Quarterly Journal of Economics, 125(1): 175-214.
Mariano, McCaffrey, and Lockwood (2010). demonstrate that such content shifting does occur. Mariano, L. T., McCaffrey, D. F., & Lockwood, J. R. (2010). A model for teacher effects from longitudinal data without assuming vertical scaling. Journal of educational and behavioral statistics, 35(3): 253-279.
The results of this experiment, which definitively established the fact that reducing class size can boost achievement, was reported by Finn and Achilles, 1990.
See Chetty, R. Friedman, J.N., Saezm E., Schanzenbach, D.W., Yagin, D. (2011). How does your kindertarten classroom affect your earnings: Evidence from Project Star. Quarterly Journal of Economics, 76(4): 1593-1660.
See Raudenbush, S.W., 2013, ibid.
See Chetty, R,. Friedman, J.N., Rockoff, J.E. (2013). Measuring the impacts of teachers II: Teacher value-added and student outcomes. American Economic Review, forthcoming.
See Heckman, J.J., Pinto, R. Savalyev, P.A. (2013) Understanding the mechanisms through which an influential early childhood program boosted adult outcomes. American Economic Review, 103(6): 2052-2086.
Master, Loeb, and Wycoff, 2014, ibid.

Is Value-Added Accurate for Teachers of Students with Disabilities?

Joanna Huang — Tue, 14 Jan 2014 19:43:00 +0000

Email this page

Daniel McCaffrey
Principal Research
Scientist
Educational Testing
Service

Daniel F. McCaffrey

and Heather Buzick

Highlights

Standardized test scores for students with disabilities pose challenges for calculating teacher value-added for special education teachers, and some general educators who teach many students with disabilities.
Scores for students with disabilities can be very low, so that misspecification of the value-added models might attribute their low scores to their teachers. Low scores can also increase the random errors in value-added.
Inconsistent use of testing accommodations across years for students with disabilities can create variability in their growth that can be incorrectly attributed to teachers.
Teachers’ value-added might not represent their contributions to the learning of students with disabilities because these students are less likely than others to have test scores that are used in calculating it.
Including disability status and other factors, such as accommodation use, in the models may reduce systematic errors in the value-added for teachers with large proportions of students with disabilities, but doing so could create incentives for improper placement of students into special education.
Value-added for many general education teachers changes very little when scores from students with disabilities are excluded from calculations.
States and districts may find it useful to monitor teachers’ value-added for connections to the proportion of students with disabilities in their classes and other evidence of systematic error. They should be prepared to revise their accountability systems if needed.

Introduction

Under laws providing for equal access to quality education, most American students with disabilities are taught largely in general education classrooms.^[1] And along with all other students, they are included in standards-based reforms—policies that hold their schools and teachers accountable for what they do or do not learn. In more and more cases, policies require that a teacher be judged partly by his value-added—a measure of what he contributes to student learning as determined by student scores on standardized tests.^[2] States and districts, of course, want to make sure they have accurate measures of these contributions. But students with disabilities pose several challenges for calculating value-added. They tend to score low—often very low—on regular state assessments,^[3] and most receive accommodations, such as extra time.^[4] The result can be scores that are unreliable or not comparable to those of other students or across years. In other cases, students with disabilities have test scores that cannot be used to calculate value-added because they take alternative assessments. At the same time, many students with disabilities are taught by multiple teachers; they may be taught by two teachers in the same classroom or by different teachers in separate general and special education classrooms. Students with disabilities also often receive help from aides and special services. Disentangling the contribution of each teacher from these other factors may be difficult.

Students with disabilities account for about 14 percent of all students in the U.S., and even more in urban districts.

In this brief, we discuss the challenges of using value-added to evaluate teachers of students with disabilities. We consider the limited empirical research on the potential for systematic errors in value-added for these teachers, either because the models do not adequately account for the likely achievement growth of their students, or because they do not account for teachers being more or less effective for students with disabilities than they are for other students. We also consider the comparability of value-added for special education teachers and the value-added for other teachers.

What is known about value-added for teachers of students with disabilities?

Students with disabilities contribute not only to the value-added of special education teachers, but also to that of many general education teachers. These students account for about 14 percent of all students in the U.S., and even more in urban districts; 18 percent of students in Detroit Public Schools, for instance, receive special education services.^[5] These 5.7 million students ^[6] are a diverse group. The 13 disabilities listed under the federal Individuals with Disabilities Education Act (IDEA) are autism, deaf-blindness, deafness, emotional disturbance, hearing impairment, intellectual disability, multiple disabilities, orthopedic impairment, other health impairment, specific learning disability, speech or language impairment, traumatic brain injury, and visual impairment.^[7] Because the majority of students with disabilities spend most of their instructional time in a general education classroom, at least 80 percent of teachers will have one or more students with a disability in their classrooms.^[8]

Nearly all students with disabilities are tested in some way. Nationally, 79 percent take their state’s regular standardized assessment with or without accommodations and 21 percent take an alternative assessment.^[9] Historically, the alternative assessments for students with the most significant cognitive disabilities have not been designed to produce scores that can be directly compared with scores on the regular assessment. Sometimes the range of scores did not even overlap. When they did, equal scores did not indicate equal levels of achievement. Another type of alternative assessment was designed for students who demonstrate persistent academic difficulties. These assessments cover the same grade-level achievement standards as the regular assessments, and they use a familiar format of multiple choice and constructed response items. What sets them apart from the regular assessments is fewer items, simplified language, and fewer options on multiple choice questions. Unlike some other alternative tests, scores from these assessments for students with persistent academic difficulties can be linked to scores from the regular assessment and included in value-added modeling. But the test is being phased out within the year.^[10]

Of students with disabilities who take the regular assessments, around 60 to 65 percent receive testing accommodations.^[11] Accommodations take many forms. In addition to extra time allotted to complete the test, they include changes in the presentation of the test, different response types, and changes in scheduling and setting.^[12] The accommodations and the type of test can vary from year to year, depending on a student’s disability classification, changes in test rules, and degree of adherence to the rules.

Scores on regular assessments are typically used to calculate a teacher’s value-added, and accommodations are rarely taken into account. This is the case even though scores can be very sensitive to accommodations and even though inconsistent use of accommodations can result in predictable patterns in achievement growth. One study, for instance, found that average growth in math for students with disabilities was less than the average growth of all students, but for students getting accommodations (in the current year but not in the previous year) growth in math was larger than it was for all students. For students who received an accommodation in the previous year, but not in the current year, average growth in math was much less than it was for all students. Overall, students with disabilities had greater than average growth in reading, but the patterns with respect to inconsistent use of accommodations were the same as they were for math. That is, year-to-year changes in the use of accommodations predicted achievement growth.^[13] This predictable variation in growth is unrelated to teacher effectiveness, and if value-added modeling does not account for it, it can result in errors.

Scores can be very sensitive to accommodations for students with disabilities.

For many general education teachers, the impact of the test scores of students with disabilities on their value-added scores is relatively limited. Because they have few of these students in their classes, their value-added scores do not change much when students with disabilities are included or not.^[14] The situation is different for teachers with substantial numbers of students with disabilities. The value-added for these teachers can be more sensitive to these students’ test scores and to the way the model accounts for disability status. It is also less accurate, because of the very low test scores earned by many students with disabilities. Most standard tests are designed to provide accurate scores for students near proficiency; because there are not enough items measuring performance at the lower end of the scale, the tests may provide very limited information about students who score substantially below average. Thus, scores for students with disabilities tend to have larger measurement errors; they deviate more from the students’ true level of achievement than do the scores of other students.

As result, the value-added of teachers with a greater percentage of students with disabilities will have more random error than will teachers with similar-size classes but fewer students with disabilities. Random errors are differences between the value-added and a teacher’s true contribution to student achievement gains. The errors stem from many chance factors, such as the students who happen to be assigned to the teacher, students being sick at test time, or error in the test. Random errors are not systematic, in that they will not be similar across years or across teachers of similar students. They contribute to year-to-year instability in value-added, so value-added for teachers with more students with disabilities may have greater annual fluctuations than that of other teachers.^[15]

A teacher may not be equally effective with students with disabilities as other students.

If the value-added model does not appropriately control for disability status, it can overestimate the growth of students with disabilities. This failure may contribute to systematic errors in which teachers are disadvantaged by having more students with disabilities in their classrooms. One study, for example, found that under value-added models that had very limited controls for student background variables, the two percent of teachers who had half or more students with disabilities received value-added scores that, on average, were low; they ranked in the 20th to 25th percentile in math and the 25th to 33rd percentile in reading. Under a model that controlled for students’ special education status and for inconsistent use of accommodations, these same teachers ranked in the 38th and 49th percentile in math and 40th and 54th percentile in reading. By contrast, the rankings of teachers who had classes with less than 20 percent students with disabilities changed very little under the two models. Similarly, adding controls to the model resulted in a modest change, from about the 48th to the 52nd percentile, for teachers whose classes were made up of 20 to 50 percent students with disabilities.^[16]

Having small numbers of students with useable test scores can also add to random errors. Because scores on alternative assessments may not be comparable to scores on the regular assessment, students with disabilities are more likely than other students not to have test scores that can be used for value-added calculations. In particular, special education teachers, who often teach small classes of students with the most severe disabilities, may have very small numbers of students with which to calculate value-added; they may even have fewer than the minimum required by the state.^[17] Thus, their scores may be very imprecise, as well as unstable across years.

A teacher may not be equally effective with students with disabilities as other students. Research has shown that direct, explicit, and systematic instruction, which includes practicing with students until they understand a concept, is particularly beneficial for students with disabilities.^[18] But this kind of instruction might not be used for general education students. Widely used observation protocols do not rate teachers on explicit systematic instruction, and instead emphasize “student-centered instruction” that gives students more control over their learning.^[19] Teachers whose classes have both students with and without disabilities will need to choose one or the other type of instruction, and–depending on the choice–they might be less effective with one or the other group of students. Research has also found that students with disabilities have fewer opportunities to learn in general education classrooms than do students without disabilities.^[20]

Students with disabilities commonly receive instruction from multiple education professionals.

A possible implication of this research is that, to truly assess a teacher’s impact on students, his value-added needs to take account of the gains made by his students with disabilities. Excluding these students because they lack test data may distort the value-added measures. Calculating a separate value-added for students with disabilities and general students might seem desirable on the basis of this research, but because many teachers teach very few students with disabilities, the random errors from this subgroup would be too large for the measures to be useful.

Students with disabilities commonly receive instruction, as well as other services, from multiple education professionals, and this also poses challenges for calculating value-added. With co-teaching, for instance, a general education and special education teacher work together in the same classroom.^[21] In this case, two teachers contribute to a student’s learning. Dual contributions are also made when students with disabilities are taught in both general and special education classrooms. Students with disabilities may also benefit from aides and other staff. These situations complicate the computation of value-added because it is hard to separate one contribution from another.

We also know little about how certain services contribute to students’ achievement growth and how best to account for these services.

These challenges are not unique to students with disabilities. In some schools, both general education students and students with disabilities are taught by multiple teachers and, although co-teaching is typically aimed at special education students, all students might be affected by multiple teachers. But these challenges may be most pronounced for special education teachers. Detailed data on teacher assignments and teachers’ shares of instruction are necessary to allow value-added models to account for students who are taught by multiple teachers. Information from the teacher himself is essential; in many states and districts, teachers can ^[22] review enrollment rosters to certify that the lists include the students they actually taught. The teachers then report the share of instruction they provided to each student. Confirmation of rosters, with safeguards to ensure accuracy, gives us richer data on student-teacher assignments than that which now exists in most data warehouses. Value-added models that make adjustments for co-teaching have been proposed,^[23] but they cannot fully distinguish the contributions of each teacher to student scores.

What more needs to be known on this issue?

Much more needs to be known about the achievement growth of students with disabilities and their teachers’ value-added. We know little, for instance, about how services change over the years because a student’s needs change or because a student changes schools. We also know little about how certain services contribute to students’ achievement growth and how best to account for these services. We need information on the difference in the achievement growth between students with disabilities who do or do not contribute to value-added. We also know little about how a student’s use of accommodations varies across years, how students receiving different accommodations compare with each other, and how students who receive consistent accommodations compare (in achievement growth) with those who receive inconsistent accommodations.^[24]

We also need more information about how instruction in both general and special education classes correlates with the achievement growth of students with disabilities. This information would help us assess whether methods to instruct students with disabilities are distinct from those used to teach other students. The data could help states and districts determine whether to calculate separate value-added for students with disabilities. We would also benefit from knowing how teachers’ contributions to the achievement gains of their students with disabilities compare with their contributions to gains made by their other students. If the contributions are very similar, there would be little need to calculate separate values. Such a study might require multiple years of value-added measures to obtain accurate values using only students with disabilities because many classes have few such students.

Along these lines, we need to know how the material taught to students with disabilities differs from what is taught to other students, and how the material taught to each group aligns with the content covered on assessments. If it does not align, the teacher’s value-added might prove inaccurate.^[25] These inaccuracies may be more pronounced for teachers who teach more students with disabilities.

Including disability status in value-added models might create an incentive to increase special education referrals.

We know that we can remove apparent systematic errors in value-added errors when we account for detailed information on disability status, especially for teachers whose classes have majorities of students with disabilities. Doing so, however, might affect the classification of students with disabilities or the placement of these students in classes. If disability status is included as a control variable, classifying more low-achieving students as students with disabilities might increase teachers’ value-added. Including disability status in value-added models might create an incentive to increase special education referrals among low-achieving students, many of whom are minorities from low-income families and are already at risk for such referrals along with their sometimes negative consequences.^[26] There is some data to suggest that educators have manipulated special education referrals in response to accountability pressures.^[27] However, not controlling for disability status might reduce the value-added scores of teachers of students with disabilities, possibly discouraging teachers from teaching classes with many of these students. To understand the potential for such consequences, we need studies on changes to the classification of disabilities and on how students with disabilities are placed after a new teacher evaluation system has been introduced.

Not controlling for disability status might reduce the value-added scores of teachers of students with disabilities.

We do not know how students with disabilities will perform on the tests now being developed to align with the Common Core State Standards. The new tests aim to improve upon existing assessments by providing more reliable measures of student achievement for all students—especially very low- or high-achieving students—and they may provide better information about students with disabilities. But the new tests will be more difficult. They will also have new procedures for accommodations, which may make the use of accommodations more consistent. Further, because several states are dropping one type of alternative assessment, a greater number of students with disabilities are expected to complete the regular assessments.^[28] These changes could increase the number of students with disabilities who can contribute to value-added, thus improving the reliability of value-added for this group.

What can’t be resolved by empirical evidence on this issue?

As discussed, there is risk both in accounting for disability status and not accounting for it. Accounting for disability status in value-added modeling might harm students by referring them to special education unnecessarily. Not accounting for disability status might harm teachers with large numbers of such students in their classroom by unfairly depressing their value-added scores, and it could harm students with disabilities if teachers do not want them in their classes as a result. Ideally, schools or districts would prevent unnecessary referrals to special education, but if they can’t, the choice is of which students or teachers to put at risk. Empirical data on the cost of either choice would very difficult to obtain, and it would not help us make the choice in any case.

To what extent, and under what circumstances, does this issue impact decisions and actions that districts and states can make on teacher evaluation?

If we are to include students with disabilities in value-added measures, these students must take assessments that are comparable to the tests taken by other students. States may want to choose tests that can provide accurate data for low-achieving students who include many students with disabilities. They might also ask test developers for tools with which to compare scores on alternative assessments with those on regular assessments. States and districts might promote consistent use of accommodations. Greater consistency will reduce the errors in measurement of achievement gains and, thus, value-added.

Value-added calculations will be most accurate if models use detailed information on disability status, as well as services received and accommodations used in the current and previous years. Many states and districts do not maintain this sort of data, at least not in a readily accessible format. They may need to update their data systems to include it and to make it more useful. Value-added for all teachers, but especially special education teachers, will also be more accurate if the models account for co-teaching and other forms of shared instruction. To obtain these data, districts may need to survey teachers.

States will need to decide how to account for disability status in their value-added models. Models that do not account for disability status may underestimate the achievement growth of students with disabilities and the contributions of their teachers to that growth. Such models include Student Growth Percentiles, which many states are using or are considering using for teacher evaluations.^[29] However, as noted, controlling for disability status could have negative consequences for students. So states that decide to control for disability status might want to monitor the number of special education referrals and how closely districts follow the guidelines for making them. Increases could indicate referrals motivated more by teacher accountability than by student need. States might want to re-evaluate students if these trends occur.

The new Common Core tests hold promise as more accurate assessments of lower-achieving students and students with disabilities. Thus, they should lead to more accurate value-added for almost all teachers. But better test data will not be enough to improve precision for special education teachers who have very few students or students who are tested by alternative means. For these teachers, other measures of teaching effectiveness, such as observations, may be particularly valuable. Special education classrooms may also demand a different observation protocol, since the exemplars of good practice used for general education may or may not apply.

There is still much work to be done on methods for assessing the teachers of students with disabilities. Student achievement growth may have a role to play, but using test scores of these students to determine that growth presents unique challenges to value-added modeling. These challenges may be particularly damaging to the accuracy of value-added for special education teachers and less of a problem for most general education teachers. However, decisions about how best to use the achievement growth of students with disabilities for evaluating their teachers also must consider the accuracy of other measures of teaching and the costs of relying on these rather than on value-added. States and districts may find it useful to monitor measures of teacher performance for relationships between these measures and the proportion of students with disabilities they teach. They might watch for systematic errors as well, and be prepared to revise their accountability systems if needed.

References +

In 2011, among students with disabilities in the U.S., ages 6-21, 61 percent spent 80 percent or more of the day in the regular class and 20 percent spent 40 percent to 79 percent of the day in the regular class. Technical Assistance and Dissemination Network. (2013). Historical state-level IDEA data files. 2011, part B, educational environments. http://tadnet.public.tadnet.org/pages/712
For more on value-added models, see Raudenbush, S., & Jean, M. (October 2012). “How Should Educators Interpret Value-Added Scores?” Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added
Technical Assistance and Dissemination Network. (2013). Historical state-level IDEA data files. 2010-2011, part B, assessment. http://tadnet.public.tadnet.org/pages/712. See also, Thurlow, M. L., Bremer, C., and Albus, D. (2011). “2008-09 Publicly Reported Assessment Results for Students with Disabilities and ELLs with Disabilities” (Technical Report 59). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. http://www.cehd.umn.edu/NCEO/onlinepubs/Tech59/TechnicalReport59.pdf
Across the 50 states and the District of Columbia, the average percentage of students with disabilities in grades 3-8 taking the regular assessment who scored proficient or above was 35 percent in math and 37 percent in reading during the 2010-2011 school year. Between 62 percent and 68 percent of students with disabilities in grades 3-8 taking the regular assessment in reading or math used testing accommodations. Technical Assistance and Dissemination Network, 2013, ibid.
Dawsey, C. P. “As Detroit Public Schools Rolls Fall, Proportion of Special-needs Students on Rise.” Detroit Free Press. December 24, 2012. http://www.freep.com/article/20121224/NEWS01/312240091/As-Detroit-Public-Schools-rolls-fall-proportion-of-special-needs-students-on-rise
Technical Assistance and Dissemination Network. (2013). Historical state-level IDEA data files. 2011, part B, child count. http://tadnet.public.tadnet.org/pages/712
U.S. Department of Education. (2013). “The Individuals with Disabilities Act of 2004.” http://idea.ed.gov/
Using data from the 1999-2000, Schools and Staffing Survey (U.S. Department of Education. National Center for Education Statistics, http://nces.ed.gov/surveys/sass/) we estimated that 82 percent of all K-12, public school teachers in the US had one or more students with an Individual Education Plan in their classrooms. Estimates from more recent Schools and Staffing Survey were not available. However, using data from the Measures of Effective Teaching (MET) Project (see The Bill and Melinda Gates Foundation. (2012). “Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains.” http://www.metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.pdf, for details on the MET Project), we find that in the participating districts and grades, 85 percent of teachers had at least one student classified as a special education student among their tested students.
Technical Assistance and Dissemination Network. (2013). Historical state-level IDEA data files. 2010-2011, part B, assessment. http://tadnet.public.tadnet.org/pages/712
Lazarus, S. S., Thurlow, M. L., & Edwards, L. M. (2013). “States’ Flexibility Plans for Phasing out the Alternate Assessment Based on Modified Academic Achievement Standards (AA-MAS) by 2014-15” (Synthesis Report 89). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. http://www.cehd.umn.edu/NCEO/onlinepubs/Synthesis89/SynthesisReport89.pdf
Technical Assistance and Dissemination Network, 2013, ibid.
Elliott, J., Thurlow, M., Ysseldyke, J., & Erickson, R. (1997). “Providing Assessment Accommodations for Students with Disabilities in State and District Assessments” (Policy Directions No. 7). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. http://www.cehd.umn.edu/NCEO/onlinepubs/archive/Policy/Policy7.html
Buzick, H. M., & Jones, N. D. (2013). “Using Test Scores from Students with disabilities in Teacher Evaluation.” Manuscript submitted for publication.
Buzick and Jones (2013) report that the correlation between value-added for teachers calculated including students with disabilities and value-added calculated without these students was greater was between .97 and .99 depending on the grade and the value-added model used in the calculation. The correlation is a measure of the strength of relationship between two measures and a correlation of 1 means the measures are identical.
Many states and districts are using Student Growth Percentiles in their teacher evaluations. The issues around the use of test scores from students with disabilities apply to both value-added models and Student Growth Percentiles. Issues around controlling for disability status would apply to Student Growth Percentiles if those models controlled for disability status; however, the common implementation of those models only control for prior achievement. For more details on the comparison of value-added and Student Growth Percentiles see Goldhaber, D., & Theobald, R. (November 2013). “Do Different Value-Added Models Tell Us the Same Things?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/briefs/value-added/different-growth-models
Buzick, H. M., & Jones, N. D., 2013, ibid.
Many states require a minimum number of students (e.g., 10) to be used for a teacher’s value-added to be reported to the teacher and his supervisors and for it to be used in the teacher’s evaluation. Value-added calculated with very small numbers of students can be very unstable and potentially unrepresentative of the teacher’s contribution to student learning.
For a summary of this research see Swanson, H. L. (2001). “Searching for the Best Model for Instructing Students with Learning Disabilities.” Focus on Exceptional Children 34 (2):1-16. http://nichcy.org/research/summaries/abstract35
Noell, G. W., Brownwell, M.T., Buzick, H. M., & Jones, N. D. (2013). “Using Measures of Educator Effectiveness to Strengthen Educator Preparation: Improving Outcomes for Students with and without Disabilities.” Collaboration for Effective Educator Development, Accountability and Reform.
Elliott, S., N. (2013). “Opportunity-to-learn: The key access and validity issue for all academic assessments.” http://www.ncaase.com/docs/Elliott_OTL_Abstract_DC_Feb2010_FINAL-1.pdf
Cook, L., & Friend, Marilyn F. (1995). “Co-teaching: Guidelines for creating effective practices.” Focus on Exceptional Children 28(3):1-16.
Hock, H., & Isenberg, E. (2010). “Teacher and Student Misclassification in Value-Added Models.” Manuscript. Washington, DC: Mathematica Policy Research.
Hock, H., & Isenberg, E. (2010). “Methods for Accounting for Co-Teaching in Value-Added Models.” Working Paper. Mathematic Policy Research: Princeton, NJ. http://mathematica-mpr.com/publications/PDFs/education/acctco-teaching_wp.pdf
Buzick, H. M., & Laitusis, C. C. (2010). “Using Growth for Accountability: Measurement challenges for students with disabilities and recommendations for research.” Educational Researcher 39: 537-544.
A similar issue arises when considering value-added for secondary teachers where the material covered for students in different tracks may differ and align differently with the content of the state assessments. See Harris, D. & Anderson, A. (May 2013). “Does Value-Added Work Better in Elementary Than in Secondary Grades?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/briefs/value-added/grades/
See Hosp & Reschly for a recent meta-analyses of studies on racial differences in special education referrals and an introduction to the long history of studies on the over representation of African American students in special education. Hosp, J.L. & Reschly, D. (2003). “Referral Rates for Intervention or Assessment: A Meta-Analysis of Racial Differences.” The Journal of Special Education 37(2): 67-80. http://files.eric.ed.gov/fulltext/EJ673050.pdf
Booher-Jennings reports on teachers in one school referring low scoring students to special education classes. See Booher-Jennings, J. (2005) “Below the Bubble: ”Educational Triage” and the Texas Accountability.” American Education Research Journal 42: 231-268. McGill-Frazen and Allington report similar activities from schools participating in their research studies. See McGill-Frazen, A. and Allington, R. L. (1993). “Flunk’em or Get Them Classified: The Contamination of Primary Grade Data.” Educational Researcher 22: 19-22. Both studies use data collected when special education students’ scores were not used for accountability.
Lazarus, S. S., Thurlow, M. L., & Edwards, L. M., 2013, ibid.
Student Growth Percentiles could control for student disability status, use of accommodations, and related factors but most states implement the models without such controls.

How Can Value-Added Measures Be Used for Teacher Improvement?

Joanna Huang — Thu, 19 Dec 2013 19:16:31 +0000

Email this page

Susanna Loeb
Professor of
Education
Stanford University
Faculty Director
Center for Education
Policy Analysis
Co-Director
PACE

Susanna Loeb

Highlights:

Value-added measures are not a good source of information for helping teachers improve because they provide little information on effective and actionable practices.
School, district, and state leaders may be able to improve teaching by using value-added to shape decisions about programs, human resources, and incentives.
Value-added measures of improvement are more precise measures for groups of teachers than they are for individual teachers, thus they may provide useful information on improvement associated with practices, programs or schools.
Many incentive programs for staff performance that are based on student performance have not shown benefits. Research points to the difficulty of designing these programs well and maintaining them politically.
Value-added measures for selecting and improving programs, for informing human resource decisions, and for incentives are likely to be more useful when they are combined with other measures.
We still have only a limited understanding of how best to use value-added measures in combination with other measures as tools for improvement.

Introduction

The question for this brief is whether education leaders can use value-added measures as tools for improving schooling and, if so, how to do this. Districts, states, and schools can, at least in theory, generate gains in educational outcomes for students using value-added measures in three ways: creating information on effective programs, making better decisions about human resources, and establishing incentives for higher performance from teachers. This brief reviews the evidence on each of these mechanisms and describes the drawbacks and benefits of using value-added measures in these and other contexts.

What is the current state of knowledge on this issue?

On their own, value-added measures do not provide teachers with information on effective practices. But this does not mean that value-added measures cannot be useful for educators and leaders to improve instruction through other means, such as identifying practices that lead to higher academic achievement or targeting professional development toward teachers who need it most.^[1]

As discussed in other Carnegie Knowledge Network briefs, value-added measures have shortcomings as measures of teaching performance relative to other available options, although they also have strengths.^[2] To begin with, the use of student test scores to measure teaching practice is both an advantage and a disadvantage. The clear benefit is that, however imperfect, test scores are direct measures of student learning, which is an outcome that we care about. Students who learn more in school tend to complete more schooling, have greater earnings potential, and lead healthier lives. Measures such as those based on classroom observations and principals’ assessments lack that direct link to valued outcomes; evidence connects these other measures to student outcomes only weakly. Basing assessments of teachers on the learning of students also recognizes the complexity of the teaching process; many different teaching styles can benefit students. Conversely, value-added measures are dependent on specific student tests, each of which is an incomplete measure of all the outcomes we want to see. The choice of test can affect a teacher’s value-added estimate, so it is important to choose tests that truly measure desired outcomes and to recognize that tests are imperfect and incomplete.^[3]

On their own, value-added measures do not provide teachers with information on effective practices.

Because value-added measures adjust for the characteristics of students in a given classroom, they are less biased measures of teacher performance than are unadjusted test score measures, and they may be less biased even than some observational measures. As described in an earlier brief, some research provides evidence that value-added measures—at least those that compare teachers within the same school and adjust well for students’ prior achievement—do not favor teachers who teach certain types of students.^[4] The adequacy of the adjustments across schools in value-added measures, which would allow the comparison of teachers in different schools, is less clear because schools can contribute to student learning in ways apart from the contribution of individual teachers and because students sort into schools in ways that value-added measures may not adjust for well.^[5]

While a fair amount of evidence suggests that value-added measures adequately adjust for differences in the background characteristics of students in each teacher’s classroom—much better than do most other measures—value-added measures are imprecise. An individual teacher’s score is not an accurate measure of his performance. Multiple years of value-added scores provide better information on teacher effectiveness than does just one year, but even multiple-year measures are not precise.

Given the imprecision of value-added measures and their inability to provide information about specific teaching practices, we might logically conclude that they cannot be useful tools for school improvement. But this conclusion would be premature. For state, district, and school leaders, value-added measures may aid in school improvement in at least three ways: improving programs, making decisions about human resources, and developing incentives for better performance.

Program improvement

Assessing programs with value-added measures is easier than it is with test scores alone.

When deciding how to invest resources or which programs to continue or expand, education leaders need information on whether programs are achieving their goals. Ideally, they would have access to random control trials of a particular program in its own particular context, but this type of information is impractical in most situations. If the programs aim to improve teacher effectiveness, value-added measures can provide useful evidence. For example, a district might want to know whether its professional development offerings are improving teacher performance in a particular subject or grade. If the program serves large enough numbers, value-added measures can distinguish between the improvement of teachers who participated in the program and those who did not. Value-added measures are likely to be at least as informative as, and probably more informative than, measures such as teachers’ own assessments of their experience or student test scores that are not adjusted for prior student characteristics. Databases with student test scores and background characteristics provide similar information, but they require greater expertise to analyze than do already-adjusted value-added scores. That is, assessing programs with value-added measures is easier than it is with test scores alone because the value-added measures account for differences in the students that teachers teach.

One might worry that value-added measures are too imprecise to measure teacher improvement. However, because multiple teachers participate in programs, program evaluators can combine measures across teachers, reducing the problem of imprecision. Imprecision is even more of an issue with value-added measures of teacher improvement than it is for measures of static effectiveness. If we think about improvement as measuring the difference between a teacher’s effectiveness at the beginning of a period and her effectiveness at the end, the change over time will be subject to errors in both the starting and the ending value. Thus, value-added measures of improvement—particularly those based only on data from the beginning and end of a single year—will be very imprecise. We need many years of data both before and after an intervention to know whether an individual teacher has improved.

Imprecision is less of a problem when we look at groups of teachers. As with years of data, the more teachers there are, the more precise the measure of the average effect is. With data on enough teachers, researchers can estimate average improvements of teachers over time, and they can identify the effects of professional development or other programs and practices on teacher improvement. As examples, studies that use student test performance to measure teachers’ effectiveness—adjusted for prior achievement and background characteristics—demonstrate that, on average, teachers add more to their students’ learning during their second year of teaching than they do in their first year, and more in their third year than in their second. Similarly, a large body of literature finds that professional development programs in general have mixed effects, but that some interventions have large effects. The more successful professional development programs last several days, focus on subject-specific instruction, and are aligned with instructional goals and curriculum.^[6] The day-to-day context of a teacher’s work can also influence improvement. Teachers gain more skills in schools that are more effective overall at improving student learning and that expose teachers to more effective peers.^[7] They also appear to learn more in schools that have more supportive professional environments. Value-added measures allow us to estimate average improvements over time and link those to the experiences of teachers

Value-added measures allow us to estimate average improvements over time and link those to the experiences of teachers.

Value-added measures of static effectiveness, in contrast to value-added measures of improvement, can also lead to teaching improvement through a school or district’s adoption of better programs or practices. For example, when districts recruit and hire teachers, knowing which pre-service teacher preparation programs have provided particularly effective teachers in the past can help in selecting teachers that are most likely to thrive. Value-added measures can provide information on which programs certify teachers who are successful at contributing to student learning. Similarly, value-added measures can help identify measurable candidate criteria—such as performance on assessments—that are associated with better teaching performance. If teachers with a given characteristic have higher value-added, it might be beneficial for districts to consider that characteristic in the hiring process.^[8]

Human resource decisions

The second means through which value-added measures may be used for improvement is by providing information for making better human resource decisions about individual teachers. While value-added measures may be imprecise for individual teachers, on average teachers with lower value-added will be less effective in the long run, and those with higher value-added will be more effective. Education leaders can learn from these measures. As an example, instead of offering professional development to all teachers equally, decision-makers might target it at teachers who need it most. Or they might use the measures as a basis for teacher promotion or dismissal. While these decisions may benefit from combining value-added data with other sources of data, value-added can provide useful information in a multiple-measures approach.

Value-added can provide useful information in a multiple-measures approach.

Research is only now emerging on the usefulness of value-added measures for making policies about human resources. One urban district gave principals value-added information on novice teachers, but continued to let principals use their own judgment on teacher renewal. Principals changed their behavior as a result, providing evidence that value-added gave them either new information or leverage for acting on the information they already had.^[9] In another set of districts, the Talent Transfer Initiative identified high-performing teachers and offered them incentives for moving to and staying in low-performing schools for at least two years. The teachers who moved were found to have positive effects on their students in their new schools, especially in elementary schools.^[10] There are many possible uses of value-added, some of which may lead to school improvement and a more equitable distribution of resources.

There is less evidence on the uses of value-added measures at the school level. An elementary school principal might consider creating separate departments for teaching math and reading if data showed differences in value-added for math and reading among the fifth grade teachers.^[11] Similarly, a principal could use the data to help make teaching assignments. If a student has had teachers with low value-added for two years in a row in math, she might make sure to give the student a teacher with a high value-added the next year.^[12] The imprecision of the value-added measures alone suggests that combining them with other measures would be useful for these kinds of decisions. There is now little research on these school-level practices and their consequences.

Incentives

A third potential way to use value-added measures to improve schools is to provide teachers with incentives for better performance. Conceptually, linking salaries to student outcomes seems like a logical way to improve teachers’ efforts or to attract those teachers most likely to produce student test score gains. Performance-based pay creates incentives for teachers to focus on student learning, or for high-performing teachers to enter and remain in the profession.

Performance-based pay carries possible drawbacks, though. Value-added measures do not capture all aspects of student learning that matter, and by giving teachers incentives to focus on outcomes that are measured, they may shortchange students on important outcomes that are not. Further, it is difficult to create performance-based pay systems that encourage teachers to treat their students equitably. For example, formulas often reward teachers for concentrating on students more likely to make gains on the test. Performance pay policies may also discourage teachers from cooperating with each other.

The failure of performance pay programs in the U.S. provides evidence that they are difficult to implement.

While there is some evidence that performance pay can work, most evidence from the U.S. suggests that it does not increase student performance. The largest study of performance incentives based on value-added measures comes from a Nashville, Tennessee study that randomly assigned middle-school math teachers (who volunteered for the study) to be eligible for performance-based pay or not. Those who were eligible could earn annual bonuses of up to $15,000. The study showed very little effect of these incentives on student performance.^[13] Similarly, a school-based program in New York City that offered staff members ,000 each for good test performance overall had little effect on student scores, as did a similar program in Round Rock, Texas.^[14] These results do not mean that value-added measures can’t be used productively as incentives; evidence from Kenya and India, at least, suggest that they can.^[15] But the failure of the programs in the U.S. provides evidence that they are difficult to implement.^[16]

The null results from most studies contrast with some recent evidence suggesting that programs that combine value-added with other measures of performance, or which use only other measures, can lead to improvement. The observation-based evaluations in Cincinnati, for example, have led to improvements in teacher effectiveness,^[17] as has the IMPACT evaluation system in Washington, D.C.^[18] Both of these programs provide feedback to teachers on their instructional practices. It is not clear how much of the improvement comes from this feedback in combination with the incentives, and how much comes from the incentives on their own.

Summary

While value-added measures may be used as instruments of improvement, it is worth asking whether other measures might be better tools. As noted, value-added measures do not tell us which teacher practices are effective, whereas other measures of performance, such as observations, can. But these other measures can be more costly to collect, and we do not yet know as much about their validity and bias.^[19] Value-added may allow us to target the collection of these measures, which in turn can help school leaders determine which teachers could most benefit from assistance.

What is the current state of knowledge on this issue?

While there is substantial information on the technical properties of value-added measures, far less is known about how these measures are put into actual use. It may, for example, be helpful to assign more students to the most effective teachers, to give less effective teachers a chance to improve, to equalize access to high-quality instruction, and to pay highly effective teachers more. But we have no evidence of the effectiveness of these approaches. Some districts are using value-added measures to make sure that their lowest performing students have access to effective teachers.^[20] But, again, we know little about the outcomes of this practice. There is also little evidence that value-added can be justified as a reason for more intensive teacher evaluation or can give struggling teachers more useful feedback.^[21] Even formal observations (generally three to five per year) provide limited information to teachers. With triage based on value-added measures alone, teachers might receive substantially more consistent and intensive feedback than what they now get. These approaches are promising, and research could illuminate their benefits and costs.

We know little about the consequences of using value-added measures to encourage educators to improve.

We could also benefit from more information on the use of value-added as a performance incentive. For example, while we have ample evidence of unintended consequences of test-based accountability—as well as evidence of some potential benefits—we know less about the consequences of using value-added measures to encourage educators to improve.^[22] Given that performance incentives have little effect on teacher performance in the short run, it may not be worth pursuing them as a means for improvement at all. However, incentives may affect who enters and stays in teaching, as well as teacher satisfaction and performance, and we know little about these potential effects. Moreover, the research suggests that value-added scores may be more effective as incentive tools when they are combined with observational and other measures that can give teachers information on practices.^[23] It would be useful to have more evidence on combining measures for this and other purposes.

What cannot be resolved by empirical evidence on this issue?

While research can inform the use of value-added measures, most decisions about how to use these measures require personal judgment, as well as a greater understanding of school and district factors than research can provide. The political feasibility and repercussions of the choices will vary by jurisdiction. Some districts and schools are more accepting than others of value-added as a measure of performance. And value-added provides more relevant information in some districts than it does in others. When school leaders already have a good understanding of teachers’ skills, value-added measures may not add much, but they may be helpful for leaders who are new to the job or less skilled at evaluating teachers through other means. Moreover, in some places, the tests used to calculate value-added are better measures of desired outcomes than they are in others. Tests are an interim measure of student learning, one indicator of how prepared students are to succeed in later life. Communities will no doubt come to different conclusions about how relevant these tests are as measures of the preparation they value.

Most decisions about how to use these measures require personal judgment.

The capacity of systems to calculate value-added and to collect other information will also affect the usefulness and cost-effectiveness of the measures. Some critical pieces must be in place if value-added is to provide useful information about programs. Data on crucial student outcomes, as measured by well-designed tests, must be collected and linked from teachers to students over time. Without this capacity, collecting and calculating value-added measures will be costly and time-consuming. With it, the measures can be calculated relatively easily.

Value-added measures will better assess the effectiveness of programs serving teachers, and that of the teachers themselves, than will other measures—such as test score comparisons—which do not adjust for the characteristics of students. For estimating the effects of programs or practices for the purpose of school improvement, value-added measures are not superior to random control trials. That is because these trials do a better job adjusting for differences in teaching contexts. However, experimental approaches are often impractical. By contrast, value-added approaches allow for real analyses of all teachers for whom test scores are available.

Conclusion

Value-added measures clearly do not provide useful information for teachers about practices they need to improve. They simply gauge student test score gains relative to what we would expect. Value-added scores also have drawbacks for assessing individual teacher effectiveness because they are imprecise. While on average they do not appear to be biased against teachers teaching different types of students within the same school, they are subject to measurement error. Nonetheless, they can still be useful tools for improving practice, especially when they are used in combination with other measures.

In particular, value-added measures can support the evaluation of programs and practices, can contribute to human resource decisions, and can be used as incentives for improved performance. Value-added measures allow education leaders to assess the effectiveness of professional development. They can help identify schools with particularly good learning environments for teachers. They can help school leaders with teacher recruitment and selection by providing information on preparation programs that have delivered effective teachers in the past. They can help leaders make decisions about teacher assignment and promotion. And they may even be used to create incentives for better teacher performance.

Value-added is sometimes better than other measures of effectiveness, just as other measures are sometimes superior to value-added. Value-added data can help school leaders determine the benefits of a particular program of professional development. New observational protocols are likely to be more useful than value-added measures because they provide teachers with information on specific teaching practices. But these observational tools have other potential drawbacks, and they have not been assessed nearly as carefully as value-added measures have for validity and reliability. In addition to being costly to collect, observational measures that favor particular teaching practices may encourage a type of teaching that is not always best. Observational measures also may not adjust well for different teaching contexts and even when educators use these protocols, additional, less formal feedback is necessary to support teachers’ development.

Value-added measures are imperfect, but they are one among many imperfect measures of teacher performance that can inform decisions by teachers, schools, districts, and, states.

References +

Harris, D. N. (2011). Value-Added Measures in Education What Every Educator Needs to Know. Cambridge, MA: Harvard Education Press.
Harris, D. N. (May 2013). How Do Value-Added Indicators Compare to Other Measures of Teacher Effectiveness? Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/wp-content/uploads/2012/10/CKN_2012-10_Harris.pdf
McCaffrey, D. (October 2012). “Do Value-Added Methods Level the Playing Field for Teachers?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/wp-content/uploads/2013/06/CKN_2012-10_McCaffrey.pdf
Ibid.
Raudenbush, S. (August 2013). “What Do We Know About Using Value-Added to Compare Teachers Who Work in Different Schools?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/wp-content/uploads/2013/08/CKN_Raudenbush-Comparing-Teachers_FINAL_08-19-13.pdf
Goldhaber, D. (November 2013). “What Do Value-Added Measures of Teacher Preparation Programs Tell Us?” Carnegie Knowledge Network. http://www.carnegieknowledgenetwork.org/wp-content/uploads/2013/11/CKN-Goldhaber-TeacherPrep_Final_11.7.pdf
Rockoff, J.E., Staiger, D., Kane, T., and Taylor, E. (2012). “Information and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools.” American Economic Review. http://www.nber.org/papers/w16240.pdf?new_window=1
Goldhaber, D., 2013, ibid
Rockoff, et al., 2012, ibid.
Glazerman, S., Protik, A., The, B., Bruch, J., and Max, J. (2013). “Transfer Incentives for High-Performing Teachers: Final Results from a Multisite Randomized Experiment.” Princeton: Mathematica Policy Research. http://ies.ed.gov/ncee/pubs/20144003/pdf/20144003.pdf
Fox, L. (2013). “Using Multiple Dimensions of Teacher Value-added to Improve Student-Teacher Assignments.” Working Paper.
Kalogrides, D., & Loeb, S. (Forthcoming). “Different teachers, different peers: The magnitude of student sorting within schools.” Educational Researcher.
Springer, M., & Winters, M.A. (2009). “New York City’s School-Wide Bonus Pay Program: Early Evidence from a Randomized Trial.” TN: National Center on Performance Incentives at Vanderbilt University. https://my.vanderbilt.edu/performanceincentives/files/2012/10/200902_SpringerWinters_BonusPayProgram15.pdf
A similar study in New York City also found no effects of financial incentives on student performance, see:
Fryer, R. (Forthcoming). “Teacher Incentives and Student Achievement: Evidence form New York City Public Schools.” Journal of Labor Economics.
Springer, 2009, ibid.
Springer, M., J. Pane, V. Le, D. McCaffrey, S. Burns, L. Hamilton, and B. Stecher. (2012). “Team Pay for Performance: Experimental Evidence From the Round Rock Pilot Project on Team Incentives.” Educational Evaluation and Policy Analysis, 34(4), 367–390.
Yuan et. al. (2012). “Incentive Pay Programs Do Not Affect Teacher Motivation or Reported Practices: Results from Three Randomized Studies.” Educational Evaluation and Policy Analysis 35 (1), 3-22.
Muralidharan, K. & Sundararaman, V. (2011). “Teacher Performance Pay: Experimental Evidence from India.” Journal of Political Economy, University of Chicago Press, 119(1), 39-77. http://www.nber.org/papers/w15323.pdf?new_window=1
Glewwe, P., Ilias, N, and Kremer, M. (2010). “Teacher Incentives.” American Economic Journal: Applied Economics, 2(3), 205-27. http://www.povertyactionlab.org/publication/teacher-incentives
A recent study finds that incentives are more effective if they are incentives to avoid harm—such as dismissal or loss of salary—instead of incentives for promotion or higher pay. See:
Fryer, R., S. Levitt, J. List, and S. Sadoff. (2012). “Enhancing the Efficacy of Teacher Incentives through Loss Aversion.” NBER Working Paper, No 18237. http://www.nber.org/papers/w18237
Taylor, E.S. and Tyler, J.H. (2012). “The Effect of Evaluation on Teacher Performance.” American Economic Review, 102(7), 3628-3651.
Wyckoff, J. and Dee, T.S. (2013). “Incentives, Selection, and Teacher Performance: Evidence from IMPACT.” Working Paper.
Harris, D. N., 2013, ibid.
Tennessee policy.
Harris, D. N. (November 2013). “How Might We Use Multiple Measures for Teacher Accountability?” Carnegie Knowledge Brief. http://www.carnegieknowledgenetwork.org/wp-content/uploads/2013/11/CKN_2013_10_Harris.pdf
Figlio, D. and Loeb, S. (2011). “School Accountability.” In E. A. Hanushek, S. Machin, and Woessmann, L. (Eds.), Handbook of the Economics of Education, Vol. 3, San Diego, CA: North Holland, 383-423.
Wyckoff, J., 2013, ibid.

What Do Value-Added Measures of Teacher Preparation Programs Tell Us?

Joanna Huang — Thu, 07 Nov 2013 21:18:07 +0000

Email this page

Dan Goldhaber
Director
Center for Education
Data & Research
Professor
University of
Washington-Bothell

Dan Goldhaber

Highlights

• Policymakers are increasingly adopting the use of student growth measures to measure the performance of teacher preparation programs.
• Value-added measures of teacher preparation programs may be able to tell us something about the effectiveness of a program’s graduates, but they cannot readily distinguish between the pre-training talents of those who enter a program from the value of the training they receive.
• Research varies on the extent to which prep programs explain meaningful variation in teacher effectiveness. This may be explained by differences in methodologies or by differences in the programs.
• Research is only just beginning to assess the extent to which different features of teacher training, such as student selection and clinical experience, influence teacher effectiveness and career paths.
• We know almost nothing about how teacher preparation programs will respond to new accountability pressures.
• Value-added based assessments of teacher preparation programs may encourage deeper discussions about additional ways to rigorously assess these programs.

Introduction

Teacher training programs are increasingly being held under the microscope. Perhaps the most notable of recent calls to reform was the 2009 declaration by U.S. Education Secretary Arne Duncan that “by almost any standard, many if not most of the nation’s 1,450 schools, colleges, and departments of education are doing a mediocre job of preparing teachers for the realities of the 21st century classroom.”^[1] Duncan’s indictment comes despite the fact that these programs require state approval and, often, professional accreditation. The problem is that the scrutiny generally takes the form of input measures, such as minimal requirements for the length of student teaching and assessments of a program’s curriculum. Now the clear shift is toward measuring outcomes.

The federal Race to the Top initiative, for instance, considered whether states had a plan for linking student growth to teachers (and principals) and to link this information to the schools in that state that trained them. The new Council for the Accreditation of Educator Preparation (CAEP) just endorsed the use of student outcome measures to judge preparation programs.^[2] And there is a new push to use changes in student test scores—a teachers’ value-added—to assess teacher preparation providers (TPPs). Several states are already doing so or plan to soon.^[3]

We cannot disentangle the value of a candidate’s selection to a program from her experience there.

Much of the existing research on teacher preparation has focused on comparing teachers who enter the profession through different routes—traditional versus alternative certification—but more recently, researchers have turned their attention to assessing individual TPPs.^[4] And in doing so, researchers face many of the same statistical challenges that arise when value-added is used to evaluate individual teachers. The measures of effectiveness might be sensitive to the student test that is used, for instance, and statistical models might not fully distinguish a teacher’s contributions to student learning from other factors.^[5] Other issues are distinct, such as how to account for the possibility that the effects of teacher training fade the longer a teacher is in the workforce.

Before detailing what we know about value-added assessments of TPPs, it is important to be explicit about what value-added methods can and cannot say about the quality of TPPs. Value-added methods may be able to tell us something about the effectiveness of a program’s graduates, but this information is a function both of graduates’ experiences in a program and of who they were when they entered. The point is worth emphasizing because policy discussions often treat TPP value-added as a reflection of training alone. Yet the students admitted to one program may be very different from those admitted to another. Stanford University’s program, for instance, enrolls students who generally have stronger academic backgrounds than those of Fresno State University. So if a teacher who graduated from Stanford turns out to be more effective than a teacher from Fresno State, we should not necessarily assume that the training at Stanford is better.

Because we cannot disentangle the value of a candidate’s selection to a program from her experience there, we need to think carefully about what we hope to learn from comparing TPPs. Some stakeholders may care about the value of the training itself, while others may care about the combined effects of selection and training. Many also are likely to be interested in outcomes other than those measured by value-added, such as the number of teachers, or types of teachers that graduate from different institutions. They may also want to know how likely it is that a graduate from a certain institution will actually find a job or stay in teaching. I elaborate on these issues below.

What is Known About Value-Added Measures of Teacher Preparation Programs?

There are only a handful of studies that link the effectiveness of teachers, based on value-added, to the program in which they were trained.^[6] This body of research assesses the degree to which training programs explain the variation in teacher effectiveness, whether there are specific features of training that appear to be related to effectiveness, and what important statistical issues arise when we try to tie these measures of effectiveness back to a training program.

Empirical research reaches somewhat divergent conclusions about the extent to which training programs explain meaningful variation in teacher effectiveness. A study of teachers in New York City, for instance, concludes that the difference between teachers from programs that graduate teachers of average effectiveness and those whose teachers are the most effective is roughly comparable to the (regression-adjusted) achievement difference between students who are and are not eligible for subsidized lunch.^[7] Research on TPPs in Missouri, by contrast, finds only very small—and statistically insignificant—differences among the graduates of different programs.^[8] There is also similar work on training programs in other states: Florida,^[9] Louisiana,^[10] North Carolina,^[11] and Washington.^[12] The findings from these studies fall somewhere between those of New York City and Missouri in terms of the extent to which the colleges from which teachers graduate provide meaningful information about how effective their program graduates are as teachers.

One explanation for the different findings among states is the extent to which graduates from TPPs—specifically those who actually end up teaching—differ from each other.^[13] As noted, regulation of training programs is a state function, and states have different admission requirements. The more that institutions within a state differ in their policies for candidate selection and training, the more likely we are to see differences in the effectiveness of their graduates. There is relatively little quantitative research on the features of TPPs that are associated with student achievement, but what does exist offers suggestive evidence that some features may matter.^[14] There is some evidence, for instance, that licensure tests, which are sometimes used to determine admission, are associated with teacher effectiveness.^[15]

The more that institutions differ in their policies, the more likely we are able to see differences in the effectiveness of their graduates.

More recently, some education programs have begun to administer an exit assessment (the “edTPA”) designed to ensure that TPP graduates meet minimum standards. But because the edTPA is new, there is not yet any large-scale quantitative research that links performance on it with eventual teacher effectiveness. There is also some evidence that certain aspects of training are associated with a teacher’s later success (as measured by value-added) in the classroom. For example, the New York City research shows that teachers tend to more effective when their student teaching has been well-supervised and aligned with methods coursework, and when the training program required a capstone project that related their clinical experience to training.^[16] Other work finds that trainees who student-teach in higher functioning schools (as measured by low attrition) turn out to be more effective once they enter their own classrooms.^[17] These studies are intriguing in that they suggest ways in which teacher training may be improved, but, as discussed below, it is difficult to tell definitively whether it is the training experiences themselves that influence effectiveness.

The differences in findings across states may also relate to the methodologies used to determine teacher-training effects. A number of statistical issues arise when we try to estimate these effects based on student achievement. One issue is that we are less sure, in a statistical sense, about the impacts of graduates from smaller programs than we are from graduates of larger ones. This is because the smaller sample of graduates gives us estimates of their effects that are less precise; it is virtually guaranteed that findings for smaller programs will not be statistically significant.^[18] Another issue is the extent to which programs should be judged in a particular year based on graduates from prior years. The effectiveness of teachers in service may not correspond with their effectiveness at the time they were trained, and it is certainly plausible that the impact of training a year after graduation would be different than it is five or 10 years later, when a teacher has acculturated to his particular school or district. Most of the studies cited above handle this issue by including only novice teachers—those with three or fewer years of experience—in their samples (exacerbating the issue of small sample sizes).^[19] Another reason states may limit their analysis to recent cohorts is that it is politically impractical to hold programs accountable for the effectiveness of teachers who graduated in the distant past, regardless of the statistical implications.

Limiting the sample of teachers used to draw inferences about training programs to novices is not the only option, however. Goldhaber et al., for instance, estimate statistical models that allow training program effects to diminish with the amount of workforce experience that teachers have.^[20] These models allow more experienced teachers to contribute toward training program estimates, but they also allow for the possibility that the effect of pre-service training is not constant over a teacher’s career. As illustrated by Figure 1, regardless of whether the statistical model accounts for the selectivity of college (dashed line) or not (solid line), this research finds that the effects of training programs do decay. They estimate that their “half-life”—the point at which half the effect of training can no longer be detected—is about 13–16 years, depending on whether teachers are judged on students’ math or reading tests and on model specification.

The impact of training a year after graduation would be different than it is five or 10 years later.

A final statistical issue is how a model accounts for factors of school context that might be related both to student achievement and the districts and schools in which TPP graduates are employed. This is a challenging problem; one would not, for instance, want to misattribute district-led induction programs or principal-influenced school environments to TPPs.^[21] This issue of context is not fundamentally different from the one that arises when we try to evaluate the effectiveness of individual teachers. We worry, for instance, that between–school comparisons of teachers may conflate the impact of teachers with that of, for instance, principals,^[22] but it is particularly problematic in the case of TPPs, when there is little mixing of TPP graduates within a school or school district. This situation may arise when TPPs serve a particular geographic area or espouse a mission aligned to the needs of a particular school district.

What More Needs to be Known on this Issue?

As noted, there are a few studies that connect the features of teacher training to the effectiveness of teachers in the field, but this research is in its infancy. I would argue that we need to learn more about (1) the effectiveness of graduates who typically go on to teach at the high school level; (2) what features of TPPs seem to contribute to the differences we see between in-service graduates of different programs; (3) how TPPs respond to accountability pressures; and (4) how TPP graduates compare in outcomes other than value-added, such as the length of time that they stay in the profession or their willingness to teach in more challenging settings.

First, most of the evidence from the studies cited above is based on teaching at the elementary and middle school levels; we know very little about how graduates from different preparation programs compare at the high school level. And while much of the research and policy discussion treats each TPP as a single institution, the reality is much more complex. Programs produce teachers at both the undergraduate and graduate levels, and they train students to teach different subjects, grades, and specialties. It is conceivable that training effects in one area—such as undergraduate training for elementary teachers—correspond with those in other areas, such as graduate education for secondary teachers. But aside from research that shows a correlation between value-added and training effects across subjects, we do not know how much estimates of training effects from programs within an institution correspond with one another.^[23]

Second, we know even less about what goes on inside training programs, the criteria for recruitment and selection of candidates, and the features of training itself. The absence of this research is significant, given the argument that radical improvements in the teacher workforce are likely to be achieved only through a better understanding of the impacts of different kinds of training. There is certainly evidence that these programs differ from one another in their requirements for admission, in the timing and nature of student teaching, and in the courses in pedagogy and academic subjects that they require. But while program features may be assessed during the accreditation and program approval process, there is little data that can readily link these features to the outcomes of graduates who actually become teachers. Teacher prep programs are being judged on some of these criteria (e.g., NCTQ, 2013),^[24] but we clearly need more research on the aspects of training that may matter. As I suggest above, it is challenging to disentangle the effects of program selection from training effects; there are certainly experimental designs that could separate the two (e.g., random assignment of prospective teachers to different types of training), but these are likely to be difficult to implement given political or institutional constraints.

And while much of the research and policy discussion treats each TPP as a single institution, the reality is much more complex.

Third, we will want to assess how TPPs respond to increased pressure for greater accountability.^[25] Post-secondary institutions are notoriously resistant to change from within, but there is evidence that they do respond to outside pressure in the form of public rankings.^[26] Rankings for training programs have just been published by the National Council on Teacher Quality (NCTQ) and U.S. News & World Report. Do programs change their candidate selection and training processes because of new accountability pressures? If so, can we see any the impact on the teachers that graduate from them? These are questions that future research could address.^[27]

Lastly, I would be remiss in not emphasizing the need to understand more about ways, aside from value-added estimates, that training programs influence teachers. Research, for instance, has just begun to assess the degree to which training programs or their particular features relate to outcomes as fundamental as the probability of a graduate’s getting a teaching job ^[28] and of staying in the profession.^[29] This line of research is important, given that policymakers care not only about the effectiveness of teachers but of their paths in and out of teaching careers.

What Can’t be Resolved by Empirical Evidence on this Issue?

Value-added based evaluation of TPPs can tell us a great deal about the degree to which differences in measureable effects on student test scores may be associated with the program from which teachers graduated. But there are a number of policy issues that empirical evidence will not resolve because they require us to make value judgments. First and foremost is whether or to what extent value-added should be used at all to evaluate TPPs.^[30] This is both an empirical and philosophical issue, but it boils down to whether value-added is a true reflection of teacher performance,^[31] and whether student test scores should be used as a measure of teacher performance.^[32]

Even policymakers who believe that value-added should be used for making inferences about TPPs face several issues that require judgment. The first is what statistical approach to use to estimate TPP effects. For instance, as outlined above, estimating the statistical models requires us to make decisions about whether and how to separate the impact of TPP graduates from school and district environments. And how policymakers handle these statistical issues will influence estimates of how much we can statistically distinguish the value-added based TPP estimates from one program to another (i.e., the standard errors of the estimates and related confidence intervals). And the statistical judgments have potentially important implications for accountability. For instance, we see the 95 percent confidence level used in academic publications as a measure of statistical significance. But, as noted above, this standard likely results in small programs being judged as statistically indistinguishable from the average, an outcome that can create unintended consequences. Take, for instance, the case in which graduates from small TPP “A” are estimated to be far less effective than graduates from large TPP “B,” who are themselves judged to be somewhat less effective than graduates from the average TPP in a state. Graduates from program “A” may not be statistically different from the mean level of teacher effectiveness, at the 95 percent confidence interval, but graduates from program “B” might be. If, as a consequence of meeting this 95 percent confidence threshold, accountability systems single out “B” but not “A,” a dual message is sent. One message might be that a program (“B”) needs improvement. The second message, to program “A,” is “don’t get larger unless you can improve,” and to program “B,” is “unless you can improve, you might want to graduate fewer prospective teachers.” In other words, program “B” can rightly point out that it is being penalized when program “A” is not, precisely because it is a large producer of teachers.

Ultimately, whether and how disparate measures of TPPs are combined is inherently a value judgment.

Is this the right set of incentives? Research can show how the statistical approach affects the likelihood that programs are identified for some kind of action, which creates tradeoffs; some programs will be rightly identified while others will not. For more on the implications of such tradeoffs in the case of individual teacher value-added.^[33] Research, however, cannot determine what confidence levels ought to be used for taking action, or what actions ought to be taken. Likewise, research can reveal more about whether TPPs with multiple programs graduate teachers of similar effectiveness, but it cannot speak to how, or whether, estimated effects of graduates from different programs within a single TPP should be aggregated to provide a summative measures of TPP performance. Combining different measures may be useful, but the combination itself could mask important information about institutions. Ultimately, whether and how disparate measures of TPPs are combined is inherently a value judgment.

Lastly, it is conceivable that changes to teacher preparation programs—changes that affect the effectiveness of their graduates—could conflict with other objectives. A case in point is the diversity of the teacher workforce, a topic that clearly concerns policymakers and teacher prep programs themselves.^[34] It is conceivable that changes to TPP admissions or graduation policies could affect the diversity of students who enter or graduate from TPPs, ultimately affecting the diversity of the teacher workforce.^[35] Research could reveal potential tradeoffs inherent with different policy objectives, but not the right balance between them.

Evaluating TPPs: Practicing Implications and Conclusions

It is surprising how little we know about the impact of TPPs on student outcomes given the important role these programs could play in determining who is selected into them and the nature of the training they receive. That’s largely because, in some states, due to new longitudinal data systems, we only now have the capacity to systematically connect teacher trainees from their training programs to the workforce, and then to student outcomes. But we clearly need better data on the admissions and training policies if we are to better understand these links. That said, the existing empirical research does provide some specific guidance for thinking about value-added based TPP accountability. In particular, while it cannot definitively identify the right model to determine TPP effects, it does show how different models change the estimates of these effects and the estimated confidence in them.

To a large extent, how we use estimates of the effectiveness of graduates from TPPs depends on the role played by those who are using them. Individuals considering a training program likely want to know about the quality of training itself.^[36] Administrators and professors probably also want information about the value of training to make programmatic improvements. State regulators might want to know more about the connections between TPP features and student outcomes to help shape policy decisions about training programs, such as whether to require a minimum GPA or test scores for admission or whether to mandate the hours of student teaching. But, as stressed above, researchers need more information than is typically available to disentangle the effects of candidate selection from training itself. Given the inherent difficulty of disentangling, one should be cautious about inferring too much about training based on value-added itself.

For other policy or practical purposes, however, it may be sufficient to know only the combined impact of selection and training. Principals or other district officials, for instance, likely wonder whether estimated TPP effects should be considered in hiring, but they may not care much about precisely what features of the training programs lead to differences in teacher effectiveness. Likewise, state regulators are charged with ensuring that prep program graduates meet a minimal quality standard; it is not their job to determine whether that standard is achieved by how the program selects its students or how it teaches them.

The bottom line is that while our knowledge about teacher training is now quite limited, there is a widespread belief that it could be much better. As Arthur Levine, former president of Teachers College, Columbia University, says, “Under the existing system of quality control, too many weak programs have achieved state approval and been granted accreditation.”^[37] The hope is that new value-added based assessments of teacher preparation programs will not only provide information about the value of different training approaches, but also will encourage a much-needed and far deeper policy discussion about how to rigorously assess programs for training teachers.

Figure 1: Decay of program effect estimates in math and reading^[38]

References +

Duncan is hardly alone in criticizing education programs. A number of researchers and stakeholders have explicitly questioned the quality control of teacher training institutions, and, in some cases, the value of teacher training.
See, for instance: Ballou, D., & Podgursky, M. (2000). Reforming teacher preparation and licensing: What is the evidence? The Teachers College Record, 102(1), 5-27.
Cochran-Smith, M., & Zeichner, K. M. (Eds.). (2005). Studying Teacher Education: The Report of the AERA Panel on Research and Teacher Education. Washington, D.C.: The American Educational Research Association.
Crowe, E. (2010). Measuring what matters: A stronger accountability model for teacher education. Washington, DC: Center for American Progress.
Levine, A. (2006). Educating School Teachers. Washington, D.C.: The Education Schools Project.
Vergari, S., & Hess, F. M. (2002). The Accreditation Game. Education Next, 2(3), 48-57.
Walsh, K., & Podgursky, M. (2001). Teacher certification reconsidered: Stumbling for quality. A rejoinder. The Abell Foundation.
CAEP was formed by the merger of Teacher Education Accreditation Council (TEAC) and the National Council for the Accreditation of Teacher Education (NCATE) and is the professional accrediting body for over 900 educator preparation providers or TPPs. The CAEP June 11, 2013 update of standards says, “surmounting all others, insist that preparation be judged by outcomes and impact on P‐12 student learning and development.”
Put another way, this value-added metric ties the performance of students back to the programs at which their teachers were trained. States using value-added include Colorado, Delaware, Georgia, Louisiana, Massachusetts, North Carolina, Ohio, Tennessee, and Texas. For more detail on the methodologies states are using, or plan to use.
See: Henry, G. T., Thompson, C. L., Bastian, K. C., Fortner, C. K., Kershaw, D. C., Marcus, J. V., & Zulli, R. A. (2011). UNC teacher preparation program effectiveness report.
This includes comparisons between and among traditional and alternative providers. Despite the growth in alternative providers, about 75 percent of teachers are still being prepared in colleges and universities (National Research Council, 2010). These numbers imply that much of what we stand to learn about the value of a particular kind of training will likely come from studying differences in traditional providers, and that substantial improvements in the skills of new teachers will likely be achieved only through improvement to these traditional programs.
For more on the context of assessing individual teachers, see:
Raudenbush, S., & Jean, M. (October 2012). How Should Educators Interpret Value-Added Scores? Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added
McCaffrey, D. (June 2013). Do Value‐Added Methods Level the Playing Field for Teachers? Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value‐added/different‐schools
Loeb, S., & Candelaria, C. (July 2013). How Stable are Value‐Added Estimates across Years, Subjects, and Student Groups? Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value‐added/value‐added‐stability
Far more studies assess the relative effectiveness of teachers trained in traditional college and university programs versus those who entered the profession through alternative routes, often having received some level of training though usually far less than what is typical from traditional teacher providers.
See: Constantine, J., Player, D., Silva, T., Hallgren, K., Grider, M., & Deke, J. (2009). An evaluation of teachers trained through different routes to certification. Final Report for National Center for Education and Regional Assistance, 142.
Glazerman, S., Mayer, D., & Decker, P. (2006). Alternative routes to teaching: The impacts of Teach for America on student achievement and other outcomes. Journal of Policy Analysis and Management, 25(1), 75-96.
Goldhaber, D. D., & Brewer, D. J. (2000). Does teacher certification matter? High school teacher certification status and student achievement. Educational evaluation and policy analysis, 22(2), 129-145.
This is also roughly equivalent to the typical difference in effectiveness between first- and second-year teachers. See: Boyd, D. J., Grossman, P. L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher preparation and student achievement. Educational Evaluation and Policy Analysis, 31(4), 416-440.
Koedel, C., Parsons, E., Podgursky, M., & Ehlert, M. (2012). Teacher Preparation Programs and Teacher Quality: Are There Real Differences Across Programs? University of Missouri Department of Economics Working Paper Series. http://economics.missouri.edu/working-papers/2012/WP1204_koedel_et_al.pdf
Mihaly, K., McCaffrey, D., Sass, T., & Lockwood, J.R. (Forthcoming). Where You Come From or Where You Go? Distinguishing Between School Quality and the Effectiveness of Teacher Preparation Program Graduates. Forthcoming in Education Finance and Policy, 8:4 (Fall 2013).
Gansle, Kristin A., Noell, George H., R. Maria Knox, Michael J. Schafer. (2010). Value Added Assessment of Teacher Preparation in Lousiana: 2005-2006 to 2008-2009. Unpublished report.
Noell, George H., Bethany A. Porter, R. Maria Patt, Amanda Dahir. 2008. Value Added Assessment of Teacher Preparation in Lousiana: 2004-2005 to 2006-2007. Unpublished report.
Henry et al., 2011, ibid.
Goldhaber, D., &Liddle, S. (2011). The Gateay to the Profession: Assessing Teacher Preparation Programs Based on Student Achievement. CEDR. University of Washington, Seattle, WA. http://www.cedr.us/papers/working/CEDR%20WP%202011-2%20Teacher%20Training%20(9-26).pdf
Goldhaber, D., Liddle, S., & Theobald, R. (2013a). The gateway to the profession: Assessing teacher preparation programs based on student achievement. Economics of Education Review.
Note that all the comparisons between TPP graduates are based on within-state differences in the measured effectiveness of in-service teachers. A significant share of TPP graduates never end up teaching, or they move from the state in which they graduate to another state.
See, for instance: Goldhaber, D., Krieg, J., and Theobald, R. (2013b). Knocking on the Door to the Teaching Profession? Modeling the entry of Prospective Teachers into the Workforce. CEDR Working Paper 2013-2. University of Washington, Seattle, WA.
Goldhaber et al. (2013a) find there are significant differences between teacher trainees at different institutions in terms of measures of academic proficiency such as college entrance exam scores. Boyd et al. (2009) find that training experiences differ, at least to some degree, across institutions.
Goldhaber, D. (2007). Everyone’s Doing It, but What Does Teacher Testing Tell Us about Teacher Effectiveness? Journal of Human Resources, 42(4), 765-794.
Boyd et al., 2009, ibid.
Ronfeldt, M. (2012). Where should student teachers learn to teach? Effects of field placement school characteristics on teacher retention and effectiveness. Educational Evaluation and Policy Analysis, 34(1), 3-26.
Related to this, some researchers (Koedel et al., 2012) argue that earlier studies (e.g. Gansle et al., 2010) of the variation in effectiveness of teachers from different training programs overstate what we know about the true differences between graduates of different programs. This is because the studies fail to properly account for the “clustering” of students within teachers. To put this another way, when estimating the effectiveness of TPP graduates it is important to make sure that the statistical model recognizes that students with a particular teacher, for instance one who graduated from program “A” and who has 25 students, are not treated as if they each have a different teacher from program “A” since these students likely have shared experiences with the other students in the classroom.
In part because institutions themselves may change over time.
Goldhaber et al., 2013a, ibid.
For more detail on the data and labor market conditions necessary to support models designed to disentangle school and district factors from the effectiveness of TPP graduates, see Mihaly et al. (forthcoming) and Goldhaber et al. (2013).
Raudenbush S. (April 2013). What Do We Know About Using Value‐Added to Compare Teachers Who Work in Different Schools? Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value‐added/different‐schools
See Boyd et al. (2009) and Goldhaber and Cowan. (2013). Both find the correlation of the effectiveness of TPP graduates in math and reading/English language arts to be in the range 0.50-0.60.
Goldhaber, D., and Cowan J. (2013). Excavating the Teacher Pipeline: Teacher Training Programs and Teacher Attrition. CEDR Working Paper 2013-3. University of Washington, Seattle, WA.
Moreover, this kind of research could go beyond training generally to explore questions of fit, such as: Is the value of different training contingent on the type of school in which teachers find themselves employed?
For instance, the edTPA, which was developed jointly by Stanford University and the American Association of Colleges for Teacher Education, is a new multi-measure assessment of teacher trainee skills that is designed to be aligned to state and national standards and used to assess teacher trainees before graduation. In theory, the edTPA may guarantee a minimum level of compentency, and it may be able to distinguish good teacher preparation programs from bad ones (Overview: edTPA, 2013). It remains unclear whether it will serve any of these purposes (Greenberg and Walsh, 2012). That’s partly because it was only in 2012 that selected programs and states used it to link trainee performance on the assessment to measures of effectiveness of in-service teachers.
Overview: edTPA. (2013). Retrieved July 15, 2013, from http://edtpa.aacte.org/about-edtpa.
Greenberg, J., & Walsh, K. (2012). What Teacher Preparation Programs Teach about K-12 Assessment: A Review. National Council on Teacher Quality.
For more on the challenges of changing postsecondary institutions, see:
McPherson, M. S., & Schapiro, M. O. (1999). Tenure issues in higher education. The Journal of Economic Perspectives, 13(1), 85-98.
McCormick, R. E., & Meiners, R. E. (1988). University governance: A property rights perspective. Journal of Law and Economics, 31, 423.
For evidence that institutions respond to public rankings, see:
Griffith, A., & Rask, K. (2007). The influence of the US News and World Report collegiate rankings on the matriculation decision of high-ability students: 1995–2004. Economics of Education Review, 26(2), 244-255.
Hossler, D., & Foley, E. M. (1995). Reducing the noise in the college choice process: The use of college guidebooks and ratings. New Directions for Institutional Research, 1995(88), 21-30.
Meredith, M. (2004). Why do universities compete in the ratings game? An empirical analysis of the effects of the US News and World Report college rankings. Research in Higher Education, 45(5), 443-461.
These pressures could lead programs to distort their programs (often referred to as “Campbell’s Law”), but if the ranking criteria are aligned with good practices, they may well result in program improvements.
Constantine et al., 2000, ibid.
Goldhaber, et al., 2013b, ibid.
Moreover, some of this work (Goldhaber and Cowan, 2013) suggests there may be tradeoffs between the effectiveness of graduates from different programs and the length of time they spend in the teacher workforce.
See: Ronfeldt, 2012, ibid.
Ronfeldt, M., Reininger, M., & Kwok, A. (2013). Recruitment or Preparation? Investigating the Effects of Teacher Characteristics and Student Teaching. Journal of Teacher Education.
For more discussion on this see: Plecki, M. L., Elfers, A. M., & Nakamura, Y. (2012). Using Evidence for Teacher Education Program Improvement and Accountability An Illustrative Case of the Role of Value-Added Measures. Journal of Teacher Education, 63(5), 318-334.
McCaffrey, Daniel, June 2013, ibid.
Harris, D. N. (May 2013) How Do Value‐Added Indicators Compare to Other Measures of Teacher Effectiveness? Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value‐added/value‐added‐other‐measures
Goldhaber, D., & Loeb, S. (April 2013) What are the Tradeoffs Associated with Teacher Misclassification in High Stakes Personnel Decisions? Carnegie Knowledge Network. http://carnegieknowledgenetwork.org/briefs/value-added/teacher-misclassifications
See, for instance: Council for the Accreditation of Educator Preparation (CAEP). (2013). Commission on Standards and Performance Reporting. CAEP Accreditation Standards and Evidence: Aspirations for Education Preparation. June 11, 2013.
There is, for instance, significant empirical evidence that minority teachers perform less well on licensure tests, some of which are used for admission into TPPs.
See: Eubanks, S. C., & Weaver, R. (1999). Excellence through diversity: Connecting the teacher quality and teacher diversity agendas. Journal of Negro Education, 451-459.
Goldhaber, 2007, ibid.
Goldhaber, D. & Hansen, M. (2010). Race, Gender, and Teacher Testing: How Objective a Tool is Teacher Licensure Testing? American Educational Research Journal, 47(1), 218-251.
Gitomer, D. H., & Latham, A. S. (2000). Generalizations in teacher education: Seductive and misleading. Journal of Teacher Education, 51(3), 215-220.
They also likely would want to know about a non-value-added outcome: whether graduating from a particular TPP might change the likelihood that they find a job.
Levine, 2006, ibid.
For more information on the distinction between the “decay” and “selectivity decay” estimates, see Goldhaber et al. (2013a).

How Might We Use Multiple Measures for Teacher Accountability?

Joanna Huang — Fri, 18 Oct 2013 22:50:18 +0000

Email this page

Douglas N. Harris
Associate Professor
Economics
Chair
Public Education
Tulane University

Douglas N. Harris

Highlights

The most common way to use multiple measures in teacher accountability is through weighted averages of value-added with other gauges of teacher performance. This method has strengths and weaknesses.
Policymakers should consider a wider range of options for using multiple measures.
Because the main objective is to accurately classify teacher performance, most discussions of measures of teacher performance focus on validity and reliability. But fairness, simplicity, and cost should also be considered.
The “matrix” and “screening” methods are somewhat more complex than weighted averages, but they may be more accurate.
The “screening” method is the least costly and fairest of the three options because it uses value-added measures to improve and streamline other forms of data collection, and it allows final decisions to be made based on the same criteria for all teachers.
Ultimately, we should assess the method of using multiple measures based on how the options affect student learning, but the evidence does not yet exist to do that.

Introduction

The idea that multiple measures should be used when evaluating teachers is widely accepted. Multiple measures are important not only because education has multiple goals, but because each measure is an imperfect indicator of any given goal.

For a variety of reasons, states and districts use multiple measures in one particular way: to make personnel decisions about teachers based on a weighted average of the separate measures. Also known as a “composite” or “index,” the weighted average provides one bottom-line metric with which teachers can be placed into performance categories. The federal Race to the Top (RTTT) initiative is one reason why states and districts use the weighted average. This competitive grants program required states to hold teachers accountable in a way that made student test scores a “significant factor” in personnel decisions. The meaning of this term is never explained, and the most likely way to meet the vague requirement was to assign large or significant weight — 50 percent in some cases — to measures of student achievement growth, such as value-added.

Multiple measures are important because no single measure could capture everything we want.

The weighted average approach is also intuitive because people use it in daily life. The Dow Jones Industrial Average combines various stock prices to provide information about the health of the overall stock market; the Weather Channel reports “heat indexes” that combine temperature and humidity to indicate how hot it feels; Consumer Reports measures product quality by combining measures across multiple dimensions; college football rankings are based on an index that combines wins, losses, the quality of opponents, and other factors. These weighted averages allow for simple rankings and comparisons.

Such comparisons are especially useful for some personnel decisions that require an up-or-down vote—we either renew the teacher’s contract or not, give tenure or not, promote him to leadership or not. This would seem to require a single measure. To be objective and fair, it would seem that we should place teachers into performance categories based on a combination of performance information, then provide support and professional development to those with the lowest scores and reward those with the highest scores.

While weighted averages are a common and intuitive approach for using multiple measures, there are other options that have their own advantages. In this brief, I also consider the “matrix” and “screening” approaches, which do not involve combining multiple measures.

Unfortunately, unlike in the other briefs in the Carnegie Knowledge Network (CKN) series, there is little evidence about the efficacy of these various approaches. Yet it is still important to show the full range of options and provide a way of thinking about their advantages and disadvantages. This requires breaking out of the narrow measurement perspective. Validity and reliability are worthy priorities, but when we consider methods other than weighted averages, it quickly becomes clear that other criteria—simplicity, fairness, and cost—are also important.

After describing and comparing the weighting, matrix, and screening methods below, I discuss their strengths and weaknesses according to all the above criteria. More than anything else, what this brief contributes are some new and concrete ways of thinking about how we use value-added and other measures in accountability systems.

What do we know about how to create and use multiple measures?

Weighting approach

The simplest possible weighting for any group of measures is the average of those measures. With an average, each measure is given equal weight: 50-50 for each of two measures, 33-33-33 for each of three measures, and so on. This approach has the advantage of simplicity.^[1]

Suppose we have two goals, or elements of effectiveness, for teachers: increasing academic achievement for their own students and contributing to the larger school community so that they help all students. If three-quarters of the definition of effective teaching involves the first goal, then the measure of teacher contributions to academic achievement for their own students should receive a weight of 0.75. The remainder (0.25) would go to the measure of broader contributions. I will call these the “value weights” because they reflect what we value about education and teachers’ work, ignoring measurement issues.

Weighted averages allow for simple rankings and comparisons.

Life gets more complicated, though, when we consider that the measures of these two elements of effectiveness vary in their validity and reliability. Suppose we use value-added techniques to measure contributions to student achievement and principal evaluations for contributions to the school community. Also, suppose that principal evaluations do not closely correspond to teachers’ actual contributions to the school community, perhaps because the principal judges this based in part on how well the principal gets along with the teacher rather than, as intended, how much the teacher helps other colleagues. In this case, even though contributions to the school community are important, the measure might deserve a relatively small weight because of the questionable validity of the measure.

Value-added measures are also prone to error. In particular, there is growing agreement that random error is the biggest problem—this is mostly what makes value-added measures bounce around so much from year to year.^[2] So, even though instructional quality is important, we might reduce the weight on value-added because of this reliability problem.^[3]

As a general rule, as I explained in another CKN brief, it is often a waste of resources to collect multiple measures of the same performance construct, except to the extent that additional measures improve validity and reliability when used in combination with other measures or that additional measures are used in part for formative teacher evaluation.^[4] This is why a good case can still be made for using both value-added and structured classroom observations. The classroom observations provide more nuanced information about the specific ways in which instruction can be improved (classroom management, quality of feedback to students, etc.), something not possible with value-added. A weighted average of value-added and classroom observations also appears to improve validity and reliability compared with using either measure alone.^[5]

Ultimately, if the goal is to accurately assess teacher performance, then we should choose a weight for each measure that minimizes the chances that a teacher will be misclassified, such as a truly low-performing teacher being placed in a middle- or high-performance category. These “optimal weights,” as I will call them, differ from the value weights because each measure varies in its validity and reliability.

Matrix approach

Rather than combining them, multiple measures can also be placed in a “matrix” where personnel decisions depend on the particular combination of measures. In the simplest case, with two measures and two performance categories each, there are four possibilities, illustrated in Figure 1.

Figure 1: Illustration of Matrix Approach

	Performance Measure A
Performance Measure B	Low A – Low B	High A – Low B
Performance Measure B	Low A – High B	High A – High B

For the Low-Low and High-High cells, the performance result will be the same as it was for the weighted average. When you take a weighted average of two low numbers, you have to get another low number. This approach could easily be extended to situations with more than two measures, though this yields more combinations that can be depicted in a simple figure. There could also be other performance categories besides “low” and “high”; Figure 1 is intended just for illustration purposes.

The interesting cases are where we see inconsistencies: the Low-High and High-Low cells. With the weighted average, teachers in both of these categories would be considered average—the lows and highs cancel out.^[6] However, as noted earlier, if one of the measures is both more valid and reliable than the other, this makes little sense. Also, it seems highly unlikely that a teacher with high value-added and apparently weak classroom practice is really equally effective as one with low value-added and strong classroom practice.

Screening approach

For a third approach, it is worth looking to the medical profession. It is common for doctors to “screen”^[7] for major diseases, using procedures that can identify the vast majority of people who could possibly have the disease. Some patients who test positive will have the disease and some will not—that is, some will be misclassified as “false positives.” Those who test positive on the screening test are given another, gold standard test that is more expensive than the initial test but much more accurate. They do not average the screening test together with the gold standard test to create a weighted average. Instead, the two pieces are considered in sequence.^[8]

Ineffective teachers could be identified the same way. Value-added measures, like medical screening tests, are relatively inexpensive^[9], but some would argue not very accurate. So, a value-added score should lead us to collect additional information (e.g., more classroom observations, student surveys, portfolios) to identify truly low-performing teachers and to provide feedback to help those teachers improve.

“Value weights” reflect what we value about education and teachers’ work

The most obvious problem with this approach is that value-added measures are not designed to capture all potential low-performers.^[10] They are statistically “noisy,” for example, so many low-performers will get high scores by chance; no additional data would be collected and the low performance would go undetected. Some teachers would slip through the cracks. The false positive rate could also be very large. With value-added as the only screener, the vast majority of teachers would be screened, so that additional information be collected at some point. Using multiple years of prior data in the screening process would help, but if teacher performance varies over time, then prior years might not be as relevant to assessing current performance. A real trade-off exists here in how to use multiple years of data. For this reason, it would be inadvisable to make value-added the sole screener. Instead, additional measures, such as past performance on other measures, could be used as a screener in conjunction with value-added. If teachers failed on either measure, it would trigger collection of additional information.

There is a second way in which value-added could be used as a screener – not of teachers, but of the classroom observers who rate teacher practice. As with value-added, observations also suffer from validity and reliability issues. Two observers can look at the same classroom and see different things, meaning that “inter-rater reliability” is low. That problem is more likely when the observers vary in how they are prepared for observing teachers or in how they define teacher effectiveness.

The example given earlier is a case in point. The classroom observer might be aware of the teacher’s prior performance, and this may color her observations. In general, under traditional evaluation systems, principals give high scores to the vast majority of teachers.^[11] Consciously or not, they might think, “I know and like this teacher so I will give her a high observation score.” Or, as the leaders of the schools, principals may worry that low scores reflect poorly on their own performance.

While there is no way to eliminate these types of problems, value-added measures could be used to reduce them. To see how, note that researchers have found consistent, positive correlations between value-added and classroom observations scores. They are far from perfect correlations (mainly because of statistical noise), but they provide a benchmark against which we can compare or validate the scores across individual observers. Inaccurate classroom observation scores would likely show up as being weakly correlated with value-added measures of the same teachers. In particular, if observers fell into the common problem of giving high ratings to almost all teachers, then the comparison with value-added might make this problem evident. Conversely, if observers based their scores on what they already know about teachers’ value-added, the observer ratings would be distorted in ways that make the correlations might be very high, which might also be a red flag.^[12] This approach will work less well when part-time observers are used because they will have fewer observations. A smaller sample size means less confidence in the correlation estimates.

Each measure varies in its validity and reliability.

When flags are raised, an additional observer might be used to make sure the information is accurate.^[13] In other words, value-added, along with other measures,^[14] can help screen the performance of not only teachers, but observers as well. Used in these ways, value-added would be a key part of the process—possibly a “significant” part of the decision according to RTTT—without being the determining factor in personnel decisions.

Evaluating the alternative approaches

The first criterion for evaluating any method of using multiple measures is accuracy—whether, for any given definition of teacher performance, teachers are placed in the correct performance categories. This is why so many of the CKN briefs have focused on concerns about validity and reliability; these are what determine the accuracy of performance classifications.

It is also important to recognize that validity is not fixed. When high stakes are attached to measures, Campbell’s Law says the measures will be corrupted (e.g., by changing the way teachers are assigned to students, increased teaching to the test, etc.). It is often hard to foresee how teachers will react and, in this absence of direct evidence, this makes it hard to assess how corruptibility might undermine validity.

Accuracy is only a starting point, however, for understanding how well the use of multiple measures will work in practice. Simplicity, while it might encourage manipulation of the measures, is also desirable so that teachers and leaders understand the measures and respond in ways that increase performance. Teachers are also more likely to respond in the hoped-for ways when they believe the accountability system is fair.

Educational leaders have also come to learn about the cost of teacher evaluations. When expert teachers or principals have to observe teachers, it takes time, a treasured resource in any school. For principals, this might mean less time creating professional development plans or less time with parents. For expert teachers, this might mean less time teaching their own students and, if these really are the best teachers, this is no small sacrifice.

These criteria are somewhat connected. People are less likely to attempt to manipulate the performance system when they see it as fair. On the other hand, it is probably easier to subvert systems that are simplistic and do not incur the costs necessary to minimize corruptibility.

The three approaches for using multiple measures stack up differently on these criteria. The weighted average approach places teachers with different combinations of performance metrics into a single category, while the matrix method has the advantage of being able to handle the inconsistent cases differently.^[15] Some of the RTTT winners proposed giving teachers tenure if they are above the bar on either value-added or the classroom observation.^[16] Alternatively, the rules could preclude placing teachers in the low-performing category if they had high value-added scores, and could not be labeled high-performing if they had a low value-added. The matrix method therefore allows greater nuance, but sacrifices simplicity.^[17]

As with the simple matrix approach, the screening idea has the disadvantage of added complexity, but this is offset by lower costs; schools could achieve similar levels of validity while devoting less time and fewer resources to data collection. The effects on reliability are less clear. Combining measures in a weighted index can increase reliability depending on whether and how the random errors of the various measures are correlated with one another. However, we are really concerned with the validity and reliability of the personnel decisions, not the measures themselves, and the screening approach is designed to focus attention on those teachers who are near the margins of each performance category. So, even though screening involves less data collection, it may not sacrifice reliability at all, or may even increase it.^[18]

It is important to recognize that validity is not fixed.

The screening approach also has an advantage over both the others in its fairness. First, there seems to be an increasing sense among teachers that value-added measures are unfair, so anything that reduces emphasis on them might be seen as more fair. Fairness is also rooted partly in whether the process is applied equally to all teachers. Since most teachers are not in tested grades and subjects, the weighted average and matrix approaches cannot be used for most teachers. But the screening approach can be used in roughly the same way for all teachers, even those in non-tested grades. For teachers who have been in the classroom for more than one year, all the information, including classroom observations, from the prior year could be used as part of the first-stage. Even though value-added has some advantages as a screening device, it does not have to be the sole basis for that first stage of the process. And if the second stage is based solely on the classroom observation, for example, then the final performance classification is made based on the same criteria for everyone. In this case, state governments would also be able to worry less about trying to extend standardized testing to grades and subjects for which it might not be appropriate.

Value-added would still play an important role in the screening process, albeit a different one, and probably a smaller one, than it plays now. By moving away from numeric weights, it would be more difficult to show whether value-added is a “significant factor,” but what is more important is whether the evaluation process leads to decisions that are valid, reliable, fair, simple, and inexpensive.

What if each measure captures a different element of teacher effectiveness?

So far, I have written about multiple ways to use multiple measures—based on multiple criteria, no less—but this still over-simplifies matters. I have been implicitly assuming to this point that each measure captures substantially the same elements of effectiveness.

Suppose we defined teacher effectiveness so that it includes exactly two elements: Element A and Element B, each of which has a corresponding Measure A and Measure B. In the earlier discussion, I was assuming that both measures captured the same share of Elements A and B. This overlap might be more or less reasonable when comparing the classroom observation and value-added measures because both are mostly capturing classroom instruction, broadly defined^[19]. In the extreme case, where two measures capture exactly the same construct(s), everything I said earlier about the three ways to use multiple methods continues to hold. Making this assumption was a useful starting point, in part because most districts are using measures focused on instruction.

But now take the other extreme. Suppose that Measure A only captures Element A and Measure B only captures Element B. This might be reasonable when value-added is used to measure instruction, while a less structured principal evaluation might capture contributions to the school community that are unrelated to classroom instruction. Student Learning Objectives (SLOs) might also fit this situation if they are intended to capture higher-order learning and if the standardized tests (on which value-added measures are based) capture more basic skills. In these examples, we might say the measures are more “one-dimensional.”

To the extent that the measures capture completely different elements of effectiveness, the weighted average and matrix approaches will make more sense than the screening approach. It would not be sensible to use a measure like value-added in the first stage of a screening approach if the second stage focused on measuring contributions to a completely different effectiveness element such as contributions to the school community. To use the analogy of Consumer Reports, this would be like identifying cars that get good gas mileage in the first stage and collecting information about road handling only if the car got good mileage. In contrast, both the weighting and matrix approaches could be used in these situations to combine separate measures of the separate performance elements.

The choice of the measures and the method of using them are intertwined.

The way in which we use the matrix approach is also affected by how one-dimensional the measures are. I mentioned earlier that a teacher with high value-added and low classroom observation scores (High-Low) is unlikely to be equally effective as one with low value-added and high classroom observations scores (Low-High). That’s true, but again only to the degree that the measures capture different elements of effectiveness. If the two measures captured each element of effectiveness equally (and if we set aside differences in validity and reliability), then the Low-High and High-Low cases might really be equally effective, and treating those cases the same way might be reasonable.

Whether the measures are overlapping or isolated also affects the quantity of information we have to collect and, therefore, the cost of the system. When the measures overlap, there is less reason to collect multiple measures; they become redundant. But when we need different measures to capture different elements of effectiveness, additional measures become more central, and this drives up costs.

A larger point here is that the choice of the measures and the method of using them are intertwined. The screening approach makes more sense when the measures are overlapping, so the decision about which measures to use cannot be completely separated from the decision about how to use the measures for performance appraisal.

What more needs to be known on this issue?

Some progress has been made in recent years in understanding weighted average measures, especially in the well-known Measures of Effective Teaching (MET) project funded by the Gates Foundation.^[20] But generally we know very little about how the use of the weighting approach really affects teaching and learning, and we know even less about the other methods. To what degree do the constructs of effectiveness being captured by value-added and classroom observations really overlap? Would the complexity of the screening approach be too confusing? These are the types of questions that can only be addressed in states and districts that are willing to alter state rules and in districts that believe alternatives might be more viable.

What can’t be resolved by empirical evidence on this issue?

The first step in establishing optimal weights—either explicitly in the case of the weighting approach or implicitly in the other methods—is to create the value weights. By definition, value weights are based on what we value rather than the data; therefore, this cannot be resolved with empirical evidence.

How, and under what circumstances, does this issue impact the decisions and actions that districts can make on teacher evaluation?

The RTTT competition has nudged states and districts into weighted average measures, but policymakers and practitioners may have more, and perhaps better, options than they realize. Theoretically, each of the three approaches laid out here—weighted average, matrix, and screening—has advantages and disadvantages. Empirically, we know relatively little about how well these alternative systems work.

We know very little about how the use of the weighting approach really affects teaching and learning.

In some sense, these alternatives are not as starkly different as they might seem. They all require multiple measures and require at least an implicit weighting scheme. Also, while I have described the screening approach as more process-oriented, even districts relying on weighted averages have some type of due process procedures in place. There are probably no states or districts where there is not at least some sort of multi-stage appeals process for teacher performance determinations. But the two-stage approach laid out here is still quite different from the appeals processes now in place. A typical appeals process does not use various stages to strategically gather information, and teachers are placed into performance categories in the first stage, whereas the screening approach would not assign teachers to performance categories until the second stage.

The weighting, matrix, and screening approaches can also be used in tandem. In the description above, I said that “if teachers failed on either measure” then more information would be collected in the screening method. This is essentially a matrix approach for the first stage of the screening process. Also, at the second stage of the screening process, we might still decide to combine two or more measures as a weighted average in making the final personnel decision.

Statistical evidence about validity and reliability is unlikely to get us very far in choosing among these approaches. What we really care about is how educators respond to the measures and that requires a different kind of evidence. To learn how to use multiple measures, we need states and districts to break out of the current narrow weighting mindset and try alternatives that RTTT has so far discouraged. Combined with rigorous evaluations like MET, these studies of the effects of actual implementation on teaching and learning can provide the evidence we really need.

References +

Mihaly et al. also find that equal weighting may be optimal if the goal is to accurately predict future value-added. However, the fact that districts are looking to multiple measures itself suggests that predicting future value-added is not the objective. Kata Mihaly, Daniel F. McCaffrey, Douglas O. Staiger, and J.R. Lockwood (2013). A Weighted Average Measure of Effective Teaching. Bill and Melinda Gates Foundation.
Daniel F. McCaffrey, Tim R. Sass, J. R. Lockwood, Kata Mihaly (2009). The Intertemporal Variability of Teacher Effect Estimates, Education Finance and Policy, Vol. 4, No. 4, Pages 572-606.
Of course, if these are the only two measures, we cannot reduce both as the weights have to add up to 1.0. In more complex examples, the correlation among the errors also becomes important. However, with only two measures, the correlation affects the overall validity and reliability of the weighted average, but not the optimal weights. A negative correlation among the errors will improve reliability of the weighted average no matter what the weights are. The situation is much more complex when there are more than two measures.
Douglas Harris (2013), How Do Value-Added Indicators Compare with Other Measures of Teacher Effectiveness. Carnegie Foundation for the Advancement of Teaching.
If both measures capture somewhat different elements of teacher effectiveness, and both elements are considered important, then the weighted average will improve validity relative to using only one measure. The effect on reliability depends on the correlation in the random errors. See Mihaly et al. (2013) Ibid.
This assumes the weights are relatively equal. If almost all the weight is given to Measure A then the teacher’s performance on that measure would dominate.
I have written about the screening approach elsewhere: Douglas N. Harris (2011). Value-Added Measures in Education. Cambridge, MA: Harvard Education Press. Douglas N. Harris (2012). Creating A Valid Process For Using Teacher Value-Added Measures. Shanker Institute. November 12, 2012. http://shankerblog.org/?p=7242. Douglas N. Harris (2012). Value-Added As A Screening Device: Part II. http://shankerblog.org/?p=7529.
Since I am using a medical analogy, some might want to call this a “triage” approach. This term fits in some ways but not in others. In both cases, the focus is on allocating resources in cost-effective ways. The higher-performing teachers get less attention just as healthier patients do. On the other hand, there is a difference between this approach and medical triage, as the latter entails devoting few resources to those who are least likely to make it. Instead, part of this point is to collect more information on these struggling teachers so that personnel decisions can be made with confidence and in keeping with legal requirements.
When I say inexpensive, I mean on a per-student or per-teacher basis and for districts and states that take advantage of the larger economies of scale involved in these calculations. To the vendors who provide value-added measures, it costs a similar amount to do it for 100 teachers as it does for 50,000 teachers, whereas with almost any other teacher performance measure, the cost grows in proportion to the number of teachers. (The main cost of value-added measures that is more proportional is having teachers check their student rosters.) This also sets aside the costs of the standard testing regime, which was originally intended for school-level accountability that preceded the movement toward teacher accountability. It also ignores the cost of expanding the testing regime to cover all teachers, partly because I view such an expansion as unwise.
Perhaps the simplest way to identify a screening device is to increase the Type I error rate, so that more teachers are automatically identified as potentially ineffective. But taking that approach to the extreme defeats the purpose of the screening approach: identifying potentially ineffective teachers efficiently. In addition to increasing the Type I error rate, the usefulness of a screener can be improved by using more prior information about teacher performance, which is also inexpensive for the fact that it has already been collected for prior evaluations.
Daniel Weisberg, Susan Sexton, Jennifer Mulhern, and David Keeling (2009). The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. New York: The New Teacher Project.
Setting aside the reliability of the correlations, this approach will capture most validity problems. If an observer is an easy or harsh rater (showing up in very low or high average scores) and the observer scores hit the floor or ceiling of the observer scale, then it would show up as a low correlation and raise a red flag. If only one of these conditions holds, then the approach might not work, but these are likely to be the less extreme cases (e.g., an easy rater will only not hit the ceiling when their biases are small). One circumstance in which the correlations might be of little value is when the observer makes different types of errors with different teachers that cancel out and yield a correlation in the expected range.
The comparison between observations is most useful when the two observers are scoring the same instance of classroom instruction, rather than following up on a different day or a different class or subject on the same day. However, having two observers at the same time might pose coordination problems; another way to address this would be through video-taped instruction.
The use of value-added in this way is not meant to preclude or replace other methods for improving the validity of classroom observations (or other measures such as student learning objectives). For example, it is also advisable, albeit costly, to have multiple observers and do calibrations by having observers see the same instances of classroom instruction.
Another dimension sometimes used in these matrices is the level of student test scores. One reason for doing this is that ceiling effects of tests mean that teacher value-added may be less valid for teachers whose students are already at high levels. In that case, the matrix approach allows teachers with high score levels and medium or low value-added to be treated differently. Louisiana is one state that is adopting this approach.
Mihaly et al. (2013) call this the “disjunctive approach.” Alternatively, when high performance is required on both measures, they call it the “conjunctive approach.”
It would seem that one additional advantage of the matrix approach is avoiding the difficult task of creating weights, but this is a bit of an illusion. If teachers in the Low-High and High-Low categories are treated equally, then, implicitly, the two measures are equally weighted (50-50). The matrix approach is also just as costly as the weighted average method because the information still has to be collected on all teachers in order to place each of them into the correct cell.
Reliability could increase if value-added were used successfully to reduce the error in the classroom observations.
When I say classroom instruction, I mean essentially everything happening in the classroom from literal instruction to classroom management.
Mihaly et al. (2013) Ibid.

What Do We Know About Using Value-Added to Compare Teachers Who Work in Different Schools?

Joanna Huang — Mon, 19 Aug 2013 22:31:39 +0000

Email this page

Stephen Raudenbush
Chair
Committee on Education
University of Chicago

Stephen W. Raudenbush

Highlights

Bias may arise when comparing the value-added scores of teachers who work in different schools.
Some schools are more effective than others by virtue of their favorable resources, leadership, or organization; we can expect that teachers of similar skill will perform better in these more effective schools.
Some schools have better contextual conditions than others, providing students with more positive influences – peers who benefit from safe neighborhoods and strong community support. These conditions may facilitate instruction, thus will tend to increase a teacher’s value-added score.
Value-added models control statistically for student background and previously demonstrated student ability. But these controls tend to be ineffective, and possibly even misleading, when we compare teachers whose classrooms vary greatly by these factors.
There are several methods for checking the sensitivity of value-added scores to school variation in contextual conditions and student backgrounds.
If value-added scores are sensitive to these factors, we can revise the analysis to ensure that the classrooms being compared are similar on measures of student background and school composition, thus reducing the risk of bias.

Introduction

This brief considers the problem of using value-added scores to compare teachers who work in different schools. My focus is on whether such comparisons can be regarded as fair, or, in statistical language, “unbiased.” An unbiased measure does not systematically favor teachers because of the backgrounds of the students they are assigned to teach, nor does it favor teachers working in resource-rich classrooms or schools. A key caveat: a measure that is unbiased does not mean the measure is accurate. An unbiased measure could be imprecise – thus inaccurate – if, for example, it is based on a small sample of students or on a test with too few items. I will not consider the issue of statistical precision here, having considered it in a previous brief.^[1] This brief focuses strictly on the bias that may arise when comparing the value-added scores of teachers who work in different schools.

Challenges That Arise in Comparing Teachers Who Work in Different Schools

In a previous brief, Goldhaber and Theobold showed that how teachers rank on value-added can depend strongly on whether those teachers are compared to colleagues working in the same school or to teachers working in different schools.^[2] This discrepancy by itself does not mean that between-school comparisons are biased. However, previous literature identifies three unique challenges that arise in comparing teachers who work in different schools, and each brings a risk of bias.

First, some schools are more effective than others by virtue of their favorable resources, leadership, or organization. We can expect that teachers of similar skill will perform better in these more effective schools. Second, some schools have more favorable contextual conditions than others, providing a student with more favorable peers – those who benefit from strong community support and neighborhood safety. These contextual conditions may facilitate instruction, thus tending to increase a teacher’s value-added score. Third, value-added models use statistical controls to make allowances for students’ backgrounds and abilities. These controls tend to be ineffective, and possibly misleading, when we compare teachers whose classrooms vary greatly in the prior ability or other characteristics of their students. This problem can be particularly acute when we compare teachers in different schools serving very different populations of students. It can also arise when we compare teachers who work in the same school but who serve very different sub-populations, as with teachers in high schools in which students are tracked by ability.^[3]

After describing each of these challenges, I consider ways to check the sensitivity of value-added scores to variations in schools’ contextual conditions and students’ background. If value-added scores are sensitive to these factors, we can revise the analysis to ensure that the classrooms being compared are similar on measures of student background and school composition, thus reducing the risk of bias. In this revised analysis, the aim is to compare teachers who work with similar students in similar schools. While policymakers may debate the utility of such comparisons, such comparisons are better supported by the available data than are comparisons between teachers who work with very different subsets of students and under different conditions. Thus they are less vulnerable to bias.

1. Variation in school effectiveness

Emerging evidence suggests that some schools are more effective than others in managing resources, creating cultures for learning, and providing instructional support. An effectively organized school may provide benefits to all teachers working in that school. If so, teachers assigned to such schools will look better on value-added measures than will teachers who work in less effective schools.

Recent randomized experiments provide compelling evidence that schools have highly varied effects.

Social scientists have for years debated whether schools vary substantially in their effectiveness, and if so, why. In his landmark 1966 report “Equality of Educational Opportunity,” sociologist James S. Coleman suggested that socio-economic segregation of schools contributed to variation in learning but that factors such as facilities and spending mattered little.^[4] The Coleman Report cast a long shadow over the proposition that giving schools more resources would improve education. However, Coleman’s and other early studies were based on cross-sectional data. Early value-added modeling of schools based on longitudinal data suggested that students with similar backgrounds experience very different growth rates depending on the schools they attend.^[5] These and more recent longitudinal studies raised the question of whether the internal life of schools, and in particular differences in leadership and collegial support, are more important than student composition in promoting learning.^[6]

Recent randomized experiments provide compelling evidence that schools have highly varied effects. These studies capitalize on the fact that new charter schools are often oversubscribed: more students apply than can be admitted. By law, applicants to these charter schools are offered admission on the basis of a randomized lottery. Researchers are now following the outcomes of winners and losers of these lotteries. A study of randomized lotteries in 36 charter schools found that being admitted to a charter school made little difference in outcomes, on average. However, the variation in the impact of being so assigned was substantial. This result gives strong causal evidence not that charter schools per se are particularly effective, but that some schools are substantially more effective than others.^[7] Other researchers developed a model to predict this variation and found that five policies, including “frequent feedback to teachers, the use of data to guide instruction, high-dosage tutoring, increased instructional time, and high expectations”, explain approximately 50 percent of the variation in school effectiveness.^[8] Leaders in effective charter schools take pains to ensure that teachers follow school-wide procedures and norms. These randomized studies corroborate work showing that effective school leadership, professional work communities, and even school safety during a base year predict changes in the value that a school adds to learning.^[9]

Comparisons of teachers who work in different schools confound teacher skill and school effectiveness.

Separating the contribution of school leadership and resources from the average contribution of teacher skill is challenging, however. A critic of the review above might reasonably argue that what makes a school effective is nothing more than the average skill level of its teachers. However, several recent studies provide evidence against this criticism. In two of these studies, experimenters randomly assigned whole schools to innovative school-wide instructional curricula, revealing substantial positive effects on student learning. In these cases, the teaching force remained stable, yet the introduction of a new school-wide curriculum created added value that cannot be attributed simply to the aggregate quality of the teaching force.^[10] Another recent study followed the value-added scores of teachers as they moved from one school to another. This study provided evidence that a teacher’s value-added score will tend to improve when that teacher moves to a school in which other teachers have high value-added scores. This evidence suggests that teachers learn from high-skill peers.^[11] Teacher collaboration and peer learning may also augment the impact of school-level factors such as the coherence of the curriculum, the availability of instructional materials, and the length of the school day or year.^[12] In sum, comparisons between the value-added scores of teachers who work in different schools confound teacher skill and school effectiveness. Current value-added technology provides no means by which to separate these influences, a fact that defines a challenge to future research. Thus we have good reason to suspect that school effectiveness biases comparisons of the value-added scores of teachers working in different schools.

2. Variation in peers

Within a district, schools tend to serve quite different sets of students. High-achieving students tend to be clustered in schools in which peers are highly motivated, parents are committed to the success of the school, and the surrounding neighborhood is safe.^[13] These are presumably favorable conditions for instruction.

Value-added models cannot isolate the impact of school organization or teacher expertise if those are correlated with local conditions.

Relatedly, sociological research has established that teachers tend to calibrate the content and pacing of their instruction according to the average prior achievement of their students,^[14] implying that these well-prepared students will learn faster in classes with high-ability peers. Moreover, there is evidence that teachers themselves believe they are more effective when teaching higher-ability students than when teaching lower-ability students.^[15] We have good reason, then, to think that favorable peer composition facilitates school and teaching effectiveness. The problem at hand is that, in principle, standard value-added models cannot isolate the impact of school organization or teacher expertise if those factors are correlated with peer motivation, parent commitment, neighborhood safety, and other local conditions.^[16] The reason is that although value-added models may include measures of peer composition such as average prior ability or average family socioeconomic status, the value-added of the teacher or school is unobserved. If the peer composition and value-added are correlated – as suggested by the research – we have no way of isolating value-added.^[17] Indeed, an attempt to control for peer composition when estimating value-added may introduce extra bias into value-added scores.^[18]

The connection between peer composition and instructional effectiveness likely plays out very differently at the elementary and secondary levels. Elementary schools draw students from local neighborhoods that tend to be quite segregated with respect to family income and race/ethnicity. As a result, elementary schools tend to be comparatively internally homogeneous with respect to student background. In contrast, large, comprehensive public secondary schools draw students from multiple elementary schools and thus tend to be internally more heterogeneous than are elementary schools. In response, high schools typically assign students to classrooms based on their perceived ability. This process of “tracking” can generate large differences among classrooms.^[19] Harris considers the special problems that arise in studying teacher value-added within secondary schools that use tracking,^[20] and reasons that problems of peer composition are more pronounced within high schools than they are within elementary schools. These effects are likely to be particularly important when we compare teachers who work in different schools, even elementary schools.

3. “Common support” and statistical adjustment for student background

The problem of statistically adjusting for student background is distinct from that of isolating peer effects or differences in school effectiveness. Even if peer effects were negligible and all schools were equally effective, classroom composition could bias value-added. For example, two classrooms taught by equally skilled teachers might display different learning rates simply because one classroom had more able students. Random assignment of students to teachers would solve this problem, but random assignment doesn’t happen in practice, so statisticians have invented adjustments to control for student background factors that predict future achievement. The problem is that, in general, we cannot rely on standard methods of statistical adjustment to work well when the backgrounds of children attending different classrooms vary substantially. Statisticians call this the failure of “common support”, and it is more likely to occur when we compare teachers in different elementary schools than when we compare teachers in the same elementary school. This problem is also likely to arise in comparisons of teachers within high schools that use tracking.

We cannot rely on statistical adjustment to work well when the background of the children attending different classrooms varies substantially.

To understand how statistical adjustments work, consider the comparisons of two teachers, A and B, who teach students having different prior average achievement. Statistical adjustment is based on a statistical model that predicts how teacher A’s students would have done if they had been assigned to teacher B and how teacher B’s students would have done if they had been assigned to teacher A. If the statistical model is based on good background information, such as prior test scores that strongly predict future test scores, this may work very well. In particular, if the two groups of students overlap considerably in their background, the data will have good predictive information about how each set of students would do in either classroom. This is a case in which the two groups have good “common support” for the model.

It is essential to compare teachers serving similar children to draw any valid causal conclusions.

However, if these two distributions do not overlap, we have a problem – a failure of common support. Suppose, in the worst case, that all of teacher A’s students have higher prior achievement than any of teacher B’s students. In this case the data have no information about how A’s students would do in B’s class or how well B’s students would do in A’s class; there is simply no valid comparison group for either teacher. In this case, the value-added score will simply be an extrapolation – a guess based on the analyst’s belief about whether the relationship between prior background and future test score is linear or, in some known way, non-linear.^[21] In essence, the comparison of value-added scores is not based on the data but is entirely based on the analyst’s assumptions about the model. This extreme case – no overlap in the two distributions – is unlikely to arise in practice. The key point is that the smaller the overlap in the two distributions of prior background, the less information the data can provide about the comparative effectiveness of the two teachers. This problem is especially acute if there is any reason to suspect that teachers are differentially effective for students of different backgrounds. In that case, it is essential to compare teachers serving similar children to draw any valid causal conclusions.^[22] Central to our discussion is the fact that, at the elementary school level, a lack of common support is more likely to occur in comparisons between teachers working in different schools than in comparisons between teachers working in the same school. A failure of common support is also likely to arise in comparisons among high school teachers working in the same school if that school tracks students on the basis of ability. It is possible and useful to check comparability in any value-added analysis, a topic to which I return in the concluding section.

Exacerbating any failure of common support is the inherent uncertainty in achievement test scores. Suppose one school draw students from the upper end of the achievement distribution while another school draws students from the lower end. Suppose that students in these schools make, on average, a five-point gain in achievement. To say that these gains represent an equivalent amount of learning requires unwarranted assumptions about the achievement test. We can speak confidently of our measurements of things like time and distance, but measuring gains in cognitive skill is much more difficult. On some tests, low-achieving students can easily make comparatively large score gains simply because the test has a large number of easy items. In this case, teachers working in low-scoring schools would produce inflated value-added scores. On other tests, it will be comparatively easy for high-achieving students to make large gains, biasing value-added in favor of teachers working in those schools. Simply put, our current technology for constructing tests does not allow us to make strong claims about the relative gains of students who start from very different places in the achievement distribution. Our testing technology works much better when are comparing gains made by students of similar background and prior skill.

Efforts to compare teachers working in schools that serve children from widely varied backgrounds are vulnerable to bias.

In sum, efforts to compare teachers working in schools that serve children from widely varied backgrounds are vulnerable to bias. In a typical school district about 15-20 percent of the total variation in students’ average incoming achievement lies between schools.^[23] This means that students attending a high-achieving school will tend to score around 1.5 standard deviations higher, on average, than students attending a low-achieving school. To compare teachers working in such a wide range of schools may include a risk of failure of common support, as well as introduce substantial peer effects, increasing the risk of bias when we compare the value-added scores of teachers working in different schools.

Empirical Evidence of Bias

As we have seen, researchers have found that teachers rank quite differently when they are compared to colleagues in the same school than when they are compared to teachers in other schools. This finding leads us to predict that comparing teachers in different schools produces more bias than does comparing teachers in the same school. However, we need to see whether empirical evidence supports such predictions. What do we know about the magnitude of bias that arises in each case?

Comparing teachers within schools

Several randomized experiments lend support to the idea that value-added scores are approximately unbiased. In one large-scale study, students were randomly assigned to teachers. No statistical adjustments were needed to correct for bias, so a comparison between classrooms within the same school can be regarded as a comparison of “true value-added”. The analysts found that the variation in such value-added scores was quite large, indicating that teachers do, in fact, vary substantially in their effectiveness. Moreover, the variation in value-added in this experiment was similar in magnitude to the variation in value-added typically found in conventional non-randomized value-added analysis.^[24] Two more recent experiments assessed the bias of value-added scores more directly. Value-added scores were computed conventionally for one year using standard methods of statistical adjustment. During the next year, pairs of teachers within schools were randomly assigned to student rosters, enabling the researchers to compute unbiased value-added scores with no statistical adjustment. Comparisons between the conventional and experimental value-added scores provided some evidence that the conventional scores were approximately unbiased.^[25]

All of these encouraging studies used experimental evidence to investigate the bias of using conventional value-added scores to compare teachers working in the same school. A key question is whether such encouraging results can be found for comparing teachers in different schools. Here the research base is sparser. Perhaps the most important study of this type followed 2.5 million children in grades 3-8 into adulthood. The researchers found that students assigned to high value-added teachers had higher educational attainment, earnings, and wealth as adults.^[26] The researchers tested the potential bias of the value-added scores in two ways. First, using parental tax data, they compared students who had experienced high value-added teachers to those who had experienced low value-added teachers. They found no association between teacher value-added and these socioeconomic measures, providing evidence against the claim that unmeasured family characteristics had biased the value-added scores. Secondly, they compared a school’s average achievement before and after a “high value-added” teacher had left the school. They found that when a school loses a teacher with a high value-added score, the school’s achievement tends to decrease. This finding is important, because it supports the claim that the value-added score has causal content, and it supports the finding that within-school differences in teacher value-added reflect real differences in effectiveness. Nevertheless, the teacher value-added scores computed in this study, despite reflecting differences in teacher effectiveness, are vulnerable to bias. At least part of the variation in teacher value-added may have reflected differences in school organization effectiveness or differences in community and peer effects. The researchers did not consider the possibility that high value-added teachers work in schools that are effectively managed, and that attending such an effective school is key to students’ future success. Nor did they test whether omitted school-level variables were associated with teacher value-added. Instead, the authors assumed implicitly that any differences between schools must reflect differences in the individual skill of teachers working in those schools. A re-analysis of these data could estimate the contribution of school value-added to adult success to assess whether an alternative explanation based on school differences is plausible.

Conclusions and Recommendations

This brief has considered sources of potential bias when we use value-added scores to compare teachers working in different schools. A growing body of evidence suggests that schools can vary substantially in their effectiveness, potentially inflating the value-added scores of teachers assigned to effective schools. Schools also vary in contextual conditions such as parental expectations, neighborhood safety, and peer influences that may directly support learning or that may contribute to school and teacher effectiveness. Moreover, schools vary substantially in the backgrounds of the students they serve, and conventional statistical methods tend to break down when we compare teachers serving very different subsets of students.

Although we know that there is great potential for bias when we compute value-added scores for teachers working in different schools, we do not yet know the extent to which this potential is actually realized. Nor do we have conclusive evidence regarding the extent to which teachers are particularly effective or ineffective for particular kinds of students. This uncertainty poses a challenge to those who wish to interpret teacher value-added scores. One way to address this challenge is to check the sensitivity of value-added results to variations across schools in effectiveness and student composition. Several approaches come to mind.

First, following Goldhaber and Theobold,^[27] one can compute value-added scores two ways: by comparing teachers within schools and by comparing teachers without regard to their school assignment. If the rankings are consistent, we have little reason to favor the within-school comparisons. My colleagues and I did this, not with value-added scores but with student perceptions of teaching quality using seven indicators of teacher effectiveness based on the Tripod Survey Assessments of Ronald Ferguson from Harvard University.^[28] Our results, taken from data on a large urban district, were highly convergent. Correlations between indicators computed in these two different ways ranged from .91 to .96 across the seven dimensions, with a mean of .94. This was not surprising, because school differences accounted for little of the variation in Tripod: Only 2-7 percent of the variation in these indicators lay between schools. Given these convergent results, there is little reason to believe that school differences were adding extra bias in these indicators based on student perceptions. (As a further check, I recommend computing the percentile rank of teachers under the two procedures; that is, comparing teachers who work in the same school and comparing teachers without reference to the school in which they work.)^[29] These findings are not surprising: Students are likely to base their perceptions of teaching quality on experiences with teachers in the same or similar schools, which likely explains why the fraction of variation between schools in student perceptions is comparatively small. In contrast, as mentioned, value-added scores tend to vary considerably between schools.

What if results are not convergent? A sensible strategy is to divide schools into subsets that serve rather similar students. One might then use value-added scores (or other indicators) to compare teachers who work in the same subset of schools. To check the sensitivity of those measures, one might again check convergence between two sets of estimates: those that compare teachers within schools and those that compare teachers working in different schools but in the same subset of schools. If these are convergent, we can assume that the decision to compare teachers working in different schools (within the same subset of schools) has not contributed bias to the value-added scores.^[30]

Shavelson and Wiley suggest a refined version of this approach: for each school, select a subset of schools that match that school in terms of student characteristics. Call this the reference set for a particular school. Ideally, each school is located in the middle of its reference set with respect to the distribution of expected achievement gains.^[31]

An additional check is to compute the “contextual effect” of student composition, as is common in research in educational sociology. This indicator, along with a measure of variation between schools in school-mean prior achievement, is diagnostic of the bias that plausibly arises from school heterogeneity.^[32] The same procedure can be used to assess whether classrooms within a school are too heterogeneous in student background to support unbiased value-added. I would recommend using this procedure (see appendix), particularly in the case of secondary schools that track students to classrooms on the basis of ability.

These sensitivity checks, and the possible stratification of teachers into sub-groups serving similar students, complicate value-added analysis and may not be congruent with policymakers’ wish to compare all teachers in a district. However, these steps may win the approval of teachers who want to be sure that comparisons between themselves and other teachers are free of bias. Moreover, I recommend this modified approach as scientifically responsible, for it limits us to answering questions that our data can actually answer.

Appendix

References +

For a detailed discussion of precision, see Raudenbush, S.W. and M. Jean. Carnegie Knowledge Network, “How Should Educator Interpret Value-Added Scores,” (2012).
Goldhaber, D. and R. Theobold. Carnegie Knowledge Network, “Do Different Value-Added Models Tell Us the Same Things?” Last updated April 2013. (2012).
See the discussion by Harris (2013).
Coleman, et al., “Equality of educational opportunity: Summary report. Vol. 2.” US Department of Health, Education, and Welfare, Office of Education, 1966.
Bryk, Anthony S. and Stephen W. Raudenbush, “Toward a more appropriate conceptualization of research on school effects: A three-level hierarchical linear model,” American Journal of Education (1988): 65-108.
Lauen, Douglas Lee and S. Michael Gaddis, “Exposure to classroom poverty and test score achievement: Contextual effects of selection,“ American Journal of Sociology 118, no. 4, (2013): 943-979.
Gleason et al., “The Evaluation of Charter School Impacts: Final Report. NCEE 2010-4029.” National Center for Education Evaluation and Regional Assistance (2010).
Dobbie, Will and Roland G. Fryer, Jr., “ Getting beneath the veil of effective schools: Evidence from New York City.“National Bureau of Economic Research, (Working paper #w17632, 2011).
Bryk et al., Organizing schools for improvement: Lessons from Chicago, University of Chicago Press, 2010.
Borman et al., “Final reading outcomes of the national randomized field trial of Success for All.” American Educational Research Journal 44, no. 3 (2007): 701-731.; Borman, G.D., M.M. Dowling, and C. Schneck, “A multisite cluster randomized field trial of Open Court Reading.” Educational Evaluation and Policy Analysis 30, no. 4 (2008): 389-407.
Jackson, Clement Kirabo and Elias Bruegmann, “Teaching students and teaching each other: The importance of peer learning for teachers.“ American Economic Journal: Applied Economics 1, no. 4 (2009): 85-108.
Raudenbush, Stephen W. “Can School Improvement Reduce Racial Inequality?” Research on Schools, Neighborhoods, and Communities: Toward Civic Responsibility (2012): 233.
Harding (2010) provides a vivid portrayal of these differences. Harding, David J., Living the drama: Community, conflict, and culture among inner-city boys, University of Chicago Press, 2010.
Dreeben, Robert and Rebecca Bar, “Educational Policy and the Working of Schools” (1983); Gamoran, Adam, “Tracking and inequality: New directions for research and practice,” The Routledge international handbook of the sociology of education (2010): 213-228.; Raudenbush, Stephen W., Brian Rowan, and Yuk Fai Cheong, “Higher order instructional goals in secondary schools: Class, teacher, and school influences,” American Educational Research Journal 30, no. 3 (1993): 523-553.
Raudenbush, Rowan, and Cheong (1992) compared the self-efficacy of a teacher when teaching high-ability high-school classes to the self-efficacy score of the same teacher when teaching low-ability high-school class. They found large differences, particularly in mathematics. Raudenbush, Stephen W., Brian Rowan, and Yuk Fai Cheong, “Contextual effects on the self-perceived efficacy of high school teachers,” Sociology of Education (1992): 150-167.
Raudenbush, Stephen W. and J. Douglas Willms “The estimation of school effects,” Journal of educational and behavioral statistics 20, no. 4 (1995): 307-335.
One might reason that a statistical model should estimate and remove the effect of peer composition, thus isolating value added. However, if the peers and value added are correlated, the failure to measure and control value added will cause bias in the estimation of peer composition. As a result, the estimate of value added based on removal of this biased peer composition effect will also be biased.
McCaffrey, D.F. Carnegie Knowledge Network, “Do Value-added models level the playing field for teachers?” Last updated June 2013 (2012).
See Gamoran’s (2010) review.
Harris, D. Carnegie Knowledge Network, “Does Value Added Work Better in Elementary than in Secondary Schools?” (2013).
If the distribution of prior test scores in one classroom does not overlap with the distribution of prior test scores in the other classroom, a comparison between the two classrooms depends entirely on the functional form of the statistical model and not at all on the data. In this case different functional forms – linear, quadratic, logarithmic, or exponential, for example – for the relationship between prior achievement and post achievement will yield different value-added scores; the data will provide no information about which functional form is preferable.
Reardon and Raudenbush (2009) show that to compare the effectiveness of teachers who teach very different students requires another strong assumption, that teachers who are comparatively effective for one type of student are also comparatively effective for other types of students. This assumption is required unless value-added models include, for each teacher, separate estimates of value-added for subsets of students who vary in background. Most value-added models do not, and cannot, do so because we can assess a teacher’s value-added only for the students she is assigned to teach. Also the amount of data needed multiplies rapidly. Teachers who work in affluent schools do not give us information about how effective they might be might teach low-income schools, for example. More research is needed on the extent to which teacher effectiveness varies for students of highly varied background. However, this problem – that some teachers may be more effective at working with some kinds of students than others – will not cause bias in value-added analysis that compare individuals teaching very similar students. Reardon, Sean F. and Stephen W. Raudenbush, “Assumptions of value-added models for estimating school effects.” Education 4, no. 4 (2009): 492-519.
Hedges, Larry V. and E.C. Hedberg “Intraclass correlations for planning group randomized experiments in rural education.” Journal of Research in Rural Education 22, no. 10 (2007): 1-15.
Nye, Barbara, Spyros Konstantopolous, and Larry V. Hedges “How large are teacher effects?” Educational evaluation and policy analysis 26, no. 3 (2004): 237-257.
Kane, Thomas J. and Douglas O. Staiger, “Estimating teacher impacts on student achievement: An experimental evaluation,” National Bureau of Economic Research, (Working paper No. w14607, 2008).; Kane, Thomas J., Daniel F. McCaffrey, Trey Miller, and Douglas O. Staiger “Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment. Research Paper. MET Project,” Bill & Melinda Gates Foundation (2013).
Chetty, Raj, John N. Friedman, and Johan E. Rockoff, “The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood,” National Bureau of Economic Research (Working paper No. w17699, 2011). This study is consistent with an earlier study showing that random assignment to effective kindergarten teachers produced favorable effects on educational attainment and adult economic status (Chetty et al., “How does your kindergarten classroom affect your earnings? Evidence from Project STAR.” The Quarterly Journal of Economics 126, no. 4 (2011): 1593-1660).
See note 2.
Raudenbush, Stephen and Marshall Jean. Carnegie Knowledge Network, “How Should Educators Interpret Value-Added Scores?” (2012).
To remove school differences, one adds school fixed effects to the conventional regression model. This is equivalent to a random effects analysis that centers all covariates around the school mean (Raudenbush, 2009). The latter analysis can readily be elaborated to study whether some teachers are better than others at teaching particular kinds of students.
This strategy is not perfect, because it might be that such schools in the same subset vary in organizational effectiveness.
Personal communication with Richard Shavelson and David Wiley.
The “contextual effect” is not really a causal effect but rather the partial association between school-average prior achievement and student achievement controlling for all student background characteristics as in value added. Based on Raudenbush and Willms (1995), we can show that the squared contextual coefficient multiplied by the school mean prior achievement sets an upper bound for the variance of the biases of the value-added score. The key assumption is that the within-school value added score is unbiased.

What Do We Know About the Use of Value-Added Measures for Principal Evaluation?

Susanna Loeb — Tue, 09 Jul 2013 20:30:01 +0000

Email this page

Susanna Loeb
Professor of
Education
Stanford University
Faculty Director
Center for Education
Policy Analysis
Co-Director
PACE

Susanna Loeb

and Jason A. Grissom

Highlights

Value-added measures for principals have many of the same problems that value-added measures for teachers do, such as imprecision and questions about whether important outcomes are captured by the test on which the measures are based.
While most measures of teachers’ value-added and schools’ value-added are based on a shared conception of the effects that teachers and schools have on their students, value-added measures for principals can vary in their underlying logic.
The underlying logic on which the value-added measure is based matters a lot in practice.
Evaluation models based on school effectiveness, which measure student test- score gains, tend not to be correlated at all with models based on school improvement, which measure changes in student test-score gains.
The choice of model also changes the magnitude of the impact that principals appear to have on student outcomes.
Estimates of principal effectiveness that are based on school effectiveness can be calculated for most principals. But estimates that are based on school effectiveness relative to the effectiveness of other principals who have served at the same school or estimates that are based on school improvement have stricter data requirements and, as a result, cover fewer principals.
Models that assume that most of school effectiveness is attributable to the principal are more consistent with other measures of principal effectiveness, such as evaluations by the district. However, it is not clear whether these other measures are themselves accurate assessments.
There is little empirical evidence on the advantages or disadvantages of using value-added measures to evaluate principals.

Introduction

Principals play a central role in how well a school performs.^[1] They are responsible for establishing school goals and developing strategies for meeting them. They lead their schools’ instructional programs, recruit and retain teachers, maintain the school climate, and allocate resources. How well they execute these and other leadership functions is a key determinant of school outcomes.^[2]

Estimating value-added for principals turns out to be even more complex than estimating value-added for teachers.

Recognizing this link between principals and school success, policymakers have developed new accountability policies aimed at boosting principal performance. In particular, policymakers increasingly are interested in evaluating school administrators based in part on student performance on standardized tests. Florida, for example, passed a bill in 2011 requiring that at least 50 percent of every school administrator’s evaluation be based on student achievement growth as measured by state assessments and that these evaluations factor into principal compensation.

Partly as a result of these laws, many districts are trying to create value-added measures for principals much like those they use for teachers. The idea is compelling, but the situations are not necessarily analogous. Estimating value-added for principals turns out to be even more complex than estimating value-added for teachers.

Three methods have been suggested for assessing a principal’s value-added. One method attributes all aspects of school effectiveness (how well students perform relative to students at other schools with similar background characteristics and students with similar peers) to the principal; a second attributes to the principal only the difference between the effectiveness of that school under that principal and the effectiveness of the same school under other principals; and a third attributes school improvement (gains in school effectiveness) to the principal. Each method has distinct strengths, and each has significant drawbacks. There is now little empirical evidence to validate any of these methods as a way to accurately evaluate principals.

While substantial work has shaped our understanding of the many ways to use test scores to measure teacher effectiveness, far less research has focused on how to use similar measures to judge school administrators. The current state of our knowledge is detailed below.

Using test scores

When we use test scores to evaluate principals, three issues are particularly salient: understanding the mechanisms by which principals affect student learning, potential bias in the estimates of the effects, and reliability of the estimates of the effects. The importance of mechanisms stems from the uncertainty about how principals affect student learning and, thus, how student test scores should be used to measure it. Potential bias comes from misattributing factors outside of the principal’s control to value-added measures. Reliability, or lack thereof, comes from imprecision in performance measures that results from random variations in test performance and idiosyncratic factors outside a principal’s control.

How best to create measures of a principal’s influence on learning depends crucially on the relationship between a principal’s performance and student performance. Two issues are particularly germane here. The first is the time span over which a principal’s decisions affect students. For instance, one might reasonably question how much of an impact principals have in their first year in a school, given the likelihood that most of the staff were there before the principal arrived and are accustomed to existing processes.

Consider a principal who is hired to lead a low-performing school. Suppose this principal excels from the start. How quickly would you expect that excellent performance to be reflected in student outcomes? The answer depends on the ways in which the principal has impact. If the effects are realized through better teacher assignments or incentives to students and teachers to exert more effort, they might be reflected in student performance immediately. If, on the other hand, a principal makes her mark through longer-term changes, such as hiring better teachers or creating environments that encourage effective teachers to stay, it may take years for her influence to be reflected in student outcomes. In practice, principals likely have both immediate and longer-term effects. The timing of principals’ effects are important for how we should measure principal value-added and also point to the importance of the length of principal tenure in using value-added measurements to assess principals.

The second consideration is distinguishing the principal effect from characteristics of the school that lie outside of the principal’s control. It may be that the vast majority of a school’s effects on learning, aside from those associated with the characteristics of the students, is attributable to the principal’s performance. In this case, identifying the overall school effect (adjusted for characteristics of the students when they entered the school) is enough to identify the principal effect. That is, the principal effect is equal to the school effect.^[3]

Alternatively, school factors outside of the principal’s control may be important for school effectiveness. For example, what happens when principals have little control over faculty selection—when the district’s central office does the hiring, or when it is tightly governed by collective bargaining agreements? One means for improving a school—hiring good people—will be largely outside a principal’s control, though a principal could still influence the development of teachers in the school as well as the retention of good teachers. As another example, some schools may have a core of teachers who work to help other teachers be effective, and these core teachers may have already been at the school before the principal arrived. Other schools may benefit from an unusually supportive and generous community leader, someone who helps the school even without the principal’s efforts. In all of these cases, if the goal is to identify principal effectiveness, it will be important to net out the effects of factors that affect school effectiveness but are outside of the principal’s control.^[4],^[5]

How one thinks about these two theoretical issues—the timing of the principal effect and the extent of a principal’s influence over schools—has direct implications for how we estimate the value that a principal adds to student performance. Three possible approaches for estimating value-added make different assumptions about these issues.

Principal value-added as school effectiveness

First, consider the simplest case, in which principals immediately affect schools and have control over all aspects of the school that affect learning except those associated with student characteristics. That is, school effectiveness is completely attributable to the principal. If this assumption holds, an appropriate approach to measuring the contribution of that principal would be to measure school effectiveness while the principal is working there, or how well students perform relative to students with similar background characteristics and peers. This approach is essentially the same as the one used for teachers; we assume that teachers have immediate effects on students during the year they have them, so we take students’ growth during that year—controlling for various factors—as a measure of that teacher’s impact. For principals, any growth in student learning that is different than that predicted for a similar student in a similar context is attributed to the principal.

The effectiveness of a school may be due to factors that were in place before the principal took over.

This approach has some validity for teachers. Because teachers have direct and individual influences on their students, it makes sense to take the adjusted average learning gains of students during a year as a measure of that teacher’s effect. The face validity of this kind of approach for principals, however, is not as strong. While the effectiveness of a school may be due in part to its principal, it may also result in part from factors that were in place before the principal took over. Many teachers, for example, may have been hired previously; the parent association may be especially helpful or especially distracting. Particularly in the short run, it would not make sense to attribute all of the contributions of those teachers to that principal. An excellent new principal who inherits a school filled with poor teachers—or a poor principal hired into a school with excellent teachers—might incorrectly be blamed or credited with results he had little to do with.

Principal value-added as relative school effectiveness

The misattribution of school effects outside of a principal’s control can create bias in the estimates of principal effectiveness. One alternative is to compare the effectiveness of a school during one principal’s tenure to the effectiveness of the school at other times. The principal would then be judged by how much students learn (as measured by test scores) while that principal is in charge, compared to how much students learned in that same school when someone else was in charge. Conceptually, this approach is appealing if we believe that the effectiveness of the school that a principal inherits affects the effectiveness of that school during the principal’s tenure. And it most likely does.

One drawback of this “within-school over-time” comparison is that schools change as neighborhoods change and teachers turn over. That is, there are possible confounding variables for which adjustments might be needed. While this need is no different than that for the first approach described above, the within-school over-time approach has some further drawbacks. In particular, given the small number of principals that schools often have over the period of available data, the comparison sets can be tiny and, as a result, idiosyncratic. If, in available data, only one principal serves in a school, there is no other principal to whom to compare her. If there are only one or two other principals, the comparison set is very small, leading to imprecision in the estimates. The within-school over-time approach holds more appeal when data cover a period long enough for a school to have had several principals. However, if there is little principal turnover, if the data stream is short, or if there are substantial changes in schools that are unrelated to the school leadership, this approach may not be feasible or advisable.

Principal value-added as school improvement

So far we have considered models built on the assumption that principal performance is reflected immediately in student outcomes and that this reflection is constant over time. Perhaps more realistic is an expectation that new principals take time to make their marks, and that their impact builds the longer they lead the school. School improvement that comes from building a more productive work environment (from skillful hiring, for instance, or better professional development or creating stronger relationships) may take a principal years to achieve. If it does, we may wish to employ a model that accounts explicitly for this dimension of time.

One such measure would capture the improvement in school effectiveness during the principal’s tenure. The school may have been relatively ineffective in the year before the principal started, or even during the principal’s first year, but if the school improved during the principal’s overall tenure, that would suggest the principal was effective. If the school’s performance declined, it would point to the reverse.

The appeal of such an approach is its clear face validity. However, it has disadvantages. In particular, the data requirements are substantial. There is error in any measure of student learning gains, and calculating the difference in these imperfectly measured gains to create a principal effectiveness measure increases the error.^[6] Indeed, this measure of principal effectiveness may be so imprecise as to provide little evidence of actual effectiveness.^[7] In addition, as with the second approach, if the school were already improving because of work done by former administrators, we may overestimate the performance of principals who simply maintain this improvement.

We have outlined three general approaches to measuring principal value-added. The school effectiveness approach attributes all of the learning benefits of attending a given school while the principal is leading it to that principal. The relative school effectiveness approach attributes the learning benefits of attending a school while the principal is leading it relative to the benefits of the same school under other principals. The school improvement approach attributes the changes in school effectiveness during a principal’s tenure to that principal. These three approaches are each based on a conceptually different model of principals’ effects, and each will lead to different concerns about validity (or bias) and precision (or reliability).

What is the Current State of Knowledge on this Issue?

Value-added measures of teacher effectiveness and school effectiveness are the subject of a large and growing research literature summarized in part by this series.^[8] In contrast, the research on value-added measures of principal effectiveness—as distinct from school effectiveness—is much less extensive. Moreover, most measures of teachers’ value-added and schools’ value-added are based on a shared conception of the effect that teachers and schools have on their students. By contrast, value-added measures of principals can vary both by their statistical approach and their underlying logic.

Even within conceptual approaches, model choices can make significant differences.

One set of findings from Miami-Dade County Public Schools compares value-added models based on the three conceptions of principal effects described above: school effectiveness, relative school effectiveness, and school improvement. A number of results emerge from these analyses. First, the model matters a lot. In particular, models based on school improvement (essentially changes in student test score gains across years) tend not to be correlated at all with models based on school effectiveness or relative school effectiveness (which are measures of student test score gains over a single year).^[9] That is, a principal who ranks high in models of school improvement is no more or less likely to be ranked high in models of school effectiveness than are other principals. Models based on school effectiveness and those based on relative school effectiveness are more highly correlated, but still some principals will have quite different ratings on one than on the other. Even within conceptual approaches, model choices can make significant differences.

Model choice affects not only whether one principal appears more or less effective than another but also how important principals appear to be for student outcomes. The variation in principal value-added is greater in models based on school effectiveness than in models based on improvement, at least in part because the models based on improvement have substantial imprecision in estimates.^[10],^[11] Between models of school effectiveness and models of relative school effectiveness (comparing principals to other principals who have taught in the same school), the models of school effectiveness show greater variation across principals.^[12] For example, in one study of North Carolina schools, the estimated variation in principal effectiveness was more than four times greater in the model that attributes school effects to the principal than in the model that compares principals within schools.^[13] This finding is not surprising given that the models of relative school effectiveness have taken out much of the variation that exists across schools, looking only within schools over time or with a group of schools that share principals.

The Miami-Dade research also provides insights into some practical problems with the measures introduced above. First, consider the model that compares principals to other principals who serve in the same school. This approach requires each school to have had multiple principals. Yet in the Miami-Dade study, even with an average annual school-level principal turnover rate of 22 percent over the course of eight school years, 38 percent of schools had only one principal ^[14] ^[15] Even when schools have had multiple principals over time, the number in the comparison group is almost always small. The within-school relative effectiveness approach, in essence, compares principals to the few other principals who have led the schools in which they have worked, then assumes that each group of principals (each set of principals who are compared against each other) is, on average, equal. In reality, they may be quite different. In the Miami-Dade study, the average principal was compared with fewer than two other principals in value-added models based on within-school relative effectiveness. The other two approaches (school effectiveness and school improvement) used far larger comparison groups.

There is more error in measuring changes in student learning than in measuring levels of student learning.

Measures of principal value-added based on school improvement also require multiple years of data. There is no improvement measure for a single year, and even two or three years of data are often insufficient for calculating a stable trend. Requiring principals to lead a school for three years in order to calculate value-added measures reduced the number of principals by two-thirds in the Miami-Dade study.^[16] A second concern with using school improvement is imprecision. As described above, there is more error in measuring changes in student learning than in measuring levels of student learning. There simply may not be information left in the measures based on school improvement to be useful as a measure of value-added.

While there are clear drawbacks to using value-added measures based on school improvement, the approach also has substantial conceptual merit. In many cases, good principals do, in fact, improve schools. The means by which they do so can take time to reveal themselves.^[17] Moreover, one study of high schools in British Columbia points to meaningful variation across principals in school improvement.^[18]

To better understand the differences in value-added measures based on different approaches, the Miami-Dade study compared a set of value-added measures to: schools’ accountability grades;^[19] the district’s ratings of principal effectiveness; students’, parents’ and staff’s assessments of the school climate; and to principals’ and assistant principals’ assessments of the principal’s effectiveness at certain tasks. These comparisons show that the first approach—attributing school effectiveness to the principal—is more predictive of all the non-test measures than are the other two approaches, although the second approach is positively related to many of the other measures as well. The third approach, measuring value-added by school improvement, is not positively correlated with any of these other measures. The absence of a relationship between measures of school improvement and measures of these other things could be the result of imprecision, or it could be because the improvement is based on a different underlying theory about how principals affect schools.

The implications of these results may not be as clear as they first seem. The non-test measures appear to validate the value-added measure that attributes all school effectiveness to the principal. Alternatively, the positive relationships may represent a shortcoming in the non-test measures. District officials, for example, likely take into account the effectiveness of the school itself when rating the performance of the principal. When asked to assess a principal’s leadership skills, assistant principals and the principals themselves may base their ratings partly on how well the school is performing instead of solely on how the principal is performing. In other words, differentiating the effect of the principal from that of other school factors may be a difficulty encountered by both test-based and subjective estimates of principal performance.

These models attempt to put numbers on phenomena when we may simply lack enough data to do so.

In sum, there are important tradeoffs among the different modeling approaches. The simplest approach—attributing all school effectiveness to the principal—seems to give the principal too much credit or blame, but it produces estimates that correlate relatively highly across math and reading, across different schools in which the principal works, and with other measures of non-test outcomes that we care about. On the other hand, the relative school effectiveness approach and the school improvement approach come closer to using a reasonable conception of the relationship between principal performance and student outcomes, but the data requirements are stringent and may be prohibitive. These models attempt to put numbers on phenomena when we may simply lack enough data to do so.

Other research on principal value-added goes beyond comparing measurement approaches to using specific measures to gain insights into principal effectiveness. One such study, which used a measure of principal value-added that was based on school effectiveness, found greater variation among principal effectiveness in high-poverty schools than in other schools. This study provides some evidence that principals are particularly important for student learning in these schools, and it highlights the point about the effects of model choice on the findings.^[20] A number of studies have used value-added measures to quantify the importance of principals for student learning. The results are somewhat inconsistent, with some finding substantially larger effects than others. One study of high school principals in British Columbia that used the within-schools approach finds a standard deviation of principal value-added that is even greater than that which is typical for teachers. Most studies, however, find much smaller differences, especially when estimates are based on within-school models.^[21]

What More Needs to be Known on This Issue?

Using student test scores to measure principal performance faces many of the same difficulties as using them to measure teacher performance. As an example, the test metric itself is likely to matter.^[22] Understanding the extent to which principals who score well on measures based on one outcome (e.g., math performance) also perform well on measures based on another outcome (e.g., student engagement) would help us understand whether principals who look good on one measure also look good on other measures. If value-added based on different measures is inconsistent, it will be particularly important to choose outcome measures that are valued.

Nonetheless, there are challenges to using test scores to measure principal effectiveness that differ from those associated with using such measures for teachers. These, too, could benefit from additional research. In particular, a better understanding of how principals affect schools would be helpful. For example, to what extent do principals affect students through their influence on veteran teachers, providing supports for improvement as well as ongoing management? Do they affect students primarily through the composition of their staffs, or can they affect students, regardless of the staff, with new curricular programs or better assignment of teachers? To what extent do principals affect students through cultural changes? How long does it take for these changes to have an impact? Clearer answers to these questions could point to the most appropriate ways of creating value-added measures.

There is little empirical evidence to warrant the use of value-added data to evaluate principals.

No matter how much we learn about the many ways in which principals affect students, value-added measures for these educators are going to be imperfect; they probably will be both biased and imprecise. Given these imperfections, can value-added measures be used productively? If so, under what circumstances? As do many managers, principals perform much of their work away from the direct observation of their employers. As a result, their employers need measures of performance other than observation. Research can clarify where the use of value-added improves outcomes, and whether other measures, in combination with or instead of value-added, lead to better results. There is now little empirical evidence to warrant the use of value-added data to evaluate principals, just as there is little clear evidence against it.

What Can’t be Resolved by Empirical Evidence on This Issue?

The problems with outcome-based measures of performance are not unique to schooling. Managers are often evaluated and compensated based on performance measures that they can only partially control.^[23] Imperfect measures can have benefits if they result in organizational improvement. For example, using student test scores to measure productivity may encourage principals to improve those scores even if the value-added measures are flawed. However, whether such measures actually do lead to improvement will depend on the organizational context and the individuals in question.^[24]

This brief has highlighted many of the potential flaws of principal value-added measures, pointing to the potential benefit of additional or alternative measures. One set of measures could capture other student outcomes, such as attendance or engagement. As with test scores, highlighting these factors creates incentives for a principal to improve them, even though these measures likely would share with test-based value-added the same uncertainty about what to attribute to the principal. Another set of measures might more directly gauge principals’ actions and the results of those actions, even if such measures are likely more costly than test-score measures to devise. These measures might come from feedback from teachers, parents, students, or from a combination of observations and discussions between district leaders and principals.

Research can say very little about how to balance these different types of measures. Would the principals (and their schools) benefit from the incentives created by evaluations based on student outcomes? Does the district office have the capacity to implement more nuanced evaluation systems? Would the dollars spent on such a system be worth the tradeoff with other potentially valuable expenditures? These are management decisions that research is unlikely to directly inform.

Conclusions

The inconsistencies and drawbacks of principal value-added measures lead to questions about whether they should be used at all. These questions are not specific to principal value-added. They apply, at least in part, to value-added measures for teachers and to other measures of principal effectiveness that do not rely on student test performance. There are no perfect measures, yet district leaders need information on which to make personnel decisions. Theoretically, if student test performance is an outcome that a school system values, the system should use test scores in some way to assess schools and hold personnel accountable. Unfortunately, we have no good evidence about how to do this well.

The warning that comes from the research so far is to think carefully about what value-added measures reveal about the contribution of the principal and to use the measures for what they are. What they are not is a clear indicator of a principal’s contributions to student test-score growth; rather, they are an indicator of student learning in that principal’s school compared with learning that might be expected in a similar context. At least part of this learning is likely to be due to the principal, and additional measures can provide further information about the principal’s role. To the extent that districts define what principals are supposed to be doing—whether that is improving teachers’ instructional practice, student attendance, or the retention of effective teachers—measures that directly capture these outcomes can help form an array of useful but imperfect ways to evaluate principals’ work.

References +

The Introduction is drawn from: Grissom, Jason A., Demetra Kalogrides, and Susanna Loeb, “Using student test scores to measure principal performance,” (National Bureau of Economic Research, No. w18568, 2012).
The research base documenting the large array of roles principals play and how execution of those roles impacts the school is extensive. For a review, see Hallinger, Philip, and Ronald H. Heck, “Exploring the principal’s contribution to school effectiveness: 1980-1995,” School Effectiveness and School Improvement 9 (2) (1998): 157-191.
On instructional leadership and student outcomes, see Robinson, Viviane MJ, Claire A. Lloyd, and Kenneth J. Rowe, “The impact of leadership on student outcomes: An analysis of the differential effects of leadership types,” Educational Administration Quarterly 44 (5) (2008): 635-674. For a more recent study linking principals’ management skills to student learning growth, see Grissom, Jason A., and Susanna Loeb, “Triangulating principal effectiveness: How perspectives of parents, teachers, and assistant principals identify the central importance of managerial skills,” American Educational Research Journal 48(5) (2011): 1091-1123.
For a more detailed discussion of school effects see Raudenbush, Stephen W., and J. Douglas Willms, “The estimation of school effects,” Journal of Educational and Behavioral Statistics 20 (4) (1995): 307-335. The authors point out the difficulty of separating peer and community characteristics from effective school practices.
This same issue arises with teachers and has received considerable attention from researchers. For an example, see Rothstein, Jesse, “Student sorting and bias in value-added estimation: Selection on observables and unobservables,” Education Finance and Policy 4(4) (2009): 537-571.
Teacher value-added measures are not immune to similar concerns. For example, a teacher might have a class of students who are unusually supportive of each other. However, the issues for estimating value-added are somewhat less problematic for teachers because the students they work with vary from class to class and year to year whereas most of the teachers and students that principals work with remain in the same school from year to year.
Kane, Thomas J., and Douglas O. Staiger. “The promise and pitfalls of using imprecise school accountability measures,” The Journal of Economic Perspectives 16 (4) (2002): 91-114. Boyd, Donald, Hamilton Lankford, Susanna Loeb, and James Wyckoff, “Measuring test measurement error: A general approach,” (National Bureau of Economic Research, No. w18010, 2012).
Kane, et al, 2002, ibid; Boyd, et al, 2012, ibid.
See for example: Aitkin, Murray, and Nicholas Longford, “Statistical modeling issues in school effectiveness studies,” Journal of the Royal Statistical Society, Series A (General) (1986): 1-43.
Grissom, et al, 2012, ibid.
ibid.
Estimates of improvement have greater measurement error because there is imprecision in both the starting effectiveness of the school and the ending point effectiveness. The improvement is the difference between these two imprecise measures and, as a result, it has two sources of error.
Grissom, et al, 2012, ibid.
Dhuey, Elizabeth and Justin Smith, “How school principals influence student learning?” (Working Paper, 2013).
Authors’ calculations.
Beteille, Tara., Demetra Kalogrides, and Susanna Loeb, “Stepping stones: Principal career paths and school outcomes,” Social Science Research 41(4) (2012): 904–919.
Grissom, Jason A., Demetra Kalogrides, and Susanna Loeb, “Using student test scores to measure principal performance,” (National Bureau of Economic Research, No. w18568, 2012).
Note: The principals for whom school improvement models are feasible – those who remain in the same school for a longer period of time – are those for whom the relative effectiveness measures are likely to be less feasible, because fewer principals have likely led their school.
See for example, Anthony S. Bryk, Penny Bender Sebring , Elaine Allensworth , Stuart Luppescu , and John Q. Easton, Organizing Schools for Improvement: Lessons from Chicago, (Chicago: University of Chicago Press, 2010).
Coelli, Michael, and David A. Green, “Leadership effects: School principals and student outcomes,” Economics of Education Review 31(1) (2012): 92-109.
Florida grades each school on a 5-point scale (A, B, C, D, F) that is meant to succinctly capture performance. Grades are based on a scoring system that assigns points to schools for their percentages of students achieving the highest levels in reading, math, science, and writing on Florida’s standardized tests in grades 3 through 10, or who make achievement gains. Grades also factor in the percentage of eligible students who are tested and the test gains of the lowest-performing students.
Branch, Gregory F., Eric A. Hanushek, and Steven G. Rivkin, “Estimating the effect of leaders on public sector productivity: The case of school principals,” (National Bureau of Economic Research, No. w17803, 2012).
Dhuey, Elizabeth, and Justin Smith, “How important are school principals in the production of student achievement?” (Working Paper, 2013).
Kane, Thomas, and Steven Cantrell. “Learning about teaching: Initial findings from the measures of effective teaching project,” (Bill & Melinda Gates Foundation, MET Project Research Paper, 2010). Lockwood, J. R., Daniel F. McCaffrey, Laura S. Hamilton, Brian Stecher, Vi-Nhuan Le, and José Felipe Martinez, “The sensitivity of value-added teacher effect estimates to different mathematics achievement measures,” Journal of Educational Measurement 44(1) (2007): 47-67; Kane, Thomas J., Daniel F. McCaffrey, Trey Miller, and Douglas Staiger, “Have we identified effective teachers? Validating measures of effective teaching using random assignment,” (Bill & Melinda Gates Foundation, MET Project Research Paper, 2013).
Graham, John R., Si Li, and Jiaping Qiu, “Managerial attributes and executive compensation,” (National Bureau of Economic Research, No. w17368, 2011).
Heinrich, Carolyn J., and Gerald Marschke, “Incentives and their dynamics in public sector performance management systems,” Journal of Policy Analysis and Management 29(1) (2010): 183-208.

Will Teacher Value-Added Scores Change when Accountability Tests Change?

Daniel McCaffrey — Mon, 17 Jun 2013 17:42:13 +0000

Email this page

Daniel McCaffrey
Principal Research
Scientist
Educational Testing
Service

Daniel F. McCaffrey

Highlights

There is only a moderate, and often weak, correlation between value-added calculations for the same teacher based on different tests.
The content, timing, and structure of tests all contribute to differences in value-added calculations based on different tests.
The stakes attached to a test affect the correlations between value-added estimates based on different tests.
Conclusions drawn from one test might not serve as an accurate picture of a teacher’s true effectiveness.
More studies are needed to better assess the potential that “teaching to the test” has for distorting value-added estimates.
Composite measures may mitigate some of the distortions caused by using different tests, and they may be better predictors of student learning gains.
States should expect large changes in value-added calculations in 2014-15 when they switch to tests aligned with the Common Core State Standards.

Introduction

Value-added evaluations use student test scores to assess teacher effectiveness. How we judge student achievement can depend on which test we use to measure it.^[1] Thus it is reasonable to ask whether a teacher’s value-added score depends on which test is used to calculate it. Would it change if we used a different test? Specifically, might a teacher admonished for poor performance or recognized for good performance have been treated differently if a different test had been used? It’s an important question, particularly because most states will soon be adopting new tests aligned with the Common Core State Standards.^[2],^[3]

In this article we discuss what is known about how sensitive value-added scores are to the choice of test and what more needs to be known. We also discuss issues about the choice of test that might not be resolved through empirical investigation, as well as the implications of these findings for states and school districts.

What is Known About Teacher Value-Added Estimates from Different Tests?

In this section, we discuss the research studies that compared value-added calculated with one test to value-added calculated for the same teachers using a different test. These studies found the correspondence between the two sets of value-added estimates to be moderate at best. (In the next section, we discuss possible reasons for these differences.)

The Measures of Effective Teaching (MET) Project calculated teacher value-added scores using state accountability tests and, separately, using project-administered tests in grades four through eight in six school districts.^[4] The study used the correlation coefficient to describe the level of agreement between the two measures for each teacher. A value of 1 represents perfect correspondence, and a value of 0 means no correspondence. The MET Project found that the correlation between value-added using the two different tests administered to the same class of students was 0.38 for math and 0.21 for reading.^[5] Associations in this range typically are considered weak: teachers with value-added in the top quartile on the state reading test would have about a 40 percent chance that their value-added on the alternative test would be below the 50th percentile.

A teacher rated at one value-added level calculated from one test has a strong likelihood of earning a different level based on value-added calculated from a different test.

Researchers also have taken advantage of the multiple tests administered by some states and school districts to investigate how much value-added changes when it is calculated with different tests. The studies used data from Hillsborough County, Florida; Houston, Texas; and a large urban district in the Northeast. In Hillsborough County, students completed two tests administered by the state: the Sunshine State Standards Test, a criterion-referenced test that assessed student mastery of Florida standards and served as the primary test for school accountability, and a norm-referenced test used to compare Florida students with those in other states.^[6] In Houston, students completed the state accountability test and a standardized norm-referenced test administered by the district.^[7] In the Northeast urban district, students completed the state test and two tests administered by the district: a standardized norm-referenced test in reading and math and a separate reading test.^[8] Each of the studies presented the correlation between teachers’ value-added based on the state accountability test and the alternative test. The studies found that correlations ranged from .20 to .59.^[9] The highest correlation coefficients were for math and reading teachers in Houston – .59 for math and .50 for reading – where value-added was calculated by pooling up to eight years of data for a teacher. The smallest values, of around .20, were for reading teachers in the Northeastern city. That case compared value-added on the state test, which was administered in the spring, to value-added on a test administered by the district the following fall. It used only one year of data from both tests.

Taken together, this evidence suggests that a teacher who taught the same curriculum to the same students, and who is rated at a given level based on value-added calculated from one test has a strong likelihood of earning a different level based on value-added calculated from a different test.

Why is value-added sensitive to the test?

No standardized achievement test covers all the content and skills that a student might learn in a year. Consequently, value-added from one test might not fully reflect a teacher’s effectiveness at promoting learning. It will not include content or skills not covered by the test. However, a teacher’s effectiveness at teaching the content on one test may be very similar to her effectiveness at teaching the content on other tests. That is, a “good teacher” is a “good teacher” regardless of the content, skills or the test. Alternatively, some teachers may be more effective at teaching some content and skills and less effective at others. A teacher may be good at teaching mathematical computations but less good at teaching problem-solving. If value-added scores from different tests lead to different conclusions about a teacher, then we may worry that value-added from any single test provides an incomplete picture of a teacher’s effectiveness, and that using it to make decisions about teachers may be inefficient or, for some teachers, unfair.

Six possible reasons for the differences in value-added between tests: timing, statistical imprecision, test content, cognitive demands, test format, and the consequences of the test.

From the results of the studies mentioned above, we might at first conclude that value-added on one test is a poor measure of a teacher’s effectiveness at teaching the content and skill measured by other tests. However, we consider six possible reasons for the weak correspondence between value-added calculated with two different tests: 1) the timing of the tests; 2) statistical imprecision; 3) test content; 4) the cognitive demands of the tests; 5) test format; and 6) the consequences of the test for students, teachers, or schools. These other possible reasons have different implications for what value-added from one test might tell us about a teacher. We will discuss each in turn:

Test timing. We have seen that the lowest correlation between value-added calculated with two different tests was for tests administered at two different times in the school year, one in the fall and the other in the spring. Thus, value-added scores are sensitive to the timing of the tests, and any comparisons of value-added among teachers or for the same teachers across years should use tests given at the same time of the school year. Should states use fall-to-fall or spring-to-spring testing? There is no research on whether testing in either period leads to value-added scores that better reflect teachers’ true effectiveness; however, spring tests do allow for calculating value-added closer to when the students were in their teachers’ classes.
Statistical imprecision. All the studies except the Houston study calculated value-added for teachers using a single year of data and two different tests administered to the same group of students. Each measure is imprecise because it is calculated with a small number of students responding to a particular test form ^[10] on a particular day. The imprecision in each test contributes to disagreements in the value-added that is based on them. However, imprecision does not mean that value-added on one test is a poor measure of a teacher’s effectiveness at teaching other content. If we remove the statistical noise that creates the imprecision, the agreement should be stronger. The Houston study used multiple years of data to calculate a teacher’s value-added on each test. Each year the state test used a different test form, and each year the teacher had different students. Thus, by combining data across multiple years, the value-added in the Houston study reduced the imprecision. As a result, the correlation between tests in that study was higher than it was in other studies. The MET Project made adjustments to the correlation between tests to estimate what the correlation would be between the average of a teacher’s value-added calculated for several years using the state test and the average of her value-added calculated for several years using alternative tests. The estimates were .54 for math and .37 for reading–again higher than the correlation for value-added from a single year. These correlations are still not strong. Value-added scores from one test might not provide the full picture of a teacher.
Test content. As discussed above, disagreement between value-added calculated with different tests might be due to differences in what is tested: A teacher may be differently effective at promoting achievement depending on the content measured. Two studies directly compared valued-added based on different content. In both of the studies, researchers calculated teacher value-added on the problem-solving subtest scores from a math assessment and, separately, on the procedures subtest scores on the same assessment, using the same students, tested under the same conditions, on the same day. In both studies, the agreement between the value-added using the two different subtests was modest, suggesting that a teacher who effectively promotes growth on problem-solving might not be equally effective at promoting growth on procedures.^[11] These studies suggest that content does matter and that teachers who are effective at teaching the content of one test might not be equally effective with other content. This has implications for what is tested and the efficiency of decisions made using value-added.
Cognitive demands. Assessments vary in the cognitive demands made by the material on the test. A teacher may be differently effective at promoting growth in different types of skills, and this could result in differences in value-added from different tests. The research is unclear about how these differences in a test’s cognitive demands contribute to differences in value-added. The alternative tests chosen by the MET Project had different cognitive demands than did the state tests, and agreement between the value-added from these two tests was moderate at best. However, the tests also had some differences in content, so it is unclear how much the difference in cognitive demand contributed to the low correlation in value-added.^[12]
Formats. Tests may have different formats. For instance, some tests may use multiple choice items, and others may use constructed response items. The different formats used by tests might contribute to difference in value-added. Some teachers may spend more time promoting the skills students need to succeed with constructed responses; other teachers may spend more time promoting the skills students need for multiple choice tests. The supplemental tests used in the MET Project were open-ended tests with constructed responses and no multiple-choice items, but the state tests contained mostly multiple choice items. Such differences could have weakened the agreement between the value-added calculated from different tests. Variation in value-added due to the test format does provide meaningful information about teachers. Tests used for value-added should include multiple formats, and any comparisons of teachers’ value-added should be restricted to tests using similar formats.
Consequences. Tests can differ according to the consequences for students, teachers, and schools that are attached to their outcomes. Tests with consequences are called “high-stakes” tests; tests with limited or no consequences are called “low-stakes” tests. The Sunshine State Standards test in Florida and the state tests in the other studies were used to hold schools and districts accountable for performance; they faced penalties if their students scored poorly. The stakes of the other tests used in the studies were not as high. Teachers report focusing on test preparation for high-stakes tests, and there is concern that this focus can inflate scores.^[13],^[14] Teachers who focus narrowly on the high-stakes state test may not do as well promoting growth on the lower-stakes test. The current research provides some evidence that value-added on high-stakes tests may be distorted somewhat by a narrow focus on the tested material. One study calculated a teacher’s value-added to her students’ test scores at the end of the year and scores produced at the end of the next several school years. The study showed that value-added fades over time as students progress through school. For example, a fourth grade teacher’s value-added based on her students’ fourth grade test scores is larger than her value-added based on those students’ fifth or sixth or higher grade scores. The study also calculated value-added for the current and future years using both a high-stakes and a low-stakes test. It found that teachers’ contributions to learning appeared to fade more quickly on a high-stakes test than on a low-stakes test.^[15] If some teachers narrowly focus on features specific to a test, such as its format for math problems or its limited set of vocabulary, then we might not expect their students’ achievement growth to carry over to future tests.^[16]

What are the consequences of sensitivity to the test?

The research suggests that conclusions about a teacher’s effectiveness drawn from one test might be specific to that test, and might not provide an accurate picture of that teacher’s effectiveness overall. For instance, the study in Hillsborough found that only 43 percent of teachers who ranked in the top 20 percent of teachers according to value-added from the norm-referenced test also ranked in the top 20 percent according to value-added estimates from the Sunshine State Standards test.^[17] Similarly, the data from the Northeastern city found that if the district had been using a pay-for-performance system,^[18] changing tests would have changed the bonuses for nearly 50 percent of teachers, for an average salary difference of ,000.^[19] Some of the variation is due to the imprecision in calculating value-added, but even without these errors, the correspondence among conclusions would be less than perfect. The Houston study, which substantially reduced imprecision by pooling multiple years of data, still found that only about 46 percent of reading teachers classified in the top 20 percent on the state test were also among the top performers on the district test.^[20]

Research suggests that conclusions about a teacher’s effectiveness drawn from one test might be specific to that test.

Although conclusions about individual teachers can be highly sensitive to the test used for calculating value-added, value-added on state tests can be used to identify groups of teachers who, on average, have students who show different growth on alternative tests. The MET Project found that, on average, if two teachers’ value-added calculated using the state test in year one of the study differed by 10 points, then, in year two, their students’ achievement on the low-stakes alternative test would differ by about 7 points.^[21] If we use the state test to identify teachers who have low value-added, the selected teachers’ students would also have lower growth on the alternative tests than would students of other teachers. The differences between the two groups of teachers would be about 70 percent as large on the alternative test as on the state test. Some of the individual teachers identified as low-performing might actually have students with substantial growth on the alternative test, but as a group, their students would tend to have lower growth on both tests. Thus, the sensitivity of value-added to a test does not mean it cannot be used to support decisions about teachers that could potentially help student outcomes.

What More Needs to be Known on this Issue?

Multiple sources can contribute to differences in value-added on different tests. Determining the contributions of each would be valuable for designing effective evaluation systems.

Differences in value-added from high- and low-stakes tests might be due to some teachers focusing more than others on superficial aspects of the tests and practices that improve student test scores but not student achievement. These practices result in what is sometimes referred to as “score inflation.”^[22] An egregious example is of teachers changing student test scores in the highly publicized cheating scandals in Atlanta and New York.^[23] Practices that lead to test-score inflation can also be less overt. For instance, teachers may have students practice test-taking skills.^[24] But differences in value-added from high- and low-stakes tests might not be due to score inflation. They may also be due to differences in the content on the tests. A teacher’s effectiveness at helping students learn the state standards might be only weakly related to his effectiveness at teaching other material, especially if he focuses on the state standards and his instructional materials are related to those standards.

The two scenarios have different implications for the utility of value-added for improving student outcomes and the consequences of switching to the rigorous Common Core State Standards and their associated tests. If some teachers have high value-added on the high-stakes test because of score inflation, then value-added might not be useful for identifying effective teachers. And using it in evaluations may have negative consequences for students. If teachers are focused on the content of the state standards and if value-added reflects their effectiveness at teaching that material, then setting rigorous standards should have positive effects on student learning. The research discussed above offers limited evidence in support of both scenarios. States and districts would benefit from knowing how much each scenario contributes to value-added on high stakes tests.

As value-added estimates become more common in consequential evaluations, the motivation for teachers to inflate their value-added will increase.

As value-added estimates become more common in consequential teacher evaluations, the motivation for teachers to take steps that might inflate their value-added will increase. There are many examples in education, and other fields, in which the use of performance indicators leads to their distortion.^[25] So, understanding this risk, and designing systems to prevent it, will be important for maintaining the integrity of teacher evaluation. One particular concern is peer competition: the extent to which teachers feel they need to take steps to inflate their scores because they think their colleagues are.

If differences in value-added with different tests are due to differences in the skills measured by the tests, the choice of skills to be measured is a critical one. We have some evidence that teacher effectiveness differs according to the two math areas of problem-solving and procedures, but we need to further understand the contributions of measurement error to those findings and to extend them to other subjects and skills. We also need to better understand which skills lead to better long-term outcomes and to determine how best to measure those skills so that value-added can provide the most meaningful measure of a teachers’ effectiveness.

It would be helpful to have more systematic evaluations of how value-added changes when states introduce new tests. In recent years, some states have made significant changes to their tests, and these might hint at what states can expect when they change to the Common Core tests. However, there is no research that documents how value-added changed with the change in tests. New studies might explore the correlation in value-added with the new and old tests, using these states as examples: What types of teachers have value-added that is different with the new test than with the old test? The studies should also consider the differences in the tests themselves and explore how these contribute to differences in value-added.

What Can’t be Resolved by Empirical Evidence on this Issue?

We cannot accurately test students on all the skills they need to be ready for college and careers and to lead happy and productive lives. Even the best test will measure only a limited set of skills, and value-added that is based on that test will evaluate teachers only on those skills. When considering skills to be tested, ideally we would select those that are required for students to achieve the long-term outcomes that society values. Empirical analyses might someday help us determine how skills relate to long-term outcomes, but they cannot tell us what those outcomes should be.

How, And Under What Circumstances, Does This Issue Impact The Decisions And Actions That Districts Make On Teacher Evaluations?

The research presented above raises two concerns for states: 1) Is the value-added for teachers from one test providing an adequate measure of the teacher’s contributions to learning, and how can the data be used to provide the most accurate measure of teacher effectiveness? 2) How should states prepare for the transition to new tests aligned with the Common Core State Standards and the impact this will have on value-added?

Implications for using value-added from one test

As discussed above, value-added from one test might not agree with that from another because 1) value-added depends on factors that are specific to each test – test format, for instance – but unrelated to teacher effectiveness, or 2) value-added does not measure aspects of teacher effectiveness related to content not covered by the test. States should choose tests that ensure that student scores are determined by content knowledge and not by extraneous factors. They also should select tests with a very broad range of content so that meaningful aspects of teacher effectiveness are not missed. The tests that two consortia are developing to measure the Common Core State Standards are designed to meet both these goals.

However, states and districts might still worry that even with the new tests, value-added might not reflect all aspects of a teacher’s contributions to student outcomes. States might combine value-added with other measures to reduce this risk. The MET Project found that composite measures of teacher effectiveness that combined, with roughly equal weights, value-added calculated with state tests, classroom observations, and student responses to surveys were somewhat better predictors of a teacher’s future value-added calculated from an alternative test than was value-added calculated from the state test alone.^[26]

Implications for the transition to the new test

The transition to the new Common Core tests will introduce many conditions that could lead to changes in teachers’ value-added. The new tests will measure content aligned with the new standards, not current standards. They will also use a different test format. Administered on computer, they will have fewer multiple choice items and more performance tasks and constructed response items. The new tests also are to be more cognitively demanding than the current tests and could be administered at different times than some tests are now. Consequently, student scores on these new tests will likely differ from what scores would have been on the current state tests, and teachers’ value-added is likely to be different, as well. Already, evidence shows that there is often a precipitous change in student achievement when districts change from one test to another.^[27] If a similar drop follows the switch to new Common Core tests, it could further contribute to instability in value-added estimates.

States should prepare for larger year-to-year changes in value-added in the 2014-15 school year when they switch to tests aligned with the Common Core.

With these conditions in mind, states should prepare for larger year-to-year changes in value-added in the 2014-15 school year when they switch to tests aligned with the Common Core. They should also be prepared for greater year-to-year variability in value-added for a few years after the change when they will be using both old and new tests.^[28] Large year-to-year variability makes value-added hard to interpret. A teacher doing well one year may appear to be doing poorly the next. It can also weaken the credibility of value-added among teachers.^[29]

There is no formal research on what states should do to ease this transition. The following paragraphs provide insights from an assistant superintendent from a large urban school district that uses value-added for performance-based pay and from analysts who supply value-added to states to address questions states might have about the upcoming changes.^[30]

One question states have is how they should modify their value-added models when they switch tests. Most of the analysts contacted for this brief said that when states changed tests, they simply applied the same methods they did in other years, except the prior year scores were from the old test and the current year scores were from the new test.^[31] All of the analysts noted the importance of making sure that scores on the old tests are predictive of students’ performance on the new tests before calculating value-added using data from both tests. If students’ scores on the old tests are weak predictors of their performance on the new tests, it could lead to error in value-added. One analyst recommended that to improve the predictive power of the prior tests, states could use prior achievement scores in math, reading, and other available subjects in calculating value-added for either math or reading. He also advised that states account for measurement error in the prior scores when creating value-added.

Many value-added methods evaluate teachers’ contributions to student learning by comparing their students’ growth in achievement with the average growth of all students in the district or state during the school year. As a result, value-added cannot be used to determine if the average teacher is improving across time. To avoid this problem, some value-added methods including those used by Tennessee and Ohio, compare student achievement growth to the average growth of students from a prior school year called a base year. The base year remains constant over time so that across years the average value-added will increase with any improvements in teaching. The analysts recommended that the base year be changed when the states switch tests, noting that scores from the new test cannot be compared with base-year scores from the old test.^[32] The analysts further suggested waiting a few years before establishing a new base year because, in their experiences, large year-to-year changes in the student score distributions can occur in the first few years of a new testing program.

The assistant superintendent cited above noted that when the state made a significant change to its annual test, teachers whose classes had high prior achievement saw their value-added go up relative to other teachers, and teachers whose classes had low prior achievement saw their value-added go down. The effects were detrimental to the credibility of value-added in the district overall. They fed existing suspicions that value-added was too low for some teachers because of the students they taught.^[33] District leaders were not prepared to address the problems created by the new test.

Because changing the test is very likely to lead to changes in value-added for some teachers, states may want to prepare: they may want to thoroughly investigate how value-added changes from the years before the Common Core test to the first years after it. If teachers with certain types of students have systematically higher or lower value-added on the new test, the state might want to take steps to reduce negative repercussions. The state might not release value-added for some teachers for a few years, instead waiting for multiple prior years of data on the new test to better adjust for differences among classes. The state might follow the recommendations of analysts and use tests from multiple subjects and control for measurement error in their value-added calculations. The state might use a weighted average of value-added calculated using the old and the new tests to smooth out the transition in tests. The state might follow the MET Project and use a composite estimate with less weight on value-added, or if the effects of the new test are concentrated on the value-added for a subset of teachers, the state might give these teachers’ value-added less weight or allow districts greater flexibility in how they use value-added for performance evaluations.

References +

The report by Koretz and colleagues provide several examples of differences in scores for students on high- and low-stakes tests. See: Koretz, Daniel, Robert L. Linn, Stephen B. Dunbar, and Lorrie A. Shepard. The Effect of High-stakes Testing on Achievement: Preliminary Findings about Generalization Across Tests. US Department of Education, Office of Educational Research and Improvement, Educational Resources Information Center, 1991.
Retrieved March 4, 2013
Two consortia are developing the tests aligned with the Common Core State Standards. The Partnership for Assessment in Readiness in College and Careers includes 21 states and the District of Columbia and the SmarterBalanced Consortium includes 23 states. Two states, North Dakota and Pennsylvania are participating in both efforts.
Issues about how the change to Common Standards and the associated tests will affect school and teacher accountability are already gathering considerable attention. Examples include blogs in the Hechinger Report, and a response to it in a blog at Education Week Teacher.
See: Bill and Melinda Gates Foundation, Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project, 2010. Retrieved March 4, 2013
These correlation coefficients were not adjusted for measurement error in value-added. As discussed in the section, Why Is Value-added Sensitive to the Test, the MET report also presents correlation coefficients that are corrected for the measurement error and the corrected values are larger.
Tim R. Sass, “The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy,” (The National Center for Analysis of Longitudinal Data in Education Research, Brief 4, 2008). Retrieved March 4, 2013
Sean P. Corcoran, Jennifer L. Jennings, and Andrew A. Beveridge, “Teacher Effectiveness on High- and Low-Stakes Tests,” (In paper originally delivered to the SREE Conference, Washington, DC, 2011). Retrieved March 4, 2013
Papay, John P, “Different Tests, Different Answers The Stability of Teacher Value-Added Estimates Across Outcome Measures,” American Educational Research Journal 48 (1) (2011): 163-193. Retrieved March 4, 2013
The district administered the Stanford Achievement Test Series, Tenth Edition. For additional details, see: Corcoran, et al, 2008, ibid.
The district administered the Stanford Achievement Test of math and reading and the Scholastic Reading Inventory. For details, see: Papay, 2011, ibid.
The correlation between value-added calculated using the SSS and the NRT for math teachers in Hillsborough County was 0.48. For teachers in Houston the study considered multiple model specification and the correlation between value-added calculated using the state test and the district administered norm-referenced test ranged from .45 to .57 with a value of .50 for the baseline specification for reading teachers and the correlation ranged from .58 to .62 with a value of .59 for the baseline specification for math teachers. The study of the Northeastern city calculated value-added for reading teachers using three tests and used multiple model specification for the calculations. The correlation between value-added calculated using the district administered norm-referenced test and either of the other tests (the state test or the reading test) ranged from .16 to .28 across different model specifications. The correlation between value-added calculated using the reading test and the states test ranged from .44 to .51 depending on the model specification.
Tests often have different test forms. For example, the test booklet used in one year has different specific items than the booklets used in other years. Each year’s test is a different test form. Test forms are constructed using common design guidelines to measure the same content and they use very similar item types and the same format each year. They are equated so that scores from one form can be used interchangeably with scores from other forms. However, because the forms use different items, a student’s score on one form will differ from the score she or he would receive if tested on a different form. This variability in scores due to the test form is known as test measurement error. It creates instability in value-added that contributes to statistical error. We distinguish between sensitivity to test form and sensitivity to tests because test forms are designed to measure the same content and be exchangeable. The variability in due to test forms is included in the statistical errors. Different tests are designed to measure related but somewhat different content and statistical errors do not account for the contributions of different tests to variation in value-added.
In the paper, “Different Tests, Different Answers: The Stability of Teacher Value-Added Estimates Across Outcome Measures,” John Papay finds the correlation ranges from .52 to .59 depending on the specification of the model. He uses the Spearman correlation of ranks but this is likely to have limited impact on the estimate of the correlation. J.R. Lockwood and colleagues find that the correlation ranges from .01 to .46 depending on the value-added model specification.
See: Lockwood, J. R., Daniel F. McCaffrey, Laura S. Hamilton, Brian Stecher, Vi-Nhuan Le, and José Felipe Martinez, “The Sensitivity of Value-Added Teacher Effect Estimates to Different Mathematics Achievement Measures,” Journal of Educational Measurement 44(1) (2007): 47-67.

Neither study adjusts the reported correlations for the test measurement error due to using a small number of items to measure the subtest content. However, Lockwood and colleagues report high reliability for the each subscale and correlation between individual student’s scores on the two subscales that is much higher than the correlation in the two sets of value-added measures.
The MET study conducted a formal evaluation of the alignment of the content covered by its supplemental test and the content covered by the fourth and eighth grade state tests administered in the two participating districts. There was overlap in the content but the emphasis of specific areas differed between the state and supplemental tests. Hence, the MET results cannot clearly distinguish between the effects of difference in content and difference in cognitive demands of the state and supplemental tests on value-added.
Daniel M. Koretz, “Alignment, High Stakes, and the Inflation of Test Scores,” (National Center for Research on Evaluation, Standards, and Student Testing (CRESST) Center for the Study of Evaluation (CSE), Report 655, 2005). Retrieved March 4, 2013
The MET Project surveyed students about test preparation in their classrooms. In first year of the study, Elementary school students were asked to evaluate the statements “We spend a lot of time practicing for the state test” and “Getting ready for the state test takes a lot of time in our class” on a 1 to 5 scale from “No, Never” to “Always”. The average scores for classrooms for these items were 4.3 and 3.8, indicating that elementary students perceived they spent a considerable amount of time practicing for the test. For secondary classes the means were 3.6 and 3.4 on the 1 to 5 scale from “Totally Untrue” to “Totally True” suggesting that secondary students perceived less time spent on test preparation. Secondary students were also asked evaluate the statement “I have learned a lot this year about the state test,” which had an average score of 3.7 across classes.
The study by Corcoran et al. (2008) finds that about 40% of a fourth grade teacher’s value-added as measured on high stakes tests carried over to his/her students fifth grade scores but about 60% of the teacher’s value-added as measured by low stakes tests carried over to fifth grade scores.
Koretz, Daniel M, “Limitations in the use of achievement tests as measures of educators’ productivity.” Journal of Human Resources 37(4) (2002): 752-777.
Sass, Tim R, “The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy,” (The National Center for Analysis of Longitudinal Data in Education Research, Brief 4, 2008).
The study assumed a system like ASPIRE used in Houston, Texas.
Papay, 2011, ibid.
Corcoran, et al, 2008, ibid.
Kane, Thomas J., Daniel F. McCaffrey, Trey Miller, and Douglas O. Staiger. “Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment,” (Bill & Melinda Gates Foundation, MET Project Research Paper, 2013). Retrieved March 4, 2013. Students were randomly assigned to teachers’ classrooms within schools in year two of the MET study, so we can conclude that being assigned a teacher with higher value-added on the state test has a positive effect on student achievement on both the state test and other tests.
Koretz, 2002, ibid.
An article by Lois Beckett gives a brief summary of these and other high profile cheating scandals. See: Beckett, Lois, “America’s most outrageous teacher scandals,” Propublica April 2013.
See Koretz, 2002, ibid, for examples of other forms of score inflation.
The pressure for people to corrupt performance indicators is so common that back in 1976 Donald T. Campbell summarized it in what is now known as Campell’s law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” See: Campbell, Donald T, “Assessing the impact of planned social change,” Evaluation and Program Planning 2(1) (1979): 67-90. Retrieved May 24, 2013.
Sheila Bird and her colleagues noted that when journey times were use to evaluate the performance of ambulance services in the Great Britain, the services made perverse changes to the start time to improve their performance on the indicators. See: Bird, Sheila M., David Cox, Vern T. Farewell, Harvey Goldstein, Tim Holt, and C. Peter, “Performance indicators: good, bad, and ugly,” Journal of the Royal Statistical Society: Series A (Statistics in Society) 168(1) (2005): 1-27.
Kata Mihaly, Daniel F. McCaffrey, Douglas O. Staiger, and J.R. Lockwood, “A Composite Estimator of Effective Teaching,” (MET Project Research Paper, 2013). Retrieved March 4, 2013
Koretz, 2002, ibid.
Some states and districts are using growth measures, such as median growth percentiles, as an alternative to value-added. Those measures also will rely on new and old tests together when states transition tests and will be susceptible to some of the same issues as value-added scores.
In their paper, “Evaluating Teacher Evaluations,” Linda Darling-Hammond, et al. provide examples of teachers’ responses to year-to-year variability in their value-added scores and the negative consequences of these unstable scores. See: Linda Darling-Hammond, Audrey Amrein-Beardsley, Edward Haertel, and Jesse Rothstein, “Evaluating Teacher Evaluations”, Phi Delta Kappan 93(6) (2012): 8-15.
We received informal comments on this issue from Damian Betebenner, at the National Center for the Improvement of Educational Assessment, Inc., Robert Meyers at the Value-Added Research Center at the University of Wisconsin, Mary Peters at Battelle for Kids, Carla Stevens at the Houston Independent School District, and John White at SAS.
These analysts use residual growth models to calculate value-added. In a residual growth model, prior year test scores are used to predict students’ current year scores and the difference between the predicted score and the actual score determines a student’s growth relative to other students. Value-added typically uses variables in addition to prior achievement when predicting current scores. A teacher’s value-added equals the average of her students’ residual growth. For additional details on alternative formulations of growth models, see: Castellano, K. E., Andrew D. Ho, “A Practitioner’s Guide to Growth Models,” (Council of Chief State School Officers, 2013). Retrieved May 23, 2013
The analysts noted that there are methods for linking the scores on the old and new tests and retaining the base year. However, they advised against using this approach because it can yield unstable results.
The educator explained that the old state test had a “ceiling effect” for high-achieving students; many of them got all the questions right. Teachers in the district questioned the accuracy of value-added from a test with a ceiling effect. The new test was more challenging and did not have a ceiling effect. Some people in the district interpreted the changes following the introduction of the new test as confirmation of bias in the value-added that was calculated with the old test and evidence that it should not be used in teacher evaluations.

Does Value-Added Work Better in Elementary Than in Secondary Grades?

Douglas N. Harris — Tue, 14 May 2013 21:45:42 +0000

Email this page

Douglas N. Harris
Associate Professor
Economics
Chair
Public Education
Tulane University

Douglas N. Harris

and Andrew Anderson

Highlights

The vast majority of research on value-added measures focuses on elementary schools; value-added measures for middle and high school teachers pose particular challenges.
Middle and high schools often “track” students in ways that affect the validity of value-added.
Student tracking in middle and high schools calls into question the validity of methods typically used to create value-added measures.
The validity of secondary-level value-added measures can be improved by directly accounting for tracks and specific courses, although this may not completely solve the problem.
Middle and high school teachers have more students, and this factor increases reliability, but it is offset by other factors that reduce reliability at those grade levels.
End-of-course exams, which are becoming more common in high school, have both advantages and disadvantages for estimating value-added.

Introduction

There is a growing body of research on the validity and reliability of value-added measures, but most of this research has focused on elementary grades. This is because, in some respects, elementary grades represent the “best-case” scenario for using value-added. Value-added measures require annual testing and, in most states, students are tested every year in elementary and middle school (grades 3-8), but in only one year in high school. Also, a large share of elementary students spend almost all their instructional time with one teacher, so it is easier to attribute learning in math and reading to that teacher.^[1]

Driven by several federal initiatives such as Race to the Top, Teacher Incentive Fund, and ESEA waivers, however, many states have incorporated value-added measures into the evaluations not only of elementary teachers but of middle and high school teachers as well. Almost all states have committed to one of the two Common Core assessments that will test annually in high school, and there is little doubt that value-added will be expanded to the grades in which the new assessments are introduced.^[2] In order to assess value-added and the validity and reliability of value-added measures, it is important to consider the significant differences across grades in the ways teachers’ work and students’ time are organized.

As we describe below, the evidence shows that there are differences in the validity of value-added measures across grades for two primary reasons. First, middle and high schools “track” students; that is, students are assigned to courses based on prior academic performance or other student characteristics. Tracking not only changes our ability to account for differences in the students who teachers educate, but also the degree to which the curriculum aligns with the tests. Second, the structure of schooling and testing vary considerably by grade level in ways that affect reliability in sometimes unexpected ways. The problems are partly correctable, but, as we show, more research is necessary to understand how problematic existing measures are and how they might be improved.

What Do We Know About How Teacher Value-Added Measures Work In Different Grades And Subjects?

We begin by discussing differences in validity across grades and follow with somewhat briefer discussions of reliability across grades. Validity refers to the degree to which something measures what it claims to measure, at least on average. Reliability refers to the degree to which the measure is consistent when repeated. A measure could be valid on average, but inconsistent when repeated, meaning it isn’t very reliable. Conversely, a measure could be highly reliable but invalid—that is, it could consistently provide the same invalid information.

Validity of value-added measures across grade levels

In elementary schools, it is common for principals to create similar classrooms (e.g., with similar numbers of low-performing and special needs students).

Students and teachers are assigned to classrooms differently in elementary schools than in middle and high schools. In elementary schools, it is common for principals to create similar classrooms (e.g., with similar numbers of low-performing and special needs students).^[3] Other elementary principals identify student needs and try to match them to teachers who have the skills to meet those needs. Principals may also take into account parental requests, so that students with more academically demanding parents get assigned to teachers with the best reputations. Either of these last two forms of assignment—those based on student needs and those based on parental requests—has the potential to reduce the validity of value-added measures.^[4] Sometimes called selection bias, the problem is that student needs and parental resources are never directly accounted for in value-added measures, even though they might affect student learning and therefore reduce validity of teacher value-added estimates.

Based on a series of experiments,^[5] simulation studies,^[6] and statistical tests,^[7] elementary school value-added models do seem to address the selection bias problem well, on average. This last caveat is important. It is extremely difficult to provide strong evidence of validity for each teacher’s value-added. Instead, prior studies are really examining whether selection bias averages out for whole groups of teachers.^[8]

In middle and high schools, students with low test scores and grades and certain other characteristics are generally tracked into remedial courses, and those with stronger academic backgrounds are tracked into advanced courses.

Students in middle and high schools, on the other hand, are not assigned to or “selected” for classes in the same way they usually are in elementary schools. Rather, students with low test scores and grades and certain other characteristics are generally tracked into remedial courses, and those with stronger academic backgrounds are tracked into advanced courses. Minority and low-income students are also more likely to end up in lower tracks. These decisions might not be driven by strict rules or requirements, but they reflect strong patterns. In our analyses of Florida data, 37 percent of the variation in students’ middle school course tracks can be explained by a combination of their prior test scores, race/ethnicity, and family income.^[9]

Tracking creates two potential problems for value-added. First, the academic content of the courses differs. This means that the material covered in each course aligns to the test in different ways. Tests are designed to align with state proficiency standards,^[10] which in many states require a fairly low level of academic skill.^[11] For this reason, we would expect the test to align better with lower or middle tracks, implying that teachers in these tracks have an easier time showing achievement gains—and therefore higher value-added. This prediction is reinforced by evidence of “ceiling effects” in standardized tests; students in the upper tracks, as described above, are likely to have higher scores and to hit the ceiling with little growth.^[12] The direction and magnitude of these influences depends, of course, on the test and no doubt varies by state. Those states with low proficiency bars are probably more likely to have tests that align better with the remedial courses.

This disadvantage to teaching in the upper track, however, is apparently offset by an apparently larger advantage. That is, upper track students seem to have unobserved traits that make them likely to achieve larger achievement gains. This is what we would predict based on which parents tend to push hardest to get their children into upper track courses. Parents who press for more challenging academic courses probably also press their children to work harder, do their homework, and so on—generating higher achievement. We cannot observe these parental activities, so they could get falsely attributed to teachers in upper tracks.

The net effect is unclear. The curriculum-test misalignment places upper-track teachers at a disadvantage because of the misalignment between the test and course content, but this might be offset by the advantage of having students who are likely to make achievement gains for reasons having nothing to do with the teacher. Below, we report results of data analyses that shed more light on the issues created for value-added measures by tracking.

Analyses of Florida secondary schools

We estimated teacher value-added ignoring students’ tracks and courses, as is typically done, and then we re-estimated with track/course effects.^[13] In middle schools, our estimates suggest that for a teacher with all lower track courses, ignoring tracks would reduce measured value-added from the 50th to the 30th percentile. Only about 25-50 percent of teachers remain in the same performance quartile when we add information about the tracks.

Teachers had higher value-added when they taught the upper-track classes, compared with the same teachers teaching lower tracks.

One might wonder whether these effects exist because more effective teachers end up in upper-track courses. We addressed this possibility by analyzing teachers who taught both lower- and upper-track courses and comparing value-added in each course type for the same teacher.^[14] Teachers had higher value-added when they taught the upper-track classes, compared with the same teachers teaching lower tracks. These results could actually understate the role of tracks because the information available about tracks might not always be accurate.^[15] For this report, we extended the analysis to Florida high schools where a similar number, 33 to 45 percent, would be in the wrong group without tracks.

Analyses from North Carolina and end-of-Course exams

If tracking is a problem in estimating value-added, we would expect the variation in high school teacher value-added to drop when we account for tracks. That is, when we ignore tracks and courses, some teachers end up with value-added that is too high because they teach many upper-track courses, and vice versa in the lower track. So accounting for tracks pulls these teachers back to the middle and reduces variation.

A recent report using data from North Carolina confirms this.^[16] The variation in teacher value-added in high school is 33 percent lower when adding track coefficients. That is, some teachers have extremely low value-added simply because they teach more lower-track courses, and other teachers have high value-added because they teach upper-track courses. This does not prove that tracking is the problem, but the evidence is consistent with that interpretation.

The study goes further and considers how well current value-added predicts future value-added across grade levels. Several researchers have argued that this predictive validity of value-added is an important sign of the measure’s validity for making long-term employment decisions. We would hope, for example, that the measures used to make teacher tenure decisions are good predictors of how teachers will perform in the years after they receive tenure. The North Carolina study finds that even the course-adjusted value-added measure is a worse predictor of future value-added in high schools than in elementary schools, even after accounting for tracking.^[17] This suggests that adjusting value-added measures in this way does not eliminate the concern that tracking reduces validity and/or that there might be other problems in estimating high school value-added.^[18]

The differences in state testing regimes are also noteworthy. In theory, high school value-added measures should have higher validity in North Carolina because that state uses end-of-course exams, which should be better aligned to the curriculum than the single generic subject test in Florida. However, recall that the advantage for higher-track teachers from omitting tracks may offset the disadvantage to those same teachers from test misalignment (e.g., test ceilings). Paradoxically, this means better test alignment could actually make the validity or selection bias problem worse because one no longer offsets the other. This would give the upper-track teachers an even greater advantage. While it is in some ways helpful that the two problems cancel out, it places us in the awkward position of having to rely on one mistake to fix the other. As an analogy, it is like a golfer who accidentally aims too far to the left but still hits the ball in the fairway because of a slice to the right—the problems cancel out. In this case, fixing only the slice and not the aim would put the ball to the right of the fairway, making matters worse.

The use of end-of-course exams also raises the issue about how well prior achievement scores account for students’ relevant prior achievement. The purpose of accounting for prior scores is that they can tell us where students started at the beginning of the school year, but the content of prior courses is so different in high school that it’s unclear how informative the prior score really is. For example, few students have learned anything about physics before they take a physics course. However, accounting for prior math, science, and other scores is still important because those scores adjust for general cognitive and study skills that also influence subsequent scores.^[19]

The ability of prior courses to account for sorting across grade levels is therefore unclear, but there are good reasons to think that having good alignment between this year’s test and this year’s content is more important than having good alignment between this year’s test and last year’s test. As further evidence of this, we estimated value-added to math scores in middle schools controlling only for prior reading scores—prior math scores were ignored. We then compared these new value-added estimates with the more typical ones where prior math is accounted for. The correlation between the two is high at 0.84.^[20]

Other evidence and summary about validity issues

The well-known Measures of Effective Teaching (MET) project funded by the Gates Foundation reports results from experiments that also address the validity of value-added at the middle and high school levels. They randomly assigned teachers to classrooms in middle school as well as 9th grade. However, there was apparently no data about the tracks teachers taught or whether random assignment occurred only within tracks. Given the directions provided to principals, it seems likely that most assignments were within a track, but we cannot know for sure because tracking data was generally not available in MET, so this study is not informative about the role of tracks in value-added estimation.^[21]

Ignoring tracks will reduce validity substantially in middle and high school, and even accounting for tracks may not solve the problem.

Overall, the evidence from the above studies suggests that ignoring tracks will reduce validity substantially in middle and high school, and even accounting for tracks may not solve the problem. This also reinforces the general problem of comparing teacher performance in different instructional contexts.^[22]

Reliability of value-added measures across grade levels

There may be trade-offs between validity and reliability in evaluating value-added measures.^[23] Below, I consider the reliability of value-added by grade and then illustrate those trade-offs.

There are many sources of random error in value-added estimates: standardized tests have measurement error, some students are sick at test-taking time, and the students assigned to teachers in any given year vary in essentially random ways. This helps to explain why teacher value-added measures are somewhat unstable over time. It also explains why researchers and value-added vendors typically report confidence intervals for value-added measures that help quantify the role of random error and the uncertainty this creates about teachers’ “true” value-added.^[24]

One of the key factors affecting confidence intervals is the sample size—the larger the number of students assigned to each teacher, the smaller the confidence interval. The fact that elementary students are assigned to only one teacher means that we can probably attribute that student’s learning to that teacher, but the trade-off is that these elementary teachers have fewer students assigned to them, and this will tend to reduce reliability.

The larger number of students per teacher at the secondary level does not necessarily mean, however, that reliability is better. This is because reliability depends on error variance relative to the variance in the value-added estimates.^[25] To take a sports analogy, suppose that we had very precise estimates of the performance of ten baseball players, but that every player was almost equally effective and therefore had almost identical batting averages. In this situation, the variance in true performance is very small, so even very precise estimates of batting averages (after lots of games) will make it hard to distinguish the best from the worst players—the estimates will be unreliable even after each player has hundreds of at-bats. Conversely, if half the players had high batting averages and the other half had no hits at all, then we could reliably identify the low-performers after a week’s worth of games. The confidence intervals would be wide in that case, but it wouldn’t matter because the differences in true performance are so large.

In this case, having more students reduces random error among middle school teachers, but, as the baseball analogy suggests, this does not increase reliability. Estimates by Daniel McCaffrey using MET project data show that there is almost no relationship between grade level and reliability—reliability may actually be worse in higher grades.

Why might that be? Is there greater variance in teacher effectiveness at the elementary level? Is random error lower at the elementary level? Or both? One plausible explanation is that middle school teachers each teach a wider range of students each year than do elementary teachers. This is plausible because our calculations in Florida suggest that most teachers work in multiple tracks. If value-added estimates do not fully account for unobservable differences in students, then we would expect to see this pattern—the variance in teacher value-added is greater at the elementary level perhaps because of biased estimates.^[26] Differences in random error could also explain lower reliability in middle schools if the reliability of the tests is lower, which would offset the advantage of having more students.

The calculations by McCaffrey provide some support for both interpretations. Compared with the elementary schools in his sample, the variance in teacher value-added is lower in middle schools and random error is higher. So, the advantage of having more students per teacher is offset by other factors that reduce reliability in middle school.

What More Needs To Be Known On This Issue?

The advantage of having more students per teacher is offset by other factors that reduce reliability in middle school.

The above evidence strongly suggests that accounting for course tracks is important to obtaining valid value-added estimates in middle and high school, but we do not yet know how well this solves the problem. We estimate value-added by accounting for prior achievement, but a key implication of our tracking argument is that prior achievement is affected by prior tracks. This creates a complex role for tracks that might not be easily captured by simply adding track variables to the value-added model.^[27] We therefore cannot presume that accounting for the tracks is sufficient, and the North Carolina study reinforces this conclusion.^[28]

It would also be useful to know how this issue affects teachers in schools that use non-standard courses. We focused on algebra and geometry and excluded courses like “Liberal Arts Mathematics” and “Applied Mathematics” that also showed up in the data. These courses are likely to align even less well with the tested content because, by definition, they are courses that are outside the norm. The teachers of these courses could be at a significant disadvantage in their performance ratings.

In addition, while we have learned a great deal about elementary teacher value-added from experiments and simulation evidence, we need to apply those same methods to address the particular threats to validity in middle and high schools. The MET project provides some experimental evidence in middle school, although simulation and other tests have been limited to elementary grades.

What Can’t Be Resolved By Empirical Evidence On This Issue?

While the evidence described here provides a sense of the empirical problems that arise across grades and subjects, there is a larger question about how well the tests capture what we want students to learn and be able to do—and how this varies across grades. For example, creativity might be a skill that could be developed more easily in early grades, but creativity is hard to measure. So, in that case, the validity problems in early grades would be even worse than in later grades. The statistical issues are therefore intertwined with the philosophical ones about what we want students to learn.

How, And Under What Circumstances, Does This Issue Impact The Decisions And Actions That Districts Make On Teacher Evaluations?

This evidence also informs the use of value-added measures in (potentially) high-stakes decisions. For the sake of simplicity and perceived fairness, it is desirable to have a common standard that applies to all grades—or really to all teachers. However, if the validity and reliability vary, not to mention the ways in which test scores align with the desired goals, then treating teachers equitably may require using value-added unequally across grades. As I have written elsewhere, the stakes attached to any measure should be inversely proportional to the measure’s validity and reliability.^[29] It appears that we may not be able to follow that rule and simultaneously use value-added the same way for all teachers, especially across elementary and secondary grades.

Given that the properties of value-added measures differ across grades and subjects, policymakers should consider using different methods for calculating and using value-added in different grades and subjects. In particular, in middle and high school, it is essential to account for the tracks and courses that teachers are assigned to when calculating value-added.

Since value-added seems to work differently across grades, this raises the question: How do we handle teachers who teach multiple grades? Fundamentally, the issues raised here do not change the answer to this question. Comparisons across grades have always been complicated by the fact that the tests differ across grades and the various approaches to combining them involve some sort of weighted average, or composite, that takes into account differences in the test scale across grades.^[30] That basic solution is also reasonable for handling the additional complication of tracking. The key is to first get the estimation right at each grade level, perhaps by accounting for tracks. That is, we have to get the estimates right for each track and grade level before creating composite value-added measures for each teacher.^[31]

It might also be tempting to reduce tracking or assign each teacher an equal mix of low- and high-track courses to easier accommodate value-added measures, but this is the proverbial “tail wagging the dog” problem. Changes in school organization and instruction should be made with caution and attention to effective instructional practice—not so that we can have better value-added measures.^[32]

The implications of tracking are missed in the vast majority of value-added estimates now being used. This means that, even setting aside other issues with the measures, current standard value-added measures for teachers who concentrate their work in particular tracks in middle and high schools will suffer from validity concerns. As with many of the problems with value-added, this one can be addressed with better data collection efforts and careful attention to how the measures are created. Accounting for tracks would almost certainly improve the measures, but future research will be required to determine how well this solution works in practice.

References +

A third factor is that a somewhat narrower range of academic content is covered in elementary schools, so there may be better alignment between the curriculum and the test.
I am referring to the new assessments from the Smarter Balanced Assessment Consortium and the Partnership for Assessment of Readiness for College and Careers (PARCC). All but three states have committed to adopting one of these two tests. Both systems plan to test annually through high school.
See: Educational Testing Service, Coming Together to Raise Achievement: New Assessments for the Common Core State Standards, 2012. Retrieved March 13, 2013
There are two types of evidence on this. In a study that is now somewhat dated, David Monk (1987) asked principals in a sample of schools about how they create classrooms. Alternatively, the assignment process can be studied indirectly using statistical tests of measureable student characteristics. This evidence is discussed in the next paragraph.
See: David H. Monk, “Secondary school size and curriculum comprehensiveness,” Economics of Education Review 6(2) (1987): 137-150.
Yet another possibility is that principals alternate students so that those with a less effective teacher in one year might be assigned to a more effective teacher the next year.
Two studies have randomly assigned students to classrooms to test the validity of value-added. The MET recent study and a similar earlier experiment (Kane & Staiger, 2008) suggest that accounting for prior achievement is sufficient to establish internal validity. This is a very valuable study, though it does have several limitations. First, the study can only account for selection bias within schools. Comparisons across school contexts are considered more problematic (Reardon & Raudenbush, 2009). Second, the results apply only to a fairly narrow sample of schools that complied with the rules of the experiments. The same concerns arise with the earlier version of this experiment in Los Angeles.
See: Thomas J. Kane, and Douglas O. Staiger, “Estimating teacher impacts on student achievement: An experimental evaluation,” (National Bureau of Economic Research, No. w14607, 2008).

Sean F. Reardon and Stephen W. Raudenbush, “Assumptions of value-added models for estimating school effects,” Education Finance and Policy 4 (4) (2009): 492-519.
Cassandra M. Guarino, Mark D. Reckase, and Jeffrey M. Wooldridge, “Can value-added measures of teacher performance be trusted?” (Paper presented at the annual meeting of the Association for Education Finance and Policy, Seattle, WA, March 24-26, 2011).
While Rothstein’s (2010) initial tests suggested that students are not approximately randomly assigned, subsequent studies have suggested that they generally are (Koedel & Betts, 2009; Chetty, Friedman, & Rockoff, 2012; Chaplin & Goldhaber, 2012).
See: Jesse Rothstein, “Teacher quality in educational production: Tracking, decay, and student achievement,” The Quarterly Journal of Economics 125(1) (2010): 175-214.

Dan Goldhaber and Duncan Chaplin, “Assessing the Rothstein Falsification Test: Does It Really Show Teacher Value-Added Models Are Biased,” (CEDR Working Paper 2011-5, 2011).

Raj Chetty, John N. Friedman, and Jonah E. Rockoff, “The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood,” (National Bureau of Economic Research, No. w17699., 2011).

Cory Koedel and Julian R. Betts, “Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique,” Education Finance and Policy 6(1) (2011): 18-42.
This is a fairly complex issue, and there is not yet agreement among researchers about how well this type of experiment tests for bias. However, from a technical standpoint, note that researchers are typically trying to determine whether an estimate of a single parameter is valid, but in this case we are trying to test whether thousands of them (teacher value-added estimates) are valid.
This is actually an under-estimate because the number of tracks is limited and there is considerable variation in scores within tracks. Also, I do not have access to course grades and this, too, influences tracks. It is difficult to make the same calculations in elementary schools because the tracks (if any) are not explicit.
One of the nation’s leading test publishers, Pearson, describes this alignment between the proficiency standards and the test as one of the “Fundamentals of Standardized Testing.”
See: Sasha Zucker, “Fundamentals of standardized testing,” (Pearson, Inc., 2003).
Peterson & Hess (2008) show that state proficiency bars are set low relative to the National Assessment of Educational Progress (NAEP) though this varies considerably by state.
See: Paul E. Peterson and Frederick M. Hess, “Few states set world-class standards.” Education Next 8(3) (2008): 70-73.
Cory Koedel and Julian Betts, “Value added to what? How a ceiling in the testing instrument influences value-added estimation,” Education Finance and Policy 5(1) (2010): 54-81.
Douglas N. Harris and Andrew Anderson “The Difficulty of Worker Monitoring in the Service Sector: A Formal Sorting Model and Empirical Evidence of the Value-Added of Middle School Students and Teachers,” (Paper presented at the Association for Public Policy and Management, 2012).
The fact that teachers in upper-track courses are at an overall advantage suggests that the student selection bias is larger than the influence of curriculum-test misalignment. We confirmed this by taking additional steps in the value-added estimation to account for student sorting and therefore isolate the curriculum-test misalignment. When we did this, the advantage of teaching upper-track courses turned into a disadvantage, just as we predicted.
This is an example of the standard “attenuation bias” problem. Whenever an independent variable in a regression is measured with error, the role of that variable in explaining the dependent variable (in this case, student test scores) is under-estimated.
C. Kirabo Jackson. Teacher Quality at the High-School Level: The Importance of
Accounting for Tracks. Working Paper (2012).
The Jackson (2012) study also focuses on within-teacher estimation to avoid conflating the results with teacher sorting.
One additional potential problem is that courses are often split up during the school year, so the same student might have two different math teachers. This not only reduces reliability but calls into question validity since the pairings of teachers are unlikely to be even approximately random.
This may also be true in elementary school. Even though we think of the content in one year building on the prior year, the learning process is not really so linear.
This is a Spearman rank order correlation.
We thank Dan McCaffrey for a useful conversation about the potential role of tracking at the secondary level in the MET study.
Raudenbush and Reardon, 2009, ibid.
See: Douglas N. Harris, Value-Added Measures in Education: What Every Educator Needs to Know. (Cambridge, MA, Harvard Education Press, 2011).
See Douglas N. Harris, 2011, ibid. More formally, the confidence intervals that some value-added reports are based on the notion that the students assigned to teachers are a sample from the population of students who could have potentially been in those classes. Measurement error in the tests themselves though this is often not accounted for in the confidence intervals.
Specifically, reliability is the ratio of the standard deviation in the estimates to the standard error of the estimate, which is the basis for the confidence interval.
It is also possible that the true variance in teacher performance is greater at the elementary school level, but there is no reason to expect this and we find it unlikely.
Harris & Anderson, 2012, ibid.
It is also worth noting, however, that some aspects of the curriculum-test alignment is partly within the control of teachers and therefore perhaps something that should be attributed to teachers. Control over the curriculum varies across schools and districts, however.
Douglas N. Harris, 2011, ibid.
One approach is to estimate the model separately by grade level and then create a weighted average based on the number of teachers taught in different grades. Another similar approach is to estimate teacher value-added across grades but in a single model that takes grade effects into account.
Whatever approach is used, it is important to identify the track effects from variation in performance by teachers who teach in multiple tracks. This would not necessarily arise in typical value-added measures. To address this, value-added estimation could be carried out in multiple stages where the effects of tracks would be identified first using only those teachers who teach in multiple tracks then that information would be used in a second stage to estimate value-added for other teachers.
Douglas N. Harris, 2011, ibid.

Carnegie Knowledge Network » Value-Added

Carnegie Knowledge Network Concluding Recommendations

Dan Goldhaber Douglas N. Harris Susanna Loeb Daniel F. McCaffrey Stephen W. Raudenbush

References +

What Do We Know About the Long-term Impacts of Teacher Value-Added?

Stephen W. Raudenbush

Highlights:

Introduction

Magnitude of Initial Value-Added Effects

Persistence of Value-added

Long-Term Impact of Value-added

Questions for Future Research

Implications for Policy and Practice

References +

Is Value-Added Accurate for Teachers of Students with Disabilities?

Daniel F. McCaffrey

and Heather Buzick

Highlights

Introduction

What is known about value-added for teachers of students with disabilities?

What more needs to be known on this issue?

What can’t be resolved by empirical evidence on this issue?

To what extent, and under what circumstances, does this issue impact decisions and actions that districts and states can make on teacher evaluation?

References +

How Can Value-Added Measures Be Used for Teacher Improvement?

Susanna Loeb

Highlights:

Introduction

What is the current state of knowledge on this issue?

Program improvement

Human resource decisions

Incentives

Summary

What is the current state of knowledge on this issue?

What cannot be resolved by empirical evidence on this issue?

Conclusion

References +

What Do Value-Added Measures of Teacher Preparation Programs Tell Us?

Dan Goldhaber

Highlights

Introduction

What is Known About Value-Added Measures of Teacher Preparation Programs?

What More Needs to be Known on this Issue?

What Can’t be Resolved by Empirical Evidence on this Issue?

Evaluating TPPs: Practicing Implications and Conclusions

References +

How Might We Use Multiple Measures for Teacher Accountability?

Douglas N. Harris

Highlights

Introduction

What do we know about how to create and use multiple measures?

Weighting approach

Matrix approach

Screening approach

Evaluating the alternative approaches

What if each measure captures a different element of teacher effectiveness?

What more needs to be known on this issue?

What can’t be resolved by empirical evidence on this issue?

How, and under what circumstances, does this issue impact the decisions and actions that districts can make on teacher evaluation?

References +

What Do We Know About Using Value-Added to Compare Teachers Who Work in Different Schools?

Stephen W. Raudenbush

Highlights

Introduction

Challenges That Arise in Comparing Teachers Who Work in Different Schools

1. Variation in school effectiveness

2. Variation in peers

3. “Common support” and statistical adjustment for student background

Empirical Evidence of Bias

Comparing teachers within schools

Conclusions and Recommendations

Appendix

References +

What Do We Know About the Use of Value-Added Measures for Principal Evaluation?

Susanna Loeb

and Jason A. Grissom

Highlights

Introduction

Using test scores

Principal value-added as school effectiveness

Dan Goldhaber
Douglas N. Harris
Susanna Loeb
Daniel F. McCaffrey
Stephen W. Raudenbush