Carnegie Knowledge Network Concluding Recommendations
Douglas N. Harris
Daniel F. McCaffrey
Stephen W. Raudenbush
It is common knowledge that teacher quality is a key in-school factor affecting student achievement. While the quality of teaching clearly matters for how much students learn, this quality is challenging to measure. Evaluating teacher quality based on the level of their students’ end-of-year test scores has been one method of assessing teacher quality, but this approach favors those teachers of students who begin the year already at a high academic level. Value-added methodology is an alternative use of annual test score data to assess teacher quality. As opposed to using the level of student achievement on a test at the end of the year to assess a teacher’s effectiveness, value-added methodology seeks to isolate a teacher’s contribution to student growth from other factors that contribute to student achievement, such as prior achievement or socio-economic status. Methodologists generally agree that value-added estimates can serve as a gauge of some dimensions of teacher effectiveness.
Yet the experts also agree that value-added measures have significant limitations. Most clearly, they do not capture a teacher’s entire contribution to student learning because they are based solely on test performance, itself an imprecise signal of teacher effectiveness. There is disagreement among experts on the extent to which value added is biased. For instance, there is room for disagreement about whether or how much statistical models can or should account for factors such as poverty, or how teachers can be compared across different classroom contexts. The comparability of value added is greater among teachers working in similar contexts, which is true of all forms of performance evaluation. Moreover, value-added measures are likely to vary according to the tests used to compute them. Because tests cover different content, they reflect different areas of student knowledge. Value added measures what the tests measure, and because these tests capture only a slice of what students are learning, value added reflects only that slice. Teachers could be effective or ineffective on outcomes not captured by students’ scores on these tests.
As important as it is to judge value-added measures on their validity and reliability, it is more relevant to know to what extent they can benefit schools and students if used in practice. Largely because value-added policies are so new, we know very little about how they work in practice. Thus there remains considerable debate about how value-added measures should be used to inform personnel policies, if they are to be used at all. As the real impacts of reforms using value added emerge and as researchers assess these effects, we will learn more about the consequences of different uses of value added. Until then, the CKN briefs have given us a rich picture of current research on the use of value-added measures for teacher evaluation. Reviewing the briefs and their implications for practitioners, we arrived at the following recommendations:
1. When using value added, allow educational leaders to make judgments in interpreting the value-added results in light of other available measures of teacher quality and the principals’ own assessments.
Studies provide evidence that value-added measures meaningfully distinguish between teachers whose future students will consistently perform well and teachers whose students will not. However, the studies also highlight the shortcomings of value added, among them instability in measures across time; systematically lower values for teachers of some types of students; and difficulty in comparing teachers across schools. These issues are potential problems for any form of evaluation, including teacher observation. The principal and other school leaders are often well situated to evaluate the many circumstances in which quantitative measures may give an incomplete picture. For instance, school leaders can see a teacher’s indirect contributions to learning through her overall work in the school, or her contribution to student outcomes not measured by achievement tests. Educational leaders likewise can make allowances for unexpected events in a teacher’s environment—a sudden change in assignment, for instance — that can’t be captured by value-added models. Human judgment also may reduce teachers’ concerns about being assessed only by test scores or being subjected to statistical procedures they don’t trust or understand.
2. Use value added with other measures that are valid and have variation.
Because education and teacher evaluations have multiple goals, assessing progress toward these goals will require multiple measures. Moreover, evaluation programs that combine value added with other measures, at least in some instances, have led to instructional improvement. While ultimately all measures should be evaluated on the same set of standards, each measure will have strengths and weaknesses. Using the measures together helps reduce errors and provides a sense of balance that may make them more acceptable to those being evaluated. When multiple measures are included in an evaluation system, all of the measures must vary across teachers in order to matter at all for the evaluation. A measure that rates all teachers at the same level will carry no weight when it is combined with other measures to classify teacher performance. For example, suppose that a school system uses both observation and value-added scores to determine teacher effectiveness. Then suppose that all of the teachers in the school system are rated proficient according to observations by their principals. The only way a teacher can be deemed “in need of improvement” is if she has a low value-added score. The only way she can be classified as “exemplary” is if she has high value-added score. Even though the system uses two measures, it is only the value-added measure that differentiates on the summative measure.
Multiple measures don’t necessarily have to be combined into a single rating scale; for instance, the value-added estimate could be used mainly to identify candidates for more intensive observation. Schools would then measure the performance of the identified teachers with more intensive measures that can provide a more accurate assessment of teacher performance but are too costly to collect for all teachers. Multiple measures may also take on many forms. Along with observations, scores on student learning objectives, and survey responses, they might also include the value that teachers bring to student outcomes other than achievement test scores. These outcomes might include progressing through courses, graduating from high school, even attending or completing college. If carefully developed and thoroughly evaluated, such long-term measures could help give schools and districts a richer picture of teacher effectiveness.
The decision to use value-added measures depends not on whether they are perfect—all measures are flawed. Rather, it depends on what other measures of effectiveness are available or feasible and the quality of those measures. If principals are skilled, have goals aligned with district or state goals, and have adequate time for evaluation, then they may be a better source of information on teacher effectiveness than value-added measures because they can observe more than test scores and see teachers throughout the course of the year. However, not all schools have principals who can do this effectively.
3. Choose a test that measures knowledge that is valued.
Value added is sensitive to whatever test students take; different tests lead to different rankings of teacher effectiveness. If we want value-added scores to yield information about teaching beyond what is currently measured by standardized tests, we must use a test that measures knowledge we value. Schools today focus largely on the content measured by standardized accountability tests. Tests used to determine value added should give teachers incentive to teach knowledge that is valued.
4. Consider differences in teaching contexts when using value added to compare teachers.
Standard methods to statistically adjust for students’ prior achievement can produce unbiased (or nearly unbiased) estimates of teacher value added. However, the research supporting this finding is based on studies in which teachers were randomly assigned to students within the same school, mainly within elementary schools, while most value-added systems compare teachers in different schools.
The distinction between within-school and between-school comparisons is an important one, because teachers within the same school share the same organizational conditions (leadership and resources); are subject to similar contextual factors (neighborhood safety, parental support, norms that favor academic achievement); and, particularly in elementary school, they tend to teach students with similar levels of prior achievement. In a heterogeneous district, schools can vary widely by all these factors. If well-managed schools foster better teaching and learning than do poorly-managed schools, then teachers in well-managed schools will outperform equally able teachers in poorly-managed schools. If some contextual conditions promote more student learning than do other conditions, teachers in schools with more favorable conditions will outperform equally able teachers in schools with less favorable conditions. Our confidence in value-added estimations of teachers will increase with the degree of similarity between the classrooms of the teachers being compared. Moreover, most tests are not designed to compare the learning gains of students who start at very different levels of achievement.
For these reasons, when education leaders are making consequential personnel decisions informed by value added, they would be wise to take into account the different contexts in which teachers work. They should also look at value-added scores across the district to detect general patterns in the distribution of teacher effectiveness. This practice serves as a check on the equitability of hiring and placement policies.
5. Take specific steps to ensure the overall credibility of the teacher evaluation system.
Many of the CKN briefs identified threats to the validity of value added as a measure of teacher effectiveness for some teachers. Inaccurate value-added estimates can lead to decisions harmful to both teachers and students. For value-added measures to be useful, they must be subject to the following minimal conditions: the tests on which they are based should reliably capture valued student outcomes; data should accurately identify which students are in which teachers’ classes; and teachers should work with enough students to produce value-added estimates that are relatively free of noise.
So school systems must look for patterns that suggest errors and use other data to confirm that teachers who consistently receive high or low value-added scores are truly high- or low-performers. For example, if all the special education teachers receive low value-added scores, but if those scores are belied by classroom observations and student surveys, educational leaders should examine whether there might be problems with one or the other measure. System leaders should clearly document the statistical methods used for calculating value added, taking steps to secure data accuracy. They should provide evidence of that accuracy and of the checks used to ensure the appropriateness of the evaluation models, the checks used to confirm the assumptions made by the models, and the tests of the robustness of the results. When communicating about individual teacher value-added scores, reports should indicate the nature of value-added scores as estimates, and be careful not to overstate the degree of their precision.
Evaluation measures that capture meaningful differences in teaching quality are important components of a system that ensures quality learning opportunities for all students. There is great risk of teacher evaluation systems not performing as planned, however. The human resources literature is rife with data and theories on why these performance measurement systems fail to yield the rich information and informed decisions they were intended to provide. Likewise, the literature on performance measurement for public employees offers numerous examples of unintended consequences, including employees gaming the system or trying to improve their performance ratings without improving their actual performance. Edward Deming, the quality improvement expert whose ideas have revolutionized the industry, cautioned organizations against using performance measurement for individuals, believing that doing so would lead to capricious behavior, as well as fear that stifles creativity. This warning should be taken seriously. If school leaders are effectively evaluating and motivating teachers using measures other than value-added measures, then value-added measures may provide little benefit for students. If local evaluations are not effective, then value-added measures may be a beneficial tool, at least until schools implement better systems.
Clearly, it is challenging to use evaluation and performance monitoring to improve outcomes. States and school districts will need to take deliberate steps to increase the chances that their systems will work as intended. Because current evaluations by managers tend to be uniformly high, school systems will need to monitor evaluation data for variability in performance ratings and promote and reward accurate evaluations. Systems must make sure that teachers are not narrowing the curriculum or manipulating student assignments to improve their value-added scores. For instance, schools might assess students periodically on tests that are different from the state test but cover the same standards. Schools might also monitor student and teacher assignments to identify unusual changes over time.
School systems should identify and monitor the factors that should not change when value added is used to assess performance by watching for evidence that administrators or teachers are taking unwanted steps to influence their value-added scores. For instance, the proportion of students classified with certain learning disabilities, which is susceptible to manipulation, should not change. School systems should also list the factors that should change with a new evaluation system, such as the number of classroom observations performed or the alignment between the type of professional development chosen and teachers’ needs. They should then monitor these factors for evidence that changes are occurring, and plan for how to respond if the system fails to behave as expected.
Data from millions of students and thousands of teachers as well as careful thought and analysis have taught us much about the statistical properties of value added. Yet these measures are only just beginning to be put into widespread use for teacher evaluation. As these new systems roll out, states and districts should be learning and experimenting, determining what works and what doesn’t. They should use this time to collect data and monitor their evaluation systems, using what they learn to make revisions. They should share their experiences with other school systems and learn from them in turn. The research community must also work with states and districts to identify the best practices for teacher evaluation, assessing whether the new systems really do yield better teaching and learning. Finally, the decisions school systems reach about value added—including whether and how to use the measure for evaluation—should not be seen as one-time events. School systems are now gathering a wealth of data from which we can learn how to make educational organizations more effective at conducting evaluations to improve teaching. Their evaluation systems must make use of that data, and evolve based on what that data shows.