Will Teacher Value-Added Scores Change when Accountability Tests Change?
Daniel F. McCaffrey
- There is only a moderate, and often weak, correlation between value-added calculations for the same teacher based on different tests.
- The content, timing, and structure of tests all contribute to differences in value-added calculations based on different tests.
- The stakes attached to a test affect the correlations between value-added estimates based on different tests.
- Conclusions drawn from one test might not serve as an accurate picture of a teacher’s true effectiveness.
- More studies are needed to better assess the potential that “teaching to the test” has for distorting value-added estimates.
- Composite measures may mitigate some of the distortions caused by using different tests, and they may be better predictors of student learning gains.
- States should expect large changes in value-added calculations in 2014-15 when they switch to tests aligned with the Common Core State Standards.
Value-added evaluations use student test scores to assess teacher effectiveness. How we judge student achievement can depend on which test we use to measure it. Thus it is reasonable to ask whether a teacher’s value-added score depends on which test is used to calculate it. Would it change if we used a different test? Specifically, might a teacher admonished for poor performance or recognized for good performance have been treated differently if a different test had been used? It’s an important question, particularly because most states will soon be adopting new tests aligned with the Common Core State Standards.,
In this article we discuss what is known about how sensitive value-added scores are to the choice of test and what more needs to be known. We also discuss issues about the choice of test that might not be resolved through empirical investigation, as well as the implications of these findings for states and school districts.
What is Known About Teacher Value-Added Estimates from Different Tests?
In this section, we discuss the research studies that compared value-added calculated with one test to value-added calculated for the same teachers using a different test. These studies found the correspondence between the two sets of value-added estimates to be moderate at best. (In the next section, we discuss possible reasons for these differences.)
The Measures of Effective Teaching (MET) Project calculated teacher value-added scores using state accountability tests and, separately, using project-administered tests in grades four through eight in six school districts. The study used the correlation coefficient to describe the level of agreement between the two measures for each teacher. A value of 1 represents perfect correspondence, and a value of 0 means no correspondence. The MET Project found that the correlation between value-added using the two different tests administered to the same class of students was 0.38 for math and 0.21 for reading. Associations in this range typically are considered weak: teachers with value-added in the top quartile on the state reading test would have about a 40 percent chance that their value-added on the alternative test would be below the 50th percentile.
Researchers also have taken advantage of the multiple tests administered by some states and school districts to investigate how much value-added changes when it is calculated with different tests. The studies used data from Hillsborough County, Florida; Houston, Texas; and a large urban district in the Northeast. In Hillsborough County, students completed two tests administered by the state: the Sunshine State Standards Test, a criterion-referenced test that assessed student mastery of Florida standards and served as the primary test for school accountability, and a norm-referenced test used to compare Florida students with those in other states. In Houston, students completed the state accountability test and a standardized norm-referenced test administered by the district. In the Northeast urban district, students completed the state test and two tests administered by the district: a standardized norm-referenced test in reading and math and a separate reading test. Each of the studies presented the correlation between teachers’ value-added based on the state accountability test and the alternative test. The studies found that correlations ranged from .20 to .59. The highest correlation coefficients were for math and reading teachers in Houston – .59 for math and .50 for reading – where value-added was calculated by pooling up to eight years of data for a teacher. The smallest values, of around .20, were for reading teachers in the Northeastern city. That case compared value-added on the state test, which was administered in the spring, to value-added on a test administered by the district the following fall. It used only one year of data from both tests.
Taken together, this evidence suggests that a teacher who taught the same curriculum to the same students, and who is rated at a given level based on value-added calculated from one test has a strong likelihood of earning a different level based on value-added calculated from a different test.
Why is value-added sensitive to the test?
No standardized achievement test covers all the content and skills that a student might learn in a year. Consequently, value-added from one test might not fully reflect a teacher’s effectiveness at promoting learning. It will not include content or skills not covered by the test. However, a teacher’s effectiveness at teaching the content on one test may be very similar to her effectiveness at teaching the content on other tests. That is, a “good teacher” is a “good teacher” regardless of the content, skills or the test. Alternatively, some teachers may be more effective at teaching some content and skills and less effective at others. A teacher may be good at teaching mathematical computations but less good at teaching problem-solving. If value-added scores from different tests lead to different conclusions about a teacher, then we may worry that value-added from any single test provides an incomplete picture of a teacher’s effectiveness, and that using it to make decisions about teachers may be inefficient or, for some teachers, unfair.
From the results of the studies mentioned above, we might at first conclude that value-added on one test is a poor measure of a teacher’s effectiveness at teaching the content and skill measured by other tests. However, we consider six possible reasons for the weak correspondence between value-added calculated with two different tests: 1) the timing of the tests; 2) statistical imprecision; 3) test content; 4) the cognitive demands of the tests; 5) test format; and 6) the consequences of the test for students, teachers, or schools. These other possible reasons have different implications for what value-added from one test might tell us about a teacher. We will discuss each in turn:
- Test timing. We have seen that the lowest correlation between value-added calculated with two different tests was for tests administered at two different times in the school year, one in the fall and the other in the spring. Thus, value-added scores are sensitive to the timing of the tests, and any comparisons of value-added among teachers or for the same teachers across years should use tests given at the same time of the school year. Should states use fall-to-fall or spring-to-spring testing? There is no research on whether testing in either period leads to value-added scores that better reflect teachers’ true effectiveness; however, spring tests do allow for calculating value-added closer to when the students were in their teachers’ classes.
- Statistical imprecision. All the studies except the Houston study calculated value-added for teachers using a single year of data and two different tests administered to the same group of students. Each measure is imprecise because it is calculated with a small number of students responding to a particular test form  on a particular day. The imprecision in each test contributes to disagreements in the value-added that is based on them. However, imprecision does not mean that value-added on one test is a poor measure of a teacher’s effectiveness at teaching other content. If we remove the statistical noise that creates the imprecision, the agreement should be stronger. The Houston study used multiple years of data to calculate a teacher’s value-added on each test. Each year the state test used a different test form, and each year the teacher had different students. Thus, by combining data across multiple years, the value-added in the Houston study reduced the imprecision. As a result, the correlation between tests in that study was higher than it was in other studies. The MET Project made adjustments to the correlation between tests to estimate what the correlation would be between the average of a teacher’s value-added calculated for several years using the state test and the average of her value-added calculated for several years using alternative tests. The estimates were .54 for math and .37 for reading–again higher than the correlation for value-added from a single year. These correlations are still not strong. Value-added scores from one test might not provide the full picture of a teacher.
- Test content. As discussed above, disagreement between value-added calculated with different tests might be due to differences in what is tested: A teacher may be differently effective at promoting achievement depending on the content measured. Two studies directly compared valued-added based on different content. In both of the studies, researchers calculated teacher value-added on the problem-solving subtest scores from a math assessment and, separately, on the procedures subtest scores on the same assessment, using the same students, tested under the same conditions, on the same day. In both studies, the agreement between the value-added using the two different subtests was modest, suggesting that a teacher who effectively promotes growth on problem-solving might not be equally effective at promoting growth on procedures. These studies suggest that content does matter and that teachers who are effective at teaching the content of one test might not be equally effective with other content. This has implications for what is tested and the efficiency of decisions made using value-added.
- Cognitive demands. Assessments vary in the cognitive demands made by the material on the test. A teacher may be differently effective at promoting growth in different types of skills, and this could result in differences in value-added from different tests. The research is unclear about how these differences in a test’s cognitive demands contribute to differences in value-added. The alternative tests chosen by the MET Project had different cognitive demands than did the state tests, and agreement between the value-added from these two tests was moderate at best. However, the tests also had some differences in content, so it is unclear how much the difference in cognitive demand contributed to the low correlation in value-added.
- Formats. Tests may have different formats. For instance, some tests may use multiple choice items, and others may use constructed response items. The different formats used by tests might contribute to difference in value-added. Some teachers may spend more time promoting the skills students need to succeed with constructed responses; other teachers may spend more time promoting the skills students need for multiple choice tests. The supplemental tests used in the MET Project were open-ended tests with constructed responses and no multiple-choice items, but the state tests contained mostly multiple choice items. Such differences could have weakened the agreement between the value-added calculated from different tests. Variation in value-added due to the test format does provide meaningful information about teachers. Tests used for value-added should include multiple formats, and any comparisons of teachers’ value-added should be restricted to tests using similar formats.
- Consequences. Tests can differ according to the consequences for students, teachers, and schools that are attached to their outcomes. Tests with consequences are called “high-stakes” tests; tests with limited or no consequences are called “low-stakes” tests. The Sunshine State Standards test in Florida and the state tests in the other studies were used to hold schools and districts accountable for performance; they faced penalties if their students scored poorly. The stakes of the other tests used in the studies were not as high. Teachers report focusing on test preparation for high-stakes tests, and there is concern that this focus can inflate scores., Teachers who focus narrowly on the high-stakes state test may not do as well promoting growth on the lower-stakes test. The current research provides some evidence that value-added on high-stakes tests may be distorted somewhat by a narrow focus on the tested material. One study calculated a teacher’s value-added to her students’ test scores at the end of the year and scores produced at the end of the next several school years. The study showed that value-added fades over time as students progress through school. For example, a fourth grade teacher’s value-added based on her students’ fourth grade test scores is larger than her value-added based on those students’ fifth or sixth or higher grade scores. The study also calculated value-added for the current and future years using both a high-stakes and a low-stakes test. It found that teachers’ contributions to learning appeared to fade more quickly on a high-stakes test than on a low-stakes test. If some teachers narrowly focus on features specific to a test, such as its format for math problems or its limited set of vocabulary, then we might not expect their students’ achievement growth to carry over to future tests.
What are the consequences of sensitivity to the test?
The research suggests that conclusions about a teacher’s effectiveness drawn from one test might be specific to that test, and might not provide an accurate picture of that teacher’s effectiveness overall. For instance, the study in Hillsborough found that only 43 percent of teachers who ranked in the top 20 percent of teachers according to value-added from the norm-referenced test also ranked in the top 20 percent according to value-added estimates from the Sunshine State Standards test. Similarly, the data from the Northeastern city found that if the district had been using a pay-for-performance system, changing tests would have changed the bonuses for nearly 50 percent of teachers, for an average salary difference of ,000. Some of the variation is due to the imprecision in calculating value-added, but even without these errors, the correspondence among conclusions would be less than perfect. The Houston study, which substantially reduced imprecision by pooling multiple years of data, still found that only about 46 percent of reading teachers classified in the top 20 percent on the state test were also among the top performers on the district test.
Although conclusions about individual teachers can be highly sensitive to the test used for calculating value-added, value-added on state tests can be used to identify groups of teachers who, on average, have students who show different growth on alternative tests. The MET Project found that, on average, if two teachers’ value-added calculated using the state test in year one of the study differed by 10 points, then, in year two, their students’ achievement on the low-stakes alternative test would differ by about 7 points. If we use the state test to identify teachers who have low value-added, the selected teachers’ students would also have lower growth on the alternative tests than would students of other teachers. The differences between the two groups of teachers would be about 70 percent as large on the alternative test as on the state test. Some of the individual teachers identified as low-performing might actually have students with substantial growth on the alternative test, but as a group, their students would tend to have lower growth on both tests. Thus, the sensitivity of value-added to a test does not mean it cannot be used to support decisions about teachers that could potentially help student outcomes.
What More Needs to be Known on this Issue?
Multiple sources can contribute to differences in value-added on different tests. Determining the contributions of each would be valuable for designing effective evaluation systems.
Differences in value-added from high- and low-stakes tests might be due to some teachers focusing more than others on superficial aspects of the tests and practices that improve student test scores but not student achievement. These practices result in what is sometimes referred to as “score inflation.” An egregious example is of teachers changing student test scores in the highly publicized cheating scandals in Atlanta and New York. Practices that lead to test-score inflation can also be less overt. For instance, teachers may have students practice test-taking skills. But differences in value-added from high- and low-stakes tests might not be due to score inflation. They may also be due to differences in the content on the tests. A teacher’s effectiveness at helping students learn the state standards might be only weakly related to his effectiveness at teaching other material, especially if he focuses on the state standards and his instructional materials are related to those standards.
The two scenarios have different implications for the utility of value-added for improving student outcomes and the consequences of switching to the rigorous Common Core State Standards and their associated tests. If some teachers have high value-added on the high-stakes test because of score inflation, then value-added might not be useful for identifying effective teachers. And using it in evaluations may have negative consequences for students. If teachers are focused on the content of the state standards and if value-added reflects their effectiveness at teaching that material, then setting rigorous standards should have positive effects on student learning. The research discussed above offers limited evidence in support of both scenarios. States and districts would benefit from knowing how much each scenario contributes to value-added on high stakes tests.
As value-added estimates become more common in consequential teacher evaluations, the motivation for teachers to take steps that might inflate their value-added will increase. There are many examples in education, and other fields, in which the use of performance indicators leads to their distortion. So, understanding this risk, and designing systems to prevent it, will be important for maintaining the integrity of teacher evaluation. One particular concern is peer competition: the extent to which teachers feel they need to take steps to inflate their scores because they think their colleagues are.
If differences in value-added with different tests are due to differences in the skills measured by the tests, the choice of skills to be measured is a critical one. We have some evidence that teacher effectiveness differs according to the two math areas of problem-solving and procedures, but we need to further understand the contributions of measurement error to those findings and to extend them to other subjects and skills. We also need to better understand which skills lead to better long-term outcomes and to determine how best to measure those skills so that value-added can provide the most meaningful measure of a teachers’ effectiveness.
It would be helpful to have more systematic evaluations of how value-added changes when states introduce new tests. In recent years, some states have made significant changes to their tests, and these might hint at what states can expect when they change to the Common Core tests. However, there is no research that documents how value-added changed with the change in tests. New studies might explore the correlation in value-added with the new and old tests, using these states as examples: What types of teachers have value-added that is different with the new test than with the old test? The studies should also consider the differences in the tests themselves and explore how these contribute to differences in value-added.
What Can’t be Resolved by Empirical Evidence on this Issue?
We cannot accurately test students on all the skills they need to be ready for college and careers and to lead happy and productive lives. Even the best test will measure only a limited set of skills, and value-added that is based on that test will evaluate teachers only on those skills. When considering skills to be tested, ideally we would select those that are required for students to achieve the long-term outcomes that society values. Empirical analyses might someday help us determine how skills relate to long-term outcomes, but they cannot tell us what those outcomes should be.
How, And Under What Circumstances, Does This Issue Impact The Decisions And Actions That Districts Make On Teacher Evaluations?
The research presented above raises two concerns for states: 1) Is the value-added for teachers from one test providing an adequate measure of the teacher’s contributions to learning, and how can the data be used to provide the most accurate measure of teacher effectiveness? 2) How should states prepare for the transition to new tests aligned with the Common Core State Standards and the impact this will have on value-added?
Implications for using value-added from one test
As discussed above, value-added from one test might not agree with that from another because 1) value-added depends on factors that are specific to each test – test format, for instance – but unrelated to teacher effectiveness, or 2) value-added does not measure aspects of teacher effectiveness related to content not covered by the test. States should choose tests that ensure that student scores are determined by content knowledge and not by extraneous factors. They also should select tests with a very broad range of content so that meaningful aspects of teacher effectiveness are not missed. The tests that two consortia are developing to measure the Common Core State Standards are designed to meet both these goals.
However, states and districts might still worry that even with the new tests, value-added might not reflect all aspects of a teacher’s contributions to student outcomes. States might combine value-added with other measures to reduce this risk. The MET Project found that composite measures of teacher effectiveness that combined, with roughly equal weights, value-added calculated with state tests, classroom observations, and student responses to surveys were somewhat better predictors of a teacher’s future value-added calculated from an alternative test than was value-added calculated from the state test alone.
Implications for the transition to the new test
The transition to the new Common Core tests will introduce many conditions that could lead to changes in teachers’ value-added. The new tests will measure content aligned with the new standards, not current standards. They will also use a different test format. Administered on computer, they will have fewer multiple choice items and more performance tasks and constructed response items. The new tests also are to be more cognitively demanding than the current tests and could be administered at different times than some tests are now. Consequently, student scores on these new tests will likely differ from what scores would have been on the current state tests, and teachers’ value-added is likely to be different, as well. Already, evidence shows that there is often a precipitous change in student achievement when districts change from one test to another. If a similar drop follows the switch to new Common Core tests, it could further contribute to instability in value-added estimates.
With these conditions in mind, states should prepare for larger year-to-year changes in value-added in the 2014-15 school year when they switch to tests aligned with the Common Core. They should also be prepared for greater year-to-year variability in value-added for a few years after the change when they will be using both old and new tests. Large year-to-year variability makes value-added hard to interpret. A teacher doing well one year may appear to be doing poorly the next. It can also weaken the credibility of value-added among teachers.
There is no formal research on what states should do to ease this transition. The following paragraphs provide insights from an assistant superintendent from a large urban school district that uses value-added for performance-based pay and from analysts who supply value-added to states to address questions states might have about the upcoming changes.
One question states have is how they should modify their value-added models when they switch tests. Most of the analysts contacted for this brief said that when states changed tests, they simply applied the same methods they did in other years, except the prior year scores were from the old test and the current year scores were from the new test. All of the analysts noted the importance of making sure that scores on the old tests are predictive of students’ performance on the new tests before calculating value-added using data from both tests. If students’ scores on the old tests are weak predictors of their performance on the new tests, it could lead to error in value-added. One analyst recommended that to improve the predictive power of the prior tests, states could use prior achievement scores in math, reading, and other available subjects in calculating value-added for either math or reading. He also advised that states account for measurement error in the prior scores when creating value-added.
Many value-added methods evaluate teachers’ contributions to student learning by comparing their students’ growth in achievement with the average growth of all students in the district or state during the school year. As a result, value-added cannot be used to determine if the average teacher is improving across time. To avoid this problem, some value-added methods including those used by Tennessee and Ohio, compare student achievement growth to the average growth of students from a prior school year called a base year. The base year remains constant over time so that across years the average value-added will increase with any improvements in teaching. The analysts recommended that the base year be changed when the states switch tests, noting that scores from the new test cannot be compared with base-year scores from the old test. The analysts further suggested waiting a few years before establishing a new base year because, in their experiences, large year-to-year changes in the student score distributions can occur in the first few years of a new testing program.
The assistant superintendent cited above noted that when the state made a significant change to its annual test, teachers whose classes had high prior achievement saw their value-added go up relative to other teachers, and teachers whose classes had low prior achievement saw their value-added go down. The effects were detrimental to the credibility of value-added in the district overall. They fed existing suspicions that value-added was too low for some teachers because of the students they taught. District leaders were not prepared to address the problems created by the new test.
Because changing the test is very likely to lead to changes in value-added for some teachers, states may want to prepare: they may want to thoroughly investigate how value-added changes from the years before the Common Core test to the first years after it. If teachers with certain types of students have systematically higher or lower value-added on the new test, the state might want to take steps to reduce negative repercussions. The state might not release value-added for some teachers for a few years, instead waiting for multiple prior years of data on the new test to better adjust for differences among classes. The state might follow the recommendations of analysts and use tests from multiple subjects and control for measurement error in their value-added calculations. The state might use a weighted average of value-added calculated using the old and the new tests to smooth out the transition in tests. The state might follow the MET Project and use a composite estimate with less weight on value-added, or if the effects of the new test are concentrated on the value-added for a subset of teachers, the state might give these teachers’ value-added less weight or allow districts greater flexibility in how they use value-added for performance evaluations.