Abstract
I exploit the random assignment of class rosters in the MET Project to estimate teacher effects on students’ performance on complex open-ended tasks in math and reading, as well as their growth mindset, grit, and effort in class. I find large teacher effects across this expanded set of outcomes, but weak relationships between these effects and performance measures used in current teacher evaluation systems including value-added to state standardized tests. These findings suggest teacher effectiveness is multidimensional, and high-stakes evaluation decisions are only weakly informed by the degree to which teachers are developing students’ complex cognitive skills and social-emotional competencies.
I. Introduction
It is well established that teachers have large effects on students’ achievement on state standardized tests (Rockoff 2004; Hanushek and Rivkin 2010; Chetty, Friedman, and Rockoff 2014a). However, state tests have typically been narrow measures of student learning, assessing basic literacy and numeracy skills using multiple-choice questions. A review of standardized tests used in 17 states judged as having the most rigorous state assessments found that 98 percent of items on math tests and 78 percent of items on reading tests only required students to recall information and demonstrate basic skills and concepts (Yuan and Le 2012). Many of the ways in which teachers affect students’ long-term outcomes such as earnings (Chetty, Friedman, and Rockoff 2014b) may be through their influence on skills and competencies not captured on state standardized tests (Bowles, Gintis, and Osborne 2001). Chamberlain (2013) found that only one-fifth of teachers’ effects on whether students went to college were explained by their impacts on standardized tests. Similarly, Jackson (2018) found that teachers’ effects on test scores accounted for less than one-third of their effects on high school completion and indicators of college matriculation.
This paper provides new evidence on the degree to which teachers affect a broad set of complex cognitive skills and social-emotional competencies using data across six large school districts collected by the Measures of Effective Teaching (MET) Project.1 Existing research linking teacher effects to outcomes other than traditional standardized assessments has examined three general outcome types: (i) observable behavioral and schooling outcomes, such as absences, suspensions, grades, grade retention, and high-school graduation (Jackson 2018; Gershenson 2016; Koedel 2008; Ladd and Sorensen 2017); (ii) student self-reported attitudes and behaviors, including motivation and self-efficacy in math, happiness and behavior in class, and time spent reading and doing homework outside of school (Blazar and Kraft 2017; Ladd and Sorensen 2017; Ruzek et al. 2015); and (iii) teacher assessments of students’ social and behavioral skills (Chetty et al. 2011; Jennings and DiPrete 2010). These studies almost uniformly find teacher effects on non-test-score outcomes, often of comparable or even larger magnitude than effects on achievement.
The MET Project data allow me to make several important contributions to this literature. First, I estimate teacher effects on a much broader set of student skills and competencies than has been previously examined. In addition to collecting student performance on state standardized tests, MET researchers administered two supplemental achievement tests comprising open-ended tasks designed to be more direct measures of students’ critical thinking and problem-solving skills. In the second year of the study, students also completed a questionnaire that included scales for measuring their grit (Duckworth and Quinn 2009) and growth mindset (Dweck 2006), two widely publicized social-emotional competencies that have received considerable attention from policymakers and educators in recent years.2 The survey also included a class-specific measure of effort, which allows for a direct comparison between teacher effects on global and domain-specific measures of perseverance. I present the first estimates of teacher effects on students’ grit, growth mindset, and effort in class. I also provide the first direct evidence of the relationship between teacher effects on state tests, complex open-ended assessments, and social-emotional competencies.
A second key advantage of using the MET data to address these questions is that a subset of teachers participated in an experiment where researchers randomly assigned class rosters among sets of volunteer teachers in the same grades and schools. This design provides the opportunity to identify teacher effects without the strong conditional independence assumption required when using observational data. The extent to which covariate adjustment adequately accounts for nonrandom student sorting when estimating teacher effects on test scores is still a topic of ongoing debate.3 Even less is known about the validity of this approach for estimating teacher effects on outcomes other than standardized state tests.
Third, the MET data allow me to examine the relationship among teacher effects on an expanded set of student outcomes and performance measures used in most teacher evaluation systems. In recent years, states have implemented sweeping reforms to teacher evaluation by adopting more rigorous systems based on multiple measures of teacher effectiveness (Steinberg and Donaldson 2016). I provide among the first evidence on whether the measures used in these high-stakes evaluation systems, including value-added to state tests, classroom observations, student surveys, and principal ratings, reflect teacher effects on complex cognitive skills and social-emotional competencies.
Leveraging the classroom roster randomization, I find teacher effects on standardized achievement in math and English Language Arts (ELA) that are similar in magnitude to prior analyses of the MET data (Kane and Cantrell 2010) and the broader value-added literature (Hanushek and Rivkin 2010). I also find teacher effects of comparable magnitude on students’ ability to perform complex tasks in math and ELA, as measured by cognitively demanding open-ended tests. While teachers who add the most value to students’ performance on state tests in math also appear to strengthen their analytic and problem-solving skills (r = 0.57), teacher effects on state ELA tests are only moderately correlated with teacher effects on open-ended response items in reading (r = 0.24). Successfully teaching more basic reading comprehension skills does not indicate that teachers are also developing students’ ability to interpret and respond to texts.
Teacher effects on students’ social-emotional competencies differ in magnitude, with the largest effects on class-specific effort, the global perseverance subscale of grit, and growth mindset. Comparing the effects of individual teachers across outcomes reveals that correlations between teacher effects on standardized tests and those on social-emotional competencies are never larger than 0.21. Consequently, more than one out of every four teachers who is in the top 25 percent of state test value-added is in the bottom 25 percent of social-emotional value-added. Together, these findings suggest that teacher effectiveness is multidimensional and that individual teachers’ abilities differ across skillsets.
Turning to teacher evaluation policies, I also find little evidence that performance measures commonly incorporated into high-stakes teacher evaluation systems capture teacher effects on complex cognitive skills or social-emotional competencies. Neither value-added to state standardized tests, scores on classroom observation rubrics, student survey assessments, nor principals’ overall assessments of professional practice serve as proxy measures for teacher effects on this broader set of outcomes, either individually or jointly. Correlations between a composite of these teacher performance measures (using commonly applied weights) and teacher effects on social-emotional skills are weak, between 0.03 and 0.19. I conclude by discussing the implications of these findings for research, policy, and practice.
II. Schooling, Skills, and Competencies
A. Complex Cognitive Skills
A growing number of national and international organizations have identified complex cognitive abilities as essential skills for the workplace in the modern economy (National Research Council 2013; OECD 2013). Psychologists and learning scientists define complex cognitive skills as a set of highly interrelated constituent skills that support cognitively demanding processes (Van Merriënboer and Jeroen 1997). These skills allow individuals to classify new problems into cognitive schema and then to transfer content and procedural knowledge from familiar schema to new challenges. Examples include writing computer programs, directing air traffic, engineering dynamic systems, and diagnosing sick patients.
Researchers and policy organizations have referred to these abilities using a variety of different terms including 21st century skills, deeper learning, critical-thinking, and higher-order thinking. State standardized achievement tests in math and reading rarely include items designed to assess these abilities (Yuan and Le 2012). Among state tests that do include open-ended ELA questions, these items are often substantially more cognitively demanding than multiple-choice questions. However, open-ended items on state math tests typically require students to move beyond recall but rarely require students to solve extended unstructured problems.
To date, empirical evidence linking teacher and school effects to the development of students’ complex cognitive skills remains very limited. Researchers at RAND found that students who had more exposure to teaching practices characterized by group work, inquiry, extended investigations, and an emphasis on problem solving performed better on open-ended math and science tests designed to assess students’ decision-making abilities, problem-solving skills, and conceptual understanding (Le et al. 2006). Using a matched-pair design, researchers at the American Institutes for Research found that students attending schools that were part of a “deeper learning” network outperformed comparison schools by more than one-tenth of a standard deviation in math and reading on the PISA-Based Test for Schools (PBTS), a test that assesses core content knowledge and complex problem-solving skills (Zeiser et al. 2014).
B. Social-Emotional Competencies
Social-emotional competencies (or social and emotional learning) is a broad umbrella term used to encompass an interrelated set of cognitive, affective, and behavioral abilities that are not commonly captured by standardized tests. Although sometimes referred to as noncognitive skills, personality traits, or character skills, these competencies explicitly require cognition, are not fixed traits, and are not intended to suggest a moral or religious valence. They are skills, attitudes, and mindsets that can be developed and shaped over time (Duckworth and Yeager 2015). Regardless of the term used, mounting evidence documents the strong predictive power of competencies other than performance on cognitive tests for educational, employment, health, and civic outcomes (Almlund et al. 2011; Borghans et al. 2008; Moffitt et al. 2011).
Two seminal experiments in education, the HighScope Perry Preschool Program and Tennessee Project STAR, documented the puzzling phenomenon of how the large effects of high-quality early-childhood and kindergarten classrooms on students’ academic achievement faded out over time, but then reappeared when examining adult outcomes such as employment and earnings as well as criminal behavior. Recent reanalyses of these experiments suggest that the long-term benefits of high-quality pre-K and kindergarten education were likely mediated through increases in students’ social-emotional competencies (Heckman, Pinto, and Savelyev 2013; Chetty et al. 2011).
III. Research Design
The MET Project was designed to evaluate the reliability and validity of a wide range of performance measures used to assess teachers’ effectiveness. The study tracked approximately 3,000 teachers from across six large public school districts over the 2009–2010 and 2010–2011 school years.4 These districts included the Charlotte-Mecklenburg Schools, Dallas Independent Schools, Denver Public Schools, Hillsborough County Public Schools, Memphis Public Schools, and New York City Schools. Substantial variation exists in the racial composition of students across districts, such that African-American, Hispanic, and white students each represent the largest racial/ethnic group in at least one district.
A. The Classroom Roster Randomization Experiment
In the second year of the study, MET researchers recruited schools and teachers to participate in a classroom roster randomized experiment. Of those fourth- and fifth-grade general education teachers who participated in the first year and remained in the study in the second year, 85 percent volunteered for the randomization study and were eligible to participate. Participating principals were asked to create classroom rosters that were “as alike as possible in terms of student composition” in the summer of 2010 (Bill & Melinda Gates Foundation 2013, p. 22). They then provided these rosters to MET researchers to randomize among volunteer teachers in the same schools, subjects, and grade levels.5 The purpose of this randomization was to eliminate potential bias in teacher effect estimates caused by any systematic sorting of teachers and students to specific classes within schools. I focus my empirical analyses on the effect of general education elementary classrooms to minimize potential confounding when students are taught by multiple teachers and outcomes are not class-specific.
B. Limitations of the MET Data
While the MET Project has several advantages, the data also have some important limitations. Almost 8,000 elementary school students (n = 7,999) were included on class rosters created for general elementary school teachers by principals. Similarly to Kane et al. (2013), I find substantial attrition among the fourth- and fifth-grade students who were included in the roster randomization process; 38.6 percent of students on these rosters were not taught by teachers who participated in the MET Project data collection in 2010–2011 and thus are censored from the MET dataset. Much of this attrition is due to the randomization design, which required principals to form class rosters before schools could know which students and teachers would remain at the school. Following random assignment, some students left the district, transferred to nonparticipating schools, or were taught by teachers who did not participate in the MET study. Some participating teachers left the profession, transferred schools, or ended up teaching different classes within their schools than originally anticipated. I present several analyses examining randomization balance in the analytic sample in Section IV.A and find that this attrition does not compromise the internal validity of the analyses to a great degree.
The single year of experimental data combined with my focus on general education elementary classrooms also limits my ability to isolate teacher effects from peer effects and transitory shocks (Chetty et al. 2011). Blazar and Kraft (2017) compared teacher effects on students’ attitudes and behaviors with and without allowing for class-specific effects and found that estimates that do not remove class-specific peer effects and shocks are inflated by approximately 15 percent. I present estimates both with and without peer-level controls to provide approximate bounds for teacher effects. Throughout the paper, I refer to my estimates as teacher effects while recognizing that the data do not allow me to definitively separate the joint effect of teachers, peers, and shocks.
I am also unable to test the predictive validity of estimated teacher effects on complex cognitive skills and social-emotional competencies using longer-term outcomes following Jackson (2018). Such analyses using the MET data are not possible because the MET Project focused on teachers and, thus, did not collect panel data on students. I instead leverage the nationally representative Educational Longitudinal Survey to illustrate the predictive validity of self-report scales that are close proxies for measures of grit and growth mindset on a range of educational, economic, personal, and civic outcomes, and I review the causal evidence on interventions targeting these competencies.
C. Sample
I construct the analytic sample to include only students in fourth and fifth grades who (i) were included in the roster randomization process, (ii) were taught by general education teachers who participated in the randomization study, (iii) had valid lagged achievement data on state standardized tests in both math and ELA, and (iv) were taught by a teacher who is linked with at least five students. These restrictions result in an analytic sample of 4,092 students and 236 general education teachers. Further restricting the analytic sample by requiring that students have valid data for all outcomes would reduce the sample to 2,907 students. In analyses available upon request, I confirm that the primary results are unchanged when using this smaller balanced sample.
I present descriptive statistics on the students and teachers in the analytic sample in Table 1. The sample closely resembles the national population of students attending public schools in cities across the United States but with a slightly larger percentage of African-American students and smaller percentage of white and Hispanic students: 36 percent are African-American, 29 percent are Hispanic, 24 percent are white, and 8 percent are Asian. More than 60 percent of students qualify for free or reduced-price lunch (FRPL) across the sample. The fourth- and fifth-grade general education elementary school teachers who participated in the MET Project randomization design are overwhelmingly female and substantially more likely to be African-American compared to the national labor market of public school teachers. Teacher experience varies widely across the sample, and one-half of the teachers hold a graduate degree.
Student and Teacher Characteristics
D. Standardized State Tests
The MET dataset includes end-of-year achievement scores on state standardized tests in math and ELA, as well as scores from the previous year. State math and ELA tests for the fourth and fifth grades administered in the six districts in 2011 primarily consisted of multiple-choice items. State test technical manuals suggest that the vast majority of items on these exams assessed students’ content knowledge, fundamental reading comprehension, and basic problem-solving skills.6 Reported reliabilities for these fourth and fifth grade tests in 2011 ranged between 0.85 and 0.95. In order to make scaled scores comparable across districts, the MET Project converted scores into rank-based Z-scores.
E. Achievement Tests Consisting of Open-Ended Tasks
MET researchers administered two supplemental achievement tests to examine the extent to which teachers promote high-level reasoning and problem-solving skills. The cognitively demanding tests, the Balanced Assessment in Mathematics (BAM) and the Stanford Achievement Test 9 Open-Ended Reading Assessment (SAT9-OE), consist exclusively of constructed-response items. The BAM was developed by researchers at the Harvard Graduate School of Education and comprises four to five tasks that require students to complete a series of open-ended questions about a complex mathematical problem and justify their thinking. The SAT9-OE was developed by Pearson Education and consists of nine open-ended questions about one extended reading passage that test students’ abilities to reason about the text, draw inferences, explain their thinking, and justify their answers. I estimate internal consistency reliabilities of students’ scores across individual items on the BAM and SAT9-OE of 0.72 and 0.85, respectively. Similar to state standardized tests, the MET Project converted raw scores on the BAM and SAT9-OE into rank-based Z-scores.
Little direct evidence exists about the predictive validity of the BAM and SAT9-OE assessments, in part because these tests were never commercialized at scale. These assessments were chosen by MET Project researchers on the basis of the primary criterion that they “provide good measures of the extent to which teachers promote high-level reasoning and problem solving skills” (MET Project 2009). Although format alone does not determine the cognitive demand of test items, a review of six major national and international assessments using Norman Webb’s Depth-of-Knowledge framework found that 100 percent of writing, 52 percent of reading, and 24 percent of math open-response items assessed strategic or extended thinking compared to only 32 percent of reading and 0 percent of math multiple-choice items (Yuan and Lee 2014). Demand and wages for jobs that require these complex cognitive skills to perform nonroutine tasks, often in combination with strong interpersonal skills, have grown steadily in recent decades (Autor, Levy, and Murnane 2003; Deming 2017; Weinberger 2014).
F. Social-Emotional Measures
Students completed short self-report questionnaires to measure their grit and growth mindset in the second year of the study. The scale used to measure grit was developed by Angela Duckworth to capture students’ tendency to sustain interest in, and effort toward, long-term goals. Students responded to a collection of eight items (for example, “I finish whatever I begin”) using a five-category Likert scale, where 1 = not like me at all and 5 = very much like me. I estimate student scores separately for the two subscales that constitute the overall grit measure as presented in the original validation study (Duckworth and Quinn 2009): (i) consistency of interest and (ii) perseverance of effort (hereafter consistency and perseverance). This approach provides an important opportunity to contrast a global measure of perseverance with a class-specific measure of effort described below and distinguishes between conceptually distinct constructs that have an unadjusted correlation of 0.22 and a disattenuated correlation of 0.33 in the analytic sample.
The growth mindset scale developed by Carol Dweck measures the degree to which students’ views about intelligence align with an incremental theory that intelligence is malleable as opposed to an entity theory, which frames intelligence as a fixed attribute (Dweck 2006). Students were asked to rate their agreement with three statements (for example, “You have a certain amount of intelligence, and you really can’t do much to change it”) on a six-category Likert scale, where 1 = strongly disagree and 6 = strongly agree. I complement these global social-emotional measures with a class-specific measure of effort, constructed from responses to survey items developed by the Tripod Project for School Improvement. Students are asked to respond to six descriptive statements about themselves (for example, “In this class I stop trying when the work gets hard”) using a five-category Likert scale, where 1 = totally untrue and 5 = totally true.
Reliability estimates of the internal consistency for growth mindset, consistency, perseverance, and effort in class are 0.78, 0.66, 0.69, and 0.56, respectively. I construct scores on each of the measures following Duckworth and Quinn (2009) and Blackwell, Trzesniewski, and Dweck (2007) by assigning point values to the Likert-scale responses and averaging across the items in each scale. I then standardize all three social-emotional measures in the full MET Project sample within grade level in order to account for differences in response scales and remove any trends due to students’ age that might otherwise be confounded with teacher effects across grade levels. See Online Appendix A for the complete list of items included in each scale.7
While a large body of evidence documents the predictive validity of social-emotional measures such as the Big Five, locus of control, and self-esteem (Almlund et al. 2011; Borghans et al. 2008; Moffitt et al. 2011), evidence for grit and growth mindset is more limited. Grit has been shown to be predictive of GPA at an Ivy League school, retention at West Point, and performance in the Scripps National Spelling Bee, conditional on IQ (Duckworth et al. 2007; Duckworth and Quinn 2009). Grittier soldiers were more likely to complete an Army Special Operations Forces selection course, grittier sales employees were more likely to keep their jobs, and grittier students were more likely to graduate from high school, conditional on a range of covariates (Eskreis-Winkler et al. 2014). Middle school students who report having a high growth mindset have been found to have higher rates of math test score growth than students who view intelligence as fixed (Blackwell et al. 2007).
Given the lack of medium- or long-term outcomes in the MET data, I examine the predictive validity of social-emotional measures, conditional on standardized test scores, on students’ educational attainment, labor market, personal, and civic outcomes ten years later using the Educational Longitudinal Study (ELS). As predictors, I use proxy measures of grit and growth mindset constructed from tenth grade students’ self-reported answers to survey items that map closely onto the perseverance of effort subscale of grit and a domain-specific measure of students’ growth mindset in math. I create a composite measure of students’ academic ability in math and reading based on students’ scores on a multiple-choice achievement test administered by the National Center for Education Statics (see Online Appendix B for details).
In Table 2, I report results from a simple set of OLS regression models where standardized measures of academic achievement, grit (perseverance), and growth mindset are included simultaneously with controls for students’ race, gender, level of parental education, and household income. Grit and growth mindset are generally weaker predictors of outcomes in adulthood compared to measures of academic achievement, but do contain information that is independent from academic ability. For example, one standard deviation increases in grit and growth mindset (0.61 and 0.73 scale points on a four-point scale, respectively) are associated with $1,632 and $848 increases in annual employment income, respectively, as well as 5.8 and 1.1 percentage point increases in the probability a student has earned a bachelor’s degree by age 26. Both grit and growth mindset are negatively associated with teen pregnancy and positively associated with civic participation. These conditional associations are likely conservative estimates of the predictive power of grit and growth mindset as they are not disattenuated for the lower reliability of survey-based measures, and the measure of growth mindset is math-specific rather than the global measure used in the MET Project.
The Predictive Validity of Self-Reported Social-Emotional Measures on Education, Employment, Personal, and Civic Outcomes from the Educational Longitudinal Study.
These analyses do not establish an underlying causal relationship or confirm that fourth and fifth graders’ self-reported grit and growth mindset have the same predictive power. However, we do know that grit and growth mindset are negatively correlated with absences and suspensions and positively correlated with GPA among upper elementary and middle school students (West 2016; West et al. 2016). A growing number of randomized control trials evaluating the effect of growth mindset interventions across various grade levels have documented causal effects on short- to medium-term academic and behavioral outcomes (Yeager et al. 2014, 2016; Miu and Yeager 2015; Paunesku et al. 2015). These studies demonstrate that growth mindset interventions increased math and science GPA over several months (Yeager et al. 2014), satisfactory performance in high-school courses (Paunesku et al. 2015), and classroom motivation (Blackwell et al. 2007), as well as decreased self-reported depressive symptoms (Miu and Yeager 2015) and aggressive desires and hostile intent attributions (Yeager et al. 2013a).
The causal evidence on the effect of grit is more limited. Several small-scale field experiments document the short-term positive academic effects of mental contrasting strategies where students learn how to plan for and overcome obstacles for achieving their goals (Duckworth et al. 2011, 2013). A recent study found that teaching fourth-grade students in Turkey about the plasticity of the human brain, the importance of effort, learning from failures, and goal-setting improved performance and persistence on objective tasks and grades (Alan et al. 2016). Together, these studies suggest that grit and growth mindset are both malleable and likely causal determinants of important intermediary student outcomes for success in later life.
G. Achievement Tests, Performance on Complex Tasks, and Social-Emotional Competencies
In Table 3, I present Pearson correlations across the eight outcome measures along with correlations disattenuated for measurement error (see Online Appendix C for technical details). The clustered patterns of covariance evident in this table illustrate the lack of independence of each of these measures. Instead, these outcomes likely capture a more limited set of latent constructs. The strongest relationships among the disattenuated correlations are between students’ performance on state standardized tests across subjects (0.81) and students’ math performance on the state tests and the open-ended test (0.81). This suggests that students who perform well on more-basic multiple-choice math questions also tend to perform well on more demanding open-ended math tasks. Student performance on state ELA tests and the SAT9-OE are correlated at 0.56, suggesting that state ELA tests are imperfect measures of students’ more complex reasoning and writing skills. Correlations between social-emotional measures and state tests as well as open-ended tests are positive but of more moderate magnitude, ranging between 0.21 and 0.41. The pattern of correlations among the social-emotional measures themselves suggests that these scales may capture two distinct competencies: self-regulation and academic mindsets. Grit subscales (especially the perseverance subscale) and effort in class are moderately to strongly correlated and can both be characterized as measures of students’ ability to self-regulate their behavior and attention.
Student-Level Correlations among State Tests, Complex Tasks and Social-Emotional Measures
H. Estimating the Variance of Teacher Effects
I begin by specifying an education production function to estimate teacher effects on student outcomes. A large body of literature has examined the consequences of different value-added model specifications (Todd and Wolpin 2003; Kane and Staiger 2008; Koedel and Betts 2011; Guarino, Reckase, and Wooldridge 2015; Chetty et al. 2014a). Typically, researchers exploit panel data with repeated measures of student achievement to mitigate against student sorting by controlling for prior achievement. The core assumption of this approach is that a prior measure of achievement is a sufficient summary statistic for all the individual, family, neighborhood, and school inputs into a student’s achievement up to that time. Models also commonly include a vector of student characteristics, averages of these characteristics and prior achievement at the classroom level, and school fixed effects (see Hanushek and Rivkin 2010).
Researchers often obtain the magnitude of teacher effects from these models by quantifying the variance of teacher fixed effects, , or “shrunken” Empirical Bayes (EB) estimates,
. The EB estimates are a weighted sum of teachers’ estimated effect,
, and the average teacher effect,
, where the weights are determined by the reliability of each estimate.8 However, variance estimates using fixed effects are biased upward because they conflate true variation with variation due to estimation error. Variance estimates using EB teacher effects are biased downward proportional to the size of the measurement error in the unshrunken estimates (see Jacob and Lefgren 2005, Online Appendix C). The true variance of teacher effects,
, is bounded between the fixed effect and EB estimators (Raudenbush and Bryk 2002).
(1)
Following Nye et al. (2004) and Chetty et al. (2011), I estimate the magnitude of the variance of teacher effects using a direct, model-based approach derived via restricted maximum likelihood estimation. I assume a Gaussian data generating process which appears well justified in the data for state and open-ended tests and an appropriate approximation for social-emotional measures. This approach is robust to the differences in reliabilities across student outcomes—assuming classical measurement error—because it simultaneously models systematic unexplained variance across teachers as well as idiosyncratic student-level variance. It produces both a maximally efficient and consistent estimator for the true variance of teacher effects.
To arrive at this model-based estimate, I specify a multilevel covariate-adjustment model as follows:
(2)
where εij = τj + ϵi.
Here Yij is a given outcome of interest for student i, in district d, in grade g, with teacher j, in school s, in year t. Across all model specifications, I include a cubic function of students’ prior year achievement on state standardized tests (Ai,t−1), in both mathematics and ELA, which I allow to vary across districts and grades by interacting all polynomial terms with district-by-grade fixed effects. I also include a vector of controls for observable student characteristics (Xi). Student characteristics include indicators for a student’s gender, age, race, FRPL status, English language proficiency status, special education status, and participation in a gifted and talented program.9
I supplement these administrative data with additional student-level controls constructed from survey data collected by the MET Project. These include controls for students’ self-reported prior grades, the number of books in their homes, the degree to which English is spoken at home, and the number of computers in their homes.10 Both theory and prior empirical evidence have shown that grades reflect students’ cognitive skills as well as social-emotional competencies such as grit and effort (Bowen, Chingos, and McPherson 2009). I find that this measure of grades is positively correlated with social-emotional measures even when controlling for prior achievement in math and ELA. Partial correlations in the analytic sample range from 0.04 with growth mindset to 0.22 with perseverance. I include randomization block fixed effects (πsg) to account for the block randomized design.
In additional models, I attempt to remove peer effects by controlling for a rich set of average classroom covariates.11 These covariates include the average prior achievement in a student’s class in both subjects (Āj–t − 1) as well as average student characteristics (using both administrative and survey data) in a student’s class (). I present models both with and without peer effects to provide informal upper and lower bounds on the true magnitude of teacher effects. Estimates of the magnitude of teacher effects in a single cross-section where teachers are observed with only one class are likely to be biased upward when peer-level controls are omitted and biased downward when they are included (Kane et al. 2013; Thompson, Guarino, and Wooldridge 2015).12
I allow for a two-level error structure for εij where τj represents a teacher-level random effect and ϵi is an idiosyncratic student-level error term. I obtain an estimate of the true variance parameter, , directly from the model through restricted maximum likelihood estimation. I specify τj in two different ways—as students’ actual teachers and as their randomly assigned teachers. Modeling the effects of students’ actual teachers may lead to potentially biased estimates due to noncompliance with random assignment. Among those students in the analytic sample, 28.1 percent are observed with nonrandomly assigned teachers. For this reason, I include a rich set of administrative and survey-based controls. I further address the potential threat of noncompliance by exchanging the precision of actual-teacher estimates for the increased robustness of specifying τj as students’ randomly assigned teachers. Estimates from this approach are analogous to Intent-to-Treat effects (ITT).
IV. Findings
A. Postattrition Balance Tests
I conduct two tests to assess the degree to which student attrition from the original randomized classroom rosters poses a threat to the randomization design. I begin by testing for balance in students’ average characteristics and prior achievement across classrooms in the analytic sample. I do this by fitting a series of models where I regress a given student characteristic or measure of prior achievement, demeaned within randomization blocks, on a set of indicators for students’ randomly assigned teachers. In Table 4, I report F-statistics of the significance of the full set of randomly assigned teacher fixed effects. I find that, postattrition, students’ characteristics and prior achievement remain largely balanced within randomization blocks. For ten of the twelve measures, I cannot reject the null hypothesis that there are no differences in average student characteristics across randomly assigned teachers. However, I do find evidence of imbalance for students who participated in a gifted program and who were English language learners (ELL). This differential attrition likely occurred because gifted and ELL students were placed into separate classes with performance requirements or teachers who had specialized certifications. To further examine this threat, I replicate my primary analyses in samples that exclude gifted and ELL students and report the results in Online Appendix D. Results are consistent with those reported below with even slightly larger magnitudes of teacher effects.
Tests for Postattrition Randomization Balance in Student Demographic Characteristics and Prior Achievement across Teachers in the Same Randomization Block
I next examine whether there appears to be any systematic relationship between students’ characteristics in the analytic sample and the effectiveness of the teachers to whom they were randomly assigned. In Table 5, I present results from a series of regression models in which I regress prior-year value-added scores of students’ randomly assigned teachers on individual student characteristics and prior achievement. I do this for value-added estimates derived from both math and ELA state tests, as well as the BAM and SAT9-OE exams in the prior academic year.13 Among the 48 different relationships I test, I find that only one is statistically significant at the 5 percent level. This is consistent with random sampling variation given the number of relationships I test.14 Together, these tests of postattrition randomization balance across teachers suggest that the classroom roster randomization process did largely eliminate the systematic sorting of students to teachers commonly present in observational data (Kalogrides and Loeb 2013; Rothstein 2010).
The Relationship between Student Characteristics and Randomly Assigned Teacher Characteristics Postattrition
B. Teacher Effects—Maximum Likelihood Estimates
In Table 6, I present estimates of the standard deviation of teacher effects from a range of models. Column 1 corresponds to the predominant school fixed effect specification in the teacher effects literature reviewed by Hanushek and Rivkin (2010). Consistent with prior studies, maximum likelihood estimates of the magnitude of teacher effects on state test scores are 0.18 SD in math and 0.14 SD in ELA. Using this baseline model, I also find teacher effects on the BAM and SAT9-OE tests of 0.14 SD and 0.17 SD, respectively. Finally, I find suggestive evidence of teacher effects on social-emotional measures ranging from 0.08 SD for consistency of interest (not statistically significant) to 0.20 SD for growth mindset.
Model-Based Restricted Maximum Likelihood Estimates of Teacher Effects on State Tests, Complex Tasks, and Social-Emotional Measures
In my preferred models with randomization-block fixed effects, I find strong evidence of teacher effects on students’ complex task performance and social-emotional competencies, although the magnitudes of these effects differ across measures. Columns 2 and 3 report results from models where I estimate teacher effects using students’ actual teachers. In Columns 4 and 5, I exchange students’ actual teachers with their randomly assigned teachers. Comparing results across models with and without peer effects (Columns 2 vs. 3 and 4 vs. 5) illustrates how the inclusion of peer-level controls somewhat attenuates my estimates by absorbing peer effects that were otherwise attributed to teachers. Focusing on Figure 1, which presents estimates from models with students’ actual teachers that condition on peer controls, I find statistically significant effects of broadly similar magnitude (0.14–0.18 SD) across all outcomes except for consistency of interest, which is both smaller in magnitude and not statistically significant.
The Magnitude of Teacher Effect Estimates on State Tests, Complex Open-Ended Assessments, and Social-Emotional Competencies.
Notes: *p < 0.05, **p < 0.01, ***p < 0.001. Estimates are model-based restricted maximum likelihood estimates of teacher effects using students’ actual teachers and controlling for classroom peer characteristics (Column 3 of Table 6) with samples ranging from 3,435 to 4,075.
Results from models using students’ randomly assigned teachers are slightly attenuated given noncompliance, but remain consistent with estimates reported above. Estimates of teacher effects on academic outcomes from models that include peer controls (Column 5) range from 0.11 SD on the BAM to 0.17 SD for the SAT9-OE. Teacher effects on consistency of interest do not achieve statistical significance, while effects on students’ growth mindset (0.15 SD), perseverance (0.14 SD), and effort in class (0.15 SD) are of similar and even slightly larger magnitude than effects on achievement. Together, these results present strong evidence of meaningful teacher effects on students’ ability to perform complex tasks and social-emotional competencies.
C. Comparing Teacher Effects across Outcomes
I investigate the nature of teacher skills by examining the relationships between individual teachers’ effects across the eight outcomes of interest. In Table 7, I present Pearson correlations of the Best Linear Unbiased Estimators (BLUE) of teacher random effects from the ML model that uses students’ actual teachers and includes peer controls (Column 3 of Table 6).15 Correlations among teacher effects from models using randomly assigned teachers produce a consistent pattern of results but are somewhat attenuated due to noncompliance. I present these results in Online Appendix Table E1.
Correlations of Teacher Effects on State Tests, Complex Tasks, and Social-Emotional Measures
Consistent with past research, I find that the correlation between general education elementary teachers’ value-added on state math and ELA tests is large at 0.58 (Corcoran, Jennings, and Beveridge 2012; Goldhaber, Cohen, and Walch 2013; Loeb, Kalogrides, and Béteille 2012). Elementary teacher effects on state math tests are also strongly related to their effects on the BAM (0.57). Elementary teachers who are effective at teaching more basic computation and numeracy skills appear to be developing their students’ ability to perform complex open-ended tasks in math. This relationship is similar to prior estimates of the correlation between teacher effects on two math exams with more similar content coverage, formats, and levels of cognitive demand (0.64 in Blazar and Kraft 2017; 0.56–0.62 in Corcoran et al. 2012).
In contrast, teacher effects on state ELA exams are a poor proxy for teacher effects on more cognitively demanding open-ended ELA tests. Teacher effects on their students’ performance on state standardized exams assessing reading comprehension with multiple-choice items explain less than 6 percent of the variation in teacher effects on the SAT9-OE, an assessment designed to capture students’ ability to reason about and respond to an extended passage. The correlation, 0.24, is also notably weaker than prior estimates of the correlation between teacher effects on two different reading exams. Papay (2011) found correlations ranging from 0.44 to 0.58 between a state test in reading and the Scholastic Reading Inventory (SRI).16 Corcoran et al. (2012) found nearly identical correlations (0.44–0.58) between teacher effects on the Texas state tests and the Stanford Achievement Test (SAT) in reading.17 In fact, teachers’ value-added to student achievement on the more cognitively demanding open-ended SAT9-OE reading exam is more strongly related to their effects on the similarly demanding open-ended BAM math test (0.46) than with their value-added to state ELA tests.
I find that teacher effects on social-emotional measures are only weakly correlated with effects on both state standardized exams and exams testing students’ performance on open-ended tasks. Among the four social-emotional measures, growth mindset has the strongest and most consistent relationship with teacher effects on state tests and complex task performance, with correlations ranging between 0.10 and 0.21. Teachers’ ability to motivate their students’ perseverance and effort is consistently a stronger predictor of teacher effects on students’ complex task performance than on standardized tests scores. Finally, teacher effects across different social-emotional measures are far less correlated than teacher effects on student achievement across subjects. Effects on growth mindset are positively correlated with effects on students’ consistency of interest (0.22), but unrelated to a teacher’s ability to motivate students’ perseverance and effort. Teacher effects on perseverance and effort in class are the only two social-emotional measures that appear to be capturing the same underlying ability, with a correlation of 0.61. This suggests that teacher effects on students’ willingness to devote effort to their classwork may extend to other contexts as well.
I illustrate the substantial degree of variation in individual teacher effects across measures by providing a scatterplot of teacher effects on state math tests and growth mindset in Figure 2. This relationship captures the strongest correlation I observe between teacher effects on social-emotional competencies and state tests (0.21). A total of 42 percent of teachers in the sample have above average effects on one outcome but below average effects on the other (21 percent in quadrant II and 21 percent in quadrant IV). Only 28 percent of teachers have effects that are above average for both state math tests and growth mindset (quadrant I). The proportion of teachers who have above average effects on both state math tests and other social-emotional measures is even lower. These findings illustrate how teachers are not simply “effective” or “ineffective” but instead have abilities that may differ across multiple dimensions of effectiveness.
Scatterplot of Teacher Effects on State Math Tests and Growth Mindset from Empirical Bayes Estimates (n = 228). The Scatterplot Represents a Correlation of 0.21.
Notes: Empirical Bayes estimates are the Best Linear Unbiased Estimators of teacher random effects from maximum likelihood models that use students’ actual teachers and include peer controls (Column 3 of Table 6). The scales of both teacher effect estimates are measured in student-level standard deviation units of the outcomes.
D. Assessing Potential Bias in Teacher Effect Correlations
The pairwise correlations presented in Table 7 are imperfect estimates of the true relationships between teacher effects, although the net direction of potential biases is not obvious. Noise in teacher effect estimates due to the imperfect reliability of student outcome measures will bias estimates downward.18 At the same time, class-specific shocks and unobserved student traits correlated with multiple outcomes can induce an upward bias. I explore the magnitude of potential biases by estimating upper and lower bounds for these correlations.
I first estimate upper bounds for Table 7 by disattenutating estimates for measurement error using an approach analogous to the Spearman (1904) correction described in Online Appendix C. I provide technical details for this procedure and report the results in Online Appendix G. The low estimated reliabilities of teacher effect estimates (0.48–0.59) result in almost a doubling of the magnitude of the unadjusted correlations with some correlations disattenuated to be greater than one, outside the possible range of correlation coefficients. This is because the Spearman adjustment assumes that errors in both measures are uncorrelated with each other, an assumption likely violated in this setting given that teacher effects are estimated using the same classroom of students across outcomes. Even these extreme upper-bound estimates show that correlations between teacher effects on state tests and social-emotional competencies are never larger than 0.42 (state math and growth mindset).
I next estimate lower bounds for Table 7 by examining correlations among teacher effects from different years for a subset of outcomes available in both years.19 This approach purges correlations of upward bias introduced by correlated errors from a common estimation sample. In all years, I use teacher effect estimates calculated by the MET Project using a standard covariate-adjustment model (Kane and Cantrell 2010) to hold the modeling approach constant. Online Appendix Table G3 compares correlation coefficients calculated among the analytic sample of general elementary school teachers, based on estimates for the same year (Panel A) and estimates in different years (Panel B). Consistent with prior studies, I find that teacher effect correlations estimated from the same class are inflated upwards, sometimes substantially, relative to teacher effects from across years (Goldhaber, Cowan, and Walch 2013; Kane and Cantrell 2010). The largest degree of upward bias occurs for estimates between outcomes that are more highly correlated, such as state tests and the supplemental open-ended assessments administered by the MET project. Smaller correlations between teacher effects on achievement measures and students’ self-reported effort in class are biased upward to a slightly lesser degree. Still, the patterns in these lower-bound estimates remain the same—correlations between teacher effects on state tests of different subjects are the largest (0.26), followed by correlations between effects on state tests and open-ended tests (0.06–0.17), and finally correlations between social-emotional competencies and test scores (0.0–0.12).
While it is difficult to know how these biases interact, I interpret these findings to suggest that attenuation bias due to noise in teacher effects is largely if not completely offset by the upward bias due to correlated errors caused by a common classroom sample. I expect the results reported in Table 7 might slightly underestimate the true magnitude of these correlations but support general inferences about the relative magnitude of these correlations across outcomes.
E. Do Teacher Evaluation Systems Capture Teacher Effects on Complex Cognitive Skills and Social-Emotional Competencies?
Under the Obama administration, the Race to the Top grant competition and state waivers for regulations in the No Child Left Behind Act incentivized states to make sweeping changes to their teacher evaluation systems. Today, most states have implemented new systems that incorporate multiple measures, including estimates of contributions to student learning, classroom observation scores, student surveys, and assessments of professional conduct (Steinberg and Donaldson 2016). Teachers’ evaluation ratings are typically constructed from a weighted combination of these measures. Classroom observations nearly always account for the largest percentage of the overall score, although the weights assigned to measures vary meaningfully across districts and states (Steinberg and Kraft 2017).
The MET Project provides a unique opportunity to further explore the relationship between evaluation metrics used in new teacher evaluation systems and teacher effects on students’ complex cognitive skills and social-emotional competencies. In Table 8, I present correlations between the teacher effects I estimate above and a range of evaluation measures from both the same year and prior year. Estimating these relationships using evaluation measures from the prior year serves to eliminate potential upward bias due to correlated errors from a common student sample as described above. At the same time, the relationships between performance measures and true teacher effects is likely somewhat stronger than the estimates reported in Table 8, which rely on imprecise measures from a single year (Kane and Staiger 2012). I compare my teacher effect estimates with the most common metrics used in teacher evaluation systems: value-added in math and ELA20; ratings on two widely used classroom observation instruments, the Classroom Assessment Scoring System (CLASS) and the Framework for Teaching (FFT); students’ opinions of their teachers’ instruction captured on the TRIPOD survey (Kane and Cantrell 2010); and principals’ overall ratings of teachers’ performance using a six-point Likert scale ranging from “very poor” to “excellent.”
Correlations of Teacher Performance Measures with Teacher Effects on State Tests, Complex Tasks, and Social-Emotional Measures
I find that neither value-added scores, classroom observation scores, student surveys, nor principal ratings serve as close proxies for teacher effects on complex cognitive skills or social-emotional competencies. Principal ratings have the strongest relationship with teacher effects on growth mindset with a correlation of 0.17. In aggregate, classroom observation scores do not appear to reflect teacher effects on this broader set of outcomes despite the wide range of domains covered by these rubrics. In supplemental analyses, I find that the strongest correlation across all eight teacher effects and the 12 CLASS domains is 0.16 (p = 0.02) between teacher effects on effort in class and the “Productivity” domain. The strongest correlation with the eight FFT domains is 0.17 (p = 0.01) between teacher effects on growth mindset and the “Establishing a Culture for Learning” domain. Student surveys have the strongest relationship with teacher effects on students’ perseverance and effort in class, although these relationships appear to be largely an artifact of correlated errors as they converge to zero when using estimates based on student ratings from the prior year.
I illustrate how summative teacher ratings from high-stakes teacher evaluation systems compare to the teacher effects I estimate by constructing proxy summative scores for teachers using the performance measures described above. I calculate scores using a weighted linear sum of value-added, observation, student, and principal ratings, with weights that reflect a prototypical evaluation system for teachers in tested grades and subjects.21 As shown in Table 8, teachers’ summative ratings are only weakly related their ability to develop students’ complex cognitive skills and social-emotional competencies. The two strongest relationships are with teacher effects on open-ended tasks in math and growth mindset, with correlations of 0.19. Among teachers ranked in the bottom fourth of the evaluation ratings, I estimate that 27 percent are actually in the top quartile of teacher effects on complex math tasks and 21 percent are in the top quartile of effects on growth mindset. These findings suggest that high-stakes decisions based on teacher performance measures commonly used in new evaluation systems largely fail to capture the degree to which teachers are developing students’ complex cognitive skills and social-emotional competencies.
V. Robustness Tests
A. Falsification Tests and Differential Reliability across Measures
At their core, my teacher effect estimates are driven by the magnitude of differences in classroom means across a range of different outcomes. Given the small number of students taught by each teacher—an average of just over 17 in the analytic sample—it is possible that these estimates are the result of sampling error across classrooms. I conduct several falsification tests for spurious results and find no compelling evidence that the results are driven by sampling error. First, I generate a random variable from the standard normal distribution so that it shares the same mean and variance as the outcomes. I then reestimate my taxonomy of models using these random values as outcomes and repeat this process 100 times. I report the average of these simulated results in Panel A of Table 9. The estimates across models are quite small, between 0.03 and 0.04 standard deviations.
Falsification Tests of Teacher Effects
I next test for teacher effects on a range of student characteristics that should be unaffected by teachers. These characteristics include gender, age, eligibility for free or reduced-price lunch status, and race/ethnicity. I drop a given measure from the set of covariates when I use it as an outcome in these falsification tests. As shown in Table 9 Panel B, I easily reject teacher effects across all of these measures except age for models using students’ actual teachers.
In Table 9 Panel C, I further demonstrate that ML estimates are not driven by unexplained variance due to the lower reliability of open-ended tests or survey scales. I test this by ex post randomly reassigning students to teachers in the analytic sample in a way that exactly replicates the observed number of students with each teacher. This allows me to examine the variance in teacher effects across outcomes when, by design, teacher effects should be zero. Averaging estimates across 100 repeated random draws, I find that the majority of estimates converge to precise zeros. Only estimates for consistency are of meaningful magnitude (0.08), but this is of less concern given that I fail to find any significant effects on this outcome across the primary analyses. Together, these falsification tests lend strong support to the validity of the teacher effect estimates.
B. Potential Reference Bias in Social-Emotional Measures
Previous research has raised concerns about potential reference bias in scales measuring social-emotional skills based on student self-reporting (Duckworth and Yeager 2015).22 In this context, the MET Project’s experimental design restricts the identifying variation to within school–grade cells, limiting the potential for reference bias at the school level and grade level within a school. Additional empirical tests provide further evidence against reference bias as a primary driver of the main results. Following West et al. (2016), I examine how the direction and magnitude of the relationship between these social-emotional measures and student achievement gains on state standardized tests change when collapsed from the student level to the class and school levels.23
As shown in Table 10, simple Pearson correlation coefficients between the four social-emotional measures and student gains on state math and ELA tests are all small, positive, and statistically significant at the student level. Collapsing the data at the classroom or school level does not reverse the sign of any of the student-level correlations, and, if anything, increases the positive relationships between self-reported social-emotional competencies and student gains. Although I cannot rule out the potential of reference bias in the measures, it does not appear as though teachers or schools where students are making larger achievement gains are also systematically changing students’ perceptions of what constitutes gritty behavior and high levels of effort.
Student-, Class-, and School-Level Correlations between Social-Emotional Measures and Gain Scores on State Tests
C. Removing Prior Test Scores
Across all models, I include prior achievement scores from state tests along with additional controls for student (and peer) characteristics that serve to increase the precision of my estimates and to guard against any potential nonrandom attrition and sorting across classrooms that occurred. The availability of prior state test scores but not prior scores on open-ended tests or social-emotional competencies creates an asymmetry in that only models with state test scores as outcomes include corresponding controls for prior outcome measures. However, unlike prior approaches that rely primarily on lagged test scores, my identification strategy leverages the random assignment of class rosters to address student sorting. I examine the sensitivity of the ML variance estimates from Table 6 and correlations across teacher effects from Table 7 by comparing them to estimates from models that exclude controls for prior test scores as well as peer average test scores.
Teacher effect estimates that omit prior scores presented in Online Appendix H are slightly larger, likely due to the between-classroom variance in a randomization block that was previously accounted for by conditioning on individual and peer-average prior achievement. Results from models that include peer controls increase the most, between 0 and 35 percent, suggesting that the average peer achievement in the prior year plays an important role in capturing peer effects. Correlations among teacher effects are meaningfully larger when models do not include lagged test scores but their relative magnitude across outcomes remains largely the same. The inflated magnitude of these correlations is likely due to an increase in correlated errors among teacher effects which prior test scores helped to reduce. Overall, these results suggest the primary findings are not driven by the asymmetric set of lagged outcome measures.
D. Teacher Effects—Upper- and Lower-Bound Average Residual Estimates
As a robustness check for my preferred model-based ML estimation approach, I also estimate upper and lower bounds for the variance of teacher effects using a two-step estimation approach following Kane et al. (2013). This allows me to relax the random effects normality assumption necessary for Equation 2. Given that teacher fixed effects are perfectly collinear with classroom-level controls in the analytic sample, I first fit the covariate-adjustment model described in Equation 2, omitting teacher random effects. In a second step, I average student residuals at the teacher level to estimate teacher effects. The variance of these average classroom residuals produces the upper-bound estimates reported in Panel A of Online Appendix Table I1. I then shrink the average classroom residuals as described in Footnote 8.24 The variance of these shrunken EB teacher effects provide lower-bound estimates reported in Panel B of Table I1.
Estimated bounds conform to the ex ante predictions described in Section III.H and almost uniformly contain my preferred estimates in Table 6. As expected, unshrunken average residuals overstate the effects of teachers while shrunken average residuals understate the magnitude of these effects. Unshrunken teacher effects on open-ended tasks and social-emotional measures are all larger than those on state tests, whereas before they were of similar magnitude. Unlike the ML estimates, average residuals are biased differentially because outcomes with lower reliability and more measurement error have more unexplained variance across classrooms. Those measures with the highest reliabilities are closest to the preferred ML estimates. Shrunken average residual estimates produce lower bounds that in some cases converge to zero. These estimates are quite conservative given the small student sample sizes for a single elementary school classroom result in low reliabilities for individual estimates, which are then shrunken substantially towards the grand mean of zero. Overall, these results confirm that our qualitative findings are not a product of the identifying assumptions of the model-based ML estimation process.
VI. Conclusion
The hallmark education policy reforms of the early 21st century—school accountability and teacher evaluation—created strong incentives for educators to improve student performance on state standardized tests. Authentic improvements in students’ underlying content knowledge and basic skills assessed on these tests are important for success in school and later in life. As I show using the ELS dataset, standardized test scores are strong predictors of a range of adult outcomes. However, these tests provide a narrow measure of the full set of student abilities and competencies that predict positive adult outcomes. Questions remain about whether teachers and schools that are judged as effective by state standardized tests are also developing students’ more complex cognitive skills and social-emotional competencies. This study suggests that this is often not the case.
The large differences in teachers’ ability to raise student performance on achievement tests (Chetty et al. 2014a; Hanushek and Rivkin 2010) and the inequitable distribution of those teachers who are most successful at raising achievement (Clotfelter, Ladd, Vigdor 2006; Goldhaber, Lavery, and Theobald 2015; Lankford, Loeb, and Wyckoff 2002) have become major foci of academic research and education policy. The substantial variation I find in teacher effects on students’ complex task performance and social-emotional competencies further reinforces the importance of teacher quality but complicates its definition. Measures of teachers’ contributions to students’ performance on state tests in math are strong proxies for their effects on students’ abilities to solve complex math problems. However, teacher effects on state ELA tests contain more limited information about how well a teacher is developing students’ abilities to reason about and draw inferences from texts. Teacher effects on state tests are even weaker indicators of the degree to which they are developing students’ social-emotional competencies.
Teaching core academic skills along with social-emotional competencies and the ability to perform unstructured tasks need not be competing priorities in a zero-sum game. I find that the relationships between teacher effects across this expanded set of student outcomes are consistently positive although often weak. As these analyses demonstrate, there are teachers who teach core academic subjects in ways that also develop students’ complex problem-solving skills and social-emotional competencies. I find that about 8 percent of teachers are rated in the top 25 percent of both value-added to complex cognitive skills and social-emotional competencies. Roughly 3 percent of teachers are in the top quartile of value-added to state tests, complex cognitive skills, and social-emotional competencies. Going forward, we need to know more about the types of curriculum, instruction, organizational practices, and school climates that allow teachers to develop a wider range of students’ skills and competencies than are commonly assessed on state achievement tests.
Current accountability and evaluation systems in education provide limited incentives for teachers to focus on helping students develop complex problem-solving skills and social-emotional competencies. Findings from this paper suggest that neither value-added to state states, observation scores, student surveys, nor principal ratings serve as close proxies for teacher effects on important skills and competencies not captured by state tests. One out of every four to six teachers who are rated among the top 10 percent based on a weighted composite of commonly used performance measures has below average effects on complex problem-solving skills and social-emotional competencies. In recent years, dozens of states have adopted new assessments aligned with the Common Core State Standards that move in the direction of assessing more complex cognitive skills (Doorey and Polikoff 2016). While these assessments may better align incentives for teachers, they face several challenges, including the traditionally lower reliability and higher cost of scoring constructed response items, increasing political opposition, and public pushback to higher standards that result in fewer students scoring at proficient or advanced levels. The long-term success of these reforms may ultimately be determined by the degree to which teachers receive the support they need to adapt their teaching to help students meet the demands of these higher standards.
Developing practical and reliable measures of students’ social-emotional competencies that could be used in school accountability or teacher evaluation systems poses an even greater challenge. Psychologists have argued that the social-emotional measures used in this study are not sufficiently robust to be used in high-stakes settings to compare teachers across schools (Duckworth and Yeager 2015). Student self-reports or teacher assessments of social-emotional measures are easy to game, and we know little about their properties when stakes are attached. While there exists potential to improve the reliability and robustness of these measures, it may be that observable student outcomes such as GPA, grade retention, attendance, and disciplinary incidents are ultimately more tractable measures for policy purposes (Whitehurst 2016). Persistent measurement challenges and the susceptibility of even these observable measures to manipulation may mean that it is more productive to focus on formative assessment approaches that help promote a dialogue among teachers, parents, and students about the importance of social-emotional development. As Albert Einstein observed, “Everything that counts cannot necessarily be counted.” What is clear is that our current conception of teacher effectiveness needs to be expanded to encompass the multiple ways in which teachers affect students’ success in school and life.
Footnotes
* Supplementary materials are freely available online at: http://uwpress.wisc.edu/journals/journals/jhr-supplementary.html
↵1. Past MET Project reports have primarily focused on developing a composite measure of teacher effectiveness for forecasting effects on student achievement (Kane and Staiger 2012) and validating this measure using random assignment (Kane et al. 2013). Included in these reports are estimates of teacher effects on open-ended cognitively demanding tests in a covariate adjusted value-added framework (Kane and Cantrell 2010; Tables 4 and 5) and estimates of the causal relationship between a composite measure of teacher effectiveness and students’ social-emotional competencies (Kane et al. 2013; Table 14).
↵2. Paul Tough’s (2009) best-selling book How Children Succeed helped to propel grit into the national dialogue about what schools should be teaching. The White House has convened meetings on the importance of “Academic Mindsets” (Yeager et al. 2013b) and the Department of Education has commissioned a paper on “Promoting Grit, Tenacity, and Perseverance” (Shechtman et al. 2013).
↵3. For an overview of the teacher value-added literature see Koedel, Mihaly, and Rockoff (2015). For an extensive discussion on the validity of teacher value-added models see Rothstein (2010), Chetty et al. (2014a), and Rothstein’s (2017) response to Chetty and his colleagues.
↵4. Detailed descriptions of the MET data are available at www.metproject.org.
↵5. Detailed descriptions of the randomization design and process can be found in Kane et al. (2013) and the “Measures of Effective Teaching User Guide” (Bill & Melinda Gates Foundation 2013).
↵6. Out of the six state ELA exams, four consisted of purely multiple-choice items (FL, NC, TN, and TX), while two also included open-response questions (CO and NY). Among the math exams, two included multiple-choice questions only (TN and TX), three contained gridded response items that require students to complete a computation and input their answer (CO, FL, and NC), and one included several short and extended response questions (NY).
↵7. All Online Appendixes are available online at http://jhr.uwpress.org/
↵8. Formally,
where
. Here λj is the ratio of true teacher variation to total observed teacher variance.
↵9. Data on FRPL were not provided by one district. I account for this by including a set of district-specific indicators for FRPL and imputing all missing data as zero.
↵10. I impute values of zero for students with missing survey data and include an indicator for missingness.
↵11. I calculate peer characteristics based on all students who were observed in a teacher’s classroom, regardless of whether they were included in the classroom roster randomization process or not.
↵12. In this context where teacher and classroom peer effects are collinear, models that omit peer effects will conflate variation in teacher effect estimates with variation in peer effects across classrooms. The direction and magnitude of bias depends on the correlation between teachers and peer effects. Given the random assignment of class rosters in the MET data, we would expect estimates of the standard deviation of teacher effects from ML models without peer controls to be inflated. By this same logic, we would expect estimates of teacher effects from ML models with peer controls to overattribute variation in outcomes across classroom to observed peer characteristics. This is because the ML models solve for the coefficients associated with the structural model, which include peer measures as the only classroom-level covariates and partition the remaining variance to estimate the magnitude of teacher effects. In application, the direction of bias is not always uniform given noncompliance and the nonrandom assignment of new students not included in the roster randomization process.
↵13. I use value-added estimates calculated by the MET Project because the district-wide data necessary to replicate these estimates are not publically available. For more information about the value-added model specification see Bill & Melinda Gates Foundation (2013).
↵14. Postattrition, students from low-income families are paired with randomly assigned teaches that have Math value-added scores that are, on average, 0.017 standard deviations (SD) higher on the state math exam in the prior year. This relationship is in the opposite direction from the type of sorting researchers are typically worried about, where more advantaged students are sorted to higher performing teachers. Even with the limited power for these tests, the magnitudes of these estimates, which are consistently less than 0.015 SD and never larger than 0.035 SD, are small relative to a standard deviation in the distribution of teacher effects in the nonexperimental 2010 MET data (Math 0.226 SD, ELA 0.170 SD, BAM 0.211 SD, SAT9-OE 0.255 SD).
↵15. These are analogues to empirical Bayes estimates.
↵16. Papay (2011) finds much lower correlations between the state test and the (SAT) in reading (0.15–0.36) and the SRI and SAT in reading (0.23–0.40). However, the SAT was administered in the fall, likely confounding teacher effect estimates in time t with both differential summer learning and, to a lesser degree, a student’s teacher in time t + 1. The correlations I report in the text are based on exams that were both given in the spring.
↵17. Corcoran et al. (2012) report that the state exams and the SAT in reading were administered “at roughly the same time of year” (p. 4).
↵18. I also examine the degree to which sampling error may attenuate these correlation coefficients by estimating the sensitivity of my estimates to class size. I present the results in Online Appendix F. These findings suggest the post hoc predicted BLUE random effect estimates I use when correlating teacher effects sufficiently correct for sampling error due to small class sizes.
↵19. This approach eliminates the direct correlation between the errors of individual students across outcomes, but is still susceptible to differential sorting patterns among teachers that are stable across classes or years. It also implicitly assumes an individual teacher’s effect does not change over time or differ based on class or school characteristics.
↵20. This value-added performance measure estimate differs from my teacher effect estimates in several ways. It is the average of teacher effect estimates in math and reading calculated by the MET Project using a standard covariate adjustment model and including all students in teachers’ classes regardless of whether students were part of the roster-randomization study (see Bill & Melinda Gates Foundation 2013). Similar to Table 7, teacher effect estimates are post hoc predicted BLUE random effect estimates derived from a model using students’ actual teachers and controlling for classroom peer characteristics. The estimation sample is limited to students who were included in the roster randomization process as described in Section III.C.
↵21. I draw upon evidence from Steinberg and Donaldson (2016) to select metrics and weights. I standardize all four performance measures to be mean zero and have a variance of one and then add them using the following weights: Score = 0.50 * CLASS + 0.35 * ValueAdded + 0.05 * Survey + 0.10 * Principal Rating. Results using FFT in place of CLASS as well as alternative weights produce similar results.
↵22. For example, studies have found that oversubscribed urban charter schools with explicit school-wide cultures aimed at strengthening students’ social-emotional competencies appear to negatively affect students’ self-reported grit, but have large positive effects on achievement and persistence in school (West et al. 2016; Dobbie and Fryer 2015).
↵23. West et al. (2016) find suggestive evidence of reference bias in self-reported measures of grit, conscientiousness and self-control in a sample of students attending traditional, charter, and exam schools in Boston. They find that correlations between social-emotional measures and overall student gains become negative when collapsed to the school level. This is analogous to the classic example of reference bias in cross-cultural surveys where, despite a widely acknowledged cultural emphasis on conscientious behavior, individuals in East Asian countries rate themselves lower in conscientiousness than do individuals in any other regions (Schmitt et al. 2007). Notably, they find little evidence of reference bias on the growth mindset scale, possibly because it asks students about beliefs that are not easily observed and, thus, less likely to be judged in reference to others.
↵24. Following Jacob and Lefgren (2008), I estimate λj using sample analogs where
is approximated by subtracting the average of the squared standard errors of the average classroom residuals from the variance of these average classroom residuals (
) and
is the squared standard error of teacher j’s average classroom residuals (
). I calculate standard errors by dividing the standard deviation of student residuals in a teacher’s classroom by the square root of the number of students in the teacher’s class.
- Received September 2016.
- Accepted July 2017.