Abstract
This paper investigates the effects of being evaluated under a novel subjective assessment system where independent inspectors visit schools at short notice, disclose their findings, and sanction schools rated fail. I demonstrate that a fail inspection rating leads to test score gains for primary school students. I find no evidence to suggest that fail schools are able to inflate test score performance by gaming the system. Relative to purely test-based accountability systems, this finding is striking and suggests that oversight by evaluators who are charged with investigating what goes on inside the classroom may play an important role in mitigating such strategic behavior. There appear to be no effects on test scores following an inspection for schools rated highly by the inspectors. This suggests that any effects from the process of evaluation and feedback are negligible for nonfailing schools, at least in the short term.
I. Introduction
In an effort to make public organizations more efficient, governments around the world make use of “hard” performance targets. Examples include student test scores for the schooling sectors in the United States, England, and Chile (Figlio and Loeb 2011) and patient waiting times in the English public healthcare system (Propper et al. 2010). Accountability based on hard or objective performance measures has the benefit of being transparent, but a potential drawback is that such schemes may lead to gaming behavior in a setting where incentives focus on just one dimension of a multifaceted outcome.1
Complementing objective performance measurement with subjective evaluation may help ameliorate such dysfunctional responses.2 In England, the setting for this paper, public (state) schools are subjected to inspections as well as test-based accountability. Under this regime, independent evaluators (inspectors) visit schools with a maximum of two days notice, assess the school’s performance, and disclose their findings on the internet. Inspectors combine hard metrics, such as test scores, with softer ones, including observations of classroom teaching, in order to arrive at a judgment of school quality. Furthermore, schools rated “fail” may be subject to sanctions, such as more frequent and intensive inspections.
The first question addressed in this paper is whether student test scores improve in response to a fail inspection rating. If a school fails its inspection, the stakes—certainly for the school principal, or headteacher—are high.3 Second, I investigate whether any estimated positive effect of a fail inspection on test scores can be explained by strategic or dysfunctional responses by teachers. As has been documented in the literature on test-based accountability, when a school’s incentives are closely tied to test scores, teachers will often adopt strategies that artificially boost the school’s measured performance. Strategies may involve excluding low-ability students from the test-taking pool and targeting students on the margin of passing proficiency thresholds (Figlio 2006; Jacob 2005; Neal and Schanzenbach 2010). The empirical strategy employed in this study allows for tests of gaming behavior on a number of key margins. Third, I assess whether the act of inspection (for nonfailing) schools yields any short-term test score gains. One hypothesis is that inspectors may provide valuable feedback that helps raise school productivity.4
However, empirically identifying the effect of a rating on test scores is plagued by the kind of mean reversion problems encountered in evaluations of active labor market programs (Ashenfelter 1978). As explained below, assignment to treatment—for example, a fail grade—is at least partly based on past realizations of the outcome, test scores. Figure 1 demonstrates the relevance of this problem in the current setting. This figure makes it clear that any credible strategy must overcome the concern that poor performance prior to a fail inspection is simply due to bad luck and that test scores would have risen even in the absence of the inspection.
In order to assess the causal effect of a fail inspection, this study exploits a design feature of the English testing system. As explained below, tests for age-11 students in England are administered in the second week of May in each year. These tests are marked externally, and results are released to schools and parents in mid-July. The short window between May and July allows me to address the issue of mean reversion: Schools failed in June are failed after the test in May but before the inspectors know the outcome of the tests.5 By comparing schools failed early in the academic year—September, say—with schools failed in June, I can isolate mean reversion from the effect of the fail inspection.6
Results using student-level data from a panel of schools show that students at early failed schools (the treatment group) gain around 0.1 of a student level standard deviation on age-11 national standardized mathematics and English tests relative to students enrolled in late fail schools (the control group). This overall finding masks substantial heterogeneity in treatment effects. In particular, the largest gains are for students scoring low on the prior (age-7) test; these gains cannot be explained by ceiling effects for the higher-ability students.7 For mathematics, students in the bottom quartile of the age-7 test score distribution gain 0.2 of a standard deviation on the age-11 test, while for students in the top quartile the gain is 0.05 of a standard deviation.
Tests for gaming behavior show that teachers do not exclude low-ability students from the test-taking pool. Next, the evidence tends to reject the hypothesis that teachers target students on the margin of attaining the official proficiency level at the expense of students far above or below this threshold. Finally, there is evidence to suggest that for some students gains last into the medium term, even after they have left the fail school. These findings are consistent with the notion that teachers inculcate real learning and not just test-taking skills in response to the fail rating.
Having ruled out important types of gaming behavior, I provide evidence on what might be driving the main results. First, I differentiate between the two subcategories of the fail rating—termed “moderate” and “severe” fail. As explained below, the former category involves increased oversight by the inspectors but does not entail other dramatic changes in inputs or school principal and teacher turnover. The results show that even for this category of moderate fail schools there are substantial gains in test scores following a fail inspection. Second, employing a survey of teachers, I provide evidence that a fail inspection does not lead to higher teacher turnover relative to a set of control schools. However, teachers at fail schools do appear to respond by improving classroom discipline. Piecing this evidence together suggests that greater effort by the current stock of classroom teachers at fail schools is an important mechanism behind the test score gains reported above.
Finally, turning to possible effects arising from the act of inspection, prior evidence suggests that structured classroom observations may provide valuable feedback to teachers and can help raise teacher productivity (Taylor and Tyler 2011).8 As discussed below, one aspect of inspections is to provide feedback to the school.9 This mechanism may be isolated from the threat or sanctions element by focusing on those schools that receive the top two ratings, Outstanding and Good. If the effects of receiving feedback from inspectors are important in the short-term then one would expect to uncover a positive treatment effect.10 If, on the other hand, schools relax in the immediate aftermath of an inspection, then one may expect to find a negative effect on test performance.
Employing the same strategy as that for evaluating the effects of a fail inspection, the empirical results reveal that the short-term effects on test scores for schools receiving a top rating are close to zero, statistically insignificant, and relatively precisely estimated. These results suggest that at least in the short term, there is no evidence of a strong positive effect from receiving feedback from the inspectors nor is there any evidence of negative effects arising from slack or teachers taking their “foot off the pedal” postinspection.11
This paper contributes to the literature on the effects of school accountability and provides new evidence on the effects of inspections.12 The findings conform with results from previous studies that broadly show that subjecting underperforming schools to pressure leads to test score gains (Figlio and Rouse 2006; Chakrabarti 2007; Reback 2008; Chiang 2009; Neal and Schanzenbach 2010; Rouse et al. 2013; Rockoff and Turner 2010; Reback, Rockoff, and Schwartz 2011). In addition, this study sheds light on the possibility of mitigating distortionary behavior under test-based accountability systems by complementing them with inspections. Eliminating some of the welfare eroding strategic behavior documented in purely test-based regimes may be a key aim of an inspection system. Finally, this study demonstrates the usefulness of inspections in improving outcomes for lower-ability or poorer students, a group under keen focus in policy discussions on education and inequality (for example, Heckman 2000; Cullen et al. 2013). The finding that inspections may be especially helpful in raising test scores for students from poorer households conforms with emerging evidence suggesting that such families may face especially severe information constraints.13
The remainder of this paper is laid out as follows. Section II describes the context for this study and discusses the prior literature on the effects of inspections. Section III lays out the empirical strategy adopted to evaluate the effect of a fail inspection on student test scores. This section also describes the empirical methods employed to test for strategic behavior by teachers in response to the fail rating. Section IV reports the results of this analysis. Section V discusses the results of receiving a nonfail inspection rating, and Section VI concludes.
II. Background
The English public schooling system combines centralized testing with school inspections. Tests take place at ages 7, 11, 14, and 16; these are known as the Key Stage 1 to Key Stage 4 tests, respectively.14 Successive governments have used these tests, especially Key Stages 2 and 4, as pivotal measures of performance in holding schools to account. For further details see, for example, Machin and Vignoles (2005).15
Since the early 1990s all English public schools have been inspected by a government agency called the Office for Standards in Education, or Ofsted. As noted by Johnson (2004), Ofsted has three primary functions: (i) offer feedback to the school principal and teachers; (ii) provide information to parents to aid their decision-making process;16 and (iii) identify schools that suffer from “serious weakness.”17 Although Ofsted employs its own in-house team of inspectors, the body contracts out the majority of inspections to a handful of private sector and not-for-profit organizations via a competitive tendering process.18 Setting overall strategic goals and objectives, putting in place an inspection framework that guides the process of inspection, as well as responsibility for the quality of inspections, remain with Ofsted.
Over the period relevant to this study, schools were generally inspected once during an inspection cycle.19 An inspection involves an assessment of a school’s performance on academic and other measured outcomes, followed by an on-site visit to the school, typically lasting between one and two days for primary schools.20 Inspectors arrive at the school at very short notice (maximum of two to three days), which should limit disruptive “window dressing” in preparation for the inspections.21 Importantly for the empirical strategy employed in the current study, inspections take place throughout the academic year, September to July.
During the on-site visit, inspectors collect qualitative evidence on performance and practices at the school. A key element of this is classroom observation. As noted in Ofsted (2011b): “The most important source of evidence is the classroom observation of teaching and the impact it is having on learning. Observations provide direct evidence for [inspector] judgements” (p. 18). In addition, inspectors hold in-depth interviews with the school leadership, examine students’ work, and have discussions with pupils and parents. The evidence gathered by the inspectors during their visit as well as the test performance data form the evidence base for each school’s inspection report. The school is given an explicit headline grade, ranging between 1 (Outstanding) and 4 (Unsatisfactory, also known as a fail rating). The report is made available to students and parents and is posted on the internet.22
There are two categories of fail, a moderate fail (known as Notice to Improve) and a more severe fail category (Special Measures). For the moderate fail category, schools are subject to additional inspections, with an implicit threat of downgrade to the severe fail category if inspectors judge improvements to be inadequate. Schools receiving the severe fail rating, however, may experience more dramatic intervention: These can include changes in the school leadership team and the school’s governing board, increased resources, as well as increased oversight from the inspectors.23
Over the relevant period for this study (September 2006 to July 2009), 13 percent of schools received the best rating, Outstanding (grade 1); 48 percent received a Good (grade 2) rating; 33 percent received a Satisfactory (grade 3) rating; and 6 percent received a Fail (grade 4) rating. The Fail rating can be decomposed into 4.5 percent of schools receiving the moderate fail and 1.5 percent of schools receiving the severe fail rating.
Inspectors clearly place substantial weight on test scores: This is borne out by analysis of the data as well as official policy statements.24 Regression analysis (see below) demonstrates that there exists a strong association between the likelihood of being failed and test scores. Nevertheless, as the above discussion indicates, test scores are not the only signal used by inspectors to rate schools. This is demonstrated by the fact that around 25 percent of schools in the bottom quarter of the test score distribution were rated either Outstanding or Good during the 2006–2009 period.
A. Prior Literature
There is a large descriptive literature investigating the role school inspections (see, for example, Matthews and Sammons 2004) but very few studies that estimate the causal effects of inspections on student achievement. Exceptions include Rosenthal (2004), Luginbuhl et al. (2009), and Allen and Burgess (2012). Rosenthal (2004) and Luginbuhl et al. (2009) both exploit the variation in timing of inspections across years to evaluate the effect of being inspected. For England, Rosenthal (2004) finds small negative effects in the year of inspection, which decline and are statistically insignificant in subsequent years. Luginbuhl et al. (2009) finds that inspections in the Netherlands have no significant effect on test scores using their preferred identification strategy. In both these studies, the identified effect of inspection is difficult to interpret. The estimates conflate effects of inspection for the better performing schools (where feedback from inspectors may be an important mechanism) with those for outright fail and borderline pass schools (where incentives generated by possible sanctions are likely to be a powerful force). The identification strategy employed in this study enables me to separately identify the effects of inspections for fail and nonfail schools.25
Allen and Burgess (2012) employs a Regression Discontinuity Design to assess the effects of a Fail inspection on student test scores. The key assumption in this RDD framework is that schools fall in the pass/fail categories by chance around some threshold. In practice, schools do not receive a continuous score where a cutoff point might determine whether they pass or fail. As the discussion above suggests, inspectors have substantial scope for manipulation in determining pass or fail.
III. Empirical Strategy
The primary question addressed here is: What is the effect of a fail inspection on students’ subsequent test scores? As described earlier, selection into the fail treatment is based at least partly on past test performance. Therefore, a simple school fixed-effect analysis using pre-and postfail test score data for a panel of schools quite possibly confounds any effect of a fail rating with mean reverting behavior of test scores. For example, if inspectors are not fully able to account for idiosyncratic negative shocks unrelated to actual school quality, then arguably any test score rise following a fail inspection would have occurred even in the absence of treatment.
This study exploits a design feature of the English testing system to address such concerns. The age-11 “Key Stage 2” tests—administered at the national level and a central plank in student and school assessment—take place over five days in the second week of May each year. The results of these tests are then released in mid-July. The short window between May and July allows me to address the issue of mean reversion: Schools failed in June are failed after the test in May but before the inspectors know the outcome of the tests. Thus the May test outcome for these schools is not affected by the subsequent fail but neither do inspectors select them for failure on the basis of this outcome. See Figure 2 for an example timeline for the year 2005–2006.
This insight enables me to identify credible causal estimates of the short-term effects of a fail inspection. Taking the year 2005–2006 as an example, the question addressed is: For schools failed in September 2005, what is the effect of the fail inspection on May 2006 test scores?
The evaluation is undertaken by comparing outcomes for schools inspected early in the academic year, September—the treatment group—with schools inspected in June, the control group.26 Schools failed in September have had almost a whole academic year to respond to the fail treatment. The identification problem, that the counterfactual outcome for schools failed in September is not observed, is solved via comparisons with June failed schools. The details of these comparisons are described below (potential remaining threats to the identification strategy are addressed in the robustness section below).
A. Descriptive Statistics
A key question is why some schools are inspected earlier in the year than others. The descriptive analysis in Table 1 helps shed light on this question. This table shows mean characteristics for primary schools inspected and failed in England in the four years 2005–2006 to 2008–2009.27 For each year, the first two columns show means for schools failed early in the academic year (September to November28) and those failed late in the year (from mid-May, after the Key Stage 2 test, to mid-July, before the release of test score results). The former category of schools are the “treatment” group and the latter the “control” group. The first row simply shows the mean of the month of inspection. Given the selection rules for the analysis, these are simply June (between 6.1 and 6.2) and October (between 10.1 and 10.2) for the control and treatment groups, respectively.
The second row, which shows the year of the previous inspection, offers an explanation why some schools are inspected early in the year and others later on. For example, for schools failed in 2005–2006 the first two columns show that the mean year of inspection for late inspected schools is 2000.6; for early inspected schools it is 2000.1.29 This suggests that schools inspected slightly earlier in the previous inspection round are also inspected slightly earlier in 2005–2006. Table 1 shows that this pattern is typical across the different years. Thus Table 1 demonstrates that over the period relevant to this study, the timing of inspections within a given year is related to the timing of the previous inspection and is unrelated to characteristics of the school such as recent test score performance or the socioeconomic makeup of the student body.30
The third, fourth, and fifth rows report the proportion of students receiving a free school meal (lunch), the proportion of students who are white British, and the school’s inspection rating from the previous inspection round. The table demonstrates that for each of the four inspection years the differences in means between the treatment and control schools are small and are statistically insignificant. (The only exception is the previous inspection rating for the year 2008–2009.)
Finally, national standardized test scores for the cohort of 11-year-olds in the year prior to the inspection are reported in Rows 6 and 7. Once again, these show no evidence of statistically significant differences between the two groups. It is noteworthy that fail schools perform between 0.4 and 0.5 of one standard deviation below the national mean. This is in line with the idea that inspectors select schools for the fail treatment at least in part on the basis of past performance.31
In sum, the evidence in Table 1 demonstrates that there is little difference between control and treatment schools on observable characteristics.32 This, combined with the fact that timing is determined by a mechanical rule, suggests that unobservable differences are also unlikely to exist between the control and treatment groups.
B. OLS and Difference-in-Differences Models
For ease of exposition, I will consider the case of the schools failed in 2005–2006 in the months of September 2005 (the treatment group) and June 2006 (the control group). The analysis extends to schools failed in the early part of the year (September to November) versus those failed late in the year (mid-May to mid-July) in each of the four inspection years.
OLS models of the following form are estimated:
(1)
where yis is the May 2006 test score outcome on the age-11 (Key Stage 2) test for student i attending school s. The treatment dummy is defined as follows: Ds = 1 if school s is failed in September 2005 and Ds = 0 if the school is failed in June 2006. Xis is a vector of student demographic controls and Wis is a vector of pretreatment school characteristics. Given the evidence on assignment to a September inspection versus a June inspection presented in the previous subsection, it can be credibly argued that treatment status Ds is uncorrelated with the error term, uis.33
Results are also presented using difference-in-differences (DID) models. Continuing with the example of schools failed in 2005–2006, data are taken from 2004–2005 (the “pre” year) and 2005–2006 (the “post” year). The following DID model is estimated:
(2)
where t = 2005 or 2006, corresponding to the academic years 2004–2005 and 2005–2006, respectively. μs is a school fixed effect and post06 is a dummy indicator, switched on when t = 2006. Dst is a time-varying treatment dummy, switched on in the post period (t = 2006) for schools inspected early in the academic year 2005–2006.34
C. Testing for Strategic Behavior
As highlighted above, a growing body of evidence has demonstrated that when schools face strong incentives to perform on test scores they game the system. These strategies include the removal of low-ability students from the testing pool, teaching to the test, and targeting students close to the mandated proficiency threshold.35
In the analysis below, I test for the presence of these types of strategic responses. First, I examine to what extent gains in test scores following the fail rating are accounted for by selectively removing low-ability students.36 This involves checking whether the estimated effect of treatment in the OLS and DID regressions (δ in Equations 1 and 2 above) changes with the inclusion of student characteristics such as prior test scores, special education needs status, free lunch status, and ethnic background. For example, suppose that in order to raise test performance, fail schools respond by removing low-ability students from the test pool. This would potentially yield large raw improvements in test scores for treated schools relative to control schools. However, conditioning on prior test scores would then reveal that these gains are much smaller or nonexistent. This test enables me to directly gauge the effect of gaming behavior on test scores.
Second, I test for whether any gains in test scores in the year of the fail inspection are sustained in the medium term. This provides an indirect test of the extent of teaching to the test. More precisely, students are enrolled in primary school at the time of the fail inspection. The issue is whether any gains in test scores observed in that year can still be detected when the students are tested again at age 14, three years after the students have left the fail primary school. Note that this is a fairly stringent test of gaming behavior since fadeout of test score gains is typically observed in settings even when there are no strong incentives to artificially boost test scores (see, for example, Currie and Thomas 1995).
Third, I analyze the distributional consequences of a fail inspection. In particular, I investigate whether there is any evidence that teachers target students on the margin of achieving the key government target for Year 6 (age 11) students.37 The key headline measure of performance used by the government and commonly employed to rank schools is the percentage of students attaining “Level 4” proficiency on the age-11 Key Stage 2 test. Following a fail inspection the incentives to maximize students passing over the threshold are more intense than prior to the fail rating. If schools are able to game the system (for example, if inspectors are unable to detect such strategic behavior), then teachers may target resources toward students on the margin of attaining this threshold to the detriment of students far below and far above this critical level.
A number of strategies are adopted to explore this issue. In the first approach, I examine whether gains in student test scores vary by prior ability. Prior ability predicts the likelihood of a student attaining the performance threshold. Previous evidence has shown that teachers neglect students at the bottom of the prior ability distribution in response to the introduction of performance thresholds (see Neal and Schanzenbach 2010).
The online Appendix Table 3 shows the distribution of Year 6 students achieving the target for mathematics and English at fail schools, in the year before the fail, by quartile of prior ability.38 Prior ability is measured by age-7 test scores. As expected, ability at age seven is a strong predictor of whether a student attains the official target: The proportion doing so rises from between a quarter and a third for the bottom quartile to almost 100 percent at the top quartile of prior ability. One implication of this evidence is that students in the lowest ability quartile are the least likely to attain the official threshold, and so at fail schools teachers may substitute effort away from them toward students in, for example, the second quartile. The analysis below tests this prediction.
A second approach to analyzing the distributional effects of a fail rating is to employ quantile regression analysis. This is discussed in online Appendix C.
IV. Results
A. Basic Results
Table 2 shows results for the effects of a fail inspection on mathematics and English test scores for schools failed in one of the four academic years, 2006–2009.39 The top panel reports results from the OLS model and the bottom panel reports results from the difference-in-differences model.
I pool the four inspection years together. Pooling over the four years is justified because over this period schools were inspected and rated in a consistent manner.40 The evidence presented in Table 1 shows that schools are indeed comparable on observable characteristics across the different years. As a robustness check, results from regression analysis conducted for each year separately are also reported (in the online Appendix Tables 1 and 2). As indicated below, these show that results for the pooled sample and for individual years produce a consistent picture of the effects of a fail inspection.
In Table 2, as well as the following tables, the comparison is between students enrolled in schools failed in the early part of the academic year, September to November—the treatment group—with those attending schools failed late in the academic year, mid-May to mid-June—the control group.41
Turning first to mathematics test scores, the row “early fail” in Panel A of Table 2 corresponds to the estimate of the treatment effect δ in Equation 1. Column 1 reports estimates from the simplest model with only school-level controls.42 The result in Column 1 suggests that the effect of a fail rating is to raise test scores by 0.12 of a standard deviation. This effect is highly statistically significant at conventional levels (standard errors are clustered at the school level).
As explained in Section IIIC above, the estimated effect in Column 1 may in part reflect distortionary behavior by teachers. If schools respond to a fail inspection strategically, for example, by excluding low-ability students from tests via suspensions, then we should see the relatively large gains in Column 1 diminish once prior ability controls are introduced in the regression analysis. In order to address such concerns, Columns 2 and 3 introduce student-level controls. Regression results reported in Column 2 include the following student characteristics: gender, eligibility for free lunch, special education needs, month of birth, whether first language is English, ethnic background, and census information on the home neighborhood deprivation index. The model in Column 3 also includes the student’s age-7 (Key Stage 1) test scores.
The rise in the R-squared statistics as we move from Columns 1 to 2 and then to 3 clearly indicates that student background characteristics and early test scores are powerful predictors of students’ test outcomes. However, the addition of these controls has little effect on the estimated effects of the fail rating. Overall, the evidence in Panel A for mathematics suggests that (1) the effect of a fail inspection is to raise test scores; and (2) this rise does not appear to be driven by schools selectively excluding students from the tests.
Turning to the difference-in-differences estimates for mathematics reported in Panel B, a nice feature of this approach is that it provides direct evidence on the importance of mean reversion. For the DID analysis, the “pre” year corresponds to test scores prior to the year of inspection whilst the “post” year corresponds to test scores from the year of inspection. The estimate of mean reversion is provided by the gain in test scores between the pre-inspection year and the year of inspection for schools failed late in the academic year (the control group). This estimate is indicated in the row labeled “post.” The DID estimate of the effect of a fail inspection is provided in the first row of Panel B, labeled “post x early fail” that corresponds to the treatment dummy Dst in Equation 3. The DID results are in line with the OLS results: Column 3 of Panel B shows that students at early failed schools gain by 0.12 of a standard deviation relative to students enrolled in late fail schools. In addition, comparing results with and without student-level controls—Column 1 versus Columns 2 and 3—shows that there is little change in the estimated effect. These results support the earlier contention that a fail inspection raises student test scores and that these gains are unlikely to be accounted for by the kind of strategic behavior outlined above.
As for evidence on mean reversion, the results in the second row of Panel B show that there is only mild mean reversion for mathematics. With the full set of controls, the coefficients on the “post” dummy is 0.03 of a standard deviation and is not statistically significant at conventional levels. This suggests that in the absence of a fail rating from the inspectors, we should expect very small gains in test scores from the low levels in the base year reported in the descriptive statistics in Table 2.
Columns 4–6 report results for English test scores. The OLS results in Column 6, Panel A show that the effect of a fail inspection is to raise standardized test scores by 0.09 of a standard deviation. The DID estimates in Panel B point to gains of around 0.07 of a standard deviation. These estimates are statistically significant. As before, the results for English provide no evidence of gaming behavior: There is little change in the estimates when we move from the Column 4, no controls, to Column 6, full set of controls.
Finally, the evidence on mean reversion of English test scores presented in the second row of Panel B shows that there is stronger evidence of a rebound in test scores from the low level in the base year. The coefficients on the “post” dummy is now 0.08 of a standard deviation, indicating a substantial rebound in test scores even in the absence of a fail inspection. As seen below, this rebound in fact corresponds to a “preprogram” dip observed in the year before inspection.43
B. Further Robustness Checks
This section presents results from a falsification exercise, provides evidence on the “preprogram dip” and addresses potential threats to identification.
1. A falsification test and the “preprogram dip”
Table 3 presents analysis from a falsification exercise. This makes use of the fact that data are available one and two years before treatment in order to conduct a placebo study. The question addressed is whether a treatment effect can be detected in the year before treatment, when in fact there was none.
As before, Table 3 pools the data over the four inspection years. The OLS estimates in Panel A compare test score outcomes in the year before inspection for students at early and late failed schools. Focusing on Columns 3 and 6 with the full set of controls, these show that the estimated effect of the placebo treatment is close to 0 and statistically insignificant for mathematics and English. The DID estimates in Panel B, which compare the change in test scores one and two years before inspection for early and late failed schools, also show no evidence of placebo effects, supporting the common trends assumption underlying the DID strategy.44
Table 3 also provides evidence on the preprogram dip in test scores, presented in the row labeled “post” in Panel B. The results in show that for English (but not maths) there is a large, statistically significant decline in test scores in the year prior to the fail rating. This decline cannot be explained by student characteristics or their prior test scores. This sheds some light on the selection rule employed by inspectors: For English at least, this evidence suggests that inspectors are more likely to fail schools that have had a recent dip in test score performance.
2. Robustness to potential threats to identification
One potential threat to the identification strategy is that schools may be failed because they experience temporary dips in quality around the time of inspection. If quality recovers in subsequent months even in the absence of a fail rating, then a comparison of September-failed schools with those failed in June, say, will yield biased estimates of the treatment effect.
I undertake a robustness check to probe this issue. On the assumption that inspectors attach some weight to past test score performance and some to the inspection evidence, early inspected schools assigned the fail rating largely on the basis of poor past test scores are less likely to be subject to mean reversion (in the component of quality observed by inspectors but not the econometrician) between the early part of the year and the time of the test in May.45 Online Appendix B3 and the accompanying table reports results separately for fail schools with above and below median probability of being assigned a fail rating on the basis of past test scores as well as other observable characteristics. These results suggest that within-year mean reversion in the unobserved component of quality is unlikely to be driving the main results.
Finally, results in Appendix B4 also present evidence on the effects of the fail treatment in the year after inspection. These show that early failed schools continue to improve relative to the later failed schools, supporting the hypothesis that more time since failure allows schools to implement strategies that lead to test score gains, and that late inspected schools also show improvements.
C. Heterogeneous Treatment Effects
In this section, I explore the distributional consequences of a fail inspection. The analysis below first assesses whether the treatment effect varies by prior ability followed by some further subgroup analysis. Online Appendix C discusses quantile treatment effects.
1. Effects by prior ability
As discussed in Section IIIA above, variation in treatment effect by prior ability may provide evidence of distortionary teacher behavior. In order to test the prediction that low-ability students are adversely affected when incentives to attain the performance threshold are strengthened, I test whether the effect of treatment varies with prior ability.46 The following model incorporating the interaction between the treatment dummy and prior ability is estimated:
(3)
where the treatment dummy Ds is turned on for schools inspected early in the academic year. Percentileis is student i’s percentile, within the selected sample of fail schools, in the prior test score distribution (the age-7 Key Stage 1 test). Thus, the coefficient on the interaction between the treatment dummy and the test percentile, γ, estimates how the effect of treatment varies by prior ability.
The effect may in fact vary nonlinearly by prior ability. This will be the case if, for example, teachers target students in the middle of the prior test score distribution and neglect students at the top and bottom. In order to allow for such nonlinear interactions the following regression is also estimated:
(4)
where the dummy variable Qisk is switched on for student i if her percentile on the prior test score lies in quartile k. Thus, γk estimates the effect of treatment for students lying in quartile k in the prior ability distribution, relative to the omitted category, the bottom quartile.
Table 4, Columns 1 and 3, presents estimates of the main (δ) and interaction (γ) effects for mathematics and English, respectively, for the linear interactions Model 3. The row “early fail” corresponds to the estimate of δ and “early fail x prior ability percentile” corresponds to the estimate of γ. The results for both mathematics and English in Columns 1 and 3 show that there is a strong inverse relationship between prior ability and the gains from treatment. Students from the lowest end of the prior ability distribution gain 0.19 and 0.15 of a standard deviation for mathematics and English, respectively.
The estimates for the nonlinear interactions model, Equation 4, are reported in Columns 2 and 4.47 Allowing for nonlinearities leaves the above conclusion unchanged: The biggest gains are posted for students from the bottom quartile (the omitted category); students in the middle of the prior ability distribution also experience substantial gains, though not as large as the ones for low-ability students. At 0.05 and 0.03 of a standard deviation for mathematics and English, respectively, gains for students in the top quartile appear to be positive, though substantially smaller than for those at lower ability levels.
One explanation that may account for the relatively small gains observed for high-ability students is that their test scores are at or close to the ceiling of 100 percent attainment. However, it should be noted that even for students in the highest ability quartile at fail schools, the mean test scores in the year before treatment are some way below the 100 percent mark (76 percent and 68 percent for mathematics and English, respectively). This hypothesis that ceiling effects bite is explored further (and rejected) in the quantile treatment effect analysis reported in the online appendix.
In summary, the results presented in Table 4 show that low-ability students reap relatively large test score gains from a fail inspection. This is in contrast to findings from some strands of the test-based accountability literature that show that low-ability students may suffer under such regimes.48 One explanation for the findings reported here may lie in the role played by inspectors. I discuss this at greater length below.
2. Further subgroup analysis
Table 5 reports results from separate regressions for subgroups determined by free lunch status and whether English is the first language spoken at home. The results by free lunch status suggest modestly higher gains in mathematics for free lunch students but smaller gains for this group relative to no-free lunch students in English. However, there are large differences in gains for students according to whether or not their first language is English. For mathematics, students whose first language is not English record gains of 0.19 of a standard deviation, compared to 0.12 of standard deviation for those whose first language is English. Similarly, gains on the English test are 0.12 of a standard deviation (though only marginally significant) for the first group of students and 0.08 of a standard deviation for the latter group.49
3. Discussion: explaining the gains for low-ability students
The analysis above points to strong gains on the age-11 (Key Stage 2) test for students classed as low ability on the prior (age-7) test. On the basis of the evidence presented above, two potential explanations for this finding can be rejected. First, these gains for low-ability students do not appear to be a result of teachers strategically allocating effort among students. Second, it also seems unlikely that ceiling effects for high-ability students account for this result. So what then explains the gains for low-ability students reported in Table 4 (as well as the quantile treatment effects reported in Appendix C)?
A model consistent with these facts is one where there is a great deal of heterogeneity within the same school or classroom in the degree to which parents are able to hold teachers to account. Parents of children scoring low on the age-7 test are likely poorer than average and less able to assess their child’s progress and the quality of instruction provided by the school. Teachers may therefore exert lower levels of effort for students whose parents are less vocal about quality of instruction. Following a fail inspection and the subsequent increased oversight of schools, teachers have to raise productivity. The optimal strategy for teachers now may be to increase effort precisely where there was the greatest slack. Thus lower-ability students, whose parents face the highest costs in terms of assessing teaching quality, may gain the most from a fail inspection. This would then help explain the strong rise for low-ability students, as reported in Table 4.50
Furthermore, if students in the low prior ability group do indeed receive greater attention from teachers following a fail inspection, the expectation may be that within this group students with higher innate ability benefit the most. This would accord with the usual assumption that investment and student ability are complementary in the test score production function. This is exactly in line with the quantile treatment effect results reported in Appendix C, which show rising treatment effects across quantiles for students in the lowest prior ability quartile.51
D. Medium-Term Effects
The results reported above show that a fail inspection leads to test score gains for age-11 (Year 6) students, who are in the last year of primary school. One question is whether these gains are sustained following the move to secondary school. This would provide indirect evidence of whether the initial test score gains at the primary school are due to “teaching to the test” rather than a result of greater mastery or deeper understanding of the material being examined. In the former case, any gains would be expected to dissipate quickly.52
Table 6 reports results for the Key Stage 3 test score outcome for students age 14 (Year 9)—that is, three years after leaving the fail primary school. This exercise is limited by the fact that these tests are teacher assessments (and not externally marked, as is the case for Key Stage 2 tests used in the analysis above). In order to reduce noise, mathematics and English test scores are combined into a single measure by taking the mean for the two tests for each student.
The results in Column 1 of Table 6 suggest that the mean effect of treatment three years after leaving the fail primary school is a gain in test score of 0.05 of a standard deviation (statistically significant at the 10 percent level). Analysis of heterogeneity in treatment impact suggests that the medium-term gains are largest for lower-ability students (Columns 2 and 3), in line with earlier results showing large gains for these groups in the year of inspection.
Overall, the analysis of test scores three years after the treatment shows that the positive effects are not as large as the immediate impacts, suggesting fadeout is an important factor. Nevertheless, the evidence shows that some of the gains do persist into the medium term.
E. Mechanisms
Having ruled out certain types of gaming behavior, this section provides tentative evidence on what might be driving test score improvements at failed schools. First, I investigate whether moderate and severe fail ratings—each of which entails different degrees of intervention following the fail inspection—yield markedly different outcomes. Second, using teacher survey data, I examine whether fail schools experience changes in teacher tenure, curriculum, and classroom discipline.
1. Effects by severity of fail
As discussed in Section II, the overall fail rating can be subdivided into a moderate fail and a severe fail: the “Notice to Improve” and “Special Measures” subcategories, respectively. It was noted above that the moderate fail rating leads to increased oversight by the inspectors but does not entail other dramatic changes in inputs or school principal and teacher turnover. Schools subject to the severe fail category, on the other hand, may well experience higher resources as well as changes in the school leadership team and the school’s governing board.
Table 7 shows the effects on test scores separately for schools receiving a moderate fail (Columns 1 and 2) and severe fail (Columns 3 and 4). For moderate fail schools, the OLS (difference-in-difference) estimates suggest gains of 0.16 (0.11) and 0.07 (0.03) of a standard deviation for mathematics and English, respectively. For the severe fail treatment, the OLS (difference-in-difference) estimates show gains of 0.10 (0.11) and 0.13 (0.15) of a standard deviation for mathematics and English, respectively.
The finding that there are test score gains at both moderate and severe fail schools is noteworthy. Given that large injections of additional resources and personnel changes are less likely at moderate fail schools than at severe fail schools, the findings in Table 7 point to greater effort and increased efficiency as the key mechanism behind gains for the former group.53
2. Classroom discipline, school curriculum and teacher tenure
In this section, I use a survey of teachers to investigate whether a fail rating leads to changes in the following outcomes (1) classroom discipline, (2) hours of the curriculum allocated to high-and low-stakes subjects and (3) teacher tenure. In one set of regression results, schools rated “Satisfactory,” the rating just above the fail category, are used to construct the control group. A second control group is constructed using schools that are “yet-to-be-failed.”
The teacher survey data are part of a major longitudinal study, the Millennium Cohort Study, which follows children born in the United Kingdom in or just after the year 2000. In the fourth wave of the study, the primary school teacher of the study child was contacted and surveyed in one of the academic years 2007–2008 or 2008–2009. A small set of questions in the survey relate to the teacher’s tenure at the school, years of experience, school curriculum, and classroom discipline.
Table 8 reports results for the classroom discipline outcome.54 For each of the Columns 1–4, the “Fail 2004–2007” dummy is switched on if the teacher is at a school that was failed in one of the academic years 2003–2004 to 2006–2007 (that is, before the survey date).
In Columns 1 and 2, the control group consists of teachers at schools that were rated “Satisfactory” between 2003–2004 and 2006–2007. In Columns 3 and 4, a different control group is constructed using teachers at schools that are failed after the interview, namely in 2009–10 or 2010–11. These are the “yet-to-be-failed” schools.55
Column 1 shows that the raw difference in teacher-reported discipline problems is lower at schools that experienced a fail rating in the recent past than at schools that received a Satisfactory rating. This gap is substantial (5.5 percentage points) but not statistically significant. When school controls are included in Column 2, this gap widens to 8.0 percentage points and is significant at the 10 percent level.56 This is a large effect, representing a 20 percent decline in this measure of indiscipline. One interpretation of this evidence is that teachers in treated schools place greater emphasis on classroom discipline and ensure that fewer students behave in a way that impedes other children from learning.57 The second set of results, reported in Columns 3 and 4, are consistent with the previous results, although the standard errors are now substantially larger.58
The results for teacher tenure and experience as well as number of hours devoted each week to English, mathematics, and physical education are reported in the online Appendix Table 7. These show that the differences in teacher tenure and experience between treatment and control groups are very small, suggesting that these factors are unlikely to be the mechanisms behind the positive test score gains at fail schools. Similarly, the estimates for the curriculum outcomes suggest small effects from the treatment. In part, this latter finding may be a consequence of the fact that the curriculum in England is set nationally.
Overall, the analysis presented here further boosts the hypothesis that greater effort on the part of the current stock of teachers at the fail school is at least part of the explanation for the test score gains reported in this study. However, without more detailed survey information on school practices, hiring policies, and teacher turnover, it is not possible to go beyond the tentative evidence provided here.
V. The Short-Term Effect of Being Inspected: The Case of Nonfail Schools
This section investigates the short-term effect of being inspected. For schools receiving one of the top two ratings, Outstanding and Good, there are no sanctions but there may be benefits for such schools from the feedback they receive from inspectors, one of the stated goals of the inspectorate (Johnson 2004). I employ the same research design as before, comparing schools inspected and receiving a given rating (Outstanding, say) in the early part of the academic year (that is, September, October, or November) with schools receiving the same rating just after the May test is taken (and before the test results are disclosed). For the top two ratings this exercise should be informative about the act of being inspected. One hypothesis is that receiving feedback from inspectors is valuable in raising school productivity and hence one would expect to uncover a positive treatment effect (see, for example, Taylor and Tyler 2013).59 An alternative hypothesis is that there is a negative effect on test scores if teachers provide lower levels of effort immediately after an inspection.
Table 9 reports the results of this analysis. Results are also reported for the Satisfactory rating.60 For Outstanding rated schools, OLS estimates in Columns 1 and 2, Panel A suggest small gains for mathematics but no significant effects for English. However, the results for mathematics do not appear to be robust: The difference-in-differences estimates in Columns 1 and 2, Panel B suggest that effects are small and statistically insignificant at conventional levels. A similar pattern of results is found for the Satisfactory rating (Columns 5 and 6, Panels A and B), while the effects for the Good rating (middle two columns) are in all cases close to 0 and relatively precisely estimated. Overall, the results in Table 9 suggest that in the short term there are no positive or negative effects of being inspected.61
VI. Conclusion
How best to design incentives for public organizations such as schools is a fundamental public policy issue. One solution, performance evaluation on the basis of test scores, is prevalent in many countries. A school inspection regime, which arguably is more likely to capture the multifaceted nature of education production, may complement (or even substitute for) a test-based accountability system.62
This paper focuses on those schools inspectors judge to be the worst performing. The main result is that a fail inspection leads to test score improvements. This is consistent with findings from the test-based accountability literature that demonstrates that sanctions for underperforming schools can lead to gains in student performance (Figlio and Loeb 2011). For schools receiving higher ratings, there are no significant effects following an inspection.
Furthermore, for fail schools I find little evidence to suggest that these schools are able to artificially inflate test performance by gaming the system. Given the prior evidence on strategic behavior in high stakes contexts, the fact that I find little evidence of dysfunctional response is revealing. If inspectors are able to successfully evaluate actual practices and quality of instruction in place at the school before and after a fail inspection, inspections may well have a mitigating effect on such distortionary responses.
Examining treatment heterogeneity reveals that gains are especially large for students scoring low on the prior (age-7) test.63 These results are consistent with the view that children of low-income parents—arguably, the least vocal in holding teachers to account—benefit the most from inspections. In such cases, inspectors may fulfill an especially vital role by substituting for parents in holding teachers to account. A fuller examination of this hypothesis is left for future research.
Footnotes
Iftikhar Hussain is a lecturer (assistant professor) in economics, University of Sussex. Email: iftikhar.hussain{at}sussex.ac.uk. For helpful discussions and comments the author would like to thank Orazio Attanasio, Oriana Bandiera, Tim Besley Steve Bond, Martin Browning, Rajashri Chakrabarti, Ian Crawford, Avinash Dixit, David Figlio, Sergio Firpo, Rachel Griffith, Caroline Hoxby, Andrea Ichino, Ian Jewitt, Kevin Lang, Valentino Larcinese, Clare Leaver, Susanna Loeb, Steve Machin, Bhash Mazumder, Meg Meyer, Andy Newell, Marianne Page, Imran Rasul, Randall Reback, Jonah Rockoff, Jeff Smith, David Ulph, John Van Reenen, and Burt Weisbrod, as well as seminar participants at the Chicago Fed, Northwestern (IPR), LSE, Manchester, Nottingham, Oxford, Sussex, the NBER Summer Institute, Association for Education Finance and Policy annual conference, the Royal Economic Society annual conference, the Society of Labor Economists annual meeting, the Educational Governance and Finance Workshop, Oslo and the Second Workshop on the Economics of Education, Barcelona. The data belong to the U.K. Department for Education and researchers can apply directly to the DfE in order to gain access to the administrative data. The author is happy to provide researchers with the Stata ‘do’ files used to create the tables in the paper.
↵1. See Holmstrom and Milgrom (1991) for a formal statement of the multitasking model. Dixit (2002) discusses incentive issues in the public sector. For empirical examples of dysfunctional behavior in the education sector see the references below.
↵2. However, a system where the evaluator is allowed to exercise his or her own judgment, rather than following a formal decision rule, raises a new set of concerns. Results from the theoretical literature emphasize influence activities and favoritism (Milgrom and Roberts 1988; Prendergast and Topel 1996) that make the subjective measure “corruptible” (Dixit 2002). See Prendergast (1999) and Lazear and Oyer (2012) on the limited empirical evidence on the effectiveness of subjective evaluation. In many settings, good objective measures may not be immediately available. For example, in its analysis of active labor market programs, Heckman et al. (2011, p. 10) notes that: “the short-term measures that continue to be used in… performance standards systems are only weakly related to the true long-run impacts of the program.” Whether complementing objective performance evaluation with subjective assessment is an effective strategy in such settings remains an open question.
↵3. Hussain (2009) analyzes the effects of inspections on the teacher labor market. The evidence shows that when schools receive a “severe fail” inspection rating, there is a significant rise in the probability that the school principal exits the teaching labor market. The next section provides further details.
↵4. Note that an important question not addressed in this study is whether introducing an inspection regime raises performance of the schooling sector as a whole.
↵5. So that the May test outcome for these schools is not affected by the subsequent fail but neither do inspectors select them for failure on the basis of this outcome.
↵6. The descriptive analysis demonstrates that there is little difference in observable characteristics between schools failed in June (the control group) and schools failed in the early part of the academic year (the treatment group). This, combined with the fact that timing is determined by a mechanical rule, suggests that there are unlikely to be unobservable differences between control and treatment schools. The claim then is that when comparing early and late fail schools within a year, treatment (early inspection) is as good as randomly assigned. Potential threats to this empirical strategy are addressed below.
↵7. Quantile regression analysis reveals substantial gains across all quantiles of the test score distribution.
↵8. On the usefulness of in-class teacher evaluations, see also Kane et al. (2010) and Rockoff and Speroni (2010).
↵9. In Section V below I discuss a number of aspects of the inspection process that sets it apart from the teacher evaluation system described by Taylor and Tyler (2001).
↵10. Taylor and Tyler (2013) finds that teacher performance increases in the year of the evaluation, although the effects in subsequent years are larger.
↵11. Of course, the possibility that both these effects cancel each other out cannot be ruled out.
↵12. For evidence on the efficacy of test-based accountability systems, such as the U.S. federal No Child Left Behind Act of 2001, see the survey by Figlio and Loeb (2011). There is a large descriptive literature on the role of school inspections (see, for example, the surveys by Faubert 2009 and OECD 2013) but very little hard evidence exists on the effects of inspections. Two exceptions from the English setting are Allen and Burgess (2012) and Rosenthal (2004), discussed below.
↵13. Hastings and Weinstein (2008) provides evidence on the importance of information constraints for poor households when choosing among schools.
↵14. Note that the age-14 (Key Stage 3) tests were abolished in 2008.
↵15. Online Appendix A outlines the relevant theoretical background.
↵16. Hussain (2013) demonstrates that parents are responsive to inspection ratings when making school choice decisions.
↵17. In its own words, the inspectorate reports the following as the primary purpose of a school inspection: “The inspection of a school provides an independent external evaluation of its effectiveness and a diagnosis of what it should do to improve, based upon a range of evidence including that from first-hand observation. Ofsted’s school inspection reports present a written commentary on the outcomes achieved and the quality of a school’s provision (especially the quality of teaching and its impact on learning), the effectiveness of leadership and management and the school’s capacity to improve.” (Ofsted 2011, p. 4).
↵18. As of 2011, Ofsted tendered school inspections to three organizations, two are private sector firms, the third is not-for-profit.
↵19. Inspection cycles typically lasted between three and six years.
↵20. English primary schools cater for students between the ages of 5 and 11.
↵21. This short notice period has been in place since September 2005. Prior to this, schools had many weeks, sometimes many months, of notice of the exact date of the inspection. Anecdotal evidence suggests that these long notice periods resulted in disruptive preparations for the inspections.
↵22. These can be obtained from http://www.ofsted.gov.uk/.
↵23. For evidence of the negligible effects of a moderate fail on principal turnover and the contrasting large effects for severe fail schools, see Hussain (2009).
↵24. The government views test scores as an “anchor” in the English accountability system. See, for example, the statement by the secretary of state for education in oral evidence to the House of Commons education select committee (Uncorrected Transcript of Oral Evidence, 31 January 2012, http://www.publications.parliament.uk/pa/cm201012/cmselect/cmeduc/uc1786-i/uc178601.htm, accessed February 10, 2012).
↵25. Note that exogenous timing of inspections, exploited by Rosenthal (2004) and Luginbuhl et al. (2009), cannot by itself identify the effect of inspection for schools receiving a specific rating. This is a consequence of the fact that ratings are determined by past realizations of the outcome variable, test scores. For example, comparing test scores in the year after inspection for failed schools with test scores for yet-to-be-failed schools does not yield unbiased estimates. See Section V below.
↵26. For an evaluation of the effects on test scores it is important to note that the latest available test score information at the time of inspection is the same for both control and treated schools: that is, from the year before inspection.
↵27. I focus on these four years because 2005–2006 is the first year when the inspection system moved from one where schools were given many weeks notice to one where inspectors arrived in schools with a maximum of two days notice. In order to analyze the effect of a first fail (a “fresh fail”), schools that may have failed in 2004–2005 or earlier are dropped from the analysis. This results in a loss of 10 percent of schools.
↵28. The early inspection category is expanded to three months in order to increase the sample of treated schools.
↵29. Note that an inspection in the academic year 1999–2000 is recorded as “2000”; an inspection in 2000–2001 is recorded as “2001,” and so on.
↵30. Further analysis shows that in a regression predicting assignment status (early versus late inspection) for the set of fail schools, the year of previous inspection is statistically significant while past test scores, prior inspection rating, and the proportion of students eligible for free lunch are all statistically insignificant at conventional levels.
↵31. One remaining concern is that the worst schools may be closed down after a fail inspection. If this happens immediately after an inspection then any test score gains posted by the early fail schools may be due to these selection effects. In fact, three pieces of evidence suggest that this is not the case. Although such a “weeding out” or “culling” process may be important in the medium term, the evidence in Table 1 demonstrating the comparability of the early and late inspected group of schools suggests that such a process does not take place immediately, that is, in the year of inspection.
Second, an examination of the data shows that the probability of dropping schools from the estimation sample because of a lack of test data from the May tests in the year of inspection—perhaps because the school is closed down—when test data are available in previous years, is similar for treated and control schools. In total, for the years 2005–2006 to 2008–2009, 4 per cent of schools (6 schools) from the control group and 5 percent (14 schools) from the treatment group are dropped because of lack of test score data in the year of inspection. These schools appear to be comparable to the treatment and control schools on characteristics such as student attainment in the year before inspection. For example, the proportion of students attaining the mandated attainment level for age 11 students in the year before inspection is 62 percent for the 14 treated (early inspected) schools dropped from the estimation sample; the corresponding mean is 63 per cent for the 258 treated schools included in the estimation sample.
Finally, a comparison of results for the moderate fail schools versus the severe fail ones also sheds light on this issue. Note that a moderate fail is unlikely to lead to changes in school leadership and school closure. If test gains following a fail are observed for moderate fail schools, then selection effects arising from school closures are unlikely to be driving the results. I report on these findings below.
↵32. In the online Appendix Table 8, I show that pooling all four years together yields the same conclusion: Treatment and control schools appear to be balanced on observable characteristics.
↵33. In a heterogenous treatment effect setting where δi is the student-specific gain from treatment, the key assumption is that Ds is uncorrelated with both uis and δi. The evidence presented above suggests that this assumption is satisfied. In this case a comparison of means for the treatment and control outcomes yields the Average Effect of Treatment on the Treated. This is the effect of a fail inspection rating for schools inspectors judge to be failing. Another parameter of policy interest—not estimated—is the Marginal Treatment Effect; that is, the test score gain for students in schools on the margin of being failed.
↵34. The key DID assumption, which embodies the assumption of common trends across treatment and control groups, is that conditional on the school fixed effect (μs) and year (post06) the treatment dummy Dst is uncorrelated with the error—that is, E(uist | post06, Dst, μs) = 0.
↵35. There is no grade retention in the English system and reduced scope for excluding students from the testing pool. Nevertheless, anecdotal evidence suggests that schools may selectively bar or exclude students and popular discourse suggests that schools target students on the margin of performance thresholds (see, for example, “Schools focusing attention on middle-ability pupils to boost results,” The Guardian, 21 September 2010).
↵36. It should be noted that potentially distortionary incentives may well exist prior to the fail rating. However, these incentives become even more powerful once a school is failed. Thus the tests for gaming behavior outlined here shed light on the effects of any extra incentives to game the system following a fail inspection. As noted previously, the incentives to improve test score performance following a fail inspection are indeed very strong.
↵37. In primary schools, national tests are administered to students in England at ages seven and 11.
↵38. The online appendix can be found at http://jhr.uwpress.org/.
↵39. Note that “2006” refers to the academic year 2005–2006 and so on for the other years.
↵40. As described earlier, changes to the inspection process were introduced in September 2005. Arguably, the biggest change was a move to very short (two days) notice for inspections, down from a notice period of many months. This regime has remained in place since September 2005.
↵41. Treatment effects, reported in the online Appendix B, estimated for schools failed in each individual month from September through to April yield a pattern that suggests a steadily declining effect as we move closer to May.
↵42. The following school-level controls are included in all the regressions reported in Panel A of Table 2: pre-inspection mathematics and English attainment, percent of students eligible for free lunch, and percent of students who are nonwhite. Dropping these from the regressions makes very little difference to the estimates. For example, without any controls at all, the estimated effect for mathematics is 0.10 of a standard deviation.
↵43. Note that the above results from data pooled over the four inspection years are in line with results from regression analysis conducted for each year separately, reported in online Appendix Tables 1 and 2.
↵44. Results for individual inspection years confirm the finding that the placebo treatment produces no discernible effect.
↵45. In order to see this, suppose that for each school i inspectors assess overall quality in year t and month m to be some weighted average of past test scores (wt–1) and inspection evidence of school quality, which is composed of fixed (qi) and time-varying components:
(*)
where is an idiosyncratic component and it is assumed that follows a covariance stationary AR(1) process. In addition, suppose that inspectors fail schools when quality falls below some threshold level, Q0:
(**)
Thus, for schools failed in September, say, the temporary component of quality (observed by the inspectors, but not the analyst) is such that: . Then because of the selection rule (**) and the AR(1) nature of , it follows that by the time of the May test, unobserved quality for September failed schools has improved, that is: . The idea behind the robustness test described above is that schools which failed largely because of low wt–1 will be likely less affected by within-year (for example, September to May) mean reversion in . Thus for these schools the estimated treatment effect of a fail rating is less likely to be driven by differences between and .
Note that further regression results, not reported here for brevity, using all schools to explain the probability of failure by month of inspection, test scores and the interaction of these two variables, show that inspectors attach the same weight to test scores in the early months of the year and later months such as June. This supports the use of a time-invariant coefficient on in (*) above.
↵46. The online Appendix Table 3 highlights the relationship between prior ability—measured by the age-7 Key Stage 1 test score—and the probability of attaining the target level on the age-11 Key Stage 2 test. This shows that only around a quarter to one-third of students in the lowest prior ability quartile attain the stipulated Level 4.
↵47. Note that running four separate regressions by prior ability quartile subgroup leads to results virtually identical to those reported in Columns 2 and 4 of Table 4.
↵48. For example, Neal and Schanzenbach (2010) finds that test scores improve for students in the middle of the prior ability distribution while low-ability students experience zero or even negative effects on test scores.
↵49. Dustmann et al. (2010) shows that even though students from most minority groups lag behind white British students upon entry to compulsory schooling, they catch up strongly in the subsequent years. The relatively large gains for this subgroup reported in Table 5 suggest that one mechanism driving the results reported in Dustmann et al. may be the school inspection system.
↵50. This interpretation of the results is also supported by the subgroup analysis of Table 5, which shows that children from poorer, minority groups tend to gain relatively more from the fail inspection. Children from families where English is not the first language at home most likely have parents who are less able to interrogate teachers and hold them accountable. The results in Table 5 boost the conclusion that it is children from these sorts of families who are helped most by the fail inspection.
↵51. Note however that in the absence of data on allocation of resources within the classroom, I cannot definitively rule out the possibility that teachers raise effort equally across all students and a differential marginal return to this rise in effort leads to the observed heterogeneous treatment effects.
↵52. Note that such fadeout of initial gains is in fact common in settings even where educators are not under pressure to artificially distort measured student performance (see, for example, Currie and Thomas 1995). Thus, the fading of test score gains does not necessarily indicate distortionary response on the part of teachers. On the other hand, if some of the initial test score gains persist into the medium term then this would suggest that the initial gains from the fail treatment are “real.” Prior evidence supports the notion that pressure on schools via test-based accountability leads to test score gains for students even after they have left the sanctioned schools (Rouse et al. 2013; Chiang 2007).
↵53. One other notable feature of the results in Table 7 worth highlighting is the contrasting nature of mean reversion in the moderate and severe fail categories. The extent of reversion-to-mean for the control groups is depicted in the “post” row of Table 7. For the moderate fail schools, there appears to be substantial mean reversion: There is bounceback, compared to the prior year, in mathematics and English test scores for the moderate fail schools of 0.06 and 0.14 of a standard deviation, respectively. For the severe fail schools, however, there is no bounceback. One interpretation of this evidence is that inspectors are able to distinguish between poorly performing schools and the very worst performing schools, with the latter category exhibiting no improvements in the absence of treatment.
↵54. The precise question in the survey is: “Are there any children in the study child’s class whose behavior in class prevents other children from learning?” (check Yes or No).
↵55. Any schools that were also failed in one of the previous inspection cycles between 2003–2004 and 2008–2009 are dropped from this control group.
↵56. School controls included in the regression: percentage of students eligible for a free lunch, the school’s national test score percentile, and type of school.
↵57. This result is unlikely to be driven by student selection into schools. Hussain (2009) shows that enrolment declines following a fail inspection. More motivated parents are more likely to switch schools, in which case discipline should deteriorate following a fail inspection.
↵58. As a robustness check, using different sample selection rules, for example, 2005–2006 and 2006–2007 fail and satisfactory schools, yields qualitatively similar results, though the reduction in sample size results in larger standard errors.
↵59. As Taylor and Tyler (2013) shows, there is a positive effect on student test scores in response to structured classroom teacher observations even in the year of the evaluation, although the effects in subsequent years are larger. However, there are a number of aspects of the inspection process that set it apart from the teacher evaluation system described by Taylor and Tyler (2013). First, the period covered by this analysis typically corresponds to the third or fourth inspection a school has received since the 1990s; the Taylor-Tyler study evaluates the effects of the first rigorous evaluation for teachers after 8–17 years in the job. If marginal gains from evaluation are declining then later evaluations and inspections may garner negligible returns. Second, whereas inspectors typically observe each teacher for no more than an hour, in the system described by Taylor and Tyler (2013) teachers receive detailed feedback from four classroom observations over the course of a whole year. Finally, Taylor and Tyler (2013) finds that gains from evaluation increase over time, at least for mathematics. The results in Table 9 relate to effects in the year of inspection only and hence longer-term gains cannot be ruled out. Unfortunately, given the nature of the mean reversion problem with respect to test scores, within school analysis exploiting the panel dimension of the data cannot be employed to uncover longer term effects of inspections.
↵60. Effects of a Satisfactory rating may incorporate effects of feedback as well as potential threats or sanctions, especially if the school is close to slipping into the fail category. Taylor and Tyler (2013) finds that the biggest effects from providing feedback are for teachers who receive relatively low scores. If this also holds for the case of inspection feedback at the school level, then one may expect to find larger effects for satisfactory schools than for higher-rated schools.
↵61. I also estimate the effect of an inspection on test scores for all schools (unconditional on the grade the school receives) by exploiting the variation in timing of inspections across years, as in Rosenthal (2004). Comparing test scores in 2007 for all schools inspected in 2006 (the treatment group) with schools inspected in 2008 (the control group), a difference-in-differences model using student-level data reveals that there are no significant differences between early and late inspected schools (the estimate is –0.8 of one percent of a standard deviation; p-value 0.24; regressions include school fixed effects and age-7 test scores).
↵62. A key concern under a subjective assessment regime is that it is open to manipulation by the bureaucrats charged with oversight. Results reported in Hussain (2012) on the validity of inspection ratings suggest that inspector ratings are correlated with underlying measures of school quality—constructed using survey measures from the school’s current stock of students and parents—even after conditioning on standard observed school characteristics.
↵63. The gains are large when compared to other possible policy interventions, such as the effects of attending a school with higher attainment levels (Hastings et al. 2009) or enrolling in a charter school (Abdulkadiroglu et al. 2011).
- Received December 2013.
- Accepted March 2014.