Abstract
We experimentally compare two modes of in-service professional development for South African public primary school teachers. In both modes teachers received the same learning materials and daily lesson plans, aligned to the official literacy curriculum. Students exposed to two years of the program improved their reading proficiency by 0.12 standard deviations if their teachers received centralized training, compared to 0.24 if their teachers received in-class coaching. Classroom observations reveal that teachers were more likely to split students into smaller reading groups, which enabled individualized attention and more opportunities to practice reading. Results vary by class size and baseline student reading proficiency.
I. Introduction
In most of the developing world, children are attending school without adequately learning to read. In South Africa, for example, after four years of schooling a striking 78 percent of students still cannot read with understanding (Mullins et al. 2017).1 Such low levels of reading proficiency have also been documented in South Asia and elsewhere in sub-Saharan Africa (Banerji, Bhattacharjea, and Wadhwa 2013; Bold et al. 2017). Since reading is a gateway to future learning, addressing these shortcomings should be a policy priority.
There is great potential to accelerate learning by improving the quality of teaching, but changing ingrained teaching practices presents a significant challenge. Numerous studies have found that teachers play a critical role in shaping a child’s learning trajectory (Das et al. 2007; Clotfelter, Ladd, and Vigdor 2010; Rivkin, Hanushek, and Kain 2005; Staiger and Rockoff 2010) and that good teaching practices correlate with faster learning (Allen et al. 2013; Araujo et al. 2016). Yet, teacher quality is highly variable, both within and between countries. In recognition of this, governments and donors invest billions of dollars annually in in-service teacher professional development,2 but with disappointing results. For example, many studies in the United States have found no impact of professional development programs on student learning, especially when conducted by government at scale,3 and a recent meta-analysis of evaluations of in-service teacher training programs in developing countries concluded that “teacher training programs vary enormously, both in their form and their effectiveness” (Popova, Evans, and Arancibia 2016). One possible reason for the failure is that many programs focus only on imparting knowledge, yet teaching is a skill that needs to be developed through ongoing practice (Kennedy 2016).
Broadly defined, there are two common approaches to in-service teacher professional development: training at a centralized venue, or classroom visits by coaches who observe teaching, provide feedback, and demonstrate effective teaching techniques. The first approach provides more time for deeper understanding to develop before actually implementing new techniques, but it might not be sufficient to change teachers’ behavior. The second approach could facilitate a change in behavior by encouraging practice— hence, learning by doing. Moreover, targeted feedback could assure correct application of techniques. Promising evidence suggests this approach can succeed at shifting teaching practice and improving student learning (Kraft, Blazar, and Hogan 2018), but it is generally considered more expensive (Knight 2012). Recent evidence has also shown that low-cost adaptations of the coaching approach, such as using online technology, is less effective (Oreopoulos and Petronijevic 2018). A possible cost-effective way to encourage adoption of new techniques is the use of scripted lesson plans (Jackson and Makarin 2018). Lesson plans can reduce the cost to teachers of switching to a new technique and provide daily prompts and reminders to encourage practice. But some are concerned that they could reduce teacher autonomy and thus hinder a good teacher’s ability to tailor their teaching to the needs of the child (Dresser 2012).
Is a short centralized training program, combined with daily lesson plans, sufficient to ensure application of new teaching practices? How important is ongoing individualized observation and feedback, provided by an expert coach, for ensuring that new techniques are implemented and implemented well? How much depends on the characteristics of the student, the teacher or the class size? Ultimately, which approach, training or coaching, is more cost-effective at improving student learning?
To answer these questions, we conducted a randomized evaluation in 180 public primary schools in South Africa and compare two different approaches to improving the teaching of home language reading in the early grades. The first approach (which we refer to as training) followed the traditional model employed by many governments: short, intensive training held at a central venue.4 In the second approach (which we refer to as coaching), specialist reading coaches visited teachers on a monthly basis to observe teaching practice and provide feedback. The average duration of exposure to the programs over the course of the year was roughly equivalent.5 Both interventions also provided teachers with daily lesson plans and educational materials, such as graded reading booklets, flash cards, and posters. The lesson plans were based on the official government curriculum and mirror exactly the pedagogical techniques prescribed by the government, but at a higher level of specificity. Moreover, the same individuals delivered both training and the coaching, so any differences we observe cannot be due to differences in the quality of implementation. Coaching costs roughly 43 USD per student annually, compared to 31 USD for training.
Our analysis draws from multiple data sources (Cilliers 2019). We assessed the reading ability of a random sample of 20 students in each school at three points: once as they entered Grade 1 prior to the rollout of the interventions (February 2015) and again at the end of their first and second academic years (November 2015 and 2016, respectively). During these school visits, we also surveyed teachers and the school principal. Furthermore, in October 2016 we conducted detailed lesson observations in a stratified random sample of 60 schools. The lesson observation instrument was explicitly designed to capture the teaching practices prescribed by the government and thus targeted by the program.
We find that, after two years of exposure to the program, students’ reading proficiency increased by 0.12 and 0.24 standard deviations if their teachers received training or coaching, respectively. The impacts are larger still—0.18 and 0.29 standard deviations, respectively—when we exclude the small sample of multigrade classrooms, a setting where the program was never intended to work. We conclude that coaching is more cost-effective than training, with an estimated 0.57 standard deviation increase in reading proficiency per 100 USD spent per student annually, compared to 0.39 in the case of training.
Next, our classroom observation allows us to unpack mechanisms by measuring how teaching practice changed in the classrooms. We find that even though there is no change in the frequency with which students practice reading in the classroom, there is a big change in how they practice reading: teachers in both treatment arms were more likely to implement a technically challenging teaching technique called group guided reading, where the students read aloud in smaller groups. As a result, students were more likely to receive individual attention from the teacher when they read, and more students were also using the graded reading booklets. The largest improvement is consistently observed in classrooms where the teachers received coaching. Notably, we see no change in other activities that are also required to take place at a daily basis but are easier to teach.6 We also perform mediation analysis, following Acharya, Blackwell, and Sen (2016), and conclude that more than half of the impact of coaching can be explained by the improvements in group guided reading.
Taken together, our results show that a combination of training and lesson plans can shift teaching practice and improve learning, but the shift is far larger when teachers receive ongoing observation and feedback from a coach, especially for the more difficult techniques.
Our study contributes to growing evidence from developing countries demonstrating that a bundled intervention of training, lesson plans, and coaching can dramatically improve students’ proficiency in early-grade reading (Piper, Zuilkowski, and Mugenda 2014; Piper et al. 2018; Lucas et al. 2014; Kerwin and Thornton 2017). These findings are also consistent with the conclusion from a recent review that structured pedagogic programs—a combination of highly specified curricula, training on instructional methods, and additional learning materials—have great potential to improve learning (Snilstveit et al. 2016). Our study makes a unique contribution in two important ways. First, we experimentally vary two common forms of teacher professional development: training versus coaching. This allows us to unpack which components are responsible for the learning gains and test for the importance of observation and feedback in developing teaching skills. This is important because one-off training is the most common form of government teacher professional development, but most research looks at a more resource-intensive coaching model. Second, the detailed classroom observations, which were explicitly developed to measure the teaching practices emphasized by the program, shed light on the underlying mechanisms.
Results of our study also contribute to debates around teacher autonomy. There is often pushback against prescribed curricula and set pedagogical standards because of the fear that they will undermine teacher autonomy and limit a teacher’s ability to tailor their teaching to the level of the child. Our study demonstrates the benefits of a structured pedagogical program. Teacher satisfaction with the program was high, underscoring the fact that teachers value the structure provided by standardized lesson plans. There was no detectable negative impact on any segment of the student population, so the reduced teacher autonomy does not come at a cost of lower learning for some students. However, the fact that the program did not improve the reading performance of the weakest students is of concern.
The paper proceeds as follows: Section II describes the interventions and the motivating theoretical channels, Section III describes the evaluation design and empirical strategy, Section IV reports results, and Section V concludes.
II. Program Description and Theoretical Framework
A. Program
Working with the South African government, we designed two related interventions aimed at improving early-grade reading in one’s home language.7 Both interventions provide teachers with lesson plans, which describe in detail the content that should be covered and the pedagogical techniques that should be applied for each instructional day.8 In addition, teachers receive supporting materials, such as graded reading booklets, flash cards, and posters. The graded reading booklets provide a key resource for the teacher to use in group guided reading (discussed in more detail below) so as to facilitate reading practice at an appropriate pace and sequence of progression. The program was led and managed by the government, who appointed a service provider, Class Act, to implement the interventions.
The two interventions differed in their approach to improving teacher pedagogical practice. The first intervention combined centralized training with provision of lesson plans and associated education materials. The professional development component in this intervention involved two days of large-group training, occurring twice yearly (at the beginning of the first and second semester, respectively). During these training sessions, roughly a quarter of the training time was meant to be spent on teachers practicing the techniques. The ratio of facilitators to teachers during the training was roughly 7:1.9 The trainers also performed follow-up visits to most of the schools, in order to encourage them to continue with the program. We refer to this intervention as training.
The second intervention, which we refer to as coaching, provided exactly the same set of instructional materials. However, instead of central training sessions, specialist reading coaches visited the teachers on a monthly basis over the duration of the academic year in order to improve teacher content knowledge, pedagogical techniques, and professional confidence. During these visits, the coaches observed teaching, provided feedback on how to improve, and demonstrated correct teaching techniques. The coaches also held information sessions with all the teachers at the start of each term to hand out new materials and occasionally held afternoon workshops (one to three a year) with a small cluster of nearby schools that were part of this intervention. There were three coaches, each serving 16–17 schools. The coaches were educated—all three had at least a Bachelor’s degree—and had past experience as both teachers and coaches. They received additional training from Class Act at the start of every term.10 The coaches also conducted the training, so the differences between the programs cannot be attributed to the expertise of those administering the programs.
The program was implemented over a period of two years: in the first year all the Grade 1 teachers in the treatment schools received training or coaching. In the second year all Grade 2 teachers in treatment schools received it. Thus, the same cohort of students benefited from the program, but a different set of teachers participated each year. Figure A.1 in the Online Appendix provides a schematic breakdown of the timeline.
Figure 1 shows the distribution of teacher exposure to the coaching program in 2016. The median number of visits that a teacher received was ten, but some teachers received far fewer visits. There was also high variation in the number of afternoon workshops that teachers attended. Putting all this together, we calculate that the average exposure to the program was 36.7 hours.11 According to administrative data, teacher attendance for training was high—98 and 93 percent for the two sessions held in 2016—and there were a total of 157 followup visits. The organization held followup training for the teachers who missed the initial training. The average number of hours of exposure for this group was roughly 34.12

Source: Class Act monitoring data for the 89 teachers from 49 schools in the coaching arm.
Both treatments followed the same curriculum as in the control. The lesson plans were fully aligned with the official government curriculum, both in terms of the topics covered and instructional techniques prescribed. The lesson plans were also integrated with the government-provided workbooks, which detailed daily exercises to be completed by students. Any difference we observe is therefore due to the modality of support the teachers received, not the pedagogical content.
B. Theoretical framework
1. How (not) to teach reading?
Despite debates around specific methodologies of teaching literacy, there is general consensus on how students learn to read.13 Acquisition of reading proficiency requires systematic practice of all the different components of reading at the appropriate sequence and pace, starting from the development of vocabulary, to recognizing sounds and letters (decoding), and moving towards recognizing words and eventually reading extended texts. The ultimate goal, reading with comprehension, can only be reached once someone can read fluently—that is, when reading becomes automatic and requires no conscious effort. This requires continual practice, as well as individual feedback to correct a student if they are reading incorrectly.
In this regard, the South African literacy curriculum is well aligned with international best practice. It prescribes in detail the frequency with which different teaching activities should take place. For example, group guided reading—where smaller groups of students read the same text under the direction of the teacher—is supposed to take place on a daily basis (Department of Basic Education 2011).14 This activity is an important ingredient to learning, since it provides opportunities for students to practice reading and receive individual feedback from their teacher, but it is difficult to implement.
However, in South Africa there is a significant gap between existing practice and what is prescribed in the curriculum (Hoadley 2012). The dominant norms of practice in South Africa involve an overreliance on teacher-directed strategies and whole-class activities, such as “chorusing,” where the teachers and students all read together, or repeat after a teacher. With these activities, there is a risk that students will not attempt to read and will merely mimic what the teacher is reading. In the worse possible equilibrium, the students pretend to be reading, and the teachers pretend to be teaching. There is also documented evidence of highly incomplete curriculum coverage, and ineffective curriculum sequencing and pacing by teachers (Taylor 2011).15
2. How to change teaching practice?
Both interventions of this study are built around the assumption that, just like learning to read, teaching is a skill that needs to be developed through regular practice. Teachers might need additional guidance and support to ensure consistent and correct application of the new techniques. Skill acquisition could lead to a sustained change in behavior, either by increasing the marginal product of effort for intrinsically motivated teachers who now see the fruits of their labor, or by reducing the marginal cost of effort, since once-difficult tasks now become easy to implement.
The lesson plans provide several mechanisms for ensuring that the methods are actually implemented and implemented well. First, the provision of fully scripted lesson plans can reduce the effort cost of transition to a new set of practices, since teachers do not need to develop daily plans themselves. Second, even before a teacher has a deep understanding of the methods or curriculum topics, the lesson plans prompt enactment, thus creating the possibility for learning by doing. In this way, the regular routines embedded in the lesson plans foster an iterative relationship between knowing and doing, through which the teacher’s own instructional repertoire is expanded. Lesson plans also provide a way to ensure that new reading materials are used and integrated into a lesson in a coherent way. Finally, lesson plans provide a focus for the entire intervention guiding not only the use of time and materials but also providing a point of focus for all training or coaching interactions. In these ways, lesson plans can be viewed as providing a set of mechanisms to encourage correct implementation of the curriculum, as well as implementation of what is taught at training sessions.
A significant initial dose of training might be important if a thorough conceptual understanding of new topics and methods is necessary before effective implementation is possible. However, there may be other practical and emotional constraints to introducing a new set of routines and activities into an existing classroom space. Coaching can facilitate the adoption of new practices through: (i) encouraging teachers to actually attempt new techniques in the classroom (somebody is there to observe, thus playing a monitoring role), (ii) providing targeted feedback on how to improve upon these techniques, and (iii) demonstrating the correct application of techniques.
III. Evaluation Design
A. Sampling and Random Assignment
The study is set in two districts in South Africa’s North West Province, in which the main home language is Setswana. This province is relatively homogeneous linguistically and is one of the country’s poorer provinces. Our sample is restricted to nonfee public schools that use Setswana as the main language of instruction and were identified as unlikely to practice multigrade teaching.16 We randomly drew a sample of 230 schools from this population and created ten strata of 23 similar schools based on school size, socioeconomic status, and previous performance in the national standardized exam, the Annual National Assessments (ANA). Within each stratum we randomly assigned five schools to each treatment group and eight to the control group.17 All treatment schools with exception of one in the coaching arm agreed to participate in the program. Results of this paper should therefore be interpreted as an intent-to-treat.
We chose to exclude schools that practice multigrade classes, since the interventions are grade-specific and unlikely to work in multigrade settings, but we were unable to exclude all those schools ex ante. Roughly 6 percent of Grade 2 teachers in each treatment arm reported teaching students from multiple grades in the same classroom. For sake of transparency we report results on both the full sample and the restricted sample—that is, the sample that excludes students who were taught in a multigrade setting.
B. Data Collection
We visited each school in our evaluation sample three times: once prior to the start of the interventions (February 2015), again after the first year of implementation (November 2015), and finally at the end of the second year (November 2016). During these school visits, we administered four different survey instruments: a student test of reading proficiency and aptitude conducted on a random sample of 20 students who entered grade one at the start of the study, a school principal questionnaire, a teacher questionnaire, and a parent/guardian questionnaire. We assessed the same students in every round of data collection, but surveyed a different set of teachers between midline and endline because students generally have different teachers in different grades. Finally, we also conducted lesson observations on a stratified random subset of 60 teachers in September 2016. The data-collection and data-capture organizations were independent of the implementing organization and research team and were blind to the treatment assignment. We registered a preanalysis plan at the AEA RCT registry in October 2016, before we had access to the endline data.
The student test was designed in the spirit of the Early Grade Reading Assessment (EGRA) and was administered orally by a fieldworker to one child at a time. The letter recognition fluency, word recognition fluency, and sentence reading components of the test were based on the Setswana EGRA instrument, which had already been developed and validated in South Africa. To this, we added a phonological awareness component in every round of assessment. The baseline instrument did not include all the same subtasks as the midline/endline instruments, because of different levels of reading proficiency expected over a two-year period. For the baseline, we also included a picture comprehension (or expressive vocabulary) test since this was expected to be an easier preliteracy skill testing vocabulary, and thus useful for avoiding a floor effect at the start of Grade 1 when many children are not expected to read at all. Similarly, we included a digit span memory test.18 The logic of including this test of working memory is that it is known to be a strong predictor of learning to read and thus serves as a good baseline control to improve statistical power. For the midline and endline, we added a writing and a paragraph reading subtask. For the endline, we further added a comprehension test.
Of the 3,539 students surveyed in the baseline, we were able to resurvey 2,951 in the endline, yielding an attrition rate of 16.6 percent. The attriters had either moved school (90 percent of attriters) or were absent on the day of assessment. Moreover, an additional 13 percent of our original sample are repeating Grade 1. Online Appendix Figure A.2 shows the breakdown of attrition and repetition rates by treatment arm. Column 1 in Table A.1 of the Online Appendix regresses treatment assignments on attrition status, after controlling for stratification. It shows there is no statistically significant difference in attrition rates across treatment arms. Columns 2–4 regress different student characteristics—student age, gender, and baseline reading proficiency—on treatment status, attrition, and an interaction between attrition and treatment status. Attriters in the control are slightly older and less likely to be female. However, the coefficients on the interaction terms show that there are no differences across evaluation arms, with the exception that attriters in the training arm are slightly more likely to be female relative to the control. We control for student gender in all our student-level analysis.
The teacher survey contained questions on basic demographics (education, gender, age, and home language), teaching experience, curriculum knowledge, and teaching practice. For curriculum knowledge, we asked the frequencies with which the teacher performed the following activities: group guided reading, spelling tests, phonics, shared reading, and creative writing. The prescribed frequency of performing these activities is stipulated in the government curriculum and also reflected in the lesson plans. Performing these activities at the appropriate frequency is thus a measure of knowledge and mastery of the curriculum, as well as fidelity to the lesson plans. The questions on teaching practice covered important student–teacher interactions that flow from group guided reading: whether teachers ask students to read out loud, provide one-on-one assessment, and sort reading groups by ability. Finally, the teacher survey also included a voluntary comprehension test, which was completed by 75, 89, and 98 percent of teachers who completed the teacher survey at baseline, midline, and endline, respectively.
In the endline, we have teacher survey data for 275 teachers in 175 schools. As a result, for 81 percent of the 2,951 students assessed at endline, we also have data on their teacher.19 In Column 5 in Table A.1 in the Online Appendix we regress treatment assignment dummies on an indicator for whether a student’s teacher also completed the teacher survey. We see that teacher nonresponse was random across treatment arms.
For the surveyed teachers, we also conducted classroom and document inspection. Fieldworkers counted the number of days that writing exercises were completed in the exercise book and the number of pages completed in the government workbook.20 To minimize risk of bias due to strategic selection of exercise and workbooks, the teacher was asked to provide books of one of the most proficient students in the class. Furthermore, fieldworkers indicated whether the teacher has a list of student assignment to the reading groups and rated on a four-point Likert scale the sufficiency and quality of the following print material: a reading corner (box library), graded reading booklets, Setswana posters, and flash cards. The school principal survey includes basic demographic questions, questions on school policies, school location, school access to resources, and a rough estimate of parent characteristics—the language spoken most commonly in the community and the highest overall education of the majority of parents.
To gain a better understanding of how teaching practice changed in the classroom, we also conducted detailed lesson observations in October 2016 in a stratified random subset of 60 schools—20 schools per treatment arm. We observed the lesson of one teacher per school. In order to assure representation across the distribution of school performance, we stratified student reading proficiency by school-average when drawing the sample. We also oversampled urban schools, where the impacts of the programs were largest at midline.21 An expert on early-grade reading developed the classroom observation instrument, in close consultation with Class Act and the evaluation team. Since it was a detailed and comprehensive instrument, we decided to limit ourselves to six qualified fieldworkers, all of whom were proficient in Setswana and had at least a bachelor’s degree in education. To further assure consistency across fieldworkers, the project manager visited at least one school with each of the fieldworkers at the start of the data collection, and data quality checks were conducted on all data collected in the first two days.
The instrument covers teaching and classroom activities that we expect to be influenced by the program. For example, the fieldworkers were required to record: the number of students who read or handle books, the number of students who practice the different types of reading activities (including activities such as vocabulary development, phonics, word/letter recognition, and reading sentences or extended texts), how reading is practiced in the classroom (for example, read individually or in a group, read silently or aloud), and the frequency and types of writing activities taking place. The instrument also captures student–teacher interactions related to group guided reading: whether reading groups are grouped by ability, how frequently students receive individual feedback from the teacher, and how frequently students are individually assessed. This final set of indicators mirror the questions that were asked in the teacher survey. The instrument is very detailed, but unlike some lesson observation instruments, did not require the fieldworkers to record time devoted to different activities. Rather, questions related to frequency of different activities were generally coded on a Likert scale.22 After the completion of the lesson observations, the fieldworkers also asked some questions about the type of teaching support they received the past year. These were open-ended questions, which allowed us to code whenever a teacher mentioned receiving training or coaching from Class Act or was using the program’s graded reading booklets or lesson plans.
To add precision to our estimates, we further complemented these survey measures with 2011 census data and results from a standardized primary school exam conducted in 2014. From the 2011 census, we constructed a community wealth index derived from several questions about household possessions. We also calculated the proportion of 13- to 18 year-olds in the community who were attending an educational institution. We also have data on each school’s quintile in terms of socioeconomic status, as coded by government.
In order to minimize the risk of over-rejection of the null hypotheses due to multiple different indicators, we aggregated data in the following ways. First, for our main outcome measure of success—reading proficiency— using principal components we combined all the subtasks into one aggregate score. We did this separately for each round of assessment. For the midline and endline scores, we used the factor loading of the control group to construct the index. This score was then standardized into a z-score: subtracting the control group mean and dividing by the standard deviation in the control. The treatment impact on the aggregate score can thus be interpreted in terms of standard deviations. Furthermore, we grouped the potential mediating factors of changed teaching practice and classroom environment into five broad categories that are theoretically distinct inputs into learning to read: (i) curriculum coverage, (ii) fidelity to routines specified in curriculum, (iii) teacher–student interactions related to group guided reading, (iv) frequency of practicing different reading activities, and (v) students’ use of reading materials. For each category we created a mean index, using the method proposed by Kling, Liebman, and Katz (2007), which is an average of the z-scores of all the constituent indicators.
C. Balance and Descriptive Statistics
Table 1 shows the balance and basic descriptive statistics of our evaluation sample. Each row represents a separate regression of the baseline variable on treatment assignments and strata dummies, clustering standard errors at the school level. The first column indicates the mean in the control. Columns 2 and 4 indicate the coefficients on the treatment dummies. Column 6 reports the number of observations, and Column 7 reports the p-values for the test of equality between training and coaching.
Descriptive and Balance Statistics
Our sample of schools comes predominantly from poor communities: 46.3 percent of schools are in the bottom quintile in terms of socioeconomic status, and 85 percent are from rural areas. In only 44 percent of schools do the majority of parents have a high school certificate or higher qualification. In almost all schools, the main language spoken in the community is Setswana. A small fraction of classrooms ended up being multigrade classrooms (6.2 percent of grade two classes). We were thus not perfectly able to identify and exclude ex ante all schools that did multigrade teaching. The teachers are mostly female and are educated: 85 and 95 percent of the Grade 1 and 2 teachers, respectively, have a degree or diploma. Nonetheless, reading comprehension levels are low. The average score for the simple comprehension test is 66 percent. The median number of Grade 2 teachers per school is one (57 percent of schools), and only one school has up to four teachers. We observe slight imbalance in baseline student reading proficiency and the school community’s socioeconomic status for the training treatment arm. We control for all these variables in the main regression specification.
Online Appendix Figure A.3 Panels a–e shows the distribution of student scores by treatment status for each subtask administered at baseline. Panel f shows the aggregate score. There are clearly floor effects for many of the subtasks, although there is a better spread for the aggregate score. Floor effects for baseline measures will not bias results, but could reduce statistical power. Online Appendix Figure A.4, Panels a and b, show the distribution of aggregate reading score at midline and endline. Our endline measure is normally distributed and shows no ceiling or floor effects.
Online Appendix Table A.2 compares the sample where we conducted the lesson observations with the full evaluation sample. In each column we regress another independent variable on a dummy variable indicating whether the student or school is in the sample where we conducted the lesson observation. In Columns 1–4 the data are at the individual level; in Column 5 the data are at the school level. In Column 1 the dependent variable is midline reading proficiency, including the full set of controls used in the main analysis (see Equation 1 below). A significant coefficient could thus be interpreted as the “value-added,” over and above the average learning trajectory of a student. Columns 1–4 in Online Appendix Table A.2 show that there are no statistically significant differences between schools where we conducted the lesson observations and the rest of our evaluation sample, both in terms of student reading proficiency evaluated at baseline, midline, and endline and a value-added measure between baseline and endline. As expected given our sampling strategy, a far higher proportion of schools where we conducted lesson observations are urban: 36.7 percent, compared to 20 percent in our overall sample. Online Appendix Figure A.5 further shows that the distribution of baseline and endline student reading proficiency is very similar, when comparing the lesson observation sample with the rest of the evaluation sample. When conducting the Kolmogorov–Smirnof equality of distribution test for the baseline and endline measures of reading proficiency, we cannot reject the null hypothesis that the distributions are the same. In addition, Online Appendix Table A.3 shows that the reduced sample where we conducted our lesson observations is balanced between treatment groups.
D. Empirical Strategy
Our main estimating equation is:
1
where yicsb1 is the endline (end of second year) aggregate score of reading proficiency for student i who was taught by a teacher in class c, school s, and stratum b; (training)s and (coaching)s are the relevant treatment dummies; ρb refers to strata fixed effects; Xicsb0 is a vector of baseline controls; and ϵicsb1 is the error term clustered at the school level.
In order to increase statistical power, we control separately for each domain of reading proficiency collected at baseline: vocabulary, letter recognition, working memory, phonological awareness, word recognition, words read, and sentence comprehension. To further increase statistical power and account for any incidental differences that may exist between treatment groups, we control for individual- and community-level characteristics that are highly correlated with yisb1 or were imbalanced at baseline.23 Where data are missing for some observations for the control variables, we imputed missing values and added a dummy indicating missingness as a control.24
When we examine dynamic impacts, we reshape the data in a wide format and estimate:
2
where t ∈ (1, 2) indicates the round of data collection, and P is a dummy variable set to one for endline data. The estimated coefficients,
and
, now show the respective treatment impact at midline, and
and
show the improvements over time.
When investigating treatment impacts on teacher behavior, we estimate:
3
where Mcs is the mediating variable of interest for a teacher in class c and school s. Standard errors are clustered at the school level for teacher survey data.25 With classroom observation data we also include fieldworker fixed effects and day fixed effects to account for the fact that not all teaching activities observed were supposed to take place on a daily basis.26 Results are robust to the exclusion of fieldworker and day fixed effects.
Finally, when testing heterogeneous treatment impacts, we estimate:
4
where σm is the moderating variable of interest, which could either be at the individual or class level, m ∈ (c, i). The moderating variable is also included in the vector of baseline controls. When the moderating variable of interest is at a teacher or class level, we further reweighted the observations so that each teacher or class receives equal weight.27
IV. Results
A. Quality of Implementation
As a first step in our analysis, we examine the quality of implementation. Rows 1–4 in Table 2 show results from the teacher questionnaire administered to all teachers in the evaluation sample. Rows 5–8 in Table 2 show results from the in-depth teacher survey conducted in a subset of 60 schools. The program was well implemented: 97 and 94 percent of teachers in the training and coaching arms, respectively, reported that they received in-service training on teaching Setswana as a home language during that year. The support was also generally well received: 45 and 66 percent in the training and coaching arms, respectively, reported they received very good support in teaching Setswana, relative to 17 percent in the Control.28 Teacher satisfaction also increased in the coaching arm. Teachers who received coaching were 28.4 percentage points more likely to strongly agree with the statement: “I feel supported and recognized for my work.” Moreover, results from the sample of teachers interviewed during the lesson observations reveal that exposure to the program was high: 79 percent and 90 percent of the regular Grade 2 teachers in the training and coaching arms, respectively, stated that they used the program’s lesson plans; 95 and 90 percent, respectively, claimed to have received some training or support from Class Act; 95 percent in both treatment arms used the program’s graded reading booklets; and 84 percent of teachers in the coaching arm reported that they were visited by the program’s reading coach that year.29 The fact that compliance was not always 100 percent could be due to treated teachers transferring to another school or being assigned to another grade in the same school.30
Control teachers also received a high level of support from government. For example, more than 79 percent of teachers in the control received in-service training on teaching Setswana as a home language during the past year, and 96 percent of teachers had at least some graded reading booklets in the classroom. The results of this program should therefore be interpreted as impacts relative to the status quo of government involvement.
Implementation
B. Impacts on Learning
Next we turn to the mean impacts on student reading proficiency at endline. Table 3 shows the regression results on different indicators of reading proficiency, estimated using Equation 1. As recommended by Athey and Imbens (2017), the p-values are constructed using randomization-based inference. We see from Column 1 that training and coaching improved aggregate learning by 0.12 and 0.24 standard deviations, respectively (p = 0.175 and p = 0.001). Column 2 shows that for both treatment arms the impacts are larger when we exclude students in multigrade classrooms: 0.18 and 0.29 standard deviations, respectively (p = 0.041 and p < 0.001). The program was never expected to be effective in such settings. Column 3 shows that the impacts are larger still when we exclude repeaters. These are students who had shorter exposure to the program, because they were not taught by the treated teachers in the second year.
Main Results
Columns 4–10 further unpack the results, looking separately at each domain of reading proficiency that constitutes the aggregate score. It is encouraging to note that coaching had a statistically significant impact on learning across all the domains of reading proficiency at endline. The impact of training, in contrast, was more muted: only the impact on phonological awareness is statistically significant. The starkest difference between training and coaching is in comprehension (p = 0.086). This is arguably the most important indicator, since the ultimate goal of literacy is reading with comprehension.
Since there was imbalance in baseline learning in the training arm (students in the training were underperforming relative to the control), we test whether the impact of training varies dramatically if we exclude the worst-performing students from the training arm. Online Appendix Table A.4 shows that there is only a very small change in the magnitude of the impact of training as we consecutively trim a larger proportion of the bottom of the distribution in the training arm: the 5th, 10th, and 15th percentiles, respectively, in terms of baseline student performance. For comparison, Columns 5–8 show the balance tests with the same trimmed sample. The difference between the training and control converges to zero as we restrict a larger proportion of the sample and is no longer statistically significant after trimming the fifth percentile. Therefore, it does not seem that imbalance is driving the smaller impact of the training arm.
Table 4 reports results on dynamic impacts, estimated using Equation 2. The estimated coefficient in the first and third rows indicate the treatment impacts at midline, whereas the coefficients in the second and fourth rows show the improvements from midline to endline. Table A.5 in the Online Appendix reports the same results, but in terms of standard deviations. Students in the coaching and training arms experienced different trends over the two years of the program. The impacts are very similar in magnitude at midline—0.13 and 0.141 standard deviations in the training and coaching arms, respectively (p = 0.107 and p = 0.081). However, over the course of the second year, students in the coaching arm continued their faster pace of learning relative to students in the Control (p = 0.131), whereas students in the training arm stagnated or even reversed slightly back to the control (p = 0.842). The difference in second-year treatment impacts between training and coaching is statistically significant (p = 0.096).
Dynamic Impacts
Moreover, Online Table A.5, Columns 2–7, show that the dynamic impacts also vary by domain of reading proficiency. At midline, the largest impact for coaching was phonological awareness (0.22 standard deviations, p = 0.003), and there were no statistically significant impacts on the number of letters and words read, nor paragraph reading. In the second year, the impacts on phonological awareness and writing actually decrease, but the impacts on reading of words, nonwords and paragraphs accelerate in the coaching arm. Coaching is thus more effective (relative to training) at developing the higher-order levels of learning, such as reading with comprehension. This could explain why the difference between the two treatment arms expanded in the second year; the teaching activities in Grade 2 focused more on reading text, than recognizing of sounds and letters. But these teaching activities are more difficult to implement.
In order to interpret the magnitude of the effect sizes, we benchmark the results of this study with those of similar programs and with the learning that took place in the control. A recent meta-analysis of 44 evaluations of coaching programs in the United States found a pooled effect size of 0.11 standard deviations on academic achievement for large-scale effectiveness studies with 100 teachers or more (Kraft, Blazar, and Hogan 2018). Conn (2017) found that the average impact of pedagogical interventions in sub-Saharan Africa was 0.228 standard deviations. A systematic review by McEwan (2015) found a mean effect of teacher professional development programs of 0.12 standard deviations. And a systematic review by Snilstveit et al. (2016) found that structured pedagogical programs have an average impact of 0.23 standard deviations on learning. Taken together, our estimated effect size of 0.232 standard deviations for coaching is in line with and perhaps slightly larger than similar interventions implemented in developing countries.
When we benchmark the treatment impacts with learning that took place in the control, we focus on the two domains of paragraph reading and comprehension. The coefficient on “Endline” in Table 4 shows the growth that took place in the control over the second year of the evaluation. We estimate that the second-year impact of coaching is equivalent to 26 percent (4.34/16.62) of the improvements in paragraph reading in the control. Moreover, since comprehension was not asked at midline, we can place an upper bound on learning by assuming that everyone in the control would have scored zero for the test at baseline. With this approach, we estimate from Table 3 that coaching is equivalent to at least 24 percent (0.3/1.234) of the learning that took place over the two years in the control.
C. Cost-Effectiveness
Since we found that the more costly program, that is, coaching, is more effective, it is important to determine which intervention was more cost-effective. For this purpose we calculate the ratio of gains to costs for two different outcomes: aggregate reading proficiency and performance on the comprehension test.31 For cost estimates we use the program budget for the second year of implementation, since implementation was likely more streamlined then compared to the first year. We also exclude fixed costs of material development (lesson plans, training material, reading booklets), since their contribution to average per student cost will be nominal if the program is scaled up.32 Based on these estimates, the per student cost of the training and coaching programs are 31 USD and 43 USD per year, respectively.33 Table A.6 in the Online Appendix provides a breakdown of costs by category. The big cost driver for training is the cost of the venue and paying for teachers’ transport, food, and accommodations. This cost is almost as high as the overall annual salary cost for the three coaches. Training also had many facilitators, with a teacher-to-facilitator ratio of roughly 7:1.34
Given these estimates we conclude that coaching is more cost-effective. It improved reading proficiency by 0.57 standard deviations per 100 USD spent per student per year, compared to a 0.39 standard deviation increase in the case of training. Coaching is also substantially more cost-effective at improving reading comprehension, with a 17 percentage point improvement in the comprehension test per 100 USD spent per student per year, compared to a six percentage point improvement in the training arm.
It is perhaps surprising that coaching is not more expensive relative to training. Clearly there are ways to reduce the cost of training (for example, by having a series of smaller workshops in a cluster of nearby schools, or reducing the number of facilitators, or reducing the number of training sessions, or not inviting the head teachers), but we do not know if the impacts would remain the same. Moreover, given the large differences in effect sizes, the cost of training would need to be dramatically reduced before training becomes more cost-effective.
D. Changing Teaching Practice
In this section we investigate underlying mechanisms by measuring how the learning environment, teaching practice, and classroom activities changed as a result of the program. For this purpose we draw from three different data sources: the teacher survey and the document inspection administered for the full evaluation sample of teachers, as well as lesson observations conducted in a stratified random subset of 60 schools. As discussed in Section III, we group the potential mediating factors into five broad categories: (i) curriculum coverage, (ii) adherence to the teaching routine as prescribed in the curriculum, (iii) teacher–student interactions related to group guided reading, (iv) frequency of practicing reading, and (v) students’ use of reading material. The regression results, estimated using Equation 3, are reported later in Tables 6–8.35
Curriculum Coverage and Routine
Types of Reading Activity
Frequency of Reading Activity and Use of Reading Material
Mediation Analysis
Rows 1–5 in Table 5 show treatment impacts on curriculum coverage, as captured during document inspection. Overall we see that there was a statistically significant increase in curriculum coverage of similar magnitude for both training and coaching arms. Rows 1–6 in Table 5 show results on the (self-reported) frequency of implementing different teaching activities: group guided reading, spelling tests, phonics, shared reading, and creative writing.36 The frequencies of doing these activities are clearly stipulated in the government curriculum. So, in principle, the teachers in the control should be performing them at the same frequency. We find that training and coaching schools are more likely to perform each activity at the appropriate level of frequency, especially for teachers who received coaching. Moreover, the difference between coaching and training is statistically significant (p = 0.02). Note that the treated teachers are not stating that they are more likely to perform all activities. Rather, they are more likely to perform activities that should take place on a daily basis—group guided reading and phonics—but less likely to perform the activity that should take place only once a week—correcting spelling. At the very least, these results show that the treated teachers have better knowledge of the appropriate routine they should follow.
Next we unpack the teaching activities related to group guided reading, an activity that teachers in both the training and coaching arms report to perform more frequently. There are three important (and practically measurable) components of group guided reading: individual attention from teachers, individual assessment, and sorting reading groups by ability. We asked for each one of these indicators separately in the teacher questionnaire and also measured these activities during the lesson observations. Rows 1–5 in Table 6 show result from the teacher survey. For both treatment arms, there is an overall increase in the activities that relate to group guided reading, with consistently larger impacts for coaching relative to training. First, as a confirmation of the self-reported increase in conducting group guided reading, we find that program teachers were more likely to provide a list of reading groups relative to the Control (16.8 and 34.4 percent in the training and coaching arms, respectively; p = 0.091 and p < 0.001), and this impact is significantly larger for teachers who received coaching (p = 0.0748). We further find that, compared to training and control teachers, teachers who received coaching were more likely to listen to students read out loud and perform one-on-one reading assessments.37 Teachers in both training and coaching arms were more likely to state that they stream groups by ability.
Rows 6–11 in Table 6 show that the results from the teacher survey on group guided reading are broadly supported by the lesson observations. There is a large increase in the mean index of 0.58 and 0.635 standard deviations in the training and coaching arms, respectively (p = 0.031 and p = 0.009). When examining the different components of group guided reading, we see that there is a large increase in the coaching arm in the probability that students read aloud in groups (37.8 percentage point increase, p = 0.022) and that the students read individually to the teacher (39.7 percentage point increase, p = 0.059).38 The impact for these two indicators is smaller for the training arm and not always statistically significant. And we find no strong evidence for any improvement in the probability of providing individual assessment and grouping by ability.39
Note that not all types of reading activities are more likely to take place. For the sake of comparison, Rows 12–14 show that teachers are no more likely to perform whole-class reading, where the whole class reads aloud with the teacher. Teachers are also no more or less likely to read aloud with the students following silently. Whole-class reading is an easy activity to perform in the classroom, and almost all teachers in the control were already doing it. Results from Rows 1–10 in Table 7 show that students are no more likely to practice reading in the classroom because of the programs, nor is there any evidence that teachers are more likely to teach phonics.40 Although the mean index for reading frequency is not significant, we see in Columns 8 and 9 that students in both the training and coaching arms are more likely to read extended texts (3–5 sentences).
Finally, Rows 11–13 in Table 7 report results on the use of books and reading material. We see a substantial increase in use of reading material, especially in the number of children who have opportunities to read. The average number of students who read the booklets increased by 1.6 and 4.6 in the training and coaching arms, respectively (p = 0.057 and p = 0.002). The difference between training and coaching is large and statistically significant at the 1 percent level, this despite the fact that teachers in both treatment arms received the same number and type of reading booklets. Note that the graded reading booklets are meant to be used during group guided reading.
To summarize, for both treatment arms we find improvements in curriculum coverage and teaching practice. Moreover, coaching had a larger impact relative to training in activities related to group guided reading: more students received individual attention from a teacher and opportunities to practice reading aloud, and more students were reading the graded reading booklets. This result is consistent with the observation that students in the coaching arm progressed at a faster pace in higher-order domains of reading proficiency, such as paragraph reading and reading comprehension, relative to students whose teachers receiving training.41 But can these improvements in teaching practice be uniquely attributed to the learning gains? We turn to this question next.
E. Mediation Analysis
What proportion of the treatment-induced learning gains can be explained by improvements in teaching practice? To answer this question, we conduct mediation analysis, employing both the linear structural equation model (see, for example, Imai, Keele, and Tingley 2010) and the sequential g estimation as proposed by Acharya, Blackwell, and Sen (2016). Both approaches make strong identifying assumptions, so our own results should be treated as merely suggestive. Section B in the Online Appendix describes the methods in more detail.
Panel A in Table 8 reports regression outputs for the linear structural equation model. For comparison, Row 1 shows the regression results from Equation 1, restricted to students for whom we also have teacher data. The regressions in Rows 2–17 successively include a different mediating variable as one of the independent variables in our main estimating equation. We consider all the intermediate outcomes collected from the teacher survey and document inspection. We do not report any results for data collected during lesson observations because limited sample size means that we do not have sufficient statistical power to draw any definitive conclusions.42 The row headings indicate the mediator of interest.
Two patterns are worth highlighting. First, Column 5 in Rows 13–17 shows that there is a statistically significant positive relationship between learning and almost all variables related to group guided reading, even after controlling for treatment assignment. For example, Row 15 shows that students taught by a teacher who could produce a list of reading groups scored on average 0.141 standard deviations higher, compared to students taught by teachers who could not produce a reading list. These results suggest that at least part of the treatment impacts are driven by an increase in the probability that teachers enact group guided reading in the classroom. In contrast, there is no positive relationship between curriculum coverage and learning. The positive relationship between routine and learning is driven, in part, by increased propensity to conduct group guided reading. Second, by comparing the regression results in Row 1 with the subsequent regressions, we see that the treatment impact of coaching is reduced by 25 percent (from 0.281 to 0.212 standard deviations), after accounting for the contribution of group guided reading to learning.
Panel B reports regression outputs for the final step of the sequential g estimation. This approach is considered an improvement to Imai, Keele, and Tingley (2010), since it allows one to control for additional post-treatment confounders.43 For possible confounders, we include the mean indexes for curriculum coverage and routine, as well as an index of the print richness in the classroom. The coefficient estimates can be interpreted as what the treatment impacts would have been, if treatment had no impact on group guided reading. The reduction in treatment impacts from Row 1 to Row 18 thus captures the indirect effect: the share of the treatment impact that is explained by treatment-induced changes in the mediator. These results suggest that as much as 68 percent of the treatment impact of coaching is mediated by changes in group guided reading. We therefore have suggestive evidence that improvements in group guided reading is at least partly responsible for the gains in reading proficiency of the coaching arm.
F. Heterogeneous Treatment Impacts
To what extent do the impacts of the interventions depend on the characteristics of the student, the teacher, and the class? To summarize, we find that impacts do not vary by observable teacher characteristics. But for both programs there is a strong nonlinear relationship between class size and learning, with the largest impacts observed for medium-sized classes. Moreover, the worst-performing students at baseline do not benefit from the program.
Table 9 displays the regression results on heterogeneous treatment impacts, estimated using Equation 4.44 Columns 1–4 show that effect sizes do not depend on observable teacher characteristics, such as teacher qualifications, age, experience, and the number of books that the teacher has read in a year. Columns 5 and 6 show that, although there is no linear relationship between the number of students in a classroom and effect size, there is a strong nonlinear (positive concave) relationship.
Teacher, Class, and Pupil-Level Interaction Effects
To further unpack this nonlinear relationship, Panels A and B in Figure 2 show local polynomial regression estimates of the relationship between effect size and class size percentile rank. We observe that for both interventions the effect sizes are largest for intermediate-sized classes, peaking at roughly the 35th percentile (38 students per class). The treatment impacts are statistically indistinguishable from zero in the very large and very small classes.45 For comparison, Panel C shows the nonparametric relationship between improvements in student learning and class size in the control schools. Control students in very small classes (up to roughly the 15th percentile) learn at a faster pace than the rest of control students. Taken together, it seems that both treated and control teachers perform equally well in the smallest classes, but perform equally badly in the largest classes.

Notes: The treatment impacts in Panels A and B are constructed in four steps. First we construct a value-added measure of reading proficiency by subtracting the predicted score from the actual score give the set of additional controls in Equation 1: . Second, we estimate a logical polynomial regression of ỹicsb1 on the percentile rank of class size separately for each treatment arm and the control. Third, we calculate the treatment impact by subtracting the fitted values of each treatment from the fitted values of the control, at each percentile of class size. Fourth, we construct pointwise 95 percent and confidence intervals from a percentile bootstrap with 500 iterations, clustering at the school level and stratifying by randomization strata. Panel C shows the relationship between value-added learning and the percentile rank of class size in the control estimated in Steps 1 and 2.
One possible interpretation for this nonlinear relationship is that the new teaching techniques and learning materials allow teachers to overcome some constraints to student learning that are present in larger classes, but teachers are either unable to implement these techniques in the largest classes, or these techniques are less effective in the largest classes. For example, since control teachers mostly perform whole-class teaching (that is, the whole class reads aloud with the teacher), it is plausible that students in larger classes are less likely to receive individual feedback from a teacher and have fewer opportunities to practice reading, compared to students in smaller classes. In contrast, group guided reading activities can provide students with these opportunities. However, teachers might find it impossible to implement group guided reading in extremely large classes, or group guided reading becomes less effective on average since a lower proportion of students get opportunities to read in front of the teacher on any given day.
Turning to student-level interactions, Panels A and B in Figure 3 show local polynomial regression estimates of the relationship between effect size and a student’s percentile rank in terms of baseline academic performance. In the coaching arm students who performed worse at baseline benefit least from the program. In fact, Panel b in Figure A.6 in the Online Appendix shows that there is no statistically significant impact for the bottom fifth of students. Panels A and B in Figure 4 shows that the impact does not vary by a student’s relative rank within her class. This suggests that the student’s absolute level of reading proficiency is the constraint to learning, rather than relative position in the class.

Notes: The treatment impacts in Panels A and B are constructed in four steps. First, we construct a value-added measure of reading proficiency by subtracting the predicted score from the actual score, given the vector controls in Equation 1: . Second, we estimate a logical polynomial regression of ỹicsb1 on the percentile rank of class size separately for each treatment arm and the control. Third, we calculate the treatment impact by subtracting the fitted values of each treatment from the fitted values of the control, at each percentile of class size. Fourth, we construct pointwise 95 percent and confidence intervals from a percentile bootstrap with 500 iterations, clustering at the school level and stratifying by randomization strata.

Notes: The treatment impacts in Panels A and B are constructed in four steps. First, we construct a value-added measure of reading proficiency by subtracting the predicted score from the actual score, given the vector controls in Equation 1: . Second, we estimate a logical polynomial regression of ỹicsb1 on the percentile rank of class size separately for each treatment arm and the control. Third, we calculate the treatment impact by subtracting the fitted values of each treatment from the fitted values of the control, at each percentile of class size. Fourth, we construct pointwise 95 percent and confidence intervals from a percentile bootstrap with 500 iterations, clustering at the school level and stratifying by randomization strata.
There can be many possible explanations for this finding, none of which we can conclusively rule out. It could be that the worst-performing students do not have a strong enough foundation to benefit from the new teaching techniques. The teachers might be covering curriculum too quickly or applying a curriculum that was too ambitious. Or, it might be that these students lack other complementary inputs to reading acquisition, such as literate and involved parents, or that weaker students are more likely to be in worse-quality schools that are less responsive to the treatments.
V. Conclusion
We report the results of a randomized evaluation of two different approaches to improving the instructional practices of early-grade reading teachers in public primary schools in South Africa. The first approach (training) follows the traditional model of a one-off training conducted at a central venue. In the other approach (coaching), teachers are visited on a monthly basis by a specialist reading coach who monitors their teaching, provides feedback, and demonstrates correct teaching practices. We find that coaching has a large and statistically significant impact on student reading proficiency, more than twice the size of the training arm. Coaching is also more cost-effective.
Detailed classroom observations and document inspection gives insight into which teaching practices changed. We find that teachers in both treatments are more likely to practice a difficult teaching technique called group guided reading. Because group guided reading occurs in smaller groups, students are more likely to read aloud and receive individual attention from their teacher when they are reading. In contrast, teachers in the control most typically conduct whole-class teaching, where the whole class reads aloud with the teacher. These impacts are consistently larger for teachers who received coaching, compared to those who received training. Furthermore, mediation analysis shows that improvements in group guided reading explain a large proportion of learning gains in the coaching arm.
These results suggest that coaches play an important role in the adoption of more technically challenging teaching techniques. Group-guided reading is particularly difficult to implement: teachers need to reorganize the classroom and keep the rest of the class busy as they provide targeted feedback to the smaller reading group. Indeed, teachers complained that group guided reading is difficult, especially in larger classes, and that the training was too short for them to fully understand group guided reading.
Our findings on the use of reading material also reveal important complementarities in the education production function between access to resources, teaching practice, and use of resources. Although teachers in both treatments arms received the same learning material—that is, the graded reading booklets—the impact in the training arm was more muted and students were not adequately using the resources. In contrast, coaching helped teachers to apply difficult teaching techniques—that is, group guided reading, which enabled students to use the learning materials.
It is important to note that both programs are bundled interventions, so we cannot attribute the learning gains exclusively to the coaching or training components of the programs. For example, the lesson plans might have had the same impact in the training arm even in absence of training, and the training and coaching arms might have had no impact if not combined with learning aids and lesson plans. This is an inevitable limitation to evaluating bundled interventions. But this is also a strength of the program, since it was designed with the premise that the different components complement each other.
Seen in the context of other evaluations of similar programs, it is likely that our results are generalizable, at least for improving early-grade reading in sub-Saharan Africa. Other studies in sub-Saharan Africa have found that the combination of reading coaches and supporting learning materials can improve students’ proficiency in early-grade reading (Piper, Zuilkowski, and Mugenda 2014; Piper et al. 2018; Lucas et al. 2014; Kerwin and Thornton 2017). Moreover, a previous quasi-experimental evaluation of a very similar coaching program in a different province in South Africa also found positive impacts on learning (Fleisch et al. 2016), even though schools in this study were predominantly urban and multilingual.
Looking forward, a key question is if and how the coaching program can be scaled up. Capacity and resource constraints make us hesitant to conclude that the government should scale the program as currently designed. The government could rely on existing staff to do the coaching, but it is unclear if they will have the right capacities and set of incentives to provide the appropriate support. The program in our study relied on only three coaches, and it was implemented by a nongovernmental organization with strong incentives to demonstrate positive impacts. We do not know if coaches would have the same impact if they were less qualified, visit less often, or connect remotely rather than in person. Nonetheless, the quasi-experimental study by Fleisch et al. (2016) provides encouraging evidence that this could be a scalable model, since their program was implemented in more than 1,000 schools. Moreover, the fact that teacher behavior changed in the training arm, without any visits by a reading coach, suggests the possibility of positive impact, even with less-qualified coaches. In sum, the program has potential to be implemented at a larger scale, but there are many unanswered questions: Does the coaching model require highly qualified coaches, or could the mere act of monitoring teaching encourage practice and thus facilitate the adoption of new teaching techniques? Can virtual coaching have the same impact as in-person coaching? Can a year of coaching lead to a sustained change in teaching practice? These questions will be a focus of future research.
Footnotes
Results reported on in this study form part of the Early Grade Reading Study (EGRS), which was led by the Department of Basic Education to find ways to improve early grade reading. We are grateful for useful feedback from Servaas van der Berg, David Evans, Clare Leaver, Rob Garlick, James Habyarimana, and a pair of anonymous reviewers. Carol Nuga-Deliwe, JaneliKotze, and Mpumi Mohohlwane provided excellent research assistance and management support. The International Initiative for Impact Evaluation provided funding for the evaluation. Many partners contributed to funding the program: UNICEF, Zenex foundation, North West Department of Education, Anglo American Chairman’s Fund, and the Department of Planning, Monitoring and Evaluation. The University of the Witwatersrand and the Human Sciences Research Council provided additional administrative and research support. All errors and omissions are our own. The data used in this article are available online: Cilliers, Jacobus. The South African Early Grade Reading Study Wave 4. Ann Arbor, MI: Inter-University Consortium for Political and Social Research. https://doi.org/10.3886/E115229V2.
Supplementary materials are freely available online at: http://uwpress.wisc.edu/journals/journals/jhr-supplementary.html
↵1. This is the percentage of children scoring less than the low international benchmark score, as defined by the Progress in International Reading Study (PIRLS).
↵2. By some estimates the United States spends 18 billion dollars annually on teacher professional development (Fryer 2017). According to a nationally representative survey conducted in 38 developed countries, 91 percent of teachers received professional development in the previous 12 months (Strizek, Tourkin, and Erberber 2014). Popova, Evans, and Arancibia (2016) calculate that nearly two-thirds of World Bank–funded education programs include a professional development component.
↵3. Jacob and Lefgren (2004), Harris and Sass (2011), Garet et al. (2011, 2008), Jacob and Lefgren (2004), Randel et al. (2011).
↵4. In our case, teachers received two training sessions, once at the beginning and once in the middle of the year, each lasting two days.
↵5. We estimate that the average exposure to the programs was 32 and 37 hours for the training and coaching arms, respectively—roughly four to five days in total.
↵6. Phonics and letter recognition are also required to be taught daily and are typically taught through whole-class reading, where all the children in the classroom follow or read with the teacher. This is a far easier form of teaching.
↵7. In South Africa, most children are taught in their home language in Grades 1–3 andthentransitiontoEnglish as the language of instruction in Grade 4.
↵8. Teachers were strongly encouraged to use the lesson plans, but this was not enforced.
↵9. Roughly 140 teachers and head teachers participated in the training. Given the large number of teachers participating, two training sessions were conducted per semester, with roughly 70 teachers per group. Ten facilitators participated in each of the training sessions.
↵10. The training focused on coaching and mentoring, school curriculum, and teaching skills.
↵11. Assuming that each information session lasted five hours, each coaching visit lasted one and one-half hours (one hour for observation and 30 minutes for feedback), and each afternoon session lasted two hours.
↵12. We assume that teachers spent on average 20 minutes talking to the trainer when visited by the trainers. The trainers were only supposed to talk to the school principals, but inevitably they also talked to the teachers.
↵13. See, for example, Langenberg et al. (2000).
↵14. These groups should ideally be sorted by ability. The teacher is expected to assess reading ability by observing each student as they read a text.
↵15. This is possibly because the curriculum has been revised several times in recent decades. But most teachers were not properly trained to implement new methods and did not have all the necessary reading materials.
↵16. Approximately 65 percent of South African children attend nonfee schools. Schools serving communities with higher socioeconomic status are allowed to charge fees, but they receive a smaller government subsidies as a consequence.
↵17. The full evaluation also consisted of a third treatment arm with a focus on parental involvement, rather than teacher training, the result of which we will discuss in a separate paper.
↵18. This involved repeating by memory first two numbers, then three, and so forth up to six numbers, and the same five items for sequences of words.
↵19. We cannot tell what proportion of teachers did not respond because children are randomly drawn at a school level, so we do not know how many teachers there are that would have matched with the students with missing teacher data.
↵20. To reduce data capture error, we asked the fieldworker to count only pages completed for three specific days. We chose three days that should have been covered by teachers by the end of the year, regardless of their choice of sequencing.
↵21. In particular, we randomly drew schools from each treatment group in the following manner: (i) six urban schools, (ii) five schools in the top tercile and five schools in the bottom tercile in terms of average performance across both baseline and midline, (iii) four schools in the top tercile in terms largest improvement between baseline and midline.
↵22. For example, the fieldworkers recorded different types of reading activities as: never, sometime, mostly, always.
↵23. The additional controls include: students’ gender, students’ parents’ education, district dummy (schools were randomly spread across two districts), performance in the most recent standardized Annual National Assessments (ANA), a community-level wealth index, and average secondary school attendance rates in the communities surrounding the school.
↵24. For categorical variables, we assigned missing values to zero; for continuous variables we assigned missing observations to equal the sample mean.
↵25. We only observed one teacher per school in the classroom observations, so there is no need to cluster our standard errors at the school level. But we surveyed all the Grade 2 teachers in each school, often more than one teacher per school.
↵26. According to the lesson plans, creative writing is supposed to take place on Fridays, which provides fewer opportunities to practice reading.
↵27. We have the same number of students per school, but due to random sampling of students, we do not have the same number of students per teacher or class.
↵28. Interestingly, teachers in the coaching arm were more likely to state that they received too much support.
↵29. Four of the sampled teachers in the classroom observations were not the regular Grade 2 teachers.
↵30. However, we believe the latter is unlikely, since teachers typically teach the same class for the duration of the year.
↵31. We consider the latter indicator because reading with comprehension is arguably the ultimate goal of literacy development. We divide the score by four so the outcome is the proportion of questions answered correctly.
↵32. A further challenge in allocating costs is that one organization jointly implemented both interventions, so some costs (such as program management, administration, and quality assurance) were shared across the programs. We asked Class Act to provide their best estimate of how time was allocated across the different interventions, and we allocated costs accordingly.
↵33. The cost of implementing the program in 50 schools are 114,210USD and 160,221 USD in the training and coaching arms, respectively. Given an average size of 74.6 of students per school at the start of the program, this amounts to costs of 31 USD and 43 USD per student, respectively. If we exclude overhead costs for coaching and only consider the key variable costs—materials, salary and transport—then the cost is 29 USD per student.
↵34. Note that the salary costs in the training arm do not include the time that the coaches dedicate to training. If the programs were implemented separately. The overall training salary costs would therefore be higher.
↵35. Many of the indicators are ordinal variables, but for ease of interpretation we report results for adapted binary variables. Results on statistical significance remain the same when running an ordered logit model on the ordinal variables, and the mean index is constructed using the ordinal variable, thus preserving all the information captured by fieldworkers.
↵36. Options were: less than once a week, once a week, two to four times a week, every day, twice a day.
↵37. Original variables are ordinal ranging from one for “never” to five for “nearly every day.”
↵38. These indicators were first recorded as ordinal variables ranked from one to four. For ease of interpretation we created a binary indicator for these two indicators, indicating if any activity took place.
↵39. There is a small increase in the probability of providing individual assessment, which is statistically significant only in the training arm.
↵40. The fieldworkers were asked to record how many students in the classroom were involved with reading letters, words, sentences, or extended texts. The answers were recorded on a five-point Likert scale, ranging from none to all the students. Fieldworkers also recorded the extent to which teachers covered phonics on a four-point Likert scale. As before we construct binary variables for ease of interpretation—equal to one if at least some students are reading and equal to one if the teacher teaches phonics at least some of the time.
↵41. In contrast, the programs had a similar impact on phonological awareness. Phonics is typically taught using whole-class teaching activities, which is easy to do and already widely implemented.
↵42. After matching lesson observation with students’ learning data, we are left with a sample of 53 teachers, compared to 275 teachers from the survey data.
↵43. Although it still makes the strong assumption that we have controlled for all post-treatment confounders that are correlated with both the mediator and the outcome.
↵44. In all future analysis we drop the small sample of multigrade classes. We do not want any trends we observe to be driven by these schools. Results are robust to including these schools.
↵45. Panel A in Online Appendix Figure A.6 shows the treatment impact by quartile of class size. For both treatments, the difference in effect sizes between the middle two quartiles and the extreme quartiles of class size is statistically significant (p < 0.001). As a reference point, the 25th and 75th percentiles have class sizes of 35 and 46 students per class.
- Received June 2018.
- Accepted October 2018.