## Abstract

In a recent paper in the *Journal of Human Resources,* Dynarski (2008) used data from the 1 percent 2000 Census Public Use Microdata Sample (PUMS) files to demonstrate that merit scholarship programs in Georgia and Arkansas increased the stock of college-educated individuals in those states. This paper replicates the results in Dynarski (2008) but we also find important differences in the results between the 1 percent and 5 percent PUMS, especially for women. We also demonstrate that the author’s use of clustered standard errors, given the small number of clusters and only two policy changes, severely understates confidence intervals.

## I. Introduction

Beginning in the early 1990s, several states introduced merit-based financial aid programs for students pursuing higher education within their state of residence. These programs usually have three related goals. First, they aim to increase access to higher education and incline some high-achieving high school students to go to college who might not have been able to afford to do so otherwise. Second, merit programs aim to encourage more students to go to college in-state. Third, these merit programs aspire to increase the completion rate. Several studies have examined the various effects of these merit aid programs, with much of the research focusing on Georgia’s HOPE Scholarship program. Dynarski (2000; 2004) finds that the HOPE Scholarship increased the probability of enrollment for young people in Georgia.^{1} In a frequently cited paper in the *JHR*, Dynarski (2008) examines microdata from the 2000 Census and concludes that merit aid programs in Georgia and Arkansas have increased the share of young people who have obtained a college degree (either an associate’s or bachelor’s) by three percentage points.

This paper replicates Dynarski (2008) and explores the sensitivity of her results to using a different sample and different estimation procedures. Several interesting results emerge. First, coefficient estimates differ between the 2000 Census 1 percent Public Use Microdata Sample (PUMS) and the 5 percent PUMS. Using the 1 percent PUMS with Dynarski’s estimation procedures (which are explained in detail in her article), we are able to replicate her results exactly. However, when Dynarski’s estimation procedure is applied to the 5 percent PUMS, the coefficient estimates are considerably smaller. The estimated effect of the state merit aid programs on degree completion is 0.0298 using the 1 percent PUMS, but falls to 0.0091 when the 5 percent PUMS is used. Further analysis reveals that the differences across the samples are mostly concentrated among women. Given that the 1 percent and 5 percent PUMS are drawn from the same underlying population, differences across the two samples are largely unexpected.

Our second main result is that the statistical significance levels in Dynarski (2008) are greatly overstated because of her use of clustered standard errors with only two policy changes and with a small number of clusters. Clustered standard errors are often an improvement over Ordinary Least Squares (OLS) standard errors in many applications because conventional OLS standard errors do not account for intracluster correlation and can be downwardly biased. With clustered standard errors there is also the issue of at what level the data should be clustered, for example, at the state level or the state-age level. If there is correlation within states across ages, then clustering at the state level might be preferable for some applications. However, clustered standard errors can also be substantially downwardly biased when the number of clusters is small (MacKinnon and White 1985; Bertrand, Duflo, and Mullainathan 2004; Donald and Lang 2007) or the number of treated groups (policy changes) is small (Bell and McAffrey 2002; Abadie, Diamond, and Hainmueller 2010; Conley and Taber 2011). This small sample bias for clustered standard errors is likely to be especially severe in difference-in-differences models with only a few policy changes (Conley and Taber 2011). To obtain valid inferences, we follow two separate approaches for examining significance levels. First, we address the issue of the small number of policy changes using the approach suggested by Conley and Taber (2011). We then address the small number of clusters using the approach suggested by Cameron, Gelbach, and Miller (2008). Using each of these methods, we find a statistically insignificant effect of merit programs on degree completion for both the 1 percent and 5 percent PUMS. In other words, while the coefficient estimates are positive, we cannot be reasonably confident that the true effects are statistically different from zero.

The paper proceeds as follows. In the next section we briefly summarize Dynarski (2008). In Section III we discuss our results using her procedures for the 1 percent and 5 percent PUMS. In Sections IV and V we employ two alternative procedures for inferences suggested by Conley and Taber (2011) and by Cameron, Gelbach, and Miller (2008), respectively. Section VI further investigates differences between the census samples, and a final section concludes.

## II. Summary of Dynarski (2008)

Arkansas and Georgia introduced large, broad-based merit-based student aid programs in 1991 and 1993, respectively.^{2} Dynarski examines the effects of these two broad-based merit aid programs on degree completion using a treatmentcomparison research design. She treats the adoption of merit aid programs in Arkansas and in Georgia as natural experiments. Students who finished high school in Arkansas and Georgia after the programs were adopted are considered the treatment group, while the comparison group consists of students in states that did not adopt merit programs during the period under study and students in Arkansas and Georgia who finished high school before the merit programs were implemented.

As a practical matter, the Census microdata do not report when or in what state a student completes high school. Since she does not know who received student aid, Dynarski uses a variable, denoted *merit*, which measures whether the student would have been exposed to a merit-based aid program while in high school. This variable is determined by place of birth, not place of residence at the time of the Census, because a change in the percentage of the population with a college degree in a state could be due to migration of college graduates. Given that most students graduate high school at age 18, she assumes that high school graduation occurs at age 18, and thus defines the treatment group as persons who were either (1) born in Arkansas and age 27 or younger at the time of the Census or (2) born in Georgia and age 25 or younger at the time of the Census. Dynarski notes that this assignment of the treatment status will cause measurement error and result in downwardly biased estimates of the effects of merit programs on degree completion. The sample is then restricted to persons between the ages of 22 and 34 at the time of the 2000 Census, who were born in the United States and have nonimputed information for age, state of birth and education. The sample also excludes persons born in Mississippi because Mississippi adopted a merit program in 1996 and is therefore not a legitimate control group.

Dynarski’s baseline empirical model is represented as follows:

where
is the share of persons of age*a*born in state *b* who have completed a college degree (either an associate’s or bachelor’s), *merit _{ab}* is an indicator variable equal to one for the treatment group and zero otherwise, δ

_{a}and δ

_{b}are age and state of birth fixed effects, and ε

_{ab}is an idiosyncratic error term. If the model is correctly specified, then β measures the effect of merit programs on degree completion. Dynarski (2008) estimates Equation 1 using Weighted Least Squares where age-state observations are weighted by the number of persons in the age by state of birth cells and standard errors are clustered by state of birth.

^{3}To estimate Equation 1, she uses the 1 percent PUMS file from the 2000 Census of Population.

In her baseline results Dynarski obtains a coefficient on the treatment dummy of 0.0298 with a standard error of 0.0040, implying that the merit-based aid programs had a positive and statistically significant effect on college degree attainment. She runs several robustness checks, including the use of different sets of control variables, and obtains similar results: a coefficient of about 0.03 and a standard error of about 0.004.

## III. Replication of Dynarski’s Results

We first replicate Dynarski (2008) using the 1 percent PUMS, and then replicate her procedure using the 5 percent PUMS and five 1 percent subsamples created from the 5 percent PUMS. These data were extracted from the IPUMS (Ruggles et al. 2008). For the dependent variable, Dynarski focuses on the completion of an associate’s degree or higher, and we do so as well. We also explore using bachelor’s degrees or higher as the dependent variable (not shown) and find qualitatively similar results. Table 1 presents our replication results for the 1 percent PUMS, the 5 percent PUMS, and five 1 percent subsamples created from the 5 percent PUMS. The first column presents results for the total population, while the second and third columns present separate results for females and males. The results for the 1 percent PUMS are presented first. We obtain a coefficient estimate for the total population of 0.0298, which is exactly the same as in Dynarski’s baseline specification in Column 1 of her Table 3. Estimating standard errors by clustering by state of birth yields a standard error of 0.0040, the same as in Dynarski (2008). This results in a 95 percent confidence interval between 0.0223 and 0.0374 and implies a statistically significant effect of merit aid programs on degree completion.

When we replicate Dynarski’s procedure, but use the 5 percent PUMS, the coefficient estimate for the total population decreases considerably to 0.0091. Clustering by state of birth produces a standard error of 0.0034, which gives a 95 percent confidence interval between 0.0026 and 0.0157; the effect of merit programs on degree completion is still statistically different from zero using clustered standard errors. However, the difference in the results between the two samples is a bit puzzling. In fact, the difference in the coefficients is sufficiently large and the clustered standard errors are sufficiently small that we would reject the null hypothesis that the two samples produce the same merit program coefficient with a *p*-value less than 0.01. Similarly, the 95 percent confidence intervals for the two estimates do not overlap when based on clustered standard errors.

Given that the Census samples are drawn from the same underlying population, this difference in coefficients is unexpected. However, there is likely a problem with using clustered standard errors to make inferences in this setting. Inferences using clustered standard errors are based on the assumption of a large number of treatment groups. However, clustered standard errors are considerably downwardly biased in difference-in-differences models that are based on a small number of policy changes (Conley and Taber 2011). Thus, it is likely that the clustered standard errors are underestimated and lead to invalid inferences. Furthermore, larger standard errors could help explain the differences in coefficient estimates for the different samples. Larger standard errors could mean that the differences across the samples are not statistically significant, and we would be less surprised to find moderately different coefficient estimates from different samples.

As further evidence that results might differ across samples, we explored dividing the 5 percent PUMS into five 1 percent subsamples using the PUMS subsample variable. Both the 1 percent and the 5 percent PUMS divide the population into 100 random subsamples numbered from 0 to 99 (U.S. Bureau of the Census 2003). The 5 percent PUMS can, therefore, be divided into five 1 percent subsamples using this variable. We follow Census recommendations and construct five 1 percent subsamples by grouping subsamples ending in 1 and 6, 2 and 7, 3 and 8, 4 and 9, and 5 and 0 (p. 95). We then estimate the effect of merit-aid programs on degree completion using the same procedure as Dynarski (2008) for each sample. These results are also presented in Table 1. For the first three constructed 1 percent subsamples, the estimated coefficients for the *merit* variable are very small and not statistically significant using clustered standard errors. For the fourth and fifth constructed 1 percent subsamples, however, the coefficient estimates are 0.0222 and 0.0215, respectively, and are significant at the 5 percent level using clustered standard errors. Again, these constructed samples are drawn from the same underlying population and should not give statistically significantly different results. However, because clustered standard errors are likely underestimated, we should not use them for making inferences in this instance; we return to this in the next section.^{4}

Dynarski (2008) also suggests that there are important differences by gender in the effects of merit programs on degree completion. She finds that the effect for women is roughly twice as large as the effect for men. We next estimate separate effects by gender for the various samples that we use. An interesting result emerges in that most of the difference between the 1 percent and 5 percent PUMS is due to females. The *merit* coefficient for females is 0.0377 in the 1 percent PUMS, but only 0.0022 in the 5 percent PUMS. This is an even bigger difference than for the total population. For males, though, the *merit* coefficient estimates are only slightly different across the two samples at 0.0201 and 0.0157 for the 1 percent and 5 percent PUMS, respectively. However, coefficient estimates differ across the five constructed 1 percent subsamples for both females and males. The merit coefficients range from −0.0071 to 0.0167 for females and from −0.0050 to 0.0342 for males. Importantly though, conventional clustered standard errors should not be used to determine if the differences across the samples are statistically significant. In the next two sections we follow two approaches intended to provide correct inferences about the effects of merit programs on degree completion. These approaches will also help us discern whether the differences across samples are significant.

## IV. Inferences Based on the Conley and Taber Procedure

As an alternative to using clustered standard errors to make inferences, we first implement a procedure suggested by Conley and Taber (2011) using code available from Conley’s website. The Conley-Taber procedure is especially useful in applications where there are a large number of control groups but only a small number of policy changes. Their procedure can be used to estimate confidence intervals in difference-in-differences models based on the distribution of residuals across the control groups. Monte Carlo analysis confirms that their approach outperforms conventional clustering methods when the number of treatment groups is small and does no worse in more general settings. They also illustrate their procedure using the effect of merit aid programs on college enrollment along the lines of Dynarski (2000; 2004) and show that inferences based on their method differ from those based on conventional clustered standard errors. We refer the reader to their paper for further details.

We apply the Conley-Taber (CT) procedure to construct 95 percent confidence intervals for the effect of merit programs on degree completion. These confidence intervals for each *merit* coefficient estimate are also reported in Table 1. In clear contrast to the standard cluster confidence intervals, the CT confidence intervals include zero for all of the samples considered. In other words, the CT procedure suggests that the effect of the merit programs on degree completion is not significant at the 5 percent level. The effects are also not significant at the 10 percent level (not shown). Furthermore, the CT confidence intervals for the 1 percent and 5 percent PUMS have considerable overlap, suggesting that the differences between the coefficients are not statistically significant. The same is true for the five 1 percent subsamples from the 5 percent PUMS. These results hold for the total population as well as females and males separately. This supports the earlier hypothesis that the use of typical clustered standard errors causes significance levels to be considerably overstated and results in invalid inferences.

## V. Inferences Using the Cameron, Gelbach, and Miller Wild Cluster Bootstrap

To address the issue of making inferences based on clustered standard errors when the number of clusters is small, we next implement the wild cluster bootstrap-*t* procedure suggested by Cameron, Gelbach, and Miller (2008). Bootstrap methods compute significance levels by creating many pseudo-samples, estimating the model parameters for each pseudo-sample, and then examining the distribution of the parameters across the various pseudo-samples. The wild cluster bootstrap-*t* constructs pseudo-samples by holding the regressors constant while resampling with replacement group-specific residuals to form new dependent variables. The procedure also uses Rademacher weights of +1 and −1, each with a probability of 0.5. This creates pseudo-samples with dependent variables created using randomly drawn residuals half the time and the negative of the randomly drawn residuals the other half of the time. For each pseudo-sample, the dependent variable is then regressed on the explanatory variables. Significance levels are computed based on the number of times the pseudo-sample coefficients differ from the null hypothesis. Cameron, Gelbach, and Miller (2008) show using Monte Carlo simulations that tests based on the wild cluster bootstrap-*t* procedure have the appropriate size and provide valid inferences. See their paper for further details.

Table 1 also reports *p*-values using the Cameron, Gelbach, and Miller (CGM) wild cluster bootstrap-*t* procedure. For all of the samples considered, the *p*-values are at least greater than 0.10, suggesting that the effect of the merit programs on degree completion is not statistically significant at the 10 percent level. In other words, we cannot be reasonably confident that merit-aid programs have an effect on completion of at least an associate’s degree. In results not shown, we also used the wild cluster bootstrap-*t* procedure to examine whether the differences in coefficients between the 1 percent and 5 percent PUMS are statistically significant. The differences are insignificant at the 10 percent level for the total population and for females and males separately. The CGM wild cluster bootstrap-*t* procedure thus provides inferences similar to the Conley-Taber procedure, but very different from using clustered standard errors.

## VI. Sample Means and Differences across Samples

While the differences in the *merit* coefficients across samples are not statistically significant using the CT and CGM methods, the differences are still large in magnitude, especially for females, and seem to warrant further exploration of the data. Table 2 presents sample means and standard errors for females ages 22– 34 for several variables by state of birth for the 1 percent and 5 percent samples.^{5} The upper panel (A) reports means and standard errors constructed without using person weights, while the lower panel (B) does use person weights. There are often important differences between the weighted and unweighted means. While there is not agreement on this, many applied econometricians argue that when possible, researchers should use the person weights. Dynarski (2008) does not use person weights and neither do our results in Table 1. We also reestimated our main results using the person weights to ensure that this is not the cause of the differences in the coefficients (Table 3). The estimates change only slightly and the qualitative results are the same (smaller *merit* coefficients using the 5 percent PUMS than using the 1 percent PUMS, with most of the difference driven by females, but the differences are not statistically significant).

The differences in means (both unweighted and weighted) between the 1 percent and 5 percent PUMS in Table 2 are often moderately large in magnitude for Arkansas, but are generally smaller for Georgia and for the rest of the United States. However, differences across samples are not statistically significant at conventional levels using a two-sample *t*-test except for the share of females that are nonwhite or Hispanic for the rest of the United States, which is significant at the 5 percent level.^{6} The significance here is an unexpected result and not easily explained. It could be due to sampling error (if we examine enough variables, we might expect roughly 5 percent of them to have differences significant at the 5 percent level) or perhaps nonsampling error. Nonsampling error might arise because the Census does confidentially scrubs in which they alter individual records in the public use data in order to prevent individuals from being identifiable (U.S. Bureau of the Census 2003). For example, Alexander, Davern, and Stevenson (2010) show that nonsampling error in the 2000 PUMS results in very inaccurate age-specific gender ratios for persons age 65 and older.

We also calculate chi-square test statistics of whether the difference in means across the 1 percent and 5 percent PUMS are jointly zero for all five variables for Arkansas and Georgia and all but the merit variable for the rest of the United States. None of the differences is statistically significant at the 5 percent level, but the differences for the rest of the United States are statistically significant at the 10 percent level.

Table 4 presents sample means for females by state of birth separately for persons ages 22–27 and 28–34 in Arkansas and the rest of the United States and ages 22– 25 and 26–34 in Georgia and the rest of the United States. The younger groups in Arkansas and Georgia are the ones exposed to the merit programs and the older groups and the rest of the United States are the controls. Again there are some differences between the unweighted and weighted means, and some differences between the 1 percent and 5 percent PUMS. For brevity, we focus on the differences in weighted means between the 1 percent and 5 percent PUMS. The difference in the share of nonwhite or Hispanic is significant for the rest of the United States for ages 22–27 and ages 22–25, though we are again unsure why. More importantly for our purposes, the differences in the shares with an associate’s degree or higher are significant for Arkansans ages 28–34 and for Georgians ages 22–25. These differences are driving the differences in the *merit* coefficient between the 1 percent and 5 percent PUMS. However, it is not clear which sample is “correct”.

Table 4 also reports the chi-square test statistic for the age groups by state of birth. The test reports that the differences in means across the 1 percent and 5 percent PUMS for Georgians ages 22–25 are jointly significantly at the 5 percent level. Differences across the samples for all other groups in Table 4 are jointly insignificant except for the weighted means for the rest of the United States ages 22–27, which is significant at the 10 percent level. Finally, Table 5 reports means for the five constructed 1 percent subsamples. The chi-square test statistics report that the differences across the five 1 percent subsamples are not jointly statistically significant.

## VII. Conclusion

States spend a substantial amount of money on aid programs, both need-based and merit. For example, in FY 2009, Georgia made 223,389 HOPE Scholarship awards and spent nearly $400 million. But we know very little about the effects of aid programs, particularly regarding their effects on college completion. Dynarski (2008) finds that the merit aid programs in Arkansas and Georgia increased college completion by about three percentage points.

In this paper we revisit Dynarski’s results using a different data sample (the 5 percent PUMS file rather than the 1 percent file) and find much smaller effects. Dynarski’s clustered standard errors are downwardly biased and lead to invalid inferences. We use two alternative approaches for computing significance levels due to Conley and Taber (2011) and Cameron, Gelbach, and Miller (2008). Both procedures suggest statistically insignificant effects of merit programs on degree completion. However, we do find some important differences between the 1 percent and 5 percent PUMS that could be due to either sampling or nonsampling error.

## Footnotes

David L. Sjoquist is professor of economics and Dan E. Sweat Distinguished Chair in Educational and Community Policy in the Andrew Young School of Policy Studies at Georgia State University. John V. Winters is an assistant professor of economics at the University of Cincinnati. The authors wish to thank Barry Hirsch and participants in the Public Economics Brownbag Series at Georgia State University for helpful comments; the usual caveat applies. The data used in this article can be obtained beginning August 2012 through July 2015 from John Winters at Department of Economics, University of Cincinnati, P.O. Box 210371, Cincinnati, OH 45221–0371, john.winters{at}uc.edu.

↵1. Cornwell, Mustard, and Sridhar (2006) find that HOPE increased enrollment in Georgia postsecondary institutions by 5.9 percent, but that two-thirds of that effect is due to fewer college students leaving the state. Their results suggest that HOPE had at best a small effect on young people attending college at all. Henry, Rubenstein, and Bugler (2004) compare “borderline” HOPE and non-HOPE recipients and find that the probability that HOPE recipients graduated within four years was 72 percent higher than non-HOPE recipients attending four-year schools.

↵2. Dynarski (2008) provides a discussion of the two aid programs and the relevant literature.

↵3. Dynarski, however, does not use person weights to construct the age by state of birth means.

↵4. We also explored estimating simple bootstrap standard errors for a 1 percent sample by drawing 1000 random samples with replacement from the combined 1 percent and 5 percent PUMS, estimating the merit coefficient for each, and computing the standard error of the pseudo-sample coefficients. The resulting standard error for females was 0.0079, but this approach does not properly account for the small number of policy changes and may also lead to invalid inferences.

↵5. Standard errors equal the standard deviation divided by the square root of the sample size. Standard errors for the 5 percent PUMS are thus less than half that of the 1 percent PUMS.

↵6. Most of this difference is attributable to differences in the share of females who are Black.

- Received April 2010.
- Accepted March 2011.