## Abstract

There is a large theoretical literature on methods for estimating causal effects under unconfoundedness, exogeneity, or selection-on-observables type assumptions using matching or propensity score methods. Much of this literature is highly technical and has not made inroads into empirical practice where many researchers continue to use simple methods such as ordinary least squares regression even in settings where those methods do not have attractive properties. In this paper, I discuss some of the lessons for practice from the theoretical literature and provide detailed recommendations on what to do. I illustrate the recommendations with three detailed applications.

## I. Introduction

There is a large literature on methods for estimating average treatment effects under the assumption of unconfoundedness (also referred to as selection-on-observables, exogeneity, ignorability, or simply the conditional independence assumption). Under this assumption, the comparison of units with different treatments but identical pretreatment variables can be given a causal interpretation. Much of the econometric literature has focused on establishing asymptotic properties for a variety of estimators without a firm conclusion on the relative merits of these estimators. As a result, the theoretical literature leaves the empirical researcher with a bewildering choice of methods with limited, and often conflicting, guidance on what to use in practice. Possibly in response to that, and possibly also because of lack of compelling evidence that simple methods (for example, ordinary least squares regression) are not adequate, few of the estimators proposed in the recent literature have been widely adopted. Instead, researchers continue to use ordinary linear regression methods with an indicator for the treatment and covariates entering additively and linearly. However, such simple methods do not always work well. If the covariate distributions differ substantially by treatment status, conventional regression methods can be sensitive to minor changes in the specification because of their heavy reliance on extrapolation.

In this paper, I provide some advice for researchers estimating average treatment effects in practice, based on my personal view of this literature. The issues raised in this paper draw from my earlier reviews (Imbens 2004; Imbens and Wooldridge 2009) and in particular my forthcoming book with Imbens and Rubin (2015) and research papers (in particular, Abadie and Imbens 2006, 2008; Hirano, Imbens, and Ridder 2003). I focus on four specific issues. First, I discuss when and why the simple ordinary least squares estimator, which ultimately relies on the same fundamental unconfoundedness assumption as matching and propensity score estimators, but which combines that assumption with strong functional form restrictions, is likely to be inadequate for estimating average treatment effects. Second, I discuss in detail two specific, fully specified, estimators that in my view are attractive alternatives to least squares estimators because of their robustness properties with respect to a variety of data configurations. Although I focus on these two methods, one based on matching and one based on subclassification on the propensity score, in some detail, the message emphatically is *not* that these are the only two estimators that are reasonable. Rather, the point is that the results should not be sensitive to the choice of reasonable estimators, and that unless a particular estimator is robust to modest changes in the implementation, any results should be viewed with suspicion. Third, I discuss trimming and other preprocessing methods for creating more balanced samples as an important first step in the analysis to ensure that the final results are more robust with any estimators, including least squares, weighting, or matching. Fourth, I discuss some supplementary analyses for assessing the plausibility of the unconfoundedness assumption. Although unconfoundedness is not testable, one should assess, whenever possible, whether unconfoundedness is plausible in specific applications, and in many cases one can in fact do so.

I illustrate the methods discussed in this paper using three different data sets. The first of these is a subset of the lottery data collected by Imbens, Rubin, and Sacerdote (2001). The second data set is the experimental part of the Dehejia and Wahba (1999) version of the Lalonde (1986) data set containing information on participants in the Nationally Supported Work (NSW) program and members of the randomized control group. The third data set contains information on participants in the NSW program and one of Lalonde’s nonexperimental comparison groups, again using the Dehejia-Wahba version of the data. Both subsets of the Dehejia-Wahba version of the Lalonde data are available on Dehejia’s website, http://users.nber.org/rdehejia/. Software for implementing these methods will be available on my website.

## II. Setup and Notation

The setup used in the current paper is by now the standard one in this literature, using the potential outcome framework for causal inference that builds on Fisher (1925) and Neyman (1923) and was extended to observational studies by Rubin (1974). See Imbens and Wooldridge (2009) for a recent survey and historical context, and Imbens and Rubin (2014) for the specific notation used here. Following Holland (1986), I will refer to this setup as the Rubin Causal Model, RCM for short.^{1}

The analysis is based on a sample of *N* units, indexed by *i* = 1, …, *N*, viewed as a random sample from an infinitely large population.^{2} For unit *i*, there are two potential outcomes, denoted by *Y _{i}*(0) and

*Y*(1), the potential outcomes without and given the treatment respectively. Each unit in the sample is observed to receive or not receive a binary treatment, with the treatment indicator denoted by

_{i}*W*. If unit

_{i}*i*receives the treatment, then

*W*= 1, otherwise

_{i}*W*= 0. There are control units and treated units, so that

_{i}*N*=

*N*+

_{c}*N*is the total number of units. The realized (and observed) outcome for unit

_{t}*i*is

I use the superscript “obs” here to distinguish between the potential outcomes that are not always observed and the observed outcome. In addition, there is for each unit *i* a *K*-component covariate *X _{i}*, with

*X*∈ 𝕏 ⊂ ℝ

_{i}*. The key characteristic of these covariates is that they are known not to be affected by the treatment. Observe that, for all units in the sample, the triple (,*

^{K}*W*,

_{i}*X*). We use

_{i}*,*

**Y***, and*

**W***to denote the vectors with tyical elements*

**X***Y*and

^{obsi}*W*, and the matrix with

_{i}*i*-th row equal to .

Define the population average treatment effect conditional on the covariates,

and the average effect in the population, and in the subpopulation of treated units,

(1)

There is a somewhat subtle issue that can also be focused on the sample average treatment effects, conditional on the covariates in the sample,

(2)

This matters for inference, although not for estimation. The asymptotic variances for the same estimators, viewed as estimators of τ_{sample} and τ_{treat,sample}, are generally smaller than when we view τ and τ_{treat} as the target, unless there is no variation in τ(*x*) over the population distribution of *X _{i}*. See Imbens and Wooldridge (2009) for more details on this distinction.

The first key assumption made is *unconfoundedness* (Rubin 1990),

(3)

where, using the Dawid (1979) conditional independence notation, *A* ∧ *B*|*C* denotes that *A* and *B* are independent conditional on *C*. The second key assumption is *overlap*,

(4)

for all *x* in the support of the covariates, where

(5)

is the *propensity score* (Rosenbaum and Rubin 1983). The combination of these two assumptions is referred to as *strong ignorability* by Rosenbaum and Rubin (1983). The combination implies that the average effects can be estimated by adjusting for difference in covariates between treated and control units. For example, given unconfoundedness, the population average treatment effect τ can be expressed in terms of the joint distribution of (,*W _{i}*,

*X*) as

_{i}(6)

To see that Equation 6 holds, note that by definition,

By unconfoundedness, the conditioning on *W _{i}* in these two terms can be dropped, and

thus proving Equation 6.

To implement estimators based on this equality, it is also required that the conditional expectations 𝔼[*Y _{i}*(0)|

*X*=

_{i}*x*], 𝔼[

*Y*(1)|

_{i}*X*=

_{i}*x*] and

*e*(

*x*) are sufficiently smooth—that is, sufficiently many times differentiable. See the technical papers in this literature, referenced in Imbens and Wooldridge (2009), for details on the smoothness requirements.

The statistical challenge is now how to estimate objects such as

(7)

One would like to estimate Equation 7 without relying on strong functional form assumptions on the conditional distributions or conditional expectations. One would also like the estimators to be robust to minor changes in the implementation of the estimator.

Whether these two assumptions are reasonable is often controversial. Violations of the overlap assumptions are often easily detected. They motivate some of the design issues and often imply that the researcher should change the estimand from the overall average treatment effect to an average for some subpopulation. See Section VF for more discussion. Potential violations of the unconfoundedness assumption are more difficult to address. First, there are some methods for assessing the plausibility of the assumption, see the discussion in Section VG. To make the assumption plausible, it is important to have detailed information about the units on characteristics that are associated both with the potential outcomes and the treatment indicator. It is often particularly helpful to have lagged measures of the outcome, for example, detailed labor market histories in evaluations of labor market programs. At the same time, having detailed information on the background of the units makes the statistical challenge of adjusting for all these differences more challenging.

This setup has been studied extensively in the statistics and econometrics literatures. For reviews and general discussions in the econometrics literature, see Angrist and Krueger (2000); Angrist and Pischke (2009); Caliendo (2006); Heckman and Vytlacil (2007); Imbens (2004); Imbens and Wooldridge (2009); Caliendo and Kopeinig (2011); Frölich (2004); Wooldridge (2002); and Lee (2005). For reviews in other areas in applied statistics, see Rosenbaum (1995, 2010); Morgan and Winship (2007); Guo and Fraser (2010); and Murnane and Willett (2010). Much of the econometrics literature is technical in nature and has primarily focused on establishing first-order large-sample properties of point and interval estimators using matching (Abadie and Imbens 2006), nonparametric regression methods (Frolich 2000; Chen, Hong, and Tarozzi 2008; Hahn 1998; Hahn and Ridder 2013), or weighting (Hirano, Imbens, and Ridder 2003; Frölich 2002). These first-order asymptotic properties are identical for a number of proposed methods, limiting the usefulness of these results for choosing between these methods. In addition, comparisons between the various methods based on Monte Carlo evidence (for example, Frölich 2004; Busso, DiNardo, and McCrary 2008; Lechner 2012) are hampered by the dependence of many of the proposed methods on tuning parameters (for example, the bandwidth choices in kernel regression, or on the number of terms or basis functions in sieve or series methods) for which rarely specific, data-dependent values are recommended, as well as by the lack of agreement on realistic designs in simulation studies.

In this paper, I will discuss my recommendations for empirical practice based on my reading of this literature. These recommendations are somewhat personal. In places they are also somewhat vague. I will emphatically not argue that one particular estimator should always be used. Part of the message is that there are no, and will not be, general results implying that in general some estimators are superior to all others. The recommendations are more specific in other places—for example suggesting supplementary analyses that indicate for some data sets that particular estimators will be sensitive to minor changes in implementation—thus making them unattractive in those settings. The hope is that these comments will be useful for empirical researchers. The recommendations will consist of a combination of particular estimators, supplementary analyses, and, perhaps most controversially, changes in estimands when Equation 7 is particularly difficult to estimate.

## III. Least Squares Estimation: When and Why Does It Not Work?

The most widely used method for estimating causal effects remains ordinary least squares (OLS) estimation. Ordinary least squares estimation relies on unconfoundedness in combination with functional form assumptions. Typically, researchers make these functional form assumptions for convenience, viewing them at best as approximations to the underlying functions. In this section, I discuss merits and concerns with this method for estimating causal effects as opposed to using it for prediction. The main point is that the least squares functional form assumptions matter, and sometimes matter substantially. Moreover, one can assess from the data whether this will be the case.

To illustrate these ideas, I will use a subset of the data collected by Lalonde. The treatment is participation in a labor market training program the National Supported Work (NSW) program. Later, I will look at other parts of the data set, but here I focus on the nonexperimental comparison between the trainees and a nonexperimental comparison group from the Panel Study of Income Dynamics (PSID). The outcome of interest is an earnings variable, *earn’78*. I focus on a subset of the data with a single covariate, *earn’75*. There are *N _{t}* = 185 men in the treatment group and

*N*= 2,490 men in the control group. From the experimental data, it can be inferred that the average causal effect is approximately $2,000. Figures 1 and 2 present histograms of

_{c}*earn’75*in the control and treatment group respectively. The average values of this variable in the two groups are 19.1 and 1.5, with standard deviations 13.6 and 3.2 respectively. Clearly, the two distributions of

*earn’75*in the trainee and comparison groups are far apart.

### A. Ordinary Least Squares Estimation

Initially, I focus on the average effect for the treated, τ_{treat}, defined in Equation 1, although the same issues arise for the average effect τ. Focusing on τ_{treat} simplifies some of the analytic calculations. Under unconfoundedness, τ_{treat} can be written as a functional of the joint distribution of (,*W _{i}*,

*X*)—that is,

_{i}(8)

Define

The first term in Equation 8, , is directly estimable from the data as . It is the second term, , that is more difficult to estimate. Suppose we specify a linear model for *Y _{i}*(0) given

*X*:

_{i}

The OLS estimator for *β** _{c}* is equal to

For the Lalonde nonexperimental data, where the outcome is earnings in thousands of dollars, and the scalar covariate here lagged earnings, we have

Then, we can write the OLS estimator of the average of the potential control outcomes for the treated, 𝔼[*Y _{i}*(0)|

*W*= 1], as

_{i}(9)

which for the Lalonde data leads to

The question is how credible this is as an estimate of 𝔼[*Y _{i}*(0)|

*W*= 1]. I focus on two aspects of this credibility.

_{i}### B. Extrapolation and Misspecification

First, some additional notation is needed. Denote the true conditional expectation and variance of *Y _{i}*(

*w*) given

*X*=

_{i}*x*by

Given the true conditional expectation, the pseudotrue values for β* _{c}*, the probability limit of the OLS estimator, is

and the estimator for the predicted value for 𝔼[*Y _{i}*(0)|

*W*= 1] will converge to

_{i}

The difference between the and the true average 𝔼[*Y _{i}*(0)|

*W*= 1] depends on the difference between the average value of the regression function for the treated and the controls and the approximation of this difference of the average conditional expectations by the difference in the average values of the best linear predictors,

_{i}

If the conditional expectation μ* _{t}*(

*x*) is truly linear, the difference is zero. In addition, if the average covariate values in the two subpopulations have the same distribution

*f*(

_{x}*x*|

*W*= 0) =

_{i}*f*(

_{x}*x*|

*W*= 1) for all

_{i}*x*, the difference is zero. However, if both the conditional expectations are nonlinear and the covariate distributions differ, there will in general be a bias. Establishing whether the conditional expectations are nonlinear can be difficult to do. However, it is straightforward to assess whether the second necessary condition for bias is satisfied, namely whether the distributions of the covariates in the two treatment arms are similar or not.

Now look at the sensitivity to different specifications. I consider two specifications for the regression function. First, a simple linear regression:

Second, a linear regression after transforming the regressor by adding one unit ($1,000) followed by taking logarithms:

The predictions for 𝔼[*Y _{i}*(0)|

*W*= 1] based on the two models are quite different, given that the average causal effect is on the order of $2,000:

_{i}

The reason for the big difference (4.07) between the two predicted values can be seen in Figure 3 where I plot the two estimated regression functions. The dashed vertical lines present the 0.25 and 0.75 quartile of the distribution of *earn’75* for the PSID sample (9.84 and 24.50 respectively). This indicates the range of values where the regression function is most likely to approximate the conditional expectation accurately. The solid vertical lines present the 0.25 and 0.75 quantiles for the distribution of *earn’75* for the trainees (0 and 1.82 respectively). One can see in this figure that for the range where the trainees are and where we therefore need to predict *Y _{i}*(0), with

*earn’75*between 0 and 1.82, the predictions of the linear and the log-linear model are quite different.

The estimates here were based on regression outcomes on earnings in 1975 for the control units only. In practice, an even more common approach is to simply estimate the regression

or

on the full sample, including both treated and control units. The reason for focusing on the regression in the control sample here is mainly expositional: It simplifies the expressions for the estimated average of the potential outcomes. In practice, it makes relatively little difference if we do the regressions with both treated and control units included. For these data, with many more control units (2,490) than treated units (1,985), the estimates for the average of the potential outcome *Y _{i}*(0) for the treated are, under the linear and log linear model, largely driving by the values for the control units. Estimating the regression function as a single regression on all units leads to:

with the difference in the average predicted value for *Y _{i}*(0) for the treated units between the linear and the log linear specification much bigger than between the regression on the control sample versus the regression on the full sample.

### C. Weights

Here, I present a different way of looking at the sensitivity of the least squares estimator in settings with substantial differences in the covariates distributions. One can write the OLS estimator of 𝔼[*Y _{i}*(0)|

*W*= 1] in a different way, as

_{i}(10)

where, for control unit *i* the weight ω* _{i}* is

(11)

It is interesting to explore the properties of these weights ω* _{i}*. First, the weights average to one over the control sample, irrespective of the data. Thus, is a weighted average of the outcomes for the control units, with the weights adding up to 1. Second, the normalized weights ω

*are all equal to one in the special case where the average of the covariates in the two groups are indentical. In that case, . In a randomized experiment, is equal to in expectation, so the weights should be close to one in that case. The variation in the weights, and the possibility of relatively extreme weights, increases with the difference between and . Finally, note that given a particular data set, one can inspect these weights directly and assess if some units have excessively large weights.*

_{i}Let us first look at the normalized weights ω* _{i}* for the Lalonde data. For these data, the weights take the form

The average of the weights is, by construction, equal to 1. The standard deviation of the weights is 1.29. The largest positive weight is 2.08, for control units with *earn’75* equal to zero. It makes sense for control individuals with *earn’75* equal to zero to have substantial weights, as there are many individuals in the treatment group with identical or similar values for *earn’75*. The most negative weight is for the control unit with *earn’75* equal to 156.7, with a weight of –12.05. This is where the problem with ordinary least squares regression shows itself most clearly. The range of values of *earn’75* for trainees is [0,25.14]. I would argue that whether an individual with *earn’75* equal to $156,700 makes $100,000 or $300,000 in 1978 should make no difference for a prediction for the average value of *Y _{i}*(0) for the trainees in the program, 𝔼[

*Y*(0)|

_{i}*W*= 1], given that the maximum value for

_{i}*earn’75*for trainees is $25,140. Conditional on the specification of the linear model, however, that difference between $100,000 and $300,000 is very important for the prediction of earnings in 1978 for trainees. Given the weight of –12.05 for the control individual with

*earn’75*equal to 156.7, the prediction for 𝔼[

*Y*(0)|

_{i}*W*= 1] would be 6.829 if

_{i}*earn*’78 = 100 for this individual, and 5.861 if

*earn*’78 = 300 for this individual, with the difference equal to $968, a substantial amount. The problem is that the least squares estimates take the linearity in the linear regression model very seriously, and, given linearity, observations on units with extreme values for

*earn*’75 are in fact very informative. Note that one can carry out these calculations without even using the outcome data. Because the weights are so large, the value of

*earn*’78 for such individuals matters substantially for our prediction for 𝔼[

*Y*(0)|

_{i}*W*= 1], where arguably the earnings for such individuals should have close to 0 weight.

_{i}### D. How to Think About Regression Adjustment

Of course, the example in the previous subsection is fairly extreme. The two distributions are very different with, for example, the fraction of the PSID control sample with *earn’75* exceeding the maximum value found in the trainee sample (which is 25.14) equal to 0.27. It would appear obvious that units with *earn’75* as large as 156.7 should be given little or no weight in the analyses. An obvious remedy would be to discard all PSID units with *earn’75* exceeding 25.14, and this would go a considerable way toward making the estimators more robust. However, it is partly because there is only a single covariate in this overly simplistic example that there are such simple remedies.^{3} With multiple covariates, it is difficult to see what trimming would need to be done. Moreover, a simple trimming rule such as the one described above is very sensitive to outliers. If there were a single trainee with *earn’75* equal to 150, that would change the estimates substantially. The issue is that regression methods are fundamentally not robust to the substantial differences between treatment and control groups. I will be discussing alternatives to the simple regression methods used here that do systematically, and automatically, down weight the influence of such outliers, both in the case with scalar covariates and in the case with multivariate covariates. If, for example, instead of using regression methods, one matched all the treated units, the control units with *earn’75* exceeding 25.07 would receive 0 weight in the estimation, and in fact few control units with *earn’75* between 18 and 25 would receive positive weights. Matching is therefore by design robust to the presence of such units.

In practice, it may be useful to compare the least squares estimates to estimates based on more sophisticated methods, both as a check that calculations were carried out correctly and to ensure that one understands what is driving any difference between the estimates.

## IV. The Strategy

In this section, I lay out the overall strategy for flexibly and robustly estimating the average effect of the treatment. The strategy will consist of two and sometimes three distinct stages, as outlined in Imbens and Rubin (2015). In the first stage, which, following Rubin (2005), I will refer to as the *design* stage, the full sample will be trimmed by discarding some units to improve overlap in covariate distributions. In the second stage, the *supplementary analysis* stage, the unconfoundedness assumption is assessed. In the third stage, the *analysis* stage, the estimator for the average effect will be applied to the trimmed data set. Let me briefly discuss the three stages in more detail.

### A. Stage I: Design

In this stage, we do not use the outcome data and focus solely on the treatment indicators and covariates, (* X*,

*). The first step is to assess overlap in covariate distributions. If this is found to be lacking, a subsample is created with more overlap by discarding some units from the original sample. The assessment of the degree of overlap is based on an inspection of some summary statistics, what I refer to as normalized differences in covariates, denoted by Δ*

**W***. These are differences in average covariate values by treatment status, scaled by a measure of the standard deviation of the covariates. These normalized differences contrast with*

_{X}*t*-statistics for testing the null of no differences in means between treated and controls. The normalized differences provide a scale and sample size free way of assessing overlap.

If the covariates are far apart in terms of this normalized difference metric, one may wish to change the target. Instead of focusing on the overall average treatment effect, one may wish to drop units with values of the covariates such that they have no counterparts in the other treatment group. The reason is that, in general, no estimation procedure will be given robust estimates of the average treatment effect in that case. To be specific, following Crump, Hotz, Imbens, and Mitnik (2008a, CHIM from hereon), I propose a rule for dropping units with extreme (that is, close to 0 or 1) values for the estimated propensity score. This step will require estimating the propensity score . One can capture the result of this first stage in terms of a function * I* =

*(*

**I***,*

**W***) that determines which units are dropped as a function of the vector of treatment indicators*

**X***and the matrix of pretreatment variables*

**W***. Here,*

**X***denotes an*

**I***N*vector of binary indicators, with

*i*-th element

*I*equal to one if unit

_{i}*i*is included in the analysis, and

*I*equal to zero if unit

_{i}*i*is dropped. Dropping the units with

*I*= 0 leads to a trimmed sample (

_{i}*Y*,

^{T}*W*,

^{T}*X*) with units.

^{T}### B. Stage II: Supplementary Analysis: Assessing Unconfoundedness

In the second stage, analyses are carried out to assess the plausibility of unconfoundedness. Again it should be stressed that these analyses can be carried out without access to the outcome data * Y*, using solely (

*,*

**W***). Here the set of pretreatment variables or covariates is partitioned into two parts—the first a vector of pseudo-outcomes,*

**X***X*, and the second a matrix containing the remaining covariates,

_{p}*X*. We then take the pseudosample (

_{r}*,*

**X**_{p}*,*

**W***) and estimate the average effect of the treatment on this pseudo-outcome, adjusting only for differences in the remaining covariates*

**X**_{r}*. This will involve first trimming this pseudosample using the same procedures to construct a trimmed sample () and then estimating a “pseudo” average treatment effect on the pseudo-outcome for this trimmed sample.*

**X**_{r}Specifically, I calculate

for specific estimators τ(·) described below, where the adjustment is only for the subset of covariates included in . This estimates the “pseudo” causal effect, the causal effect of the treatment on the pretreatment variable , which is a priori known to be zero. If we find the estimate is substantially and statistically close to zero after adjusting for , this will be interpreted as evidence supportive of the assumption of unconfoundedness. If the estimate differs from zero, either substantially or statistically, this will be viewed as casting doubt on the (ultimately untestable) unconfoundedness assumption, with the degree of evidence dependent on the magnitude of the deviations. The result of this stage will therefore be an assessment of the credibility of any estimates of the average effect of the treatment on the outcome of interest, without using the actual outcome data.

### C. Stage III: Analysis

In the third stage, the outcome data * Y* are used and estimates of the average treatment effect of interest are calculated. Using one of the recommended estimators, blocking (subclassification) or matching, applied to the trimmed sample, we obtain a point estimate:

This step is more delicate, with concerns about pretesting bias, and I therefore propose no specification searches at this stage. The first estimator I discuss in detail is based on blocking or subclassification on the estimated propensity score in combination with regression adjustments within the blocks. The second estimator is based on direct one-to-one (or one-to-*k*) covariate matching with replacement in combination with regression adjustments within the matched pairs. Both are, in my view, attractive estimators because of their robustness.

## V. Tools

In this section, I will describe the calculation of the normalized differences Δ_{X,k} and the four functions—the propensity score, *e*(*x*;* W*,

*), the trimming function,*

**X***I*(

*,*

**W***), the blocking estimator, τ*

**X**_{block}(

*,*

**Y***,*

**W***), and the matching estimator, τ*

**X**_{match}(

*,*

**Y***,*

**W***)—and the choices that go into these functions.*

**X**I should note that, in specific situations, one may augment the implicit choices made regarding the specification of the regression function and the propensity score in the proposed algorithm with substantive knowledge. Such knowledge is likely to improve the performance of the methods. Such knowledge notwithstanding, there is often a part of the specification about which the researcher is agnostic. For example, the researcher may not have strong a priori views on whether a pretreatment variable such as age should enter linearly or quadratically in the propensity score in the context of the evaluation of a labor market program. The methods described here are intended to assist the researcher in such decisions by providing a benchmark estimator for the average effect of interest.

### A. Assessing Overlap: Normalized Differences in Average Covariates

For element *X _{i}*,

*of the covariate vector*

_{k}*X*, the normalized difference is defined as

_{i}(12)

where

A key point is that the normalized difference is more useful for assessing the magnitude of the statistical challenge in adjusting for covariate differences between treated and control units than the *t*-statistic for testing the null hypothesis that the two differences are zero, or

(13)

The *t*-statistic *t _{X,k}* may be large in absolute value simply because the sample is large and, as a result, small differences between the two sample means are statistically significant even if they are substantively small. Large values for the normalized differences, in contrast, indicate that the average covariate values in the two groups are substantially different. Such differences imply that simple methods such as regression analysis will be sensitive to specification choices and outliers. In that case, there may in fact be no estimation method that leads to robust estimates of the average treatment effect.

### B. Estimating the Propensity Score

In this section, I discuss an estimator for the propensity score *e*(*x*) proposed in Imbens and Rubin (2015). The estimator is based on a logistic regression model, estimated by maximum likelihood. Given a vector of functions *h* : 𝕏 ↦ ℝ* ^{M}*, the propensity score is specified as

with an *M*-dimensional parameter vector γ. There is no particular reason for the choice of a logistic model as opposed to, say, a probit model, where *e*(*x*) = Φ(*h*(*x*)′γ). This is one of the places where it is important that the eventual estimates be robust to this choice. In particular, the trimming of the sample will improve the robustness to this choice, removing units with estimated propensity score values close to zero or one where the choice between logit and probit models may matter.

Given a choice for the function *h*(*x*), the unknown parameter γ is estimated by maximum likelihood:

The estimated propensity score is then

The key issue is the choice of the vector of functions *h*(·). First note that the propensity score plays a mechanical role in balancing the covariates. It is not given a structural or causal interpretation in this analysis. In choosing a specification, there is therefore little role for theoretical substantive arguments: We are mainly looking for a specification that leads to an accurate approximation to the conditional expectation. At most, theory may help in judging which of the covariates are likely to be important here.

A second point is that there is little harm in specification searches at this stage. The inference for the estimators for the average treatment effect we consider is conditional on covariates and is not affected by specification searches at this stage that do not involve the data on the outcome variables.

A third point is that, although the penalty for including irrelevant terms in the propensity score is generally small, eventually the precision will go down if too many terms are included in the specification.

A simple, and the most common, choice for the specification of the propensity score is to take the basic covariates themselves, *h*(*x*) = *x*. This need not be adequate for two reasons. The main reason is that this specification may not be sufficiently flexible. In many cases, there are important interactions between the covariates that need to be adjusted for in order to get an adequate approximation to the propensity score. Second, it may be that the linear specification includes pretreatment variables that have very little association with the treatment indicator and therefore need not be included in the specification. Especially in settings with many pretreatment variables, this may be wasteful and lead to low precision in the final estimates. I therefore discuss a more flexible specification for the propensity score based on stepwise regression, proposed in Imbens and Rubin (2015). This provides a data driven way to select a subset of all linear and second-order terms. Details are provided in the appendix. There are many, arguably more sophisticated, alternatives to choosing a specification for the propensity score that, however, have not been tried out much in practice. Other discussions include Millimet and Tchernis (2009). Recently, lasso methods (Tibshirani 1996) have been used for similar problems (Belloni, Chernozhukov, and Hansen, 2012). However, the point is again not to find a single method for estimating the propensity score that will outperform all others. Rather, the goal is to find a reasonable method for estimating the propensity score that will, in combination with the subsequent adjustment methods, lead to estimates for the treatment effects of interest that are similar to those based on other reasonable strategies for estimating the propensity score. For the estimators here, a finding that the final results for, say, the average treatment effect are sensitive to using the stepwise selection method compared to, say, the lasso or other shrinkage methods would raise concerns that none of the estimates are credible.

This vector of functions always includes a constant, *h*_{1}(*x*) = 1. Consider the case where *x* is a *K*-component vector of covariates (not counting the intercept). I restrict the remaining elements of *h*(*x*) to be either equal to a component of *x* or to be equal to the product of two components of *x*. In this sense, the estimator is not fully nonparametric: Although one can generally approximate any function by a polynomial, here I limit the approximating functions to second-order polynomials. In practice, however, for the purposes for which I will use the estimated propensity score, this need not be a severe limitation.

The problem now is how to choose among the (*K* + 1) × (*K* + 2) / 2 – 1 first- and second-order terms. I select a subset of these terms in a tepwise fashion. Three choices are to be made by the researcher. First, there may be a subset of the covariates that will be included in the linear part of the specification, irrespective of their association with the outcome and the treatment indicator. Note that biases from omitting variables comes from both the correlation between the omitted variables and the treatment indicator, and the correlation between the omitted variable and the outcome, just as in the conventional omitted variable formula for regression. In applications to job training programs, these might include lagged outcomes or other covariates that are a priori expected to be substantially correlated with the outcomes of interest. Let us denote this subvector by *X _{B}*, with dimension 1≤

*K*≤

_{B}*K*+ 1 (

*X*always includes the intercept). This vector need not include any covariates beyond the intercept if the researcher has no strong views regarding the relative importance of any of the covariates. Second, a threshold value for inclusion of linear terms has to be specified. The decision to include an additional linear term is based on a likelihood ratio test statistic for the null hypothesis that the coefficient on the additional covariate is equal to zero. The threshold value will be denoted by

_{B}*C*

_{lin}, with the value used in the applications being

*C*

_{lin}= 1. Finally, a threshold value for inclusion of second-order terms has to be specified. Again, the decision is based on the likelihood ratio statistic for the test of the null hypothesis that the additional second-order term has a coefficient equal to zero. The threshold value will be denoted by

*C*

_{qua}, and the value used in the applications in the current paper is

*C*

_{qua}= 2.71. The values for

*C*

_{lin}and

*C*

_{qua}have been tried out in simulations, but there are no formal results demonstrating that these are superior to other values. More research regarding these choices would be useful.

### C. Blocking with Regression

The first estimator relies on an initial estimate of the propensity score and uses subclassification (Rosenbaum and Rubin 1983, 1984) combined with regression within the subclasses. See Imbens and Rubin (2014) for more details. Conceptually, there are a couple of advantages of this estimator over simple weighting estimators. First, the subclassification approximately averages the propensity score within the subclasses, and so smoothes over the extreme values of the propensity score. Second, the regression within the subclasses adds a lot of flexibility compared to a single weighted regression.

I use the estimator for the propensity score discussed in the previous section. The estimator then requires partitioning of the range of the propensity score, the interval [0,1] into *J* intervals of the form [*b _{j}*

_{–1},

*b*), for

_{j}*j*= 1, …,

*J*, where

*b*

_{0}= 0 and

*b*= 1. Let

_{J}*B*(

_{i}*j*) ∈{0,1} be a binary indicator for the event that the estimated propensity score for unit , satisfies . Within each block, the average treatment effect is estimated using linear regression with some of the covariates, including an indicator for the treatment

The covariates included beyond the intercept may consist of all or a subset of the covariates viewed as most important. Because the covariates are approximately balanced within the blocks, the regression does not rely heavily on extrapolation as it might do if applied to the full sample. This leads to *J* estimates , one for each stratum or block. These *J* within-block estimates are then averaged over the *J* blocks, using either the proportion of units in each block, (*N _{cj}* +

*N*)/

_{tj}*N*, or the proportion of treated units in each block,

*N*/

_{tj}*N*, as the weights:

_{t}(14)

The only decision the researcher has to make in order to implement this estimator is the number of blocks, *J*, and boundary values for the blocks, *b _{j}*, for

*j*= 1, …,

*J*– 1. Following analytic result by Cochran (1968) for the case with a single normally distributed covariate, researchers have often used five blocks, with an equal number of units in each block. Here I use a data-dependent procedure for selecting both the number of blocks and their boundaries that leads to a number of blocks that increases with the sample size, developed in Imbens and Rubin (2015).

The algorithm proposed in Imbens and Rubin (2015) relies on comparing average values of the log odds ratios by treatment status, where the estimated log odds ratio is

Start with a single block, *J* = 1. Check whether the current stratification or blocking is adequate. This check is based on the comparison of three statistics, one regarding the correlation of the log odds ratio with the treatment indicator within the current blocks, and two concerning the block sizes if one were to split them further. Calculate the *t*statistic for the test of the null hypothesis that the average value for the estimated propensity score for the treated units is the same as the average value for the estimated propensity score for the control units in the block. Specifically, if one looks at block *j*, with *B _{ij}* = 1 if unit

*i*is in block

*j*(or ), the

*t*-statistic is

where

and

If the block were to be split, it would be split at the median of the values of the estimated propensity score within the block (or at the median value of the estimated propensity score among the treated if the focus is on the average value for the treated). For block, denote this median by *b*_{j–1, j}, and define

The current block will be viewed as adequate if either the *t*-statistic is sufficiently small (less than ), or if splitting the block would lead to too small a number of units in one of the treatment arms or in one of the new blocks. Formally, the current block will be viewed as adequate if either

or

or

where *K* is the number of covariates. If all the current blocks are deemed adequate, the blocking algorithm is finished. If at least one of the blocks is viewed as not adequate, it is split by the median (or at the median value of the estimated propensity score among the treated if the focus is on the average value for the treated). For the new set of blocks, the adequacy is calculated for each block, and this procedure continues until all blocks are viewed as adequate.

### D. Matching with Replacement and Bias-Adjustment

In this section, I discuss the second general estimation procedure: matching. For general discussions of matching methods, see Rosenbaum (1989); Rubin (1973a, 1979); Rubin and Thomas (1992a,b); Sekhon (2009); and Hainmueller (2012). Here, I discuss a specific procedure developed by Abadie and Imbens (2006).^{4} This estimator consists of two steps. First, all units are matched, both treated and controls. The matching is with replacement, so the order in which the units are matched does not matter. After matching all units, or all treated units if the focus is on the average effect for the treated, some of the remaining bias is removed through regression on a subset of the covariates, with the subvector denoted by *Z _{i}*.

For ease of exposition, we focus on the case with a single match. The methods generalize in a straightforward manner to the case with multiple matches. See Abadie and Imbens (2006) for details. Let the distance between two values of the covariate vector *x* and *x*′ be based on the Mahalanobis metric , where is the sample covariance matrix of the covariates. Now for each *i*, for *i* = 1, …, *N*, let *m*(*i*) be the index for the closest match, defined as

where we ignore the possibility of ties. Given the ℓ(*i*), define

Define also the matched values for the covariates

This leads to a matched sample with *N* pairs. Note that the same units may be used as a match more than once because we match with replacement. The simple matching estimator discussed in Abadie and Imbens (2006) is . Abadie and Imbens (2006, 2010) suggest improving the bias properties of this simple matching estimator by using linear regression to remove biases associated with differences between and . See also Quade (1982) and Rubin (1973b). First, run the two least squares regressions

in both cases on *N* units, to get the least squares estimates and . Now, adjust the imputed potential outcomes as

and

Now the bias-adjusted matching estimator is

and the bias-adjusted matching estimator for the average effect for the treated is

In practice, the linear regression bias-adjustment eliminates a large part of the bias that remains after the simple matching. Note that the linear regression used here is very different from linear regression in the full sample. Because the matching ensures that the covariates are well-balanced in the matched sample, linear regression does not rely much on extrapolation the way it may in the full sample if the covariate distributions are substantially different.

### E. A General Variance Estimator

In this section, I will discuss an estimator for the variance of the two estimators for average treatment effects. Note that the bootstrap is not valid in general because matching estimators are not asymptotically linear. See Abadie and Imbens (2008) for detailed discussions.

#### 1. The weighted average outcome representation of estimators and asymptotic linearity

The first key insight is that most estimators for average treatment effects share a common structure. This common structure is useful for understanding some of the commonalities of and differences between the estimators. These estimators, including the blocking and matching estimators discussed in Sections III and IV, can be written as a weighted average of observed outcomes,

with

Moreover, the weights ω* _{i}* do not depend on the outcomes

*, only on the covariates*

**Y**^{obs}*and the treatment indicators*

**X***. The specific functional form of the dependence of the weights ω*

**W***on the covariates and treatment indicators depends on the particular estimator, whether linear regression, matching, weighting, blocking, or some combination thereof. Given the choice of the estimator, and given values for*

_{i}*and*

**W***, the weights can be calculated. See Appendix B for the results for some common estimators.*

**X**#### 2. The conditional variance

Here, I focus on estimation of the variance of estimators for average treatment effects, conditional on the covariates * X* and the treatment indicators

*. I exploit the weighted linear average characterization of the estimators in Equation 17. Hence, the conditional variance is*

**W**

The only unknown components of this variance are . Rather than estimating these conditional variances through nonparametric regression following Abadie and Imbens (2006), I suggest using matching. Suppose unit *i* is a treated unit. Then, find the closest match within the set of all other treated units in terms of the covariates. Ignoring ties, let *h*(*i*) be the index of the unit with the same treatment indicator as *i* closest to *X _{i}*:

Because *X _{i}* ≈

*X*

_{h(i)}, and thus μ

_{1}(

*X*) ≈ μ

_{i}_{1}(

*X*

_{h(i)}), it follows that one can approximate the difference

*Y*–

_{i}*Y*

_{h}_{(}

_{i}_{)}by

(18)

The righthand side of Equation 18 has expectation zero and variance equal to . This motivates estimating by

Note that this estimator is not a consistent estimator for . However, this is not important because one is interested not in the variances at specific points in the covariates distribution but, rather, in the variance of the average treatment effect. Following the procedure introduce above, this variance is estimated as:

In principle, one can generalize this variance estimator using the nearest *L* matches rather than just using a single match. In practice, there is little evidence that this would make much of a difference. Hanson and Sunderam (2012) discusses extensions to clustered sampling.

### F. Design: Ensuring Overlap

In this section, I discuss two methods for constructing a subsample of the original data set that is more balanced in the covariates. Both take as input the vector of assignments * W* and the matrix of covariates

*, and select a set of units, a subset of the set of indices {1, 2, …,*

**X***N*}, with

*N*elements, leading to a trimmed sample with assignment vector

^{T}*W*, covariates

^{T}*X*, and outcomes

^{T}*Y*. The units corresponding to these indices will then be used to apply the estimators for average treatment effects discussed in Sections III and IV.

^{T}The first method is aimed at settings with a large number of controls relative to the number of treated units, and the focus is on the average effect of the treated. This method constructs a matched sample where each treated unit is matched to a distinct control unit. This creates a sample of size 2 · *N _{t}* distinct units, half of them treated and half of them control units. This sample can then be used in the analyses of Sections III and IV.

The second method for creating a sample with overlap drops units with extreme values of the propensity score. For such units, it is difficult to find comparable units with the opposite treatment. Their presence makes analyses sensitive to minor changes in the specification to the presence of outliers in terms of outcome values. In addition, their presence increases the variance of many of the estimators. The threshold at which units are dropped is based on a variance criterion, following CHIM.

#### 1. Matching without replacement on the propensity score to create a balanced sample

The first method creates balance by matching on the propensity score. Here we match without replacement. Starting with the full sample with *N* units, *N _{t}* treated and

*N*>

_{c}*N*controls, the first step is to estimate the propensity score using the methods from Section II. I then transform this to the log odds ratio,

_{t}

To simplify notation, I will drop the dependence of the estimated log odds ratio on * X* and

*and simply write . Given the estimated log odds ratio, the*

**W***N*treated observations are ordered, with the treated unit with the highest value of the estimated log odds ratio scored first. Then, the first treated unit (the one with the highest value of the estimated log odds ratio) is matched with the control unit with the closest value of the estimated log odds ratio.

_{t}Formally, if the treated units are indexed by *i* = 1, …, *N _{t}*, with , the index of the matched control

*j*(1) satisfies

*W*

_{j}_{(1)}= 0, and

Next, the second treated unit is matched to unit *j*(2), where *W _{j}*

_{(2)}= 0, and

Continuing this for all *N _{t}* treated units leads to a sample of 2 ·

*N*distinct units, half of them treated and half of them controls. Although before the matching the average value of the propensity score among the treated units must be at least as large as the average value of the propensity score among the control units, this need not be the case for the matched samples.

_{t}I do not recommend simply estimating the average treatment effect for the treated by differencing average outcomes in the two treatment groups in this sample. Rather, this sample is used as a trimmed sample, with possibly still a fair amount of bias remaining but one that is more balanced in the covariates than the original full sample. As a result, it is more likely to lead to credible and robust estimates. One may augment this procedure by dropping units for which the match quality is particularly poor, say dropping units where the absolute value of the difference in the log odds ratio between the unit and its match is larger than some threshold.

#### 2. Dropping observations with extreme values of the propensity score

The second method for addressing lack of overlap discussed is based on the work by CHIM. Its starting point is the definition of average treatment effects for subsets of the covariate space. Let 𝕏 be the covariate space and 𝔸 ⊂ 𝕏 be some subset of the covariate space. Then define τ(𝔸) = 𝔼[*Y _{i}*(1) –

*Y*(0)|

_{i}*X*∈ 𝔸]. The idea in CHIM is to choose a set 𝔸 such that there is substantial overlap between the covariate distributions for treated and control units within this subset of the covariate space. That is, one wishes to exclude from the set 𝔸, values for the covariates for which there are few treated units compared the number of control units or vice versa. The question is how to operationalize this. CHIM suggests looking at the asymptotic efficiency bound for the efficient estimator for τ(𝔸). The motivation for that criterion is as follows. If there is a value for the covariate such that there are few treated units relative to the number of control units, then for this value of the covariate, the variance for an estimator for the average treatment effect will be large. Excluding units with such covariate values should therefore improve the asymptotic variance of the efficient estimator. It turns out that this is a feasible criterion that leads to a relatively simple rule for trimming the sample. The trimming also improves the robustness properties of the estimators. The trimmed units tend to be units with high leverage whose presence makes estimators sensitive to outliers in terms of outcome values.

_{i}CHIM calculates the efficiency bound for τ(𝔸), building on the work by Hahn (1998), assuming homoskedasticity so that for all *x* and a constant treatment effect, as

where *q*(𝔸) = Pr(*X* ∈ 𝔸). It derives the characterization for the set 𝔸 that minimizes the asymptotic variance and shows that it has the simple form

(19)

where α satisfies

Crump et al. (2008a) then suggests dropping units with *X _{i}* ∉ 𝔸

^{*}. Note that this subsample is selected solely on the basis of the joint distribution of the treatment indicators and the covariates and therefore does not introduce biases associated with selection based on the outcomes.

Implementing the CHIM suggestion is straightforward once one has estimated the propensity score. Define the function *g*: 𝕏 ↦ ℝ with

and the function *h* : 𝕏 ↦ ℝ with

and let *ŷ* = arg min_{λ} *h*(*λ*). The value of α in (5.19) that defines 𝔸^{*} is . Finding ŷ is straightforward. The function *h*(*λ*) is a step function, so one can simply evaluate the function at *λ* = *g*(*X _{i}*) for

*i*= 1, …,

*N*and select the one that minimizes

*h*(

*λ*).

The simulations reported in CHIM suggest that, in many settings, in practice the choice α = 0.1 leading to

provides a good approximation to the optimal 𝔸^{*}.

### G. Assessing Unconfoundedness

Although the unconfoundedness assumption is not testable, the researcher can often do calculations to assess the plausibility of this critical assumption. These calculations focus on estimating the causal effect of the treatment on a pseudo-outcome, a variable known to be unaffected by it, typically because its value is determined prior to the treatment itself. Such a variable can be time-invariant, but the most interesting case is in considering the treatment effect on a lagged outcome, commonly observed in evaluations of labor market programs. If the estimated effect differs from zero, it is less plausible that the unconfoundedness assumption holds, and if the treatment effect on the pseudo-outcome is estimated to be close to zero, it is more plausible that the unconfoundedness assumption holds. Of course, this does not directly test this assumption; in this setting, being able to reject the null of no effect does not directly reflect on the hypothesis of interest, unconfoundedness. Nevertheless, if the variables used in this proxy test are closely related to the outcome of interest, the test arguably has more power. For these tests, it is clearly helpful to have a number of lagged outcomes.

To formalize this, suppose the covariates consist of a number of lagged outcomes *Y*_{i,–1}, …, *Y*_{i,–T} as well as time-invariant individual characteristics *Z _{i}*, so that the full set of covariates is

*X*= (

_{i}*Y*

_{i,–1}, …,

*Y*

_{i,–T},

*Z*). By construction, only units in the treatment group after period –1 receive the treatment; all other observed outcomes are control outcomes. Also, suppose that the two potential outcomes

_{i}*Y*(0) and

_{i}*Y*(1) correspond to outcomes in period zero. Now consider the following two assumptions. The first is unconfoundedness given only

_{i}*T*– 1 lags of the outcome:

and the second assumes stationarity and exchangeability:

, does not depend on *i* and *s*.

Then it follows that

which is testable. This hypothesis is what the procedure described above assesses by analyzing the data with *X _{p}* =

*Y*

_{–1}and

*X*= (

_{r}*Y*

_{–2}, …,

*Y*

_{–}

*,*

_{T}*Z*). Whether this test has much bearing on unconfoundedness depends on the link between the two assumptions and the original unconfoundedness assumption. With a sufficient number of lags, unconfoundedness given all lags except the very last one would appear to be a plausible assumption if unconfoundedness given all lags holds. Therefore, the relevance of the test depends largely on the plausibility of the second assumption, stationarity and exchangeability.

## VI. Three Applications

In this section, I will apply the methods discussed in the previous sections to three data sets. The first data set, the “lottery data,” collected by Imbens, Rubin, and Sacerdote (2001), contains information on individuals who won large prizes in the Massachusetts lottery as well as on individuals who won small one-time prizes. They study the effect of lottery winnings on labor market outcomes. Here, I look at the effect of winning a large prize on average difference in labor earnings for the first six years after winning the lottery. The second illustration is based a widely used data set, the “experimental Lalonde data,” originally collected and analyzed by Lalonde (1986). Labor market programs are a major area of application for the evaluation methods discussed in this article, with important applications in Ashenfelter and Card (1985); Lalonde (1986); and Card and Sullivan (1988). Here, I use the version of the data put together by Dehejia and Wahba (1999), which is available from Dehejia’s website.^{5} The data set contains information on men participating in an experimental job training program. The focus is on estimating the average effect of the program on subsequent earnings. The third data set, the “nonexperimental Lalonde data,” replaces the individuals in the control group of the experimental Lalonde data set with observations on men from the Current Population Survey (CPS). It is also available on Dehejia’s website. The individuals in the CPS are substantially different from those in the experiment, which creates severe challenges for comparisons between the two groups.

### A. The Imbens-Rubin-Sacerdote Lottery Data

A subset of 496 lottery players with complete information on key variables is used. Of these 496 individuals, 237 won big prizes (on average, approximately $50,000 per year for 20 years), and 259 did not. For clarity, the former group is referred to as the “winners” and the latter as the “losers,” although the latter did win small one-time prizes. There is information on them concerning their characteristics and economic circumstances at the time of playing the lottery, and social security earnings for six years before and after playing the lottery. Although obviously lottery numbers are drawn randomly to determine the winners, within the subset of individuals that responded to our survey (approximately 50 percent) there may be systematic differences between individuals who won big prizes and individuals who did not. Moreover, even if there were no nonresponses, differential ticket buying behavior implies that simple comparisons between winners and nonwinners do not necessarily have a causal interpretation.

#### 1. Summary statistics for the Imbens-Rubin-Sacerdote data

Table 1 presents summary statistics for the covariates, including the normalized difference. The normalized differences are generally modest, with only three out of 18 normalized differences larger than 0.30 in absolute value. These three are Tickets Bought (the number of tickets bought per week), Age, and Years of Schooling. The latter two may be related to the willingness to respond to the survey. Note that the overall response rate in the survey was approximately 50 percent.

#### 2. Estimating the propensity score for the Imbens-Rubin-Sacerdote data

Four covariates are selected for automatic inclusion in the propensity score: Tickets Bought, Years of Schooling, Working Then, and Earnings Year–1. The reason for including Tickets Bought is that, by the nature of the lottery, individuals buying more tickets are more likely to be in the big winner sample, and therefore it is a priori known that this variable should affect the propensity score. The other three covariates are included because on a priori grounds they are viewed as likely to be associated with the primary outcome, average yearly earnings for the six years after playing the lottery.

The algorithm discussed in detail in the appendix leads to the inclusion of four additional linear terms and ten second-order terms. The parameter estimates for the final specification of the propensity score are presented in Table 2. The covariates are listed in the order they were selected for inclusion in the specification. Note that only one of the earnings measures was included beyond the (automatically selected) earnings in the year immediately prior to playing the lottery.

To assess the sensitivity of the propensity score estimates to the selection procedure, I consider three alternative specifications. In the second specification, I do not select any covariates for automatic inclusion. In the third specification, I include all linear terms but no second order terms. In the fourth specification, I use lasso (Tibshirani 1996) to select among all first- and second-order terms for inclusion in the propensity score. In Table 3, I report the correlations between the log odds ratios (the logarithm of the ratio of the propensity score and one minus the propensity score) based on these three specifications, and the number of parameters estimated and the value of the log likelihood function. It turns out in this particular data set, the automatic inclusion of the four covariates does not affect the final specification of the propensity score. All four covariates are included by the algorithm for choosing the specification of the propensity score even if they had not been preselected. The fit of the propensity score model selected by the algorithm is substantially better than that based on the specification with all 18 covariates included linearly with no second-order terms or the lasso. Even though the linear specification has the same degrees of freedom, it has a value of the log likelihood function that is lower by 30.2. The lasso selects fewer terms, 12 in total, and also has a substantially lower value for the log likelihood function.

#### 3. Trimming the Imbens-Rubin-Sacerdote data

In the lottery application, the focus is on the overall average effect. I therefore use the CHIM procedure discussed in Section II for trimming. The threshold for the propensity score coming out of this procedure is 0.0891. This leads to dropping 86 individuals with an estimated propensity score less than 0.0891, and 87 individuals with an estimated propensity score in excess of 0.9109. Table 4 presents details on the number of units dropped by treatment status and value of the propensity score.

The trimming leaves us with a sample consisting of 323 individuals, 151 big winners and 172 small winners. Table 5 presents summary statistics for this trimmed sample. One can see that the normalized differences between the two treatment groups are substantially smaller. For example, in the full sample, the normalized difference in the number of tickets bought was 0.90, and in the trimmed sample it is 0.51. I then reestimate the propensity score on the trimmed sample. The same four variables as before were selected for automatic inclusion. Again, four additional covariates were selected by the algorithm for inclusion in the propensity score. These were the same four as in the full sample, with the exception of male, which was replaced by pos earn year –5. For the trimmed sample, only four second-order terms were selected by the algorithm. The parameter estimates for the propensity score estimated on the trimmed sample are presented in Table 6.

#### 4. Assessing unconfoundedness for the Imbens-Rubin-Sacerdote data

Next, I do some analyses to assess the plausibility of the unconfoundedness analyses. There are 18 covariates, six characteristics from the time of playing the lottery, six annual earnings measures, and six indicators for positive earnings for those six years. I use three different pseudo-outcomes. First, I analyze the data using earnings from the last prewinning year as the pseudo-outcome, and second I use average earnings for the last two prewinning years as the pseudo-outcome, and in the third case, I use average earnings for the last three prewinning years as the pseudo-outcome. In all three cases, I preselect four covariates for automatic inclusion in the propensity score, the same three characteristics as before with in each case the last prewinning year of earnings given the new outcome. I redo the entire analysis, starting with the full sample of 496 individuals, including estimating the propensity score, trimming the sample and reestimating the propensity score. Table 7 presents the results for the blocking and matching estimators. In all three cases, both the blocking and matching estimates are substantively and statistically close to zero. The estimated effects for the three pseudo-outcomes range from −$1,160 to -$390, with none of the three statistically significantly different from zero at the 10 percent level. This suggests that unconfoundedness may be a reasonable assumption in this setting.

#### 5. Estimating average treatment effects for the Imbens-Rubin-Sacerdote data

Now we turn to the primary outcome, average earnings in the six years after playing the lottery, the effect of winning a big prize. Note that up to now we had not used data on the outcome in any of the analyses. I report 18 estimates. One set of six uses no covariates in the regression part, one set uses the four preselected covariates in the regression part, and the final set of six uses all eighteen covariates in the regression part. In each set of six estimates, there are two based on the full sample (one blocking estimator with a single block, and one matching estimator), and four based on the trimmed sample: one with no blocking, one with two blocks, one with the optimal number of blocks, and one matching estimator. The results are reported in Table 8. With five blocks, the estimates range from -$5,070 to -$5,740, with the standard errors between $1,400 and $2,000. With two blocks, the estimates are similar, with a little more variation. With no blocking, the estimates vary considerably more, and even more so in the full sample.

#### 6. Sensitivity to the specification of the propensity score for the Imbens-Rubin-Sacerdote data

Finally, we assess the sensitivity of the estimates of the average treatment effects to the specification of the propensity score. We compare three specifications: First, with *C _{L}* = 1 and

*C*= 2.71, which leads to the inclusion of some linear and some second-order terms; second, corresponding to the inclusion of all linear terms and no second-order terms, which formally corresponds to

_{Q}*C*= 0 and

_{L}*C*= ∞; and third, in which the lasso is used to select among all first- and second-order terms. In Table 9, we present results for the same six estimators we considered before. The range of estimates for the preferred (

_{Q}*C*= 1,

_{L}*C*= 2.71) specification is [–5.66, –4.19], with width 1.47. The corresponding range for the linear propensity score estimator with (

_{Q}*C*= 0,

_{L}*C*= ∞) specification is [–5.21, –2.39], with width 3.82. The lasso estimates range from –3.91 to –2.97, somewhat lower than the other estimates.

_{Q}#### 7. Conclusions for the Imbens-Rubin-Sacerdote data

Overall, the analysis suggests that the blocking estimates are robust, with a preferred estimate of the effect of winning a large prize on average yearly earnings around –$5,400 and a standard error of about $1,400. Moreover, the supporting analyses suggest that this is credible as an estimate of the causal effect of winning the lottery.

### B. The Lalonde Experimental Data (Dehejia-Wahba Sample)

The second data set comes from an experimental evaluation of a labor market training program. It was originally analyzed by Lalonde (1986) and subsequently by many other researchers. Various versions of the Lalonde data have been widely used in this literature (Dehejia and Wahba 1999; Heckman and Hotz 1989; Smith and Todd 2005; Crump et al. 2008b). The data used here were analyzed by Dehejia and Wahba (1999) and are available on Dehejia’s website. The data set contains information on 185 men who were randomly selected to the training program and 260 men assigned to the control group. I focus on estimating the average effect of the training.

#### 1. Summary statistics for the experimental Lalonde data

Table 10 presents summary statistics for the experimental Lalonde data. Not surprisingly, the normalized differences are modest. The largest is 0.30 for no degree. The *t*-statistic for this covariate is 3.1. If we had all the data from the randomized experiment, this would be extremely unlikely, but in the presence of nonresponse, it is not uncommon even in a randomized experiment to find some covariates whose distributions differ in the two treatment groups.

#### 2. Estimating the propensity score for the experimental Lalonde data

Next, we estimate the propensity score. If we had data from a completely randomized experiment with no attrition, the true propensity score would be constant. Even in that case, the algorithm might select some variables for inclusion in the propensity score and so the estimated propensity score need not be constant. Here, we include the four earnings variables, earnings in 1974 and 1975, and indicators for these earnings being equal to zero for inclusion into the propensity score. The algorithm selects three additional terms—nodegree, hispanic, and education—for inclusion in the linear part of the propensity score. In addition, the algorithm selects three second-order terms—the interaction of nodegree and education, *earn ’74* and *nodegree*, and *unempl ’75* and *education*—for a total of 11 covariates. The results are in Table 11.

#### 3. Trimming the experimental Lalonde data

Based on the estimated propensity score, we discard individuals with very low and very high values of the propensity score. The procedure described in Section II leads to a threshold of 0.1299 for low values (and 0.8701 for high values). This leads to dropping four control units with low values of the propensity score, one treated unit with a low value of the propensity score (0.12), and two treated units with a high value of the propensity score. It should not come as a surprise that in this case the procedure does not lead to a large number of units being dropped. The randomization ensures that most units are comparable, and propensity score values are clustered around the average value 0.42 with a standard deviation of 0.13. The minimum and maximum values for the estimated propensity score are 0.03 and 0.91. Table 12 presents the subsample sizes.

We reestimated the propensity score on the trimmed sample, but because the sample is so similar to the full sample the algorithm selects the exact same specification and the parameter estimates are very similar. For that reason they are not reported here.

#### 4. Assessing unconfoundedness for the experimental Lalonde data

The next step is to assess the unconfoundedness assumption. I use two pseudo-outcomes. The first is *earn’75*, earnings in 1975, and the second is the average of earnings in 1974 and 1975. In the first case, I use the eight covariates (the ten original ones minus earnings in 1975 and the indicator for zero earnings in 1975), and in the second case, I use six covariates (the original ten minus the four earnings-related covariates). In both cases, I do the full analysis, including trimming the sample and estimating the propensity score. Table 13 presents the results for these analyses. I look at the subclassification and matching estimators, using the all covariates to adjust for remaining bias. Over the four cases (the two estimators and the two pseudo-outcomes), the estimated effect of the treatment on the pseudo-outcome ranges from –$80 to $220. In none of the four cases can we reject the null hypothesis of no effect of the treatment on the pseudo-outcome (all four *t*-statistics are less than or equal to one in absolute value). Thus, the data are supportive of the unconfoundedness assumption, not surprisingly given that the data came from a randomized experiment.

#### 5. Subclasses for the experimental Lalonde data

To gain further understanding in the workings of these methods, I report details on the blocking procedure. The algorithm discussed in Section III and the appendix divides the experimental Lalonde data into three blocks. Table 14 presents the block boundaries and the number of control and treated units in each block. In the trimmed sample, the average estimated propensity for the treated individuals is 0.45 and for the control individuals 0.39, with a difference of 0.06. Within each of the three blocks, however, the difference in average propensity score values for treated and controls is 0.02 or less.

#### 6. Estimating average treatment effects for the experimental Lalonde data

Finally, I turn to the outcome data, earnings in 1978. In Table 15, I present results for the blocking and matching estimators, using no covariates, the four earnings variables, or all ten covariates for adjusting beyond the blocking or matching. The results range from $1,460 to $2,300, in all cases significantly different from zero at the 5 percent level.

#### 7. Conclusions for the experimental Lalonde data

For the experimental Lalonde data, I find that the program had a positive effect on earnings, with point estimates somewhere between $1,500 and $2,300. The variation in the estimates is probably due to the small sample size and the thick-tailed earnings distribution. The data suggest that unconfoundedness is plausible.

### C. The Lalonde Nonexperimental Data (Dehejia-Wahba Sample)

The final data set is the nonexperimental Lalonde data. Here, I take the men who were assigned to the training program and compare them to a comparison group drawn from the Current Population Survey. This is a data set that has been widely used as a testing ground for estimators for treatment effects.

#### 1. Summary statistics for the nonexperimental Lalonde data

Table 16 presents summary statistics for the covariates for the nonexperimental Lalonde data, including the normalized difference. Compared to the lottery data or the experimental Lalonde data, one sees substantially larger differences between the two groups, with normalized differences as large as two (for the indicator for being African-American), and many larger than one (including variables that a priori are likely to be important, such as the earnings variables). These summary statistics suggest that, irrespective of whether one believes the unconfoundedness assumption is a plausible one, it will be a severe statistical challenge to adjust for these covariate differences. Without even having looked at the outcome data, one can be confident that simple least squares regression adjustments are likely to be sensitive to the exact implementation.

#### 2. Estimating the propensity score for the nonexperimental Lalonde data

The first step is to estimate the propensity score on the full sample. I selected the two prior earnings measures, and the two indicators for these measures being positive for automatic inclusion in the propensity score. The selection algorithm selected five additional linear terms, leaving only education as a covariate that was not selected. The algorithm then selected five second-order terms, many of them involving the lagged earnings and employment measures. Table 17 presents the parameter estimates for this specification.

Clearly, the second-order terms are important here. The value of the log likelihood function at the maximum likelihood estimates is –408.8. If instead, a simple linear specification is used with all ten covariates, the value of the log likelihood function is –474.5. The correlation between the log odds ratio (the logarithm of the ratio of the propensity score over one minus the propensity score) for the specification involving second-order terms versus only linear terms is only 0.70. It appears unlikely that the linear specification of the propensity score will lead to an accurate approximation to the true conditional expectation.

#### 3. Matching the nonexperimental Lalonde data to improve balance

In this case, one is interested in the average effect on the treated. Because there are a large number of potential controls, the matching from Section I is used to obtain a more balanced sample. Specifically, I match the 185 treated individuals to the closest controls, without replacement. The match quality is not always high for this data set. Out of the 185 matches, there are 30 matches with difference in log odds ratio larger than 1.00. These are treated units with a propensity score between 0.74 and 0.89, matched with control units with propensity scores between 0.49 and 0.69. Although these differences are substantial, it is plausible that the further adjustment procedures suggested in this paper are effective in removing them. In Table 18, I report the normalized differences for the original sample (the same as before in Table 16), the normalized differences in the matched sample, and the ratio of the two. Compared to the normalized differences prior to the matching, the normalized differences after propensity score matching are substantially smaller. Whereas before there were six (out of ten) normalized differences in excess of 1.00, now none of the normalized differences is larger than 0.28, with most smaller than 0.10. The remaining normalized differences are still substantial, and they highlight that the simple matching has not removed all covariate differences between treated and control units, as also evidenced by the poor match quality for some of the matches. Nevertheless, because the differences are so much smaller, it will now be more reasonable to expect various estimators to be able to remove most of the remaining differences.

The propensity score on the matched sample is then reestimated. This time, only two covariates are selected for inclusion in the linear propensity score beyond the four automatically selected. These two covariates are married and nodegree. In addition, two second-order terms are included. Table 19 presents the parameter estimates for this specification.

#### 4. Assessing unconfoundedness for the nonexperimental Lalonde data

The next step is to assess the plausibility of the unconfoundedness assumption by estimating treatment effects for two pseudo-outcomes, *Earn ’75*, and (*Earn ’74* + *Earn ’75*)/2. Table 20 presents the results, using both the bias-adjusted matching estimator and the blocking estimator. All four estimates are substantively large, and one can in all four cases soundly reject the null hypothesis that the effect of the treatment on the pseudo-outcome is zero. The conclusion is that one cannot be confident here that unconfoundedness holds based on these analyses.

It is interesting though to note that the estimated effect for the second pseudo-outcome, (*Earn ’74* + *Earn ’75*)/2, where I only adjust for the six nonearnings related covariates, is much larger in absolute value than the estimated effect for the first pseudo-outcome, *Earn ’75*. In the latter case, I adjust for differences in earnings in 1974 for treated and controls. This combination of results does suggest that with additional earnings measures unconfoundedness might be more plausible.

#### 5. Subclasses for the nonexperimental Lalonde data

For comparison with the experimental Lalonde data, it is useful to compare the subclasses constructed by the blocking method. Here the method leads to five blocks. Table 21 presents the block boundaries and the number of control and treated units in each block. In the matched sample, before the blocking, the average estimated propensity score for the 185 treated individuals is 0.53 and the average estimated propensity score for the 185 matched control individuals is 0.47, with a difference of 0.06. Within four of the five blocks, the difference in average propensity score values for treated and controls is 0.01 or less, with only in the first block the difference in average values for the propensity score equal to 0.05.

#### 6. Estimating average treatment effects for the nonexperimental Lalonde data

Finally, Table 22 presents the results for the actual outcome, earnings in 1978. I report estimates without regression adjustment, with regression adjustment for the four a priori selected covariates, and with covariate adjustment for all ten covariates. I report these estimates for the full sample with 15,992 controls, for the matched sample without further blocking, for the matched sample with two blocks and, finally, for the matched sample with the optimal number of blocks. The estimates are fairly robust once I use at least two blocks for the matched samples, or even in the single block case with regression adjustment for the full set of ten covariates. In addition, the estimates are fairly close to the estimates based on the experimental sample.

#### 7. Conclusions for the nonexperimental Lalonde data

The results for the nonexperimental Lalonde data are interesting and quite different from those of the other data sets. The point estimates are fairly close to the estimates based on the experimental data, and the estimates are robust to changes in the specification. However, the estimates based on the pseudo-outcomes suggest one would not have been able to ascertain that the unconfoundedness assumption was plausible based on the nonexperimental data alone. One would have needed additional lagged measures of earnings to get a better assessment of the unconfoundedness assumption.

## VII. Conclusion

In this paper, I have outlined a strategy for estimating average treatment effects in settings with unconfoundedness. I have described a specific algorithm to estimate the propensity score in a flexible manner and to implement subclassification and matching methods based on Imbens and Rubin (Forthcoming 2014). I stressed the importance of dropping units with extreme values of the propensity score to obtain a more balanced sample. I also described a method for assessing the plausibility of the unconfoundedness assumption. These methods are illustrated on three data sets and are seen to lead to robust estimates for average treatment effects in all three cases. In two of the examples, with the lottery data and the experimental Lalonde data, the data pass the test that suggests that unconfoundedness is plausible. In the third case, with the nonexperimental Lalonde data, I cannot be confident that unconfoundedness holds based on the data. Thus, I would be less sure about the findings absent the experimental estimates as a benchmark, which in a normal study the researcher does not have.

## Appendix A Estimating the Propensity Score

Given the choices *X _{B}*,

*C*

_{lin}, and

*C*

_{qua}, the algorithm selects covariates for inclusion in the specification of the propensity score using the following 11 steps.

(1) Estimate the logistic regression model by maximum likelihood with the basic covariates

*X*._{B}(2) Estimate

*K*+ 1 –*K*additional logistic regression models where each model includes a single additional element of_{B}*X*not included in*X*. In each case, calculate the likelihood ratio test statistic for the null hypothesis that the coefficient on this additional variable is equal to zero against the alternative hypothesis that the coefficient on this additional covariate differs from zero._{B}(3) If the largest of the

*K*+ 1 –*K*likelihood ratio test statistics is smaller than_{B}*C*_{lin}, go to Step 6. If the largest of the likelihood ratio test statistics is larger than or equal to*C*_{lin}, select the corresponding covariate for inclusion in the vector*h*(·) and go to Step 4.(4) At this stage,

*K*+_{B}*K*linear terms have been selected for inclusion in the propensity score. Estimate_{L}*K*+ 1 –*K*–_{B}*K*logistic regressions, each with the already selected_{L}*K*+_{B}*K*covariates, plus one of the remaining covariates at a time. For each case, calculate the likelihood ratio test statistic for the null hypothesis that the coefficient on this additional variable is equal to zero against the alternative hypothesis that it differs from zero._{L}(5) If the largest of the likelihood ratio test statistics is smaller than

*C*_{lin}, go to Step 6. If the largest of the likelihood ratio test statistics is larger than or equal to*C*_{lin}, select the corresponding covariate for inclusion in the vector*h*(·) and return to Step 4.(6) At this stage,

*K*+_{B}*K*linear terms have been selected (including the intercept) and none of the remaining covariates would improve the log likelihood more than by_{L}*C*_{lin/2}(given that the likelihood ratio statistic is twice the difference in log likelihood values). Now, I will select a subset of the second-order terms. I only consider second-order terms for covariates that have been selected for inclusion in the linear part of the specification. Excluding the intercept that leaves*K*+_{B}*K*–1 linear terms, and thus (_{L}*K*+_{B}*K*–1) (_{L}*K*+_{B}*K*) / 2 potential second-order terms. I follow essentially the same algorithm as for the linear case for deciding which of these second-order terms to include but with the threshold for the likelihood ratio test statistic equal to_{L}*C*_{qua}instead of*C*_{lin}.(7) Estimate (

*K*+_{B}*K*–1)(_{L}*K*+_{B}*K*) / 2 logistic regression models, each including the_{L}*K*+_{B}*K*linear terms and one second-order term. Calculate the likelihood ratio test statistics for the null hypothesis that the coefficient on the second-order term is equal to zero._{L}(8) If the largest of the likelihood ratio test statistics is smaller than

*C*_{qua}, go to Step 11. If the largest of the likelihood ratio test statistics is larger than or equal to*C*_{lin}, select the corresponding second-order term for inclusion in the vector*h*(·).(9) At this point, there are

*K*+_{B}*K*linear terms selected and_{L}*K*second-order terms. Estimate (_{Q}*K*+_{B}*K*–1)(_{L}*K*+_{B}*K*) / 2 –_{L}*K*logistic regression models, each including the_{Q}*K*+_{B}*K*+_{L}*K*terms already selected, and one of the remaining second-order terms. Calculate the likelihood ratio test statistic for testing the null that the additional second-order term has a zero coefficient._{Q}(10) If the largest of the likelihood ratio test statistics is smaller than

*C*_{qua}, go to Step 8. If the largest of the likelihood ratio test statistics is larger than or equal to*C*_{qua}, select the corresponding second-order term for inclusion in the vector*h*(·) and go to Step 9.(11) The vector

*h*(·) now consists of the*K*terms selected a priori, the_{B}*K*linear terms, and the_{L}*K*second-order terms. Estimate the propensity score by maximum likelihood using this specification._{Q}

This algorithm always will converge to some specification for the propensity score. It need not select all covariates that are important, and it may select some that are not important, but it can generally provide a reasonable starting point for the specification of the propensity score. It is likely that it will lead to a substantial improvement over simply including all linear terms and no second-order terms. Incidentally, this algorithm would lead to the linear specification if one fixed *C*_{lin=0} (so that all linear terms would be included) and *C*_{qua=∞} (so that no second-order terms would be included).

## Appendix B Weights for Various Estimators

Here, I briefly show what the weights are for some widely used estimators.

### I. Difference Estimator

For the difference in average outcomes for treated and control units, , we have λ* _{i}* = 1.

The weights for all units are equal to one.

### III. Regression Estimator

Consider the least squares estimator in the regression with a scalar covariate *X _{i}*,

The weights are

where is the sample variance of *X _{i}*. Here the weights are not necessarily all positive.

### III. Subclassification Estimator

For the subclassification estimator, let the number of units in subclass *j* be equal to *N _{j}*, and the number of control and treated units in this subclass be equal to

*N*and

_{c,j}*N*respectively, and let

_{t,j}*B*(

_{i}*j*) be an indicator for unit

*i*falling in subclass

*j*. Then

### IV. Matching Estimator

A matching estimator with 1 match for each treated and control unit has the form,

where

ensuring that *Ŷ _{i}*(

*w*) is a linear combination of with positive weights.

### IV. Weighting Estimator

The weighting estimator (Hirano, Imbens, and Ridder 2003) has the form

so that

Here the weights are all positive.

## Footnotes

Guido W. Imbens is a professor of economics at the Stanford Graduate School of Business. Some data used in this article are available from the author beginning November 2015 through October 2018.

↵1. There is some controversy around this terminology. In a personal communication, Holland told me that he felt that Rubin’s generalizations of the earlier work (for example, Neyman 1923) and the central role the potential outcomes play in his (Rubin’s) approach justified the label.

↵2. The assumption that the sample can be viewed as a random sample from a large population is largely for convenience. Abadie et al. (2014) discusses the implications of this assumption, as well as alternatives.

↵3. Note that Lalonde in his original paper does consider trimming the sample by dropping all individuals with earnings in 1975 over some fixed threshold. He considers different values for this threshold.

↵4. The estimator has been implemented in the official version of Stata. See also Abadie et al. (2003) and Becker and Ichino (2002).

↵5. The webpage is http://www.nber.org/~rdehejia/nswdata.html.

- Received March 2013.
- Accepted March 2014.