## Abstract

A growing literature seeks to explain why so many more women than men now attend college. A commonly cited stylized fact is that the college wage premium is, and has been, higher for women than for men. After identifying and correcting a bias in estimates of college wage premiums, I find that there has been essentially no gender difference in the college wage premium for at least a decade. A similar pattern appears in quantile wage regressions and for advanced degree wage premiums.

## I. Introduction

Today in the United States, more women than men attend and graduate from college, a dramatic change from recent decades. This development has spawned a growing literature that attempts to identify its causes. See Chiappori, Iyigun, and Weiss (2009); Goldin, Katz, and Kuziemko (2006); DiPrete and Buchmann (2006); Dougherty (2005). Perhaps the most prominent stylized fact in this literature has been that the college wage premium for women is higher than the college wage premium for men (and has been for at least 40 years).^{1}

This stylized fact has framed the discussion in this literature: a higher return to education for women explains why more women than men attend college today, but it leaves unanswered why more men attended college in the past (since the college wage premium was higher for women in the past as well). Hence, some scholars have looked to factors that prevented women in the past from exploiting the higher college wage premium for women. Goldin, Katz, and Kuziemko (2006) point to past barriers to women’s education and careers; as these barriers fell, the gender difference in the college wage premium became decisive: “According to most estimates, the college (log or percentage) wage premium is actually higher for women than men, and it has been for some time…. As the labor force participation of women has begun to resemble men’s, women have responded to the monetary returns.” Analogously, Chiappori, Iyigun, and Weiss (2009) combine an exogenous fall in the time required for housework and “the higher *labor-market return* to schooling for women” to explain the relative rise in women’s college attainment.

But what if this stylized fact, the starting point for these arguments, is wrong? I will argue that the college wage premium for women is *not* larger for women than men, and we must revisit our accounts of the dramatic rise in women’s college attainment relative to men. Most recent estimates rely on Current Population Survey (CPS) wage data that are “topcoded” or censored at a maximum value. Topcoding biases estimates of the college wage premium downward for males relative to females, and the magnitude of this bias has grown over time. Once I account for this bias, I find no gender difference in the college wage premium in recent years.

This paper proceeds as follows. Section II briefly discusses recent estimates of the college wage premiums for men and women. None of these estimates adequately accounts for the bias caused by censoring. Section III presents facts on topcoding and shows how topcoding can bias estimates of the college wage premium. Section IV describes the data set I use in my estimates. In Section V, I reestimate college wage premiums for men and women using CPS data after accounting for topcodes, and show that the college wage premium is not larger for women than for men, at least in recent years. Section VI concludes.

## II. Existing Estimates

The near-consensus in the literature has been that college wage premiums are higher for women than for men, and have been since at least the 1960s. This claim rests primarily on analysis of the March CPS data series, which, for all years since 1963, provides a continuous series of annual, cross-sectional, nationally representative data on education, earnings, and hours worked.^{2} Chiappori, Iyigun, and Weiss (2009) look at white workers age 25–54 during years 1975–2004 and conclude that “women receive a higher increase in wages than men when they acquire college or advanced degrees.” Using CPS data for 1963 to 2001, DiPrete and Buchmann (2006) report an earnings gap of about 0.1 to 0.2 log points among 30–34 year old full-time/full-year (FTFY) white workers for the entire study period. Charles and Luoh (2003) use CPS data from 1961–97. Comparing those with at least two years of college to those with no college, they find that the wage premium for women is consistently about 0.2 log points higher than for men. Card and DiNardo (2002) use CPS data for years 1975–99 and report college wage premiums for women that are greater than or equal to those of men for all years.

Some studies use data sources other than the CPS. Dougherty (2005) runs wage regressions on years of schooling using National Longitudinal Survey of Youth 1979 (NLSY79) data and finds higher wage premiums for women throughout the period 1988–2000. He also cites more than 20 other studies that use data sources other than the CPS and find higher “returns to schooling” for women than men. None of these studies, however, looks at data from 1990 or later. Pen˜a (2007) disagrees with these findings, but offers only evidence from outside the United States to support the claim that the college wage premium is higher for men.

## III. Topcoding and Topcode Bias

Wage and salary earnings data (“wage income” or “wages”) in CPS public use files have been topcoded since 1967. From 1967–80, the topcode was 50,000; from 1981–83, it was 75,000; from 1984–94, it was 99,999; from 1995– 2001, it was 150,000; and since 2002, it has been 200,000.^{3} These are nominal values; all income data, including topcodes, reported in the CPS are in current year dollars.

Topcoding is a widely recognized issue for CPS data. See, for example, Katz and Murphy (1992); Card and DiNardo (2002); Autor, Katz, and Kearney (2005); DiPrete and Buchmann (2006); Mulligan and Rubinstein (2008); Hirsh and McPherson (2008); Larrimore et al. (2008). I have found no prior study, however, that has identified and examined the biasing effect of topcodes on the college wage premium. (Although they do not discuss bias due to topcoding, Katz and Murphy (1992) do adjust topcoded wages before calculating college wage premiums for the years 1963– 87. Card and DiNardo (2002) note the presence of topcodes and suggest adjusting them by a factor of 1.4, although their results described above do not make this adjustment.) Instead, some prior work has attempted to address the presence of topcodes by recensoring wage data at maximum values that are more consistent across time. For example, DiPrete and Buchmann (2006) recensor the 1963–2001 CPS wage data at topcodes that are linearly smoothed over time (and always below 124,000); Card and DiNardo (2002) recensor all observations after 1994 at 100,000. This type of correction should avoid spurious jumps in estimated wages in years when topcodes change.

Recensoring, however, greatly increases the number of observations that are subject to censoring in recent years. Recensoring of wage observations has lead to larger and larger numbers of censored wage observations. Figure 1 shows the dramatic rise in the share of observations in my sample (described in Section IV) subject to topcoding or recensoring at 100,000.

Crucially, for estimates of the college wage premium, the increase over time in the censoring of wage observations is not benign. First, the overwhelming majority of topcoded or recensored observations in any year are college-educated individuals. Because of this, using topcoded data without accounting for censoring will bias college wage premium estimates downward. Second, the great majority of topcoded or recensored observations are male. Thus, we would expect topcoding bias to disproportionately depress the college wage premium for males. Together, these facts suggest that topcoding and recensoring have increasingly biased estimates of the relative college wage premiums of women and men.

## IV. Data and Construction of Samples

To estimate college wage premiums after accounting for topcodes, I use the IPUMS CPS data series for 1970–2008.^{4} I construct a sample that is fairly representative of the samples used in the literature: I include white, non-Hispanic, adult civilians who were age 18 to 65 at the time of survey and who worked the previous year as private or government employees for a wage or salary. I exclude observations with negative CPS sample weights.^{5} I further restrict the sample to workers with 1–40 years of potential experience, as defined below. I limit the sample to full-time, full-year (FTFY) workers, where FTFY is defined as 35 or more hours per week and 50 or more weeks per year.

I focus on FTFY workers for two reasons. First, the 1994 CPS redesign was intended to increase the measured labor force participation of workers (believed to be mostly women) who in previous survey formats were being recorded as not in the labor force. Given that FTFY workers are the subset of workers who are least likely to be affected by recategorizing workers on the margin of labor force participation, I reduce any potential spurious trend generated by the 1994 survey redesign. Second, studies that have attempted to independently verify the accuracy of CPS wage data find that FTFY wage data appear to be very accurately reported, while wages for part-time, part-year workers appear to be substantially underreported. See Roemer (2000, 2002).

The coding of educational variables in the CPS data changed between 1990 and 1991. For 1970–90, the education variable is the number of whole and partial years of education completed (topcoded at 18). For 1991–2008, the education variable is coded as intervalled years of schooling for observations with less than a high school degree, and as the highest degree obtained for those with at least a high school degree, with a separate category for some college with no college degree. To ensure maximum consistency across time, I generate the following educational category recodes:

I define *Dropout* as anyone with fewer than 12 years of schooling completed (1970–90) or less than a high school diploma or General Education Development (GED) certificate (1991–2008); *High School Graduate* as anyone with exactly 12 years of schooling completed (1970–90) or exactly a high school diploma or GED certificate (1991–2008); *Some College* as anyone with more than 12 and fewer than 16 years of schooling completed (1970–90) or categorized as “Associate Degree” or “Some College, No Degree” by the CPS (1991–2008); *Bachelor’s Degree* as 16 or 17 years of schooling completed (1970–90)^{6} or categorized as “Bachelor’s Degree” by the CPS (1991–2008); and *Advanced Degree* as 18 or more years of schooling completed (1970–2008) or categorized as “Masters,” “Professional,” or “Doctorate” degree holder by the CPS (1991–2008). I define *College Graduate* as any observation that I have defined as either Bachelor’s Degree or Advanced Degree above.

I also generate a “Years of School” variable. As noted above, the CPS reports years of schooling only until 1990. For 1991–2008, I impute years of completed schooling as follows: I divide the sample into demographic cells based on sex and educational category. For each demographic cell, I compute the mean years of schooling during the period 1988–90. These values, rounded down to the nearest integer, are the years of schooling used for observations in the same demographic categories for 1991–2008. With this measure of years of school, I generate potential experience as *AGE – YEARS OF SCHOOL – 7*.

I deflate all wage values using the Personal Consumption Expenditures (PCE) price index to 1982 dollars, and drop all observations with annual wages less than $3,484, or one-half minimum wage in 1982. For a 52-week year, this is equivalent to the $67/week threshold used in Katz and Murphy (1992) and Mulligan and Rubinstein (2008).^{7} Finally, I also exclude observations flagged as containing “allocated” (imputed) values for education or the amount or source of wage and salary income. No results are sensitive to the inclusion or exclusion of either type of imputed data. Table 1 provides descriptive statistics for the sample.

## V. Results

I calculate college wage premiums by sex using alternate specifications, each designed to account for topcoding of wage data. First, I run ordinary least squares (OLS) regressions in which topcoded wage values are adjusted to eliminate bias from censoring. Second, I deploy the Tobit model for censored regression. Third, I use median regressions, which are not sensitive to the values of upper-tail wages. Regardless of the method used, I find no female advantage in the college wage premium in recent years. I then examine separately the premiums associated with bachelor’s degrees and advanced degrees.

### A. OLS Regression Estimates

My specification is fairly standard in the literature. I run a set of yearly regressions of log wages on a dummy for female sex, a dummy for college completion, an interaction between college completion and female sex, and a set of other controls (all of which are interacted with female sex):

For my initial OLS regressions, *y _{i}*is annual wage income and

*Educ*is a dummy for college graduate. The vector

_{i}*X*includes potential experience, potential experience squared, and Census region dummies, all of them interacted with the dummy for female sex. The difference between the male and female college wage premium is the coefficient δ on the female/college interaction term.

_{i}Figure 2 shows the estimates of the college wage premiums from an OLS log wage regression when wages are recensored at 100,000. This generates the familiar conclusion that the college wage premium for women has been consistently higher than the premium for men. As we see from Figure 3, the gender difference is statistically significant in nearly all years, and appears to be growing in recent years.

As noted above, the problem with these estimates is that censoring wages at a topcode or censoring point *T* will bias downward the coefficients of variables that tend to raise wages above *T*. Because the true values of topcoded wage observations are not available, my approach is to replace topcoded log wage observations with their expected value; in other words, I choose an “adjustment factor” *A* such that I can replace topcoded wage observations (observations where *y _{i}* =

*T*) with new values (

*y*=

_{i}*AT*) where the log of the adjusted wage is equal to the expected value of the log of the true (unobserved) wage (ln(

*AT*) =

*E*[ln(

*y*)|

_{i}*y*>

_{i}*T*]).

I generate year- and sex-specific adjustment factors for each year in my sample. To do this, I employ the common assumption that upper-tail wages are Pareto distributed. Given a Pareto distribution with minimum value *m* and the Pareto index parameter *k*, the parameter *k* uniquely determines the adjustment factor *A* and vice versa. If wages *y _{i}* are Pareto distributed, then for any topcode value

*T*,

Thus, my task is to estimate *k* by year and sex. I construct the likelihood function for a censored Pareto distribution

where there are *N* observations, of which the first *n* observations are topcoded at *T*.

Maximizing log likelihood yields the following estimate for *k*:

The only unquantified term in Equation 4 is *m*. Rather than impose an a priori value for*m*, I take advantage of the fact that the adjustment factor*A*—and therefore *k*—is known for each sex in recent years. With this information, I derive an estimate of*m*, which I then use in Equation 4 to estimate*kˆ* for all years. Since 1995, each topcoded wage observation in the CPS public-use data has been replaced with the mean wages of all topcoded observations in the same sex-by-race-by-FTFY-status demographic cell (in other words, for each year and sex during 1995–2008, the CPS data provide E[*y _{i}* |

*y*≥

_{i}*T*]). For each sex, I calculate the average ratio of these expected values to the topcodes; call this (sex-specific) ratio

*R*.

^{8}Again using the Pareto distribution, I derive a value for

*k*directly from the ratio of topcoded values to the topcode:

I then calibrate *m* for each sex such that the maximum likelihood estimate
matches the *k* derived directly from the 1995–2008 CPS data.^{9} Armed with this estimate of*m*, I then use the estimator for*k*given in Equation 4 to generate adjustment factors for each sex and year in the sample. Finally, I recensor 1995–2008 wage observations at their topcodes and multiply all topcoded wage observations for 1967–2008 by their year- and sex-specific adjustment factors before taking logs of all wage values.^{10}

When I reestimate college wage premiums using adjusted wages, a very different picture emerges. See Figures 4 and 5.^{11} I find little or no gender difference in the college wage premium since about 1990.

### B. Tobit Regression Estimates

An alternate, and indeed simpler, way to correct for topcoding is to employ the Tobit censored regression model. This model treats *y _{i}* in Equation 1, annual wage income, as a latent variable, whose observed values are censored at an upper limit equal to the topcode value for the relevant year. This specification yields results almost identical to estimates based on adjusted wages. See Figures 6 and 7. There is essentially no significant gender difference in the college wage premium since about 1990.

### C. Quantile Regression Estimates

I next run quantile (median) regressions, using the same set of regressors. Figures 8 and 9 present the results of the quantile regressions. Importantly, these results are insensitive to the presence of topcodes or corrections for topcoding. In these esti- mates, the female-male difference is larger in earlier years, but the difference narrows and then vanishes by 2000.^{12}

### D. The Advanced Degree Wage Premium

As usually defined, the college wage premium compares college graduates to high school graduates with no college education, making no distinction between workers whose highest degree is a bachelor’s and those with advanced degrees. Such a distinction may have seemed unnecessary, because those studies that looked separately at bachelor’s and advanced degree holders have found the same pattern for each group as for all college graduates—a persistent difference in the premium favoring women. See Chiappori, Iyigun, and Weiss (2009); Card and DiNardo (2002).

Once I account for topcoding, however, a different picture emerges. To examine advanced degrees and bachelor’s degrees separately, I use the specification in Equation 1 with one change: *Educ _{i}* is now a vector with separate dummies for bachelor’s degree only and advanced degree. I adjust topcoded wages as described above and run OLS regressions with adjusted log wages. I find that the advanced degree wage premium for women far exceeded the advanced degree wage premium for men through the 1970s and 1980s; since then, the men’s and women’s premiums have converged. The bachelor’s degree wage premiums for men and women, however, have never differed by much. See Figures 10 and 11.

^{13}Thus, it appears that the past gender difference in the “college” wage premium was largely the product of the gender difference in the

*advanced degree*wage premium.

### E. Sensitivity Analysis

In addition to the alternate estimation methods presented above, I perform further robustness checks on my methodology. First, in the spirit of Katz and Murphy (1992), I estimate college wage premiums nonparametrically by dividing the universe of observations into demographic cells, and then computing the college wage premium on a cell-by-cell basis. For each cell, I compute the mean wage among high school graduates and the mean wage among college graduates; the college wage premium is the log of the ratio of these averages. I combine these cell-specific college wage premiums to estimate a yearly college wage premium for each sex. In these fixed-weight estimates, higher college wage premiums for women disappear once topcoded wage observations are adjusted.

Second, I execute fixed-weight estimates after redefining the college wage premium as the log ratio of the *median* college wage to the *median* high school wage. This yields a “fixed-weight median” estimate; the results are nearly identical to the median regression results presented above. Third, I replicate the results of several papers described in Section II. After correcting for recensoring and topcoding, I find that their results are consistent with the results I report above. All these results are presented in detail in a Web Appendix.^{14}

## VI. Conclusion

After identifying and correcting a bias in estimates of college wage premiums, I reestimate college wage premiums for women and men and develop a new “stylized fact”: the gender difference in the college wage premium has dwindled over time, and there has been no female advantage in the college wage premium for at least a decade.

This fact only deepens the puzzle of why more women now attend college than men. Women’s college attendance has overtaken men’s, even as the college wage premium for women has *fallen* relative to the premium for men. This suggests that other forces, such as the nonmarket benefits of college education or the (nonpecuniary) costs of attending college, are driving relative changes in college attendance of men and women. Indeed, current work already has begun to consider some of the nonmarket benefits to college, including its effects on marriage and divorce (Iyigun and Walsh 2007; Chiappori, Iyigun, and Weiss 2009; DiPrete and Buchmann 2006). That women may have lower (nonpecuniary) costs of attending college is a promising explanation as well. See Becker, Hubbard, and Murphy (2010); Goldin, Katz, and Kuziemko (2006); Buchmann and DiPrete (2006).

## Footnotes

William H.J. Hubbard is the Kauffman Legal Research Fellow at the University of Chicago Law School and a Ph.D. Candidate in the University of Chicago Economics Department. He is grateful for comments from Gary Becker, Pierre-Andre Chiappori, Jonathan Hall, Devon Haskell, Ethan Lieber, Lee Lockwood, Kevin Murphy, Derek Neal, Emily Oster, Genevieve Pham-Kanter, Mark Phillips, Jesse Shapiro, Hugo Sonnenschein, Andrew Zuppann, and participants in the Workshop in Applications of Economics at the University of Chicago. The data used in this article can be obtained beginning six months after publication through three years hence from William H. J. Hubbard, University of Chicago Law School, 1111 E. 60th Street, Chicago, IL 60637, or by email at whubbard{at}uchicago.edu.

↵1. The college wage premium is generally defined, and I define it here, as the difference in log wages between college graduates and high school graduates with no college education.

↵2. Unless noted otherwise, the years I give for CPS data reflect the year to which the data apply, not the year in which the data were collected (for example, data from the March 1964 CPS apply to 1963). The CPS is intended to be a nationally representative survey of the civilian, noninstitutionalized population.

↵3. Until 1988, all wage income was reported as a single total subject to these topcodes. Since 1988, the CPS has reported wage income from separate jobs separately, applying a separate topcode to each job; for these years, the topcodes listed are for wage income from the longest job held that year. The Integrated Public Use Microdata Series (IPUMS) CPS data series aggregates these individual-job wage income variables to consistently report total wage income (subject to the topcodes listed above) for all years. The results herein use this IPUMS CPS total wage income variable for all years. When wages are disaggregated and topcodes are adjusted separately, my results are not discernibly affected.

↵4. See King, et al. 2009. Integrated Public Use Microdata Series, Current Population Survey: Version 2.0. Minneapolis, Minn.: Minnesota Population Center.

↵5. Fewer than 230 out of nearly 2.9 million observations in the sample have negative weights.

↵6. The results are not sensitive to how observations with 17 years of schooling are categorized.

↵7. The results are not sensitive to the trimming of outliers. See Bollinger and Chandra (2005).

↵8. The values for

*R*are 1.835 for men and 1.884 for women.↵9. Because nominal and real income levels are changing over time, I define

*m*as a multiple of the median for that year: 1.149 (men) or 0.929 (women) times the median wage income.↵10. Hirsh and McPherson (2008) provide estimates of

*R*by year (starting in 1973) and sex based on supplemental CPS data on hourly wages. Larrimore et al. (2008) utilize internal-use CPS files to provide mean wages of all topcoded observations in the same sex-by-race-by-FTFY-status demographic cell (comparable to those available in public-use files for 1995–2008) for all years starting in 1975. Using either of these sources to adjust topcoded values yields results virtually identical to those presented here.↵11. Note that the standard errors do not account for estimation of adjustment factors.

↵12. A comparison of coefficients and standard errors for recensored and adjusted OLS, Tobit, and median regressions appears in Table 2.

↵13. Results from the corresponding Tobit regressions, and quantile (median) regressions using unadjusted log wages, are virtually identical.

↵14. Available online at http://home.uchicago.edu/˜whhubbar/.

- Received December 2009.
- Accepted July 2010.