## ABSTRACT

We examine dynamics in the gender gap in high school mathematics achievement using competition data. A clear gender gap is present by ninth grade, and the gap widens over time. Gender-related differences in dropout rates and in the mean and variance of year-to-year improvement contribute to the widening of the gender gap. The most important difference is that fewer girls make large enough gains to improve their rankings. We also document a discouragement effect: among students falling just short of qualifying for a prestigious second-stage exam, some drop out of future years, and this reaction may be stronger among girls.

## I. Introduction

The gender gap in average science and math achievement by the end of high school has narrowed significantly in recent decades and is qualitatively small today.^{1} However, girls are underrepresented among high-achieving math students in middle and high school, and this may contribute to their underrepresentation in STEM fields, both in college majors and the workforce.^{2} These gaps have been shown to vary with potentially manipulable environmental factors, such as local culture and the presence of same-gender instructors.^{3} To the extent that there is a role for policy in addressing female underrepresentation in STEM, several natural questions arise: at what point in students’ development do these gaps occur, how do they evolve over time, and why?

This work takes advantage of a new panel data set on American Mathematics Competition (AMC) participants to examine the dynamics of the gender gap over the high school years within a large population of very high-achieving U.S. math students. The AMC tests are much better than commonly studied tests at identifying and distinguishing among very high-achieving students, and many of the very best U.S. math students take these tests. The panel data set used for the first time in this paper allows us to analyze the development of math achievement by seeing how students perform on tests of similarly high difficulty in Grades 9, 10, 11, and 12. It also lets us examine dropouts from and new entry into real-world competition by high-achieving boys and girls. An important limitation of this setting is that our study population consists of students who have chosen to participate in a competition. They are presumably more interested in competition, and we only see them perform in a competitive environment.^{4} This is relevant to interpreting the finding of this paper that the high-achievement gender gap is already large by ninth grade (Niederle and Vesterlund 2010). We hope that this limitation is less relevant to the dynamic analysis that we mostly focus on in this paper. The estimates of the dynamics compare subsequent outcomes for boys and girls who have in common that they selected into competing and achieved identical scores in year *t*.

Section II presents institutional facts on the AMC contests and summary statistics on the data set. It then presents two basic observations that motivate the rest of the paper. The first observation is that there is already a substantial gender gap among high-achieving ninth-graders. The other is that the gender gap widens substantially over the high school years. The first observation motivates examining the individual-level persistence in high scores. If it is substantial—which we find—then the ninth-grade gap is a large contributor to the end-of-high-school gap and merits further study. The second observation motivates a richer examination of the dynamics of achievement among high-achieving high school students and the gender-related differences in these dynamics. Many potential explanations have been discussed to account for the single fact that boys outnumber girls among high math achievers. A fuller understanding of the dynamics can provide a much larger set of facts that proposed explanations for the end-of-high-school gender gap would need to explain.

Section III takes a step back from the focus on gender to provide some initial observations on the dynamics of high achievement in high school. We present several observations on the environment in which high-achieving high school students are investing in their math skills. One is that performance is highly persistent even when we subdivide the top percentile very finely. Another is that high-achieving students must improve their mastery of the precalculus mathematics and problem-solving skills tested by the AMC contests substantially to maintain their position year-to-year, and the probability of making substantial gains relative to one’s cohort is low.

Section IV then explores gender-related differences in the dynamics to identify factors that lead to the widening gender gap. Studies of other environments have identified several factors that could affect patterns of entry, exit, and improvement in the AMC. Preferences for competition and differences in how boys and girls allocate their effort across coursework in different subjects and extracurricular activities could be particularly relevant here.^{5}

Our analyses uncover several gender-related differences. High-achieving girls improve by less from year to year on average than do boys with similar initial performance. The variance of the girls’ improvements is lower. Girls at each performance level are more likely to drop out of participating, and girls are underrepresented among the high-scoring entrants. To clarify the relative importance of these patterns, we propose a method for decomposing the net change in the fraction female among high-scoring students into several components. The decomposition suggests that the most important gender-related dynamic difference is that fewer girls are making large enough increases from year to year to move up into the top rank groups.

Section V ventures into assessing potential explanations for the gender gap in a more causal-inference style, looking at whether a portion of the gap may be attributable to gender-related differences in reactions to disappointment. We note that high-achieving students will be quite disappointed if they fall short of a threshold score needed to move on to a second-stage exam and that this disappointment can be viewed as a treatment that is applied at a different cutoff level of performance on different tests. We use a variant of a regression-discontinuity design to examine a narrow window around the cutoff for progressing to the second-stage exam and find strong evidence that both boys and girls are more likely to drop out of participating in future years if they score just below the cutoff. We also find that the tendency to drop out after experiencing disappointment may be more common among girls.

Finally, Section VI recaps results and presents conclusions and implications for future research.

Our investigation is related to a number of literatures. A number of papers, including our own, have noted that girls are underrepresented among the high-scorers on standard math assessments and math contests both in the United States and in many other countries.^{6} While the dynamics of the gender gap are less studied, there are several previous studies documenting an increasing gender gap.^{7} Relative to this literature, we add a number of new observations about the gender gap. This includes both our initial observations that extreme gender gaps among very high-achievers are already present by ninth grade and that the gender gap among high-achievers widens over the course of high school, and the many observations about the dynamics of achievement that we are able to make due to our unique panel data on high-achieving boys and girls.

Our work is also related to the literature on gender differences in attitudes toward competition. Using laboratory experiments, Niederle and Vesterlund (2007) and Niederle, Segal, and Vesterlund (2013) found a clear gender gap in willingness to enter contests. This may be particularly relevant to our application, as Buser, Peter, and Wolter (2017) find that the gender gap in the preference for competition is highest for high-ability students.^{8} Prior experimental and real world evidence has also demonstrated that men and women react differently to losing contests. Gill and Prowse (2014) find that women who lose a contest score lower in subsequent contests. Buser (2016) finds that men (but not women) react to losing by seeking greater challenges. Buser and Yuan (2019) find that, even within populations who have already opted into competing, women are more likely to react to losing by ceasing to compete. Cai et al. (2019) find that women’s performance suffers more than men’s in response to negative performance shocks on earlier exams taken on the same day. Our real world evidence on students’ reactions to disappointment are consistent with there being a similar gender gap, although our message is not entirely aligned, as we find that boys also react to disappointment by dropping out of future competition.^{9}

Our Section V analysis is very closely related to the Buser and Yuan (2019) regression discontinuity analysis of students near the threshold for advancing to the second round of the Dutch Math Olympiad. They document a small and insignificant one percentage point dropout effect for boys, and a large and marginally significant 11 percentage point dropout effect for girls. Our much larger sample allows us to use narrower windows and get more precise estimates.^{10} We find estimates of 3.4–3.7 percentage points for boys and 4.2–5.6 percentage points for girls, with standard errors of 1.2 percentage points at most. We find marginally suggestive evidence for the Buser and Yuan (2019) finding that girls are relatively more likely to react by dropping out, but we document that the effect among boys is also substantial and that the differential effect for girls relative to boys in the United States contest is not nearly as large as their point estimates for the Netherlands.

More broadly, our work is motivated in several respects by the rich literature on gender gaps in wages and career development. As summarized in Blau and Kahn (2017), gender gaps in mathematics and career-oriented college majors declined substantially between the 1960s and 1980s, but there has been less progress since.^{11} Of particular relevance is the subset of the literature that pertains to the dynamics of the gender gap in pay and workforce participation (Bertrand, Goldin, and Katz 2010; Goldin et al. 2017).

## II. The High-Achievement Gender Gap in AMC Scores

In this section, we bring out some basic facts about the gender gap among AMC high-scorers. Ellison and Swanson (2010) noted a large gender gap at high achievement levels and that the gaps are much wider at very high achievement levels above those that can be reliably measured with more commonly used standardized tests. Among the new observations here are that the high-achievement gender gap is already quite large and has the same distinctive pattern by the time students are in ninth grade and that the gap grows wider over the high school years.

### A. Background and Data

The primary subject of our analysis is a database of scores on the Mathematical Association of America’s AMC 10 and AMC 12 contests for 1999–2007. The tests are 25-question, multiple-choice tests designed to identify and distinguish among students at very high performance levels. They are administered to more than 200,000 students in about 3,000 U.S. high schools. The AMC 10 is open to students in Grades 10 and below. The AMC 12 is open to students in Grade 12 and below.

Several features make the AMCs well suited to studying the dynamics of high math achievement during the high school years. One is that the tests are reliable even for very high-achieving students.^{12} A second is that the tests are very popular among the very best U.S. math students.^{13} A third is that many high-achieving students take the tests annually over a four-year period, which lets us track their year-to-year improvement. The benefits and costs of participating in the AMC contests are myriad and vary across students. Immediate benefits and costs include the psychic benefit (enjoyment) or cost (stress) associated with the competition itself (Niederle and Vesterlund 2007); extrinsic benefits such as praise, AMC and school prizes, and credentials for college applications; and intrinsic satisfaction or disappointment from performing well or poorly. Future benefits include the knowledge gained by studying and access to more elite levels of competition.

By 2007, the AMC offered four tests per year: the AMC 10A and 12A were offered on one date in early February, and the AMC 10B and 12B were offered two weeks later. One motivation was to accommodate students whose school was on vacation or cancelled due to snow on the A date. But schools can offer both the A-date and B-date tests, and some students choose to take a test on both dates. In 2007, about 3 percent of A-date-takers also took a B-date test.^{14} The test multiplicity necessitates rescaling scores from the various year *t* tests to make them comparable to other tests from the same year. In the years 2000–2006, the way in which we do this is to think of year *t* scores as predictors of year *t* + 1 AMC 12 scores. We run separate linear regressions of year *t* + 1 AMC 12 scores on scores on each year *t* test and consider two year *t* scores to be equivalent if the predicted year *t* + 1 AMC 12 score is the same. This year-ahead prediction is not possible in the final year of our data, so in 2007 we instead normalize scores by comparing the performance of students who take both an A test and a B test in 2007.^{15}

Our normalization is not designed to put year *t* and year *t*′ scores on a common scale. Instead, we mostly avoid the difficulties inherent in comparing scores across calendar years by focusing on students’ *ranks* within the set of students who participated in a given year. In Section III.A, we present evidence that transforming scores to log ranks produces a measure in which the additive improvement in performance from year to year is similar over a wide range of (high) initial performance levels. This ability to renormalize scores in such a way is another attractive feature of the AMC environment.

Our raw data consist of separate files of student-level scores on each test in each year. The records contain a school identifier, the state in which the school is located, an anonymization of the student’s name, and the student’s gender, grade, age, and home ZIP code. We create a student-level panel data set by merging these files, assuming that two scores belong to the same student if the name and school match and the age, grade, and gender are consistent, or if the name and state are the same and the city, home ZIP code, age, grade, and gender are consistent.^{16}

In the full pre-2007 data set, we match 43 percent of Grade 9–11 student–years to a score in the subsequent year. Note that failures to match result from both students who do not participate in the following year and the limitations of our matching procedure; for example, we will miss students who report their name inconsistently, students who skip a grade, most students who move, etc. One would expect high-achieving students to be more likely to take the AMC in subsequent years. Our match rates are consistent with this. For example, among Grade 9–11 students who were among the 500 highest scoring students in their cohort, the subsequent-year match rate is 80 percent.

In our analyses of the evolution of students’ scores we define a student’s *Adjusted-Score* in year *t* to be the rescaling of the score that they received on the first test offered by their school in that year. Note that, at schools that offer both the A-date and B-date tests, students who only take the B-date test in year *t* are coded as not participating in that year. The primary reason for this decision is that we think doing otherwise would lead to miscounts of high-scoring students.^{17}

### B. Summary Statistics and the Gender Gap in AMC Participation

In this section, we present some summary statistics on AMC scores and participation rates. Gender differences in participation rates are not large, but there is some evidence of gender-related selection into the contests.

Table 1 summarizes participation and scores by grade and gender. The top panel contains information for female students. Female participation grows substantially from ninth to tenth grade, from an average of about 19,000 ninth-grade girls per year to about 28,000 tenth-grade girls per year. One reason for the growth may be that some teachers hesitate to recommend the AMC tests to ninth-graders, regarding the tests as too advanced. Awareness of the AMCs also presumably diffuses over time. Female participation remains roughly constant from 10th to 11th grade. It then drops by about 18 percent from 11th to 12th grade.^{18} One reason may be that 12th-grade scores and awards come out too late to be listed on college applications.

The bottom portion of the table reports comparable statistics for boys. Male participation is about 11 percent higher than female participation in ninth grade. Its growth from ninth grade to tenth grade is similar to that for girls. The series then diverge a bit more, as male participation continues to grow from 10th grade to 11th grade, and has an 11th-to-12th grade decline that is less than half as large as that for females. The pool of 12th-grade AMC takers is about 43 percent female. While the gender gap in AMC participation increases over the course of high school, and we will later investigate differential dropout rates in detail, a first takeaway is that participation rates among high-achieving girls and boys are not too different for the AMC 12. Most AMC takers presumably come from the high end of the SAT population, and the population of students with SAT scores of 600 or above is also 43 percent female.

The table also provides summary statistics on normalized AMC scores. The AMC tests are not a good source for insights on average performance given the highly selected populations, so we will not say much about them. Our previous papers focused on counts of students achieving scores above certain high thresholds, for which we think selection is less of an issue. Scoring 100 on the AMC 12 can be thought of as roughly similar in difficulty to scoring 780 or 800 on the math SAT. Among 12th-graders scoring at this level or higher, we find a male-to-female ratio of about 3.4:1. The male-to-female ratio among students achieving comparable scores on the SATs is about 2:1. The gender gap could be different on the AMC and SAT due to differences in what is being tested and to the fact that the SAT is a cruder instrument. But the magnitude of the difference suggests that there are some gender-related differences in participation rates, as would be expected given the literature on gender differences in attitudes toward competition.

Scoring 120 on the AMC 12 represents a much higher level of achievement—roughly in the 99.99th percentile of the full U.S. 12th-grade population. Here, we think that selection into test-taking is less important. Our primary reason for saying this is that reaching the highest levels of performance on the AMC 12 requires a great deal of natural ability and effort directed toward mastering high school mathematics, and we feel that it is unlikely that students not interested in participating in math competitions would exert the effort necessary to excel at these levels. We see this as analogous to saying that there are unlikely to be many high school students who can throw a curveball and a 90-mph fastball who are not participating in competitive baseball. Note that the male-to-female ratio is much larger among students reaching the 120 level. This is part of a larger pattern noted in Ellison and Swanson (2010).

### C. The Gender Gap in High Math Achievement over the High School Years

In this section, we illustrate how the gender gap among AMC high-scorers changes over the course of high school. Two important observations are that the gender gap is already large in ninth grade and widens substantially over the high school years.

Figure 1 reports the percentage of AMC high-scorers in each grade who are female for various definitions of high scoring. The top line in the figure uses the least restrictive definition, examining the 5,000 highest-scoring students in each grade–year. These are very high-achieving students, but not extremely unusual ones: one could think of them as on a trajectory to score 780 or 800 on the math SAT by the end of high school. At the left endpoint we see that there is a substantial gender gap in ninth grade: only 30.5 percent of the high-scoring ninth-graders are female. Looking from left to right along this line, the gender gap widens in each subsequent year. By 12th grade, only 21.8 percent of the top 5,000 high-scorers are female. The drop from ninth grade to tenth grade is the largest, but the decline is fairly steady.

The lower series present comparable estimates using more and more stringent definitions of “high-achieving,” going all the way to a definition that is two orders of magnitude more demanding and examines just the top 1 percent of our initial high-achieving pool. That each of the curves slopes downward indicates that the finding that the gender gap widens over the course of high school is quite robust to how one defines high-scoring. In proportional terms, the decline in the percent female from ninth grade to 12th is between 29 percent and 35 percent along every curve except the lowest one.^{19}

Ellison and Swanson (2010) highlighted that the gender gap is much larger when one examines more extreme high-achievers. A comparison of the leftmost points of the series in Figure 1 shows clearly that this pattern is already present by ninth grade. Girls comprise 30.5 percent of the top 5,000 ninth-graders, but only 8.4 percent of the top 50 ninth-graders. One implication is that, if performance is highly persistent (which we find), then the larger gender gap observed among extreme high-achievers relative to ordinary high-achievers cannot be primarily driven by things that are happening during high school. The subsequent analyses in this paper investigate the second fact visible in the slopes in this figure: the gender gap widens over the high school years among ordinary high-achievers, extreme high-achievers, and everyone in between.

## III. Dynamics of Achievement among High-Achievers

In this section, we take a step back from gender-related issues and present some more general evidence on the dynamics of achievement among high-achieving math students. Our observations include that the distribution of mathematical achievement is sufficiently spread out so that the top ninth-graders are already very high in the overall score distribution, that high-achieving students must substantially improve from year to year to keep up with their cohort, that there is substantial performance persistence, and that it is unlikely that students will greatly improve their within-cohort rank.

### A. Growth and Variation in Absolute Performance

Although it is becoming increasingly common to take calculus in the junior year, and the AMC contests only cover precalculus topics, top students are increasing their command of the AMC material and problem-solving techniques over the course of high school.^{20} To give some sense of performance improvement, Table 2 lists the average overall rank that a student needed to have in order to be in the grade-specific top 50, top 100, top 500, etc. For example, to rank among the top 100 ninth-graders, one only needs to score in the top 1,173 overall, whereas a 12th-grader needs to score in the top 241 overall to be in the top 100 in their cohort.

One immediate observation is that some students have already reached very high achievement levels by ninth grade. For example, the 500th best ninth-grader is already well within the top 5,000 12th-graders and hence is already at the level where we would expect a nearly perfect SAT score. The 50th best ninth-grader is well within the top 500 12th-graders.

While some ninth-graders are already very good, the table also makes very clear that students must improve substantially from year to year to maintain their within-cohort position. The right panel reports the percentage reduction in the overall rank that students in various positions must make to maintain their within-grade rank. High-scoring ninth-graders will need to improve their overall rank by roughly 40–60 percent in order to achieve the same position relative to their peers as a tenth-grader. High-scoring tenth-graders will need to improve their overall rank by about 50 percent. The required improvement between 11th and 12th grades is somewhat smaller, but still notable given that most high-scoring 12th-graders will be studying calculus or something more advanced.

The similarity of the percentage change numbers within each column is striking, given that the stringency of the definition of high achievement varies by two orders of magnitude from the top to the bottom. This suggests that the log of a student’s rank is a natural cardinal measure of performance to use when analyzing high-achieving students. We see this as another feature of the AMC environment that makes it attractive to study.^{21}

One simple way to get a feel for what year-to-year improvement is typical at the individual level is to examine the distribution of log(*Rank*_{i,t+1})–log(*Rank _{it}*) among students who take the test in both years

*t*and

*t*+ 1. This variable has a mean of −0.28 for ninth-graders, −0.39 for tenth-graders, and −0.26 for 11th-graders. These are substantial increases in performance.

The degree to which students improve from year to year likely differs for students in different parts of the distribution. For example, the effort that students are putting into improving their knowledge and problem-solving skills will differ. AMC performance in any given year is a noisy measure of a student’s underlying achievement level, which we think of as the average score they would get if given similar tests multiple times. This standard measurement error problem implies that one cannot estimate average achievement gains as a function of initial achievement via an OLS regression. Assuming this is classical measurement error, however, we can use instrumental variable (IV) regressions to estimate this relationship when some instrument for year *t* achievement is available. Table 3 presents estimates of improvement as a function of initial performance obtained from IV regressions of log(*Rank*_{i,t+1})–log(*Rank _{it}*) on – (log(

*GradeRank*)–log(5000)), using the log of a student’s within-grade rank in year

_{it}*t*–1 as an instrument.

^{22}The negative coefficient estimates on the term reflecting initial achievement levels indicates that students at higher achievement levels in the initial year are expected to make even larger improvements in log rank.

The constant term in these regressions can be thought of as the average improvement for a student who has the 5,000th best score in their cohort in year *t*. The estimates suggest that these improvements in log rank are −0.50 for tenth-graders and −0.33 for 11th-graders. If we convert the mean improvements in *Rank* needed to maintain a given within-grade rank in Table 2 to changes in log(*Rank*), they would be approximately −0.74 in tenth grade and −0.32 in 11th grade. Hence, a tenth-grader who ranks 5,000th in their grade must improve by substantially more than the expected amount in order to maintain their rank. Intuitively, this reflects that there are many more students ranked below the 5,000th student than above. If the 5,000th-ranked student makes the average improvement, then there will be more students jumping ahead of them due to above-average gains than falling behind due to below-average gains.

Standard deviations of the full sample increases in log rank are 0.73, 0.86, and 0.96 for 9th to 10th, 10th to 11th, and 11th to 12th grades, respectively. Note that these will reflect both the measurement error of the test as a measure of students’ underlying achievement levels in both years and also true variation in the growth in achievement from year to year. Online Appendix B presents a calculation examining changes over multiple years to estimate the relative importance of the two components. It suggests that the measurement error component is larger than the variation in achievement growth component, but that there is still substantial heterogeneity in students’ true year-to-year achievement growth.^{23}

### B. Persistence and Mobility in Relative-to-Cohort Performance

We now focus on how students move up and down *within their cohort* from year to year. Figure 2 presents a graphical view of the estimated rank-to-rank transition matrix. For example, the height of the darkest shaded portion at the bottom of the leftmost bar indicates that there is a 36 percent chance that a student who is among the top 50 in their cohort in year *t* will again rank in the top 50 in year *t* + 1, and the portion of the same bar just above this indicates that there is an additional 16 percent chance that such a student will rank from 51 to 100 in year *t* + 1.^{24}

One clear observation from the figure is that performance in year *t* is a strikingly strong predictor of performance in year *t* + 1, even when making comparisons that rely on incredibly fine distinctions in year *t* performance. Comparing students who were ranked in the top 50 in their grade in year *t* to those ranked 51–100, for example, the higher-ranked students are more than twice as likely to achieve a top 50 score in year *t* + 1 (36 percent vs. 16 percent) and less than half as likely to score outside the top 500 (10 percent vs. 25 percent). Similar patterns are visible over and over in the other bars. Students who were ranked 51–100 are more than twice as likely to achieve a top-100 score in year *t* + 1 than are students who were ranked 101–200 at *t*. Students ranked 101–200 at *t* are more than twice as likely to achieve a top-200 score at *t* + 1 than are students who ranked 201–500, and so on.

A second observation is that it is possible to move up in the distribution, but substantial improvements are quite unlikely. To help visualize this, we have outlined boxes that correspond to the diagonal of the transition matrix using dashed lines. Some substantial improvements are present. For example, 14 percent of those ranked 101–200 within their grade in year *t* move into the top 100 in year *t* + 1, including some moving into the top 50. But the chances of improving by even one rank group is never above 16 percent, and the chances of all of the three-or-more group improvements are sufficiently small as to be very hard to see in the figure.

A third observation is that dropping out of participation is relevant even among high-achieving students. The heights of the white outlined boxes at the top of each bar correspond to the percentage of students we were not able to find in the year *t* + 1 data. Among students who are ranked 1,001–5,000 in their grade in year *t*, we are unable to match 35 percent to a year *t* + 1 score. The fact that the unmatched rate is 35 percent for students with ranks from 1,001–5,000 and just 14 percent for students with ranks from 1–50 suggests that at least 20 percent of the students in the 1,001–5,000 rank group truly do not participate in year *t*. Dropping out appears to be less and less likely as one moves up in the ranks. The majority of the unmatched students in the top group are probably unmatched because of the limitations of our data set rather than due to the students actually dropping out.^{25}

One final comment on the figure is that we feel it bolsters the case that the AMC is an interesting measurement tool. While we always encourage readers to look up old test questions online, with the belief that many will feel that the test seems nicely designed to test problem-solving skills and students’ command of core precalculus topics, such impressions cannot tell us how noisy a test is as a measure of some student capability, nor how much we should care about the capability being measured. The level of persistence in Figure 2 makes very clear that the AMC test is a sufficiently accurate and consistent measure of some capability related to high achievement such that it is a good predictor of year-ahead performance. And our earlier results on students’ gains from year to year indicate that the capability being measured is something that builds over the high school years rather than something more stable like differential quickness or accuracy in performing calculations.

We also present here a longer horizon backward-looking transition matrix. The bars in Figure 3 show the fraction of students who achieved the rank corresponding to that bar in 12th grade who were in each rank category in ninth grade. At the very highest levels of achievement, the performance persistence we noted earlier remains striking. For example, we can see in the first bar that there are more holdovers from the ninth-grade top 50 in the 12th-grade top 50 (about 25 percent) than there are students who have moved up from the entire 201–40,000 range (about 21 percent). Only 5 percent scored outside the top 1,000 as ninth-graders. Although there are a substantial fraction, 35 percent, whom we were unable to match to a ninth-grade score, given how few students manage to move up from the 1,000+ range into the top 50, we imagine that many of these students are students we failed to match rather than true entrants. Some causes of matching failures, including students who switch high schools or skip grades, will likely be more frequent here as we are matching across a three-year span.

At the still extremely high level of students who rank 201–500 among 12th-graders, there is more heterogeneity in ninth-grade origins. Students moving down from the top 200, holdovers from the ninth-grade 201–500 group, and students moving up from the 501–1,000 group each comprise about 10 percent of this group. We also see a much larger number of students who had not done as well in ninth grade, with 25 percent coming from outside the top-1,000 ranges.

At the lower (but still high) levels of 12th-grade achievement in the figure, improvement since ninth grade plays an even more prominent role. Only about 5–9 percent of these students in the 12th-grade 501–1,000 and 1,001–5,000 rank groups are students who have dropped down from a higher ninth-grade rank group. Meanwhile, 12 percent and 19 percent, respectively, are students who have moved into these groups after having scores that placed them outside the top 5,000 ninth-graders. These students have improved by enough to overcome both their initial disadvantage and the substantially higher score needed to make the within-grade top 5,000 as a 12th-grader. The fraction of students that we cannot match to a ninth-grade score is also much larger in these groups, at 53 percent and 62 percent, respectively. The fact that the failure-to-match rate is so much larger here than it was for the top-50 students suggests that a substantial number of the unmatched 12th-graders in these groups are true entrants who had not participated in ninth grade.

Early in this section, we noted that the gender gap among high-achieving math students is already large in ninth grade. Given that performance is highly persistent, it is not surprising that the girls are not able to overcome their initial disadvantage. But performance persistence makes it all the more striking that the gender gap among high-achieving math students widens substantially over the high school years. Some of the more detailed findings in this section highlight channels that could be relevant: large performance improvements are needed to maintain one’s within-cohort rank, some students are dropping out of participating (at least at all but the highest ranks), and the three-year time span between ninth and 12th grades is long enough to allow quite a number of students who were not high-performers in ninth grade to improve or enter and achieve a high rank by the end of high school. Gender-related differences in any of these dimensions could contribute to the widening gender gap.

## IV. Gender Differences in Dynamics and a Decomposition

In this section, we look at gender-related differences in the dynamics of year-to-year performance and present a decomposition that lets us quantify the relative importance of several factors to the broadening of the gender gap in achievement over the high school years.

### A. Differences in Dynamics

We first look for gender-related differences in year-to-year improvement within the population of students who participate in the AMC tests in consecutive years. Table 4 presents estimates from an OLS regression:

where the *δ _{g}* and

*g*are grade and year dummies. Note that the dependent variable is the increase in a student’s rank, so that a positive coefficient on any variable implies that an increase in that variable is associated with decreased year-to-year improvement in AMC performance. The first column of Panel A reports estimates from this regression run on the set of students who ranked in the top 5,000 within their grade in the initial year. The negative coefficient on the initial rank indicates substantial mean-reversion in within-grade rank, as one would expect given that test scores are a noisy measure of underlying ability.

_{t}The primary coefficient of interest in the regression is the coefficient on the female dummy. It is positive and highly significant, indicating that girls are improving by less from year to year than boys by about 31 log points. The second main estimate of interest is whether there are gender-related differences in the variance of year-to-year improvement. Panel B of the table reports gender-specific means of the squared residuals from the above regression. Again, we find a statistically significant gender difference: there is greater year-to-year variance in the boys’ performances. Hence, we have identified two separate features of the dynamics that would tend to contribute to a widening of the gender gap among the highest achievers: (i) the girls’ mean improvement from year to year is lower, and (ii) the variance in their year-to-year improvement is also lower.

In the above regression, there is also a moderately sized but statistically significant coefficient on the interaction between the female dummy and within-grade rank, indicating that the gender gap in mean improvement is larger for higher achievers. To examine whether this may reflect a substantial difference among the highest achievers, the second column of Table 4 estimates the same regression on the sample of even higher achievers who were ranked in the top 500 in their cohort in the initial year. We find that things are not appreciably different at this level. The gender gap in mean improvement is estimated to be 32 log points per year, and the residual variance is again lower for the girls. In unreported results, we also estimated the above regressions separately on 9th-, 10th-, and 11th-graders and did not find substantial differences in either finding across grades.

Differential rates of dropping out of test taking could also contribute to changes in the gender gap among high-scorers. To explore this we define an indicator *Dropout*_{it+1} for whether each year *t* high-scorer could not be found in the year *t* + 1 data, and estimate the OLS regression:^{26}

The first column of Table 5 reports estimates run on students who were in the top 5,000 in their grade in year *t*. The primary coefficient of interest is the female dummy. The estimate of 0.023 indicates that girls are 2.3 percentage points more likely to drop out of participating than boys with comparable scores. The estimate is highly statistically significant, so we have identified a third factor contributing to the widening of the gender gap over the course of high school.

The second through fourth columns of the table present similar regressions estimated separately on the students in 9th, 10th, and 11th grades. The gender gap in dropout rates is larger in the 11th to 12th grade transition than in the other years. Girls are 4.5 percentage points less likely to participate in 12th grade than boys who had comparable 11th-grade scores. Early in high school, the gender gap in dropout rates is much smaller.

All regressions include controls for the student’s within-grade rank in the initial year. The positive coefficients on these controls reflect that higher-scoring students are substantially less likely to drop out. The coefficients are quite similar across all three grades, indicating that this relationship is fairly stable over the course of high school.

The final column of Table 5 looks at more extreme high-scorers who were among the top 500 students in their grade in year *t*. The point estimate of the gender-related difference in dropout rates is much smaller in this sample, just 0.2 percentage points, but the standard error is such that we can neither reject that the gender gap is zero, nor that it is the same as in the top 5,000 sample.

We noted earlier that some high-scorers at the end of high school are students who came later to math competitions. To examine whether there are also gender-related differences in this aspect of the dynamics, Figure 4 graphs the fraction female among all Grade 9–11 students who were in each rank group in some year from 1999–2006, and the fraction female among Grade 10–12 students in the rank group in 2000–2007 who are entrants. In all but the top rank group, we find that the fraction of female students among the entrants is slightly lower than the fraction among the students who were in that group in the previous year. On average, the difference is about one percentage point.^{27} This gender difference in AMC entry is a fourth contributor to the broadening of the gender gap over the high school years.

To recap, we have identified four gender-related differences in the dynamics of student achievement that will contribute to the widening of the gender gap in high achievement on the AMC over the high school years. High-achieving girls are on average not improving by as much from year to year, there is less variance in their year-to-year improvement, they are more likely to drop out of participating (especially after 11th grade), and we see fewer girls among the high-scoring entrants whom we cannot find in the previous year’s data.

### B. A Decomposition of Changes in the Gender Gap

In this section, we define a decomposition of the change in the gender gap into portions attributable to various differences that provides a measure of their relative importance.

Our analysis focuses on changes in the fraction of students in achievement group *X* at time *t* who are female. (We will often use being in the top 50, 500, or 5,000 as the group *X*.) Here, we relate this to various aspects of differences in the boys’ and girls’ transition matrices.

**Proposition 1.** The change in the fraction female in group *X* can be written as:

See Online Appendix C for algebraic expressions of each term and proof.

The first term in the decomposition, , can be thought of as the change in female representation that is due to girls dropping out at a different rate (assuming that the girls who dropped out would have succeeded at the same rate as the girls who continued to participate). The second term, , reflects the difference in rates at which girls who continue to participate improve by enough to remain in rank group *X*. The third term, , reflects the difference in rates at which lower-ranked girls versus boys subsequently climb into group *X*. The fourth, , reflects any discrepancies between female representation among the high-scorers who did not participate in the previous year and what would be expected given the total number of entrants and female representation among the previous year’s high-scorers.

The final term in the decomposition, , captures mechanical changes that would occur even if there were no gender-related differences in the transition process, due to asymmetries in the initial conditions. There are mechanical effects pushing in both directions. A negative effect is that the girls in each rank group *X* are disproportionately found in the lower part of the rank group, so girls in *X* would be less likely to avoid dropping into a lower group in the following year. Working in the opposite direction, there are also more girls in the rank group just below *X* than in group *X*. With gender-independent dynamics, this would result in the set of students who move up into rank group *X* in the next year being more heavily female. The sign of the net mechanical effect will depend on which of these countervailing effects is larger.

As discussed in further detail in Online Appendix C, we implement this decomposition by estimating the transition probabilities both for the full population and for girls as smooth functions of the initial year rank via local linear regressions, with log(*Rank*)as the right-hand-side variable. We do this separately for students in 9th, 10th, and 11th grades, pooling the data for all six cohorts within each regression.

One version of our basic fact about the widening gender gap was that the percentage of female students in the top 5,000 drops from 30.5 in ninth grade to 21.8 in 12th grade. This is a drop of 8.7 percentage points over three years, which is about three percentage points per year. The first row of Table 6 presents a decomposition of this change.^{28} It indicates that by far the largest source of the drop—responsible for 3.6 percentage points, which is more than 100 percent of the drop—is , the termin our decomposition that reflects differences in the rates at which male and female students at ranks below 5,000 improve their performance and “grow” into the top 5,000. Note that this term is designed to control for how far below the top 5,000 cutoff male and female students were in the previous year; it is due only to differences in the probabilities that male and female students at each given rank outside the top 5,000 move up into the top 5,000. This in turn will reflect both the differences we identified earlier in both average improvements from year to year and in the variance of students’ improvements.^{29}

Two other features of the dynamics are a little less than one-third as important as the growth effect: , which reflects the reduced rate at which highly ranked female students who take the test maintain their top-5,000 position, and , which reflects the lower fraction of female students among “entrant” high-scorers. The difference in dropout rates is a smaller contributor on average.

The final column indicates that the total drop would be much larger were it not for a positive mechanical effect. To appreciate why this effect can be large in practice, recall that the fraction female is much higher in the population of test-takers outside the top 5,000. For example, for tenth-graders it is 0.26 for students in the top 5,000 and 0.40 for students who are ranked between 5,001 and 20,000. Although each individual 5,001–20,000 student is not very likely to move into the top 5,000 in 11th grade, together they will account for about 23 percent of the year *t* + 1 Grade 11 top 5,000. If the dynamics were gender-independent, then the fraction of girls in this moving-up group would be close to 40 percent, and this would substantially bring up the average percent female variable in the top 5,000.

The next three rows of the table report the separate 9th to 10th, 10th to 11th, and 11th to 12th grade decompositions that went into the average discussed above. Recall that gender gap widened most from ninth grade to tenth grade. The entry effect is relatively more important at this stage. The changes from 10th to 11th grade are very similar to the overall average. In the 11th to 12th grade transition, the growth effect is even more important, dropout plays a role, and the entry effect is unimportant.

The final two rows of the table focus on more extreme high-achievers. Recall that the fraction female in the top 500 declined from 18 percent in ninth grade to just 12 percent in 12th grade. This 35 percent decrease was larger than the 29 percent decrease at the top-5,000 level, although it is smaller in percentage point terms (about two percentage points per year). The importance of the growth process to the evolution in the gender gap comes through even more strongly here—differences in the probabilities with which boys and girls at each lower rank are able to move into the top 500 are much more important than the other differences we’ve identified. The entry and dropout effects are both just minor factors, consistent with the view that few true entrants will make it all the way to the top 500, and few students will drop out after earning such high scores.

The bottom row looks at even more extreme high-achievers, who scored in the top 50 in their grade. Here, the dropout and entry effects continue to fade to insignificance relative to the large growth effect. What remains are the large growth effect and a continuation effect, again offset in large part by the mechanical effect.^{30}

The small numbers that come up when doing top-50 calculations may make it easier to understand why the mechanical effect is so large. On average, 18.1 of the year *t* + 1 top-50 students will be repeats from the year *t* top 50. They will be joined by 19.5 students moving up from ranks 51–500. If the students moving up were randomly drawn from their rank groups, then about 16 percent of them would be female. Hence, their presence would increase the overall percent female in the top 50 by about (19.5/50) × (16–9)≈3 percentage points. The magnitude of these mechanical effects makes the broadening gender gap even more striking—the widening of the gender gap occurs despite the fact that every year there are many more girls well positioned to move into the top 50 (or 500 or 5,000) than currently in the top 50 (or 500 or 5,000).

## V. Potential Mechanism: Reactions to Disappointment

So far, we have tried to improve understanding of the widening of the gender gap in high math achievement over the high school years by providing descriptive evidence on the dynamics of performance that any potential explanation would have to account for. In this section, we exploit the multistage nature of the AMC series to provide evidence with a more causal flavor on the feedback mechanism. Specifically, we investigate gender differences in how students react to disappointment.

The AMC 10/12 contests are the first stage of a series. A number of awards are given out along the way, and students take pride in how far they advance. For most of the high-achieving students in our sample, the most salient potential accomplishment is qualifying for the American Invitational Mathematics Exam (AIME).^{31} Qualifying keeps their math competition season alive for another month, and they will list it on their college applications. Many who fall just short of the cutoff for AIME qualification will be disappointed. Our discussions of “reactions to disappointment” should be understood as shorthand for how students react to this disappointment relative to how they react to the positive feedback that comes with qualifying.

The “rational” response to falling just short might be to redouble one’s efforts. One consideration pushing in this direction is that not having previously been an AIME qualifier should raise the incremental benefit that qualifying provides to one’s resume; and students have learned (given how much students typically improve from year to year) that they have a good chance of qualifying in the subsequent year. There are, however, other forces that might push rational students in the opposite direction. For example, a negative signal about the returns to investing in math could incentivize a reallocation of effort toward other subjects.^{32} It also seems plausible that students might react to the disappointment by investing less in math for behavioral reasons. In light of the literature on gender differences in coursework, self-confidence and interest in competition (for example, Wang, Eccles, and Kenny 2013; Niederle and Vesterlund 2007, and Croson and Gneezy 2009), one could easily imagine that there are gender differences in the rational and behavioral responses.

The rules for advancement from the AMC 10/12 to the AIME are a bit complicated. Students qualify if they score at least 120 on the AMC 10 or 100 on the AMC 12. They also qualify if they are among the top 1 percent of U.S. test-takers on the particular (A or B) AMC 10 that they took, or among the top 5 percent on the particular AMC 12.^{33} The rules are an ex ante attempt to treat the tests roughly equally, but in practice the ex post level of correctly measured performance at which the cutoff falls varies from test to test. From our perspective, this is fortuitous in that it makes the AIME qualification “treatment” less collinear with performance.^{34} As an initial look at the data, Figure 5 provides a regression discontinuity (RD)–style plot of the probability with which students with scores in each one-point score band cannot be found in the next year’s data. Students in the zero band and all students to the right qualified for the AIME. We report the means separately for boys and girls and add separately estimated regression lines on each side of the cutoff. The figure strongly suggests that there is a discontinuous jump in the probability of dropping out of future participation when students fall just short of the AIME cutoff.

The noisiness of the female data on the right side of the figure reflects that there are a limited number of girls with scores more than ten points above the AIME cutoff.^{35} But in other cases—for example, the data points for boys exactly at and six points below the AIME cutoff—substantial departures from the regression lines occur, despite sample sizes that are quite large. We believe that this reflects the role of unobserved student characteristics that do not covary smoothly with students’ scores relative to the cutoff. To illustrate why this is plausible, note that the most common AMC 10 cutoff is 120. In 2002–2006, the unique way to score 120 was to answer all 25 questions and get 20 correct and five wrong.^{36} The unique way to get the score just below 120, 119.5, was to attempt just 18 of the 25 questions and get 17 correct and one wrong, with seven left blank. The students scoring 119.5 and 120 may therefore be different in unobserved ways. For example, the 120 students may be quicker, less accurate, and more risk-loving. Such discontinuous changes in unobservables of this variety would make the standard RD estimator of the causal effect of AIME qualification inappropriate.

We try to estimate effects in a manner that is robust to this potential problem in two ways. First, we simply estimate a regression similar to our earlier dropout regression, with the addition of a dummy variable for failing to qualify for the AIME, and we restrict our analysis to the subsample of students who were within two correct answers of the AIME cutoff on either side. When the AIME cutoff is 120, we include students who answered 25 questions and got 18 or 19 correct (failing to qualify with a 108 or 114, respectively), as well as students who got 20 or 21 correct (qualifying with a 120 or 126, respectively). In such a sample, where the number of students answering each number of questions is roughly balanced, the qualification dummy would be mostly uncorrelated with any function of the number of questions answered, and we would hope that our quadratic in log(*GradeRank*) would capture any smooth relationship between higher achievement and dropout rates, whereas the dummy for failing to qualify would capture any discontinuous jump at the year test-specific cutoff. We also explicitly control for whether a student scored in each of the subsets of scores, for example, {…, 108, 114, 120, 126,…} and {113.5, 119.5, 125.5} that are possible when students attempt a given number of questions. Given the variety of scoring rules used in different years, this involves adding a total of 18 dummy variables and can be thought of as an estimator that will give us the causal effect of failing to qualify on the probability of dropping out, provided that the unobservables are smooth across the cutoff once we have controlled for the differences related to the number of questions a student attempted.

The first column of Table 7 presents estimates from this OLS regression.^{37} Our main interest in conducting these regressions is on the effect of the disappointing outcome of failing to qualify for the AIME. The main effect on this variable is substantial, 3.7 percentage points, and highly statistically significant. One way to think about the magnitude is that it is comparable to the participation gender gap for 11th grade girls—that is, it means that an 11th-grade boy with a score just below the AIME cutoff will be almost as likely to drop out of participating as an 11th-grade girl who scored just above the cutoff.

The second main coefficient of interest is the differential effect that failing to qualify for the AIME has on girls. The estimate indicates that the decrease in the probability of participation is 1.9 percentage points larger for girls than for boys—that is, girls are even more likely than boys to cease participating in the AMCs when they experience a disappointing outcome (this difference is statistically significant at the 3 percent level). The effect on a girl of just missing the AIME will be the sum of the two estimated coefficients, so girls with scores just below the AIME cutoff will be 5.6 percentage points less likely to participate in the following year than girls who just barely qualify for the AIME. This is consistent with previous literature on gender differences in self-confidence and responses to competition, suggesting that those findings are relevant even to the set of highly accomplished girls we are studying, and providing us with a causal link identifying a factor that contributes to the widening gap. The gap is, however, substantially smaller than the roughly ten percentage point gap Buser and Yuan (2019) found in a similar analysis of Dutch data, and it will only account for a small portion of the observed widening of the gender gap over the high school years.^{38}

Second, we implement a RD estimator with local linear controls for the running variable (distance to AIME cutoff as in Figure 5), an endogenous bandwidth, and robust inference, as in Calonico, Cattaneo, and Titiunik (2014), separately on the male and female samples. In these regressions, we allow for gender-specific nonlinear effects of the running variable on each side of the cutoff and control for year and grade fixed effects, a dummy for taking the B-date test, a dummy for taking both the A-date and B-date tests, and dummies for number of questions attempted.^{39}

As reported in the second column of Table 7, the estimated effect of the failing-to-qualify “treatment” on boys is that it increases their probability of dropout by 3.4 percentage points. This is very similar to the OLS estimate and is similarly significant.^{40} The effect for girls in the third column is somewhat smaller than the OLS estimate, at 4.2 percentage points, and remains highly significant, although the standard error is larger. We can therefore conclude that our finding that both boys and girls are more likely to drop out of future competition is robust to the more flexible allowance for unobserved heterogeneity.^{41}

The 0.8 percentage point difference between the male and female dropout effects is estimated sufficiently precisely to rule out the larger difference found in Buser and Yuan (2019), and suggests that this causal channel can only account for a small portion of the observed widening of the gender gap. However, the standard errors are sufficiently large in the “optimal bandwidth” specifications that we must also say that whether the gender gap in reactions to disappointment is statistically significant is sensitive to how one controls for potential unobserved heterogeneity.

Disappointment may also affect the performance of students who continue to participate in the AMC tests by affecting the effort students put in over the course of the following year. To look for effects of this type, Table 8 reports coefficient estimates from regressions like those in Table 4 examining the change in within-grade rank between year *t* and year *t* + 1, but using RD regressions as in Table 7.

The first main coefficient of interest in this regression is again the coefficient on the dummy for missing the AIME cutoff. In the OLS regression, we get a positive, significant coefficient, which again suggests that students are not doing better after experiencing disappointment. Students with scores just below the AIME cutoff have a larger increase in their expected year rank (that is, they do worse) than do students with scores just above the AIME cutoff. The magnitude is not very large in economic terms—students’ ranks are increasing by a little more than 10 percent. However, the fact that it is positive is noteworthy: we have seen that scoring just below the AIME cutoff induces some students to drop out, and the most natural guess would be that these dropouts are relatively weak students, which would result in the pool of continuing students with scores just below the AIME cutoff being positively selected.

In contrast to our earlier result on girls’ reacting worse to disappointment in terms of being more likely to drop out, girls who continue participating despite experiencing disappointment show less of a disappointment effect in their performance. This could reflect that the sample of continuing girls is more selected, but could also reflect that girls who do not drop out are less likely to reduce their effort. Regardless, it appears that differences in dropout rates are the channel through which gender differences in reactions to disappointment might contribute to a widening gender gap. The positive coefficient estimate on the female dummy indicates that (along the lines of what was reported earlier) girls just above and below the AIME cutoff are still improving by less on average than boys with comparable scores.

The remaining columns of the table provide estimates of the effect of failing to qualify for the AIME on subsequent year performance from the optimal bandwidth RD procedure. The effect of failing to qualify for the AIME on boys’ year-to-year improvement is estimated to be smaller at 0.048 and only marginally significant; the point estimate for girls’ reactions is identical at 0.048, though it is not statistically significantly different from zero.

To summarize, students appear to react to the disappointment at falling short of the AIME cutoff both by being more likely to drop out and by improving by less in the subsequent year conditional on not dropping out. The dropout effect may be larger for girls, though this result is sensitive to specification. This could be one factor contributing to the widening of the gender gap, particularly given girls’ lower performance in earlier grades and particularly if the observed disappointment effects generalize to other parts of the score distribution; however, consistent with our above decomposition results, these effects can at most account for a small portion of the observed widening.

## VI. Conclusions

We used data from the American Mathematics Competitions to document that the gender gap among high-achieving math students is already quite large by ninth grade. Girls comprise just 30 percent of the 5,000 highest-scoring ninth-graders on the AMC contests, 18 percent of the 500 highest scoring ninth-graders, and just 8 percent of the top 50. One takeaway is that, to fully understand the gender gap in high math achievement among high school students, it will be necessary to examine pre-high school data. We hope that our paper will spur further work in this direction.

A second main finding, which is the focus of most of this paper, is that the gender gap in high math achievement widens substantially over the high school years. The largest change occurs between ninth and tenth grades, but it is a fairly steady process clearly visible in every year. The fraction female among students who are among the top 5,000 in their grade on the AMC test drops from 30 percent in ninth grade to 22 percent in 12th grade. Among students who are among the top 500 in their grade, the drop is from 18 percent in ninth grade to just 12 percent in 12th grade. These are substantial changes. They would be hard to reconcile with the simplest views of gender gaps stemming from some time-invariant biological difference, and they motivate looking more closely at the year-to-year dynamics of student performance over the high school years.

Our initial analysis of the dynamics of high math achievement brings out several new facts. Two that are particularly important are that high-achieving students must substantially improve their absolute performance from year to year to maintain their within-cohort rank, and that within-cohort ranks are nevertheless quite persistent. The persistence reinforces our earlier comment that pre-high school factors are important drivers of the gender gap in high school. The need for substantial improvement to stay in place derives from a combination of two effects. One is that the typical high-achieving math student is substantially improving their knowledge and problem-solving skills from year to year. The other is that there are many more students outside the top 500 than in the top 500. Some lower-ranked students are making far-above-average improvements, and this forces highly ranked students to make above-average improvements to maintain their place. Thus, our high-achieving students are exerting substantial effort to bolster already highly advanced math skills. There are many, many demands on elite high school students’ time that could lead to systematic differences in the opportunity costs of and interest in making such investments.

We have identified four distinct gender-related differences in the dynamics of student performance that contribute to the widening gender gap. In comparison with boys who had the same score in the previous year, high-achieving girls are more likely to drop out of participating in the AMC tests (particularly in 12th grade), and the performance gains of those who do participate again are lower on average and less variable. Girls are also underrepresented in the pool of high-scoring “entrants” whom we could not match to a score in the previous year. Our decompositions point to “growth” differences, the underrepresentation of girls in the set of students who manage to move up from lower ranks to high ranks, as the most important source of the widening gap. The other effects are more moderate in size, but in combination and cumulated over the years they also contribute substantially to the observed widening of the gender gap.

From ninth to tenth grade, the dearth of female entrants is important, and from 11th to 12th grade, dropouts become an issue. But the growth differences are consistently the largest effect both across grades and across levels of achievement. In most cases, they account for well over 100 percent of the observed widening of the gender gap. Again, this suggests a line of further inquiry—why are there so few girls who move up substantially relative to their cohort in the later high school years?

Any potential explanation for the gender gap in high math achievement will have to reckon with these facts. Several explanations suggested by the literature seem promising. Being a top performer on the AMC requires a substantial amount of both ability and effort. Girls may have lower valuations than boys for the rewards associated with top performance on the AMC contests, based on intrinsic preferences, societal conditioning, future college or career expectations, or some combination thereof (Wiswall and Zafar 2018). Girls may enjoy the AMC competition less than boys and therefore invest less effort toward it (Niederle and Vesterlund 2007). Girls may be more risk-averse than boys and thus less likely to invest all their effort in one extracurricular activity with a risky payoff (Borghans et al. 2009). Girls may have more promising alternative extracurricular activities competing for their time than boys (Wang, Eccles, and Kenny 2013). These and other factors may contribute to the gender gap in ninth grade. More importantly for our purposes, they may contribute to the widening of the gender gap as the effort required to maintain one’s rank increases, as the number of future opportunities to succeed decrease and college applications loom larger (that is, the stakes increase) (Azmat, Calsamiglia, and Iriberri 2016), and as students receive feedback on past performance.^{42}

Our final section examines one potential contributor and suggests that reactions to disappointment may be part of the answer. Both boys and girls who experience the disappointing outcome of just barely failing to qualify for the AIME are more likely to not participate in the following year. The dropout effect may be larger for girls, although the significance of this difference is not very clear. Apart from psychological effects related to disappointment, of course, one could also potentially explain such reactions in more standard “rational” ways; for example, high-achieving girls might have a greater breadth of other skills and interests that compete for their time and might therefore rationally shift more effort away from math contests toward other activities to maximize chances of college admission or long-run success. We hope to see future work on this as well.

A limitation of all of our work on the AMC contests is that the data concern performance in a competitive environment. We believe that many of the investments in problem-solving skills and mastering precalculus mathematics that the AMC contests measure will also benefit students in later-life environments. In our earlier work, we presented some data on SAT scores consistent with this view, but it would nice to have this complemented with work on the dynamics of achievement as measured with other instruments. It would be even more complementary to be able to track AMC participants forward and examine how participation, achievement, disappointment, etc. on the AMC tests affect outcomes that are well established to be important in later life, such as choice of college major, pursuit of postgraduate education, and career choices. Agarwal and Gaule (2018) perform such an exercise on a more extreme set of high-achievers in high school math competition, students who advance all the way to the International Mathematics Olympiad (IMO). They find that IMO scores are highly predictive of math publications and citations 20 years in the future; IMO gold medalists are *fifty* times more likely to win the Fields medal than are graduates of a top-10 math Ph.D. program who did not participate or advance quite this far in high school math contests. AMC scores and participation are surely not as strongly predictive as this of any subsequent achievement, but it would be very interesting to see how they relate to longer-run outcomes.

## Footnotes

This project would not have been possible without Professor Steve Dunbar and Marsha Conley at AMC, who provided access to the data, as well as their insight. We thank Daniel Ehrlich for excellent research assistance. Financial support was provided by the Sloan Foundation and the Toulouse Network for Information Technology. Glenn Ellison is the author of a series of math books that some students use to prepare for math competitions. All errors are those of the authors. The analysis used data obtained under a restricted-use agreement with the Mathematics Association of America (MAA), with protocols in place to prevent identification of individual students or schools, and these data cannot be publicly shared. The authors would be happy to assist researchers interested in obtaining the data for the purposes of replication (a.t.swan{at}gmail.com).

Supplementary materials are available online at https://jhr.uwpress.org.

↵1. See Xie and Shauman (2003) and Goldin, Katz, and Kuziemko (2006), among others.

↵2. See Hedges and Nowell (1995), Guiso et al. (2008), Hyde et al. (2008), and Ellison and Swanson (2010) on math test scores and Ginther and Kahn (2004) and Carrell, Page, and West (2010) on workforce issues.

↵3. See Guiso et al. (2008), Pope and Sydnor (2010), and Carrell, Page, and West (2010).

↵4. See Gneezy, Niedele, and Rustichini (2003), among others, on gender differences in performance in competitive environments.

↵5. For example, laboratory and field evidence suggests that men and boys are more likely to select into experimental and real-world competition than women and girls of equal ability (Buser, Niederle, and Oosterbeek 2014; Niederle and Vesterlund 2007); gender differentials in standardized test performance of high school students depend on the competitive stakes of the tests (Azmat, Calsamiglia, and Iriberri 2016); and, in a TV game show testing general knowledge, women earn 40 percent less than men and exit the game prematurely at a faster rate (Hogarth, Karelaia, and Trujillo 2012). The large literature showing that girls earn higher grades in all subjects, with particularly large differences in language courses, suggests that girls may be spending more time on nonmath coursework. See Voyer and Voyer (2014) for a metastudy. Chachra et al. (2009) provide evidence on the extracurricular activities of engineering students.

↵6. See Hedges and Nowell (1995), Guiso et al. (2008), and Ellison and Swanson (2010).

↵7. See, for example, Fryer and Levitt (2010) regarding U.S. students, Bharadwaj et al. (2016) regarding Chilean students, and Contini, Di Tommaso, and Mendolia (2017) regarding Italian students.

↵8. Sutter and Glätzle-Rützler (2015) report that such differences are robust across a broad age range and visible as early as age three, so they may help account for our ninth-grade results.

↵9. Our finding on the magnitude of the ninth-grade gender gap can also be seen as suggestive that differences in interest in competition are producing part of the real-world effect of girls being underrepresented among the highest scorers on the contests.

↵10. Our sample has approximately 100 times as many student–years.

↵11. Focusing on STEM fields specifically, Ceci et al. (2014) present evidence on lower female propensities to major in math-intensive subjects in college and higher female propensities to major in non-math-intensive sciences. They then examine career development in STEM fields and find greater evidence of pipeline leakage in fields such as psychology, life science, and social science, rather than in math-intensive fields in which women are more underrepresented.

↵12. Ellison and Swanson (2010) note that AMC scores are a stronger predictor of how students will do when retaking the math SAT than the previous math SAT score, and the tests remain a calibrated predictor of future test scores at upper tail percentiles that are an order of magnitude higher than can be measured with the SAT.

↵13. While the 3,000 AMC-offering schools is a small fraction of the total number of high schools in the United States, Ellison and Swanson (2016) note that at least 80 percent of the highest-performing students on several other math contests and mathematical research contests took the AMCs. At less rarefied achievement levels, a back-of-the-envelope calculation suggests that about 20 percent of the students at participating schools with 800s on the SAT math take the AMC contests.

↵14. The structure of the AMC contests changed twice in the period we study. In 1999, all students took a common test similar to the AMC 12. In 2000, the AMC introduced the AMC 10 and offered younger students the option of taking either test. The AMC 10 and 12 are similar—14 of the 25 questions were common to both tests in the first year—but to be less intimidating to younger students and less affected by knowledge of above grade-level material, the AMC 10 avoids logarithms and trigonometry and rarely has questions as difficult as the five most difficult on the AMC 12.

↵15. Online Appendix A provides more details on the methodology and the resulting normalizations. An AMC 10 score of

*x*turns out to be roughly equivalent to a score of 7/8*x*on the AMC 12, but there are idiosyncratic differences from test to test of about five to ten points on the AMC 12’s 150-point scale. There is more topcoding of AMC 10 scores than AMC 12 scores, but an order of magnitude less than on the SAT. A perfect 150 on the AMC 10 is usually equivalent to about a 130 on the AMC 12. A few hundred students per year score at least 130 on the AMC 12 versus about 15,000 who get perfect scores on the math SAT.↵16. Only unique matches are kept in the data set for analysis. Students’ demographic variables are missing for 3–6 percent of observations. We consider two values of a variable to be “consistent” if they match or if one or more values is missing. Grade is considered a match between a year

*t*observation and a year*t*_{0}observation if*grade*_{t}–*grade*_{t0}=*t*–*t*_{0}.↵17. Miscounting is a concern because most schools offer only the A-date tests, and some of the most serious students will take a B-date test at another area school that offers it if their school does not. Our procedure avoids double-counting these students if the alternate location they find is a school offering the test on both dates, which we think is by far the most common situation in which this occurs. Alternatively, we could have used all of the B-date scores with some set of matching rules to filter out potential out-of-school students. Any such procedure could at most increase the sample size by 2 percent.

↵18. We have constructed the sample to include 9th-, 10th-, 11th-, and 12th-graders from all years, so the drop in female participation noted here should not be contaminated by the time trend in AMC participation.

↵19. The 19 percent estimate for the top group is noisy given the small sample sizes: the top 50 is only 7–8 percent female, which means that there are typically just three or four girls in the top 50 of each grade–year.

↵20. In 2015, more than 120,000 AP Calculus exams were taken by students in 11th grade and below. It was less common for the cohorts we study, but there were already more than 30,000 students in 11th grade or below taking AP Calculus when our first cohort was in 11th grade (2001).

↵21. When student performance can only be measured as a within-year

*z*-score, the dynamics of the year-to-year changes in relative-to-cohort performance are more difficult to analyze for high-scoring students because changes are highly asymmetric: high-achieving students can only improve their performance very slightly from year to year, but can easily do much worse.↵22. These regressions are run on the subsample for which previous year scores are available. The identifying assumption is that rank in year

*t*− 1 is another (noisy) measure of the student’s position in the latent expected achievement distribution in year*t*, and that measurement error in year*t*−1 is uncorrelated with measurement error in year*t*.↵23. Further evidence on the role of measurement error can be observed in Online Appendix Table A3, which shows that the sign of the estimated relationship between −[log(

*GradeRank*_{it}) − log(5000)] and log(*Rank*_{i,t+1})−log(*Rank*_{it}) flips when we instrument for log(*GradeRank*_{it}), as in Table 3. That is, measurement error in log(*GradeRank*_{it}) leads to substantial mean reversion, which obscures the relationship between initial achievement and year-to-year improvement in the OLS regression.↵24. Due to the discreteness of AMC scores, there will typically be a number of students tied for positions that cross each boundary. For example, in 2006, 14 11th-graders had scores of 124, which left them tied for positions 196–209. In this situation, we would include the experience of each of these students with weight 0.64 in our calculation of what happened to students with ranks of 201 to 500 in year

*t*. And we similarly record each student’s outcome as their probability of being in each rank group as though ties are broken at random.↵25. To investigate this issue, we looked manually through published lists of 2006 and 2007 high-scorers. Among the top 50 students in each grade in 2006, we failed to find 2007 matches for 2.6 percent of ninth-graders, 4.3 percent of tenth-graders, and 12.1 percent of 11th-graders. These figures should be compared to the sum of the dropout rate and the probability of finishing outside the 2,000 in our algorithmic match, which is about 18 percent on average across grades. Several factors are involved in the superiority of this manual match over our algorithmic match: manually, we were able to identify students who switched schools, students who took the test at a testing center in one year and in their high school in another year, and students who appear to have listed their first name differently in different years. It is worth noting that matching failures are likely more prevalent at the highest score levels due to high-performing students taking the exams at testing centers in lieu of or in addition to their own high schools.

↵26. Recall that

*Dropout*_{it+1}will reflect both true dropouts and students we fail to match for other reasons. The B-test dummy takes on a value of 0.02 and is statistically significant. We suspect that this reflects in part that a higher fraction of students taking B-date tests are students taking the test at a location other than their regular school, which makes us more likely to fail to match their performances across years. We hope that such matching failures are not gender-related.↵27. It is possible that there are gender-related differences in our ability to match students. For example, one gender could be more likely to fill in their name differently in different years. However, girls are overrepresented in the pool of year

*t*high-scorers whom we cannot match to a year*t*+ 1 score and underrepresented among year*t*+ 1 high-scorers whom we cannot match to a year*t*score. The potential gender-related matching errors suggested by these results have opposite sign.↵28. The “average” decomposition is obtained by averaging separately estimated decompositions of the changes from 9th to 10th, 10th to 11th, and 11th to 12th grades.

↵29. The latter matters here because students outside the top 5,000 will need to improve by substantially more than the average amount to move into the top 5,000.

↵30. In order to account for noise in the decomposition exercise introduced by the local linear regressions, we performed a nonparametric bootstrap of the decomposition procedure, resampling at the student level and holding ranks fixed across 2,000 bootstrap draws. As shown in Online Appendix Table A2, all terms in Table 6 are estimated with a great deal of precision, with the exception of several factors at the top-50 level.

↵31. In the years in our sample roughly 500–750 ninth-graders, 1,000–2,000 10th-graders, 3,000–5,000 11th-graders, and 4,000–6,000 12th-graders qualify. In later stages, students who score highly enough on the AMC 10/12 and AIME are invited to participate in the USA Math Olympiad (USAMO). High USAMO scorers are invited to the Math Olympiad Summer Program (MOP). Six MOP students are selected to represent the United States at the International Math Olympiad (IMO).

↵32. This reallocation could involve reallocating effort toward other competitive endeavors (for example, biology, chemistry, physics, linguistics, or informatics olympiads or debate competitions) or to coursework or noncompetitive extracurriculars.

↵33. Hence, the cutoff can be below 100/120 but never above.

↵34. Although some students may be aware that the AIME cutoff for the AMC 10 is often 120, and the cutoff for the AMC 12 is often 100, it would nevertheless be difficult for students to “game” the cutoff and strategically score just above it. If gaming were common, we would expect to see bunching of students right at the cutoff. As shown in Online Appendix Figure A1, the decline in student counts in a neighborhood of the cutoff is smooth for both boys and girls, and there is no evidence of bunching.

↵35. See Online Appendix Figure A1 for a histogram, by gender, of the distribution of students relative to the AIME cutoff.

↵36. In 1999, the AMC had 30 questions and gave five points for a correct answer and two for a blank answer. In 2000–2001, the tests gave six for a correct answer and two for a blank answer. In 2002–2006, the score for a blank answer increased to 2.5.

↵37. The regression also includes unreported year and grade fixed effects, a dummy for taking the B-date test, a dummy for taking both the A-date and B-date tests, a quadratic in log(

*GradeRank*), and Female interactions with the linear and quadratic log(*GradeRank*) terms. We normalize within-grade rank separately within each grade so that the adjusted log of within-grade rank variable has mean zero within each grade for students with scores exactly at the AIME cutoff. With this normalization, for example, the coefficients on the*Female*×*Grade 9*interaction can be thought of as giving the gender difference in dropout rates for students who qualified for the AIME with the lowest possible score.↵38. The estimate here is sufficiently precise to rule out an effect close to that found in Buser and Yuan (2019). Their data set is two orders of magnitude smaller, resulting in standard errors that are sufficiently large that they typically cannot rule out a smaller gap of the size we estimate.

↵39. In these regressions we use the default optimal bandwidth as implemented in Stata’s rdrobust package. An optimal bandwidth of 12 points is selected for males, vs. 10.6 for females.

↵40. Both rdrobust results described here are highly significant (

*p*-values ≤ 0.01) according to both conventional and robust inference methods; conventional standard errors are reported in the Table for brevity.↵41. This contrasts with Buser and Yuan (2019), who are unable to find an effect for boys.

↵42. Women more often attribute past successes to luck than to inner attributes (and past failures to inner attributes), while men do the opposite (Beyer 1990; Felder et al. 1994).

- Received June 2020.
- Accepted May 2021.

This open access article is distributed under the terms of the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0 and is freely available online at: http://jhr.uwpress.org