## Friday, November 22, 2013

### Just Instrument It

(With apologies to macro-economists.)

The amount of time you spend in education predicts your earnings quite strongly, and it's generally agreed (Simon Cowell aside) that if you want to do well in life, staying in education for longer is a good idea.

But how much effect does it have?  We could look at a survey of people's incomes and group them by education level, but this doesn't give a causal effect.  It might tell us that people who have a masters degree earn more during their lifetime, on average, than those who don't.  This could be because people from wealthy backgrounds can afford the tuition for a masters degree, and also have pals in the city who can help them get a big salary afterwards.  Or perhaps people who do masters degrees work harder than the rest.  We can't easily tell the difference: this problem is called confounding.
 We can't tell whether time spent in education level causes earnings to increase, or there's a third factor which affects both.

Suppose Sarah, 16, is considering whether to drop out of school, or stay for another year.  She doesn't care about whether other people who choose to stay in school turn out to be rich, what she wants to know is: if she stays, how much more will she earn than if she leaves (and will it cover the counselling?).  If the apparent benefit of education is really down to having rich friends, then Sarah staying in school won't necessarily help at all.  Knowing the best course specifically for Sarah is pretty difficult, but we might be able to answer this: if an average person stays in school an extra year, how much more will they earn than if they leave?

To answer this we can simply do a randomized control trial: take 2,000 people, toss a coin for each, and force those who get heads to stay to 17, and the rest to go straight into work, and see who fares best 25 years later.  Regrettably indentured labour (or education) of subjects is not an option open to researchers in social policy, so it is not possible to perform this experiment.

#### Instruments

So what can we do?  Well, what we really want is to find a third quantity which affects whether or not people decide to stay in school, but isn't related to later income.  This is a bit like outsourcing our coin tosses to a third party who doesn't suffer from our ethical qualms.  We call this quantity an instrument.

What sort of quantity would work?  Two researchers from the National Bureau of Economic Research, Joshua Angrist and Alan Krueger, thought they'd found one: the time of the year in which you're born.  In many US states it's a legal requirement to be educated until a certain age, say 16.  If you're born in September, this means that at the very beginning of 10th grade, you can drop out; but if your birthday is in August, you have to wait right until the end of the grade to leave, so you get an extra year's schooling.

Looking at US census data, Angrist and Krueger (1991) found a very weak ($R^2 \approx 0.0001$) but significant association between the time of year individuals were born, and the number of years education they received; the later in the school year you're born, the more education you get.  But we can think of this as just a coin toss: the time of year in which you're born is not determined by your family background or other factors relevant to your earnings, it just happens.
 We assume that the time of your birth is not affected by your background, but it does have a small effect on how long you stay in school for.

So if we find that people born in September earn less than the group born in August, then the difference must be because of the extra year of education (some of them) received!

There are four key assumptions we need to make this work:
1. the time of birth (the instrument) is not confounded with education (the intermediate variable) or earnings (the outcome): in other words, the instrument is, for all intents and purposes, randomized and not affected by, for example, your background;
2. the time of birth doesn't directly affect earnings, except through the intermediate variable: so being born later might mean that you get an extra year of schooling, which in turn means you earn more, but the time of birth can't affect earnings in any other way;
3. the time of birth has an effect on education: in our case, being born later increases the amount of schooling you get.
4. the effects are linear (as in linear regression).
Then because we assume that the time of birth is randomized (or at least, not confounded), any correlation between birth and education (say $\alpha$) is causal (it's as if we did a randomized trial), and also the correlation between birth and earnings (say $\beta$) is causal.

Furthermore, because we believe that all the effect of birth on earnings is through education, this means that the causal effect of birth on earnings ($\beta$) should just be the combination (i.e. the product) of the causal effect of birth on education ($\alpha$), and the causal effect of education on earnings (which is what we want to know, say $\rho$).  That is: $\beta = \alpha \times \rho$, where we can measure $\beta$ and $\alpha$ from our observational experiment.  Then we just use:
$$\rho = \frac{\beta}{\alpha}$$
and hey presto! $\rho$ is the causal effect we wanted to find (and will be different in general from the ordinary correlation between education and earnings).  This is the method of instrumental variables (IV), and has been around since the 1920s.

Note the importance of assumption 3 here: if the instrument and the intermediate effect are not related then $\alpha = 0$, and division by zero leads to unpleasant stomach cramps and a skin rash.

#### "Dude, that is weak"

So Angrist and Krueger divided the effect of birth on earnings by the effect of birth on education, and obtained an estimate of the causal effect of an extra year's schooling on earnings.  (I'm simplifying quite a lot here.)  They find that additional compulsory years of schooling does lead to higher wages later in life, because of the additional education received.  The exact estimates vary for different cohorts, but is of the order of an 8% increase in wages for each additional year of education.  Which all seems plausible, and in fact it turns out to be similar to the observed correlation.

Unfortunately, this approach carries a fatal flaw in this context, as shown in this rather nice 1995 JASA paper by Bound, Jaeger and Baker.  The problem is related to the dividing-by-zero warning I gave earlier.  A perfect instrument will control the intermediate variable exactly, just like an actual randomized trial; in this case we don't need to use instrumental variables methods.  If the instrument doesn't affect the intermediate at all, then we can't do anything, which is why we need assumption 3.

But if the instrument only very weakly affects the intermediate (and therefore the outcome), it's almost as bad as not affecting it at all: we have to divide a tiny number ($\beta$), by another tiny number ($\alpha$), and hope we get a reasonable result.  As every computer scientist knows this is very bad, because it means that if we get our estimate for $\alpha$ or $\beta$ wrong by even a tiny amount, then the division will give completely the wrong answer.

This means that if the model is mis-specified by even a tiny amount, then the instrumental variables estimator will be totally misleading.  This is the problem of weak instruments.

We're assuming that the effects are linear (so each extra month's schooling leads to the same number of extra dollars earned, for example) and at best this is only an approximation to the truth.  The world just isn't very linear in practice.  Worse, assumptions 1 and 2 are essentially untestable in the IV set up, so (without other information) it's quite possible that our model is mis-specified.

#### Still Instrumental

Although one should be very careful about drawing strong conclusions from studies using instruments, they are still very useful, and a devilishly clever idea.  Many questions in economics just can't be empirically tested by randomized trials: can you imagine asking Mark Carney to flip a coin before deciding whether or not to raise interest rates, just so we can see the causal effect over time?  Instead we must resort to instruments or other natural experiments.

Single studies using instruments are pretty unreliable; but they can still contribute to a wider 'body of proof' about something: we're pretty confident that staying in school increases your earnings, because it's been demonstrated in lots of studies in different countries and time periods.  It is a robust finding.

There is another rather promising application of instrumental variables, one which has gained quite a bit of attention recently.  Mendelian randomization uses our genes as instruments, and has been used to prove, for example, that alcohol raises your blood pressure.  But I'll save that delight for another time.

#### Reference

This nice page by Andrew Johnston was useful.