Thursday, December 19, 2013

In which I use regression to cope with my childhood insecurities

Whenever I got too cocky as a child, my mother used to threaten me with “regression to the mean”.

“You think you’re so smart,” she would say. “But statistically, smart parents tend to have children who are dumber than they are. So you’d better listen to me.”

Clearly this was scarring, because I then went to college and took a bunch of statistics classes in an effort to understand this fearsome “regression to the mean”. It turned out to be a foe as subtle as it was important: a recent article in Nature cited it as one of the 20 most critical phenomena for policy-makers to understand.  I will first explain the concept, and then two particularly seductive perversions of it [1]. It is entirely possible that my understanding is still flawed, in which case you should comment or shoot me an email.

What is regression to the mean?

My mother was right: exceptional parents tend to have less exceptional children. Sir Francis Galton was first to notice that tall parents tended to have children who were shorter than they were, and short parents tended to have children who were taller. This isn’t just an effect with parents and children. Mutual funds that do very well one year tend to do worse the next year. If we look at the 100 National Basketball Association players who scored the most points in 2012, 64% of them scored fewer points in 2013. And this is true no matter what metric we look at:

If we rank players by...
Probability a top-100 player got worse in 2013:
Defensive rebounds per game
3-pointers per game
Assists per game
Steals per game

Doing well in 2012 isn’t causing the athletes to do worse in 2013: as I’ll explain, this is merely a statistical illusion. So regression to the mean inspires a lot of really stupid sports articles: the Sports Illustrated cover jinx refers to the myth that appearing on the front cover of Sports Illustrated is bad luck for an athlete because these athletes often suffer a drop in performance, but this is only a manifestation of regression to the mean, not a causal effect of the cover story. Medical trials provide a slightly more consequential example: if you give a drug to people in the midst of severe depression and check on them two weeks later, they’ll usually be doing somewhat better, but that’s not necessarily the causal effect of your drug:  they would also probably have improved if you had done nothing.

Why does this occur? 

Scoring baskets is a combination of skill and luck. A basketball player who scores an exceptional number of baskets in 2012 is likely more skillful than the average player, but he’s also likely more lucky. His skill will persist from season to season, but his luck won’t, so if he got very lucky in 2012, he’s likely to be less lucky--and do worse--in 2013. Regression to the mean emerges whenever we have this combination of signal and randomness: the signal will persist, but the randomness will not.

We can describe this combination of signal and randomness more mathematically. Regression to the mean occurs whenever there is imperfect correlation between two variables: they don’t lie along a perfectly straight line (see end for a longer explanation of correlation [2]). Here are some correlations: 

(The above data is hypothetical.) Perfect correlations, of -1 or 1, are pretty much impossible in real data, which means that almost any two variables will exhibit regression to the mean. Let x be how good-looking you are and y be how many drinks you get bought at a bar. We'll express x and y in terms of standard deviations--so if you were two standard deviations better looking than the average person (like me, on a bad day), x would be two. (We often refer to data not in terms of its actual value, but in terms of its distance from the average in standard deviations, because that gives us a sense of how rare it is: this is sometimes called z-scoring, see end for a brief explanation [3]). 

x and y are probably positively correlated here [4]--let’s say the correlation r is .5. If you want to fit a straight line to the data--we call that “linear regression” or “the only technique an economist knows”--predicting y from x is actually incredibly simple: y=r*x [5]. (This is only true because x and y are z-scored -- another nice thing about z-scoring data -- in general, the formula is a bit more complex.) So if you were two standard deviations better looking than the average person, r being .5, linear regression would predict that you get one standard deviation more drinks than the average person. The important point here is that two is less than one: the number of drinks you get is less extreme than your hotness, you have regressed to the mean. 

Because the gap between x and r*x gets bigger as x gets more extreme, the gap between x and our predicted value of y gets bigger as x gets more extreme--more regression to the mean. (On the other hand, if x is close to its mean, the scatter in y means that y will actually tend to lie farther from the mean than x does: very average parents should have more extraordinary children, which we could maybe call the Dursley-Harry Potter effect.)

This definition makes regression to the mean both more and and less powerful than is often supposed, which brings me to...

The two seductive misconceptions

1. Because the examples given of regression to the mean are often of parents and children, many people think it has something to do with genetics and biology; alternately, they hear about depressed people becoming better on second measurement and think it is just a property of repeated measurements. But regression to the mean has nothing to do with sex or psych or any of those icky things: it is a mathematical property of any two imperfectly correlated variables, and there need be no causal relationship between the two.

2. The second seductive misconception is my favorite, because it means my mom was only half right. Regression to the mean is limited in two important ways which I’ll illustrate by example. First, regression assumes our data lies along a straight line. But consider the case where x describes the performance of a royal courtier on his first etiquette exam, and y describes the performance on his second. We might expect to see something like the left plot, where a straight line fits the data well. Now imagine the exam proctor is Henry VIII, so that between the first exam and the second, every courtier who’s below average is beheaded, and every courtier who’s above average is given the answer sheet. In that case, everyone either gets a perfect score or a zero (right plot), and a straight line no longer fits the data very well. (Both plots are z-scored so the mean is zero and the standard deviation is 1.)

Even though both datasets have roughly the same correlation, and thus the same degree of “regression to the mean”, these concepts are only meaningful to the extent that a straight line actually fits the data. Intuitively, in the second plot, the y-values are more extreme than the x-values: certainly the headless courtiers would think so.

But let’s ignore this problem and assume that a straight line does fit the data well. All that means is that y is fewer standard deviations from the mean than is x. But if the standard deviation of y is much greater than that of x, the predicted values of y may still be “more extreme” in an absolute sense, even if they’re not in a z-scored one. Below I calculate the average score of state Congressional representatives on a conservative-liberal scale, according to these bros, for the 97th Congress (beginning in 1981) and the 111th Congress (beginning in 2009). Negative is more liberal.

Unsurprisingly, there’s a strong correlation, with r=.57. But this correlation is still less than 1, so we would expect this dataset to exhibit regression to the mean, and indeed it does: when we look at the 10 most extreme liberal states in 1981, 7 become less liberal (as measured by z-score) in 2009 (the exceptions are Massachusetts, Rhode Island, and New York). But our incredibly harmonious political system has become more polarized since the 1980s, and so the standard deviation in y is nearly 50% greater than the standard deviation in x. So while this dataset technically exhibits regression to the mean (in the z-scored sense), 38 of the 50 states are actually farther from the mean, in an absolute sense, in 2009 than they were in 1981: they became more extreme (red dots) not less extreme (blue dots). And since x and y are measured on the same scale, we probably care about the absolute sense. 

Mind blown! In summary: if the scatter in your data increases, regression to the mean might not mean much. To return to my argument with my mother, one might imagine that smart parents make their children play chess, and dumb parents make their children play football, and that chess and football have such profoundly opposite effects on intelligence that the children’s IQs range from 50 to 150 while their parents’ range only from 90 to 110. The z-scored predictions for the children’s IQs will be less extreme than their parents’, but the absolute predictions may be much more extreme.

Which is to say that my mother is not necessarily right. But given that I’ve spent the last 17 years worrying about an argument she’s long since forgotten, you can draw your own conclusions [6].

With thanks to Shengwu Li and Nat Roth for insights.

[1] My high school physics teacher used to shield us from tempting mistakes by saying “Don't give in to this seduction! Because then you'll be pregnant with bad ideas.”
[2] Correlation is a number between -1 and 1 which measures how consistently an increase in one variable is associated with an increase in the other. Positive correlations indicate that an increase in one variable corresponds to an increase in the other; negative correlations indicate that an increase in one variable corresponds to a decrease in the other; zero correlation indicates no relationship.
[3] Standard deviation is a measure of how spread out the data is: if data follows a bell curve, which data often roughly does, 68% of the datapoints will be within 1 standard deviation of the average, 95% within 2, and 99.7% within 3, a fact referred to, creatively, as the “68-95-99.7 rule”. Standard deviation is useful because it lets us refer not to the actual value of a datapoint, but to its distance from the mean in standard deviations--this is called the “z-score”, and gives us a rough sense of how “rare” a datapoint is. IQ tests, for example, often have a mean of 100 and a standard deviation of 15, which means that someone with an IQ of 145 has a z-scored IQ of 3: this is a good way to insult them, although if they actually have an IQ of 145 they will probably figure it out pretty quickly.
[4] Except at really sad places like MIT bars, where there’s so little variation in x that the correlation coefficient becomes undefined.
[5] For those who know a little bit about regression, here’s the weird thing you might want to think about: we could also predict x from y, in which case we get x=ry, not y=rx. Which of these lines is “correct” and why is this not a contradiction?
[6] In case it is not apparent, I like my mother.

Thursday, December 12, 2013

#23andStupid vs. #nannystate

Last Thursday evening, I sat at my desk at 23andMe, a genetics company which until very recently offered its customers the chance to divine from their DNA their risks of cancer, heart disease, and many other conditions. I typed out the last lines of a computer program to monitor Twitter and biked home at breakneck speed. When I arrived home, 23andMe had just released the announcement that would set off a Twitter storm: the FDA had ordered it to stop providing its genetic health reports. I set my program running: over the next 48 hours, I recorded more than 4,300 tweets related to the news. What follows is my analysis of two questions: who cared, and what did they think? Any sharp statistician would be suspicious of my objectivity, so I also built a website which will allow you to explore the data yourself: if my conclusions seem unwarranted, please comment or shoot me an email. All analysis is based solely on public data and does not reflect the views of 23andMe.

At peak, roughly 2 hours after the announcement, there were more than 500 tweets an hour relating to 23andMe, or a tweet every 7 seconds. The tweets came from all over the world, as you can tell by tracking the timezone of the tweeter:

It’s perhaps surprising that there are more tweets from the East Coast than the West Coast, given that 23andMe is a Californian company, but on the other hand the East Coast has more than double the West Coast’s population.

Who were the Tweeters?

A short answer: biologists, geeks, and the politically active. A longer answer: we can use a technique called PCA to make this picture (download it, zoom in, and be patient) of the words Tweeters use to describe themselves in their Twitter profiles. (I include a short explanation of PCA at the end of this post) [1]. Two words appear close together in the picture if they appear frequently together in tweeters’ self-descriptions. From this we can pick out clusters of words which indicate types of Tweeters: near the top,  “cancer”, “biotech”, “research”, “genomics”, “biology”, “genetics”, etc: the biologists. Near the bottom, “apps”, “design”, “developer”, “engineer”, “mobile”: the tech nerds. To the right, a combination of health--“lifestyle”, “living”, “healthy”, “live”--and politics: “libertarian” [2], “citizen”, “america”, “environment” [3].  

Another question we can ask is: do people who describe themselves similarly tend to tweet similarly? We answer this by projecting the tweets into two dimensions, projecting the self-descriptions into two dimensions, and seeing whether people who are close in tweet-space are also close in self-description space. The answer turns out to be yes--the correlation in closeness is positive and highly significant. This might be due to the same people tweeting the same things over and over again, so I took them out, and the correlation is still positive. This turns out to be due to a bunch of Tweeters that are described as news sites, who tend to tweet different things than non-news sites. When you take those out, the correlation disappears. I suspect that, in general, people with similar profiles tweet similar things; I also suspect that Twitter, Facebook and Google are way ahead of me on this one.

What did they think?

Most people didn’t take a side at all, and just retweeted the news; 74% of the tweets were pretty much exact repetitions of earlier tweets. I was disappointed by this lack of originality, but of course repeating exactly what you’ve been told is often valuable: if you’re a dividing cell, it prevents cancer, and you’re a soldier, it prevents court martials. Here’s a plot of the number of original tweets as a function of the total number of tweets; the changes in slope are interesting. Between tweets 300 and 2000, there are relatively few original tweets, probably because most people are just retweeting the news without really thinking about it.

Most of the people  expressing strong opinions supported 23andMe. When we filter on people using profanity, 15/16 tweets blame the FDA. (The exception: “@23andMe This is BS. I only bought these kits to learn about my health, and now I can't. I want my money back!”). When we filter on people expressing negative emotions, 16/19 blame the FDA (42 people express negative emotions, but 23 of them just say that they’re “sad”, leaving blame ambiguous). I wondered if looking only at negative words biased the sample towards people who feel negatively towards the FDA, so I looked instead at words indicating positive emotion, and found that 15/20 people who took a definite side favored 23andMe. I also looked at people expressing opinions on the lawsuit against 23andMe; 32 people simply retweeted news stories about the lawsuit, but of the 7 who took a side, all said the lawsuit was frivolous. Finally, when I looked at people with backgrounds in science, medicine, or biology, 17/20 who took a definite position supported 23andMe. There are also 52 tweets from libertarians who mock the #nannystate, a tweeter who refers to 23andMe CEO Anne Wojcicki as a “gummy bear”, and a Canadian who is so upset about the whole thing that he says “#IDontWantToLiveOnThisPlanetAnymore”. Of course, Twitter users probably represent a biased population: they may be exactly the sort of young, free-spirited, tech-savvy individuals who would like a company like 23andMe.

Whatever happens, we are lucky to live in such exciting times. In the words of Tweeter @LibrariNerd from Nilbog:

I’ve been saving all the emails I’m getting from 23andMe about it. Feels potentially historical.


1. PCA is an elegant technique that helps you visualize “high-dimensional data”, which has become a buzzword in our information-rich world. High-dimensional data just means that each datapoint takes a lot of numbers to represent: a Twitter post can be represented by a long row of ones and zeros, where each one or zero refers to the presence or absence of a certain word; a genotype (what we have at 23andMe) can be represented by a row of zeros, ones, and twos, where each number describes a particular location in the genome. High-dimensional data is difficult to visualize--we don’t do well in more than 3 dimensions--but PCA allows you to project the data down into 2 dimensions in a way that retains an essential property: points that are close together in the high dimensional space will be close together in the 2 dimensional space.
2. “Libertarian” also appears right next to “single”, on which I have no comment.
3. Those familiar with PCA will note that this is a projection of the words, not the self-descriptions: the transpose of the document-term matrix. You can also project the original matrix, but it’s harder to fit the self-descriptions on one page; from what I could make out, you get a continuum of “biologist” to “general nerd”.
4. I used Python’s difflib for string comparison with a threshold of .8.
5. This dataset is somewhat incomplete for two reasons. a) I upgraded my program while it was running (so it could collect Tweeter self-descriptions and time zones as well as the raw tweets) and b) it crashed at 2 AM the first night, so there’s a period of a few hours when I’m missing data.
6. A note on the website: the website is known to have certain minor bugs which I will fix when I get the computer on which the code resides back from my boyfriend.