I work just down the block from Google, and I get the sense that it’s devouring me: every time I drive home I’m surrounded by rainbow Google bikes, giant Google buses, Google Street View cars, Google self-driving cars. But I still haven’t managed to get what I actually want: their data.
Recently, however, I found a way to combine the 23andMe and Google datasets and thus achieve absolute power increase my faith in 23andMe’s dataset. We describe the work on 23andMe’s blog here. It relies on a tool called Google Correlate, which was used, among other things, to build Google Flu Trends: basically, Google Correlate lets you enter a number for every state and see which search terms show the strongest correlation with that state-by-state pattern. For example, when I enter the average latitude for each state, I see that the search terms which are most frequent in Northern states are those you would expect: “how much vitamin D”, “heated seats”, “seasonal affective disorder”, etc.
You can do this with 23andMe data: we know where many of our customers live, and we have their answers to thousands of survey questions, and this allowed me to make maps for more than 1,500 health and behavioral traits. I put the data behind those maps into Google Correlate. I started with 23andMe customer answers to “How often do you eat leafy greens?”, and out of the billions of Google searches, one of the most strongly associated was “raw kale”. States with high rates of 23andMe customers with coronary artery disease also had high rates of Google searches for “statin drugs”, which lower cholesterol. Just as striking were the negative correlations.
The inverse pattern is clear (and highly statistically significant, even after multiple hypothesis correction). These connections are fun and surprising, given the billions of Google searches, the thousands of 23andMe traits, and the fact that typing in a Google query is very different than answering a medical survey. But I want to talk about a larger point here, which is how we can combine multiple datasets to overcome bias, the specter that haunts my data-related dreams.
By bias I mean “something that makes the number you’re estimating different from the number you want to estimate, no matter how much data you get”. Let’s say you’re trying to figure out whether there’s a difference in physical attractiveness between chessplayers and non-chessplayers. (Yes.) So you go on a dating website with a lot of chessplayers (checkmates.com) and you download all the pictures and you write an algorithm which evaluates the attractiveness of a picture and you compare the algorithm’s output for chessplayers and non-chessplayers. You find that your algorithm says that chessplayers are on average 15% more attractive. You should immediately worry about two things. First, is that difference statistically significant? If you only looked at 4 chessplayers, and one guy was egregiously ugly, it’s probably not. But statistical significance is usually the easy problem, because it can be solved by getting more data; that’s hard if you’re trying to recruit, say, people with a rare psychiatric condition, but I work with datasets which are large enough that statistical significance is rarely an issue. Any difference large enough to care about has like a chance in a trillion of being due to chance.
But bias is a much more insidious problem, because it cannot be solved by getting more data. Here are some biases we might see in the chessplayer problem:
a) Maybe chessplayers who post photos online are more attractive than the average chessplayer.
b) Maybe chessplayers who engage in online dating at all are more attractive than the average chessplayer.
c) An algorithm that measures attractiveness? Really? I’m not a computer vision expert, but I’d be immediately worried that the algorithm was being biased by differences that had nothing to do with the faces. Maybe chessplayers tend to pose with chessboards, and that creates a tiny discrepancy in how the algorithm evaluates them.
In all of these cases, bias means that what you’re trying to measure -- the difference in attractiveness between chessplayers and non-chessplayers in the general population -- is not what you’re actually measuring. More data will not fix bias, because you’ll always be measuring the wrong thing.
We can think of these two problems -- statistical significance and bias -- in terms of romantic pursuits. A statistical significance problem is like when you ask someone out and they say, “I’m sorry, but I just don’t know you well enough yet” -- they need more data. But a bias problem is when they say, “I’m sorry, but I just don’t like you” -- no matter how much more of you they get, you’re still not going to be what they want .
I’m now being accused of being frivolous, so here are two more important examples. Why do we care about low voter turnout? It isn’t because we don’t have enough votes to detect a statistically significant difference between candidates. Even in a very close election -- 51 - 49, say, which even in the 2000 presidential race occurred in only 6 states -- the difference will be statistically significant (p < .05) even if only 10,000 people turn out to vote. We worry about low voter turnout because it often produces bias: those who vote have higher incomes, are better educated, are less likely to be minorities, etc. (There are rare cases where the election is so close that the difference actually isn’t statistically significant -- not to start a flame war, but in Florida in 2000, Bush would’ve had to win by about 5,000 votes, not the 537 he actually won by. Even then we might’ve worried about bias due to irregularities in the election process -- but luckily we can all sleep soundly thanks to the completely non-partisan Supreme Court decision.) I think if you actually wanted to measure who the country really wants to be in charge, you would just contact a small randomized sample of people in each state and not let anyone else vote; that would better deal with the bias problem.
Let’s take an example which is less likely to get me mean emails: developing a blood test to detect cancer while it’s still treatable. A lot of people have tried to do this, and pretty much the same thing always happens: they write a paper saying that you can detect cancer X by looking at the levels of molecule Y, everyone gets excited, other people try to replicate the results and they never can. The most plausible reason for this is that the original results were due to bias. What you want is a blood test that can tell the difference between apparently healthy people who secretly have cancer and apparently healthy people without cancer. But apparently healthy people who secretly have cancer are hard to find, because most cancers are rare, so you would have to take blood from a lot of healthy people -- hundreds of thousands, in some cases. So most scientists use a shortcut: rather than taking blood from apparently healthy people with cancer, they just take blood from people with cancer. Those people are easy to find: you just go to a hospital. Unfortunately, this means you haven’t designed a blood test that can detect cancer in apparently healthy people -- you’ve designed a blood test that can detect cancer in people we already know have cancer at your particular hospital. Which is fine, if these two groups have the same blood -- but often they don’t. For example, cancer treatment itself messes with your blood -- blood samples may be collected while the patient is under anesthesia, or undergoing chemotherapy, which both alter your blood. So you haven’t created a cancer detector; you’ve created a chemotherapy detector. Or maybe your cancer and healthy populations are different for reasons you don’t care about -- one attempt to develop a screen for prostate cancer compared cancer patients (who were all men) to healthy controls (who were all women). So then you’ve created a sex detector.
Summary: bias undermines democracy and kills people. What do we do about it? There are standard practices for reducing bias -- controlling for obvious things like sex, doing double-blind studies so your results aren’t influenced by what you want to see. But I still tend to be very nervous about bias, in part because I’m a nervous person and in part because everything is intercorrelated, so tiny discrepancies in variables you don’t care about can produce discrepancies in variables you do. Another powerful means of determining whether a pattern is due to bias is to see if you see the pattern in a very different dataset. Because while both datasets likely have biases, they’re unlikely to be the same biases: so if you see the same pattern in both, it’s more likely to be due to something real. To return to 23andMe and Google, 23andMe survey results are probably biased by the fact that 23andMe customers aren’t necessarily a representative sample of the general population. Google is going to suffer less from this, since its product is more pervasive, but might be biased by the fact that Google searches map only weakly onto what someone is actually thinking. (For example, if more people Google “I want to have sex with sheep” than “I want to have sex with my girlfriend”, that might not be because more people want to have sex with sheep than with their girlfriend; it might be because they’re freaked out about the sheep, and turn to Google.) Both datasets are going to be biased by the fact that you have to be able to use a computer to use Google or enter 23andMe survey answers, but this is a considerably less scary bias than the ones in the independent analyses.
Obviously, the 23andMe/Google analysis is more of a fun proof-of-concept than a rigorous statistical analysis, but the compare-multiple-datasets idea is an important one. To return to biomedical research: often preliminary analysis will be done on a sample from one hospital, and the results will then be checked on a sample from another hospital; this controls for hospital-specific biases. Or you might do analysis on a dataset of a completely different nature: here’s a lovely analysis that starts by looking at gene expression data over time, draws some conclusions, then investigates whether those conclusions are true by actually knocking out some of the genes and analyzing that dataset as well, and compares these results to another dataset that used another method to analyze the genes.
I’m not saying that every study that uses a single dataset is biased or unpublishable -- clearly that isn’t true. But attempting to replicate results in a different dataset is almost always worthwhile, because anticipating every bias is impossible. This is particularly true if you’re generically curious, like I am, and you’re often poking your nose into fields in which you have no business or background knowledge.
 In practice, I suspect many people who say the former actually mean the latter, but that is not applicable to our analogy.