Thursday, December 12, 2013

#23andStupid vs. #nannystate

Last Thursday evening, I sat at my desk at 23andMe, a genetics company which until very recently offered its customers the chance to divine from their DNA their risks of cancer, heart disease, and many other conditions. I typed out the last lines of a computer program to monitor Twitter and biked home at breakneck speed. When I arrived home, 23andMe had just released the announcement that would set off a Twitter storm: the FDA had ordered it to stop providing its genetic health reports. I set my program running: over the next 48 hours, I recorded more than 4,300 tweets related to the news. What follows is my analysis of two questions: who cared, and what did they think? Any sharp statistician would be suspicious of my objectivity, so I also built a website which will allow you to explore the data yourself: if my conclusions seem unwarranted, please comment or shoot me an email. All analysis is based solely on public data and does not reflect the views of 23andMe.

At peak, roughly 2 hours after the announcement, there were more than 500 tweets an hour relating to 23andMe, or a tweet every 7 seconds. The tweets came from all over the world, as you can tell by tracking the timezone of the tweeter:

It’s perhaps surprising that there are more tweets from the East Coast than the West Coast, given that 23andMe is a Californian company, but on the other hand the East Coast has more than double the West Coast’s population.

Who were the Tweeters?

A short answer: biologists, geeks, and the politically active. A longer answer: we can use a technique called PCA to make this picture (download it, zoom in, and be patient) of the words Tweeters use to describe themselves in their Twitter profiles. (I include a short explanation of PCA at the end of this post) [1]. Two words appear close together in the picture if they appear frequently together in tweeters’ self-descriptions. From this we can pick out clusters of words which indicate types of Tweeters: near the top,  “cancer”, “biotech”, “research”, “genomics”, “biology”, “genetics”, etc: the biologists. Near the bottom, “apps”, “design”, “developer”, “engineer”, “mobile”: the tech nerds. To the right, a combination of health--“lifestyle”, “living”, “healthy”, “live”--and politics: “libertarian” [2], “citizen”, “america”, “environment” [3].  

Another question we can ask is: do people who describe themselves similarly tend to tweet similarly? We answer this by projecting the tweets into two dimensions, projecting the self-descriptions into two dimensions, and seeing whether people who are close in tweet-space are also close in self-description space. The answer turns out to be yes--the correlation in closeness is positive and highly significant. This might be due to the same people tweeting the same things over and over again, so I took them out, and the correlation is still positive. This turns out to be due to a bunch of Tweeters that are described as news sites, who tend to tweet different things than non-news sites. When you take those out, the correlation disappears. I suspect that, in general, people with similar profiles tweet similar things; I also suspect that Twitter, Facebook and Google are way ahead of me on this one.

What did they think?

Most people didn’t take a side at all, and just retweeted the news; 74% of the tweets were pretty much exact repetitions of earlier tweets. I was disappointed by this lack of originality, but of course repeating exactly what you’ve been told is often valuable: if you’re a dividing cell, it prevents cancer, and you’re a soldier, it prevents court martials. Here’s a plot of the number of original tweets as a function of the total number of tweets; the changes in slope are interesting. Between tweets 300 and 2000, there are relatively few original tweets, probably because most people are just retweeting the news without really thinking about it.

Most of the people  expressing strong opinions supported 23andMe. When we filter on people using profanity, 15/16 tweets blame the FDA. (The exception: “@23andMe This is BS. I only bought these kits to learn about my health, and now I can't. I want my money back!”). When we filter on people expressing negative emotions, 16/19 blame the FDA (42 people express negative emotions, but 23 of them just say that they’re “sad”, leaving blame ambiguous). I wondered if looking only at negative words biased the sample towards people who feel negatively towards the FDA, so I looked instead at words indicating positive emotion, and found that 15/20 people who took a definite side favored 23andMe. I also looked at people expressing opinions on the lawsuit against 23andMe; 32 people simply retweeted news stories about the lawsuit, but of the 7 who took a side, all said the lawsuit was frivolous. Finally, when I looked at people with backgrounds in science, medicine, or biology, 17/20 who took a definite position supported 23andMe. There are also 52 tweets from libertarians who mock the #nannystate, a tweeter who refers to 23andMe CEO Anne Wojcicki as a “gummy bear”, and a Canadian who is so upset about the whole thing that he says “#IDontWantToLiveOnThisPlanetAnymore”. Of course, Twitter users probably represent a biased population: they may be exactly the sort of young, free-spirited, tech-savvy individuals who would like a company like 23andMe.

Whatever happens, we are lucky to live in such exciting times. In the words of Tweeter @LibrariNerd from Nilbog:

I’ve been saving all the emails I’m getting from 23andMe about it. Feels potentially historical.


1. PCA is an elegant technique that helps you visualize “high-dimensional data”, which has become a buzzword in our information-rich world. High-dimensional data just means that each datapoint takes a lot of numbers to represent: a Twitter post can be represented by a long row of ones and zeros, where each one or zero refers to the presence or absence of a certain word; a genotype (what we have at 23andMe) can be represented by a row of zeros, ones, and twos, where each number describes a particular location in the genome. High-dimensional data is difficult to visualize--we don’t do well in more than 3 dimensions--but PCA allows you to project the data down into 2 dimensions in a way that retains an essential property: points that are close together in the high dimensional space will be close together in the 2 dimensional space.
2. “Libertarian” also appears right next to “single”, on which I have no comment.
3. Those familiar with PCA will note that this is a projection of the words, not the self-descriptions: the transpose of the document-term matrix. You can also project the original matrix, but it’s harder to fit the self-descriptions on one page; from what I could make out, you get a continuum of “biologist” to “general nerd”.
4. I used Python’s difflib for string comparison with a threshold of .8.
5. This dataset is somewhat incomplete for two reasons. a) I upgraded my program while it was running (so it could collect Tweeter self-descriptions and time zones as well as the raw tweets) and b) it crashed at 2 AM the first night, so there’s a period of a few hours when I’m missing data.
6. A note on the website: the website is known to have certain minor bugs which I will fix when I get the computer on which the code resides back from my boyfriend.

