Tuesday, August 19, 2014

How to Study the Rage of Millions of People

Dear Twitter: in this post, I offer to give away your data. I do this completely in good faith and for no monetary gain because I am a researcher, I think your data is fascinating, and I hope to help people make sense of it. I have reviewed your Terms of Service, read a number of research papers written on your data, and contacted employees at Twitter, and to my knowledge I am not in violation of any of your rules. But if I have misunderstood please contact me at emmap1 at alumni dot stanford dot edu and I am more than happy to comply with your requests.
I am very excited about Twitter because it combines two qualities.

1. People actually use it. Famous people -- it’s become standard for celebrities to say “Follow me on Twitter!” -- and more importantly, lots of people.

2. It makes massive amounts of data available in a way you can process with a computer. 500,000,000 tweets are sent every day and Twitter will give you up to 1% of those. And if I know what 1% I want -- for example, only Tweets containing the word “Spock” -- it will give me all of them, which means I can actually hear everything that’s being said on a topic by millions of people worldwide. And not just what’s being said, but who’s saying it -- how they describe themselves, where they live, who their friends are, and the last few thousand things they said [1].

[Pause so we can all process how incredibly cool this is.]

If you still don’t think this is incredibly cool, you’re either not paying attention or dead on the inside. Twitter is enabling new research on everything from the Mexican drug war to the Israel-Palestine conflict to earthquakes to the stock market. Just this week, it easily provided enough data for research papers on three topics I can think of off the top of my head: societal reactions to suicide using Robin-Williams-related tweets, altruistic behavior using #icebucketchallenge, and protests against racism using Ferguson-related tweets. I’ll come back to the last one in a second.

I want to study Twitter with you. Consequently I am making three things available. (If you like working with data you should read about the first two; if you just like reading about data, you should skip to the third.) The first is a tool that makes it easy to collect all the Tweets (and all the data for the Tweeters) that contain sets of words or phrases. Important caveat: this program will only collect Tweets live -- it cannot search for Tweets in the past, because Twitter makes it very hard to get these -- so you need to be quick on the draw. This tool may be slightly buggy -- let me know if you find weird things! -- but it’s probably not seriously buggy because I have been using it more or less without incident for the last few months; you can turn it on, forget about it, and come back later to get your data.

So the first tool gives you raw data. The second thing is a tool that infers cool things from this raw data and returns it in a table which is easy to analyze. For example, you can sometimes use a Tweeter’s name to get their gender and race, as I describe here. You can use a technique called sentiment analysis to analyze the emotions in the Tweets, and watch how levels of sadness, anger, profanity, and so on change over time, or by group. You can often figure out the Tweeter’s location from their timezone, and you can also get the local time, which is important if the phenomenon you’re studying has daily cycles. The documentation for this tool is here. Unfortunately, I cannot make the code for the tool publicly available because it relies on a sentiment analysis library which is proprietary, although I may cut it down and release a less complicated version when I have time. But if you have a dataset which you would like to use it to analyze, shoot me an email!

The third thing, to illustrate the utility of the first two things, is an actual dataset of Tweets relating to the Ferguson shooting. I’ve been monitoring Twitter for about a week for hashtags like ferguson, iftheygunnedmedown, and handsupdontshoot, and I initially was collecting so many Tweets that I ended up keeping only a tenth; even so, it’s a few hundred thousand Tweets. It’s a very rich dataset, and I’ll probably do some more analysis on it myself after events play out, but email me if you’re interested in looking at it and we can discuss possibilities for collaboration. (Twitter’s Terms of Service prohibit me from just making the dataset publicly available.)

I’ve barely glanced at the data, but one thing I did do was take the most common hashtags and connect the ones which tended to appear together [2]. At first it’s a little hard to see what’s going on, but when we look closer we can see evidence of a rich and complex conversation:

1. There’s a purple cluster talking about the many other unjust police shootings, often in connection with the lastwords hashtag.
2. There’s a red cluster of Anonymous users -- a group of online activists who conducted cyberattacks against the Ferguson police department.

3. There’s a yellow cluster of Tea Party members and gun rights activists, who I’m sure have been made much less paranoid about abuses of government power because of this whole episode. Close to them is a more liberal group that includes hashtags like “p2” (Progressives 2.0), “libcrib”, “stoprush”, “ows” (Occupy Wall Street) and “civilrights”. Oddly, some military hashtags (“military” and “vets”) appear to be more connected to the liberals than the conservatives.
4. There’s a red-purple group of people who are advocating peaceful protests with hashtags like “love”, “unity”, “equality”, and “MLK”.
5. There’s a purple group of people drawing connections to Gaza, and close to them there’s another group of people drawing connections to other international events (“egypt”, “syria”, “ukraine”, “iraq”, “isis”).

If you want to explore further, zoom in and click on the circles. Clearly there’s a complicated and interesting conversation going on here, and even if there’s a lot of dirt in the data, there’s a lot of gold there as well; let me know if you’re interested in digging deeper! Here’s one question that occurred to me: there’s been a daily pattern with peaceful protests by day and more violence and anger at night. Can we see evidence of that cycle on Twitter? And, if we can, is it because a) the same people get angrier at night or b) different groups of people tweet by day and by night?

Do me a favor and if you do end up using any of this, or if you have thoughts for new or improved tools, please:

a) Shoot me an email and let me know what you find / ideas you have! I’d be happy to publish cool analyses here or collaborate to find other audiences for them as well. And if you’re not a computer person but you think of some cool societal trend, or you notice something important happening, let me know quickly and maybe we can track it!
b) Feel free to point people to the tools or this blog!


[1] Contrast this to other big data companies -- Google would never make individual level data like this available, and even when they make grouped data available (say, how many people are searching a certain term) they make it very hard to use a computer to get it quickly. (Google Trends is cool, but I could never use it to get, say, the volumes of 10,000 different searches over time.) Facebook requires me to get something called “user consent” (what?) to get most interesting data. I’m not criticizing Google and Facebook for keeping their data hidden, by the way; their users expect privacy. But the whole point of Twitter is that you’re a Twit in public, and users have no expectation of privacy. Twitter does conceal some information, like the user’s location, if the user chooses this in their privacy settings.
[2] This was created, incidentally, using NetworkX + Gephi, because I got excited about Gilad Lotan’s excellent talk on the combination.

Monday, July 14, 2014

A Tale of Two Cities: The Twitter Reaction to the Return of Lebron James

At 9:31 AM PST on July 11, LeBron James announced that he was returning to Cleveland, and Twitter exploded. (If you don’t know who LeBron James is, see [1] for backstory.) The frenzy was such that the New York Times ran a front page story purely about the tweets. I collected more than 2 million of them, and learned some things about forgiveness, race, and fangirls.

One obvious question: did people on balance approve of James’ decision? The NYT did not attempt to figure this out -- come on, NYT! -- probably because they didn’t have the data and it’s hard to measure approval. One standard way to do it is to count words with positive and negative associations using a word list, but this is a bit dicey in this data; words like “fan” are usually positive, but here you have tweets like “LeBron fans suck”. Instead, I came up with customized phrases. For example, I recorded 1,392 tweets containing the phrase “I love LeBron” and 1,549 tweets containing the phrase “I hate LeBron”. But the latter group contained tweets like “Do I hate LeBron still? Nope” -- some people might loathe James for his prior mistakes, but admire this decision. Indeed, the data supports this idea:  
Phrase Pair
Number of Tweets
“good decision” compared to “bad decision”
547 to 94
“good move” compared to “bad move”
954 to 123
“smart move” compared to “stupid move”
390 to 20

Overall, the Twitter data indicates that while James is still polarizing, this decision was popular. Obviously, however, not everyone was thrilled. I recorded 25k tweets from Tweeters who listed “Miami” in their location and compared those to the 21k tweets from Tweeters in Cleveland. This was a little sad. Miami fans used twice as many words expressing negative emotion, three times as many words expressing anger, twice as much profanity. Interestingly, though, they were about four times as likely to express respect: they were sad and angry but also reluctantly impressed. We can also look at the hashtags which were particularly common in each city: some of them were obvious (191/191 “northeastohio” hashtags were from Cleveland) but some were interesting:

Miami Hashtags
Fraction From Miami
Cleveland Hashtags
Fraction from Cleveland
smh (“shake my head”)
tfm (“total frat move”)

(“Lift the ban” and “I’m sorry” turn out to relate to this crazy Cleveland fan who got himself banned from Cleveland games for a year for running onto the basketball court while James was playing and begging him to return. James patted him on the head as he was dragged away by security.)

Let’s talk about race. Twitter doesn’t provide race data, but I wanted to see if I could infer it for a few reasons:

1. Racial dynamics in professional basketball are often interesting: 76% of the players are black, as compared to 43% of coaches and 2% of owners. There have been a lot of race-related episodes: see the owner who was banned for life for racist remarks; Jesse Jackson’s allegations that Lebron James was being treated like a runaway slave; the differential popularity of Lebron James among different races; the racism against Jeremy Lin.
2. I’ve done a fair bit of work on gender dynamics, and women and racial minorities share many problems; studying race seems a natural extension.
3. It’s an interesting problem.

Obviously, race is very complicated -- at 23andMe, I’ve learned from our ancestry experts just how tangled the relationship between biological ancestry and self-identified race is -- and so any inference from Twitter data is going to be highly imperfect. Please keep this in mind before writing me blistering emails. I tried to identify Tweeters as black, white, Hispanic, or Asian, and used three methods to do so:

1. Tweeter self-description. Someone who uses the word “Asian” in their self-description is usually Asian, although obviously there are some false positives (people who use the word black but are saying they like black dresses, etc).
2. Tweeter last name. See here. This turns out to be very useful for Asian and Hispanic names, not so much for white vs black names.
3. Tweeter first name. Freyer and Levitt wrote a nice article about the consequences of having a distinctively black name; we can supplement their list of black and white names with data on baby names from NYC, which gives us Asian and Hispanic names as well.

People have been trying to get race from name for many years and it’s a lot more dicey than getting gender from name. The most basic problem is this: while someone who names their kid “Alabaster Snowflake” is probably white, they’re also probably not representative of the general white population. The people for whom you can identify race from name are going to be unusual. Similarly, someone who identifies herself as Asian on her Twitter profile may not be representative of Asians generally. So we’re not really comparing white people to Asian people, we’re comparing people with distinctively white names to people with distinctively Asian names [2]; similarly for profiles. To emphasize this distinction, I'm going to refer to tweeters not as "Asian" but as "d-Asian" -- ie, distinctively Asian.

I was able to identify 124k tweets from d-White tweeters, 32k from d-Hispanic tweeters, 12k from d-black tweeters, and 7k from d-Asian tweeters (in North America).  I could not identify clear racial differences in whether Tweeters approved of James’ decision, but I found other interesting differences. d-Asian tweeters do, in fact, tend to tweet about Jeremy Lin; 56% of tweets containing “jlin” come from d-Asians. d-Hispanic tweeters are especially likely to use hashtags supporting teams in Los Angeles, San Antonio, and Miami -- all cities with large Hispanic populations -- and, unsurprisingly, tend to use Spanish words. d-black tweeters also tended to use different language: “finna”, “ima”, “tryna”, and “yall” were among the words that increased in frequency most among d-black Tweeters, as were various versions of n*****. (d-Black tweeters were about four times as likely as d-white tweeters to use n***a, with d-Asians and d-Hispanics falling in the middle.)

I also looked at gender. Only about 17% of tweets came from women, and some of the male tweeters complained about how female tweeters were just tweeting “I looooove LeBron!” But the stereotype of the sweet-spoken fangirls turns out to be wrong: the girls tweeting about James express more anger and use more profanity than the guys, and while they are indeed more likely to say they love him, they’re more likely to say they hate him, too. And forget about the welcoming female domestic stereotype: female tweeters are actually slightly (but statistically significantly) less likely to use variants of “welcome home”. These results surprised me enough that I checked whether my filters were broken (I don’t think they are); one explanation is that interest in basketball is somewhat unusual for women, and that women who tweet about LeBron James are unusual in other ways as well. (Alternately, there might be some weird correlation between gender and another variable, like location.)

This is about as much time as I'm willing to spend studying LeBron James; on the other hand, if you could infer race in a way that doesn't introduce weird biases, that would be exciting and powerful, so let me know if you have ideas about that. Also, I realize that race (like gender) is a fraught topic, so please let me know if anything I've written seems insensitive or inaccurate.

[1] LeBron James is one of the greatest and most polarizing basketball players of all time. At 18, he began his career playing for Cleveland, a sad sports city that hasn’t won a championship since 1964; then he broke their hearts and drew widespread disgust by announcing in a graceless press conference that he was leaving to join two superstars on Miami’s team.
[2] I initially thought I could get around this problem by looking at all names and simply assigning each name a score for each race depending on how frequently it was used for that race (rather than just looking at names with >90% confidence for a particular race); this would incorporate data for all Tweeters rather than just the distinctive name ones, and then you could just run a regression on the name race score. I think this runs into a similar problem, though, because you find that for black last names, for example, very few Tweeters have names which strongly indicate that they are black, which may mean that whatever signal you get is predominantly driven by these distinctive Tweeters.

Tuesday, July 8, 2014

From Kale to Cancer: Using Multiple Datasets to Kill Bias

I work just down the block from Google, and I get the sense that it’s devouring me: every time I drive home I’m surrounded by rainbow Google bikes, giant Google buses, Google Street View cars, Google self-driving cars. But I still haven’t managed to get what I actually want: their data.

Recently, however, I found a way to combine the 23andMe and Google datasets and thus achieve absolute power increase my faith in 23andMe’s dataset. We describe the work on 23andMe’s blog here. It relies on a tool called Google Correlate, which was used, among other things, to build Google Flu Trends: basically, Google Correlate lets you enter a number for every state and see which search terms show the strongest correlation with that state-by-state pattern. For example, when I enter the average latitude for each state, I see that the search terms which are most frequent in Northern states are those you would expect: “how much vitamin D”, “heated seats”, “seasonal affective disorder”, etc.

You can do this with 23andMe data: we know where many of our customers live, and we have their answers to thousands of survey questions, and this allowed me to make maps for more than 1,500 health and behavioral traits. I put the data behind those maps into Google Correlate. I started with 23andMe customer answers to “How often do you eat leafy greens?”, and out of the billions of Google searches, one of the most strongly associated was “raw kale”. States with high rates of 23andMe customers with coronary artery disease also had high rates of Google searches for “statin drugs”, which lower cholesterol. Just as striking were the negative correlations.

The inverse pattern is clear (and highly statistically significant, even after multiple hypothesis correction). These connections are fun and surprising, given the billions of Google searches, the thousands of 23andMe traits, and the fact that typing in a Google query is very different than answering a medical survey. But I want to talk about a larger point here, which is how we can combine multiple datasets to overcome bias, the specter that haunts my data-related dreams.

By bias I mean “something that makes the number you’re estimating different from the number you want to estimate, no matter how much data you get”. Let’s say you’re trying to figure out whether there’s a difference in physical attractiveness between chessplayers and non-chessplayers. (Yes.) So you go on a dating website with a lot of chessplayers (checkmates.com) and you download all the pictures and you write an algorithm which evaluates the attractiveness of a picture and you compare the algorithm’s output for chessplayers and non-chessplayers. You find that your algorithm says that chessplayers are on average 15% more attractive. You should immediately worry about two things. First, is that difference statistically significant? If you only looked at 4 chessplayers, and one guy was egregiously ugly, it’s probably not. But statistical significance is usually the easy problem, because it can be solved by getting more data; that’s hard if you’re trying to recruit, say, people with a rare psychiatric condition, but I work with datasets which are large enough that statistical significance is rarely an issue. Any difference large enough to care about has like a chance in a trillion of being due to chance.

But bias is a much more insidious problem, because it cannot be solved by getting more data. Here are some biases we might see in the chessplayer problem:

a) Maybe chessplayers who post photos online are more attractive than the average chessplayer.
b) Maybe chessplayers who engage in online dating at all are more attractive than the average chessplayer.
c) An algorithm that measures attractiveness? Really? I’m not a computer vision expert, but I’d be immediately worried that the algorithm was being biased by differences that had nothing to do with the faces. Maybe chessplayers tend to pose with chessboards, and that creates a tiny discrepancy in how the algorithm evaluates them.

In all of these cases, bias means that what you’re trying to measure -- the difference in attractiveness between chessplayers and non-chessplayers in the general population -- is not what you’re actually measuring. More data will not fix bias, because you’ll always be measuring the wrong thing.

We can think of these two problems -- statistical significance and bias -- in terms of romantic pursuits. A statistical significance problem is like when you ask someone out and they say, “I’m sorry, but I just don’t know you well enough yet” -- they need more data. But a bias problem is when they say, “I’m sorry, but I just don’t like you” -- no matter how much more of you they get, you’re still not going to be what they want [1].

I’m now being accused of being frivolous, so here are two more important examples. Why do we care about low voter turnout? It isn’t because we don’t have enough votes to detect a statistically significant difference between candidates. Even in a very close election -- 51 - 49, say, which even in the 2000 presidential race occurred in only 6 states -- the difference will be statistically significant (p < .05) even if only 10,000 people turn out to vote. We worry about low voter turnout because it often produces bias: those who vote have higher incomes, are better educated, are less likely to be minorities, etc. (There are rare cases where the election is so close that the difference actually isn’t statistically significant -- not to start a flame war, but in Florida in 2000, Bush would’ve had to win by about 5,000 votes, not the 537 he actually won by. Even then we might’ve worried about bias due to irregularities in the election process -- but luckily we can all sleep soundly thanks to the completely non-partisan Supreme Court decision.) I think if you actually wanted to measure who the country really wants to be in charge, you would just contact a small randomized sample of people in each state and not let anyone else vote; that would better deal with the bias problem.

Let’s take an example which is less likely to get me mean emails: developing a blood test to detect cancer while it’s still treatable. A lot of people have tried to do this, and pretty much the same thing always happens: they write a paper saying that you can detect cancer X by looking at the levels of molecule Y, everyone gets excited, other people try to replicate the results and they never can. The most plausible reason for this is that the original results were due to bias. What you want is a blood test that can tell the difference between apparently healthy people who secretly have cancer and apparently healthy people without cancer. But apparently healthy people who secretly have cancer are hard to find, because most cancers are rare, so you would have to take blood from a lot of healthy people -- hundreds of thousands, in some cases. So most scientists use a shortcut: rather than taking blood from apparently healthy people with cancer, they just take blood from people with cancer. Those people are easy to find: you just go to a hospital. Unfortunately, this means you haven’t designed a blood test that can detect cancer in apparently healthy people -- you’ve designed a blood test that can detect cancer in people we already know have cancer at your particular hospital. Which is fine, if these two groups have the same blood -- but often they don’t. For example, cancer treatment itself messes with your blood -- blood samples may be collected while the patient is under anesthesia, or undergoing chemotherapy, which both alter your blood. So you haven’t created a cancer detector; you’ve created a chemotherapy detector. Or maybe your cancer and healthy populations are different for reasons you don’t care about -- one attempt to develop a screen for prostate cancer compared cancer patients (who were all men) to healthy controls (who were all women). So then you’ve created a sex detector.

Summary: bias undermines democracy and kills people. What do we do about it? There are standard practices for reducing bias -- controlling for obvious things like sex, doing double-blind studies so your results aren’t influenced by what you want to see. But I still tend to be very nervous about bias, in part because I’m a nervous person and in part because everything is intercorrelated, so tiny discrepancies in variables you don’t care about can produce discrepancies in variables you do. Another powerful means of determining whether a pattern is due to bias is to see if you see the pattern in a very different dataset. Because while both datasets likely have biases, they’re unlikely to be the same biases: so if you see the same pattern in both, it’s more likely to be due to something real. To return to 23andMe and Google, 23andMe survey results are probably biased by the fact that 23andMe customers aren’t necessarily a representative sample of the general population. Google is going to suffer less from this, since its product is more pervasive, but might be biased by the fact that Google searches map only weakly onto what someone is actually thinking. (For example, if more people Google “I want to have sex with sheep” than “I want to have sex with my girlfriend”, that might not be because more people want to have sex with sheep than with their girlfriend; it might be because they’re freaked out about the sheep, and turn to Google.) Both datasets are going to be biased by the fact that you have to be able to use a computer to use Google or enter 23andMe survey answers, but this is a considerably less scary bias than the ones in the independent analyses.

Obviously, the 23andMe/Google analysis is more of a fun proof-of-concept than a rigorous statistical analysis, but the compare-multiple-datasets idea is an important one. To return to biomedical research: often preliminary analysis will be done on a sample from one hospital, and the results will then be checked on a sample from another hospital; this controls for hospital-specific biases. Or you might do analysis on a dataset of a completely different nature: here’s a lovely analysis that starts by looking at gene expression data over time, draws some conclusions, then investigates whether those conclusions are true by actually knocking out some of the genes and analyzing that dataset as well, and compares these results to another dataset that used another method to analyze the genes.

I’m not saying that every study that uses a single dataset is biased or unpublishable -- clearly that isn’t true. But attempting to replicate results in a different dataset is almost always worthwhile, because anticipating every bias is impossible. This is particularly true if you’re generically curious, like I am, and you’re often poking your nose into fields in which you have no business or background knowledge.


[1] In practice, I suspect many people who say the former actually mean the latter, but that is not applicable to our analogy.