Obsession with Regression: August 2014

Dear Twitter: in this post, I offer to give away your data. I do this completely in good faith and for no monetary gain because I am a researcher, I think your data is fascinating, and I hope to help people make sense of it. I have reviewed your Terms of Service, read a number of research papers written on your data, and contacted employees at Twitter, and to my knowledge I am not in violation of any of your rules. But if I have misunderstood please contact me at emmap1 at alumni dot stanford dot edu and I am more than happy to comply with your requests.

I am very excited about Twitter because it combines two qualities.

1. People actually use it. Famous people -- it’s become standard for celebrities to say “Follow me on Twitter!” -- and more importantly, lots of people.

2. It makes massive amounts of data available in a way you can process with a computer. 500,000,000 tweets are sent every day and Twitter will give you up to 1% of those. And if I know what 1% I want -- for example, only Tweets containing the word “Spock” -- it will give me all of them, which means I can actually hear everything that’s being said on a topic by millions of people worldwide. And not just what’s being said, but who’s saying it -- how they describe themselves, where they live, who their friends are, and the last few thousand things they said [1].

[Pause so we can all process how incredibly cool this is.]

If you still don’t think this is incredibly cool, you’re either not paying attention or dead on the inside. Twitter is enabling new research on everything from the Mexican drug war to the Israel-Palestine conflict to earthquakes to the stock market. Just this week, it easily provided enough data for research papers on three topics I can think of off the top of my head: societal reactions to suicide using Robin-Williams-related tweets, altruistic behavior using #icebucketchallenge, and protests against racism using Ferguson-related tweets. I’ll come back to the last one in a second.

I want to study Twitter with you. Consequently I am making three things available. (If you like working with data you should read about the first two; if you just like reading about data, you should skip to the third.) The first is a tool that makes it easy to collect all the Tweets (and all the data for the Tweeters) that contain sets of words or phrases. Important caveat: this program will only collect Tweets live -- it cannot search for Tweets in the past, because Twitter makes it very hard to get these -- so you need to be quick on the draw. This tool may be slightly buggy -- let me know if you find weird things! -- but it’s probably not seriously buggy because I have been using it more or less without incident for the last few months; you can turn it on, forget about it, and come back later to get your data.

So the first tool gives you raw data. The second thing is a tool that infers cool things from this raw data and returns it in a table which is easy to analyze. For example, you can sometimes use a Tweeter’s name to get their gender and race, as I describe here. You can use a technique called sentiment analysis to analyze the emotions in the Tweets, and watch how levels of sadness, anger, profanity, and so on change over time, or by group. You can often figure out the Tweeter’s location from their timezone, and you can also get the local time, which is important if the phenomenon you’re studying has daily cycles. The documentation for this tool is here. Unfortunately, I cannot make the code for the tool publicly available because it relies on a sentiment analysis library which is proprietary, although I may cut it down and release a less complicated version when I have time. But if you have a dataset which you would like to use it to analyze, shoot me an email!

The third thing, to illustrate the utility of the first two things, is an actual dataset of Tweets relating to the Ferguson shooting. I’ve been monitoring Twitter for about a week for hashtags like ferguson, iftheygunnedmedown, and handsupdontshoot, and I initially was collecting so many Tweets that I ended up keeping only a tenth; even so, it’s a few hundred thousand Tweets. It’s a very rich dataset, and I’ll probably do some more analysis on it myself after events play out, but email me if you’re interested in looking at it and we can discuss possibilities for collaboration. (Twitter’s Terms of Service prohibit me from just making the dataset publicly available.)

I’ve barely glanced at the data, but one thing I did do was take the most common hashtags and connect the ones which tended to appear together [2]. At first it’s a little hard to see what’s going on, but when we look closer we can see evidence of a rich and complex conversation:

1. There’s a purple cluster talking about the many other unjust police shootings, often in connection with the lastwords hashtag.

2. There’s a red cluster of Anonymous users -- a group of online activists who conducted cyberattacks against the Ferguson police department.

3. There’s a yellow cluster of Tea Party members and gun rights activists, who I’m sure have been made much less paranoid about abuses of government power because of this whole episode. Close to them is a more liberal group that includes hashtags like “p2” (Progressives 2.0), “libcrib”, “stoprush”, “ows” (Occupy Wall Street) and “civilrights”. Oddly, some military hashtags (“military” and “vets”) appear to be more connected to the liberals than the conservatives.

4. There’s a red-purple group of people who are advocating peaceful protests with hashtags like “love”, “unity”, “equality”, and “MLK”.

5. There’s a purple group of people drawing connections to Gaza, and close to them there’s another group of people drawing connections to other international events (“egypt”, “syria”, “ukraine”, “iraq”, “isis”).

If you want to explore further, zoom in and click on the circles. Clearly there’s a complicated and interesting conversation going on here, and even if there’s a lot of dirt in the data, there’s a lot of gold there as well; let me know if you’re interested in digging deeper! Here’s one question that occurred to me: there’s been a daily pattern with peaceful protests by day and more violence and anger at night. Can we see evidence of that cycle on Twitter? And, if we can, is it because a) the same people get angrier at night or b) different groups of people tweet by day and by night?

Do me a favor and if you do end up using any of this, or if you have thoughts for new or improved tools, please:

a) Shoot me an email and let me know what you find / ideas you have! I’d be happy to publish cool analyses here or collaborate to find other audiences for them as well. And if you’re not a computer person but you think of some cool societal trend, or you notice something important happening, let me know quickly and maybe we can track it!

b) Feel free to point people to the tools or this blog!

Notes:

[1] Contrast this to other big data companies -- Google would never make individual level data like this available, and even when they make grouped data available (say, how many people are searching a certain term) they make it very hard to use a computer to get it quickly. (Google Trends is cool, but I could never use it to get, say, the volumes of 10,000 different searches over time.) Facebook requires me to get something called “user consent” (what?) to get most interesting data. I’m not criticizing Google and Facebook for keeping their data hidden, by the way; their users expect privacy. But the whole point of Twitter is that you’re a Twit in public, and users have no expectation of privacy. Twitter does conceal some information, like the user’s location, if the user chooses this in their privacy settings.

[2] This was created, incidentally, using NetworkX + Gephi, because I got excited about Gilad Lotan’s excellent talk on the combination.

Obsession with Regression

Thursday, August 28, 2014

The Dark Side of Viral Rage

Tuesday, August 19, 2014

How to Study the Rage of Millions of People