Thursday, November 27, 2014

Ferguson FAQ

Recently I published an analysis of the Ferguson conflict that showed, using Twitter data, that there was a “red group” and a “blue group” who rarely talked to each other, thought very different things, came from very different backgrounds, and often were uncivil even when they did talk. Thanks to everyone who wrote to me about the analysis! Here are answers to the most common questions I’ve received.

What data did you use?

215,000 tweets containing the Ferguson hashtag collected between November 17th and 19th (prior to the announcement of the verdict).

What tools did you use to collect the data?

Python -- specifically, the tweepy library and a program I wrote which you can find here (described at more length here).

What tools did you use to analyze the data and make the visualization?

Python for analysis; Gephi for visualization. See Gilad Lotan’s excellent tutorial on how to use Gephi to analyze Twitter data.

How did you divide Tweeters into red and blue groups?

I used Gephi's community detection algorithm (on the adjacency matrix for the most frequent tweeters, where Mij was 1 if tweeter i had mentioned tweeter j in a tweet), sometimes known as the Louvain method. Essentially, this divides Tweeters into groups that mention each other frequently.

Regarding whether this grouping is valid: as I note in the piece, I am mindful of the fact that there are many ways to group data, and I think this is worth exploring further. One problem we always face is how many groups there are (see here and here). You can always sort of make it look like people hate each other by clustering the data into groups even if there isn’t necessarily any separation between the groups -- this is something to be wary of when looking at analyses like this one.

But I think several pieces of evidence (in addition to Gephi's striking visual) point to the validity of the red / blue division. The fact that the two groups are associated with the tweeters’ self-descriptions (like race and political affiliation) is revealing; the fact that the two groups are associated with tweeting different things is also revealing (and by no means something I expected to see -- for example, if you divide Twitter datasets by gender, you will frequently find that men and women tweet essentially similar things). This evidence is powerful because it is external -- it was not used to come up with the grouping, but it supports it.

In general, we often bring in such external evidence to argue that a grouping is valid. For example, in a biological analysis we might cluster genes into groups that show similar expression patterns (group A highly expressed in the liver and not in the lungs; group B highly expressed in the lungs and not in the liver). We would be more sure that the groups we had found were “real” if there was external evidence like a transcription factor that was known to turn on all the genes in group A, or a biological function that was common to all the genes in group A.

You said the blue cluster is much larger than the red cluster. What happens if you break down the blue cluster further?

I don’t know! Someone should figure this out.

Can I see your data or code?

Yes. I cannot make the data publicly available because of Twitter’s terms of service, but if you are a researcher with a project, shoot me an email. In addition to the two days of data used in this analysis, I also have several million tweets both from several months ago, when Ferguson initially made the news, and from after the verdict was announced.

As always, if you work at Twitter and have any objection to any of this, please email me -- I am acting in good faith and more than happy to comply with your requests.

No comments:

Post a Comment