Obsession with Regression: A Tale of Two Cities: The Twitter Reaction to the Return of Lebron James

At 9:31 AM PST on July 11, LeBron James announced that he was returning to Cleveland, and Twitter exploded. (If you don’t know who LeBron James is, see [1] for backstory.) The frenzy was such that the New York Times ran a front page story purely about the tweets. I collected more than 2 million of them, and learned some things about forgiveness, race, and fangirls.

One obvious question: did people on balance approve of James’ decision? The NYT did not attempt to figure this out -- come on, NYT! -- probably because they didn’t have the data and it’s hard to measure approval. One standard way to do it is to count words with positive and negative associations using a word list, but this is a bit dicey in this data; words like “fan” are usually positive, but here you have tweets like “LeBron fans suck”. Instead, I came up with customized phrases. For example, I recorded 1,392 tweets containing the phrase “I love LeBron” and 1,549 tweets containing the phrase “I hate LeBron”. But the latter group contained tweets like “Do I hate LeBron still? Nope” -- some people might loathe James for his prior mistakes, but admire this decision. Indeed, the data supports this idea:

Phrase Pair	Number of Tweets
“good decision” compared to “bad decision”	547 to 94
“good move” compared to “bad move”	954 to 123
“smart move” compared to “stupid move”	390 to 20

Overall, the Twitter data indicates that while James is still polarizing, this decision was popular. Obviously, however, not everyone was thrilled. I recorded 25k tweets from Tweeters who listed “Miami” in their location and compared those to the 21k tweets from Tweeters in Cleveland. This was a little sad. Miami fans used twice as many words expressing negative emotion, three times as many words expressing anger, twice as much profanity. Interestingly, though, they were about four times as likely to express respect: they were sad and angry but also reluctantly impressed. We can also look at the hashtags which were particularly common in each city: some of them were obvious (191/191 “northeastohio” hashtags were from Cleveland) but some were interesting:

Miami Hashtags	Fraction From Miami	Cleveland Hashtags	Fraction from Cleveland
goodluck	15/15	forgiveness	14/14
neverforget	11/11	happydaysforthecityistillcallhome	48/48
lebronliquidation	28/28	cosmic	17/17
thankyoulebron	98/102	cavsgoodkarma	87/87
notcool	179/193	welcomehomelebron	219/224
smh (“shake my head”)	12/12	unfinishedbusiness	20/21
tfm (“total frat move”)	13/19	lifttheban	56/61
respect	49/60	imsorry	53/58

(“Lift the ban” and “I’m sorry” turn out to relate to this crazy Cleveland fan who got himself banned from Cleveland games for a year for running onto the basketball court while James was playing and begging him to return. James patted him on the head as he was dragged away by security.)

Let’s talk about race. Twitter doesn’t provide race data, but I wanted to see if I could infer it for a few reasons:

1. Racial dynamics in professional basketball are often interesting: 76% of the players are black, as compared to 43% of coaches and 2% of owners. There have been a lot of race-related episodes: see the owner who was banned for life for racist remarks; Jesse Jackson’s allegations that Lebron James was being treated like a runaway slave; the differential popularity of Lebron James among different races; the racism against Jeremy Lin.

2. I’ve done a fair bit of work on gender dynamics, and women and racial minorities share many problems; studying race seems a natural extension.

3. It’s an interesting problem.

Obviously, race is very complicated -- at 23andMe, I’ve learned from our ancestry experts just how tangled the relationship between biological ancestry and self-identified race is -- and so any inference from Twitter data is going to be highly imperfect. Please keep this in mind before writing me blistering emails. I tried to identify Tweeters as black, white, Hispanic, or Asian, and used three methods to do so:

1. Tweeter self-description. Someone who uses the word “Asian” in their self-description is usually Asian, although obviously there are some false positives (people who use the word black but are saying they like black dresses, etc).

2. Tweeter last name. See here. This turns out to be very useful for Asian and Hispanic names, not so much for white vs black names.

3. Tweeter first name. Freyer and Levitt wrote a nice article about the consequences of having a distinctively black name; we can supplement their list of black and white names with data on baby names from NYC, which gives us Asian and Hispanic names as well.

People have been trying to get race from name for many years and it’s a lot more dicey than getting gender from name. The most basic problem is this: while someone who names their kid “Alabaster Snowflake” is probably white, they’re also probably not representative of the general white population. The people for whom you can identify race from name are going to be unusual. Similarly, someone who identifies herself as Asian on her Twitter profile may not be representative of Asians generally. So we’re not really comparing white people to Asian people, we’re comparing people with distinctively white names to people with distinctively Asian names [2]; similarly for profiles. To emphasize this distinction, I'm going to refer to tweeters not as "Asian" but as "d-Asian" -- ie, distinctively Asian.

I was able to identify 124k tweets from d-White tweeters, 32k from d-Hispanic tweeters, 12k from d-black tweeters, and 7k from d-Asian tweeters (in North America). I could not identify clear racial differences in whether Tweeters approved of James’ decision, but I found other interesting differences. d-Asian tweeters do, in fact, tend to tweet about Jeremy Lin; 56% of tweets containing “jlin” come from d-Asians. d-Hispanic tweeters are especially likely to use hashtags supporting teams in Los Angeles, San Antonio, and Miami -- all cities with large Hispanic populations -- and, unsurprisingly, tend to use Spanish words. d-black tweeters also tended to use different language: “finna”, “ima”, “tryna”, and “yall” were among the words that increased in frequency most among d-black Tweeters, as were various versions of n*****. (d-Black tweeters were about four times as likely as d-white tweeters to use n***a, with d-Asians and d-Hispanics falling in the middle.)

I also looked at gender. Only about 17% of tweets came from women, and some of the male tweeters complained about how female tweeters were just tweeting “I looooove LeBron!” But the stereotype of the sweet-spoken fangirls turns out to be wrong: the girls tweeting about James express more anger and use more profanity than the guys, and while they are indeed more likely to say they love him, they’re more likely to say they hate him, too. And forget about the welcoming female domestic stereotype: female tweeters are actually slightly (but statistically significantly) less likely to use variants of “welcome home”. These results surprised me enough that I checked whether my filters were broken (I don’t think they are); one explanation is that interest in basketball is somewhat unusual for women, and that women who tweet about LeBron James are unusual in other ways as well. (Alternately, there might be some weird correlation between gender and another variable, like location.)

This is about as much time as I'm willing to spend studying LeBron James; on the other hand, if you could infer race in a way that doesn't introduce weird biases, that would be exciting and powerful, so let me know if you have ideas about that. Also, I realize that race (like gender) is a fraught topic, so please let me know if anything I've written seems insensitive or inaccurate.

Notes:

[1] LeBron James is one of the greatest and most polarizing basketball players of all time. At 18, he began his career playing for Cleveland, a sad sports city that hasn’t won a championship since 1964; then he broke their hearts and drew widespread disgust by announcing in a graceless press conference that he was leaving to join two superstars on Miami’s team.

[2] I initially thought I could get around this problem by looking at all names and simply assigning each name a score for each race depending on how frequently it was used for that race (rather than just looking at names with >90% confidence for a particular race); this would incorporate data for all Tweeters rather than just the distinctive name ones, and then you could just run a regression on the name race score. I think this runs into a similar problem, though, because you find that for black last names, for example, very few Tweeters have names which strongly indicate that they are black, which may mean that whatever signal you get is predominantly driven by these distinctive Tweeters.

Obsession with Regression

Monday, July 14, 2014

A Tale of Two Cities: The Twitter Reaction to the Return of Lebron James

2 comments: