Thursday, November 5, 2015
Monday, October 19, 2015
A few months ago I became single for the first time in four years. I went from studying online dating to being a datapoint myself . This has made me think more urgently about questions I once considered only abstractly. Today I write about the connection between testing statistical hypotheses and testing romantic attraction.
Statisticians love to develop multiple ways of testing the same thing. If I want to decide whether two groups of people have significantly different IQs, I can run a t-test or a rank sum test or a bootstrap or a regression. You can argue about which of these is most appropriate, but I basically think that if the effect is really statistically significant and large enough to matter, it should emerge regardless of which test you use, as long as the test is reasonable and your sample isn’t tiny. An effect that appears when you use a parametric test but not a nonparametric test is probably not worth writing home about .
A similar lesson applies, I think, to first dates. When you’re attracted to someone, you overanalyze everything you say, spend extra time trying to look attractive, etc. But if your mutual attraction is really statistically significant and large enough to matter, it should emerge regardless of the exact circumstances of a single evening. If the shirt you wear can fundamentally alter whether someone is attracted to you, you probably shouldn’t be life partners.
You can argue against this by pointing out cases where a tiny detail does matter because it prevents you from having any future interactions: for example, you foolishly wear your XL Chess Team sweatshirt to the bar and your would be Lothario never bothers to approach you and thereby discover that you look much better with it off.
This is a risk. In statistical terms, a glance at across a bar doesn’t give you a lot of data and increases the probability you’ll make an incorrect decision. As a statistician, I prefer not to work with small datasets, and similarly, I’ve never liked romantic environments that give me very little data about a person. (Don’t get me started on Tinder. The only thing I can think when I see some stranger staring at me out of a phone is, “My errorbars are huge!” which makes it very hard to assess attraction.)
Even on a longer date, there’s some risk that a disaster at the beginning will ruin your subsequent interactions. If you start by asking “how’s your relationship with your mother?”, you’ve torpedoed your chance to have a truly intimate conversation about how she ran off to train monkeys.
Still, I’m sticking to the principle that if your romance-to-be is statistically robust, whether you wear makeup or the moon is full should make no more difference than whether you compute the Spearman or Pearson correlation. (And if your date asks you if you want to bootstrap, the answer is always, of course, yes.)
I think there’s even an argument for being deliberately unattractive to your date, on the grounds that if they still like you, they must really like you. Imagine a cliched rom-com disaster : you vomit on your date. This isn’t sexy. On the other hand, someone who finds you attractive after that is much more likely to still find you attractive when you’re puking during pregnancy or chemotherapy . This is somewhat analogous to using a statistical test that makes very weak assumptions (here's one example): if the test yields positive results, you can have high confidence they're real.
Please don’t send me angry emails when you take this post too seriously and the love of your life spurns you because you didn’t shower for a week before your date. But I’d welcome your thoughts in the comments or via email. (Also hit me up if you have ideas for statistical projects that I can only conduct while single.)
 I recently received an email from a Stanford professor in a similar situation: his marriage broke up after 20 years, and he responded by writing a book about the connections between economics and dating.
 An economics friend points out a corollary to this principle: be suspicious of analyses that use really convoluted tests when it seems like simple ones should do, because that might indicate that the simple ones didn’t produce the results they’re reporting.
 10 Things I Hate About You, 50 Shades of Grey, Mean Girls. What’s with this trope, and why are the pukers always female?
 You’re calling me crazy and I’m kind of kidding, but I’d also argue that the idea of testing one’s partner is a socially accepted one. (My scholarly attempt to do a lit review on this question -- I Googled “make them work for it” -- yielded this text. You’re welcome.) There are many bad reasons people are told to defer sleeping with someone, but a not-so-bad-one, from a probabilistic standpoint, is that someone who will wait might be more likely to really like you.
Friday, October 16, 2015
Shengwu Li and I argue in the Washington Post that universities who conduct sexual assault surveys often misunderstand the basic statistical goal: not to get as many students as possible to answer the survey, but to get an unbiased sample of reasonable size. We propose a method for doing this.
Saturday, October 10, 2015
Brian Clifton, Gilad Lotan and I published an analysis of the fierce online debate about abortion. We visualize the spread of the hashtags #ShoutYourAbortion and #ShoutYourAdoption, develop a method for classifying tweeters as pro-choice or pro-life, and show that we can often predict someone's stance with high accuracy -- without ever reading their profile or tweets. Here is Brian's beautiful visualization of the spread of the hashtag.
Monday, September 21, 2015
How do we map the density of a set of events? For example, we might want to map locations of tweets supporting Bernie Sanders as opposed to Hillary Clinton or locations of housing evictions or locations of police shootings. I confronted this problem recently while writing a post for Quartz (which they split into two) about where people tweet more about wine and where they tweet more about beer. Here’s a finished map (you can see more in the Quartz posts); color shows the fraction of #beer or #wine tweets which are about #wine, with red denoting pro-#wine areas.
I built a tool which lets you make maps like this, and because I think this problem is often useful to solve and often solved badly, I provide some thoughts on how to make maps below. If you’re not interested in details, you should probably just look at the maps in the Quartz posts, but if you keep reading you’ll at least get to see me make a lot of bad maps.
One simple thing to do is just to make a state-by-state map where each state’s color corresponds to the density of events. This has a few problems. It requires us to map all the latitude, longitude pairs to their states, and if we want to look a country besides America, we need to adapt our method; more fundamentally, it’s not very high-resolution, and there are often interesting patterns at the sub-state level.
To get better resolution, people often just plot the exact latitude, longitude location of the tweets, so you get visualizations that look like this:
Which is pretty but not very useful (like your momma!) because it basically just shows us where the cities are. (I have lost track of the number of data analyses I have read that can be summarized as, “when you have more people, more things happen”).
What I think you usually want to do is plot the density of events relative to some background. For example, I don’t care about the absolute density of wine tweets, which will be heavily correlated with population density; I care about the fraction of beer/wine tweets which are wine tweets.
So one thing we can do is estimate the density of wine tweets, estimate the density of beer tweets and then plot the difference: densitywine - densitybeer. (We can estimate density using a method called, appropriately, kernel density estimation). Here’s what happens when we do that; red denotes areas with more wine, blue with more beer.
The problem is that the reddest areas aren’t necessarily the areas where 90% of people are tweeting about wine; they might also just be the areas with a ton of tweeters (which will also have larger differences between densitywine and densitybeer). So maybe we really want something like densitywine / (densitybeer + densitywine), which we can interpret as the fraction of tweets which are about wine. Here’s what that looks like.
The problem, basically, is that the ratio of two things is unstable when the denominator gets small, which often happens. I tried various ways of getting around this but they were finicky.
So here’s an alternate solution: for every point which you want to color, look at the closest 10 beer/wine tweets and see how many of them are about beer. It seems like this will take a really long time if we have, say, 10,000 points and 50,000 tweets. Luckily, computer scientists have devised an efficient way of doing this which takes about two lines of code and a second to run (#MyFieldIsCoolerThanYourField) .
This doesn’t look so good because California appears to be hemorrhaging, but once we mask off the oceans we get a nice map:
I’ve posted the code to make maps like these on GitHub so you can make maps of your own (and let me know if you find anything cool)! Keep in mind that the map will be less reliable in areas with little data. You can use it on any data (not just from Twitter) that has latitude and longitude. It requires knowledge of Python, so shoot me an email if you get stuck. If I get enough complaints from people who want to make maps but can’t use Python, I’ll just build a web tool.
Also, I am not a mapmaking expert, so feel free to tell me how I could’ve used CartoDB or whatever to do all this! (My problem with CartoDB is that the free version won’t keep data private and limits the size of your datasets to 50 MB. I’m not really a 50 MB kind of girl.)
 We just train a k-nearest neighbors classifier and plot its classification surface. There’s also the question of how you choose the number of nearest neighbors to look at. If you choose too small a number, the map gets very splotchy:
And if you choose too large a number, you lose real details. I’m not exactly sure how best to choose, so hit me up if you have thoughts. Do not tell me to use cross-validation. Seriously, we’re mapping drunk tweets here.
A final note: this tool obviously relies on you having latitude / longitude data. Many datasets are not in this form (eg, they might include addresses instead) and I have not found a great way to rapidly convert between addresses and latitudes / longitudes, because many APIs are rate limited. Let me know if you have a good solution to this problem.
Saturday, August 29, 2015
My analysis of half a million drunk tweets, and more than two million tweets about alcohol, was just published by Quartz; you can read it here (and the coverage by Slate here, if you speak French).
Friday, August 7, 2015
This week tens of thousands of female engineers tweeted pictures of themselves and explanations of the work they do under the hashtag #ILookLikeAnEngineer. The movement made the front page of the New York Times, and so I decided to see what I could do with the tweets. Because this is a post about female engineers, I will describe the engineering steps in a little more detail:
- I used a program to scrape roughly 100,000 #ILookLikeAnEngineer tweets.
- From each tweet, I extracted any links to images, filtering out retweets and duplicate links.
- Using what Mark Zuckerberg in The Social Network would call “a little wget magic”, I downloaded all the pictures from the links. This gave me roughly 10,000 pictures (1.2 GB). Then I created a site so that drunken fraternity men could rate the attractiveness of the women...no, never mind.
- I programmatically cropped all 10,000 images into squares of uniform size.
- I wanted to see if I could create mosaics: compose a large image from tiled smaller images. There are websites which do this but I was pretty sure they would choke on 1.2 GB of pictures and not give me the freedom to experiment with parameters. The better solution was to write code to do it, and the lazier solution was to see if someone else had already done that, which they had. This allowed me to create mosaics using only one line of code, but I didn’t like the initial output so I added a bunch of my code to their code until it did what I wanted.
Here are the final results. From far away, the mosaics look like Seurat: for example, here is the woman who started it all.
Zoom in and you get lost in the individual pictures.
Here are some high resolution versions (warning: the files are large; zoom in). Please feel free to use them, with attribution, to persuade women to become engineers or do other socially useful things. (Do me a favor and let me know about it!)
I was working on this while watching the Republican debate, and at some point I got so tired of hearing overconfident men pretend to know more than they did. (The line that did me in was Huckabee’s assertion that scientists agree personhood begins at conception because of a fetus’s “DNA schedule”. What the hell is a DNA schedule?) So I went home and wrote code. I’m often comforted by the fact that, however loud and annoying the person lecturing me may be, they cannot get inside my skull: the silent sanctity of those few inches of space, the infinite freedom to reflect and create, remain my own.
And yet. It’s naive to think that freedom of thought is enough. My work requires a computer, which I need economic freedom to buy. And Huckabee’s proposed restrictions of contraception and abortion will reduce women’s economic freedom. My work is funded by government science agencies which Huckabee wants to cut. So even code is cold comfort at the moment.
Apologies for the slightly bleak ending. If you have a custom image you’d like female-engineerified or higher resolution versions of these images, I’m happy to do that. If you are one of the women portrayed here and are uncomfortable having your face composed of many smaller women, let me know and I will take your picture down. And if you have ideas for cool things to do with this dataset or ways to improve the mosaics, please let me know! (Some pretty obvious improvements one could make are a) filtering out non-faces and b) filtering out duplicate images.)