Thursday, May 19, 2016

Five things I learned from counting 900 engineers at Google I/O

I wrote this in a few hours to release it during the conference it discusses, so treat its conclusions with the appropriate grain of salt and let me know if you find mistakes.

Google gave me a ticket to their annual developer conference, Google I/O. I am a computer scientist but not really a developer (edit: for the benefit of the person on Hacker News alleging "reverse discrimination" -- I won the ticket in a coding competition. Maybe be a little careful throwing around statements like that. I describe myself as “not a developer” because I’m better at other aspects of computer science, not because I can’t code) so I decided that, rather than bugging people about full stack development or cross-platform coding, I could make myself more useful by analyzing diversity data. By this I mean I spent 6 hours wandering into various conference venues and tapping “F” on my phone when I saw a woman and “M” when I saw a man; in total I counted 916 people. Shoutout to my extremely tolerant housemate Andrew Suciu, who received all these messages:
I focused on gender because I didn’t think I could guess people’s race with high accuracy. (I realize, of course, that inferring gender from appearance also has serious caveats, but the data is useful to collect and I wasn’t going to interrogate random strangers about their gender identity). Here are five things I learned.

  1. Women were unexpectedly well-represented. 29% of people I counted were women. (I tweeted at Google to get the official numbers and will update if they reply.) That means women were better-represented in my data than they are, for example, as software engineers at 80% of tech companies, or among software developers overall (21%) in Labor Department statistics, or among Google engineers (17%). I am puzzled by this and welcome your explanations. From what I saw at the conference Google made pretty good efforts on the gender diversity front: a) posting a very large sign with anti-harassment guidelines b) giving women free tickets to the conference (that’s how I ended up there) c) featuring professional woman emoticons in the keynote and d) having three women speak in the keynote.
  2. There was surprisingly little variation in gender ratio between conference events. I computed gender ratios at about a dozen conference events, and they were considerably more stable than I expected them to be -- almost always within 10% of the overall average of 29%. (Whether the variation is even statistically significant depends on how exactly you define the categories -- perils of categorical f-tests and p-values!) Full data at the end of the piece. This is considerably less dramatic than, say, gender variation across developer subfields in Stack Overflow’s developer survey. One explanation might be that at a conference, people wander randomly into lots of events, which homogenizes gender ratios by adding noise.
  3. Women cluster together. We don’t just have the total counts of women: we also have the groupings of women, because I tapped “M” and “F” in the order that I saw people, and for lines that order is meaningful. (In cases where people are just sitting around, it’s a little more arbitrary, so I exclude that data from this analysis.) So if, for example, I tapped “MMMMMFFFFFFFF” that would be highly grouped data -- all the men are together and so are all the women. So we can ask whether the women group together more than we would expect if the line were just in random order, and it turns out they do (statistical details here [3]). At some events I could see this clearly without statistics -- at the machine learning office hours, for example, one table had only 1 woman out of 18, and the other table had 7 women out of 13 (fine, statistics: Fisher’s exact test p = .004, t-test p = .002). I think a large driver of the clustering is probably that women arrive in groups because they work at a company together, not that they preferentially connect at the conference, but the latter could play a role as well. (Anecdotally, three of the four people who spoke to me during the conference were women.)
  4. Live-blogging is perilous. When I arrived at the conference at 8:30 AM, about an hour and a half before it started, a quick headcount implied that 90 - 95% of attendees were men, and I posted this online. But as the conference progressed and I got more data, it became clear the early figure was too skewed. I regret not waiting to get more data before posting. While I was clear about the lack of data, there was no advantage to posting so quickly. I often think about this when people rapidly tweet their reactions to complex events. It doesn’t matter how smart you are; you’ll still write a better piece if you reflect. And I realize, of course, that sometimes you have to work very quickly because an event demands it -- I wrote this post in a few hours so I could publish it during the conference, so take my statements with a grain of salt -- but I still wish we took more time to think.  
  5. Machine learning should not just be used for takeout delivery.

Stand back, I’m going to use a metaphor. Imagine King Arthur came back to the castle to find his son cutting pizza with Excalibur.

Arthur: Son, that is literally the magical blade of destiny I pulled from the stone to to become king of England.
Son: Yeah, dad, but it cuts pizza really well.
Arthur: Sure, but can’t you think of anything more exciting to do with it?

This is how I feel about a lot of applications of machine learning, which I’m using here to mean “statistical methods that computers use to do cool things like learn to do complex tasks and understand images / text / speech”. Machine learning is revolutionary technology. You can use it to build apps that will understand “I want curry” and “play Viva la Vida”, but are those really the examples you want to highlight, as the conference keynote did? Let’s talk instead about how we can use machine learning to pick out police encounters before they become violent and stop searching and jailing so many innocent people; let’s talk about catching cancer using a phone’s snapshot of a mole or Parkinson’s using an accelerometer’s tremor or heart disease using a phone-based heart monitor. Those are the technologies that deserve the label “disruptive”, the you’re-a-goddamn-wizard-Harry applications that make your heart freeze. A few moments of the two-hour keynote emphasized big ideas -- the last ten minutes, an app to help Syrians relocate -- but most of the use cases were fixes to first-world problems [3].
Part of this, of course, is that it’s probably more profitable to drone-deliver San Franciscans Perrier than to bring people in Flint any water at all. But part of it is that people’s backgrounds influence what they choose to create. A woman would’ve been less likely to create this app which lets you stalk a random stranger, and someone who’d suffered from racism or classism would’ve been less likely to create this app which lets you identify “sketchy” areas. Machine learning is revolutionary technology, but if you want to use it to create revolutionary products, you need people who want revolution -- people who regularly suffer from deeper shortcomings in the status quo than waiting 15 minutes for curry [4].

Notes:
[1] I’m not that worried about double-counting because there were 7,000 people at the conference.
[2] Call each MMMFMFMFMF... vector corresponding to a line at an event Li. Then I use the following bootstrapping procedure: compute a clustering statistic which I’ll describe below, randomly permute each Li separately, recompute the clustering statistic, repeat a few hundred times and compare the true clustering statistic to the random clustering statistics. (I permute each Li separately because if you mix them all together, you’ll be confounded by event gender differences). I tried two clustering statistics: first, the number of F’s who were followed by an F, and second, the bucket statistic, which is defined as bucket(Li, n, k): the number of bins of size n within Li  that contain at least k women, with k chosen to be at least half of n. I used the second statistic because I thought I might automatically group women together when there was ambiguity in the order of the line (people standing side by side) which I was worried would bias into the first statistic since it’s very local. For both statistics, the true clustering statistic was higher than the average random clustering statistic (regardless of the value of n, k I chose for the second statistic). For the first statistic, this was true for pretty much every random iterate, and for the second statistic, the percentage of random iterates it was true varied for depending on n and k, averaging about 90% of iterates.
[3] This is based only on the keynote, not the rest of the conference. I think this is reasonable because the keynote was watched by millions of people, it was run by the CEO of Google, and a speech like that should reflect what you want to emphasize.
[4] To be clear, I’m not really blaming Google either for the trivial-app problem (although I think their examples could’ve been better chosen, as a company they do a lot of amazing things) or the lack-of-diversity problem; they’re industry-wide symptoms. But that isn’t really reassuring.

Full data:

People entering keynote speech, 9:30 AM: 65 / 236 are women, 28% .
People exiting keynote speech, 12:00 PM: 25 / 88, 28%.
High performance web user interfaces line: 21 / 58, 36%
Accessibility office hours: 10 / 40, 25%
Machine learning office hours: 8 / 31, 26%
People sitting around on grass: 33 / 87, 38%
Access and empathy tent: 13 / 38, 34%
Android studio / google play: 6 / 46, 13%
Making music station: 4 / 18, 22%
Project loon (giant balloon): 5 / 11, 45%
Android experiments (random cute stuff): 15 / 46, 33%
Audience arranged in ring around robotic arm spewing paint to music: 17 / 48, 35%
Android pay everywhere line: 6 / 25, 24%
Engineering cinematic experiences in VR line: 15 / 62, 24%

Devtools on rails line: 24 / 82, 29%

Saturday, May 14, 2016

Sunsets, Fraternities, and Deep Learning

Some projects I’ve been working on recently:

Sunset Nerd: a beautiful sunset detector. It’s a robot that monitors the number of posts tagged #sunset on Instagram (a picture sharing website) and sends you a message when there is a beautiful sunset happening in your area so you don’t miss it. If you live in a city -- currently the app works for San Francisco, NYC, Philadelphia, Los Angeles, Palo Alto, Washington DC, Chicago, Houston, San Diego, Miami, Indianapolis, Jacksonville, Boston, and Seattle -- click here to try it out on Facebook -- just send a message! (It will only message you once a week on average, and you can unsubscribe at any time.) For example, Boston on February 5 had a gorgeous pink sunset as a snowstorm cleared, and the detector picked it up; the red line shows the number of posts on that day, and the grey lines show the normal number of posts.

Fraternities: My New York Times piece, a shorter version of the piece originally posted on this blog; Total Frat Move’s response.

Computational biology: I’ve gotten to work on two papers, one with Pang Wei Koh and Anshul Kundaje on using convolutional neural networks to denoise epigenetic data, to be presented as a spotlight talk at ICML’s Computational Biology Workshop, and one with Bo Wang, Junjie Zhu, and Serafim Batzoglou on a method for clustering single-cell RNA-seq data, which has been presented at Cold Spring Harbor.

Relationship abuse dataset: if this is a research interest of yours, I have a dataset of more than 100,000 Twitter posts under the hashtags #MaybeHeDoesntHitYou and #MaybeSheDoesntHitYou. Filtering on tweets with form "#MaybeHeDoesntHitYou but" yields a large dataset of people's experiences with non-physical relationship abuse. Shoot me an email (emmap1 “at” cs.stanford.edu) if you have project ideas -- no time to pursue this project on my own, but happy to discuss collaboration.

Thursday, April 21, 2016

Why Has The Number of NYT Headlines About Trump Fallen By a Factor of Two?

Over the past few weeks I’ve gotten the sense that, thank merciful heaven, the New York Times has stopped writing so many articles about Donald Trump. First I downloaded their data [1] to see whether this was all in my head. It wasn’t. Whether we look at headlines per week mentioning Trump (top graph) or articles per week mentioning Trump (bottom graph) we see that the NYT has been giving him much less coverage.
Why? Here are a couple ideas.

  1. Maybe the NYT, after repeatedly complaining about Trump’s free media coverage, decided to stop giving him so much free media coverage. Here’s their analysis of his free coverage on March 15 -- right around the time his coverage begins to decrease. (Nick Kristof, a NYT columnist, writes a similar piece on March 26.)
  2. Maybe this isn’t specific to Trump -- maybe the NYT’s getting bored of the primaries in general. So I looked at headlines about the other candidates. The Democratic candidates, if anything, have been getting more coverage in the last few weeks; top is Sanders, bottom is Clinton.
On the other hand, both Cruz and Kasich have also seen drops in coverage. Here are Cruz, Kasich, and Trump at the bottom for comparison.

So maybe the NYT is ignoring all the Republicans, not just Trump. (It’s worth noting that coverage of candidates may be correlated -- if you write about Trump fighting with Cruz, you’re covering both candidates.) Susan Athey, an economist who studies the internet, points out that a lot of the difference between liberal and conservative media sources is not that they cover the same topic differently, but that they cover different topics. (So if you read Fox News you’ll hear about Benghazi and if you read the NYT you’ll hear about climate change.) Given this, I’d be curious to know if you’d see a similar pattern in Fox News coverage of the Republicans.

3. Maybe the NYT is covering Trump less because he’s no longer as strong a candidate. Here are Trump’s odds of winning the Republican nomination as provided by PredictWise; Trump’s the red line. Note the fall in late March - early April.

(Or maybe causality goes in the other direction? Trump’s odds fall because he no longer gets as much free NYT coverage? Or maybe there’s a third variable -- Trump’s odds fall and he gets less NYT coverage because it becomes increasingly apparent that he isn’t worth talking about? Time series are fun.)

I don’t know which if any of these hypotheses is correct, so if you quote this post and say I’m making causal claims I will hunt you down. I’m just asking questions. Here’s a final one. No one reporter could account for the drop in Trump’s coverage -- it’s a difference of hundreds of articles. So something has to be coordinating the behavior of lots of reporters. How does that work? Is some editor at the NYT saying to the newsroom, “We SHALL NO LONGER write so many articles about the quasi-Nazi with little hands?” Or do the reporters look at what their peers are focusing on and choose to focus on that as well? Or are they all just responding to the same external factors?

Discriminating between these hypotheses is going to be hard using only data. But there are people who can help answer this question. So if you work at the NYT, feel free to explain what’s going on :)

Notes:

[1] Data; my code.

Monday, March 28, 2016

Protecting Yourself Against Statistical Crimes

I have a dream that one day every child will take a class which will teach them to recognize statistical crimes. It would replace another high school math or science class, like calculus, trigonometry, geometry, or Newtonian physics, because these are totally useless for 90% of the population. (I was a physics major. I’m allowed to say these things.) Statistics is not like that. Send a child into the world unable to recognize statistical crimes and you are preparing them to be perpetually lied to -- by politicians pushing agendas, journalists facing tight deadlines, and scientists trying to get published.


This class would not be a math class. I don’t care if kids understand how to do a chi-2 test. I just want to make them very paranoid. It would be like that the scene in Harry Potter where the students are taught “constant vigilance” against the Dark Arts.


“[The teacher] gave a harsh laugh, and then clapped his gnarled hands together. ‘The sooner you know what you’re up against, the better. How are you supposed to defend yourself against something you’ve never seen?’ ”
And then instead of torturing a spider (seriously, who hired that guy? Don’t wizards have any teaching standards?) you could enumerate a bunch of statistical crimes. Which, to reinforce the fact that this class is necessary, I’m now going to do. I spent a month annotating every single article I read that discussed data for a popular audience (sample titles: “White Female Republicans are the Angriest Republicans”, “Study: More Useless Liberal Arts Majors Could Destroy ISIS”, “The Reproductive Rights Rollback of 2015”). In total I annotated 49 articles; you can see my annotations here and a note on my methodology here [1].


These are my overall impressions. They are not statistical; they’re a qualitative summary. Throughout I use “article” to refer to the general-interest publication and “study” to refer to the original scientific work it describes.


  1. Sites which specialized in statistical writing, like the NYT’s Upshot and FiveThirtyEight, wrote about data more reliably.
  2. Almost all the articles had something I could push back on. Most frequently, I had questions the original article didn’t answer or caveats it didn’t mention. This isn’t necessarily the journalist’s fault: most general-interest articles are shorter than the studies they describe, and so details get lost. But I also found a third of the articles were substantially misleading. (I’m not labeling those articles in the spreadsheet since I don’t want to be mean and the cutoff is somewhat arbitrary: maybe you could argue I’m an overly anal statistician and the actual fraction is a fourth or a fifth.) So if you want to know what a study says, reading a general-interest article about the study is not a reliable way to figure it out unless you really trust the journalist or outlet -- you have to at least glance through the study. General-interest articles often misdescribe studies, presenting correlational studies as causal, or presenting theoretical models as though they actually analyzed data. You don’t always have time to skim the original study, but I think you should before you repost it on Facebook or Twitter.
  3. Article titles are particularly likely to mislead. Outlets have incentives to use clickbait titles, the title is often not written by the author of the article, and it’s hard to summarize a complex topic in a dozen words. Please do not repost something after only reading the title.
  4. Be particularly suspicious of results which are politically charged or published in politically biased outlets (Jezebel, Breitbart), especially if the article substantiates the outlet’s worldview. (Also be suspicious of results which substantiate your worldview -- if you’re like me, you’re less inclined to question them.)
  5. Here are some questions to ask. If an article says, “A new study shows that X” your first question should be: how? Was it an experiment? A survey? A meta-analysis? A theoretical model? Sometimes this will be pretty obvious. If an article says, “Study shows that ⅔ of Americans prefer chocolate to vanilla”, the scientists probably ran a survey. But if an article says, “Study shows that increasing the minimum wage increases unemployment” -- it makes a huge difference whether the authors found a new natural experiment or did a meta-analysis of the past literature or are a bunch of undergrads who wrote up a theoretical model after passing Econ 101.
Once you understand how the study was conducted, push back on the study itself. If they claim to have “controlled for other factors” -- controlling for other factors is really hard. If they ran a survey -- was the population actually representative? Could non-response bias explain their results? In general: are the effects large enough to actually matter? Are their results statistically significant? Did they look at a hundred different things and only report the one which they liked? Are the numbers they are reporting the ones we care about, and are they properly contextualized? Try to think of other explanations for their data besides the one they favor. Be creative and obnoxious. You can find examples of how I think about articles in my annotations.


I close on a gentler note. The fact that you can make statistical arguments against an article does not mean that the author is incompetent or ill-intentioned or that the article is bad. All work has caveats -- certainly you can argue with all my blog posts -- and that’s fine as long as they’re clear. But some caveats are subtle and not clearly acknowledged (or deliberately hidden) which is why we need to teach our children to defend themselves. Avada Kedavra!


While I was working on this project, the New York Times and the Wall Street Journal both published op-eds arguing we should teach statistics. I dream of a world where statistical literacy is so common that statistical errors, like spelling errors, make it impossible to be taken seriously; where publications that use only anecdotes get demands for data. It would be a world where we paid attention to gun violence not because of mass shootings, but because of the far larger numbers of people who are shot and go unnoticed every day; where terrorists could no longer sow fear by killing a far smaller fraction of the population than die annually from heart disease; where we donated to charities that saved lives as opposed to making us feel good; where we conducted randomized controlled trials to test which government programs worked best. I truly believe that millions of people would lead better lives if everyone understood and applied basic statistical reasoning. That’s just not true of trigonometry. Let’s teach statistics instead.


Notes:
[1] When I first wrote this piece, it was 6:10 AM and I just couldn’t take it anymore and I ranted about three articles which I thought were bad. After I calmed down I decided that was both mean and unpersuasive, so I did a more systematic annotation. My reading material skews towards the New York Times, so to get a more representative sample I annotated not just the articles I would read naturally: I also went back and read statistical articles in other widely read publications like Gawker, Buzzfeed, and Breitbart (I Googled “new study” + publication name).  (I was doing this quickly, so if you think I’ve been unfair or misunderstood an article, my apologies -- let me know and I’ll fix the spreadsheet.)