Thursday, May 19, 2016

Five things I learned from counting 900 engineers at Google I/O

I wrote this in a few hours to release it during the conference it discusses, so treat its conclusions with the appropriate grain of salt and let me know if you find mistakes.

Google gave me a ticket to their annual developer conference, Google I/O. I am a computer scientist but not really a developer (edit: for the benefit of the person on Hacker News alleging "reverse discrimination" -- I won the ticket in a coding competition. Maybe be a little careful throwing around statements like that. I describe myself as “not a developer” because I’m better at other aspects of computer science, not because I can’t code) so I decided that, rather than bugging people about full stack development or cross-platform coding, I could make myself more useful by analyzing diversity data. By this I mean I spent 6 hours wandering into various conference venues and tapping “F” on my phone when I saw a woman and “M” when I saw a man; in total I counted 916 people. Shoutout to my extremely tolerant housemate Andrew Suciu, who received all these messages:
I focused on gender because I didn’t think I could guess people’s race with high accuracy. (I realize, of course, that inferring gender from appearance also has serious caveats, but the data is useful to collect and I wasn’t going to interrogate random strangers about their gender identity). Here are five things I learned.

  1. Women were unexpectedly well-represented. 29% of people I counted were women. (I tweeted at Google to get the official numbers and will update if they reply.) That means women were better-represented in my data than they are, for example, as software engineers at 80% of tech companies, or among software developers overall (21%) in Labor Department statistics, or among Google engineers (17%). I am puzzled by this and welcome your explanations. From what I saw at the conference Google made pretty good efforts on the gender diversity front: a) posting a very large sign with anti-harassment guidelines b) giving women free tickets to the conference (that’s how I ended up there) c) featuring professional woman emoticons in the keynote and d) having three women speak in the keynote.
  2. There was surprisingly little variation in gender ratio between conference events. I computed gender ratios at about a dozen conference events, and they were considerably more stable than I expected them to be -- almost always within 10% of the overall average of 29%. (Whether the variation is even statistically significant depends on how exactly you define the categories -- perils of categorical f-tests and p-values!) Full data at the end of the piece. This is considerably less dramatic than, say, gender variation across developer subfields in Stack Overflow’s developer survey. One explanation might be that at a conference, people wander randomly into lots of events, which homogenizes gender ratios by adding noise.
  3. Women cluster together. We don’t just have the total counts of women: we also have the groupings of women, because I tapped “M” and “F” in the order that I saw people, and for lines that order is meaningful. (In cases where people are just sitting around, it’s a little more arbitrary, so I exclude that data from this analysis.) So if, for example, I tapped “MMMMMFFFFFFFF” that would be highly grouped data -- all the men are together and so are all the women. So we can ask whether the women group together more than we would expect if the line were just in random order, and it turns out they do (statistical details here [3]). At some events I could see this clearly without statistics -- at the machine learning office hours, for example, one table had only 1 woman out of 18, and the other table had 7 women out of 13 (fine, statistics: Fisher’s exact test p = .004, t-test p = .002). I think a large driver of the clustering is probably that women arrive in groups because they work at a company together, not that they preferentially connect at the conference, but the latter could play a role as well. (Anecdotally, three of the four people who spoke to me during the conference were women.)
  4. Live-blogging is perilous. When I arrived at the conference at 8:30 AM, about an hour and a half before it started, a quick headcount implied that 90 - 95% of attendees were men, and I posted this online. But as the conference progressed and I got more data, it became clear the early figure was too skewed. I regret not waiting to get more data before posting. While I was clear about the lack of data, there was no advantage to posting so quickly. I often think about this when people rapidly tweet their reactions to complex events. It doesn’t matter how smart you are; you’ll still write a better piece if you reflect. And I realize, of course, that sometimes you have to work very quickly because an event demands it -- I wrote this post in a few hours so I could publish it during the conference, so take my statements with a grain of salt -- but I still wish we took more time to think.  
  5. Machine learning should not just be used for takeout delivery.

Stand back, I’m going to use a metaphor. Imagine King Arthur came back to the castle to find his son cutting pizza with Excalibur.

Arthur: Son, that is literally the magical blade of destiny I pulled from the stone to to become king of England.
Son: Yeah, dad, but it cuts pizza really well.
Arthur: Sure, but can’t you think of anything more exciting to do with it?

This is how I feel about a lot of applications of machine learning, which I’m using here to mean “statistical methods that computers use to do cool things like learn to do complex tasks and understand images / text / speech”. Machine learning is revolutionary technology. You can use it to build apps that will understand “I want curry” and “play Viva la Vida”, but are those really the examples you want to highlight, as the conference keynote did? Let’s talk instead about how we can use machine learning to pick out police encounters before they become violent and stop searching and jailing so many innocent people; let’s talk about catching cancer using a phone’s snapshot of a mole or Parkinson’s using an accelerometer’s tremor or heart disease using a phone-based heart monitor. Those are the technologies that deserve the label “disruptive”, the you’re-a-goddamn-wizard-Harry applications that make your heart freeze. A few moments of the two-hour keynote emphasized big ideas -- the last ten minutes, an app to help Syrians relocate -- but most of the use cases were fixes to first-world problems [3].
Part of this, of course, is that it’s probably more profitable to drone-deliver San Franciscans Perrier than to bring people in Flint any water at all. But part of it is that people’s backgrounds influence what they choose to create. A woman would’ve been less likely to create this app which lets you stalk a random stranger, and someone who’d suffered from racism or classism would’ve been less likely to create this app which lets you identify “sketchy” areas. Machine learning is revolutionary technology, but if you want to use it to create revolutionary products, you need people who want revolution -- people who regularly suffer from deeper shortcomings in the status quo than waiting 15 minutes for curry [4].

Notes:
[1] I’m not that worried about double-counting because there were 7,000 people at the conference.
[2] Call each MMMFMFMFMF... vector corresponding to a line at an event Li. Then I use the following bootstrapping procedure: compute a clustering statistic which I’ll describe below, randomly permute each Li separately, recompute the clustering statistic, repeat a few hundred times and compare the true clustering statistic to the random clustering statistics. (I permute each Li separately because if you mix them all together, you’ll be confounded by event gender differences). I tried two clustering statistics: first, the number of F’s who were followed by an F, and second, the bucket statistic, which is defined as bucket(Li, n, k): the number of bins of size n within Li  that contain at least k women, with k chosen to be at least half of n. I used the second statistic because I thought I might automatically group women together when there was ambiguity in the order of the line (people standing side by side) which I was worried would bias into the first statistic since it’s very local. For both statistics, the true clustering statistic was higher than the average random clustering statistic (regardless of the value of n, k I chose for the second statistic). For the first statistic, this was true for pretty much every random iterate, and for the second statistic, the percentage of random iterates it was true varied for depending on n and k, averaging about 90% of iterates.
[3] This is based only on the keynote, not the rest of the conference. I think this is reasonable because the keynote was watched by millions of people, it was run by the CEO of Google, and a speech like that should reflect what you want to emphasize.
[4] To be clear, I’m not really blaming Google either for the trivial-app problem (although I think their examples could’ve been better chosen, as a company they do a lot of amazing things) or the lack-of-diversity problem; they’re industry-wide symptoms. But that isn’t really reassuring.

Full data:

People entering keynote speech, 9:30 AM: 65 / 236 are women, 28% .
People exiting keynote speech, 12:00 PM: 25 / 88, 28%.
High performance web user interfaces line: 21 / 58, 36%
Accessibility office hours: 10 / 40, 25%
Machine learning office hours: 8 / 31, 26%
People sitting around on grass: 33 / 87, 38%
Access and empathy tent: 13 / 38, 34%
Android studio / google play: 6 / 46, 13%
Making music station: 4 / 18, 22%
Project loon (giant balloon): 5 / 11, 45%
Android experiments (random cute stuff): 15 / 46, 33%
Audience arranged in ring around robotic arm spewing paint to music: 17 / 48, 35%
Android pay everywhere line: 6 / 25, 24%
Engineering cinematic experiences in VR line: 15 / 62, 24%

Devtools on rails line: 24 / 82, 29%

4 comments:

  1. For the stats, I might reach for a runs test: https://en.wikipedia.org/wiki/Wald–Wolfowitz_runs_test or a chi-square test (observed data is the number of ff, fm, mf, mm pairs, expected data is p(f)*p(f)*n, p(f)*p(m)*n, etcetc)

    ReplyDelete
  2. I consistently love your blog, and this is no exception. Thanks for towing the social justice line on this machine learning stuff. One of my projects involves using public health injury data and police stop data to suggest new policing strategies that are less explicitly or implicitly racist and more life-saving. We're really going in that direction - where "You're a goddam wizard harry!" becomes a rallying cry, bringing on the fly analysis and decision making to core activities. Cool stuff - here's to hoping it's got more powerful of an impact than "my perrier drone is late."
    - Mike, a social justice / environmental epidemiologist from NC

    ReplyDelete
  3. I'm guessing that women were well represented at the conference because (apparently from the way you ended up there), tickets were given out on merit, so the only bias is awareness and self-selection.

    I guess the judges of coding contests could know the genders of the authors, but I'm guessing gender-bias is dampened when someone is evaluating lots of code.

    ReplyDelete
  4. QUANTUM BINARY SIGNALS

    Professional trading signals delivered to your cell phone daily.

    Follow our trades today and gain up to 270% daily.

    ReplyDelete