Obsession with Regression: 2017

Monday, December 18, 2017

Disagreeing without disliking each other

...increasingly seems to be impossible. Democrats dislike Republicans; Republicans dislike Democrats; even within liberal enclaves, very liberal campus activists dislike slightly less liberal campus activists. More than one of my friends has told me that, if they could work on any problem, they would want to stitch closed these schisms in our society.

Recently I had two experiences that renewed my faith that this was possible. I started contentious discussions with two very different groups:

1. The Redditors: A bunch of Reddit fans of a popular blogger who, in my view, can be biased against feminists. He had written a piece arguing that the focus on powerful men’s sexual assaults was “a hit job” on men. I wrote a rebuttal, decided I was feeling confrontational, and posted it where all his Reddit fans would see it.

2. The AlterConfers: The attendees of AlterConf 2017, a conference that “provides safe opportunities for marginalized people and those who support them in the tech and gaming industries”. I had been invited to speak about ethical dilemmas in computer science, and decided I would begin my talk by talking about criminal justice sentencing algorithms. I was going to argue that algorithms that created large racial disparities were not necessarily unfair.

These groups, as you can imagine, followed profoundly different norms. When I arrived at AlterConf, I was asked for my preferred pronouns (she/her) and told that the bathrooms had been “liberated from the gender binary”; when I posted on Reddit, a commenter immediately assumed I was male. At AlterConf I was asked whether my talk had any trigger warnings and handed red, yellow and green cards to indicate whether I was comfortable talking to other people; on Reddit, I found a similarly careful means of communication that relied on a totally different vocabulary -- “motte and bailey”; “toxoplasma”; “infinite regress”; “Taleb’s notion of time probability”; “eudaimonia”, “metis”, and “episteme”; "Foucaultian nihilism” and “biopower”.

I doubt the two groups would get along that well. At AlterConf, multiple speakers attributed social problems to cis white men; on Reddit, multiple commenters criticized feminists instead. At AlterConf, speakers discussed how to stop racial discrimination; on Reddit, commenters focused instead on discrimination against men. (Like African-Americans, they argued, men were discriminated against by the criminal justice system.)

Nor was I arguing positions that either group was particularly sympathetic to. Reddit isn’t known for loving feminist positions; similarly, arguing that the criminal justice system isn’t as racist as it appears to be isn’t a popular stance in activist circles. I was genuinely nervous before engaging with each group.

But in fact both interactions went remarkably well: no one got mad, I learned from both groups, and they learned from me. On Reddit, commenters pointed me to a long line of papers showing discrimination against men in criminal justice, which I wrote some code to check out -- more on this some other time. They also asked me for recommendations of other blogs to read, machine learning resources, why feminists acted the way they did, and why I had worked at 23andMe, so information flow went both ways. At AlterConf, I heard ideas for making tech conferences safer and more inclusive (take note, NIPS); making code reviews more pleasant; and finding books with more diverse protagonists, among many others. Conversely, many people came up to me after my talk to ask about tradeoffs in algorithmic fairness.

Why was this communication peaceful and productive? Here are some thoughts.

1. Both communities established strong norms of respectful discourse. These norms are wildly different, of course -- on Reddit, you get a lot of rationalist jargon, and at AlterConf, you get a lot of activist jargon. But they share a common goal: to allow everyone to participate in a free discussion without getting insulted or upset. And while I don't agree with all the ways these communities achieve this goal -- sounding super-rational can sometimes just conceal silly arguments or be pretentious, for example, and I think trigger warnings, while useful in some cases, are used over-broadly -- it helps to just establish a common intention that we're all trying to get along.

Sometimes, the norms are very effective at preserving civil discourse. For example, the one Reddit commenter who was overtly disrespectful, questioning how I had managed to earn my professional bona fides when I wrote like a high schooler, was swiftly downvoted and told they were violating the rules of the forum; their comment was then deleted. One norm I particularly like is charity, a term I heard mentioned frequently on Reddit. As I understand it, charity means "assume good intent and respond to the strongest version of the opponent's argument". I love this ideal, although I don't always achieve it [1].

2. I showed willingness to learn. On Reddit, I began my rebuttal with a long paragraph listing all the things I had learned from the original blog post, and when commenters disagreed with me, I asked them for references. At AlterConf, I started by saying I was grateful to be invited to speak because I thought the wider CS community could take a lot of useful lessons from AlterConf. I also told them that I was about to give a short talk on a controversial topic to a new community, which was always risky, so I was nervous and if they disagreed they should come talk to me because I liked talking to people who disagreed with me.

This willingness to learn was not an act: it helps me to approach new communities anthropologically, with openness, curiosity, and some degree of detachment, and view things I don't agree with as interesting and well-intentioned rather than stupid and malignant. Of course, I don’t always manage to do this.

3. We came from similar tribes. On Reddit, I could credibly claim to be a rationalist math nerd, and the fact that I was in the Stanford CS program was a good thing; had I picked a fight on Breitbart, I suspect I’d have been cast as a liberal elite. Similarly, at AlterConf, I started by saying I studied police discrimination to try to establish that my heart was in the right place.

I'm not sure any of these strategies would allow you to bridge a wider schism and engage with, say, Fox News commenters. But maybe we don't need to do that yet. Even within the Democratic party, there are schisms that if bridged would help us win elections. And, more broadly, if you start by reaching out to the most distant people who will listen to you, perhaps little by little that frontier grows more distant.

Notes:

[1] In my experience, the activist community isn’t always charitable either; I dislike how people are sometimes demonized when their intent is benign, as I’ve discussed before. But at AlterConf everyone was nice to me.

Friday, December 8, 2017

No, Scott Alexander, the focus on powerful men’s sexual assaults is not “a hit job” on men

I want to rebut a recent piece in which Scott Alexander, a widely read blogger, criticizes the focus on sexual assaults committed by powerful men (as opposed to assaults committed by women). I think this post is worth responding to because Alexander's blog is incredibly widely read among computer scientists and my other analytical friends -- it may be the most widely-read blog in my social circle -- and I don't think it covers gender issues fairly, and this post is an example of that.

There are a couple important things Alexander gets right. He's right that society decides not to care about certain classes of sexual assaults; there's probably been more coverage this year of Taylor Swift getting groped than all prison rape combined. He's right that society is wrong to make fun of men who are assaulted by women, and I agree the media should seek out reports from men as well. Reading his post and some of his references increased my already-held belief that we should take men more seriously when they are harassed or assaulted by women, so credit to him for that. He makes a thought-provoking argument that men might care more about assault if they believed it could happen to them too and they'd be taken seriously if it did.

But then the post says a bunch of things that are less reasonable.

First, he repeatedly implies that the current conversation is only about men assaulting women. This is factually incorrect; plenty of men have also gotten in trouble over accusations of assaulting men -- Kevin Spacey, George Takei, James Levine, we could go on.

Then he says that the focus on male assaulters is "a hit job on the outgroup [men]. Do I think that sexual harassment is being used this way? I have no other explanation for the utter predominance of genderedness in the conversation."

Here's another explanation: it's extremely obvious that male-on-female assault, a very common and damaging kind, has some unique characteristics worth discussing -- like the fact that men are generally physically, professionally, and economically more powerful, which fundamentally changes the dynamic of the assault. This discussion is long overdue, and we're having it now. Another reason the assaults of powerful men are worth discussing specifically is that it's a very bad idea to give assaulters, who definitionally don’t have enough regard for others' suffering, access to, say, America's entire nuclear arsenal. So the fact that two of America's last four presidents have been accused of assault or harassment by multiple people is worth talking about. When there are credible allegations of assault against female presidents, senators, media moguls, etc, we should absolutely talk about those those too.

(Note, incidentally, that there are many other ways in which the current conversation is biased and incomplete: other groups being largely left out of the headlines are prisoners, people of color, transgender people, and people whose abusers aren't famous, but somehow his hypothesized "hit job" is against men only. This is odd.)

Alexander also says that focusing on male-on-female assault is like talking only about black-on-white crime or Muslim-on-Christian terrorism: it implies you have an insidious agenda, like Richard Spencer. (Comparing feminists to Nazis is a very tired rhetorical tactic -- feminazis, anyone? -- but let's move on.) These comparisons are wrong for three reasons. First, as explained above, the current conversation isn't just about female victims, though it does focus on male crimes. Second, it's entirely reasonable to have a conversation specifically about the crimes committed by one group. The media has been running non-stop articles about white supremacists, and that is not "a hit job on whites" but an analysis of an important social phenomenon. I don’t feel attacked when people complain about white supremacists; similarly, criticism of male assaulters isn't criticism of all men.

The third reason his examples are bad is that, in both his examples, the group being blamed is a non-dominant group that's been discriminated against for centuries. Contrast this with the group he's comparing to, men. A negative consequence of obsessing about black-on-white crime is a system of mass incarceration that wrecks millions of lives a year. A negative consequence of obsessing about Muslim-on-Christian terrorism was a war that killed hundreds of thousands of Iraqi civilians. In contrast, a negative consequence of obsessing about assaults committed by powerful men is...I'm not sure what it is, but I'm pretty sure it doesn't involve hundreds of thousands of dead people. It would've been easy for him to flip his examples around, which would have made them more apt but less persuasive, so this choice seems like sophistry here.

These counter-arguments are obvious, and Scott Alexander is smart and thorough, so the fact that he doesn't rebut or even mention them is worrisome to me. I think when it comes to gender issues and feminism, he has biases, perhaps driven in part by what happened to Scott Aaronson, and I've had this feeling reading his blog before. And of course we all have biases, certainly I do, but I'm not the main source on gender issues for eight gajillion rationalists. (His post has been shared on several men's rights subreddits, so he's a source for other demographics as well.) So my request is -- please don't take Slate Star Codex as a definitive source on gender issues. He's smart and provocative and I read him, but please read people who disagree with him too.

Sunday, August 6, 2017

Testing for discrimination in college admissions

Recently, the Trump administration’s investigation into racial discrimination in college admissions has brought the topic back into the news. But the claim that some races need higher GPAs or SAT scores to be admitted to colleges is, of course, an old one. This post discusses the statistical subtleties involved in proving such a claim: specifically, I examine some of the arguments that Asian applicants need higher SAT scores than white applicants. To be open about my beliefs at the outset, I think that colleges probably do discriminate against Asians, as they once discriminated against Jews, but the statistical arguments made to prove discrimination are often flawed. This also describes my beliefs about discrimination more broadly: while it is pervasive, quantifying it statistically is hard.

We’re going to use a hypothetical example where only whites and Asians apply for admission, Asians tend to have higher SAT scores than whites, and the only thing that actually affects whether you get admitted is your SAT score. So in this hypothetical example, there is no discrimination; your race does not affect your chances of admission.

On the left, I show the scores for Asian applicants and white applicants. On the right, I show how your probability of admission depends on your SAT score. So someone with an SAT score of 1400 has about a 50% chance of admission, regardless of whether they’re white or Asian. Given that there’s no discrimination in our hypothetical example, if a statistical argument implies there is discrimination, that argument is flawed. So let’s take a look at some arguments.

The most common argument I’ve seen that Asians are discriminated against is that the SAT scores of admitted Asians are higher than SAT scores of admitted whites. But Kirabo Jackson, an economist at Northwestern University, points out the flaw in this argument. In our hypothetical example, where there is no discrimination, admitted Asians will have an average score of about 1460, and admitted whites will have an average score of about 1310. This happens because the Asian distribution is shifted to the right: even though a kid with a 1500 is equally likely to get in regardless of whether they’re white or Asian, there are more Asians with 1500s.

When I ran this argument by a friend, he said that the study which people often cite when claiming Asians are discriminated against is considerably more sophisticated. So I read the study, and it is more sophisticated; it’s worth reading. They fit a model where they simultaneously control for someone’s race and SAT score, which lets you see whether people of some races need higher scores to get in.

Here’s the subtlety. The paper doesn’t actually look at SAT scores, but SAT scores divided into bins from 1200 - 1300, 1300 - 1400, and so on. Within those bins, the paper’s model assumes all applicants should have an equal chance of admission (all else being equal). But that isn’t quite right: an applicant with a 1290 will have a higher chance of admission than an applicant with an 1210. And because Asians are right-shifted in our example, that means that Asians in the 1200 - 1300 bin will have higher scores, and a higher chance of admission, than whites in the 1200 - 1300 bin, even though the paper’s model assumes that applicants in that bin should be equal if there is no discrimination. Below is a plot which illustrates the idea. Within each score bin, Asians (red line) have a higher average SAT score (left plot), and thus a higher chance of admission (right plot), then whites in the same bin (blue line).

So what happens when we fit the paper’s model on our hypothetical data? Now we find discrimination against whites. This happens because the blue lines are below the red lines: whites in a bin have a lower chance of admission than Asians in a bin because they have lower average scores. So the paper’s model will incorrectly conclude that, controlling for SAT score, whites have about 20% lower odds of admission, a significant amount of discrimination. I should note it’s entirely possible that the authors fit other models that don’t bin SAT scores, although I couldn’t find those models mentioned in the paper [1]; please point me to anything I’ve missed.

Okay. So we took hypothetical data that had no discrimination. One widely repeated statistical argument shows discrimination against Asians. Another widely repeated statistical argument shows discrimination against whites. This isn’t good. The basic mathematical takeaway is that when races have different distributions over a variable (like SAT score) and you divide that variable into bins, you can get misleading results. (See the literature on infra-marginality for interesting discussions of related phenomena in tests for police discrimination).

The broader takeaway is that testing for discrimination is really hard. Which isn’t to say you should discount all evidence that it occurs; you should just be mindful of the caveats. Also, these statistical problems are tricky and fun to think about, so you should come work with me on them.

Footnotes:

[1] One of the authors went on to write a book on the topic, the one cited in the lawsuit against Harvard; I took a look at the relevant chapter, and it seems to use a similar binning strategy for SAT scores. To be clear, just because a model has caveats worth discussing doesn’t mean the work is bad or the conclusions are wrong; indeed, the book appears to be impressively comprehensive. Also, our hypothetical example actually suggests that this model might underestimate the amount of discrimination against Asians.

Monday, April 17, 2017

Proving discrimination from personal experience

Here’s an interaction you might’ve participated in:

Member of minority group: I just had [negative interaction] with John. I don’t think he would’ve done that if I hadn’t been a minority.

Listener: That sucks. But...how do you know it was because you were a minority? Maybe he was just having a bad day or he was really busy or …

The negative interaction might be, say, that John talked down to them or didn’t include them on a project. The listener’s reaction is totally reasonable and well-intentioned (at least, I hope it is, because I’ve had it myself). Sometimes it isn’t even said out loud; the listener just thinks it. Here I argue that this reaction is not the most useful one. I explain why, both in English and in math, and then I suggest four more useful reactions.

The problem with this reaction is not that it’s false. It’s that it’s obvious. If a minority tells you about something bad that happened to them, you can almost always attribute it to factors other than their minority status. (Throughout this essay, I’ll refer to negative behavior that’s due to someone’s minority status as “discrimination”.) Worse, this uncertainty will persist even if the discrimination occurs repeatedly and is quite significant. The core reason for this is that human behavior is complicated, there are lots of things that could explain a given interaction, and in our lives we observe only a small number of interactions. Because it is so hard to rule out other factors, individual discrimination suits have notoriously low success rates.

Let’s be clear: I’m not saying you can never prove discrimination from someone’s individual experience. Obviously, there some experiences which are so blatant that discrimination is the only explanation: if someone drops a racial slur or grabs their female coworker by the whatever, we know they’re a president bigot. But, in today’s workplaces, problematic discrimination is rarely so overt -- hence the term “second generation” discrimination. Here’s a picture:

Here’s a simple mathematical model that formalizes this idea. If you don’t like math, feel free to skip to the “What should we do instead” section. Let’s say the result of an interaction, Y, depends on a number of observable factors, X, one of which is whether someone’s a minority. Specifically, let:

Screen Shot 2017-04-05 at 11.48.48 AM.png

where beta is a set of coefficients describing how much each factor matters, and noise is due to random things we don’t observe. So, for example, Y might be your grade on a computer science assignment, X might include factors like “does your code produce the correct output” and “are you a minority” and noise might be due to stuff like how quickly the TA is grading [1].

If we want to know whether there’s discrimination, we need to figure out the value of betaminority: this will tell us whether minorities get worse outcomes just for being minorities. We can infer this value using linear regression, and importantly, we can also infer the uncertainty on the value.

Here’s the problem. When you do linear regression on a small number of datapoints (which is all a person has, given that they don’t observe that many interactions) you’re going to have huge uncertainty in the inferred values. To illustrate this, I ran a simulation using the model above with two groups, call them A and B, each half the population. I set the parameters so there was a strong discrimination effect against B. Specifically, even though A and B are equal along other dimensions, the average person in A will be ranked higher than about two thirds of people in B, due solely to discrimination; if you look at people in the top 5%, less than a third will be B. So this is enough discrimination to produce substantial underrepresentation. But when we try to infer the value of the discrimination coefficient, we can’t be sure there’s discrimination. In the plot below, the horizontal axis is how many interactions we observe; the blue area shows the 95% confidence interval for the discrimination coefficient (with negative values showing discrimination against B); the black line shows a world with no discrimination.

The important point being that the blue shaded area overlaps 0 -- meaning no discrimination is possible -- even if you have literally dozens of interactions, which is way more than you often have. (For fewer than about 5 interactions, the errorbars just blow up and you can’t even graph it.) You can alter simulation parameters or simulate things slightly differently, but I don’t think you’ll change the basic point: you can’t infer effect sizes on sample sizes this small with any confidence.

This model also illustrates some features which make concluding discrimination harder. For example, our errorbars will be larger if other features in X are correlated with being a minority. (“No no, I didn’t promote him because he’s a man. I promoted him because we work well together because we always go out to dinner together / play basketball together / he sounds so much more confident. Well, yes, my wife says I can’t go out to dinner with women…”) Also, your errorbars will be larger if you’re observing repeated interactions from the same person. (If you’re trying to compare your treatment to that of a single coworker, it’s even harder to be sure if it’s because you’re a minority or because of one of the innumerable other ways in which you’ll inevitably differ.) Last, you’re going to be in even more trouble if your minority is a very small fraction of the population whose interactions you observe (say, computer scientists) -- I don’t know if most computer scientists are prejudiced against African-American students because I’ve literally never seen them interact with one.

It’s worth noting that there are a lot of other subtleties in detecting discrimination which have nothing to do with small sample size and which this model doesn’t capture (see the intro to this paper for a brief, clear introduction) but I think small sample size is probably the biggest challenge in the individual-human-experience-setting, so it’s what I focused on here.

What should we do instead?

So it isn’t useful to tell someone that they can’t be sure their experience is due to discrimination, because even in cases when a large amount of discrimination is occurring, people often won’t observe the data to conclusively rule out other factors. What should we do instead?

Here’s one thing I don’t think we should do: assume that discrimination is occurring every time a minority says they think it might be. (I do think we should assume they’re telling the truth about what occurred). The solution to uncertainty and bad data is not to always rule in favor of one party, since it creates perverse incentives and people’s lives get wrecked both by discrimination and by allegations of discrimination. Instead:

Recognize the severity of the problem that minorities deal with. It’s not that they hallucinate discrimination everywhere or are incapable of logical thinking or rigorous standards of proof. It’s that proving discrimination from anecdotal experience is frequently an extremely difficult statistical task. Also, it’s exhausting to continually deal with the unprovable possibility of discrimination: to wonder, every time something doesn’t work out, if some subtle injustice was at play.
Use common sense. Statisticians call this “a prior”: ie, you let your prior knowledge about how the world inform how you interpret the data. So, for example, if you hear someone refer to a black student as “articulate” or a female professor as “aggressive”, you don’t need to hear one hundred more examples to suspect prejudice may be at play. Your prior knowledge about how those adjectives are used helps you conclude discrimination more quickly. (I suspect that one reason female judges are more inclined to rule in favor of discrimination suits is because they have different prior beliefs about how common discrimination is.)
Aggregate data. If one person’s experience doesn’t give you enough data to rule out other factors, aggregate experiences. Class-action lawsuits are an essential means of going after discriminatory employers for this reason. Climate surveys within departments are another example, as is publishing systematic salary gap data (as Britain now does). The sexual assault reporting system Callisto, which aggregates accusations of assault against the same accuser, is based on a related idea, as I’ve discussed.
Conduct workplace audit studies. This idea is kind of crazy and might get you fired, but here it is: if it’s hard to prove discrimination because there are too many other factors at play, keep the other factors constant. Here are some examples:

When a female employee says something in a meeting and people ignore it and then a male employee says the exact same thing and gets a more positive response, we’re more convinced that’s discrimination. (There are a hilarious number of Google results for that phenomenon, by the way.)
A few years ago, I spent a few weeks emailing the NYT’s technical team and getting no response; finally I asked my boyfriend to send them the exact same question, and they immediately responded.
Or take this recent case, where a male and female employee switched their email accounts and were treated dramatically differently.

All these examples feel like compelling evidence of discrimination because it’s hard to pin the different outcome on extraneous factors; everything except minority status remains the same.

So, could you do this in your workplace? More and more interactions occur online, making it easier to switch identities: for example, you could imagine switching Slack accounts for a week. Obviously there are 14 million ways this could go wrong, but drop me a line if you try it.

Footnotes:

[1] This is easily extended to binary outcomes: Y ~ Bernoulli(sigmoid(X * beta + noise))