This post discusses sexual assault. I have tried to keep descriptive details to the minimum required for the math. I am aware this is a sensitive and controversial topic and, as always, welcome your emails with comments, suggestions, or objections.
The Bill Cosby case has highlighted the threat of serial offenders, and may illustrate a larger trend. One study found that more than 90% of campus sexual assaults were committed by a very small proportion of the population (6%) who on average committed six assaults. If this study is accurate, it seems profoundly important, because it implies most assaults are not being committed by people who sincerely misinterpret their partner’s intentions once. While it can be difficult to determine guilt in a single case, my intuition is that a person independently accused of assault by multiple people is quite unlikely to be innocent.
Because, like many people, I feel strongly when it comes to sexual assault, I built a mathematical model to investigate whether this intuition was accurate. Based on the above discussion of serial offenders, I included two groups in my model: “disproportionate assaulters”, who assault people at a high rate, and everyone else, who assault people at a lower rate . I refer to these groups as “DAs” and “non-DAs”.
One question of interest is: given that someone has been accused of assault by k people, what is the probability that they are guilty in at least one case ? I created a simulation you can play with to answer this question. The horizontal axis is how many people have accused someone of assault, and the vertical axis is the probability that they are completely innocent. I initially, pretty arbitrarily, set the parameters of the model as follows, but I invite you to modify them by playing with the sliders:
p0, proportion of people who are DAs
pr1, probability a DA will assault someone in a given encounter
pr2, probability a non-DA will assault someone in a given encounter
pag, probability someone who is guilty of assault in a given encounter will be accused
pai, probability someone who is innocent of assault in a given encounter will be accused
n1, number of sexual encounters had by a DA
n2, number of sexual encounters had by a non-DA
Shoot me an email if you have good ways to pin down any of these values; the values I chose yield roughly the results in the original paper on serial offenders, but there’s a lot of residual uncertainty .
Obviously this model is idealized and does not capture all the complexities at play here. (Feel free to extend it yourself and write to me about it! Here’s some math and code;  has some notes on how I think you might extend it.) Still, from playing with it, we can make a few observations:
- The intuition that, “if you are accused of assault by multiple people, you’re probably guilty” is often accurate. Importantly, even if you were to choose settings where someone who has been accused once is more likely than not to be innocent, the probability of innocence often drops dramatically if they have been accused twice.
- This is still mostly true even if we don’t buy the assumption that there are two different kinds of people (by setting pr1 = pr2 and n1 = n2).
- As we would expect, the probability that someone is innocent increases dramatically as we increase pai , the probability that someone who is innocent will be accused. But we know pai must be very small simply because the vast majority of people are never accused of assault. For example, if each person has 10 sexual encounters and 90% of people are never accused of sexual assault, pai must be lower than 1% even if only innocent people are accused. (Thanks to Seth Stephens-Davidowitz for pointing this out.)
- Increasing our certainty that the guilty are guilty is not just good for accusers: it is also good for the accused, because it potentially allows us to raise the standard of evidence while still catching the same number of guilty people.
- In some cases with multiple accusations, it is essentially impossible that someone is innocent. My mother, who worked as a prosecutor in sexual assault cases, observed that because some repeat offenders use very similar methodology each time, accusers’ testimonies can share distinctive details in a way that would be impossible if no assault occurred (assuming accusations are levied independently).
But how can we combine multiple accusations of assault, given that survivors are usually unaware of each other and often reluctant to come forward? The New York Times recently reported on a tool designed to do this. It allows survivors to file accusations with a third party, who will keep the accusations confidential unless multiple accusations are levied at the same person. The thought is that this could make survivors of assault more willing to come forward and make it easier to identify serial offenders. My initial reaction to this idea was that it was so exciting I should drop out of school to go work on it. After thinking about it further and reading this paper, I concluded that this idea has at least three downsides as well:
- It may discourage survivors from reporting assault via the usual avenues -- if they file a third-party accusation and no one else does, they may conclude that they were “mistaken” and never follow up, which seems very harmful.
- You do not want to create a world where only people accused of assault multiple times are ever convicted. This has echoes of the “woman’s testimony is worth half of a man’s” standard which is applied in some countries. You also really don’t want people feeling like they can commit “one free assault”.
- If we have to wait for multiple assaults to be reported, serial assaulters have more time to commit assaults.
I cannot overemphasize that this is a complicated and painful problem to which there are no easy technological or statistical solutions. But I do think the combine multiple accusations approach is an interesting one, so I’d love to hear your thoughts.
After writing the main post, I wanted to add a statistical note on the doubts which have emerged about the UVA sexual assault case. (If you’re not familiar with the details, several weeks ago Rolling Stone published a story about a gang rape at UVA which got a lot of attention; a few days ago, they issued a statement saying there were “discrepancies” in the accuser’s account and their trust in her was “misplaced”, and the internet exploded.) I think the UVA episode illustrates precisely why statistics are so important -- because anything can happen in a single story, making it a risky thing to hang a cause on. Regardless of what really happened at UVA, the broader trend is clear: the rate of campus sexual assaults is high (20%, says CDC, although better data should be collected); the rate of false accusations is low (this review cites 6 studies which all yield estimates between 2% and 8%, lower than the rate of false reports for car theft). This is much more important than what happened in a single UVA fraternity on a single night. Similarly, to me the compelling story behind Ferguson is not contingent on what exactly happened between Darren Wilson and Michael Brown over the course of 90 seconds -- it is the systemic racial divides in Ferguson, and the research that makes discrimination against African-Americans by the police and justice system all too overwhelmingly clear. Causes are more robust to randomness when backed by statistics in addition to anecdotes.
 This is known as a mixture model, a very useful statistical tool that assumes that your data is generated by a combination of different groups. For example, you might assume that Tweets are generated by a mixture of Democrats and Republicans, or gene expression patterns are generated by a mixture of cancer cells and healthy cells. Obviously, this model should not be taken to imply that assault is committed only by “evil people” who are immune to social and cultural factors, just that some people tend to commit assaults more than others. The fact that rates of assault are much higher in some environments implies that social and cultural factors do play a role both in how likely someone is to commit assault and how likely they are to get away with it.
 There are other questions as well: for example, given that someone has been accused of assault by k people, what is the probability they are in the serial offender group? Given that someone has been accused of assault by k people, what is the probability that a particular allegation is true?
 Particularly pai, the probability that someone who is innocent is accused, since this is such an important parameter. For example, if only 2 - 8% of accusations are false, ought we choose a value of pai such that 92 - 98% of those accused of sexual assault by one person are guilty -- either the accusation is false or the person is guilty? Or is there some third possibility -- perhaps that the victim is telling the truth but their story does not meet a legal standard?
 For example, there probably there aren’t really two clearly separate populations -- there’s some continuous distribution of propensities to assault, and number of people you have sexual encounters with.