Remember 2020? At the end of last year, you were all invited to virtual New Year’s Eve Party, and part of that party was a quiz where we made some probabilistic guesses about some things that might happen in 2021. Well, the results are in, and we thought it would be fun to throw together a report summarizing how it went.
Subparforecasting is a joking reference to to superforecasting, a process based on research by Philip Tetlock about how to develop a group of people who are really good at making predictions, like, better than experts in the field in question or CIA analysts.
A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learing from the process. Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.
Given that the point of all of this is to get feedback, we thought we’d go ahead and score people’s results.
Scoring is a bit of a dark art, and we didn’t think too hard about it. We just picked the simplest rule we could find on wikipedia: the logarithmic scoring rule. Log scoring is pretty simple. For this test, we convert every probabilistic guess into a number p between 0 and 1. Then we define the score as follows.
\text{score}(p) = \begin{cases} \ln(p), & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}
So, let’s say you guessed that a given outcome something was 90% to happen, and it did. Then your score would be \ln(0.9) = -0.105. If it didn’t happen, your score would be \ln(1-0.9) = \ln(0.1) = -2.303.
So, all scores are negative, and higher scores are better. One interesting aspect of this scoring is that if you’re absolutely confident and wrong, you get penalized infinitely! I.e., if you say something is 0% or 100% to happen and you’re wrong, the score is negative infinity.
In the spirit of generous interpretation, we decided to adjust every 0% prediction to a 1% prediction, and every 100% prediction to 99%, since we think people weren’t quite thinking in absolutes.
(As a side-note, of the predictions that were entirely certain, 50% of them were wrong!)
Here they are. We sorted by the average log score.
name | score | num votes |
---|---|---|
Jeremy Dauber | -0.511 | 1 |
Michelle Fisher | -0.535 | 2 |
Ariela Migdal | -0.573 | 4 |
Yaron Minsky | -0.578 | 16 |
Ethan Tucker | -0.707 | 3 |
Dima Gorenburg | -0.709 | 15 |
Zev Minsky-Primus | -0.709 | 13 |
Lisa Primus | -0.746 | 17 |
Richard Primus | -0.777 | 11 |
Dave Tytell | -0.832 | 19 |
Martha Escobar | -0.856 | 22 |
Dan Cohen | -0.877 | 3 |
Martha Walker | -0.885 | 16 |
Yair Minsky | -1.003 | 10 |
Miri Pomerantz | -1.204 | 1 |
Romana Primus | -1.353 | 21 |
Yehuda Kurtzer | -1.386 | 1 |
Naftaly Minsky | -1.609 | 1 |
Tova Ovadia | -1.886 | 8 |
Sema Stein | -2.160 | 9 |
Jess Tytell | -2.526 | 1 |
One thing that feels a little weird by this is that the top of the ranking is dominated by folk who just made a handful of guesses. One thing we could do is restrict to just people who made 10 guesses or more:
name | score | num votes |
---|---|---|
Yaron Minsky | -0.578 | 16 |
Dima Gorenburg | -0.709 | 15 |
Zev Minsky-Primus | -0.709 | 13 |
Lisa Primus | -0.746 | 17 |
Richard Primus | -0.777 | 11 |
Dave Tytell | -0.832 | 19 |
Martha Escobar | -0.856 | 22 |
Martha Walker | -0.885 | 16 |
Yair Minsky | -1.003 | 10 |
Romana Primus | -1.353 | 21 |
But the 10 vote limit is a little arbitrary. Another idea would be to fill out the rankings by treating everyone who didn’t vote as having voted for 50/50. This isn’t fair, exactly, but it’s interesting, in that it’s a super simple strategy that everyone could employ.
Let’s see how it does:
name | score | num votes |
---|---|---|
Yaron Minsky | -0.616 | 24 |
Ariela Migdal | -0.673 | 24 |
Michelle Fisher | -0.680 | 24 |
Jeremy Dauber | -0.686 | 24 |
Ida Gorenburg | -0.693 | 24 |
Ethan Tucker | -0.695 | 24 |
Zev Minsky-Primus | -0.702 | 24 |
Dima Gorenburg | -0.703 | 24 |
Miri Pomerantz | -0.714 | 24 |
Dan Cohen | -0.716 | 24 |
Yehuda Kurtzer | -0.722 | 24 |
Lisa Primus | -0.731 | 24 |
Naftaly Minsky | -0.731 | 24 |
Richard Primus | -0.731 | 24 |
Jess Tytell | -0.770 | 24 |
Dave Tytell | -0.803 | 24 |
Martha Walker | -0.821 | 24 |
Yair Minsky | -0.822 | 24 |
Martha Escobar | -0.843 | 24 |
Tova Ovadia | -1.091 | 24 |
Sema Stein | -1.243 | 24 |
Romana Primus | -1.247 | 25 |
You’ll notice a new competitor pretty high up in the rankings: Ida Gorenburg! But, why didn’t she show up before? Well, Ida came to the party, but she didn’t answer any questions at all. But she still scored in the top quartile or so! Which is to say, a lot of us ended up doing worse than just guessing 50% for everything! Also, some folk who answered just a few questions, like Ariela, Michelle and Jeremy, are near the top of the ranking.
Another natural question that arises is, what about the wisdom of crowds? If we just average everyone’s guesses together, do we do better? To check, we’ll synthesize a fake participant, Averagey McAverageface, whose guess is the average of all votes. Let’s see how he does!
name | score | num votes |
---|---|---|
Yaron Minsky | -0.616 | 24 |
Ariela Migdal | -0.673 | 24 |
Michelle Fisher | -0.680 | 24 |
Jeremy Dauber | -0.686 | 24 |
Averagey McAverageface | -0.690 | 24 |
Ida Gorenburg | -0.693 | 24 |
I truncated the list after the first few, but as you can see, we’re maybe not that wise as a crowd. Averagey McAverageface does just a hair better than Ida, who, as you’ll remember, didn’t answer any questions at all…
The following graphic lets you visualize the data and see how the predictors did collectively and individually.
This visualization works best on an ordinary computer , not a tablet or phone. That’s because you need to hover with your mouse in order to uncover more information. To hover over something, move the tip of your mouse pointer over the element in question and leave it there for a moment!
Each barbell represents a question, and one of the two ends of the barbell is marked as green, depending on whether the answer was true (on the right) or false (on the left). Hover over the line to see what the actual question was.
The red squares correspond to individual people’s guesses. Hovering will tell you the person’s identity and their probabilistic guess.
You can click a guess square to select it, at which point it will turn blue, as will all of that person’s guesses across all the questions.
The black mark is the average probability of all the guesses. The guesses are sorted by the average probability.
Here’s the data: