Subpar Forecasting, 2021!

Remember 2020? At the end of last year, you were all invited to virtual New Year’s Eve Party, and part of that party was a quiz where we made some probabilistic guesses about some things that might happen in 2021. Well, the results are in, and we thought it would be fun to throw together a report summarizing how it went.

What is this about?

Subparforecasting is a joking reference to to superforecasting, a process based on research by Philip Tetlock about how to develop a group of people who are really good at making predictions, like, better than experts in the field in question or CIA analysts.

A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learing from the process. Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.

Scoring

Given that the point of all of this is to get feedback, we thought we’d go ahead and score people’s results.

Scoring is a bit of a dark art, and we didn’t think too hard about it. We just picked the simplest rule we could find on wikipedia: the logarithmic scoring rule. Log scoring is pretty simple. For this test, we convert every probabilistic guess into a number p between 0 and 1. Then we define the score as follows.

\text{score}(p) = \begin{cases} \ln(p), & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}

So, let’s say you guessed that a given outcome something was 90% to happen, and it did. Then your score would be \ln(0.9) = -0.105. If it didn’t happen, your score would be \ln(1-0.9) = \ln(0.1) = -2.303.

So, all scores are negative, and higher scores are better. One interesting aspect of this scoring is that if you’re absolutely confident and wrong, you get penalized infinitely! I.e., if you say something is 0% or 100% to happen and you’re wrong, the score is negative infinity.

In the spirit of generous interpretation, we decided to adjust every 0% prediction to a 1% prediction, and every 100% prediction to 99%, since we think people weren’t quite thinking in absolutes.

(As a side-note, of the predictions that were entirely certain, 50% of them were wrong!)

The rankings

Here they are. We sorted by the average log score.

name	score	num votes
Jeremy Dauber	-0.511	1
Michelle Fisher	-0.535	2
Ariela Migdal	-0.573	4
Yaron Minsky	-0.578	16
Ethan Tucker	-0.707	3
Dima Gorenburg	-0.709	15
Zev Minsky-Primus	-0.709	13
Lisa Primus	-0.746	17
Richard Primus	-0.777	11
Dave Tytell	-0.832	19
Martha Escobar	-0.856	22
Dan Cohen	-0.877	3
Martha Walker	-0.885	16
Yair Minsky	-1.003	10
Miri Pomerantz	-1.204	1
Romana Primus	-1.353	21
Yehuda Kurtzer	-1.386	1
Naftaly Minsky	-1.609	1
Tova Ovadia	-1.886	8
Sema Stein	-2.160	9
Jess Tytell	-2.526	1

One thing that feels a little weird by this is that the top of the ranking is dominated by folk who just made a handful of guesses. One thing we could do is restrict to just people who made 10 guesses or more:

name	score	num votes
Yaron Minsky	-0.578	16
Dima Gorenburg	-0.709	15
Zev Minsky-Primus	-0.709	13
Lisa Primus	-0.746	17
Richard Primus	-0.777	11
Dave Tytell	-0.832	19
Martha Escobar	-0.856	22
Martha Walker	-0.885	16
Yair Minsky	-1.003	10
Romana Primus	-1.353	21

But the 10 vote limit is a little arbitrary. Another idea would be to fill out the rankings by treating everyone who didn’t vote as having voted for 50/50. This isn’t fair, exactly, but it’s interesting, in that it’s a super simple strategy that everyone could employ.

Let’s see how it does:

name	score	num votes
Yaron Minsky	-0.616	24
Ariela Migdal	-0.673	24
Michelle Fisher	-0.680	24
Jeremy Dauber	-0.686	24
Ida Gorenburg	-0.693	24
Ethan Tucker	-0.695	24
Zev Minsky-Primus	-0.702	24
Dima Gorenburg	-0.703	24
Miri Pomerantz	-0.714	24
Dan Cohen	-0.716	24
Yehuda Kurtzer	-0.722	24
Lisa Primus	-0.731	24
Naftaly Minsky	-0.731	24
Richard Primus	-0.731	24
Jess Tytell	-0.770	24
Dave Tytell	-0.803	24
Martha Walker	-0.821	24
Yair Minsky	-0.822	24
Martha Escobar	-0.843	24
Tova Ovadia	-1.091	24
Sema Stein	-1.243	24
Romana Primus	-1.247	25

You’ll notice a new competitor pretty high up in the rankings: Ida Gorenburg! But, why didn’t she show up before? Well, Ida came to the party, but she didn’t answer any questions at all. But she still scored in the top quartile or so! Which is to say, a lot of us ended up doing worse than just guessing 50% for everything! Also, some folk who answered just a few questions, like Ariela, Michelle and Jeremy, are near the top of the ranking.

Another natural question that arises is, what about the wisdom of crowds? If we just average everyone’s guesses together, do we do better? To check, we’ll synthesize a fake participant, Averagey McAverageface, whose guess is the average of all votes. Let’s see how he does!

name	score	num votes
Yaron Minsky	-0.616	24
Ariela Migdal	-0.673	24
Michelle Fisher	-0.680	24
Jeremy Dauber	-0.686	24
Averagey McAverageface	-0.690	24
Ida Gorenburg	-0.693	24

I truncated the list after the first few, but as you can see, we’re maybe not that wise as a crowd. Averagey McAverageface does just a hair better than Ida, who, as you’ll remember, didn’t answer any questions at all…

Visualizing the questions

The following graphic lets you visualize the data and see how the predictors did collectively and individually.

This visualization works best on an ordinary computer , not a tablet or phone. That’s because you need to hover with your mouse in order to uncover more information. To hover over something, move the tip of your mouse pointer over the element in question and leave it there for a moment!

Each barbell represents a question, and one of the two ends of the barbell is marked as green, depending on whether the answer was true (on the right) or false (on the left). Hover over the line to see what the actual question was.
The red squares correspond to individual people’s guesses. Hovering will tell you the person’s identity and their probabilistic guess.
You can click a guess square to select it, at which point it will turn blue, as will all of that person’s guesses across all the questions.
The black mark is the average probability of all the guesses. The guesses are sorted by the average probability.

Here’s the data: