Subpar Forecasting, 2021!

Remember 2020? At the end of last year, you were all invited to virtual New Year’s Eve Party, and part of that party was a quiz where we made some probabilistic guesses about some things that might happen in 2021. Well, the results are in, and we thought it would be fun to throw together a report summarizing how it went.

What is this about?

Subparforecasting is a joking reference to to superforecasting, a process based on research by Philip Tetlock about how to develop a group of people who are really good at making predictions, like, better than experts in the field in question or CIA analysts.

A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learing from the process. Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.

Scoring

Given that the point of all of this is to get feedback, we thought we’d go ahead and score people’s results.

Scoring is a bit of a dark art, and we didn’t think too hard about it. We just picked the simplest rule we could find on wikipedia: the logarithmic scoring rule. Log scoring is pretty simple. For this test, we convert every probabilistic guess into a number p between 0 and 1. Then we define the score as follows.

\text{score}(p) = \begin{cases} \ln(p), & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}

So, let’s say you guessed that a given outcome something was 90% to happen, and it did. Then your score would be \ln(0.9) = -0.105. If it didn’t happen, your score would be \ln(1-0.9) = \ln(0.1) = -2.303.

So, all scores are negative, and higher scores are better. One interesting aspect of this scoring is that if you’re absolutely confident and wrong, you get penalized infinitely! I.e., if you say something is 0% or 100% to happen and you’re wrong, the score is negative infinity.

In the spirit of generous interpretation, we decided to adjust every 0% prediction to a 1% prediction, and every 100% prediction to 99%, since we think people weren’t quite thinking in absolutes.

(As a side-note, of the predictions that were entirely certain, 50% of them were wrong!)

The rankings

Here they are. We sorted by the average log score.

name score num votes
Jeremy Dauber -0.511 1
Michelle Fisher -0.535 2
Ariela Migdal -0.573 4
Yaron Minsky -0.578 16
Ethan Tucker -0.707 3
Dima Gorenburg -0.709 15
Zev Minsky-Primus -0.709 13
Lisa Primus -0.746 17
Richard Primus -0.777 11
Dave Tytell -0.832 19
Martha Escobar -0.856 22
Dan Cohen -0.877 3
Martha Walker -0.885 16
Yair Minsky -1.003 10
Miri Pomerantz -1.204 1
Romana Primus -1.353 21
Yehuda Kurtzer -1.386 1
Naftaly Minsky -1.609 1
Tova Ovadia -1.886 8
Sema Stein -2.160 9
Jess Tytell -2.526 1


One thing that feels a little weird by this is that the top of the ranking is dominated by folk who just made a handful of guesses. One thing we could do is restrict to just people who made 10 guesses or more:

name score num votes
Yaron Minsky -0.578 16
Dima Gorenburg -0.709 15
Zev Minsky-Primus -0.709 13
Lisa Primus -0.746 17
Richard Primus -0.777 11
Dave Tytell -0.832 19
Martha Escobar -0.856 22
Martha Walker -0.885 16
Yair Minsky -1.003 10
Romana Primus -1.353 21


But the 10 vote limit is a little arbitrary. Another idea would be to fill out the rankings by treating everyone who didn’t vote as having voted for 50/50. This isn’t fair, exactly, but it’s interesting, in that it’s a super simple strategy that everyone could employ.

Let’s see how it does:

name score num votes
Yaron Minsky -0.616 24
Ariela Migdal -0.673 24
Michelle Fisher -0.680 24
Jeremy Dauber -0.686 24
Ida Gorenburg -0.693 24
Ethan Tucker -0.695 24
Zev Minsky-Primus -0.702 24
Dima Gorenburg -0.703 24
Miri Pomerantz -0.714 24
Dan Cohen -0.716 24
Yehuda Kurtzer -0.722 24
Lisa Primus -0.731 24
Naftaly Minsky -0.731 24
Richard Primus -0.731 24
Jess Tytell -0.770 24
Dave Tytell -0.803 24
Martha Walker -0.821 24
Yair Minsky -0.822 24
Martha Escobar -0.843 24
Tova Ovadia -1.091 24
Sema Stein -1.243 24
Romana Primus -1.247 25


You’ll notice a new competitor pretty high up in the rankings: Ida Gorenburg! But, why didn’t she show up before? Well, Ida came to the party, but she didn’t answer any questions at all. But she still scored in the top quartile or so! Which is to say, a lot of us ended up doing worse than just guessing 50% for everything! Also, some folk who answered just a few questions, like Ariela, Michelle and Jeremy, are near the top of the ranking.

Another natural question that arises is, what about the wisdom of crowds? If we just average everyone’s guesses together, do we do better? To check, we’ll synthesize a fake participant, Averagey McAverageface, whose guess is the average of all votes. Let’s see how he does!

name score num votes
Yaron Minsky -0.616 24
Ariela Migdal -0.673 24
Michelle Fisher -0.680 24
Jeremy Dauber -0.686 24
Averagey McAverageface -0.690 24
Ida Gorenburg -0.693 24


I truncated the list after the first few, but as you can see, we’re maybe not that wise as a crowd. Averagey McAverageface does just a hair better than Ida, who, as you’ll remember, didn’t answer any questions at all…

Visualizing the questions

The following graphic lets you visualize the data and see how the predictors did collectively and individually.

This visualization works best on an ordinary computer , not a tablet or phone. That’s because you need to hover with your mouse in order to uncover more information. To hover over something, move the tip of your mouse pointer over the element in question and leave it there for a moment!

Here’s the data: