Welcome to the results for year three of our subparforecasing competition!
(You can check out results from 2021 and 2022 if you want to see how the contest has evolved.)
Subparforecasting is a joking reference to to superforecasting, which comes from Philip Tetlock’s research about how to find and develop a group of people who are outsize good at making predictions.
A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learning from the process. Back in 2020 (oh how time flies) Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.
We’re using the same scoring approach as last year. A few things to remember:
If you didn’t answer a question, we treat that as having guessed 50%.
Absolutely confident guesses (0% or 100%) that turn out to be wrong are worth negative infinity. This year, there were no absolutely confident guesses, so that’s progress!
Your total score is the average of your score on each question individually.
A score of zero means you did the same as just not answering every question, which is to say, the same as saying everything was 50%.
The bigger (more positive) your score is, the better!
If you scroll all the way to the bottom, there are more details on the math of the scoring. Also, we have some fun extra calculations we did to evaluate people confidence, and over-confidence! So don’t miss the stuff at the end.
name | score |
---|---|
David R H Miller | 0.340 |
Lucas Kimball | 0.320 |
Eyal Minsky-Fenick | 0.316 |
Lisa Minsky-Primus | 0.293 |
Dima Gorenburg | 0.288 |
Franco Baseggio | 0.241 |
Henry Schneider | 0.235 |
Yaron Minsky | 0.227 |
Zev Minsky-Primus | 0.223 |
Yair and Naftaly | 0.211 |
Nick Salter (and family) | 0.196 |
Averagey McAverageface | 0.187 |
Sharon | 0.166 |
Martha Escobar | 0.156 |
Mara Kailin | 0.140 |
Ida Gorenburg | 0.119 |
Richard Primus | 0.119 |
Jonas C. Peters | 0.113 |
Michael Farbiarz | 0.066 |
Daniel Cohen | 0.013 |
Sally Gottesman | -0.002 |
Elana Farbiarz | -0.012 |
Ada Fenick | -0.029 |
Sam Wurzel | -0.040 |
Sanjyot Dunung | -0.045 |
Michelle Fisher | -0.091 |
Gabriel Farbiarz | -0.125 |
Dianne Newman | -0.128 |
Eyal Schachner | -0.160 |
Shula mmonsky | -0.168 |
eli cohen | -0.273 |
We had 30 teams this year — most teams were individuals, but there are a few multi-person teams. Remember that a negative score means you did worse than just saying 50% for everything. Of our 30 teams, just 11 had negative scores. That’s similar to last year.
In order to represent the wisdom of the crowds, we again added a synthetic player, Averagey McAverageface. Averagey’s prediction is just the average of everyone else’s prediction on any particular question. 11 teams did better than Averagey, so Averagey did alright, but we had a lot of teams that did better.
We also generated a scatterplot, which you can use to get a sense of how people’s performance differed between last year and this year, including only those who participated in both years. You can see a definite correlation! We also plotted a line of best fit, which gives you a sense of how much improvement there was across the board.
Note that you can hover over individual points to see the specific people and scores year-over-year.
We wanted to investigate another way of thinking about people’s results, which is how over-confident or under-confident each person was. We don’t have a perfect methodology here, but we think we have a pretty reasonable way of thinking about it.
First, to build some intuition, some examples to get your intuition going.
Imagine someone all of whose predictions are either 40% or 60%, but all their 40% predictions turn out false, and all their 60% predictions turn out true. That person is probably under-confident, and they should probably increase the confidence of their predictions.
Alternatively, someone who predicts lots of things at 95% or 5%, but only 60% of those predictions work out is overconfident.
The following table rates people from the most over-confident to the most under-confident, where the people in the middle seem to be best calibrated.
There are three scores listed here, and I’ll explain roughly what they all mean, and we’ll give a more mathematical account at the end.
raw conf measures just how confident you are, and has nothing to do with the actual outcomes. So, someone who predicts things at 90% or 10% is going to have a higher “raw confident” score than someone who makes predictions at 60% or 40%.
overconf measures how your confidence relates to your outcomes. If your guesses are more directionally right than your probabilities suggest, you’re under-confident, and get a negative number. If your guesses are less directionally right than the probabilities suggest, you’re overconfident, and you get a positive number.
score p, or the score probability, is an estimate of how likely it would be that your score could have been achieved by chance if your probabilities were correct. This gives you a sense of how significant the overconfidence measure is. Notably, people who are well-calibrated (i.e., overconfidence is close to zero) tend to have higher score probabilities. Also, the overconfidence noted in the table seems more significant than the underconfidence.
I’m not sure exactly what to take away from this, other than the fact that people near the top of the table should consider expressing more uncertainty in their estimates, and people at the bottom should consider doing the reverse.
name | overconf | raw conf | score p |
---|---|---|---|
Eyal Schachner | 0.336 | 0.218 | 0.002 |
Dianne Newman | 0.336 | 0.223 | 0.002 |
eli cohen | 0.321 | 0.149 | 0.001 |
Gabriel Farbiarz | 0.255 | 0.131 | 0.005 |
Shula mmonsky | 0.237 | 0.096 | 0.000 |
Michelle Fisher | 0.212 | 0.193 | 0.025 |
Sam Wurzel | 0.130 | 0.103 | 0.086 |
Daniel Cohen | 0.111 | 0.129 | 0.212 |
Franco Baseggio | 0.099 | 0.340 | 0.414 |
Ada Fenick | 0.070 | 0.035 | 0.175 |
Elana Farbiarz | 0.069 | 0.052 | 0.249 |
Jonas C. Peters | 0.059 | 0.214 | 0.576 |
Sanjyot Dunung | 0.058 | 0.026 | 0.174 |
Sally Gottesman | 0.055 | 0.102 | 0.490 |
Ida Gorenburg | 0.048 | 0.141 | 0.601 |
Yaron Minsky | 0.031 | 0.250 | 0.778 |
Nick Salter (and family) | 0.025 | 0.234 | 0.810 |
Sharon | -0.005 | 0.137 | 0.954 |
Richard Primus | -0.009 | 0.086 | 0.932 |
Mara Kailin | -0.021 | 0.119 | 0.812 |
Michael Farbiarz | -0.035 | 0.032 | 0.475 |
Dima Gorenburg | -0.063 | 0.192 | 0.548 |
Martha Escobar | -0.069 | 0.124 | 0.428 |
Yair and Naftaly | -0.083 | 0.128 | 0.321 |
Henry Schneider | -0.089 | 0.162 | 0.338 |
Zev Minsky-Primus | -0.092 | 0.119 | 0.315 |
Lucas Kimball | -0.099 | 0.213 | 0.336 |
Eyal Minsky-Fenick | -0.099 | 0.228 | 0.268 |
David R H Miller | -0.132 | 0.208 | 0.195 |
Lisa Minsky-Primus | -0.144 | 0.145 | 0.092 |
The following graphic lets you visualize the data and see how the predictors did collectively and individually.
This visualization works best on an ordinary computer , not a tablet or phone. That’s because you need to hover with your mouse in order to uncover more information. To hover over something, move the tip of your mouse pointer over the element in question and leave it there for a moment!
Each barbell represents a question. Hover over the line to see the text of the question.
The ends of the barbell represent the truth-value of the result, with false on the left and true on the right. Depending on which way the given question worked out, the appropriate end is highlighted as green.
The red squares correspond to individual predictions. Hover over a red square to see who it is and what their prediction was.
You can click a red square to select that prediction, at which point it will turn blue, as will all of that person’s predictions across all the questions.
The black mark is the average of all the predictions. The predictions are sorted by the average probability.
Note: it takes a moment for the text to pop up when you hover, so be patient!
Here’s the visualization:
We’re doing a small variation on the logarithmic scoring rule. For log scoring, you should think of the probability p of a prediction as a number between 0 and 1, with 0 representing 0%, and 1 representing 100%.
Here’s how our score works:
\text{score}(p,\text{outcome}) = \left( \begin{cases} \ln(p) , & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}\right) - \ln(0.5)
This is just log-scoring minus \ln(0.5). That increases the score, since \ln(0.5) is negative. As a result, your score is always zero if your prediction is 50%, and higher scores are better. This reflects the fact that we treat a 50% prediction as a kind of a baseline.
So, say you predicted that a given outcome something was 90% to happen.
The basic idea is to take seriously the probabilities people write down, and use that as a model of the probabilities of the underlying events, with the extra (incorrect!) assumption that the different questions are statistically independent.
Raw confidence is the score you should have expected to get if your probabilities were true. So, someone very confident expects a high score, and someone unconfident expects a less good score. The lowest you can get is zero, since a maximally unconfident person will guess 50% everywhere, and so will get a score of zero.
Here’s the equation for raw confident. n is the number of questions, and p_i is the probability you assigned for question i. \text{score} is the standard scoring function.
\frac{\sum_{i=1}^{n} (p_i \text{score}(p_i,T) + (1 - p_i)\text{score}(p_i,F))}{n}
We measure overconfidence by looking at the difference between raw confidence and the actual score that was achieved.
\frac{\sum_{i=1}^{n} \text{raw-confidence} - \text{score}(p_i,\text{outcome}_i)}{n}
The last measure is the question of how likely is your achieved score, given your confidence. We don’t have an equation for this, but just measured it by doing a monte-carlo simulation. Here, we just ran it over and over, assuming that the questions are resolved independently.