Subpar Forecasting, 2023!

Welcome to the results for year three of our subparforecasing competition!

(You can check out results from 2021 and 2022 if you want to see how the contest has evolved.)

What is this all about?

Subparforecasting is a joking reference to to superforecasting, which comes from Philip Tetlock’s research about how to find and develop a group of people who are outsize good at making predictions.

A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learning from the process. Back in 2020 (oh how time flies) Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.

Scoring

We’re using the same scoring approach as last year. A few things to remember:

If you didn’t answer a question, we treat that as having guessed 50%.
Absolutely confident guesses (0% or 100%) that turn out to be wrong are worth negative infinity. This year, there were no absolutely confident guesses, so that’s progress!
Your total score is the average of your score on each question individually.
A score of zero means you did the same as just not answering every question, which is to say, the same as saying everything was 50%.
The bigger (more positive) your score is, the better!

If you scroll all the way to the bottom, there are more details on the math of the scoring. Also, we have some fun extra calculations we did to evaluate people confidence, and over-confidence! So don’t miss the stuff at the end.

The rankings

name	score
David R H Miller	0.340
Lucas Kimball	0.320
Eyal Minsky-Fenick	0.316
Lisa Minsky-Primus	0.293
Dima Gorenburg	0.288
Franco Baseggio	0.241
Henry Schneider	0.235
Yaron Minsky	0.227
Zev Minsky-Primus	0.223
Yair and Naftaly	0.211
Nick Salter (and family)	0.196
Averagey McAverageface	0.187
Sharon	0.166
Martha Escobar	0.156
Mara Kailin	0.140
Ida Gorenburg	0.119
Richard Primus	0.119
Jonas C. Peters	0.113
Michael Farbiarz	0.066
Daniel Cohen	0.013
Sally Gottesman	-0.002
Elana Farbiarz	-0.012
Ada Fenick	-0.029
Sam Wurzel	-0.040
Sanjyot Dunung	-0.045
Michelle Fisher	-0.091
Gabriel Farbiarz	-0.125
Dianne Newman	-0.128
Eyal Schachner	-0.160
Shula mmonsky	-0.168
eli cohen	-0.273

We had 30 teams this year — most teams were individuals, but there are a few multi-person teams. Remember that a negative score means you did worse than just saying 50% for everything. Of our 30 teams, just 11 had negative scores. That’s similar to last year.

In order to represent the wisdom of the crowds, we again added a synthetic player, Averagey McAverageface. Averagey’s prediction is just the average of everyone else’s prediction on any particular question. 11 teams did better than Averagey, so Averagey did alright, but we had a lot of teams that did better.

Scatterplot!

We also generated a scatterplot, which you can use to get a sense of how people’s performance differed between last year and this year, including only those who participated in both years. You can see a definite correlation! We also plotted a line of best fit, which gives you a sense of how much improvement there was across the board.

Note that you can hover over individual points to see the specific people and scores year-over-year.

Overconfidence

We wanted to investigate another way of thinking about people’s results, which is how over-confident or under-confident each person was. We don’t have a perfect methodology here, but we think we have a pretty reasonable way of thinking about it.

First, to build some intuition, some examples to get your intuition going.

Imagine someone all of whose predictions are either 40% or 60%, but all their 40% predictions turn out false, and all their 60% predictions turn out true. That person is probably under-confident, and they should probably increase the confidence of their predictions.

Alternatively, someone who predicts lots of things at 95% or 5%, but only 60% of those predictions work out is overconfident.

The following table rates people from the most over-confident to the most under-confident, where the people in the middle seem to be best calibrated.

There are three scores listed here, and I’ll explain roughly what they all mean, and we’ll give a more mathematical account at the end.

raw conf measures just how confident you are, and has nothing to do with the actual outcomes. So, someone who predicts things at 90% or 10% is going to have a higher “raw confident” score than someone who makes predictions at 60% or 40%.
overconf measures how your confidence relates to your outcomes. If your guesses are more directionally right than your probabilities suggest, you’re under-confident, and get a negative number. If your guesses are less directionally right than the probabilities suggest, you’re overconfident, and you get a positive number.
score p, or the score probability, is an estimate of how likely it would be that your score could have been achieved by chance if your probabilities were correct. This gives you a sense of how significant the overconfidence measure is. Notably, people who are well-calibrated (i.e., overconfidence is close to zero) tend to have higher score probabilities. Also, the overconfidence noted in the table seems more significant than the underconfidence.

I’m not sure exactly what to take away from this, other than the fact that people near the top of the table should consider expressing more uncertainty in their estimates, and people at the bottom should consider doing the reverse.

name	overconf	raw conf	score p
Eyal Schachner	0.336	0.218	0.002
Dianne Newman	0.336	0.223	0.002
eli cohen	0.321	0.149	0.001
Gabriel Farbiarz	0.255	0.131	0.005
Shula mmonsky	0.237	0.096	0.000
Michelle Fisher	0.212	0.193	0.025
Sam Wurzel	0.130	0.103	0.086
Daniel Cohen	0.111	0.129	0.212
Franco Baseggio	0.099	0.340	0.414
Ada Fenick	0.070	0.035	0.175
Elana Farbiarz	0.069	0.052	0.249
Jonas C. Peters	0.059	0.214	0.576
Sanjyot Dunung	0.058	0.026	0.174
Sally Gottesman	0.055	0.102	0.490
Ida Gorenburg	0.048	0.141	0.601
Yaron Minsky	0.031	0.250	0.778
Nick Salter (and family)	0.025	0.234	0.810
Sharon	-0.005	0.137	0.954
Richard Primus	-0.009	0.086	0.932
Mara Kailin	-0.021	0.119	0.812
Michael Farbiarz	-0.035	0.032	0.475
Dima Gorenburg	-0.063	0.192	0.548
Martha Escobar	-0.069	0.124	0.428
Yair and Naftaly	-0.083	0.128	0.321
Henry Schneider	-0.089	0.162	0.338
Zev Minsky-Primus	-0.092	0.119	0.315
Lucas Kimball	-0.099	0.213	0.336
Eyal Minsky-Fenick	-0.099	0.228	0.268
David R H Miller	-0.132	0.208	0.195
Lisa Minsky-Primus	-0.144	0.145	0.092

Visualizing the predictions

The following graphic lets you visualize the data and see how the predictors did collectively and individually.

This visualization works best on an ordinary computer , not a tablet or phone. That’s because you need to hover with your mouse in order to uncover more information. To hover over something, move the tip of your mouse pointer over the element in question and leave it there for a moment!

Each barbell represents a question. Hover over the line to see the text of the question.
The ends of the barbell represent the truth-value of the result, with false on the left and true on the right. Depending on which way the given question worked out, the appropriate end is highlighted as green.
The red squares correspond to individual predictions. Hover over a red square to see who it is and what their prediction was.
You can click a red square to select that prediction, at which point it will turn blue, as will all of that person’s predictions across all the questions.
The black mark is the average of all the predictions. The predictions are sorted by the average probability.

Note: it takes a moment for the text to pop up when you hover, so be patient!

Here’s the visualization:

The scoring methodology, in detail

We’re doing a small variation on the logarithmic scoring rule. For log scoring, you should think of the probability p of a prediction as a number between 0 and 1, with 0 representing 0%, and 1 representing 100%.

Here’s how our score works:

\text{score}(p,\text{outcome}) = \left( \begin{cases} \ln(p) , & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}\right) - \ln(0.5)

This is just log-scoring minus \ln(0.5). That increases the score, since \ln(0.5) is negative. As a result, your score is always zero if your prediction is 50%, and higher scores are better. This reflects the fact that we treat a 50% prediction as a kind of a baseline.

So, say you predicted that a given outcome something was 90% to happen.

If you were right, then your score would be \ln(0.9) - \ln(0.5) = 0.59.
If you were wrong, your score would be \ln(1-0.9) - \ln(0.5) = -1.61.

Confidence

The basic idea is to take seriously the probabilities people write down, and use that as a model of the probabilities of the underlying events, with the extra (incorrect!) assumption that the different questions are statistically independent.

Raw confidence

Raw confidence is the score you should have expected to get if your probabilities were true. So, someone very confident expects a high score, and someone unconfident expects a less good score. The lowest you can get is zero, since a maximally unconfident person will guess 50% everywhere, and so will get a score of zero.

Here’s the equation for raw confident. n is the number of questions, and p_i is the probability you assigned for question i. \text{score} is the standard scoring function.

\frac{\sum_{i=1}^{n} (p_i \text{score}(p_i,T) + (1 - p_i)\text{score}(p_i,F))}{n}

Overconfidence

We measure overconfidence by looking at the difference between raw confidence and the actual score that was achieved.

\frac{\sum_{i=1}^{n} \text{raw-confidence} - \text{score}(p_i,\text{outcome}_i)}{n}

Score probability

The last measure is the question of how likely is your achieved score, given your confidence. We don’t have an equation for this, but just measured it by doing a monte-carlo simulation. Here, we just ran it over and over, assuming that the questions are resolved independently.