Subpar Forecasting, 2024
Welcome to the results for year four of our subparforecasing competition!
If you want a refresher or a reminder as to what this is all about, look here. And you can check out results from 2021, 2022, and 2023 if you want to see how the contest has evolved.
The rankings
The list below has all of the people who participated, plus two non-human participants:
- Averagey McAverageface. Averagey joined us last year, and is the wisdom-of-the-crowds player. His predictions are the average of everyone else’s.
- ChatGPT. We heard this AI thing is supposed to be big, so we asked ChatGPT to make predictions this year.
Without further ado, the results!
name | score |
---|---|
Zev Minsky-Primus | 0.210 |
David Tytell | 0.160 |
Jeremy White | 0.153 |
Elizabeth Frieden & family | 0.110 |
Gabriel Farbiarz | 0.100 |
Shirah Werberger | 0.091 |
Lisa Minsky-Primus | 0.085 |
Alan Promer | 0.078 |
ChatGPT | 0.077 |
Averagey McAverageface | 0.073 |
eli cohen | 0.073 |
Naftaly and Yair | 0.065 |
Yaron Minsky | 0.063 |
Sigal Minsky-Primus and Nava Litt | 0.052 |
Jonas Peters | 0.049 |
Richard Primus | 0.046 |
Monserrate Garzon-Navarro | 0.042 |
nancy | 0.040 |
Sam Wurzel | 0.030 |
Romana Primus | 0.021 |
Ty Overby | 0.015 |
Jeremy Dauber | 0.012 |
Jessica | -0.002 |
Sanjyot Dunung | -0.002 |
Aryeh | -0.011 |
Sharon Fenick | -0.022 |
Nick Salter (and family) | -0.043 |
Greg | -0.070 |
David Huang | -0.076 |
Dima | -0.077 |
Franco Baseggio | -0.081 |
Shula | -0.083 |
Katrina | -0.112 |
Sarah Williams | -0.178 |
Michelle Fisher | -0.187 |
Eyal Minsky-Fenick | -0.188 |
Lucas Kimball | -0.210 |
Sally Gottesman (and Ezra Tiven-Gottesman) | -0.264 |
Martha Ameyalli | -0.272 |
Dalit CS | -0.284 |
Yoav Schachner | -0.380 |
David Miller | -0.415 |
Elana Rosenberg | -inf |
So, how did we do this year? A few observations:
On the whole, it seems like we did a lot worse. Remember that having a negative score means your predictions were worse than just guessing 50% for everything. Last year, 35% of us had negative predictions. This year, it was 50%. Our average score was lower too. Last year, we averaged 0.16, this year, -0.037, ignoring our non-human participants and our one negative infinity. Speaking of which…
Three different people collectively made 9 absolutely confident predictions. We don’t think absolutely confident predictions are ever really a good idea, but…those with absolutely confident did better this year than they back in 2021, where fully half of the absolutely confident predictions were wrong. This time, only two of them were wrong. (Still, even one wrong is worth negative infinity points.)
Some notes on the questions
You can see how everyone guessed and what the results were in our visualization. Most of the resolutions were cut and dried, but there are a few worth saying more about.
ChatGPT
Our question was:
“Excluding this question, ChatGPT 4 will outperform Averagey McAverageface on this year’s subparforecasting competition. Note that ChatGPT 4 can search the web. (We’re going to give ChatGPT4 a chance at competing this year. We’ll give it a prompt to try to get it to answer the quiz, and we’ll feed it each question individually. Averagey McAverageface is the synthetic player who chooses the average of everyone else’s probability estimate. We will ask the questions to gpt more or less like so.”
The phrase “excluding this question” is key. It turns out, Averagey beat out ChatGPT excluding this question, but…ChatGPT correctly thought it was going to lose that question, and so, including the question, ChatGPT is actually ahead! So, that’s weird.
The Westminster dog show
Our question was:
“The winner of Westminster Best in Show will be a dog under 20 lbs”
Which doesn’t sound that complicated. But, it turns out, it was. Here’s what Dave Teitel told us:
Okay. I have an answer!
This will take a while to explain. Stick with me.
First: Credentials
My staff member has won best of show at national and international dog shows, is an international dog show judge, and knew Sage’s handler previously. That is past tense as, sadly, Sage’s handler passed away suddenly last week, just days after his retirement.
Dog shows are judged against a perfect model of a dog breed. Best of breed is the dog with the smallest delta from the ideal dog of that breed. Best of show is the dog with the smallest delta of all the small delta breeds.
But of all the criteria that are measured, weight is not one. Which is why we can’t find it. We would need the dog’s actual vet records! And that’s not happening given the passing.
But there are things we DO know….
According to my staffer: The long answer:
“The standard calls for miniature poodles to be between 10 and 15 inches tall at the shoulder, and my guess is that Sage is closer to the top of that range. A 15-inch sheltie would typically weigh 15–18 lbs., but a poodle would likely weigh more like 13–15. Different body type.”
So… While we don’t know Sage’s exact weight, we do know he was the smallest delta from the ideal mini poodle in the country. Which means he couldn’t have been more than a hair outside the height range, if at all, and that given the frame, our outside expert is confident that means no more than ~15 lbs.
So we know for sure that the winner was indeed a dog that weighed less than 20lbs. Q.E.D.
Snow-days in NY
The question was:
“New York City public schools will have at least one snow day after Jan 21st. We define a snow day as a day where the NYC DOE determines that students should not come into school due to snow.”
This was a close one! There was a single snow day, though, honestly, it barely snowed in the end, and was kind of a dud. But, rules are rules!
Top-grossing original narrative film
The question was:
“In one month between February and September, the top-grossing movie will be an original narrative film. So, no toy or game tie-ins (e.g., Mario/Barbie), no sequels, no comic book movies, nothing based on a novel, etc. Historical fiction is allowed, though. Adjudicated by Lisa Minsky-Primus.”
There was some debate on this one, but Lisa decided this landed as true. The argument against is that the Bob Marley movie isn’t “historical fiction”, it’s a biopic. But, it is lightly fictionalized, and it’s certainly not a documentary. So, historical fiction it is!
Demographics of incoming freshman at Harvard
The question was:
“The percentage of incoming freshmen at Harvard that do not identify as White, South Asian or Asian will be smaller for the class of 2028 than 2027, as determined by the freshman survey reported in the crimson”
Frustratingly, the Crimson, which reported data for 2027 did not appear to release any demographic data from the survey it did for the class of 2028.
The New York Times did report on this, though, and they said:
- African Americans went from 18% to 14%
- Asians stayed the same at 37%
- Hispanics went from 14% to 16%
But there’s no data released about white students. And, as the Times said:
Harvard did not report the share of white students in the class, consistent with past practice, and it is hard to make inferences because the percentage of students not disclosing race or ethnicity on their applications doubled to 8 percent this year from 4 percent last year.
I think this suggests that all in, if the data had been reported, the question would have been settled in the affirmative. But, again, rules are rules, so, this question is null and void.
Wait, what is this all about?
Subparforecasting is a joking reference to to superforecasting, which comes from Philip Tetlock’s research about how to find and develop a group of people who are outsize good at making predictions.
A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learning from the process. Back in 2020 (oh how time flies) Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.
Scoring
The details of the scoring are a little complicated, but if you’re interested, we do have the math. But here are some high-level observations about how the scores work that you don’t need any math for.
If you didn’t answer a question, we treat that as having guessed 50%.
Absolutely confident guesses (0% or 100%) that turn out to be wrong are worth negative infinity.
Your overall score is the average of your score on each question individually.
A score of zero means you did the same as just not answering every question, which is to say, the same as saying everything was 50%.
The bigger (more positive) your score is, the better!
Overconfidence / Underconfidence
We decided to do some analysis of over- and under-confidence. With some help from Matt Russell (Hi Matt!) we decided to try out a new way of measuring it that he proposed.
The idea is pretty simple: in order to see if you’re over-confident or under-confident, we try scaling your confidence up and down until we find the scaling factor that maximizes your score.
In other words, we made all of your guesses bolder or more cautious to different degrees, and checked to see which way gave you the highest score. If your score goes up as we make your guesses more confident, then you’re underconfident. If your score goes up if we make your guesses less confident, you’re overconfident.
A few things to note about this approach.
It’s not obvious what we mean by “scaling confidence”. Without going into the details, we do this in a way that treats a probability of 99\% as roughly 10x more confident than 90\%, and similarly for 0.1\% versus 1\%.
In the table, we’re reporting “underconfidence” numbers, which means that the higher the number is, the more you needed to increase the confidence of your predictions to optimize your score.
Notably, numbers above 1 are underconfident. Numbers between 0 and 1 are overconfident, and numbers below zero mean your guesses are negatively correlated with reality! You would have been better off betting the other way from what you did, across the board.
Here are the confidence ratings!
name | underconfidence |
---|---|
Shirah Werberger | 3.590 |
Zev Minsky-Primus | 2.450 |
Jeremy White | 2.010 |
Alan Promer | 1.520 |
Sam Wurzel | 1.240 |
David Tytell | 0.960 |
Naftaly and Yair | 0.930 |
Elizabeth Frieden & family | 0.870 |
Richard Primus | 0.850 |
Lisa Minsky-Primus | 0.720 |
eli cohen | 0.710 |
Jonas Peters | 0.650 |
Ty Overby | 0.650 |
Chat GPT | 0.620 |
Monserrate Garzon-Navarro | 0.570 |
Gabriel Farbiarz | 0.480 |
Sigal Minsky-Primus and Nava Litt | 0.460 |
Sanjyot Dunung | 0.380 |
Romana Primus | 0.340 |
Yaron Minsky | 0.340 |
Aryeh | 0.300 |
Jessica | 0.280 |
Jeremy Dauber | 0.270 |
Sharon Fenick | 0.250 |
Nick Salter (and family) | 0.170 |
Dima | 0.140 |
Franco Baseggio | 0.130 |
nancy | 0.120 |
Katrina | 0.080 |
Shula | 0.060 |
Sarah Williams | 0.040 |
David Miller | 0.030 |
Greg | 0.030 |
Lucas Kimball | -0.010 |
Sally Gottesman (and Ezra Tiven-Gottesman) | -0.010 |
Michelle Fisher | -0.020 |
Dalit CS | -0.050 |
David Huang | -0.050 |
Eyal Minsky-Fenick | -0.070 |
Martha Ameyalli | -0.140 |
Yoav Schachner | -0.460 |
Elana Rosenberg | nan |
Visualizing the predictions
The following graphic lets you visualize the data and see how the predictors did collectively and individually.
Each barbell represents a question. Tap on the gray line to see the text of the question, or alternately, you can hover over the gray line.
The ends of the barbell represent the truth-value of the result, with false on the left and true on the right. Depending on which way the given question worked out, the appropriate end is highlighted as green.
The red squares correspond to individual predictions. Hover over a red square to see who it is and what their prediction was.
You can highlight an individual person’s votes either by using the dropdown, or tap a red square to select that prediction.
The black mark is the average of all the predictions. The predictions are sorted by the location of the black mark.
Note: it takes a moment for the text to pop up when you hover, so be patient!
Here’s the visualization:
The scoring methodology, in detail
We’re doing a small variation on the logarithmic scoring rule. For log scoring, you should think of the probability p of a prediction as a number between 0 and 1, with 0 representing 0%, and 1 representing 100%.
Here’s how our score works:
\text{score}(p,\text{outcome}) = \left( \begin{cases} \ln(p) , & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}\right) - \ln(0.5)
This is just log-scoring minus \ln(0.5). That increases the score, since \ln(0.5) is negative. As a result, your score is always zero if your prediction is 50%, and higher scores are better. This reflects the fact that we treat a 50% prediction as a kind of a baseline.
So, say you predicted that a given outcome something was 90% to happen.
- If you were right, then your score would be \ln(0.9) - \ln(0.5) = 0.59.
- If you were wrong, your score would be \ln(1-0.9) - \ln(0.5) = -1.61.