Subpar Forecasting, 2024

Welcome to the results for year four of our subparforecasing competition!

If you want a refresher or a reminder as to what this is all about, look here. And you can check out results from 2021, 2022, and 2023 if you want to see how the contest has evolved.

The rankings

The list below has all of the people who participated, plus two non-human participants:

Without further ado, the results!

name score
Zev Minsky-Primus 0.210
David Tytell 0.160
Jeremy White 0.153
Elizabeth Frieden & family 0.110
Gabriel Farbiarz 0.100
Shirah Werberger 0.091
Lisa Minsky-Primus 0.085
Alan Promer 0.078
ChatGPT 0.077
Averagey McAverageface 0.073
eli cohen 0.073
Naftaly and Yair 0.065
Yaron Minsky 0.063
Sigal Minsky-Primus and Nava Litt 0.052
Jonas Peters 0.049
Richard Primus 0.046
Monserrate Garzon-Navarro 0.042
nancy 0.040
Sam Wurzel 0.030
Romana Primus 0.021
Ty Overby 0.015
Jeremy Dauber 0.012
Jessica -0.002
Sanjyot Dunung -0.002
Aryeh -0.011
Sharon Fenick -0.022
Nick Salter (and family) -0.043
Greg -0.070
David Huang -0.076
Dima -0.077
Franco Baseggio -0.081
Shula -0.083
Katrina -0.112
Sarah Williams -0.178
Michelle Fisher -0.187
Eyal Minsky-Fenick -0.188
Lucas Kimball -0.210
Sally Gottesman (and Ezra Tiven-Gottesman) -0.264
Martha Ameyalli -0.272
Dalit CS -0.284
Yoav Schachner -0.380
David Miller -0.415
Elana Rosenberg -inf


So, how did we do this year? A few observations:

Some notes on the questions

You can see how everyone guessed and what the results were in our visualization. Most of the resolutions were cut and dried, but there are a few worth saying more about.

ChatGPT

Our question was:

“Excluding this question, ChatGPT 4 will outperform Averagey McAverageface on this year’s subparforecasting competition. Note that ChatGPT 4 can search the web. (We’re going to give ChatGPT4 a chance at competing this year. We’ll give it a prompt to try to get it to answer the quiz, and we’ll feed it each question individually. Averagey McAverageface is the synthetic player who chooses the average of everyone else’s probability estimate. We will ask the questions to gpt more or less like so.”

The phrase “excluding this question” is key. It turns out, Averagey beat out ChatGPT excluding this question, but…ChatGPT correctly thought it was going to lose that question, and so, including the question, ChatGPT is actually ahead! So, that’s weird.

The Westminster dog show

Our question was:

“The winner of Westminster Best in Show will be a dog under 20 lbs”

Which doesn’t sound that complicated. But, it turns out, it was. Here’s what Dave Teitel told us:

Okay. I have an answer!

This will take a while to explain. Stick with me.

First: Credentials

My staff member has won best of show at national and international dog shows, is an international dog show judge, and knew Sage’s handler previously. That is past tense as, sadly, Sage’s handler passed away suddenly last week, just days after his retirement.

Dog shows are judged against a perfect model of a dog breed. Best of breed is the dog with the smallest delta from the ideal dog of that breed. Best of show is the dog with the smallest delta of all the small delta breeds.

But of all the criteria that are measured, weight is not one. Which is why we can’t find it. We would need the dog’s actual vet records! And that’s not happening given the passing.

But there are things we DO know….

According to my staffer: The long answer:

“The standard calls for miniature poodles to be between 10 and 15 inches tall at the shoulder, and my guess is that Sage is closer to the top of that range. A 15-inch sheltie would typically weigh 15–18 lbs., but a poodle would likely weigh more like 13–15. Different body type.”

So… While we don’t know Sage’s exact weight, we do know he was the smallest delta from the ideal mini poodle in the country. Which means he couldn’t have been more than a hair outside the height range, if at all, and that given the frame, our outside expert is confident that means no more than ~15 lbs.

So we know for sure that the winner was indeed a dog that weighed less than 20lbs. Q.E.D.

Snow-days in NY

The question was:

“New York City public schools will have at least one snow day after Jan 21st. We define a snow day as a day where the NYC DOE determines that students should not come into school due to snow.”

This was a close one! There was a single snow day, though, honestly, it barely snowed in the end, and was kind of a dud. But, rules are rules!

Top-grossing original narrative film

The question was:

“In one month between February and September, the top-grossing movie will be an original narrative film. So, no toy or game tie-ins (e.g., Mario/Barbie), no sequels, no comic book movies, nothing based on a novel, etc. Historical fiction is allowed, though. Adjudicated by Lisa Minsky-Primus.”

There was some debate on this one, but Lisa decided this landed as true. The argument against is that the Bob Marley movie isn’t “historical fiction”, it’s a biopic. But, it is lightly fictionalized, and it’s certainly not a documentary. So, historical fiction it is!

Demographics of incoming freshman at Harvard

The question was:

“The percentage of incoming freshmen at Harvard that do not identify as White, South Asian or Asian will be smaller for the class of 2028 than 2027, as determined by the freshman survey reported in the crimson”

Frustratingly, the Crimson, which reported data for 2027 did not appear to release any demographic data from the survey it did for the class of 2028.

The New York Times did report on this, though, and they said:

But there’s no data released about white students. And, as the Times said:

Harvard did not report the share of white students in the class, consistent with past practice, and it is hard to make inferences because the percentage of students not disclosing race or ethnicity on their applications doubled to 8 percent this year from 4 percent last year.

I think this suggests that all in, if the data had been reported, the question would have been settled in the affirmative. But, again, rules are rules, so, this question is null and void.

Wait, what is this all about?

Subparforecasting is a joking reference to to superforecasting, which comes from Philip Tetlock’s research about how to find and develop a group of people who are outsize good at making predictions.

A key part of Tetlock’s process is getting feedback: making predictions, seeing how they work out, and learning from the process. Back in 2020 (oh how time flies) Matthew Yglesias did a nice summary of this in his piece on how to be less full of shit.

Scoring

The details of the scoring are a little complicated, but if you’re interested, we do have the math. But here are some high-level observations about how the scores work that you don’t need any math for.

Overconfidence / Underconfidence

We decided to do some analysis of over- and under-confidence. With some help from Matt Russell (Hi Matt!) we decided to try out a new way of measuring it that he proposed.

The idea is pretty simple: in order to see if you’re over-confident or under-confident, we try scaling your confidence up and down until we find the scaling factor that maximizes your score.

In other words, we made all of your guesses bolder or more cautious to different degrees, and checked to see which way gave you the highest score. If your score goes up as we make your guesses more confident, then you’re underconfident. If your score goes up if we make your guesses less confident, you’re overconfident.

A few things to note about this approach.

Here are the confidence ratings!

name underconfidence
Shirah Werberger 3.590
Zev Minsky-Primus 2.450
Jeremy White 2.010
Alan Promer 1.520
Sam Wurzel 1.240
David Tytell 0.960
Naftaly and Yair 0.930
Elizabeth Frieden & family 0.870
Richard Primus 0.850
Lisa Minsky-Primus 0.720
eli cohen 0.710
Jonas Peters 0.650
Ty Overby 0.650
Chat GPT 0.620
Monserrate Garzon-Navarro 0.570
Gabriel Farbiarz 0.480
Sigal Minsky-Primus and Nava Litt 0.460
Sanjyot Dunung 0.380
Romana Primus 0.340
Yaron Minsky 0.340
Aryeh 0.300
Jessica 0.280
Jeremy Dauber 0.270
Sharon Fenick 0.250
Nick Salter (and family) 0.170
Dima 0.140
Franco Baseggio 0.130
nancy 0.120
Katrina 0.080
Shula 0.060
Sarah Williams 0.040
David Miller 0.030
Greg 0.030
Lucas Kimball -0.010
Sally Gottesman (and Ezra Tiven-Gottesman) -0.010
Michelle Fisher -0.020
Dalit CS -0.050
David Huang -0.050
Eyal Minsky-Fenick -0.070
Martha Ameyalli -0.140
Yoav Schachner -0.460
Elana Rosenberg nan

Visualizing the predictions

The following graphic lets you visualize the data and see how the predictors did collectively and individually.

Note: it takes a moment for the text to pop up when you hover, so be patient!

Here’s the visualization:

The scoring methodology, in detail

We’re doing a small variation on the logarithmic scoring rule. For log scoring, you should think of the probability p of a prediction as a number between 0 and 1, with 0 representing 0%, and 1 representing 100%.

Here’s how our score works:

\text{score}(p,\text{outcome}) = \left( \begin{cases} \ln(p) , & \text{if the outcome was true} \\ \ln(1-p) & \text{if the outcome was false} \end{cases}\right) - \ln(0.5)

This is just log-scoring minus \ln(0.5). That increases the score, since \ln(0.5) is negative. As a result, your score is always zero if your prediction is 50%, and higher scores are better. This reflects the fact that we treat a 50% prediction as a kind of a baseline.

So, say you predicted that a given outcome something was 90% to happen.