Stubsack: weekly thread for sneers not worth an entire post, week ending 5th October 2025 - awful.systems

blakestacey@awful.systems · 2 days ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 5th October 2025 - awful.systems

BigMuffN69@awful.systems · edit-2 21 hours ago

Links to the METR tasks w/ massive error bars at 50% level lmaou.

Someone in the comments rightly points out the comparison with covid isn’t apt. With covid, underlying mechanism caused an exponential effect in covid’s spread

With LLMs the exponential trend is being caused by exponentially spending money and a healthy dose of targeting benchmarks, which is why people are calling the top. The money literally doesn’t exist for this shit to go on so you can create your 50% accurate mechanical turk.

Edit: idk the more I think about this the more it irks me. Like if I was allowed to pick and choose benchmarks that agree with my biases I would post something like this…

… and claim model performance is actually getting worse over time.

https://xcancel.com/sayashk/status/1966144670561612202#m

gerikson@awful.systems · 21 hours ago

The second screenshot goes to a chart where the Y axis is labelled

Task duration (for humans) where logistic regression of our data predicts the AI has a 50% chance of succeeding

So they’re just extrapolating an exponential, not actually measuring it.

Stubsack: weekly thread for sneers not worth an entire post, week ending 5th October 2025 - awful.systems

Stubsack: weekly thread for sneers not worth an entire post, week ending 5th October 2025 - awful.systems

Stubsack: weekly thread for sneers not worth an entire post, week ending 28th September 2025 - awful.systems - awful.systems