I wonder how much of it is due to the model being familiar with the game or part...

andrepd · 2025-12-20T15:32:19 1766244739

There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.

dwaltrip · 2025-12-20T19:40:26 1766259626

If they game the pelican benchmark, it’d be pretty obvious.

Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.

If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.

criley2 · 2025-12-20T15:46:14 1766245574

While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get.

astrange · 2025-12-20T16:25:08 1766247908

Hm? 5.1 Thinking is much better than 4o or o3. Just don't use the instant model.

malnourish · 2025-12-21T00:38:36 1766277516

5.2 is a solid model and I'm actually impressed with M365 copilot when using it.

ctoth · 2025-12-20T18:04:17 1766253857

> as they do for popular benchmarks or for penguins riding a bike.

Citation?