Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.




There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.

If they game the pelican benchmark, it’d be pretty obvious.

Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.

If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.


While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get.

Hm? 5.1 Thinking is much better than 4o or o3. Just don't use the instant model.

5.2 is a solid model and I'm actually impressed with M365 copilot when using it.

> as they do for popular benchmarks or for penguins riding a bike.

Citation?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: