> I try the LLMs every now and then, and they still make the same stupid hallucinations that ChatGPT did on day 1.
One of the tests I sometimes do of LLMs is a geometry puzzle:
You're on the equator facing south. You move forward 10,000 km along the surface of the Earth. You are rotate 90° clockwise. You move another 10,000 km forward along the surface of the earth. Rotate another 90° clockwise, then move another 10,000 km forward along the surface of the Earth.
Where are you now, and what direction are you facing?
They all used to get this wrong all the time. Now the best ones sometimes don't. (That said, only one to succed just as I write this comment was DeepSeek; the first I saw succeed was one of ChatGPT's models but that's now back to the usual error they all used to make).
Anecdotes are of course a bad way to study this kind of thing.
Unfortunately, so are the benchmarks, because the models have quickly saturated most of them, including traditional IQ tests (on the plus side, this has demonstrated that IQ tests are definitely a learnable skill, as LLMs loose 40-50 IQ points when going from public to private IQ tests) and stuff like the maths olympiad.
Right now, AFAICT the only open benchmarks are the METR time horizon metric, the ARC-AGI family of tests, and the "make me an SVG of ${…}" stuff inspired by Simon Willison's pelican on a bike.
Out of interest, was your intended answer "where you started, facing east"?
FWIW, Claude Opus 4.5 gets this right for me, assuming that is the intended answer. On request, it also gave me a Mathematica program which (after I fixed some trivial exceptions due to errors in units) informs me that using the ITRF00 datum the actual answer is 0.0177593 degrees north and 0.168379 west of where you started (about 11.7 miles away from the starting point) and your rotation is 89.98 degrees rather than 90.
(ChatGPT 5.1 Thinking, for me, get the wrong answer because it correctly gets near the South Pole and then follows a line of latitude 200 times round the South Pole for the second leg, which strikes me as a flatly incorrect interpretation of the words "move forward along the surface of the earth". Was that the "usual error they all used to make"?)
> Out of interest, was your intended answer "where you started, facing east"?
Or anything close to it so long as the logic is right, yes. I care about the reasoning failure, not the small difference between the exact quarter-circumferences of these great circles and 10,000km; (Not that it really matters, but now you've said the answer, this test becomes even less reliable than it already was).
> FWIW, Claude Opus 4.5 gets this right for me, assuming that is the intended answer.
Like I said, now the best ones sometimes don't [always get it wrong].
For me yesterday, Claude (albeit Sonnet 4.5, because my testing is cheap) avoided the south pole issue, but then got the third leg wrong and ended up at the north pole. A while back ChatGPT 5 (I looked the result up) got the answer right, yesterday GPT-5-thinking-mini (auto-selected by the system) got it wrong same way as you report on the south pole but then also got the equator wrong and ended up near the north pole.
"Never" to "unreliable success" is still an improvement.
One of the tests I sometimes do of LLMs is a geometry puzzle:
They all used to get this wrong all the time. Now the best ones sometimes don't. (That said, only one to succed just as I write this comment was DeepSeek; the first I saw succeed was one of ChatGPT's models but that's now back to the usual error they all used to make).Anecdotes are of course a bad way to study this kind of thing.
Unfortunately, so are the benchmarks, because the models have quickly saturated most of them, including traditional IQ tests (on the plus side, this has demonstrated that IQ tests are definitely a learnable skill, as LLMs loose 40-50 IQ points when going from public to private IQ tests) and stuff like the maths olympiad.
Right now, AFAICT the only open benchmarks are the METR time horizon metric, the ARC-AGI family of tests, and the "make me an SVG of ${…}" stuff inspired by Simon Willison's pelican on a bike.