> We benchmark humans with these tests – why would we not do that for AIs?
Because the correlation between the thing of interest and what the tests measure may be radically different for systems that are very much unlike humans in their architecture than they are for humans.
There’s an entire field about this in testing for humans (psychometry), and approximately zero on it for AIs. Blindly using human tests – which are proxy measures of harder-to-directly-assess figures of merit requiring significant calibration on humans to be valid for them – for anything else without appropriate calibration is good for generating headlines, but not for measuring anything that matters. (Except, I guess, the impact of human use of them for cheating on the human tests, which is not insignificant, but not generally what people trumpeting these measures focus on.)
There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.
But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".
If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.
Check on the curve for flight speed sometime, and see what you think of that, and what you would have thought of it during the initial era of powered flight.
Maybe a different analogy will make my point better. Compare rocket technology with jet engine technology. Both continued to progress across a vaguely comparable time period, but at no point was one a substitute for the other except in some highly specialized (mostly military-related) cases. It is very clear that language models are very good at something. But are they, to use the analogy, the rocket engine or the jet engine?
I doubt that that’s a sustained exponential growth. As far as I know, there is no power law that could explain it, and from a computational complexity theory point of view it doesn’t seem possible.
See https://www.lesswrong.com/posts/J6gktpSgYoyq5q3Au/benchmarki.... The short answer is that linear elo growth corresponds roughly linearly to linear evaluation depth, but since the game tree is exponential, linear elo growth scales with exponential compute. The main algorithmic improvements are things that let you shrink the branching factor, and as long as you can keep shrinking the branching factor, you keep getting exponential improvements. SF15 has a branching factor of roughly 1.6. Sure the exponential growth won't last for ever, but it's been surprisingly resilient for at least 30 years.
It wouldn’t have been possible if there hadn’t been an exponential growth in computing resources over the past decades. That has already slowed down, and the prospects for the future are unclear. Regarding the branching factor, the improvements certainly must converge towards an asymptote.
The more general point is that you always end up with an S-curve instead of a limitless exponential growth as suggested by Kaibeezy. And with AI we simply don’t know how far off the inflection point is.
The implications for society? We better up our game.