Let me give an example[0] > Before LLMs, we had a very crisp test for having, or...

Let me give an example[0]

> Before LLMs, we had a very crisp test for having, or not having a world model: The former could answer certain questions that the latter could not, as shown in #Bookofwhy. LLMs made it harder to test, for the latter could fake having a model by simply citing texts from authors who had world models, see https://ucla.in/3L91Yvt The question before us is: Should we care? Or, can LLMs fake having a world model so well that it wouldn't show up in performance? If not, we need a new mini-Turing test to distinguish having vs. not-having a world model.

The thing here is that LLMs are trained on most of the internet. Some people are surprised by some results but don't seem to know what kind of content is on the internet or in the training set (to be fair, we don't know what all the training sets are). There's lots of people that believe LLMs have world models (there have even been papers written about this!) but it's actually pretty likely that the LLMs were trained on similar examples, and this also helps explain why it is easy to break the world model (it isn't a world model if it is very brittle). We can see similar things with our tests like LSAT and GRE subject tests. Well guess what, there are whole subreddits and stack exchanges dedicated to these. Reddit is is most of these datasets and if you're testing on training data you're spoiled (there's a lot of spoiling that happens these days, but like Judea said, does it matter?)

The problem with hyping up LLMs/ML/AI up too much is that we can no longer discuss how to improve the system. If you are convinced that they have everything solved then there's nothing to do. But no system is perfectly solved. Never confuse someone criticizing a tool with someone saying a tool is useless. I'm pretty critical of LLMs and am happy to talk about their limitations. That doesn't mean I'm also not wildly impressed and use them frequently. There's too much reaction to criticism as if people are throwing the thing in the dumpster rather than just discussing limitations.

FWIW, I wouldn't change my mind if DM's test showed the opposite result. You can check my comment history. I'd probably dig in and rather comment why DM's tests were bullshit. I've even made comments about how chain of thought is frequently a type of spoiling. The reason being not because I don't think we can't create AGI (I very much do), but because I have a deep experience with these models and ML in general and nothing in my experience and understanding leads me to even believe half the things people claim. You'll see me make many comments ranting about the difficulties of metrics and how there's this absurd evaluation that people do (not just in ML) of using a proxy test set and using some proxy metric and saying performance on it is enough to claim a model is better. It is a ridiculous notion.

[0] https://twitter.com/yudapearl/status/1710038543912104050