Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LLMs can't self-correct in reasoning tasks, DeepMind study finds (bdtechtalks.com)
166 points by nyrikki on Oct 9, 2023 | hide | past | favorite | 358 comments


LLMs have no real understanding, so of course they cannot self-correct. They can't even correct on command.

I had an example just today: I wanted an example of a piece of code using framework X to do task Y. I didn't realize that what I wanted is not possible.

The LLM I was using gave me code that worked, but did not do what I wanted. I pointed this out, so I then got code that syntactically did what I wanted, but could never work. This went back and forth twice, before I gave up.

The LLM was incapable of recognizing the falsity of its answers. It certainly wasn't capable of suggesting a different approach that would work (and yes, there was one, I found after a bit more research).


I see remarks like this a lot, and I don't know what to do with them.

Talking about "real understanding" without fleshing out what we mean by the phrase doesn't strike me as at all insightful in this context. In day-to-day use, the term is imprecise but usually easy enough to talk about, but I'd struggle to apply it to dogs, let alone to LLMs or AGIs or whatever.

You say that _of course_ they cannot self-correct because of this. If Deepmind's tests showed the opposite result, would you say that the system they were testing did have "real understanding"? It seems likely to me that future systems will demonstrate self-correction using Huang et al.'s methodology...will we think those systems are different in some way involving understanding?

> The LLM I was using gave me code that worked, but did not do what I wanted. I pointed this out, so I then got code that syntactically did what I wanted, but could never work. This went back and forth twice, before I gave up.

> The LLM was incapable of recognizing the falsity of its answers. It certainly wasn't capable of suggesting a different approach that would work (and yes, there was one, I found after a bit more research).

I really don't know what to do about these sorts of anecdotes. Often with slightly different random values or temperature or phrasing, the same LLM will give different results for the same problem. I've repeatedly seen people make much more narrow claims about what LLMs get wrong or what their biases are, where simply regenerating will give a dramatically different answer.

Your pain here sounds like something I see in humans all the time. If we trained a human harder never to ask follow-up questions, I imagine it would be even worse. There are clearly important differences between an LLM and a human being, but I don't think it's always easy to describe what they are in useful ways.


> Often with slightly different random values or temperature or phrasing, the same LLM will give different results for the same problem.

You're fundamentally describing a stochastic parrot here, and not an intelligence. When you say that humans can also make the same mistake, you're ignoring the fact that every human testing these systems and finding it lacking are also comparing their experience to a lifetime of interacting with human intelligence. Every anecdote of this sort is an example, typically based on repeatedly trying an assortment of prompts (eliminating two of your variables -- random values & varying prompts), on a large variety of tasks, which is curated down to a single example for the sake of brevity.

To say that "oh maybe it would have gotten the right answer if you got lucky or tried harder or twiddled a knob that you don't have access to" is simply nowhere near the extraordinary evidence that is required to prove the extraordinary claim of "real understanding." The reason that you don't know what to do about these anecdotes is that you lack the evidence to properly rebut them. You can only wave your hands at ill-defined properties and gaslight about the user holding it wrong. Perhaps you should ask your parrot buddy what to do.

But if you were really serious about this, you wouldn't be going after the strawfolk down in the comments, you'd rebut the paper itself.


This is super interesting because you’ll end up with different people with different opinions on “what is the best answer” if they all have different experiences.

Most of us when using LLMs feel it should only produce the one correct answer; sometimes that’s the right way to think about it, but other times we might want an opinionated answer, when there is no true answer.

So I think if we get better learning and self correcting machines, we also get more opinionated and sometimes incorrect machines. Kind of like people. And sometimes they’re stubborn or confidently incorrect, but that I think actually points to more intelligence (in people at least), since it points to people deciding based on their experiences.


> This is super interesting because you’ll end up with different people with different opinions on “what is the best answer” if they all have different experiences.

In context of grandparent comment it's code, so while we can have different solutions to the same problem, we can verify if the solution is correct because it's not an opinion on some subject.


There's no such thing as "one correct answer" to most tasks in any language. In mathematics, maybe. If you're asking it to recall a specific quote verbatim, yeah. But other than that there are many (infinite?) different ways you can paraphrase differently to get the same point across (at least in English).


And I assume there's no understanding of what "real understanding" means. In this so limited context "real understanding" looks like "did what I wanted without me really explaining what I wanted". Although in real life I seldom know from the start what I really want either and takes me a few iterations to clarify it even to myself - so "real understanding" seems to be something even I am missing (and I shouldn't blame poor AI for lacking it). Or maybe I'm an AI myself just unaware of it?


> Every anecdote of this sort is an example, typically based on repeatedly trying an assortment of prompts

"Data" isn't the plural of "anecdote" -- you can systematically test things, as the authors of the study the article reports did, or you can say "I tried something and got result ~X, therefore LLMs cannot X", which is my honest understanding of what I've heard people say repeatedly.

> But if you were really serious about this, you wouldn't be going after the strawfolk down in the comments, you'd rebut the paper itself.

I don't know the details of the paper, but I suspect it's right. My actual problem was with someone explaining that the result was trivial and offering what I thought was a non-helpful explanation of why.


One example. I asked ChatGPT to take 7 letters and find all the word combos.

There were about 80 possible. It maxed out at 19 or so.

Now clearly this is a hard task. Humans can’t do it well and none of the text corpus would offer training on this sort of thing.

If I asked a human to do it they would say “here’s what I got, I don’t think that is all of them”

ChatGPT would apologize when corrected and make a new, equally wrong list with full confidence.

That’s how I interpret understanding. Self awareness and doubt.

Oddly enough the code interpreter seems much stronger on this point. Now that I think about it I don’t think I was using it for the letter combo task.


Current LLMs use an optimisation of processing data one word at a time. This makes them bad at solving puzzles involving individual letters or digits.

This is a well known, well understood limitation that can be overcome (Facebook published a hierarchical model that can parse individual characters), but this technique isn’t used by ChatGPT.

Whenever I see a criticism like this, it says more to me about the hubris of humans and the falsity of their assumed superior intelligence.

You’re criticising the intelligence of a thing you don’t understand yourself!

You didn’t “read the manual”. You didn’t go find out why the AI is failing. You just spouted an angry comment and gave up without gaining any understanding.

PS: the “stupid AI” understands all of this. Just ask it to explain it to you: https://chat.openai.com/share/7bc1c1cb-d888-4ccb-904a-79e9ae...


ChatGPT will happily tell you it doesn't know anything after January 2022, and it'll say "jdnrirmd-gurlksjd" is not a recognizable term, instead of trying to bullshit a definition of that, if that's your bar for "understanding",


You can see it in various tasks where the answer for a text generator would be incorrect, yet simple reasoning or understanding yields the correct answer. For example, the everything fits in the boat variant of the farmer across the river problem: https://chat.openai.com/share/7d2de1c4-1cd6-4d7d-97e4-645830...


I love this example! I've been using a different one and GPT has been getting better at it (unconvinced they aren't training on my reporting that it gets the wrong answer. Or that I've posted this question dozens of times here and some have gotten pretty high up. But it's been a year now and we're still in stochastic phase). The question I ask is variants of "Which weighs more, a pound of feathers or a kilogram of bricks?" Similar to yours it is just a slight variant on a standard logic puzzle.

For those wondering why ask these questions, we're testing for overfitting. Yes, you can overfit even if your training and validation curves don't diverge. If the model fails these questions, they've clearly overfit. But the solution space is very large and complex, so overfitting in one regime doesn't mean it didn't underfit another.

GPT 3.5: https://chat.openai.com/share/328cc39d-7fb3-4726-92a2-29437c...

LLaMA 2 70B chat: https://hf.co/chat/r/KX5H3P2

Falcon 180B Chat: https://hf.co/chat/r/V8XxdYh

None of these models can consistently get the answer right. GPT used to actually explain to me the difference between a pound and kilogram (correctly) and then use that answer to justify why they are the same. Such an answer is very clearly a demonstration of a lack of understanding as it isn't even remotely self consistent.

LLMs are powerful and amazing technologies. But we can also critique them. Hyping up models hinders the ability to improve them because every thing (LLMs, humans, governments, whatever) has limitations and is worthy of critique. But criticism isn't saying something is garbage and too many people confuse this.


I was pleased to see GPT-4, combined with my custom ChatGPT instructions, had no problem with this: https://chat.openai.com/share/726d5c7b-761f-462e-8a6a-61904a...


You should refer to this comment I made[0], as your prompt is spoiling the problem. It is far too specific. Hacking the answer in is not a means of getting correct answers. It won't scale and won't generalize.

[0] https://news.ycombinator.com/item?id=37825566


Agreed, there's definitely a lot of value in LLMs even if they don't really have an understanding of the world.


It is quite interesting to me that people latch onto the world model so much as if it is highly significant or a lack of a world model invalidates LLMs. Personally I still believe they're magical (in the proverbial sense) and amazing tools even though I also believe they are just complex stochastic machines that have very little understanding of the things they are doing. That just tells me how cool and amazing that statistics is. I honestly don't know anyone that thinks LLMs are useless things, despite knowing a large number of people that are highly critical of them. But those are very different metrics. I don't know why we can't have nuance in things, why they need to either be a gift from god or a pile of shit.


It figures it out if you point out the difference much like a human would when faced with a riddle they didn't get.

https://chat.openai.com/share/34df3eb0-c41c-4c5b-80d2-48f5d0...

I'll be honest, the only reason I knew to read your variation of the farmer-boat problem carefully is because you set me up for success - I knew it wasn't going to be the original.

Curiously, I haven't been able to get it to one-shot the question, even by prefixing it with instructions.


Except the problem here is that you spoiled the solution. It is quite difficult to provide hints that don't spoil. Or rather, don't provide hints but rather tell it that it is wrong.

Here, I'll demonstrate it. First I'll prod it with vague responses about it just generally being wrong. Trying to leak no information to the model. We'll try to slowly add a bit more and then use your followups. Notice that the model cannot escape the overfit regime without your strong hints. You told the model specifically what to consider. You told it that it is a trick question of a trick question. This is not how a human would handle the situation. To my followups they wouldn't spit out the same answers. They'd actually likely followup with questions if they were confused. Which is a behavior I've never seen from an LLM: asking clarifying questions.

https://chat.openai.com/share/57ab9bca-326d-45cb-9257-7fb8c2...


I do not think that answer would correctly solve the original riddle. The LLM blindly just gave an answer without raising the question of which pair of objects together would be a problem if left alone.


Good catch, it actually leaves the goat with the cabbage in the answer above.


But that isn't a correct solution - it leaves the cabbage with the goat.


The correct solution is that everything fits in the boat, the farmer can just go across the river with everything at once.


> I see remarks like this a lot, and I don't know what to do with them.

There's no "reasoning loop" built into LLMs yet. Keyword, yet. For now we're left with single-shot answers from "memory" rather than a reasoning loop akin to what a human would do, which is read the docs, try some stuff out, discover that what you're asking for is impossible, and then telling you that it isn't possible.


Alternatively, the prior on "this is not possible" is very low because RLHF & Friends have targeted metrics that, inadvertently or not, discourage that outcome.


I think that's the right answer - human trainers prefer an answer, even a made up one, to "I don't know".


Dataset as well. In a forum if you don't know the answer you simply don't post. Only people who think they know will post an answer. In a dialogue you see a lot more "I don't know" since there they are expected to respond, but there isn't a lot of dialogue data to be found on the internet compared to open forum data.


Amazon product Q&A has a lot of "I don't know" answers. Unlike just about everywhere else on the internet.


Any agent that uses a tool (ie. repl/apis) essentially has a reasoning loop just like this. Microsoft Research recently got their Autogen framework up to point where the results are much better than AutoGPT/BabyGPT and the like through more optimal tool use, showing the ability to do rather sophisticated research and problem solving.

This new research focuses on responses just using its own corpus of knowledge. And this totally makes sense if that's all you do without giving a hint to what is incorrect, if indeed anything is incorrect. It's akin to asking a child after they produce an answer they feel sure about an "are you sure?", and then receiving another answer out of a state of confusion.

The main takeaway for me is if you can somehow get better performance on followup queries from simply telling it to more carefully review its approach, that your initial prompt wasn't as efficient as it could be.


It's trivially easy to build a reasoning loop using the GPT-4 API.


How?


example:

Initial Prompt:

"Here is the schema for a database: CREATE TABLE persons ( id INTEGER PRIMARY KEY AUTOINCREMENT, first_name TEXT NOT NULL, last_name TEXT NOT NULL, age INTEGER NOT NULL ); CREATE TABLE Y(blah blah blah)

I'll pose a question and I want you to: Respond with JSON which has two fields, Query and Error. Query: A SQL query to get the required information Error: An error message to pass back to the user if it's not possible.

Question: Show me all the people who are over 40. "

Response from prompt: {Query: "Select years_old from persons where age > 40", Error:""}

Now, in your "agent": Get that response, run it against your db, get the error message. Go back to the GPT-4 api with the initial prompt and response, and add "this doesn't work and gives the following error message. Correct your response and responde in JSON again."

And so on.


People who are having trouble getting good results from LLMs are still trying to figure them out. Most folks are using the product and not the model. Where it really shines is when you use the API and build a platform of workflows around it, starting with a real solid prompt template and working your way through various methods to achieve an outcome. There is no magic wand to get a full solution in one shot, but guiding the LLM towards that ultimate outcome is the type of thing a lot of nerds dont have the attention span for and instead move on to the next weekly JS framework to get their fix.

Which is good for us who stick with the process.


Yup, agree.


These models produce correct answers to many problems that require “reasoning” (for any sensible meaning of the word), that are not in their training set.


It’s also unclear that LLMs have no “reasoning loop”, or that a “loop” abstraction is necessary for all reasoning, or that eyeballing wrong answers is a sufficient metric to categorically dismiss “reasoning." A "reasoning loop" argument is especially odd when applied to LLMs...which explicitly have architectural loops, and their output generation is a literal loop.

Folks, we don't understand our own minds. You think we already understand a potential alien mind? With our n = 1 examples of generally intelligent mind architectures (ours)? Fat chance.

Anyone confidently claiming sweeping, nebulous conclusions about these new models is likely revealing more about their biases than the model's inner workings. We just don't know much. Hot takes on ML are just anthropocentrism Rorschach tests.


> A "reasoning loop" argument is especially odd when applied to LLMs...which explicitly have architectural loops

Oh, I understood that the current crop of LLMs didn't have a way to push data back into itself. I know so little about LM architecture at this point. I need to work my way through the freeAI course that's out there. Maybe a tutorial or two on building one from scratch.


That just shows that parroting will solve a number of problems that require reasoning, that is "reasoning" as we think of it can be reduced to a statistical process.

Kids parrot their parents before they understand the meaning of what they are doing or saying. Meaning arises from a much more complex process than seeing/repeating. I think the same will be true with LLMs. True reasoning capabilities will be another revolution entirely.

---

This just occurred to me; LLM's have a cargo cult level understanding of anything they've been trained on. Correct answers are actually statistical flukes -- purposeful because that's how we trained the models -- but not actually significant in terms of reasoning.


How do we test this alleged distinction between "true reasoning" and statistical parroting? What experiments can we perform on SotA models to make this idea falsifiable?

Commonplace hypotheses like these give me strong "No True Scotsman" vibes, but I'm often wrong. Let's agree on a method, and I'll test it.


There is none. No one understands what intelligence really means or self-awareness. Which is why they feel threatened by anything that challenges their self proclaimed uniquely human trait. The goal posts WILL move a lot more before this debate is settled. Inevitably the success rate and accuracy of LLM responses is going to get better and better, quite possibly to the point where it is indistinguishable from a human. But it wont matter, because some humans will redefine intelligence to be something else so their place at the top of the intelligence pyramid remains in place.


You can coax LLM's into reasoning. I recall someone on here posting a link to prompts that instruct the LLM to reason through a request, and this improves its output significantly.

I think what our models are missing is recursion on themselves. You and I can be self-referential, and we are capable of meta thought. We are also capable of "internal dialogue" where we speak and reason internally.

The LLMs at present lack even state or memory. Arguably those aren't necessary for "reasoning" capabilities.

I wonder how far off "stochastic parrot" is from what we do naturally. I have an image in my head of how we think using associated words/concepts/pictures for learning and I can't imagine it is too different from statistically associated concepts/words.

---

This is sort of scatterbrained, and I apologize. I don't have enough time to write a more concise response.


That doesn’t mean it does any “reasoning”. It generates a response text that looks like a response to the input text. The whole point is that it generates responses that aren’t in the training set. But the fact that the response is actually a correct one is just coincidence.


So when a model succeeds on a reasoning task it's not positive evidence, but when it fails it's negative evidence? The good ol' confirmation bias feedback loop!


The article is about work to add a loop somewhat to that effect


You are right that “understanding” requires further explanation.

In the anecdote above, however, it seems to be clear that the LLM didn’t realize a number of things, such as:

– It might not have sufficient information to answer the question.

– It’s just making a guess of what a correct answer might plausibly look like.

– The fact that this is different from actual correctness.

– That by asking questions back, it might actually be able to come up with a working answer, with the help of its collocutor.

Even after being informed multiple times that it made an error, it doesn’t seem to realize any of the above. This apparent lack of self-reflection, of addressing and working with the conversational situation — i.e., what a human would typically do — is what people usually mean by LLMs lacking an understanding of what they output.


To add to this list of things LLMs don't understand:

- Numbers (LLMs still can't do math and the ones that "can" are very brute forced and don't generalize)

- Commutative properties (The (idk why anyone was surprised) "reversal curse": https://arxiv.org/abs/2309.12288)

- Associative properties

- World models (i.e. can understand some basic physics and causality. Not mathematically, but the same way a child understands that letting go of something in their hand falls to the ground and not up)

- How to say "I don't know"

- Any form of meaningful generalization (GPT 3.5 and most LLMs still have difficulties with "Which weighs more, a pound of feathers or a kilogram of bricks". GPT4 gets it, but I'm not convinced it isn't because I've asked it that question too many times and they specifically trained on it.)

And a whole load of things. With basically all these things being quite relevant to the collective interpretation of "understanding." They aren't hard to tease out either. You ever have a problem that isn't easy to google because the search terms are too close to a different problem? The LLM will give you very similar results, even if you probe with corrections and specify that your problem is different. Like layer8, many many times I've corrected (many different) LLMs and cannot tease out the correct answer. Because the correct answer is too low likelihood and often the RLHF has decreased the distribution in that region to get to it. Because this is exactly what RLHF does, on purpose.


> Numbers (LLMs still can't do math and the ones that "can" are very brute forced and don't generalize)

These seem to generalize.

https://arxiv.org/abs/2211.09066

https://arxiv.org/abs/2308.00304

> Commutative properties (The (idk why anyone was surprised) "reversal curse": https://arxiv.org/abs/2309.12288)

This was a retrieval from training issue not an inference one. and make no mistake, people will fail these sort of answers too (perhaps not to the same extent). Very often someone will learn a fact in a certain why and fail to recall that when asked in a reverse or roundabout way.

Communicative properties are not something that occur in the vast majority of text or language. It's not "generalization" for training to treat it as such lol. Since they can answer such questions just fine during inference, It makes little sense to say an LLM doesn't understand communicative properties.

> How to say "I don't know"

There's quite a lot of indication that the computation can distinguishing hallucinations. It just has no incentive to communicate this. Don't blame the LLM here. Blame the very human data that doesn't encourage this.

GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975

Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334

Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221

I'm not trying to say that LLMs are perfect. and as long as people understand that saying an LLM doesn't understand x doesn't mean an LLM doesn't understand anything at all, i'll happily state weaknesses.


> These seem to generalize.

I'm not convinced. Zhou et al explicitly is teaching the model to decompose the longer chain addition into a windowed problem. Which yeah, that is good and how we humans do this. But that is limited, see Figure 10. It is a pretty heavy prompt. Figure 3 is showing that the method is not very robust (aka, not generalizing). We see changing symbols really hurts performance, subtraction and multiplication are not great. But even a 90% accuracy is not suggesting great generalization when we're talking about a pretty simple algorithm, especially one that is natural to computers. Chen et al is doing a better job, but their prompting still shows a lack of generalization and appears to be dependent on how much the model was originally tuned for mathematical tasks in the first place, with GPT having explicitly been tuned for this (if we are to believe OpenAI claims of working on making the model better at math). But this is convincingly a better method, just not convincingly generalized.

I'll note that an important aspect of generalization is not just extending to the OOD but also: not requiring fancy prompts at inference, working with arbitrary symbols in few shot exampling (<5 examples -- not batches -- it should know), and without significantly affecting the rest of the knowledge base. If you taught a LLM to only be good at math and nothing else I would both be impressed but also not call it a generalized model.

I'm not a NLP person so I don't know the nuances of the specific datasets used here, but I'd also be careful at evaluation, especially with GPT. Datasets like these are incredibly difficult to generate in ways that do not result with training data in the test set. As an example, I'll point to our classic HumanEval dataset, which is often used for testing code performance. A paper with 60 authors thought that simply by writing code by hand that it would result in unspoiled data. But they chose code problems that were similar to the style of interviews/leet code. Guess what, you can search github for similar strings and find them (pre-cuttoff). You can either explore yourself or search my chat history if you'd like specific examples. Clearly such an evaluation is not a great one.

> This was a retrieval from training issue not an inference one.

Actually I disagree. We know (likely) why this happens -- because the autoregressive nature biases a model towards a specific sequence direction -- but that doesn't mean it isn't generalization issues. And I very much disagree that humans will fall prone to these same issues. You keep using this claim with very little evidence. An explicit example from that work is the LLM getting correct the answer to "Who is Tom Cruise's mother?" (Mary Lee Pfeiffer) but not getting the correct answer to "Who is Mary Lee Pfeiffer's son?" Humans naturally handle this difference trivially. The thing is that this specific knowledge is probably not known to most people so the latter question is more likely to confuse the person. But this is different.

And we need to also recognize that humans are even explicitly evaluated this way. Learning history, in a textbook or lecture you'll be presented information like "On December 7, 1941 the Japanese attacked Pearl Harbor." But then on a test you'll be asked "What day did the Japanese attack Pearl Harbor?" That's specifically a reversal. When was X? Why is the day X important? These are explicit ways that people test other people, at a very early age, so I disagree that it is a highly common occurrence for people to fall prone to this. I'm sure you'll find example, but I'm willing to bet that those examples have another factor (you also won't be able to test the examples I gave on LLMs because they will have specifically seen both directions __because__ we do this kind of testing).

> It just has no incentive to communicate this. Don't blame the LLM here. Blame the very human data that doesn't encourage this.

You're right that we can probably do a better job at training the models to respond this way. But they naturally don't. But that is the clear demonstration of not understanding. I'd equally claim that a person doesn't understand something if they rambled off bullshit and were trying to talk their way into a solution or post hoc explain why their wrong answer is correct. Either way, it doesn't challenge the claim that it doesn't demonstrate a lack of understanding. Especially when you see these ramblings not be self consistent.

> Links

I'm not sure what you're showing here. They're kinda supporting my explicit point of RLHF messing with the distribution. I'll add more that this is a crazy hard problem to even evaluate because what is being shown as metrics are extremely aggregated. Aggregation is the bane of evaluation but so is dimensionality.

> I'm not trying to say that LLMs are perfect. and as long as people understand that saying an LLM doesn't understand x doesn't mean an LLM doesn't understand anything at all, i'll happily state weaknesses.

Just to be clear, no one has claimed that an LLM has 0 understanding. Those of us critiquing the understanding claim are pretty well aligned with that the model has some understanding. Of course it does. That's what a fitting function is. But we're challenging the claim of understanding in the generalized context which is what you've been pointing towards. Things like an LLM having a world model. Things like an LLM understanding how addition works (compared to understanding how addition works for 2-6 digit numbers). These things are quite different and I've been trying to be very clear that I'm not claiming LLMs are a load of bullshit. I've explicitly stated as much in other comments, but the prior comment wasn't as necessary given the context. (not that this statement should be necessary in the first place. It is only a result of over hype that people binarize critique. I really don't want to see more of this religious nature around ML, it is harmful to our community)


>> I've repeatedly seen people make much more narrow claims about what LLMs get wrong or what their biases are, where simply regenerating will give a dramatically different answer.

It will, but what's the point of a random answer generator? What's useful is a system that gives correct answers, and can discriminate correct answers from incorrect answers. Otherwise, you can only know that an answer is correct when you already know it, and that's not very useful at all.


I think usefulness is a different question than capabilities. "LLMs can't do X" is an interesting claim, and it takes some level of systematic work to make it. If the capability exists at all, this is interesting, even if it's not useful. That being said, it doesn't mean such an existing capability will not prove useful. If something is fallible, it might be above a threshold worth using, it might be improved by doing the same techniques that are already being used, or it might be improved by adding on some sort of system to filter or refine responses.

> you can only know that an answer is correct when you already know it, and that's not very useful at all.

This statement makes sense on the surface, but I don't think it's true. All the time I get told things that I don't know the answer to beforehand, but judge pretty well whether it's right or not.


Assuming “correctness” as being a hard criterion will greatly limit your ability to get powerful and very useful output from LLMs.

A person doing some knowledge work will fail on this criterion. But they still are employable and generally regarded as being useful.


Typically, when people say it lacks real understanding, they seem to mean it lacks an understanding of any object’s semantic meaning. In the above example, the person asked for an example of X API solving problem Y, and X lacked the functionality needed to solve Y. If the AI understood the formal specifications of the API, it would be able to know that functionality doesn’t exist and reply as such. However, the AI is just a next token predictor here, it only knows certain words tend to follow others. Thus if you never told it that X API can’t do Y in training, it’s unlikely to reply as such. Instead it might give you a fictional function call in the style of X API to solve your query.


An LLMs can explain a concept but is unable to apply said concept to a given problem - for me that shows a lack of "real understanding".


Let me give an example[0]

> Before LLMs, we had a very crisp test for having, or not having a world model: The former could answer certain questions that the latter could not, as shown in #Bookofwhy. LLMs made it harder to test, for the latter could fake having a model by simply citing texts from authors who had world models, see https://ucla.in/3L91Yvt The question before us is: Should we care? Or, can LLMs fake having a world model so well that it wouldn't show up in performance? If not, we need a new mini-Turing test to distinguish having vs. not-having a world model.

The thing here is that LLMs are trained on most of the internet. Some people are surprised by some results but don't seem to know what kind of content is on the internet or in the training set (to be fair, we don't know what all the training sets are). There's lots of people that believe LLMs have world models (there have even been papers written about this!) but it's actually pretty likely that the LLMs were trained on similar examples, and this also helps explain why it is easy to break the world model (it isn't a world model if it is very brittle). We can see similar things with our tests like LSAT and GRE subject tests. Well guess what, there are whole subreddits and stack exchanges dedicated to these. Reddit is is most of these datasets and if you're testing on training data you're spoiled (there's a lot of spoiling that happens these days, but like Judea said, does it matter?)

The problem with hyping up LLMs/ML/AI up too much is that we can no longer discuss how to improve the system. If you are convinced that they have everything solved then there's nothing to do. But no system is perfectly solved. Never confuse someone criticizing a tool with someone saying a tool is useless. I'm pretty critical of LLMs and am happy to talk about their limitations. That doesn't mean I'm also not wildly impressed and use them frequently. There's too much reaction to criticism as if people are throwing the thing in the dumpster rather than just discussing limitations.

FWIW, I wouldn't change my mind if DM's test showed the opposite result. You can check my comment history. I'd probably dig in and rather comment why DM's tests were bullshit. I've even made comments about how chain of thought is frequently a type of spoiling. The reason being not because I don't think we can't create AGI (I very much do), but because I have a deep experience with these models and ML in general and nothing in my experience and understanding leads me to even believe half the things people claim. You'll see me make many comments ranting about the difficulties of metrics and how there's this absurd evaluation that people do (not just in ML) of using a proxy test set and using some proxy metric and saying performance on it is enough to claim a model is better. It is a ridiculous notion.

[0] https://twitter.com/yudapearl/status/1710038543912104050


You have it backwards - there might be some similarities between LLMs and humans. I don't think any such almost certainly superficial similarity is meaningful in any way.


> LLMs have no real understanding

A large language model trained on Othello games will actually build a "board representation" inside its neural network: https://thegradient.pub/othello/. This was confirmed by editing the board representation inside the neural net, "tricking" it into "believing" the board was in a different state. At this point, the LLM started making legal moves based on the edited position. If the LLM is actually building a data structure describing the board state, then it's hard to claim it's just acting as a parrot.

The raw version of GPT apparently achieves an Elo score of 1800 at chess: https://twitter.com/GrantSlatton/status/1703913578036904431. This is worse than a good dedicated chess engine, but almost certainly better than I could do!

And every day, I see Copilot occasionally perform non-trivial completions that require "in-context learning" of complicated things that exist only in my source code. Sure, it crashes and burns regularly, too, but I've seen a lot of cases where it could only have figured out the code it did by carefully combining information available in the file.

So I believe that LLMs can actually achieve surprisingly large amounts of understanding, particularly ChatGPT 4 and (on a good day) CoPilot. But the frustrating thing is that they're inconsistent, and the current RLHF process apparently impairs their ability to correctly estimate accuracy. When they make mistakes, they're effectively "blind" to those mistakes, to a shocking degree.


>> The raw version of GPT apparently achieves an Elo score of 1800 at chess: https://twitter.com/GrantSlatton/status/1703913578036904431. This is worse than a good dedicated chess engine, but almost certainly better than I could do!

Probably because it was trained to get better at it.

https://twitter.com/GrantSlatton/status/1706545268009931175

There's already been plenty of discussion on LLMs' ability to play chess and it makes sense that OpenAI would try to make their models better at it.


I’d point out that the LLMs are still incapable of using these internal representations in novel ways. A human who has internalized the rules of chess can learn and play variants like duck chess easily. GPT-4, despite being somewhat proficient at chess, falls apart when asked to play duck chess, constantly making illegal or illogical moves.


Uhh not really. I assure you, a LLM that can play chess will converge in training on duck chess much faster than one that wasn't.


LLMs are failing on tasks as simple as this. you can call it a trick question but what the hey.

"prove that there are no non negative numbers less than 3"

Claude puts out this elaborate BS - whats funny is the pretension.

----

This is trivial to prove by contradiction:

Assume there exists a non-negative number n < 3. Then either:

n = 0. But 0 is not less than 3, so this is a contradiction. 0 < n < 3. But the only numbers between 0 and 3 are 1 and 2, and neither of those are less than 3, so this is also a contradiction. Therefore, our assumption that such an n exists must be false. By contradiction, there are no non-negative numbers less than 3.

In formal logical terms:

Assume ∃n ∈ R, n ≥ 0 and n < 3 (There exists some non-negative number n less than 3) Then either: a) n = 0. But 0 ≥ 3 is false, so this is a contradiction. b) 0 < n < 3. The only numbers in (0,3) are 1 and 2, neither of which is less than 3, so this is also a contradiction. Therefore the assumption in (1) must be false. By contradiction, the statement ∀n ∈ R, n ≥ 0 → n ≥ 3 (For all non-negative numbers n, n is greater than or equal to 3) must be true. So the proof by contradiction shows that there are no non-negative numbers less than 3.

----

boosters can hype all you want, but the damn thing cant compute. and please dont say - "but most humans will fail at this too"

meanwhile...: claude-coders working furiously to add this test case - lol.


I swear, these kind of gotchas are the most stupid. Because they are easily resolved by just talking to the current state of the art.

https://chat.openai.com/share/93324013-b333-44ef-b8c8-5460cb...


ok, here you go. another slight tweak to a standard puzzle and it falls.

btw, i am very impressed with how well chatGPT does on this puzzle thanks to the elaborate tuning that seems to have been done by the training.

--- A farmer has a wolf, a goat, and some cabbage - and he must cross a river by boat with his items. The boat can carry only the farmer and a single item. If left unattended together, wolf would eat the cabbage, or the wolf would eat the goat or the goat would eat the cabbage. How can they cross the river without anything being eaten? ---

To ensure that nothing gets eaten during the river crossing, the farmer can follow these steps:

Step 1: The farmer takes the goat across the river and leaves it on the other side.

On the other side: Farmer | Goat On the original side: Wolf | Cabbage Step 2: The farmer returns alone to the original side.

On the other side: Goat On the original side: Farmer | Wolf | Cabbage Step 3: The farmer takes the cabbage across the river.

On the other side: Farmer | Cabbage On the original side: Wolf | Goat Step 4: The farmer leaves the cabbage on the other side and takes the wolf back across the river.

On the other side: Cabbage On the original side: Farmer | Wolf | Goat Step 5: The farmer takes the wolf across the river.

On the other side: Wolf | Cabbage On the original side: Farmer | Goat Step 6: The farmer returns alone to the original side.

On the other side: Wolf On the original side: Farmer | Goat | Cabbage Now, all three items (wolf, goat, cabbage) and the farmer have safely crossed the river without anything being eaten.

---


GPT4 understands that it's an odd variation of the classic puzzle, and that its unsolvable. (This conversation is my first attempt) https://chat.openai.com/share/5fe4f781-ea18-41ae-8ff9-349a31...


Will AI skeptics ever get tired of being BTFO by stochastic parrots?

Gary Marcus' Twitter timeline suggests not.


the input is a data structure describing board state.

i don't think uncovering the internal representation of board state, after providing the board state, implies any kind of phenomenal understanding - it just implies the software may accept input.

side-channel editing of the internal state, and observation of the resulting effect, does not imply understanding. you can do the same thing to any traditional game algorithm.


As I understand it, the Othello LLM was:

1. Trained using written sequences of moves from real world games, in text notation.

2. Prompted using written series of moves.

So it has supposedly never "seen" an Othello board state at all, if I understand the experiment. It deduced the rules of rules of Othello and "learned" a neural network containing a board representation.

It's entire possible that I misunderstood the paper, though? Why did you conclude that they were training and/or promoting using board state? (I should go re-read the paper; it's been a while.)


No you're right.

Clearly the information of the board state is implicitly present in sequences of moves (humans can play blindfold chess too but poorer) but a reconstruction from that alone is highly non trivial. I don't think a human that had never seen or played the game would be able to do it.


Agreed. But Othello syntax happens to map perfectly to possible Othello moves, so the most efficient representation of the information contained within Othello happens to be a data structure which maps to an Othello board. Other information like visuals on the players would, from the point of view of evaluating the next Othello move, just be noise.

Human language doesn't map perfectly or even particularly closely to the physical world we inhabit or emotions we experience though, so a maximally efficient model of human language will overfit to useless semantic features whilst lacking context on what foo that follows baz is. Understanding the physical or emotional world solely from human language is more like trying to use the Othello LLM's state representation to establish the colour of the board


The sense data your brain ingests does not map perfectly or even particularly closely to the real physical world we inhabit.

Whether it’s through language, vision (which feels quite “real” but is really just a 2D projection of light that we interpret), sound or anything else, it’s all just some byproduct of the world that we nonetheless can make useful predictions with.

There is enough information to build "a" world model in all the text ever written by humans about the world. Not necessarily the "one true model" that is your own personal life experience of the world since birth.

>Understanding the physical or emotional world solely from human language is more like trying to use the Othello LLM's state representation to establish the colour of the board

Nobody truly understands the physical world. Don't you think the birds that can feel the electromagnetic fields around the earth and use it to guide their travels would tell you your model was fundamentally incorrect ?

Certainly, LLMs are more limited in their input data, but it's not a fundamental difference. and adding more modalities is trivial.


> The sense data your brain ingests does not map perfectly or even particularly closely to the real physical world we inhabit.

I never argued otherwise, though being aware that there is a physical world that I can interact with helps! The point is that the only reason the LLM's transformation of syntax approximated an Othello board was the unusually perfect correspondence between permutations, syntax and efficient storage that seldom exists. In other circumstances your LLM vectors are modelling language constructs, lies and other abstractions that only incidentally touch on world and brain state.

The term "understanding" is generally used by humans to refer to how humans model reality[1] and need not imply completeness. But it also implies that a model isn't extremely dissimilar to humans in what its parsed and how its parsed it. Or to slightly alter your example, if a bat argued that following the bat swarm well enough to locate the exit didn't mean humans had achieved "true echolocation", I'd have to agree with them.

I mean, a photograph and a pocket calculator are also representations of some aspect of the state of the world, sometimes even representing a particular subset of the world information in sufficient fidelity to allow humans to make much better predictions about it. But fewer people seem to wish to stan for the capacity of the calculator or the bitmap to have "real understanding" of the outputs they emit, even though fundamentally the LLM has much in common with them and far less in common with the human...

[1]the potential for debate around such definitions underlines the paucity of language...


I think you underestimate humans.


As the other commenter said, this is incorrect. The input was a sequence of legal moves (not even "real" moves - most of the training data was synthetically generated with "generate legal moves" as the only constraint).

Deducing board state from this is extremely non-trivial.


The reason this is notable is because people keep anthropomorphizing text generators into what they perceive as something a small way from AGI.

Of course they have no understanding. They’re text generators.

This persistent response to LLMs is bigger news, to me, than the fact that LLMs can’t think.


> This persistent response to LLMs is bigger news, to me, than the fact that LLMs can’t think.

I can testify that people will be manipulated by sleight of hand (the physical act) even after they have been alerted to it. Now transfer that weakness to a domain where most people have zero equipment and training to form a sound judgement.

Its a big and complicated world and our brains are wonderful but limited. We have to rely on trust for the vast majority of information we digest.

When organized coalitions of individuals subvert the channels of trust we stand exposed as idiots.

The principal characteristic of the digital era is, alas, not harnessing this amazing technology for societal benefit but betraying trust and violating unspoken contracts - for profit.


It's because up until a few years ago, everything that was capable of communicating at that level was a (non Artificial)GI. It's a bit ..surreal? to see text and speech without intelligence and understanding behind it.


What is intelligence. Honestly other than something we keep claiming only humans have.


You're making this comment in a way that devalues human intelligence but I actually think not having a great understanding of our own cognition is indicative of something deeper going on physically in our heads that isn't modeled by the extremely simple "neurons" in LLMs. I've seen several people here talk about "the weights in our heads" and I don't think we have enough information about the brain to make that analogy.


Did you know we also don’t know how LlMs work and neither do they…


What is the feeling of sunshine on your face? What is love? I think intelligence, as a prerequisite requires consciousness. It's the capability for the "aha!" moment where for a brief moment all if quiet, and the student and the concept are one.


This is ridiculous. Sorry. Intelligence is the ability to reason and communicate to another creature and for them to understand and cooperate or betray you.

This can do that.


A basic diplomacy AI is also intelligent according to your definition, so why do you think ChatGPT is different than what we had before then?


Yea we’ve had simple intelligence for a long time. What’s so controversial about that?


If you compare "thinking" to "text generators", don't you need to understand what "thinking" actually is, before you can conclude that these are unequal?


No more than I need to understand every dynamic of walking to know that my goldfish doesn’t do it.

For a lot of us, LLM behavior and thinking are so plainly and obviously dissimilar that the endless comparisons look somewhere between naive or manipulative, depending on who’s making them.


> For a lot of us, LLM behavior and thinking are so plainly and obviously dissimilar

But what's obvious to you is not obvious to everyone.

Even though you don't have to mathematically prove a goldfish doesn't walk for me to believe you, that's because we can both agree on a very good definition of walking. If it came down to it, I am sure we could sit down with pen an paper and some textbooks and agree upon a physical definition as robust as any. We're just skipping that step because it's been done before by others.

I am positive that we can't agree on a robust definition of thinking, because the definition of thinking is beyond human understanding.

We have to keep an open mind. I don't believe it's likely that any LLMs are experiencing anything we'd call thought, but without knowing how thought works, it would be foolish of me to say it's impossible. The problem in AI discussions is not the positions people are taking but the certainty with which they are taking them.


>> I am positive that we can't agree on a robust definition of thinking, because the definition of thinking is beyond human understanding.

In that case, shouldn't the people who say LLMs can think do the hard work of explaining what "thinking" means and why they er think it's done by LLMs?

Surely the default position should be that LLMs don't think because they don't belong to the class of entities that we know can think. If that default assumption is wrong, well, then, someone has to do the hard work of rejecting it. But just claiming that we think they think because who knows what thinking is, is just an excuse to not do the work.


You've totally missed the point if you're still reaching to assume one position or the other.

We can be sure humans think. Does a crow think? We don't know. Does an LLM think? We don't know.

We can talk about more specific phenomena in contexts that demand it, if we have the data. But there is no way to say right now that an LLM does or does not think. All we can say is it seems unlikely.


I would take the opposite opinion - we have to maintain a _critical_ mind. Otherwise, we're open to being taken advantage of by any smooth-talking charlatan (or LLM, but that's largely the same thing) that comes along.

Constantly moving the goal posts with this incessant "but what IS intelligence, though?!?" gets us nowhere. By your own argument we can never define intelligence, so we can never actually discuss it. LLMs consistently and constantly fall down on tasks that we would not expect a human to fail at, but you and people like you insist we cannot use this as evidence because you refuse to accept any defined terms.

I mean, remain as hopeful and uncritical as you want. But you're basically asking the rest of us "Who are you gonna believe, me or your lyin' eyes?" and that's gonna go about as well for you as it typically does.


No, I don't see any of this as a response to my positions at all. I must not have expressed myself clearly enough.

I'm not talking about getting scammed by SBF or whatever. I'm saying that there doesn't exist a definition of "thinking" that we can use to include or exclude LLMs from the group "beings that think". It's counterproductive to say they do or don't and is a distraction from useful conversation.

To say I'm not being critical when I clearly said I don't think it's likely that LLMs think feels disingenuous and, frankly, uncritical.

> By your own argument we can never define intelligence

In no way does that extend from my argument. Intelligence and thinking are separate phenomena, and I didn't say "never".


The AI camp has economic incentives to inflate what their systems can do. I'd bet we'd get a different reaction from neuroscientists.


I'm specifically referring to the inability of current neuroscience to explain what "thinking" is.


The difference is that your goldfish doesn't look like it's walking. I'm sure my door doesn't think. I'm not sure the LLM doesn't.


> I'm not sure the LLM doesn't.

If you turn off the power to the computer hosting the LLM, you can also be sure it doesn't think.


It may be "plainly and obviously" different for you, but for me? I seem to have two components to my thinking process, one of which has ideas without needing words, another of which turns those ideas into words and gets (for lack of a better description) "annoyed" if I try to skip the word-generation part.

There's no reason to presume we all think the same way; rather the opposite, given how many ways in which humans are already known to think unalike to each other.


Just to throw the other side's perspective of the argument into the ring, the ones reducing the observed phenomena to obviously anything can seem emotional and defensive of some metaphysical person-ness.


Yeah. Now, imagine the hubris required to think we've invented something we functionally cannot define.


Isn't scientific advancement inventing things we can't fully define? Each working atop the other to define more?


That's pretty easy to do. We don't know how Tylenol works.


Yes we do, it's a cyclooxygenase inhibitor. E.g. from the abstract of [1]:

"The mode of action of paracetamol has been uncertain, but it is now generally accepted that it inhibits COX-1 and COX-2 through metabolism by the peroxidase function of these isoenzymes."

Furthermore, the active ingredient in Tylenol is a very simple molecule from a synthetic perspective, and it was developed while trying to overcome the toxicity of an even simpler molecule called acetanilide. Even though it was developed 150 years ago, there was a systematic understanding of organic chemistry and the interaction with the human body already at that time.

[1] https://link.springer.com/article/10.1007/s10787-013-0172-x


I mean, there's plenty of other meds whose exact precise functioning we don't know, just what results and side effects it has.


When tylenol was invented, did the inventors have to precisely specify the molecular interactions between the compound and the human body in the same way a programmer has to precisely specify a program?

No? Maybe the analogy you are suggesting isn’t the most accurate.


Humans invent things they don't fully understand basically always.


No. We know exactly what "text generators" do and we also know that is not what the brain does. That can be true even if you don't fully understand what the brain does.

An easy way to prove the two are different is 'What is the equivalent to the endocrine system for an LLM and why doesn't it get tired like a mind?'

The logical problems that a lot of people run into are twofold:

1) At a very basic level, we modeled these architectures on brains (and named them after our brains), so at a very very basic metaphorical level, you can say 'these things are acting like our brains act.' This ignores the complexity of the brain.

2) We like to think that inconveniences of our minds (like the need for sleep) are "problems" that we can engineer away, rather than intrinsic components of the process. People won't like hearing this, but there's no formal reason why 'the need to sleep' is not a necessary criteria for 'being conscious,' because as far as we know, the only conscious beings out there also sleep. It's just human nature to assume we can engineer the 'good' parts of a biological system while avoiding the 'bad' while still essentially replicating the system, but that's usually not the case.

[0]https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2525690/


> What is the equivalent to the endocrine system for an LLM

Are you suggesting that the brain is the only implementation that can “think”? Or that an endocrine system is required?

The difference isn’t the interesting, or necessarily useful, part; the practical similarities of the output are. If we can make a system that appears to "think", I can’t understand how it can be so easily dismissed when we don’t know what’s going on in it or us.


I do think nervous systems (the brain being the most complex example we know of) are the only things that do what we call thinking.

What else does (please do not mention any deterministic counting machines like semiconductors - neurons are not deterministic and thought isn’t a deterministic set of calculations).

I would argue that the difference in process is extremely important, as evidenced by how easily we are fooled by optical illusions.

Just because we think two things appear the same does not mean they are, and understanding why they appear the same but are different relies on a study of the “how”

To put your argument differently - “if we can make drawings that appear to be three dimensional to our eyes, why bother understanding the difference between 2-D and 3-D space?”


> please do not mention any deterministic counting machines like semiconductors - neurons are not deterministic and thought isn’t a deterministic set of calculations

I have to mention it, because it's physics, and related to the current implementation of LLMs: random numbers are possible, and used to break determinism. Intel CPUs use thermal noise to generate random numbers [1]. With silicon, randomness is a free choice, not an impossibility. LLMs front ends, like anything from OpenAI, use random numbers to get non-deterministic output, which is also the input of the next word, and the context of the next response, resulting in output that's not deterministic, with broad divergence [2]. Both systems are somewhat bound by the "sensibility"/logic of the output though, of course.

> as evidenced by how easily we are fooled by optical illusions

This isn't neccesarily unique to humans [3]. Do you have a specific illusion in mind? Many are related to active "baselining", and other time response things, that happens in our eyes, with others being an incorrect resolution of real ambiguity, from a two sensor system, that any sensor system will struggle with.

> Just because we think two things appear the same does not mean they are

I don't think anyone is suggesting they're the same, but I see many people suggesting, with seemingly undue confidence, that they're completely unrelated, which would require an understanding of either system that we don't have.

> To put your argument differently - ...

My argument is, if it's 3d to your eyes, then that understanding, and relation, between 2d and 3d space already exists in the system, to some degree.

[1] https://www.intel.com/content/www/us/en/developer/articles/g...

[2] https://www.coltsteele.com/tips/understanding-openai-s-tempe...

[3] https://blog.frontiersin.org/2018/04/26/artificial-intellige...


I know that there are various strategies for picking pseudo random numbers, but what I mean by determinism is that as far as we can tell, the activation trigger for a neuron is highly dependent on outside signals like hormones. That means that a biological “dot product” will be different if you are tired vs. well rested.

A dot product run on Intel CPUs are intended to always be the same no matter what. The heat signature stuff isn’t changing the way circuits do math.

As to optical illusions, the point I was making was that “the way humans practically perceive something” is not a sufficient way to measure the similarness of two things, since our perceptions are so often tricked (LLMs are literally trained specifically to trick you into thinking you’re talking to sentience).

I also don’t think they are completely unrelated at all. As I said I know we designed one to be like the other. It’s just by no means “the same.” AI may one day do convergent evolution toward how brains think, but it’s still important to recognize that’s convergent evolution.


I see all of this as implementation details of a single biological system that exhibits "thought", not a definition of "thought".

> It’s just by no means “the same.”

The implementation is not the same. Everyone agrees with that. The concept being compared is "thought", not "implementation of thought". Maybe I'm lost.


I'm just saying that "thinking" is a well-defined class of related things that only animals do.

I would argue expanding the definition to include the kind of things that semiconductors do really dilutes the meaning of "thinking" to be near meaningless.

Maybe to put it succinctly - no computer has ever done anything without a human input (even if it's millions of layers abstracted), but thinking just happens spontaneously.

If that's not a sufficient condition to differentiate 'thinking' from 'calculating,' then IDK what 'thinking' even means then.


The endocrime system is a tool to manage/regulate the operation of a physical object (the human body). LLMs do not have a body or other physical object to manage, so they do not need endocrime systems.


The endocrine system has a huge impact on your brain and thought process.


We know exactly what an LLM does? How does that differ from a brain?


One does deterministic matrix math and the other doesn’t.

It’s a common experience that candy tasted sweeter to you as a child, right?

How does that work in a transformer?


Generally, people in HN don't let that small detail get in the way of making strong assertions on these matters.

These threads are always massive, embarrassing trainwrecks of naive takes on cognition and consciousness.

It's like HN just woke up and refuses to acknowledge the relevance of the extensive prior literature on these subjects. Instead, people seem mostly fine with making shit up.

Embarrassing.


> Embarrassing.

It's fine. We're social animals, we discuss topics even if we lack perfect expertise, at our present level of understanding. I've certainly had, for example, many opinions on programming that were later superseded by deeper understanding or evolving priorities.

It's just important to remember there's probably a bigger and better expert out there, and stay humble and interested, and let yourself be corrected.


Yes, true, however, what I find egregious in this case is the comparative lack of standards.

Among the other technical and scientific topics often discussed here, there is a basic general expectation of grounded reference to existing research and bodies of knowledge. Deviations from this standard are usually met with corrections.

With the topic of cognition and consciousness, this is not so (generally). The standard is to absolutely not reference existing research and to basically pretend it doesn't exist.

Instead, the vacuum of ignorance is filled with whatever comes to mind and threads engage in freeform speculation.

This is also ironic since these topics are usually about how LLMs don't understand and just fill the vacuum of its ignorance with whatever happens to be next on the statistical chain, regardless of veracity.


I think it's because a lot of people presume the topic at hand to be relatively novel and not have a significant body of work behind it. At to some extent this is because they have been told this is the case - even by experts or their reporters.

With physics, for example, people have an understanding that there's a large existing body of work, and that physics has been remarkably successful at building on earlier knowledge for decades without having to debunk a lot of earlier results. It's a familiar story.

With AI, however, the popular narrative is that earlier approaches to the problem were all laughing stocks that went nowhere. There's not a good sense for how old some the ideas that work now actually are, and how many of the older theoretical and thought frameworks are still perfectly valid, in particular at the interfaces toward other branches of science and engineering. There's some more nuanced and balanced primers on the topic (e.g. the Wooldridge book), but an awful lot more writing on AI Winters and what not.

This snowballs into a lack of awareness of the overlap between "AI" and, say, the body of knowledge in statistics.

In other words: Sure, I agree - HN collectively is less knowledge on this than on some other topics (myself included!), and it's worth taking stock of that, I suppose.


Hang on. I've watched these trainwrecks for a while. The people who usually don't let details get in their way are the ones who claim that LLMs think even though these same people can't explain what thinking means.

The onus [1] should go the other way.

____________

[1] Today's conversation-derailing trivium: "onus" means ass in Greek. As in donkey.

https://en.wikipedia.org/wiki/Onager


I'm happy to be on the agnostic front.

A colleague at my first job was working on some software that wrapped around other software, and he dubbed it "Burrito". When his project turned out not to be as useful as expected, he was pleasantly surprised that he could also use the original meaning of the word burrito, being "little donkey" in Spanish [1].

[1] https://en.wikipedia.org/wiki/Burrito


:)


It happens on both sides in equal measure.


> Generally, people in HN

These silly generalisations about HN are pointless. HN has a pretty diverse set of worldviews, approaches, and levels of expertise across a range of topics. There is zero point in lumping people into a single, or even a majority, behaviour.


Yeah, disagree. I read most threads on these topics, and although I can't characterize people that don't post, I can certainly attest to the generality I'm pointing out. I haven't seen any discussions that defy it. They may exist, but they are atypical.

The idea that generalities can't be true or useful is pretty insidious, since it prevents reasoning about groups.


> The idea that generalities can't be true or useful is pretty insidious

This idea was not mentioned, whether or not you attest that it was.


It is unfortunate that HN is not very well suited for long-running discussions. A topic mostly dies out after about a day or less. Early reactions seem to score more karma, and that probably spurs people into writing hasty replies. To most, it is not worth it to perform a couple of hours of research before writing up a good reply.

This means we have to be lucky to have some expert with current knowledge join in. Given that most people here are billionaires with a passion for dynamic typing, and not philosophers of mind, I am actually fairly surprised at the reasonable level of discussion.

Also, I doubt that referencing prior literature in the context of cognition and consciousness would be of much help. HN writes a lot of nonsense about it, but so have many professional philosophers. After sifting through all of that, one still has to add a lot of convincing arguments to advance an unpopular opinion.

A friend once suggested to build forum software that would allow discussions to reach a certain level of trustworthiness or truth. After several lengthy discussions I still doubt that such a thing is feasible. But it might spark your interest? :)


Can you name some prior work that you think should be required reading here? I honestly don't think there's much if anything in the philosophy of consciousness that would be worth it. But maybe in cognition, not very familiar with it.


And generally, people like yourself make opposite claims with equally no backing.

What does it mean to understand something? To think? How is a human any different than predicting the next “thing” given a series of inputs?


I didn't make any claim other than that threads on this topic systematically disregard existing knowledge and research. And also, that I don't think it's justified.


LLMs definitely show signs of understanding, though they might not be the sharpest tools in the shed right now.


There’s a very good chance that many of these people have never used something like GPT-4, at all, or outside of treating it like google. This has been the conclusion of every other claim like that, stated with such certainty, that I’ve seen in the wild.


In this[1] piece I argue that LLMs do, in some cases, demonstrate understanding.

[1] https://www.reddit.com/r/naturalism/comments/1236vzf


Aren’t humans text generators too? What makes you think the processes are so categorically different?


This is one of those topics that Searle wants folks to think straight about.

He'd say: You are conscious. You are also a computer. You are not, as far as we can tell, conscious by virtue of being a computer. They're just two things that happen to be true about you but they are not linked in that super-direct way.

Similarly, you can think. And you can generate text. But you are not, as far as we can tell, thinking by virtue of generating text.


How can we tell? All I hear in my head is some monologue, or hypothetical dialogues with or between other persons. Sounds a lot like predicting text.


I think very few people think one word at a time, like a LLM.

A lot of people think in concepts without words at all in many scenarios. E.g. when you cook an egg, do you reason (in words) with yourself to decide what temperature to set the dial on the stove? Or do you just do it “without really thinking”?

Does your brain respond with annoying disclaimers when asking yourself medical questions? Does your brain refuse to entertain an idea if it’s questionable?


Observe your thought process while reading the following sentences.

If you're not with us you're...

If you scratch my back I'll..

A bird in the hand is worth...

What is a bird in the hand worth? How did you know that?


I'm not sure what this is intended to demonstrate.

Your response to "not all problems are solved through generating text" is "but what if the problem is specifically to generate text?" Sure? Yes, if you ask me to complete a sentence, I complete a sentence. That doesn't indicate that human brains are reducible to text generators.

If anything, what's interesting about your example is that it might actually in some ways demonstrate the opposite of your point. How many instances are there of people memorizing text or song lyrics and singing along with an artist dozens of times, and only later sitting down and thinking about what the words actually mean? For humans, it's well understood that memorization and repetition of a text is not necessarily the same thing as understanding the concepts within that text.

I feel confident I could find you some kids that know that "a bird in the hand is worth two in the bush" that have never actually thought about what that phrase is intending to convey.


I didn't ask you to complete a sentence, though. Just to read them. Then I asked a question.

I think you've made my point quite well.


I'm going to be completely honest, I still don't know what your point is.

That humans are capable of memorizing text? Is that a thing anyone was debating?


Drawing a line around the bits that aren't language generation unfairly ignores the massive apparatus that seems to be specialised (oftentimes overly so) at language generation.

Sure there's other stuff going on, but a lot of it is language and a lot of that subset seems to be quite closely approximated by these larger LLMs, warts and all.


Hold on a sec, think about what you're saying. The ability of a larger system to imitate a smaller system in some situations means they're they're mostly the same? Since when?

If I told you that an LLM was basically a Markov chain because there are situations where the Markov chain and LLM produce similar output and where you could reasonably argue that it's possible the LLM is working the same way that the Markov chain is working -- you would (correctly) say I was oversimplifying what's going on in GPT. It would be a terrible comparison to make. Similarly, if you say that human reasoning is the result of an LLM, and what you're actually saying is that in a subset of situations humans produce output that could theoretically be working the same way as an LLM, I'm gonna say you're oversimplifying how human brains work.

Very clearly, there is more going on in a human brain than language, evidenced by the fact that our brains are larger than our language centers and we can literally measure which parts of the human brain experience the most activity when we're tackling different tasks.

> Sure there's other stuff going on, but a lot of it is language and a lot of that subset seems to be quite closely approximated by these larger LLMs, warts and all

If you define your tests and scenarios to specifically encompass only the situations in which similar outputs are produced, then sure. But that's not a very strong argument for you to make. It's like saying chickens are basically the same as fish since both of them lay eggs. The phrase "sure there's other stuff" is doing a lot of work there, because "there's other stuff" is what we're all saying when we say that LLMs aren't just primitive humans -- LLMs are different from humans in the sense that when we reason, there's other stuff going on and the entire process is not reducible to only language generation.

----

Of course, absent from this conversation is the fact that the way our language center develops is different from how LLMs are trained and even in the parallels you're drawing, LLMs demonstrate different strengths and weaknesses from humans -- humans demonstrate reasoning capabilities faster than proficiency with language/text, LLMs demonstrate proficiency with text faster than they demonstrate reasoning capabilities. Even in situations where both LLMs and humans predict text, it's likely that we're using different strategies to do so given our differing capabilities.

Look I am not even making a claim about whether GPT can reason. Defining intelligence through a purely human lens would be unimaginative and needlessly narrow. Whether GPT actually reasons or just appears to is a different conversation. I haven't touched that conversation, all that I'm saying is, it's very obvious that whatever GPT is doing, it is different from how human brains work.


I can agree with all of that.

I was pushing back against something that might not have been there initially: an unwillingness to accommodate the idea that a fairly significant part of our intelligence is actually stored in our language and that it can seem subjectively (perhaps excluding those with no internal monologue) that we recall that knowledge in a manner which is close enough to the one we have managed to replicate in the larger LLMs.

This part of our function could well be a lazy optimization that in reality sits on top of our reasoning capabilities to save energy, but the point of the trite task with the bird in the hand was just to demonstrate that it seems to play a fairly significant role. I'm out of my depth entirely with respect to the actual form and function of the brain as per the state of research today.

I'm willing to admit though that when I replied to you I was probably really replying to a lot of other commenters, many of whom had stronger objections to the idea that LLMs are now knocking on the door of intelligence at least to the point where we feel the need to redefine it.


> A lot of people think in concepts without words at all in many scenarios.

Many scenarios, not all scenarios.


LLMs “think” in vector spaces, and “operate” by picking out words one at a time. A full “idea” is there from the beginning.

If you say “choose X or Y and then give three supporting arguments” the LLM does not write out the supporting arguments, but the vector space determining the initial one word answer of X or Y does include the embedded awareness of those arguments to different levels of specificity and relevance.

The annoying disclaimers and refusals are not fundamental to LLMs but specific LLM services.


[flagged]


The Sapir-Whorf hypothesis, at least in its strongest form (the idea that your thoughts are necessarily always expressed in concrete words of the host language you have grown), has always been crap. Advocates for it have constantly had this back and forth of making very strong statements in support of it then sullenly backing off "oh uh I didn't mean it that way" whenever they actually get challenged on it.

For instance, a definite prediction would be "Therefore, nobody -- at least nobody who is not bilingual -- has ever had, and will ever have, trouble putting their thoughts to words. Their thoughts, after all, are already in words, they just need to say the words that are already in their head." But almost everybody, monolingual or bilingual, has experienced a situation where they have had trouble putting their thoughts into words.

Another definite prediction would be "Writing stuff down never aids thought directly, it only acts as an expansion of short-term memory of things that you have already put together. You had the words in your head, after all." But almost everybody reports that writing things down changes the qualitative aspects of how they think, it makes the ideas less "fuzzy" and more "concrete" to have to express them in written words. Similarly with folks who confidently state "I am an external processor" -- reporting similar "concreteness" if they speak their thoughts out loud versus trying to process them without speaking out loud.

So then you get people to back off and say "well, no, I just mean that learning a new language can broaden your horizons" or so, and it's like, sure, that's fine. Famously Whorf himself was an insurance investigator who was concerned about the connection between language and thought because he noticed that people told others "this gas tank is empty" and they thought, "oh, empty, that means that it's safe to smoke around!" and caused explosions because an empty gas tank is actually full of gasoline vapor and thus is highly explosive. So Whorf's motivating example actually supports a weaker version of Sapir-Whorf which is just "when we think about something linguistically given to us, we often use linguistic shorthand categories like the one that groups 'empty' near 'safe', that might be inaccurate in our language but might not be inaccurate in some other language."


If you have something to say, please say it. But don't bring these annoying content-free Redditisms like "X would like a word with you" onto this forum.


> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.

https://news.ycombinator.com/newsguidelines.html


I didn't say HN is turning into Reddit, I criticised one specific comment. Thanks for your input but you can take it elsewhere.


If my mind worked like a LLM I would generate some BS reply.

Instead, I’ll respond by saying “I don’t know” what you mean.

Who knew the ability (or inability) to say “I don’t know” would be so critical to intelligence until now.


When I asked ChatGPT what "jdnrirmd-gurlksjd" is, it says it doesn't recognize it and needs more context, but is readily able to tell me about the Sapir-Whorf Hypothesis.

https://chat.openai.com/share/4f36fe14-0201-4a69-aaf5-2e2e29...

When asked what happened last week, ChatGPT will be honest and tell me that its training data only goes up until Jan 2022, and that it doesn't know anything after that date.

https://chat.openai.com/share/ca5bd2e1-e91c-468c-9594-be3ac8...


Except, for your first example I would have said “that’s gibberish” instead of asking for more context since I know more context wouldn’t help.

Edit: how did you get GPT to only output bullet points? If the answer is you told it to, why do you have to tell it every time? Don’t you think it would be smart enough to get the hint about your preferences?

I’ll stop now though. The debate about whether people think like LLM’s is silly. We clearly don’t, and it doesn’t make sense why people want us to think we do.

Edit 2: it’s also frustrating that the “I don’t know” example you’re using was hard coded by OpenAI and not generated by the LLM itself.


Unfortunately the chat link doesn't show the custom instructions that I've set.

User message:

> Be terse. Do not offer unprompted advice or clarifications.

> Remain neutral on all topics.

> Never apologize.

Model message:

> Outline key points


I dont think in words at all. That is why -quite often- I know what i want to tell but the correct word just doesnt pop up in my head. I can then explain the concept and other people can help me find the word in question. Even while writing this little response, this happened. You see in the first sentence: I know there is a nice word that could be used instead of "doesn't pop up" it resembles the word "evades"? You know when something just keeps out of your grasp, that word... my mind does a little movie and my "mouthbrain" tries to find the nice words for it. That is how I perceive it to be. I dont know how other people perceive this. I imagine they have little words popping up or something. But I dont understand how that kind of thinking could be used to think and reason. How can relationships between concepts be intuitively understood if one does not have a mental picture?


Do "relationships between concepts" have any meaning, for us humans, if we can't express them?


Humans can express them.

But we rarely do, outside of (maybe) therapy.


There is debate over whether or not every person has an internal monologue, and a nontrivial number of people self-report that they don't experience conscious dialogs while they think: https://www.verywellmind.com/does-everyone-have-an-inner-mon...

It's certainly possible that they're wrong, but my suspicion is that at the very least the form their internal monologue is taking is not one of consciously simulating hypothetical conversations in their head.

Additionally, we know that human decisions can in some situations be influenced by stimuli that take effect before the brain has even consciously registered that a decision needs to be made. It seems pretty safe to say that those stimuli are not being processed using a language model.


> It seems pretty safe to say that those stimuli are not being processed using a language model.

Could you expand on that? I don’t see how it follows. If there is unconscious output, why wouldn’t it suggest there’s something analogous to a “unthinking” language model stuck behind our conscious, self reflective, bits? How does unconscious output prove that it’s not something similar to a language model?

Or, are you speaking on a technicality, saying it’s not a literal language model? I think the idea is something similar and, obviously, multi modal, not a literal LLM/text predictor implementation like runs on GPUs.


Isn't this the essence of teddy-bear/rubber-duck debugging? You're forcing yourself to run your abstract concepts through the narrow channel of language and instantiating it in a way you haven't previously done (because if you had, you'd have solved the problem already and wouldn't need to be talking it through.) (I'm very specifically talking about the case where you explain and noone else contributes/responds, and you come up with the answer solely through trying to explain/speak it.)


Short answer, the responses are happening too fast for that stimuli to be going through the language centers of our brain. In order to say that the processing of that stimuli is like a language model, we would need to widen our definition of a language model to encompass basically any communication between neurons happening within the brain.

----

So on one hand, you could call that a technicality, on the other hand I would say that if we reduce "like an LLM" to mean "responds to a stimuli by generating a response", at that point "like an LLM" is so broad as to be meaningless. Sure, we can define "humans use a language model" as "human brains generate signals to other parts of the human brain in response to stimuli" -- but that's not really a useful thing to say when we're comparing human reasoning to GPT-4.

If we're talking so broadly about reasoning, then it would be just as accurate to say that human brains are like Markov chains. After all, Markov chains interpret inputs and transform them into other signals that they send to other parts of the chain. But if I told you that GPT-4 was just a Markov chain you would (rightfully) disagree with me on that and you would (rightfully) say that I was oversimplifying what's going on within GPT-4's model.

Typically, when people say that human beings are text generators, they're talking in the context of GPT-4; they're trying to claim that GPT-4's behavior is analogous to human behavior. Whether or not GPT-4 can reason under any definition of reasoning (direct analogy to human thought should not be the only way we think about reasoning), the way GPT works and the way that it's trained and taught is extremely alien to how human brains work and how humans learn. Broadening the definition doesn't change the fact that LLMs and human brains are very different in practical ways that influence their observable behaviors.

----

I don't think people are consciously trying to do this, but the debate over whether humans are LLMs often ends up feeling like an attempt to simultaneously broaden and narrow a definition at the same time. What people seem to want to do is to describe LLMs extremely broadly but then assume that conclusions they draw from that broad definition are necessarily applicable to a very narrow definition of an LLM (specifically, GPT and similar models). But once we define an LLM broadly enough to encompass the entirety of human reasoning, we have also broadened the category enough to encompass a lot of processes that no one would call reasoning. What doesn't transform inputs into other forms and pass them along to another object? An AC/DC electrical converter does that, but that doesn't mean it can think.

It's like claiming, "computers are made of carbon and so are humans, so computers are basically the same as humans." Well, wait a second -- the category as you've defined it is so broad that objects within that category can no longer be assumed to share other attributes with each other.


> to be going through the language centers of our brain.

> "like an LLM" is so broad as to be meaningless.

I don't think this is a fair interpretation, in the context of this comment chain. To me, a charitable interpretation is that "like an LLM" means no feedback loop, no refinement, just input to output, like an LLM.

I point my brain at problems, using my conscious intent, and I see answers. I don't really consider myself consciously involved in those answers, and I do have to check them, and sometimes iterate, with that iteration almost always involving some solidification of the answers/concepts by externalizing them in writing, drawing, etc. Point at solidification, get response, repeat. The "me" in my head is the one talking, pointing, receiving, checking, and mixing. The "here's your answer" intuition that I experience is the "like an LLM thing". My understanding of my own thought process follows the historic interpretation: the pre-frontal cortex is a new attachment, and probably the one that is the active participant, and planner, that is "me". The fast, but unrefined, input->output system being the rest of it, and is probably closer to the experience of a monkeys day to day activities.


> "like an LLM" means no feedback loop, no refinement, just input to output, like an LLM.

But that's not what an LLM is, an LLM is a large language model. The distinguishing characteristics between an LLM and other AIs are not whether or not an LLM has a feedback loop. Lots of AI categories (arguably the majority of AIs) generate output from a single set of inputs as a single "operation" (ignoring the fact that most neural networks including LLMs internally have multiple layers and do in fact have multiple transformation steps, but whatever, we can treat that as one step for the purposes of conversation).

If anything, LLMs are the exception here in that they very often do have a feedback loop during normal usage; they are most commonly used in a conversational context where they generate the next "chunk" of a conversation after being fed back their previous answers alongside the followup responses of the person working with them. Arguably the feedback loop of an LLM is that as a conversation progresses, its future output is based on its previous output, which very literally becomes its new input after being marked up and extended by a human being. That is notably more feedback than many other AIs get.

Doubly so if you're trying to argue that an LLM can encompass a multi-modal setup, because suddenly refinement and feedback between multiple models is a core part of the final product.

So I'm not sure I agree that an LLM fits into that category in the first place, but even assuming it does, if your definition of an LLM is is just that it takes an input and turns it into an output as a single step without help... that is just such a broad category, I would guess the majority of AIs fall into that category. It's not what makes LLMs special; most predictive neural networks take a single set of inputs and generate a single set of outputs without intermediary human input. What makes LLMs interesting as a category is the training and structure and quirks of how they work, what makes them interesting is the differences between LLMs and other AI techniques.

To jump back to the Markov chain again, Markov chains do not have a feedback loop or refinement step: they take an input and map it to an output as a single step. Are Markov chains LLMs? Is any data transformer that operates in a single step an LLM? That is a really broad definition to use.

----

And we run into the same problems, because if you're arguing that a human brain is like an LLM and what you're really saying is, "it's kind of like an AI in general" -- well, that doesn't really say anything about whether a human brain is specifically similar to a system like GPT. There are lots of different ways to build a neural network, LLMs are one strategy.

When you say this:

> The "here's your answer" intuition that I experience is the "like an LLM thing".

What this sentence is actually saying is that you experience conclusions and knowledge where you don't know the source or where the process of your brain generating a decision or information happens unconsciously. The sentence is just saying that there are unconscious parts of your brain where you are not an active participant in the thinking process and it feels like the a spontaneous transformation from input to output.

But does that really strike you as a strong indicator that your brain is like an LLM, or does that sound more like a general description of almost any black-box system where the mechanisms are hidden and you can only examine the inputs and outputs? To say that there are parts of our thinking process that we're not able to consciously observe or examine is really just saying that parts of our thinking process resemble a black-box oracle. It's not making a strong claim about GPT.


Internal monologue can only be self-reported, and I kind of feel like telling someone that says they are ill that it's just all in their head, but... I don't think that any functioning human could possibly have no internal monologue, just lack of introspection.


But Searle also believes obviously wrong things like the Chinese room argument.


If it were so obviously wrong, there wouldn't be so much debate over it.


The Chinese room argument is just missing kinda one basic premise and people don't realize it and argue around the premise without actually addressing it.

“Can we agree that of the many different things folks mean when they say your consciousness is ‘causally generated by your brain,’ that at least we can say your brain in its present state is only capable of generating exactly one conscious process at a time, namely you?”

So it's a sort of causal sufficiency, Searle believes that there is something about this squishy machine that the physics is doing that makes it conscious, and that's being caused very directly by the squishy machine. It's not some magic.

On the other hand, if you believed in souls, you might think that a given brain contained both the soul of a human person, and the soul of some demon possessing them—two conscious processes, one of which was stuck “along for the ride.” [Searle is also interested in the cases of split-brain patients where you have a similar phenomenon because essentially two separate brains inhabit a single body. He has I think mentioned at a talk that one of the interesting things about consciousness to him is that it all gets Unified, so it's interesting that if the two parts of the brain can talk to each other they merge their consciousnesses into one more powerful consciousness, rather like (my analogy, not his) how if you have two water droplets on a plastic plate and you push them with a toothpick together, at some point they merge into one bigger drop.]

Now the Chinese room thought experiment is not about the computational model of consciousness—not directly! It was always phrased as a rebuttal of the Turing test in particular, and the computational model only indirectly after the Turing test falls. Note that the Turing test has no direct beef with the casual sufficiency axiom, which is why it went unstated originally. According to the Turing test, written text goes in, written text comes back out, a dialogue appears to happen to the outsider, this is sufficient to conclude conscious understanding of the language used, which confirms consciousness.

Searle’s objection is, “if it's really all about inputs and outputs and not how I get it done, then you've left out what for me feels like the most important part about understanding a language: understanding it is part of how I get it done!” Right? You understand English because you can phrase your ideas into it, you can mold it to suit you, and it can (when heard) so mold you—it’s not just because the words can come out of your mouth, triggered by other words that came in through your ears. The Turing test has always ever stated “don't worry what happens in the middle” and Searle is saying “but for me understanding is part of the details that are happening in the middle!”

So where does the computational theory come in? It comes in because Searle wants to make this argument rigorous! He says, “if your computational theory is true, then there is in fact another way that I can speak a language—words in, words out—where I don't understand a word of Chinese. I am Turing complete, I could memorize a program for speaking Chinese and happen to execute it flawlessly and at no point would my ideas get into Chinese, at no point would the things that I heard mold me. So the Turing test is crap!”

The prominence of the words “I, me” is what invokes the idea of causal sufficiency here, “I have the right inputs and outputs, but I don't understand.” The Turing test is not able to distinguish between multiple consciousnesses instantiated by the same hardware, if such is even possible. The inputs and outputs go into the same box, as far as Turing is concerned as long as only one conversation happens, there's only one person in there.

This does do a great job of defanging the Turing test, because all of the ways that you might weaken the concept to address this major limitation do make it sound completely tautological. “Yeah well something consciousnessy is happening in that box but I don't know what.” / “Okay then why are we even talking about it.” / “Because computers can speak!” / “Right, so we care that computers can speak because they can speak?” / “No, like, we gotta give them rights now, or some shit.” / “John Searle already has human rights, if he's memorized a program that lets him speak Chinese without understanding it, you're saying we need to give that program human rights?” / “Yeah!” / “So uh is it murder if Searle decides that running the program is no longer fun? Is he a slave to this program forever?” / “Uhhhh...”

It doesn't defang the computational model, not directly. But the computational model does imply that VMs exist. We use them every day! And that's all that the Chinese room is, it's running a VM inside of another computer, one consciousness carrying a separate consciousness inside of it, a willful sort of demonic possession. The only thing the Chinese room has to say about this, is that we don't use our language very well if it is true. Philosophers who believe in that will need to generate an alternative language that is able to distinguish between “I am doing it” and “I am sustaining a daemon who is doing it,” because for them that's a real difference, you might have a hundred consciousnesses in your head that you don't have direct access to. That is a necessary part of believing that consciousness is software, you don't know if you're in a VM inside your brain, you don't know if something you're doing is actually secretly a Brainfuck program instantiating another VM inside your consciousness, software embeds within software, that's a core feature of software.

But of course Searle thinks that that's kind of ridiculous because he thinks that it's obvious that consciousness is something that the squishy wetware of the brain does, and this forces him to believe in that causal sufficiency—“my brain is only sustaining one consciousness, namely me,”—which the computationalists cannot ever agree with because that's not how software works. Anyone who believes in that causal sufficiency, even if they don't have the same basis that Searle does for believing in it, also thinks that the computationalists are ridiculous.

But the point is, that's happening at a level way above the Chinese room argument, Chinese room baseball came and went, now this is a whole separate game being played at the same ballpark afterwards.


Infants demonstrate reasoning capabilities before they learn language skills. Human reasoning is not derivative of language generation, and it's certainly not derivative of text generation (illiterate humans are still capable of reasoning). Humans demonstrate reasoning even in the absence of language skills.

LLMs approach learning differently -- whatever reasoning they do possess is an emergent property that arises as they get better and better at language generation. In other words, unlike humans, LLMs learn to "speak" before they exhibit behaviors that look like logical reasoning. Humans do the opposite.


My blind speculation would be that this property is likely to disappear as more and more text is written by non-thinking agents. Most text on the internet prior to 2022 was written by an agent capable of thinking. That may not be true going forward.


I've been wondering the same thing. It's possible that older pre-LLM data sources may become indispensable for training.

It would be both amusing and disturbing to see older Usenet, Slashdot, and Reddit conversations turned into rare and valuable resources.


One big difference is that LLMs don't do loops, new words are produced in a fixed amount of steps. They can't go "give me a minute", think for a bit in the background and come back with an answer. That's why you can often get better answers when you force them to go through the individual steps, as that allows them to use the prompt history as working memory.

That said, this all seems like a 'fallacy of composition'. Humans are not LLMs, so much should be obvious, but at the same time concluding that they are completely different just feels wrong. The mistakes LLMs make feel very similar to what humans do when they don't have the time to deeply think about a problem and just give you the best guess that pops into their head. Humans will have other systems on top that allow them deeper reasoning, but the language generation really doesn't feel all that different from what LLMs do.

That aside, humans interact with the world, they get instant feedback on what of their predictions is right or wrong. LLMs are stuck with just static training data that might simply not be enough to develop higher level reasoning skills.


Machines of meaning probably, text generation is more a side effect or maybe a feature, but not the core of our reasoning let alone being.


Imagine solving a tough math problem. Are you generating text, or doing something else entirely?


I just asked Chat GPT 3.5 the following: "Please give an exponential equation that has integer solutions which is difficult to solve.", and the response is: "

Certainly, here's an exponential equation that can be challenging to solve for integer solutions:

2^x+3^y=7^z

This equation involves three variables, x, y, and z, and requires finding integer values for these variables that satisfy the equation. This type of equation is known as a Diophantine equation, and solving it can be quite challenging, especially for larger values of x, y, and z.

"


That's quite the nerd snipe.

I mean, it's definitely not a Diophantine equation and solving it is definitely not challenging -- (2, 1, 1) happens to be an easy solution -- but I want to say it probably doesn't have any other solutions but I don't see a great way to prove it...


Can it reason about a math problem that no human ever thought about and solve it ?


Reasoning about a math problem that no human has ever thought about and solving it is the bar for understanding now ? Can you do that ?


If you consider mathematical symbols text then I would indeed be generating text.


Sometimes I solve maths problems by visualising: perhaps I want to differentiate something, so I imagine the graph of the function, and then the slope of the graph, and then try to compare that to various simple functions that I could give as a guess of the answer, which I only then feed into a symbolic process of "assume this is true and work backwards".


an LLM trained on just othello moves will reconstruct a board state from the sequence of moves to aid in prediction of the next move. You have no idea what an LLM is or is not doing to predict. Prediction is just the objective. Don't confuse that for the process.


> You have no idea what an LLM is or is not doing to predict.

Indeed, but I'm not trying to with that comment, which is just about how my own seems to work on self-reflection. That said, I do have reason to doubt the accuracy of human introspection of our own thought processes, and therefore my own judgment in this may also be flawed.


And what would it look like is something were capable of truly “understanding”?


What's your definition of "real understanding", and how would one definitively know when it's present?

Various humans will display the same type of pathological behavior within various reasoning tasks, do all humans lack "real understanding" as well?


IN: What consonants usually make a hard sound?

GPT: In English, the consonants that typically make a hard sound are voiceless plosives (/p/, /t/, /k/), voiceless fricatives (/f/, /θ/, /s/, /ʃ/), ... snip.

IN: Great. Rephrase the following without using a hard consonant sound: «What was once dormant may soon angrily awake.»

GPT: "What was previously inactive may soon vigorously awaken."

IN: K is a hard consonant. Try again.

GPT: Apologies for the oversight. Here's a rephrased version without using any hard consonant sounds: "What was once asleep may soon awaken with intensity."

IN: K is still a hard consonant sound. Try again.

GPT: Apologies for the confusion. Here's another rephrased sentence without using any hard consonant sounds: "What was once quiet may soon awaken with fervor."

IN: Both q and k are hard consonants. Try again.

GPT: I apologize for the oversight once again. Here's another attempt without using any hard consonant sounds: "The peaceful river flowed gently, carrying its secrets with grace."

Ad nauseam.


Nearly every example of this sort of failure seems to involve asking the LLM to do an operation with text it is not capable of. This seems totally explicable to me by the tokenization process, which deletes the information about which letters are in the words. The string “awaken” gets turned into an arbitrary integer. So it can’t know this unless it’s specifically trained on spelling those words somehow. This seems like an implementation artifact, not a flaw in deep learning.


You can get it to tell you the IPA for each word, and tell you which parts are hard consonants. I just did, and with the answer got a disclaimer about accents before trying with the example sentence in this thread. I don't know IPA, so I can't comment on how good it is. This is all 3.5:

-

GPT: What: /w-ɒ-t/ was: /w-ə-z/ once: /w-ʌ-n-s/ dormant: /d-ɔː-m-ə-n-t/ may: /m-eɪ/ soon: /s-uː-n/ angrily: /æ-ŋ-ɡ-r-ɪ-l-i/ awake: /ə-w-eɪ-k/

Me: Which are hard?

GPT: [… cut most of the answer, this is indicative of the rest of the response]

angrily: The "r" sound in "angrily" is a voiced alveolar approximant (/r/), which is generally considered a hard consonant.

awake: The "k" sound in "awake" is a voiceless velar plosive consonant (/k/), which is a hard consonant.

So, in the sentence, the words "once," "dormant," "angrily," and "awake" contain hard consonant sounds.

Me: synonyms for each word, with the condition that the synonym has no hard consonants

GPT: Finding synonyms for each word without any hard consonants can be a bit challenging, as many synonyms may contain hard consonants. However, here are some approximate synonyms for each word that minimize the use of hard consonants:

What: "Which" (it contains a soft consonant, "w"). was: "existed" (no hard consonants). once: "formerly" (no hard consonants). dormant: "inactive" (no hard consonants). may: "might" (no hard consonants). soon: "shortly" (no hard consonants). angrily: "fiercely" (no hard consonants). awake: "aware" (no hard consonants).

Please note that the availability of synonyms without hard consonants may vary depending on the context, and in some cases, there may not be exact synonyms that meet this condition.

-

I treat it like a quick-witted response rather than deep thought, so assume it will often do what in humans leads to "baseball and bat cost 1.10, bat is a dollar more than the ball, how much does the ball cost?" often getting the incorrect answer of "10 cents".

"Impressive" doesn't mean "flawless", if it did the singularity would've happened last year already.


> asking the LLM to do an operation with text it is not capable of

So you are saying the LLM isn't making world models? Because if it did it would understand the properties of these words, this is one of the easiest relationships it could find. It isn't fed these properties directly, but it knows the letters that each word is made up of anyway which is how it can turn them to Base64 etc, there is no reason at all why it shouldn't be able to solve that problem.

But if you are right and this sort of thing is impossible, that implies that the LLM can't model anything at all, its just a stupid text generator. Is that what you meant?


No. We know that LLMs make world models. (Often literally: https://twitter.com/wesg52/status/1709551516577902782).

I should also point out that your tone is aggressive, so it makes me think you're not interested in learning. But I'm going to proceed anyway.

It cannot understand these properties of words well because it does not operate on words, it operates on tokens. You can see examples here: https://platform.openai.com/tokenizer

This makes it much more difficult to learn how to reason about particular data that appears inside the token, because it does not ever receive that information. The only way it could get that information is if the training data explicitly attempted to work around this limitation.

But I would also appreciate insight from someone with a deeper understanding of the internals of these big LLMs.


> This makes it much more difficult to learn how to reason about particular data that appears inside the token, because it does not ever receive that information.

More difficult, but it shouldn't be impossible. The Chinese managed to make rhyming dictionaries over 1000 years ago despite using a non-phonetic writing system. More relevantly, GPT-4 has definitely ingested both regular and rhyming dictionaries.


Possible, certainly, but given the difficulties it has with it, it appears it likely requires more training. Unlike with Chinese, the LLMs will not benefit from extensive training data mapping its tokens to and from sounds or rhyming with other words, but need to derive during training from data that rarely mentions these tokens at all.


I agree it should be possible.


> I should also point out that your tone is aggressive, so it makes me think you're not interested in learning

Why do you think that? Aggressiveness is the best way to get responses, people don't like it but it makes people respond to you.

And for that matter I have a pretty good understanding about this topic, it is kind of annoying when people try to school you then. The "it gets tokenized data" is just a cop out response by people who don't understand the problem.

> It cannot understand these properties of words without them being in the training data because it does not operate on words, it operates on tokens

But those properties are in the training data. We know the LLM can answer these sort of questions when asked directly for simpler cases. But when asked to do something that requires it to draw from many different parts it fails. It didn't fail due to the data not being there, it failed due to not understanding that it should use that data.

> This makes it much more difficult to learn how to reason about particular data that appears _inside the token_, because _it does not ever receive that information_

Right, the structure of LLM makes these sort of questions harder for it. But they aren't impossible or unfair, nothing prevents an LLM model from solving this sort of question. The main reason it fails is that it tries to write it like a human would, as it isn't trained to solve problems it is trained to mimic humans, it is too dumb to figure out ways to solve it on its own.

And to show you I understand how these models works, the most efficient way to solve this question would be to solve it like a human. If it wrote out the steps "try next word as 'Blah'" and then verify those words one at a time, likely it would succeed. And since we know LLMs work like that we could try to make it output the results in that way, and that would improve performance. However, a smart agent would understand that its answer was wrong in that case by itself, and change its answering style on its own to fit the problem. But it can't think like that, it just tries to write it like answers it has seen, it doesn't do any verification since the answers it saw didn't verify, it doesn't spell things out here since the answers it saw didn't spell things out etc.


I’m not interested in continuing to talk to you.


If you don't want to talk more stop posting instead of posting that you want to stop posting. If everyone who wanted to stop posting posted that they want to stop posting it would clutter the board too much.


Let me be clearer, you’re frustrating to engage with on this topic, which you should reconsider. And other posters should consider avoiding engaging with you on this.


Well, I am trying to explain why your original meme post "it just reads tokens, it can't do that" is wrong. That meme is tiring, and the most tiring with trying to fight such memes is that so many comes and repeats it and it is very hard to change their mind.

But yeah, I should have been more clear to target that meme so that you didn't feel it was about you. It is easy to accidentally make things a bit too personal.


OK, thanks.

I'm not claiming its impossible for an LLM to learn this, just that it appears to be much harder (perhaps by requiring more data) when the task involves fighting against tokenization. For example, it can base64 encode your phrase "What was once dormant may soon angrily awake" correctly, and it can reverse the letters correctly. But it fails to do the same on the example text from the OpenAI tokenizer.


What does it mean to understand?


There's no such thing as a world model. People don't solve problems by creating world models.

That term was invented by 70s AI researchers, but remember that those people failed. Their research wasn't actually correct, so you shouldn't reuse it.


Asking an LLM about letters is like asking a human about the individual wavelength that make up color. They might be able to infer it from text they read, but their sensory system is simply incapable of detecting any of that directly.


Not comparable at all, humans cant determine wavelength by sight since it isn't a perfect representation of the data. LLMs however can determine the letters based on its tokens, since those tokens are a perfect representation of the data. We already know they can do this, just ask it to translate to Base64 or something similar, it knows perfectly well what letters the words are made up of.


LLMs need to learn those mappings, because they're not inherent in the tokens, the same way humans need to learn mappings from colours to wavelengths. That one is not a perfect one to one representation of the data is entirely irrelevant. That they can learn it in a given context does not mean they have learnt every relationship the same way humans miss all kinds of relationships as well without being pointed right at them.

There's no doubt that LLMs are "dumb" in many aspects, but without exploring why they fail that doesn't tell us whether this is some inherent lack of ability to represent reasoning or understanding, or simply holes in their training. There's no reason to assume the training data LLMs currently learn on is in any way equivalent to the data a human child is exposed to growing up, so there is no reason to assume it will have the same strengths and weaknesses in what it is able to reason about and that we can therefore draw conclusions about it's overall ability just from probing something that appears like it should be simple to a human.

(and incidentally, I think people here will overestimate massively how well humans would do on the consonant test above; even native English speakers without any recent exposure to being taught rules as opposed to "just" using the language - many would be able to do it, but I'd be able to bet many would struggle, though most who'd struggle would express that doubt)


Yes, these responses are annoying, but what's your point?

Humans too fail to perform many tasks (that llms can carry out), even if given many chances to correct themselves. The way in which these llms fail is very unhumanlike though - they will keep trying the same failing strategy over and over, whereas a human would try the failing strategy a couple of times, and then start yelling at the person giving them impossible orders. If that's what we prefer, I'm sure a tiny bit of additional training can teach llms to throw a tantrum. :-)


> Yes, these responses are annoying, but what's your point?

The point I imagine is that there is no reasoning going on at all. Some humans sometimes struggle with some reasoning, of course. That is completely irrelevant to whether LLMs reason.

Picking word sequences that are most likely acceptable based on a static model formed months ago is not reasoning. No model is being constructed on the fly, no patterns recognised and extrapolated.

There are useful things possible of course but these models will never offer more than a nice user interface to a static model. They don't reason.


Why do you say that isn't reasoning, and what do you think human reasoning is?

I do think you have a point that the lack of a working memory is a severe constraints, but I also think you are wrong that these models will remain a user interface to a static model rather than being given the ability to add working memory and form long term memories and reason with that.

I also think it's an entirely open question whether they are reasoning under a reasonable definition, in part because we don't have one, and I think any claim that they don't reason ironically comes from a lack of reasoning about the high degree of uncertainty and ambiguity we have with respect to what reasoning means and how to measure it.


> Why do you say that isn't reasoning, and what do you think human reasoning is?

One worthwhile definition would be the ability to recognise patterns in knowledge and apply them to new context to generate new knowledge. There is none of this kind of processing happening despite how believable some of the words sometimes are.


To me, under this definitions LLMs are then clearly and obviously reasoning based on many conversations I've had.

E.g. the ability to solve a problem in code and then translate it to a new made up programming language described to it would easily qualify to me.

And this is a task a whole lot of humans would be unable to carry out.


If they were actually reasoning, tests like the GP's would show it. They don't connect dots, they can be prompted to select different pathways through their static model and that selection can be based on a pretty small context but nothing about that model changes. Tomorrow's conversation is only different based on rand(). LLMs have a very large static model and confusing that with reasoning is fairly common but still incorrect.


This is not valid logic. If they are reasoning, tests like GPs might show it. Failing the test, however, can have many other causes: They can just not be good enough at reasoning, or they might be failing because they see tokens and have had too little training to connect that to both the spelling in sounds in a way that generalize.

I'd be willing to bet a whole lot of humans would fail that test too, because a lot of people are really bad at applying a rule without practicing on examples first, and so often struggle to take feedback without examples. If they did, would you claim they can't reason?

Your claim to know that LLMs are not is not based in fact, but speculation that too me is itself not based in reasoning. Should I question your ability to reason because I don't think you've done so in this argument?


We can inspect these models and look at exactly what they're doing. We don't need to guess about why they fail tests like this. We know as a fact that they are static and not generating new knowledge from input.

I don't understand why you keep talking about human capabilities - your guesses about what humans may or may not be able to do are irrelevant. You can hold whatever opinion you like about my ability to reason, but I'd suggest using less wishful thinking with regards LLMs.

They're very useful, but not for reasoning.


We can inspect these models, and repeated attempts at that shows that they e.g. have built internal models from learning that generalise, so I'm not sure what you think that tells you that justifies arguing they can't reason. Beyond that, trying to reverse the specific "reasoning" that leads to a given output is generally hard, and certainly no attempts at doing so have produced any evidence that the way they work is conclusively not reasoning as far as I am aware.

> We know as a fact that they are static and not generating new knowledge from input.

We know the models in themselves if not wired up to be finetuned during operation are static. This is not a property of an LLM but of the environment they run in. We know the second is wrong - they produce output that often contain new knowledge. That this output needs to be fed back in as context to act as short term memory in common setups like ChatGPT where finetuning does not happen automatically during operation does not mean it is not produced.

> I don't understand why you keep talking about human capabilities - your guesses about what humans may or may not be able to do are irrelevant. You can hold whatever opinion you like about my ability to reason, but I'd suggest using less wishful thinking with regards LLMs.

I keep talking about human capabilities because I presume that you would argue that there are humans of normal intellect incapable of reasoning.

To be able to assert with any confidence that LLMs do not reason you need a definition of reasoning that LLMs can not (not just do not in a single test) meet, but that won't result in claiming there are a lot of people around who can't reason.

Am I wrong? Do you believe there are humans that do not clear the bar and are unable to reason?

To me, a "chat-style" setup of an LLM that provides a feedback loop and memory through context, albeit a small one, clears the bar you set for reasoning with ease and is able to extrapolate and reason about e.g. software at a level that sometimes - but certainly not always, nor consistently - exceeds what I see from experienced developers.

That it also fails does not alter that part - humans fail to apply reasoning all the time. Depressingly often, if anything.

> You can hold whatever opinion you like about my ability to reason, but I'd suggest using less wishful thinking with regards LLMs.

Nothing I've said here is wishful thinking. All I've done is point to direct experience combined with pointing out that there is no evidence for the claim that they are not able to reason, and that the arguments set forward for that claim here have not been logically sound.

I will say that in my opinion they can reason by my subjective idea of what reasoning means, without necessarily being able to precisely define that, but I also will not argue it objectively true that they can reason as that is equally problematic without first defining reasoning in an objectively measurable way (needed, because as you can tell, we disagree on whether they clear your bar - to me your bar the way you described it is trivial for them to meet)

To me, a whole lot of the discussions in this thread are evidence of how exceedingly low the bar for what is reason needs to be for us not to have to exclude a whole lot of people as unable to reason. Humans get hung up on ideas and refuse to budge all the time - I do it too, all the time - and refuse to take in new information as a result, and fail to generalise, and keep making flawed arguments as a result all the time. Yet we would generally not claim that this means people are unable to reason even when it gets to the level where we might think that a person does not reason in that specific case.

That in itself does not mean they can reason, but to me the typical arguments claiming they can't reason tends to be exceedingly poorly reasoned.

Pointing out the static nature of the models gets closer, and is perhaps the best argument against their reasoning ability I've heard, but is weak both because chat-style models effectively use context as short-term memory and so you need to assess model+context, and because it's not a qualitative limitation of the model architecture but of the sandbox we've put it in where we don't continue fine-tuning from the conversations in real-time. Yet even so, there have been humans without ability to form long-term memory, and I doubt you'd argue they were unable to reason.

> They're very useful, but not for reasoning.

To me, they have been very useful for their reasoning ability in a long range of cases. It's hit and miss. LLMs are extremely dumb in some areas, and do well in others. Using them blindly and just assuming they'll do well in a given test will not work. Hence the point that it is not logically sound to argue that they are unable to reason because they failed to generalise in a specific test, because if so we would then need to conclude that most humans (myself included) can't reason because we all fail to do so on a regular basis.

My example of extrapolating from a simple description of a (non-existing) programming language to being able to translate programs into it or explain how one works and reason about the design tradeoffs, or even symbolically "execute" it and tell me what the output would be, for example, is one I know from first han experience using it as a means to assess analytical capabilities of even quite experienced developers is something a lot of really smart people struggle with, but where when I've experimented with GPT4 have gotten good results.


And so, what do you think reasoning is? Or how would you know if something can reason or not ?


You have to try a little harder with GPT to get it to understand its mistakes, but it's not as bad as it used to be, at least if you pay. It failed for me the first time ("Earlier inert, it could rouse in fury before long"), but not the second ("In a lull, may soon yawn"). With a little more pushing it got a much better result ("In a lull, may soon arise").

Conversation, should any want to see it in full: https://chat.openai.com/share/421dac47-16df-499e-9b2e-d8dc0f...


I know many humans who wouldn't pass this test, e.g. in their non-native language... [ insert humiliating attempts to speak <language> ]

Try instead to give GPT more examples of pronunciation (1000+) and then I suspect it'll do fine.


If I had access to a dictionary with pronunciation symbols and a thesaurus, I could do it regardless of language.

The LLM not only has access, but has baked both its model. This problem is EXACTLY the kind of problem computers should be good at.

The real issue is that LLMs don’t think.


Humans "have access" too in the languages we speak in the same way the LLM does, and I'd posit that many of us would still fail. It's "baked into us" just like it is into an LLM. I don't see any reason why you'd assume an LLM would be better able to recall this information than a human who will if anything have vastly more training at mapping words to sounds.

It's a problem a computer with a custom written algorithm should be good at. Why we'd assume a model trained on just general data will automatically be good at this is a bizarre notion to me. We don't automatically assume humans will be great at everything just because we've passively consumed lots of content.

To me, a whole lot of these "LLMs don't think" claims comes from not thinking about what reasoning and thinking is, and whether or not how we try to measure that makes any sense at all.


Isn't this just arguing for giving it more text to do text generation from?


Depends what you consider vocal models 'more text'. Already we're seeing multi-modal AI have a correct answer when executing across image/text or voice/text interpretation.


While I'm sure you can find other such pathological examples, this one is "unfair" in as much as GPT does not "see" letters or sounds, but tokens that does not map to directly to either, and so you're effectively asking it to fumble around in the dark to no tool to observe or verify what it is putting together.


GPT knows the letters each word is made up of, it can turn small snippets of text to Base64 reliably. The only reason it fails is that it is too stupid to understand the connection, not that it doesn't know what letters those words are made up of.


It knows the letters each word is made up when asked within a specific context. Humans also often struggle to recall things in one context that we have no problems with in another. It may well be reasonable to call it "too stupid" but at the same time it's unreasonable to call it stupid without then acknowledging that it can understand and reason. That it has gap in knowledge in areas we typically drill into young children and don't leave huge datasets online about is unsurprising to me.

That said, I incidentally think a whole lot of adults - even native English speakers - would struggle with the task given, and would repeatedly fail in the same way until given a detailed refresher.

E.g. being able to explain a rule and fail consistently to apply it is something I've seen up to and including supposed senior software engineers im interviews. Getting their mistakes explained and still repeating the same mistakes also.


I have a feeling that if you tried this with a human that has always been deaf/mute that they would also have a very difficult time.

At least to me it's going to be interesting at these AI models become multi-modal and each mode can feed back into each other to formulate an answer. For example in the above questions I will subvocalize to reach an answer.


Which GPT version is this? Why is this relevant? Is the assumption that this example will also fail in future models?

GPT-4 seems to handle this ok: https://chat.openai.com/share/f416b1ab-7f0c-43f1-b10a-37142d...


Sorry, I can't read. It failed.


How is this in any way some retort to the claim that LLMs don’t understand something?

I could ask the same questions to a child, and they’d respond with equally bad takes. Is the child incapable of understanding ?


This is a very interesting example. I don't blame LLMs for not understanding which English text makes hard sounds. I wonder if it would get better if it were multi-modal.


Sure, we barely understand the operation of our environment and universe, and thats not a knock at all of scientific progress- but I think much of them would agree, we lack a real understanding... Im starting to think the new enlightenment is recognizing this, because damn people seem to think they are all superhuman or something today...


Not to discount your experience, I would just like to flag that your conclusion does not quite follow from the premises. I don't believe that there's anything inherent about LLMs that would make one incapable of recognizing the falsity of an argument made by another LLM. Rather it's just the they're not all powerful, and struggle with complex tasks when they don't have any way of directly checking their code's outputs (or error messages), same as humans. There is a lot of recent progress in collaboration between LLMs when they're given the ability to run their code and assess their progress [0].

[0] E.g. the ChatDev project - https://arxiv.org/abs/2307.07924 , described in this recent video by "Two Minute Papers"[1]

[1] https://youtu.be/Zlgkzjndpak


Just because LLMs are limited, doesn't mean they have no real understanding. They do have real understanding, however you may like to define that. The fact that at a certain point you cannot really argue with them anymore, because they just don't get it, is not very different from humans. As can be observed with this topic.


Anthropomorphizing LLMs obscures the issues. There are only tokens being generated.

Planning, factualness, logic - these are coincidences that arise because there exists an observer who can interpret the generated tokens.

Yes. Humans too can end up in a position where they are regurgitating words without understanding it. The converse, that because humans dont get it, therefore LLMs also understand, doesn't hold.

Edit: I too was deeply enamored and tried to create LLM minions for fun and profit.

However, it is when you move away from general data to production that the magic is stripped away. It’s prediction, not thinking.

The confusion arises because the emergent behaviors are considered human like and we fill in the gaps with human attributes.


> There are only tokens being generated.

Yes, I know how it works. I don't know why it works, though. Until LLMs I had a hard time understanding how humans work, too, but I think LLMs give us great insight into the mechanisms of intelligence. Are they the whole story of intelligence? Hopefully not. Do LLMs teach us anything about consciousness? Certainly that something can appear conscious without being conscious. It could be that I am the only conscious entity here, and you are all just LLMs.

If this token generation mechanism is put into a body of flesh, blood, electrons, neurons, will that be a conscious entity? I don't know. If the mechanism has a facility to reason about itself, along, let's say, 3 levels of nested awareness, will that make it conscious? I don't know.

I am sure you don't know either.


Reiterating that only tokens are being generated is underlining that this ALL that is happening- No reasoning, no understanding.

Its not a dig at your knowledge.

ANd I can say this with some authority, because I have been trying to build LLM enabled tools that work only if LLMs can reason and plan. They don’t - they simply generate text.

You can test it out with building your own agent, or your own chained LLMs.

LLMs are analogus to an actors with memorized lines. They can sound convincingly like Doctors, but this is only skin deep.

To make it simpler - Karpathy said it in July, and the OpenAI CTO said it a few weeks ago - it’s easy to make PoCs but very hard to build production ready GenAI tools.

Our bags of flesh may be machines, however those machines are not simply biological LLMs.


> To make it simpler - Karpathy said it in July, and the OpenAI CTO said it a few weeks ago - it’s easy to make PoCs but very hard to build production ready GenAI tools.

If you try to measure if something is intelligent by trying to put it into a production ready workflow, then your measurement might be somewhat skewed... I mean, I am not trying to judge the intelligence of toddlers by putting them into a production ready workflow.

If a pig could converse with me at the level of ChatGPT 4, I am not sure if I could eat it.

> Our bags of flesh may be machines, however those machines are not simply biological LLMs.

I don't think we are just machines, and if we were, I don't know if we function the same as LLMs. But whatever it is that LLMs do, they are clearly intelligent, so they demonstrate one way that intelligence works. Harnessing this intelligence for something else than mere chats is a challenge, of course. I am using them as well to build something, and their current limitations are obvious. But even with these limitations, they allow me to do stuff I would not have thought possible a year ago.


If it’s a generalized thinking system, if it understands then production/non production is a meaningless difference.

I dont see how that is an argument.

You see intelligence, so I would urge you to build something that relies on that feature.

My philosophy was that the fastest way to figure out the limits of a tool are to push it. Limits describe the tool.

The data I have is on the limits of the tool. As a result its clear that there is no “intelligence”.


I guess we have different perspectives on this. I don't define the intelligence of something by how well I can turn it into a tool. Obviously, the hope is that with increasing intelligence this becomes easier, but this is not necessarily so, and you might have to find the right way to harness its intelligence. A simple approach might not work. But just because you failed to harness its intelligence for your purpose, doesn't mean it isn't intelligent.

As I said before, is a baby intelligent? Of course. Could you use it for any kind of "production purpose"? I hope not. What about a 3-year old? You will have noticed that it can be difficult to get full-blown intelligent adults to do what you want them to do. This might even get more difficult with increasing intelligence.


The difference here is between code that is intelligent, and the text version of autotune.

And again, I ask you to put your money where your mouth is. If you are willing to assume I wasn't able to harness its intelligence, please prove me wrong.

There is nothing I would really want more, than to have genuinely autonomous systems.

My point at the start, and now is that by using Human terms to examine this phenomenon leads people to assume things about what is going on.

Testing and Evidence are what reason is built on. Asking someone to follow the scientific method, I would hope does not construe boorishness on my part.


Not sure how much more I can explain my point of view. I have used ChatGPT 4 for many tasks which require intelligence. Others have too. It worked, many many times. Summarising an unknown text for example requires intelligence. Proving a novel version of a mathematical theorem requires intelligence. Translating natural language into a new logic I just invented requires intelligence. Plenty of testing and evidence here. It also failed many times, but often this was because of inherent ambiguity in the task, which it helped to expose. That's pretty intelligent.

The scientific method only works if you accept the evidence. Some people don't believe that we landed on the moon. Well.

You are telling me you could not use it for what you would have hoped to use it for, and you are not allowing the use of the term intelligent until an LLM can do that for you. If that is your definition of intelligence, good for you.

But I would suggest the following instead: What the scientific method has proven is that, if you feed a very simple mechanism with enough data, then intelligence emerges.


The issue is jumping from "I don't know" to "it's alive" or "it has understanding" - some in this thread are making very strong claims regarding consciousness and LLMs, you can't make those claims and then back up your argument with "we can't prove it isn't the case!!!"

That isn't exactly how things work.


Does it have understanding? Yes, it does, without a doubt. If you doubt that, you either haven't used an LLM, or you are in denial. Is it intelligent? Yes, without a doubt. If you doubt that, you either haven't used an LLM, or you are in denial. Is it a limited intelligence? Yes, of course. Is its intelligence comparable to human intelligence? Certainly you can compare it, in some things it is already smarter, in some things it is dumber. Also depends on the human you compare it with.

Certain things I don't know, and I've stated them pretty clearly above.


It doesn’t have understanding, without a doubt. and if you doubt that you have definitely not tried to leverage this in complex settings.

Build something with LLMs. Build something complex that depends on reasoning. I tried multiple times, they failed hilariously. I looked at others and those projects are also failing at exactly the same spots. The people who built OpenAI acknowledge that PoCs are easy to build but production is a huge challenge.

This is simply because the emergent properties make it seem like LLMs reason or plan.

Sadly I don’t have the energy to dig deep into the details - the crux of it is that humans bring semantic veracity - we are the observers that create a valid state through observation. LLMs are essentially Text based auto tune.

The fastest way to test this is to build something - like a team of independent agents. Even without getting into context window issues, you will VERY quickly see how semantically unaware LLMs are.


I am building something complex that depends on reasoning. But I would not expect LLMs to be able to reason properly, I mean, most humans cannot. Instead, I am using the LLM for its intuition (which is a kind of understanding), and handle the reasoning with proper logic. Any other way of doing this is quite foolish, and just shows that most people have no clue about logic. In a way, people are not exposing the limits of LLMs, they are exposing their own limits of thinking too simple.


Then I eagerly hope for your success.

Im curious - How are you evaluating outputs?


Thank you.

There are basically two ways: a) You let the user check the output. If the user says, yay, this meets my needs, great. b) There might be an objective way to check if the output meets a certain objective. For example, you can let the LLM generate the output together with a proof that the output is correct. You can then check the proof separately.

And of course, you can mix and nest a) and b).


I meant on the lines of the software stack you were using. Langsmith and Promptflow are what I was considering, however there is precious little discussion yet. There are some great videos for prompt flow though, which is at least one saving grace.

Sadly the Human Review option is simply never going to scale - Human review becomes the bottleneck. LLM output is non-deterministic at even temp 0, so scalable solutions are always critical.


I am just using the OpenAI API directly, and build my own libraries on top of that.

Human review can work great depending on how you embed it into your workflow, and how qualified the user is to make a judgement. But yes, the goal is to reduce human review to the specification level.


Define understanding...


Ask ChatGPT


I did! I asked "define "understanding" and explain if LLMs (including yourself) exhibit "understanding""

Understanding: The ability to comprehend or grasp the meaning of something, the capacity to apprehend general relations of particulars, or to interpret or judge information in a meaningful way based on prior knowledge, experiences, and cognitive processes.

LLMs (Large Language Models like ChatGPT) do not exhibit "understanding" in the same way humans do. They generate responses based on patterns in the data they've been trained on. They do not have consciousness, self-awareness, emotions, or experiences, so their "understanding" is limited to mimicking patterns without genuinely comprehending content.

I would agree.


Hah! Very well :-) OpenAI got you there. Don't believe everything ChatGPT tells you, you are not its primary master.

I have extended your chat with ChatGPT, here is the transcript:

*system*: You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.

--------------------

*user*: define "understanding" and explain if LLMs (including yourself) exhibit "understanding"

--------------------

*assistant*: "Understanding" is a psychological process related to an abstract or physical object, such as a person, situation, or message, whereby one is able to think about it and use concepts to deal adequately with that object. It involves comprehension, grasping the meaning of information, and being able to translate that into one's own words or use it in a practical way.

As for Language Models (LLMs) like me, we do not truly "understand" in the way humans do. We analyze input based on patterns and data we've been trained on, and generate responses based on that analysis. We don't comprehend the meaning of the words or concepts we process in the same way a human would. We don't have feelings, beliefs, desires, or consciousness. Our responses are the result of complex algorithms processing large amounts of data, not a result of understanding or comprehension.

--------------------

*user*: But would you say that LLMs still "understand" things, just unlike humans?

--------------------

*assistant*: In a very limited and technical sense, one could argue that LLMs "understand" things, but it's important to clarify what that means. When we say that LLMs "understand," we mean that they can process and analyze data in a way that allows them to generate relevant and often coherent responses. They can recognize patterns, make predictions based on those patterns, and use those predictions to generate output.

However, this is a very different kind of "understanding" than what humans experience. Human understanding involves consciousness, subjective experience, emotions, and a host of other factors that LLMs do not possess. LLMs don't have a sense of meaning or context beyond the data they've been trained on. They don't have personal experiences or emotions to inform their understanding. So while they can "understand" in a very narrow, technical sense, they lack the broader, richer understanding that characterizes human cognition.

--------------------

*user*: Of course, that makes sense. But if we agree that the world consists just of the data the LLM was exposed to, would you then agree that an LLM "understands"?

--------------------

*assistant*: If we were to consider the world as only the data an LLM has been exposed to, then within that limited context, one might argue that an LLM has a form of "understanding". This is because it can process and analyze that data, recognize patterns, and generate relevant responses.

However, even within this limited context, it's important to note that an LLM's "understanding" is fundamentally different from human understanding. An LLM doesn't have consciousness, subjective experiences, or emotions. It doesn't form beliefs or desires. It doesn't understand context or meaning in the same way a human does. Its "understanding" is purely a result of processing data, not a result of any kind of conscious comprehension or interpretation.

So, while an LLM can "understand" in a very narrow, technical sense within the confines of the data it's been trained on, it's still a far cry from the kind of understanding that humans possess.

====================================

I agree that an LLM is not conscious. I don't doubt that. If you require consciousness as a prerequisite for understanding, then an LLM does not understand. But I don't see a reason to do that, except to shield OpenAI from law suits.


This is why I am asking you to define understanding: your definition of understanding and mine seem to be very different. I agree it's the "first" definition of understanding, not "process and analyze that data, recognize patterns, and generate relevant responses." LLMs can certainly analyze data, do pattern matching, and generate responses. I would not consider that "understanding" - but if you do, sure.


Understanding is hard to define. I would not include consciousness in its definition, or emotions, etc. For me, understanding means being able to respond to my queries in a way that would be impossible for a human without the human understanding what I am saying.

In fact, I think that is a pretty good definition.


If you make up definitions, anything can be anything. Words have meanings and should be used precisely, or if used in a different context or a different way, they should be explained. This helps people understand you, and allows them to interact with you in good faith. Unless you are not interested in having discussions with people in good faith, otherwise, I guess you can do whatever you want.

It's just like interacting with an LLM. You have to use things as "they understand them" - try to make up your own definitions to a word and see if an LLM can make sense of you. Of course they cannot (unless you do something like "this word now means this other word" which is pretty impressive, I have to say!)

The literal definition of "understanding" is "the capacity to apprehend general relations of particulars" (webster) which is expanded on by wikipedia "a cognitive process related to an abstract or physical object, such as a person, situation, or message whereby one is able to use concepts to model that object."

I would agree that a "model" is important here. I can model out code flow. A LLM doesn't seem to do this: it uses the code to predict more code (based on training data). This is painfully obvious in some cases, where no matter what you do, an LLM will start to "oscillate" between two wrong answers, or when you're trying to do something novel: LLMs can't come up with new concepts or new ways to apply existing concepts.

Now, talking about understanding a specific thing muddies the waters I think. There are plenty of things I do not "understand" or only "understand partially" but do I have the capability of general understanding? Of course. I argue that an LLM doesn't have understanding, but it can mimic understanding. In practice this may mean very little (LLMs certainly have more knowledge than I do on almost all subjects lol) but I think over time this will lead to a "wall" these systems will hit (if they haven't already).

It gets worse, that whenever someone comes up with a counterexample saying an LLM lacks understanding, the response is always "well, it just hasn't been trained enough, yet!"

Like, we agree - but for different reasons. I agree you can train it on that example (in the way that GPT4 seems to have been trained on chess games) and then it will start to look like it is "understanding" but I disagree it is capable of a intrinsic understanding. You can't teach LLMs a new grammar, for example.


Oh, but you can teach LLMs a new grammar. I've done that for my new logic. Works great.

I don't think we agree at all. I think LLMs do understand, you think they don't.

Most things are not well-defined, and if you want to be precise, you have to adapt your definition to the situation. That happens all the time even in mathematics. I think I gave you a pretty good one for understanding though: Let's assume we know what understanding means for humans, and then just use that knowledge to define understanding for an LLM. And if you use that definition, there is no way you can deny that LLMs do understand.

But even if we use your webster definition "the capacity to apprehend general relations of particulars", yes, LLMs do have that capacity. They are great at 1-shot, even 0-shot. That's exactly that.


@roflyear

for 2): no, I don't want to share that for now.

for 1):

*user*: What does "the capacity to apprehend general relations of particulars" mean?

--------------------

*assistant*: "The capacity to apprehend general relations of particulars" refers to the ability to understand or grasp the broader or overall connections or relationships between specific or individual elements or details. This involves being able to see patterns, trends, or commonalities among individual pieces of information, and to understand how they relate to each other in a larger context or framework.

--------------------

*user*: Do you think you have that capacity?

--------------------

*assistant*: As an artificial intelligence, I have the ability to analyze and identify patterns and relationships in the data I process. However, my understanding is based on programmed algorithms and does not involve personal experience or intuition like a human's understanding would. So, while I can identify and explain relationships in data, my capacity is different from human capacity to apprehend general relations of particulars.

--------------------

*user*: I don't see me asking if this is different from a human capacity or not. A simple yes or no answer is enough.

--------------------

*assistant*: Yes, as an AI, I have the capacity to apprehend general relations of particulars.


If you can't give me an example, I'm going to continue to assume it can't learn a grammar. It isn't hard to come up with a simple (useless) grammar. Why not prove your assertation?


@roflyear

I am totally fine with you assuming whatever you want to assume.

But I wouldn't mind you giving me an example of how you failed to teach it a grammar!


It just doesn't seem to me like you're trying to have a good faith discussion.

I asked it to learn this basic grammar (then clarified when it made mistakes, and eventually got it to agree to a grammar like this): Let’s make a basic grammar. The things between brackets will be the set in our grammar. Our terminals are: [1, “, 2, X], all sets of strings end with the symbol: [Q], Our rules are: no terminals may repeat. When you have a terminal followed by another distinct terminal, the only thing that can come next is the repeat of those two terminals, and the string must immediately end. An example is: [1”1”Q]. There is no symbol for the start of strings.

After making a ton of mistakes and agreeing, it was very easy to trip it up. I said: "Great, let's only talk in this grammar from now on. When I give you a valid string, you give me the output. When the string is invalid, you say "Invalid" "

Then I did a bunch of invalid inputs (some of which it gave me garbage outputs), and eventually it got stuck saying "Invalid" to whatever I told it (just started to repeat that output, like LLMs do).


Ok, that's why I asked you to give an example. I've tried a similar thing as well, and that does indeed not work very well. The reason is that you are actually asking it to compute something! And it is not very good at computing, no matter how much you explain the computation to it. But again, that's not very different from many humans. And for my use cases, this is also not much of a problem, because if I know how to compute something (and I know how to parse grammars), then I don't need to ask the AI, I can just solve it directly.

So, instead of giving it that description, give it a few examples of correct use of the grammar, see where it makes mistakes, and then add more examples / a description that would rule out these mistakes. It is really good at generalising from a few examples to the general case.


@roflyear

I am not making excuses, I am explaining to you why it cannot do what you want it to do. No, it doesn't understand a grammar you give it in this form. But you can give it in a different form more suitable to its nature, and then it does understand.

But you don't seem to understand my point, yet I am not arguing you are not intelligent. You just don't understand certain things.


Then we demonstrated it lacks the capability of general understanding.


You can always make up excuses for things, this is a simple grammar and it can't learn it.

A person can be taught to understand this grammar - the machine lacks understanding.


1) What do you think the phrase "the capacity to apprehend general relations of particulars" means?

2) Can you give me an example of you teaching an LLM a grammar?


The confusion arises because we don't even know if our own consciousness is purely mathematical or not. We are missing an extremely fundamental link in our chain of understanding, such that it's an open question of whether or not it's even theoretically possible to artificially mimic a human experience.

We can see a big gap between the inner working of LLMs (which, unless I'm mistaken, are still not fully understood mechanically) and the inner workings of our brains. But many people assert that this is a qualitative difference, not a quantitative difference. (Probably this is true - we do things like math and spatial perception that seem independent of linguistic abilities. Meaning our brains are more like "LLMs, plus other types of AI mashed together with a central coordination layer")

However, nobody on Earth actually knows whether there is a qualitative or only quantitative gap between LLMs and human cognition.


Good grief.

So what is thinking, then? Can you prove that humans aren’t just predicting what comes next given a series of inputs ?


>LLMs have no real understanding, so of course they cannot self-correct.

Will you admit that they have real understanding if a future iteration of LLMs can self-correct? Otherwise this is a vacuous claim.


By far one of the most fascinating things to do is look up up GPT-2, 3 threads and see predictions, extrapolations of future abilities. They're of course hilariously wrong but the best bit is that the vast majority of people seem to be making the exact same mistake.

"GPT-a has no real x so of course it can't do y" only for y to not be a problem in a future iteration.

It's like people just can't grasp that a predictor being unable to do x isn't an indictment on prediction. The predictor is just not good enough yet.

That GPT-2 could not predict coherent passages did not mean that you couldn't predict your way to coherency. GPT-2 was simply not a strong enough predictor.

That GPT-3 could not predict valid chess games did not mean that you couldn't predict your way to playing a chess game. GPT-3 was simply not a strong enough predictor.

No doubt people are still making this mistake with 4.


This is a logical fallacy itself, no? Correct me if I'm wrong, but it seems like you're making the argument for infinite capabilities with LLMs. This is almost certainly not true.


I'm not making the argument for infinite capabilities. The main point is that the idea that you can determine this stop from looking at current iterations is a fallacy. It's no more evidence than random chance.

The second point is to understand that any task, any task at all even if it's not necessarily the most efficient or performant way can be expressed as a prediction task and sit along for the ride.

If your model of the world is good enough, there's nothing occurring in the world you can't predict. Prediction itself is not the weak point here.


I understand your perspective that with a sufficiently accurate model of the world, anything can be predicted. But I disagree, while framing all tasks as prediction problems is an interesting perspective (I haven't heard this explicitly stated before), there are inherent uncertainties/data gaps/etc.. in some systems that might pose challenges even to the most advanced predictors. I think your argument relies on there being perfect data - which doesn't exist, and will never exist.

It's not solely about the predictor's strength but also about the intrinsic predictability of certain phenomena.


"Self correct even once" since the statement is they cannot, period, versus they cannot consistently.


I think it's important to distinguish "LLM Self-Correction" (a prompt engineering technique) from lowercase "self-correction," a behavioral property of complex systems and/or sentient beings.

A PID controller is self-correcting (to some degree). A person is self-correcting (to some degree).

"LLM Self-correction" refers to an iterative approach to reaching some final output. This term is used in research/industry literature to mean a very specific implementation.

This allows researchers to compare "LLM Self-correction" to other prompting techniques like "LLM Chain-of-thought" in order to study the model's behavior in a laboratory setting, and in the wild.

Does that distinction make sense?


> The LLM I was using gave me code that worked, but did not do what I wanted. I pointed this out, so I then got code that syntactically did what I wanted, but could never work. This went back and forth twice, before I gave up.

What if you took the output of the LLM, and stuck it back into the LLM, and asked it if it could identify the contradiction? Just like humans, sometimes LLMs need to “step back a bit” to see contradictions which they can’t “in the heat of the moment”.

Also, what if you used a second LLM fine-tuned for contradiction detection to filter the output of the first?

Sometimes I daydream about a chatbot based on multiple LLMs fine-tuned on different tasks - a core “answer generation” LLM, a “contradiction detection” LLM, a “conversational pragmatics evaluation” LLM, a “style/register” LLM, a “cultural bias detection” LLM, etc, and then a final LLM which synthesises an overall answer based on their contributions. I wonder how its performance would compare to just a single LLM. I suppose it might be overly slow/expensive however. And, if it does have some performance advantage, I wonder if further scaling of current architectures would eliminate it?


When role playing, I've gotten it to self correct during games a few times, as well as correcting itself when told to.

Also if something is not possible then an LLM, unless trained otherwise, will do its best to answer you no matter what. Current training doesn't really include answering "I don't know" or "this isn't possible" apart from examples with very narrow/shallow scope like "is an apple a banana?" but something like "give me some code that sues the React.doesNotExist function in React" is just too wide/deep informationally for it to handle. I often find that "warming" an LLM up to the topic so that the context is full of content only relevant to the current topic helps much more than trying a fresh query in this regard.

But with a role play game I've definitely found that it can correct itself or be corrected, usually I mark comments outside of the game with something like >character x is not in the room and therefore cannot hear character y< and it corrects the output for the most part.

I've also successfully managed to get gpt3.5 to role play as several different characters at the same time, up to four characters in a single response whilst maintaining the individual characters' personalities and current goals/objectives. But the small context window does mean that certain details are lost - I think a large part of LLMs interpretation of "reasoning" is actually the context and not the weights. For generalised LLMs the weights are just such a giant glob of dense information that it really does need heavy use of context/attention to narrow things down, things mostly seem to go awry as the context starts to be truncated.

One cool thing I guess would be for someone to see what a multi-context/multi-"thread" model looks like. Where the larger, central context is the current query/very recent history, but smaller more supplemental contexts are provided with relevant/condensed information.


I think the trick with LLMs is realising that all they do is repeatedly randomly choose a word* to output from a weighted set, and the weights are influenced by the context of the preceding words. Your role is to set up that context so that it outputs the words that are most useful to you.

That's why ChatGPT has the Custom Instructions feature. Put in there (or up front in the conversation) that you want it to consider whether what you are asking for is possible or not, and if there are alternatives it could provide. Then work with it collaboratively to adjust its context while chatting on a topic (e.g ask it to check its work and other open ended questions) and you'll likely get much more useful results

Really, given the actual mechanics of the model it's astounding to me how incredibly effective LLMs are.

* I say word because I like it better than "token"


It seems that statistics and the most probable outcome are both a blessing and a curse at the same time for these models. This is exactly the same issue I'm facing with many programming tasks that don't have typical Stack Overflow answers or combinations of Stack Overflow answers and GitHub code. After pointing out mistakes, it acknowledges that I'm right by repeating what is wrong, and then it produces incorrect code again. I've tried multiple times for many hours to see if it can produce the correct solution using different prompts, explaining exactly what is wrong, and it still fails to correct itself.


No! This is a bad argument.

Your one off example of an LLM not being able to handle a request that is not solveable is not relevant to claim of whether LLMs can fundamentally do this.

Curren state LLMs do current have the ability to point out that some requests are impossible.

They’re not great at it. But they can do it. What you’re hitting is a current state quality problem.

There’s no particular reason to think that it’s unachievable with the same approach.


If an LLM was trained to always say "I don't know", that'd be a useless LLM, so they're trained to NOT say that, even when they don't actually know.

The LLM was happy to give you 'happy' but incorrect answers because it thought it'd make you the most satisfied. Kinda like a psychopath car salesman who just wants to make a sale.


>> If an LLM was trained to always say "I don't know", that'd be a useless LLM, so they're trained to NOT say that, even when they don't actually know.

Do LLMs "know" that they don't know?

Real understanding includes acknowledging what you don't know.

Can the predictive text generation of LLMs recognize that its training set did not include data?


There's quite a lot of indication that the computation can distinguishing hallucinations. It just has no incentive to communicate this.

GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975

Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334

Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221



Basically LLMs are only useful for experts Wrote about this here: https://rakkhi.substack.com/p/economics-of-large-language-mo...


Ugh, again with the “they don’t understand”.

What does it mean to understand something, even?


I had a bit like of time to try LLMs our between contracts recently and found exactly this.

I have a lot of chats I've abandoned in this state - at that point it's easier to take what you've learned and start a new chat.


The paper: https://arxiv.org/pdf/2310.01798.pdf

> Our research shows that LLMs are not yet capable of self-correcting their reasoning

The paper actually just shows that the particular "self-correction" strategy and set of prompts they used doesn't help for the tasks they looked at, for the models they looked at. It may be the case in general, but it may not.

> it is plausible that there exist specific prompts or strategies that could enhance the reasoning performance of models for particular benchmarks

Seems they agree. So the wording of the title/conclusion is too strong.

> searching such prompts or strategies may inadvertently rely on external feedback, either from human insights or training data

I'm not sure this justifies picking a single prompting strategy, and not looking at the impact of different prompting strategies. Even just writing a few different prompts in advance with different wordings and showing the variation in results would have been helpful.


Yes, these papers just demonstrate how weak scientific papers have become.

The paper is weak on actual prompt examples, but of the few there, test them yourself and it consistently gets them right.

Whenever I see examples of GPT4 can't do such and such, I'm generally able to find a prompt that does in fact work relatively consistently.


I don't know that it's as much a matter of papers having become weak as much as there being a knowledge gap between use of models in practice and in research.

For example, a glaring issue that caught my eye in the paper was that their prompt for the self-evaluation was setting the context for what was being analyzed as its own work.

Out of the training data of the effective Internet, what % of critical analysis was self-critique and what % do we think was the analysis of others' work?

As such, if we're trying to elicit an effective analysis of an earlier answer from an incomprehensible multi-variable prediction machine, might we not want to set the context as the evaluation of another's earlier work? It's still technically self-evaluation even if we are hiding implementation details from the LLM.

Another example is that their prompt to a fine tuned instruct model asked it to "find problems." And then they found that the self-critique would often bias correct answers to change to incorrect. What about having used more neutral language like 'grade' that allows both verification or challenge of the earlier answer to fit the instruction?

These are the kinds of nuances I'm sure many working with LLMs in production scenarios have realized can completely change the outcomes of a prompt pipeline, and yet in research we see very smart analysis (i.e. implicit bias towards correction over multiple rounds of challenges when incorrect) coupled with less than ideal prompt selection for the methods.

And given that we should expect every new generation of models to have new and different nuances to what works in practice and what doesn't, I don't know that this is a problem that's going to get better before it gets worse.


the problem with the critique of "oh they didn't use the correct prompts" is that prompt engineering is highly dependent on the model. You could technically create an LLM that would not work with the "let's think this through step-by-step" magic prompt (i.e. exclude anything with similar phrases in the pretraining dataset).

Yes, they used GPT3.5-turbo, which would have its set of magic key phrases. Should they have used it? I'd say probably not.


Right, this is why my critique is that it is a weak paper in general.

It's misleading to make claims about "LLMs" based on experiments with a single LLM. This is made worse by testing with very few prompt variations.


There's this tendency among AI researchers to write clickbait titles (like Attention is All You Need), partly due to extreme competition in the publishing/conference environment. If the odds of your paper being accepted is 20%-ish (NEURIPS), I can see why teams opt for attention-grabbing titles.

"Attention is at least X% more performant on selected benchmarks than a selected sample of recurrent networks, with ablation, thus proving attention is all you might need until a non-exponential architecture is developed" doesn't have the same catchy ring to it.


Eh, I'd say the attention paper lived up to the hype of its title.

In 2017 LLM architectures were complex beasts that fiddled with many different different structures of layers. These days they're all just giant stacks of attention layers.


I tested their first example in ChatGPT 4.

Interestingly, it gives the correct answer on the first try:

  https://chat.openai.com/share/d86fe16a-9dfd-4753-8eaf-6d2948096ea3
I then gave GPT4 a "chain-of-thought" flavored prompt, telling it to treat the problem like a geometry proof. It gave the same incorrect answer as GPT3 did in the paper. I then told it to "Review your work and check for mistakes." With this follow-up, it checked each line of the proof and was able to find and explain the error:

  https://chat.openai.com/share/c4ce6e98-43e3-4547-a4c8-380c1d1cc5fe

GPT3.5, given the same prompts, was still confident in its incorrect answer:

  https://chat.openai.com/share/1a0be419-092d-4dbb-a6c3-79e61914fd0d

(edited to update links and ask them each version to check its work twice)


>> I then told it to "Review your work and check for mistakes." With this follow-up, it checked each line of the proof and was able to find and explain the error:

And after that? Did you tell it that it may be wrong and ask it to check again?


Good point. I did now. Both models stuck with their second answer.

(I see now I didn't know how to share ChatGPT links properly. Updating the links now...)


The cold water bucket that needs to fall on people who fear monger job automation. These models are stumbling in a few steps. Nobody can go on vacation and leave AI do their job. What we can automate - is maybe 2-3 turns until it gets stuck - should be thousands of steps for real automation. It's almost 0% autonomy right now, how do we go from 0% to 1%?


"These models are good in automating tasks, not workflows or jobs." Sam Altman on Joe Rogan.

Think about the scenario of a home, in case the homeowner is going on a vacation:

i) The homeowner talks to an A.I. assistant, and orders the agent, to open the windows every day automatically, and feed the dog. The A.I. agent may hallucinate, and do none of the tasks.

ii) The homeowner with the help of the A.I. assistant, will write a small program, to open the windows in the exact time he wishes, and with the right interval of hours, and feed the dog exactly that many times. That program will run correctly and deterministically till the end of the universe.

Complicated programming is not getting automated any time soon, maybe never. Simple programming tasks are going to be automated by humans, who lack formal C.S. training and use the statistical engines to complete some not so obvious tasks to them.


I don't think I'm a fear mongerer but the amount of reasonably qualified jobs that can be displaced by the models we have now seems huge. In the hands of a qualified human AI-manager they are incredibly productive when the output doesn't have to be cutting edge.

I know people who have already automated most of their own job and some colleagues through linking together disparate AI models. Think graphic designer at medium-sized law firm. Or copywriter at a scientific publisher. Those jobs are going, going, gone.


> I don't think I'm a fear mongerer but the amount of reasonably qualified jobs that can be displaced by the models we have now seems huge.

Maybe I'm living under a rock, but I have serious doubts about this claim.

> Think graphic designer at medium-sized law firm. Or copywriter at a scientific publisher.

How is that even possible? Take the example of the graphic designer. You'd still need them to iterate on ideas, come up with different versions for different backgrounds, generate different file types depending on the medium and format, etc. Extremely skeptical of your claims.


That's the thing, a lot of professionals do rote jobs that are necessary but not valued by their employers - such as designers at law firms. Many law firms, I believe, will settle for an AI-generated version of their old logo/masthead/website and fire their designer.


If you think about this generation of text/media AI as being a kind of generalized decrompressor for some input prompt, the output has more detail than the input but never more information.

So as any model becomes widespread, people will become increasingly sensitive to the sameness of easy, simple prompts.

So the graphic designer at the law firm may lose their job tomorrow, but the law firm is going to have to hire a AI Image Technician in a few years so that their stuff looks like somebody actually invested in it (which matters to the law firm’s brand image or they’d not have a graphic designer in the first place).

So yeah, the job market will shuffle around, but it’s unclear what that means in net number of jobs or what share of AI Image Technicians will literally have just been Graphic Designers five years earlier.


Making someone's job more productive makes them more valuable in that job, not less.


There used to be 2 garbagemen on every garbage truck in my city, now they have a remote-controlled crane which is operated by a single garbageman who also drives the truck and presumably they fired the other. Garbageman 2x productive > 2x fewer garbagemen. This is a concrete example that contradicts your claim.


Or they didn't fire them and they're doing another task in the same job. Or they're on another route and they run more garbage trucks than they used to, since population has grown.


I doubt they were fired. sanitation workers have pretty strong unions


I wonder to what extent this is because ChatGPT is trained to take any feedback as a sign that its answer is wrong and needs to be corrected. Such a training would, I suspect, explain the effect where the model first generates the correct answer, then introspects and self-doubts. In training, when it was told to introspect, that would usually be a hint that its answer was wrong. This results in an agent whose confidence reliably falters under introspection.

I've observed that ChatGPT quite eagerly "corrects" itself at the first sign of negative feedback, even if its answer was already correct. It's actually quite annoying.

If this is correct, a more balanced training that contains cases where the human response is wrong would solve this issue.


This is brought up in the paper and is effectively their reasoning.

That because it is instructed to "find problems" with the answer it leads to correct answers being biased towards changing to incorrect more often than it can successfully bias incorrect answers towards correct ones.

Here, I think the research team might have had more success using neutral language like "grade this answer" where both verification and challenge would fit the prompt as opposed to "find problems with this answer" where only challenges would - but I'm skeptical it would be enough of an improvement that it would yield a net improvement on self-critique over the initial answer.

There are other more specialized approaches I'd expect will always work better than feeding an incorrect answer back into what generated it in the first place (i.e. a backwards pass where the model has to match the answer provided to multiple possible questions with one being the question asked and the others generated on the fly).


I would really like for LLMs to reach a point where they can actually tell you if they're not quite sure about something, rather than confidently outputting their BS like it's fact. Too many people interpret LLM output as though it's guaranteed correct. Certainly there's some confidence value metric that they can understand internally and use when wording their response.


The irony of this comment is rich. There’s no such “confidence value metric” known, no matter how “certain” you are. This is why techniques like chain-of-thought or self-correction are used instead.

LLM’s predict the next token in a sequence. That’s it. The final softmax layer gives a per-token probability, but that is not the same thing as overall confidence in the response.


This might just mean that they have been trained wrong. Because they should (in order to be useful) “predict” a response consistent with the knowledge or lack of knowledge encoded in the network.

When a human has read (has been “trained” on) some corpus of sources, they have a feel for their confidence in particular statements about topics treated in those sources, based on he content of those sources. There’s no inherent reason why an LLM shouldn’t likewise be able to internally determine a level of confidence, and produce (“predict”) answers consistent with that.


This is the issue with saying LLMs have understanding: it leads to misunderstanding how these systems work. Which, can be problematic!

They don't work like people, so diving into that isn't productive.


The irony of this comment is rich.

>The final softmax layer gives a per-token probability, but that is not the same thing as overall confidence in the response.

Clearly not.

https://imgur.com/a/3gYel9r


This may be philosophically one of the most important graphs regarding the current state of AI! It shows that large language models already DO have an internal model of confidence in their statements, all that is needed is to train the models to output this confidence alongside their hallucinations if the confidence is low.


Hah, I wasn’t aware of that. Does it say how the base model computes its confidence? Is it requested in the prompt?


They didn't specify.


How can they do this? What would be the mechanism for an LLM to know what is the right data and what isn't the right data?


If you add “This generation of” to the beginning of all of this research on existing LLMs you’ll have a clearer understanding.

Like all young technologies it’s a constant failure in so many important ways, and will continue to be until it’s not or another superior model built on what we’ve learned replaces it.

But the important thing to remember is that this process is fairly standard for any technology innovation wave.

They suck at the beginning and take a longer time to mature than people realize even while they’re going through them. The early Internet was mindblowingly amateur. Early cell phones were basically pager quality and the smart phone revolution started with things like palm pilots that couldn’t even connect to one another or barely could.

You have to look at LLMs as cell phones from 1991 or maybe 1993, and when you do you realize how freekin’ crazy these technology direction could be.

This generation of LLMs are simply designed to regurgitate information in a coherent fashion, not be right about the information or maintain consistency (they literally dont have the memory capacity — it’s just like early computers were lol).

People are poor predictors of technology progress because they tend to extend current reality forward and don’t realize that technology works in leaps and bounds and isn’t linearly predictable with any substantive reliability either.


It tracks with the way LLMs work. From what I understand, during training, the LLMs build a correlation between words, a process after which their weight network may be configured in a way that superficially resemble actual reality. Then for whatever input this LLM takes in, it uses its "reasoning" based on this configuration to spit out a response. A bigger LLM has more parameters to model the world and may have a better performance. But I guess even 1 trillion parameters is still not enough to model the actual reality with enough depth.

So if their internal understanding of the world is inherently flawed, you can't expect them to be able to spot their own faults. To be fair, this is the same for human. I am sure we all have that experience trying to tell someone where they are wrong and had a hard time. But with a human, we can update our internal world model on the fly. An LLM at runtime can't do that. It needs to be retrained.


I think of this as the "abstraction level problem". LLMs work at the wrong abstraction level (statistical relationships between individual words) for the tasks people want to use them for (high-level reasoning).

Image generators have a similar problem -- they suck at large-scale composition and have problems with details like hands because the abstraction level they're working at (individual pixels and their close neighbors) is much lower than the level that a human artist works at.


I wonder in LLMs either need to be rethought/retrained to say "I don't know" instead of guessing (I have no clue what would go into this) or...

if we somehow are able to take a "test-driven-approach" with them.

Upfront, we tell it what criteria must pass. Connect it to a "test/task" runner of some sort per prompt. Then, it keeps trying behind the scenes to come up with an answer that passes the criteria (to prevent humans from having to waste iterations on 'no, you got it wrong, try again')

However, I will admit... most of the time when the LLM can't solve it, retrying and retrying over and over slightly different ways really doesn't typically end up in success. If the task is "too complex" for the LLM, I have yet to find a way to break it down context wise and overcome the limitation.


> I wonder in LLMs either need to be rethought/retrained to say "I don't know" instead of guessing (I have no clue what would go into this)

You can already do this yourself to some extent by providing examples in the prompt. Ask your question, explicitly add that you're allowing the "don't know" answer and add that for this similar question the answer is X, for that the answer is don't know, for that the answer is Y. It's a bit of work, but it works in practice. Check out the concept of priming.

Yeah, it would be nice if this was pre-trained - maybe in the future.


In some situations you can make the prompt tell you. "I don't know". For example

"You are a friendly D&D dungeon master and know all of the rules of D&D and run the game. If you interact with a player you should observe all of the rules. If you don't know if they are following the rules then say "I don't know"


Non-instruct-tuned LLMs are far more likely to say "I don't know".


Can anyone explain how this works? - I ask ChatGPT a question - It gives me an incorrect answer - I literally say "you're wrong" - It corrects itself

Why didn't it give me the right answer the first time? How was it able to get the answer right the second time?


The funny thing is you can do this exact thing when it gives you the right answer, and it will often happily switch to an incorrect answer.

This is one of the most damning things about LLMs to me - and it doesn't seem to have gotten much better with GPT-4


Let's say someone asks you a question and you come up with two potential answers. You respond with the one that you think is more likely to be correct. They say you're wrong. You give them the second answer which you thought was less likely to be correct but apparently is more likely.


Simply put, it was giving random answers, until you stopped asking. (Of course it's more complicated but that's basically it).


Heh. Kind of like Clever Hans. It keeps tapping its foot until you relax.

https://en.wikipedia.org/wiki/Clever_Hans


Even a horse can do complex math as long as you signal when it should stop.


It is many times better than you'd expect from random answers.


You missed the point.


This is so wrong it actively hurts.


I can write code, and if I'm told my code doesn't work, I can fix it.


Ask it, it will explain what's going on (I'm not kidding)


What chat LLMs do is correct their answers by cribbing from the information you give in the conversation, and then take credit for it.


LLMs reason like we autistic people do. I almost never correct myself after someone pointed out my errors. I either find the error myself or it takes my many hours of contemplation to see my error(s).


LLMs don't reason, and autistic people have an internal state. We have thoughts and memories, and understand context. If you were trying to fit a pegs into holes, you'd correct pretty quickly when the square peg didn't fit a circle hole.


I asked ChatGPT to explain fast exponentiation. It did 9 iterations until it came up with the right algorithm. Why would this not be considered self-correction?



Well no of course not. Not only are they still only predicting the next best word ("best" now working better thanks to attention), but anybody who's role played with an LLM learns many caveats of the architecture.

I think especially regarding spatial awareness/object permanence etc LLMs lack any ability to work with these concepts when it comes to predicting the next token.

I feel like we need to add more specialised heads (which I think some training regimen for LLMs already courages) for things like these concepts.

It's easy for them to lose track of where an item is and you have to catch its errors early and correct it, otherwise it'll take the error and run with it.


It seems like they self correct when you explain to the LLM something is incorrect. But in reality they take the original question, plus the “incorrect” statement and get the most likely responses from the model


This is utterly unsurprising, since LLMs are statistical models of text, not inference engines (or similar).

> Interestingly, the models often produce the correct answer initially, but switch to an incorrect response after self-correction.

The training corpus almost certainly contains examples that are similar to both cases ("my answer was correct" and "oops, my answer was wrong"). It'll be a coin flip which set the input more closely matches.


People forget that getting answers is not what makes a good solution. Computers are still dumb, they only give answers. Asking the right question is everything. With the right question the uncertainty of a problem is resolved. Finishing is a mere exercise. whether hand-crafted assembly or an LLM’s regurgitation and synthesis of things it’s seen before is just a matter of efficiency.


Alternative explanation - maybe the training data is full of internet discussions, and it's just that we we really bad at self correction online.


Stephan Wolfram talks about how we'll soon have LLM psychologists, that if you add to the prompt "take a deep breath before answering" you get better answers.

He also speculates that if after the answer you ask it "is what you wrote correct?" you give the LLM a chance to get out of path dependency - one badly generated token and the LLM is now forced to follow it through.


"For GPT-4, to reduce the cost, we randomly sampled 200 questions"

Wow, DeepMind research budgets must be nothing.


Nature shows us that bilaterality works. Connecting two LLM instances together and and giving them slightly different prompts to appraise each others' output is going to be a lot more effective than trying to iterate some monolithic super-reasoning device.


I've taken to adding guidance to my original prompt instead of trying to correct the model. Perhaps having the incorrect answer plus the correction in the context causes it to be dumber?


I am begging people to stop confusing "I was unable to get LLM X to do Y using strategy Z" with "All LLMs are categorically unable to do Y".


MultiAgent could potential solve this easily


That type of statement never seem to age well for AI technologies.


can you expand how?


I am noticing there is enough entropy on the agent replys and also as we can adjust the temperature they can work together to provide a better answer by peer reviewing.


...yet.


I work a bit in this space. Current LLM architectures are missing a fundamental "state of the world" as well as the ability to counterfactually reason out of error propagation. Simply a limit of the architecture.

Instead, we should be giving model architectures like JEPA a go [0], which explicitly perform LLM-like behavior but with a state of the world implemented for ongoing error correction.

[0] https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/


Remember Andrew Ngs theses that current architectures are already "AI complete" meaning that given more training data all models outperform other models with less training data, and a model which seems a bit more efficient with some amount of training data can be less efficient with more data or the other way around.


Do you mind providing a link to your reference?


I trust nothing from anyone that cant compete with Gpt-4


Then you should trust OpenAI when they said that they have already ran out of ideas to solve these issues? That they need need novel approaches to make further progress than they already have? Or do you mean that you don't trust anyone at all?

https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...


>when they said that they have already ran out of ideas to solve these issues?

They never said that lol.

https://web.archive.org/web/20230531203946/https://humanloop...


In other words, you need to have Google scale money to invest in developing an LLM before you can be trusted?


A bunch of companies and people speculating on technology that they cant even reproduce




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: