> Our research shows that LLMs are not yet capable of self-correcting their reasoning
The paper actually just shows that the particular "self-correction" strategy and set of prompts they used doesn't help for the tasks they looked at, for the models they looked at. It may be the case in general, but it may not.
> it is plausible that there exist specific prompts or strategies that could enhance the reasoning performance of models for particular benchmarks
Seems they agree. So the wording of the title/conclusion is too strong.
> searching such prompts or strategies may inadvertently rely on external feedback, either from human insights or training data
I'm not sure this justifies picking a single prompting strategy, and not looking at the impact of different prompting strategies. Even just writing a few different prompts in advance with different wordings and showing the variation in results would have been helpful.
I don't know that it's as much a matter of papers having become weak as much as there being a knowledge gap between use of models in practice and in research.
For example, a glaring issue that caught my eye in the paper was that their prompt for the self-evaluation was setting the context for what was being analyzed as its own work.
Out of the training data of the effective Internet, what % of critical analysis was self-critique and what % do we think was the analysis of others' work?
As such, if we're trying to elicit an effective analysis of an earlier answer from an incomprehensible multi-variable prediction machine, might we not want to set the context as the evaluation of another's earlier work? It's still technically self-evaluation even if we are hiding implementation details from the LLM.
Another example is that their prompt to a fine tuned instruct model asked it to "find problems." And then they found that the self-critique would often bias correct answers to change to incorrect. What about having used more neutral language like 'grade' that allows both verification or challenge of the earlier answer to fit the instruction?
These are the kinds of nuances I'm sure many working with LLMs in production scenarios have realized can completely change the outcomes of a prompt pipeline, and yet in research we see very smart analysis (i.e. implicit bias towards correction over multiple rounds of challenges when incorrect) coupled with less than ideal prompt selection for the methods.
And given that we should expect every new generation of models to have new and different nuances to what works in practice and what doesn't, I don't know that this is a problem that's going to get better before it gets worse.
the problem with the critique of "oh they didn't use the correct prompts" is that prompt engineering is highly dependent on the model. You could technically create an LLM that would not work with the "let's think this through step-by-step" magic prompt (i.e. exclude anything with similar phrases in the pretraining dataset).
Yes, they used GPT3.5-turbo, which would have its set of magic key phrases. Should they have used it? I'd say probably not.
There's this tendency among AI researchers to write clickbait titles (like Attention is All You Need), partly due to extreme competition in the publishing/conference environment. If the odds of your paper being accepted is 20%-ish (NEURIPS), I can see why teams opt for attention-grabbing titles.
"Attention is at least X% more performant on selected benchmarks than a selected sample of recurrent networks, with ablation, thus proving attention is all you might need until a non-exponential architecture is developed" doesn't have the same catchy ring to it.
Eh, I'd say the attention paper lived up to the hype of its title.
In 2017 LLM architectures were complex beasts that fiddled with many different different structures of layers. These days they're all just giant stacks of attention layers.
> Our research shows that LLMs are not yet capable of self-correcting their reasoning
The paper actually just shows that the particular "self-correction" strategy and set of prompts they used doesn't help for the tasks they looked at, for the models they looked at. It may be the case in general, but it may not.
> it is plausible that there exist specific prompts or strategies that could enhance the reasoning performance of models for particular benchmarks
Seems they agree. So the wording of the title/conclusion is too strong.
> searching such prompts or strategies may inadvertently rely on external feedback, either from human insights or training data
I'm not sure this justifies picking a single prompting strategy, and not looking at the impact of different prompting strategies. Even just writing a few different prompts in advance with different wordings and showing the variation in results would have been helpful.