The paper: https://arxiv.org/pdf/2310.01798.pdf > Our research shows that LLMs a...

aeternum · on Oct 9, 2023

Yes, these papers just demonstrate how weak scientific papers have become.

The paper is weak on actual prompt examples, but of the few there, test them yourself and it consistently gets them right.

Whenever I see examples of GPT4 can't do such and such, I'm generally able to find a prompt that does in fact work relatively consistently.

kromem · on Oct 10, 2023

I don't know that it's as much a matter of papers having become weak as much as there being a knowledge gap between use of models in practice and in research.

For example, a glaring issue that caught my eye in the paper was that their prompt for the self-evaluation was setting the context for what was being analyzed as its own work.

Out of the training data of the effective Internet, what % of critical analysis was self-critique and what % do we think was the analysis of others' work?

As such, if we're trying to elicit an effective analysis of an earlier answer from an incomprehensible multi-variable prediction machine, might we not want to set the context as the evaluation of another's earlier work? It's still technically self-evaluation even if we are hiding implementation details from the LLM.

Another example is that their prompt to a fine tuned instruct model asked it to "find problems." And then they found that the self-critique would often bias correct answers to change to incorrect. What about having used more neutral language like 'grade' that allows both verification or challenge of the earlier answer to fit the instruction?

These are the kinds of nuances I'm sure many working with LLMs in production scenarios have realized can completely change the outcomes of a prompt pipeline, and yet in research we see very smart analysis (i.e. implicit bias towards correction over multiple rounds of challenges when incorrect) coupled with less than ideal prompt selection for the methods.

And given that we should expect every new generation of models to have new and different nuances to what works in practice and what doesn't, I don't know that this is a problem that's going to get better before it gets worse.

chewxy · on Oct 10, 2023

the problem with the critique of "oh they didn't use the correct prompts" is that prompt engineering is highly dependent on the model. You could technically create an LLM that would not work with the "let's think this through step-by-step" magic prompt (i.e. exclude anything with similar phrases in the pretraining dataset).

Yes, they used GPT3.5-turbo, which would have its set of magic key phrases. Should they have used it? I'd say probably not.

aeternum · on Oct 10, 2023

Right, this is why my critique is that it is a weak paper in general.

It's misleading to make claims about "LLMs" based on experiments with a single LLM. This is made worse by testing with very few prompt variations.

grepLeigh · on Oct 9, 2023

There's this tendency among AI researchers to write clickbait titles (like Attention is All You Need), partly due to extreme competition in the publishing/conference environment. If the odds of your paper being accepted is 20%-ish (NEURIPS), I can see why teams opt for attention-grabbing titles.

"Attention is at least X% more performant on selected benchmarks than a selected sample of recurrent networks, with ablation, thus proving attention is all you might need until a non-exponential architecture is developed" doesn't have the same catchy ring to it.

PoignardAzur · on Oct 9, 2023

Eh, I'd say the attention paper lived up to the hype of its title.

In 2017 LLM architectures were complex beasts that fiddled with many different different structures of layers. These days they're all just giant stacks of attention layers.