I don't know that it's as much a matter of papers having become weak as much as ...

I don't know that it's as much a matter of papers having become weak as much as there being a knowledge gap between use of models in practice and in research.

For example, a glaring issue that caught my eye in the paper was that their prompt for the self-evaluation was setting the context for what was being analyzed as its own work.

Out of the training data of the effective Internet, what % of critical analysis was self-critique and what % do we think was the analysis of others' work?

As such, if we're trying to elicit an effective analysis of an earlier answer from an incomprehensible multi-variable prediction machine, might we not want to set the context as the evaluation of another's earlier work? It's still technically self-evaluation even if we are hiding implementation details from the LLM.

Another example is that their prompt to a fine tuned instruct model asked it to "find problems." And then they found that the self-critique would often bias correct answers to change to incorrect. What about having used more neutral language like 'grade' that allows both verification or challenge of the earlier answer to fit the instruction?

These are the kinds of nuances I'm sure many working with LLMs in production scenarios have realized can completely change the outcomes of a prompt pipeline, and yet in research we see very smart analysis (i.e. implicit bias towards correction over multiple rounds of challenges when incorrect) coupled with less than ideal prompt selection for the methods.

And given that we should expect every new generation of models to have new and different nuances to what works in practice and what doesn't, I don't know that this is a problem that's going to get better before it gets worse.