This showdown benchmark was and still is great, but an enormous grain of salt should be added to any model that was released after the showdown benchmark itself.
Maybe everyone has a different dose of skepticism. Personally I'm not even looking at results for models that were released after the benchmark, for all this tells us, they might as well be one-trick ponies that only do well in the benchmark.
It might be too much work, but one possible "correct" approach for this kind of benchmark would to periodically release new benchmarks with new tests (that are broadly in the same categories) and only include models that predate each benchmark.
Yeah that’s a classic problem, and it's why good tests are such closely guarded secrets: to keep them from becoming training fodder for the next generation of models. Regarding the "model date" vs "benchmark date" - that's an interesting point... I'll definitely look into it!
I don't have any captcha systems in place, but I wonder if it might be worth putting up at least a few nominal roadblocks (such as Anubis [1]) to at least slow down the scrapers.
A few weeks ago I actually added some new, more challenging tests to the GenAI Text-to-Image section of the site (the “angelic forge” and “overcrowded flat earth”) just to keep pace with the latest SOTA models.
In the next few weeks, I’ll be adding some new benchmarks to the Image Editing section as well~~
The Blender previz reskin task [1] could be automated! New test cases could be randomly and procedurally generated (without AI).
Generate a novel previz scene programatically in Blender or some 3D engine, then task the image model with rendering it in a style (or to style transfer to a given image, eg. something novel and unseen from Midjourney). Another test would be to replace stand in mannequins with identities of characters in reference images and make sure the poses and set blocking match.
Throw in a 250 object asset pack and some skeletal meshes that can conform to novel poses, and you've got a fairly robust test framework.
Furthermore, anything that succeeds from the previz rendering task can then be fed into another company's model and given a normal editing task, making it doubly useful for two entirely separate benchmarks. That is, successful previz generations can be reused as image edit test cases - and you a priori know the subject matter without needing to label a bunch of images or run a VLM, so you can create a large set of unseen tests.
You don't need skepticism, because even if you're acting in 100% good faith and building a new model, what's the first thing you're going to do? You're going to go look up as many benchmarks as you can find and see how it does on them. It gives you some easy feedback relative to your peers. The fact that your own model may end up being put up against these exact tests is just icing.
So I don't think there's even a question of whether or not newer models are going to be maximizing for benchmarks - they 100% are. The skepticism would be in how it's done. If something's not being run locally, then there's an endless array of ways to cheat - like dynamically loading certain LoRAs in response to certain queries, with some LoRAs trained precisely to maximize benchmark performance. Basically taking a page out of the car company playbook in response to emissions testing.
But I think maximizing the general model itself to perform well on benchmarks isn't really unethical or cheating at all. All you're really doing there is 'outsourcing' part of your quality control tests. But it simultaneously greatly devalues any benchmark, because that benchmark is now the goal.
I think training image models to pass these very specific tests correctly will be very difficult for any of these companies. How would they even do that?
Hire a professional Photoshop artist to manually create the "correct" images and then put the before and after photos into the training data. Or however they've been training these models thus far, i don't know.
And if that still doesn't get you there, hash the image inputs to detect if its one of these test photos and then run your special test-passer algo.
Maybe everyone has a different dose of skepticism. Personally I'm not even looking at results for models that were released after the benchmark, for all this tells us, they might as well be one-trick ponies that only do well in the benchmark.
It might be too much work, but one possible "correct" approach for this kind of benchmark would to periodically release new benchmarks with new tests (that are broadly in the same categories) and only include models that predate each benchmark.