Yeah that’s a classic problem, and it's why good tests are such closely guarded secrets: to keep them from becoming training fodder for the next generation of models. Regarding the "model date" vs "benchmark date" - that's an interesting point... I'll definitely look into it!
I don't have any captcha systems in place, but I wonder if it might be worth putting up at least a few nominal roadblocks (such as Anubis [1]) to at least slow down the scrapers.
A few weeks ago I actually added some new, more challenging tests to the GenAI Text-to-Image section of the site (the “angelic forge” and “overcrowded flat earth”) just to keep pace with the latest SOTA models.
In the next few weeks, I’ll be adding some new benchmarks to the Image Editing section as well~~
The Blender previz reskin task [1] could be automated! New test cases could be randomly and procedurally generated (without AI).
Generate a novel previz scene programatically in Blender or some 3D engine, then task the image model with rendering it in a style (or to style transfer to a given image, eg. something novel and unseen from Midjourney). Another test would be to replace stand in mannequins with identities of characters in reference images and make sure the poses and set blocking match.
Throw in a 250 object asset pack and some skeletal meshes that can conform to novel poses, and you've got a fairly robust test framework.
Furthermore, anything that succeeds from the previz rendering task can then be fed into another company's model and given a normal editing task, making it doubly useful for two entirely separate benchmarks. That is, successful previz generations can be reused as image edit test cases - and you a priori know the subject matter without needing to label a bunch of images or run a VLM, so you can create a large set of unseen tests.
I don't have any captcha systems in place, but I wonder if it might be worth putting up at least a few nominal roadblocks (such as Anubis [1]) to at least slow down the scrapers.
A few weeks ago I actually added some new, more challenging tests to the GenAI Text-to-Image section of the site (the “angelic forge” and “overcrowded flat earth”) just to keep pace with the latest SOTA models.
In the next few weeks, I’ll be adding some new benchmarks to the Image Editing section as well~~
[1] - https://anubis.techaro.lol