Here's a problem that no frontier model does well on (f1 < 0.2), but which I thi...

Here's a problem that no frontier model does well on (f1 < 0.2), but which I think is relatively easy for most humans:

https://dorrit.pairsys.ai/

> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.