Hi! The expected errors are not standardized enough for it to make sense to enable --check-errors by default. If you look at the readme, you'll see that the only thing they're checking is that the _numbers of errors_ are correct.
run_tests.py does not appear to be checking the number of errors or the errors themselves for the tokenizer, encoding or serializer tests from html5lib-tests - which represent the majority of tests.
There's also something off about your benchmark comparison. If one runs pytest on html5lib, which uses html5lib-test plus its own unit tests and does check if errors match exactly, the pass rate appears to be much higher than 86%:
These numbers are inflated because html5lib-tests/tree-construction tests are run multiple times in different configurations. Many of the expected failures appear to be script tests similar to the ones JustHTML skips.
I think the reason this was an evening project for Simon is based on both the code and the tests and conjunction. Removing one of them would at least 10x the effort is my guess.
The biggest value I got from JustHTML here was the API design.
I think that represents the bulk of the human work that went into JustHTML - it's really nice, and lifting that directly is the thing that let me build my library almost hands-off and end up with a good result.
Without that I would have had to think a whole lot more about what I was doing here!
See also the demo app I vibe-coded against their library here: https://tools.simonwillison.net/justhtml - that's what initially convinced me that the API design was good.
My feeling is that my code depends more on the html5lib-tests work than on html5ever. While inspired by, I think the macro-based Rust code is different enough from the source so that its new work. I’m guessing we’ll never know.
Talking about "thieves" is very much going back to the idea that software is the same thing as physical things. When talking about software we have a very simple concept to guide us: the license.
The license of html5ever is MIT, meaning the original authors are OK that people do whatever they want with it. I've retained that license and given them acknowledgement (not required by the license) in the README. Simon has done the same, kept the license and given acknowledgement (not required) to me.
Can a human even put GPL on bot written code since it relies on copyright to protect it? Is that like museums adding copyright to scans of public domain paintings in their holdings? Which was fought about in courts for years.
Probably a human could put a copyright on a prompt (that would be the "source" and the LLM would be a compiler or interpreter) and the generated code would be derivative of the prompt and any inputs.
It would probably get into whether the prompt itself is considered copyrightable. There is some threshold for that since I have heard some patches are considered insignificant and uncopyrightable.
For me (original author of JustHTML), it was enough the put the instructions on how to run tests in the AGENTS.md. It knows enough about coding to run tests by itself.
This is a namespacing test. The reason the tag is <svg title> is that the parser is handling the title tag as the svg version of it. SVG has other handling rules, so unless the parser knows that it won't work right. I would be interesting to run the tests against Chrome as well!
You are also looking at the test format of the tag, when serialized to HTML the svg prefixes will disappear.
These projects are no different from other random hobby-projects that pop up that weren't LLM-coded. The community will decide if they are useful or not.
The difference here might be speed, that there are many more projects like this.
That said, the example you are pulling our out does not match that either. I'll make sure to fix this bug and other like it! https://github.com/EmilStenstrom/justhtml/issues/20
reply