More

EmilStenstrom · 2025-12-19T07:26:06 1766129166

Hi! The expected errors are not standardized enough for it to make sense to enable --check-errors by default. If you look at the readme, you'll see that the only thing they're checking is that the _numbers of errors_ are correct.

That said, the example you are pulling our out does not match that either. I'll make sure to fix this bug and other like it! https://github.com/EmilStenstrom/justhtml/issues/20

Aloisius · 2025-12-19T20:01:26 1766174486

run_tests.py does not appear to be checking the number of errors or the errors themselves for the tokenizer, encoding or serializer tests from html5lib-tests - which represent the majority of tests.

There's also something off about your benchmark comparison. If one runs pytest on html5lib, which uses html5lib-test plus its own unit tests and does check if errors match exactly, the pass rate appears to be much higher than 86%:

    $ uv run pytest -v 
    17500 passed, 15885 skipped, 683 xfailed,

These numbers are inflated because html5lib-tests/tree-construction tests are run multiple times in different configurations. Many of the expected failures appear to be script tests similar to the ones JustHTML skips.

EmilStenstrom · 2025-12-19T22:38:23 1766183903

Excellent feedback. I'll have a look at the running of html5lib tests again.

EmilStenstrom · 2025-12-17T07:24:05 1765956245

I think the reason this was an evening project for Simon is based on both the code and the tests and conjunction. Removing one of them would at least 10x the effort is my guess.

simonw · 2025-12-17T12:03:28 1765973008

The biggest value I got from JustHTML here was the API design.

I think that represents the bulk of the human work that went into JustHTML - it's really nice, and lifting that directly is the thing that let me build my library almost hands-off and end up with a good result.

Without that I would have had to think a whole lot more about what I was doing here!

orange_puff · 2025-12-17T22:39:53 1766011193

Do you mind elaborating? By API design, do you mean how they structured their classes, methods, etc. or something else?

simonw · 2025-12-18T05:10:37 1766034637

I mean the design of the user-facing API: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api...

See also the demo app I vibe-coded against their library here: https://tools.simonwillison.net/justhtml - that's what initially convinced me that the API design was good.

I particularly liked the design of JustHTML's core DOM node: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api... - and the design of the streaming API: https://github.com/EmilStenstrom/justhtml/blob/main/docs/api...

EmilStenstrom · 2025-12-17T07:22:05 1765956125

The other way around works as well! ”Get me to 100% test coverage using only integration tests” is a fun prompt!

EmilStenstrom · 2025-12-17T07:19:15 1765955955

My feeling is that my code depends more on the html5lib-tests work than on html5ever. While inspired by, I think the macro-based Rust code is different enough from the source so that its new work. I’m guessing we’ll never know.

EmilStenstrom · 2025-12-17T06:39:21 1765953561

Another interesting experiment is to start from the html5lib-tests suite directly, instead of JustHTML. Worth another experiment?

EmilStenstrom · 2025-12-17T06:36:55 1765953415

Talking about "thieves" is very much going back to the idea that software is the same thing as physical things. When talking about software we have a very simple concept to guide us: the license.

The license of html5ever is MIT, meaning the original authors are OK that people do whatever they want with it. I've retained that license and given them acknowledgement (not required by the license) in the README. Simon has done the same, kept the license and given acknowledgement (not required) to me.

We're all good to go.

EmilStenstrom · 2025-12-17T06:19:18 1765952358

Would it be a problem if you maintained the GPL license and released your code as open source?

simonw · 2025-12-17T06:38:53 1765953533

Good point, that might actually be fine (especially if you kept copyright for the original authors too.)

ZeroGravitas · 2025-12-17T09:16:50 1765963010

Can a human even put GPL on bot written code since it relies on copyright to protect it? Is that like museums adding copyright to scans of public domain paintings in their holdings? Which was fought about in courts for years.

fluidcruft · 2025-12-18T15:56:44 1766073404

Probably a human could put a copyright on a prompt (that would be the "source" and the LLM would be a compiler or interpreter) and the generated code would be derivative of the prompt and any inputs.

It would probably get into whether the prompt itself is considered copyrightable. There is some threshold for that since I have heard some patches are considered insignificant and uncopyrightable.

EmilStenstrom · 2025-12-17T06:17:38 1765952258

For me (original author of JustHTML), it was enough the put the instructions on how to run tests in the AGENTS.md. It knows enough about coding to run tests by itself.

EmilStenstrom · 2025-12-17T06:15:02 1765952102

This is a namespacing test. The reason the tag is <svg title> is that the parser is handling the title tag as the svg version of it. SVG has other handling rules, so unless the parser knows that it won't work right. I would be interesting to run the tests against Chrome as well!

You are also looking at the test format of the tag, when serialized to HTML the svg prefixes will disappear.

EmilStenstrom · 2025-12-16T19:12:26 1765912346

These projects are no different from other random hobby-projects that pop up that weren't LLM-coded. The community will decide if they are useful or not.

The difference here might be speed, that there are many more projects like this.

piker · 2025-12-17T21:13:41 1766006021

> The difference here might be speed, that there are many more projects like this.

This is a big difference.