Going Further

Why tinyhtml5?

tinyhtml5 is a friendly fork of the unmaintained html5lib library, that’s been adapted to fulfill WeasyPrint’s needs.

What are the differences with html5lib?

tinyhtml5 is a simplified and modernized version of html5lib, with the given main differences:

  • It is only a HTML parser, providing tree corresponding to a given HTML string or file.

  • The public API is only the tinyhtml5.parse() function, allowing filename string and pathlib.Path as input in addition to HTML string and file descriptor.

  • The only output format supported is ElementTree.

  • Tree walkers, adapters and filters are not supported.

  • Tests are included.

  • Code internals are cleaned, simplified and modernized.

Why not maintaining html5lib?

html5lib is a wonderful piece of software, used by many Python projects, and we’re really grateful for the work done by its previous authors and maintainers.

But unfortunately, html5lib is unmaintained. Maintaining it would have been much more complex for us than creating tinyhtml5, for many different reasons.

First of all, there are many features of html5lib we don’t use. Maintaining code we don’t use doesn’t look like a good idea for us, because it’s hard to build a clean API when we don’t know the real needs. html5lib is a whole HTML manipulation library, with a lot of different use cases, that requires a lot of internal abstractions to provide its powerful features. tinyhtml5 is much less powerful, and much simpler too.

The goal was also to clean the code. The syntax and the architecture had to be modernized, because a lot of things have changed during the last years of the Python ecosystem. html5lib was supporting Python 2 and included a lot of details that are useless for recent versions of Python. Cleaning the whole code was complex in html5lib’s modules we knew, it would have been much harder in modules we didn’t know.

Maintaining html5lib requires a lot of work, that’s why it’s been unmaintained. Honestly, there’s no reason why we could do better than the previous maintainers, we didn’t want to give false hopes to its users. Maintaining and extending all these features, with a very large community with a lot of different needs, is a work we can’t do seriously with the limited resources we have.