Development¶
This page onboards contributors and records how turbohtml is built and maintained.
Getting set up¶
turbohtml uses tox with tox-uv; uv manages the interpreters, so you do not need to install Python versions yourself.
$ git clone https://github.com/tox-dev/turbohtml
$ cd turbohtml
$ git submodule update --init tests/html5lib-tests # conformance data for the test suite
$ uvx --with tox-uv tox r -e 3.14 # build, test, and check coverage
The tests/html5lib-tests submodule holds the conformance suite used by one of the tests. Do not initialize all
submodules indiscriminately: the tools/bench-data submodules reference multi-MiB real documents (pinned upstream
commits, nothing copied into this repository) used only by tox r -e bench; fetch them on demand with git submodule
update --init --depth 1 tools/bench-data/whatwg-html tools/bench-data/war-and-peace.
tox r -e 3.14 builds the extension, runs the test suite, and fails unless both Python and C coverage are 100%
(line and branch). Other environments: type (ty), docs (Sphinx), fix
(pre-commit), pkg_meta (wheel/sdist metadata), bench (pyperf comparison against the standard library), and regen (regenerate the entity
tables).
Project layout¶
src/turbohtml/
__init__.py # public API re-export, typed
_html.pyi # type stub for the C extension
py.typed # PEP 561 marker
turbohtml.h # internal header shared by the C sources
escape.c # html.escape implementation (SIMD / SWAR)
unescape.c # html.unescape implementation (entity tables)
_htmlmodule.c # module definition; wires escape.c + unescape.c
html_entities.h # generated tables (do not edit)
tools/generate_html_entities.py # regenerates html_entities.h
tests/ # pytest suite (escape + unescape)
The three C files compile into a single _html extension. They are split per feature for readability and share only
the entry-point declarations in turbohtml.h.
Architecture decisions¶
A C extension, built with meson-python. Escaping and unescaping are hot paths, so the core is C. meson-python is the build backend because hatchling (used by our pure-Python projects) does not compile C; meson-python is a first-class C backend with built-in coverage support.
No stable ABI (abi3). The fast paths require the non–Limited API buffer macros
PyUnicode_KIND, PyUnicode_DATA, PyUnicode_READ, PyUnicode_WRITE and PyUnicode_New (see the
PyUnicode C API and PEP 393). The Limited API only exposes per-code-point calls (PyUnicode_ReadChar /
PyUnicode_WriteChar) with no access to the underlying buffer, which would remove the SWAR scan that justifies the
package. We therefore ship one wheel per interpreter and let cibuildwheel build the
matrix.
No pure-Python fallback. PEP 399 requires a pure-Python fallback only for standard-library modules. As a third-party package distributing per-interpreter wheels, turbohtml ships only the compiled implementation.
SIMD / SWAR for escape. escape confirms most strings need no escaping, so it classifies one-byte strings sixteen
bytes at a time: on NEON a single low-nibble table lookup plus one comparison matches all five specials at once (each
has a unique low nibble — the PSHUFB trick used by pulldown-cmark), on x86-64 SSE2 compares per special, and elsewhere a
SWAR word applies the bit-twiddling “has-zero” trick. The sizing pass accumulates the growth of all
matches branchlessly, and the writing pass copies clean stretches wholesale, rewriting only the positions a match
bitmask singles out. UCS-2 / UCS-4 strings (see PEP 393 for the representations) are probed for all five special
characters in one SWAR pass over a 64-bit word.
Free-threading ready. The module has no mutable state (immutable str inputs, read-only tables), so it declares
Py_MOD_GIL_NOT_USED and per-interpreter GIL support on interpreters that support them. See the free-threading
extension guide.
Exact standard-library parity. turbohtml reproduces html.escape() and html.unescape() byte
for byte, including ' for the single quote and the full HTML5 character-reference rules. The suite fuzzes the C
output against the standard library.
Generated entity tables. html_entities.h is produced by tools/generate_html_entities.py. The named
references come from html.entities.html5 (which mirrors the WHATWG named character references); the numeric-charref correction tables are derived
directly from the WHATWG specification rather than any private
standard-library internals, so the C tables never drift from the source of truth.
Maintainer tasks¶
Regenerate the entity tables (after a CPython update changes html.entities):
$ tox r -e regen
Run the full check matrix locally (per-interpreter, 3.10–3.15 plus free-threading):
$ tox r # all environments
$ tox r -e 3.13 # a single interpreter
Coverage is enforced two ways: Python via covdefaults (100%), and C via
gcovr with --fail-under-line 100 --fail-under-branch 100 on an instrumented meson coverage
build (-Db_coverage=true -Db_ndebug=true). The only excluded
branches are allocation-failure guards that a test cannot trigger; each is marked with a gcovr exclusion marker and a comment explaining why.
Adding a C feature:
Add
src/turbohtml/<feature>.cand declare its entry point inturbohtml.h.Add the source to
meson.buildand wire the method in_htmlmodule.c.Add tests and keep coverage at 100%; mark any genuinely unreachable branch with
GCOVR_EXCL_BR_LINEplus a reason.
Releasing¶
A release is cut by the 🚀 Release GitHub Actions workflow: it builds the sdist and the full wheel matrix with
cibuildwheel and publishes to PyPI via trusted publishing, so no API token is stored.