Explanation¶
Why a C core¶
Escaping and unescaping sit on hot paths: HTML output escaping runs on every rendered fragment, and unescaping runs on
every chunk of text an HTML parser emits. turbohtml implements both in C so they run several times faster than an
equivalent pure-Python implementation, with no change in behavior.
Measured with pyperf on CPython 3.14 (a release build, Apple M-series) against
html.escape() and html.unescape(). The multi-MiB inputs stream well past the CPU caches; the
book and spec cases are real documents (Project Gutenberg’s War and Peace, the WHATWG HTML spec source) referenced as
git submodules. Reproduce with tox -e bench:
operation |
input |
turbohtml |
stdlib |
speedup |
|---|---|---|---|---|
|
tiny plain (64 B) |
0.04 µs |
0.11 µs |
2.9x |
|
medium markup (4 KiB) |
2.38 µs |
8.09 µs |
3.4x |
|
no-op prose (4 MiB) |
0.12 ms |
2.66 ms |
22.2x |
|
book text (3 MiB) |
0.72 ms |
2.80 ms |
3.9x |
|
book HTML (4 MiB) |
1.35 ms |
4.88 ms |
3.6x |
|
spec HTML, dense (4 MiB) |
5.27 ms |
13.3 ms |
2.5x |
|
UCS-2 plain (4 MiB) |
0.74 ms |
2.60 ms |
3.5x |
|
UCS-2 markup (4 MiB) |
3.44 ms |
11.5 ms |
3.3x |
|
UCS-4 plain (4 MiB) |
0.97 ms |
5.58 ms |
5.8x |
|
UCS-4 markup (4 MiB) |
4.08 ms |
20.3 ms |
5.0x |
|
tiny plain (64 B) |
0.02 µs |
0.03 µs |
1.3x |
|
medium dense refs (4 KiB) |
8.57 µs |
72.5 µs |
8.5x |
|
numeric refs (4 KiB) |
5.24 µs |
81.1 µs |
15.5x |
|
book HTML, real refs (4 MiB) |
2.80 ms |
8.96 ms |
3.2x |
|
escaped book HTML (5 MiB) |
2.10 ms |
21.2 ms |
10.1x |
|
dense refs (4 MiB) |
10.4 ms |
78.5 ms |
7.6x |
|
UCS-2 refs (4 MiB) |
2.78 ms |
19.4 ms |
7.0x |
escape gains the most on text that needs little escaping (the SIMD scan classifies sixteen bytes at a time and
copies clean stretches wholesale); unescape gains the most on entity-heavy input, where the standard library pays a
Python call per match. The gap is narrowest on tiny strings, where call overhead dominates, and on special-dense markup,
where both sides spend their time writing replacements. Numbers vary with input and hardware; reproduce them with tox
-e bench.
Unlike a standard-library accelerator, turbohtml ships only the compiled implementation. PEP 399 requires a
pure-Python fallback only for the standard library; as a third-party package distributing per-interpreter wheels,
turbohtml has no need for one, which keeps the surface small.
Block-at-a-time scanning¶
escape spends most of its time confirming that a string contains nothing that needs escaping. For one-byte strings
it classifies sixteen bytes at a time with SIMD (on arm64 NEON a single low-nibble table lookup plus one comparison
matches all five specials at once; on x86-64 SSE2 compares per special; elsewhere a 64-bit SWAR word applies the
has-zero bit trick). The sizing pass turns each
comparison directly into that special’s output growth and sums the block branchlessly; the writing pass converts the
comparisons into a position bitmask so clean stretches are copied wholesale and only the matched bytes are rewritten.
When nothing needs escaping the input is returned unchanged. Wider (UCS-2 / UCS-4) strings — see PEP 393 for
CPython’s string representations — pack four / two code points into a 64-bit SWAR word and probe all five special
characters in a single pass. unescape works the same way in reverse: it hops between & occurrences (memchr
on one-byte text) and bulk-copies the clean spans between references instead of inspecting every character. This needs
the PyUnicode buffer API, which is why turbohtml cannot use the
Limited API.
Matching the standard library¶
turbohtml reproduces the behavior of html.escape() and html.unescape() exactly. escape
uses the same replacements, including ' for the single quote, and unescape applies the full HTML5
character-reference rules: named references with
longest-prefix matching, numeric references, the Windows-1252 remaps, and the invalid code-point handling that maps to
U+FFFD or the empty string. The test suite checks the C output against the standard library over a large fuzzed
corpus.
A spec-exact tokenizer¶
turbohtml.tokenize() implements the WHATWG HTML tokenization algorithm — the same state machine inside every browser —
rather than a regex approximation like html.parser.HTMLParser. The C implementation mirrors the spec
state by state so the two can be read side by side, and it is validated against the shared html5lib-tests conformance suite that browsers and parser libraries validate against, at
all three input storage widths, once per input storage width, because the token stream must be invariant to how CPython
happens to store the string.
Two deliberate scope decisions keep the surface honest:
The tokenizer is not a parser. It hands you the token stream; it does not build a tree, balance tags, or apply the tree-construction rules. The one tree-construction duty it takes on is content-model switching: after a start tag for
script,style,titleand the other raw-text elements, the element’s contents tokenize as the spec requires (a<b>inside a script body is text, not a tag).Parse errors are recovered from, not reported. The spec defines a recovery transition for every error and the machine takes it, so malformed input produces the same tokens a browser would see; the error stream itself is not part of the API.
Where behavior could drift, it is pinned by more than the suite: the token stream is fuzz-compared against html5lib’s
tokenizer, and source positions use the same 1-based-line, 0-based-column convention as html.parser, so
diagnostics line up with what the standard library would report.
Tokenizing at native width¶
CPython stores a string at one of three widths (PEP 393): one byte per character for Latin-1, two for the basic
multilingual plane, four beyond. The tokenizer keeps that representation end to end instead of widening everything to
UCS-4: the input buffer, accumulated text runs, tag names and attribute values all store code points at the narrowest
width their content needs, promoting only when a wider character actually arrives. The state-machine core is compiled
once per width — the same trick CPython’s stringlib uses — so every read is direct indexing, and the plain-text
states bulk-scan to the next special character the way html5ever does rather
than dispatching the state machine per character. For the ASCII documents that dominate real traffic, a text run travels
from input to the final str as one-byte copies.
Measured on CPython 3.14 (a release build, via tox -e bench) against html.parser.HTMLParser driven
with no-op handlers and html5lib’s pure-Python tokenizer, over synthetic cases and html5lib’s benchmark corpus (a slice
of the WHATWG spec source plus web-platform-tests pages of varied sizes):
input |
turbohtml |
|
speedup |
html5lib |
speedup |
|---|---|---|---|---|---|
typical markup |
30.3 µs |
449 µs |
14.8x |
840 µs |
27.7x |
text-heavy prose |
0.55 µs |
2.92 µs |
5.3x |
149 µs |
273x |
attribute-heavy |
24.7 µs |
330 µs |
13.3x |
837 µs |
33.8x |
script-heavy |
13.0 µs |
162 µs |
12.5x |
526 µs |
40.5x |
entity-heavy |
22.3 µs |
205 µs |
9.2x |
1246 µs |
55.8x |
wpt tiny (0.6 kB) |
1.60 µs |
18.2 µs |
11.4x |
49 µs |
30.9x |
wpt small (4 kB) |
15.0 µs |
176 µs |
11.8x |
434 µs |
29.0x |
wpt medium (9.6 kB) |
34.9 µs |
376 µs |
10.8x |
1190 µs |
34.1x |
wpt large (92 kB) |
348 µs |
4250 µs |
12.2x |
9311 µs |
26.7x |
wpt CJK (124 kB) |
626 µs |
8926 µs |
14.3x |
22844 µs |
36.5x |
whatwg spec (235 kB) |
701 µs |
7838 µs |
11.2x |
20409 µs |
29.1x |
ecmascript spec (3 MB) |
7.08 ms |
57.9 ms |
8.2x |
192 ms |
27.1x |
whatwg spec source (7.9 MB) |
37.0 ms |
399 ms |
10.8x |
907 ms |
24.5x |
The closest case is a document that is almost entirely a single text node, where the standard library’s regex performs
one C scan and never really tokenizes; everywhere markup actually appears, the state machine is 8-15x faster. Numbers
vary with input and hardware; reproduce them with tox -e bench.
Free-threading¶
The extension holds no shared mutable state: inputs are immutable str objects, the lookup tables are read-only, and
each turbohtml.Tokenizer owns its state machine outright, so tokenizers in different threads never contend. It
therefore declares free-threading support and a per-interpreter GIL on interpreters new enough to honor those slots, so
it does not force the global lock back on under a free-threaded build. As with any stateful object, feeding one
tokenizer from several threads at once needs synchronization on the caller’s side. See the free-threading extension
guide.