Explanation

Why a C core

Escaping and unescaping sit on hot paths: HTML output escaping runs on every rendered fragment, and unescaping runs on every chunk of text an HTML parser emits. turbohtml implements both in C so they run several times faster than an equivalent pure-Python implementation, with no change in behavior.

Measured with pyperf on CPython 3.14 (a release build, Apple M-series) against html.escape() and html.unescape(). The multi-MiB inputs stream well past the CPU caches; the book and spec cases are real documents (Project Gutenberg’s War and Peace, the WHATWG HTML spec source) referenced as git submodules. Reproduce with tox -e bench:

operation

input

turbohtml

stdlib

speedup

escape

tiny plain (64 B)

0.04 µs

0.11 µs

2.9x

escape

medium markup (4 KiB)

2.38 µs

8.09 µs

3.4x

escape

no-op prose (4 MiB)

0.12 ms

2.66 ms

22.2x

escape

book text (3 MiB)

0.72 ms

2.80 ms

3.9x

escape

book HTML (4 MiB)

1.35 ms

4.88 ms

3.6x

escape

spec HTML, dense (4 MiB)

5.27 ms

13.3 ms

2.5x

escape

UCS-2 plain (4 MiB)

0.74 ms

2.60 ms

3.5x

escape

UCS-2 markup (4 MiB)

3.44 ms

11.5 ms

3.3x

escape

UCS-4 plain (4 MiB)

0.97 ms

5.58 ms

5.8x

escape

UCS-4 markup (4 MiB)

4.08 ms

20.3 ms

5.0x

unescape

tiny plain (64 B)

0.02 µs

0.03 µs

1.3x

unescape

medium dense refs (4 KiB)

8.57 µs

72.5 µs

8.5x

unescape

numeric refs (4 KiB)

5.24 µs

81.1 µs

15.5x

unescape

book HTML, real refs (4 MiB)

2.80 ms

8.96 ms

3.2x

unescape

escaped book HTML (5 MiB)

2.10 ms

21.2 ms

10.1x

unescape

dense refs (4 MiB)

10.4 ms

78.5 ms

7.6x

unescape

UCS-2 refs (4 MiB)

2.78 ms

19.4 ms

7.0x

escape gains the most on text that needs little escaping (the SIMD scan classifies sixteen bytes at a time and copies clean stretches wholesale); unescape gains the most on entity-heavy input, where the standard library pays a Python call per match. The gap is narrowest on tiny strings, where call overhead dominates, and on special-dense markup, where both sides spend their time writing replacements. Numbers vary with input and hardware; reproduce them with tox -e bench.

Unlike a standard-library accelerator, turbohtml ships only the compiled implementation. PEP 399 requires a pure-Python fallback only for the standard library; as a third-party package distributing per-interpreter wheels, turbohtml has no need for one, which keeps the surface small.

Block-at-a-time scanning

escape spends most of its time confirming that a string contains nothing that needs escaping. For one-byte strings it classifies sixteen bytes at a time with SIMD (on arm64 NEON a single low-nibble table lookup plus one comparison matches all five specials at once; on x86-64 SSE2 compares per special; elsewhere a 64-bit SWAR word applies the has-zero bit trick). The sizing pass turns each comparison directly into that special’s output growth and sums the block branchlessly; the writing pass converts the comparisons into a position bitmask so clean stretches are copied wholesale and only the matched bytes are rewritten. When nothing needs escaping the input is returned unchanged. Wider (UCS-2 / UCS-4) strings — see PEP 393 for CPython’s string representations — pack four / two code points into a 64-bit SWAR word and probe all five special characters in a single pass. unescape works the same way in reverse: it hops between & occurrences (memchr on one-byte text) and bulk-copies the clean spans between references instead of inspecting every character. This needs the PyUnicode buffer API, which is why turbohtml cannot use the Limited API.

Matching the standard library

turbohtml reproduces the behavior of html.escape() and html.unescape() exactly. escape uses the same replacements, including ' for the single quote, and unescape applies the full HTML5 character-reference rules: named references with longest-prefix matching, numeric references, the Windows-1252 remaps, and the invalid code-point handling that maps to U+FFFD or the empty string. The test suite checks the C output against the standard library over a large fuzzed corpus.

A spec-exact tokenizer

turbohtml.tokenize() implements the WHATWG HTML tokenization algorithm — the same state machine inside every browser — rather than a regex approximation like html.parser.HTMLParser. The C implementation mirrors the spec state by state so the two can be read side by side, and it is validated against the shared html5lib-tests conformance suite that browsers and parser libraries validate against, at all three input storage widths, once per input storage width, because the token stream must be invariant to how CPython happens to store the string.

Two deliberate scope decisions keep the surface honest:

  • The tokenizer is not a parser. It hands you the token stream; it does not build a tree, balance tags, or apply the tree-construction rules. The one tree-construction duty it takes on is content-model switching: after a start tag for script, style, title and the other raw-text elements, the element’s contents tokenize as the spec requires (a <b> inside a script body is text, not a tag).

  • Parse errors are recovered from, not reported. The spec defines a recovery transition for every error and the machine takes it, so malformed input produces the same tokens a browser would see; the error stream itself is not part of the API.

Where behavior could drift, it is pinned by more than the suite: the token stream is fuzz-compared against html5lib’s tokenizer, and source positions use the same 1-based-line, 0-based-column convention as html.parser, so diagnostics line up with what the standard library would report.

Tokenizing at native width

CPython stores a string at one of three widths (PEP 393): one byte per character for Latin-1, two for the basic multilingual plane, four beyond. The tokenizer keeps that representation end to end instead of widening everything to UCS-4: the input buffer, accumulated text runs, tag names and attribute values all store code points at the narrowest width their content needs, promoting only when a wider character actually arrives. The state-machine core is compiled once per width — the same trick CPython’s stringlib uses — so every read is direct indexing, and the plain-text states bulk-scan to the next special character the way html5ever does rather than dispatching the state machine per character. For the ASCII documents that dominate real traffic, a text run travels from input to the final str as one-byte copies.

Measured on CPython 3.14 (a release build, via tox -e bench) against html.parser.HTMLParser driven with no-op handlers and html5lib’s pure-Python tokenizer, over synthetic cases and html5lib’s benchmark corpus (a slice of the WHATWG spec source plus web-platform-tests pages of varied sizes):

input

turbohtml

html.parser

speedup

html5lib

speedup

typical markup

30.3 µs

449 µs

14.8x

840 µs

27.7x

text-heavy prose

0.55 µs

2.92 µs

5.3x

149 µs

273x

attribute-heavy

24.7 µs

330 µs

13.3x

837 µs

33.8x

script-heavy

13.0 µs

162 µs

12.5x

526 µs

40.5x

entity-heavy

22.3 µs

205 µs

9.2x

1246 µs

55.8x

wpt tiny (0.6 kB)

1.60 µs

18.2 µs

11.4x

49 µs

30.9x

wpt small (4 kB)

15.0 µs

176 µs

11.8x

434 µs

29.0x

wpt medium (9.6 kB)

34.9 µs

376 µs

10.8x

1190 µs

34.1x

wpt large (92 kB)

348 µs

4250 µs

12.2x

9311 µs

26.7x

wpt CJK (124 kB)

626 µs

8926 µs

14.3x

22844 µs

36.5x

whatwg spec (235 kB)

701 µs

7838 µs

11.2x

20409 µs

29.1x

ecmascript spec (3 MB)

7.08 ms

57.9 ms

8.2x

192 ms

27.1x

whatwg spec source (7.9 MB)

37.0 ms

399 ms

10.8x

907 ms

24.5x

The closest case is a document that is almost entirely a single text node, where the standard library’s regex performs one C scan and never really tokenizes; everywhere markup actually appears, the state machine is 8-15x faster. Numbers vary with input and hardware; reproduce them with tox -e bench.

Free-threading

The extension holds no shared mutable state: inputs are immutable str objects, the lookup tables are read-only, and each turbohtml.Tokenizer owns its state machine outright, so tokenizers in different threads never contend. It therefore declares free-threading support and a per-interpreter GIL on interpreters new enough to honor those slots, so it does not force the global lock back on under a free-threaded build. As with any stateful object, feeding one tokenizer from several threads at once needs synchronization on the caller’s side. See the free-threading extension guide.