Reference

turbohtml: fast, typed HTML utilities powered by a C-accelerated core.

class turbohtml.Token

An HTML token produced by Tokenizer or tokenize(). Immutable; the meaningful attributes depend on .type.

attr(default=None)

Return the value of attribute name on a start or end tag. A valueless attribute yields None; a missing attribute yields default.

Parameters:
Return type:

str | None

attrs

attribute (name, value) pairs for tags, else None

Type:

list[tuple[str, str | None]] | None

col

0-based source column where this token began

Type:

int

data

text run or comment data, else None

Type:

str | None

force_quirks

whether a DOCTYPE forces quirks mode

Type:

bool

line

1-based source line where this token began

Type:

int

name

DOCTYPE name, else None

Type:

str | None

public_id

DOCTYPE public identifier, else None

Type:

str | None

self_closing

whether a start tag carried a trailing slash

Type:

bool

system_id

DOCTYPE system identifier, else None

Type:

str | None

tag

lowercased tag name for start/end tags, else None

Type:

str | None

type

the TokenType of this token

Type:

TokenType

class turbohtml.Tokenizer

Streaming HTML tokenizer. Feed markup with feed() and iterate the returned iterators; call close() at the end, or use the tokenizer as a context manager so leaving the with block signals end of input, then iterate the tokenizer itself for the remaining tokens. For a whole string at once use tokenize().

close()

Signal end of input and return an iterator over the final tokens, flushing any buffered text and the token in progress.

Return type:

Iterator[Token]

feed()

Append a chunk of markup and return an iterator over the tokens that are now complete. Text before an unfinished tag stays buffered until more is fed or close() is called.

Parameters:

data (str)

Return type:

Iterator[Token]

reset()

Discard all input and return to the initial Data state.

Return type:

None

turbohtml.__version__ = '0.2.0'

The installed package version.

turbohtml.escape(s, quote=True)

Replace special characters “&”, “<” and “>” with HTML-safe sequences.

If the optional flag quote is true (the default), the quotation mark characters, both double quote (”) and single quote (‘), are also translated.

Parameters:
Return type:

str

turbohtml.tokenize(s, /)

Tokenize a whole HTML string, returning an iterator of Token objects following the WHATWG tokenization algorithm.

Parameters:

s (str)

Return type:

Iterator[Token]

turbohtml.unescape(s, /)

Convert all named and numeric character references in s to the corresponding Unicode characters, following the HTML5 rules.

Parameters:

s (str)

Return type:

str

class turbohtml.TokenType(*values)

The kind of a Token; selects which of its attributes are meaningful.

TEXT = 0
START_TAG = 1
END_TAG = 2
COMMENT = 3
DOCTYPE = 4