Reference¶

turbohtml: fast, typed HTML utilities powered by a C-accelerated core.

class turbohtml.Token¶

An HTML token produced by Tokenizer or tokenize(). Immutable; the meaningful attributes depend on .type.

attr(default=None)¶

Return the value of attribute name on a start or end tag. A valueless attribute yields None; a missing attribute yields default.

Parameters:

name (str)
default (str | None)

Return type:

str | None

attrs¶

attribute (name, value) pairs for tags, else None

Type:: list[tuple[str, str | None]] | None

col¶

0-based source column where this token began

Type:: int

data¶

text run or comment data, else None

Type:: str | None

force_quirks¶

whether a DOCTYPE forces quirks mode

Type:: bool

line¶

1-based source line where this token began

Type:: int

name¶

DOCTYPE name, else None

Type:: str | None

public_id¶

DOCTYPE public identifier, else None

Type:: str | None

self_closing¶

whether a start tag carried a trailing slash

Type:: bool

system_id¶

DOCTYPE system identifier, else None

Type:: str | None

tag¶

lowercased tag name for start/end tags, else None

Type:: str | None

type¶

the TokenType of this token

Type:: TokenType

class turbohtml.Tokenizer¶

Streaming HTML tokenizer. Feed markup with feed() and iterate the returned iterators; call close() at the end, or use the tokenizer as a context manager so leaving the with block signals end of input, then iterate the tokenizer itself for the remaining tokens. For a whole string at once use tokenize().

close()¶

Signal end of input and return an iterator over the final tokens, flushing any buffered text and the token in progress.

Return type:: Iterator[Token]

feed()¶

Append a chunk of markup and return an iterator over the tokens that are now complete. Text before an unfinished tag stays buffered until more is fed or close() is called.

Parameters:: data (str)
Return type:: Iterator[Token]

reset()¶

Discard all input and return to the initial Data state.

Return type:: None

turbohtml.__version__ = '0.2.0'¶: The installed package version.

turbohtml.escape(s, quote=True)¶

Replace special characters “&”, “<” and “>” with HTML-safe sequences.

If the optional flag quote is true (the default), the quotation mark characters, both double quote (”) and single quote (‘), are also translated.

Parameters:

s (str)
quote (bool)

Return type:

str

turbohtml.tokenize(s, /)¶

Tokenize a whole HTML string, returning an iterator of Token objects following the WHATWG tokenization algorithm.

Parameters:: s (str)
Return type:: Iterator[Token]

turbohtml.unescape(s, /)¶

Convert all named and numeric character references in s to the corresponding Unicode characters, following the HTML5 rules.

Parameters:: s (str)
Return type:: str

class turbohtml.TokenType(*values)¶

The kind of a Token; selects which of its attributes are meaningful.

TEXT = 0¶

START_TAG = 1¶

END_TAG = 2¶

COMMENT = 3¶

DOCTYPE = 4¶