Tutorials¶
Getting started¶
This tutorial walks you from an empty environment to escaping and unescaping your first HTML.
Install turbohtml from PyPI:
$ pip install turbohtml
Open a Python prompt and escape some text for safe inclusion in an HTML page:
>>> import turbohtml
>>> turbohtml.escape("5 > 3 & 2 < 4")
'5 > 3 & 2 < 4'
By default the quotation marks are escaped too, which is what you want inside an attribute value:
>>> turbohtml.escape("name=\"O'Brien\"")
'name="O'Brien"'
Now go the other way and turn HTML character references back into text:
>>> turbohtml.unescape("Tom & Jerry — café")
'Tom & Jerry — café'
From here you can stay with the string helpers, or continue below to break whole documents into tokens.
Tokenizing your first document¶
This tutorial takes you from a string of HTML to a stream of tokens you can inspect.
Start with a small document and hand it to turbohtml.tokenize(), which returns an iterator of
turbohtml.Token objects:
>>> import turbohtml
>>> for token in turbohtml.tokenize('<p class="intro">Tom & Jerry</p>'):
... print(token)
Token(START_TAG, tag='p')
Token(TEXT, data='Tom & Jerry')
Token(END_TAG, tag='p')
Each token tells you what it is through type, a turbohtml.TokenType. Start and end tags carry the
lowercased tag name and the attributes, already decoded:
>>> start, text, end = turbohtml.tokenize('<p class="intro">Tom & Jerry</p>')
>>> start.type
<TokenType.START_TAG: 1>
>>> start.tag
'p'
>>> start.attrs
[('class', 'intro')]
Text arrives with character references already resolved — the & above came through as a plain & — and
content that the HTML specification treats as raw, such as a script body, arrives as one text token without further
interpretation:
>>> [token.data for token in turbohtml.tokenize("<script>if (a < b) run()</script>")
... if token.type is turbohtml.TokenType.TEXT]
['if (a < b) run()']
When the document arrives in pieces — from a network stream, for example — create a turbohtml.Tokenizer and
feed the pieces as they come. Each feed() returns the tokens that piece completed, and close() flushes whatever
remains:
>>> tokenizer = turbohtml.Tokenizer()
>>> [token.tag for token in tokenizer.feed("<div><sp")]
['div']
>>> [token.tag for token in tokenizer.feed("an>")]
['span']
>>> list(tokenizer.close())
[]
Notice the incomplete <sp stayed buffered until the rest of the tag arrived. That is the whole tokenizer API. From
here, head to the How-to guides guides for task-focused recipes — including porting an existing
html.parser.HTMLParser subclass — or the Reference for the exact signatures.