construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation
Removed a few whitelisted test cases that are now passing directly.
The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
are now merged with specific registrations by rank. Not yet clear if that is a
good idea overall, need to check use cases when implementing template expansion
and other functionality.
183 parser test now passing.
tests now passing.
Link trails depend on language-dependent positive character classes in the PHP
parser. These classes all seem to disallow punctuation implicitly and list
differing plain text characters instead, so it might be possible to get away
with identifying a common class of non-trail punctuation instead. This would
help to keep the tokenizer independent of configurations, which is very
desirable for caching and simplified external parsing.
start / row / table end). The old productions are not deleted yet to make it
easy to compare the output on more complex articles. 181 tests passing after
adding two table tests with whitespace-only differences to the whitelist.
token stream. This is the first token transformation exercising the
TokenTransformer class as its dispatcher. Template expansions, wiki link
formatting, tag sanitation and extensions should be able to use the same
dispatcher by registering for specific token types.
The parser performance is very slightly improved as the token stream is only
traversed once.