Commit graph

21 commits

Author SHA1 Message Date
Gabriel Wicke aaca5eac7d More tweaks: safesubst and image options
* Ignore safesubst for now
* Remove an unneeded whitelist entry
* Make sure the caption is not lost for thumbs (fix to last commit) and remove
  debug print

Change-Id: I243584ed0838cf7c3b4110fe9cdf869272477312
2012-04-17 11:02:52 +02:00
Gabriel Wicke afa5b95bc1 Don't work around html5 library tokenizer attribute reordering
The HTML5 parser we are using to normalize expected HTML output in parserTests
reverses the order of attributes (see
https://github.com/aredridel/html5/pull/53 for the fix). Remove whitelist
entries concerned with this and use the proper order in external image
attributes.

Change-Id: If1868cae05396a150757c85a20473ab756cbcd97
2012-04-16 17:09:06 +02:00
Gabriel Wicke f662690d02 Shorten data-mw-rt to data-mw and clean up whitelist
Instead of a proliferation of data-mw-* attributes, it should be easier to
stash all private / non-semantic round-trip information in a JSON object
stored in data-mw.

Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285
2012-04-02 18:12:49 +02:00
Gabriel Wicke 227103e12c Accept empty table cell attribute sections, and consider percent-encoded %2525
valid. 270 tests passing.
2012-03-06 14:32:45 +00:00
Gabriel Wicke 2efcd3cd57 Reworked percent encoding handling for URIs to get closer to the 'url
construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation

Removed a few whitelisted test cases that are now passing directly.

The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
2012-03-06 13:49:37 +00:00
Gabriel Wicke 19fe9726a2 Fix invalid external link representation. 268 tests passing. 2012-03-05 18:06:29 +00:00
Gabriel Wicke 7b0c807710 Change wikilink tokenization strategy to split on pipes. This makes it
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
2012-03-05 12:00:38 +00:00
au f1fb937b4a * Instead of sorting attributes, whitelist the one parserTest where it matters. 2012-02-20 22:26:24 +00:00
Gabriel Wicke 025f9cddb3 Prefix all internal data- attributes with data-mw- and adjust the whitelist
and test output normalization accordingly. 235 tests passing.
2012-02-13 13:54:07 +00:00
Gabriel Wicke 1f6db903e9 Pluck a few low-hanging fruit in external link tokenization, and add a simple
localurl parser function implementation. 230 parser tests now passing.
2012-02-07 10:28:23 +00:00
Gabriel Wicke 34025251a3 Clean up 'END' token handling a bit. 2012-01-17 20:01:21 +00:00
Gabriel Wicke 2e35171fd1 Fix quote handling and tweak the whitelist a bit. 'any' token registrations
are now merged with specific registrations by rank. Not yet clear if that is a
good idea overall, need to check use cases when implementing template expansion
and other functionality.

183 parser test now passing.
2012-01-04 14:09:05 +00:00
Gabriel Wicke 8e00a72d0a Improvements to link trail handling, and two tweaks to the whitelist. 182
tests now passing. 

Link trails depend on language-dependent positive character classes in the PHP
parser. These classes all seem to disallow punctuation implicitly and list
differing plain text characters instead, so it might be possible to get away
with identifying a common class of non-trail punctuation instead. This would
help to keep the tokenizer independent of configurations, which is very
desirable for caching and simplified external parsing.
2011-12-30 12:47:06 +00:00
Gabriel Wicke 1c7fe0eb34 Refactor table productions to support table fragments in templates (table
start / row / table end). The old productions are not deleted yet to make it
easy to compare the output on more complex articles. 181 tests passing after
adding two table tests with whitespace-only differences to the whitelist.
2011-12-22 11:43:55 +00:00
Gabriel Wicke a8fa9433c4 Convert quote handling (italic/bold) to a core extension operating on the
token stream. This is the first token transformation exercising the
TokenTransformer class as its dispatcher. Template expansions, wiki link
formatting, tag sanitation and extensions should be able to use the same
dispatcher by registering for specific token types.

The parser performance is very slightly improved as the token stream is only
traversed once.
2011-12-12 20:53:14 +00:00
Gabriel Wicke 0734fb24c5 Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
Gabriel Wicke 7e1585d360 Add empty tables to the whitelist (legal in HTML5). Also add one more
functionally identical italic/bold/link permmutation on the whitelist.
2011-12-06 22:05:43 +00:00
Gabriel Wicke 1a5ffacc5c Add slightly different but functionally identical italic/bold/link nesting to
whitelist.
2011-12-06 16:45:19 +00:00
Gabriel Wicke d00743ad79 Improve external links and definition lists, now 133 tests passing ;)
Also add printwhitelist option to test runner, provides js code copy/pastable
to whitelist.
2011-12-01 14:25:59 +00:00
Gabriel Wicke edf40c616c Make whitelist usage an option; tweak comment a bit 2011-12-01 11:47:22 +00:00
Gabriel Wicke 35efed6634 Add a parser test whitelist for manually-checked tests, and an option to print
JSON-serialized parser output for failing tests, which can then be added to
the whitelist if appropriate.
2011-12-01 10:58:12 +00:00