2011-12-01 10:58:12 +00:00
/ * A m a p o f t e s t t i t l e s a n d t h e i r m a n u a l l y v e r i f i e d o u t p u t . I f t h e p a r s e r
* output matches the expected output listed here , the test can be marked as
* passing in parserTests . js . * /
testWhiteList = { } ;
First attempt implementing rewriting rules on the DOM
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
i.e. it takes tag pairs (B, I) for ex, and attempts to
normalize the tree just for those tag pairs. Normalizing
across multiple tags is implemented as pairwise rewriting
across all pairs: Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
attributes lose their meaning on rewriting since tags are
reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?
Output examples and possible issues to fix:
<i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
<u><b><i>biu</i>bu</b>u</u>
But, the equivalent wikitext form:
'''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
<i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.
However, a slightly different version:
"'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
<u>'''''biu''bu'''u</u>
An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
(ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
(ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)
The current rewriting strategy could possibly be re-implemented as S-M
rewriting. The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form. I may
not play with it now, but just documenting it for later (to play with
in my spare time).
This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation. Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.
gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.
Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
2012-05-25 01:10:47 +00:00
// Italic/link nesting is changed in this test, but the rendered result is the
2012-06-07 08:04:20 +00:00
// same. Currently the result is actually an improvement over the MediaWiki
// output.
2012-04-02 16:12:49 +00:00
testWhiteList [ "Bug 2702: Mismatched <i>, <b> and <a> tags are invalid" ] = "<p><i><a href=\"http://example.com\">text</a></i><a href=\"http://example.com\"><b>text</b></a><i>Something <a href=\"http://example.com\">in italic</a></i><i>Something <a href=\"http://example.com\">mixed</a></i><a href=\"http://example.com\"><b>, even bold</b></a><i><b>Now <a href=\"http://example.com\">both</a></b></i></p>" ;
2011-12-01 14:25:59 +00:00
2012-01-04 14:09:05 +00:00
// The expected result for this test is really broken html.
2012-04-02 16:12:49 +00:00
testWhiteList [ "Link containing double-single-quotes '' in text embedded in italics (bug 4598 sanity check)" ] = "<p><i>Some <a href=\"/wiki/Link\">pretty </a></i><a href=\"/wiki/Link\">italics<i> and stuff</i></a><i>!</i></p>" ;
2011-12-07 11:44:38 +00:00
testWhiteList [ "External link containing double-single-quotes in text embedded in italics (bug 4598 sanity check)" ] = "<p><i>Some <a href=\"http://example.com/\">pretty </a></i><a href=\"http://example.com/\">italics<i> and stuff</i></a><i>!</i></p>" ;
2011-12-12 20:53:14 +00:00
// This is a rare edge case, and the new behavior is arguably more consistent
2012-03-25 23:03:07 +00:00
testWhiteList [ "5 quotes, code coverage +1 line" ] = "<p><i><b></b></i></p>" ;
2011-12-07 11:44:38 +00:00
2012-01-17 20:01:21 +00:00
// The comment in the test already suggests this result as correct, but
// supplies the old result without preformatting.
2012-04-02 16:12:49 +00:00
testWhiteList [ "Bug 6200: Preformatted in <blockquote>" ] = "<blockquote><pre>\nBlah</pre></blockquote>" ;
2012-01-17 20:01:21 +00:00
2011-12-06 22:05:43 +00:00
// empty table tags / with only a caption are legal in HTML5.
testWhiteList [ "A table with no data." ] = "<table></table>" ;
testWhiteList [ "A table with nothing but a caption" ] = "<table><caption> caption</caption></table>" ;
2012-04-02 16:12:49 +00:00
testWhiteList [ "Fuzz testing: Parser22" ] = "<p><a href=\"http://===r:::https://b\">http://===r:::https://b</a></p><table></table>" ;
2011-12-06 22:05:43 +00:00
2012-06-07 08:04:20 +00:00
/ * *
* Small whitespace differences that we now start to care about for
* round - tripping
* /
2011-12-22 11:43:55 +00:00
// Very minor whitespace difference at end of cell (MediaWiki inserts a
// newline before the close tag even if there was no trailing space in the cell)
2012-06-07 08:04:20 +00:00
//testWhiteList["Table rowspan"] = "<table border=\"1\"><tbody><tr><td> Cell 1, row 1 </td><td rowspan=\"2\"> Cell 2, row 1 (and 2) </td><td> Cell 3, row 1 </td></tr><tr><td> Cell 1, row 2 </td><td> Cell 3, row 2 </td></tr></tbody></table>";
2011-12-22 11:43:55 +00:00
// Inter-element whitespace only
2012-06-07 08:04:20 +00:00
//testWhiteList["Indented table markup mixed with indented pre content (proposed in bug 6200)"] = " \n\n<table><tbody><tr><td><pre>\nText that should be rendered preformatted\n</pre></td></tr></tbody></table>";
2011-12-07 11:44:38 +00:00
2012-06-07 08:04:20 +00:00
/* Misc sanitizer / HTML5 differences */
2011-12-07 11:44:38 +00:00
2012-03-06 13:49:37 +00:00
// Single quotes are legal in HTML5 URIs. See
// http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation
2012-04-02 16:12:49 +00:00
testWhiteList [ "Link containing double-single-quotes '' (bug 4598)" ] = "<p><a href=\"/wiki/Lista_d''e_paise_d''o_munno\">Lista d''e paise d''o munno</a></p>" ;
2011-12-07 11:44:38 +00:00
// Sanitizer
testWhiteList [ "Invalid attributes in table cell (bug 1830)" ] = "<table><tbody><tr><td Cell:=\"\">broken</td></tr></tbody></table>" ;
testWhiteList [ "Table security: embedded pipes (http://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-April/022293.html)" ] = "<table><tbody><tr><td> |<a href=\"ftp://|x||\">[1]</a>\" onmouseover=\"alert(document.cookie)\">test</td></tr></tbody></table>" ;
2012-03-06 13:49:37 +00:00
// Sanitizer, but UTF8 in link is ok in HTML5
2012-04-02 16:12:49 +00:00
testWhiteList [ "External link containing double-single-quotes with no space separating the url from text in italics" ] = "<p><a href=\"http://www.musee-picasso.fr/pages/page_id18528_u1l2.htm\" data-mw=\"{"sourcePos":[0,146]}\"><i>La muerte de Casagemas</i> (1901) en el sitio de </a><a href=\"/wiki/Museo_Picasso_(París)\">Museo Picasso</a>.</p>" ;
2011-12-07 11:44:38 +00:00
2012-04-02 16:12:49 +00:00
testWhiteList [ "External links: wiki links within external link (Bug 3695)" ] = "<p><a href=\"http://example.com\"></a><a href=\"/wiki/Wikilink\">wikilink</a> embedded in ext link</p>" ;
2012-02-07 10:28:23 +00:00
2011-12-07 11:44:38 +00:00
2012-03-06 14:32:45 +00:00
// This is valid, just confusing for humans. The reason for disallowing this
// might be history by now. XXX: Check this!
2012-04-02 16:12:49 +00:00
testWhiteList [ "Link containing % as a double hex sequence interpreted to hex sequence" ] = "<p><a href=\"/wiki/7%2525_Solution\">7%25 Solution</a></p>" ;
2012-03-05 18:06:29 +00:00
2011-12-01 10:58:12 +00:00
if ( typeof module == "object" ) {
module . exports . testWhiteList = testWhiteList ;
}