mediawiki-extensions-Visual.../modules/parser/test/parserTests-whitelist.js

/* A map of test titles and their manually verified output. If the parser
 * output matches the expected output listed here, the test can be marked as
 * passing in parserTests.js. */

testWhiteList = {};

// Italic/link nesting is changed in this test, but the rendered result is the
// same. Currently the result is actually an improvement over the MediaWiki
// output.
testWhiteList["Bug 2702: Mismatched <i>, <b> and <a> tags are invalid"] = "<p><i><a href=\"http://example.com\">text</a></i><a href=\"http://example.com\"><b>text</b></a><i>Something <a href=\"http://example.com\">in italic</a></i><i>Something <a href=\"http://example.com\">mixed</a></i><a href=\"http://example.com\"><b>, even bold</b></a><i><b>Now <a href=\"http://example.com\">both</a></b></i></p>";

// The expected result for this test is really broken html.
testWhiteList["Link containing double-single-quotes '' in text embedded in italics (bug 4598 sanity check)"] = "<p><i>Some <a href=\"/wiki/Link\">pretty </a></i><a href=\"/wiki/Link\">italics<i> and stuff</i></a><i>!</i></p>";

testWhiteList["External link containing double-single-quotes in text embedded in italics (bug 4598 sanity check)"] = "<p><i>Some <a href=\"http://example.com/\">pretty </a></i><a href=\"http://example.com/\">italics<i> and stuff</i></a><i>!</i></p>";

// This is a rare edge case, and the new behavior is arguably more consistent
testWhiteList["5 quotes, code coverage +1 line"] = "<p><i><b></b></i></p>";

// The comment in the test already suggests this result as correct, but
// supplies the old result without preformatting.
testWhiteList["Bug 6200: Preformatted in <blockquote>"] = "<blockquote><pre>\nBlah</pre></blockquote>";

// empty table tags / with only a caption are legal in HTML5.
testWhiteList["A table with no data."] = "<table></table>";
testWhiteList["A table with nothing but a caption"] = "<table><caption> caption</caption></table>";
testWhiteList["Fuzz testing: Parser22"] = "<p><a href=\"http://===r:::https://b\">http://===r:::https://b</a></p><table></table>";

/** 
 * Small whitespace differences that we now start to care about for
 * round-tripping 
 */

// Very minor whitespace difference at end of cell (MediaWiki inserts a
// newline before the close tag even if there was no trailing space in the cell)
//testWhiteList["Table rowspan"] = "<table border=\"1\"><tbody><tr><td> Cell 1, row 1 </td><td rowspan=\"2\"> Cell 2, row 1 (and 2) </td><td> Cell 3, row 1 </td></tr><tr><td> Cell 1, row 2 </td><td> Cell 3, row 2 </td></tr></tbody></table>";

// Inter-element whitespace only
//testWhiteList["Indented table markup mixed with indented pre content (proposed in bug 6200)"] = "   \n\n<table><tbody><tr><td><pre>\nText that should be rendered preformatted\n</pre></td></tr></tbody></table>";


/* Misc sanitizer / HTML5 differences */

// Single quotes are legal in HTML5 URIs. See 
// http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation
testWhiteList["Link containing double-single-quotes '' (bug 4598)"] = "<p><a href=\"/wiki/Lista_d''e_paise_d''o_munno\">Lista d''e paise d''o munno</a></p>";


// Sanitizer
// testWhiteList["Invalid attributes in table cell (bug 1830)"] = "<table><tbody><tr><td Cell:=\"\">broken</td></tr></tbody></table>";
// testWhiteList["Table security: embedded pipes (http://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-April/022293.html)"] = "<table><tbody><tr><td> |<a href=\"ftp://|x||\">[1]</a>\" onmouseover=\"alert(document.cookie)\"&gt;test</td></tr></tbody></table>";

// Sanitizer, but UTF8 in link is ok in HTML5
testWhiteList["External link containing double-single-quotes with no space separating the url from text in italics"] = "<p><a href=\"http://www.musee-picasso.fr/pages/page_id18528_u1l2.htm\" data-rt=\"{&quot;sourcePos&quot;:[0,146]}\"><i>La muerte de Casagemas</i> (1901) en el sitio de </a><a href=\"/wiki/Museo_Picasso_(París)\">Museo Picasso</a>.</p>";

testWhiteList["External links: wiki links within external link (Bug 3695)"] = "<p><a href=\"http://example.com\"></a><a href=\"/wiki/Wikilink\">wikilink</a> embedded in ext link</p>";


// This is valid, just confusing for humans. The reason for disallowing this
// might be history by now. XXX: Check this!
testWhiteList["Link containing % as a double hex sequence interpreted to hex sequence"] = "<p><a href=\"/wiki/7%2525_Solution\">7%25 Solution</a></p>";

if (typeof module == "object") {
	module.exports.testWhiteList = testWhiteList;
}
Add a parser test whitelist for manually-checked tests, and an option to print JSON-serialized parser output for failing tests, which can then be added to the whitelist if appropriate. 2011-12-01 10:58:12 +00:00			`/* A map of test titles and their manually verified output. If the parser`
			`* output matches the expected output listed here, the test can be marked as`
			`* passing in parserTests.js. */`

			`testWhiteList = {};`

First attempt implementing rewriting rules on the DOM - This is implemented as a post-processing pass. - Might require additional checks to verify rewriteability. - Implemented as a pair-wise tag DOM minimization strategy, i.e. it takes tag pairs (B, I) for ex, and attempts to normalize the tree just for those tag pairs. Normalizing across multiple tags is implemented as pairwise rewriting across all pairs: Ex:(b,i), (b,u),(i,u) for (b,i,u) - Copied over attributes as part of rewriting, but some of the attributes lose their meaning on rewriting since tags are reordered (ex: sourcePosn, sourceTagPosn). How do we handle this? Output examples and possible issues to fix: <i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u> gets rewritten to: <u><b><i>biu</i>bu</b>u</u> But, the equivalent wikitext form: '''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u> does not get rewritten because of parsing differences. This wikitext gets parsed into: <i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u> The extra ''' token in the middle thwarts DOM rewriting. However, a slightly different version: "'''''<u>biu</u>''<u>bu</u>'''<u>u</u>" gets properly normalized to: <u>'''''biu''bu'''u</u> An alternative, but fun strategy to play with is to use the following two normalization primitives: S(wap) and M(erge). - S rewrites T1(T2(x)) into T2(T1(x)) (ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>) - M rewrites (T(x),T(y)) into (T(x,y)). (ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>) The current rewriting strategy could possibly be re-implemented as S-M rewriting. The problem to solve there would be to find an efficient rewriting strategy that is guaranteed to lead to a normal form. I may not play with it now, but just documenting it for later (to play with in my spare time). This commit is just as a record of fun/experimental code where I get to learn details of JS, wikitext, parsing, and DOM manipulation. Next version of this code will attempt to introduce minimal DOM restructuring across multiple tags at once which can be more efficient. gwicke: Removed now passing test from whitelist, and updated another whitelist entry which is now improved. Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8 2012-05-25 01:10:47 +00:00			`// Italic/link nesting is changed in this test, but the rendered result is the`
Remove a few entries we now care about from the whitelist They are mostly about whitespace, but there is also a debatable quote test that outputs an empty bold element at the end of the line. We should perhaps strip this empty bold in the QuoteTransformer, as the preservation of an empty bold tag in round-tripping does not seem to be too useful. Change-Id: I1d8f3ebabcd9f6249e5170de420ba52e8aea22ca 2012-06-07 08:04:20 +00:00			`// same. Currently the result is actually an improvement over the MediaWiki`
			`// output.`
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["Bug 2702: Mismatched <i>, <b> and <a> tags are invalid"] = "<p><i><a href=\"http://example.com\">text</a></i><a href=\"http://example.com\"><b>text</b></a><i>Something <a href=\"http://example.com\">in italic</a></i><i>Something <a href=\"http://example.com\">mixed</a></i><a href=\"http://example.com\"><b>, even bold</b></a><i><b>Now <a href=\"http://example.com\">both</a></b></i></p>";`
Improve external links and definition lists, now 133 tests passing ;) Also add printwhitelist option to test runner, provides js code copy/pastable to whitelist. 2011-12-01 14:25:59 +00:00
Fix quote handling and tweak the whitelist a bit. 'any' token registrations are now merged with specific registrations by rank. Not yet clear if that is a good idea overall, need to check use cases when implementing template expansion and other functionality. 183 parser test now passing. 2012-01-04 14:09:05 +00:00			`// The expected result for this test is really broken html.`
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["Link containing double-single-quotes '' in text embedded in italics (bug 4598 sanity check)"] = "<p><i>Some <a href=\"/wiki/Link\">pretty </a></i><a href=\"/wiki/Link\">italics<i> and stuff</i></a><i>!</i></p>";`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
			`testWhiteList["External link containing double-single-quotes in text embedded in italics (bug 4598 sanity check)"] = "<p><i>Some <a href=\"http://example.com/\">pretty </a></i><a href=\"http://example.com/\">italics<i> and stuff</i></a><i>!</i></p>";`

Convert quote handling (italic/bold) to a core extension operating on the token stream. This is the first token transformation exercising the TokenTransformer class as its dispatcher. Template expansions, wiki link formatting, tag sanitation and extensions should be able to use the same dispatcher by registering for specific token types. The parser performance is very slightly improved as the token stream is only traversed once. 2011-12-12 20:53:14 +00:00			`// This is a rare edge case, and the new behavior is arguably more consistent`
List markup is created during the sync23 phase. This makes it possible to transclude list items from a template. Note: "5 quotes" test is broken by this patch, it appears that ListHandler newline processing is changing some state which mysteriously affects the QuoteTransformer. This is ominous, hopefully there's a simple explanation... gwicke: fix a bug in tokenizer triggered by definition lists like this: **; foo : bar Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad 2012-03-25 23:03:07 +00:00			`testWhiteList["5 quotes, code coverage +1 line"] = "<p><i><b></b></i></p>";`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
Clean up 'END' token handling a bit. 2012-01-17 20:01:21 +00:00			`// The comment in the test already suggests this result as correct, but`
			`// supplies the old result without preformatting.`
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["Bug 6200: Preformatted in <blockquote>"] = "<blockquote><pre>\nBlah</pre></blockquote>";`
Clean up 'END' token handling a bit. 2012-01-17 20:01:21 +00:00
Add empty tables to the whitelist (legal in HTML5). Also add one more functionally identical italic/bold/link permmutation on the whitelist. 2011-12-06 22:05:43 +00:00			`// empty table tags / with only a caption are legal in HTML5.`
			`testWhiteList["A table with no data."] = "<table></table>";`
			`testWhiteList["A table with nothing but a caption"] = "<table><caption> caption</caption></table>";`
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["Fuzz testing: Parser22"] = "<p><a href=\"http://===r:::https://b\">http://===r:::https://b</a></p><table></table>";`
Add empty tables to the whitelist (legal in HTML5). Also add one more functionally identical italic/bold/link permmutation on the whitelist. 2011-12-06 22:05:43 +00:00
Remove a few entries we now care about from the whitelist They are mostly about whitespace, but there is also a debatable quote test that outputs an empty bold element at the end of the line. We should perhaps strip this empty bold in the QuoteTransformer, as the preservation of an empty bold tag in round-tripping does not seem to be too useful. Change-Id: I1d8f3ebabcd9f6249e5170de420ba52e8aea22ca 2012-06-07 08:04:20 +00:00			`/**`
			`* Small whitespace differences that we now start to care about for`
			`* round-tripping`
			`*/`

Refactor table productions to support table fragments in templates (table start / row / table end). The old productions are not deleted yet to make it easy to compare the output on more complex articles. 181 tests passing after adding two table tests with whitespace-only differences to the whitelist. 2011-12-22 11:43:55 +00:00			`// Very minor whitespace difference at end of cell (MediaWiki inserts a`
			`// newline before the close tag even if there was no trailing space in the cell)`
Remove a few entries we now care about from the whitelist They are mostly about whitespace, but there is also a debatable quote test that outputs an empty bold element at the end of the line. We should perhaps strip this empty bold in the QuoteTransformer, as the preservation of an empty bold tag in round-tripping does not seem to be too useful. Change-Id: I1d8f3ebabcd9f6249e5170de420ba52e8aea22ca 2012-06-07 08:04:20 +00:00			`//testWhiteList["Table rowspan"] = "<table border=\"1\"><tbody><tr><td> Cell 1, row 1 </td><td rowspan=\"2\"> Cell 2, row 1 (and 2) </td><td> Cell 3, row 1 </td></tr><tr><td> Cell 1, row 2 </td><td> Cell 3, row 2 </td></tr></tbody></table>";`
Refactor table productions to support table fragments in templates (table start / row / table end). The old productions are not deleted yet to make it easy to compare the output on more complex articles. 181 tests passing after adding two table tests with whitespace-only differences to the whitelist. 2011-12-22 11:43:55 +00:00
			`// Inter-element whitespace only`
Remove a few entries we now care about from the whitelist They are mostly about whitespace, but there is also a debatable quote test that outputs an empty bold element at the end of the line. We should perhaps strip this empty bold in the QuoteTransformer, as the preservation of an empty bold tag in round-tripping does not seem to be too useful. Change-Id: I1d8f3ebabcd9f6249e5170de420ba52e8aea22ca 2012-06-07 08:04:20 +00:00			`//testWhiteList["Indented table markup mixed with indented pre content (proposed in bug 6200)"] = " \n\n<table><tbody><tr><td><pre>\nText that should be rendered preformatted\n</pre></td></tr></tbody></table>";`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00

Remove a few entries we now care about from the whitelist They are mostly about whitespace, but there is also a debatable quote test that outputs an empty bold element at the end of the line. We should perhaps strip this empty bold in the QuoteTransformer, as the preservation of an empty bold tag in round-tripping does not seem to be too useful. Change-Id: I1d8f3ebabcd9f6249e5170de420ba52e8aea22ca 2012-06-07 08:04:20 +00:00			`/* Misc sanitizer / HTML5 differences */`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
Reworked percent encoding handling for URIs to get closer to the 'url construction' part of the HTML5 spec: http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation Removed a few whitelisted test cases that are now passing directly. The encoding canonicalization could also be moved to the Sanitizer. Doing this early in token stream processing however has the advantage of providing further transformations uniform data to work with. We could even consider to move this even further into the tokenizer. 2012-03-06 13:49:37 +00:00			`// Single quotes are legal in HTML5 URIs. See`
			`// http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation`
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["Link containing double-single-quotes '' (bug 4598)"] = "<p><a href=\"/wiki/Lista_d''e_paise_d''o_munno\">Lista d''e paise d''o munno</a></p>";`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00

			`// Sanitizer`
First pass porting PHP's sanitizer to Parsoid * Ported attribute sanitization code (and related functions) from core/includes/Sanitizer.php * Added dummy flags and set to true (use of rdfa, microdata attrs, and html5 mode). * Removed couple whitelisted sanitizer tests. * A few sanitizer tests now pass. * More work to be done. Change-Id: I19c92bbfcb57f3e97a7af1b7c5f63772e427dae4 2012-07-26 16:34:11 +00:00			`// testWhiteList["Invalid attributes in table cell (bug 1830)"] = "<table><tbody><tr><td Cell:=\"\">broken</td></tr></tbody></table>";`
			`// testWhiteList["Table security: embedded pipes (http://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-April/022293.html)"] = "<table><tbody><tr><td> \|<a href=\"ftp://\|x\|\|\">[1]</a>\" onmouseover=\"alert(document.cookie)\">test</td></tr></tbody></table>";`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
Reworked percent encoding handling for URIs to get closer to the 'url construction' part of the HTML5 spec: http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation Removed a few whitelisted test cases that are now passing directly. The encoding canonicalization could also be moved to the Sanitizer. Doing this early in token stream processing however has the advantage of providing further transformations uniform data to work with. We could even consider to move this even further into the tokenizer. 2012-03-06 13:49:37 +00:00			`// Sanitizer, but UTF8 in link is ok in HTML5`
Rename data-mw into data-rt This hopefully makes it clearer that data-rt contains private round-trip info instead of semantically interesting data. Change-Id: I03b476ed112a4b627c9871ee3677c450f943429a 2012-07-16 19:10:08 +00:00			`testWhiteList["External link containing double-single-quotes with no space separating the url from text in italics"] = "<p><a href=\"http://www.musee-picasso.fr/pages/page_id18528_u1l2.htm\" data-rt=\"{"sourcePos":[0,146]}\"><i>La muerte de Casagemas</i> (1901) en el sitio de </a><a href=\"/wiki/Museo_Picasso_(París)\">Museo Picasso</a>.</p>";`
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["External links: wiki links within external link (Bug 3695)"] = "<p><a href=\"http://example.com\"></a><a href=\"/wiki/Wikilink\">wikilink</a> embedded in ext link</p>";`
Pluck a few low-hanging fruit in external link tokenization, and add a simple localurl parser function implementation. 230 parser tests now passing. 2012-02-07 10:28:23 +00:00
Add a few more items to the whitelist 2011-12-07 11:44:38 +00:00
Accept empty table cell attribute sections, and consider percent-encoded %2525 valid. 270 tests passing. 2012-03-06 14:32:45 +00:00			`// This is valid, just confusing for humans. The reason for disallowing this`
			`// might be history by now. XXX: Check this!`
Shorten data-mw-rt to data-mw and clean up whitelist Instead of a proliferation of data-mw-* attributes, it should be easier to stash all private / non-semantic round-trip information in a JSON object stored in data-mw. Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285 2012-04-02 16:12:49 +00:00			`testWhiteList["Link containing % as a double hex sequence interpreted to hex sequence"] = "<p><a href=\"/wiki/7%2525_Solution\">7%25 Solution</a></p>";`
Fix invalid external link representation. 268 tests passing. 2012-03-05 18:06:29 +00:00
Add a parser test whitelist for manually-checked tests, and an option to print JSON-serialized parser output for failing tests, which can then be added to the whitelist if appropriate. 2011-12-01 10:58:12 +00:00			`if (typeof module == "object") {`
			`module.exports.testWhiteList = testWhiteList;`
			`}`