mediawiki-extensions-Visual.../modules/parser
Subramanya Sastry 8174c9dafc First attempt implementing rewriting rules on the DOM
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
  i.e. it takes tag pairs (B, I) for ex, and attempts to
  normalize the tree just for those tag pairs.  Normalizing
  across multiple tags is implemented as pairwise rewriting
  across all pairs:  Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
  attributes lose their meaning on rewriting since tags are
  reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?

Output examples and possible issues to fix:
   <i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
   <u><b><i>biu</i>bu</b>u</u>

But, the equivalent wikitext form:
   '''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
   <i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.

However, a slightly different version:
   "'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
   <u>'''''biu''bu'''u</u>

An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
  (ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
  (ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)

The current rewriting strategy could possibly be re-implemented as S-M
rewriting.  The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form.  I may
not play with it now, but just documenting it for later (to play with
in my spare time).

This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation.  Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.

gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.

Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
2012-05-29 08:17:57 +02:00
..
html5 Land big TokenTransformDispatcher and eventization refactoring. 2012-01-03 18:44:31 +00:00
ext.Cite.js Forward-port Cite extension 2012-05-03 13:22:01 +02:00
ext.cite.taghook.ref.js Moving parser stuff back into the modules folder (oops) 2011-11-02 21:45:57 +00:00
ext.core.AttributeExpander.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
ext.core.BehaviorSwitchHandler.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
ext.core.LinkHandler.js Fix an external link regression, and add server shell wrapper and setup docs 2012-05-23 16:25:42 +02:00
ext.core.ListHandler.js Don't eat newline tokens in the ListHandler 2012-05-16 23:14:21 +02:00
ext.core.NoIncludeOnly.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
ext.core.ParserFunctions.js Notes on missing parser functions, more error reporting tweaks 2012-05-24 17:31:26 +02:00
ext.core.PostExpandParagraphHandler.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
ext.core.QuoteTransformer.js Don't eat newline tokens in the ListHandler 2012-05-16 23:14:21 +02:00
ext.core.Sanitizer.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
ext.core.TemplateHandler.js Notes on missing parser functions, more error reporting tweaks 2012-05-24 17:31:26 +02:00
ext.Util.js Nominate more HTML5 sectioning and heading elements for block-level treatment 2012-04-11 12:53:49 +02:00
ext.util.TokenCollector.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
mediawiki.DOMConverter.js Replace console.log with console.warn in all debug statements 2012-02-14 20:56:14 +00:00
mediawiki.DOMPostProcessor.js First attempt implementing rewriting rules on the DOM 2012-05-29 08:17:57 +02:00
mediawiki.HTML5TreeBuilder.node.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
mediawiki.LinearModelConverter.js Add HTML DOM -> linear model converter 2012-03-29 12:47:14 -07:00
mediawiki.parser.defines.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
mediawiki.parser.environment.js Resolve subpage links, and remove hack for H: titles 2012-05-24 17:57:41 +02:00
mediawiki.parser.js Big token transform framework overhaul part 2 2012-05-15 17:05:47 +02:00
mediawiki.Title.js Add basic thumb rendering support 2012-04-09 23:04:26 +02:00
mediawiki.tokenizer.peg.js Keep going on tokenizer errors 2012-05-24 10:30:32 +02:00
mediawiki.TokenTransformManager.js A few (partly hackish) improvements 2012-05-24 16:30:26 +02:00
mediawiki.WikitextSerializer.js Basic rt support for indent pre variant 2012-05-25 18:55:38 +02:00
package.json Basic parser / serializer web service 2012-05-23 12:35:00 +02:00
parse.js Basic parser / serializer web service 2012-05-23 12:35:00 +02:00
pegTokenizer.pegjs.txt Basic rt support for indent pre variant 2012-05-25 18:55:38 +02:00
README.txt As much as I have loved writing Makefiles... I've replaced its functionality with package.json, mostly so we can avoid non-node dependencies. This is one of the recommended practices. We should consider moving tests/parser into modules/parser/tests, other node projects keep all module code in one directory. 2012-04-04 11:02:58 -07:00

A combined Mediawiki and html parser in JavaScript running on node.js. Please
see (https://www.mediawiki.org/wiki/Future/Parser_development) for an overview
of the current implementation, and instructions on running the tests.

You might need to set the NODE_PATH environment variable,
  export NODE_PATH="node_modules"

Download the dependencies:
  npm install

Run tests:
  npm test