Commit graph

208 commits

Author SHA1 Message Date
Gabriel Wicke 0f9d939b00 Use word diff if --color is enabled
Change-Id: Ib8d3de75ac306974abfdaca22bfc7b69bc62891d
2012-06-05 16:10:13 +02:00
Gabriel Wicke e53bc93a8e Check out old results before running tests
Change-Id: Ia56aec22194a14c94620237041c30c269ac2e56a
2012-05-30 17:37:10 +02:00
Gabriel Wicke 36084c5d93 Preserve original newlines in HTML and serialization
254 round-trip tests (up from 184) are now passing.

Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms

Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
2012-05-29 23:29:03 +02:00
Subramanya Sastry 8174c9dafc First attempt implementing rewriting rules on the DOM
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
  i.e. it takes tag pairs (B, I) for ex, and attempts to
  normalize the tree just for those tag pairs.  Normalizing
  across multiple tags is implemented as pairwise rewriting
  across all pairs:  Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
  attributes lose their meaning on rewriting since tags are
  reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?

Output examples and possible issues to fix:
   <i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
   <u><b><i>biu</i>bu</b>u</u>

But, the equivalent wikitext form:
   '''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
   <i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.

However, a slightly different version:
   "'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
   <u>'''''biu''bu'''u</u>

An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
  (ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
  (ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)

The current rewriting strategy could possibly be re-implemented as S-M
rewriting.  The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form.  I may
not play with it now, but just documenting it for later (to play with
in my spare time).

This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation.  Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.

gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.

Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
2012-05-29 08:17:57 +02:00
Gabriel Wicke c52c24b0cb Slightly improve formatting of web service; test commit message tweak
Change-Id: Ibac3ce3dd9aa2c4faf11eed351fea941ebf1e4b3
2012-05-25 16:36:14 +02:00
Gabriel Wicke c692bc2307 Use '/bin/sh' instead of '/bin/bash'
Bash omits the time output for some reason. We would like to keep a record of
performance too.

Change-Id: I7c435b1cf2e2f237f78a45b2819a195a240e3aa4
2012-05-25 16:19:46 +02:00
Gabriel Wicke 1ce2bc605d Add a small test runner with result archiving in git repo
Supports both -> HTML DOM and round-trip testing. Displays the diff to the
last results using less -r.

Change-Id: Ib3fbadeda3c8f4f7e3d2e6e5236a73ff7a773623
2012-05-25 14:28:53 +02:00
Gabriel Wicke b89f5071e5 Basic parser / serializer web service
* After installing Parsoid (sudo npm install -g in modules/parser), run 'node
  server.js' from the api directory and navigate to http://localhost:8000/ and
  follow the directions. You can start to navigate the English wikipedia at
  http://localhost:8000/Main_Page, or manually enter wikitext or HTML DOM to
  convert.
* Uses the express framework, could also use just connect
* Uses the cluster module to manage workers per-core and restart those on
  failure

Change-Id: I443f2996ed3df00826b038b7476a2f966ab0c425
2012-05-23 12:35:00 +02:00
Gabriel Wicke 39c6f42879 Link round-tripping and other improvements
* Changed RDFa for links according to
  http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
  typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).

Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
2012-05-22 13:36:06 +02:00
Gabriel Wicke fb7d5418a5 Round-trip nowiki
Change-Id: I5f7e6a43f5fdc1708ee710b2a601b20db733452c
2012-05-21 18:06:09 +02:00
Subramanya Sastry 9b84e931db First pass updating parserTests to verify dom->wikitext serialization.
- Just a quick first pass updating the parserTests.js script so we can
  test DOM -> wikitext serialization (but which in effect really tests
  roundtripping).
- There is no output normalization yet which is needed for now since we
  are not yet preserving white-space.

Change-Id: Ie52058e0dc3330f852c24fa05641dced19f950e0
2012-05-18 09:51:22 +02:00
Gabriel Wicke 04fc74c76a Strip RDFa attributes in parserTests
We are adding some extra information in those, which should not make tests
fail.

Change-Id: I42cca596330252efeff5d51508f97ef1c566475b
2012-05-17 17:03:44 +02:00
Gabriel Wicke 542921b5a3 Removed html5 parser patch no longer needed with 0.3.8
Change-Id: Id8c23d34e8cca49a360f536e792144a85a8468a3
2012-05-16 12:06:42 +02:00
Adam Wight 0a7f0b7630 List markup is created during the sync23 phase.
This makes it possible to transclude list items from a template.

Note: "5 quotes" test is broken by this patch, it appears that ListHandler
newline processing is changing some state which mysteriously affects the
QuoteTransformer.  This is ominous, hopefully there's a simple explanation...

gwicke: fix a bug in tokenizer triggered by definition lists like this:
**; foo : bar

Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad
2012-05-08 11:39:36 +02:00
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke 8ff810659a Rename text/wiki and tokens/wiki to text/x-mediawiki and similar
Change-Id: I70113629f4633685cd6db3914303a15e4c79a50a
2012-04-25 20:19:43 +02:00
Gabriel Wicke 8368e17d6a Biggish token transform system refactoring
* All parser pipelines including tokenizer and DOM stuff are now constructed
  from a 'recipe' data structure in a ParserPipelineFactory.

* All sub-pipelines of these can now be cached

* Event registrations to a pipeline are directly forwarded to the last
  pipeline member to save relatively expensive event forwarding.

* Some APIs for on-demand expansion / format conversion of parameters from
  parser functions are added:

  param.to('tokens/expanded', cb)
  param.to('text/wiki', cb) (this does not work yet)

  All parameters are additionally wrapped into a Param object that provides
  method for positional parameter naming (.named() or conversion to a dict
  (.dict()).

* The async token transform manager is now separated from a frame object, with
  the frame holding arguments, an on-demand expansion method and loop checks.

* Only keys of template parameters are now expanded. Parser functions or
  template arguments trigger an expansion on-demand. This (unsurprisingly)
  makes a big performance difference with typical switch-heavy template
  systems.

* Return values from async transforms are no longer used in favor of plain
  callbacks. This saves the complication of having to maintain two code paths.
  A trick in transformTokens still avoids the construction of unneeded
  TokenAccumulators.

* The results of template expansions are no longer buffered.

* 301 parser tests are passing

Known issues:

* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
  modified.

Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
2012-04-25 16:51:36 +02:00
Gabriel Wicke aaca5eac7d More tweaks: safesubst and image options
* Ignore safesubst for now
* Remove an unneeded whitelist entry
* Make sure the caption is not lost for thumbs (fix to last commit) and remove
  debug print

Change-Id: I243584ed0838cf7c3b4110fe9cdf869272477312
2012-04-17 11:02:52 +02:00
Gabriel Wicke afa5b95bc1 Don't work around html5 library tokenizer attribute reordering
The HTML5 parser we are using to normalize expected HTML output in parserTests
reverses the order of attributes (see
https://github.com/aredridel/html5/pull/53 for the fix). Remove whitelist
entries concerned with this and use the proper order in external image
attributes.

Change-Id: If1868cae05396a150757c85a20473ab756cbcd97
2012-04-16 17:09:06 +02:00
Catrope 7465b670e1 Add and update an offset map in DocumentNode
This has some TODOs still but I want to land it now anyway, and fix the
TODOs later.

* Add this.offsetMap which maps each linear model offset to a model tree node
* Refactor createNodesFromData()
** Rename it to buildSubtreeFromData()
** Have it build an offset map as well as a node subtree
** Have it set the root on the fake root node so that when the subtree
   is attached to the main tree later, we don't get a rippling root
   update all the way down
** Normalize the way the loop processes content, that way adding offsets
   for content is easier
* Add rebuildNodes() which uses buildSubtreeFromData() to rebuild stuff
* Use rebuildNodes() in DocumentSynchronizer
* Use pushRebuild() in TransactionProcessor
* Optimize setRoot() for the case where the root is already set correctly

Change-Id: I8b827d0823c969e671615ddd06e5f1bd70e9d54c
2012-04-13 16:46:02 -07:00
Gabriel Wicke 9913108b40 Fix fetch-parserTests (it is in path instead of fs)
Change-Id: I169502079ea2609a4f4af776b15767cf0c3ec8b5
2012-04-04 20:40:09 +02:00
Adam Wight b234edba88 As much as I have loved writing Makefiles... I've replaced its functionality with package.json, mostly so we can avoid non-node dependencies. This is one of the recommended practices. We should consider moving tests/parser into modules/parser/tests, other node projects keep all module code in one directory.
Explained in the README how to use npm to load the dependencies and run tests.  Too bad about NODE_PATH...

Don't try to find parserTests.txt in assorted places--if it isn't present, fetch from gerrit.  You can symlink from core if you're developing on both parsers, and the fetch script will not overwrite.

Use __dirname in parserTests.js to allow the script to run independent of current working directory.

Change-Id: I4c8b884e91f4fdeae385c7697aff768bdd199dd5
2012-04-04 11:02:58 -07:00
Gabriel Wicke f662690d02 Shorten data-mw-rt to data-mw and clean up whitelist
Instead of a proliferation of data-mw-* attributes, it should be easier to
stash all private / non-semantic round-trip information in a JSON object
stored in data-mw.

Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285
2012-04-02 18:12:49 +02:00
Gabriel Wicke 5ef3438ee5 Change path to parserTests from phase3 to core after switch to git.
Change-Id: Ie13f678eaa81447e98db5c8c394ab103caad8454
2012-04-02 17:10:06 +02:00
Audrey Tang d3602bb459 * Get parser tests from GitWeb, not Subversion.
Change-Id: I39f933b9e0320dc62736da07ce097ec1badec9aa
2012-03-28 23:39:01 +08:00
Catrope 7a726b0278 Add tree synchronization for replace
To handle replace operations that are not themselves consistent (these
are common, for instance when replacing an opening element in one place,
then replacing the closing element somewhere else), we process
subsequent replace operations inside the first one until things are
balanced again, then issue a single rebuild for the whole thing.

Change-Id: Ide4613f046fabfeeef383138c39e350b1b710033
2012-03-26 02:51:30 -07:00
Roan Kattouw 662633dfb3 Add a test for unwrapping and rewrapping 2012-03-14 21:02:33 +00:00
Roan Kattouw 2c43a34f74 Rewrite the rebuild action to take two ranges rather than a node and some data. 2012-03-14 21:02:31 +00:00
Roan Kattouw 37a59016e8 Break out pushAction() into separate functions for each action. This will allow me to change the rebuild action to take totally different parameters. 2012-03-14 21:02:29 +00:00
Antoine Musso f637756319 node modules required: request & jshashes 2012-03-13 15:14:18 +00:00
Roan Kattouw 16a2356e43 Add tests for list split tree sync 2012-03-13 00:14:38 +00:00
Trevor Parscal c977591886 Added test for ve.dm.DocumentSynchronizer that exercises multi-action synchronizations 2012-03-09 19:38:54 +00:00
Roan Kattouw d70aa70707 Add test for replacing a table with a list. This only works because
nesting validity isn't checked yet (lists inside lists are illegal
IIRC), but for now it tests the reversal of the order of the closing
tags nicely
2012-03-09 02:19:50 +00:00
Roan Kattouw b13d0a849d Add a check for the length of unwrapOuter, and add a test for each
exception
2012-03-09 01:44:31 +00:00
Roan Kattouw bc600b34be Make prepareWrap() use the data from the model rather than the unwrap
parameters. This fixes the case where rolling back a list unwrap would
restore the list items without their attributes
2012-03-09 01:14:41 +00:00
Roan Kattouw 3bc6b3d8c7 Add tests for unwrapping a list
This also excercises unwrapEach. One of the tests is still subtly broken
in that the attributes on the listItems aren't preserved, I'll fix that
next.
2012-03-09 00:38:35 +00:00
Roan Kattouw 5054ed320e Implement prepareWrap and add tests for it 2012-03-08 23:21:26 +00:00
Roan Kattouw 10a6ee73f4 Add tests for content replacements 2012-03-08 23:21:23 +00:00
Trevor Parscal 3ec0c07843 Fixed name of test suite to match actual class name 2012-03-08 19:37:13 +00:00
Trevor Parscal becb1daa39 Added more tests for ve.dm.DocumentSynchronizer and fixed some bugs along the way 2012-03-08 19:35:51 +00:00
Trevor Parscal 459c4fa271 Added some basic tests for resize and insert. Fixed some bugs in both of those code paths along the way. 2012-03-08 00:52:30 +00:00
Gabriel Wicke af03eb4f29 Improve generic attribute expansion before external link processing, and make
wgUploadPath configurable. Also change the hard-coded fall-back image sizes to
sensible defaults. This breaks three parser tests until image size retrieval
from the wiki is implemented.
2012-03-06 18:02:35 +00:00
Gabriel Wicke 227103e12c Accept empty table cell attribute sections, and consider percent-encoded %2525
valid. 270 tests passing.
2012-03-06 14:32:45 +00:00
Gabriel Wicke 2efcd3cd57 Reworked percent encoding handling for URIs to get closer to the 'url
construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation

Removed a few whitelisted test cases that are now passing directly.

The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
2012-03-06 13:49:37 +00:00
Gabriel Wicke 19fe9726a2 Fix invalid external link representation. 268 tests passing. 2012-03-05 18:06:29 +00:00
Gabriel Wicke 7b0c807710 Change wikilink tokenization strategy to split on pipes. This makes it
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
2012-03-05 12:00:38 +00:00
Trevor Parscal 0e41da3340 Fixed tests that were broken by r112150. 2012-03-02 23:12:38 +00:00
Gabriel Wicke 009d7a4dea Namespaces to the rescue. 2012-03-02 15:49:05 +00:00
Gabriel Wicke fe681042c0 Collect some statistics while grepping. 2012-03-01 16:42:28 +00:00
Gabriel Wicke e0838db315 Capturing the regexp is no longer necessary, and speeds up the grepper. Also
tweaked the multi-line ISBN regexp slightly.
2012-02-29 13:02:46 +00:00