Commit graph

462 commits

Author SHA1 Message Date
Gabriel Wicke b3bd2ffe8d Fix definition list parsing and round-trip single vs. multi-line dt/dd
* Removed murky ' :' -> ' :' replacement in tokenizer. This breaks four
  parser tests, and should be fixed in a token stream transformer or DOM
  postprocessor. This replacement clashes with round-tripping, and is not
  terribly important visually.
* Added stx:row annotation to single-line dt/dd pairs and use it to preserve
  single-line syntax in the serializer. There is no attempt yet to support the
  addition of nested lists in an originally single-line dd. We'd need to look
  ahead in the serializer to support this. Perhaps the editor can simply drop
  data-mw in that case.
* Switched default dt/dd serialization to multi-line. This supports all nested
  lists and multiple dds.
* Don't close dls when switching from dt to dd or back in the token stream
  ListHandler.

Overall 290 round-trip tests are passing now (up from 284, some due to  ,
some due to lists). The number of passing parser tests dropped slightly from
303 to 297 (or 301/295 on weekdays other than Thursday).

Change-Id: I85ff40571833713388c6523e6a4ba2e94daa3807
2012-06-21 17:34:25 +02:00
Gabriel Wicke e584e35ecb Improve nested definition list serialization
Basically only prefix all bullets if the serialization output is going to be
in start-of-line context. The test for that is currently inline, but should
perhaps be factored out to a method or state flag instead.

We could alternatively consider to return the start-of-line prefix and let it
be used in _serializeToken in case we end up in start-of-line context.

This patch also fixes a newline issue on input like this:

:d1
::: d3

Both the list and list item handlers now set the startsNewline flag
dynamically depending on the context, so that we don't depend on the
suppression of newlines from list syntax by the singleLineMode any more.

There is still an extra newline inserted between list items in the following
example:

;t1 :d1
;;t2 ::d2

This looks like a bug in the produced DOM and not in the serializer, since the
outer definition list is closed and re-opened between d1 and t2.

Change-Id: I78e3a1ef34cf9159d5a1e86fb64c774ff111e71d
2012-06-21 15:28:43 +02:00
Gabriel Wicke ab286d6a59 Empty elements only use the start handler info
Thus move the 'endsLine' attribute to the start section.

Change-Id: I8490d866b84aa99205ca9e8e3ee137026fb18501
2012-06-21 10:30:11 +02:00
Gabriel Wicke cf32b34b0a First attempt at the definition list bug (work in progress)
The main issue is that the bullets from dd/dt were not stored on the stack. I
added a separate field for it in each stack entry, which now fixes the basic
indent case without (afaik) breaking anything else.

There are still some newline issues, and the need to handle the single-line
dd/dt vs. the multi-line variant.

Change-Id: I65939c05e2c5dde0789bf8aefd7651161a2f137c
2012-06-20 23:51:39 +02:00
Gabriel Wicke 344fac19b5 Improve preformatted text handling
* Don't escape html-syntax pre content for now; Should parse this with a new
  pre content production later (which needs to be split out of the regular pre
  production in the tokenizer)
* Protect indent-pre content from start-of-line syntax escaping
* Preserve extra leading spaces in the tokenizer
* Two more (now 284) round-trip tests are passing

Change-Id: I199b89c0ee7fae12546df10c1b5117c97caccac5
2012-06-20 19:28:34 +02:00
Gabriel Wicke 6054a4aa14 Clean up serializer newline handling a bit further
Queued newlines and new trailing newlines were not cleanly separated so far,
which caused some trailing newlines to be consumed for needed leading
newlines. This change fixes several newline bugs, taking the number of passing
round-trip tests from 276 back up to 282.

Change-Id: Idb4706e15ce71e63085033e3f3f29557915c11a8
2012-06-20 16:31:39 +02:00
Gabriel Wicke 2426901e5b Fix definition lists with multiple dds
Fixed a bug in the list handler for multiple dds in a definition list. Also
fixed a few JSHint warnings.

Change-Id: I3e883786698a9521347fc2a5e6420646318813a7
2012-06-20 15:34:20 +02:00
Gabriel Wicke c9d3db8f34 Fix a few round-tripping and list issues
At least partly fixes some bugs in
http://www.mediawiki.org/wiki/Parsoid/Bug_test_cases. 276 round-trip tests are
passing.

* Fixes
  http://www.mediawiki.org/wiki/Parsoid/Bug_test_cases#extra_newline_after_empty_dd,
  except for lost newline in 'working' example before next heading
* Fixes newlines in definition lists
  (http://www.mediawiki.org/wiki/Parsoid/Bug_test_cases#dd_indentation etc),
  but does not fix missing / incorrect bullets for those

Change-Id: I21f66e265e43e1d1a4c7da70984a9984b8e6d0dd
2012-06-20 13:53:47 +02:00
Gabriel Wicke b94cad47dc Fix single-line mode for nested lists
Known issue: breaks round-tripping of :;;;::. That test is normally disabled
anyway, so we can fix it later.

Change-Id: I7954271311bfb7e71caae59d8177e3f04a9ebbca
2012-06-20 01:48:52 +02:00
Gabriel Wicke 33dc9abb0d Clean up sHref handling a bit
* sHref is now always a string
* fixes crasher when sHref is not set

Change-Id: If5756948ac6bc26c2d7c04d970b5aba5331cb8bb
2012-06-20 00:34:57 +02:00
Gabriel Wicke e117f09362 Wikitext escaping and quite complete source range tracking
* Started to add more complete tag source range (tsr) annotations to most
  start / empty tags. These replace the old sourcePos and sourceTagPos
  annotations, and look more promising for general round-tripping than block
  source ranges (bsr). See
  http://www.mediawiki.org/wiki/User:GWicke/Parsoid_source_ranges for some
  notes on this.
* Added an escapeWikitext method in the serializer that tokenizes supposedly
  text-only content from the DOM with the tokenizer and wraps runs of returned
  non-text tokens into nowiki tags. The source corresponding to non-text
  tokens is retrieved using the tsr annotations.
* Removed old (unused) table productions to avoid confusion.
* 276 round-trip tests are passing, vs. 283 without escaping.

Known issues:
* harmless for now, can be improved later: urllinks in external link captions
  are wrapped in nowiki. Example HTML:

<a rel='mw:extLink' href="http://example.com">http://example2.com</a>

* some start-of-line syntax in wiki-syntax preformatted blocks might be
  wrapped into nowiki when that would not really be needed. Example HTML DOM:

<pre>
* foo
* bar
</pre>

Change-Id: I01c34aedd5c566614d36924add47a6a960e91987
2012-06-19 23:36:44 +02:00
Gabriel Wicke 5fbc80321b Improve newline handling for comments and nowiki/noinclude tags
* Added a newlineTransparent flag to handlers that prevents changes to the
  onNewline status, so that content following it is still considered to be in
  start-of-line context. This fixes a few rt tests where a comment or nowiki
  tag is at the start of the line, and following content should end up on the
  same line.
* 283 rt parser tests are now passing.

Change-Id: Ie58dcb9e5e9af9000fff61c2e1db5d8649ffc3f6
2012-06-18 22:56:41 +02:00
Gabriel Wicke 97fb2d3c0d Serializer refactoring
* tokens are not modified any more (they are supposed to be immutable)
* handler info is now split in start / end objects and potentially a 'make'
  method; added more flags to govern the newline behavior of different tags
* added a generic singleLine mode for single-line syntactical environments
* switched the web service to line-based diffs to avoid issues when diffing
  the round-trip results of [[:en:Programming language]]
* 280 round-trip tests are passing now

Change-Id: I74b8ffbf69643c5d6e5ec852ec58e680c9018901
2012-06-18 21:52:15 +02:00
Subramanya Sastry f1d03f325e Couple minor bug fixes in serializer
Change-Id: I961e2f4e7609cc6b264eaf494b39497401cdc55c
2012-06-17 22:41:14 -05:00
Gabriel Wicke 41d8212573 Emit SpaceCharacters token for HTML5 'space' chars
HTML5 defines space characters as [ \r\n\t\f] in
http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character.
It treats these specially in a few contexts. As an example, the foster
parenting algorithm does not apply to space characters.

As a result, this change fixes the round-tripping of spaces between table
tags, which were previously moved before the table.

Change-Id: I32ab29275a9f824fc66d8286638eb42748cfc9a5
2012-06-17 16:16:07 +02:00
Subramanya Sastry a229f72833 First pass redoing serialization code to handle newline requirements
from Parsoid HTML output as well as VE HTML output.  There are still
some newline related failures from parser tests that needs fixing, but
this is getting close.  So committing for now so other eyes can make the
bugs shallow :).

Change-Id: Ia6a218ee9fb3e18fe0573c89ff3a4236779e1e64
2012-06-16 10:09:06 -05:00
Subramanya Sastry 3f92f39397 Removed newline normalization between paragraphs.
Change-Id: Ifd55db73c8fe2b3e952066a75cba2f8e13c58430
2012-06-14 18:51:56 -05:00
Subramanya Sastry 54f12d1807 Fix for href handling.
- Check if href for links has the wgScriptPath prefix before
  attempting to strip it from the href.

Change-Id: I844151ef7317476668d1306b96a2aec5a56fd0f1
2012-06-14 18:35:22 -05:00
Subramanya Sastry c0fc9e9a97 Updated newline handling around lists and nested lists.
- Something like this:
    <ul><li>1</li><li>2<ul><li>2.1</li><li>2.2<ul><li>2.2.1</li><li>2.2.2</li></ul></li><li>2.3</li></ul></li><li>3</li></ul>
  now serializes properly to:

    *1
    *2
    **2.1
    **2.2
    ***2.2.1
    ***2.2.2
    **2.3
    *3

  So does this form which is what the above wikitext parses to:
    <ul><li>1
    </li><li>2
    <ul><li>2.1
    </li><li>2.2
    <ul><li>2.2.1
    </li><li>2.2.2
    </li></ul></li><li>2.3
    </li></ul></li><li>3
    </li></ul>

- Lists (and nested lists) are not entirely newline-insensitive.
  They still depend on newlines *between* lists.  The opening
  <ul> tag for non-nested lists should always start on a new line.
  So, for example,
    <ul><li>foo</li></ul><ul><li>bar</li></ul>
  will serialize to:
    *foo
    *bar
  which is incorrect.  But,
    <ul><li>foo</li></ul>
    <ul><li>bar</li></ul>
  will correctly serialize to:
    *foo

    *bar

Change-Id: I13a0290368574865957bcf57aebab488fbbb7026
2012-06-14 17:09:59 -05:00
Subramanya Sastry 8978e406fc Minor code refactoring
Change-Id: Ib7f70a3ac42e3d5a5985e9a9bcffa313bdac289b
2012-06-14 15:18:53 -05:00
Subramanya Sastry d7e83c4e2b Fixed/updated newline handling for <p> tags
- More pieces are now simplified and all(?) newline handling
  is now centralized in the serializeToken function.

- This commit fixes bugs in rt-ing some code snippets
    ----------
    Ex 1: foo<p>bar</p>baz
    ----------

- This commit fixes bugs serializing VE generated html
    ----------
    Ex 2: <p>foo</p><pre>bar</pre> ==> foo\n bar
    ----------

- But, this round of fixes introduces RT failures for certain
  code examples in parserTests.txt.  In all these failing cases,
  inline text/html is embedded within a generated <p> tag during
  parsing.  If these generated <p> tags can have a "gc:1" attribute
  added to them, we can properly serialize them to the original
  form.
    ----------
    Ex 3: foo<pre>bar</pre>
          Parsed HTML: <p>foo</p><pre>bar</pre>
    ----------
  Note how this parsed HTML is identical to what the VE outputs
  in Example 2 above.  So, without the gc:1 attribute, we now
  have conflicting requirements on the example same HTML.
  This increases confidence in the correctness of my commit here.

Change-Id: I86beadec91c445a7f8a6d36a639b406697daa0a2
2012-06-14 14:59:18 -05:00
Subramanya Sastry 13e03ec1d7 Refix <pre> serialization.
- Effectively reverted fix from f882a65153
  and added a new fix.

Change-Id: I8b81e26525a5f1a22acaf2c7067f2dcd9b962818
2012-06-14 13:10:02 -05:00
Subramanya Sastry 51227f2a4a Improved, simplified newline handling in wikitext serializer.
- Eliminated newline handling from several places in code and
  mostly isolated it to serializeToken thus simplifying newline
  handling logic.
- Fixing some bugs in the process: # of green roundtrip tests
  went up by 5 (294 --> 299) but actually introduced failures on
  a few originally succeeding tests (additional leading/trailing
  newlines on the entire test output).
- Added bonus: made list serializing (mostly) insensitive to
  newlines between tags.  So, all the following DOM serialize
  identically to the following wikitext:

  *foo
  *bar

  ----------
  <ul><li>foo</li><li>bar</li></ul>

  ----------
  <ul>
  <li>foo</li>
  <li>bar</li>
  </ul>
  ----------
  <ul>

  <li>
  foo

  </li>

  <li>
  bar</li>

  </ul>
  ----------

Change-Id: I76be56c4b2789039dff5f47de4659746882e45d6
2012-06-14 00:10:51 -05:00
Subramanya Sastry bf0f5d1b7e Minor code cleanup
Change-Id: Ic5d99b6c483841310b0c295c1c30246f907455b4
2012-06-13 13:47:26 -05:00
Subramanya Sastry 23ec054013 Fixed round-tripping of interwiki links.
Change-Id: If0427b9865b3e9cf8c0ad0b4efaebc9f9f7fb865
2012-06-13 13:39:18 -05:00
Subramanya Sastry 445780b4d3 Revert default tokenization result from null to ''
* As part of an earlier fix, I had changed default value of 'res'
  to null instead of ''.  But, this was potentially buggy because
  the previous check was (res !== '') which could be triggered
  by return values of handlers.  By changing the check to null,
  I was effectively changing the code paths for those handlers that
  returned ''.

Change-Id: I2302023be7422ce4fb384ff5a50fe53fa7732855
2012-06-13 11:53:05 -05:00
Subramanya Sastry cfe94eed1f Minor code refactoring
Change-Id: Iec3cb4d83d16174371f0b1f3f23b1056aeed458e
2012-06-13 09:46:34 -05:00
Subramanya Sastry f882a65153 Fix serialization of <pre> tags
Change-Id: I7ae95e7ec06167d0c1bfdaba3d0c67d941043299
2012-06-12 13:54:35 -05:00
Subramanya Sastry 727c2119bb Refactored serializeToken method and added special-case handling of
paragraphs in lists.

* We need to look at other special-case handling requirements of
  html tags in lists (and other contexts like tables).

Change-Id: I84b8402d90a186c9075c2d45263c94377312927a
2012-06-11 17:55:41 -05:00
Gabriel Wicke 1ca586e5f1 Improve interwiki config a bit
* Moved wikipedia default prefixes to environment
* Added 'addInterwiki' method
* Adjusted link handling normalizeTitle to reflect this

Change-Id: If5b2314cc36346b6da8649ed410457a612d80a22
2012-06-07 12:30:16 +02:00
Gabriel Wicke 2fa5baabbb Make it easier to configure the default wiki, and add support for mediawiki.org
* mw:Foo now loads pages from mediawiki.org
* The default prefix still is 'en'. You can switch this to 'mw' in ParserService.js.

Change-Id: I1208667e6114bd711b7988a8b3adb32ffab70969
2012-06-07 11:50:40 +02:00
Subramanya Sastry b665a2558f Fixed bugs handing/transforming quotes
- Three bugs that were messing up quote transformations.
- Now, the following cases are handled properly:

  * ''foo'''
  * '''foo''
  * ''foo''''
  * ''''foo''

  These tests (and other quote tests) have to be added to core parser
  tests file.

- One more parser test green.

Change-Id: I4f93e8910639f546bfc9304becab17d26d5529de
2012-06-07 01:37:45 -05:00
Gabriel Wicke 350e700d8f Add core-upgrade
Change-Id: I5ad0955e8272d376f009f89461bed310978b25e4
2012-06-06 15:58:17 +02:00
Gabriel Wicke a146fcb8ad Improve the handling of newlines for round-tripping
An improvement, but there still are some extra newlines inserted after
paragraphs. Example input:

-------

Foo:
{|
|foo
|}
-------

Extra newlines are inserted after the Foo: and the foo in the table. They are
not fed as tokens or text to the tree builder, so there is likely a bug in the
html5 library or JSDom.

Change-Id: I83eb6180e3cd1c4e7f9b15b31d339e1d32bccd3f
2012-06-06 10:17:03 +02:00
Gabriel Wicke 59fc634cce Update patched html5 library to version 0.3.8
Change-Id: I321d9a58ea1af33842a606fc8706938093a8330f
2012-06-06 10:17:03 +02:00
Subramanya Sastry fe6f289486 Merge changes I5d98c704,Ib8d3de75
* changes:
  A few tweaks to link round-tripping
  Use word diff if --color is enabled
2012-06-05 16:04:23 +00:00
Subramanya Sastry b095db4303 Simpler implementation of flatten.
* Possibly more efficient under heavy GC load -- untested.
* No change in time and memory use for single file parsing.

Change-Id: Id2f3f65cc0e5f38ed968bbda60b97e46523e700e
2012-06-05 10:47:46 -05:00
Gabriel Wicke dc3168cf6d A few tweaks to link round-tripping
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
  in the serializer
* renamed state.linkTail to state.dropTail

Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
2012-06-05 17:26:27 +02:00
Gabriel Wicke d16032ae9a Track html syntax in block_tag production
Change-Id: If560523644f007485809762f12216e08fb3c3ed3
2012-06-05 12:39:56 +02:00
Gabriel Wicke cc96ff4f5e Very basic interwiki support
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.

Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.

Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
2012-06-05 11:19:58 +02:00
Gabriel Wicke 92f753a365 Pre and link target improvements
* Don't explicitly add the newline in the pre, as we preserve newline tokens
  now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.

Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
2012-06-04 14:03:05 +02:00
Gabriel Wicke ee2ddbd3cb Fix list handler issues
Lists interrupted by non-empty lines would not close the list properly.
Register for any token instead of just for newlines and close the list if no
listItem follows the newline.

Change-Id: I1743901e3db541bbeda78d17707db943e6ceb9b9
2012-06-04 13:38:43 +02:00
Gabriel Wicke f821eac102 Optionally round-trip sHref in data-mw
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.

Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
2012-06-04 12:28:05 +02:00
Gabriel Wicke e0809209ec Don't set the data-mw attribute if the object is actually empty.
Change-Id: I984f1b44bba67d7a9f1a709738d14c0ee02f69a9
2012-06-04 12:26:03 +02:00
Gabriel Wicke 2774e5aa6c Actually replace all underscores in wikilink target
Change-Id: I633f8d6e4f639aff90fd456600376b7c6515fd50
2012-06-04 11:48:59 +02:00
Gabriel Wicke 3f2c72f920 Fix padleft / padright (mis)use as substr
Change-Id: I0645e11c8ef8b550ad35300d1904788940fc748a
2012-06-04 11:30:45 +02:00
Gabriel Wicke 4533c274ca Fix a crasher in the serializer
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.

Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0
2012-06-04 00:00:54 +02:00
Gabriel Wicke 31522d3d49 Add ApiRequest
Change-Id: I5f2a1cb65223a68f10bc63903000248efca05586
2012-06-02 16:52:51 +02:00
Gabriel Wicke 63abd57fc8 Improve newline-before-paragraph round-tripping support
Change-Id: I9176a97f9695018650d9a63b89514c07e0d6be90
2012-06-02 16:39:33 +02:00
Gabriel Wicke d3975a8d03 Very basic round-trip test mode for the API
Returns both the resulting wikitext and the diff with the original input.

Change-Id: Iad25039beb054a84e1ad51ffa9fee924db49c60b
2012-06-02 16:20:54 +02:00
Gabriel Wicke 74135b295f Some more switch fixes
Change-Id: If1a6086348c45a73a941bc8e6728ef75d002be50
2012-06-02 15:04:20 +02:00
Subramanya Sastry 8f216af2f5 Handle link tails properly.
- Added a tail json attribute for wikiLinks
- During serialization, this attribute is used to strip the tail from
  the link target and render it after the link

  [[hen]]s ==> <a ... data-mw="{gc:1, tail: 's'}" ...>hens</a>
           ==> [[hen]]s

- 2 more roundtrip tests green

Change-Id: I84f3dabaf0271f7a67641a00148467daa8310eb0
2012-06-01 23:41:10 -05:00
Subramanya Sastry 413fc5e043 Fixed bug serializing wikilinks with implicit link text.
* Simple fix but greens 10 more roundtrip tests.

Change-Id: I7f82d788a10bd83e0e3215568c2168081c332c50
2012-06-01 17:25:21 -05:00
Gabriel Wicke 16219ddc6d Fix up #switch a bit
* Re-establish the value-only default
* Fix value expansion

Change-Id: I32e62789b25bbe17a74c564e41e9101ad5528fb7
2012-06-01 22:15:43 +02:00
Gabriel Wicke e2301813ed Merge "Tokenizer backtracking cache bug fix and memory savings" 2012-06-01 12:06:00 +00:00
GWicke befd223476 Merge "First pass implementing a general tag minimization routine" 2012-06-01 11:15:48 +00:00
Gabriel Wicke ece2b0f810 Tokenizer backtracking cache bug fix and memory savings
* The state of syntax stops is now properly included in the cache key for the
  tokenizer-internal backtracking cache. This fixes some mis-parses when
  re-parsing a bit of text with different flags.
* Clear the backtracking cache after each toplevelblock. This drops the peak
  memory usage when expanding [[:en:Barack Obama]] from ~380M to ~110M.

Change-Id: Icdb879cae5907e4595903dd6acba2e686e8c2e4b
2012-06-01 12:53:49 +02:00
Subramanya Sastry 1c80e2d7f0 First pass implementing a general tag minimization routine
* This routine attempts to rewrite the DOM to maximize tag overlap
  and thus minimize tag uses.

* This takes as input a set of tags which participate in the
  minimization.

* Tested on the following example
  <b><i><u><s>BIUS</s></u></i></b><b><i><s>BIS</s></i></b><b><u><s>BUS</s></u></b><u><i>UI</i></u>
  with multiple combinations of the 2^4 possible variations of i,b,u,s
  tags: [], ['i','b','u','s'], ['i'], ['b','s'], ['i','b','u']

  - But, I am not fully sure if this implements the right behavior when
    only a subset of inline tags are provided.  Needs discussion and tweaking
	 as necessary.

* Also tested on few others:
  <b>B</b><b><i>BI</i></b><b><i><u>BIU</u></i></b><b><i><u><s>BIUS</s></u></i></b>
  <s><i><b>SIB</s></i></b><s><i><u>SIU</u></i></s><i><u>IU</u></i><i>I</i>

* The previous pairwise tag rewriting version fails on several of these
  examples, so this new version is a definite improvement.

* No change in parserTests run (203 passing before and after).

* Possible improvements that could/should be undertaken:
  - get rid of useless/idempotent add/remove of nodes that don't change
    the DOM.
  - ensure that node attributes post-restructuring are correct.

Change-Id: Ib4a8b39583fa96a2be880a77021ca81cefa06484
2012-05-31 12:10:28 -05:00
Gabriel Wicke 4ea6b8e2be Revert part of last template syntax tweak
Change-Id: I084e1210577f80c3b96020d57cfa5c68eb5d139b
2012-05-31 12:02:42 +02:00
Gabriel Wicke c5d7e01944 Another tokenizer robustness improvement
This patch fixes a tokenizer syntax error encountered on
[[:en:Template:JacksonvilleWikiProject-Member]] and [[:en:Template:Infobox
former country]] by allowing optional whitespace before start-of-line template
syntax.

Change-Id: Ic214a731de58bf766e51f23d5e24ea2ce6788f58
2012-05-30 18:38:23 +02:00
Gabriel Wicke a133768781 Don't eat '}}' in generic attributes and similar productions
This fixes some syntax errors, at least one in Template:Geobox.

Change-Id: I32338febe25d0833c1d9bc4de293cd15b4cbb7be
2012-05-30 17:37:10 +02:00
Gabriel Wicke 36084c5d93 Preserve original newlines in HTML and serialization
254 round-trip tests (up from 184) are now passing.

Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms

Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
2012-05-29 23:29:03 +02:00
Subramanya Sastry 8174c9dafc First attempt implementing rewriting rules on the DOM
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
  i.e. it takes tag pairs (B, I) for ex, and attempts to
  normalize the tree just for those tag pairs.  Normalizing
  across multiple tags is implemented as pairwise rewriting
  across all pairs:  Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
  attributes lose their meaning on rewriting since tags are
  reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?

Output examples and possible issues to fix:
   <i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
   <u><b><i>biu</i>bu</b>u</u>

But, the equivalent wikitext form:
   '''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
   <i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.

However, a slightly different version:
   "'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
   <u>'''''biu''bu'''u</u>

An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
  (ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
  (ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)

The current rewriting strategy could possibly be re-implemented as S-M
rewriting.  The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form.  I may
not play with it now, but just documenting it for later (to play with
in my spare time).

This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation.  Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.

gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.

Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
2012-05-29 08:17:57 +02:00
Gabriel Wicke b2adee0ae7 Basic rt support for indent pre variant
* Added a generic stx_v 'syntax variant' round-trip attribute
* For pre, use stx:'html' vs. no syntax annotation. This might not be 100%
  safe for arbitrary html input, so we might want to flip this to stx:'wiki'
  later.
* 181 round-trip tests passing

Change-Id: If6080917a3a7c069066db3db60efe59b1f6c28d8
2012-05-25 18:55:38 +02:00
Gabriel Wicke a31ccaabe4 Support definition lists with empty definition
Change-Id: I81c39a7e49f2ea7ce32cdd3600caeb5eb9f50d84
2012-05-25 15:40:32 +02:00
Gabriel Wicke 06b51b1f3f Properly round-trip dd/dt; 178 round-trip tests passing.
Need to track variable whitespace before elements to make some more tests
pass.

Change-Id: Ia86535d6f352e2ffe7965547cd506b0dbb6dfba2
2012-05-25 13:59:55 +02:00
Gabriel Wicke 6f62878c78 Resolve subpage links, and remove hack for H: titles
Change-Id: I6c9c64179274e5c1641a3b127ac3b273a3c5254e
2012-05-24 17:57:41 +02:00
Gabriel Wicke dc61f313a2 Notes on missing parser functions, more error reporting tweaks
Change-Id: Ib6ce60cf1b55671a6ff57aa47edb5787ec3aefea
2012-05-24 17:31:26 +02:00
Gabriel Wicke cc10aab54f Add self alias
Change-Id: I47682f407da6b554179611c7d0f63f882ab5a871
2012-05-24 17:16:35 +02:00
Gabriel Wicke 13ae7cda11 A few (partly hackish) improvements
* Very basic support attribute key-value pairs emitted from templates
* Add TALKPAGENAME stub implementation
* Only show 'no revisions' message for top-level pages

Change-Id: I4b4ac0c7b2c0531ac4b39f0f49f4217302576ab9
2012-05-24 16:30:26 +02:00
Gabriel Wicke 3e0e11b1d0 Sanity check for tokens being an array
Change-Id: Ia4e4071e1469c31e3b320d854500938bb0245f82
2012-05-24 14:35:58 +02:00
Gabriel Wicke 93ce7453f0 Fake fullpagename et al a bit better
Change-Id: I85ddf9e88e5f8ac274f371bea0879600997001e4
2012-05-24 11:05:31 +02:00
Gabriel Wicke cdd1eca42d Fix non-existing revision error reporting
Change-Id: I6b8687bcde98b92d9d6217a738a177db279fd006
2012-05-24 10:50:47 +02:00
Gabriel Wicke f03fc39d15 Report missing revisions when retrieving templates
Change-Id: I9f33acafc4d3fbd062125d824e2614dafd4cd5a0
2012-05-24 10:45:01 +02:00
Gabriel Wicke caf2fa663d Keep going on tokenizer errors
Change-Id: I76fab4528f89b425845aef1685b3a54ddfeceef4
2012-05-24 10:30:32 +02:00
Gabriel Wicke e70448e53a Use text/x-mediawiki content type, and handle tokenizer errors without --debug
Change-Id: I154cd344306aa05ada7ff30f631d487f39fa9739
2012-05-24 10:19:25 +02:00
Gabriel Wicke 4cc2d25e70 Fix a debug print reference error
Change-Id: Ic26d29aced4129c3dd718c4751dadb62a0be1a27
2012-05-23 20:52:45 +02:00
Gabriel Wicke d6af3b3375 Improve the serializer and its output display in the web service
Change-Id: Id3ca96846cad42517d7d4bada8f4bb250d54247b
2012-05-23 17:50:35 +02:00
Gabriel Wicke 95496c02db Add an extra newline before headings, and ignore favicon.ico requests
Change-Id: Ibacac3453afefa5dbe803c1e0260e8c943785f12
2012-05-23 17:17:54 +02:00
Gabriel Wicke 21286a50df Make sure pageName is set in the web service, and handle empty page name in parser function
Change-Id: I5d36eefecc2f35a860d00a8960004f8e651ed17c
2012-05-23 16:43:45 +02:00
Gabriel Wicke a862718ad8 Add some checks against undefined tokens returned from async transforms
Change-Id: Ie19537083b96b1b2e12e1c4b65a7a044753c18ac
2012-05-23 16:32:21 +02:00
Gabriel Wicke a4c5d43ff7 Fix an external link regression, and add server shell wrapper and setup docs
Change-Id: I9a4f7690e98313d003a2fec35324ed70556e6461
2012-05-23 16:25:42 +02:00
Gabriel Wicke b89f5071e5 Basic parser / serializer web service
* After installing Parsoid (sudo npm install -g in modules/parser), run 'node
  server.js' from the api directory and navigate to http://localhost:8000/ and
  follow the directions. You can start to navigate the English wikipedia at
  http://localhost:8000/Main_Page, or manually enter wikitext or HTML DOM to
  convert.
* Uses the express framework, could also use just connect
* Uses the cluster module to manage workers per-core and restart those on
  failure

Change-Id: I443f2996ed3df00826b038b7476a2f966ab0c425
2012-05-23 12:35:00 +02:00
Gabriel Wicke febb912ead No end delimiter after template row attributes
Change-Id: Iba304fb797d221e2d65ae055d266bff2f6301df8
2012-05-23 09:30:07 +02:00
Gabriel Wicke 39c6f42879 Link round-tripping and other improvements
* Changed RDFa for links according to
  http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
  typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).

Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
2012-05-22 13:36:06 +02:00
Gabriel Wicke 7e21b7380a Merge "Round-trip nowiki" 2012-05-21 17:16:56 +00:00
Gabriel Wicke fb7d5418a5 Round-trip nowiki
Change-Id: I5f7e6a43f5fdc1708ee710b2a601b20db733452c
2012-05-21 18:06:09 +02:00
Gabriel Wicke a6610e52c2 Serializer and table round-tripping improvements
* added stx: 'html' round-trip information for html tags
* added t_stx: 'row' info for row-wise table wiki syntax, and support for it
  in the serializer
* the first table row is implicit in wikitext
* renamed lastToken to prevToken in serializer
* strip first newline in an initial chunkCB

Change-Id: I014b046539d1b674d830551c5fd1b74a67f81993
2012-05-21 14:59:53 +02:00
Gabriel Wicke e069e7cb1c Merge "Support table captions and properly delimit the end of table options" 2012-05-21 12:51:58 +00:00
Gabriel Wicke 54e75b93b7 Support table captions and properly delimit the end of table options
Change-Id: I15eb8df19528cfceadfee368370501b30f0e36a0
2012-05-18 10:46:43 +02:00
Gabriel Wicke c39eb36968 Use outerHTML to serialize unhandled DOM node in serializer
Change-Id: I37350712c9450c34025740a8d6de51344739c2b7
2012-05-18 10:03:16 +02:00
Gabriel Wicke 3c6d829708 Fix first bug caught by new roundtrip mode for parserTests
Change-Id: Id152fd29606d8ee34ac300945f41e2a5f48f087f
2012-05-18 09:55:22 +02:00
Subramanya Sastry ae4810b201 Renamed items to itemCount for better code readability.
Change-Id: I53851c07a4746928fddec4b3737136f081d49178
2012-05-17 12:32:46 -05:00
Subramanya Sastry 58da03bc85 Track list prefixes in the list start handler and use them to output
serialized text in list item handlers.

Change-Id: Ic7562d531d2313bedcf3b7450b4f28f02bc2b5a3
2012-05-17 12:12:46 -05:00
Gabriel Wicke e2815b516c Start to handle links
Change-Id: I1fb975910651820fd889d77152562fd4fbcb5db8
2012-05-17 14:32:56 +02:00
Gabriel Wicke b7fd4498a9 Use single _serializeToken handler for both DOM and tokens
Change-Id: I45e1d90b53a5ddc678f7744f27274bebcfc375fe
2012-05-17 13:20:39 +02:00
Gabriel Wicke 8dbc2f573f Simplistic wikitext round-tripping with parse.js --wikitext
Lists are a bit tricky, as nested lists are not wrapped in a separate list
item. Should work now though.

Change-Id: I2e5f29f6afa6bdd2d5e5c0c5d019b70c611b73d1
2012-05-17 12:44:46 +02:00
Gabriel Wicke 3414418b1f Don't eat newline tokens in the ListHandler
This fix only affects following transforms, of which there are few right now.
Also removed a stray token mutation in QuoteTransformer.

Change-Id: Id6d4adce944b06fc1a3651cfbf63fc2670125225
2012-05-16 23:14:21 +02:00
Gabriel Wicke 542921b5a3 Removed html5 parser patch no longer needed with 0.3.8
Change-Id: Id8c23d34e8cca49a360f536e792144a85a8468a3
2012-05-16 12:06:42 +02:00
Mark Holmquist 96ee9ad45c Add a new wikitext serializer, with limited functionality.
This isn't finished at all, but Gabriel wants to take a crack at it,
so here it is!

Change-Id: I9732aa141f7c69a28c8f5978cb18180e93cb9eda
2012-05-15 10:41:28 -07:00