Commit graph

1348 commits

Author SHA1 Message Date
Gabriel Wicke e117f09362 Wikitext escaping and quite complete source range tracking
* Started to add more complete tag source range (tsr) annotations to most
  start / empty tags. These replace the old sourcePos and sourceTagPos
  annotations, and look more promising for general round-tripping than block
  source ranges (bsr). See
  http://www.mediawiki.org/wiki/User:GWicke/Parsoid_source_ranges for some
  notes on this.
* Added an escapeWikitext method in the serializer that tokenizes supposedly
  text-only content from the DOM with the tokenizer and wraps runs of returned
  non-text tokens into nowiki tags. The source corresponding to non-text
  tokens is retrieved using the tsr annotations.
* Removed old (unused) table productions to avoid confusion.
* 276 round-trip tests are passing, vs. 283 without escaping.

Known issues:
* harmless for now, can be improved later: urllinks in external link captions
  are wrapped in nowiki. Example HTML:

<a rel='mw:extLink' href="http://example.com">http://example2.com</a>

* some start-of-line syntax in wiki-syntax preformatted blocks might be
  wrapped into nowiki when that would not really be needed. Example HTML DOM:

<pre>
* foo
* bar
</pre>

Change-Id: I01c34aedd5c566614d36924add47a6a960e91987
2012-06-19 23:36:44 +02:00
Subramanya Sastry bb7d7c09a5 Fixed newline stripping in rtve mode.
- Only strip newlines after ">" chars (still not robust,
  but better than stripping everywhere). This prevents
  useless/incorrect diffs in rtve mode and lets us identify
  real bugs.

Change-Id: Iab7b41c4b3d6351c090f8d3a3070330325e876d4
2012-06-19 12:34:42 -05:00
Gabriel Wicke 9e2a47d540 Switch diff algo back to diffWords by default
Faster than diffChars, and still easier to read than diffLines.

Change-Id: Id450a2f8a098bb0a71ccf54616f82dad4f25441c
2012-06-19 00:21:34 +02:00
Gabriel Wicke 5fbc80321b Improve newline handling for comments and nowiki/noinclude tags
* Added a newlineTransparent flag to handlers that prevents changes to the
  onNewline status, so that content following it is still considered to be in
  start-of-line context. This fixes a few rt tests where a comment or nowiki
  tag is at the start of the line, and following content should end up on the
  same line.
* 283 rt parser tests are now passing.

Change-Id: Ie58dcb9e5e9af9000fff61c2e1db5d8649ffc3f6
2012-06-18 22:56:41 +02:00
Gabriel Wicke 97fb2d3c0d Serializer refactoring
* tokens are not modified any more (they are supposed to be immutable)
* handler info is now split in start / end objects and potentially a 'make'
  method; added more flags to govern the newline behavior of different tags
* added a generic singleLine mode for single-line syntactical environments
* switched the web service to line-based diffs to avoid issues when diffing
  the round-trip results of [[:en:Programming language]]
* 280 round-trip tests are passing now

Change-Id: I74b8ffbf69643c5d6e5ec852ec58e680c9018901
2012-06-18 21:52:15 +02:00
Subramanya Sastry f1d03f325e Couple minor bug fixes in serializer
Change-Id: I961e2f4e7609cc6b264eaf494b39497401cdc55c
2012-06-17 22:41:14 -05:00
Gabriel Wicke 910f2ed87a Experimental /_rtve/ round-trip test mode for web API
This mode strips all newlines from the html source before serializing it back
to wikitext, thus simulating newline-less DOM output from the VE. This
simplistic method also strips newlines in preformatted text, which will show
up as noise in the diff. This simple mode is still useful for the
identification of basic newline-less DOM serialization issues.

An improved version could try to approximate the VE's behavior more closely by
only stripping some newlines.

Due to the experimental nature this mode is not linked from the index page for
now.

Change-Id: I1dfec7ec3e6c12b7de4bbb9ff6f2d8b7834e2857
2012-06-17 17:40:48 +02:00
Gabriel Wicke 41d8212573 Emit SpaceCharacters token for HTML5 'space' chars
HTML5 defines space characters as [ \r\n\t\f] in
http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character.
It treats these specially in a few contexts. As an example, the foster
parenting algorithm does not apply to space characters.

As a result, this change fixes the round-tripping of spaces between table
tags, which were previously moved before the table.

Change-Id: I32ab29275a9f824fc66d8286638eb42748cfc9a5
2012-06-17 16:16:07 +02:00
Subramanya Sastry a229f72833 First pass redoing serialization code to handle newline requirements
from Parsoid HTML output as well as VE HTML output.  There are still
some newline related failures from parser tests that needs fixing, but
this is getting close.  So committing for now so other eyes can make the
bugs shallow :).

Change-Id: Ia6a218ee9fb3e18fe0573c89ff3a4236779e1e64
2012-06-16 10:09:06 -05:00
Subramanya Sastry 3f92f39397 Removed newline normalization between paragraphs.
Change-Id: Ifd55db73c8fe2b3e952066a75cba2f8e13c58430
2012-06-14 18:51:56 -05:00
Subramanya Sastry 54f12d1807 Fix for href handling.
- Check if href for links has the wgScriptPath prefix before
  attempting to strip it from the href.

Change-Id: I844151ef7317476668d1306b96a2aec5a56fd0f1
2012-06-14 18:35:22 -05:00
Subramanya Sastry c0fc9e9a97 Updated newline handling around lists and nested lists.
- Something like this:
    <ul><li>1</li><li>2<ul><li>2.1</li><li>2.2<ul><li>2.2.1</li><li>2.2.2</li></ul></li><li>2.3</li></ul></li><li>3</li></ul>
  now serializes properly to:

    *1
    *2
    **2.1
    **2.2
    ***2.2.1
    ***2.2.2
    **2.3
    *3

  So does this form which is what the above wikitext parses to:
    <ul><li>1
    </li><li>2
    <ul><li>2.1
    </li><li>2.2
    <ul><li>2.2.1
    </li><li>2.2.2
    </li></ul></li><li>2.3
    </li></ul></li><li>3
    </li></ul>

- Lists (and nested lists) are not entirely newline-insensitive.
  They still depend on newlines *between* lists.  The opening
  <ul> tag for non-nested lists should always start on a new line.
  So, for example,
    <ul><li>foo</li></ul><ul><li>bar</li></ul>
  will serialize to:
    *foo
    *bar
  which is incorrect.  But,
    <ul><li>foo</li></ul>
    <ul><li>bar</li></ul>
  will correctly serialize to:
    *foo

    *bar

Change-Id: I13a0290368574865957bcf57aebab488fbbb7026
2012-06-14 17:09:59 -05:00
Subramanya Sastry 8978e406fc Minor code refactoring
Change-Id: Ib7f70a3ac42e3d5a5985e9a9bcffa313bdac289b
2012-06-14 15:18:53 -05:00
Translation updater bot 1a92d59ebb Merge "Localisation updates from http://translatewiki.net." 2012-06-14 20:12:50 +00:00
Translation updater bot adb46c5e4f Localisation updates from http://translatewiki.net.
Change-Id: I9099f8cd45dd932a9daac0404c310aad37a14768
2012-06-14 20:08:18 +00:00
Subramanya Sastry d7e83c4e2b Fixed/updated newline handling for <p> tags
- More pieces are now simplified and all(?) newline handling
  is now centralized in the serializeToken function.

- This commit fixes bugs in rt-ing some code snippets
    ----------
    Ex 1: foo<p>bar</p>baz
    ----------

- This commit fixes bugs serializing VE generated html
    ----------
    Ex 2: <p>foo</p><pre>bar</pre> ==> foo\n bar
    ----------

- But, this round of fixes introduces RT failures for certain
  code examples in parserTests.txt.  In all these failing cases,
  inline text/html is embedded within a generated <p> tag during
  parsing.  If these generated <p> tags can have a "gc:1" attribute
  added to them, we can properly serialize them to the original
  form.
    ----------
    Ex 3: foo<pre>bar</pre>
          Parsed HTML: <p>foo</p><pre>bar</pre>
    ----------
  Note how this parsed HTML is identical to what the VE outputs
  in Example 2 above.  So, without the gc:1 attribute, we now
  have conflicting requirements on the example same HTML.
  This increases confidence in the correctness of my commit here.

Change-Id: I86beadec91c445a7f8a6d36a639b406697daa0a2
2012-06-14 14:59:18 -05:00
Subramanya Sastry 13e03ec1d7 Refix <pre> serialization.
- Effectively reverted fix from f882a65153
  and added a new fix.

Change-Id: I8b81e26525a5f1a22acaf2c7067f2dcd9b962818
2012-06-14 13:10:02 -05:00
Subramanya Sastry 51227f2a4a Improved, simplified newline handling in wikitext serializer.
- Eliminated newline handling from several places in code and
  mostly isolated it to serializeToken thus simplifying newline
  handling logic.
- Fixing some bugs in the process: # of green roundtrip tests
  went up by 5 (294 --> 299) but actually introduced failures on
  a few originally succeeding tests (additional leading/trailing
  newlines on the entire test output).
- Added bonus: made list serializing (mostly) insensitive to
  newlines between tags.  So, all the following DOM serialize
  identically to the following wikitext:

  *foo
  *bar

  ----------
  <ul><li>foo</li><li>bar</li></ul>

  ----------
  <ul>
  <li>foo</li>
  <li>bar</li>
  </ul>
  ----------
  <ul>

  <li>
  foo

  </li>

  <li>
  bar</li>

  </ul>
  ----------

Change-Id: I76be56c4b2789039dff5f47de4659746882e45d6
2012-06-14 00:10:51 -05:00
Subramanya Sastry 51958e4c6a Removed unused parser pipeline construction
Change-Id: Id2a7dde895b7c3fbf776a2035009686afd4301df
2012-06-13 15:08:13 -05:00
Subramanya Sastry bf0f5d1b7e Minor code cleanup
Change-Id: Ic5d99b6c483841310b0c295c1c30246f907455b4
2012-06-13 13:47:26 -05:00
Subramanya Sastry 23ec054013 Fixed round-tripping of interwiki links.
Change-Id: If0427b9865b3e9cf8c0ad0b4efaebc9f9f7fb865
2012-06-13 13:39:18 -05:00
Subramanya Sastry 445780b4d3 Revert default tokenization result from null to ''
* As part of an earlier fix, I had changed default value of 'res'
  to null instead of ''.  But, this was potentially buggy because
  the previous check was (res !== '') which could be triggered
  by return values of handlers.  By changing the check to null,
  I was effectively changing the code paths for those handlers that
  returned ''.

Change-Id: I2302023be7422ce4fb384ff5a50fe53fa7732855
2012-06-13 11:53:05 -05:00
Subramanya Sastry cfe94eed1f Minor code refactoring
Change-Id: Iec3cb4d83d16174371f0b1f3f23b1056aeed458e
2012-06-13 09:46:34 -05:00
Catrope 6bf79475e4 Add suggested development configuration in comments
Change-Id: I3ec8c5326faced6d4b5c878f26f37a281b03bd95
2012-06-12 19:19:48 -07:00
Subramanya Sastry e00cfdbc16 Merge "Remove a few entries we now care about from the whitelist" 2012-06-12 18:57:17 +00:00
Subramanya Sastry f882a65153 Fix serialization of <pre> tags
Change-Id: I7ae95e7ec06167d0c1bfdaba3d0c67d941043299
2012-06-12 13:54:35 -05:00
Subramanya Sastry 727c2119bb Refactored serializeToken method and added special-case handling of
paragraphs in lists.

* We need to look at other special-case handling requirements of
  html tags in lists (and other contexts like tables).

Change-Id: I84b8402d90a186c9075c2d45263c94377312927a
2012-06-11 17:55:41 -05:00
Translation updater bot e605878f46 Localisation updates from http://translatewiki.net.
Change-Id: I632b12009951716606f71530485ffc7ef377213d
2012-06-11 14:31:40 +00:00
Trevor Parscal ea73773854 Added node_modules and node error log file to git ignore
Change-Id: I9006d1e4bfd266fb470fe9143995c5dda112ac43
2012-06-10 23:56:50 -07:00
Translation updater bot 81d53403a8 Localisation updates from http://translatewiki.net.
Change-Id: I29672237acf0ab18963bdd46702b53c675e00b4c
2012-06-07 19:15:00 +00:00
Gabriel Wicke 3f61dc9821 Link talk page separately
Change-Id: Ib839f619e7e14ccf0ef698fc2e780ef4b0d65505
2012-06-07 13:42:05 +02:00
Gabriel Wicke 3549a16085 Add a 'report issue' link below round-trip results
Change-Id: I5e3a785a328af0debcf83dc2038b5e5417fa5158
2012-06-07 13:37:40 +02:00
Gabriel Wicke bec7fb2f8c Mention citations as not round-tripping
Change-Id: I57e25f6f4072bae2f5681b8611e98f899875d1e2
2012-06-07 13:18:44 +02:00
Gabriel Wicke 76cca063ba Add hint on where to support issues in web service entry page
* Explain what we are currently interested in and link to
  :mw:Talk:Parsoid/Todo.

Change-Id: I747c6ee8a021a7a73ec91b73281c1c679a00da8f
2012-06-07 13:16:05 +02:00
Gabriel Wicke 1ca586e5f1 Improve interwiki config a bit
* Moved wikipedia default prefixes to environment
* Added 'addInterwiki' method
* Adjusted link handling normalizeTitle to reflect this

Change-Id: If5b2314cc36346b6da8649ed410457a612d80a22
2012-06-07 12:30:16 +02:00
Gabriel Wicke 2fa5baabbb Make it easier to configure the default wiki, and add support for mediawiki.org
* mw:Foo now loads pages from mediawiki.org
* The default prefix still is 'en'. You can switch this to 'mw' in ParserService.js.

Change-Id: I1208667e6114bd711b7988a8b3adb32ffab70969
2012-06-07 11:50:40 +02:00
Gabriel Wicke b49102281f Remove a few entries we now care about from the whitelist
They are mostly about whitespace, but there is also a debatable quote test
that outputs an empty bold element at the end of the line. We should perhaps
strip this empty bold in the QuoteTransformer, as the preservation of an empty
bold tag in round-tripping does not seem to be too useful.

Change-Id: I1d8f3ebabcd9f6249e5170de420ba52e8aea22ca
2012-06-07 10:04:20 +02:00
Subramanya Sastry b665a2558f Fixed bugs handing/transforming quotes
- Three bugs that were messing up quote transformations.
- Now, the following cases are handled properly:

  * ''foo'''
  * '''foo''
  * ''foo''''
  * ''''foo''

  These tests (and other quote tests) have to be added to core parser
  tests file.

- One more parser test green.

Change-Id: I4f93e8910639f546bfc9304becab17d26d5529de
2012-06-07 01:37:45 -05:00
Translation updater bot 42daebe50a Localisation updates from http://translatewiki.net.
Change-Id: Ieb79571c97e1158414ecccbc8d5e984382f2cce5
2012-06-06 20:19:14 +00:00
Gabriel Wicke 413df0c471 Strip \r from form input- we normalize everything to Unix
Change-Id: I5cd255e1a7ab9958f120fad408362e6f709e4b91
2012-06-06 19:26:29 +02:00
Gabriel Wicke 47204c4ca0 Use diffChars instead of diffWords, as the former misses some changes
The improved merge algorithm now makes diffChars output more palatable. Things
could still be improved by collecting single-character 'neutral' changes in a
block of 'add' changes and converting them to adds / removes.

Change-Id: I8439e8acab4360c08b89d9ce8a6b8523e7a0a210
2012-06-06 18:36:28 +02:00
Subramanya Sastry f8221b128b Used a more robust heuristic for merging consecutive diffs
- Check if consecutive diffs are separate by 1 word in addition
  to max 3 chars.  This takes care of diffs introduced by template diffs
  separated by the template name and creates a clean single diff.

Change-Id: I9181d2ed9a07bee6ca5d5ebd6ddea84f7e2cecac
2012-06-06 11:01:47 -05:00
Gabriel Wicke 2bc066b42d Up the diff merge size heuristic a bit and always use the same algorithm
Change-Id: I707c8a55ed1758cdd591d2fc95e03a360c8e76d1
2012-06-06 17:46:25 +02:00
Gabriel Wicke bc1a77a812 Make modified newlines visible by replacing empty lines with a space
Change-Id: If7b811245e0d01a7a147ab54c3801fc1754730a9
2012-06-06 17:11:29 +02:00
Gabriel Wicke 1876d785a7 Swap ins/del in the diff
Change-Id: Id336d713d1767a4b7859b158f2c2ddf9adc11cfb
2012-06-06 16:02:54 +02:00
Gabriel Wicke 350e700d8f Add core-upgrade
Change-Id: I5ad0955e8272d376f009f89461bed310978b25e4
2012-06-06 15:58:17 +02:00
Gabriel Wicke d0a0454ada Merge "Improve the handling of newlines for round-tripping" 2012-06-06 13:54:04 +00:00
Gabriel Wicke aee35f627d Merge "Update patched html5 library to version 0.3.8" 2012-06-06 13:53:37 +00:00
Gabriel Wicke a146fcb8ad Improve the handling of newlines for round-tripping
An improvement, but there still are some extra newlines inserted after
paragraphs. Example input:

-------

Foo:
{|
|foo
|}
-------

Extra newlines are inserted after the Foo: and the foo in the table. They are
not fed as tokens or text to the tree builder, so there is likely a bug in the
html5 library or JSDom.

Change-Id: I83eb6180e3cd1c4e7f9b15b31d339e1d32bccd3f
2012-06-06 10:17:03 +02:00
Gabriel Wicke 59fc634cce Update patched html5 library to version 0.3.8
Change-Id: I321d9a58ea1af33842a606fc8706938093a8330f
2012-06-06 10:17:03 +02:00