Commit graph

42 commits

Author SHA1 Message Date
Subramanya Sastry 18541f0286 Couple minor bug fixes in serializer
Change-Id: I961e2f4e7609cc6b264eaf494b39497401cdc55c
2012-06-18 11:25:21 -07:00
Subramanya Sastry 9b5404e288 First pass redoing serialization code to handle newline requirements
from Parsoid HTML output as well as VE HTML output.  There are still
some newline related failures from parser tests that needs fixing, but
this is getting close.  So committing for now so other eyes can make the
bugs shallow :).

Change-Id: Ia6a218ee9fb3e18fe0573c89ff3a4236779e1e64
2012-06-18 11:25:21 -07:00
Subramanya Sastry 2271f19ecf Removed newline normalization between paragraphs.
Change-Id: Ifd55db73c8fe2b3e952066a75cba2f8e13c58430
2012-06-18 11:25:21 -07:00
Subramanya Sastry 8fd901850c Fix for href handling.
- Check if href for links has the wgScriptPath prefix before
  attempting to strip it from the href.

Change-Id: I844151ef7317476668d1306b96a2aec5a56fd0f1
2012-06-18 11:25:21 -07:00
Subramanya Sastry 18de05ba7f Updated newline handling around lists and nested lists.
- Something like this:
    <ul><li>1</li><li>2<ul><li>2.1</li><li>2.2<ul><li>2.2.1</li><li>2.2.2</li></ul></li><li>2.3</li></ul></li><li>3</li></ul>
  now serializes properly to:

    *1
    *2
    **2.1
    **2.2
    ***2.2.1
    ***2.2.2
    **2.3
    *3

  So does this form which is what the above wikitext parses to:
    <ul><li>1
    </li><li>2
    <ul><li>2.1
    </li><li>2.2
    <ul><li>2.2.1
    </li><li>2.2.2
    </li></ul></li><li>2.3
    </li></ul></li><li>3
    </li></ul>

- Lists (and nested lists) are not entirely newline-insensitive.
  They still depend on newlines *between* lists.  The opening
  <ul> tag for non-nested lists should always start on a new line.
  So, for example,
    <ul><li>foo</li></ul><ul><li>bar</li></ul>
  will serialize to:
    *foo
    *bar
  which is incorrect.  But,
    <ul><li>foo</li></ul>
    <ul><li>bar</li></ul>
  will correctly serialize to:
    *foo

    *bar

Change-Id: I13a0290368574865957bcf57aebab488fbbb7026
2012-06-18 11:25:21 -07:00
Subramanya Sastry 9e5ed592fc Minor code refactoring
Change-Id: Ib7f70a3ac42e3d5a5985e9a9bcffa313bdac289b
2012-06-18 11:25:21 -07:00
Subramanya Sastry 031602f525 Fixed/updated newline handling for <p> tags
- More pieces are now simplified and all(?) newline handling
  is now centralized in the serializeToken function.

- This commit fixes bugs in rt-ing some code snippets
    ----------
    Ex 1: foo<p>bar</p>baz
    ----------

- This commit fixes bugs serializing VE generated html
    ----------
    Ex 2: <p>foo</p><pre>bar</pre> ==> foo\n bar
    ----------

- But, this round of fixes introduces RT failures for certain
  code examples in parserTests.txt.  In all these failing cases,
  inline text/html is embedded within a generated <p> tag during
  parsing.  If these generated <p> tags can have a "gc:1" attribute
  added to them, we can properly serialize them to the original
  form.
    ----------
    Ex 3: foo<pre>bar</pre>
          Parsed HTML: <p>foo</p><pre>bar</pre>
    ----------
  Note how this parsed HTML is identical to what the VE outputs
  in Example 2 above.  So, without the gc:1 attribute, we now
  have conflicting requirements on the example same HTML.
  This increases confidence in the correctness of my commit here.

Change-Id: I86beadec91c445a7f8a6d36a639b406697daa0a2
2012-06-18 11:25:21 -07:00
Subramanya Sastry 7c5a0f680f Refix <pre> serialization.
- Effectively reverted fix from f882a65153
  and added a new fix.

Change-Id: I8b81e26525a5f1a22acaf2c7067f2dcd9b962818
2012-06-18 11:25:21 -07:00
Subramanya Sastry f745633797 Improved, simplified newline handling in wikitext serializer.
- Eliminated newline handling from several places in code and
  mostly isolated it to serializeToken thus simplifying newline
  handling logic.
- Fixing some bugs in the process: # of green roundtrip tests
  went up by 5 (294 --> 299) but actually introduced failures on
  a few originally succeeding tests (additional leading/trailing
  newlines on the entire test output).
- Added bonus: made list serializing (mostly) insensitive to
  newlines between tags.  So, all the following DOM serialize
  identically to the following wikitext:

  *foo
  *bar

  ----------
  <ul><li>foo</li><li>bar</li></ul>

  ----------
  <ul>
  <li>foo</li>
  <li>bar</li>
  </ul>
  ----------
  <ul>

  <li>
  foo

  </li>

  <li>
  bar</li>

  </ul>
  ----------

Change-Id: I76be56c4b2789039dff5f47de4659746882e45d6
2012-06-18 11:25:20 -07:00
Subramanya Sastry bf0f5d1b7e Minor code cleanup
Change-Id: Ic5d99b6c483841310b0c295c1c30246f907455b4
2012-06-13 13:47:26 -05:00
Subramanya Sastry 23ec054013 Fixed round-tripping of interwiki links.
Change-Id: If0427b9865b3e9cf8c0ad0b4efaebc9f9f7fb865
2012-06-13 13:39:18 -05:00
Subramanya Sastry 445780b4d3 Revert default tokenization result from null to ''
* As part of an earlier fix, I had changed default value of 'res'
  to null instead of ''.  But, this was potentially buggy because
  the previous check was (res !== '') which could be triggered
  by return values of handlers.  By changing the check to null,
  I was effectively changing the code paths for those handlers that
  returned ''.

Change-Id: I2302023be7422ce4fb384ff5a50fe53fa7732855
2012-06-13 11:53:05 -05:00
Subramanya Sastry f882a65153 Fix serialization of <pre> tags
Change-Id: I7ae95e7ec06167d0c1bfdaba3d0c67d941043299
2012-06-12 13:54:35 -05:00
Subramanya Sastry 727c2119bb Refactored serializeToken method and added special-case handling of
paragraphs in lists.

* We need to look at other special-case handling requirements of
  html tags in lists (and other contexts like tables).

Change-Id: I84b8402d90a186c9075c2d45263c94377312927a
2012-06-11 17:55:41 -05:00
Gabriel Wicke dc3168cf6d A few tweaks to link round-tripping
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
  in the serializer
* renamed state.linkTail to state.dropTail

Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
2012-06-05 17:26:27 +02:00
Gabriel Wicke cc96ff4f5e Very basic interwiki support
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.

Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.

Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
2012-06-05 11:19:58 +02:00
Gabriel Wicke 92f753a365 Pre and link target improvements
* Don't explicitly add the newline in the pre, as we preserve newline tokens
  now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.

Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
2012-06-04 14:03:05 +02:00
Gabriel Wicke f821eac102 Optionally round-trip sHref in data-mw
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.

Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
2012-06-04 12:28:05 +02:00
Gabriel Wicke 2774e5aa6c Actually replace all underscores in wikilink target
Change-Id: I633f8d6e4f639aff90fd456600376b7c6515fd50
2012-06-04 11:48:59 +02:00
Gabriel Wicke 4533c274ca Fix a crasher in the serializer
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.

Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0
2012-06-04 00:00:54 +02:00
Gabriel Wicke 63abd57fc8 Improve newline-before-paragraph round-tripping support
Change-Id: I9176a97f9695018650d9a63b89514c07e0d6be90
2012-06-02 16:39:33 +02:00
Subramanya Sastry 8f216af2f5 Handle link tails properly.
- Added a tail json attribute for wikiLinks
- During serialization, this attribute is used to strip the tail from
  the link target and render it after the link

  [[hen]]s ==> <a ... data-mw="{gc:1, tail: 's'}" ...>hens</a>
           ==> [[hen]]s

- 2 more roundtrip tests green

Change-Id: I84f3dabaf0271f7a67641a00148467daa8310eb0
2012-06-01 23:41:10 -05:00
Subramanya Sastry 413fc5e043 Fixed bug serializing wikilinks with implicit link text.
* Simple fix but greens 10 more roundtrip tests.

Change-Id: I7f82d788a10bd83e0e3215568c2168081c332c50
2012-06-01 17:25:21 -05:00
Gabriel Wicke 36084c5d93 Preserve original newlines in HTML and serialization
254 round-trip tests (up from 184) are now passing.

Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms

Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
2012-05-29 23:29:03 +02:00
Gabriel Wicke b2adee0ae7 Basic rt support for indent pre variant
* Added a generic stx_v 'syntax variant' round-trip attribute
* For pre, use stx:'html' vs. no syntax annotation. This might not be 100%
  safe for arbitrary html input, so we might want to flip this to stx:'wiki'
  later.
* 181 round-trip tests passing

Change-Id: If6080917a3a7c069066db3db60efe59b1f6c28d8
2012-05-25 18:55:38 +02:00
Gabriel Wicke 06b51b1f3f Properly round-trip dd/dt; 178 round-trip tests passing.
Need to track variable whitespace before elements to make some more tests
pass.

Change-Id: Ia86535d6f352e2ffe7965547cd506b0dbb6dfba2
2012-05-25 13:59:55 +02:00
Gabriel Wicke d6af3b3375 Improve the serializer and its output display in the web service
Change-Id: Id3ca96846cad42517d7d4bada8f4bb250d54247b
2012-05-23 17:50:35 +02:00
Gabriel Wicke 95496c02db Add an extra newline before headings, and ignore favicon.ico requests
Change-Id: Ibacac3453afefa5dbe803c1e0260e8c943785f12
2012-05-23 17:17:54 +02:00
Gabriel Wicke febb912ead No end delimiter after template row attributes
Change-Id: Iba304fb797d221e2d65ae055d266bff2f6301df8
2012-05-23 09:30:07 +02:00
Gabriel Wicke 39c6f42879 Link round-tripping and other improvements
* Changed RDFa for links according to
  http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
  typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).

Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
2012-05-22 13:36:06 +02:00
Gabriel Wicke 7e21b7380a Merge "Round-trip nowiki" 2012-05-21 17:16:56 +00:00
Gabriel Wicke fb7d5418a5 Round-trip nowiki
Change-Id: I5f7e6a43f5fdc1708ee710b2a601b20db733452c
2012-05-21 18:06:09 +02:00
Gabriel Wicke a6610e52c2 Serializer and table round-tripping improvements
* added stx: 'html' round-trip information for html tags
* added t_stx: 'row' info for row-wise table wiki syntax, and support for it
  in the serializer
* the first table row is implicit in wikitext
* renamed lastToken to prevToken in serializer
* strip first newline in an initial chunkCB

Change-Id: I014b046539d1b674d830551c5fd1b74a67f81993
2012-05-21 14:59:53 +02:00
Gabriel Wicke e069e7cb1c Merge "Support table captions and properly delimit the end of table options" 2012-05-21 12:51:58 +00:00
Gabriel Wicke 54e75b93b7 Support table captions and properly delimit the end of table options
Change-Id: I15eb8df19528cfceadfee368370501b30f0e36a0
2012-05-18 10:46:43 +02:00
Gabriel Wicke c39eb36968 Use outerHTML to serialize unhandled DOM node in serializer
Change-Id: I37350712c9450c34025740a8d6de51344739c2b7
2012-05-18 10:03:16 +02:00
Gabriel Wicke 3c6d829708 Fix first bug caught by new roundtrip mode for parserTests
Change-Id: Id152fd29606d8ee34ac300945f41e2a5f48f087f
2012-05-18 09:55:22 +02:00
Subramanya Sastry ae4810b201 Renamed items to itemCount for better code readability.
Change-Id: I53851c07a4746928fddec4b3737136f081d49178
2012-05-17 12:32:46 -05:00
Subramanya Sastry 58da03bc85 Track list prefixes in the list start handler and use them to output
serialized text in list item handlers.

Change-Id: Ic7562d531d2313bedcf3b7450b4f28f02bc2b5a3
2012-05-17 12:12:46 -05:00
Gabriel Wicke e2815b516c Start to handle links
Change-Id: I1fb975910651820fd889d77152562fd4fbcb5db8
2012-05-17 14:32:56 +02:00
Gabriel Wicke b7fd4498a9 Use single _serializeToken handler for both DOM and tokens
Change-Id: I45e1d90b53a5ddc678f7744f27274bebcfc375fe
2012-05-17 13:20:39 +02:00
Gabriel Wicke 8dbc2f573f Simplistic wikitext round-tripping with parse.js --wikitext
Lists are a bit tricky, as nested lists are not wrapped in a separate list
item. Should work now though.

Change-Id: I2e5f29f6afa6bdd2d5e5c0c5d019b70c611b73d1
2012-05-17 12:44:46 +02:00