Commit graph

81 commits

Author SHA1 Message Date
Gabriel Wicke 1e902fc050 Merge "Encapsulate token collection" 2012-07-13 21:08:52 +00:00
Gabriel Wicke e329455d55 Encapsulate token collection
* Arbitrary predicate support for the termination of collection mode
* tokens as property of the collector instead of a state-global thing

Change-Id: Ibcb342bc64a76fece9b04a760ea56c7878e67cad
2012-07-13 13:57:04 -07:00
GWicke 97bc6cd5d7 Merge "Serializer fixes" 2012-07-13 20:36:36 +00:00
Subramanya Sastry f4c6ba8545 Serializer fixes
* Fixed image serializer to deal with missing 'v' value in a k-v pair
  representing an image attribute.
* Added fix to deal with bare <li>'s (without surrounding <ul> tags)

NOTE: The second fix is required currently to deal with bugs in the parser
as it deals with complex cases.  But, in the future, we could deal with
this in one of the following ways:
(a) The serializer expects a well-formed DOM and all cleanup will be
    done as part of external tools/passes.
(b) The serializer supports a small set of exceptional cases and bare
    list items could be one of them
(c) The serializer ought to handle any DOM that is thrown at it.

Yet to be resolved.

Change-Id: Ib585e5c9f2a8a80854740ce0211bde705f9fd6f4
2012-07-13 15:33:09 -05:00
GWicke a742ec5ffc Merge "Fixed parser and serializer to deal with a 4+ length dash sequence." 2012-07-13 20:14:34 +00:00
Subramanya Sastry 49ed0d3adf Fixed parser and serializer to deal with a 4+ length dash sequence.
Change-Id: If7caaefec1ad55e7604712ef959ff0c843392adf
2012-07-13 15:12:09 -05:00
Subramanya Sastry e529ae7e0e Serializer fix for empty headings (BUG-33089)
Change-Id: Ia7b018335ac9e31938052473fc47ce38443fdeb4
2012-07-05 16:50:48 -05:00
Subramanya Sastry 166e7a75c9 Fix for Bug 37913
* Strips the first paragraph tag in a list item or table cell context
  if there are no attributes on it and stx:html is not set

Change-Id: I74988645fe505c662f86488e32d0f11d464ffe41
2012-06-29 23:47:59 -05:00
Gabriel Wicke 9ddc863d89 Up entity name length limit even further
There are some really long names in
http://www.w3.org/2003/entities/2007xml/unicode.xml

Change-Id: I0138c9610bb288cd8f29e3600b8a21f932e7bcd9
2012-06-29 23:38:10 +02:00
Gabriel Wicke cf7f437966 Match named entities with up to eight chars
The longest entries in
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references.

Change-Id: I2c9f102fe6a905e339e12520d08c1b1b0a4002d8
2012-06-29 23:15:30 +02:00
Gabriel Wicke 370fb607c8 Insert separation between adjacent pres
Change-Id: I55aa649b4e076cae32b3c970d6384ab2ed4cdd6c
2012-06-29 23:05:06 +02:00
Gabriel Wicke 6c8dfa26fa Escape ampersands in entities from plain text DOM content
Change-Id: I0826077cf48b67e38a525090be66411c38d7b65f
2012-06-29 23:02:21 +02:00
Subramanya Sastry 5874d9a5f1 More thumb roundtripping fixes.
* Looks like I misled myself in commit 88fc91 -- that wikitext
  roundtripped perfectly because it went through the 'src' route
  because it was a thumbnail with an explicit image which doesn't
  go through renderThumb -- so, the serializer simply spit out the
  original 'src' string and hence perfect rt :).
* More whitespace preserving fixes in LinkHandler.
* Also changed resource value in the img tag to use the original
  filename rather than the normalized capitalized filename.
* 2 more parsertests rt -- now upto 400.

Change-Id: I144a6486dd9d07da8a74a68700fe96c78d192826
2012-06-29 00:30:13 -05:00
Subramanya Sastry 88fc91a292 Next round of image roundtripping fixes
* Changed PrefixImageOptions so that thumb and thumbnail are
  distinct key-value pairs.  Without this fix, cannot distinguish
  between thumb=foo.jpg and thumbnail=foo.jpg
* Fixed link handler so whitespace is preserved around prefixed image
  options.
* Fixed figure handler to process the 3 different kind of image options:
  size, simple image options, and prefixed image options.
* There is a hack/fixme for "upright: aspect" prefixed image option
  which needs to be looked into.
* Still need to fix uppercasing of the image resource name.

With these fixes, the following wikitext roundtrips perfectly
(after newline breaks are removed)

[[Image:Foo.jpg|thumbnail = 'baby.jpg'|100x100px|center| alt =bbbbb|
upright=true|bottom|link='http://foo.bar'|
This is a [[Linked Caption]] in the image]]

Change-Id: I6606df56874c2b97f00f08cb6bbeaec9878167d3
2012-06-28 18:55:47 -05:00
Gabriel Wicke ff414ad825 Add generic source round-trip mode, and use it for plain images (for now)
Anything with data-gen="both" and dataAttribs.src defined serializes to
dataAttribs.src and drops its contents (if any). We can use this to round-trip
elements we don't properly parse or serialize yet. Without RDFa info, the
editor will not touch the contents after encountering data-gen="both".

Change-Id: Ia39e5fdd765c2c9b36f26313455685d29f118839
2012-06-28 17:44:26 +02:00
Gabriel Wicke 4f94492f08 Merge changes I27bdc9c5,Ic09972bb
* changes:
  Update nowiki handling to latest spec; some fixes to it
  Default to two preceding newlines for headings for better readabilty
2012-06-28 14:02:00 +00:00
Gabriel Wicke 198e55a32b Update nowiki handling to latest spec; some fixes to it
346 round-trip tests are passing now (up from 343).

Change-Id: I27bdc9c5e010a13c2b4dddc6f263cbf9d3adac36
2012-06-28 14:57:05 +02:00
Gabriel Wicke 5b4cb03ee4 Default to two preceding newlines for headings for better readabilty
This only applies to newly created headings, so headings with a single newline
preceding them will be round-tripped that way.

Change-Id: Ic09972bbd25c3934b53f6fd3b5be5a0c3185c2af
2012-06-28 12:42:19 +02:00
Subramanya Sastry f995fc025a First pass serializing image thumbs.
* Collect all figure tokens and process them as a chunk
* This effectively mimics context-sensitive DOM walking,
  but since we need serialization supported on a token stream,
  we cannot use real DOM walking.  The current technique should
  also work on a token stream.
* There is a FIXME about the image filename being capitalized.
  This needs fixing in the parser or some other way of recognizing
  original unnormalized filenam.

Amended by gwicke:
* Build option list and join it with pipe to avoid stray trailing pipe
* Satisfy JSHint's weird preference to have '&&' and '||' at the end of the line

Change-Id: I1e5f6600f297fcdf81e3227a82ca3b71d4e97fc3
2012-06-28 11:29:10 +02:00
Gabriel Wicke 424a246b00 Rename data-mw-gc to data-gen. Credit to James!
Change-Id: Iacbe20b355ddf5f12fffb71ff4dd978ac4364928
2012-06-27 19:08:14 +02:00
Gabriel Wicke c02218c736 Merge changes Idfa5d6a8,I700142a5
* changes:
  Represent nowiki as span instead of meta
  Round-trip html entities and introduce data-mw-gc attribute
2012-06-27 16:07:48 +00:00
Subramanya Sastry 4d2a46fb44 Code cleanup and more newline fixes.
* Removed dead commented out code.
* Cleaned up newline handling in serializer some more.
* Now, onNewLine and onStartOfLine reflect serializer state
  more accurately.
* No implicit new lines for explicit html tags.
* 9 more roundtrip tests now green.

Change-Id: I9f640de2ae769c7472538fa687400dc8a40c2b2d
2012-06-27 15:23:22 +02:00
Gabriel Wicke 7108ee985a Represent nowiki as span instead of meta
Change-Id: Idfa5d6a8ee7b2d17205779361ca69d075a79964d
2012-06-27 13:59:14 +02:00
Gabriel Wicke 0b9a420129 Round-trip html entities and introduce data-mw-gc attribute
297 round-trip tests are passing with this patch.

TODO:
* generalize data-mw-gc handling in the serializer for any tag
* use data-mw-gc="both" and data-mw.src: 'the wikitext' for round-tripping of
  wikitext structures, optionally with some presentational (but read-only)
  content
* use span and data-mw-gc="both" for nowiki

Change-Id: I700142a56818977c20c8c06e6a5f2e77a708d25e
2012-06-27 12:52:52 +02:00
Gabriel Wicke 08b5ed1a43 Use _inNewlineContext method instead of bare onNewline
This makes sure that we escape start-of-line syntax when needed, since
onNewline is often not yet set.

Discussion / background:
[19:18] <subbu> this will fix it, but, i think this is asking for another
minor refactoring of these flags ... because this is a subtle fix which means
it might be possible to make it clearer.  onNewline is one true in on
direction, i.e. if true, we are in a new line state, but if we are in a
newline context, onNewline is not true, which is why this new method is
needed.
[19:19] <subbu> i dont know if it is possible, but it seems like it shoudl be
possible.  but, something for later.
[19:20] <subbu> badly phraed.  "onNewline" ==> in new line context, but if in
new line context, onNewline may be false.
[19:20] <gwicke> we should perhaps update it as early as possible instead
[19:21] <subbu> i cannot today, but possible monday.  i am heading out in
about 15-30 mins.
[19:22] <gwicke> will need to check all conditions depending on it in
_serializeToken
[19:22] <subbu> oh, i misunderstood you :)
[19:22] <gwicke> and if there are cases where the onNewline / onStartOfLine
state could be reverted later
[19:23] <subbu> you were referring to the flag, i thought you meant we should
fix this sooner than later.
[19:23] <gwicke> yes, I wasn't terribly clear
[19:23] <gwicke> you wrote something about following productions swallowing
newlines, but I think we don't actually do that any more
[19:24] <gwicke> I'm quite optimistic that updating those flags much earlier
would work
[19:25] <subbu> yes, it could fix it.
[19:26] <subbu> you might be right reg. swallowing.  it was happening earlier.
but, not right now, after single-line mode and other fixes.

Change-Id: Ic1d8141c04eb54a59977d0ba87bcf06bafd421e0
2012-06-23 19:27:56 +02:00
Gabriel Wicke d4dc8d86d9 Entity-escape [<>] in text content
This should not really be needed if the tokenizer did not decode html entities
on the fly. It is still a quick way to make sure no htmlish content can be
inserted even with the current decoding.

The next step and proper fix is to make entity decoding either optional in the
tokenizer (flag-controlled), or move it to a later stage in the token
processing pipeline.

Change-Id: Ife093dcfb95113763dab5635b098c795d3550586
2012-06-23 17:06:10 +02:00
Subramanya Sastry 5f584909e1 Added documentation + minor code refactoring
* Renamed defaultOptions to initialState
* Got rid of unused state property
* Added comments explaining how state attributes
  and tag handler flags are used
* Refactored listItemHandler check into functions and
  added FIXME possible rewriting of that check.
* Protected serializeDOM in a try-catch handler to
  catch exceptions and output the exception to the console.

Change-Id: I3d351c06e4b86baeb5a55243b11dbfa9baca5bb7
2012-06-22 18:29:46 -05:00
Gabriel Wicke b3bd2ffe8d Fix definition list parsing and round-trip single vs. multi-line dt/dd
* Removed murky ' :' -> '&nbsp;:' replacement in tokenizer. This breaks four
  parser tests, and should be fixed in a token stream transformer or DOM
  postprocessor. This replacement clashes with round-tripping, and is not
  terribly important visually.
* Added stx:row annotation to single-line dt/dd pairs and use it to preserve
  single-line syntax in the serializer. There is no attempt yet to support the
  addition of nested lists in an originally single-line dd. We'd need to look
  ahead in the serializer to support this. Perhaps the editor can simply drop
  data-mw in that case.
* Switched default dt/dd serialization to multi-line. This supports all nested
  lists and multiple dds.
* Don't close dls when switching from dt to dd or back in the token stream
  ListHandler.

Overall 290 round-trip tests are passing now (up from 284, some due to &nbsp;,
some due to lists). The number of passing parser tests dropped slightly from
303 to 297 (or 301/295 on weekdays other than Thursday).

Change-Id: I85ff40571833713388c6523e6a4ba2e94daa3807
2012-06-21 17:34:25 +02:00
Gabriel Wicke e584e35ecb Improve nested definition list serialization
Basically only prefix all bullets if the serialization output is going to be
in start-of-line context. The test for that is currently inline, but should
perhaps be factored out to a method or state flag instead.

We could alternatively consider to return the start-of-line prefix and let it
be used in _serializeToken in case we end up in start-of-line context.

This patch also fixes a newline issue on input like this:

:d1
::: d3

Both the list and list item handlers now set the startsNewline flag
dynamically depending on the context, so that we don't depend on the
suppression of newlines from list syntax by the singleLineMode any more.

There is still an extra newline inserted between list items in the following
example:

;t1 :d1
;;t2 ::d2

This looks like a bug in the produced DOM and not in the serializer, since the
outer definition list is closed and re-opened between d1 and t2.

Change-Id: I78e3a1ef34cf9159d5a1e86fb64c774ff111e71d
2012-06-21 15:28:43 +02:00
Gabriel Wicke ab286d6a59 Empty elements only use the start handler info
Thus move the 'endsLine' attribute to the start section.

Change-Id: I8490d866b84aa99205ca9e8e3ee137026fb18501
2012-06-21 10:30:11 +02:00
Gabriel Wicke cf32b34b0a First attempt at the definition list bug (work in progress)
The main issue is that the bullets from dd/dt were not stored on the stack. I
added a separate field for it in each stack entry, which now fixes the basic
indent case without (afaik) breaking anything else.

There are still some newline issues, and the need to handle the single-line
dd/dt vs. the multi-line variant.

Change-Id: I65939c05e2c5dde0789bf8aefd7651161a2f137c
2012-06-20 23:51:39 +02:00
Gabriel Wicke 344fac19b5 Improve preformatted text handling
* Don't escape html-syntax pre content for now; Should parse this with a new
  pre content production later (which needs to be split out of the regular pre
  production in the tokenizer)
* Protect indent-pre content from start-of-line syntax escaping
* Preserve extra leading spaces in the tokenizer
* Two more (now 284) round-trip tests are passing

Change-Id: I199b89c0ee7fae12546df10c1b5117c97caccac5
2012-06-20 19:28:34 +02:00
Gabriel Wicke 6054a4aa14 Clean up serializer newline handling a bit further
Queued newlines and new trailing newlines were not cleanly separated so far,
which caused some trailing newlines to be consumed for needed leading
newlines. This change fixes several newline bugs, taking the number of passing
round-trip tests from 276 back up to 282.

Change-Id: Idb4706e15ce71e63085033e3f3f29557915c11a8
2012-06-20 16:31:39 +02:00
Gabriel Wicke c9d3db8f34 Fix a few round-tripping and list issues
At least partly fixes some bugs in
http://www.mediawiki.org/wiki/Parsoid/Bug_test_cases. 276 round-trip tests are
passing.

* Fixes
  http://www.mediawiki.org/wiki/Parsoid/Bug_test_cases#extra_newline_after_empty_dd,
  except for lost newline in 'working' example before next heading
* Fixes newlines in definition lists
  (http://www.mediawiki.org/wiki/Parsoid/Bug_test_cases#dd_indentation etc),
  but does not fix missing / incorrect bullets for those

Change-Id: I21f66e265e43e1d1a4c7da70984a9984b8e6d0dd
2012-06-20 13:53:47 +02:00
Gabriel Wicke b94cad47dc Fix single-line mode for nested lists
Known issue: breaks round-tripping of :;;;::. That test is normally disabled
anyway, so we can fix it later.

Change-Id: I7954271311bfb7e71caae59d8177e3f04a9ebbca
2012-06-20 01:48:52 +02:00
Gabriel Wicke 33dc9abb0d Clean up sHref handling a bit
* sHref is now always a string
* fixes crasher when sHref is not set

Change-Id: If5756948ac6bc26c2d7c04d970b5aba5331cb8bb
2012-06-20 00:34:57 +02:00
Gabriel Wicke e117f09362 Wikitext escaping and quite complete source range tracking
* Started to add more complete tag source range (tsr) annotations to most
  start / empty tags. These replace the old sourcePos and sourceTagPos
  annotations, and look more promising for general round-tripping than block
  source ranges (bsr). See
  http://www.mediawiki.org/wiki/User:GWicke/Parsoid_source_ranges for some
  notes on this.
* Added an escapeWikitext method in the serializer that tokenizes supposedly
  text-only content from the DOM with the tokenizer and wraps runs of returned
  non-text tokens into nowiki tags. The source corresponding to non-text
  tokens is retrieved using the tsr annotations.
* Removed old (unused) table productions to avoid confusion.
* 276 round-trip tests are passing, vs. 283 without escaping.

Known issues:
* harmless for now, can be improved later: urllinks in external link captions
  are wrapped in nowiki. Example HTML:

<a rel='mw:extLink' href="http://example.com">http://example2.com</a>

* some start-of-line syntax in wiki-syntax preformatted blocks might be
  wrapped into nowiki when that would not really be needed. Example HTML DOM:

<pre>
* foo
* bar
</pre>

Change-Id: I01c34aedd5c566614d36924add47a6a960e91987
2012-06-19 23:36:44 +02:00
Gabriel Wicke 5fbc80321b Improve newline handling for comments and nowiki/noinclude tags
* Added a newlineTransparent flag to handlers that prevents changes to the
  onNewline status, so that content following it is still considered to be in
  start-of-line context. This fixes a few rt tests where a comment or nowiki
  tag is at the start of the line, and following content should end up on the
  same line.
* 283 rt parser tests are now passing.

Change-Id: Ie58dcb9e5e9af9000fff61c2e1db5d8649ffc3f6
2012-06-18 22:56:41 +02:00
Gabriel Wicke 97fb2d3c0d Serializer refactoring
* tokens are not modified any more (they are supposed to be immutable)
* handler info is now split in start / end objects and potentially a 'make'
  method; added more flags to govern the newline behavior of different tags
* added a generic singleLine mode for single-line syntactical environments
* switched the web service to line-based diffs to avoid issues when diffing
  the round-trip results of [[:en:Programming language]]
* 280 round-trip tests are passing now

Change-Id: I74b8ffbf69643c5d6e5ec852ec58e680c9018901
2012-06-18 21:52:15 +02:00
Subramanya Sastry f1d03f325e Couple minor bug fixes in serializer
Change-Id: I961e2f4e7609cc6b264eaf494b39497401cdc55c
2012-06-17 22:41:14 -05:00
Subramanya Sastry a229f72833 First pass redoing serialization code to handle newline requirements
from Parsoid HTML output as well as VE HTML output.  There are still
some newline related failures from parser tests that needs fixing, but
this is getting close.  So committing for now so other eyes can make the
bugs shallow :).

Change-Id: Ia6a218ee9fb3e18fe0573c89ff3a4236779e1e64
2012-06-16 10:09:06 -05:00
Subramanya Sastry 3f92f39397 Removed newline normalization between paragraphs.
Change-Id: Ifd55db73c8fe2b3e952066a75cba2f8e13c58430
2012-06-14 18:51:56 -05:00
Subramanya Sastry 54f12d1807 Fix for href handling.
- Check if href for links has the wgScriptPath prefix before
  attempting to strip it from the href.

Change-Id: I844151ef7317476668d1306b96a2aec5a56fd0f1
2012-06-14 18:35:22 -05:00
Subramanya Sastry c0fc9e9a97 Updated newline handling around lists and nested lists.
- Something like this:
    <ul><li>1</li><li>2<ul><li>2.1</li><li>2.2<ul><li>2.2.1</li><li>2.2.2</li></ul></li><li>2.3</li></ul></li><li>3</li></ul>
  now serializes properly to:

    *1
    *2
    **2.1
    **2.2
    ***2.2.1
    ***2.2.2
    **2.3
    *3

  So does this form which is what the above wikitext parses to:
    <ul><li>1
    </li><li>2
    <ul><li>2.1
    </li><li>2.2
    <ul><li>2.2.1
    </li><li>2.2.2
    </li></ul></li><li>2.3
    </li></ul></li><li>3
    </li></ul>

- Lists (and nested lists) are not entirely newline-insensitive.
  They still depend on newlines *between* lists.  The opening
  <ul> tag for non-nested lists should always start on a new line.
  So, for example,
    <ul><li>foo</li></ul><ul><li>bar</li></ul>
  will serialize to:
    *foo
    *bar
  which is incorrect.  But,
    <ul><li>foo</li></ul>
    <ul><li>bar</li></ul>
  will correctly serialize to:
    *foo

    *bar

Change-Id: I13a0290368574865957bcf57aebab488fbbb7026
2012-06-14 17:09:59 -05:00
Subramanya Sastry 8978e406fc Minor code refactoring
Change-Id: Ib7f70a3ac42e3d5a5985e9a9bcffa313bdac289b
2012-06-14 15:18:53 -05:00
Subramanya Sastry d7e83c4e2b Fixed/updated newline handling for <p> tags
- More pieces are now simplified and all(?) newline handling
  is now centralized in the serializeToken function.

- This commit fixes bugs in rt-ing some code snippets
    ----------
    Ex 1: foo<p>bar</p>baz
    ----------

- This commit fixes bugs serializing VE generated html
    ----------
    Ex 2: <p>foo</p><pre>bar</pre> ==> foo\n bar
    ----------

- But, this round of fixes introduces RT failures for certain
  code examples in parserTests.txt.  In all these failing cases,
  inline text/html is embedded within a generated <p> tag during
  parsing.  If these generated <p> tags can have a "gc:1" attribute
  added to them, we can properly serialize them to the original
  form.
    ----------
    Ex 3: foo<pre>bar</pre>
          Parsed HTML: <p>foo</p><pre>bar</pre>
    ----------
  Note how this parsed HTML is identical to what the VE outputs
  in Example 2 above.  So, without the gc:1 attribute, we now
  have conflicting requirements on the example same HTML.
  This increases confidence in the correctness of my commit here.

Change-Id: I86beadec91c445a7f8a6d36a639b406697daa0a2
2012-06-14 14:59:18 -05:00
Subramanya Sastry 13e03ec1d7 Refix <pre> serialization.
- Effectively reverted fix from f882a65153
  and added a new fix.

Change-Id: I8b81e26525a5f1a22acaf2c7067f2dcd9b962818
2012-06-14 13:10:02 -05:00
Subramanya Sastry 51227f2a4a Improved, simplified newline handling in wikitext serializer.
- Eliminated newline handling from several places in code and
  mostly isolated it to serializeToken thus simplifying newline
  handling logic.
- Fixing some bugs in the process: # of green roundtrip tests
  went up by 5 (294 --> 299) but actually introduced failures on
  a few originally succeeding tests (additional leading/trailing
  newlines on the entire test output).
- Added bonus: made list serializing (mostly) insensitive to
  newlines between tags.  So, all the following DOM serialize
  identically to the following wikitext:

  *foo
  *bar

  ----------
  <ul><li>foo</li><li>bar</li></ul>

  ----------
  <ul>
  <li>foo</li>
  <li>bar</li>
  </ul>
  ----------
  <ul>

  <li>
  foo

  </li>

  <li>
  bar</li>

  </ul>
  ----------

Change-Id: I76be56c4b2789039dff5f47de4659746882e45d6
2012-06-14 00:10:51 -05:00
Subramanya Sastry bf0f5d1b7e Minor code cleanup
Change-Id: Ic5d99b6c483841310b0c295c1c30246f907455b4
2012-06-13 13:47:26 -05:00
Subramanya Sastry 23ec054013 Fixed round-tripping of interwiki links.
Change-Id: If0427b9865b3e9cf8c0ad0b4efaebc9f9f7fb865
2012-06-13 13:39:18 -05:00