This should not really be needed if the tokenizer did not decode html entities
on the fly. It is still a quick way to make sure no htmlish content can be
inserted even with the current decoding.
The next step and proper fix is to make entity decoding either optional in the
tokenizer (flag-controlled), or move it to a later stage in the token
processing pipeline.
Change-Id: Ife093dcfb95113763dab5635b098c795d3550586
* Renamed defaultOptions to initialState
* Got rid of unused state property
* Added comments explaining how state attributes
and tag handler flags are used
* Refactored listItemHandler check into functions and
added FIXME possible rewriting of that check.
* Protected serializeDOM in a try-catch handler to
catch exceptions and output the exception to the console.
Change-Id: I3d351c06e4b86baeb5a55243b11dbfa9baca5bb7
* Removed murky ' :' -> ' :' replacement in tokenizer. This breaks four
parser tests, and should be fixed in a token stream transformer or DOM
postprocessor. This replacement clashes with round-tripping, and is not
terribly important visually.
* Added stx:row annotation to single-line dt/dd pairs and use it to preserve
single-line syntax in the serializer. There is no attempt yet to support the
addition of nested lists in an originally single-line dd. We'd need to look
ahead in the serializer to support this. Perhaps the editor can simply drop
data-mw in that case.
* Switched default dt/dd serialization to multi-line. This supports all nested
lists and multiple dds.
* Don't close dls when switching from dt to dd or back in the token stream
ListHandler.
Overall 290 round-trip tests are passing now (up from 284, some due to ,
some due to lists). The number of passing parser tests dropped slightly from
303 to 297 (or 301/295 on weekdays other than Thursday).
Change-Id: I85ff40571833713388c6523e6a4ba2e94daa3807
Basically only prefix all bullets if the serialization output is going to be
in start-of-line context. The test for that is currently inline, but should
perhaps be factored out to a method or state flag instead.
We could alternatively consider to return the start-of-line prefix and let it
be used in _serializeToken in case we end up in start-of-line context.
This patch also fixes a newline issue on input like this:
:d1
::: d3
Both the list and list item handlers now set the startsNewline flag
dynamically depending on the context, so that we don't depend on the
suppression of newlines from list syntax by the singleLineMode any more.
There is still an extra newline inserted between list items in the following
example:
;t1 :d1
;;t2 ::d2
This looks like a bug in the produced DOM and not in the serializer, since the
outer definition list is closed and re-opened between d1 and t2.
Change-Id: I78e3a1ef34cf9159d5a1e86fb64c774ff111e71d
The main issue is that the bullets from dd/dt were not stored on the stack. I
added a separate field for it in each stack entry, which now fixes the basic
indent case without (afaik) breaking anything else.
There are still some newline issues, and the need to handle the single-line
dd/dt vs. the multi-line variant.
Change-Id: I65939c05e2c5dde0789bf8aefd7651161a2f137c
* Don't escape html-syntax pre content for now; Should parse this with a new
pre content production later (which needs to be split out of the regular pre
production in the tokenizer)
* Protect indent-pre content from start-of-line syntax escaping
* Preserve extra leading spaces in the tokenizer
* Two more (now 284) round-trip tests are passing
Change-Id: I199b89c0ee7fae12546df10c1b5117c97caccac5
Queued newlines and new trailing newlines were not cleanly separated so far,
which caused some trailing newlines to be consumed for needed leading
newlines. This change fixes several newline bugs, taking the number of passing
round-trip tests from 276 back up to 282.
Change-Id: Idb4706e15ce71e63085033e3f3f29557915c11a8
Fixed a bug in the list handler for multiple dds in a definition list. Also
fixed a few JSHint warnings.
Change-Id: I3e883786698a9521347fc2a5e6420646318813a7
Known issue: breaks round-tripping of :;;;::. That test is normally disabled
anyway, so we can fix it later.
Change-Id: I7954271311bfb7e71caae59d8177e3f04a9ebbca
* Started to add more complete tag source range (tsr) annotations to most
start / empty tags. These replace the old sourcePos and sourceTagPos
annotations, and look more promising for general round-tripping than block
source ranges (bsr). See
http://www.mediawiki.org/wiki/User:GWicke/Parsoid_source_ranges for some
notes on this.
* Added an escapeWikitext method in the serializer that tokenizes supposedly
text-only content from the DOM with the tokenizer and wraps runs of returned
non-text tokens into nowiki tags. The source corresponding to non-text
tokens is retrieved using the tsr annotations.
* Removed old (unused) table productions to avoid confusion.
* 276 round-trip tests are passing, vs. 283 without escaping.
Known issues:
* harmless for now, can be improved later: urllinks in external link captions
are wrapped in nowiki. Example HTML:
<a rel='mw:extLink' href="http://example.com">http://example2.com</a>
* some start-of-line syntax in wiki-syntax preformatted blocks might be
wrapped into nowiki when that would not really be needed. Example HTML DOM:
<pre>
* foo
* bar
</pre>
Change-Id: I01c34aedd5c566614d36924add47a6a960e91987
* Added a newlineTransparent flag to handlers that prevents changes to the
onNewline status, so that content following it is still considered to be in
start-of-line context. This fixes a few rt tests where a comment or nowiki
tag is at the start of the line, and following content should end up on the
same line.
* 283 rt parser tests are now passing.
Change-Id: Ie58dcb9e5e9af9000fff61c2e1db5d8649ffc3f6
* tokens are not modified any more (they are supposed to be immutable)
* handler info is now split in start / end objects and potentially a 'make'
method; added more flags to govern the newline behavior of different tags
* added a generic singleLine mode for single-line syntactical environments
* switched the web service to line-based diffs to avoid issues when diffing
the round-trip results of [[:en:Programming language]]
* 280 round-trip tests are passing now
Change-Id: I74b8ffbf69643c5d6e5ec852ec58e680c9018901
HTML5 defines space characters as [ \r\n\t\f] in
http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#space-character.
It treats these specially in a few contexts. As an example, the foster
parenting algorithm does not apply to space characters.
As a result, this change fixes the round-tripping of spaces between table
tags, which were previously moved before the table.
Change-Id: I32ab29275a9f824fc66d8286638eb42748cfc9a5
from Parsoid HTML output as well as VE HTML output. There are still
some newline related failures from parser tests that needs fixing, but
this is getting close. So committing for now so other eyes can make the
bugs shallow :).
Change-Id: Ia6a218ee9fb3e18fe0573c89ff3a4236779e1e64
- Check if href for links has the wgScriptPath prefix before
attempting to strip it from the href.
Change-Id: I844151ef7317476668d1306b96a2aec5a56fd0f1
- Something like this:
<ul><li>1</li><li>2<ul><li>2.1</li><li>2.2<ul><li>2.2.1</li><li>2.2.2</li></ul></li><li>2.3</li></ul></li><li>3</li></ul>
now serializes properly to:
*1
*2
**2.1
**2.2
***2.2.1
***2.2.2
**2.3
*3
So does this form which is what the above wikitext parses to:
<ul><li>1
</li><li>2
<ul><li>2.1
</li><li>2.2
<ul><li>2.2.1
</li><li>2.2.2
</li></ul></li><li>2.3
</li></ul></li><li>3
</li></ul>
- Lists (and nested lists) are not entirely newline-insensitive.
They still depend on newlines *between* lists. The opening
<ul> tag for non-nested lists should always start on a new line.
So, for example,
<ul><li>foo</li></ul><ul><li>bar</li></ul>
will serialize to:
*foo
*bar
which is incorrect. But,
<ul><li>foo</li></ul>
<ul><li>bar</li></ul>
will correctly serialize to:
*foo
*bar
Change-Id: I13a0290368574865957bcf57aebab488fbbb7026
- More pieces are now simplified and all(?) newline handling
is now centralized in the serializeToken function.
- This commit fixes bugs in rt-ing some code snippets
----------
Ex 1: foo<p>bar</p>baz
----------
- This commit fixes bugs serializing VE generated html
----------
Ex 2: <p>foo</p><pre>bar</pre> ==> foo\n bar
----------
- But, this round of fixes introduces RT failures for certain
code examples in parserTests.txt. In all these failing cases,
inline text/html is embedded within a generated <p> tag during
parsing. If these generated <p> tags can have a "gc:1" attribute
added to them, we can properly serialize them to the original
form.
----------
Ex 3: foo<pre>bar</pre>
Parsed HTML: <p>foo</p><pre>bar</pre>
----------
Note how this parsed HTML is identical to what the VE outputs
in Example 2 above. So, without the gc:1 attribute, we now
have conflicting requirements on the example same HTML.
This increases confidence in the correctness of my commit here.
Change-Id: I86beadec91c445a7f8a6d36a639b406697daa0a2
- Eliminated newline handling from several places in code and
mostly isolated it to serializeToken thus simplifying newline
handling logic.
- Fixing some bugs in the process: # of green roundtrip tests
went up by 5 (294 --> 299) but actually introduced failures on
a few originally succeeding tests (additional leading/trailing
newlines on the entire test output).
- Added bonus: made list serializing (mostly) insensitive to
newlines between tags. So, all the following DOM serialize
identically to the following wikitext:
*foo
*bar
----------
<ul><li>foo</li><li>bar</li></ul>
----------
<ul>
<li>foo</li>
<li>bar</li>
</ul>
----------
<ul>
<li>
foo
</li>
<li>
bar</li>
</ul>
----------
Change-Id: I76be56c4b2789039dff5f47de4659746882e45d6
* As part of an earlier fix, I had changed default value of 'res'
to null instead of ''. But, this was potentially buggy because
the previous check was (res !== '') which could be triggered
by return values of handlers. By changing the check to null,
I was effectively changing the code paths for those handlers that
returned ''.
Change-Id: I2302023be7422ce4fb384ff5a50fe53fa7732855
paragraphs in lists.
* We need to look at other special-case handling requirements of
html tags in lists (and other contexts like tables).
Change-Id: I84b8402d90a186c9075c2d45263c94377312927a
* Moved wikipedia default prefixes to environment
* Added 'addInterwiki' method
* Adjusted link handling normalizeTitle to reflect this
Change-Id: If5b2314cc36346b6da8649ed410457a612d80a22
* mw:Foo now loads pages from mediawiki.org
* The default prefix still is 'en'. You can switch this to 'mw' in ParserService.js.
Change-Id: I1208667e6114bd711b7988a8b3adb32ffab70969
- Three bugs that were messing up quote transformations.
- Now, the following cases are handled properly:
* ''foo'''
* '''foo''
* ''foo''''
* ''''foo''
These tests (and other quote tests) have to be added to core parser
tests file.
- One more parser test green.
Change-Id: I4f93e8910639f546bfc9304becab17d26d5529de
An improvement, but there still are some extra newlines inserted after
paragraphs. Example input:
-------
Foo:
{|
|foo
|}
-------
Extra newlines are inserted after the Foo: and the foo in the table. They are
not fed as tokens or text to the tree builder, so there is likely a bug in the
html5 library or JSDom.
Change-Id: I83eb6180e3cd1c4e7f9b15b31d339e1d32bccd3f
* Possibly more efficient under heavy GC load -- untested.
* No change in time and memory use for single file parsing.
Change-Id: Id2f3f65cc0e5f38ed968bbda60b97e46523e700e
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
in the serializer
* renamed state.linkTail to state.dropTail
Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.
Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.
Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
* Don't explicitly add the newline in the pre, as we preserve newline tokens
now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.
Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
Lists interrupted by non-empty lines would not close the list properly.
Register for any token instead of just for newlines and close the list if no
listItem follows the newline.
Change-Id: I1743901e3db541bbeda78d17707db943e6ceb9b9
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.
Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.
Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0