The improved merge algorithm now makes diffChars output more palatable. Things
could still be improved by collecting single-character 'neutral' changes in a
block of 'add' changes and converting them to adds / removes.
Change-Id: I8439e8acab4360c08b89d9ce8a6b8523e7a0a210
- Check if consecutive diffs are separate by 1 word in addition
to max 3 chars. This takes care of diffs introduced by template diffs
separated by the template name and creates a clean single diff.
Change-Id: I9181d2ed9a07bee6ca5d5ebd6ddea84f7e2cecac
An improvement, but there still are some extra newlines inserted after
paragraphs. Example input:
-------
Foo:
{|
|foo
|}
-------
Extra newlines are inserted after the Foo: and the foo in the table. They are
not fed as tokens or text to the tree builder, so there is likely a bug in the
html5 library or JSDom.
Change-Id: I83eb6180e3cd1c4e7f9b15b31d339e1d32bccd3f
* Attempt to accumulate consecutive add-delete pairs
with "short text" separating the pairs. This is equivalent to
the <b><i> ... </i></b> minimization to expand range of
<b> and <i> tags, except there is no optimal solution except
as determined by heuristics ("short text": <= 2 chars).
Change-Id: I408e318c315eba18aac4051ed84d77e3e092d497
* Possibly more efficient under heavy GC load -- untested.
* No change in time and memory use for single file parsing.
Change-Id: Id2f3f65cc0e5f38ed968bbda60b97e46523e700e
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
in the serializer
* renamed state.linkTail to state.dropTail
Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.
Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.
Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
Functional changes (fixes):
* Make writeElement() also update parentNode and parentType for openings
* Also add to fixupStack when opening a wrapper for a text node
Non-functional changes (cleanup&docs):
* Document all variables at the beginning of the function
* Group variables according to where/how they're used
* Move expectedType into writeElement()
* Kill node, duplicates parentNode unnecessarily
* Kill paragraphOpened, was misnamed and unnecessary
* Rename closedElements to reopenElements
Change-Id: Ie5b4e4f30b267943048fdc170accb29139039192
* Push entire elements onto openingStack rather than type strings
* When closing an element, build a clone of the opening and push it onto
closedElements, then insert that clone when reopening the element
Change-Id: I8b0fb44394aed6c471dc6dacaab03e44c2333733
* Don't explicitly add the newline in the pre, as we preserve newline tokens
now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.
Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
Lists interrupted by non-empty lines would not close the list properly.
Register for any token instead of just for newlines and close the list if no
listItem follows the newline.
Change-Id: I1743901e3db541bbeda78d17707db943e6ceb9b9
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.
Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
The char-based diff looked good in some pages, but yielded terrible results in
others. The word-based algo is more consistent overall.
Change-Id: I7f2d40315ad96df037c2d9a1d50739e3d21b6c81
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.
Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0
The word or char-based algorithm does not scale well beyond 5k chars or so. We
now perform a line-based diff and then continue to diff the line differences
using the char-based algorithm. This gives a char-based diff even for bigger
inputs.
Change-Id: Iec87ca56540060e4df2859ba54c992e7ff5cfe10
* Stay in round-trip mode in HTML DOM output
* Return DOM, wikitext and diff as soon as they are available
Change-Id: I7f8f44cfe8eed63a521d1318d116c22232cb6b1b