Commit graph

16529 commits

Author SHA1 Message Date
Trevor Parscal 5d20f45120 Added getHtmlElementFromDataElement and getDataElementFromHtmlElement
Change-Id: Ie5f4fd86612b5a6c34b5843d1e9a521edc626a63
2012-06-06 10:17:51 -07:00
Gabriel Wicke 47204c4ca0 Use diffChars instead of diffWords, as the former misses some changes
The improved merge algorithm now makes diffChars output more palatable. Things
could still be improved by collecting single-character 'neutral' changes in a
block of 'add' changes and converting them to adds / removes.

Change-Id: I8439e8acab4360c08b89d9ce8a6b8523e7a0a210
2012-06-06 18:36:28 +02:00
Subramanya Sastry f8221b128b Used a more robust heuristic for merging consecutive diffs
- Check if consecutive diffs are separate by 1 word in addition
  to max 3 chars.  This takes care of diffs introduced by template diffs
  separated by the template name and creates a clean single diff.

Change-Id: I9181d2ed9a07bee6ca5d5ebd6ddea84f7e2cecac
2012-06-06 11:01:47 -05:00
Gabriel Wicke 2bc066b42d Up the diff merge size heuristic a bit and always use the same algorithm
Change-Id: I707c8a55ed1758cdd591d2fc95e03a360c8e76d1
2012-06-06 17:46:25 +02:00
Gabriel Wicke bc1a77a812 Make modified newlines visible by replacing empty lines with a space
Change-Id: If7b811245e0d01a7a147ab54c3801fc1754730a9
2012-06-06 17:11:29 +02:00
Gabriel Wicke 1876d785a7 Swap ins/del in the diff
Change-Id: Id336d713d1767a4b7859b158f2c2ddf9adc11cfb
2012-06-06 16:02:54 +02:00
Gabriel Wicke 350e700d8f Add core-upgrade
Change-Id: I5ad0955e8272d376f009f89461bed310978b25e4
2012-06-06 15:58:17 +02:00
Gabriel Wicke d0a0454ada Merge "Improve the handling of newlines for round-tripping" 2012-06-06 13:54:04 +00:00
Gabriel Wicke aee35f627d Merge "Update patched html5 library to version 0.3.8" 2012-06-06 13:53:37 +00:00
Gabriel Wicke a146fcb8ad Improve the handling of newlines for round-tripping
An improvement, but there still are some extra newlines inserted after
paragraphs. Example input:

-------

Foo:
{|
|foo
|}
-------

Extra newlines are inserted after the Foo: and the foo in the table. They are
not fed as tokens or text to the tree builder, so there is likely a bug in the
html5 library or JSDom.

Change-Id: I83eb6180e3cd1c4e7f9b15b31d339e1d32bccd3f
2012-06-06 10:17:03 +02:00
Gabriel Wicke 59fc634cce Update patched html5 library to version 0.3.8
Change-Id: I321d9a58ea1af33842a606fc8706938093a8330f
2012-06-06 10:17:03 +02:00
Subramanya Sastry bff08b799e Improvement to the refineDiffs function to improve diff quality.
* Attempt to accumulate consecutive add-delete pairs
  with "short text" separating the pairs.  This is equivalent to
  the <b><i> ... </i></b> minimization to expand range of
  <b> and <i> tags, except there is no optimal solution except
  as determined by heuristics ("short text": <= 2 chars).

Change-Id: I408e318c315eba18aac4051ed84d77e3e092d497
2012-06-06 00:08:00 -05:00
Subramanya Sastry fe6f289486 Merge changes I5d98c704,Ib8d3de75
* changes:
  A few tweaks to link round-tripping
  Use word diff if --color is enabled
2012-06-05 16:04:23 +00:00
Subramanya Sastry b095db4303 Simpler implementation of flatten.
* Possibly more efficient under heavy GC load -- untested.
* No change in time and memory use for single file parsing.

Change-Id: Id2f3f65cc0e5f38ed968bbda60b97e46523e700e
2012-06-05 10:47:46 -05:00
Gabriel Wicke dc3168cf6d A few tweaks to link round-tripping
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
  in the serializer
* renamed state.linkTail to state.dropTail

Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
2012-06-05 17:26:27 +02:00
Gabriel Wicke 0f9d939b00 Use word diff if --color is enabled
Change-Id: Ib8d3de75ac306974abfdaca22bfc7b69bc62891d
2012-06-05 16:10:13 +02:00
Catrope d378182bff Reapply typo fixes from c0e1991 , were undone in b0f6f64
Change-Id: If84e13b91781ed96d3bd9e94171e16020c65ea42
2012-06-05 06:54:35 -07:00
Catrope 528728558b Fix bug in selectNodes's logic for traversing back up the tree
Change-Id: I0fc5a2ad2c9a8d162e8ddbf3cc6d31684d364928
2012-06-05 06:50:09 -07:00
Gabriel Wicke d16032ae9a Track html syntax in block_tag production
Change-Id: If560523644f007485809762f12216e08fb3c3ed3
2012-06-05 12:39:56 +02:00
Gabriel Wicke c1d8270bdb Fix wgScriptPath in round-trip mode without interwiki
Change-Id: I7cc80b7be1afffc586a2ea45d21303e9ba07c0d4
2012-06-05 12:11:45 +02:00
Gabriel Wicke 3346aed86e Support interwiki links, and some cleanup
Change-Id: I205c53a03f5230e3ef9100487f4934f97bdc179a
2012-06-05 12:05:33 +02:00
Gabriel Wicke cc96ff4f5e Very basic interwiki support
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.

Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.

Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
2012-06-05 11:19:58 +02:00
Trevor Parscal b0f6f64d90 Made pushRetain do nothing if you give it 0 and throw an exception if you give it a negative length
Change-Id: Ib9955660b05a04503325ddb20f9e9a525b4d6832
2012-06-04 16:27:33 -07:00
Trevor Parscal 9111e34a0b Added nodeOuterRange to selectNodes
Change-Id: I9ef0c383fbb2515c752d2d3c52e8632aac73d811
2012-06-04 16:21:29 -07:00
Trevor Parscal a0ea700090 Added tests for transactions that deal with removing alien nodes
Change-Id: Id36413c62386dbe5ebc8c8b3a1d3c5e301a8175a
2012-06-04 16:21:29 -07:00
Catrope c0e19915ef Typo fixes and missing 'var' in newFromRemoval()
Change-Id: Ibb984d862670b5386ff76fc55ef3322f695b6ae1
2012-06-04 16:07:16 -07:00
Catrope db793009be fixupInsertion() variable documentation and cleanup
Functional changes (fixes):
* Make writeElement() also update parentNode and parentType for openings
* Also add to fixupStack when opening a wrapper for a text node

Non-functional changes (cleanup&docs):
* Document all variables at the beginning of the function
* Group variables according to where/how they're used
* Move expectedType into writeElement()
* Kill node, duplicates parentNode unnecessarily
* Kill paragraphOpened, was misnamed and unnecessary
* Rename closedElements to reopenElements

Change-Id: Ie5b4e4f30b267943048fdc170accb29139039192
2012-06-04 16:07:16 -07:00
Catrope f7445b37b8 Retain attributes when reopening closed nodes
* Push entire elements onto openingStack rather than type strings
* When closing an element, build a clone of the opening and push it onto
  closedElements, then insert that clone when reopening the element

Change-Id: I8b0fb44394aed6c471dc6dacaab03e44c2333733
2012-06-04 16:07:16 -07:00
Catrope b167453f0d Add ve.dm.Node.getAttributes() to get a reference to the attributes object
Change-Id: Ic2463d4f7053a5f6defd212f04deb5ea71542843
2012-06-04 16:01:14 -07:00
Catrope e67b2fd793 Add some more tests for newFromInsertion
Change-Id: I3c403928a8176d8685e48a63759fefec4657ca96
2012-06-04 16:01:14 -07:00
Catrope 109624e8b3 fixupInsertion fixes, wrapping content works now
Change-Id: I1eee6afcffbf09955578b7f0534aa5b7234802df
2012-06-04 16:01:13 -07:00
Rob Moen fd5eb80dd7 Document annotate method in surface model.
Comment cleanup

Change-Id: Ifd3eeab9046f376529a827dfafdc28506845ac15
2012-06-04 15:06:37 -07:00
Rob Moen f958297102 Create tests for surface model annotate method.
Fix up surface model tests

Change-Id: I02b57edc5c3faeb39e0b3c1f473f03fbd49d85b4
2012-06-04 15:02:56 -07:00
Rob Moen c338304d33 Rewrite annotate as more low level method in Surface model.
TODO: follow up with annotate tests

Change-Id: If0e68bd3a09840b1e5f3e8d85fd22a8c10134b58
2012-06-04 14:29:27 -07:00
Gabriel Wicke 92f753a365 Pre and link target improvements
* Don't explicitly add the newline in the pre, as we preserve newline tokens
  now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.

Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
2012-06-04 14:03:05 +02:00
Gabriel Wicke ee2ddbd3cb Fix list handler issues
Lists interrupted by non-empty lines would not close the list properly.
Register for any token instead of just for newlines and close the list if no
listItem follows the newline.

Change-Id: I1743901e3db541bbeda78d17707db943e6ceb9b9
2012-06-04 13:38:43 +02:00
Gabriel Wicke f821eac102 Optionally round-trip sHref in data-mw
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.

Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
2012-06-04 12:28:05 +02:00
Gabriel Wicke e0809209ec Don't set the data-mw attribute if the object is actually empty.
Change-Id: I984f1b44bba67d7a9f1a709738d14c0ee02f69a9
2012-06-04 12:26:03 +02:00
Gabriel Wicke 2774e5aa6c Actually replace all underscores in wikilink target
Change-Id: I633f8d6e4f639aff90fd456600376b7c6515fd50
2012-06-04 11:48:59 +02:00
Gabriel Wicke 3f2c72f920 Fix padleft / padright (mis)use as substr
Change-Id: I0645e11c8ef8b550ad35300d1904788940fc748a
2012-06-04 11:30:45 +02:00
Gabriel Wicke 0eabd2c67e Add round-trip form and split out rt diffing
Change-Id: I3bc8ad7f273937ce6c767b8d7bbccdc86cbd93b4
2012-06-04 10:49:59 +02:00
Gabriel Wicke 99c98d6c56 Diff refinement fixes
Change-Id: I11c69de0fdcd636ccd11cd0b6cb16c5acdb188b3
2012-06-04 10:16:05 +02:00
Gabriel Wicke d2602c47a6 Switch back to word-based diff
The char-based diff looked good in some pages, but yielded terrible results in
others. The word-based algo is more consistent overall.

Change-Id: I7f2d40315ad96df037c2d9a1d50739e3d21b6c81
2012-06-04 00:02:49 +02:00
Gabriel Wicke 4533c274ca Fix a crasher in the serializer
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.

Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0
2012-06-04 00:00:54 +02:00
Gabriel Wicke d01581c380 Create a 'refinement diff' algorithm
The word or char-based algorithm does not scale well beyond 5k chars or so. We
now perform a line-based diff and then continue to diff the line differences
using the char-based algorithm. This gives a char-based diff even for bigger
inputs.

Change-Id: Iec87ca56540060e4df2859ba54c992e7ff5cfe10
2012-06-03 23:46:57 +02:00
Gabriel Wicke b11b8d8a6b Revert to line diff, word diff explodes on some pages
Change-Id: Ic338498b47bb6b6c98fa6280f44464cd70a48b1b
2012-06-03 11:39:03 +02:00
Gabriel Wicke b5e067e086 Some more web service tweaks
* Stay in round-trip mode in HTML DOM output
* Return DOM, wikitext and diff as soon as they are available

Change-Id: I7f8f44cfe8eed63a521d1318d116c22232cb6b1b
2012-06-03 11:04:40 +02:00
Gabriel Wicke 7c18891504 Snazzy html word diff for roundtrip view
Also show the HTML DOM, Wikitext output and diff.

Change-Id: Ibe744fbc895239f4e48f6e0e2f2b2f345c0845bd
2012-06-03 01:36:56 +02:00
Gabriel Wicke 4cf74497b7 Update web service start page documentation
Change-Id: I38efc5a9d5b919c6168cf97d0efbae9db967e351
2012-06-02 17:17:37 +02:00
Gabriel Wicke 31522d3d49 Add ApiRequest
Change-Id: I5f2a1cb65223a68f10bc63903000248efca05586
2012-06-02 16:52:51 +02:00