Commit graph

39 commits

Author SHA1 Message Date
Trevor Parscal 8cb5cbf75d Merge branch 'refs/heads/master' into dmrewrite 2012-06-13 15:00:44 -07:00
Subramanya Sastry 51958e4c6a Removed unused parser pipeline construction
Change-Id: Id2a7dde895b7c3fbf776a2035009686afd4301df
2012-06-13 15:08:13 -05:00
Catrope 6bf79475e4 Add suggested development configuration in comments
Change-Id: I3ec8c5326faced6d4b5c878f26f37a281b03bd95
2012-06-12 19:19:48 -07:00
Trevor Parscal 0e0d9fbb50 Added localhost to available sources and made it default
We should probably add some sort of LocalSetting.php type system for environment specific settings like this and git ignore it

Change-Id: I362296ea286d0348e023d1673bba55e098229280
2012-06-08 15:21:20 -07:00
Gabriel Wicke 3f61dc9821 Link talk page separately
Change-Id: Ib839f619e7e14ccf0ef698fc2e780ef4b0d65505
2012-06-07 13:42:05 +02:00
Gabriel Wicke 3549a16085 Add a 'report issue' link below round-trip results
Change-Id: I5e3a785a328af0debcf83dc2038b5e5417fa5158
2012-06-07 13:37:40 +02:00
Gabriel Wicke bec7fb2f8c Mention citations as not round-tripping
Change-Id: I57e25f6f4072bae2f5681b8611e98f899875d1e2
2012-06-07 13:18:44 +02:00
Gabriel Wicke 76cca063ba Add hint on where to support issues in web service entry page
* Explain what we are currently interested in and link to
  :mw:Talk:Parsoid/Todo.

Change-Id: I747c6ee8a021a7a73ec91b73281c1c679a00da8f
2012-06-07 13:16:05 +02:00
Gabriel Wicke 1ca586e5f1 Improve interwiki config a bit
* Moved wikipedia default prefixes to environment
* Added 'addInterwiki' method
* Adjusted link handling normalizeTitle to reflect this

Change-Id: If5b2314cc36346b6da8649ed410457a612d80a22
2012-06-07 12:30:16 +02:00
Gabriel Wicke 2fa5baabbb Make it easier to configure the default wiki, and add support for mediawiki.org
* mw:Foo now loads pages from mediawiki.org
* The default prefix still is 'en'. You can switch this to 'mw' in ParserService.js.

Change-Id: I1208667e6114bd711b7988a8b3adb32ffab70969
2012-06-07 11:50:40 +02:00
Gabriel Wicke 413df0c471 Strip \r from form input- we normalize everything to Unix
Change-Id: I5cd255e1a7ab9958f120fad408362e6f709e4b91
2012-06-06 19:26:29 +02:00
Gabriel Wicke 47204c4ca0 Use diffChars instead of diffWords, as the former misses some changes
The improved merge algorithm now makes diffChars output more palatable. Things
could still be improved by collecting single-character 'neutral' changes in a
block of 'add' changes and converting them to adds / removes.

Change-Id: I8439e8acab4360c08b89d9ce8a6b8523e7a0a210
2012-06-06 18:36:28 +02:00
Subramanya Sastry f8221b128b Used a more robust heuristic for merging consecutive diffs
- Check if consecutive diffs are separate by 1 word in addition
  to max 3 chars.  This takes care of diffs introduced by template diffs
  separated by the template name and creates a clean single diff.

Change-Id: I9181d2ed9a07bee6ca5d5ebd6ddea84f7e2cecac
2012-06-06 11:01:47 -05:00
Gabriel Wicke 2bc066b42d Up the diff merge size heuristic a bit and always use the same algorithm
Change-Id: I707c8a55ed1758cdd591d2fc95e03a360c8e76d1
2012-06-06 17:46:25 +02:00
Gabriel Wicke bc1a77a812 Make modified newlines visible by replacing empty lines with a space
Change-Id: If7b811245e0d01a7a147ab54c3801fc1754730a9
2012-06-06 17:11:29 +02:00
Gabriel Wicke 1876d785a7 Swap ins/del in the diff
Change-Id: Id336d713d1767a4b7859b158f2c2ddf9adc11cfb
2012-06-06 16:02:54 +02:00
Subramanya Sastry bff08b799e Improvement to the refineDiffs function to improve diff quality.
* Attempt to accumulate consecutive add-delete pairs
  with "short text" separating the pairs.  This is equivalent to
  the <b><i> ... </i></b> minimization to expand range of
  <b> and <i> tags, except there is no optimal solution except
  as determined by heuristics ("short text": <= 2 chars).

Change-Id: I408e318c315eba18aac4051ed84d77e3e092d497
2012-06-06 00:08:00 -05:00
Gabriel Wicke c1d8270bdb Fix wgScriptPath in round-trip mode without interwiki
Change-Id: I7cc80b7be1afffc586a2ea45d21303e9ba07c0d4
2012-06-05 12:11:45 +02:00
Gabriel Wicke 3346aed86e Support interwiki links, and some cleanup
Change-Id: I205c53a03f5230e3ef9100487f4934f97bdc179a
2012-06-05 12:05:33 +02:00
Gabriel Wicke cc96ff4f5e Very basic interwiki support
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.

Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.

Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
2012-06-05 11:19:58 +02:00
Gabriel Wicke 0eabd2c67e Add round-trip form and split out rt diffing
Change-Id: I3bc8ad7f273937ce6c767b8d7bbccdc86cbd93b4
2012-06-04 10:49:59 +02:00
Gabriel Wicke 99c98d6c56 Diff refinement fixes
Change-Id: I11c69de0fdcd636ccd11cd0b6cb16c5acdb188b3
2012-06-04 10:16:05 +02:00
Gabriel Wicke d2602c47a6 Switch back to word-based diff
The char-based diff looked good in some pages, but yielded terrible results in
others. The word-based algo is more consistent overall.

Change-Id: I7f2d40315ad96df037c2d9a1d50739e3d21b6c81
2012-06-04 00:02:49 +02:00
Gabriel Wicke d01581c380 Create a 'refinement diff' algorithm
The word or char-based algorithm does not scale well beyond 5k chars or so. We
now perform a line-based diff and then continue to diff the line differences
using the char-based algorithm. This gives a char-based diff even for bigger
inputs.

Change-Id: Iec87ca56540060e4df2859ba54c992e7ff5cfe10
2012-06-03 23:46:57 +02:00
Gabriel Wicke b11b8d8a6b Revert to line diff, word diff explodes on some pages
Change-Id: Ic338498b47bb6b6c98fa6280f44464cd70a48b1b
2012-06-03 11:39:03 +02:00
Gabriel Wicke b5e067e086 Some more web service tweaks
* Stay in round-trip mode in HTML DOM output
* Return DOM, wikitext and diff as soon as they are available

Change-Id: I7f8f44cfe8eed63a521d1318d116c22232cb6b1b
2012-06-03 11:04:40 +02:00
Gabriel Wicke 7c18891504 Snazzy html word diff for roundtrip view
Also show the HTML DOM, Wikitext output and diff.

Change-Id: Ibe744fbc895239f4e48f6e0e2f2b2f345c0845bd
2012-06-03 01:36:56 +02:00
Gabriel Wicke 4cf74497b7 Update web service start page documentation
Change-Id: I38efc5a9d5b919c6168cf97d0efbae9db967e351
2012-06-02 17:17:37 +02:00
Gabriel Wicke 7c7ddd22a7 Retrieve content from the main namespace instead of templates
Change-Id: Id917fa617d6fba1e1b290b2ed20c24aed24d39d2
2012-06-02 16:48:00 +02:00
Gabriel Wicke d3975a8d03 Very basic round-trip test mode for the API
Returns both the resulting wikitext and the diff with the original input.

Change-Id: Iad25039beb054a84e1ad51ffa9fee924db49c60b
2012-06-02 16:20:54 +02:00
Gabriel Wicke c52c24b0cb Slightly improve formatting of web service; test commit message tweak
Change-Id: Ibac3ce3dd9aa2c4faf11eed351fea941ebf1e4b3
2012-05-25 16:36:14 +02:00
Gabriel Wicke 540d14d8fe More tweaks to the intro message
Change-Id: I059ba4fc584eae45092376b5b2258df7ed52b55c
2012-05-24 15:10:03 +02:00
Gabriel Wicke 987ff8aa84 Add a longer welcome/help message and a link to the Parsoid docs
Change-Id: I39e4873908bc0619738353a55577aa81abb76287
2012-05-24 14:58:35 +02:00
Gabriel Wicke e70448e53a Use text/x-mediawiki content type, and handle tokenizer errors without --debug
Change-Id: I154cd344306aa05ada7ff30f631d487f39fa9739
2012-05-24 10:19:25 +02:00
Gabriel Wicke 6dac7de2f4 Capital T in second part of Content-Type
Change-Id: I20a9af9a8c0e05e96c77c9ac7566ea7bc1a6dabd
2012-05-23 18:12:09 +02:00
Gabriel Wicke d6af3b3375 Improve the serializer and its output display in the web service
Change-Id: Id3ca96846cad42517d7d4bada8f4bb250d54247b
2012-05-23 17:50:35 +02:00
Gabriel Wicke 95496c02db Add an extra newline before headings, and ignore favicon.ico requests
Change-Id: Ibacac3453afefa5dbe803c1e0260e8c943785f12
2012-05-23 17:17:54 +02:00
Gabriel Wicke 21286a50df Make sure pageName is set in the web service, and handle empty page name in parser function
Change-Id: I5d36eefecc2f35a860d00a8960004f8e651ed17c
2012-05-23 16:43:45 +02:00
Gabriel Wicke b89f5071e5 Basic parser / serializer web service
* After installing Parsoid (sudo npm install -g in modules/parser), run 'node
  server.js' from the api directory and navigate to http://localhost:8000/ and
  follow the directions. You can start to navigate the English wikipedia at
  http://localhost:8000/Main_Page, or manually enter wikitext or HTML DOM to
  convert.
* Uses the express framework, could also use just connect
* Uses the cluster module to manage workers per-core and restart those on
  failure

Change-Id: I443f2996ed3df00826b038b7476a2f966ab0c425
2012-05-23 12:35:00 +02:00