Commit graph

425 commits

Author SHA1 Message Date
Gabriel Wicke dc3168cf6d A few tweaks to link round-tripping
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
  in the serializer
* renamed state.linkTail to state.dropTail

Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
2012-06-05 17:26:27 +02:00
Gabriel Wicke d16032ae9a Track html syntax in block_tag production
Change-Id: If560523644f007485809762f12216e08fb3c3ed3
2012-06-05 12:39:56 +02:00
Gabriel Wicke cc96ff4f5e Very basic interwiki support
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.

Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.

Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
2012-06-05 11:19:58 +02:00
Gabriel Wicke 92f753a365 Pre and link target improvements
* Don't explicitly add the newline in the pre, as we preserve newline tokens
  now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.

Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
2012-06-04 14:03:05 +02:00
Gabriel Wicke ee2ddbd3cb Fix list handler issues
Lists interrupted by non-empty lines would not close the list properly.
Register for any token instead of just for newlines and close the list if no
listItem follows the newline.

Change-Id: I1743901e3db541bbeda78d17707db943e6ceb9b9
2012-06-04 13:38:43 +02:00
Gabriel Wicke f821eac102 Optionally round-trip sHref in data-mw
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.

Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
2012-06-04 12:28:05 +02:00
Gabriel Wicke e0809209ec Don't set the data-mw attribute if the object is actually empty.
Change-Id: I984f1b44bba67d7a9f1a709738d14c0ee02f69a9
2012-06-04 12:26:03 +02:00
Gabriel Wicke 2774e5aa6c Actually replace all underscores in wikilink target
Change-Id: I633f8d6e4f639aff90fd456600376b7c6515fd50
2012-06-04 11:48:59 +02:00
Gabriel Wicke 3f2c72f920 Fix padleft / padright (mis)use as substr
Change-Id: I0645e11c8ef8b550ad35300d1904788940fc748a
2012-06-04 11:30:45 +02:00
Gabriel Wicke 4533c274ca Fix a crasher in the serializer
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.

Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0
2012-06-04 00:00:54 +02:00
Gabriel Wicke 31522d3d49 Add ApiRequest
Change-Id: I5f2a1cb65223a68f10bc63903000248efca05586
2012-06-02 16:52:51 +02:00
Gabriel Wicke 63abd57fc8 Improve newline-before-paragraph round-tripping support
Change-Id: I9176a97f9695018650d9a63b89514c07e0d6be90
2012-06-02 16:39:33 +02:00
Gabriel Wicke d3975a8d03 Very basic round-trip test mode for the API
Returns both the resulting wikitext and the diff with the original input.

Change-Id: Iad25039beb054a84e1ad51ffa9fee924db49c60b
2012-06-02 16:20:54 +02:00
Gabriel Wicke 74135b295f Some more switch fixes
Change-Id: If1a6086348c45a73a941bc8e6728ef75d002be50
2012-06-02 15:04:20 +02:00
Subramanya Sastry 8f216af2f5 Handle link tails properly.
- Added a tail json attribute for wikiLinks
- During serialization, this attribute is used to strip the tail from
  the link target and render it after the link

  [[hen]]s ==> <a ... data-mw="{gc:1, tail: 's'}" ...>hens</a>
           ==> [[hen]]s

- 2 more roundtrip tests green

Change-Id: I84f3dabaf0271f7a67641a00148467daa8310eb0
2012-06-01 23:41:10 -05:00
Subramanya Sastry 413fc5e043 Fixed bug serializing wikilinks with implicit link text.
* Simple fix but greens 10 more roundtrip tests.

Change-Id: I7f82d788a10bd83e0e3215568c2168081c332c50
2012-06-01 17:25:21 -05:00
Gabriel Wicke 16219ddc6d Fix up #switch a bit
* Re-establish the value-only default
* Fix value expansion

Change-Id: I32e62789b25bbe17a74c564e41e9101ad5528fb7
2012-06-01 22:15:43 +02:00
Gabriel Wicke e2301813ed Merge "Tokenizer backtracking cache bug fix and memory savings" 2012-06-01 12:06:00 +00:00
GWicke befd223476 Merge "First pass implementing a general tag minimization routine" 2012-06-01 11:15:48 +00:00
Gabriel Wicke ece2b0f810 Tokenizer backtracking cache bug fix and memory savings
* The state of syntax stops is now properly included in the cache key for the
  tokenizer-internal backtracking cache. This fixes some mis-parses when
  re-parsing a bit of text with different flags.
* Clear the backtracking cache after each toplevelblock. This drops the peak
  memory usage when expanding [[:en:Barack Obama]] from ~380M to ~110M.

Change-Id: Icdb879cae5907e4595903dd6acba2e686e8c2e4b
2012-06-01 12:53:49 +02:00
Subramanya Sastry 1c80e2d7f0 First pass implementing a general tag minimization routine
* This routine attempts to rewrite the DOM to maximize tag overlap
  and thus minimize tag uses.

* This takes as input a set of tags which participate in the
  minimization.

* Tested on the following example
  <b><i><u><s>BIUS</s></u></i></b><b><i><s>BIS</s></i></b><b><u><s>BUS</s></u></b><u><i>UI</i></u>
  with multiple combinations of the 2^4 possible variations of i,b,u,s
  tags: [], ['i','b','u','s'], ['i'], ['b','s'], ['i','b','u']

  - But, I am not fully sure if this implements the right behavior when
    only a subset of inline tags are provided.  Needs discussion and tweaking
	 as necessary.

* Also tested on few others:
  <b>B</b><b><i>BI</i></b><b><i><u>BIU</u></i></b><b><i><u><s>BIUS</s></u></i></b>
  <s><i><b>SIB</s></i></b><s><i><u>SIU</u></i></s><i><u>IU</u></i><i>I</i>

* The previous pairwise tag rewriting version fails on several of these
  examples, so this new version is a definite improvement.

* No change in parserTests run (203 passing before and after).

* Possible improvements that could/should be undertaken:
  - get rid of useless/idempotent add/remove of nodes that don't change
    the DOM.
  - ensure that node attributes post-restructuring are correct.

Change-Id: Ib4a8b39583fa96a2be880a77021ca81cefa06484
2012-05-31 12:10:28 -05:00
Gabriel Wicke 4ea6b8e2be Revert part of last template syntax tweak
Change-Id: I084e1210577f80c3b96020d57cfa5c68eb5d139b
2012-05-31 12:02:42 +02:00
Gabriel Wicke c5d7e01944 Another tokenizer robustness improvement
This patch fixes a tokenizer syntax error encountered on
[[:en:Template:JacksonvilleWikiProject-Member]] and [[:en:Template:Infobox
former country]] by allowing optional whitespace before start-of-line template
syntax.

Change-Id: Ic214a731de58bf766e51f23d5e24ea2ce6788f58
2012-05-30 18:38:23 +02:00
Gabriel Wicke a133768781 Don't eat '}}' in generic attributes and similar productions
This fixes some syntax errors, at least one in Template:Geobox.

Change-Id: I32338febe25d0833c1d9bc4de293cd15b4cbb7be
2012-05-30 17:37:10 +02:00
Gabriel Wicke 36084c5d93 Preserve original newlines in HTML and serialization
254 round-trip tests (up from 184) are now passing.

Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms

Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
2012-05-29 23:29:03 +02:00
Subramanya Sastry 8174c9dafc First attempt implementing rewriting rules on the DOM
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
  i.e. it takes tag pairs (B, I) for ex, and attempts to
  normalize the tree just for those tag pairs.  Normalizing
  across multiple tags is implemented as pairwise rewriting
  across all pairs:  Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
  attributes lose their meaning on rewriting since tags are
  reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?

Output examples and possible issues to fix:
   <i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
   <u><b><i>biu</i>bu</b>u</u>

But, the equivalent wikitext form:
   '''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
   <i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.

However, a slightly different version:
   "'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
   <u>'''''biu''bu'''u</u>

An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
  (ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
  (ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)

The current rewriting strategy could possibly be re-implemented as S-M
rewriting.  The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form.  I may
not play with it now, but just documenting it for later (to play with
in my spare time).

This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation.  Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.

gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.

Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
2012-05-29 08:17:57 +02:00
Gabriel Wicke b2adee0ae7 Basic rt support for indent pre variant
* Added a generic stx_v 'syntax variant' round-trip attribute
* For pre, use stx:'html' vs. no syntax annotation. This might not be 100%
  safe for arbitrary html input, so we might want to flip this to stx:'wiki'
  later.
* 181 round-trip tests passing

Change-Id: If6080917a3a7c069066db3db60efe59b1f6c28d8
2012-05-25 18:55:38 +02:00
Gabriel Wicke a31ccaabe4 Support definition lists with empty definition
Change-Id: I81c39a7e49f2ea7ce32cdd3600caeb5eb9f50d84
2012-05-25 15:40:32 +02:00
Gabriel Wicke 06b51b1f3f Properly round-trip dd/dt; 178 round-trip tests passing.
Need to track variable whitespace before elements to make some more tests
pass.

Change-Id: Ia86535d6f352e2ffe7965547cd506b0dbb6dfba2
2012-05-25 13:59:55 +02:00
Gabriel Wicke 6f62878c78 Resolve subpage links, and remove hack for H: titles
Change-Id: I6c9c64179274e5c1641a3b127ac3b273a3c5254e
2012-05-24 17:57:41 +02:00
Gabriel Wicke dc61f313a2 Notes on missing parser functions, more error reporting tweaks
Change-Id: Ib6ce60cf1b55671a6ff57aa47edb5787ec3aefea
2012-05-24 17:31:26 +02:00
Gabriel Wicke cc10aab54f Add self alias
Change-Id: I47682f407da6b554179611c7d0f63f882ab5a871
2012-05-24 17:16:35 +02:00
Gabriel Wicke 13ae7cda11 A few (partly hackish) improvements
* Very basic support attribute key-value pairs emitted from templates
* Add TALKPAGENAME stub implementation
* Only show 'no revisions' message for top-level pages

Change-Id: I4b4ac0c7b2c0531ac4b39f0f49f4217302576ab9
2012-05-24 16:30:26 +02:00
Gabriel Wicke 3e0e11b1d0 Sanity check for tokens being an array
Change-Id: Ia4e4071e1469c31e3b320d854500938bb0245f82
2012-05-24 14:35:58 +02:00
Gabriel Wicke 93ce7453f0 Fake fullpagename et al a bit better
Change-Id: I85ddf9e88e5f8ac274f371bea0879600997001e4
2012-05-24 11:05:31 +02:00
Gabriel Wicke cdd1eca42d Fix non-existing revision error reporting
Change-Id: I6b8687bcde98b92d9d6217a738a177db279fd006
2012-05-24 10:50:47 +02:00
Gabriel Wicke f03fc39d15 Report missing revisions when retrieving templates
Change-Id: I9f33acafc4d3fbd062125d824e2614dafd4cd5a0
2012-05-24 10:45:01 +02:00
Gabriel Wicke caf2fa663d Keep going on tokenizer errors
Change-Id: I76fab4528f89b425845aef1685b3a54ddfeceef4
2012-05-24 10:30:32 +02:00
Gabriel Wicke e70448e53a Use text/x-mediawiki content type, and handle tokenizer errors without --debug
Change-Id: I154cd344306aa05ada7ff30f631d487f39fa9739
2012-05-24 10:19:25 +02:00
Gabriel Wicke 4cc2d25e70 Fix a debug print reference error
Change-Id: Ic26d29aced4129c3dd718c4751dadb62a0be1a27
2012-05-23 20:52:45 +02:00
Gabriel Wicke d6af3b3375 Improve the serializer and its output display in the web service
Change-Id: Id3ca96846cad42517d7d4bada8f4bb250d54247b
2012-05-23 17:50:35 +02:00
Gabriel Wicke 95496c02db Add an extra newline before headings, and ignore favicon.ico requests
Change-Id: Ibacac3453afefa5dbe803c1e0260e8c943785f12
2012-05-23 17:17:54 +02:00
Gabriel Wicke 21286a50df Make sure pageName is set in the web service, and handle empty page name in parser function
Change-Id: I5d36eefecc2f35a860d00a8960004f8e651ed17c
2012-05-23 16:43:45 +02:00
Gabriel Wicke a862718ad8 Add some checks against undefined tokens returned from async transforms
Change-Id: Ie19537083b96b1b2e12e1c4b65a7a044753c18ac
2012-05-23 16:32:21 +02:00
Gabriel Wicke a4c5d43ff7 Fix an external link regression, and add server shell wrapper and setup docs
Change-Id: I9a4f7690e98313d003a2fec35324ed70556e6461
2012-05-23 16:25:42 +02:00
Gabriel Wicke b89f5071e5 Basic parser / serializer web service
* After installing Parsoid (sudo npm install -g in modules/parser), run 'node
  server.js' from the api directory and navigate to http://localhost:8000/ and
  follow the directions. You can start to navigate the English wikipedia at
  http://localhost:8000/Main_Page, or manually enter wikitext or HTML DOM to
  convert.
* Uses the express framework, could also use just connect
* Uses the cluster module to manage workers per-core and restart those on
  failure

Change-Id: I443f2996ed3df00826b038b7476a2f966ab0c425
2012-05-23 12:35:00 +02:00
Gabriel Wicke febb912ead No end delimiter after template row attributes
Change-Id: Iba304fb797d221e2d65ae055d266bff2f6301df8
2012-05-23 09:30:07 +02:00
Gabriel Wicke 39c6f42879 Link round-tripping and other improvements
* Changed RDFa for links according to
  http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
  typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).

Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
2012-05-22 13:36:06 +02:00
Gabriel Wicke 7e21b7380a Merge "Round-trip nowiki" 2012-05-21 17:16:56 +00:00
Gabriel Wicke fb7d5418a5 Round-trip nowiki
Change-Id: I5f7e6a43f5fdc1708ee710b2a601b20db733452c
2012-05-21 18:06:09 +02:00
Gabriel Wicke a6610e52c2 Serializer and table round-tripping improvements
* added stx: 'html' round-trip information for html tags
* added t_stx: 'row' info for row-wise table wiki syntax, and support for it
  in the serializer
* the first table row is implicit in wikitext
* renamed lastToken to prevToken in serializer
* strip first newline in an initial chunkCB

Change-Id: I014b046539d1b674d830551c5fd1b74a67f81993
2012-05-21 14:59:53 +02:00
Gabriel Wicke e069e7cb1c Merge "Support table captions and properly delimit the end of table options" 2012-05-21 12:51:58 +00:00
Gabriel Wicke 54e75b93b7 Support table captions and properly delimit the end of table options
Change-Id: I15eb8df19528cfceadfee368370501b30f0e36a0
2012-05-18 10:46:43 +02:00
Gabriel Wicke c39eb36968 Use outerHTML to serialize unhandled DOM node in serializer
Change-Id: I37350712c9450c34025740a8d6de51344739c2b7
2012-05-18 10:03:16 +02:00
Gabriel Wicke 3c6d829708 Fix first bug caught by new roundtrip mode for parserTests
Change-Id: Id152fd29606d8ee34ac300945f41e2a5f48f087f
2012-05-18 09:55:22 +02:00
Subramanya Sastry ae4810b201 Renamed items to itemCount for better code readability.
Change-Id: I53851c07a4746928fddec4b3737136f081d49178
2012-05-17 12:32:46 -05:00
Subramanya Sastry 58da03bc85 Track list prefixes in the list start handler and use them to output
serialized text in list item handlers.

Change-Id: Ic7562d531d2313bedcf3b7450b4f28f02bc2b5a3
2012-05-17 12:12:46 -05:00
Gabriel Wicke e2815b516c Start to handle links
Change-Id: I1fb975910651820fd889d77152562fd4fbcb5db8
2012-05-17 14:32:56 +02:00
Gabriel Wicke b7fd4498a9 Use single _serializeToken handler for both DOM and tokens
Change-Id: I45e1d90b53a5ddc678f7744f27274bebcfc375fe
2012-05-17 13:20:39 +02:00
Gabriel Wicke 8dbc2f573f Simplistic wikitext round-tripping with parse.js --wikitext
Lists are a bit tricky, as nested lists are not wrapped in a separate list
item. Should work now though.

Change-Id: I2e5f29f6afa6bdd2d5e5c0c5d019b70c611b73d1
2012-05-17 12:44:46 +02:00
Gabriel Wicke 3414418b1f Don't eat newline tokens in the ListHandler
This fix only affects following transforms, of which there are few right now.
Also removed a stray token mutation in QuoteTransformer.

Change-Id: Id6d4adce944b06fc1a3651cfbf63fc2670125225
2012-05-16 23:14:21 +02:00
Gabriel Wicke 542921b5a3 Removed html5 parser patch no longer needed with 0.3.8
Change-Id: Id8c23d34e8cca49a360f536e792144a85a8468a3
2012-05-16 12:06:42 +02:00
Mark Holmquist 96ee9ad45c Add a new wikitext serializer, with limited functionality.
This isn't finished at all, but Gabriel wants to take a crack at it,
so here it is!

Change-Id: I9732aa141f7c69a28c8f5978cb18180e93cb9eda
2012-05-15 10:41:28 -07:00
Gabriel Wicke d918fa18ac Big token transform framework overhaul part 2
* Tokens are now immutable. The progress of transformations is tracked on
  chunks instead of tokens. Tokenizer output is cached and can be directly
  returned without a need for cloning. Transforms are required to clone or
  newly create tokens they are modifying.

* Expansions per chunk are now shared between equivalent frames via a cache
  stored on the chunk itself. Equivalence of frames is not yet ideal though,
  as right now a hash tree of *unexpanded* arguments is used. This should be
  switched to a hash of the fully expanded local parameters instead.

* There is now a vastly improved maybeSyncReturn wrapper for async transforms
  that either forwards processing to the iterative transformTokens if the
  current transform is still ongoing, or manages a recursive transformation if
  needed.

* Parameters for parser functions are now wrapped in abstract Params and
  ParserValue objects, which support some handy on-demand *value* expansions.
  Keys are always expanded. Parser functions are converted to use these
  interfaces, and now properly expand their values in the correct frame.
  Making this expansion lazier is certainly possible, but would complicate
  transformTokens and other token-handling machinery. Need to investigate if
  it would really be worth it. Dead branch elimination is certainly a bigger
  win overall.

* Complex recursive asynchronous expansions should now be closer to correct
  for both the iterative (transformTokens) and recursive (maybeSyncReturn
  after transformTokens has returned) code paths.

* Performance degraded slightly. There are no micro-optimizations done yet
  and the shared expansion cache still has a low hit rate. The progress
  tracking on chunks is not yet perfect, so there are likely a lot of unneeded
  re-expansions that can be easily eliminated. There is also more debug
  tracing right now. Obama currently expands in 54 seconds on my laptop.

Change-Id: I4a603f3d3c70ca657ebda9fbb8570269f943d6b6
2012-05-15 17:05:47 +02:00
Catrope c256ea7d71 Fix fatal error in parse.js
Trying something trivial like echo 'Hello world' | node parse.js
would throw TypeError: Function.prototype.apply: Arguments list has wrong type

Change-Id: Ia0a1154b0f3edbfb1f228a1d2072fced1b147141
2012-05-10 12:04:57 -07:00
Gabriel Wicke b1bd0d73ec Don't eat end token in ListHandler, and lazier Quote handler registration
* Setting the rank on tokens is still used currently, but will be phased out
  in favor of setting it on chunks. Tokens will be immutable to allow sharing
  and caching without a need for cloning.
* Only register for newline and end tokens in QuoteTransformer when active.

Change-Id: I2c45bc7e4a105219a1404ab221eed7f242128f1e
2012-05-10 09:47:53 +02:00
Adam Wight 0a7f0b7630 List markup is created during the sync23 phase.
This makes it possible to transclude list items from a template.

Note: "5 quotes" test is broken by this patch, it appears that ListHandler
newline processing is changing some state which mysteriously affects the
QuoteTransformer.  This is ominous, hopefully there's a simple explanation...

gwicke: fix a bug in tokenizer triggered by definition lists like this:
**; foo : bar

Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad
2012-05-08 11:39:36 +02:00
Gabriel Wicke 909633ea08 Improve template / tplarg precedence in tokenizer
Change-Id: If9b24b42ea223e0f30f906a83496d73ec60c4a0d
2012-05-04 13:17:06 +02:00
Gabriel Wicke 8a30f76370 Use upright option, including the 0.75 default width
Change-Id: Iacdf6173e0ee8f58ca4385fd9b2cde77b2fdf3c4
2012-05-04 11:15:35 +02:00
Gabriel Wicke 57dfd89383 Handle upright option properly
Change-Id: I831fcccf874f9a0505e88eb76d269b1d2f68e3e0
2012-05-03 16:15:34 +02:00
Gabriel Wicke c4fc7508a7 Add basic # REDIRECT handling
Change-Id: I71f659201c1d5de4a528ddfac7f65bf20a89f97d
2012-05-03 15:54:36 +02:00
Gabriel Wicke 6ab017308b Only specify the width for thumbnails to keep the aspect ratio
Change-Id: I4e55ff719da6cb58f396ad6043e46acaed4a504d
2012-05-03 15:36:42 +02:00
Gabriel Wicke 6139398494 Reduce debugging overhead a bit, and provide default internal image size
Change-Id: I345af8c5905a5fa747f9ed342ba2ba8c1026d044
2012-05-03 14:49:55 +02:00
Gabriel Wicke 6e21f6bb27 Forward-port Cite extension
* Adapted Cite extension to use current interfaces and token formats
* Improved TokenCollector

Change-Id: I20419b19edd9bbad2c2abf17a2ff1411b99c0c04
2012-05-03 13:22:01 +02:00
Gabriel Wicke 2291fe8364 Reduce the need for token cloning slightly
Change-Id: I31c71bddca4855afdffc3fe5c8d759cfa1994d86
2012-04-27 23:12:25 +02:00
Gabriel Wicke 5fb2c46073 Clone cached tokens, and fix switch for empty needle
Change-Id: I63946e5a56f6fd7dd30d00b12d36032dd1dd0017
2012-04-27 15:59:01 +02:00
Gabriel Wicke ed8cb54831 Simplify transformToken slightly, and fix JSHint warnings
Change-Id: I95769ed063ea855a9109148f5db83ea43f423e56
2012-04-27 15:31:30 +02:00
Gabriel Wicke 2d7b4a2a59 Make .to more consistent and add optional parentCB arg
* parentCB (if set) is called with { async: true } if expansion is going to be
  asynchronous.
* Strings are handled efficiently
* all value parameter chunks can now be converted using .to().

Change-Id: Ib013e1bc3d8e7f692009038209db6a056887326e
2012-04-27 13:57:23 +02:00
Gabriel Wicke fd1a67aa16 Add .to('text/plain/expanded', cb) support and convert ifeq to use it
Change-Id: I99c78de12fed41ba36811402f7ecacb420391d70
2012-04-27 12:18:30 +02:00
Gabriel Wicke 30a83d7fd7 Accept wikilink parameters with dangling equal ('|arg=|')
Change-Id: Ib4f6d186da2a74522b17c377dac5c9a7de7e5861
2012-04-27 11:35:00 +02:00
Gabriel Wicke 1d70e7b81c Disable preformatted text from indents in template args
Change-Id: I84144d3fab6541ed264d9b092806c8bf9de6e8b2
2012-04-27 10:45:08 +02:00
Gabriel Wicke 56d6757f67 Fixes for the template fetch retry feature
Change-Id: Id36cb02c535d07f4f2cdd54ae682b6a144a2faa9
2012-04-26 20:31:23 +02:00
Gabriel Wicke 027d77e0c9 Fix --wikidom and --linearmodel parse.js options; retry on template fetch failures
Change-Id: I444397936fd87971fe085df4b467089367e9ffa6
2012-04-26 19:51:00 +02:00
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke 8ff810659a Rename text/wiki and tokens/wiki to text/x-mediawiki and similar
Change-Id: I70113629f4633685cd6db3914303a15e4c79a50a
2012-04-25 20:19:43 +02:00
Gabriel Wicke 814511f523 Remove dead parser pipeline code
Change-Id: I802f1798d5163c1ce82d648f739c2e79b17eda41
2012-04-25 17:12:32 +02:00
Gabriel Wicke 8368e17d6a Biggish token transform system refactoring
* All parser pipelines including tokenizer and DOM stuff are now constructed
  from a 'recipe' data structure in a ParserPipelineFactory.

* All sub-pipelines of these can now be cached

* Event registrations to a pipeline are directly forwarded to the last
  pipeline member to save relatively expensive event forwarding.

* Some APIs for on-demand expansion / format conversion of parameters from
  parser functions are added:

  param.to('tokens/expanded', cb)
  param.to('text/wiki', cb) (this does not work yet)

  All parameters are additionally wrapped into a Param object that provides
  method for positional parameter naming (.named() or conversion to a dict
  (.dict()).

* The async token transform manager is now separated from a frame object, with
  the frame holding arguments, an on-demand expansion method and loop checks.

* Only keys of template parameters are now expanded. Parser functions or
  template arguments trigger an expansion on-demand. This (unsurprisingly)
  makes a big performance difference with typical switch-heavy template
  systems.

* Return values from async transforms are no longer used in favor of plain
  callbacks. This saves the complication of having to maintain two code paths.
  A trick in transformTokens still avoids the construction of unneeded
  TokenAccumulators.

* The results of template expansions are no longer buffered.

* 301 parser tests are passing

Known issues:

* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
  modified.

Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
2012-04-25 16:51:36 +02:00
Demon 28e44b1d0f Merge "Add --wikidom flag to parse.js" 2012-04-25 14:18:59 +00:00
Catrope 47969e20a1 Add --wikidom flag to parse.js
Also remove unused import of DOMConverter

Change-Id: I1eabe6bf9935970c1f049681b52e867a510ea77a
2012-04-23 15:01:12 -07:00
Gabriel Wicke e2ca8c24c7 Delay some token duplication until actual mutation happens
This is a bit better than cloning tokens wholesale, but not by much. There is
a lot of potential for much better per-token caching with reduced token
cloning. Need to map out all dependencies besides token attributes expanded
from template parameters or other scoped state. Even if tokens themselves
don't need transformation, they might still need to be considered for other
token transformers, so simply keeping the final rank won't quite work even if
the token itself is fully transformed. As a minimum, a shallow clone would
need to be made and the rank reset (as in env.cloneTokens).

Change-Id: I4329113bb21750bae9a635229ed1b08da75dc614
2012-04-18 17:53:04 +02:00
Gabriel Wicke bf84638bc0 Add tokenizer cache and clone token state on mutation
* Added an LRU cache (using the lru-cache node module) for tokenizer output
* Mutation of nested attributes now replaces the containers. A shallow copy of
  tokens is sufficient to isolate token transformations. Need to investigate
  if we can actually get away without isolation and re-transformation for most
  ordinary tokens.

Change-Id: I9136b1d7a1fbcc538183a319d4ecaa290d616fdf
2012-04-18 14:40:47 +02:00
Gabriel Wicke aaca5eac7d More tweaks: safesubst and image options
* Ignore safesubst for now
* Remove an unneeded whitelist entry
* Make sure the caption is not lost for thumbs (fix to last commit) and remove
  debug print

Change-Id: I243584ed0838cf7c3b4110fe9cdf869272477312
2012-04-17 11:02:52 +02:00
Gabriel Wicke 7fe5a86b60 Improve image option handling
Change-Id: If1376766f41ff1288bfe2af19beecd3299c09a01
2012-04-17 10:46:20 +02:00
Gabriel Wicke afa5b95bc1 Don't work around html5 library tokenizer attribute reordering
The HTML5 parser we are using to normalize expected HTML output in parserTests
reverses the order of attributes (see
https://github.com/aredridel/html5/pull/53 for the fix). Remove whitelist
entries concerned with this and use the proper order in external image
attributes.

Change-Id: If1868cae05396a150757c85a20473ab756cbcd97
2012-04-16 17:09:06 +02:00
Gabriel Wicke c688b039de Collected tweaks
* less verbose logging in noinclude processing and template expansion
* Give priority to the processing of templates transcluded from transclusions
  to get closer to depth-first processing. This serves to minimize memory
  usage from queued-up tokens.
* Increase the maximum outstanding requests per template retrieval. 10000
  amazingly proved too low a limit on some big pages.
* Only process a single template request callback at a time for now
* Add a debug print in the treebuilder wrapper
* Don't treat multiple comments on a single line as a single comment to match
  the PHP parser's behavior

Change-Id: I9a86b6d7bec3b9e1f17415daf1bf74170240721a
2012-04-16 15:47:03 +02:00
Gabriel Wicke 1bf8a9e5e1 Small tweak in comment about onlyinclude forcing buffered expansion
Change-Id: Ib324e24c51c97e07e6737bf23f16db07043b69ab
2012-04-16 15:42:29 +02:00
Gabriel Wicke efd4c026ea Disallow &lt; and &gt; in external link urls
Change-Id: Id865c3d46b33b182bb5b244e77e815c0afd7fa49
2012-04-16 15:36:56 +02:00
Gabriel Wicke 25523f4cf0 Implement urlencode parser function
Change-Id: I4fca3134c9c3eb9a7d6f3360be6de054fb47477c
2012-04-16 14:54:03 +02:00
Gabriel Wicke 421ef44621 Match the empty string as whitespace too
Change-Id: I1a8ed882021804f62855b9db4368270feebbfc16
2012-04-16 14:48:39 +02:00
Gabriel Wicke 08453199df Increase number of callbacks per reactor iteration to 4
In experiments this dropped the memory consumption further, and reduces the
queuing overhead in the node reactor.

Change-Id: I9409b6ca863b43b7557663bbec9572365059c078
2012-04-13 14:50:36 +02:00