Commit graph

50 commits

Author SHA1 Message Date
Gabriel Wicke ece2b0f810 Tokenizer backtracking cache bug fix and memory savings
* The state of syntax stops is now properly included in the cache key for the
  tokenizer-internal backtracking cache. This fixes some mis-parses when
  re-parsing a bit of text with different flags.
* Clear the backtracking cache after each toplevelblock. This drops the peak
  memory usage when expanding [[:en:Barack Obama]] from ~380M to ~110M.

Change-Id: Icdb879cae5907e4595903dd6acba2e686e8c2e4b
2012-06-01 12:53:49 +02:00
Gabriel Wicke 4ea6b8e2be Revert part of last template syntax tweak
Change-Id: I084e1210577f80c3b96020d57cfa5c68eb5d139b
2012-05-31 12:02:42 +02:00
Gabriel Wicke c5d7e01944 Another tokenizer robustness improvement
This patch fixes a tokenizer syntax error encountered on
[[:en:Template:JacksonvilleWikiProject-Member]] and [[:en:Template:Infobox
former country]] by allowing optional whitespace before start-of-line template
syntax.

Change-Id: Ic214a731de58bf766e51f23d5e24ea2ce6788f58
2012-05-30 18:38:23 +02:00
Gabriel Wicke 36084c5d93 Preserve original newlines in HTML and serialization
254 round-trip tests (up from 184) are now passing.

Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms

Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
2012-05-29 23:29:03 +02:00
Gabriel Wicke caf2fa663d Keep going on tokenizer errors
Change-Id: I76fab4528f89b425845aef1685b3a54ddfeceef4
2012-05-24 10:30:32 +02:00
Gabriel Wicke e70448e53a Use text/x-mediawiki content type, and handle tokenizer errors without --debug
Change-Id: I154cd344306aa05ada7ff30f631d487f39fa9739
2012-05-24 10:19:25 +02:00
Gabriel Wicke d918fa18ac Big token transform framework overhaul part 2
* Tokens are now immutable. The progress of transformations is tracked on
  chunks instead of tokens. Tokenizer output is cached and can be directly
  returned without a need for cloning. Transforms are required to clone or
  newly create tokens they are modifying.

* Expansions per chunk are now shared between equivalent frames via a cache
  stored on the chunk itself. Equivalence of frames is not yet ideal though,
  as right now a hash tree of *unexpanded* arguments is used. This should be
  switched to a hash of the fully expanded local parameters instead.

* There is now a vastly improved maybeSyncReturn wrapper for async transforms
  that either forwards processing to the iterative transformTokens if the
  current transform is still ongoing, or manages a recursive transformation if
  needed.

* Parameters for parser functions are now wrapped in abstract Params and
  ParserValue objects, which support some handy on-demand *value* expansions.
  Keys are always expanded. Parser functions are converted to use these
  interfaces, and now properly expand their values in the correct frame.
  Making this expansion lazier is certainly possible, but would complicate
  transformTokens and other token-handling machinery. Need to investigate if
  it would really be worth it. Dead branch elimination is certainly a bigger
  win overall.

* Complex recursive asynchronous expansions should now be closer to correct
  for both the iterative (transformTokens) and recursive (maybeSyncReturn
  after transformTokens has returned) code paths.

* Performance degraded slightly. There are no micro-optimizations done yet
  and the shared expansion cache still has a low hit rate. The progress
  tracking on chunks is not yet perfect, so there are likely a lot of unneeded
  re-expansions that can be easily eliminated. There is also more debug
  tracing right now. Obama currently expands in 54 seconds on my laptop.

Change-Id: I4a603f3d3c70ca657ebda9fbb8570269f943d6b6
2012-05-15 17:05:47 +02:00
Gabriel Wicke 2291fe8364 Reduce the need for token cloning slightly
Change-Id: I31c71bddca4855afdffc3fe5c8d759cfa1994d86
2012-04-27 23:12:25 +02:00
Gabriel Wicke 027d77e0c9 Fix --wikidom and --linearmodel parse.js options; retry on template fetch failures
Change-Id: I444397936fd87971fe085df4b467089367e9ffa6
2012-04-26 19:51:00 +02:00
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke e2ca8c24c7 Delay some token duplication until actual mutation happens
This is a bit better than cloning tokens wholesale, but not by much. There is
a lot of potential for much better per-token caching with reduced token
cloning. Need to map out all dependencies besides token attributes expanded
from template parameters or other scoped state. Even if tokens themselves
don't need transformation, they might still need to be considered for other
token transformers, so simply keeping the final rank won't quite work even if
the token itself is fully transformed. As a minimum, a shallow clone would
need to be made and the rank reset (as in env.cloneTokens).

Change-Id: I4329113bb21750bae9a635229ed1b08da75dc614
2012-04-18 17:53:04 +02:00
Gabriel Wicke bf84638bc0 Add tokenizer cache and clone token state on mutation
* Added an LRU cache (using the lru-cache node module) for tokenizer output
* Mutation of nested attributes now replaces the containers. A shallow copy of
  tokens is sufficient to isolate token transformations. Need to investigate
  if we can actually get away without isolation and re-transformation for most
  ordinary tokens.

Change-Id: I9136b1d7a1fbcc538183a319d4ecaa290d616fdf
2012-04-18 14:40:47 +02:00
Gabriel Wicke df050e4481 Convert external link syntax stops to stack
Eat unbalanced external link parts within template parameters. This does not
produce the same output as the PHP parser
(try echo '{{YouTube}}' | node parse.js), but preserves a level of sanity.
Need to check how common this is for external links. If it is rare enough,
moving the ']' after the parser function manually would fix the rendering for
the YouTube case.

Change-Id: I597d808efff36baa22191e7946a0061cc31120e8
2012-04-13 11:08:42 +02:00
Gabriel Wicke bff43938f6 Support noinclude/includeonly/onlyinclude in attributes
Fun test case:
{|
|-<includeonly>
foo
</includeonly>
|Hello
|}

Change-Id: I353bb287d3967ade549fbcb4ae64511a1f1f7e36
2012-04-11 17:37:25 +02:00
Gabriel Wicke 403be4af42 Add basic thumb rendering support
* DOM based on Wikia's thumb output: HTML5, clean caption without magnify
  icon.
* basic RDFa annotations, but most options additionally in data-mw object-
  might want to move more (or all?) of those into RDFa data using meta tags.
* no support yet for framed or other formats, image scaling etc
* also tweaked some config options in the environment

Change-Id: Ie461fcdce060cfc2dec65cc057709ae650ef3368
2012-04-09 23:04:26 +02:00
Gabriel Wicke 5ef2074251 Enable support for block-level wiki constructs in template arguments. This
gets a bit closer to supporting table fragments passed through template
arguments. Next, we'll need a way to indicate start-of-line position to
enable sol block-levels in template parameters. 

Example:

{|
{{#if: true|{{!}}Table cell|}}
|}
2012-03-15 11:43:49 +00:00
Gabriel Wicke 7e22020398 Convert syntactical break flags for templates from counters to the stack
variant to fix the precedence for {{!}} (break on these inside table content,
but not in template options within tables).
2012-03-14 16:30:59 +00:00
Gabriel Wicke 77a61dd687 Improve support for {{!}}, and don't produce a pre for indented tables. 2012-03-14 10:58:11 +00:00
Gabriel Wicke 2195c31abf Move link types to data-mw-rt, and support some more template tokenization
edge cases. For example, the PHP parser treats | foo | = bar | as | foo = bar |,
believe it or not ;)
2012-03-13 12:32:31 +00:00
Gabriel Wicke 4cd8b302ac Improved template tokenization. The parser can now template-expand
[[:en:Barack Obama]] without exceeding 1.7GB of memory (which is the node
limit).
2012-03-12 17:31:45 +00:00
Gabriel Wicke ae4ab7a39c Refactor syntactic stops into an object and add a stack variant for option
values.
2012-03-12 13:08:43 +00:00
Gabriel Wicke ffc9383096 Temporary fix for template tokenization, especially needed for
[[Template:Cite core]].
2012-03-08 14:24:04 +00:00
Gabriel Wicke b1e131d568 A bit more documentation and naming cleanup in the tokenizer wrapper. 2012-03-08 09:00:45 +00:00
Gabriel Wicke 7f7202e89c A few improvements to external link and image handling. 264 tests passing. 2012-03-05 15:34:27 +00:00
Gabriel Wicke 7b0c807710 Change wikilink tokenization strategy to split on pipes. This makes it
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
2012-03-05 12:00:38 +00:00
Gabriel Wicke 167dbdb0fa Parse image options. 2012-03-02 13:36:37 +00:00
Gabriel Wicke 8b7ba9051b Add productions for image option tokenization, and prepare to call those from
the LinkHandler token stream transformer.
2012-03-01 18:07:20 +00:00
Gabriel Wicke 058c4213a4 Remove some more unused code and tidy up some more. 2012-02-21 18:26:40 +00:00
Gabriel Wicke 416126c041 Fix the bug in the inline_breaks replacement, and write another switch-based
version, which is slightly faster and shorter. Performance is improved by
about 5% for parserTests.
2012-02-21 17:57:30 +00:00
Gabriel Wicke 18a04f7581 Tidy up and comment the tokenizer a bit more. Start to move code into
mediawiki.tokenizer.js module, and pass a reference to parse(). Faster
inline_breaks production using a JS function which seems to be generally
correct, but still breaks five tests when enabled. Seems to be some weird
interaction with peg.js, possibly something to do with caching.
2012-02-21 17:21:42 +00:00
Gabriel Wicke 001194b140 Replace console.log with console.warn in all debug statements 2012-02-14 20:56:14 +00:00
Gabriel Wicke a5cc10a06b Change token format to plain strings for text tokens, and specific objects for
other tokens. This is only the first half of the conversion. The next step is
to drop the type attribute on most tokens and match on the constructor in the
token transform machinery.
2012-02-01 16:30:43 +00:00
Gabriel Wicke 7cd94df47d A few minor tweaks to reduce memory usage 2012-01-27 13:32:44 +00:00
Gabriel Wicke 4e6a54560a * Emit token chunks for top-level block elements by patching the source of the
tokenizer
* Fix a bug uncovered by this
* Increase the number of outstanding listeners on a single download to 10000
2012-01-22 23:21:53 +00:00
Gabriel Wicke 34025251a3 Clean up 'END' token handling a bit. 2012-01-17 20:01:21 +00:00
Gabriel Wicke 287604c422 A bit of cleanup in ParserPipeline, with better and more consistent support
for multiple input types.
2012-01-09 19:33:49 +00:00
Gabriel Wicke e99d7a2a55 Two batteries worth of token transform manager refactoring.
* TokenTransformDispatcher is now renamed to TokenTransformManager, and is
  also turned into a base class
* SyncTokenTransformManager and AsyncTokenTransformManager subclass
  TokenTransformManager and implement synchronous (phase 1,3) and asynchronous
  (phase 2) transformation stages.
* Communication between stages uses the same chunk / end events as all the
  other token stages.
* The AsyncTokenTransformManager now supports the creation of nested
  AsyncTokenTransformManagers for template expansion.
  The AsyncTokenTransformManager object takes on the responsibilities of a
  preprocessor frame. Transforms are newly created (or potentially resurrected
  from a cache), so that transforms do not have to worry about concurrency.
* The environment is pushed through to all transform managers and the
  individual transforms.
2012-01-09 17:49:16 +00:00
Gabriel Wicke 6601c544e6 Handle default for template arg expansion, add template fetch functionality
and tweak a few minor things in the grammar and QuoteTransformer.
2012-01-06 17:19:14 +00:00
Gabriel Wicke 6cd95fea37 Fix up constructors in EventEmitter inheritance and tweak a few more comments. 2012-01-04 12:28:41 +00:00
Gabriel Wicke 29362cc53c Rename ParseThingy to ParserPipeline and fix up broken WikiDom generation and
commandline runner.
2012-01-04 08:39:45 +00:00
Gabriel Wicke bd98eb4c5a Land big TokenTransformDispatcher and eventization refactoring.
The TokenTransformDispatcher now actually implements an asynchronous, phased
token transformation framework as described in
https://www.mediawiki.org/wiki/Future/Parser_development/Token_stream_transformations.

Additionally, the parser pipeline is now mostly held together using events.
The tokenizer still emits a lame single events with all tokens, as block-level
emission failed with scoping issues specific to the PEGJS parser generator.
All stages clean up when receiving the end tokens, so that the full pipeline
can be used for repeated parsing.

The QuoteTransformer is not yet 100% fixed to work with the new interface, and
the Cite extension is disabled for now pending adaptation. Bold-italic related
tests are failing currently.
2012-01-03 18:44:31 +00:00
Neil Kandalgaonkar 20374b5911 fix substr for IE, followup r107464 2011-12-30 21:51:03 +00:00
Gabriel Wicke b3a0270d69 Remove env and load grammar in tokenizer constructor. Re-add property hack to
keep parserTests running for now. Really need a different pipeline for html
serialization or a reference to the HTML DOM.
2011-12-28 17:04:16 +00:00
Neil Kandalgaonkar 8fbf36e63e put add terminal token inside tokenize method (will pull it out again for streaming interface) 2011-12-28 01:37:15 +00:00
Neil Kandalgaonkar 6103646ec8 remove need to add newline at end of input 2011-12-28 01:37:11 +00:00
Neil Kandalgaonkar 962d1262fc create tokenizer without need to modify namespace with PEG source 2011-12-28 01:36:36 +00:00
Gabriel Wicke 752b0990b2 Refactor parserTests somewhat into a class-like structure, and wire up the
TokenTransformer.
2011-12-12 14:03:54 +00:00
Gabriel Wicke d616f07a79 Don't re-build the wiki tokenizer for each test. This speeds up the full
parserTests.js run slightly from 7-8 minutes to about 14 seconds ;)

A few very minor tweaks to the grammar are also thrown into this commit.
2011-12-12 10:47:42 +00:00
Gabriel Wicke abc2254110 A bit of comment clean-up and wrapping of tree building into try/catch block
to actually count failures.
2011-12-08 11:40:59 +00:00
Gabriel Wicke 92fdf99384 Further renaming, this time from pegParser to pegTokenizer. 2011-12-08 10:59:44 +00:00
Renamed from modules/parser/mediawiki.parser.peg.js (Browse further)