Commit graph

56 commits

Author SHA1 Message Date
Gabriel Wicke d918fa18ac Big token transform framework overhaul part 2
* Tokens are now immutable. The progress of transformations is tracked on
  chunks instead of tokens. Tokenizer output is cached and can be directly
  returned without a need for cloning. Transforms are required to clone or
  newly create tokens they are modifying.

* Expansions per chunk are now shared between equivalent frames via a cache
  stored on the chunk itself. Equivalence of frames is not yet ideal though,
  as right now a hash tree of *unexpanded* arguments is used. This should be
  switched to a hash of the fully expanded local parameters instead.

* There is now a vastly improved maybeSyncReturn wrapper for async transforms
  that either forwards processing to the iterative transformTokens if the
  current transform is still ongoing, or manages a recursive transformation if
  needed.

* Parameters for parser functions are now wrapped in abstract Params and
  ParserValue objects, which support some handy on-demand *value* expansions.
  Keys are always expanded. Parser functions are converted to use these
  interfaces, and now properly expand their values in the correct frame.
  Making this expansion lazier is certainly possible, but would complicate
  transformTokens and other token-handling machinery. Need to investigate if
  it would really be worth it. Dead branch elimination is certainly a bigger
  win overall.

* Complex recursive asynchronous expansions should now be closer to correct
  for both the iterative (transformTokens) and recursive (maybeSyncReturn
  after transformTokens has returned) code paths.

* Performance degraded slightly. There are no micro-optimizations done yet
  and the shared expansion cache still has a low hit rate. The progress
  tracking on chunks is not yet perfect, so there are likely a lot of unneeded
  re-expansions that can be easily eliminated. There is also more debug
  tracing right now. Obama currently expands in 54 seconds on my laptop.

Change-Id: I4a603f3d3c70ca657ebda9fbb8570269f943d6b6
2012-05-15 17:05:47 +02:00
Gabriel Wicke c4fc7508a7 Add basic # REDIRECT handling
Change-Id: I71f659201c1d5de4a528ddfac7f65bf20a89f97d
2012-05-03 15:54:36 +02:00
Gabriel Wicke 56d6757f67 Fixes for the template fetch retry feature
Change-Id: Id36cb02c535d07f4f2cdd54ae682b6a144a2faa9
2012-04-26 20:31:23 +02:00
Gabriel Wicke 027d77e0c9 Fix --wikidom and --linearmodel parse.js options; retry on template fetch failures
Change-Id: I444397936fd87971fe085df4b467089367e9ffa6
2012-04-26 19:51:00 +02:00
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke 8ff810659a Rename text/wiki and tokens/wiki to text/x-mediawiki and similar
Change-Id: I70113629f4633685cd6db3914303a15e4c79a50a
2012-04-25 20:19:43 +02:00
Gabriel Wicke 8368e17d6a Biggish token transform system refactoring
* All parser pipelines including tokenizer and DOM stuff are now constructed
  from a 'recipe' data structure in a ParserPipelineFactory.

* All sub-pipelines of these can now be cached

* Event registrations to a pipeline are directly forwarded to the last
  pipeline member to save relatively expensive event forwarding.

* Some APIs for on-demand expansion / format conversion of parameters from
  parser functions are added:

  param.to('tokens/expanded', cb)
  param.to('text/wiki', cb) (this does not work yet)

  All parameters are additionally wrapped into a Param object that provides
  method for positional parameter naming (.named() or conversion to a dict
  (.dict()).

* The async token transform manager is now separated from a frame object, with
  the frame holding arguments, an on-demand expansion method and loop checks.

* Only keys of template parameters are now expanded. Parser functions or
  template arguments trigger an expansion on-demand. This (unsurprisingly)
  makes a big performance difference with typical switch-heavy template
  systems.

* Return values from async transforms are no longer used in favor of plain
  callbacks. This saves the complication of having to maintain two code paths.
  A trick in transformTokens still avoids the construction of unneeded
  TokenAccumulators.

* The results of template expansions are no longer buffered.

* 301 parser tests are passing

Known issues:

* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
  modified.

Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
2012-04-25 16:51:36 +02:00
Gabriel Wicke e2ca8c24c7 Delay some token duplication until actual mutation happens
This is a bit better than cloning tokens wholesale, but not by much. There is
a lot of potential for much better per-token caching with reduced token
cloning. Need to map out all dependencies besides token attributes expanded
from template parameters or other scoped state. Even if tokens themselves
don't need transformation, they might still need to be considered for other
token transformers, so simply keeping the final rank won't quite work even if
the token itself is fully transformed. As a minimum, a shallow clone would
need to be made and the rank reset (as in env.cloneTokens).

Change-Id: I4329113bb21750bae9a635229ed1b08da75dc614
2012-04-18 17:53:04 +02:00
Gabriel Wicke bf84638bc0 Add tokenizer cache and clone token state on mutation
* Added an LRU cache (using the lru-cache node module) for tokenizer output
* Mutation of nested attributes now replaces the containers. A shallow copy of
  tokens is sufficient to isolate token transformations. Need to investigate
  if we can actually get away without isolation and re-transformation for most
  ordinary tokens.

Change-Id: I9136b1d7a1fbcc538183a319d4ecaa290d616fdf
2012-04-18 14:40:47 +02:00
Gabriel Wicke aaca5eac7d More tweaks: safesubst and image options
* Ignore safesubst for now
* Remove an unneeded whitelist entry
* Make sure the caption is not lost for thumbs (fix to last commit) and remove
  debug print

Change-Id: I243584ed0838cf7c3b4110fe9cdf869272477312
2012-04-17 11:02:52 +02:00
Gabriel Wicke c688b039de Collected tweaks
* less verbose logging in noinclude processing and template expansion
* Give priority to the processing of templates transcluded from transclusions
  to get closer to depth-first processing. This serves to minimize memory
  usage from queued-up tokens.
* Increase the maximum outstanding requests per template retrieval. 10000
  amazingly proved too low a limit on some big pages.
* Only process a single template request callback at a time for now
* Add a debug print in the treebuilder wrapper
* Don't treat multiple comments on a single line as a single comment to match
  the PHP parser's behavior

Change-Id: I9a86b6d7bec3b9e1f17415daf1bf74170240721a
2012-04-16 15:47:03 +02:00
Gabriel Wicke 08453199df Increase number of callbacks per reactor iteration to 4
In experiments this dropped the memory consumption further, and reduces the
queuing overhead in the node reactor.

Change-Id: I9409b6ca863b43b7557663bbec9572365059c078
2012-04-13 14:50:36 +02:00
Gabriel Wicke 06ae53fdfe Drastically reduce memory usage for template-heavy pages
Only call back a few callbacks per reactor iteration from the template fetch
request queue. This changes the expansion pattern from a (memory intensive)
breadth-first expansion to something quite close to depth-first expansion.
Additionally, retrieved pages are quickly added to the page cache so that a
lot of request queuing is avoided in favor of synchronous expansion from the
cache. On pages like Barack Obama that previously ran out of memory after
consuming node's 1.6G heap limit, expansion now runs in relatively constant
100-300M resident (so far, still running).

Change-Id: Ie34a1eeff00d868416de45ef8d289898258f560c
2012-04-13 14:31:03 +02:00
Gabriel Wicke 5bb2d96869 Token stream transform improvements
* add past paths for empty arguments etc
* cache attribute token transform pipelines
* fix bugs in TokenCollector and NoIncludeOnly handler, and improve its
  efficiency by only registering for 'end' tokens on demand
* Remove empty reset methods from a few handlers
* Add a simple 'ap' debug print function that makes it easy to only print some
  debug prints by temporarily changing 'dp' to 'ap'
* Improvements and bug fixes in AttributeExpander

Change-Id: Ie69729c8f62d48bba922712e44ebce484c621c50
2012-04-12 15:42:09 +02:00
Gabriel Wicke 403be4af42 Add basic thumb rendering support
* DOM based on Wikia's thumb output: HTML5, clean caption without magnify
  icon.
* basic RDFa annotations, but most options additionally in data-mw object-
  might want to move more (or all?) of those into RDFa data using meta tags.
* no support yet for framed or other formats, image scaling etc
* also tweaked some config options in the environment

Change-Id: Ie461fcdce060cfc2dec65cc057709ae650ef3368
2012-04-09 23:04:26 +02:00
Gabriel Wicke 7e22020398 Convert syntactical break flags for templates from counters to the stack
variant to fix the precedence for {{!}} (break on these inside table content,
but not in template options within tables).
2012-03-14 16:30:59 +00:00
Gabriel Wicke 4cd8b302ac Improved template tokenization. The parser can now template-expand
[[:en:Barack Obama]] without exceeding 1.7GB of memory (which is the node
limit).
2012-03-12 17:31:45 +00:00
Gabriel Wicke f02ff95aa3 Token representation clean-up. Now all tokens are differentiated using
constructors instead of type attributes.
2012-03-07 20:06:54 +00:00
Gabriel Wicke 5f618103d7 Set allTokensProcessed flag for async callbacks from the template expander. 2012-03-07 17:36:33 +00:00
Gabriel Wicke d7da324272 Basic fall-through support for #switch parser function 2012-02-22 14:57:50 +00:00
Gabriel Wicke 8dde1f77b4 Reduce debug print overhead, roughly a 10% speed-up on parserTests. 2012-02-21 18:49:43 +00:00
Gabriel Wicke 5806705733 Push transformer setup a bit further into the attribute pipeline. 2012-02-20 12:56:00 +00:00
Gabriel Wicke 001194b140 Replace console.log with console.warn in all debug statements 2012-02-14 20:56:14 +00:00
Gabriel Wicke 0b40741e1c Strip trailing newlines from included templates 2012-02-13 14:17:03 +00:00
Gabriel Wicke 6e33255503 Improve support for preprocessor functionality in attributes; Support
multi-line xmlish tags with preprocessor stuff in attributes.
2012-02-09 16:36:29 +00:00
Gabriel Wicke 6983481561 Move attribute expansion back to separate handler, as this makes it easier to
only expand used branches selected by parser functions. Template (and
-argument) expansion is simply registered before general expansion.

Additionally, a few more simple time-based magic words are added in
ParserFunctions.
2012-02-09 13:44:20 +00:00
Gabriel Wicke 3f7c1499cd Enable support for general preprocessor functionality in attribute keys and
values. This includes comments, templates and template arguments.

This also replaces the specialized expansion logic in the TemplateHandler. The
removal of link validation lets one more parser test fail for now. External
link target validation will need to be implemented in the token stream handler
for links. This is noted as TODO in
https://www.mediawiki.org/wiki/Future/Parser_development#Token_stream_transforms.
2012-02-08 15:10:30 +00:00
Gabriel Wicke b4892102a4 Clean up transform callback interface 2012-02-07 11:53:29 +00:00
Gabriel Wicke 53bf4f2bd0 Temporarily disable the sanitizer and start to support preprocessor
functionality (comments, templates, template arguments) in arbitrary
attributes. The grammar for this is still quite rough, will need to
consolidate that area.
2012-02-06 19:15:44 +00:00
Gabriel Wicke a5cc10a06b Change token format to plain strings for text tokens, and specific objects for
other tokens. This is only the first half of the conversion. The next step is
to drop the type attribute on most tokens and match on the constructor in the
token transform machinery.
2012-02-01 16:30:43 +00:00
Gabriel Wicke 14a8a13678 A few more debug helpers including a --trace mode for light debugging. Some
improvements to parser functions on the way to support the cite extensions.
Preparation for generic template and template arg in attribute support. 222
parser tests now passing.
2012-01-31 16:50:16 +00:00
Gabriel Wicke 4e6a54560a * Emit token chunks for top-level block elements by patching the source of the
tokenizer
* Fix a bug uncovered by this
* Increase the number of outstanding listeners on a single download to 10000
2012-01-22 23:21:53 +00:00
Gabriel Wicke 7ea4d7d3db A few parser function fixes and maximum template expansion in environment
config.
2012-01-22 19:32:28 +00:00
Gabriel Wicke 561cf3c237 Bug fixes and a first stab at a #time parser function. You can expand the main
page like this:

cd extensions/VisualEditor/modules/parser
echo '{{:Main Page}}' | node parse.js
echo '{{:Main Page}}' | node parse.js --html
echo '{{:Main Page}}' | node parse.js --debug

Even the date-based includes work somewhat, although they don't yet accept
passed-in dates.
2012-01-22 07:07:16 +00:00
Gabriel Wicke 60e45bb739 A bit of template expansion bug fixing and parser function documentation 2012-01-22 01:27:22 +00:00
Gabriel Wicke 785a4af76f Implement a few parser functions. 220 parser tests now passing. 2012-01-21 20:38:13 +00:00
Gabriel Wicke 1a6546fbca Support empty template arguments and default values in arg expansion 2012-01-21 03:03:33 +00:00
Gabriel Wicke 145df2655c * NoInclude and IncludeOnly improvements
* Tokenizer support for templates and template args in template arguments and titles
* Async attribute expansion fixes
2012-01-20 22:02:23 +00:00
Gabriel Wicke 348cac6cf0 Fix a bug in TokenCollector, and misc tweaks for template expansions. 2012-01-20 18:47:17 +00:00
Gabriel Wicke 7cc8e69147 Collapse all requests per template into a single outstanding request using an
event-emitting TemplateRequest object and a request queue.
2012-01-20 02:36:18 +00:00
Gabriel Wicke c15e0d4167 Minor cleanup in TemplateHandler 2012-01-20 00:49:27 +00:00
Gabriel Wicke d0ece16c86 Fix async template expansion, so we can now render simple pages with templates
directly to WikiDom from enwiki using a commandline like this:

  echo '{{User:GWicke/Test}}' | node parse.js

Wohoo!

Complex pages with templates won't render properly yet, as noinclude /
includeonly and parser functions are not yet implemented. As a result, the
parser will run out of memory or hit the currently low expansion depth limit
as it tries to expand documentation for all templates.
2012-01-19 23:43:39 +00:00
Gabriel Wicke 2233d0a488 Eventify parser tests and parse.js commandline wrapper to actuallly allow
async template fetching. Async expansion is not yet fully debugged, but at
least the preconditions for that are now there.
2012-01-18 23:46:01 +00:00
Gabriel Wicke 5b8054636e Make template fetching somewhat functional on node with Inez' help, but
disable it by default in parserTests as it tries to fetch all sorts of parser
functions and is not yet fully supported in parserTests. The next step will be
to build a list of parser functions (to avoid fetching them as templates) and
pushing the event interface into parserTests.
2012-01-18 19:38:32 +00:00
Gabriel Wicke e7381da5b8 Trim whitespace off template titles and argument names. 209 parser tests now
passing.
2012-01-17 23:18:33 +00:00
Gabriel Wicke f50fecf1e3 Fix template argument expansion. 200 parser tests now passing. 2012-01-17 22:29:26 +00:00
Gabriel Wicke 34025251a3 Clean up 'END' token handling a bit. 2012-01-17 20:01:21 +00:00
Gabriel Wicke f4081bef08 First template expansion tests start working, and a bug fix in
DOMPostProcessor paragraph wrapper. 187 parser tests now passing.
2012-01-14 00:58:20 +00:00
Gabriel Wicke 196d704e8e Template expansion now enabled and somewhat working, but template fetching
still fails all the time.
2012-01-13 18:48:25 +00:00
Gabriel Wicke 32c9bccd7c Results of early template expansion debugging. Still disabled by default, but
getting closer.
2012-01-11 19:48:49 +00:00