Commit graph

105 commits

Author SHA1 Message Date
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke 8368e17d6a Biggish token transform system refactoring
* All parser pipelines including tokenizer and DOM stuff are now constructed
  from a 'recipe' data structure in a ParserPipelineFactory.

* All sub-pipelines of these can now be cached

* Event registrations to a pipeline are directly forwarded to the last
  pipeline member to save relatively expensive event forwarding.

* Some APIs for on-demand expansion / format conversion of parameters from
  parser functions are added:

  param.to('tokens/expanded', cb)
  param.to('text/wiki', cb) (this does not work yet)

  All parameters are additionally wrapped into a Param object that provides
  method for positional parameter naming (.named() or conversion to a dict
  (.dict()).

* The async token transform manager is now separated from a frame object, with
  the frame holding arguments, an on-demand expansion method and loop checks.

* Only keys of template parameters are now expanded. Parser functions or
  template arguments trigger an expansion on-demand. This (unsurprisingly)
  makes a big performance difference with typical switch-heavy template
  systems.

* Return values from async transforms are no longer used in favor of plain
  callbacks. This saves the complication of having to maintain two code paths.
  A trick in transformTokens still avoids the construction of unneeded
  TokenAccumulators.

* The results of template expansions are no longer buffered.

* 301 parser tests are passing

Known issues:

* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
  modified.

Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
2012-04-25 16:51:36 +02:00
Gabriel Wicke c688b039de Collected tweaks
* less verbose logging in noinclude processing and template expansion
* Give priority to the processing of templates transcluded from transclusions
  to get closer to depth-first processing. This serves to minimize memory
  usage from queued-up tokens.
* Increase the maximum outstanding requests per template retrieval. 10000
  amazingly proved too low a limit on some big pages.
* Only process a single template request callback at a time for now
* Add a debug print in the treebuilder wrapper
* Don't treat multiple comments on a single line as a single comment to match
  the PHP parser's behavior

Change-Id: I9a86b6d7bec3b9e1f17415daf1bf74170240721a
2012-04-16 15:47:03 +02:00
Gabriel Wicke efd4c026ea Disallow < and > in external link urls
Change-Id: Id865c3d46b33b182bb5b244e77e815c0afd7fa49
2012-04-16 15:36:56 +02:00
Gabriel Wicke df050e4481 Convert external link syntax stops to stack
Eat unbalanced external link parts within template parameters. This does not
produce the same output as the PHP parser
(try echo '{{YouTube}}' | node parse.js), but preserves a level of sanity.
Need to check how common this is for external links. If it is rare enough,
moving the ']' after the parser function manually would fix the rendering for
the YouTube case.

Change-Id: I597d808efff36baa22191e7946a0061cc31120e8
2012-04-13 11:08:42 +02:00
Gabriel Wicke bff43938f6 Support noinclude/includeonly/onlyinclude in attributes
Fun test case:
{|
|-<includeonly>
foo
</includeonly>
|Hello
|}

Change-Id: I353bb287d3967ade549fbcb4ae64511a1f1f7e36
2012-04-11 17:37:25 +02:00
Gabriel Wicke 5a33099875 Improve template tokenization in template arguments
Taxobox tables now render pretty much correctly.

Change-Id: I5a0564138ff0c688d8a5a69b7867646fd3763946
2012-04-10 16:40:49 +02:00
Gabriel Wicke dbdd320348 Improve parameter tokenization support especially for table rows
Change-Id: I961d69e228b96adc69ea9acb3733d13f5898602d
2012-04-05 16:00:26 +02:00
Gabriel Wicke 7a35e5db16 Remove behaviors var in tokenizer, now handled in token handler
Change-Id: I68eeff3f05ce29c13e347c2cd7ea6519e58b0e03
2012-04-04 21:17:29 +02:00
Adam Wight a85ed36efa "magic words" are tokenized and used to set parser.environment flags
behavior switches are converted to tokens which set parser.environment flags during the async transformation stage.

The next step would be for handlers in the sync23 stage to generate the TOC, section edit links, and so on according to these directives.

No tests written, because the switches are consumed and don't appear in rendered html.  We can test the magic word layout controls individually, once they're implemented.

Another small change was to store option flags directly in the environment object, not that it makes much difference.

Change-Id: I863fbf4be1a17d2f6c31158298dd301f19ae1137
2012-04-04 11:25:29 -07:00
Gabriel Wicke e3a745a024 Improvements for template / -argument precedence; support for empty params
Change-Id: Id0894ccbedfa47fa3658817ca65119a2af76be3e
2012-04-04 16:29:47 +02:00
Gabriel Wicke 2037215185 Disallow '[' in generic attribute names
This avoids interpreting something like

! [[foo|bar]]

as

<th [[foo=''>bar]]</th>.

Change-Id: If59708fa90eb0117a15b2b6446890d1ae19a857c
2012-04-04 14:31:11 +02:00
Gabriel Wicke f588d2a7aa Fix table headings in template parameters
Change-Id: Icdfc5655968fc845230ad7638124309d6b8c1ada
2012-04-04 12:54:34 +02:00
Gabriel Wicke b8d980a229 Don't eat newline / space in template parameters
..so that block_lines can match.

Change-Id: I4c464dc44249f40e4aa280df35fb726bfce3a745
2012-04-04 11:22:31 +02:00
Gabriel Wicke 47de122a95 Improve support for table / template interaction
Match pairs of {{!}} or | for template productions, but not a mix of the two.
Example:

{{#if:1|{{!}}-
{{!}} {{#if:1|style="color: red"{{!}}|}}
}}

Note that the style parameter ends up as the *key* of an empty-valued
attribute on the table cell currently.

Change-Id: I5f9357dd1645ef97b0af89f32e8d92ae49218c72
2012-04-03 18:48:35 +02:00
Gabriel Wicke 0fe062fbe1 JSHint cleanups and parser function argument handling improvements
Parser functions which only accept positional arguments now return both the
key and value of arguments. Complete attributes (key and value) for templates
and the like from parser functions are not yet supported though.

Change-Id: I3f81bb35acd27186222ce6d5217e820042527c01
2012-04-03 18:10:48 +02:00
Gabriel Wicke 5248fd31e8 Magic links and behavior switch tokenization by Ori Livneh
Commit first patch by Ori, lets 288 parser tests pass. Yay!

Change-Id: Iac8c3d1ad1984900350b20f7e725c40618a1e8ba
2012-04-02 17:31:34 +02:00
Gabriel Wicke 5ef2074251 Enable support for block-level wiki constructs in template arguments. This
gets a bit closer to supporting table fragments passed through template
arguments. Next, we'll need a way to indicate start-of-line position to
enable sol block-levels in template parameters. 

Example:

{|
{{#if: true|{{!}}Table cell|}}
|}
2012-03-15 11:43:49 +00:00
Gabriel Wicke 7e22020398 Convert syntactical break flags for templates from counters to the stack
variant to fix the precedence for {{!}} (break on these inside table content,
but not in template options within tables).
2012-03-14 16:30:59 +00:00
Gabriel Wicke 77a61dd687 Improve support for {{!}}, and don't produce a pre for indented tables. 2012-03-14 10:58:11 +00:00
Gabriel Wicke 835914b2de Support {{=}}. 2012-03-14 09:07:01 +00:00
Gabriel Wicke 2195c31abf Move link types to data-mw-rt, and support some more template tokenization
edge cases. For example, the PHP parser treats | foo | = bar | as | foo = bar |,
believe it or not ;)
2012-03-13 12:32:31 +00:00
Gabriel Wicke 4cd8b302ac Improved template tokenization. The parser can now template-expand
[[:en:Barack Obama]] without exceeding 1.7GB of memory (which is the node
limit).
2012-03-12 17:31:45 +00:00
Gabriel Wicke 3c5fe2523c Tolerate more newlines and spaces in templates, and support templates and
comments in urls.
2012-03-12 14:31:06 +00:00
Gabriel Wicke ae4ab7a39c Refactor syntactic stops into an object and add a stack variant for option
values.
2012-03-12 13:08:43 +00:00
Gabriel Wicke ffc9383096 Temporary fix for template tokenization, especially needed for
[[Template:Cite core]].
2012-03-08 14:24:04 +00:00
Gabriel Wicke f02ff95aa3 Token representation clean-up. Now all tokens are differentiated using
constructors instead of type attributes.
2012-03-07 20:06:54 +00:00
Gabriel Wicke 227103e12c Accept empty table cell attribute sections, and consider percent-encoded %2525
valid. 270 tests passing.
2012-03-06 14:32:45 +00:00
Gabriel Wicke 2efcd3cd57 Reworked percent encoding handling for URIs to get closer to the 'url
construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation

Removed a few whitelisted test cases that are now passing directly.

The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
2012-03-06 13:49:37 +00:00
Gabriel Wicke a9ebc1d986 Support external images wrapped in a clickable link using bracketed external
link syntax. 265 tests passing.
2012-03-05 16:23:00 +00:00
Gabriel Wicke 7f7202e89c A few improvements to external link and image handling. 264 tests passing. 2012-03-05 15:34:27 +00:00
Gabriel Wicke 7b0c807710 Change wikilink tokenization strategy to split on pipes. This makes it
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
2012-03-05 12:00:38 +00:00
Gabriel Wicke 167dbdb0fa Parse image options. 2012-03-02 13:36:37 +00:00
Gabriel Wicke 8b7ba9051b Add productions for image option tokenization, and prepare to call those from
the LinkHandler token stream transformer.
2012-03-01 18:07:20 +00:00
Gabriel Wicke 4b9bd45b82 Start to move wikilink expansion to a separate async token transformer. 2012-02-29 13:56:29 +00:00
Gabriel Wicke b8bb503199 Actually commit onlyinclude, as already announced in r112592. 2012-02-28 13:24:35 +00:00
Gabriel Wicke 491ad5ffef Cleanup and commenting. 2012-02-22 13:13:18 +00:00
Gabriel Wicke 9b3313d923 Speed up flatten slightly by avoiding garbage for already flat arrays. Also,
use simple string concatenation instead of arrays as the strings tend to be
few and short.
2012-02-22 11:25:44 +00:00
Gabriel Wicke 8dde1f77b4 Reduce debug print overhead, roughly a 10% speed-up on parserTests. 2012-02-21 18:49:43 +00:00
Gabriel Wicke 058c4213a4 Remove some more unused code and tidy up some more. 2012-02-21 18:26:40 +00:00
Gabriel Wicke 416126c041 Fix the bug in the inline_breaks replacement, and write another switch-based
version, which is slightly faster and shorter. Performance is improved by
about 5% for parserTests.
2012-02-21 17:57:30 +00:00
Gabriel Wicke 18a04f7581 Tidy up and comment the tokenizer a bit more. Start to move code into
mediawiki.tokenizer.js module, and pass a reference to parse(). Faster
inline_breaks production using a JS function which seems to be generally
correct, but still breaks five tests when enabled. Seems to be some weird
interaction with peg.js, possibly something to do with caching.
2012-02-21 17:21:42 +00:00
Gabriel Wicke 8718bd65bc Add list of HTML5 and deprecated HTML3/4 elements in preparation for
end-of-potential-extension rules; Support indented tag-wrapped pre blocks.
2012-02-21 14:44:56 +00:00
Gabriel Wicke 059ff94bc4 Reject match for invalid urlencoded code points. 2012-02-16 13:57:56 +00:00
Gabriel Wicke dc1d30fcb5 Tweaked template parameters a bit further, and made the self-closing tag
protection a bit less trigger-happy.
2012-02-15 15:56:11 +00:00
Gabriel Wicke 089413298c Protect self-closing tags in generic attribute production. 2012-02-15 13:23:50 +00:00
Gabriel Wicke 5e94a238fc Prepare for the support of tables (and later generally block-level elements)
in template parameters. 244 tests passing.
2012-02-15 11:51:29 +00:00
Gabriel Wicke 774a3189c8 Improve support for generic attribute names coming from
templates/templateargs.
2012-02-15 10:19:39 +00:00
Gabriel Wicke 1ce6f5a3c4 Improve support for single-line attributes with preprocessor support. 243
tests passing.
2012-02-14 21:25:52 +00:00
Gabriel Wicke f02b3d91c6 Port urlencoded char support to preprocessor-supporting link target
production, and remove old link_target production.
2012-02-14 21:08:25 +00:00