Commit graph

122 commits

Author SHA1 Message Date
Gabriel Wicke 92f753a365 Pre and link target improvements
* Don't explicitly add the newline in the pre, as we preserve newline tokens
  now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.

Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
2012-06-04 14:03:05 +02:00
Gabriel Wicke ece2b0f810 Tokenizer backtracking cache bug fix and memory savings
* The state of syntax stops is now properly included in the cache key for the
  tokenizer-internal backtracking cache. This fixes some mis-parses when
  re-parsing a bit of text with different flags.
* Clear the backtracking cache after each toplevelblock. This drops the peak
  memory usage when expanding [[:en:Barack Obama]] from ~380M to ~110M.

Change-Id: Icdb879cae5907e4595903dd6acba2e686e8c2e4b
2012-06-01 12:53:49 +02:00
Gabriel Wicke c5d7e01944 Another tokenizer robustness improvement
This patch fixes a tokenizer syntax error encountered on
[[:en:Template:JacksonvilleWikiProject-Member]] and [[:en:Template:Infobox
former country]] by allowing optional whitespace before start-of-line template
syntax.

Change-Id: Ic214a731de58bf766e51f23d5e24ea2ce6788f58
2012-05-30 18:38:23 +02:00
Gabriel Wicke a133768781 Don't eat '}}' in generic attributes and similar productions
This fixes some syntax errors, at least one in Template:Geobox.

Change-Id: I32338febe25d0833c1d9bc4de293cd15b4cbb7be
2012-05-30 17:37:10 +02:00
Gabriel Wicke 36084c5d93 Preserve original newlines in HTML and serialization
254 round-trip tests (up from 184) are now passing.

Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms

Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
2012-05-29 23:29:03 +02:00
Gabriel Wicke b2adee0ae7 Basic rt support for indent pre variant
* Added a generic stx_v 'syntax variant' round-trip attribute
* For pre, use stx:'html' vs. no syntax annotation. This might not be 100%
  safe for arbitrary html input, so we might want to flip this to stx:'wiki'
  later.
* 181 round-trip tests passing

Change-Id: If6080917a3a7c069066db3db60efe59b1f6c28d8
2012-05-25 18:55:38 +02:00
Gabriel Wicke a31ccaabe4 Support definition lists with empty definition
Change-Id: I81c39a7e49f2ea7ce32cdd3600caeb5eb9f50d84
2012-05-25 15:40:32 +02:00
Gabriel Wicke 39c6f42879 Link round-tripping and other improvements
* Changed RDFa for links according to
  http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
  typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).

Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
2012-05-22 13:36:06 +02:00
Gabriel Wicke 7e21b7380a Merge "Round-trip nowiki" 2012-05-21 17:16:56 +00:00
Gabriel Wicke fb7d5418a5 Round-trip nowiki
Change-Id: I5f7e6a43f5fdc1708ee710b2a601b20db733452c
2012-05-21 18:06:09 +02:00
Gabriel Wicke a6610e52c2 Serializer and table round-tripping improvements
* added stx: 'html' round-trip information for html tags
* added t_stx: 'row' info for row-wise table wiki syntax, and support for it
  in the serializer
* the first table row is implicit in wikitext
* renamed lastToken to prevToken in serializer
* strip first newline in an initial chunkCB

Change-Id: I014b046539d1b674d830551c5fd1b74a67f81993
2012-05-21 14:59:53 +02:00
Gabriel Wicke e2815b516c Start to handle links
Change-Id: I1fb975910651820fd889d77152562fd4fbcb5db8
2012-05-17 14:32:56 +02:00
Gabriel Wicke d918fa18ac Big token transform framework overhaul part 2
* Tokens are now immutable. The progress of transformations is tracked on
  chunks instead of tokens. Tokenizer output is cached and can be directly
  returned without a need for cloning. Transforms are required to clone or
  newly create tokens they are modifying.

* Expansions per chunk are now shared between equivalent frames via a cache
  stored on the chunk itself. Equivalence of frames is not yet ideal though,
  as right now a hash tree of *unexpanded* arguments is used. This should be
  switched to a hash of the fully expanded local parameters instead.

* There is now a vastly improved maybeSyncReturn wrapper for async transforms
  that either forwards processing to the iterative transformTokens if the
  current transform is still ongoing, or manages a recursive transformation if
  needed.

* Parameters for parser functions are now wrapped in abstract Params and
  ParserValue objects, which support some handy on-demand *value* expansions.
  Keys are always expanded. Parser functions are converted to use these
  interfaces, and now properly expand their values in the correct frame.
  Making this expansion lazier is certainly possible, but would complicate
  transformTokens and other token-handling machinery. Need to investigate if
  it would really be worth it. Dead branch elimination is certainly a bigger
  win overall.

* Complex recursive asynchronous expansions should now be closer to correct
  for both the iterative (transformTokens) and recursive (maybeSyncReturn
  after transformTokens has returned) code paths.

* Performance degraded slightly. There are no micro-optimizations done yet
  and the shared expansion cache still has a low hit rate. The progress
  tracking on chunks is not yet perfect, so there are likely a lot of unneeded
  re-expansions that can be easily eliminated. There is also more debug
  tracing right now. Obama currently expands in 54 seconds on my laptop.

Change-Id: I4a603f3d3c70ca657ebda9fbb8570269f943d6b6
2012-05-15 17:05:47 +02:00
Adam Wight 0a7f0b7630 List markup is created during the sync23 phase.
This makes it possible to transclude list items from a template.

Note: "5 quotes" test is broken by this patch, it appears that ListHandler
newline processing is changing some state which mysteriously affects the
QuoteTransformer.  This is ominous, hopefully there's a simple explanation...

gwicke: fix a bug in tokenizer triggered by definition lists like this:
**; foo : bar

Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad
2012-05-08 11:39:36 +02:00
Gabriel Wicke 909633ea08 Improve template / tplarg precedence in tokenizer
Change-Id: If9b24b42ea223e0f30f906a83496d73ec60c4a0d
2012-05-04 13:17:06 +02:00
Gabriel Wicke 30a83d7fd7 Accept wikilink parameters with dangling equal ('|arg=|')
Change-Id: Ib4f6d186da2a74522b17c377dac5c9a7de7e5861
2012-04-27 11:35:00 +02:00
Gabriel Wicke 1d70e7b81c Disable preformatted text from indents in template args
Change-Id: I84144d3fab6541ed264d9b092806c8bf9de6e8b2
2012-04-27 10:45:08 +02:00
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke 8368e17d6a Biggish token transform system refactoring
* All parser pipelines including tokenizer and DOM stuff are now constructed
  from a 'recipe' data structure in a ParserPipelineFactory.

* All sub-pipelines of these can now be cached

* Event registrations to a pipeline are directly forwarded to the last
  pipeline member to save relatively expensive event forwarding.

* Some APIs for on-demand expansion / format conversion of parameters from
  parser functions are added:

  param.to('tokens/expanded', cb)
  param.to('text/wiki', cb) (this does not work yet)

  All parameters are additionally wrapped into a Param object that provides
  method for positional parameter naming (.named() or conversion to a dict
  (.dict()).

* The async token transform manager is now separated from a frame object, with
  the frame holding arguments, an on-demand expansion method and loop checks.

* Only keys of template parameters are now expanded. Parser functions or
  template arguments trigger an expansion on-demand. This (unsurprisingly)
  makes a big performance difference with typical switch-heavy template
  systems.

* Return values from async transforms are no longer used in favor of plain
  callbacks. This saves the complication of having to maintain two code paths.
  A trick in transformTokens still avoids the construction of unneeded
  TokenAccumulators.

* The results of template expansions are no longer buffered.

* 301 parser tests are passing

Known issues:

* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
  modified.

Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
2012-04-25 16:51:36 +02:00
Gabriel Wicke c688b039de Collected tweaks
* less verbose logging in noinclude processing and template expansion
* Give priority to the processing of templates transcluded from transclusions
  to get closer to depth-first processing. This serves to minimize memory
  usage from queued-up tokens.
* Increase the maximum outstanding requests per template retrieval. 10000
  amazingly proved too low a limit on some big pages.
* Only process a single template request callback at a time for now
* Add a debug print in the treebuilder wrapper
* Don't treat multiple comments on a single line as a single comment to match
  the PHP parser's behavior

Change-Id: I9a86b6d7bec3b9e1f17415daf1bf74170240721a
2012-04-16 15:47:03 +02:00
Gabriel Wicke efd4c026ea Disallow &lt; and &gt; in external link urls
Change-Id: Id865c3d46b33b182bb5b244e77e815c0afd7fa49
2012-04-16 15:36:56 +02:00
Gabriel Wicke df050e4481 Convert external link syntax stops to stack
Eat unbalanced external link parts within template parameters. This does not
produce the same output as the PHP parser
(try echo '{{YouTube}}' | node parse.js), but preserves a level of sanity.
Need to check how common this is for external links. If it is rare enough,
moving the ']' after the parser function manually would fix the rendering for
the YouTube case.

Change-Id: I597d808efff36baa22191e7946a0061cc31120e8
2012-04-13 11:08:42 +02:00
Gabriel Wicke bff43938f6 Support noinclude/includeonly/onlyinclude in attributes
Fun test case:
{|
|-<includeonly>
foo
</includeonly>
|Hello
|}

Change-Id: I353bb287d3967ade549fbcb4ae64511a1f1f7e36
2012-04-11 17:37:25 +02:00
Gabriel Wicke 5a33099875 Improve template tokenization in template arguments
Taxobox tables now render pretty much correctly.

Change-Id: I5a0564138ff0c688d8a5a69b7867646fd3763946
2012-04-10 16:40:49 +02:00
Gabriel Wicke dbdd320348 Improve parameter tokenization support especially for table rows
Change-Id: I961d69e228b96adc69ea9acb3733d13f5898602d
2012-04-05 16:00:26 +02:00
Gabriel Wicke 7a35e5db16 Remove behaviors var in tokenizer, now handled in token handler
Change-Id: I68eeff3f05ce29c13e347c2cd7ea6519e58b0e03
2012-04-04 21:17:29 +02:00
Adam Wight a85ed36efa "magic words" are tokenized and used to set parser.environment flags
behavior switches are converted to tokens which set parser.environment flags during the async transformation stage.

The next step would be for handlers in the sync23 stage to generate the TOC, section edit links, and so on according to these directives.

No tests written, because the switches are consumed and don't appear in rendered html.  We can test the magic word layout controls individually, once they're implemented.

Another small change was to store option flags directly in the environment object, not that it makes much difference.

Change-Id: I863fbf4be1a17d2f6c31158298dd301f19ae1137
2012-04-04 11:25:29 -07:00
Gabriel Wicke e3a745a024 Improvements for template / -argument precedence; support for empty params
Change-Id: Id0894ccbedfa47fa3658817ca65119a2af76be3e
2012-04-04 16:29:47 +02:00
Gabriel Wicke 2037215185 Disallow '[' in generic attribute names
This avoids interpreting something like

! [[foo|bar]]

as

<th [[foo=''>bar]]</th>.

Change-Id: If59708fa90eb0117a15b2b6446890d1ae19a857c
2012-04-04 14:31:11 +02:00
Gabriel Wicke f588d2a7aa Fix table headings in template parameters
Change-Id: Icdfc5655968fc845230ad7638124309d6b8c1ada
2012-04-04 12:54:34 +02:00
Gabriel Wicke b8d980a229 Don't eat newline / space in template parameters
..so that block_lines can match.

Change-Id: I4c464dc44249f40e4aa280df35fb726bfce3a745
2012-04-04 11:22:31 +02:00
Gabriel Wicke 47de122a95 Improve support for table / template interaction
Match pairs of {{!}} or | for template productions, but not a mix of the two.
Example:

{{#if:1|{{!}}-
{{!}} {{#if:1|style="color: red"{{!}}|}}
}}

Note that the style parameter ends up as the *key* of an empty-valued
attribute on the table cell currently.

Change-Id: I5f9357dd1645ef97b0af89f32e8d92ae49218c72
2012-04-03 18:48:35 +02:00
Gabriel Wicke 0fe062fbe1 JSHint cleanups and parser function argument handling improvements
Parser functions which only accept positional arguments now return both the
key and value of arguments. Complete attributes (key and value) for templates
and the like from parser functions are not yet supported though.

Change-Id: I3f81bb35acd27186222ce6d5217e820042527c01
2012-04-03 18:10:48 +02:00
Gabriel Wicke 5248fd31e8 Magic links and behavior switch tokenization by Ori Livneh
Commit first patch by Ori, lets 288 parser tests pass. Yay!

Change-Id: Iac8c3d1ad1984900350b20f7e725c40618a1e8ba
2012-04-02 17:31:34 +02:00
Gabriel Wicke 5ef2074251 Enable support for block-level wiki constructs in template arguments. This
gets a bit closer to supporting table fragments passed through template
arguments. Next, we'll need a way to indicate start-of-line position to
enable sol block-levels in template parameters. 

Example:

{|
{{#if: true|{{!}}Table cell|}}
|}
2012-03-15 11:43:49 +00:00
Gabriel Wicke 7e22020398 Convert syntactical break flags for templates from counters to the stack
variant to fix the precedence for {{!}} (break on these inside table content,
but not in template options within tables).
2012-03-14 16:30:59 +00:00
Gabriel Wicke 77a61dd687 Improve support for {{!}}, and don't produce a pre for indented tables. 2012-03-14 10:58:11 +00:00
Gabriel Wicke 835914b2de Support {{=}}. 2012-03-14 09:07:01 +00:00
Gabriel Wicke 2195c31abf Move link types to data-mw-rt, and support some more template tokenization
edge cases. For example, the PHP parser treats | foo | = bar | as | foo = bar |,
believe it or not ;)
2012-03-13 12:32:31 +00:00
Gabriel Wicke 4cd8b302ac Improved template tokenization. The parser can now template-expand
[[:en:Barack Obama]] without exceeding 1.7GB of memory (which is the node
limit).
2012-03-12 17:31:45 +00:00
Gabriel Wicke 3c5fe2523c Tolerate more newlines and spaces in templates, and support templates and
comments in urls.
2012-03-12 14:31:06 +00:00
Gabriel Wicke ae4ab7a39c Refactor syntactic stops into an object and add a stack variant for option
values.
2012-03-12 13:08:43 +00:00
Gabriel Wicke ffc9383096 Temporary fix for template tokenization, especially needed for
[[Template:Cite core]].
2012-03-08 14:24:04 +00:00
Gabriel Wicke f02ff95aa3 Token representation clean-up. Now all tokens are differentiated using
constructors instead of type attributes.
2012-03-07 20:06:54 +00:00
Gabriel Wicke 227103e12c Accept empty table cell attribute sections, and consider percent-encoded %2525
valid. 270 tests passing.
2012-03-06 14:32:45 +00:00
Gabriel Wicke 2efcd3cd57 Reworked percent encoding handling for URIs to get closer to the 'url
construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation

Removed a few whitelisted test cases that are now passing directly.

The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
2012-03-06 13:49:37 +00:00
Gabriel Wicke a9ebc1d986 Support external images wrapped in a clickable link using bracketed external
link syntax. 265 tests passing.
2012-03-05 16:23:00 +00:00
Gabriel Wicke 7f7202e89c A few improvements to external link and image handling. 264 tests passing. 2012-03-05 15:34:27 +00:00
Gabriel Wicke 7b0c807710 Change wikilink tokenization strategy to split on pipes. This makes it
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
2012-03-05 12:00:38 +00:00
Gabriel Wicke 167dbdb0fa Parse image options. 2012-03-02 13:36:37 +00:00