Commit graph

446 commits

Author SHA1 Message Date
Gabriel Wicke 30a83d7fd7 Accept wikilink parameters with dangling equal ('|arg=|')
Change-Id: Ib4f6d186da2a74522b17c377dac5c9a7de7e5861
2012-04-27 11:35:00 +02:00
Gabriel Wicke 1d70e7b81c Disable preformatted text from indents in template args
Change-Id: I84144d3fab6541ed264d9b092806c8bf9de6e8b2
2012-04-27 10:45:08 +02:00
Gabriel Wicke 56d6757f67 Fixes for the template fetch retry feature
Change-Id: Id36cb02c535d07f4f2cdd54ae682b6a144a2faa9
2012-04-26 20:31:23 +02:00
Gabriel Wicke 027d77e0c9 Fix --wikidom and --linearmodel parse.js options; retry on template fetch failures
Change-Id: I444397936fd87971fe085df4b467089367e9ffa6
2012-04-26 19:51:00 +02:00
Gabriel Wicke 3be4992782 'Obama finally expands' ;) Misc fixes and documentation updates
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
  while it would prevously run out of RAM after ~30 minutes. Wohoooo!
  The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
  tests are passing since these tests expect the day of the week to be
  Thursday.  Won't be the case tomorrow.

Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
2012-04-26 18:18:08 +02:00
Gabriel Wicke 8ff810659a Rename text/wiki and tokens/wiki to text/x-mediawiki and similar
Change-Id: I70113629f4633685cd6db3914303a15e4c79a50a
2012-04-25 20:19:43 +02:00
Gabriel Wicke 814511f523 Remove dead parser pipeline code
Change-Id: I802f1798d5163c1ce82d648f739c2e79b17eda41
2012-04-25 17:12:32 +02:00
Gabriel Wicke 8368e17d6a Biggish token transform system refactoring
* All parser pipelines including tokenizer and DOM stuff are now constructed
  from a 'recipe' data structure in a ParserPipelineFactory.

* All sub-pipelines of these can now be cached

* Event registrations to a pipeline are directly forwarded to the last
  pipeline member to save relatively expensive event forwarding.

* Some APIs for on-demand expansion / format conversion of parameters from
  parser functions are added:

  param.to('tokens/expanded', cb)
  param.to('text/wiki', cb) (this does not work yet)

  All parameters are additionally wrapped into a Param object that provides
  method for positional parameter naming (.named() or conversion to a dict
  (.dict()).

* The async token transform manager is now separated from a frame object, with
  the frame holding arguments, an on-demand expansion method and loop checks.

* Only keys of template parameters are now expanded. Parser functions or
  template arguments trigger an expansion on-demand. This (unsurprisingly)
  makes a big performance difference with typical switch-heavy template
  systems.

* Return values from async transforms are no longer used in favor of plain
  callbacks. This saves the complication of having to maintain two code paths.
  A trick in transformTokens still avoids the construction of unneeded
  TokenAccumulators.

* The results of template expansions are no longer buffered.

* 301 parser tests are passing

Known issues:

* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
  modified.

Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
2012-04-25 16:51:36 +02:00
Demon 28e44b1d0f Merge "Add --wikidom flag to parse.js" 2012-04-25 14:18:59 +00:00
Catrope 47969e20a1 Add --wikidom flag to parse.js
Also remove unused import of DOMConverter

Change-Id: I1eabe6bf9935970c1f049681b52e867a510ea77a
2012-04-23 15:01:12 -07:00
Gabriel Wicke e2ca8c24c7 Delay some token duplication until actual mutation happens
This is a bit better than cloning tokens wholesale, but not by much. There is
a lot of potential for much better per-token caching with reduced token
cloning. Need to map out all dependencies besides token attributes expanded
from template parameters or other scoped state. Even if tokens themselves
don't need transformation, they might still need to be considered for other
token transformers, so simply keeping the final rank won't quite work even if
the token itself is fully transformed. As a minimum, a shallow clone would
need to be made and the rank reset (as in env.cloneTokens).

Change-Id: I4329113bb21750bae9a635229ed1b08da75dc614
2012-04-18 17:53:04 +02:00
Gabriel Wicke bf84638bc0 Add tokenizer cache and clone token state on mutation
* Added an LRU cache (using the lru-cache node module) for tokenizer output
* Mutation of nested attributes now replaces the containers. A shallow copy of
  tokens is sufficient to isolate token transformations. Need to investigate
  if we can actually get away without isolation and re-transformation for most
  ordinary tokens.

Change-Id: I9136b1d7a1fbcc538183a319d4ecaa290d616fdf
2012-04-18 14:40:47 +02:00
Gabriel Wicke aaca5eac7d More tweaks: safesubst and image options
* Ignore safesubst for now
* Remove an unneeded whitelist entry
* Make sure the caption is not lost for thumbs (fix to last commit) and remove
  debug print

Change-Id: I243584ed0838cf7c3b4110fe9cdf869272477312
2012-04-17 11:02:52 +02:00
Gabriel Wicke 7fe5a86b60 Improve image option handling
Change-Id: If1376766f41ff1288bfe2af19beecd3299c09a01
2012-04-17 10:46:20 +02:00
Gabriel Wicke afa5b95bc1 Don't work around html5 library tokenizer attribute reordering
The HTML5 parser we are using to normalize expected HTML output in parserTests
reverses the order of attributes (see
https://github.com/aredridel/html5/pull/53 for the fix). Remove whitelist
entries concerned with this and use the proper order in external image
attributes.

Change-Id: If1868cae05396a150757c85a20473ab756cbcd97
2012-04-16 17:09:06 +02:00
Gabriel Wicke c688b039de Collected tweaks
* less verbose logging in noinclude processing and template expansion
* Give priority to the processing of templates transcluded from transclusions
  to get closer to depth-first processing. This serves to minimize memory
  usage from queued-up tokens.
* Increase the maximum outstanding requests per template retrieval. 10000
  amazingly proved too low a limit on some big pages.
* Only process a single template request callback at a time for now
* Add a debug print in the treebuilder wrapper
* Don't treat multiple comments on a single line as a single comment to match
  the PHP parser's behavior

Change-Id: I9a86b6d7bec3b9e1f17415daf1bf74170240721a
2012-04-16 15:47:03 +02:00
Gabriel Wicke 1bf8a9e5e1 Small tweak in comment about onlyinclude forcing buffered expansion
Change-Id: Ib324e24c51c97e07e6737bf23f16db07043b69ab
2012-04-16 15:42:29 +02:00
Gabriel Wicke efd4c026ea Disallow < and > in external link urls
Change-Id: Id865c3d46b33b182bb5b244e77e815c0afd7fa49
2012-04-16 15:36:56 +02:00
Gabriel Wicke 25523f4cf0 Implement urlencode parser function
Change-Id: I4fca3134c9c3eb9a7d6f3360be6de054fb47477c
2012-04-16 14:54:03 +02:00
Gabriel Wicke 421ef44621 Match the empty string as whitespace too
Change-Id: I1a8ed882021804f62855b9db4368270feebbfc16
2012-04-16 14:48:39 +02:00
Gabriel Wicke 08453199df Increase number of callbacks per reactor iteration to 4
In experiments this dropped the memory consumption further, and reduces the
queuing overhead in the node reactor.

Change-Id: I9409b6ca863b43b7557663bbec9572365059c078
2012-04-13 14:50:36 +02:00
Gabriel Wicke 06ae53fdfe Drastically reduce memory usage for template-heavy pages
Only call back a few callbacks per reactor iteration from the template fetch
request queue. This changes the expansion pattern from a (memory intensive)
breadth-first expansion to something quite close to depth-first expansion.
Additionally, retrieved pages are quickly added to the page cache so that a
lot of request queuing is avoided in favor of synchronous expansion from the
cache. On pages like Barack Obama that previously ran out of memory after
consuming node's 1.6G heap limit, expansion now runs in relatively constant
100-300M resident (so far, still running).

Change-Id: Ie34a1eeff00d868416de45ef8d289898258f560c
2012-04-13 14:31:03 +02:00
Gabriel Wicke df050e4481 Convert external link syntax stops to stack
Eat unbalanced external link parts within template parameters. This does not
produce the same output as the PHP parser
(try echo '{{YouTube}}' | node parse.js), but preserves a level of sanity.
Need to check how common this is for external links. If it is rare enough,
moving the ']' after the parser function manually would fix the rendering for
the YouTube case.

Change-Id: I597d808efff36baa22191e7946a0061cc31120e8
2012-04-13 11:08:42 +02:00
Gabriel Wicke 5bb2d96869 Token stream transform improvements
* add past paths for empty arguments etc
* cache attribute token transform pipelines
* fix bugs in TokenCollector and NoIncludeOnly handler, and improve its
  efficiency by only registering for 'end' tokens on demand
* Remove empty reset methods from a few handlers
* Add a simple 'ap' debug print function that makes it easy to only print some
  debug prints by temporarily changing 'dp' to 'ap'
* Improvements and bug fixes in AttributeExpander

Change-Id: Ie69729c8f62d48bba922712e44ebce484c621c50
2012-04-12 15:42:09 +02:00
Gabriel Wicke 3124deca2c Track inclusion status on CachedTokenPipeline
Non-include attribute pipelines are not cached for now. Adding separate
caching for non-include attribute pipelines is very likely worth it, but
deferred for now.

Change-Id: I13f949d9f0a04536f9ccfcb73a2be69c5c08be01
2012-04-12 10:21:50 +02:00
Gabriel Wicke efa41370d3 Set inclusion flag for attribute transform managers too
Change-Id: Ice15d8fde6de4a3e850a028db9917e976218fc43
2012-04-11 21:55:52 +02:00
Gabriel Wicke bff43938f6 Support noinclude/includeonly/onlyinclude in attributes
Fun test case:
{|
|-<includeonly>
foo
</includeonly>
|Hello
|}

Change-Id: I353bb287d3967ade549fbcb4ae64511a1f1f7e36
2012-04-11 17:37:25 +02:00
Gabriel Wicke 9ae572cca0 Fixes to template expansion / token transform managers, 296 tests passing.
* Convert isNoInclude logic to positive isInclude throughout and set it
  properly on attribute pipelines. Also don't cache non-include pipelines.
* Add a --pagename parameter to parse.js, which sets the page name in the
  environment. This is then returned by {{PAGENAME}}. Not the final solution,
  but useful for taxobox testing as taxons are selected based on PAGENAME.
* Add rudimentary pagenamebase parser function

Change-Id: If9c0be4c255200d0f2a30f02e5619437b4fd8f12
2012-04-11 16:34:27 +02:00
Gabriel Wicke bbae66cd69 Nominate more HTML5 sectioning and heading elements for block-level treatment
Block-level (in HTML4 lingo) elements are not wrapped into paragraphs.

Change-Id: I4a01c9721be30b526172952915d528dea79e2f30
2012-04-11 12:53:49 +02:00
Gabriel Wicke 5a33099875 Improve template tokenization in template arguments
Taxobox tables now render pretty much correctly.

Change-Id: I5a0564138ff0c688d8a5a69b7867646fd3763946
2012-04-10 16:40:49 +02:00
Gabriel Wicke 577ef1f916 Add some support for alignment of thumbs
Change-Id: I70570f48423628f7a87a35647698a66a5f413088
2012-04-10 12:11:59 +02:00
Gabriel Wicke 403be4af42 Add basic thumb rendering support
* DOM based on Wikia's thumb output: HTML5, clean caption without magnify
  icon.
* basic RDFa annotations, but most options additionally in data-mw object-
  might want to move more (or all?) of those into RDFa data using meta tags.
* no support yet for framed or other formats, image scaling etc
* also tweaked some config options in the environment

Change-Id: Ie461fcdce060cfc2dec65cc057709ae650ef3368
2012-04-09 23:04:26 +02:00
Gabriel Wicke dbdd320348 Improve parameter tokenization support especially for table rows
Change-Id: I961d69e228b96adc69ea9acb3733d13f5898602d
2012-04-05 16:00:26 +02:00
Gabriel Wicke 7a35e5db16 Remove behaviors var in tokenizer, now handled in token handler
Change-Id: I68eeff3f05ce29c13e347c2cd7ea6519e58b0e03
2012-04-04 21:17:29 +02:00
GWicke da60861be8 Merge ""magic words" are tokenized and used to set parser.environment flags" 2012-04-04 19:11:03 +00:00
Adam Wight a85ed36efa "magic words" are tokenized and used to set parser.environment flags
behavior switches are converted to tokens which set parser.environment flags during the async transformation stage.

The next step would be for handlers in the sync23 stage to generate the TOC, section edit links, and so on according to these directives.

No tests written, because the switches are consumed and don't appear in rendered html.  We can test the magic word layout controls individually, once they're implemented.

Another small change was to store option flags directly in the environment object, not that it makes much difference.

Change-Id: I863fbf4be1a17d2f6c31158298dd301f19ae1137
2012-04-04 11:25:29 -07:00
Adam Wight b234edba88 As much as I have loved writing Makefiles... I've replaced its functionality with package.json, mostly so we can avoid non-node dependencies. This is one of the recommended practices. We should consider moving tests/parser into modules/parser/tests, other node projects keep all module code in one directory.
Explained in the README how to use npm to load the dependencies and run tests.  Too bad about NODE_PATH...

Don't try to find parserTests.txt in assorted places--if it isn't present, fetch from gerrit.  You can symlink from core if you're developing on both parsers, and the fetch script will not overwrite.

Use __dirname in parserTests.js to allow the script to run independent of current working directory.

Change-Id: I4c8b884e91f4fdeae385c7697aff768bdd199dd5
2012-04-04 11:02:58 -07:00
Gabriel Wicke e3a745a024 Improvements for template / -argument precedence; support for empty params
Change-Id: Id0894ccbedfa47fa3658817ca65119a2af76be3e
2012-04-04 16:29:47 +02:00
Gabriel Wicke 2037215185 Disallow '[' in generic attribute names
This avoids interpreting something like

! [[foo|bar]]

as

<th [[foo=''>bar]]</th>.

Change-Id: If59708fa90eb0117a15b2b6446890d1ae19a857c
2012-04-04 14:31:11 +02:00
Gabriel Wicke f588d2a7aa Fix table headings in template parameters
Change-Id: Icdfc5655968fc845230ad7638124309d6b8c1ada
2012-04-04 12:54:34 +02:00
Gabriel Wicke b8d980a229 Don't eat newline / space in template parameters
..so that block_lines can match.

Change-Id: I4c464dc44249f40e4aa280df35fb726bfce3a745
2012-04-04 11:22:31 +02:00
Trevor Parscal 606d97da99 Merge "Add HTML DOM -> linear model converter" 2012-04-03 17:52:55 +00:00
Gabriel Wicke 47de122a95 Improve support for table / template interaction
Match pairs of {{!}} or | for template productions, but not a mix of the two.
Example:

{{#if:1|{{!}}-
{{!}} {{#if:1|style="color: red"{{!}}|}}
}}

Note that the style parameter ends up as the *key* of an empty-valued
attribute on the table cell currently.

Change-Id: I5f9357dd1645ef97b0af89f32e8d92ae49218c72
2012-04-03 18:48:35 +02:00
Gabriel Wicke 0fe062fbe1 JSHint cleanups and parser function argument handling improvements
Parser functions which only accept positional arguments now return both the
key and value of arguments. Complete attributes (key and value) for templates
and the like from parser functions are not yet supported though.

Change-Id: I3f81bb35acd27186222ce6d5217e820042527c01
2012-04-03 18:10:48 +02:00
GWicke b7db83e09a Merge "Magic links and behavior switch tokenization by Ori Livneh" 2012-04-02 16:43:13 +00:00
Gabriel Wicke f662690d02 Shorten data-mw-rt to data-mw and clean up whitelist
Instead of a proliferation of data-mw-* attributes, it should be easier to
stash all private / non-semantic round-trip information in a JSON object
stored in data-mw.

Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285
2012-04-02 18:12:49 +02:00
Gabriel Wicke 5248fd31e8 Magic links and behavior switch tokenization by Ori Livneh
Commit first patch by Ori, lets 288 parser tests pass. Yay!

Change-Id: Iac8c3d1ad1984900350b20f7e725c40618a1e8ba
2012-04-02 17:31:34 +02:00
Catrope 8dc994f037 Add HTML DOM -> linear model converter
Also, in ParserPipeline:
* Import the LM converter and expose it through getLinearModel()
* Fix getWikiDom() to actually work (still unused)

In parse.js:
* Add --help option that prints usage information (was unreachable)
* Add --linearmodel option to output linear model JSON instead of HTML

Change-Id: Ic534e03ff40a7c9117bb63f0c635a4213d5e3406
2012-03-29 12:47:14 -07:00
Gabriel Wicke 5ef2074251 Enable support for block-level wiki constructs in template arguments. This
gets a bit closer to supporting table fragments passed through template
arguments. Next, we'll need a way to indicate start-of-line position to
enable sol block-levels in template parameters. 

Example:

{|
{{#if: true|{{!}}Table cell|}}
|}
2012-03-15 11:43:49 +00:00
Gabriel Wicke 7e22020398 Convert syntactical break flags for templates from counters to the stack
variant to fix the precedence for {{!}} (break on these inside table content,
but not in template options within tables).
2012-03-14 16:30:59 +00:00
Gabriel Wicke 77a61dd687 Improve support for {{!}}, and don't produce a pre for indented tables. 2012-03-14 10:58:11 +00:00
Gabriel Wicke 835914b2de Support {{=}}. 2012-03-14 09:07:01 +00:00
Gabriel Wicke 2195c31abf Move link types to data-mw-rt, and support some more template tokenization
edge cases. For example, the PHP parser treats | foo | = bar | as | foo = bar |,
believe it or not ;)
2012-03-13 12:32:31 +00:00
Gabriel Wicke 4cd8b302ac Improved template tokenization. The parser can now template-expand
[[:en:Barack Obama]] without exceeding 1.7GB of memory (which is the node
limit).
2012-03-12 17:31:45 +00:00
Gabriel Wicke 3c5fe2523c Tolerate more newlines and spaces in templates, and support templates and
comments in urls.
2012-03-12 14:31:06 +00:00
Gabriel Wicke ae4ab7a39c Refactor syntactic stops into an object and add a stack variant for option
values.
2012-03-12 13:08:43 +00:00
Roan Kattouw 29f416937e Fix some usages of splice.apply in the data model to use
ve.batchedSplice(). Added FIXME comments for occurrences outside of DM
2012-03-10 00:31:28 +00:00
Gabriel Wicke ffc9383096 Temporary fix for template tokenization, especially needed for
[[Template:Cite core]].
2012-03-08 14:24:04 +00:00
Gabriel Wicke 39017dd769 Percent-encode spaces in URLs, so that they are recognized as valid URLs later
on.
2012-03-08 11:53:15 +00:00
Gabriel Wicke 7518db8197 A few fixes to parser functions and template expansion. Trim whitespace off
template arguments, let the last duplicate key win and fake pagenamee slightly
better.
2012-03-08 11:44:37 +00:00
Gabriel Wicke 51023feaa4 Improvements for image option handling. 2012-03-08 10:03:22 +00:00
Gabriel Wicke b1e131d568 A bit more documentation and naming cleanup in the tokenizer wrapper. 2012-03-08 09:00:45 +00:00
Gabriel Wicke f02ff95aa3 Token representation clean-up. Now all tokens are differentiated using
constructors instead of type attributes.
2012-03-07 20:06:54 +00:00
Gabriel Wicke f157093a41 Delegate responsibility for resetting the token rank to transforms, if full
re-processing in a phase is wanted. By default, after a token type change or
the return of multiple tokens only the remaining transforms with higher ranks
are applied.

Updated a few comments as well.
2012-03-07 19:29:53 +00:00
Gabriel Wicke 1f8c43b9e2 A few minor documentation updates. 2012-03-07 18:42:26 +00:00
Gabriel Wicke 5f618103d7 Set allTokensProcessed flag for async callbacks from the template expander. 2012-03-07 17:36:33 +00:00
Gabriel Wicke e5a1116817 Start re-transformation as soon as possible in TokenAccumulator._returnTokens
to maximize IO concurrency. Signal that all tokens are fully transformed to
callbacks called from TokenAccumulator._returnTokens. The result should be a
single re-transformation when entering the callback chain, and only if the
transform does not signal that it took care of full transformation itself.
Template expansion would set this flag, as the nested transform pipeline
processes all tokens to the end of phase async12.
2012-03-07 16:29:06 +00:00
Gabriel Wicke 656524dbbc Fixes for multi-transformer expansion in AsyncTransformManager. Added argument
to callback which lets transforms indicate if their returned tokens are fully
processed for their phase. If not, the callback re-processes them so that any
remaining transforms are applied.
2012-03-07 15:39:18 +00:00
Gabriel Wicke af03eb4f29 Improve generic attribute expansion before external link processing, and make
wgUploadPath configurable. Also change the hard-coded fall-back image sizes to
sensible defaults. This breaks three parser tests until image size retrieval
from the wiki is implemented.
2012-03-06 18:02:35 +00:00
Gabriel Wicke 227103e12c Accept empty table cell attribute sections, and consider percent-encoded %2525
valid. 270 tests passing.
2012-03-06 14:32:45 +00:00
Gabriel Wicke 2efcd3cd57 Reworked percent encoding handling for URIs to get closer to the 'url
construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation

Removed a few whitelisted test cases that are now passing directly.

The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
2012-03-06 13:49:37 +00:00
Gabriel Wicke 19fe9726a2 Fix invalid external link representation. 268 tests passing. 2012-03-05 18:06:29 +00:00
Gabriel Wicke a9ebc1d986 Support external images wrapped in a clickable link using bracketed external
link syntax. 265 tests passing.
2012-03-05 16:23:00 +00:00
Gabriel Wicke 7f7202e89c A few improvements to external link and image handling. 264 tests passing. 2012-03-05 15:34:27 +00:00
Gabriel Wicke 7b0c807710 Change wikilink tokenization strategy to split on pipes. This makes it
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
2012-03-05 12:00:38 +00:00
Gabriel Wicke 3e6f1b6bea Use some options primitively. 2012-03-02 14:19:33 +00:00
Gabriel Wicke 167dbdb0fa Parse image options. 2012-03-02 13:36:37 +00:00
Gabriel Wicke 8b7ba9051b Add productions for image option tokenization, and prepare to call those from
the LinkHandler token stream transformer.
2012-03-01 18:07:20 +00:00
Gabriel Wicke b1a7119a46 Hack up some rudimentary image rendering. Using jshashes for the md5, and
a few hard-coded image image sizes ;) 262 tests passing.
2012-03-01 13:51:53 +00:00
Gabriel Wicke d4faf9eaf4 More work on wiki link rendering and general wiki title / namespace
functionality.
2012-03-01 12:47:05 +00:00
Gabriel Wicke 4b9bd45b82 Start to move wikilink expansion to a separate async token transformer. 2012-02-29 13:56:29 +00:00
Gabriel Wicke b8bb503199 Actually commit onlyinclude, as already announced in r112592. 2012-02-28 13:24:35 +00:00
Gabriel Wicke 3227903d48 Follow-up to r112116, accidentally committed from subdirectory. 2012-02-22 16:41:01 +00:00
Gabriel Wicke 3568dfee14 Add some support for functionhooks in test parser and parserTests.js, and
tweak a few parser functions.
2012-02-22 15:59:11 +00:00
Gabriel Wicke d7da324272 Basic fall-through support for #switch parser function 2012-02-22 14:57:50 +00:00
Gabriel Wicke 491ad5ffef Cleanup and commenting. 2012-02-22 13:13:18 +00:00
Gabriel Wicke 9b3313d923 Speed up flatten slightly by avoiding garbage for already flat arrays. Also,
use simple string concatenation instead of arrays as the strings tend to be
few and short.
2012-02-22 11:25:44 +00:00
Gabriel Wicke 8dde1f77b4 Reduce debug print overhead, roughly a 10% speed-up on parserTests. 2012-02-21 18:49:43 +00:00
Gabriel Wicke 058c4213a4 Remove some more unused code and tidy up some more. 2012-02-21 18:26:40 +00:00
Gabriel Wicke 416126c041 Fix the bug in the inline_breaks replacement, and write another switch-based
version, which is slightly faster and shorter. Performance is improved by
about 5% for parserTests.
2012-02-21 17:57:30 +00:00
Gabriel Wicke 18a04f7581 Tidy up and comment the tokenizer a bit more. Start to move code into
mediawiki.tokenizer.js module, and pass a reference to parse(). Faster
inline_breaks production using a JS function which seems to be generally
correct, but still breaks five tests when enabled. Seems to be some weird
interaction with peg.js, possibly something to do with caching.
2012-02-21 17:21:42 +00:00
Gabriel Wicke 8718bd65bc Add list of HTML5 and deprecated HTML3/4 elements in preparation for
end-of-potential-extension rules; Support indented tag-wrapped pre blocks.
2012-02-21 14:44:56 +00:00
Gabriel Wicke ffec77273a Comment and minor code tweaks. 2012-02-21 11:24:20 +00:00
au ea15bffb27 Revert "* Always sort attributes (+1 test pass)."
This reverts commit 45ca281da8eef8030bdd1986418cb914fc9a717c.
2012-02-20 22:26:12 +00:00
Gabriel Wicke 5806705733 Push transformer setup a bit further into the attribute pipeline. 2012-02-20 12:56:00 +00:00
Gabriel Wicke 8eddb4ec6b Add some comments to the Sanitizer 2012-02-20 11:14:53 +00:00
Gabriel Wicke 71e95bd54b Set up token stream transformers from a map of phases per input content type.
Not yet applied to attribute pipeline creation. 249 tests passing.
2012-02-20 11:07:21 +00:00
au 9c55f5e8b7 * Always sort attributes (+1 test pass).
The performance impact for .sort is quite small (12.079s => 12.158s)
  and Sanitizer is probably one of the more accessible places to do this.
2012-02-18 21:01:07 +00:00
au aa589d989b * Rudimentary CSS validation; +4 tests pass. (Bug 2304, 3244). 2012-02-18 20:16:23 +00:00
Gabriel Wicke 4d80b8daa8 Detail comments about next steps and divide parser functions in those that
need more information from the wiki and readily implementable items.
2012-02-17 10:23:14 +00:00
Gabriel Wicke 059ff94bc4 Reject match for invalid urlencoded code points. 2012-02-16 13:57:56 +00:00
Gabriel Wicke dc1d30fcb5 Tweaked template parameters a bit further, and made the self-closing tag
protection a bit less trigger-happy.
2012-02-15 15:56:11 +00:00
Gabriel Wicke 089413298c Protect self-closing tags in generic attribute production. 2012-02-15 13:23:50 +00:00
Gabriel Wicke 5e94a238fc Prepare for the support of tables (and later generally block-level elements)
in template parameters. 244 tests passing.
2012-02-15 11:51:29 +00:00
Gabriel Wicke 774a3189c8 Improve support for generic attribute names coming from
templates/templateargs.
2012-02-15 10:19:39 +00:00
Gabriel Wicke 1ce6f5a3c4 Improve support for single-line attributes with preprocessor support. 243
tests passing.
2012-02-14 21:25:52 +00:00
Gabriel Wicke f02b3d91c6 Port urlencoded char support to preprocessor-supporting link target
production, and remove old link_target production.
2012-02-14 21:08:25 +00:00
Gabriel Wicke 001194b140 Replace console.log with console.warn in all debug statements 2012-02-14 20:56:14 +00:00
Gabriel Wicke f42b379e52 Fix named wikilink options (image options really) in template arguments, and
speed up template parameter parsing by eliminating some backtracking. 238
tests passing (unchanged).
2012-02-14 15:45:18 +00:00
Gabriel Wicke 64f63b3714 request is automatically installed by jsdom. Follow-up to r111459. Thanks
Hashar!
2012-02-14 14:15:50 +00:00
Gabriel Wicke 466e8e54ad Tweak comment about request module 2012-02-14 14:01:13 +00:00
Gabriel Wicke 0b8d1b0387 * Add custom toString methods for tokens to aid debugging
* Convert all attributes into strings in Sanitizer
* Use strict comparison against empty string in tokenizer
* Add very simple sitename parserfunction
* 138 tests passing
2012-02-13 17:02:23 +00:00
Gabriel Wicke 9945175416 Reformat Date.replaceChars 2012-02-13 14:23:48 +00:00
Gabriel Wicke 0b40741e1c Strip trailing newlines from included templates 2012-02-13 14:17:03 +00:00
Gabriel Wicke 025f9cddb3 Prefix all internal data- attributes with data-mw- and adjust the whitelist
and test output normalization accordingly. 235 tests passing.
2012-02-13 13:54:07 +00:00
Gabriel Wicke b1617b1d71 Add some support for ideographic spaces in external links, support the
int: namespace alias and perform some normalization on the MediaWiki namespace
prefix.
2012-02-13 13:35:46 +00:00
Gabriel Wicke 55ddb4fd66 Remove WikiDom default serialization and --html argument from parse.js
wrapper. HTML ist now the only supported format. The DOMConverter is now no
longer used. Roan, feel free to remove / butcher it for direct HTML to linear
model conversion.
2012-02-11 17:59:17 +00:00
Gabriel Wicke a122e51eec Move data-* annotations into separate object on tokens, that is then
serialized into a single data-mw-rt attribute if present. Update parserTests
to ignore this attribute for comparisons with expected parser output.

A few more tweaks and notes are thrown into this commit too. 233 tests are
passing now.
2012-02-11 16:43:25 +00:00
Gabriel Wicke aff30be131 Some comments and reshuffling in the grammar, and a typo in the
AttributeExpander.
2012-02-09 22:27:45 +00:00
Gabriel Wicke 6e33255503 Improve support for preprocessor functionality in attributes; Support
multi-line xmlish tags with preprocessor stuff in attributes.
2012-02-09 16:36:29 +00:00
Gabriel Wicke 16ded7d955 Fix a bug in wikilink with trail tokenization. 2012-02-09 14:06:35 +00:00
Gabriel Wicke 6983481561 Move attribute expansion back to separate handler, as this makes it easier to
only expand used branches selected by parser functions. Template (and
-argument) expansion is simply registered before general expansion.

Additionally, a few more simple time-based magic words are added in
ParserFunctions.
2012-02-09 13:44:20 +00:00
Gabriel Wicke 3f7c1499cd Enable support for general preprocessor functionality in attribute keys and
values. This includes comments, templates and template arguments.

This also replaces the specialized expansion logic in the TemplateHandler. The
removal of link validation lets one more parser test fail for now. External
link target validation will need to be implemented in the token stream handler
for links. This is noted as TODO in
https://www.mediawiki.org/wiki/Future/Parser_development#Token_stream_transforms.
2012-02-08 15:10:30 +00:00
Gabriel Wicke 157c495a9e Normalize the title in localurl. 232 tests passing. 2012-02-07 12:26:00 +00:00
Gabriel Wicke b4892102a4 Clean up transform callback interface 2012-02-07 11:53:29 +00:00
Gabriel Wicke 1f6db903e9 Pluck a few low-hanging fruit in external link tokenization, and add a simple
localurl parser function implementation. 230 parser tests now passing.
2012-02-07 10:28:23 +00:00
Gabriel Wicke cf8b7bf45d External links don't nest. 2012-02-07 09:38:28 +00:00
Gabriel Wicke 53bf4f2bd0 Temporarily disable the sanitizer and start to support preprocessor
functionality (comments, templates, template arguments) in arbitrary
attributes. The grammar for this is still quite rough, will need to
consolidate that area.
2012-02-06 19:15:44 +00:00
Gabriel Wicke c26243989e Improve toJSON handlers to include all properties 2012-02-06 19:12:29 +00:00
Gabriel Wicke 0bea9fdfbb Fix nowiki tokenization regression introduced r110495 2012-02-03 13:10:04 +00:00
Gabriel Wicke 26f2026cff Add custom JSON serializers for tokens that include a type attribute 2012-02-03 13:09:01 +00:00
Gabriel Wicke 8c75aa1a7a Remove type attribute for tag tokens. 2012-02-01 18:37:48 +00:00
Gabriel Wicke 689f697a93 Push token format conversion a bit further along, and add defines that were
missing in last commit.
2012-02-01 17:03:08 +00:00
Gabriel Wicke a5cc10a06b Change token format to plain strings for text tokens, and specific objects for
other tokens. This is only the first half of the conversion. The next step is
to drop the type attribute on most tokens and match on the constructor in the
token transform machinery.
2012-02-01 16:30:43 +00:00
Gabriel Wicke dd3707ded5 Remove some modules normally bundled with node.js from dependencies, and
remove some older ones that are only used in currently-dead code.
2012-02-01 10:32:33 +00:00
Gabriel Wicke e65c6502c0 Add source for #time implementation in comment 2012-02-01 10:14:01 +00:00
Gabriel Wicke 14a8a13678 A few more debug helpers including a --trace mode for light debugging. Some
improvements to parser functions on the way to support the cite extensions.
Preparation for generic template and template arg in attribute support. 222
parser tests now passing.
2012-01-31 16:50:16 +00:00
Neil Kandalgaonkar 2688f823ef added dependencies to README 2012-01-31 00:56:07 +00:00
Neil Kandalgaonkar f0b934ef2e first pass at an API method that returns wikidom. Shells out to node. Some issues with XML API result formatting but works fine in JSON 2012-01-31 00:02:48 +00:00
Gabriel Wicke 7cd94df47d A few minor tweaks to reduce memory usage 2012-01-27 13:32:44 +00:00
Gabriel Wicke 4e6a54560a * Emit token chunks for top-level block elements by patching the source of the
tokenizer
* Fix a bug uncovered by this
* Increase the number of outstanding listeners on a single download to 10000
2012-01-22 23:21:53 +00:00
Gabriel Wicke 7ea4d7d3db A few parser function fixes and maximum template expansion in environment
config.
2012-01-22 19:32:28 +00:00
Gabriel Wicke 561cf3c237 Bug fixes and a first stab at a #time parser function. You can expand the main
page like this:

cd extensions/VisualEditor/modules/parser
echo '{{:Main Page}}' | node parse.js
echo '{{:Main Page}}' | node parse.js --html
echo '{{:Main Page}}' | node parse.js --debug

Even the date-based includes work somewhat, although they don't yet accept
passed-in dates.
2012-01-22 07:07:16 +00:00
Gabriel Wicke 60e45bb739 A bit of template expansion bug fixing and parser function documentation 2012-01-22 01:27:22 +00:00
Gabriel Wicke e8a7034acf Add some commandline switches to parse.js. Supports switching on/off debug
mode and a selection of html/WikiDom serialization.
2012-01-21 22:42:54 +00:00
Gabriel Wicke 785a4af76f Implement a few parser functions. 220 parser tests now passing. 2012-01-21 20:38:13 +00:00
Gabriel Wicke 1a6546fbca Support empty template arguments and default values in arg expansion 2012-01-21 03:03:33 +00:00
Gabriel Wicke fdd048b3b2 Remove a few stray debug prints and disable debugging in parse.js 2012-01-20 22:21:33 +00:00
Gabriel Wicke 145df2655c * NoInclude and IncludeOnly improvements
* Tokenizer support for templates and template args in template arguments and titles
* Async attribute expansion fixes
2012-01-20 22:02:23 +00:00
Gabriel Wicke 348cac6cf0 Fix a bug in TokenCollector, and misc tweaks for template expansions. 2012-01-20 18:47:17 +00:00
Gabriel Wicke 7cc8e69147 Collapse all requests per template into a single outstanding request using an
event-emitting TemplateRequest object and a request queue.
2012-01-20 02:36:18 +00:00
Gabriel Wicke fc2088bb21 Add some rudimentary noinclude / includeonly support and fix up
TokenCollector.
2012-01-20 01:46:16 +00:00
Gabriel Wicke c15e0d4167 Minor cleanup in TemplateHandler 2012-01-20 00:49:27 +00:00
Gabriel Wicke d0ece16c86 Fix async template expansion, so we can now render simple pages with templates
directly to WikiDom from enwiki using a commandline like this:

  echo '{{User:GWicke/Test}}' | node parse.js

Wohoo!

Complex pages with templates won't render properly yet, as noinclude /
includeonly and parser functions are not yet implemented. As a result, the
parser will run out of memory or hit the currently low expansion depth limit
as it tries to expand documentation for all templates.
2012-01-19 23:43:39 +00:00
Gabriel Wicke 2233d0a488 Eventify parser tests and parse.js commandline wrapper to actuallly allow
async template fetching. Async expansion is not yet fully debugged, but at
least the preconditions for that are now there.
2012-01-18 23:46:01 +00:00
Gabriel Wicke 5b8054636e Make template fetching somewhat functional on node with Inez' help, but
disable it by default in parserTests as it tries to fetch all sorts of parser
functions and is not yet fully supported in parserTests. The next step will be
to build a list of parser functions (to avoid fetching them as templates) and
pushing the event interface into parserTests.
2012-01-18 19:38:32 +00:00
Gabriel Wicke 4bd4307924 Fix comment to reflect the actual regexp/spec in the JS version as well. 2012-01-18 19:35:13 +00:00
Gabriel Wicke 14e6728cc4 Add the start of a minimal sanitizer stage, that only strips IDN ignored
characters from host portions of links hrefs for now. This module needs to be
filled up with pretty much everything Sanitizer.php does, including tag and
attribute whitelists and attribute value sanitation (especially for style
attributes).

We'll also need to think about round-tripping of sanitized tokens.
2012-01-18 01:42:56 +00:00
Gabriel Wicke 336be4f617 Eat '[[[' as plain text token, makes it 212 passing. 2012-01-18 00:23:17 +00:00
Gabriel Wicke 178adbc342 Accept IPv6 (and IPv4) addresses in the tokenizer, so another test passes. 2012-01-18 00:00:47 +00:00
Gabriel Wicke e7381da5b8 Trim whitespace off template titles and argument names. 209 parser tests now
passing.
2012-01-17 23:18:33 +00:00
Gabriel Wicke f50fecf1e3 Fix template argument expansion. 200 parser tests now passing. 2012-01-17 22:29:26 +00:00
Gabriel Wicke 34025251a3 Clean up 'END' token handling a bit. 2012-01-17 20:01:21 +00:00
Gabriel Wicke 7f579398c7 Use isBlockTag in DOMPostProcessor 2012-01-17 18:30:22 +00:00
Gabriel Wicke 6bd7ca1e75 Misc improvements, now 196 parser tests passing.
* Add handler for post-expand paragraph wrapping on token stream, to handle
  things like comments on its own line post-expand
* Add general Util module
* Fix self-closing tag handling in HTML5 tree builder
2012-01-17 18:22:10 +00:00
Gabriel Wicke f4081bef08 First template expansion tests start working, and a bug fix in
DOMPostProcessor paragraph wrapper. 187 parser tests now passing.
2012-01-14 00:58:20 +00:00
Gabriel Wicke 196d704e8e Template expansion now enabled and somewhat working, but template fetching
still fails all the time.
2012-01-13 18:48:25 +00:00
Gabriel Wicke 32c9bccd7c Results of early template expansion debugging. Still disabled by default, but
getting closer.
2012-01-11 19:48:49 +00:00
Gabriel Wicke 6b6ec2933d More work towards template expansion.
* Created AttributeTokenTransformManager for generic attribute conversion, and
  removed { title, template argument {key, value} } expansion from
  TemplateHandler.
* Added caching for attribute and input sub-pipelines. Especially attribute
  pipelines would otherwise be recreated for each attribute value and key.
2012-01-11 00:05:51 +00:00
Gabriel Wicke 5ec30252f1 More token transform and pipeline setup refactoring to support template
expansion better.
2012-01-10 01:09:50 +00:00
Gabriel Wicke 287604c422 A bit of cleanup in ParserPipeline, with better and more consistent support
for multiple input types.
2012-01-09 19:33:49 +00:00
Gabriel Wicke becf3cb7ea Add generic 'collect all tokens between delimiter tokens and call a transform
function on it' util for synchronous transformation phases. This can be used
to implement parser hooks (aka extension tags) besides other things.
2012-01-09 18:13:45 +00:00
Gabriel Wicke e99d7a2a55 Two batteries worth of token transform manager refactoring.
* TokenTransformDispatcher is now renamed to TokenTransformManager, and is
  also turned into a base class
* SyncTokenTransformManager and AsyncTokenTransformManager subclass
  TokenTransformManager and implement synchronous (phase 1,3) and asynchronous
  (phase 2) transformation stages.
* Communication between stages uses the same chunk / end events as all the
  other token stages.
* The AsyncTokenTransformManager now supports the creation of nested
  AsyncTokenTransformManagers for template expansion.
  The AsyncTokenTransformManager object takes on the responsibilities of a
  preprocessor frame. Transforms are newly created (or potentially resurrected
  from a cache), so that transforms do not have to worry about concurrency.
* The environment is pushed through to all transform managers and the
  individual transforms.
2012-01-09 17:49:16 +00:00
Gabriel Wicke 6601c544e6 Handle default for template arg expansion, add template fetch functionality
and tweak a few minor things in the grammar and QuoteTransformer.
2012-01-06 17:19:14 +00:00
Gabriel Wicke f0c844f28f Add template expansion handler skeleton, not yet functional. Also note
improvements needed in the tokenizer template handling.
2012-01-06 14:30:55 +00:00
Gabriel Wicke 2e35171fd1 Fix quote handling and tweak the whitelist a bit. 'any' token registrations
are now merged with specific registrations by rank. Not yet clear if that is a
good idea overall, need to check use cases when implementing template expansion
and other functionality.

183 parser test now passing.
2012-01-04 14:09:05 +00:00
Gabriel Wicke 6cd95fea37 Fix up constructors in EventEmitter inheritance and tweak a few more comments. 2012-01-04 12:28:41 +00:00
Gabriel Wicke e3ae9a702b Fix JSHint warnings (mostly about comment indentation) from r108012. 2012-01-04 11:06:24 +00:00
Gabriel Wicke 4c4a24f0a0 Hook up the DOMPostProcessor using events as well, and rename the subscription
methods to tell a story. Also document idea on how to dynamically configure
the pipeline depending on event registrations in comment.
2012-01-04 11:00:54 +00:00
Gabriel Wicke f0399d2ec5 Clean up comments in TokenTransformDispatcher and mark private methods with
underscore.
2012-01-04 09:48:24 +00:00
Gabriel Wicke ee79158e53 Add trailing newline in commandline parser wrapper 2012-01-04 08:42:53 +00:00
Gabriel Wicke 29362cc53c Rename ParseThingy to ParserPipeline and fix up broken WikiDom generation and
commandline runner.
2012-01-04 08:39:45 +00:00
Gabriel Wicke bd98eb4c5a Land big TokenTransformDispatcher and eventization refactoring.
The TokenTransformDispatcher now actually implements an asynchronous, phased
token transformation framework as described in
https://www.mediawiki.org/wiki/Future/Parser_development/Token_stream_transformations.

Additionally, the parser pipeline is now mostly held together using events.
The tokenizer still emits a lame single events with all tokens, as block-level
emission failed with scoping issues specific to the PEGJS parser generator.
All stages clean up when receiving the end tokens, so that the full pipeline
can be used for repeated parsing.

The QuoteTransformer is not yet 100% fixed to work with the new interface, and
the Cite extension is disabled for now pending adaptation. Bold-italic related
tests are failing currently.
2012-01-03 18:44:31 +00:00
Neil Kandalgaonkar 20374b5911 fix substr for IE, followup r107464 2011-12-30 21:51:03 +00:00
Gabriel Wicke 8e00a72d0a Improvements to link trail handling, and two tweaks to the whitelist. 182
tests now passing. 

Link trails depend on language-dependent positive character classes in the PHP
parser. These classes all seem to disallow punctuation implicitly and list
differing plain text characters instead, so it might be possible to get away
with identifying a common class of non-trail punctuation instead. This would
help to keep the tokenizer independent of configurations, which is very
desirable for caching and simplified external parsing.
2011-12-30 12:47:06 +00:00
Gabriel Wicke 11ece76b7b Fix suffix handling for wiki links. 2011-12-30 09:35:57 +00:00
Gabriel Wicke b3a0270d69 Remove env and load grammar in tokenizer constructor. Re-add property hack to
keep parserTests running for now. Really need a different pipeline for html
serialization or a reference to the HTML DOM.
2011-12-28 17:04:16 +00:00
Gabriel Wicke 3a63fb118e Add a few comments inline, and remove unneeded html serialization as we are
only interested in WikiDom output in this parser wrapper.
2011-12-28 13:46:52 +00:00
Neil Kandalgaonkar 8fbf36e63e put add terminal token inside tokenize method (will pull it out again for streaming interface) 2011-12-28 01:37:15 +00:00
Neil Kandalgaonkar 6103646ec8 remove need to add newline at end of input 2011-12-28 01:37:11 +00:00
Neil Kandalgaonkar 4158f82d7e refactor parser to ParseThingy in different module, can be invoked with command line utility parse.js 2011-12-28 01:37:06 +00:00
Neil Kandalgaonkar d91a67ba99 nodeName not defined 2011-12-28 01:36:54 +00:00
Neil Kandalgaonkar 962d1262fc create tokenizer without need to modify namespace with PEG source 2011-12-28 01:36:36 +00:00
Gabriel Wicke 33e60dd4d9 Update comments a bit. 2011-12-22 12:37:24 +00:00
Gabriel Wicke 9ee0e660ec Fix regression introduced by r107060 for regular table cells. Good to have a
test suite ;)
2011-12-22 12:09:25 +00:00
Gabriel Wicke a94d0ec10c Re-add support for row-only tables. 2011-12-22 11:58:32 +00:00
Gabriel Wicke 1c7fe0eb34 Refactor table productions to support table fragments in templates (table
start / row / table end). The old productions are not deleted yet to make it
easy to compare the output on more complex articles. 181 tests passing after
adding two table tests with whitespace-only differences to the whitelist.
2011-12-22 11:43:55 +00:00
Gabriel Wicke 2845ba9552 Handle noinclude and includeonly at start of line, so that syntax after it
still matches as if it actually was preceded by a newline.
2011-12-21 11:38:50 +00:00
Gabriel Wicke 3a631db6d9 Fix ranges for annotations in implicit paragraphs within branch nodes. 2011-12-16 19:36:04 +00:00
Gabriel Wicke cc06551f2e Rename table_header production to table_heading. Those non-natives strike again. 2011-12-16 19:24:59 +00:00