Only call back a few callbacks per reactor iteration from the template fetch
request queue. This changes the expansion pattern from a (memory intensive)
breadth-first expansion to something quite close to depth-first expansion.
Additionally, retrieved pages are quickly added to the page cache so that a
lot of request queuing is avoided in favor of synchronous expansion from the
cache. On pages like Barack Obama that previously ran out of memory after
consuming node's 1.6G heap limit, expansion now runs in relatively constant
100-300M resident (so far, still running).
Change-Id: Ie34a1eeff00d868416de45ef8d289898258f560c
Eat unbalanced external link parts within template parameters. This does not
produce the same output as the PHP parser
(try echo '{{YouTube}}' | node parse.js), but preserves a level of sanity.
Need to check how common this is for external links. If it is rare enough,
moving the ']' after the parser function manually would fix the rendering for
the YouTube case.
Change-Id: I597d808efff36baa22191e7946a0061cc31120e8
* add past paths for empty arguments etc
* cache attribute token transform pipelines
* fix bugs in TokenCollector and NoIncludeOnly handler, and improve its
efficiency by only registering for 'end' tokens on demand
* Remove empty reset methods from a few handlers
* Add a simple 'ap' debug print function that makes it easy to only print some
debug prints by temporarily changing 'dp' to 'ap'
* Improvements and bug fixes in AttributeExpander
Change-Id: Ie69729c8f62d48bba922712e44ebce484c621c50
Non-include attribute pipelines are not cached for now. Adding separate
caching for non-include attribute pipelines is very likely worth it, but
deferred for now.
Change-Id: I13f949d9f0a04536f9ccfcb73a2be69c5c08be01
* Convert isNoInclude logic to positive isInclude throughout and set it
properly on attribute pipelines. Also don't cache non-include pipelines.
* Add a --pagename parameter to parse.js, which sets the page name in the
environment. This is then returned by {{PAGENAME}}. Not the final solution,
but useful for taxobox testing as taxons are selected based on PAGENAME.
* Add rudimentary pagenamebase parser function
Change-Id: If9c0be4c255200d0f2a30f02e5619437b4fd8f12
* DOM based on Wikia's thumb output: HTML5, clean caption without magnify
icon.
* basic RDFa annotations, but most options additionally in data-mw object-
might want to move more (or all?) of those into RDFa data using meta tags.
* no support yet for framed or other formats, image scaling etc
* also tweaked some config options in the environment
Change-Id: Ie461fcdce060cfc2dec65cc057709ae650ef3368
behavior switches are converted to tokens which set parser.environment flags during the async transformation stage.
The next step would be for handlers in the sync23 stage to generate the TOC, section edit links, and so on according to these directives.
No tests written, because the switches are consumed and don't appear in rendered html. We can test the magic word layout controls individually, once they're implemented.
Another small change was to store option flags directly in the environment object, not that it makes much difference.
Change-Id: I863fbf4be1a17d2f6c31158298dd301f19ae1137
Explained in the README how to use npm to load the dependencies and run tests. Too bad about NODE_PATH...
Don't try to find parserTests.txt in assorted places--if it isn't present, fetch from gerrit. You can symlink from core if you're developing on both parsers, and the fetch script will not overwrite.
Use __dirname in parserTests.js to allow the script to run independent of current working directory.
Change-Id: I4c8b884e91f4fdeae385c7697aff768bdd199dd5
Match pairs of {{!}} or | for template productions, but not a mix of the two.
Example:
{{#if:1|{{!}}-
{{!}} {{#if:1|style="color: red"{{!}}|}}
}}
Note that the style parameter ends up as the *key* of an empty-valued
attribute on the table cell currently.
Change-Id: I5f9357dd1645ef97b0af89f32e8d92ae49218c72
Parser functions which only accept positional arguments now return both the
key and value of arguments. Complete attributes (key and value) for templates
and the like from parser functions are not yet supported though.
Change-Id: I3f81bb35acd27186222ce6d5217e820042527c01
Instead of a proliferation of data-mw-* attributes, it should be easier to
stash all private / non-semantic round-trip information in a JSON object
stored in data-mw.
Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285
Also, in ParserPipeline:
* Import the LM converter and expose it through getLinearModel()
* Fix getWikiDom() to actually work (still unused)
In parse.js:
* Add --help option that prints usage information (was unreachable)
* Add --linearmodel option to output linear model JSON instead of HTML
Change-Id: Ic534e03ff40a7c9117bb63f0c635a4213d5e3406
gets a bit closer to supporting table fragments passed through template
arguments. Next, we'll need a way to indicate start-of-line position to
enable sol block-levels in template parameters.
Example:
{|
{{#if: true|{{!}}Table cell|}}
|}
re-processing in a phase is wanted. By default, after a token type change or
the return of multiple tokens only the remaining transforms with higher ranks
are applied.
Updated a few comments as well.
to maximize IO concurrency. Signal that all tokens are fully transformed to
callbacks called from TokenAccumulator._returnTokens. The result should be a
single re-transformation when entering the callback chain, and only if the
transform does not signal that it took care of full transformation itself.
Template expansion would set this flag, as the nested transform pipeline
processes all tokens to the end of phase async12.
to callback which lets transforms indicate if their returned tokens are fully
processed for their phase. If not, the callback re-processes them so that any
remaining transforms are applied.
wgUploadPath configurable. Also change the hard-coded fall-back image sizes to
sensible defaults. This breaks three parser tests until image size retrieval
from the wiki is implemented.
construction' part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#url-manipulation-and-creation
Removed a few whitelisted test cases that are now passing directly.
The encoding canonicalization could also be moved to the Sanitizer. Doing this
early in token stream processing however has the advantage of providing further
transformations uniform data to work with. We could even consider to move this
even further into the tokenizer.
possible to support template / template argument expansion in image options,
and causes little trouble for wikilinks. Non-image wikilinks with multiple
text pipes are quite rare in the dumps, and concatenating description tokens
with a plain '|' is quite easy. 261 parser tests passing.
mediawiki.tokenizer.js module, and pass a reference to parse(). Faster
inline_breaks production using a JS function which seems to be generally
correct, but still breaks five tests when enabled. Seems to be some weird
interaction with peg.js, possibly something to do with caching.
* Convert all attributes into strings in Sanitizer
* Use strict comparison against empty string in tokenizer
* Add very simple sitename parserfunction
* 138 tests passing
wrapper. HTML ist now the only supported format. The DOMConverter is now no
longer used. Roan, feel free to remove / butcher it for direct HTML to linear
model conversion.
serialized into a single data-mw-rt attribute if present. Update parserTests
to ignore this attribute for comparisons with expected parser output.
A few more tweaks and notes are thrown into this commit too. 233 tests are
passing now.
only expand used branches selected by parser functions. Template (and
-argument) expansion is simply registered before general expansion.
Additionally, a few more simple time-based magic words are added in
ParserFunctions.
values. This includes comments, templates and template arguments.
This also replaces the specialized expansion logic in the TemplateHandler. The
removal of link validation lets one more parser test fail for now. External
link target validation will need to be implemented in the token stream handler
for links. This is noted as TODO in
https://www.mediawiki.org/wiki/Future/Parser_development#Token_stream_transforms.
functionality (comments, templates, template arguments) in arbitrary
attributes. The grammar for this is still quite rough, will need to
consolidate that area.
other tokens. This is only the first half of the conversion. The next step is
to drop the type attribute on most tokens and match on the constructor in the
token transform machinery.
improvements to parser functions on the way to support the cite extensions.
Preparation for generic template and template arg in attribute support. 222
parser tests now passing.
page like this:
cd extensions/VisualEditor/modules/parser
echo '{{:Main Page}}' | node parse.js
echo '{{:Main Page}}' | node parse.js --html
echo '{{:Main Page}}' | node parse.js --debug
Even the date-based includes work somewhat, although they don't yet accept
passed-in dates.
directly to WikiDom from enwiki using a commandline like this:
echo '{{User:GWicke/Test}}' | node parse.js
Wohoo!
Complex pages with templates won't render properly yet, as noinclude /
includeonly and parser functions are not yet implemented. As a result, the
parser will run out of memory or hit the currently low expansion depth limit
as it tries to expand documentation for all templates.
disable it by default in parserTests as it tries to fetch all sorts of parser
functions and is not yet fully supported in parserTests. The next step will be
to build a list of parser functions (to avoid fetching them as templates) and
pushing the event interface into parserTests.
characters from host portions of links hrefs for now. This module needs to be
filled up with pretty much everything Sanitizer.php does, including tag and
attribute whitelists and attribute value sanitation (especially for style
attributes).
We'll also need to think about round-tripping of sanitized tokens.
* Add handler for post-expand paragraph wrapping on token stream, to handle
things like comments on its own line post-expand
* Add general Util module
* Fix self-closing tag handling in HTML5 tree builder
* Created AttributeTokenTransformManager for generic attribute conversion, and
removed { title, template argument {key, value} } expansion from
TemplateHandler.
* Added caching for attribute and input sub-pipelines. Especially attribute
pipelines would otherwise be recreated for each attribute value and key.
* TokenTransformDispatcher is now renamed to TokenTransformManager, and is
also turned into a base class
* SyncTokenTransformManager and AsyncTokenTransformManager subclass
TokenTransformManager and implement synchronous (phase 1,3) and asynchronous
(phase 2) transformation stages.
* Communication between stages uses the same chunk / end events as all the
other token stages.
* The AsyncTokenTransformManager now supports the creation of nested
AsyncTokenTransformManagers for template expansion.
The AsyncTokenTransformManager object takes on the responsibilities of a
preprocessor frame. Transforms are newly created (or potentially resurrected
from a cache), so that transforms do not have to worry about concurrency.
* The environment is pushed through to all transform managers and the
individual transforms.
are now merged with specific registrations by rank. Not yet clear if that is a
good idea overall, need to check use cases when implementing template expansion
and other functionality.
183 parser test now passing.
The TokenTransformDispatcher now actually implements an asynchronous, phased
token transformation framework as described in
https://www.mediawiki.org/wiki/Future/Parser_development/Token_stream_transformations.
Additionally, the parser pipeline is now mostly held together using events.
The tokenizer still emits a lame single events with all tokens, as block-level
emission failed with scoping issues specific to the PEGJS parser generator.
All stages clean up when receiving the end tokens, so that the full pipeline
can be used for repeated parsing.
The QuoteTransformer is not yet 100% fixed to work with the new interface, and
the Cite extension is disabled for now pending adaptation. Bold-italic related
tests are failing currently.
tests now passing.
Link trails depend on language-dependent positive character classes in the PHP
parser. These classes all seem to disallow punctuation implicitly and list
differing plain text characters instead, so it might be possible to get away
with identifying a common class of non-trail punctuation instead. This would
help to keep the tokenizer independent of configurations, which is very
desirable for caching and simplified external parsing.
start / row / table end). The old productions are not deleted yet to make it
easy to compare the output on more complex articles. 181 tests passing after
adding two table tests with whitespace-only differences to the whitelist.
is in its early stages and nowhere near deployment, please Be Bold and just
commit things like this directly! IMHO it makes more sense to fully review this
once it settles down a bit.