* added stx: 'html' round-trip information for html tags
* added t_stx: 'row' info for row-wise table wiki syntax, and support for it
in the serializer
* the first table row is implicit in wikitext
* renamed lastToken to prevToken in serializer
* strip first newline in an initial chunkCB
Change-Id: I014b046539d1b674d830551c5fd1b74a67f81993
Lists are a bit tricky, as nested lists are not wrapped in a separate list
item. Should work now though.
Change-Id: I2e5f29f6afa6bdd2d5e5c0c5d019b70c611b73d1
This fix only affects following transforms, of which there are few right now.
Also removed a stray token mutation in QuoteTransformer.
Change-Id: Id6d4adce944b06fc1a3651cfbf63fc2670125225
* Tokens are now immutable. The progress of transformations is tracked on
chunks instead of tokens. Tokenizer output is cached and can be directly
returned without a need for cloning. Transforms are required to clone or
newly create tokens they are modifying.
* Expansions per chunk are now shared between equivalent frames via a cache
stored on the chunk itself. Equivalence of frames is not yet ideal though,
as right now a hash tree of *unexpanded* arguments is used. This should be
switched to a hash of the fully expanded local parameters instead.
* There is now a vastly improved maybeSyncReturn wrapper for async transforms
that either forwards processing to the iterative transformTokens if the
current transform is still ongoing, or manages a recursive transformation if
needed.
* Parameters for parser functions are now wrapped in abstract Params and
ParserValue objects, which support some handy on-demand *value* expansions.
Keys are always expanded. Parser functions are converted to use these
interfaces, and now properly expand their values in the correct frame.
Making this expansion lazier is certainly possible, but would complicate
transformTokens and other token-handling machinery. Need to investigate if
it would really be worth it. Dead branch elimination is certainly a bigger
win overall.
* Complex recursive asynchronous expansions should now be closer to correct
for both the iterative (transformTokens) and recursive (maybeSyncReturn
after transformTokens has returned) code paths.
* Performance degraded slightly. There are no micro-optimizations done yet
and the shared expansion cache still has a low hit rate. The progress
tracking on chunks is not yet perfect, so there are likely a lot of unneeded
re-expansions that can be easily eliminated. There is also more debug
tracing right now. Obama currently expands in 54 seconds on my laptop.
Change-Id: I4a603f3d3c70ca657ebda9fbb8570269f943d6b6
Trying something trivial like echo 'Hello world' | node parse.js
would throw TypeError: Function.prototype.apply: Arguments list has wrong type
Change-Id: Ia0a1154b0f3edbfb1f228a1d2072fced1b147141
* Setting the rank on tokens is still used currently, but will be phased out
in favor of setting it on chunks. Tokens will be immutable to allow sharing
and caching without a need for cloning.
* Only register for newline and end tokens in QuoteTransformer when active.
Change-Id: I2c45bc7e4a105219a1404ab221eed7f242128f1e
This makes it possible to transclude list items from a template.
Note: "5 quotes" test is broken by this patch, it appears that ListHandler
newline processing is changing some state which mysteriously affects the
QuoteTransformer. This is ominous, hopefully there's a simple explanation...
gwicke: fix a bug in tokenizer triggered by definition lists like this:
**; foo : bar
Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad
* parentCB (if set) is called with { async: true } if expansion is going to be
asynchronous.
* Strings are handled efficiently
* all value parameter chunks can now be converted using .to().
Change-Id: Ib013e1bc3d8e7f692009038209db6a056887326e
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
while it would prevously run out of RAM after ~30 minutes. Wohoooo!
The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
tests are passing since these tests expect the day of the week to be
Thursday. Won't be the case tomorrow.
Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
* All parser pipelines including tokenizer and DOM stuff are now constructed
from a 'recipe' data structure in a ParserPipelineFactory.
* All sub-pipelines of these can now be cached
* Event registrations to a pipeline are directly forwarded to the last
pipeline member to save relatively expensive event forwarding.
* Some APIs for on-demand expansion / format conversion of parameters from
parser functions are added:
param.to('tokens/expanded', cb)
param.to('text/wiki', cb) (this does not work yet)
All parameters are additionally wrapped into a Param object that provides
method for positional parameter naming (.named() or conversion to a dict
(.dict()).
* The async token transform manager is now separated from a frame object, with
the frame holding arguments, an on-demand expansion method and loop checks.
* Only keys of template parameters are now expanded. Parser functions or
template arguments trigger an expansion on-demand. This (unsurprisingly)
makes a big performance difference with typical switch-heavy template
systems.
* Return values from async transforms are no longer used in favor of plain
callbacks. This saves the complication of having to maintain two code paths.
A trick in transformTokens still avoids the construction of unneeded
TokenAccumulators.
* The results of template expansions are no longer buffered.
* 301 parser tests are passing
Known issues:
* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
modified.
Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a
This is a bit better than cloning tokens wholesale, but not by much. There is
a lot of potential for much better per-token caching with reduced token
cloning. Need to map out all dependencies besides token attributes expanded
from template parameters or other scoped state. Even if tokens themselves
don't need transformation, they might still need to be considered for other
token transformers, so simply keeping the final rank won't quite work even if
the token itself is fully transformed. As a minimum, a shallow clone would
need to be made and the rank reset (as in env.cloneTokens).
Change-Id: I4329113bb21750bae9a635229ed1b08da75dc614
* Added an LRU cache (using the lru-cache node module) for tokenizer output
* Mutation of nested attributes now replaces the containers. A shallow copy of
tokens is sufficient to isolate token transformations. Need to investigate
if we can actually get away without isolation and re-transformation for most
ordinary tokens.
Change-Id: I9136b1d7a1fbcc538183a319d4ecaa290d616fdf
* Ignore safesubst for now
* Remove an unneeded whitelist entry
* Make sure the caption is not lost for thumbs (fix to last commit) and remove
debug print
Change-Id: I243584ed0838cf7c3b4110fe9cdf869272477312
The HTML5 parser we are using to normalize expected HTML output in parserTests
reverses the order of attributes (see
https://github.com/aredridel/html5/pull/53 for the fix). Remove whitelist
entries concerned with this and use the proper order in external image
attributes.
Change-Id: If1868cae05396a150757c85a20473ab756cbcd97
* less verbose logging in noinclude processing and template expansion
* Give priority to the processing of templates transcluded from transclusions
to get closer to depth-first processing. This serves to minimize memory
usage from queued-up tokens.
* Increase the maximum outstanding requests per template retrieval. 10000
amazingly proved too low a limit on some big pages.
* Only process a single template request callback at a time for now
* Add a debug print in the treebuilder wrapper
* Don't treat multiple comments on a single line as a single comment to match
the PHP parser's behavior
Change-Id: I9a86b6d7bec3b9e1f17415daf1bf74170240721a
In experiments this dropped the memory consumption further, and reduces the
queuing overhead in the node reactor.
Change-Id: I9409b6ca863b43b7557663bbec9572365059c078
Only call back a few callbacks per reactor iteration from the template fetch
request queue. This changes the expansion pattern from a (memory intensive)
breadth-first expansion to something quite close to depth-first expansion.
Additionally, retrieved pages are quickly added to the page cache so that a
lot of request queuing is avoided in favor of synchronous expansion from the
cache. On pages like Barack Obama that previously ran out of memory after
consuming node's 1.6G heap limit, expansion now runs in relatively constant
100-300M resident (so far, still running).
Change-Id: Ie34a1eeff00d868416de45ef8d289898258f560c
Eat unbalanced external link parts within template parameters. This does not
produce the same output as the PHP parser
(try echo '{{YouTube}}' | node parse.js), but preserves a level of sanity.
Need to check how common this is for external links. If it is rare enough,
moving the ']' after the parser function manually would fix the rendering for
the YouTube case.
Change-Id: I597d808efff36baa22191e7946a0061cc31120e8
* add past paths for empty arguments etc
* cache attribute token transform pipelines
* fix bugs in TokenCollector and NoIncludeOnly handler, and improve its
efficiency by only registering for 'end' tokens on demand
* Remove empty reset methods from a few handlers
* Add a simple 'ap' debug print function that makes it easy to only print some
debug prints by temporarily changing 'dp' to 'ap'
* Improvements and bug fixes in AttributeExpander
Change-Id: Ie69729c8f62d48bba922712e44ebce484c621c50
Non-include attribute pipelines are not cached for now. Adding separate
caching for non-include attribute pipelines is very likely worth it, but
deferred for now.
Change-Id: I13f949d9f0a04536f9ccfcb73a2be69c5c08be01
* Convert isNoInclude logic to positive isInclude throughout and set it
properly on attribute pipelines. Also don't cache non-include pipelines.
* Add a --pagename parameter to parse.js, which sets the page name in the
environment. This is then returned by {{PAGENAME}}. Not the final solution,
but useful for taxobox testing as taxons are selected based on PAGENAME.
* Add rudimentary pagenamebase parser function
Change-Id: If9c0be4c255200d0f2a30f02e5619437b4fd8f12
* DOM based on Wikia's thumb output: HTML5, clean caption without magnify
icon.
* basic RDFa annotations, but most options additionally in data-mw object-
might want to move more (or all?) of those into RDFa data using meta tags.
* no support yet for framed or other formats, image scaling etc
* also tweaked some config options in the environment
Change-Id: Ie461fcdce060cfc2dec65cc057709ae650ef3368
behavior switches are converted to tokens which set parser.environment flags during the async transformation stage.
The next step would be for handlers in the sync23 stage to generate the TOC, section edit links, and so on according to these directives.
No tests written, because the switches are consumed and don't appear in rendered html. We can test the magic word layout controls individually, once they're implemented.
Another small change was to store option flags directly in the environment object, not that it makes much difference.
Change-Id: I863fbf4be1a17d2f6c31158298dd301f19ae1137
Explained in the README how to use npm to load the dependencies and run tests. Too bad about NODE_PATH...
Don't try to find parserTests.txt in assorted places--if it isn't present, fetch from gerrit. You can symlink from core if you're developing on both parsers, and the fetch script will not overwrite.
Use __dirname in parserTests.js to allow the script to run independent of current working directory.
Change-Id: I4c8b884e91f4fdeae385c7697aff768bdd199dd5
Match pairs of {{!}} or | for template productions, but not a mix of the two.
Example:
{{#if:1|{{!}}-
{{!}} {{#if:1|style="color: red"{{!}}|}}
}}
Note that the style parameter ends up as the *key* of an empty-valued
attribute on the table cell currently.
Change-Id: I5f9357dd1645ef97b0af89f32e8d92ae49218c72
Parser functions which only accept positional arguments now return both the
key and value of arguments. Complete attributes (key and value) for templates
and the like from parser functions are not yet supported though.
Change-Id: I3f81bb35acd27186222ce6d5217e820042527c01
Instead of a proliferation of data-mw-* attributes, it should be easier to
stash all private / non-semantic round-trip information in a JSON object
stored in data-mw.
Change-Id: Id200a6a8789fa152f29ea530e5a24b6ee7b4b285
Also, in ParserPipeline:
* Import the LM converter and expose it through getLinearModel()
* Fix getWikiDom() to actually work (still unused)
In parse.js:
* Add --help option that prints usage information (was unreachable)
* Add --linearmodel option to output linear model JSON instead of HTML
Change-Id: Ic534e03ff40a7c9117bb63f0c635a4213d5e3406
gets a bit closer to supporting table fragments passed through template
arguments. Next, we'll need a way to indicate start-of-line position to
enable sol block-levels in template parameters.
Example:
{|
{{#if: true|{{!}}Table cell|}}
|}
re-processing in a phase is wanted. By default, after a token type change or
the return of multiple tokens only the remaining transforms with higher ranks
are applied.
Updated a few comments as well.
to maximize IO concurrency. Signal that all tokens are fully transformed to
callbacks called from TokenAccumulator._returnTokens. The result should be a
single re-transformation when entering the callback chain, and only if the
transform does not signal that it took care of full transformation itself.
Template expansion would set this flag, as the nested transform pipeline
processes all tokens to the end of phase async12.
to callback which lets transforms indicate if their returned tokens are fully
processed for their phase. If not, the callback re-processes them so that any
remaining transforms are applied.
wgUploadPath configurable. Also change the hard-coded fall-back image sizes to
sensible defaults. This breaks three parser tests until image size retrieval
from the wiki is implemented.