- Eliminated newline handling from several places in code and
mostly isolated it to serializeToken thus simplifying newline
handling logic.
- Fixing some bugs in the process: # of green roundtrip tests
went up by 5 (294 --> 299) but actually introduced failures on
a few originally succeeding tests (additional leading/trailing
newlines on the entire test output).
- Added bonus: made list serializing (mostly) insensitive to
newlines between tags. So, all the following DOM serialize
identically to the following wikitext:
*foo
*bar
----------
<ul><li>foo</li><li>bar</li></ul>
----------
<ul>
<li>foo</li>
<li>bar</li>
</ul>
----------
<ul>
<li>
foo
</li>
<li>
bar</li>
</ul>
----------
Change-Id: I76be56c4b2789039dff5f47de4659746882e45d6
* As part of an earlier fix, I had changed default value of 'res'
to null instead of ''. But, this was potentially buggy because
the previous check was (res !== '') which could be triggered
by return values of handlers. By changing the check to null,
I was effectively changing the code paths for those handlers that
returned ''.
Change-Id: I2302023be7422ce4fb384ff5a50fe53fa7732855
paragraphs in lists.
* We need to look at other special-case handling requirements of
html tags in lists (and other contexts like tables).
Change-Id: I84b8402d90a186c9075c2d45263c94377312927a
* Moved wikipedia default prefixes to environment
* Added 'addInterwiki' method
* Adjusted link handling normalizeTitle to reflect this
Change-Id: If5b2314cc36346b6da8649ed410457a612d80a22
* mw:Foo now loads pages from mediawiki.org
* The default prefix still is 'en'. You can switch this to 'mw' in ParserService.js.
Change-Id: I1208667e6114bd711b7988a8b3adb32ffab70969
- Three bugs that were messing up quote transformations.
- Now, the following cases are handled properly:
* ''foo'''
* '''foo''
* ''foo''''
* ''''foo''
These tests (and other quote tests) have to be added to core parser
tests file.
- One more parser test green.
Change-Id: I4f93e8910639f546bfc9304becab17d26d5529de
An improvement, but there still are some extra newlines inserted after
paragraphs. Example input:
-------
Foo:
{|
|foo
|}
-------
Extra newlines are inserted after the Foo: and the foo in the table. They are
not fed as tokens or text to the tree builder, so there is likely a bug in the
html5 library or JSDom.
Change-Id: I83eb6180e3cd1c4e7f9b15b31d339e1d32bccd3f
* Possibly more efficient under heavy GC load -- untested.
* No change in time and memory use for single file parsing.
Change-Id: Id2f3f65cc0e5f38ed968bbda60b97e46523e700e
* Moved the tail attribute to the second attribute (a bit cleaner)
* Disallowed newlines in the tail production
* Improved the selection of round-tripped href vs. generated content vs. href
in the serializer
* renamed state.linkTail to state.dropTail
Change-Id: I5d98c704b6ea566011e22237786f8da17548570f
Pages titles with a wikipedia interwiki prefix now load the page from
corresponding Wikipedia. Links in a page then stay within the given language.
Note that Parsoid currently makes no effort to recognize localized namespaces,
so it won't render media files, categories etc correctly.
Change-Id: I7bc4102e81a402772ea23231170734d580ea15b9
* Don't explicitly add the newline in the pre, as we preserve newline tokens
now. This avoids doubling of newlines when round-tripping.
* Use the sHref attribute even if the href contains spaces.
Change-Id: I8bec8fbfd6a7836bf2e5eec20869a0edd95c93b6
Lists interrupted by non-empty lines would not close the list properly.
Register for any token instead of just for newlines and close the list if no
listItem follows the newline.
Change-Id: I1743901e3db541bbeda78d17707db943e6ceb9b9
If the href would not denormalize, add a copy of the original href in data-mw
and use it to preserve non-conventional capitalization etc.
Change-Id: Ifef50eec7343b0e6b0ba66b6d19a8a3e8c9f8001
A tail containing regexp syntax (a ? in [[:en:Main Page]]) would crash the
serializer. Use substr instead.
Change-Id: I8519aec9c07dfe31893d676b1c936a42d2af74a0
- Added a tail json attribute for wikiLinks
- During serialization, this attribute is used to strip the tail from
the link target and render it after the link
[[hen]]s ==> <a ... data-mw="{gc:1, tail: 's'}" ...>hens</a>
==> [[hen]]s
- 2 more roundtrip tests green
Change-Id: I84f3dabaf0271f7a67641a00148467daa8310eb0
* The state of syntax stops is now properly included in the cache key for the
tokenizer-internal backtracking cache. This fixes some mis-parses when
re-parsing a bit of text with different flags.
* Clear the backtracking cache after each toplevelblock. This drops the peak
memory usage when expanding [[:en:Barack Obama]] from ~380M to ~110M.
Change-Id: Icdb879cae5907e4595903dd6acba2e686e8c2e4b
* This routine attempts to rewrite the DOM to maximize tag overlap
and thus minimize tag uses.
* This takes as input a set of tags which participate in the
minimization.
* Tested on the following example
<b><i><u><s>BIUS</s></u></i></b><b><i><s>BIS</s></i></b><b><u><s>BUS</s></u></b><u><i>UI</i></u>
with multiple combinations of the 2^4 possible variations of i,b,u,s
tags: [], ['i','b','u','s'], ['i'], ['b','s'], ['i','b','u']
- But, I am not fully sure if this implements the right behavior when
only a subset of inline tags are provided. Needs discussion and tweaking
as necessary.
* Also tested on few others:
<b>B</b><b><i>BI</i></b><b><i><u>BIU</u></i></b><b><i><u><s>BIUS</s></u></i></b>
<s><i><b>SIB</s></i></b><s><i><u>SIU</u></i></s><i><u>IU</u></i><i>I</i>
* The previous pairwise tag rewriting version fails on several of these
examples, so this new version is a definite improvement.
* No change in parserTests run (203 passing before and after).
* Possible improvements that could/should be undertaken:
- get rid of useless/idempotent add/remove of nodes that don't change
the DOM.
- ensure that node attributes post-restructuring are correct.
Change-Id: Ib4a8b39583fa96a2be880a77021ca81cefa06484
This patch fixes a tokenizer syntax error encountered on
[[:en:Template:JacksonvilleWikiProject-Member]] and [[:en:Template:Infobox
former country]] by allowing optional whitespace before start-of-line template
syntax.
Change-Id: Ic214a731de58bf766e51f23d5e24ea2ce6788f58
254 round-trip tests (up from 184) are now passing.
Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms
Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
i.e. it takes tag pairs (B, I) for ex, and attempts to
normalize the tree just for those tag pairs. Normalizing
across multiple tags is implemented as pairwise rewriting
across all pairs: Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
attributes lose their meaning on rewriting since tags are
reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?
Output examples and possible issues to fix:
<i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
<u><b><i>biu</i>bu</b>u</u>
But, the equivalent wikitext form:
'''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
<i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.
However, a slightly different version:
"'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
<u>'''''biu''bu'''u</u>
An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
(ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
(ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)
The current rewriting strategy could possibly be re-implemented as S-M
rewriting. The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form. I may
not play with it now, but just documenting it for later (to play with
in my spare time).
This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation. Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.
gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.
Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
* Added a generic stx_v 'syntax variant' round-trip attribute
* For pre, use stx:'html' vs. no syntax annotation. This might not be 100%
safe for arbitrary html input, so we might want to flip this to stx:'wiki'
later.
* 181 round-trip tests passing
Change-Id: If6080917a3a7c069066db3db60efe59b1f6c28d8
* Very basic support attribute key-value pairs emitted from templates
* Add TALKPAGENAME stub implementation
* Only show 'no revisions' message for top-level pages
Change-Id: I4b4ac0c7b2c0531ac4b39f0f49f4217302576ab9
* After installing Parsoid (sudo npm install -g in modules/parser), run 'node
server.js' from the api directory and navigate to http://localhost:8000/ and
follow the directions. You can start to navigate the English wikipedia at
http://localhost:8000/Main_Page, or manually enter wikitext or HTML DOM to
convert.
* Uses the express framework, could also use just connect
* Uses the cluster module to manage workers per-core and restart those on
failure
Change-Id: I443f2996ed3df00826b038b7476a2f966ab0c425
* Changed RDFa for links according to
http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).
Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
* added stx: 'html' round-trip information for html tags
* added t_stx: 'row' info for row-wise table wiki syntax, and support for it
in the serializer
* the first table row is implicit in wikitext
* renamed lastToken to prevToken in serializer
* strip first newline in an initial chunkCB
Change-Id: I014b046539d1b674d830551c5fd1b74a67f81993
Lists are a bit tricky, as nested lists are not wrapped in a separate list
item. Should work now though.
Change-Id: I2e5f29f6afa6bdd2d5e5c0c5d019b70c611b73d1
This fix only affects following transforms, of which there are few right now.
Also removed a stray token mutation in QuoteTransformer.
Change-Id: Id6d4adce944b06fc1a3651cfbf63fc2670125225
* Tokens are now immutable. The progress of transformations is tracked on
chunks instead of tokens. Tokenizer output is cached and can be directly
returned without a need for cloning. Transforms are required to clone or
newly create tokens they are modifying.
* Expansions per chunk are now shared between equivalent frames via a cache
stored on the chunk itself. Equivalence of frames is not yet ideal though,
as right now a hash tree of *unexpanded* arguments is used. This should be
switched to a hash of the fully expanded local parameters instead.
* There is now a vastly improved maybeSyncReturn wrapper for async transforms
that either forwards processing to the iterative transformTokens if the
current transform is still ongoing, or manages a recursive transformation if
needed.
* Parameters for parser functions are now wrapped in abstract Params and
ParserValue objects, which support some handy on-demand *value* expansions.
Keys are always expanded. Parser functions are converted to use these
interfaces, and now properly expand their values in the correct frame.
Making this expansion lazier is certainly possible, but would complicate
transformTokens and other token-handling machinery. Need to investigate if
it would really be worth it. Dead branch elimination is certainly a bigger
win overall.
* Complex recursive asynchronous expansions should now be closer to correct
for both the iterative (transformTokens) and recursive (maybeSyncReturn
after transformTokens has returned) code paths.
* Performance degraded slightly. There are no micro-optimizations done yet
and the shared expansion cache still has a low hit rate. The progress
tracking on chunks is not yet perfect, so there are likely a lot of unneeded
re-expansions that can be easily eliminated. There is also more debug
tracing right now. Obama currently expands in 54 seconds on my laptop.
Change-Id: I4a603f3d3c70ca657ebda9fbb8570269f943d6b6
Trying something trivial like echo 'Hello world' | node parse.js
would throw TypeError: Function.prototype.apply: Arguments list has wrong type
Change-Id: Ia0a1154b0f3edbfb1f228a1d2072fced1b147141
* Setting the rank on tokens is still used currently, but will be phased out
in favor of setting it on chunks. Tokens will be immutable to allow sharing
and caching without a need for cloning.
* Only register for newline and end tokens in QuoteTransformer when active.
Change-Id: I2c45bc7e4a105219a1404ab221eed7f242128f1e
This makes it possible to transclude list items from a template.
Note: "5 quotes" test is broken by this patch, it appears that ListHandler
newline processing is changing some state which mysteriously affects the
QuoteTransformer. This is ominous, hopefully there's a simple explanation...
gwicke: fix a bug in tokenizer triggered by definition lists like this:
**; foo : bar
Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad
* parentCB (if set) is called with { async: true } if expansion is going to be
asynchronous.
* Strings are handled efficiently
* all value parameter chunks can now be converted using .to().
Change-Id: Ib013e1bc3d8e7f692009038209db6a056887326e
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
while it would prevously run out of RAM after ~30 minutes. Wohoooo!
The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
tests are passing since these tests expect the day of the week to be
Thursday. Won't be the case tomorrow.
Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136