254 round-trip tests (up from 184) are now passing.
Also:
* tweaked runtests.sh slightly (use less -R instead of -r).
* made sure the EOFTk is preserved in phase 3 transforms
Change-Id: I1de22186bdb78e52019370e43f096877005b8f5a
- This is implemented as a post-processing pass.
- Might require additional checks to verify rewriteability.
- Implemented as a pair-wise tag DOM minimization strategy,
i.e. it takes tag pairs (B, I) for ex, and attempts to
normalize the tree just for those tag pairs. Normalizing
across multiple tags is implemented as pairwise rewriting
across all pairs: Ex:(b,i), (b,u),(i,u) for (b,i,u)
- Copied over attributes as part of rewriting, but some of the
attributes lose their meaning on rewriting since tags are
reordered (ex: sourcePosn, sourceTagPosn). How do we handle this?
Output examples and possible issues to fix:
<i><b><u>biu</u></b></i><b><u>bu</u></b><u>u</u>
gets rewritten to:
<u><b><i>biu</i>bu</b>u</u>
But, the equivalent wikitext form:
'''''<u>biu</u>''''''''<u>bu</u>'''<u>u</u>
does not get rewritten because of parsing differences.
This wikitext gets parsed into:
<i><b><u>biu</u>'''</b></i><u>bu<b>u</b></u>
The extra ''' token in the middle thwarts DOM rewriting.
However, a slightly different version:
"'''''<u>biu</u>''<u>bu</u>'''<u>u</u>"
gets properly normalized to:
<u>'''''biu''bu'''u</u>
An alternative, but fun strategy to play with is to use the following
two normalization primitives: S(wap) and M(erge).
- S rewrites T1(T2(x)) into T2(T1(x))
(ex: <b><i>foo</i></b> ==> <i><b>foo</b></i>)
- M rewrites (T(x),T(y)) into (T(x,y)).
(ex: <b>foo</b><b>bar</b> ==> <b>foobar</b>)
The current rewriting strategy could possibly be re-implemented as S-M
rewriting. The problem to solve there would be to find an efficient
rewriting strategy that is guaranteed to lead to a normal form. I may
not play with it now, but just documenting it for later (to play with
in my spare time).
This commit is just as a record of fun/experimental code where I get to
learn details of JS, wikitext, parsing, and DOM manipulation. Next
version of this code will attempt to introduce minimal DOM restructuring
across multiple tags at once which can be more efficient.
gwicke: Removed now passing test from whitelist, and updated another whitelist
entry which is now improved.
Change-Id: Ie97bcb164eb62c34ba61aa76ba2f4c232aa713d8
This was caused by the fact that a non-structural leaf can not have children, which makes it appear incompatible as a sibling to an arbitrary structural element (like a paragraph) but since it can not contain content we can check that instead.
Change-Id: Ie3c58b4b43f2aa6921f8f82aa82511e231207854
Supports both -> HTML DOM and round-trip testing. Displays the diff to the
last results using less -r.
Change-Id: Ib3fbadeda3c8f4f7e3d2e6e5236a73ff7a773623
Using this argument will only return true if the offset is a place you can add any element to (hence the unrestricted part of it). This is good for testing if a paragraph could potentially be inserted there.
Change-Id: I6cc91da437c52493de03eb687b28966198270fea
The 'insert' and 'remove' operations weren't implemented in the
transaction processor and were a holdover from the old DM
implementation.
Also migrated the tests, especially those that asserted that consecutive
insert/remove operations were combined (this is no longer the case now
that they are replace operations)
Change-Id: I2379fe92b331c5316f70f4b695397da41581cce9
* After installing Parsoid (sudo npm install -g in modules/parser), run 'node
server.js' from the api directory and navigate to http://localhost:8000/ and
follow the directions. You can start to navigate the English wikipedia at
http://localhost:8000/Main_Page, or manually enter wikitext or HTML DOM to
convert.
* Uses the express framework, could also use just connect
* Uses the cluster module to manage workers per-core and restart those on
failure
Change-Id: I443f2996ed3df00826b038b7476a2f966ab0c425
* Changed RDFa for links according to
http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Added basic support for internal/external link serialization
* Moved numbering of external links from tokenizer to LinkHandler
* Added round-tripping for generic HTML tags
* Replaced nowiki tag with <meta typeOf="mw:tag" content="nowiki"> and <meta
typeOf="mw:tag" content="/nowiki"> for now.
* 154 round-trip tests passing (node parserTests.js --roundtrip).
Change-Id: I16c4db21b1b543ee57c73e569c83025b64664542
- Just a quick first pass updating the parserTests.js script so we can
test DOM -> wikitext serialization (but which in effect really tests
roundtripping).
- There is no output normalization yet which is needed for now since we
are not yet preserving white-space.
Change-Id: Ie52058e0dc3330f852c24fa05641dced19f950e0
in the new DM. Change method name getAnnotationRange from offset to
getAnnotatedRangeFromOffset. Write tests
Change-Id: I7028803065409e271ceced73e4803954d4a956dc
getAnnotationRangeFromOffset and offsetContainsAnnotation
which deprecated getAnnotationBoundaries, and getIndexOfAnnotation
write unit tests for proof
Change-Id: I6c0d4e3ca96dd569b1909cd22fce68c3a6fe382c
By first summarizing a node tree or node selection and then using the normal deepEqual function we are able to take advantage of native QUnit functionality to make diagnosing a problem easier under failing conditions and reduce the verbosity of the output under passing conditions.
Change-Id: I4af5d69596cf5459aa32f61ee6d5b8355233c3df
* Make nodeSelectionEqual() accept a desc parameter and use it for
building descriptions
* Put the array element number in the desciption too
* Add descriptions for selectNodes tests and pass them through
Change-Id: Icd2894d11516234598cbd984cc8d88f705bfc1d6
* Don't crash if b.children has a different length than expected
* Don't crash if b.children is absent but expected
* Check if b.children is present when it's not expected
* Add assertion when neither a.children not b.children are present; this
makes the number of assertions for leaves equal to the number of
assertions for branches
* Check attributes too
* Accept a desc parameter, and add node paths (like
list/listItem/paragraph/text) to the descriptions
* Use desc parameter for all calls (some tests were already using it,
even though it didn't actually exist)
Change-Id: Ic56d27d20377e7f4fdfa038fdf4ebe00dcb3e062
This was because the while loop was never entered as end >= left was
true from the start. Convert the while loop to a do-while loop to make
sure it runs at least once
Change-Id: I9c6436a7b296e65a36b8301095b6edd00507d321
Now returning an empty array when a non annotated character is found
in the range. No longer looping through each annotation, simply
comparing to previous characters annotations and trimming differences.
Write additional test.
Change-Id: I41d2422a931a74325693edca409aed6d5da20ba8
This is for the case where we have a zero-length range in between two
siblings, and we need to know what index that corresponds to in order to
be able to insert nodes there (rebuildNodes() will use it for this
purpose)
Change-Id: I357d1cd665667a76f955a10b8d9d2810976cdbd7
Selecting a zero-length range at the start or end of a text node
(e.g. (1,1)) would return the text node's parent instead of the text
node
Change-Id: I7fe089bf66b93185dd3415eff53aa7e04e3ffdb2
This makes it possible to transclude list items from a template.
Note: "5 quotes" test is broken by this patch, it appears that ListHandler
newline processing is changing some state which mysteriously affects the
QuoteTransformer. This is ominous, hopefully there's a simple explanation...
gwicke: fix a bug in tokenizer triggered by definition lists like this:
**; foo : bar
Change-Id: I4e3a86596fe9bffcbfc4bf22895362c3bf742bad
This gets rid of the length == outerLength-2 hack in getDataFromNode()
and will make it easier to implement similar logic in selectNodes()
Change-Id: I1294350b67ca3eefde2b7fe9fea0bc6d8b90f772
* Moved implementation of getting and updating a DOM wrapper to ve.ce.BranchNode
* Updated ve.ce.BranchNode tests
* Renamed ve.NodeFactory.createNode to ve.NodeFactory.create
* Added ve.NodeFactory.lookup which gets the constructor of a type
* Added attribute pass-through to ve.dm.BranchNodeStub
Change-Id: I8f5b7d3d3ae616cc5f39828b24b655163d782ae5
* Makes it simpler in the linear model because we don't have to use style: "item" for regular list items and style: "definition" for definition lists
* Enforces correct nesting through existing node rules systems
* Updates tests accordingly
Change-Id: I64d80af938e325f1961226505bdc386bb35ccdda
* Also fixed calls to addListenerMethod
* Also routed adding children in the constructor of ve.dm.BranchNode to the splice method
* Renamed types of ve.dm stub nodes to avoid collisions (since we have to register ce nodes by the same names for them to be generated by onSplice)
Change-Id: Ia2e75cf0a62186cc0e214683feb25c619590318a
* Fixed constructor of ve.ce.BranchNode which was calling the wrong method to perform an onSplice and with the wrong arguments
* Removed/renamed events emitted from ve.ce.BranchNode.onSplice
* Reintroduced .$ to all ce nodes
* Ported over functionality for DOM node type variance used in headings, lists and list items
* Moved the old ve.ce.Content guts to ve.ce.TextNode
* Added getOffsetFromNode and getDataFromNode to ve.dm.DocumentFragment
* Added setDocument and getDocument to dm nodes
Change-Id: I185423ba2f1a858dde562cb2f5bc3852aec930db
* Add .process() and its helpers from ve1
* Fix applyAnnotations()
** Use data[i] rather than data[j]
** Don't add empty annotation objects for no reason
** Remove empty annotation objects
*** I thought this wasn't needed, but it is needed for clean rollbacks
* Remove unused second parameter in applyAnnotations() call
Change-Id: Ia338f62d2eaf2a76f8ef653eead05bc44757a122
* Also removed beforeSplice and afterSplice in favor of just plain splice which is the same as afterSplice used to be - beforeSplice was never used and it was making things more complex looking than needed
Change-Id: Icbbc57eac73a2a206ba35409ab57b3d1a49ab1a5
* Added support for asking if a given node type can have children or grandchildren and what types of nodes can be it's parent or child
* Removed canHaveChildren methods from leaf and branch nodes and converted use of them to depend on factory to read static rules from constructor lookup by type
Change-Id: I9769f95647066576416bacb791c4b68dd0285b35
These are needed to make sure the base classes behave correctly, the ones in sub classes are needed to make sure the prototypes of the base class are correctly inherited
Change-Id: I334cc3ce1c4c0ce2eed23c79e8877332a953f7c3
Neither of these nodes have elements around them, so they must override the default behavior of ve.dm.Node
Change-Id: I19c02c210bfc04b6e2ee1a37b8890e84a236eee3
* Moved node tree assertion to ve.dm.example
* Added rebuildNodes test
* Fixed some typos in rebuildNodes
Change-Id: I4853ded4b062aaa3758435093368bc23667ca3bf
Image nodes are leafs, so providing an empty array to their children/length "contents" constructor argument ends up setting the numeric length to [], which casts to an empty string in arithmetic, causing all further calculations to be concatenations instead of additions.
Change-Id: I40e1ea2295f6095318bc4c24185cadfdfb684557
* Indexes for attribute references were off (the comments that were previously wrong before were relied on and cause this to also be wrong)
* Tree should start with a document node
Change-Id: Ia8e4faa4bcb25db797ff97f6e798ba253adfe535
Within ve.dm.DocumentFragment it makes more sense to call the root node (which is always a document node) a document node, especially since there may be a different node used as a root.
This commit also adds test for getDocumentNode and getNodeFromOffset which uses the offset map.
Change-Id: Ic4609233cedc41f7e5a5f8fdb0e6178652c95554
And fixed ve.dm.DocumentFragment constructor to generate a correct offset map which creates references to branch nodes only
Change-Id: If9e515be0c63d272bfed9bf4da625a48edd36f48
* [[:en:Barack Obama]] can now be expanded in 77 seconds using 330MB RAM,
while it would prevously run out of RAM after ~30 minutes. Wohoooo!
The token transform framework rework really paid off.
* 303 parser tests are passing in the new record time of 5.5 seconds. Two more
tests are passing since these tests expect the day of the week to be
Thursday. Won't be the case tomorrow.
Change-Id: I56e850838476b546df10c6a239c8c9e29a1a3136
* All parser pipelines including tokenizer and DOM stuff are now constructed
from a 'recipe' data structure in a ParserPipelineFactory.
* All sub-pipelines of these can now be cached
* Event registrations to a pipeline are directly forwarded to the last
pipeline member to save relatively expensive event forwarding.
* Some APIs for on-demand expansion / format conversion of parameters from
parser functions are added:
param.to('tokens/expanded', cb)
param.to('text/wiki', cb) (this does not work yet)
All parameters are additionally wrapped into a Param object that provides
method for positional parameter naming (.named() or conversion to a dict
(.dict()).
* The async token transform manager is now separated from a frame object, with
the frame holding arguments, an on-demand expansion method and loop checks.
* Only keys of template parameters are now expanded. Parser functions or
template arguments trigger an expansion on-demand. This (unsurprisingly)
makes a big performance difference with typical switch-heavy template
systems.
* Return values from async transforms are no longer used in favor of plain
callbacks. This saves the complication of having to maintain two code paths.
A trick in transformTokens still avoids the construction of unneeded
TokenAccumulators.
* The results of template expansions are no longer buffered.
* 301 parser tests are passing
Known issues:
* Cosmetic cleanup remains to do
* Some parser functions do not support async expansions yet, and need to be
modified.
Change-Id: I1a7690baffbe8141cadf67270904a1b2e1df879a