Commit graph

264 commits

Author SHA1 Message Date
Gabriel Wicke bd98eb4c5a Land big TokenTransformDispatcher and eventization refactoring.
The TokenTransformDispatcher now actually implements an asynchronous, phased
token transformation framework as described in
https://www.mediawiki.org/wiki/Future/Parser_development/Token_stream_transformations.

Additionally, the parser pipeline is now mostly held together using events.
The tokenizer still emits a lame single events with all tokens, as block-level
emission failed with scoping issues specific to the PEGJS parser generator.
All stages clean up when receiving the end tokens, so that the full pipeline
can be used for repeated parsing.

The QuoteTransformer is not yet 100% fixed to work with the new interface, and
the Cite extension is disabled for now pending adaptation. Bold-italic related
tests are failing currently.
2012-01-03 18:44:31 +00:00
Neil Kandalgaonkar 20374b5911 fix substr for IE, followup r107464 2011-12-30 21:51:03 +00:00
Gabriel Wicke 8e00a72d0a Improvements to link trail handling, and two tweaks to the whitelist. 182
tests now passing. 

Link trails depend on language-dependent positive character classes in the PHP
parser. These classes all seem to disallow punctuation implicitly and list
differing plain text characters instead, so it might be possible to get away
with identifying a common class of non-trail punctuation instead. This would
help to keep the tokenizer independent of configurations, which is very
desirable for caching and simplified external parsing.
2011-12-30 12:47:06 +00:00
Gabriel Wicke 11ece76b7b Fix suffix handling for wiki links. 2011-12-30 09:35:57 +00:00
Gabriel Wicke b3a0270d69 Remove env and load grammar in tokenizer constructor. Re-add property hack to
keep parserTests running for now. Really need a different pipeline for html
serialization or a reference to the HTML DOM.
2011-12-28 17:04:16 +00:00
Gabriel Wicke 3a63fb118e Add a few comments inline, and remove unneeded html serialization as we are
only interested in WikiDom output in this parser wrapper.
2011-12-28 13:46:52 +00:00
Neil Kandalgaonkar 8fbf36e63e put add terminal token inside tokenize method (will pull it out again for streaming interface) 2011-12-28 01:37:15 +00:00
Neil Kandalgaonkar 6103646ec8 remove need to add newline at end of input 2011-12-28 01:37:11 +00:00
Neil Kandalgaonkar 4158f82d7e refactor parser to ParseThingy in different module, can be invoked with command line utility parse.js 2011-12-28 01:37:06 +00:00
Neil Kandalgaonkar d91a67ba99 nodeName not defined 2011-12-28 01:36:54 +00:00
Neil Kandalgaonkar 962d1262fc create tokenizer without need to modify namespace with PEG source 2011-12-28 01:36:36 +00:00
Gabriel Wicke 33e60dd4d9 Update comments a bit. 2011-12-22 12:37:24 +00:00
Gabriel Wicke 9ee0e660ec Fix regression introduced by r107060 for regular table cells. Good to have a
test suite ;)
2011-12-22 12:09:25 +00:00
Gabriel Wicke a94d0ec10c Re-add support for row-only tables. 2011-12-22 11:58:32 +00:00
Gabriel Wicke 1c7fe0eb34 Refactor table productions to support table fragments in templates (table
start / row / table end). The old productions are not deleted yet to make it
easy to compare the output on more complex articles. 181 tests passing after
adding two table tests with whitespace-only differences to the whitelist.
2011-12-22 11:43:55 +00:00
Gabriel Wicke 2845ba9552 Handle noinclude and includeonly at start of line, so that syntax after it
still matches as if it actually was preceded by a newline.
2011-12-21 11:38:50 +00:00
Gabriel Wicke 3a631db6d9 Fix ranges for annotations in implicit paragraphs within branch nodes. 2011-12-16 19:36:04 +00:00
Gabriel Wicke cc06551f2e Rename table_header production to table_heading. Those non-natives strike again. 2011-12-16 19:24:59 +00:00
Gabriel Wicke 605ed23fd2 Fix attributes in table headings. 2011-12-16 19:22:13 +00:00
Gabriel Wicke 08255ff3e6 Small bug fix to heading level, spotted by Mike from localwiki- thanks! 2011-12-15 23:59:35 +00:00
Gabriel Wicke a04744b2ec Add some more attribute remapping capabilities to the DOMConverter, and clean
up some grammar formatting.
2011-12-15 17:33:07 +00:00
Gabriel Wicke e98dd9e722 Implement 1-char-minimum width for annotations, and some additonal minor
cleanup.
2011-12-15 11:05:52 +00:00
Gabriel Wicke 22ba27295b Clean up the DOMConverter a bit. 2011-12-15 10:55:30 +00:00
Gabriel Wicke e72dee76e4 Follow-up to r106208 and r106207. Both good catches, thanks Yair! As this code
is in its early stages and nowhere near deployment, please Be Bold and just
commit things like this directly! IMHO it makes more sense to fully review this
once it settles down a bit.
2011-12-15 10:13:50 +00:00
Gabriel Wicke 3585bd9c8e Accept row-only tables. The parser now eats [[en:Barack Obama]] as-is. Hooray! 2011-12-15 00:39:28 +00:00
Gabriel Wicke 6df94a34a1 Less lust for urls 2011-12-15 00:26:22 +00:00
Gabriel Wicke ce2ee067f7 Minor tweak to wiki link production 2011-12-15 00:12:58 +00:00
Gabriel Wicke 377226a120 Comment out a stray console.log 2011-12-14 23:44:58 +00:00
Gabriel Wicke 574abd9774 A collection of small bug fixes to the grammar, Cite, the Token format
converter and the HTML DOM -> WikiDom converter. The tokenizer now digests all
parserTests.
2011-12-14 23:38:46 +00:00
Gabriel Wicke dc77d73ad5 Add ability to pass through JSON data to WikiDom in data-json-* attributes,
and fix parser to actually parse the Barack Obama article except for one table
with nested templates at the start-of-line.
2011-12-14 17:25:09 +00:00
Gabriel Wicke f6e4267fca Handle a few more element types, and reset offset for each leaf node. Not sure
if the latter is correct, as the documentation at
https://www.mediawiki.org/wiki/Visual_editor/Software_design#Data_Structures
and the actual sample WikiDom in the editor sandbox seem to disagree on this
point.
2011-12-14 16:22:27 +00:00
Gabriel Wicke 6676a47008 Add implicit level attribute to WikiDom headings. 2011-12-14 15:55:58 +00:00
Gabriel Wicke 3018ca690b Improve WikiDom conversion: Handle text and annotations in branch nodes as
paragraphs and treat list items as branches.
2011-12-14 15:40:40 +00:00
Gabriel Wicke a09aa4d599 Add rough HTML DOM to WikiDom conversion. You can see serialized WikiDom of
parser tests using 'node parserTests.js --wikidom'.
2011-12-14 15:15:41 +00:00
Gabriel Wicke 5f80d30428 Clean up access to document and body after building the tree. 2011-12-14 09:40:49 +00:00
Gabriel Wicke 30749b8d8d Update comments a bit and add a note on things to improve in API. 2011-12-14 09:33:25 +00:00
Gabriel Wicke 55ff272847 Comment TokenTransformDispatcher. 2011-12-13 20:13:09 +00:00
Gabriel Wicke 44deefe303 Minor tweak to comment. 2011-12-13 18:55:44 +00:00
Gabriel Wicke c61b32eaa7 Clean up and comment the Cite extension a bit. 2011-12-13 18:45:09 +00:00
Gabriel Wicke feee9ded9f Convert the Cite extension to a token stream transformer.
This required a few further additions to the TokenTransformDispatcher. In
particular, there is now an 'any' token match whose callbacks are executed
before more specific callbacks. This is used by the Cite extension to eat all
tokens between ref and /ref tags. This need is very common, so should be
broken out to an intermediate layer in the future.

In general, the requirements for the TokenTransformDispatcher API are now
clearer, and the API should likely be cleaned up / simplified.
2011-12-13 14:48:47 +00:00
Gabriel Wicke 8e55e79b67 Rename TokenTransformer to TokenTransformDispatcher. 2011-12-13 11:45:12 +00:00
Gabriel Wicke 8231511217 Replace custom object copy with $.extend. 2011-12-13 11:18:15 +00:00
Gabriel Wicke 39aedd4378 Improve comments in QuoteTransformer. 2011-12-13 10:25:18 +00:00
Gabriel Wicke 0ad08b9ae3 Add a README file pointing to the wiki documentation. 2011-12-12 22:30:11 +00:00
Gabriel Wicke a8fa9433c4 Convert quote handling (italic/bold) to a core extension operating on the
token stream. This is the first token transformation exercising the
TokenTransformer class as its dispatcher. Template expansions, wiki link
formatting, tag sanitation and extensions should be able to use the same
dispatcher by registering for specific token types.

The parser performance is very slightly improved as the token stream is only
traversed once.
2011-12-12 20:53:14 +00:00
Gabriel Wicke 752b0990b2 Refactor parserTests somewhat into a class-like structure, and wire up the
TokenTransformer.
2011-12-12 14:03:54 +00:00
Gabriel Wicke d616f07a79 Don't re-build the wiki tokenizer for each test. This speeds up the full
parserTests.js run slightly from 7-8 minutes to about 14 seconds ;)

A few very minor tweaks to the grammar are also thrown into this commit.
2011-12-12 10:47:42 +00:00
Gabriel Wicke 89c5e0cafb Follow-up to r105859: Add missing new. 2011-12-12 10:09:13 +00:00
Gabriel Wicke 9ebce5839a Further development of the TokenTransformer framework. 2011-12-12 10:01:47 +00:00
Gabriel Wicke 80d5067813 Add a TokenTransformer dispatcher class. This class provides subscriptions by
token type, and supports asynchronous token expansion (for example for async
template expansion). This code is not yet tested or used. The interface for
token insertion from transformation functions will be expanded as needed.
2011-12-08 14:37:31 +00:00
Gabriel Wicke c2b69e2486 Clean up newline handling. Emit a NEWLINE token for each
non-{comment,pre,nowiki} newline.
2011-12-08 14:34:18 +00:00
Gabriel Wicke abc2254110 A bit of comment clean-up and wrapping of tree building into try/catch block
to actually count failures.
2011-12-08 11:40:59 +00:00
Gabriel Wicke 92fdf99384 Further renaming, this time from pegParser to pegTokenizer. 2011-12-08 10:59:44 +00:00
Gabriel Wicke 76bc477038 Rename html5TokenEmitter to HTML5TreeBuilder, and the contained Tokenizer to
TreeBuilder.
2011-12-08 10:37:18 +00:00
Gabriel Wicke 19a1f0850f Tidy up the grammar a bit. 2011-12-08 10:33:23 +00:00
Gabriel Wicke 3742d70abd Add some documentation to syntax flags 2011-12-07 15:54:55 +00:00
Gabriel Wicke 545ca1809f Convert template argument production to generic inline with syntactic stop.
Fix a bug in generic inline production. Nested multi-line templates are now
parsed okayish.
2011-12-07 15:39:39 +00:00
Gabriel Wicke 902db40a1f Process template arguments into an object. 2011-12-07 14:46:07 +00:00
Gabriel Wicke 51a40e4dbc Follow-up to r105423: Fix off-by-one bug. 2011-12-07 11:56:12 +00:00
Gabriel Wicke 49c286a67b Fix a bug in doQuotes (bitten by surprising JS sort() behavior), and improve
tag-only-line handling. 180 parser tests now passing.
2011-12-07 11:51:24 +00:00
Gabriel Wicke 418a5067c6 Parse attributes in tables using generic attribute production. Some table
tests still do not pass as the MW table output reorders attributes ;)
2011-12-06 22:03:21 +00:00
Gabriel Wicke 3d06707152 Slightly speed up inline tag productions using guards and grouping; Fix list
processing function.
2011-12-06 18:35:05 +00:00
Gabriel Wicke ea8f226fd5 Remove ext and references special cases, now subsumed by generic XML tag
productions. Document issue around special tokenizer mode for other extension
tags.
2011-12-06 16:44:27 +00:00
Gabriel Wicke e7de089d5b Decode urls and html entities, 163 tests now passing. 2011-12-06 13:17:14 +00:00
Gabriel Wicke a72a9e55a3 Don't match internal links with url as target. 161 passing. 2011-12-06 12:26:57 +00:00
Gabriel Wicke 2b5cc67bf5 Further tweaks to headings. 157 tests now passing. 2011-12-06 11:59:41 +00:00
Gabriel Wicke f4d123886e Convert heading rules to single rule that figures out the level. This saves a
lot of backtracking and inline break complexity.
2011-12-06 11:06:05 +00:00
Gabriel Wicke 33e19f7275 Recognize block-level elements independent of case; Ignore toc and section
edit links in tests. 148 parser tests passing.
2011-12-05 20:03:24 +00:00
Gabriel Wicke 9ed9cb31bd Fix template argument handling somewhat. 2011-12-05 17:58:11 +00:00
Gabriel Wicke 1760210d13 Fixes to tables, headings and misc smaller stuff. Tracked down an issue caused
by improperly caching of production results, which interfered with the
flag-dependent inline_break production.
2011-12-04 19:23:24 +00:00
Gabriel Wicke 63c728924b Use pegjs from npm 2011-12-01 15:23:23 +00:00
Antoine Musso 5ab379f479 fix vim modeline 2011-12-01 15:19:37 +00:00
Gabriel Wicke 0ce1e9fcf3 Add a quick html entity decoding hack, and document need for general decoder. 2011-12-01 14:39:55 +00:00
Gabriel Wicke d00743ad79 Improve external links and definition lists, now 133 tests passing ;)
Also add printwhitelist option to test runner, provides js code copy/pastable
to whitelist.
2011-12-01 14:25:59 +00:00
Gabriel Wicke 82e31ffd42 Do not allow newlines in various attributes 2011-11-30 15:12:53 +00:00
Gabriel Wicke 821162484e Allow inlines in the term part of ; term : definition 2011-11-30 14:53:28 +00:00
Gabriel Wicke f758894de7 Let another test pass by swapping the default order of italic/bold for '''''.
Minor test output cosmetics.
2011-11-30 13:54:57 +00:00
Gabriel Wicke e0fca805a6 Expand tabs in grammar. 2011-11-30 13:42:26 +00:00
Gabriel Wicke 2bb512a4de A bit of tokenizer grammar clean-up and additional expected-html
normalization. 99 parser tests now passing.
2011-11-30 13:40:17 +00:00
Gabriel Wicke 127d8c8621 Simplify DOM paragraph wrapping postprocessor 2011-11-30 12:28:45 +00:00
Gabriel Wicke f0edc5cb9a Fix a few more tests by allowing inline content inside links. 76 now passing. 2011-11-29 18:43:27 +00:00
Gabriel Wicke ae0b5f9af4 * Split paragraph handling between tokenizer and DOM postprocessor for better
html markup handling. 
* Remove global 'use strict' declarations from html5 parser. 
* Add trailing whitespace handling in dt

Overall, 55 parser tests are now passing.
2011-11-29 15:11:51 +00:00
Gabriel Wicke b16c295b98 Consider dl as a block-level element. 2011-11-28 16:54:58 +00:00
Gabriel Wicke d3f0196df7 Add primitive HTML comparison to detect passing parser tests. The expected
HTML is parsed using a HTML parser and re-serialized, and the output compared
to the serialization of the new parser's dom. Newline normalization is a
cheap hack for now, need to improve that later.
2011-11-28 11:10:39 +00:00
Gabriel Wicke 6b8c109cf0 Separate block-level tags in tokenizer to delimit inlines and avoid wrapping
block-level in paragraphs.
2011-11-25 17:41:26 +00:00
Gabriel Wicke 859379a635 Improvements to nowiki/pre interaction. Will need to distinguish block-level
tags from inline HTML tags next.
2011-11-25 15:02:44 +00:00
Gabriel Wicke dd5cd59ac6 Better HTML, pre and blocklevel handling. Hackish source formatting for easier
comparison with parserTest results.
2011-11-25 12:47:03 +00:00
Gabriel Wicke 5b3a4497aa Add generic HTML tokenization and nowiki handling. 2011-11-25 10:59:43 +00:00
Gabriel Wicke 6c36ddcbce Follow-up to r104164: Clean-up comments, remove old italic/bold productions. 2011-11-24 14:20:56 +00:00
Gabriel Wicke dee262658f Add MediaWiki-compatible quote handling including quirks and overlapped
structures like ''[[Link|Link text'']]. This is another transform on the token
stream.
2011-11-24 13:56:30 +00:00
Gabriel Wicke baf55875b9 Re-add modified wiki list handling to tokenizer. 2011-11-23 14:27:51 +00:00
Gabriel Wicke 694b998f24 Minor improvement to italic/bold, documentation on failed modularization of
static parser functions.
2011-11-22 16:51:05 +00:00
Gabriel Wicke d1b0293569 Fix comment token conversion and serialization 2011-11-21 09:22:30 +00:00
Gabriel Wicke 65afd9b610 Improve internal link handling 2011-11-18 14:48:32 +00:00
Gabriel Wicke d744e65c48 Add missing token adapter. 2011-11-18 14:00:14 +00:00
Gabriel Wicke b750ce38b8 Add node.js-compatible HTML5 parser and hook it up to the PEG tokenizer.
Builds a DOM tree (jsdom) from the tokens and then serializes that using
document.innerHTML. This is all very experimental, so don't be surprised by
rough edges.
2011-11-18 13:57:07 +00:00
Gabriel Wicke 11e487d8c0 Flatten inline token lists before merging text into text tokens. 2011-11-17 15:43:31 +00:00
Gabriel Wicke ea87e7aaee Convert PEG parser to tokenizer for back-end HTML parser. Now emits a list of
tokens, which for now is still completely built before parsing can proceed.
For each top-level block, the source start/end positions are added as
attributes to the top-most tokens. No tracking of wiki vs. html syntax yet.
2011-11-17 15:26:02 +00:00
Gabriel Wicke ef3c84bd2e Extract text from inline elements for better testing. Slightly improved
handling of comment-only lines. Change pre to leaf content model.
2011-11-08 16:08:05 +00:00
Gabriel Wicke 18ead89b37 Improved paragraph, br, comment parsing and switched headings to
generic inlineline with syntactic flags.
2011-11-07 23:09:30 +00:00
Gabriel Wicke 944d010eb2 Indentation cleanup in PEG parser and Html serializer 2011-11-07 21:05:37 +00:00
Gabriel Wicke c3a0c56e56 rename definition{term,description} to just {term,description} 2011-11-07 20:36:34 +00:00
Gabriel Wicke 71891131c3 Grammar improvements
* replaced regexp stack with a set of break rules for inline content within
  specialized parse contexts, switched more rules to generic
  inlineline/inline/block rules.
* don't consume end-of-line for proper start-of-line matching
* added some pre support
* still no conversion of inline elements to annotations
2011-11-07 14:39:12 +00:00
Gabriel Wicke 06ca9f12fe Rename definitiondata to definitiondescription, minor fixes 2011-11-04 12:25:01 +00:00
Gabriel Wicke 7e5c196732 Some more progress for tables and definition lists 2011-11-04 12:06:49 +00:00
Gabriel Wicke 83a80bad49 Fixes for definition lists 2011-11-04 11:08:11 +00:00
Gabriel Wicke 85def70a8a Add basic list serialization to HtmlSerializer
* Added 'definitionterm' and 'definitiondata' styles to support definition
  lists, and special-case handling in the serializer to wrap both in dls.
2011-11-04 10:02:59 +00:00
Gabriel Wicke 63398b5749 Update parserTests to latest serializers 2011-11-04 07:45:05 +00:00
Gabriel Wicke a8838dab18 Start by handling paragraphs, at least a bit. 2011-11-03 15:16:05 +00:00
Gabriel Wicke 0d30a5528e First combination of WikiDom serializers with existing parser in
tests/parser/parserTests.js.

* Removed var from es in es.js to allow node.js to access it as global. Only
  alternative solution appears to be a node-specific 'exports' construct:
  http://nodejs.org/docs/v0.3.1/api/modules.html
* Added es.Document.js and es.Document.Serializer.js in es/bases. Not sure if
  this is the desired location.
* Changed es.extend to es.extendClass in the serializers
* Modified the first parser test to include the WikiDom modules and call the
  new HTML serializer
2011-11-03 13:55:48 +00:00
Trevor Parscal 5bae153214 Moving parser stuff back into the modules folder (oops) 2011-11-02 21:45:57 +00:00
Trevor Parscal 2b499d5990 Reorganized modules by javascript namespace 2011-11-02 21:31:45 +00:00
Brion Vibber 213ee7d4a8 followup r101685: the peg definition 2011-11-02 21:09:19 +00:00
Brion Vibber 56a75ccca7 Copy several of the experimental JS parser bits from ParserPlayground to VisualEditor. They'll need retooling to hook up with the wikidom stuff. 2011-11-02 21:07:51 +00:00