Gabriel Wicke
bd98eb4c5a
Land big TokenTransformDispatcher and eventization refactoring.
...
The TokenTransformDispatcher now actually implements an asynchronous, phased
token transformation framework as described in
https://www.mediawiki.org/wiki/Future/Parser_development/Token_stream_transformations .
Additionally, the parser pipeline is now mostly held together using events.
The tokenizer still emits a lame single events with all tokens, as block-level
emission failed with scoping issues specific to the PEGJS parser generator.
All stages clean up when receiving the end tokens, so that the full pipeline
can be used for repeated parsing.
The QuoteTransformer is not yet 100% fixed to work with the new interface, and
the Cite extension is disabled for now pending adaptation. Bold-italic related
tests are failing currently.
2012-01-03 18:44:31 +00:00
Neil Kandalgaonkar
20374b5911
fix substr for IE, followup r107464
2011-12-30 21:51:03 +00:00
Gabriel Wicke
8e00a72d0a
Improvements to link trail handling, and two tweaks to the whitelist. 182
...
tests now passing.
Link trails depend on language-dependent positive character classes in the PHP
parser. These classes all seem to disallow punctuation implicitly and list
differing plain text characters instead, so it might be possible to get away
with identifying a common class of non-trail punctuation instead. This would
help to keep the tokenizer independent of configurations, which is very
desirable for caching and simplified external parsing.
2011-12-30 12:47:06 +00:00
Gabriel Wicke
11ece76b7b
Fix suffix handling for wiki links.
2011-12-30 09:35:57 +00:00
Gabriel Wicke
b3a0270d69
Remove env and load grammar in tokenizer constructor. Re-add property hack to
...
keep parserTests running for now. Really need a different pipeline for html
serialization or a reference to the HTML DOM.
2011-12-28 17:04:16 +00:00
Gabriel Wicke
3a63fb118e
Add a few comments inline, and remove unneeded html serialization as we are
...
only interested in WikiDom output in this parser wrapper.
2011-12-28 13:46:52 +00:00
Neil Kandalgaonkar
8fbf36e63e
put add terminal token inside tokenize method (will pull it out again for streaming interface)
2011-12-28 01:37:15 +00:00
Neil Kandalgaonkar
6103646ec8
remove need to add newline at end of input
2011-12-28 01:37:11 +00:00
Neil Kandalgaonkar
4158f82d7e
refactor parser to ParseThingy in different module, can be invoked with command line utility parse.js
2011-12-28 01:37:06 +00:00
Neil Kandalgaonkar
d91a67ba99
nodeName not defined
2011-12-28 01:36:54 +00:00
Neil Kandalgaonkar
962d1262fc
create tokenizer without need to modify namespace with PEG source
2011-12-28 01:36:36 +00:00
Gabriel Wicke
33e60dd4d9
Update comments a bit.
2011-12-22 12:37:24 +00:00
Gabriel Wicke
9ee0e660ec
Fix regression introduced by r107060 for regular table cells. Good to have a
...
test suite ;)
2011-12-22 12:09:25 +00:00
Gabriel Wicke
a94d0ec10c
Re-add support for row-only tables.
2011-12-22 11:58:32 +00:00
Gabriel Wicke
1c7fe0eb34
Refactor table productions to support table fragments in templates (table
...
start / row / table end). The old productions are not deleted yet to make it
easy to compare the output on more complex articles. 181 tests passing after
adding two table tests with whitespace-only differences to the whitelist.
2011-12-22 11:43:55 +00:00
Gabriel Wicke
2845ba9552
Handle noinclude and includeonly at start of line, so that syntax after it
...
still matches as if it actually was preceded by a newline.
2011-12-21 11:38:50 +00:00
Gabriel Wicke
3a631db6d9
Fix ranges for annotations in implicit paragraphs within branch nodes.
2011-12-16 19:36:04 +00:00
Gabriel Wicke
cc06551f2e
Rename table_header production to table_heading. Those non-natives strike again.
2011-12-16 19:24:59 +00:00
Gabriel Wicke
605ed23fd2
Fix attributes in table headings.
2011-12-16 19:22:13 +00:00
Gabriel Wicke
08255ff3e6
Small bug fix to heading level, spotted by Mike from localwiki- thanks!
2011-12-15 23:59:35 +00:00
Gabriel Wicke
a04744b2ec
Add some more attribute remapping capabilities to the DOMConverter, and clean
...
up some grammar formatting.
2011-12-15 17:33:07 +00:00
Gabriel Wicke
e98dd9e722
Implement 1-char-minimum width for annotations, and some additonal minor
...
cleanup.
2011-12-15 11:05:52 +00:00
Gabriel Wicke
22ba27295b
Clean up the DOMConverter a bit.
2011-12-15 10:55:30 +00:00
Gabriel Wicke
e72dee76e4
Follow-up to r106208 and r106207. Both good catches, thanks Yair! As this code
...
is in its early stages and nowhere near deployment, please Be Bold and just
commit things like this directly! IMHO it makes more sense to fully review this
once it settles down a bit.
2011-12-15 10:13:50 +00:00
Gabriel Wicke
3585bd9c8e
Accept row-only tables. The parser now eats [[en:Barack Obama]] as-is. Hooray!
2011-12-15 00:39:28 +00:00
Gabriel Wicke
6df94a34a1
Less lust for urls
2011-12-15 00:26:22 +00:00
Gabriel Wicke
ce2ee067f7
Minor tweak to wiki link production
2011-12-15 00:12:58 +00:00
Gabriel Wicke
377226a120
Comment out a stray console.log
2011-12-14 23:44:58 +00:00
Gabriel Wicke
574abd9774
A collection of small bug fixes to the grammar, Cite, the Token format
...
converter and the HTML DOM -> WikiDom converter. The tokenizer now digests all
parserTests.
2011-12-14 23:38:46 +00:00
Gabriel Wicke
dc77d73ad5
Add ability to pass through JSON data to WikiDom in data-json-* attributes,
...
and fix parser to actually parse the Barack Obama article except for one table
with nested templates at the start-of-line.
2011-12-14 17:25:09 +00:00
Gabriel Wicke
f6e4267fca
Handle a few more element types, and reset offset for each leaf node. Not sure
...
if the latter is correct, as the documentation at
https://www.mediawiki.org/wiki/Visual_editor/Software_design#Data_Structures
and the actual sample WikiDom in the editor sandbox seem to disagree on this
point.
2011-12-14 16:22:27 +00:00
Gabriel Wicke
6676a47008
Add implicit level attribute to WikiDom headings.
2011-12-14 15:55:58 +00:00
Gabriel Wicke
3018ca690b
Improve WikiDom conversion: Handle text and annotations in branch nodes as
...
paragraphs and treat list items as branches.
2011-12-14 15:40:40 +00:00
Gabriel Wicke
a09aa4d599
Add rough HTML DOM to WikiDom conversion. You can see serialized WikiDom of
...
parser tests using 'node parserTests.js --wikidom'.
2011-12-14 15:15:41 +00:00
Gabriel Wicke
5f80d30428
Clean up access to document and body after building the tree.
2011-12-14 09:40:49 +00:00
Gabriel Wicke
30749b8d8d
Update comments a bit and add a note on things to improve in API.
2011-12-14 09:33:25 +00:00
Gabriel Wicke
55ff272847
Comment TokenTransformDispatcher.
2011-12-13 20:13:09 +00:00
Gabriel Wicke
44deefe303
Minor tweak to comment.
2011-12-13 18:55:44 +00:00
Gabriel Wicke
c61b32eaa7
Clean up and comment the Cite extension a bit.
2011-12-13 18:45:09 +00:00
Gabriel Wicke
feee9ded9f
Convert the Cite extension to a token stream transformer.
...
This required a few further additions to the TokenTransformDispatcher. In
particular, there is now an 'any' token match whose callbacks are executed
before more specific callbacks. This is used by the Cite extension to eat all
tokens between ref and /ref tags. This need is very common, so should be
broken out to an intermediate layer in the future.
In general, the requirements for the TokenTransformDispatcher API are now
clearer, and the API should likely be cleaned up / simplified.
2011-12-13 14:48:47 +00:00
Gabriel Wicke
8e55e79b67
Rename TokenTransformer to TokenTransformDispatcher.
2011-12-13 11:45:12 +00:00
Gabriel Wicke
8231511217
Replace custom object copy with $.extend.
2011-12-13 11:18:15 +00:00
Gabriel Wicke
39aedd4378
Improve comments in QuoteTransformer.
2011-12-13 10:25:18 +00:00
Gabriel Wicke
0ad08b9ae3
Add a README file pointing to the wiki documentation.
2011-12-12 22:30:11 +00:00
Gabriel Wicke
a8fa9433c4
Convert quote handling (italic/bold) to a core extension operating on the
...
token stream. This is the first token transformation exercising the
TokenTransformer class as its dispatcher. Template expansions, wiki link
formatting, tag sanitation and extensions should be able to use the same
dispatcher by registering for specific token types.
The parser performance is very slightly improved as the token stream is only
traversed once.
2011-12-12 20:53:14 +00:00
Gabriel Wicke
752b0990b2
Refactor parserTests somewhat into a class-like structure, and wire up the
...
TokenTransformer.
2011-12-12 14:03:54 +00:00
Gabriel Wicke
d616f07a79
Don't re-build the wiki tokenizer for each test. This speeds up the full
...
parserTests.js run slightly from 7-8 minutes to about 14 seconds ;)
A few very minor tweaks to the grammar are also thrown into this commit.
2011-12-12 10:47:42 +00:00
Gabriel Wicke
89c5e0cafb
Follow-up to r105859: Add missing new.
2011-12-12 10:09:13 +00:00
Gabriel Wicke
9ebce5839a
Further development of the TokenTransformer framework.
2011-12-12 10:01:47 +00:00
Gabriel Wicke
80d5067813
Add a TokenTransformer dispatcher class. This class provides subscriptions by
...
token type, and supports asynchronous token expansion (for example for async
template expansion). This code is not yet tested or used. The interface for
token insertion from transformation functions will be expanded as needed.
2011-12-08 14:37:31 +00:00
Gabriel Wicke
c2b69e2486
Clean up newline handling. Emit a NEWLINE token for each
...
non-{comment,pre,nowiki} newline.
2011-12-08 14:34:18 +00:00
Gabriel Wicke
abc2254110
A bit of comment clean-up and wrapping of tree building into try/catch block
...
to actually count failures.
2011-12-08 11:40:59 +00:00
Gabriel Wicke
92fdf99384
Further renaming, this time from pegParser to pegTokenizer.
2011-12-08 10:59:44 +00:00
Gabriel Wicke
76bc477038
Rename html5TokenEmitter to HTML5TreeBuilder, and the contained Tokenizer to
...
TreeBuilder.
2011-12-08 10:37:18 +00:00
Gabriel Wicke
19a1f0850f
Tidy up the grammar a bit.
2011-12-08 10:33:23 +00:00
Gabriel Wicke
3742d70abd
Add some documentation to syntax flags
2011-12-07 15:54:55 +00:00
Gabriel Wicke
545ca1809f
Convert template argument production to generic inline with syntactic stop.
...
Fix a bug in generic inline production. Nested multi-line templates are now
parsed okayish.
2011-12-07 15:39:39 +00:00
Gabriel Wicke
902db40a1f
Process template arguments into an object.
2011-12-07 14:46:07 +00:00
Gabriel Wicke
51a40e4dbc
Follow-up to r105423: Fix off-by-one bug.
2011-12-07 11:56:12 +00:00
Gabriel Wicke
49c286a67b
Fix a bug in doQuotes (bitten by surprising JS sort() behavior), and improve
...
tag-only-line handling. 180 parser tests now passing.
2011-12-07 11:51:24 +00:00
Gabriel Wicke
418a5067c6
Parse attributes in tables using generic attribute production. Some table
...
tests still do not pass as the MW table output reorders attributes ;)
2011-12-06 22:03:21 +00:00
Gabriel Wicke
3d06707152
Slightly speed up inline tag productions using guards and grouping; Fix list
...
processing function.
2011-12-06 18:35:05 +00:00
Gabriel Wicke
ea8f226fd5
Remove ext and references special cases, now subsumed by generic XML tag
...
productions. Document issue around special tokenizer mode for other extension
tags.
2011-12-06 16:44:27 +00:00
Gabriel Wicke
e7de089d5b
Decode urls and html entities, 163 tests now passing.
2011-12-06 13:17:14 +00:00
Gabriel Wicke
a72a9e55a3
Don't match internal links with url as target. 161 passing.
2011-12-06 12:26:57 +00:00
Gabriel Wicke
2b5cc67bf5
Further tweaks to headings. 157 tests now passing.
2011-12-06 11:59:41 +00:00
Gabriel Wicke
f4d123886e
Convert heading rules to single rule that figures out the level. This saves a
...
lot of backtracking and inline break complexity.
2011-12-06 11:06:05 +00:00
Gabriel Wicke
33e19f7275
Recognize block-level elements independent of case; Ignore toc and section
...
edit links in tests. 148 parser tests passing.
2011-12-05 20:03:24 +00:00
Gabriel Wicke
9ed9cb31bd
Fix template argument handling somewhat.
2011-12-05 17:58:11 +00:00
Gabriel Wicke
1760210d13
Fixes to tables, headings and misc smaller stuff. Tracked down an issue caused
...
by improperly caching of production results, which interfered with the
flag-dependent inline_break production.
2011-12-04 19:23:24 +00:00
Gabriel Wicke
63c728924b
Use pegjs from npm
2011-12-01 15:23:23 +00:00
Antoine Musso
5ab379f479
fix vim modeline
2011-12-01 15:19:37 +00:00
Gabriel Wicke
0ce1e9fcf3
Add a quick html entity decoding hack, and document need for general decoder.
2011-12-01 14:39:55 +00:00
Gabriel Wicke
d00743ad79
Improve external links and definition lists, now 133 tests passing ;)
...
Also add printwhitelist option to test runner, provides js code copy/pastable
to whitelist.
2011-12-01 14:25:59 +00:00
Gabriel Wicke
82e31ffd42
Do not allow newlines in various attributes
2011-11-30 15:12:53 +00:00
Gabriel Wicke
821162484e
Allow inlines in the term part of ; term : definition
2011-11-30 14:53:28 +00:00
Gabriel Wicke
f758894de7
Let another test pass by swapping the default order of italic/bold for '''''.
...
Minor test output cosmetics.
2011-11-30 13:54:57 +00:00
Gabriel Wicke
e0fca805a6
Expand tabs in grammar.
2011-11-30 13:42:26 +00:00
Gabriel Wicke
2bb512a4de
A bit of tokenizer grammar clean-up and additional expected-html
...
normalization. 99 parser tests now passing.
2011-11-30 13:40:17 +00:00
Gabriel Wicke
127d8c8621
Simplify DOM paragraph wrapping postprocessor
2011-11-30 12:28:45 +00:00
Gabriel Wicke
f0edc5cb9a
Fix a few more tests by allowing inline content inside links. 76 now passing.
2011-11-29 18:43:27 +00:00
Gabriel Wicke
ae0b5f9af4
* Split paragraph handling between tokenizer and DOM postprocessor for better
...
html markup handling.
* Remove global 'use strict' declarations from html5 parser.
* Add trailing whitespace handling in dt
Overall, 55 parser tests are now passing.
2011-11-29 15:11:51 +00:00
Gabriel Wicke
b16c295b98
Consider dl as a block-level element.
2011-11-28 16:54:58 +00:00
Gabriel Wicke
d3f0196df7
Add primitive HTML comparison to detect passing parser tests. The expected
...
HTML is parsed using a HTML parser and re-serialized, and the output compared
to the serialization of the new parser's dom. Newline normalization is a
cheap hack for now, need to improve that later.
2011-11-28 11:10:39 +00:00
Gabriel Wicke
6b8c109cf0
Separate block-level tags in tokenizer to delimit inlines and avoid wrapping
...
block-level in paragraphs.
2011-11-25 17:41:26 +00:00
Gabriel Wicke
859379a635
Improvements to nowiki/pre interaction. Will need to distinguish block-level
...
tags from inline HTML tags next.
2011-11-25 15:02:44 +00:00
Gabriel Wicke
dd5cd59ac6
Better HTML, pre and blocklevel handling. Hackish source formatting for easier
...
comparison with parserTest results.
2011-11-25 12:47:03 +00:00
Gabriel Wicke
5b3a4497aa
Add generic HTML tokenization and nowiki handling.
2011-11-25 10:59:43 +00:00
Gabriel Wicke
6c36ddcbce
Follow-up to r104164: Clean-up comments, remove old italic/bold productions.
2011-11-24 14:20:56 +00:00
Gabriel Wicke
dee262658f
Add MediaWiki-compatible quote handling including quirks and overlapped
...
structures like ''[[Link|Link text'']]. This is another transform on the token
stream.
2011-11-24 13:56:30 +00:00
Gabriel Wicke
baf55875b9
Re-add modified wiki list handling to tokenizer.
2011-11-23 14:27:51 +00:00
Gabriel Wicke
694b998f24
Minor improvement to italic/bold, documentation on failed modularization of
...
static parser functions.
2011-11-22 16:51:05 +00:00
Gabriel Wicke
d1b0293569
Fix comment token conversion and serialization
2011-11-21 09:22:30 +00:00
Gabriel Wicke
65afd9b610
Improve internal link handling
2011-11-18 14:48:32 +00:00
Gabriel Wicke
d744e65c48
Add missing token adapter.
2011-11-18 14:00:14 +00:00
Gabriel Wicke
b750ce38b8
Add node.js-compatible HTML5 parser and hook it up to the PEG tokenizer.
...
Builds a DOM tree (jsdom) from the tokens and then serializes that using
document.innerHTML. This is all very experimental, so don't be surprised by
rough edges.
2011-11-18 13:57:07 +00:00
Gabriel Wicke
11e487d8c0
Flatten inline token lists before merging text into text tokens.
2011-11-17 15:43:31 +00:00
Gabriel Wicke
ea87e7aaee
Convert PEG parser to tokenizer for back-end HTML parser. Now emits a list of
...
tokens, which for now is still completely built before parsing can proceed.
For each top-level block, the source start/end positions are added as
attributes to the top-most tokens. No tracking of wiki vs. html syntax yet.
2011-11-17 15:26:02 +00:00
Gabriel Wicke
ef3c84bd2e
Extract text from inline elements for better testing. Slightly improved
...
handling of comment-only lines. Change pre to leaf content model.
2011-11-08 16:08:05 +00:00
Gabriel Wicke
18ead89b37
Improved paragraph, br, comment parsing and switched headings to
...
generic inlineline with syntactic flags.
2011-11-07 23:09:30 +00:00
Gabriel Wicke
944d010eb2
Indentation cleanup in PEG parser and Html serializer
2011-11-07 21:05:37 +00:00
Gabriel Wicke
c3a0c56e56
rename definition{term,description} to just {term,description}
2011-11-07 20:36:34 +00:00
Gabriel Wicke
71891131c3
Grammar improvements
...
* replaced regexp stack with a set of break rules for inline content within
specialized parse contexts, switched more rules to generic
inlineline/inline/block rules.
* don't consume end-of-line for proper start-of-line matching
* added some pre support
* still no conversion of inline elements to annotations
2011-11-07 14:39:12 +00:00
Gabriel Wicke
06ca9f12fe
Rename definitiondata to definitiondescription, minor fixes
2011-11-04 12:25:01 +00:00
Gabriel Wicke
7e5c196732
Some more progress for tables and definition lists
2011-11-04 12:06:49 +00:00
Gabriel Wicke
83a80bad49
Fixes for definition lists
2011-11-04 11:08:11 +00:00
Gabriel Wicke
85def70a8a
Add basic list serialization to HtmlSerializer
...
* Added 'definitionterm' and 'definitiondata' styles to support definition
lists, and special-case handling in the serializer to wrap both in dls.
2011-11-04 10:02:59 +00:00
Gabriel Wicke
63398b5749
Update parserTests to latest serializers
2011-11-04 07:45:05 +00:00
Gabriel Wicke
a8838dab18
Start by handling paragraphs, at least a bit.
2011-11-03 15:16:05 +00:00
Gabriel Wicke
0d30a5528e
First combination of WikiDom serializers with existing parser in
...
tests/parser/parserTests.js.
* Removed var from es in es.js to allow node.js to access it as global. Only
alternative solution appears to be a node-specific 'exports' construct:
http://nodejs.org/docs/v0.3.1/api/modules.html
* Added es.Document.js and es.Document.Serializer.js in es/bases. Not sure if
this is the desired location.
* Changed es.extend to es.extendClass in the serializers
* Modified the first parser test to include the WikiDom modules and call the
new HTML serializer
2011-11-03 13:55:48 +00:00
Trevor Parscal
5bae153214
Moving parser stuff back into the modules folder (oops)
2011-11-02 21:45:57 +00:00
Trevor Parscal
2b499d5990
Reorganized modules by javascript namespace
2011-11-02 21:31:45 +00:00
Brion Vibber
213ee7d4a8
followup r101685: the peg definition
2011-11-02 21:09:19 +00:00
Brion Vibber
56a75ccca7
Copy several of the experimental JS parser bits from ParserPlayground to VisualEditor. They'll need retooling to hook up with the wikidom stuff.
2011-11-02 21:07:51 +00:00