* the used RDFa types for links are now identical to those listed in
http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary, and are supported for
serialization
* Editors are responsible for adjusting the type when converting between link
types. Adding a caption to an mw:UrlLink for example should convert it into
an mw:ExtLink.
Update: rebased on top of trace patches
Change-Id: Ie1b882e2b3fbad08be94769e1167dccd8dfea65d
* Source-based round-tripping now uses typeof="mw:Placeholder" instead of
data-gen.
* mw:Image is supported for round-tripping, but not yet for modifications as
it is still source-based
Change-Id: Ie5cf4e54de0163168c25c2b5c09380657a15970f
* Copied over utility methods from mediawiki.parser.environment.js
to ext.Util.js.
* Moved over utility method from mediawiki.parser.defines.js to
ext.Util.js.
* Converted Util to be a singleton object rather than an allocatable
class. There is no reason to allocate a new utility class everywhere
since this utility object has no useful state.
* Fixed up use of utility methods to use Util rather than env.
Change-Id: Ib81f96b894f6528f2ccbe36e1fd4c3d50cd1f6b7
Now that we have access to the contents we can more easily compare the content
with link targets. This is still to do- this commit only converts the link
handler to work on the collected tokens.
* Start to implement latest RDFa spec from
http://www.mediawiki.org/wiki/Parsoid/RDFa_vocabulary
* Capitalize types, add mw:Entity type for html entities
* Detect changes to entities using tokenCollector and srcContent
Change-Id: I45429f4b930858a16e166ef8377c8f6f5114c414
This is work in progress, but committed for now so I can use it for links and
tweak it while doing so.
Change-Id: I757277f6efacda6d9432ca57542a957f597a98de
This hopefully makes it clearer that data-rt contains private round-trip info
instead of semantically interesting data.
Change-Id: I03b476ed112a4b627c9871ee3677c450f943429a
* Arbitrary predicate support for the termination of collection mode
* tokens as property of the collector instead of a state-global thing
Change-Id: Ibcb342bc64a76fece9b04a760ea56c7878e67cad
* Fixed image serializer to deal with missing 'v' value in a k-v pair
representing an image attribute.
* Added fix to deal with bare <li>'s (without surrounding <ul> tags)
NOTE: The second fix is required currently to deal with bugs in the parser
as it deals with complex cases. But, in the future, we could deal with
this in one of the following ways:
(a) The serializer expects a well-formed DOM and all cleanup will be
done as part of external tools/passes.
(b) The serializer supports a small set of exceptional cases and bare
list items could be one of them
(c) The serializer ought to handle any DOM that is thrown at it.
Yet to be resolved.
Change-Id: Ib585e5c9f2a8a80854740ce0211bde705f9fd6f4
* Strips the first paragraph tag in a list item or table cell context
if there are no attributes on it and stx:html is not set
Change-Id: I74988645fe505c662f86488e32d0f11d464ffe41
* Looks like I misled myself in commit 88fc91 -- that wikitext
roundtripped perfectly because it went through the 'src' route
because it was a thumbnail with an explicit image which doesn't
go through renderThumb -- so, the serializer simply spit out the
original 'src' string and hence perfect rt :).
* More whitespace preserving fixes in LinkHandler.
* Also changed resource value in the img tag to use the original
filename rather than the normalized capitalized filename.
* 2 more parsertests rt -- now upto 400.
Change-Id: I144a6486dd9d07da8a74a68700fe96c78d192826
* Changed PrefixImageOptions so that thumb and thumbnail are
distinct key-value pairs. Without this fix, cannot distinguish
between thumb=foo.jpg and thumbnail=foo.jpg
* Fixed link handler so whitespace is preserved around prefixed image
options.
* Fixed figure handler to process the 3 different kind of image options:
size, simple image options, and prefixed image options.
* There is a hack/fixme for "upright: aspect" prefixed image option
which needs to be looked into.
* Still need to fix uppercasing of the image resource name.
With these fixes, the following wikitext roundtrips perfectly
(after newline breaks are removed)
[[Image:Foo.jpg|thumbnail = 'baby.jpg'|100x100px|center| alt =bbbbb|
upright=true|bottom|link='http://foo.bar'|
This is a [[Linked Caption]] in the image]]
Change-Id: I6606df56874c2b97f00f08cb6bbeaec9878167d3
Anything with data-gen="both" and dataAttribs.src defined serializes to
dataAttribs.src and drops its contents (if any). We can use this to round-trip
elements we don't properly parse or serialize yet. Without RDFa info, the
editor will not touch the contents after encountering data-gen="both".
Change-Id: Ia39e5fdd765c2c9b36f26313455685d29f118839
This only applies to newly created headings, so headings with a single newline
preceding them will be round-tripped that way.
Change-Id: Ic09972bbd25c3934b53f6fd3b5be5a0c3185c2af
* Collect all figure tokens and process them as a chunk
* This effectively mimics context-sensitive DOM walking,
but since we need serialization supported on a token stream,
we cannot use real DOM walking. The current technique should
also work on a token stream.
* There is a FIXME about the image filename being capitalized.
This needs fixing in the parser or some other way of recognizing
original unnormalized filenam.
Amended by gwicke:
* Build option list and join it with pipe to avoid stray trailing pipe
* Satisfy JSHint's weird preference to have '&&' and '||' at the end of the line
Change-Id: I1e5f6600f297fcdf81e3227a82ca3b71d4e97fc3
* Removed dead commented out code.
* Cleaned up newline handling in serializer some more.
* Now, onNewLine and onStartOfLine reflect serializer state
more accurately.
* No implicit new lines for explicit html tags.
* 9 more roundtrip tests now green.
Change-Id: I9f640de2ae769c7472538fa687400dc8a40c2b2d
297 round-trip tests are passing with this patch.
TODO:
* generalize data-mw-gc handling in the serializer for any tag
* use data-mw-gc="both" and data-mw.src: 'the wikitext' for round-tripping of
wikitext structures, optionally with some presentational (but read-only)
content
* use span and data-mw-gc="both" for nowiki
Change-Id: I700142a56818977c20c8c06e6a5f2e77a708d25e
This makes sure that we escape start-of-line syntax when needed, since
onNewline is often not yet set.
Discussion / background:
[19:18] <subbu> this will fix it, but, i think this is asking for another
minor refactoring of these flags ... because this is a subtle fix which means
it might be possible to make it clearer. onNewline is one true in on
direction, i.e. if true, we are in a new line state, but if we are in a
newline context, onNewline is not true, which is why this new method is
needed.
[19:19] <subbu> i dont know if it is possible, but it seems like it shoudl be
possible. but, something for later.
[19:20] <subbu> badly phraed. "onNewline" ==> in new line context, but if in
new line context, onNewline may be false.
[19:20] <gwicke> we should perhaps update it as early as possible instead
[19:21] <subbu> i cannot today, but possible monday. i am heading out in
about 15-30 mins.
[19:22] <gwicke> will need to check all conditions depending on it in
_serializeToken
[19:22] <subbu> oh, i misunderstood you :)
[19:22] <gwicke> and if there are cases where the onNewline / onStartOfLine
state could be reverted later
[19:23] <subbu> you were referring to the flag, i thought you meant we should
fix this sooner than later.
[19:23] <gwicke> yes, I wasn't terribly clear
[19:23] <gwicke> you wrote something about following productions swallowing
newlines, but I think we don't actually do that any more
[19:24] <gwicke> I'm quite optimistic that updating those flags much earlier
would work
[19:25] <subbu> yes, it could fix it.
[19:26] <subbu> you might be right reg. swallowing. it was happening earlier.
but, not right now, after single-line mode and other fixes.
Change-Id: Ic1d8141c04eb54a59977d0ba87bcf06bafd421e0
This should not really be needed if the tokenizer did not decode html entities
on the fly. It is still a quick way to make sure no htmlish content can be
inserted even with the current decoding.
The next step and proper fix is to make entity decoding either optional in the
tokenizer (flag-controlled), or move it to a later stage in the token
processing pipeline.
Change-Id: Ife093dcfb95113763dab5635b098c795d3550586
* Renamed defaultOptions to initialState
* Got rid of unused state property
* Added comments explaining how state attributes
and tag handler flags are used
* Refactored listItemHandler check into functions and
added FIXME possible rewriting of that check.
* Protected serializeDOM in a try-catch handler to
catch exceptions and output the exception to the console.
Change-Id: I3d351c06e4b86baeb5a55243b11dbfa9baca5bb7
* Removed murky ' :' -> ' :' replacement in tokenizer. This breaks four
parser tests, and should be fixed in a token stream transformer or DOM
postprocessor. This replacement clashes with round-tripping, and is not
terribly important visually.
* Added stx:row annotation to single-line dt/dd pairs and use it to preserve
single-line syntax in the serializer. There is no attempt yet to support the
addition of nested lists in an originally single-line dd. We'd need to look
ahead in the serializer to support this. Perhaps the editor can simply drop
data-mw in that case.
* Switched default dt/dd serialization to multi-line. This supports all nested
lists and multiple dds.
* Don't close dls when switching from dt to dd or back in the token stream
ListHandler.
Overall 290 round-trip tests are passing now (up from 284, some due to ,
some due to lists). The number of passing parser tests dropped slightly from
303 to 297 (or 301/295 on weekdays other than Thursday).
Change-Id: I85ff40571833713388c6523e6a4ba2e94daa3807
Basically only prefix all bullets if the serialization output is going to be
in start-of-line context. The test for that is currently inline, but should
perhaps be factored out to a method or state flag instead.
We could alternatively consider to return the start-of-line prefix and let it
be used in _serializeToken in case we end up in start-of-line context.
This patch also fixes a newline issue on input like this:
:d1
::: d3
Both the list and list item handlers now set the startsNewline flag
dynamically depending on the context, so that we don't depend on the
suppression of newlines from list syntax by the singleLineMode any more.
There is still an extra newline inserted between list items in the following
example:
;t1 :d1
;;t2 ::d2
This looks like a bug in the produced DOM and not in the serializer, since the
outer definition list is closed and re-opened between d1 and t2.
Change-Id: I78e3a1ef34cf9159d5a1e86fb64c774ff111e71d
The main issue is that the bullets from dd/dt were not stored on the stack. I
added a separate field for it in each stack entry, which now fixes the basic
indent case without (afaik) breaking anything else.
There are still some newline issues, and the need to handle the single-line
dd/dt vs. the multi-line variant.
Change-Id: I65939c05e2c5dde0789bf8aefd7651161a2f137c
* Don't escape html-syntax pre content for now; Should parse this with a new
pre content production later (which needs to be split out of the regular pre
production in the tokenizer)
* Protect indent-pre content from start-of-line syntax escaping
* Preserve extra leading spaces in the tokenizer
* Two more (now 284) round-trip tests are passing
Change-Id: I199b89c0ee7fae12546df10c1b5117c97caccac5
Queued newlines and new trailing newlines were not cleanly separated so far,
which caused some trailing newlines to be consumed for needed leading
newlines. This change fixes several newline bugs, taking the number of passing
round-trip tests from 276 back up to 282.
Change-Id: Idb4706e15ce71e63085033e3f3f29557915c11a8
Known issue: breaks round-tripping of :;;;::. That test is normally disabled
anyway, so we can fix it later.
Change-Id: I7954271311bfb7e71caae59d8177e3f04a9ebbca
* Started to add more complete tag source range (tsr) annotations to most
start / empty tags. These replace the old sourcePos and sourceTagPos
annotations, and look more promising for general round-tripping than block
source ranges (bsr). See
http://www.mediawiki.org/wiki/User:GWicke/Parsoid_source_ranges for some
notes on this.
* Added an escapeWikitext method in the serializer that tokenizes supposedly
text-only content from the DOM with the tokenizer and wraps runs of returned
non-text tokens into nowiki tags. The source corresponding to non-text
tokens is retrieved using the tsr annotations.
* Removed old (unused) table productions to avoid confusion.
* 276 round-trip tests are passing, vs. 283 without escaping.
Known issues:
* harmless for now, can be improved later: urllinks in external link captions
are wrapped in nowiki. Example HTML:
<a rel='mw:extLink' href="http://example.com">http://example2.com</a>
* some start-of-line syntax in wiki-syntax preformatted blocks might be
wrapped into nowiki when that would not really be needed. Example HTML DOM:
<pre>
* foo
* bar
</pre>
Change-Id: I01c34aedd5c566614d36924add47a6a960e91987
* Added a newlineTransparent flag to handlers that prevents changes to the
onNewline status, so that content following it is still considered to be in
start-of-line context. This fixes a few rt tests where a comment or nowiki
tag is at the start of the line, and following content should end up on the
same line.
* 283 rt parser tests are now passing.
Change-Id: Ie58dcb9e5e9af9000fff61c2e1db5d8649ffc3f6
* tokens are not modified any more (they are supposed to be immutable)
* handler info is now split in start / end objects and potentially a 'make'
method; added more flags to govern the newline behavior of different tags
* added a generic singleLine mode for single-line syntactical environments
* switched the web service to line-based diffs to avoid issues when diffing
the round-trip results of [[:en:Programming language]]
* 280 round-trip tests are passing now
Change-Id: I74b8ffbf69643c5d6e5ec852ec58e680c9018901
from Parsoid HTML output as well as VE HTML output. There are still
some newline related failures from parser tests that needs fixing, but
this is getting close. So committing for now so other eyes can make the
bugs shallow :).
Change-Id: Ia6a218ee9fb3e18fe0573c89ff3a4236779e1e64