Also encode cite ids properly as now they can contain arbitrary
text. Change in blacklist due to this.
TODO: Investigate if it would be better to do this directly in
the tokenizer.
Change-Id: Ic112124e90d256d73a351d0d57fe3c7546fa065f
* Although this resolves the crashes, I'm unsatisfied with it as a
proper fix to the underlying issue. There are many places throughout
the codebase where we serialize and then parse document fragments
that should be instrumented to store and unpack data-* attributes.
Bug: T76518
Change-Id: Idca1b0a37ec924a71cb51160d000c7de9717d422
The coding conventions suggest avoiding ==,
and for this condition definedness is actually more relevant
than whether the string has any text, but since
the string can also be '0', checking for !$text doesn't work.
Similar to I15b422d3345bf4522e68a17dce9682ff28484559 .
Change-Id: Ib823678b639bf4f1a92dffcd9e41c780b56ab128
The coding conventions suggest avoiding ==,
and for this condition definedness is actually more relevant
than whether the string has any text.
Change-Id: I15b422d3345bf4522e68a17dce9682ff28484559
* Currently, this mimics Cite.php behavior where "a b" and "a_b"
are considered identical ids.
* Added new parser test.
* Fixed output of another test.
* Fixed section name of a commented out test.
Change-Id: I0c51404c3e659bbddfe9a8909aa6a109d368b762
In this function $text can be both false and empty string.
It is more intuitive to use a boolean operator here than
to rely on the fact that comparing to '' using == happens
to give the correct result.
Change-Id: I08248a3fcade7744287e9b9f3bc176d29ac1ecde
* This is both faster and consistent with how we're accessing other
parsoid attributes. It's also a step towards not having this data in
the html output.
* Changes to parserTests and the blacklist are for attribute order.
* Requires upgrading domino to 1.0.18
https://github.com/fgnass/domino/pull/48
Change-Id: I1edbc260887d480adf04763b15043c374e27cceb
General changes
---------------
* Replaced the hacky 'inBlockNode' parser pipeline option with
a cleaner 'noPWrapping' option that suppresses paragraph wrapping
in sub-pipelines (ex: recursive link content, ref tags, attribute
content, etc.).
Changes to wt2html pipeline
---------------------------
* Fixed paragraph-wrapping code to ensure that there are no bare
text nodes left behind, but without removing the line-based block-tag
influences on p-wrapping. Some simplifications as well.
TODO: There are still some discrepancies around <blockquote>
p-wrapping behavior. These will be investigated and addressed
in a future patch.
* Fixed foster parenting code to ensure that fostered content is
added in p-tags where necessary rather than span-tags.
Changes to html2wt/selser pipeline
----------------------------------
* Fixed DOMDiff to tag mw:DiffMarker nodes with a is-block-node
attribute when the deleted node is a block node. This is used
during selective serialization to discard original separators
between adjacent p-nodes if either of their neighbors is a
deleted block node.
* Fixed serialization to account for changes to p-wrapping.
- Updated tag handlers for the <p> tag.
- Updated separator handling to deal with deleted block tags
and their influence on separators around adjacent p-tags.
- Updated selser output code to test whether a deleted block
tag forces nowiki escaping on unedited content from adjacent
p-tags.
Changes to parser tests / test setup
------------------------------------
* Tweaked selser test generation to ensure that text nodes are always
inserted in p-wrappers where necessary.
* Updated parser test output for several tests to introduce p-tags
instead of span-tags or missing p-tags, add html/parsoid section,
or in one case, add missing HTML output.
Parser Test Result changes
--------------------------
Newly passing
- 12 wt2html
- 1 wt2wt
- 3 html2html
- 3 html2wt
Newly failing
- 1 html2wt
"3. Leading whitespace in indent-pre suppressing contexts should not be escaped"
This is just normalization of output where multiple HTML forms
serialize to the same wikitext with a newline difference. It is not
worth the complexity to fix this.
- 1 wt2wt
""Trailing newlines in a deep dom-subtree that ends a wikitext line"
This is again normalization during serialization where an extra
unnecessary newline is introduced.
- A bunch of selser test changes.
182 +add, 188 -add => 6 fewer selser failures
- That is a lot of changes to sift through, and I didn't look at every
one of those, but a number of changes seem to be harmless, and just
a change to previously "failing" tests.
- "Media link with nasty text" test seems to have a lot of selser
changes, but the HTML generated by Parsoid seems to be "buggy" with
interesting DSR values as well. That needs investigation separately.
- "HTML nested bullet list, closed tags (bug 5497) [[3,3,4,[0,1,4],3]]"
has seen a degradation where a dirty diff got introduced.
Haven't investigated carefully why that is so.
Change-Id: Ia9c9950717120fbcd03abfe4e09168e787669ac4
* Nested <ref> tag support was broken in 69b6ec4d since that patch
effectively processed <refs> only on the final DOM rather than
extracted <ref> information in sub-pipelines. In doing so, it
broke support for <ref> in <ref> tags which are supported by {{efn}}
templates and used as follows.
{{efn|A clarification.{{sfn|Smith|2009|p=2}}}}
The {{sfn}} tpl generates a <ref> tag inside another <ref> tag
generated by the {{efn}} tpl.
* This patch fixes the breakage by processing <ref> content after it
is extracted (in case it is known to have nested <ref> tags).
* This fixes the rendering on enwiki/Otto_I%2C_Holy_Roman_Emperor
and is an improvement over when it was broken. The nested <ref>
gets id 18 (just like the Cite.php handling) whereas it used to
get id 1 before the breakage.
* TODO:
- Figure out a way to add a test for this.
Change-Id: Ib82b75f66249b2133a123adbe6fd7acbfd8ec8fb