Follow up to 6c15f6e where the same approach was taken in dom diff'ing.
Clarifies where the "id" is expected to point and the limitations of the
approach vis-a-vis embedded content.
For example,
<ref>hi ho</ref>
[[File:Test.png|<references />]]
won't roundtrip, and never did, because the references section the "id"
would point to is in embedded content.
This was really only ever about the case where the <ref> itself was
found in embedded content, like an image caption, and we wanted to find
a top level references section, like,
[[File:Test.png|<ref>hi ho</ref>]]
<references />
The one case old approach was ostensibly doing something smarter was if
both the references section and the ref were in the same embedded
content, as in,
[[File:Test.png|<ref>hi ho</ref><references />]]
However, at least for file captions, those were always serialized in a
fragment of the top level doc and suffer from same dropping as the first
example here. Maybe some other embedded content is handled differently,
in which case this is probably an acceptable regression.
Change-Id: Ia90eadcc5099a8c27f0bf3fda0ce2f0effca7bcc
It's a feature of named refs that we only know at the time of inserting
the references list whether they have content or not, and are therefore
in err. The strategy of 4438a72 was to keep pointers to all named ref
nodes so that if an error does occur, we can mark them up.
The problem with embedded content is that, at the time when we find out
about the errors, it's been serialized and stored, and so any pointers
we might have kept around are no longer live or relevant. We need to go
back and process all that embedded content again to find where the refs
with errors are hiding.
This patch slightly optimizes that by keeping a map of all the errors
for refs in embedded content so that only one pass is necessary, rather
than for each references list. Also note that, in the common case, this
pass won't run since we won't have any errors in embedded content.
Bug: T266356
Change-Id: I32e7bfa796cd4382c43b3b1d17b925dc97ce9f7f
* Add it to CE HTML, for compatibility with site styles.
* Add it to DM HTML of newly created references only. Existing
references just use whatever classes we got from Parsoid, to avoid
unnecessary DOM changes and dirty diffs.
Bug: T265930
Depends-On: I61a2132f3876e2d9567d985358f51eb51c479813
Change-Id: I9d6856f03071c09617b8ae7db938135a3e30fe8e
The following sniffs are failing and were disabled:
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationProtected
Additional changes:
* Dropped .inc files from .phpcs.xml (T200956).
Change-Id: Ia1461485a1a0122a5b830dc2badb542ae7ec9a1c
* Generating error message in data-mw in affected refs
* Cite test validates correctness of adding error to afflicted
refs
* Autogenerated references aren't considered erroneous
(the extension to the legacy parser also generates them)
and are not suppressed when serializing because apparently
that's the behaviour Parsoid clients want. However, in
this patch we're marking up autogenerated references
*with group attributes* as errors (the legacy extension
doesn't generate them at all) and are choosing to suppress
them when serializing since we considered them an error
while parsing and don't want them to persist in the content.
Bug: T51538
Change-Id: Ia651b10449dc41c2cb439b33a361e8c8e482f502
The structure of this class changed a lot in If2fe5f5
(T239572). This was never reflected in the @covers tags. I
suggest to go with a trivial @covers for the entire class
and let PHPUnit figure it out. The alternative is to kind
of repeat many private implementation details, and this
doesn't feel right.
Change-Id: Ie414876489133ab9aca934c19a5e403cd339abf1
While, for the most part, content nested in refs will end up in the
references section and should be fine to maintain a linkback. However,
if content differs from a previously named ref (ie. !== cachedHtml), it
ends up being serialized and nested in data-mw, so is also unsafe.
Bug: T266294
Bug: T266356
Change-Id: Ia92f42e06353c411b986d0665cbe6338052555fa
In 47506af, serializing references content to be added to data-mw was
delayed until we were inserting the reference into the dom, to give us a
chance to mark up errors about not finding named ref content. After
which, the content is cleaned out of references to make room for list
items that belong there.
In 6bd0594, we noted that we can't store linkback from embeded content
since the content is released after serializing and won't be available
when it comes time to mark it up with errors.
Similarly, the linkbacks added to other groups from the references
content won't be available after inserting that reference and so we
shouldn't be holding on to them either. This means that we won't be
marking them up for the named error above but with the benefit that we
won't crash when trying to access these linkbacks from another group.
A test case is added which crashes trying to access a linkback from a
group that has already been inserted.
Note that the extension to the legacy parser considers referencing
another group from group content to be
"cite_error_references_group_mismatch". If we choose to also do that,
we may be able to reverse this change.
Change-Id: Idf0e49fa07dc3614068793c72a30ce3de1e2392c
* Note: the <ref name="1"> numeric is flagged as an error, but
<ref follow="1"> is not flagged as well as it would be duplicitous.
We are proposing that Numeric digits as a name be changed to
a warning as Parsoid Cite handles them properly without error.
Background: The core Cite extension started off not allowing numeric
names because the data structure they used made it inconvenient,
but Parsoid Cite has a separate index for named refs.
The core Cite devs thought there was potential for a conflict in
the ids if numeric names were allowed. The ids that core uses
follow the pattern:
for no name defined: cite_ref-1 cite_ref-2
for name="refname": cite_ref-refname_3-0 cite_ref-refname_3-1
so for name="1" at worst you might see ids like:
cite_ref-1 or cite_ref-1_2-0
so that does not produce conflicting IDs and isn't a concern,
and that is the pattern that Parsoid Cite uses.
* Error case of numeric in name and follow and two tests that
validate errors.
Bug: T51538
Change-Id: I95d725d0f77abadc1ddb2dd6939762b7d322e4f2
* Error case of dir= not ltr or rtl is caught and a cite tests
validates the new check.
Bug: T51538
Change-Id: I8c7e088416f0a000e638771a3fe5e8e0c58bcc23
Avoids crashers from trying to serialize the named content twice if
there's a valid follow with some other error, as expected in the FIXME.
This introduces a backwards incompatibility for invalid follows, which
will result in the contents of the ref being dropped.
A test is added to assert that selser will save us for the most part.
Change-Id: I1f572f996a7c2b3b852752f5348ebb60d8e21c47
* This patch introduces a preprocessing step on the edited DOM.
* Existing preprocessing code has been extracted into the
preprocessDOM method.
Any registered extensions preprocessors are invoked on the DOM.
So, this assumes that the htmlPreprocess extension listener is only
applicable to the edited DOM. If we want to expose the concept of
selective serialization through the API, we may want to add an
additional interface method / listener to the DOMProcessor class.
As of this patch, this is somewhat theoretical since there are no
such extension handlers registered on either DOM. Future patches
can clarify this better as specific needs arise.
* The handler also calls the serializer's custom preprocessing steps.
This step is applicable to both the original as well as edited DOM
(since DOM Diff is impacted by the results). If a need arises,
in the future, we may introduce a new extension DOM processor method
that applies to both original and edited DOMs.
* Right now, only selser strips section tags and non-selser wts
doesn't need to. So, preprocessDOM there is empty. Additional
selser-only DOM preprocessing will show up in later patches.
* Moved a stub HTML->WT preprocessor in Cite extension to RefProcessor.
Bug: T254501
Change-Id: I0c12afb2ea82617406d72ad872ac4f33678fa5f2
The description in T179082 suggests that by using one document for the
entire parse, we'd probably see some performance gains from not having
to import nodes when we get to the top level pipeline and we'd avoid the
validation errors from 19a9c3c.
However, the spec seems to suggest creating a new document when parsing
an HTML fragment,
https://html.spec.whatwg.org/#html-fragment-parsing-algorithm
And, indeed, domino implements it that way,
12a5f67136/lib/htmlelts.js (L84-L96)
So, the request in T217705 may be a little misguided.
What then is this patch good for? In T221790 the ask is that
sub-pipelines produce DocumentFragment which make for cleaner interfaces
and less confusion when migrating children.
The general outline here is that a document is created when the
environment is constructed that gives us the 1-1 correspondence.
Sub-pipelines do create their own documents for the purpose of tree
building, as in the fragment parsing algorithm, but are then immediately
imported to DocumentFragments to be used for the rest of the
post-processing passes.
Bug: T221790
Bug: T179082
Bug: T217705
Change-Id: Idf856d4e071d742ca38486c8ab402e39b3c8949f