Sanitizer is heavily used by extensions and we decided to let
extensions directly access it.
So, stop proxying those methods from ParsoidExtensionAPI.
Change-Id: I5ff285bf33733878135e2091d53ae12f7340c8fc
* Addresses a FIXME (T263052) where Parsoid Cite injects
style = "display: none;" in refs with follow instead of
having css do that triggered by having a class "mw-ref-follow"
as part of the refs html.
Bug: T263052
Depends-On: I351516b81566aba0adb4d298e39806dfb4fc7b03
Change-Id: I8bfc4ee3df162e2040e3c6f0c37fbf2a7c30d7f6
Follow up to 01cf61a
Numeric array keys are returned as integers.
echo "<references 2/>" | php bin/parse.php
Bug: T269748
Change-Id: I892753c330f95d258e0310626f109386fd020177
It's sufficient to handle this case in processRefs.
Also moves $referencesGroup to the ReferencesData instance, rather than
passing it around as a variable (inconsistently).
Change-Id: I8637e3ce644642259e353d0df3d9c0dbc3102c7b
Follow up to 6c15f6e where the same approach was taken in dom diff'ing.
Clarifies where the "id" is expected to point and the limitations of the
approach vis-a-vis embedded content.
For example,
<ref>hi ho</ref>
[[File:Test.png|<references />]]
won't roundtrip, and never did, because the references section the "id"
would point to is in embedded content.
This was really only ever about the case where the <ref> itself was
found in embedded content, like an image caption, and we wanted to find
a top level references section, like,
[[File:Test.png|<ref>hi ho</ref>]]
<references />
The one case old approach was ostensibly doing something smarter was if
both the references section and the ref were in the same embedded
content, as in,
[[File:Test.png|<ref>hi ho</ref><references />]]
However, at least for file captions, those were always serialized in a
fragment of the top level doc and suffer from same dropping as the first
example here. Maybe some other embedded content is handled differently,
in which case this is probably an acceptable regression.
Change-Id: Ia90eadcc5099a8c27f0bf3fda0ce2f0effca7bcc
It's a feature of named refs that we only know at the time of inserting
the references list whether they have content or not, and are therefore
in err. The strategy of 4438a72 was to keep pointers to all named ref
nodes so that if an error does occur, we can mark them up.
The problem with embedded content is that, at the time when we find out
about the errors, it's been serialized and stored, and so any pointers
we might have kept around are no longer live or relevant. We need to go
back and process all that embedded content again to find where the refs
with errors are hiding.
This patch slightly optimizes that by keeping a map of all the errors
for refs in embedded content so that only one pass is necessary, rather
than for each references list. Also note that, in the common case, this
pass won't run since we won't have any errors in embedded content.
Bug: T266356
Change-Id: I32e7bfa796cd4382c43b3b1d17b925dc97ce9f7f
* Generating error message in data-mw in affected refs
* Cite test validates correctness of adding error to afflicted
refs
* Autogenerated references aren't considered erroneous
(the extension to the legacy parser also generates them)
and are not suppressed when serializing because apparently
that's the behaviour Parsoid clients want. However, in
this patch we're marking up autogenerated references
*with group attributes* as errors (the legacy extension
doesn't generate them at all) and are choosing to suppress
them when serializing since we considered them an error
while parsing and don't want them to persist in the content.
Bug: T51538
Change-Id: Ia651b10449dc41c2cb439b33a361e8c8e482f502
While, for the most part, content nested in refs will end up in the
references section and should be fine to maintain a linkback. However,
if content differs from a previously named ref (ie. !== cachedHtml), it
ends up being serialized and nested in data-mw, so is also unsafe.
Bug: T266294
Bug: T266356
Change-Id: Ia92f42e06353c411b986d0665cbe6338052555fa
In 47506af, serializing references content to be added to data-mw was
delayed until we were inserting the reference into the dom, to give us a
chance to mark up errors about not finding named ref content. After
which, the content is cleaned out of references to make room for list
items that belong there.
In 6bd0594, we noted that we can't store linkback from embeded content
since the content is released after serializing and won't be available
when it comes time to mark it up with errors.
Similarly, the linkbacks added to other groups from the references
content won't be available after inserting that reference and so we
shouldn't be holding on to them either. This means that we won't be
marking them up for the named error above but with the benefit that we
won't crash when trying to access these linkbacks from another group.
A test case is added which crashes trying to access a linkback from a
group that has already been inserted.
Note that the extension to the legacy parser considers referencing
another group from group content to be
"cite_error_references_group_mismatch". If we choose to also do that,
we may be able to reverse this change.
Change-Id: Idf0e49fa07dc3614068793c72a30ce3de1e2392c
* Note: the <ref name="1"> numeric is flagged as an error, but
<ref follow="1"> is not flagged as well as it would be duplicitous.
We are proposing that Numeric digits as a name be changed to
a warning as Parsoid Cite handles them properly without error.
Background: The core Cite extension started off not allowing numeric
names because the data structure they used made it inconvenient,
but Parsoid Cite has a separate index for named refs.
The core Cite devs thought there was potential for a conflict in
the ids if numeric names were allowed. The ids that core uses
follow the pattern:
for no name defined: cite_ref-1 cite_ref-2
for name="refname": cite_ref-refname_3-0 cite_ref-refname_3-1
so for name="1" at worst you might see ids like:
cite_ref-1 or cite_ref-1_2-0
so that does not produce conflicting IDs and isn't a concern,
and that is the pattern that Parsoid Cite uses.
* Error case of numeric in name and follow and two tests that
validate errors.
Bug: T51538
Change-Id: I95d725d0f77abadc1ddb2dd6939762b7d322e4f2
* Error case of dir= not ltr or rtl is caught and a cite tests
validates the new check.
Bug: T51538
Change-Id: I8c7e088416f0a000e638771a3fe5e8e0c58bcc23
Avoids crashers from trying to serialize the named content twice if
there's a valid follow with some other error, as expected in the FIXME.
This introduces a backwards incompatibility for invalid follows, which
will result in the contents of the ref being dropped.
A test is added to assert that selser will save us for the most part.
Change-Id: I1f572f996a7c2b3b852752f5348ebb60d8e21c47
* This patch introduces a preprocessing step on the edited DOM.
* Existing preprocessing code has been extracted into the
preprocessDOM method.
Any registered extensions preprocessors are invoked on the DOM.
So, this assumes that the htmlPreprocess extension listener is only
applicable to the edited DOM. If we want to expose the concept of
selective serialization through the API, we may want to add an
additional interface method / listener to the DOMProcessor class.
As of this patch, this is somewhat theoretical since there are no
such extension handlers registered on either DOM. Future patches
can clarify this better as specific needs arise.
* The handler also calls the serializer's custom preprocessing steps.
This step is applicable to both the original as well as edited DOM
(since DOM Diff is impacted by the results). If a need arises,
in the future, we may introduce a new extension DOM processor method
that applies to both original and edited DOMs.
* Right now, only selser strips section tags and non-selser wts
doesn't need to. So, preprocessDOM there is empty. Additional
selser-only DOM preprocessing will show up in later patches.
* Moved a stub HTML->WT preprocessor in Cite extension to RefProcessor.
Bug: T254501
Change-Id: I0c12afb2ea82617406d72ad872ac4f33678fa5f2
The description in T179082 suggests that by using one document for the
entire parse, we'd probably see some performance gains from not having
to import nodes when we get to the top level pipeline and we'd avoid the
validation errors from 19a9c3c.
However, the spec seems to suggest creating a new document when parsing
an HTML fragment,
https://html.spec.whatwg.org/#html-fragment-parsing-algorithm
And, indeed, domino implements it that way,
12a5f67136/lib/htmlelts.js (L84-L96)
So, the request in T217705 may be a little misguided.
What then is this patch good for? In T221790 the ask is that
sub-pipelines produce DocumentFragment which make for cleaner interfaces
and less confusion when migrating children.
The general outline here is that a document is created when the
environment is constructed that gives us the 1-1 correspondence.
Sub-pipelines do create their own documents for the purpose of tree
building, as in the fragment parsing algorithm, but are then immediately
imported to DocumentFragments to be used for the rest of the
post-processing passes.
Bug: T221790
Bug: T179082
Bug: T217705
Change-Id: Idf856d4e071d742ca38486c8ab402e39b3c8949f
* Remove the id's from follow refs because they were
duplicating the same key value erroneously and also
did not provide useful info. Fixed all tests accordingly.
* Added FIXME which refers to a new Phab ticket about
removing the code which adds style = display-none
that will be moved to CSS at some point.
Bug: T262986
Change-Id: Ib59f5eec951aa83a02357de865df8ab3dd8d2f67
These refs get a `style="display: none;"` since they're
not intended to be user visible.
Follow refs with errors conform to the proposed spec in T251842
Bug: T51538
Change-Id: Ie4ea28e7f9afde24614874bb4b8e07c5cabafa12
* Interim state commit with experimental code.
* Updates to citeParserTests.txt to check now valid follow
functionality and newly passing tests.
* Added to follow refs, <sup style="display: none;" about=...
to suppress display of hidden sups needed for VE to use
in editing follow refs.
* Added code to implemented follow functionality and catch
invalid usage.
Bug: T51538
Change-Id: Ic3ac8237fd2c490cfaf2fe799759742f72f10686
* Bug fix for accessing undefined extsrc member variable in edge
case. See T260082 for deeper explanation of the WT that caused a
case where empty flag is not set and extsrc is also missing, but
since either case including extsrc being unset indicates no
content, this additional check is safe for now.
Bug: T259676
Change-Id: I20750c6977883668c83bdae78fbeb171f899e1ab
* The latter feels more readable and intuitive.
* The option defaults to true.
* To make for simpler code, I ensured that the option value is always
set before it is accessed.
* But, the typeof value still uses the "/sealed/" qualifier.
Alternatively, I could use "/packed/" if we want to adhere to the
config value name more closely.
* Tangentially related changes:
- made getWrapperTokens a private method since it is only used
internal to PipelineUtils.
- remove default value for $opts in encapsulateExpansionHTML since
the value is always passed in everywhere.
Change-Id: I86c4e5adf11e3151f51f2623e5ed85282a2e1298
This was done with a custom sniff in,
MediaWiki/Sniffs/Commenting/FunctionCommentSniff.php
`$singleType === 'null' && count( $explodedType ) === 2`
since there's some ambiguity with,
`what|type|null`
but also a case like the following is left out,
`string[]|null`
Change-Id: I1bd50a4486d7ef4974280b476fd03d3ee53232b3
Alternative to I6ea271a5d5c7b12a13bb12a682c39bcfd7b1f116
We can follow this up by passing the ExtensionTag to the
ExtensionTagHandler constructor.
Change-Id: I5b1b191bc85968ad617eb3ebcdd7721c55006af2
* Bug fixes to accurately match core cite use of
cite_error_ref_no_key error and adjusted citeTests to match.
Bug: T51538
Change-Id: I3ae5300a5f86decebb7e67c5ea57c0c15677fbcc