Content Translation can lead to Cite references to nodes that aren't present
in the translated document. For now, I've suppressed these log entries in
Parsoid's logstash dashboard. But, there is no reason to continue emitting
these at error level given the large volume of CX pages and hence a large
volume of these non-actionable log entries.
Change-Id: I8df7e722203d7b866d987d626215bcd53b945d60
Use @property to provide the types of undeclared variables to Phan and
PHPStorm, as in my NodeData patch. Declare $dp->tmp since it is
commonly used and does not affect the JSON serialized output since it is
always stripped.
I omitted the constructor, instead of following the suggestion in the
massageLoadedDataParsoid doc comment which proposed injesting a
JSON-like data structure in the constructor. I thought it would be more
efficient to have the initial property assignments inline in the calling
code. This means breaking up many object cast expressions into
individual assignments.
In IncludeOnly, the coalescing null operator was only handling the case
where $start->dataAttribs was unset, which seems unlikely. I made it so
that it checks whether $start->dataAttribs->tsr is unset.
I added strongly typed clone() methods, to preserve type information for
static analysis.
DataParsoid is the type of the data in both the DOM and in tokens. To
simplify the changes to the Token hierarchy, I removed the duplicate
definitions of the public properties $attribs and $dataAttribs.
Change-Id: I16172083e7e9bcb94601d1d6862d1d202a7e3660
* Most of these are remnants from the Parsoid/JS codebase.
* This change follows the pattern we've been using everywhere
since the port from JS->PHP.
* Also reduces instruction count by about 0.2%.
Change-Id: Ibf21104f6722c34299f03e303dc3401bf053a751
Review regex usage, and use an alternative where possible, to improve
performance.
* Add PHPUtils::stripPrefix() and PHPUtils::stripSuffix(). Benchmark in
doc comment.
* /foo/ -> str_contains()
* /^foo/ -> str_starts_with()
* /^f/ -> ($s[0] ?? '') === 'f'
* /foo$/ -> str_ends_with()
* /^(foo|bar)$/ -> in_array(), benchmark suggests 10x improvement
* preg_replace(/foo/) -> str_replace()
* preg_replace(/^[abc]/) -> strspn(), benchmark suggests 3x improvement.
Curiously, it is faster without a limit for short input strings,
although a limit presumably adds robustness.
* preg_replace(/[abc]+$/) -> rtrim()
* preg_match_all() -> substr_count()
* In DOMUtils::hasTypeOf(), use explode() instead of a regex. Validated
by a benchmark.
* In DOMUtils::addTypeOf(), stop normalizing adjacent spaces. This
allows us to use implode(explode()) without a filtering loop. The
patch to Ext/Cite/References.php was to remove spaces added by this
change. The parserTests.txt changes were a consequence of the
References.php change.
* In LinkHandlerUtils::getHref() I allowed a single bare slash to be
counted as a path-absolute URL since I think that was the intention of
the original code.
* In LinkHandlerUtils::getLinkRoundTripData() I captured the portion of
interest from the previoulinkHandlers regex instead of running a
second regex.
* LinkHandlerUtils::linkHandler() had the regex
/^mw:WikiLink|mw:MediaLink$/ which I think was a bug, missing
parentheses. I fixed the bug.
The margins are pretty tight for a lot of these. Using polyfills for
str_contains() etc. might change the conclusion.
Also:
* In DOMUtils::matchTypeOf(), avoid calling hasAttribute().
getAttribute() is documented as returning an empty string if the
attribute does not exist.
Change-Id: I8d7bdf1bccc869b4dc17058a5822ef34968471e6
Follow up to 47dd898
Also renames a variable to be consistent in the two places we get
contents for the ref.
Change-Id: I13e61b8911ff16549fbb0888b9c3313ed5e7701e
Follow up to 47dd898
Fixes the test case found in rt,
php bin/parse.php --domain ceb.wikipedia.org --pageName "Martin Van Buren" --offsetType ucs2 < /dev/null
The offsetType is necessary so that the ConvertOffsets pass runs. The
crasher here is because the embedded html still contains the sealed ref
fragments because we've stored the unprocessed html.
Change-Id: Ic1e1c3e54433bf6d7574420c2eade1349261de0b
Restores linkbacks for ref-in-ref.
Follow up to 568034a where it's noted that it's fine to maintain
linkbacks for ref-in-ref, as long as the ref isn't a named ref that's
trying to redefine the contents for that name, in which case we embed
the contents.
A test case for this can be,
```
<ref name="hiho">off to work</ref>
{{#tag:ref|<i>we go <ref name="ohno">ohno</ref></i>|name="hiho"}}
{{#tag:ref|<i>we go <ref name="ohno2">ohno2</ref></i>|name="test"}}
```
The linkback to #cite_ref-ohno2_3-0 is present while continuing to
suppress the dangling linkback to #cite_ref-ohno_2-0, since that's in
embedded content. On master, both linkbacks are unnecessarily
suppressed.
Bug: T289331
Change-Id: Ifcf7464e86a4408f5dd9e2fd6d3aa47a0670ca49
This will be helpful in a subsequent patch where we make use of that
data while processing refs in refs. Content differing implies that
we'll be embedding it for roundtripping, rather than putting in the dom.
Change-Id: I7bd1d4c503fc58f862960bec82ca514fc29d7eff
This moves determining if we already have a reference created for a
named ref outside of that function, which is helpful for making use of
the cached html for that ref earlier.
Change-Id: Ie416bd95b980f9f95111d7e420945f40e2ada747
The standard type for these returns is NodeList and HTMLCollection, which
are almost *but not quite* the same as an array. In two places we got a
little complacent and assumed our non-standard DOMCompat workarounds would
always return arrays. Tweaked the types of DOMCompat to report that they
return an `iterable`, which is a PHP7.1 "pseudo-type" that unifies
arrays and \Traversable types like HTMLCollection/NodeList. This
allows phan to catch places where we slip up and assume an array type
return.
It does introduce a new wrinkle, though, since there is no simple way
to turn an iterable into an array. We're using a simple
`iterable_to_array` helper function for this.
Change-Id: I35bdeb3afa30ef5182e71733a0a606aadcafb435
In PHP's DOM extension, one of the legacy bugs is that
DOMNode::getAttribute() can never return `null` (to indicate that the
attribute is missing), instead it returns an empty string in that
case. This isn't (modern) spec-compliant behavior (it's a leftover
from ancient times) and we had to watch this carefully when porting
from JS.
In the time since the port, we've written new code and embedded this
assumption that DOMNode::getAttribute() will never return null into
the new code we've written. Fix this. Always use `getAttribute(...)
?? ''` (unless we're just doing an equality test against a non-empty
string, or the code is preceded by a `hasAttribute` test) so that our
code will work whether or not getAttribute returns null for a missing
attribute.
Change-Id: If33200e1053b2dd79abb5dfb3808c05ff3a0bbba
Mostly comments along the lines of "{classname} constructor"
in the doc block for the __construct method.
Change-Id: I67ffe070985dc75a5d817b1b5ac97b529d7ab4b8
* Inlcudes test coverage for refnames with single and more than
one underbar in a row which are maintained as separate keys but
serialized without the multiple underbars
Bug: T267974
Change-Id: I9c21a6ff761f4b9a22b1185280b5676e2c160208
Back in the early days of Parsoid, we introduced rtTestMode so
we can suppress lots of noisy (but harmless) diffs in rt-testing
so we can isolate the harmful diffs that absolutely needed fixing.
This mode was critical to running large scale round-trip testing on
a large test corpus and let us get a lot of confidence in Parsoid's
ability to handle VisualEditors edits.
But, now that Parsoid is established and selective serialization is
also fairly robust, it is time to get rid of this mode altogether.
This mode was adding clutter to the codebase and was potentially confusing
in some cases. We won't lose our ability to identify regressions in
rt-testing since all we care about is semantic diff changes relative
to a baseline. We just end up with a lower-fidelity baseline.
Change-Id: I22a1b3ecf4e0224000f1df6a98cf7ea9bcb4ee4e
* Refnames such as 'a b' and 'a_b' are now kept seperate like
in Core Cite. Refnames with unicode whitespace characters
such as "a\u2028b' are handled as distinct refnames from 'a b'
and their ID's are sanitized appropriately to have underbars.
Bug: T267974
Change-Id: Ie06d1f2b8614dbdcf8572ed4647ec9093ef006d5
This is the same fix as in 5e5e360 for T259676
The root of the issue is described in T260082
Bug: T271357
Bug: T260082
Change-Id: I7ccf0b20f6b0be0f31101a2c4a88010675dc72ba
Sanitizer is heavily used by extensions and we decided to let
extensions directly access it.
So, stop proxying those methods from ParsoidExtensionAPI.
Change-Id: I5ff285bf33733878135e2091d53ae12f7340c8fc
* Addresses a FIXME (T263052) where Parsoid Cite injects
style = "display: none;" in refs with follow instead of
having css do that triggered by having a class "mw-ref-follow"
as part of the refs html.
Bug: T263052
Depends-On: I351516b81566aba0adb4d298e39806dfb4fc7b03
Change-Id: I8bfc4ee3df162e2040e3c6f0c37fbf2a7c30d7f6
Follow up to 01cf61a
Numeric array keys are returned as integers.
echo "<references 2/>" | php bin/parse.php
Bug: T269748
Change-Id: I892753c330f95d258e0310626f109386fd020177
It's sufficient to handle this case in processRefs.
Also moves $referencesGroup to the ReferencesData instance, rather than
passing it around as a variable (inconsistently).
Change-Id: I8637e3ce644642259e353d0df3d9c0dbc3102c7b
Follow up to 6c15f6e where the same approach was taken in dom diff'ing.
Clarifies where the "id" is expected to point and the limitations of the
approach vis-a-vis embedded content.
For example,
<ref>hi ho</ref>
[[File:Test.png|<references />]]
won't roundtrip, and never did, because the references section the "id"
would point to is in embedded content.
This was really only ever about the case where the <ref> itself was
found in embedded content, like an image caption, and we wanted to find
a top level references section, like,
[[File:Test.png|<ref>hi ho</ref>]]
<references />
The one case old approach was ostensibly doing something smarter was if
both the references section and the ref were in the same embedded
content, as in,
[[File:Test.png|<ref>hi ho</ref><references />]]
However, at least for file captions, those were always serialized in a
fragment of the top level doc and suffer from same dropping as the first
example here. Maybe some other embedded content is handled differently,
in which case this is probably an acceptable regression.
Change-Id: Ia90eadcc5099a8c27f0bf3fda0ce2f0effca7bcc
It's a feature of named refs that we only know at the time of inserting
the references list whether they have content or not, and are therefore
in err. The strategy of 4438a72 was to keep pointers to all named ref
nodes so that if an error does occur, we can mark them up.
The problem with embedded content is that, at the time when we find out
about the errors, it's been serialized and stored, and so any pointers
we might have kept around are no longer live or relevant. We need to go
back and process all that embedded content again to find where the refs
with errors are hiding.
This patch slightly optimizes that by keeping a map of all the errors
for refs in embedded content so that only one pass is necessary, rather
than for each references list. Also note that, in the common case, this
pass won't run since we won't have any errors in embedded content.
Bug: T266356
Change-Id: I32e7bfa796cd4382c43b3b1d17b925dc97ce9f7f