Implements mocking for the #tag parser function for ref tags.
The ref-in-ref linting cycles tests from 04aa4be can now be restored.
This shows that after I1b598bd359b900d1b89abf5d8105a5d131aea3d1, the
protection added in 04aa4be is no longer necessary, because we only lint
the content where it's defined.
Bug: T237463
Change-Id: I4059e32b9bea8cdc23d2112812c3f7e167e47399
Follow up to I1b598bd359b900d1b89abf5d8105a5d131aea3d1
This also lints the html stashed in data-mw of mw:Extension/ref, when
named references have redefinitions.
The fast fail for linting references with errors is removed since it's
no longer necessary after the work in T51538.
Bug: T214994
Change-Id: I2431b4782339a1ac41c49f7ca0ad3480c0b13bad
This change will fix the crasher from T301293, since all the necessary
information to locate the ref contents is contained in the first
encapsulation wrapper node, we therefore don't need to traverse into it
and potentially be tripped up by the node being closed early for having
content model violations.
By using the linkback id from the href, we're potentially linting the
content multiple times.
By using the id from the data-mw, we're only linting the content
specified by the specific ref (with the slight caveat that if two named
refs define exactly the same content, they share an id).
Note that if named refs have multiple definitions, and hence the content
ends up in data-mw, we aren't yet linting it, that's T214994.
However, by not using the linkback ids from the href, we'll need to
traverse the html that mw:Extension/References have stuffed in their
data-mw if we want to lint references defined in the references tags
themselves (see the commented out test). This has the benefit though of
not running into the issue described in the References::lintHandler
(ie. not having the right tplInfo while traversing the content).
Bug: T301293
Bug: T214994
Change-Id: I1b598bd359b900d1b89abf5d8105a5d131aea3d1
Content Translation can lead to Cite references to nodes that aren't present
in the translated document. For now, I've suppressed these log entries in
Parsoid's logstash dashboard. But, there is no reason to continue emitting
these at error level given the large volume of CX pages and hence a large
volume of these non-actionable log entries.
Change-Id: I8df7e722203d7b866d987d626215bcd53b945d60
Use @property to provide the types of undeclared variables to Phan and
PHPStorm, as in my NodeData patch. Declare $dp->tmp since it is
commonly used and does not affect the JSON serialized output since it is
always stripped.
I omitted the constructor, instead of following the suggestion in the
massageLoadedDataParsoid doc comment which proposed injesting a
JSON-like data structure in the constructor. I thought it would be more
efficient to have the initial property assignments inline in the calling
code. This means breaking up many object cast expressions into
individual assignments.
In IncludeOnly, the coalescing null operator was only handling the case
where $start->dataAttribs was unset, which seems unlikely. I made it so
that it checks whether $start->dataAttribs->tsr is unset.
I added strongly typed clone() methods, to preserve type information for
static analysis.
DataParsoid is the type of the data in both the DOM and in tokens. To
simplify the changes to the Token hierarchy, I removed the duplicate
definitions of the public properties $attribs and $dataAttribs.
Change-Id: I16172083e7e9bcb94601d1d6862d1d202a7e3660
* Most of these are remnants from the Parsoid/JS codebase.
* This change follows the pattern we've been using everywhere
since the port from JS->PHP.
* Also reduces instruction count by about 0.2%.
Change-Id: Ibf21104f6722c34299f03e303dc3401bf053a751
Review regex usage, and use an alternative where possible, to improve
performance.
* Add PHPUtils::stripPrefix() and PHPUtils::stripSuffix(). Benchmark in
doc comment.
* /foo/ -> str_contains()
* /^foo/ -> str_starts_with()
* /^f/ -> ($s[0] ?? '') === 'f'
* /foo$/ -> str_ends_with()
* /^(foo|bar)$/ -> in_array(), benchmark suggests 10x improvement
* preg_replace(/foo/) -> str_replace()
* preg_replace(/^[abc]/) -> strspn(), benchmark suggests 3x improvement.
Curiously, it is faster without a limit for short input strings,
although a limit presumably adds robustness.
* preg_replace(/[abc]+$/) -> rtrim()
* preg_match_all() -> substr_count()
* In DOMUtils::hasTypeOf(), use explode() instead of a regex. Validated
by a benchmark.
* In DOMUtils::addTypeOf(), stop normalizing adjacent spaces. This
allows us to use implode(explode()) without a filtering loop. The
patch to Ext/Cite/References.php was to remove spaces added by this
change. The parserTests.txt changes were a consequence of the
References.php change.
* In LinkHandlerUtils::getHref() I allowed a single bare slash to be
counted as a path-absolute URL since I think that was the intention of
the original code.
* In LinkHandlerUtils::getLinkRoundTripData() I captured the portion of
interest from the previoulinkHandlers regex instead of running a
second regex.
* LinkHandlerUtils::linkHandler() had the regex
/^mw:WikiLink|mw:MediaLink$/ which I think was a bug, missing
parentheses. I fixed the bug.
The margins are pretty tight for a lot of these. Using polyfills for
str_contains() etc. might change the conclusion.
Also:
* In DOMUtils::matchTypeOf(), avoid calling hasAttribute().
getAttribute() is documented as returning an empty string if the
attribute does not exist.
Change-Id: I8d7bdf1bccc869b4dc17058a5822ef34968471e6
Follow up to 47dd898
Also renames a variable to be consistent in the two places we get
contents for the ref.
Change-Id: I13e61b8911ff16549fbb0888b9c3313ed5e7701e
Follow up to 47dd898
Fixes the test case found in rt,
php bin/parse.php --domain ceb.wikipedia.org --pageName "Martin Van Buren" --offsetType ucs2 < /dev/null
The offsetType is necessary so that the ConvertOffsets pass runs. The
crasher here is because the embedded html still contains the sealed ref
fragments because we've stored the unprocessed html.
Change-Id: Ic1e1c3e54433bf6d7574420c2eade1349261de0b
Restores linkbacks for ref-in-ref.
Follow up to 568034a where it's noted that it's fine to maintain
linkbacks for ref-in-ref, as long as the ref isn't a named ref that's
trying to redefine the contents for that name, in which case we embed
the contents.
A test case for this can be,
```
<ref name="hiho">off to work</ref>
{{#tag:ref|<i>we go <ref name="ohno">ohno</ref></i>|name="hiho"}}
{{#tag:ref|<i>we go <ref name="ohno2">ohno2</ref></i>|name="test"}}
```
The linkback to #cite_ref-ohno2_3-0 is present while continuing to
suppress the dangling linkback to #cite_ref-ohno_2-0, since that's in
embedded content. On master, both linkbacks are unnecessarily
suppressed.
Bug: T289331
Change-Id: Ifcf7464e86a4408f5dd9e2fd6d3aa47a0670ca49
This will be helpful in a subsequent patch where we make use of that
data while processing refs in refs. Content differing implies that
we'll be embedding it for roundtripping, rather than putting in the dom.
Change-Id: I7bd1d4c503fc58f862960bec82ca514fc29d7eff
This moves determining if we already have a reference created for a
named ref outside of that function, which is helpful for making use of
the cached html for that ref earlier.
Change-Id: Ie416bd95b980f9f95111d7e420945f40e2ada747
The standard type for these returns is NodeList and HTMLCollection, which
are almost *but not quite* the same as an array. In two places we got a
little complacent and assumed our non-standard DOMCompat workarounds would
always return arrays. Tweaked the types of DOMCompat to report that they
return an `iterable`, which is a PHP7.1 "pseudo-type" that unifies
arrays and \Traversable types like HTMLCollection/NodeList. This
allows phan to catch places where we slip up and assume an array type
return.
It does introduce a new wrinkle, though, since there is no simple way
to turn an iterable into an array. We're using a simple
`iterable_to_array` helper function for this.
Change-Id: I35bdeb3afa30ef5182e71733a0a606aadcafb435
In PHP's DOM extension, one of the legacy bugs is that
DOMNode::getAttribute() can never return `null` (to indicate that the
attribute is missing), instead it returns an empty string in that
case. This isn't (modern) spec-compliant behavior (it's a leftover
from ancient times) and we had to watch this carefully when porting
from JS.
In the time since the port, we've written new code and embedded this
assumption that DOMNode::getAttribute() will never return null into
the new code we've written. Fix this. Always use `getAttribute(...)
?? ''` (unless we're just doing an equality test against a non-empty
string, or the code is preceded by a `hasAttribute` test) so that our
code will work whether or not getAttribute returns null for a missing
attribute.
Change-Id: If33200e1053b2dd79abb5dfb3808c05ff3a0bbba
Mostly comments along the lines of "{classname} constructor"
in the doc block for the __construct method.
Change-Id: I67ffe070985dc75a5d817b1b5ac97b529d7ab4b8
* Inlcudes test coverage for refnames with single and more than
one underbar in a row which are maintained as separate keys but
serialized without the multiple underbars
Bug: T267974
Change-Id: I9c21a6ff761f4b9a22b1185280b5676e2c160208
Back in the early days of Parsoid, we introduced rtTestMode so
we can suppress lots of noisy (but harmless) diffs in rt-testing
so we can isolate the harmful diffs that absolutely needed fixing.
This mode was critical to running large scale round-trip testing on
a large test corpus and let us get a lot of confidence in Parsoid's
ability to handle VisualEditors edits.
But, now that Parsoid is established and selective serialization is
also fairly robust, it is time to get rid of this mode altogether.
This mode was adding clutter to the codebase and was potentially confusing
in some cases. We won't lose our ability to identify regressions in
rt-testing since all we care about is semantic diff changes relative
to a baseline. We just end up with a lower-fidelity baseline.
Change-Id: I22a1b3ecf4e0224000f1df6a98cf7ea9bcb4ee4e
* Refnames such as 'a b' and 'a_b' are now kept seperate like
in Core Cite. Refnames with unicode whitespace characters
such as "a\u2028b' are handled as distinct refnames from 'a b'
and their ID's are sanitized appropriately to have underbars.
Bug: T267974
Change-Id: Ie06d1f2b8614dbdcf8572ed4647ec9093ef006d5
* Tweaked a few of the remaining js files to enable me to purge more
files.
* In another pass, we can consolidate most of the utils into a couple
rt-test helper classes and purge more.
* Verified that bin/roundtrip-test.js runs by using a parsoid/js service
installed via the deb package.
$ bin/roundtrip-test.js --parsoidURL http://localhost:8142 Hospet --outputContentVersion 2.1.0
* Haven't touched the package.* files
Change-Id: I79657250de3baaa09bda6be2f440ba5363b62eaf
This is the same fix as in 5e5e360 for T259676
The root of the issue is described in T260082
Bug: T271357
Bug: T260082
Change-Id: I7ccf0b20f6b0be0f31101a2c4a88010675dc72ba
Sanitizer is heavily used by extensions and we decided to let
extensions directly access it.
So, stop proxying those methods from ParsoidExtensionAPI.
Change-Id: I5ff285bf33733878135e2091d53ae12f7340c8fc
* Addresses a FIXME (T263052) where Parsoid Cite injects
style = "display: none;" in refs with follow instead of
having css do that triggered by having a class "mw-ref-follow"
as part of the refs html.
Bug: T263052
Depends-On: I351516b81566aba0adb4d298e39806dfb4fc7b03
Change-Id: I8bfc4ee3df162e2040e3c6f0c37fbf2a7c30d7f6
Follow up to 01cf61a
Numeric array keys are returned as integers.
echo "<references 2/>" | php bin/parse.php
Bug: T269748
Change-Id: I892753c330f95d258e0310626f109386fd020177
It's sufficient to handle this case in processRefs.
Also moves $referencesGroup to the ReferencesData instance, rather than
passing it around as a variable (inconsistently).
Change-Id: I8637e3ce644642259e353d0df3d9c0dbc3102c7b