Commit graph

153 commits

Author SHA1 Message Date
Arlo Breault e3c4e96c71 Prioritize body->html over body->id when diffing refs
Just to keep things consistent since that's the precedence we use when
serializing.

Change-Id: I1456b6a21ae050d58d15620e501a14c29a64f9e3
2022-04-28 14:51:05 -04:00
Subramanya Sastry c3f34b7360 Cite: Document the Parsoid-only responsive refs threshold
* This functionality comes from 74acc71e

Change-Id: I5d985ff75670136a27d52850ea0e41da52fa3c96
2022-04-14 12:32:18 +05:30
Subramanya Sastry 15a38973c7 html2wt: Use info level for unactionable Cite logspam caused by CX usage
Content Translation can lead to Cite references to nodes that aren't present
in the translated document. For now, I've suppressed these log entries in
Parsoid's logstash dashboard. But, there is no reason to continue emitting
these at error level given the large volume of CX pages and hence a large
volume of these non-actionable log entries.

Change-Id: I8df7e722203d7b866d987d626215bcd53b945d60
2022-03-02 20:07:35 +00:00
Subramanya Sastry 20e5117622 Minor tweak to function name in ParsoidExtensionAPI
Change-Id: Ic8fdfdc12b224460277d6a34247d20911526823a
2022-02-03 22:46:09 +00:00
Arlo Breault 2e4f69a492 Implement diffHandler for Cite extension
Bug: T214651
Change-Id: I64585cd89135887e095e3ab17d10c3c7d82af1c9
2021-11-16 20:33:24 +00:00
Tim Starling 0cc211d675 Make DataParsoid be a real class
Use @property to provide the types of undeclared variables to Phan and
PHPStorm, as in my NodeData patch. Declare $dp->tmp since it is
commonly used and does not affect the JSON serialized output since it is
always stripped.

I omitted the constructor, instead of following the suggestion in the
massageLoadedDataParsoid doc comment which proposed injesting a
JSON-like data structure in the constructor. I thought it would be more
efficient to have the initial property assignments inline in the calling
code. This means breaking up many object cast expressions into
individual assignments.

In IncludeOnly, the coalescing null operator was only handling the case
where $start->dataAttribs was unset, which seems unlikely. I made it so
that it checks whether $start->dataAttribs->tsr is unset.

I added strongly typed clone() methods, to preserve type information for
static analysis.

DataParsoid is the type of the data in both the DOM and in tokens. To
simplify the changes to the Token hierarchy, I removed the duplicate
definitions of the public properties $attribs and $dataAttribs.

Change-Id: I16172083e7e9bcb94601d1d6862d1d202a7e3660
2021-10-13 10:20:15 +11:00
Umherirrender c0ace40aaa build: Updating mediawiki/mediawiki-phan-config to 0.11.0
Change-Id: Ifb2eec4e791fd0de0a50d8ef85e0947ab9a891e7
2021-09-21 12:13:36 -05:00
Subramanya Sastry f7bc278673 DOMUtils: Get rid of isElt, isText, isComment helpers
* Most of these are remnants from the Parsoid/JS codebase.
* This change follows the pattern we've been using everywhere
  since the port from JS->PHP.
* Also reduces instruction count by about 0.2%.

Change-Id: Ibf21104f6722c34299f03e303dc3401bf053a751
2021-09-20 22:39:38 +00:00
Tim Starling 8f3369b090 Avoid using regexes
Review regex usage, and use an alternative where possible, to improve
performance.

* Add PHPUtils::stripPrefix() and PHPUtils::stripSuffix(). Benchmark in
  doc comment.
* /foo/ -> str_contains()
* /^foo/ -> str_starts_with()
* /^f/ -> ($s[0] ?? '') === 'f'
* /foo$/ -> str_ends_with()
* /^(foo|bar)$/ -> in_array(), benchmark suggests 10x improvement
* preg_replace(/foo/) -> str_replace()
* preg_replace(/^[abc]/) -> strspn(), benchmark suggests 3x improvement.
  Curiously, it is faster without a limit for short input strings,
  although a limit presumably adds robustness.
* preg_replace(/[abc]+$/) -> rtrim()
* preg_match_all() -> substr_count()
* In DOMUtils::hasTypeOf(), use explode() instead of a regex. Validated
  by a benchmark.
* In DOMUtils::addTypeOf(), stop normalizing adjacent spaces. This
  allows us to use implode(explode()) without a filtering loop. The
  patch to Ext/Cite/References.php was to remove spaces added by this
  change. The parserTests.txt changes were a consequence of the
  References.php change.
* In LinkHandlerUtils::getHref() I allowed a single bare slash to be
  counted as a path-absolute URL since I think that was the intention of
  the original code.
* In LinkHandlerUtils::getLinkRoundTripData() I captured the portion of
  interest from the previoulinkHandlers regex instead of running a
  second regex.
* LinkHandlerUtils::linkHandler() had the regex
  /^mw:WikiLink|mw:MediaLink$/ which I think was a bug, missing
  parentheses. I fixed the bug.

The margins are pretty tight for a lot of these. Using polyfills for
str_contains() etc. might change the conclusion.

Also:

* In DOMUtils::matchTypeOf(), avoid calling hasAttribute().
  getAttribute() is documented as returning an empty string if the
  attribute does not exist.

Change-Id: I8d7bdf1bccc869b4dc17058a5822ef34968471e6
2021-09-13 23:01:45 +00:00
Arlo Breault 66355c1ddc Migrate out valid follow contents after processing refs
Follow up to 47dd898

Also renames a variable to be consistent in the two places we get
contents for the ref.

Change-Id: I13e61b8911ff16549fbb0888b9c3313ed5e7701e
2021-08-27 15:00:54 -04:00
Arlo Breault 5c7c37e0c9 Reserialize processed refs if content differs
Follow up to 47dd898

Fixes the test case found in rt,
php bin/parse.php --domain ceb.wikipedia.org --pageName "Martin Van Buren" --offsetType ucs2 < /dev/null

The offsetType is necessary so that the ConvertOffsets pass runs.  The
crasher here is because the embedded html still contains the sealed ref
fragments because we've stored the unprocessed html.

Change-Id: Ic1e1c3e54433bf6d7574420c2eade1349261de0b
2021-08-27 15:00:37 -04:00
Subramanya Sastry 0d26fd19d5 Cite: Rename functions pushing/popping embedded content flags
Change-Id: Ie8736fcc139caba467209b7ba57daaa8f53bc18a
2021-08-26 11:43:52 -05:00
Arlo Breault 47dd8989a7 Don't process ref-in-ref as embedded, unless content differs
Restores linkbacks for ref-in-ref.

Follow up to 568034a where it's noted that it's fine to maintain
linkbacks for ref-in-ref, as long as the ref isn't a named ref that's
trying to redefine the contents for that name, in which case we embed
the contents.

A test case for this can be,

```
<ref name="hiho">off to work</ref>
{{#tag:ref|<i>we go <ref name="ohno">ohno</ref></i>|name="hiho"}}
{{#tag:ref|<i>we go <ref name="ohno2">ohno2</ref></i>|name="test"}}
```

The linkback to #cite_ref-ohno2_3-0 is present while continuing to
suppress the dangling linkback to #cite_ref-ohno_2-0, since that's in
embedded content.  On master, both linkbacks are unnecessarily
suppressed.

Bug: T289331
Change-Id: Ifcf7464e86a4408f5dd9e2fd6d3aa47a0670ca49
2021-08-26 16:41:02 +00:00
Arlo Breault d0e1637d22 Move content differ check up higher
This will be helpful in a subsequent patch where we make use of that
data while processing refs in refs.  Content differing implies that
we'll be embedding it for roundtripping, rather than putting in the dom.

Change-Id: I7bd1d4c503fc58f862960bec82ca514fc29d7eff
2021-08-26 16:38:58 +00:00
Arlo Breault 50dfe518cc Only call ReferencesData::add when adding
This moves determining if we already have a reference created for a
named ref outside of that function, which is helpful for making use of
the cached html for that ref earlier.

Change-Id: Ie416bd95b980f9f95111d7e420945f40e2ada747
2021-08-26 16:37:36 +00:00
C. Scott Ananian 187de4b769 The ::querySelectorAll() and ::getElementsBy* helpers don't always return array
The standard type for these returns is NodeList and HTMLCollection, which
are almost *but not quite* the same as an array.  In two places we got a
little complacent and assumed our non-standard DOMCompat workarounds would
always return arrays.  Tweaked the types of DOMCompat to report that they
return an `iterable`, which is a PHP7.1 "pseudo-type" that unifies
arrays and \Traversable types like HTMLCollection/NodeList.  This
allows phan to catch places where we slip up and assume an array type
return.

It does introduce a new wrinkle, though, since there is no simple way
to turn an iterable into an array.  We're using a simple
`iterable_to_array` helper function for this.

Change-Id: I35bdeb3afa30ef5182e71733a0a606aadcafb435
2021-07-31 03:50:07 +00:00
C. Scott Ananian a1d0fdd776 Allow Node::getAttribute() to return null
In PHP's DOM extension, one of the legacy bugs is that
DOMNode::getAttribute() can never return `null` (to indicate that the
attribute is missing), instead it returns an empty string in that
case.  This isn't (modern) spec-compliant behavior (it's a leftover
from ancient times) and we had to watch this carefully when porting
from JS.

In the time since the port, we've written new code and embedded this
assumption that DOMNode::getAttribute() will never return null into
the new code we've written.  Fix this.  Always use `getAttribute(...)
?? ''` (unless we're just doing an equality test against a non-empty
string, or the code is preceded by a `hasAttribute` test) so that our
code will work whether or not getAttribute returns null for a missing
attribute.

Change-Id: If33200e1053b2dd79abb5dfb3808c05ff3a0bbba
2021-07-30 20:34:47 +00:00
C. Scott Ananian fd3597cd39 Add class alias file to allow swapping in Dodo for DOMDocument
Change-Id: I56c10d2f4283e9e7b57bf722208fefab007cdf45
2021-07-23 12:20:06 -04:00
DannyS712 55cc7c2828 Remove documentation that repeats the code
Mostly comments along the lines of "{classname} constructor"
in the doc block for the __construct method.

Change-Id: I67ffe070985dc75a5d817b1b5ac97b529d7ab4b8
2021-06-02 09:57:36 +00:00
Aaron Piotrowski 81630c4267 Upgrade to mediawiki/mediawiki-codesniffer 36
Change-Id: I103a662d0af77cafa46cf6445e1580aabd005f31
2021-05-04 10:25:25 -05:00
Arlo Breault be829c15b0 Check for multiples doesn't apply to follows
Follow up to 7bd9f87

Bug: T276388
Change-Id: I68ab87702b967e870c432564b54d86bcbf914174
2021-03-03 18:07:17 -05:00
Arlo Breault e047ff7afc Refactor sanitization in a normalizeKey function
This matches the legacy parser extension.

Change-Id: Iecec58e793e4a7c0ecd3a139773f225484f4be8f
2021-01-12 00:04:43 +00:00
sbailey c3bc1f00b0 Contract multiple underbars in a row in refnames to a single underbar
* Inlcudes test coverage for refnames with single and more than
   one underbar in a row which are maintained as separate keys but
   serialized without the multiple underbars

Bug: T267974
Change-Id: I9c21a6ff761f4b9a22b1185280b5676e2c160208
2021-01-11 23:14:11 +00:00
Subramanya Sastry 6ebe050750 Get rid of rtTestMode
Back in the early days of Parsoid, we introduced rtTestMode so
we can suppress lots of noisy (but harmless) diffs in rt-testing
so we can isolate the harmful diffs that absolutely needed fixing.
This mode was critical to running large scale round-trip testing on
a large test corpus and let us get a lot of confidence in Parsoid's
ability to handle VisualEditors edits.

But, now that Parsoid is established and selective serialization is
also fairly robust, it is time to get rid of this mode altogether.
This mode was adding clutter to the codebase and was potentially confusing
in some cases. We won't lose our ability to identify regressions in
rt-testing since all we care about is semantic diff changes relative
to a baseline. We just end up with a lower-fidelity baseline.

Change-Id: I22a1b3ecf4e0224000f1df6a98cf7ea9bcb4ee4e
2021-01-11 15:39:06 -06:00
sbailey 9679519b0a Fix for Parsoid Cite refname whitespace handling
* Refnames such as 'a b' and 'a_b' are now kept seperate like
   in Core Cite. Refnames with unicode whitespace characters
   such as "a\u2028b' are handled as distinct refnames from 'a b'
   and their ID's are sanitized appropriately to have underbars.

Bug: T267974
Change-Id: Ie06d1f2b8614dbdcf8572ed4647ec9093ef006d5
2021-01-08 17:22:44 +00:00
Arlo Breault e8d8481f60 More papering over in References.php
This is the same fix as in 5e5e360 for T259676

The root of the issue is described in T260082

Bug: T271357
Bug: T260082
Change-Id: I7ccf0b20f6b0be0f31101a2c4a88010675dc72ba
2021-01-06 18:53:55 -05:00
sbailey 394015a38b Add ref/follow name to Cite error cite_error_references_missing_key
Bug: T51538
Change-Id: Id19a4e4c37169ca6eb7aecdce66b1662546ae31a
2020-12-21 18:08:21 -08:00
sbailey 511543e3f1 Add refname parameter to cite_error_empty_references_define error
Bug: T51538
Change-Id: I2850b7f181f44465437bc486bc544c5cd58aa5e3
2020-12-21 13:31:37 -08:00
sbailey b9b10a3fe0 Add group name to Cite error cite_error_references_group_mismatch
Bug: T51538
Change-Id: Ie6e04edcdf4b9760711ec53021d65970691a3813
2020-12-18 22:16:28 +00:00
sbailey 7bd9f87157 Add parameter $refName to Cite error cite_error_ref_duplicate_key
Bug T51538
Change-Id: If8399be12a5cad025b3a4db8e970c8de96c75ad6
2020-12-16 13:58:45 -08:00
sbailey 5fbf890f12 Add direction parameter to cite_error_ref_invalid_dir message
Bug: T51538
Change-Id: I5e964ad7341a46552d7b8eded0d844c0132816b1
2020-12-16 20:24:57 +00:00
sbailey cf4a49ba6e Add group name parameter to cite_error_group_refs_without_references
Bug: T51538
Change-Id: I8708ffa21c2ef68c124a5b055a6860cfb4ec12e1
2020-12-16 20:19:33 +00:00
Arlo Breault d95a783cc8 Stop referring to spec version numbers where unnecessary
Presumably the source should be up-to-date with the latest spec.

Change-Id: Iaea7f80e9d3bbd3520a7b499252162240deeba62
2020-12-16 13:55:27 -05:00
Subramanya Sastry 07bcfd9add Purge Sanitizer proxying from ParsoidExtensionAPI
Sanitizer is heavily used by extensions and we decided to let
extensions directly access it.

So, stop proxying those methods from ParsoidExtensionAPI.

Change-Id: I5ff285bf33733878135e2091d53ae12f7340c8fc
2020-12-10 16:54:30 +00:00
sbailey e0322afd84 Parsoid Cite add class mw-ref-follow for refs with follow
* Addresses a FIXME (T263052) where Parsoid Cite injects
   style = "display: none;" in refs with follow instead of
   having css do that triggered by having a class "mw-ref-follow"
   as part of the refs html.

Bug: T263052
Depends-On: I351516b81566aba0adb4d298e39806dfb4fc7b03
Change-Id: I8bfc4ee3df162e2040e3c6f0c37fbf2a7c30d7f6
2020-12-10 16:54:25 +00:00
Arlo Breault 8d4543954f Cast references attributes to strings
Follow up to 01cf61a

Numeric array keys are returned as integers.

echo "<references 2/>" | php bin/parse.php

Bug: T269748
Change-Id: I892753c330f95d258e0310626f109386fd020177
2020-12-09 16:05:12 +00:00
Arlo Breault 3c15454851 Refine adding module(style)?s in extapi
Bug: T269022
Change-Id: Ic2c56c554934ced2aea04317d988098ca840076f
2020-11-30 17:15:27 -05:00
Arlo Breault 6525d69200 Reconcile some ref errors cases with $hasFollow
Change-Id: I5e3a27366f177af6c221d57da6e31f28cc91bb0c
2020-11-25 13:51:37 -05:00
sbailey de5d806335 Cite error tag name defined in references not used before
Bug: T51538
Change-Id: Id89b3cc186de42e5e5c05f15d7546db9d64ec864
2020-11-25 13:50:25 -05:00
sbailey 703dc8dc05 Adding cite error ref in reference with mismatched group
Bug: T51538
Change-Id: I5492dbaebb7bca79e83be09fdcfe810eaef8c053
2020-11-24 17:22:56 +00:00
Arlo Breault 680df4379c Use inReferencesContent flag to get rid of processRefsInReferences
It's sufficient to handle this case in processRefs.

Also moves $referencesGroup to the ReferencesData instance, rather than
passing it around as a variable (inconsistently).

Change-Id: I8637e3ce644642259e353d0df3d9c0dbc3102c7b
2020-11-24 17:22:01 +00:00
Arlo Breault b88f9ca881 Fix porting bug from 005176a
Bug: T249742
Change-Id: Iabe86266c06b2cbc3c51b16b73d360a7182878f1
2020-11-24 10:54:23 -05:00
sbailey 4c7108f553 Adding cite error ref in reference no content defined
Bug: T51538
Change-Id: I4cdcf1a36f472f582812dbb5e7050c0ead614639
2020-11-23 18:32:26 -05:00
sbailey 1f0221e327 Add reporting of cite error of a ref in reference without name specified
Bug: 51538
Change-Id: I193d5583b31be32741088fb25c348878f34b5016
2020-11-23 23:30:14 +00:00
Arlo Breault e3ca32c9ff Add method to check if in references content
More specific than just embedded content, needed for adding errors in
follow up patches.

Change-Id: I4bf659cd208c3322870e3ea0126bda4a2a7037d8
2020-11-23 18:53:03 +00:00
Arlo Breault b664b64fcb Use $extApi->pushError for invalid references parameters
Follow up to 01cf61a

Change-Id: Ic4483f151d12352cc9e6f6094e4df442eabca376
2020-11-16 22:19:54 +00:00
Arlo Breault 2f1bbc1804 Only look for data-mw.body.id in the top level dom
Follow up to 6c15f6e where the same approach was taken in dom diff'ing.

Clarifies where the "id" is expected to point and the limitations of the
approach vis-a-vis embedded content.

For example,

<ref>hi ho</ref>
[[File:Test.png|<references />]]

won't roundtrip, and never did, because the references section the "id"
would point to is in embedded content.

This was really only ever about the case where the <ref> itself was
found in embedded content, like an image caption, and we wanted to find
a top level references section, like,

[[File:Test.png|<ref>hi ho</ref>]]
<references />

The one case old approach was ostensibly doing something smarter was if
both the references section and the ref were in the same embedded
content, as in,

[[File:Test.png|<ref>hi ho</ref><references />]]

However, at least for file captions, those were always serialized in a
fragment of the top level doc and suffer from same dropping as the first
example here.  Maybe some other embedded content is handled differently,
in which case this is probably an acceptable regression.

Change-Id: Ia90eadcc5099a8c27f0bf3fda0ce2f0effca7bcc
2020-11-10 21:56:16 +00:00
sbailey 01cf61ad67 Adding check for illegal attributes in references tag
Bug: T51538
Change-Id: I7dbc577a61abb660d2bdb66ead0d7b71fd66cf47
2020-11-10 19:47:04 +00:00
Arlo Breault 4310b6a243 Mark up cite errors in embedded content
It's a feature of named refs that we only know at the time of inserting
the references list whether they have content or not, and are therefore
in err.  The strategy of 4438a72 was to keep pointers to all named ref
nodes so that if an error does occur, we can mark them up.

The problem with embedded content is that, at the time when we find out
about the errors, it's been serialized and stored, and so any pointers
we might have kept around are no longer live or relevant.  We need to go
back and process all that embedded content again to find where the refs
with errors are hiding.

This patch slightly optimizes that by keeping a map of all the errors
for refs in embedded content so that only one pass is necessary, rather
than for each references list.  Also note that, in the common case, this
pass won't run since we won't have any errors in embedded content.

Bug: T266356
Change-Id: I32e7bfa796cd4382c43b3b1d17b925dc97ce9f7f
2020-11-06 18:31:26 -05:00
Arlo Breault b2e2732674 Switch some uses of matchTypeOf to hasTypeOf in Cite
Change-Id: I99986c337944547ae398851676de13377f4114b1
2020-11-06 13:14:04 -05:00