Commit graph

385 commits

Author SHA1 Message Date
Arlo Breault 128b758d13 Omit wrapperTag in references call to extTagToDOM
The contents of the domfragment gets migrated to an ol element and
then a query for sups in the ol makes up the nestedRefsHTML string.

Follows-Up: I44abd4f13254bc4512caad85244d523ac19140e7
Change-Id: Ie29d9b1ea7b34121587e77c8693dba79ed166ee1
2023-07-11 15:03:25 -04:00
Subramanya Sastry 5a77aa959b ParsoidExtensionAPI: Add method to process HTML embedded in attributes
* Parsoid core cannot know anything about how extensions represent
  information in their DOM. We were missing an extension API method
  and config that lets extensions specify this information along with
  handlers to process embedded content.

* This patch fills that above gap.

* There is a FIXME in ContentUtils::shiftDSR which will be addressed
  in followup patches with this new functionality.

Change-Id: If921ca471f25f21671a60c3f796fdf145267364d
2023-06-10 21:06:54 +05:30
Subramanya Sastry 1e61376d59 ParsoidExtensionAPI: Rename Extension API method
* This method isn't used anywhere else outside (per codesearch)
  so being bold and renaming without any public notification.

Change-Id: I0c4ec7c4b2afb1d0d3cd1a932e0384bd9626799e
2023-06-08 18:55:51 +05:30
Arlo Breault c37977aad2 Assume strings for dataMw->body properties in ref handlers
The diffHandler already has this expectation.

Bug: T337036
Change-Id: I0c3a3532c655df6777407a1e0af2d9bac9495037
2023-05-19 13:28:41 -04:00
Subramanya Sastry fc240841c4 ParsoidExtensionAPI: Simplify extTagToDOM
* leadingWS is always '' in all the uses so far and it originally
  came from Cite when we first extracted the extension API.

* It is unnecessarily confusing and serves no real purpose.

Change-Id: I5f2f0f123dc6938735c09b96a42c1923b0a49085
2023-03-28 15:44:06 -05:00
Subramanya Sastry a9260875ac SiteConfig: Transition away from hardcoded Cite info
* This builds on the core patch that adds this new method
  to core's SiteConfig class.

Bug: T268777
Depends-On: Ia489cd880de3df73d7cd47d1eb01971f2655effe
Change-Id: Ifd41e4275a9344474e48d013e1116b028c7775d7
2023-02-20 17:56:02 -05:00
Arlo Breault 8958246add Fix multiple ref nestings from the tag parser function
The algorithm in the grammar is fixed up and the restriction to one
level of nesting is lifted.  It was overlooked that parser functions can
be nested.  Some of the history of ref-in-ref can be found in 03bf14b0
and d634925.

Bug: T326521
Change-Id: I8e442d41f68133c1b4f16556fedaff6106e233fb
2023-01-25 01:17:09 +00:00
Isabelle Hurbain-Palatin 57fb5617a9 Fixes data-id for "follow" refs
"follow" refs are cloned, which can lead to issues with their
data-objects conflicting with the initial node: if
additional, possibly out-of-order processing is happening, we may end up
accessing/modifying the data bag. In particular, we considered adding a
call to the DOMNormalizer when serializing the reference bodies, and
this was triggering an exception when accessing the DOMDiff of the
cloned node. We are not considering this right now anymore, but cloning
the NodeData may avoid issues of this type in the future.

This patch also introduces a utility to do the clone+clone data bag in a
single method call.

Change-Id: Iccdae82dec81d488433981d764bea539609497eb
2022-11-25 17:41:33 +00:00
C. Scott Ananian c24f5a18af Turn on phan warnings about the use of deprecated methods
MWParsoid\ParsoidServices depends on the deprecated
\MediaWiki\Parser\Parsoid\ParsoidServices in core; it is not needed
any more in MW 1.40 so it has been removed.

Change-Id: I2c4fd2e9ea2b5e2751074f1b9705e46436a32131
2022-11-04 12:36:24 -04:00
Isabelle Hurbain-Palatin 24b5204c28 Add 'outputHasCoreMwDomSpecMarkup' as option in extension registration
The ''outputHasCoreMwDomSpecMarkup' option in the extension registration
indicates whether the output of an extension complies to the MediaWiki
DOM Spec and should be processed as such.

The initial goal of this setting is to enable the DOMNormalizer pass for
Parsoid HTML generated in extensions, as needed by the implementation of
T309024.

We setup "Cite" as an example of configuration.

Bug: T309024
Change-Id: I26182c67ea17a4aa59414856d5f4614f2cd06c46
2022-09-29 13:25:53 +00:00
Isabelle Hurbain-Palatin 51189f14ac Allow multivalues in rel attributes
The 'rel' attribute can be used for multiple values in the same way that
the 'typeof' attribute can. In this patch, we allow for that by
providing methods to handle these multivalues and by refactoring the
code that would only allow for single values.

Bug: T186241
Change-Id: I625b121ae77c06c519c3b578583d41dc220e3294
2022-07-19 18:46:50 +00:00
Arlo Breault 7537d185eb Improve comment around serializing autogenerated references
Follow up to Ia651b10449dc41c2cb439b33a361e8c8e482f502

Change-Id: I0a72e7fbf90a78d04144d66103cbe013e41c5163
2022-06-13 14:26:00 -04:00
Subramanya Sastry 74aa1ea647 Cite tests: Update html/parsoid sections with output in integrated mode
These changes are a result of adding Parsoid integrated testing support
in core.

* Some tests got new html/parsoid sections.
* Some tests got their html/parsoid output upgraded to remove bogus HTML.
* Since we are not treating the test runs to be in integrated mode,
  I removed some comments that only pertained to a test run in
  standalone mode.
* There are 8 tests that are failing in integrated mode as well and
  I've tagged 5 known ones with T307741. The other 3 failing ones are
  the responsive reference tests that have a threshold value set.
  Those failures will need investigation and require additional changes
  to the core parser test runner to pass through the config to Parsoid.

Change-Id: I370d57d45cf126f71b3666fb493a887faf6b8e0d
2022-05-11 17:40:46 +00:00
Isabelle Hurbain-Palatin 28d6a05c35 Hoisting references outside of links
When a reference is inserted in a link, Parsoid generates nested links,
which would break browser rendering. In some cases where the generated
HTML is reprocessed in later stages of Parsoid (as is the case in
T301293), it can lead to internal breakages.

In this patch, we propose to hoist references markers (and their <sup>
tag) included in links outside of that link, as next sibling to the link.
It avoids the vast majority of link nesting due to that situation, with
the following side effects:
* the <sup> tagging is now maintained around the marker tag,
* if a ref tag is in the middle of a link, it gets moved to the end of
the link (in the legacy parser, it stays where it is, but the end of the
link gets de-linked).

The selser test failures are consistent with the expected behaviour - wt2wt
does not round-trip correctly, leading to selser failures.

Bug: T301293
Change-Id: Ia39483c2112b1356e14a310fbb48baed946b5caa
2022-05-11 13:45:59 +00:00
Arlo Breault 6d9afe15dd Resolve linter test port-fixme
Implements mocking for the #tag parser function for ref tags.

The ref-in-ref linting cycles tests from 04aa4be can now be restored.

This shows that after I1b598bd359b900d1b89abf5d8105a5d131aea3d1, the
protection added in 04aa4be is no longer necessary, because we only lint
the content where it's defined.

Bug: T237463
Change-Id: I4059e32b9bea8cdc23d2112812c3f7e167e47399
2022-05-10 17:07:08 -04:00
Arlo Breault 18889348c9 Lint html stashed in data-mw of mw:Extension/references
Follow up to I1b598bd359b900d1b89abf5d8105a5d131aea3d1

This also lints the html stashed in data-mw of mw:Extension/ref, when
named references have redefinitions.

The fast fail for linting references with errors is removed since it's
no longer necessary after the work in T51538.

Bug: T214994
Change-Id: I2431b4782339a1ac41c49f7ca0ad3480c0b13bad
2022-05-02 23:37:42 +00:00
Arlo Breault 9fa2aeaba3 Only lint content defined by a specific ref
This change will fix the crasher from T301293, since all the necessary
information to locate the ref contents is contained in the first
encapsulation wrapper node, we therefore don't need to traverse into it
and potentially be tripped up by the node being closed early for having
content model violations.

By using the linkback id from the href, we're potentially linting the
content multiple times.

By using the id from the data-mw, we're only linting the content
specified by the specific ref (with the slight caveat that if two named
refs define exactly the same content, they share an id).

Note that if named refs have multiple definitions, and hence the content
ends up in data-mw, we aren't yet linting it, that's T214994.

However, by not using the linkback ids from the href, we'll need to
traverse the html that mw:Extension/References have stuffed in their
data-mw if we want to lint references defined in the references tags
themselves (see the commented out test).  This has the benefit though of
not running into the issue described in the References::lintHandler
(ie. not having the right tplInfo while traversing the content).

Bug: T301293
Bug: T214994
Change-Id: I1b598bd359b900d1b89abf5d8105a5d131aea3d1
2022-05-02 23:37:39 +00:00
Arlo Breault e3c4e96c71 Prioritize body->html over body->id when diffing refs
Just to keep things consistent since that's the precedence we use when
serializing.

Change-Id: I1456b6a21ae050d58d15620e501a14c29a64f9e3
2022-04-28 14:51:05 -04:00
Subramanya Sastry c3f34b7360 Cite: Document the Parsoid-only responsive refs threshold
* This functionality comes from 74acc71e

Change-Id: I5d985ff75670136a27d52850ea0e41da52fa3c96
2022-04-14 12:32:18 +05:30
Subramanya Sastry 15a38973c7 html2wt: Use info level for unactionable Cite logspam caused by CX usage
Content Translation can lead to Cite references to nodes that aren't present
in the translated document. For now, I've suppressed these log entries in
Parsoid's logstash dashboard. But, there is no reason to continue emitting
these at error level given the large volume of CX pages and hence a large
volume of these non-actionable log entries.

Change-Id: I8df7e722203d7b866d987d626215bcd53b945d60
2022-03-02 20:07:35 +00:00
Subramanya Sastry 20e5117622 Minor tweak to function name in ParsoidExtensionAPI
Change-Id: Ic8fdfdc12b224460277d6a34247d20911526823a
2022-02-03 22:46:09 +00:00
Arlo Breault 2e4f69a492 Implement diffHandler for Cite extension
Bug: T214651
Change-Id: I64585cd89135887e095e3ab17d10c3c7d82af1c9
2021-11-16 20:33:24 +00:00
Tim Starling 0cc211d675 Make DataParsoid be a real class
Use @property to provide the types of undeclared variables to Phan and
PHPStorm, as in my NodeData patch. Declare $dp->tmp since it is
commonly used and does not affect the JSON serialized output since it is
always stripped.

I omitted the constructor, instead of following the suggestion in the
massageLoadedDataParsoid doc comment which proposed injesting a
JSON-like data structure in the constructor. I thought it would be more
efficient to have the initial property assignments inline in the calling
code. This means breaking up many object cast expressions into
individual assignments.

In IncludeOnly, the coalescing null operator was only handling the case
where $start->dataAttribs was unset, which seems unlikely. I made it so
that it checks whether $start->dataAttribs->tsr is unset.

I added strongly typed clone() methods, to preserve type information for
static analysis.

DataParsoid is the type of the data in both the DOM and in tokens. To
simplify the changes to the Token hierarchy, I removed the duplicate
definitions of the public properties $attribs and $dataAttribs.

Change-Id: I16172083e7e9bcb94601d1d6862d1d202a7e3660
2021-10-13 10:20:15 +11:00
Umherirrender c0ace40aaa build: Updating mediawiki/mediawiki-phan-config to 0.11.0
Change-Id: Ifb2eec4e791fd0de0a50d8ef85e0947ab9a891e7
2021-09-21 12:13:36 -05:00
Subramanya Sastry f7bc278673 DOMUtils: Get rid of isElt, isText, isComment helpers
* Most of these are remnants from the Parsoid/JS codebase.
* This change follows the pattern we've been using everywhere
  since the port from JS->PHP.
* Also reduces instruction count by about 0.2%.

Change-Id: Ibf21104f6722c34299f03e303dc3401bf053a751
2021-09-20 22:39:38 +00:00
Tim Starling 8f3369b090 Avoid using regexes
Review regex usage, and use an alternative where possible, to improve
performance.

* Add PHPUtils::stripPrefix() and PHPUtils::stripSuffix(). Benchmark in
  doc comment.
* /foo/ -> str_contains()
* /^foo/ -> str_starts_with()
* /^f/ -> ($s[0] ?? '') === 'f'
* /foo$/ -> str_ends_with()
* /^(foo|bar)$/ -> in_array(), benchmark suggests 10x improvement
* preg_replace(/foo/) -> str_replace()
* preg_replace(/^[abc]/) -> strspn(), benchmark suggests 3x improvement.
  Curiously, it is faster without a limit for short input strings,
  although a limit presumably adds robustness.
* preg_replace(/[abc]+$/) -> rtrim()
* preg_match_all() -> substr_count()
* In DOMUtils::hasTypeOf(), use explode() instead of a regex. Validated
  by a benchmark.
* In DOMUtils::addTypeOf(), stop normalizing adjacent spaces. This
  allows us to use implode(explode()) without a filtering loop. The
  patch to Ext/Cite/References.php was to remove spaces added by this
  change. The parserTests.txt changes were a consequence of the
  References.php change.
* In LinkHandlerUtils::getHref() I allowed a single bare slash to be
  counted as a path-absolute URL since I think that was the intention of
  the original code.
* In LinkHandlerUtils::getLinkRoundTripData() I captured the portion of
  interest from the previoulinkHandlers regex instead of running a
  second regex.
* LinkHandlerUtils::linkHandler() had the regex
  /^mw:WikiLink|mw:MediaLink$/ which I think was a bug, missing
  parentheses. I fixed the bug.

The margins are pretty tight for a lot of these. Using polyfills for
str_contains() etc. might change the conclusion.

Also:

* In DOMUtils::matchTypeOf(), avoid calling hasAttribute().
  getAttribute() is documented as returning an empty string if the
  attribute does not exist.

Change-Id: I8d7bdf1bccc869b4dc17058a5822ef34968471e6
2021-09-13 23:01:45 +00:00
Arlo Breault 66355c1ddc Migrate out valid follow contents after processing refs
Follow up to 47dd898

Also renames a variable to be consistent in the two places we get
contents for the ref.

Change-Id: I13e61b8911ff16549fbb0888b9c3313ed5e7701e
2021-08-27 15:00:54 -04:00
Arlo Breault 5c7c37e0c9 Reserialize processed refs if content differs
Follow up to 47dd898

Fixes the test case found in rt,
php bin/parse.php --domain ceb.wikipedia.org --pageName "Martin Van Buren" --offsetType ucs2 < /dev/null

The offsetType is necessary so that the ConvertOffsets pass runs.  The
crasher here is because the embedded html still contains the sealed ref
fragments because we've stored the unprocessed html.

Change-Id: Ic1e1c3e54433bf6d7574420c2eade1349261de0b
2021-08-27 15:00:37 -04:00
Subramanya Sastry 0d26fd19d5 Cite: Rename functions pushing/popping embedded content flags
Change-Id: Ie8736fcc139caba467209b7ba57daaa8f53bc18a
2021-08-26 11:43:52 -05:00
Arlo Breault 47dd8989a7 Don't process ref-in-ref as embedded, unless content differs
Restores linkbacks for ref-in-ref.

Follow up to 568034a where it's noted that it's fine to maintain
linkbacks for ref-in-ref, as long as the ref isn't a named ref that's
trying to redefine the contents for that name, in which case we embed
the contents.

A test case for this can be,

```
<ref name="hiho">off to work</ref>
{{#tag:ref|<i>we go <ref name="ohno">ohno</ref></i>|name="hiho"}}
{{#tag:ref|<i>we go <ref name="ohno2">ohno2</ref></i>|name="test"}}
```

The linkback to #cite_ref-ohno2_3-0 is present while continuing to
suppress the dangling linkback to #cite_ref-ohno_2-0, since that's in
embedded content.  On master, both linkbacks are unnecessarily
suppressed.

Bug: T289331
Change-Id: Ifcf7464e86a4408f5dd9e2fd6d3aa47a0670ca49
2021-08-26 16:41:02 +00:00
Arlo Breault d0e1637d22 Move content differ check up higher
This will be helpful in a subsequent patch where we make use of that
data while processing refs in refs.  Content differing implies that
we'll be embedding it for roundtripping, rather than putting in the dom.

Change-Id: I7bd1d4c503fc58f862960bec82ca514fc29d7eff
2021-08-26 16:38:58 +00:00
Arlo Breault 50dfe518cc Only call ReferencesData::add when adding
This moves determining if we already have a reference created for a
named ref outside of that function, which is helpful for making use of
the cached html for that ref earlier.

Change-Id: Ie416bd95b980f9f95111d7e420945f40e2ada747
2021-08-26 16:37:36 +00:00
C. Scott Ananian 187de4b769 The ::querySelectorAll() and ::getElementsBy* helpers don't always return array
The standard type for these returns is NodeList and HTMLCollection, which
are almost *but not quite* the same as an array.  In two places we got a
little complacent and assumed our non-standard DOMCompat workarounds would
always return arrays.  Tweaked the types of DOMCompat to report that they
return an `iterable`, which is a PHP7.1 "pseudo-type" that unifies
arrays and \Traversable types like HTMLCollection/NodeList.  This
allows phan to catch places where we slip up and assume an array type
return.

It does introduce a new wrinkle, though, since there is no simple way
to turn an iterable into an array.  We're using a simple
`iterable_to_array` helper function for this.

Change-Id: I35bdeb3afa30ef5182e71733a0a606aadcafb435
2021-07-31 03:50:07 +00:00
C. Scott Ananian a1d0fdd776 Allow Node::getAttribute() to return null
In PHP's DOM extension, one of the legacy bugs is that
DOMNode::getAttribute() can never return `null` (to indicate that the
attribute is missing), instead it returns an empty string in that
case.  This isn't (modern) spec-compliant behavior (it's a leftover
from ancient times) and we had to watch this carefully when porting
from JS.

In the time since the port, we've written new code and embedded this
assumption that DOMNode::getAttribute() will never return null into
the new code we've written.  Fix this.  Always use `getAttribute(...)
?? ''` (unless we're just doing an equality test against a non-empty
string, or the code is preceded by a `hasAttribute` test) so that our
code will work whether or not getAttribute returns null for a missing
attribute.

Change-Id: If33200e1053b2dd79abb5dfb3808c05ff3a0bbba
2021-07-30 20:34:47 +00:00
C. Scott Ananian fd3597cd39 Add class alias file to allow swapping in Dodo for DOMDocument
Change-Id: I56c10d2f4283e9e7b57bf722208fefab007cdf45
2021-07-23 12:20:06 -04:00
DannyS712 55cc7c2828 Remove documentation that repeats the code
Mostly comments along the lines of "{classname} constructor"
in the doc block for the __construct method.

Change-Id: I67ffe070985dc75a5d817b1b5ac97b529d7ab4b8
2021-06-02 09:57:36 +00:00
Aaron Piotrowski 81630c4267 Upgrade to mediawiki/mediawiki-codesniffer 36
Change-Id: I103a662d0af77cafa46cf6445e1580aabd005f31
2021-05-04 10:25:25 -05:00
Arlo Breault be829c15b0 Check for multiples doesn't apply to follows
Follow up to 7bd9f87

Bug: T276388
Change-Id: I68ab87702b967e870c432564b54d86bcbf914174
2021-03-03 18:07:17 -05:00
Arlo Breault e047ff7afc Refactor sanitization in a normalizeKey function
This matches the legacy parser extension.

Change-Id: Iecec58e793e4a7c0ecd3a139773f225484f4be8f
2021-01-12 00:04:43 +00:00
sbailey c3bc1f00b0 Contract multiple underbars in a row in refnames to a single underbar
* Inlcudes test coverage for refnames with single and more than
   one underbar in a row which are maintained as separate keys but
   serialized without the multiple underbars

Bug: T267974
Change-Id: I9c21a6ff761f4b9a22b1185280b5676e2c160208
2021-01-11 23:14:11 +00:00
Subramanya Sastry 6ebe050750 Get rid of rtTestMode
Back in the early days of Parsoid, we introduced rtTestMode so
we can suppress lots of noisy (but harmless) diffs in rt-testing
so we can isolate the harmful diffs that absolutely needed fixing.
This mode was critical to running large scale round-trip testing on
a large test corpus and let us get a lot of confidence in Parsoid's
ability to handle VisualEditors edits.

But, now that Parsoid is established and selective serialization is
also fairly robust, it is time to get rid of this mode altogether.
This mode was adding clutter to the codebase and was potentially confusing
in some cases. We won't lose our ability to identify regressions in
rt-testing since all we care about is semantic diff changes relative
to a baseline. We just end up with a lower-fidelity baseline.

Change-Id: I22a1b3ecf4e0224000f1df6a98cf7ea9bcb4ee4e
2021-01-11 15:39:06 -06:00
sbailey 9679519b0a Fix for Parsoid Cite refname whitespace handling
* Refnames such as 'a b' and 'a_b' are now kept seperate like
   in Core Cite. Refnames with unicode whitespace characters
   such as "a\u2028b' are handled as distinct refnames from 'a b'
   and their ID's are sanitized appropriately to have underbars.

Bug: T267974
Change-Id: Ie06d1f2b8614dbdcf8572ed4647ec9093ef006d5
2021-01-08 17:22:44 +00:00
Subramanya Sastry 97f965f9cb Parsoid/JS: Purge majority of code not used by roundtrip-test.js
* Tweaked a few of the remaining js files to enable me to purge more
  files.

* In another pass, we can consolidate most of the utils into a couple
  rt-test helper classes and purge more.

* Verified that bin/roundtrip-test.js runs by using a parsoid/js service
  installed via the deb package.

$ bin/roundtrip-test.js --parsoidURL http://localhost:8142 Hospet --outputContentVersion 2.1.0

* Haven't touched the package.* files

Change-Id: I79657250de3baaa09bda6be2f440ba5363b62eaf
2021-01-08 17:15:04 +00:00
Arlo Breault e8d8481f60 More papering over in References.php
This is the same fix as in 5e5e360 for T259676

The root of the issue is described in T260082

Bug: T271357
Bug: T260082
Change-Id: I7ccf0b20f6b0be0f31101a2c4a88010675dc72ba
2021-01-06 18:53:55 -05:00
sbailey 394015a38b Add ref/follow name to Cite error cite_error_references_missing_key
Bug: T51538
Change-Id: Id19a4e4c37169ca6eb7aecdce66b1662546ae31a
2020-12-21 18:08:21 -08:00
sbailey 511543e3f1 Add refname parameter to cite_error_empty_references_define error
Bug: T51538
Change-Id: I2850b7f181f44465437bc486bc544c5cd58aa5e3
2020-12-21 13:31:37 -08:00
sbailey b9b10a3fe0 Add group name to Cite error cite_error_references_group_mismatch
Bug: T51538
Change-Id: Ie6e04edcdf4b9760711ec53021d65970691a3813
2020-12-18 22:16:28 +00:00
sbailey 7bd9f87157 Add parameter $refName to Cite error cite_error_ref_duplicate_key
Bug T51538
Change-Id: If8399be12a5cad025b3a4db8e970c8de96c75ad6
2020-12-16 13:58:45 -08:00
sbailey 5fbf890f12 Add direction parameter to cite_error_ref_invalid_dir message
Bug: T51538
Change-Id: I5e964ad7341a46552d7b8eded0d844c0132816b1
2020-12-16 20:24:57 +00:00
sbailey cf4a49ba6e Add group name parameter to cite_error_group_refs_without_references
Bug: T51538
Change-Id: I8708ffa21c2ef68c124a5b055a6860cfb4ec12e1
2020-12-16 20:19:33 +00:00