While working on T270009, I noticed that <style> and <link> nodes
are treated differently, which seemed weird. Rewrite this again,
hopefully this is the last time.
The changed test cases also involve <area> and <input> nodes,
and the new results make more sense to me.
Bug: T264116
Change-Id: I3af90c84768a4b3dc53446927f4dba6f72175a2f
We've recently decided that we want to "extend" comments until
the end of the paragraph (e36dc8e78a,
d0ae6c4e44).
However, we still had this special case that did the opposite: it
ensured that if a comment ended in the middle of a text node, the
comment would not be extended to the end of the node. Remove it.
Note the change in the test file signatures-funny-formattedreply.html,
which actually covered this case specifically.
Change-Id: Id1384bb0c6e1a5f0c70f55efcb4caa240f230f07
The end marker is skipped forward until an open or close
block tag is reached. In tree traversal terms this means
moving either to the next sibling, or the parent (to skip
over close tags).
Bug: T256033
Change-Id: Iaa2c588698790d576ac4f9ecc126f58a082ef6b3
The general rule is that comments start after their preceding
thread item, but when that is a heading we should skip past
the entire <h[1-6]> node to avoid making section edit links
part of the first comment.
Bug: T267988
Change-Id: Ia7f1b27e0a69a9aab7c7da743bf8549479304096
As CommentFormatter no longer needs HTMLFormatter, remove
the inheritance and make addReplyLinks a static method.
Testing locally this is marginally slower, going from 2.55s
to 2.9s for the CommentFormatterTest case.
Bug: T266317
Bug: T267973
Change-Id: If69749cae678a1647a138d782a32032189f55cec
A TreeWalker ends up walking potentially every single subsequent
node in the document looking for a target node. Instead use upwards
traversal to find a common ancestor, then sibling traversal to
compare document order.
This makes calling cloneContents on every comment on a 300k talk page
significantly faster, going from >30s to 500ms locally.
Change-Id: I28a2b8c11d4098d9bc44d19b98e19ccc02273098
Ideally the edit autosummary would be generated in the same
way as in the old wikitext editor: from the wikitext of the
heading. But on the JS side, we don't have access to the
wikitext, or to the PHP method that generates autosummaries.
This might seem crazy at first, but ultimately the point of
the autosummaries is to link to the section heading by its
'id' attribute, so it is perfectly reliable.
Doing it this way depends on $wgFragmentMode being set to
[ 'html5', 'legacy' ] or [ 'html5' ], otherwise the escaped IDs
are super garbled (particularly in non-Latin-alphabet languages)
and can't be unescaped reliably. Conveniently, we already
require that since 9ee0fd69f5.
Bug: T264561
Bug: T266725
Change-Id: I7d35098d672d0edb50d49e22de1686d5cc83b60e
The condition was wrong, it could return either an element child with
.mw-headline, or a non-element child.
Bug: T267284
Change-Id: I28cda22ee8c5fe4a3259621adddd647b31291703
Internal PHP errors (such as "Call to undefined method…") are not Exceptions.
Follow-up to e18a0f3dcd.
Bug: T267035
Change-Id: I3cbf2b6b0d1d8a97cdac9791ec4f7b2ec807c7e5
After recent changes allowing ThreadItems to have IDs, they can now
also have warnings about duplicate IDs.
Bug: T267035
Change-Id: If3edfe34e6e29741e29fac8946a3c88badc4ab7f
The following sniffs are failing and were disabled:
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationPrivate
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationProtected
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationPublic
Additional changes:
* Dropped .inc files from .phpcs.xml (T200956).
Change-Id: I340d6b573e9ae2a99085fb19a705fcf567b03f92
Use the same logic for marking ranges in the document, and ensure
that the heading range does not include section edit links or
section numberings.
Change-Id: I782caafc34fee2a822b0a17b24dd6b9528202eca
If A follows B, then we can assume that B does not follow A.
Calling the function recursively computes that twice,
we can instead make some simple changes to "invert" the result.
Change-Id: I709aca7cb997dd2fe3980468a8c6bde6f366fb5b
It's an expensive method, and we previously called it for
every child of the common ancestor, completely unnecessarily.
These changes follow from two observations:
* If there is a $firstPartiallyContainedChild, then the
first fully contained child must follow it; similarly,
if there is a $lastPartiallyContainedChild, then the
last fully contained child must precede it.
* All nodes between the first and last fully contained
children are also fully contained.
Maybe it can be made cleverer still, but it's a lot better.
Change-Id: I4e596c62274c2c0be115f0ddec42629115b430a4
Skipping them could result in incorrect handling when RESTBase HTML is
outdated.
When a result for a given comment is not found, display an error
instead of assuming it is not transcluded.
Bug: T262065
Change-Id: I14a7a0a25d5181b5c49bd5677f0c002dce5a3cb9
To avoid old threads re-appearing on popular pages when someone
uses a vague title (e.g. dozens of threads titled "question" on
[[Wikipedia:Help desk]]: https://w.wiki/fbN), include the oldest
timestamp in the thread (i.e. date the thread was started) in the
heading ID.
Bug: T264478
Change-Id: If918bfd5e025248923d1939bc86916697ead95a0
Sequential numbers aren't great because they change when an earlier
comment is archived. Parent comment/heading IDs should change less
often.
This also makes much more sense for disambiguating subsections,
e.g. a dozen identical ===Votes=== sections for a dozen proposals.
Bug: T264478
Change-Id: I466454984fd919ebef35f2b37ddb5d86dc842996
Our threads now also contain all replies to their sub-threads.
This is similar to how sections work in MediaWiki, where the parent
section also contains the content of all the lower-level sections.
We're going to need this for notifications about replies in a thread.
Bug: T264478
Change-Id: I241fc58e2088a7555942824b0f184ed21e3a8b6f
Previously, only comments could have IDs, because we only needed IDs
for replying. But we might also use them for notifications soon.
Bug: T264478
Change-Id: I1bcad02bf17ab54bc5028a959543c10f0430836b
The output of CommentFormatter::addReplyLinks() and consequently
ThreadItem::jsonSerialize() can end up in the HTTP cache (Varnish) on
Wikimedia wikis. We need to consider that when changing that code.
Introduce a concept of legacy ID (generated by the older algorithm
after it changes), add some placeholder code that will generate them
in the future, and update some code to find comments by either normal
or legacy IDs.
Add dire comments in a bunch of places (as if that ever helps).
Bug: T264478
Change-Id: I4368f366800ab21b8b184b09378037614fdecd33
"This modifies the original objects…" – I feel like this is obvious
now, but maybe it wasn't so obvious when this code was structured
differently before a2431fe006. Also,
it refers to a variable that doesn't exist.
"FIXME this will clone the reply…" – No, actually, it will not.
It would if replies were associative arrays, but they are objects,
and have always been, ever since the PHP parser was merged in
7b7a2cd69c. Maybe they were arrays
once in Roan's mind before he pushed that for review.
Change-Id: I1348e111699fdbde99cd1f9ef45d8f465f7391b0
We can check whether a node is a child of another node directly,
without iterating over all its children.
Change-Id: I3a26df89365bf765348d96b477c983ec9c4e43fe
* Add the preference
* Only display it when the reply tool is enabled
* Use it when opening the reply tool
* Save it when the menu is toggled from the reply tool interface
Bug: T261539
Change-Id: Icb8fa6b3f1e9a3644669f21b08f34ea8c175f2f9
This is not necessary, and never has been. This variable contains an
object and it's never assigned to.
Instead, the reference creates hard-to-debug bugs (I've just spent
an hour debugging one). When the variable name is reused later in
the same function as the loop variable of another foreach() loop
(such as in If918bfd5e0), the result is overwriting of the last entry
in $this->threadItems with the last entry from the other array.
I was questioning everything I know about variables until I noticed.
Change-Id: Ibb57f915b39dd4d6d2e744903f9ecadd67b1f52d
Also, add tests covering this and the previous bug fixes in this code
(T259818, T261706).
Note that the test data added in tests/cases/ doesn't exactly match
the entire configuration of the wiki, only the parts we want to cover.
This is unlike the data in tests/data/, which was literally copied
from the relevant wikis, and which is used as input for other tests.
Bug: T265500
Change-Id: I29a59a5952f6dc9fb5910434bb6bcc9dcdaa01a9
When a timestamp directly followed a `<div>…</div>` tag (or perhaps
some other wrapper containing lots of content), we would detect the
username from the earliest links in the wrapper (furthest from the
timestamp), rather than the latest links (closest to the timestamp).
Bug: T262573
Change-Id: Id16449a86a731b13dc79846bb30ecf6554e26f1d
The wikitext parser outputs `<p><br></p>` for empty paragraphs, so we
need to ignore `<br>` tags when searching for an "interesting" node
that marks the beginning of a comment. Otherwise the empty paragraphs
mess up the detection of indentation levels.
Bug: T264116
Change-Id: I84a97ab577baa7336b78935ccdc48041ecfc231a
* Export parser data (date format, digits, timezone names, and
messages for weekday/month names) converted to language variants
* Update the parsers to try matching using every variant, in case
the page is displayed in non-default variant (and to avoid
problems with incomplete variant conversion)
Bug: T259818
Change-Id: I04d73992cd31ce06fa79f87df0c0a53d7efc3c58
Avoids using the deprecated $noSeparators parameter to Language::formatNum
in favor of Language::formatNumNoSeparators, which has been around since
MW 1.21.
Change-Id: I012434d5f6c749fec45a6c160e8d5d03686192e9
PHP was counting UTF-8 bytes, JS was counting UTF-16 bytes.
Both should have been counting codepoints (although it doesn't
really matter as long as they both count the same things).
I noticed the issue after adding some tests using the Cyrillic
script, when one case had different results in PHP and JS:
Id25b537fecd789640c209ff7f30e777455a3aece.
Change-Id: Ic31240678f71ba48e6ec202126bf490cea12bb66
Move the code so that we check for "?title=" query parameter first,
because we don't handle this right in the other code path.
Use parse_url() instead of wfParseUrl() because the latter doesn't
accept relative URLs, and we don't care about the other differences.
Bug: T261711
Depends-On: I4da952876e1c3d1a41d06b51f7e26015ff5e34d7
Change-Id: I70fac2b41befd782b0a47a4f726ae748dc0f775d
The PHP code incorrectly assumed that the digits are single-byte in
UTF-8, which is never the case (except for 0-9).
The JS code worked correctly because it uses UTF-16 strings, so the
bug would only affect non-BMP digits there. This was noted in a TODO
comment, but we overlooked it when reimplementing in PHP.
Instead of a string of 10 characters, use an array of 10
single-character strings.
Bug: T261706
Change-Id: Ic5421382474c88f003424799c53ff473d99cce92
As we do in VE, extract the revid from the document.
Unlike in VE we don't need to throw an error if there is
a mis-match, as we will likely be able to make the edit anyway.
Just use the ID we got from the document.
Log a warning if there is ever an ID mis-match so we
can evaluate if this check is actually needed.
Change-Id: I94c37980524a9faabac49495903a5262387af562
When a comment ended before the end of a paragraph, the next
comment would begin right there in the middle of the paragraph.
This could result in the detected indentation level of that
comment being incorrect, and replies being inserted in wrong
places, as seen in the 'signatures-funny' test case.
The code moved to the parser was previously repeated twice in
addListItem() and addReplyLink(), which should have been a hint
that something isn't quite right.
Also, fix the code guarding against overlapping signatures,
now that signatures may not be at the end of a comment.
Bug: T260855
Change-Id: Ic26a87642f8a15d5de2f7073d4d8176b299c7f94
Causes page corruption, in a new way we haven't seen before.
* Revert "Move page updating logic to controller.js"
This reverts commit 54fdc6de06.
* Revert "ReplyWidget: Move clear methods from #teardown to #clear"
This reverts commit 9b811a94e0.
* Revert "ApiDiscussionToolsEdit: Do not pass 'basetimestamp'"
This reverts commit 7de5938a6f.
* Revert "Use DOMCompat::getOuterHTML instead of doc->saveHTML()"
This reverts commit 7b2448d2f0.
* Revert "CommentController: Remove remains of client-side edit conflict handling"
This reverts commit 2d038af705.
* Revert "Restore error message for when comment is deleted while replying"
This reverts commit 655c0526d6.
* Revert "Use transcluded from API to avoid ever fetching Parsoid DOM in client"
This reverts commit 9d0fc184fe.
* Revert "Create a 'transcludedfrom' API endpoint"
This reverts commit 5d8f3b9051.
* Revert "Edit API for replies"
This reverts commit 8829a1a412.
Bug: T259855
Change-Id: I6419408c6194ec0afa6b8ee604b12c1a24c6ac7b
Previously, parser would output offsets that don't exist in their
containers, because we were pretending that entities are parts of
their neighboring text nodes.
Turns out it's much easier to do it right when going backwards.
Change-Id: I9bccca2d403f1a976ae517449989170cdd99721e
Something terrible has happened to this function… It seems that I have
brutalized it when rebasing 092cfd6075.
Change-Id: I12d75c69d15645112563a7bc345209b23b54cb3e
Only 'baserevid' should be required. That's what we used before commit
8829a1a412, since switching from
'basetimestamp' in commit 4e135c7f07 in
order to better handle edit conflicts with yourself. That fix seems to
have regressed, so let's try this and see if it helps.
Bug: T252558
Change-Id: Iff5911384f3320b6e7f97a1fa34e82ecd4b44fb3
The latter results in lots of extra HTML entity encoding.
The former is built by the Parsing team and appears to result
in no unexpected changes elsewhere in the document.
As Parsoid's selser relies on HTML fragments being byte-for-byte
equal, these changes were resulting in wikitext normalisations
in untouched parts of the document ("dirty diffs").
Bug: T259855
Change-Id: Ib3cb605911e690ec3e8c2f9df25fd1a2e2849d7e
This reverts commit 96953647c3.
* Re-apply "Edit API for replies"
This applies commit 8829a1a412.
* Re-apply "Create a 'transcludedfrom' API endpoint"
This applies commit 5d8f3b9051.
* Re-apply "Use transcluded from API to avoid ever fetching Parsoid DOM in client"
This applies commit 9d0fc184fe.
* Re-apply "Restore error message for when comment is deleted while replying"
This applies commit 655c0526d6.
Change-Id: Id20d21899f87464636022aa0683f8c03e0060117
Causes page corruption.
* Revert "Restore error message for when comment is deleted while replying"
This reverts commit 655c0526d6.
* Revert "Use transcluded from API to avoid ever fetching Parsoid DOM in client"
This reverts commit 9d0fc184fe.
* Revert "Create a 'transcludedfrom' API endpoint"
This reverts commit 5d8f3b9051.
* Revert "Edit API for replies"
This reverts commit 8829a1a412.
Bug: T259855
Change-Id: I98036f14dd900b51f20e98696e31b9b618eceee1
When adding a reply, we take a node at the end of the previous comment,
compare that comment's indentation level to the expected indentation level
of the reply, and add (or remove) that number of wrapper lists.
The existing code did not consider that comments may have lists within
them, and so the indentation of that node may not match the indentation
of the comment.
Bug: T252702
Change-Id: Icc5ff19783d2b213bff99f283cb0599a8b5c1ab4
Previously we preferred that, but used '*' (<ul><li>) when the parent
comment or the previous reply also used it.
Bug: T252708
Change-Id: I3abf606da6693905764f1be745fad999fdf57fbe
* Remove the existing approach for detecting signatures that only
worked in source mode; remove autoSignWikitext()
* Use the same approach for auto-signing in source mode as we have
already used in visual
* In both modes, detect whether the user has already typed a signature
at the end of their comment in the modifier, and if so, don't add a
signature
* Add test cases for the detection
Bug: T255738
Change-Id: I791d3035cb1ffc33ce3966d4617a25d08700c35b
* Pass rootNode to the constructor
* Rename getters to match CommentItem/HeadingItem/ThreadItem
value classes.
* Always build the thread tree so CommentItem's always have
and ID and replies/parent.
Change-Id: I508be9534de59016ff806e3d84edcbb1c76cb0c6
Instead of doing a separate tree walk and finding all timestamps
separately, make it part of the getComments tree walk, and find
timestamps one at a time.
Change-Id: I47f466eaf228504faa189fd99e07493bc7f022cd
This is similar to what the JS version does.
The TreeWalker and NodeFilter classes are adapted from
https://github.com/Krinkle/dom-TreeWalker-polyfill
(MIT license).
This makes #getComments twice as fast on en-big-oldparser.html
Change-Id: I2441f33e6e7bad753ac830d277e6a2e81ee8c93d
* Move modifier#getFullyCoveredWrapper to utils
* Use that method to find the node where we start searching for
template wrappers, rather than using endContainer
Bug: T252058
Change-Id: I55de58102f3468fce01290bd413a7fdc96d322d6
When there is a wrapper element whose range matches the range of
a comment, any replies will now be added outside of that wrapper,
instead of directly after the comment (inside the wrapper).
Bug: T250126
Change-Id: I6b42c4db019ae998e91eebd324f9cbd2aa791b4f
It was useful when I was debugging those parts of the code, but now
it's usually annoying.
The warnings can still sometimes be useful for understanding how the
tool parses some discussion, though. To keep that functionality, add
displaying warnings for each comment in the debug mode.
Change-Id: I2d218a8a394f179bcc0990ff988a0567c275ccf2
Follow-up to Ic1438d516e223db462cb227f6668e856672f538c.
Minor corrections and comment improvements in PHP parser,
and "backporting" some changes to JS parser that I like.
Change-Id: I5e54121914ec6b323e556dd133bcb71b3aefbb61
This method shouldn't be required on the server. Leave comments
relating to it in addListItem so JS & PHP can be kept in sync.
Change-Id: I849fac660faf6e750272c20776f96b9250f96b1b
In JS, strings are internally encoded as UTF-16, and properties like
.length return values in UTF-16 code units.
In PHP, strings are internally encoded as UTF-8, and we have the
option of using methods that return bytes like strlen() or UTF-8 code
units like mb_strlen().
However, the values produced by preg_match( …, PREG_OFFSET_CAPTURE )
are in bytes, and there's nothing we can do about that. So let's use
bytes throughout, mixing the two types results in meaningless numbers.
Then in the test code, we have to calculate UTF-16 code units offsets
based on the UTF-8 byte offsets.
We also have to copy the entire workaround for mw:Entity nodes… Maybe
the parser should be fixed to return the real nodes for ranges' ends
in this case.
Change-Id: I05804489d7de0d60be6e9f84e6a49a885e9fb870
It appears PHP's DOM library always uses CDATA nodes for the contents of
<style> tags, even if there is no such markup in the source HTML.
Change-Id: Id04b27086c5e7a0b016a3a440b2b4895d6b13c93
Profiling reveals that >87% of the run time of our test suite is spent
in this tiny method. Apparently, DOMNodeList::item() is extremely slow
(possibly it's linear time instead of constant time?).
Profiled using XDebug and KCacheGrind:
https://phabricator.wikimedia.org/F31815264
We can calculate the child's index in its parent by counting its
precending siblings instead, which turns out to be much faster.
Before:
1. 275444ms to run DiscussionToolsCommentParserTest:testGetComments with data set #2
2. 12668ms to run DiscussionToolsCommentParserTest:testGetComments with data set #3
...
After:
1. 9545ms to run DiscussionToolsCommentParserTest:testGetComments with data set #2
2. 5549ms to run DiscussionToolsCommentParserTest:testGetComments with data set #3
...
That's still kind of slow but now it's bearable to run the test suite.
Change-Id: I49155f7aa2e231a9a20bf282cf6aaa28fc902e0b
* Not to be confused with the Parsing Team's
"Great Parser JS to PHP port of 2019"
Gasp as OR hacks are changed to null coalescing operators.
Applaud as variable declarations are dropped.
Cheer as parameters and return values are type-hinted.
Shudder as DomNodeLists have no indexOf method.
Moving discussion parsing to the server should allow
us to implement much cleaner APIs for commenting.
Bug: T252252
Co-authored-by: Ed Sanders <esanders@wikimedia.org>
Change-Id: Ic1438d516e223db462cb227f6668e856672f538c
When the user clicks a "Reply" link on a page that is affected by the
'fostered' lint error (indicating fostered content in the HTML
representation), display an error and refuse to edit it, as Parsoid's
transformations will damage the page content.
The error message includes a link to documentation about lint errors,
and a link to the editor that will highlight the error location.
Depends-On: I723ec766d1244d117f8d624440026fe5af0d3403
Bug: T246481
Change-Id: Ic60cb58f98d10dc9b113469e5d3bbfb2d2b0564f