Commit graph

159 commits

Author SHA1 Message Date
Bartosz Dziewoński 063174e71c Use instanceof for checking for text/element nodes in PHP
It is friendlier for static analysis tools like Phan, which can't
infer anything from the `->nodeType === …` checks, and we were already
using it in most places.

Fix newly revealed Phan failures (and one unneeded suppression).

Change-Id: Id789f05e16a210f7ba22ca7514587c392fac0741
2022-03-08 23:28:39 +00:00
jenkins-bot 542da89530 Merge "Don't detect comments within references" 2022-02-28 16:47:21 +00:00
Bartosz Dziewoński 8a2715bdd5 Move signatureScanLimit to a constant in JS
Change-Id: Ieb60c148fd060ab62e4a493e2d0dff6c051f945c
2022-02-21 22:42:14 +01:00
Bartosz Dziewoński 4244418e56 Don't detect comments within references
Bug: T301213
Change-Id: Ifd5198651c8ed0ce53379fb5e35938089cd54a09
2022-02-21 19:57:44 +00:00
jenkins-bot 1a48f8cd7e Merge "CommentParser: Inject a forgotten service" 2022-02-21 19:30:55 +00:00
Bartosz Dziewoński 85165543f4 CommentParser: Inject a forgotten service
Also sort alphabetically.

Change-Id: I9e77c4aa1fba930f382e3c4f17ac0504c2f06668
2022-02-21 20:15:54 +01:00
Bartosz Dziewoński aea36bab3a CommentParser: Fix a small use of global state
Also, in ThreadItem::getSinglePageTransclusionTitle(), we don't need
this terribly complicated method.

Change-Id: If02c09aaa2f4dd66b2bc253a1edec4ea107564ee
2022-02-21 18:15:31 +00:00
Bartosz Dziewoński 8e44b43df0 Split off ThreadItemSet from CommentParser
Goal:
-----
Finishing the work from Iadb7757debe000025e52770ca51ebcf24ca8ee66
by changing CommentParser::parse() to return a data object, instead of
the whole parser.

Changes:
--------
ThreadItemSet.php:
ThreadItemSet.js:
* New data class to access the results of parsing a discussion. Most
  methods and properties are moved from CommentParser with no changes.

CommentParser.php:
Parser.js:
* parse() returns a new ThreadItemSet.
* Remove methods moved to ThreadItemSet.
* Placeholder headings are generated slightly differently, as we process
  things in a different order.
* Grouping threads and computing IDs/names is no longer lazy. We always
  needed IDs/names anyway.
* computeId() explicitly uses a ThreadItemSet to check the existing IDs
  when de-duplicating.

controller.js:
* Move the code for turning some nodes annotated by CommentFormatter
  into a ThreadItemSet (previously a Parser) from controller#init to
  ThreadItemSet.static.newFromAnnotatedNodes, and rewrite it to handle
  assigning parents/replies and recalculating legacy IDs more nicely.
* mw.dt.pageThreads is now a ThreadItemSet.

Change-Id: I49bfe019aa460651447fd383f73eafa9d7180a92
2022-02-21 16:22:32 +00:00
Bartosz Dziewoński 4613ae78e7 Change CommentParser into a service
Goal:
-----
To have a method like CommentParser::parse(), which just takes a node
to parse and a title and returns plain data, so that we don't need to
keep track of the config to construct a CommentParser object (the
required config like content language is provided by services) and
we don't need to keep that object around after parsing.

Changes:
--------
CommentParser.php:
* …is now a service. Constructor only takes services as arguments.
  The node and title are passed to a new parse() method.
* parse() should return plain data, but I split this part to a separate
  patch for ease of review: I49bfe019aa460651447fd383f73eafa9d7180a92.
* CommentParser still cheats and accesses global state in a few places,
  e.g. calling Title::makeTitleSafe or CommentUtils::getTitleFromUrl,
  so we can't turn its tests into true unit tests. This work is left
  for future commits.

LanguageData.php:
* …is now a service, instead of a static class.

Parser.js:
* …is not a real service, but it's changed to behave in a similar way.
  Constructor takes only the required config as argument,
  and node and title are instead passed to a new parse() method.

CommentParserTest.php:
parser.test.js:
* Can be simplified, now that we don't need a useless node and title
  to test internal methods that don't use them.

testUtils.js:
* Can be simplified, now that we don't need to override internal
  ResourceLoader stuff just to change the parser config.

Change-Id: Iadb7757debe000025e52770ca51ebcf24ca8ee66
2022-02-19 19:51:57 +01:00
Bartosz Dziewoński f51f3a1051 CommentParser: Remove unused method getThreadItemsByName()
Follow-up to a5099739a6.

Change-Id: I53cbf6a7a2c9b95674998734689b3930dfe74149
2022-02-19 19:51:57 +01:00
Bartosz Dziewoński 99b5de8038 Split Data class into ResourceLoaderData and LanguageData
The Data class contained utilities for two unrelated purposes.
Split each half to a separate class.

Notably, this improves the signature of the getLocalData() function.

Change-Id: Icde615fb9d483fee1f352c34909b37f8ffde8081
2022-02-19 19:37:34 +01:00
Bartosz Dziewoński ae9f26a9e5 Various code quality tweaks
(suggested by PhpStorm)

composer.json:
* Document required PHP extensions

Parser.js:
* Remove incorrect param documentation
* Fix some typos in comments (missing parentheses)

CommentParser.php:
* Fix some typos in comments (missing parentheses)

ImmutableRange.php:
* Remove unused property
* Add a `throw` to indicate that code path is unreachable

SubscribedNewCommentPresentationModel.php:
* Add missing `return false`

CommentParserTest.php:
* Remove unnecessary pass-by-reference

CommentModifierTest.php:
* Remove unused variable

CommentParserTest.php:
* Don't construct Element objects directly. PHP's DOMElement allows
  it, but Parsoid/Dodo's doesn't, and we use the latter for static
  analysis. This generates all kinds of confusing warnings.

Change-Id: Ia9598ebea0e99830dd485296e94a9d96acc4b258
2022-02-19 19:36:52 +01:00
Bartosz Dziewoński 13ab1db6da Don't count leading/trailing whitespace against signature scan limit
It's an arbitrary limit, it seems harmless to relax it to support the
use case in the task, even if it's weird.

Bug: T300949
Change-Id: I7c895c7019726758bbae3183b9c3ecbd9eabcf38
2022-02-04 19:35:29 +00:00
Ed Sanders 0b42aea276 CommentParser: Cache variables in getUsernameFromLink
Change-Id: I625e6ded3badd75a7a658c8d000576d0d165a18b
2022-02-04 19:35:18 +00:00
Ed Sanders 8ad1df7dc8 CommentParser: Name parts of return value from findSignature
Change-Id: I3a5ad36df0afdedc0aa9a15e5d83c5426b03b790
2022-02-04 19:34:18 +00:00
Bartosz Dziewoński f15693eefa Use class list everywhere for adding/checking CSS classes
In PHP, use DOMCompat::getClassList(), provided by Parsoid.
In JS, use `.classList`, available in all supported browsers.

This may fix some bugs where we were incorrectly checking for exactly
one class. The change in isOurGeneratedNode() is needed for
Ib2fa40c5fa389572b0e88ef558728fa06e3621b0.

Change-Id: Ia28d31678fd3d617b69280c4b7857755300fa515
2022-01-24 18:40:00 +01:00
Ed Sanders f80ff74fc6 Handle selflinks by returning the current page's title
Bug: T287818
Change-Id: I67f10ac9976581279d1e6a477e90d55875ebab20
2022-01-12 21:18:04 +00:00
Ed Sanders 34011b7a07 Parser: Pass in title of page being parsed
Will be used to parse selflinks in the future.

Change-Id: I2bc29d1c5c69cb6309f582f162f9af7d96ce8913
2022-01-12 21:17:59 +00:00
Bartosz Dziewoński ef7274d69e Move some helpers from CommentParser to CommentUtils
Change-Id: I0e323d3b75f47459a5548a13e9684f4c6ff4ba0c
2021-12-13 17:13:41 +01:00
Ed Sanders 8e4f08182e Add missing typehints
Change-Id: Ia25c5bea1834a3fdd26f32a9d5ed097789329824
2021-12-01 14:57:09 +00:00
Ed Sanders a86d308d66 CommentItem.php: Store timestamp object instead of string
We do something similar in CommentItem.js with a moment object.
The object can be converted to a string when required.

Change-Id: Id7221e9201db0d89c3b771574634c878c9515ca0
2021-11-09 16:37:45 +00:00
Bartosz Dziewoński c1f4668806 Change CommentParser and ImmutableRange to use offsets in codepoints instead of bytes
The PHP DOM extension measures lengths and offsets in Unicode codepoints.
Our PHP code used UTF-8 bytes, causing some offsets to be slightly off.
Now it mostly uses Unicode codepoints as well (we're forced to use bytes
in a few places, because preg_match returns offsets in bytes).

In practice, this had no visible effect to the user. It caused the
markers `<span data-mw-comment-end="..."></span>` to be placed at
the end of their container instead of the correct position when the
timestamp contained multibyte characters (e.g. "ź" in Polish); but
the correct position is usually at the end of the container anyway.

In the test cases, the only difference is placing these markers before
a trailing line break inside `<p>...</p>` tags rather than before it.

The patch also accidentally fixes another bug, where element nodes
with no children (mostly <img>) were incorrectly excluded when calling
cloneContents(), because they were treated as if they were text nodes.

Change-Id: Iccdccf1078598f4b62cab96225e9c85a4c0e93ee
2021-09-27 19:04:16 +00:00
jenkins-bot 15747aba3a Merge "CommentParser: Remove outdated legacy ID algorithm" 2021-09-21 16:50:37 +00:00
Alexander Vorwerk 97a702cbc7 CommentParser: use IPUtils instead of the deprecated IP class
Bug: T291008
Change-Id: I0207940a642a32f2ca997b78387c9ff0af101599
2021-09-14 22:19:05 +02:00
Bartosz Dziewoński 9819df3288 CommentParser: Remove outdated legacy ID algorithm
Last changed in March (4a0802065c),
was only needed for about 2 weeks for compatibility with cached data.

Change-Id: I510238cb86a7b4d7ae5e8636716d1e9ca2d0e402
2021-09-07 17:41:30 +02:00
Bartosz Dziewoński a5099739a6 Improve notifications for comments posted in close succession
In case 4 and case 6, no notifications are expected. In all other
cases we now get the expected notifications.

Bug: T285528
Change-Id: I9e813bb3a053bc1232783f9eae1ad75672b4fa7e
2021-08-01 12:27:33 +02:00
C. Scott Ananian 25272e7a4a Don't refer directly to PHP dom extension classes; avoid nonstandard behavior
These changes ensure that DiscussionTools is independent of DOM
library choice, and will not break if/when Parsoid switches to an
alternate (more standards-compliant) DOM library.

We run `phan` against the Dodo standards-compliant DOM library,
so this ends up flagging uses of non-standard PHP extensions to
the DOM.  These will be suppressed for now with a "Nonstandard DOM"
comment that can be grepped for, since they will eventually
will need to be rewritten or worked around.

Most frequent issues:

* Node::nodeValue and Node::textContent and Element::getAttribute()
can return null in a spec-compliant implementation.  Add `?? ''` to
make spec-compliant results consistent w/ what PHP returns.

* DOMXPath doesn't accept anything except DOMDocument.  These uses
should be replaced with DOMCompat::querySelectorAll() or similar
(which end up using DOMXPath under the covers for DOMDocument any way,
but are implemented more efficiently in a spec-compliant
implementation).

* A couple of times we have code like:
  `while ($node->firstChild!==null) { $node = $node->firstChild; }`
and phan's analysis isn't strong enough to determine that $node is still
non-null after the while.  This same issue should appear with DOMDocument
but phan doesn't complain for some reason.

One apparently legit issue:

* Node::insertBefore() is once called in a funny way which leans on
the fact that the second option is optional in PHP.  This seems to be
a workaround for an ancient PHP bug, and can probably be safely
removed.

Bug: T287611
Bug: T217867
Change-Id: I3c4f41c3819770f85d68157c9f690d650b7266a3
2021-07-30 18:15:40 -04:00
libraryupgrader b0884b177c build: Updating dependencies
composer:
* mediawiki/mediawiki-codesniffer: 36.0.0 → 37.0.0

npm:
* postcss: 7.0.35 → 7.0.36
  * https://npmjs.com/advisories/1693 (CVE-2021-23368)
* glob-parent: 5.1.1 → 5.1.2
  * https://npmjs.com/advisories/1751 (CVE-2020-28469)
* trim-newlines: 3.0.0 → 3.0.1
  * https://npmjs.com/advisories/1753 (CVE-2021-33623)

Change-Id: I7a71e23da561599da417db3b3077b78d91173bbc
2021-07-22 16:29:04 +00:00
libraryupgrader 12fb65b9f1 build: Updating composer dependencies
* mediawiki/mediawiki-codesniffer: 35.0.0 → 36.0.0
* php-parallel-lint/php-parallel-lint: 1.2.0 → 1.3.0

Change-Id: I5c152292e83e7f3441e2c08b7d0ad23ac90f194b
2021-05-05 11:14:52 +00:00
jenkins-bot ef7073b8fd Merge "Simplify how warnings for IDs equal to legacy IDs are avoided" 2021-04-15 23:31:45 +00:00
Bartosz Dziewoński 42ce942c86 Introduce comment "names" to identify comments across revisions/pages
The existing comment IDs can't be used to find the same comment on
a different revision or page (when it's transcluded), because they
depend on the comment's parent and its position on the page.

Comment names depend only on the author and timestamp. The trade-off
is that they can't distinguish comments posted within the same minute,
or in the same edit, so we will still need the IDs sometimes.

Prefer using comment names when replying, if they're not ambiguous.
This fixes T273413 and T275821.

Heading names depend on the author and timestamp of the oldest comment.
This way we don't have to detect changes to the heading text, but we
can't distinguish headings without any comments.

Bug: T274685
Bug: T273413
Bug: T275821
Change-Id: Id85c50ba38d1e532cec106708c077b908a3fcd49
2021-03-23 16:08:42 +00:00
Bartosz Dziewoński b28290fa62 Simplify how warnings for IDs equal to legacy IDs are avoided
I don't like the extra parameter.

Follow-up to d05109b24d.

Change-Id: Ic0f403a816fd3182982002da326bb32d591ebcf7
2021-03-22 20:15:07 +00:00
Ed Sanders 4a0802065c Make IDs (to be used as URL hashes) wikitext safe
* Use hyphens instead of pipes a separators
* Use underscores for spaces in usernames

Change-Id: I6efd9739fc73e45002e50e64c43ce0de1c2f1239
2021-03-18 20:45:21 +01:00
Bartosz Dziewoński a103abb8ae Ignore warnings about legacy IDs in tests
Change-Id: I3c74b4e65aac9b84494917547cce7eb6a75995b4
2021-03-18 20:42:03 +01:00
Bartosz Dziewoński f5059e6ea6 Don't detect comments within 'cite' elements too
Follow-up to 024a978ffd.

Bug: T275881
Change-Id: I53448ad22cd0531e7fd4aa0ea5d15782879cce14
2021-03-01 21:40:43 +01:00
jenkins-bot 0eb37a87df Merge "Don't detect comments within quotes" 2021-02-28 22:56:20 +00:00
Bartosz Dziewoński 024a978ffd Don't detect comments within quotes
Bug: T275881
Change-Id: I8f7a4279837bd95ebf5b604ff350c0a3f29c2c05
2021-02-28 22:49:48 +00:00
jenkins-bot 8bb5eea999 Merge "Improve signature detection to handle formatting on the timestamp" 2021-02-27 22:54:50 +00:00
jenkins-bot 49938a88dc Merge "Improve merging multiple comments on one paragraph" 2021-02-27 22:54:43 +00:00
Daimona Eaytoy 67096cb431 Stop using deprecated Language methods
Change-Id: I7cf21365df355a4a62f9e353be61aaa03ed58b9d
2021-02-27 14:48:49 +00:00
Bartosz Dziewoński efe95494a8 Improve signature detection to handle formatting on the timestamp
Now it detect signatures generated by en.wp's {{Undated}} template,
and signatures of people who do weird stuff to the timestamps.

Bug: T275938
Change-Id: I27b07f6786ca5433a3c02a5fe68e4716d41401bb
2021-02-27 02:33:30 +01:00
Bartosz Dziewoński af082908a5 Improve merging multiple comments on one paragraph
The horrendous 11-line if() condition did not correctly handle
signatures wrapped in inline formatting markup, like <small>.

Instead, implement this logic in the code for skipping to the end
of a paragraph, which didn't exist yet when that condition was
added, but seems like a much better place to check this now.

Bug: T275934
Change-Id: I5cccff889b5e15b5f8fde0538bf4bccb22e762cf
2021-02-27 02:21:36 +01:00
Bartosz Dziewoński 35738b1f9b CommentParser: Replace getThreadStartTimestamp with getThreadStartComment
Change-Id: Ia8d878594306b5ce4039ca06d6dcec753e5dea28
2021-02-24 12:26:58 +00:00
Ed Sanders fa484e0c4a Don't allow CommentItem author to be null
Change-Id: Idb12bfa62e42bff521e872ab358b5ba9a8d24089
2021-02-22 20:55:35 +00:00
Bartosz Dziewoński 1998c983f1 computeId() can't return null
It used to return null for headings, but now it doesn't. Simplify some
code checking for that.

Change-Id: I28131c4aee89b901879b4c49953d6b15ed91b5e7
2021-02-13 00:08:15 +01:00
Ed Sanders d05109b24d Truncate user generated part of IDs to 80 characters
This ensures that IDs fit in a 255 character database field.

Bug: T273658
Change-Id: I3cfe4fce6a865b4343f0f01121cd696aa5f98b22
2021-02-03 15:04:58 +00:00
Bartosz Dziewoński c781b127c9 Handle category links at ends of comments affecting indentation
* Ignore rendering-transparent nodes between discussion comments.
* Improve isRenderingTransparentNode() so that <link> nodes
  representing TemplateStyles are not considered transparent,
  otherwise this would undo ae920b831f.
  Using a regexp from Parsoid.

Bug: T272746
Change-Id: I0b3c3251156ba6c4826abf5ba44ea93f80ebc01d
2021-01-26 04:55:03 +01:00
Bartosz Dziewoński 8f42c74985 Fix skipping to the end of paragraph, now it considers nested tags
Add yet another tree walking utility: CommentUtils::linearWalk().
Unlike TreeWalker, it allows handling the beginnings and ends of nodes
separately – kind of like parsing a XML token stream, or kind of like
VisualEditor's linear model.

(Add unit tests for this utility. The simple.html test case is copied
from [VisualEditor/VisualEditor]/demos/ve/pages/simple.html.)

Use this utility to stop skipping when we reach either a closing or
opening block node tag. Previously we'd skip over such tags inside
nested "transparent" nodes (like <a>, <del>, or apparently <font>).

Bug: T271385
Change-Id: I201a942eb3a56335e84d94e150ec2c33f8b4f4e0
2021-01-18 18:20:20 +00:00
Bartosz Dziewoński 50ad5bb2b4 Ignore outdent templates at the beginning of comments
Bug: T264116
Change-Id: Iae9dbb30a1aead897cc274f655d3ecff4b297dbd
2020-12-14 21:35:56 +01:00
Bartosz Dziewoński ae920b831f Change which nodes are ignored at the beginning of comments again
While working on T270009, I noticed that <style> and <link> nodes
are treated differently, which seemed weird. Rewrite this again,
hopefully this is the last time.

The changed test cases also involve <area> and <input> nodes,
and the new results make more sense to me.

Bug: T264116
Change-Id: I3af90c84768a4b3dc53446927f4dba6f72175a2f
2020-12-14 21:33:50 +01:00
Bartosz Dziewoński 0fc71f60cd Skip to the end of the paragraph if it's just text, too
We've recently decided that we want to "extend" comments until
the end of the paragraph (e36dc8e78a,
d0ae6c4e44).

However, we still had this special case that did the opposite: it
ensured that if a comment ended in the middle of a text node, the
comment would not be extended to the end of the node. Remove it.

Note the change in the test file signatures-funny-formattedreply.html,
which actually covered this case specifically.

Change-Id: Id1384bb0c6e1a5f0c70f55efcb4caa240f230f07
2020-11-25 00:48:53 +01:00
Ed Sanders d0ae6c4e44 Skip end marker "forward" until a block tag is reached
The end marker is skipped forward until an open or close
block tag is reached. In tree traversal terms this means
moving either to the next sibling, or the parent (to skip
over close tags).

Bug: T256033
Change-Id: Iaa2c588698790d576ac4f9ecc126f58a082ef6b3
2020-11-23 15:08:29 +00:00
Ed Sanders 44a1bbcc59 Fix start node for comments following headings
The general rule is that comments start after their preceding
thread item, but when that is a heading we should skip past
the entire <h[1-6]> node to avoid making section edit links
part of the first comment.

Bug: T267988
Change-Id: Ia7f1b27e0a69a9aab7c7da743bf8549479304096
2020-11-19 23:48:30 +00:00
Thiemo Kreuz 8ffe0d55da Remove comments that literally repeat what the code says
Change-Id: Ib928cf61dc512fbbf39a3279789376d635a82c52
2020-11-11 09:31:59 +01:00
jenkins-bot e378a9122b Merge "Don't detect comments within headings" 2020-11-05 16:56:02 +00:00
Bartosz Dziewoński bed717d329 Move getHeadlineNodeAndOffset() to utils
Needed by I7d35098d672d0edb50d49e22de1686d5cc83b60e.

Change-Id: I44bf927213de570fe9de43e485e09cfae6778eef
2020-11-05 16:11:30 +01:00
Bartosz Dziewoński 986e83ee61 Fix getHeadlineNodeAndOffset() returning text nodes
The condition was wrong, it could return either an element child with
.mw-headline, or a non-element child.

Bug: T267284
Change-Id: I28cda22ee8c5fe4a3259621adddd647b31291703
2020-11-05 16:09:35 +01:00
Bartosz Dziewoński 1626242863 Don't detect comments within headings
Bug: T267068
Change-Id: Id134f15e086fd070801c4b1d836dbfbf9bf444ad
2020-11-04 21:57:33 +01:00
libraryupgrader fb6706a606 build: Updating mediawiki/mediawiki-codesniffer to 32.0.0
The following sniffs are failing and were disabled:
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationPrivate
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationProtected
* MediaWiki.Commenting.PropertyDocumentation.MissingDocumentationPublic

Additional changes:
* Dropped .inc files from .phpcs.xml (T200956).

Change-Id: I340d6b573e9ae2a99085fb19a705fcf567b03f92
2020-10-29 10:53:01 +00:00
Ed Sanders 3aca622894 Treat headings like comments now they have IDs
Use the same logic for marking ranges in the document, and ensure
that the heading range does not include section edit links or
section numberings.

Change-Id: I782caafc34fee2a822b0a17b24dd6b9528202eca
2020-10-28 12:38:18 +00:00
Bartosz Dziewoński ba8434e2e0 Add legacy IDs as of wmf/1.36.0-wmf.14
Bug: T264478
Change-Id: I099e1068fedc25d671cd1245ac8b32941dca7232
2020-10-22 22:52:59 +02:00
Bartosz Dziewoński 0ddc171c8a Add oldest timestamp in the thread to heading IDs
To avoid old threads re-appearing on popular pages when someone
uses a vague title (e.g. dozens of threads titled "question" on
[[Wikipedia:Help desk]]: https://w.wiki/fbN), include the oldest
timestamp in the thread (i.e. date the thread was started) in the
heading ID.

Bug: T264478
Change-Id: If918bfd5e025248923d1939bc86916697ead95a0
2020-10-22 02:19:21 +02:00
Bartosz Dziewoński b09bbfe668 Disambiguate comments by parent ID, rather than sequential numbers
Sequential numbers aren't great because they change when an earlier
comment is archived. Parent comment/heading IDs should change less
often.

This also makes much more sense for disambiguating subsections,
e.g. a dozen identical ===Votes=== sections for a dozen proposals.

Bug: T264478
Change-Id: I466454984fd919ebef35f2b37ddb5d86dc842996
2020-10-22 02:19:21 +02:00
Bartosz Dziewoński 3137d76f40 Connect sub-threads to their parent threads
Our threads now also contain all replies to their sub-threads.
This is similar to how sections work in MediaWiki, where the parent
section also contains the content of all the lower-level sections.
We're going to need this for notifications about replies in a thread.

Bug: T264478
Change-Id: I241fc58e2088a7555942824b0f184ed21e3a8b6f
2020-10-22 02:05:02 +02:00
Bartosz Dziewoński 9ee0fd69f5 Allow headings to have IDs
Previously, only comments could have IDs, because we only needed IDs
for replying. But we might also use them for notifications soon.

Bug: T264478
Change-Id: I1bcad02bf17ab54bc5028a959543c10f0430836b
2020-10-22 02:04:28 +02:00
Bartosz Dziewoński 6719d17364 Handle cached "legacy" IDs (and other JSON-serialized data)
The output of CommentFormatter::addReplyLinks() and consequently
ThreadItem::jsonSerialize() can end up in the HTTP cache (Varnish) on
Wikimedia wikis. We need to consider that when changing that code.

Introduce a concept of legacy ID (generated by the older algorithm
after it changes), add some placeholder code that will generate them
in the future, and update some code to find comments by either normal
or legacy IDs.

Add dire comments in a bunch of places (as if that ever helps).

Bug: T264478
Change-Id: I4368f366800ab21b8b184b09378037614fdecd33
2020-10-22 00:53:06 +02:00
Bartosz Dziewoński 3b8d63467e CommentParser: Remove confused comments about references and objects
"This modifies the original objects…" – I feel like this is obvious
now, but maybe it wasn't so obvious when this code was structured
differently before a2431fe006. Also,
it refers to a variable that doesn't exist.

"FIXME this will clone the reply…" – No, actually, it will not.
It would if replies were associative arrays, but they are objects,
and have always been, ever since the PHP parser was merged in
7b7a2cd69c. Maybe they were arrays
once in Roan's mind before he pushed that for review.

Change-Id: I1348e111699fdbde99cd1f9ef45d8f465f7391b0
2020-10-21 21:01:27 +02:00
Bartosz Dziewoński 88b5be11fd CommentParser: Avoid unnecessary reference in foreach()
This is not necessary, and never has been. This variable contains an
object and it's never assigned to.

Instead, the reference creates hard-to-debug bugs (I've just spent
an hour debugging one). When the variable name is reused later in
the same function as the loop variable of another foreach() loop
(such as in If918bfd5e0), the result is overwriting of the last entry
in $this->threadItems with the last entry from the other array.
I was questioning everything I know about variables until I noticed.

Change-Id: Ibb57f915b39dd4d6d2e744903f9ecadd67b1f52d
2020-10-15 17:58:06 +02:00
Bartosz Dziewoński ed17f640b6 Ignore other empty-ish things at the beginning of comments
Follow-up to 432a959436.

Bug: T264116
Change-Id: I0685cafab70c7e9d22f504f1a1309c9a28d6f2e1
2020-09-30 23:42:47 +02:00
Bartosz Dziewoński 17b7a481a2 Fix detecting username from the wrong links sometimes
When a timestamp directly followed a `<div>…</div>` tag (or perhaps
some other wrapper containing lots of content), we would detect the
username from the earliest links in the wrapper (furthest from the
timestamp), rather than the latest links (closest to the timestamp).

Bug: T262573
Change-Id: Id16449a86a731b13dc79846bb30ecf6554e26f1d
2020-09-29 22:31:24 +02:00
Bartosz Dziewoński 432a959436 Ignore empty paragraphs at the beginning of comments
The wikitext parser outputs `<p><br></p>` for empty paragraphs, so we
need to ignore `<br>` tags when searching for an "interesting" node
that marks the beginning of a comment. Otherwise the empty paragraphs
mess up the detection of indentation levels.

Bug: T264116
Change-Id: I84a97ab577baa7336b78935ccdc48041ecfc231a
2020-09-29 22:22:35 +02:00
Bartosz Dziewoński 329df8c953 Parsing discussions converted to language variants
* Export parser data (date format, digits, timezone names, and
  messages for weekday/month names) converted to language variants
* Update the parsers to try matching using every variant, in case
  the page is displayed in non-default variant (and to avoid
  problems with incomplete variant conversion)

Bug: T259818
Change-Id: I04d73992cd31ce06fa79f87df0c0a53d7efc3c58
2020-09-16 22:07:07 +00:00
Ed Sanders 92a8ca3469 Documentation fix
Change-Id: Ic37bb713bed8af7390a3e8be7ea0203b4687ce0e
2020-09-15 01:38:54 +02:00
Bartosz Dziewoński 14fb013515 Match handling of "signature scan limit" between JS and PHP
PHP was counting UTF-8 bytes, JS was counting UTF-16 bytes.
Both should have been counting codepoints (although it doesn't
really matter as long as they both count the same things).

I noticed the issue after adding some tests using the Cyrillic
script, when one case had different results in PHP and JS:
Id25b537fecd789640c209ff7f30e777455a3aece.

Change-Id: Ic31240678f71ba48e6ec202126bf490cea12bb66
2020-09-08 03:27:01 +02:00
Bartosz Dziewoński 2d3fe47ac1 Fix parsing localised digits in PHP discussion parser
The PHP code incorrectly assumed that the digits are single-byte in
UTF-8, which is never the case (except for 0-9).

The JS code worked correctly because it uses UTF-16 strings, so the
bug would only affect non-BMP digits there. This was noted in a TODO
comment, but we overlooked it when reimplementing in PHP.

Instead of a string of 10 characters, use an array of 10
single-character strings.

Bug: T261706
Change-Id: Ic5421382474c88f003424799c53ff473d99cce92
2020-09-01 01:50:33 +02:00
Bartosz Dziewoński e36dc8e78a Skip to the end of the paragraph in the parser, not modifier
When a comment ended before the end of a paragraph, the next
comment would begin right there in the middle of the paragraph.
This could result in the detected indentation level of that
comment being incorrect, and replies being inserted in wrong
places, as seen in the 'signatures-funny' test case.

The code moved to the parser was previously repeated twice in
addListItem() and addReplyLink(), which should have been a hint
that something isn't quite right.

Also, fix the code guarding against overlapping signatures,
now that signatures may not be at the end of a comment.

Bug: T260855
Change-Id: Ic26a87642f8a15d5de2f7073d4d8176b299c7f94
2020-08-20 19:35:55 +00:00
Bartosz Dziewoński 84cb9d1dca parser: Code quality tweaks
Do things in a more intuitive order, avoid some repetition,
rename a vaguely named variable.

Change-Id: Ic1a0bb54134682eaf126231e04eb67847d6a5da6
2020-08-20 20:52:42 +02:00
Bartosz Dziewoński 375bfe028e parser: Fix comment ranges when timestamp has entities
Previously, parser would output offsets that don't exist in their
containers, because we were pretending that entities are parts of
their neighboring text nodes.

Turns out it's much easier to do it right when going backwards.

Change-Id: I9bccca2d403f1a976ae517449989170cdd99721e
2020-08-11 20:41:06 +02:00
Bartosz Dziewoński 7cd370615f Reindent CommentParser::findTimestamp()
Something terrible has happened to this function… It seems that I have
brutalized it when rebasing 092cfd6075.

Change-Id: I12d75c69d15645112563a7bc345209b23b54cb3e
2020-08-11 06:45:45 +02:00
Bartosz Dziewoński 31b26a5bec Fix indentation level when replying to comments with mixed indentation
When adding a reply, we take a node at the end of the previous comment,
compare that comment's indentation level to the expected indentation level
of the reply, and add (or remove) that number of wrapper lists.

The existing code did not consider that comments may have lists within
them, and so the indentation of that node may not match the indentation
of the comment.

Bug: T252702
Change-Id: Icc5ff19783d2b213bff99f283cb0599a8b5c1ab4
2020-08-06 01:25:33 +02:00
Ed Sanders a2431fe006 Refactor CommentParser
* Pass rootNode to the constructor
* Rename getters to match CommentItem/HeadingItem/ThreadItem
  value classes.
* Always build the thread tree so CommentItem's always have
  and ID and replies/parent.

Change-Id: I508be9534de59016ff806e3d84edcbb1c76cb0c6
2020-07-20 23:38:10 +01:00
Ed Sanders a4636d39fc Move #getTranscludedFrom from parser to ThreadItem
Also requires moving getTitleFromUrl to CommentUtils

Change-Id: I9cb83a3fdd456eba66899433b866ce7a7f00eeb5
2020-07-20 15:56:48 +01:00
Ed Sanders 7ae5bbf384 Move #getAuthors from parser to ThreadItem
Change-Id: I16e513000e5366b3044b17a99da07d8d0f47a61f
2020-07-20 15:13:59 +01:00
Ed Sanders b32f991913 Documentation fixes
Change-Id: I2c7ccecbf8a50bd4d658b0f17f4a21fe90a3c399
2020-07-20 13:34:08 +01:00
Ed Sanders 092cfd6075 Parser: Replace findTimestamps with findTimestamp
Instead of doing a separate tree walk and finding all timestamps
separately, make it part of the getComments tree walk, and find
timestamps one at a time.

Change-Id: I47f466eaf228504faa189fd99e07493bc7f022cd
2020-07-15 21:34:22 +02:00
Bartosz Dziewoński 308c2747b0 CommentParser.php: Use tree walking instead of XPath
This is similar to what the JS version does.

The TreeWalker and NodeFilter classes are adapted from
https://github.com/Krinkle/dom-TreeWalker-polyfill
(MIT license).

This makes #getComments twice as fast on en-big-oldparser.html

Change-Id: I2441f33e6e7bad753ac830d277e6a2e81ee8c93d
2020-07-15 16:40:50 +00:00
Ed Sanders ed70d49285 CommentParser.php: Fix URL parsing
Change-Id: I406fd98b308dd4d975ea974f2369737a7052b556
2020-07-01 17:06:02 +01:00
Ed Sanders d5376e28fc Improve ThreadItem documentation
Change-Id: Ia266fc22b02af0edbb32f356b4e0d92fe3a4da5f
2020-06-26 14:56:19 +02:00
James D. Forrester d6c3df31f5 Remove various phan suppressions and fix issues
Change-Id: I73b535f284566a0a8876a3198b9784b47567fac6
2020-06-12 20:35:59 +01:00
Ed Sanders 7be0cc3209 Create ThreadItem classes
Change-Id: Id2c5324d74eccb1209ccb76768c557722c6d9400
2020-06-12 20:35:59 +01:00
Umherirrender 48e860916a build: Add mediawiki/mediawiki-phan-config
Replace phan-taint-check-plugin by phan, it is now included

Change-Id: I0e682a83afd30faa8967e3c586431be4ae9a29b3
2020-06-10 22:21:07 +02:00
Ed Sanders da433037a3 Move getTranscludedFromElement to Utils
Change-Id: I8bdd757f949c013ba426150a192d71243fadf45d
2020-06-01 22:32:23 +01:00
Bartosz Dziewoński 79ae8a32c5 Support parsing when timestamp is wrapped in a link
Bug: T252059
Change-Id: Ib8952fb80503bad407e8d0fe725103a0fae12a6a
2020-05-27 22:47:17 +02:00
Bartosz Dziewoński 01b4a8f4f4 Support replying when timestamp is template-generated
* Move modifier#getFullyCoveredWrapper to utils
* Use that method to find the node where we start searching for
  template wrappers, rather than using endContainer

Bug: T252058
Change-Id: I55de58102f3468fce01290bd413a7fdc96d322d6
2020-05-27 21:16:03 +02:00
Ed Sanders b3ca37c1c5 Create ImmutableRange class in PHP
TODO: Create one in JS as well

Change-Id: I6c9dc2455afcb8d0b68674a2985c5e43dd94b6fb
2020-05-22 15:01:09 +01:00
Bartosz Dziewoński 515af82061 Reduce duplication between PHP parser and data gen for JS parser
Also, make the handling of TranslateNumerals and digitsRegexp the same
between PHP and JS.

Change-Id: I1d81343d0b59ab3ecd59ba1c2ad99a729d983ac4
2020-05-19 20:54:44 +02:00
jenkins-bot 366aca2ccd Merge "Stop printing console warnings" 2020-05-18 22:52:30 +00:00
Bartosz Dziewoński 219339551c Stop printing console warnings
It was useful when I was debugging those parts of the code, but now
it's usually annoying.

The warnings can still sometimes be useful for understanding how the
tool parses some discussion, though. To keep that functionality, add
displaying warnings for each comment in the debug mode.

Change-Id: I2d218a8a394f179bcc0990ff988a0567c275ccf2
2020-05-18 23:37:37 +02:00
Ed Sanders 607440498e Spell check pass
Change-Id: Ia20da358297126bd52a968bd77c960f81fe82b8f
2020-05-18 21:24:14 +00:00
Bartosz Dziewoński c848d8a90e Parser tweaks
Follow-up to Ic1438d516e223db462cb227f6668e856672f538c.
Minor corrections and comment improvements in PHP parser,
and "backporting" some changes to JS parser that I like.

Change-Id: I5e54121914ec6b323e556dd133bcb71b3aefbb61
2020-05-18 19:53:26 +00:00