Commit graph

86 commits

Author SHA1 Message Date
Bartosz Dziewoński 6a1f2accee Parser: Minor code cleanup
We had some unused variables and roundabout checks.

Change-Id: I454b60ffa05c1cc12288c5de88c849a25aa35447
2024-02-08 16:59:04 +01:00
Bartosz Dziewoński 9db35873a4 Parser: Fix the main loop getting stuck on some signatures
In certain cases the parser could go back rather than forward after
finding a signature, causing it to find the same signature forever
until it ran out of memory.

Test cases coming later in a separate patch.

Bug: T356884
Change-Id: I8ac72b05e5e4ed45e6119c012a69708c9d8eda29
2024-02-07 22:33:57 +01:00
Ed Sanders 8069585489 CommentParser: Ignore generated timestamp links
This will be present in parser cache output and can
sometimes be mistaken for user page links.

Bug: T356142
Change-Id: I800b23d8466f72affcadfa336aab07abf7f8d79e
2024-01-30 10:29:36 +00:00
Ed Sanders 329a268564 ThreadItem.js: Rename getNativeRange to getRange
Makes the class more similar to PHP. The non-native
ranges exist for efficiency, but users will usually
want native ranges.

Change-Id: Ifd7dd034d2e0f3b9af050ecdab3e063df73dde5e
2023-12-15 16:22:52 +00:00
Ed Sanders 4051c7faf4 Ignore signatures with invalid timestamps
Bug: T352455
Change-Id: Ie499db4594bfa23b618907383d0ac583849ff582
2023-12-05 13:23:15 +00:00
Ed Sanders f2f0ec2f65 build: Update linters and fix
Change-Id: Iec16f3330f94d38bb50492b7dcc9207786b964a4
2023-11-28 16:10:47 +00:00
jenkins-bot 048d5364e2 Merge "Replace preg_replace_callback with strtr in CommentParser" 2023-10-31 13:35:19 +00:00
thiemowmde 10dcd1f847 Replace preg_replace_callback with strtr in CommentParser
It does the same as before.

I think performance is not a concern here, and wasn't my motivation
either. But I hope this makes the code easier to read and to reason
with.

I added a pure unit test case (without involving an actual Language
object) to cover the previously uncovered digits feature.

Change-Id: I6a0fc86035817eabb42b55e58183ae094c052aa6
2023-10-31 08:55:40 +01:00
thiemowmde 1491b47b12 Improve performance of CommentParser::getUsernameFromLink
I was curious why running the CommentParserTest takes so long. I
found this is one of the bottlenecks because it's called so often,
but many link titles that are parsed as user names turn out to be
something else. This little hack speeds up the test by 15% and has
probably a similar impact in production scenarios.

Change-Id: I5a0b3a49ba5793c8a345baaa7118fed500c082b6
2023-10-30 17:59:46 +01:00
Theodore Dubois 4ca17b8c33 Support ISO 8601 timestamps in the parser
https://wikipesija.org is currently using ISO 8601 as the default date
format. The format is xnY-xnm-xnd"T"xnH:xni:xns and 'xn', 'm', and 's'
need support added.

Change-Id: I235098a578eb92ddd23ea47fa23d60df4b28f590
2023-06-17 11:36:43 -07:00
Ed Sanders 92f5cfd821 Support suppressing comment detection in pages or sections
This can be done within sections using CSS:
* mw-notalk

Or at a page level using a magic word:
* __NOTALK__

"notalk" suppresses all comment detection, treating the content as
not containing any comments even if there are signatures present.

Bug: T295553
Bug: T249293
Change-Id: Ic1d7294bafcf7071e16838e70684ecadd7bc6fd3
2023-04-03 18:36:34 +02:00
Ed Sanders 2fcc505d50 Parser: Store timestamp ranges
Change-Id: Ifcbe22011f11f4374f38b7aa346da5a96cac968c
2023-03-28 23:51:17 +00:00
Ed Sanders b82af45735 CommentParser: Output display name if different to username
The only normalisation we apply for comparison is lowercasing.

Change-Id: Id3d57c2066429fcedc7dcc091e74ed46e17060f1
2023-02-23 23:03:32 +00:00
Bartosz Dziewoński 3a9997d6ea Improve handling for comment separators
* Detect comment separators at the end of comments too
* Consider TemplateStyles associated with ignored templates

This unexpectedly improves a lot of cases other than T313097 too,
mostly where <br> or {{outdent}} was used within a paragraph:
splitting comments that were previously jumbled together, or restoring
content that was previously ignored for apps / notifications.

Bug: T313097
Change-Id: I9b2ef6b760f2ffd97141ad7000f70919aeab7803
2023-01-10 01:59:52 +00:00
Ed Sanders e24550fae9 Refactor thread summary getters
Replace getThreadSummary with individual getters that call
calculateThreadSummary once.

Change-Id: Ie8a8b4d7cb5121847b78dbc20bca2c8d48c7d857
2022-09-06 23:19:13 +02:00
Ed Sanders 664d5d041a Fix fetching of oldest comment in a thread
The implementation in Parser doesn't descend into sub-thread.
Re-use the getThreadSummary method in ThreadItem and traverse
the thread properly.

Bug: T298617
Change-Id: I318d9012eb83f37ccbe463923524ef2e9f995ced
2022-09-01 21:22:09 +00:00
Ed Sanders 0ad9b4c6b2 Move placeholder heading level (99) to a constant
Change the HeadingItem constructor to take a 'null' headingLevel
and store this internally with the constant. Change the JSON
serializer to convert this back to null.

Change-Id: I27508eed75d94b99c5189548919309f8da7deb75
2022-06-14 22:51:49 +01:00
Bartosz Dziewoński 6a59149132 Ignore LRM and RLM in more places in the timestamp
We previously ignored them before timezone indicator (e9c401e3aa),
but they can end up in other places too, e.g. after the time.

Now we ignore them after every token. This is way overkill, but it
shouldn't hurt.

Bug: T308448
Change-Id: I20f7aaa34dba23f2a2faf1be258c1aea32ab770f
2022-05-17 02:00:22 +02:00
Ed Sanders 579b8bb1d4 Implement getTimestampString on CommentItem
Change-Id: I1768e9993debe904d6a228942ad0188486d65c0b
2022-03-24 16:49:35 +00:00
Bartosz Dziewoński c7723baf72 CommentParser: Replace uses of Title with TitleValue
Another small step towards removing the reliance on global state.

Change-Id: Ifb4a5bcbef6606d02f1c7aa7385d72822cb0bad0
2022-03-18 18:24:34 +00:00
jenkins-bot 32d9ef573a Merge "CommentParser: Avoid using a dynamic undeclared property" 2022-03-10 00:22:16 +00:00
jenkins-bot 76478dda26 Merge "Move signatureScanLimit to a constant in JS" 2022-03-10 00:22:14 +00:00
Bartosz Dziewoński 4c29304484 CommentParser: Avoid using a dynamic undeclared property
Change-Id: Iefa8dea83bc0d31b9c6b3509189eeaa652dd9ea0
2022-03-08 23:30:11 +00:00
Bartosz Dziewoński eb1fe7a8fb CommentParser: Fix redundant uses of getHeadlineNodeAndOffset()
We call CommentUtils::getHeadlineNodeAndOffset() before constructing
the HeadingItem in CommentParser, so the range's startContainer
is always the headline node.

Change-Id: I2afb6ba9100e785cd91f31d82f4cea59fa8b5443
2022-03-08 23:29:34 +00:00
Bartosz Dziewoński 8a2715bdd5 Move signatureScanLimit to a constant in JS
Change-Id: Ieb60c148fd060ab62e4a493e2d0dff6c051f945c
2022-02-21 22:42:14 +01:00
Bartosz Dziewoński 4244418e56 Don't detect comments within references
Bug: T301213
Change-Id: Ifd5198651c8ed0ce53379fb5e35938089cd54a09
2022-02-21 19:57:44 +00:00
Bartosz Dziewoński 8e44b43df0 Split off ThreadItemSet from CommentParser
Goal:
-----
Finishing the work from Iadb7757debe000025e52770ca51ebcf24ca8ee66
by changing CommentParser::parse() to return a data object, instead of
the whole parser.

Changes:
--------
ThreadItemSet.php:
ThreadItemSet.js:
* New data class to access the results of parsing a discussion. Most
  methods and properties are moved from CommentParser with no changes.

CommentParser.php:
Parser.js:
* parse() returns a new ThreadItemSet.
* Remove methods moved to ThreadItemSet.
* Placeholder headings are generated slightly differently, as we process
  things in a different order.
* Grouping threads and computing IDs/names is no longer lazy. We always
  needed IDs/names anyway.
* computeId() explicitly uses a ThreadItemSet to check the existing IDs
  when de-duplicating.

controller.js:
* Move the code for turning some nodes annotated by CommentFormatter
  into a ThreadItemSet (previously a Parser) from controller#init to
  ThreadItemSet.static.newFromAnnotatedNodes, and rewrite it to handle
  assigning parents/replies and recalculating legacy IDs more nicely.
* mw.dt.pageThreads is now a ThreadItemSet.

Change-Id: I49bfe019aa460651447fd383f73eafa9d7180a92
2022-02-21 16:22:32 +00:00
Bartosz Dziewoński 4613ae78e7 Change CommentParser into a service
Goal:
-----
To have a method like CommentParser::parse(), which just takes a node
to parse and a title and returns plain data, so that we don't need to
keep track of the config to construct a CommentParser object (the
required config like content language is provided by services) and
we don't need to keep that object around after parsing.

Changes:
--------
CommentParser.php:
* …is now a service. Constructor only takes services as arguments.
  The node and title are passed to a new parse() method.
* parse() should return plain data, but I split this part to a separate
  patch for ease of review: I49bfe019aa460651447fd383f73eafa9d7180a92.
* CommentParser still cheats and accesses global state in a few places,
  e.g. calling Title::makeTitleSafe or CommentUtils::getTitleFromUrl,
  so we can't turn its tests into true unit tests. This work is left
  for future commits.

LanguageData.php:
* …is now a service, instead of a static class.

Parser.js:
* …is not a real service, but it's changed to behave in a similar way.
  Constructor takes only the required config as argument,
  and node and title are instead passed to a new parse() method.

CommentParserTest.php:
parser.test.js:
* Can be simplified, now that we don't need a useless node and title
  to test internal methods that don't use them.

testUtils.js:
* Can be simplified, now that we don't need to override internal
  ResourceLoader stuff just to change the parser config.

Change-Id: Iadb7757debe000025e52770ca51ebcf24ca8ee66
2022-02-19 19:51:57 +01:00
Bartosz Dziewoński 99b5de8038 Split Data class into ResourceLoaderData and LanguageData
The Data class contained utilities for two unrelated purposes.
Split each half to a separate class.

Notably, this improves the signature of the getLocalData() function.

Change-Id: Icde615fb9d483fee1f352c34909b37f8ffde8081
2022-02-19 19:37:34 +01:00
Bartosz Dziewoński ae9f26a9e5 Various code quality tweaks
(suggested by PhpStorm)

composer.json:
* Document required PHP extensions

Parser.js:
* Remove incorrect param documentation
* Fix some typos in comments (missing parentheses)

CommentParser.php:
* Fix some typos in comments (missing parentheses)

ImmutableRange.php:
* Remove unused property
* Add a `throw` to indicate that code path is unreachable

SubscribedNewCommentPresentationModel.php:
* Add missing `return false`

CommentParserTest.php:
* Remove unnecessary pass-by-reference

CommentModifierTest.php:
* Remove unused variable

CommentParserTest.php:
* Don't construct Element objects directly. PHP's DOMElement allows
  it, but Parsoid/Dodo's doesn't, and we use the latter for static
  analysis. This generates all kinds of confusing warnings.

Change-Id: Ia9598ebea0e99830dd485296e94a9d96acc4b258
2022-02-19 19:36:52 +01:00
Bartosz Dziewoński 13ab1db6da Don't count leading/trailing whitespace against signature scan limit
It's an arbitrary limit, it seems harmless to relax it to support the
use case in the task, even if it's weird.

Bug: T300949
Change-Id: I7c895c7019726758bbae3183b9c3ecbd9eabcf38
2022-02-04 19:35:29 +00:00
Ed Sanders 0b42aea276 CommentParser: Cache variables in getUsernameFromLink
Change-Id: I625e6ded3badd75a7a658c8d000576d0d165a18b
2022-02-04 19:35:18 +00:00
Ed Sanders 8ad1df7dc8 CommentParser: Name parts of return value from findSignature
Change-Id: I3a5ad36df0afdedc0aa9a15e5d83c5426b03b790
2022-02-04 19:34:18 +00:00
Ed Sanders f80ff74fc6 Handle selflinks by returning the current page's title
Bug: T287818
Change-Id: I67f10ac9976581279d1e6a477e90d55875ebab20
2022-01-12 21:18:04 +00:00
Ed Sanders 34011b7a07 Parser: Pass in title of page being parsed
Will be used to parse selflinks in the future.

Change-Id: I2bc29d1c5c69cb6309f582f162f9af7d96ce8913
2022-01-12 21:17:59 +00:00
Ed Sanders 2e1241289c Better document {Object} types
Change-Id: Ibfaf2ded443301c68552dbf98a1897a50bda9ef5
2021-12-20 17:25:54 +00:00
Bartosz Dziewoński ef7274d69e Move some helpers from CommentParser to CommentUtils
Change-Id: I0e323d3b75f47459a5548a13e9684f4c6ff4ba0c
2021-12-13 17:13:41 +01:00
Ed Sanders 7c3e583bec build: Update eslint-config-wikimedia to 0.21.0
Change-Id: I72de463d5a878e555eeed0e7ce2772e1d3a46f06
2021-11-08 19:03:40 +00:00
Ed Sanders f4c12e120a Define documentable types in eslintrc instead of inline
These types can be passed a parameters to any file without
creating a dependency, so it makes more sense to allow
the globally.

Change-Id: I5504465fd997b46547642e7046993b370b85586e
2021-10-17 14:38:39 +01:00
Bartosz Dziewoński f35bf487ef Take over extra links to add a new topic added by gadgets/templates
* Move getTitleFromUrl() from parser to utils. It's a generic method,
  the PHP equivalent is already in utils.

Bug: T277371
Change-Id: Id960e5f60af02bdeb0a3a68f43b7a695eb035139
2021-06-30 18:06:39 +02:00
Ed Sanders 1893405635 Code style: Move var declarations inline
Change-Id: I1686603388b050ba4ec22eff23e4806cdf262b87
2021-04-22 17:43:46 +00:00
Bartosz Dziewoński 42ce942c86 Introduce comment "names" to identify comments across revisions/pages
The existing comment IDs can't be used to find the same comment on
a different revision or page (when it's transcluded), because they
depend on the comment's parent and its position on the page.

Comment names depend only on the author and timestamp. The trade-off
is that they can't distinguish comments posted within the same minute,
or in the same edit, so we will still need the IDs sometimes.

Prefer using comment names when replying, if they're not ambiguous.
This fixes T273413 and T275821.

Heading names depend on the author and timestamp of the oldest comment.
This way we don't have to detect changes to the heading text, but we
can't distinguish headings without any comments.

Bug: T274685
Bug: T273413
Bug: T275821
Change-Id: Id85c50ba38d1e532cec106708c077b908a3fcd49
2021-03-23 16:08:42 +00:00
Ed Sanders 4a0802065c Make IDs (to be used as URL hashes) wikitext safe
* Use hyphens instead of pipes a separators
* Use underscores for spaces in usernames

Change-Id: I6efd9739fc73e45002e50e64c43ce0de1c2f1239
2021-03-18 20:45:21 +01:00
Bartosz Dziewoński f5059e6ea6 Don't detect comments within 'cite' elements too
Follow-up to 024a978ffd.

Bug: T275881
Change-Id: I53448ad22cd0531e7fd4aa0ea5d15782879cce14
2021-03-01 21:40:43 +01:00
jenkins-bot 0eb37a87df Merge "Don't detect comments within quotes" 2021-02-28 22:56:20 +00:00
Bartosz Dziewoński 024a978ffd Don't detect comments within quotes
Bug: T275881
Change-Id: I8f7a4279837bd95ebf5b604ff350c0a3f29c2c05
2021-02-28 22:49:48 +00:00
Bartosz Dziewoński efe95494a8 Improve signature detection to handle formatting on the timestamp
Now it detect signatures generated by en.wp's {{Undated}} template,
and signatures of people who do weird stuff to the timestamps.

Bug: T275938
Change-Id: I27b07f6786ca5433a3c02a5fe68e4716d41401bb
2021-02-27 02:33:30 +01:00
Bartosz Dziewoński af082908a5 Improve merging multiple comments on one paragraph
The horrendous 11-line if() condition did not correctly handle
signatures wrapped in inline formatting markup, like <small>.

Instead, implement this logic in the code for skipping to the end
of a paragraph, which didn't exist yet when that condition was
added, but seems like a much better place to check this now.

Bug: T275934
Change-Id: I5cccff889b5e15b5f8fde0538bf4bccb22e762cf
2021-02-27 02:21:36 +01:00
Bartosz Dziewoński 35738b1f9b CommentParser: Replace getThreadStartTimestamp with getThreadStartComment
Change-Id: Ia8d878594306b5ce4039ca06d6dcec753e5dea28
2021-02-24 12:26:58 +00:00
Bartosz Dziewoński 1998c983f1 computeId() can't return null
It used to return null for headings, but now it doesn't. Simplify some
code checking for that.

Change-Id: I28131c4aee89b901879b4c49953d6b15ed91b5e7
2021-02-13 00:08:15 +01:00