Commit graph

56 commits

Author SHA1 Message Date
Bartosz Dziewoński 13ab1db6da Don't count leading/trailing whitespace against signature scan limit
It's an arbitrary limit, it seems harmless to relax it to support the
use case in the task, even if it's weird.

Bug: T300949
Change-Id: I7c895c7019726758bbae3183b9c3ecbd9eabcf38
2022-02-04 19:35:29 +00:00
Ed Sanders 0b42aea276 CommentParser: Cache variables in getUsernameFromLink
Change-Id: I625e6ded3badd75a7a658c8d000576d0d165a18b
2022-02-04 19:35:18 +00:00
Ed Sanders 8ad1df7dc8 CommentParser: Name parts of return value from findSignature
Change-Id: I3a5ad36df0afdedc0aa9a15e5d83c5426b03b790
2022-02-04 19:34:18 +00:00
Ed Sanders f80ff74fc6 Handle selflinks by returning the current page's title
Bug: T287818
Change-Id: I67f10ac9976581279d1e6a477e90d55875ebab20
2022-01-12 21:18:04 +00:00
Ed Sanders 34011b7a07 Parser: Pass in title of page being parsed
Will be used to parse selflinks in the future.

Change-Id: I2bc29d1c5c69cb6309f582f162f9af7d96ce8913
2022-01-12 21:17:59 +00:00
Ed Sanders 2e1241289c Better document {Object} types
Change-Id: Ibfaf2ded443301c68552dbf98a1897a50bda9ef5
2021-12-20 17:25:54 +00:00
Bartosz Dziewoński ef7274d69e Move some helpers from CommentParser to CommentUtils
Change-Id: I0e323d3b75f47459a5548a13e9684f4c6ff4ba0c
2021-12-13 17:13:41 +01:00
Ed Sanders 7c3e583bec build: Update eslint-config-wikimedia to 0.21.0
Change-Id: I72de463d5a878e555eeed0e7ce2772e1d3a46f06
2021-11-08 19:03:40 +00:00
Ed Sanders f4c12e120a Define documentable types in eslintrc instead of inline
These types can be passed a parameters to any file without
creating a dependency, so it makes more sense to allow
the globally.

Change-Id: I5504465fd997b46547642e7046993b370b85586e
2021-10-17 14:38:39 +01:00
Bartosz Dziewoński f35bf487ef Take over extra links to add a new topic added by gadgets/templates
* Move getTitleFromUrl() from parser to utils. It's a generic method,
  the PHP equivalent is already in utils.

Bug: T277371
Change-Id: Id960e5f60af02bdeb0a3a68f43b7a695eb035139
2021-06-30 18:06:39 +02:00
Ed Sanders 1893405635 Code style: Move var declarations inline
Change-Id: I1686603388b050ba4ec22eff23e4806cdf262b87
2021-04-22 17:43:46 +00:00
Bartosz Dziewoński 42ce942c86 Introduce comment "names" to identify comments across revisions/pages
The existing comment IDs can't be used to find the same comment on
a different revision or page (when it's transcluded), because they
depend on the comment's parent and its position on the page.

Comment names depend only on the author and timestamp. The trade-off
is that they can't distinguish comments posted within the same minute,
or in the same edit, so we will still need the IDs sometimes.

Prefer using comment names when replying, if they're not ambiguous.
This fixes T273413 and T275821.

Heading names depend on the author and timestamp of the oldest comment.
This way we don't have to detect changes to the heading text, but we
can't distinguish headings without any comments.

Bug: T274685
Bug: T273413
Bug: T275821
Change-Id: Id85c50ba38d1e532cec106708c077b908a3fcd49
2021-03-23 16:08:42 +00:00
Ed Sanders 4a0802065c Make IDs (to be used as URL hashes) wikitext safe
* Use hyphens instead of pipes a separators
* Use underscores for spaces in usernames

Change-Id: I6efd9739fc73e45002e50e64c43ce0de1c2f1239
2021-03-18 20:45:21 +01:00
Bartosz Dziewoński f5059e6ea6 Don't detect comments within 'cite' elements too
Follow-up to 024a978ffd.

Bug: T275881
Change-Id: I53448ad22cd0531e7fd4aa0ea5d15782879cce14
2021-03-01 21:40:43 +01:00
jenkins-bot 0eb37a87df Merge "Don't detect comments within quotes" 2021-02-28 22:56:20 +00:00
Bartosz Dziewoński 024a978ffd Don't detect comments within quotes
Bug: T275881
Change-Id: I8f7a4279837bd95ebf5b604ff350c0a3f29c2c05
2021-02-28 22:49:48 +00:00
Bartosz Dziewoński efe95494a8 Improve signature detection to handle formatting on the timestamp
Now it detect signatures generated by en.wp's {{Undated}} template,
and signatures of people who do weird stuff to the timestamps.

Bug: T275938
Change-Id: I27b07f6786ca5433a3c02a5fe68e4716d41401bb
2021-02-27 02:33:30 +01:00
Bartosz Dziewoński af082908a5 Improve merging multiple comments on one paragraph
The horrendous 11-line if() condition did not correctly handle
signatures wrapped in inline formatting markup, like <small>.

Instead, implement this logic in the code for skipping to the end
of a paragraph, which didn't exist yet when that condition was
added, but seems like a much better place to check this now.

Bug: T275934
Change-Id: I5cccff889b5e15b5f8fde0538bf4bccb22e762cf
2021-02-27 02:21:36 +01:00
Bartosz Dziewoński 35738b1f9b CommentParser: Replace getThreadStartTimestamp with getThreadStartComment
Change-Id: Ia8d878594306b5ce4039ca06d6dcec753e5dea28
2021-02-24 12:26:58 +00:00
Bartosz Dziewoński 1998c983f1 computeId() can't return null
It used to return null for headings, but now it doesn't. Simplify some
code checking for that.

Change-Id: I28131c4aee89b901879b4c49953d6b15ed91b5e7
2021-02-13 00:08:15 +01:00
Ed Sanders d05109b24d Truncate user generated part of IDs to 80 characters
This ensures that IDs fit in a 255 character database field.

Bug: T273658
Change-Id: I3cfe4fce6a865b4343f0f01121cd696aa5f98b22
2021-02-03 15:04:58 +00:00
Ed Sanders 6c3dd3aaa9 Move Hooks to HookUtils
Now that all the real hooks have been separated out

Change-Id: Ibdb42f98614fc551068f8f8e5297dcc99251ab46
2021-02-01 22:35:11 +00:00
Bartosz Dziewoński c781b127c9 Handle category links at ends of comments affecting indentation
* Ignore rendering-transparent nodes between discussion comments.
* Improve isRenderingTransparentNode() so that <link> nodes
  representing TemplateStyles are not considered transparent,
  otherwise this would undo ae920b831f.
  Using a regexp from Parsoid.

Bug: T272746
Change-Id: I0b3c3251156ba6c4826abf5ba44ea93f80ebc01d
2021-01-26 04:55:03 +01:00
Bartosz Dziewoński 8f42c74985 Fix skipping to the end of paragraph, now it considers nested tags
Add yet another tree walking utility: CommentUtils::linearWalk().
Unlike TreeWalker, it allows handling the beginnings and ends of nodes
separately – kind of like parsing a XML token stream, or kind of like
VisualEditor's linear model.

(Add unit tests for this utility. The simple.html test case is copied
from [VisualEditor/VisualEditor]/demos/ve/pages/simple.html.)

Use this utility to stop skipping when we reach either a closing or
opening block node tag. Previously we'd skip over such tags inside
nested "transparent" nodes (like <a>, <del>, or apparently <font>).

Bug: T271385
Change-Id: I201a942eb3a56335e84d94e150ec2c33f8b4f4e0
2021-01-18 18:20:20 +00:00
Bartosz Dziewoński 50ad5bb2b4 Ignore outdent templates at the beginning of comments
Bug: T264116
Change-Id: Iae9dbb30a1aead897cc274f655d3ecff4b297dbd
2020-12-14 21:35:56 +01:00
Bartosz Dziewoński ae920b831f Change which nodes are ignored at the beginning of comments again
While working on T270009, I noticed that <style> and <link> nodes
are treated differently, which seemed weird. Rewrite this again,
hopefully this is the last time.

The changed test cases also involve <area> and <input> nodes,
and the new results make more sense to me.

Bug: T264116
Change-Id: I3af90c84768a4b3dc53446927f4dba6f72175a2f
2020-12-14 21:33:50 +01:00
Bartosz Dziewoński 113ad01d5e Consistent use of isBlockElement() in JS code
(Consistent compared to our PHP code)

Change-Id: I4aba6f22cc7da2da30ccd6346e4390b14bd347af
2020-12-05 22:48:34 +01:00
Bartosz Dziewoński 0fc71f60cd Skip to the end of the paragraph if it's just text, too
We've recently decided that we want to "extend" comments until
the end of the paragraph (e36dc8e78a,
d0ae6c4e44).

However, we still had this special case that did the opposite: it
ensured that if a comment ended in the middle of a text node, the
comment would not be extended to the end of the node. Remove it.

Note the change in the test file signatures-funny-formattedreply.html,
which actually covered this case specifically.

Change-Id: Id1384bb0c6e1a5f0c70f55efcb4caa240f230f07
2020-11-25 00:48:53 +01:00
Ed Sanders d0ae6c4e44 Skip end marker "forward" until a block tag is reached
The end marker is skipped forward until an open or close
block tag is reached. In tree traversal terms this means
moving either to the next sibling, or the parent (to skip
over close tags).

Bug: T256033
Change-Id: Iaa2c588698790d576ac4f9ecc126f58a082ef6b3
2020-11-23 15:08:29 +00:00
Ed Sanders 44a1bbcc59 Fix start node for comments following headings
The general rule is that comments start after their preceding
thread item, but when that is a heading we should skip past
the entire <h[1-6]> node to avoid making section edit links
part of the first comment.

Bug: T267988
Change-Id: Ia7f1b27e0a69a9aab7c7da743bf8549479304096
2020-11-19 23:48:30 +00:00
jenkins-bot e378a9122b Merge "Don't detect comments within headings" 2020-11-05 16:56:02 +00:00
Bartosz Dziewoński bed717d329 Move getHeadlineNodeAndOffset() to utils
Needed by I7d35098d672d0edb50d49e22de1686d5cc83b60e.

Change-Id: I44bf927213de570fe9de43e485e09cfae6778eef
2020-11-05 16:11:30 +01:00
Bartosz Dziewoński 1626242863 Don't detect comments within headings
Bug: T267068
Change-Id: Id134f15e086fd070801c4b1d836dbfbf9bf444ad
2020-11-04 21:57:33 +01:00
Ed Sanders 3aca622894 Treat headings like comments now they have IDs
Use the same logic for marking ranges in the document, and ensure
that the heading range does not include section edit links or
section numberings.

Change-Id: I782caafc34fee2a822b0a17b24dd6b9528202eca
2020-10-28 12:38:18 +00:00
Bartosz Dziewoński 044bc50fb6 Fix some TODOs about test data
We avoided fixing these because it causes changes in just about all of
the test data, which is annoying when reviewing or blaming changes.

But the previous several commits also caused changes in just about all
of the test data, so we might as well do this too.

Change-Id: I83b64d83b6f12c04dc06c0cadff7cdd89417e137
2020-10-22 00:21:04 +00:00
Bartosz Dziewoński 0ddc171c8a Add oldest timestamp in the thread to heading IDs
To avoid old threads re-appearing on popular pages when someone
uses a vague title (e.g. dozens of threads titled "question" on
[[Wikipedia:Help desk]]: https://w.wiki/fbN), include the oldest
timestamp in the thread (i.e. date the thread was started) in the
heading ID.

Bug: T264478
Change-Id: If918bfd5e025248923d1939bc86916697ead95a0
2020-10-22 02:19:21 +02:00
Bartosz Dziewoński b09bbfe668 Disambiguate comments by parent ID, rather than sequential numbers
Sequential numbers aren't great because they change when an earlier
comment is archived. Parent comment/heading IDs should change less
often.

This also makes much more sense for disambiguating subsections,
e.g. a dozen identical ===Votes=== sections for a dozen proposals.

Bug: T264478
Change-Id: I466454984fd919ebef35f2b37ddb5d86dc842996
2020-10-22 02:19:21 +02:00
Bartosz Dziewoński 3137d76f40 Connect sub-threads to their parent threads
Our threads now also contain all replies to their sub-threads.
This is similar to how sections work in MediaWiki, where the parent
section also contains the content of all the lower-level sections.
We're going to need this for notifications about replies in a thread.

Bug: T264478
Change-Id: I241fc58e2088a7555942824b0f184ed21e3a8b6f
2020-10-22 02:05:02 +02:00
Bartosz Dziewoński 9ee0fd69f5 Allow headings to have IDs
Previously, only comments could have IDs, because we only needed IDs
for replying. But we might also use them for notifications soon.

Bug: T264478
Change-Id: I1bcad02bf17ab54bc5028a959543c10f0430836b
2020-10-22 02:04:28 +02:00
Bartosz Dziewoński 6719d17364 Handle cached "legacy" IDs (and other JSON-serialized data)
The output of CommentFormatter::addReplyLinks() and consequently
ThreadItem::jsonSerialize() can end up in the HTTP cache (Varnish) on
Wikimedia wikis. We need to consider that when changing that code.

Introduce a concept of legacy ID (generated by the older algorithm
after it changes), add some placeholder code that will generate them
in the future, and update some code to find comments by either normal
or legacy IDs.

Add dire comments in a bunch of places (as if that ever helps).

Bug: T264478
Change-Id: I4368f366800ab21b8b184b09378037614fdecd33
2020-10-22 00:53:06 +02:00
Bartosz Dziewoński 3b8d63467e CommentParser: Remove confused comments about references and objects
"This modifies the original objects…" – I feel like this is obvious
now, but maybe it wasn't so obvious when this code was structured
differently before a2431fe006. Also,
it refers to a variable that doesn't exist.

"FIXME this will clone the reply…" – No, actually, it will not.
It would if replies were associative arrays, but they are objects,
and have always been, ever since the PHP parser was merged in
7b7a2cd69c. Maybe they were arrays
once in Roan's mind before he pushed that for review.

Change-Id: I1348e111699fdbde99cd1f9ef45d8f465f7391b0
2020-10-21 21:01:27 +02:00
Bartosz Dziewoński ed17f640b6 Ignore other empty-ish things at the beginning of comments
Follow-up to 432a959436.

Bug: T264116
Change-Id: I0685cafab70c7e9d22f504f1a1309c9a28d6f2e1
2020-09-30 23:42:47 +02:00
Bartosz Dziewoński 17b7a481a2 Fix detecting username from the wrong links sometimes
When a timestamp directly followed a `<div>…</div>` tag (or perhaps
some other wrapper containing lots of content), we would detect the
username from the earliest links in the wrapper (furthest from the
timestamp), rather than the latest links (closest to the timestamp).

Bug: T262573
Change-Id: Id16449a86a731b13dc79846bb30ecf6554e26f1d
2020-09-29 22:31:24 +02:00
Bartosz Dziewoński 432a959436 Ignore empty paragraphs at the beginning of comments
The wikitext parser outputs `<p><br></p>` for empty paragraphs, so we
need to ignore `<br>` tags when searching for an "interesting" node
that marks the beginning of a comment. Otherwise the empty paragraphs
mess up the detection of indentation levels.

Bug: T264116
Change-Id: I84a97ab577baa7336b78935ccdc48041ecfc231a
2020-09-29 22:22:35 +02:00
jenkins-bot 636ca06e7e Merge "Fix parsing links in Parsoid documents without short URLs" 2020-09-17 20:40:40 +00:00
Bartosz Dziewoński 329df8c953 Parsing discussions converted to language variants
* Export parser data (date format, digits, timezone names, and
  messages for weekday/month names) converted to language variants
* Update the parsers to try matching using every variant, in case
  the page is displayed in non-default variant (and to avoid
  problems with incomplete variant conversion)

Bug: T259818
Change-Id: I04d73992cd31ce06fa79f87df0c0a53d7efc3c58
2020-09-16 22:07:07 +00:00
Ed Sanders 92a8ca3469 Documentation fix
Change-Id: Ic37bb713bed8af7390a3e8be7ea0203b4687ce0e
2020-09-15 01:38:54 +02:00
Bartosz Dziewoński 14fb013515 Match handling of "signature scan limit" between JS and PHP
PHP was counting UTF-8 bytes, JS was counting UTF-16 bytes.
Both should have been counting codepoints (although it doesn't
really matter as long as they both count the same things).

I noticed the issue after adding some tests using the Cyrillic
script, when one case had different results in PHP and JS:
Id25b537fecd789640c209ff7f30e777455a3aece.

Change-Id: Ic31240678f71ba48e6ec202126bf490cea12bb66
2020-09-08 03:27:01 +02:00
Bartosz Dziewoński 10899af666 Fix parsing links in Parsoid documents without short URLs
Move the code so that we check for "?title=" query parameter first,
because we don't handle this right in the other code path.

Use parse_url() instead of wfParseUrl() because the latter doesn't
accept relative URLs, and we don't care about the other differences.

Bug: T261711
Depends-On: I4da952876e1c3d1a41d06b51f7e26015ff5e34d7
Change-Id: I70fac2b41befd782b0a47a4f726ae748dc0f775d
2020-09-02 23:42:37 +02:00
Bartosz Dziewoński 2d3fe47ac1 Fix parsing localised digits in PHP discussion parser
The PHP code incorrectly assumed that the digits are single-byte in
UTF-8, which is never the case (except for 0-9).

The JS code worked correctly because it uses UTF-16 strings, so the
bug would only affect non-BMP digits there. This was noted in a TODO
comment, but we overlooked it when reimplementing in PHP.

Instead of a string of 10 characters, use an array of 10
single-character strings.

Bug: T261706
Change-Id: Ic5421382474c88f003424799c53ff473d99cce92
2020-09-01 01:50:33 +02:00