Commit graph

37 commits

Author SHA1 Message Date
Ed Sanders 340572bc05 Create a Utils class in PHP
Also move htmlTrim to utils in JS.

Change-Id: Ia5356d713c1c5d521c396cc28bcd4ecc7ee5bbbb
2020-05-15 00:25:32 +01:00
Bartosz Dziewoński 76289cdf73 tests: Fix failures due to CDATA handling in PHP
It appears PHP's DOM library always uses CDATA nodes for the contents of
<style> tags, even if there is no such markup in the source HTML.

Change-Id: Id04b27086c5e7a0b016a3a440b2b4895d6b13c93
2020-05-14 22:37:23 +00:00
Bartosz Dziewoński 33d69e26c9 tests: Fix different whitespace trimming in PHP and JS
Notably, JS trims the no-break space, while PHP doesn't. There are
some other differences that don't come up in our tests. What we really
want is to trim the ASCII whitespace as defined in the HTML spec.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim
https://www.php.net/manual/en/function.trim.php
https://infra.spec.whatwg.org/#ascii-whitespace

Change-Id: I95b8fb38878716a2fa7ec84c9f2e8065ebe77c0d
2020-05-14 21:37:26 +00:00
Ed Sanders 076c6d828a Parser: Use Element instead of Node when appropriate
Change-Id: I67015989954e61810302bf8931d05419f007ffeb
2020-05-13 14:46:36 +01:00
Ed Sanders 77f1be3bd1 parser.js: Fix some typos and variable names
Change-Id: Ib80fd87893f8fad47eaeac8edabe03472b78b75c
2020-05-08 20:22:49 +01:00
Ed Sanders 4f22e32304 Fix typo in parser.js 'dl' => 'dd'
Doesn't actually affect the behaviour as we are
computing indentation level.

Change-Id: I8b6824460ffbee1c0acb361c26503d3b6c439025
2020-05-08 14:24:15 +01:00
jenkins-bot f0f457562d Merge "Handle &nbsp; and other entities in the timestamp/timezone" 2020-05-07 19:44:12 +00:00
Ed Sanders 44fd8ec0ef Parser: Return mw.Title from getTitleFromUrl
Change-Id: I2c3ce50c5ff7486d6c1f3a0c26cb3a9f1c252b60
2020-05-05 21:34:46 +00:00
Bartosz Dziewoński 0269626adc Handle &nbsp; and other entities in the timestamp/timezone
Bug: T251838
Change-Id: Iba8d7c71e332c63229eec4bc7c80b10627135784
2020-05-05 22:38:16 +02:00
Ed Sanders 5da26e2d0b parser: Have getAuthors return a list
Change-Id: Ieda881a26f802a9e8c73822fcbcc70dd6035bbb6
2020-05-05 12:13:12 +01:00
Bartosz Dziewoński 4ad9aa8449 parser: Don't crash on links to invalid titles
Bug: T251045
Change-Id: I1a7b8b9b34d38ec4e3edc12687bdf99bb0e61bd8
2020-04-27 19:59:26 +02:00
Bartosz Dziewoński dab37fd7b4 parser: Make #getTranscludedFrom return page title in text form
It's more convenient for display or comparing it with other things.

Depends-On: I03bc455d5484a6c51f3fa2397c64936b829fe7e3
Change-Id: I88d7aa68977210b16860075ed52983a5e99ee0f7
2020-03-24 22:29:35 +00:00
Bartosz Dziewoński e9b583d1c3 parser: Improve merging multiple comments on one line
Now also works if the "follow-up" comment is wrapped in e.g. `<small>`.

Change-Id: Ic37cb6afdb42021f109a1818f5c4299d907ed094
2020-03-14 13:34:42 +00:00
Bartosz Dziewoński 04365c0188 Merge RL modules which are only loaded by 'ext.discussionTools.init'
Bug: T240474
Change-Id: I1b83aa18666be8f1ea6a3602b299f92574d42cb7
2020-03-14 14:33:23 +01:00
jenkins-bot 2e0c299a1f Merge "Fix signatureRanges overlapping for some comments" 2020-03-09 20:31:54 +00:00
Bartosz Dziewoński e3e4ef9de4 parser: Detect comments transcluded from another page
When trying to reply to a comment that is inside a transclusion,
detect if it's transcluded from a subpage or simply wrapped in a
template, and show appropriate error messages.

References:
* VisualEditor ve.dm.Converter#getAboutGroup()
* VisualEditor ve.dm.ModelRegistry#matchElement()
* Parsoid Linter#findEnclosingTemplateName()

Bug: T245694
Change-Id: If3dd1ebbf1d02ee4379c200019bfc3a8ec02325b
2020-03-09 20:28:56 +01:00
Bartosz Dziewoński b4029c3c58 Fix signatureRanges overlapping for some comments
If two signatures for a single comment were near each other,
we would sometimes treat them as one huge signature.

Change-Id: Ied4b3aa535a9ca6bebef8a004ae48b7d5a8f2f9b
2020-03-09 13:28:22 +00:00
Bartosz Dziewoński 0ca851aa92 parser: Return signature and timestamp ranges
Currently not used for anything. May be used later for editing
comments (T245225) or reformatting timestamps (T240360).

Note that a comment may have multiple signatures+timestamps,
and we return them all so that you have to deal with that.

Fix some unrelated incorrect documentation comments.

Bug: T245220
Change-Id: I51b8bf4a3bb7968f35e32c7e44c95c2ab079d9ac
2020-03-05 14:28:17 +01:00
Bartosz Dziewoński ea26009896 Work around mw.Uri crash on fallback encoding in links
Bug: T245889
Change-Id: I182f9ffa84a3b3cf4afafd536360572eda9d2714
2020-02-29 19:08:01 +01:00
Bartosz Dziewoński e9c401e3aa Ignore LRM and RLM before timezone indicator
They are not generated by MediaWiki, but they often appear when users
sign others' unsigned comments by copy-pasting the timestamp from the
history page.

Add test config data for nlwiki, exported by running this in the
browser console:

  copy(
    JSON.stringify( { wgArticlePath, wgNamespaceIds, wgFormattedNamespaces }, null, 2 ) + '\n' +
    JSON.stringify( mw.loader.moduleRegistry['ext.discussionTools.parser'].packageExports['data.json'], null, 2 )
  );

Bug: T245784
Change-Id: Icbcdc5a028e9ce2cb09173f87769e525ec3082fc
2020-02-25 00:20:00 +00:00
Bartosz Dziewoński ff0386239f Only detect comments with real signatures
Consequences of this are visible in the test cases:

* (en) Tech News posts are not detected.
  Examples: "21:22, 1 July 2019 (UTC)", "21:42, 29 July 2019 (UTC)"

* (en) Comments by users who customize the timestamp are not detected.
  Examples: "10:49, 28 June 2019 (UTC)", "21:34, 14 July 2019 (UTC)"

* (en) Comments with signatures missing a username are not detected.
  This sometimes happens if a comment is accidentally signed with
  '~~~~~' (five tildes), which only inserts the timestamp.
  Examples: "17:17, 27 July 2019 (UTC)", "10:25, 29 July 2019 (UTC)"

* (pl) A lone timestamp at the beginning of a thread is not detected.
  It's not part of a post, it was added to aid automatic archiving.
  Example: "21:03, 18 paź 2018 (CET)"

Bug: T245692
Change-Id: I0767bb239a1800f2e538917b5995fc4f0fa4d043
2020-02-21 01:30:54 +01:00
Bartosz Dziewoński 7761f62b42 Fix edit summary for comments in 0th section (no heading)
Bug: T245765
Change-Id: I9eb4726ef096b8d7459cc1409814514ec1dc89ae
2020-02-21 00:44:42 +01:00
Bartosz Dziewoński e5e6fdd3af Stop using native Range objects, they're too annoying
Native Range objects are automatically updated when the DOM elements
they refer to are affected (e.g. detached from the DOM, or their offset
changes because of siblings being added/removed).

This seemed harmless or maybe even slightly useful, but it turns out
it conflicts with VisualEditor, which has to wrap the entire page in a
new DOM node when it opens (and unwrap it when it closes), effectively
temporarily detaching it from the DOM, which destroys all our ranges.

Just use a plain object that stores the same data as a Range. And when
we need to use Range's API, we can simply construct a temporary one.

Bug: T241861
Change-Id: Iee64aa3d667877265ef8a59293c202e6478d7fb6
2020-02-05 19:42:03 +01:00
Bartosz Dziewoński e29b8173bf Handle comments before first section heading
The loop in parser.js assumed that there was always a heading before
any comments (not counting the page title, only section headings).

Bug: T243869
Change-Id: I3a0bb06716e75d4a17e25c40748673a071ee5f30
2020-01-30 00:14:46 -08:00
Bartosz Dziewoński 30fcfec1fd parser: Merge multiple comments on one line
Even when you have multiple signatures by multiple users in one
paragraph (or list item), it's still basically a single comment.
We don't want to offer multiple buttons to reply to it.

The changed parser test cases are illustrative:
* All affected comments in the "pl" example are comments with a
  "post-scriptum", which is now more intuitively treated as part of
  the main comment.
* The first comment in the "en" example would probably have been
  better if it wasn't merged, but a weird use of the outdent template
  causes us to not be able to distinguish that the two parts of the
  comment display on separate lines.
* The last comment in the "en" example (isn't that neat?) was previously
  incorrectly treated as two comments, because there's a timestamp in
  the middle of it (the user is referring to another comment).
* Remaining affected comments in the "en" example are also comments
  with a "post-scriptum" and their treatment is clearly better now.

It also accidentally fixes some problems with modifier tests (but not
all), where previously <dl> nodes would be inserted in the middle of
<p> nodes, to reply to the comments which are now merged.

Bug: T240640
Change-Id: I0f2d9238aff75d78286250affd323cd145661a11
2020-01-22 02:21:43 +01:00
Bartosz Dziewoński da668b72d5 Identify comments by username+timestamp+seq
Possible use cases:
* Matching comments between PHP and Parsoid HTML [implemented here]
* Finding the same comment in a different revision of a page
  (e.g. while resolving an edit conflict, or to allow resuming
  composition of autosaved comments) [implemented for highlighting
  user's own posted comment only]
* Permanent links to comments [future]

The reasoning for this form of ID is:
* _Timestamp_ by itself is a nearly unique identifier, so it's a good
  thing to start with
* Users may post multiple comments in one edit (or in many edits in
  one minute), so we need the _sequential number_ to distinguish them
* _Username_ is probably not required, but it may reduce the need
  for sequential numbers, and will help with human-readability if we
  add permanent links

The ID remains stable when a new comment is added anywhere by anyone
(excepts comments within the same minute by the same user), or when a
section is renamed.

It's not always stable when a comment is moved or when an entire
section is moved or deleted (archived), but you can't have everything.

Change-Id: Idaae6427d659d12b82e37f1791bd03833632c7c0
2019-12-09 13:45:31 +00:00
Bartosz Dziewoński 4021ca1642 Add unit tests for parser#getTimestampParser
Change-Id: I03cba04489194539d6ff3a32acdb9a8fe3d499e5
2019-10-30 00:13:56 +01:00
Bartosz Dziewoński fc34556b04 Fix parsing links to subpages in user signatures
Change-Id: I381087c252eeb7530e63c4d99cecc1b2ee041b0a
2019-10-30 00:13:56 +01:00
Bartosz Dziewoński c83201b10c Fix parsing non-standard case in links to user contribs
Change-Id: I2da72e2731019ad5be0ba33aa229ad914a7aaf10
2019-10-30 00:13:56 +01:00
Bartosz Dziewoński e8012b7094 Fix incorrectly detecting a section heading inside the table of contents
Change-Id: I7209b523c3322b3b379e6aa82a4b2014cc39c404
2019-10-30 00:13:56 +01:00
Bartosz Dziewoński 9efe8b1dd4 Add unit tests for parser#getTimestampRegexp
Depends-On: I6c3d186de1877f73d4a4e3fec7d6d632a5d5fa83
Change-Id: Icdb44f793a8f5e56666ec635bb8b0125041b5aab
2019-10-24 23:21:29 +02:00
Bartosz Dziewoński 97ce480767 Document methods in parser.js
Change-Id: I9272a619770f805f36686d722eebba586d2650e4
2019-10-22 14:38:53 +00:00
Bartosz Dziewoński 96af61bbc4 Minor naming and comment cleanup after re-reading the code
Change-Id: I5d0309329e56034697070ebadf551ac704323d5c
2019-10-22 14:38:46 +00:00
Bartosz Dziewoński db80e48933 Handle timestamps in daylight saving time
Add the Moment Timezone library. Add a script for managing libraries,
like in MediaWiki core.

Depends-On: I9a59a6ad01850b30327e4215f2be61b8d1c41277
Change-Id: I64bc79e7d0ccdf42b006e5a225c8aa70ea5f4e15
2019-10-22 16:33:21 +02:00
Bartosz Dziewoński 3dc5d79b20 Fix regexp for HTML heading tags
Change-Id: I91b8c2626e76d340da83b9e36a655c3f5158ac3c
2019-10-20 17:18:01 +00:00
Bartosz Dziewoński 282cf3c386 Escape regexp special characters in date formats
For example, the default date format for Japanese (ja) is
"Y年n月j日 (D) H:i", which contains parentheses.

Change-Id: I4fce11f2913959dad06b3846d03df1da1e84e435
2019-10-20 17:17:55 +00:00
Bartosz Dziewoński b105bf7ded Detect and parse timestamps, signatures, comments and threads
Bug: T232780
Bug: T234404
Change-Id: Ie9c80121089742cfc7cd7c04d694c2e0fe8d6a98
2019-10-18 13:59:07 +02:00