Commit graph

13 commits

Author SHA1 Message Date
alex4401 dcfd6ea6a8 Implement a RemexHtml-based provider (requires 1.38)
This patch has been in testing on ark.wiki.gg (with the platform's approval) for a few months now, and has yielded us significantly better extracts than the live algorithm. It brings Description2 closer to TextExtracts' level, but without its constraints. TextExtracts does not enable us to override descriptions as we see fit, requires internal API calls, and uses HtmlFormatter which has its own slew of problems.

I'm also raising MW requirement to 1.38: this change has not been tested with older versions, and I removed uses of old `getProperty` in favour of `getPageProperty` to clean up the code. I doubt this really matters much: support for 1.35 is about to end (if it hasn't already), and 1.38 itself is already EoL. It also appears that lately this extension only received forward-compatibility patches.

New provider leverages RemexHtml (used in core MediaWiki and Parsoid) to parse and process a page's HTML. For performance reasons (but also bit of practicality - it's unlikely that description derival needs full page text...) any HTML after the first <h1-6> heading is dropped and not parsed. However, this is just a precaution, on ARK we haven't  noticed any performance degradation.

Two new configuration variables are added:
- `$wgUseSimpleDescriptionAlgorithm` (proposed default: false) determines which extract provider is used. `true` is the previous algorithm. `false` is the new Remex implementation.
- `$wgDescriptionRemoveElements` is an array of tag names and/or class names to strip out when generating extracts. This is only supported in the Remex provider, and would be hard to retrofit into the previous algorithm.

Depends-On: I04b00f99085f07f773212ee3eca8470eece34e9e
Change-Id: I8e6bf1f17443feac89f07e728d395e5a525bd4d1
2024-05-17 20:03:08 +00:00
jenkins-bot d325abc0dd Merge "Truncate descriptions to a certain length" 2024-05-17 19:59:23 +00:00
alex4401 c146909532
Truncate descriptions to a certain length
With this change, generated descriptions are cut at 300 characters, without breaking words when possible, and with an ellipsis added in case the cut happened mid-sentence.

The `Description2::getFirstChars` function was borrowed from TextExtracts with minor alterations (added comment for their regex, and removed `>` from word healing given it's unclear why it's been included).

New configuration variable `$wgDescriptionMaxChars` (proposed default: 300, which seems like a sensible amount) controls this behaviour.

This has been in testing on ark.wiki.gg (with the platform's approval, which I'm glad for) since early September. Without this change, we had a few pages with little sections having a huge part of their body text thrown into the `description` meta tag...

Depends-On: I585f2c0046571310aad67f3ba148c4f22aaae49f
Change-Id: I04b00f99085f07f773212ee3eca8470eece34e9e
2024-02-24 08:44:53 +01:00
xtex 055927a901 Remove style tags in description
Some pages' description may include CSS code rendered by [[Extension:TemplateStyles]].

Change-Id: I352ac2338eb5977305308546523ec6c55f7cb599
2024-02-14 07:16:19 +00:00
alex4401 050086dd6e Do not try to derive descriptions if one has been specified already
Skip running the generator on interface messages and if a description has already been set. This resolves some annoyances ranging from performance (relevant if the algorithm becomes more expensive to run) to multiple description meta tags being spawned.

This is part of my RemexHtml patch chain, which I've split up to avoid having a single commit alter the majority of the codebase. If it ends up being rejected, I can rebase this change to rid of dependencies on the rest of the chain.

Depends-On: I97fd065c9554837747021ba9fff26005e33270f4
Change-Id: I585f2c0046571310aad67f3ba148c4f22aaae49f
2024-01-12 08:28:31 +00:00
alex4401 b73fe26c29 Extract description algorithm into a new class
Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti.

The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time.

Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff
Change-Id: I97fd065c9554837747021ba9fff26005e33270f4
2024-01-12 08:24:19 +00:00
alex4401 43eb47183e Switch to hook handlers and inject ConfigFactory
Moving hooks into a separate class, and using dependency injection for configuration. Due to hook interfaces being added in MW 1.35, this change also raises the MediaWiki requirement to >1.35.0.

This patch is a part of my RemexHtml deriver chain (split into multiple patches to avoid a single commit altering almost the entirety of the codebase), which raises the floor to 1.38 later. There's not really a point in merging this if the rest of the patch chain is declined.

Depends-On: I484feeb51beab0c2e06c9f958a1c15c40853b967
Change-Id: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff
2024-01-11 22:44:02 +00:00
C. Scott Ananian b269304773 ParserOutput::getPageProperty() now returns null when key is missing.
The return value of ParserOutput::getPageProperty() has transitioned
to returning `null` instead of `false` when the page property is missing.

Bug: T301915
Depends-On: Iaa25c390118d2db2b6578cdd558f2defd5351d15
Change-Id: I31d4115d75e080bb0177f30b2acf55ca2525a19d
2022-02-16 19:33:00 -05:00
C. Scott Ananian 84276905bc Update uses of ParserOutput::getPageProperty() to handle new return value
The return value of ParserOutput::getPageProperty() will transition to
returning `null` instead of `false` when the page property is missing.

Bug: T301915
Change-Id: Id95dbdd427310e4e1cf40330b149adfe2b68f848
2022-02-16 19:31:35 -05:00
Umherirrender 6d3b7ce782 Replace deprecated ParserOutput::getProperty
Change-Id: I9278120212bcd0c003af899ce9602c292edac947
2022-02-06 13:11:38 +01:00
C. Scott Ananian 70325b3952 Remove <metadesc>
The {{description2}} parser function is still present. The <metadesc>
hooks was just added "so that Description2 can be used as a
replacement for Extension:MetaDescriptionTag", but MW 1.35 deprecated
and 1.36 removed support for Parser::setFunctionTagHook.

Bug: T236809
Change-Id: I53710f0d0a7bc8de4a3404b6df9ccbd1381cf36d
2020-08-26 18:47:07 -04:00
libraryupgrader 06229d23c4 build: Updating mediawiki/mediawiki-codesniffer to 18.0.0
The following sniffs now pass and were enabled:
* MediaWiki.Commenting.LicenseComment.InvalidLicenseTag

Change-Id: I6d9d49895a44890710017d55401618735f43337e
2018-04-14 00:16:14 +00:00
Sam Wilson 74dfa5fdd9 Add composer for CI, and fix CS errors
Also move class to its own namespace.

Bug: T183532
Change-Id: I5a89efd3f162d73e247543c86967562a949a3bf4
2017-12-22 08:01:33 +00:00