mediawiki-extensions-Descri.../includes/SimpleDescriptionProvider.php

<?php

namespace MediaWiki\Extension\Description2;

use Config;

class SimpleDescriptionProvider implements DescriptionProvider {
	/**
	 * @param Config $config
	 */
	public function __construct( Config $config ) {
	}

	/**
	 * Extracts description from the HTML representation of a page.
	 *
	 * The algorithm:
	 * 1. Removes all <style> and <table> elements and their contents.
	 * 2. Selects all <p> elements.
	 * 3. Iterates over those paragraphs, strips out all HTML tags and trims white-space around.
	 * 4. Then the first non-empty paragraph is picked as the result.
	 *
	 * @param string $text
	 * @return string
	 */
	public function derive( string $text ): ?string {
		$myText = $text;
		$stripTags = [ 'style', 'table' ];
		foreach ( $stripTags as $tag ) {
			$pattern = "%<$tag\b[^>]*+>(?:(?R)|[^<]*+(?:(?!</?$tag\b)<[^<]*+)*+)*+</$tag>%i";
			$myText = preg_replace( $pattern, '', $myText );
		}

		$paragraphs = [];
		if ( preg_match_all( '#<p>.*?</p>#is', $myText, $paragraphs ) ) {
			foreach ( $paragraphs[0] as $paragraph ) {
				$paragraph = trim( strip_tags( $paragraph ) );
				if ( !$paragraph ) {
					continue;
				}
				return $paragraph;
			}
		}

		return null;
	}
}
Extract description algorithm into a new class Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti. The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time. Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff Change-Id: I97fd065c9554837747021ba9fff26005e33270f4 2023-12-04 19:54:13 +00:00			`<?php`

			`namespace MediaWiki\Extension\Description2;`

Implement a RemexHtml-based provider (requires 1.38) This patch has been in testing on ark.wiki.gg (with the platform's approval) for a few months now, and has yielded us significantly better extracts than the live algorithm. It brings Description2 closer to TextExtracts' level, but without its constraints. TextExtracts does not enable us to override descriptions as we see fit, requires internal API calls, and uses HtmlFormatter which has its own slew of problems. I'm also raising MW requirement to 1.38: this change has not been tested with older versions, and I removed uses of old `getProperty` in favour of `getPageProperty` to clean up the code. I doubt this really matters much: support for 1.35 is about to end (if it hasn't already), and 1.38 itself is already EoL. It also appears that lately this extension only received forward-compatibility patches. New provider leverages RemexHtml (used in core MediaWiki and Parsoid) to parse and process a page's HTML. For performance reasons (but also bit of practicality - it's unlikely that description derival needs full page text...) any HTML after the first <h1-6> heading is dropped and not parsed. However, this is just a precaution, on ARK we haven't noticed any performance degradation. Two new configuration variables are added: - `$wgUseSimpleDescriptionAlgorithm` (proposed default: false) determines which extract provider is used. `true` is the previous algorithm. `false` is the new Remex implementation. - `$wgDescriptionRemoveElements` is an array of tag names and/or class names to strip out when generating extracts. This is only supported in the Remex provider, and would be hard to retrofit into the previous algorithm. Depends-On: I04b00f99085f07f773212ee3eca8470eece34e9e Change-Id: I8e6bf1f17443feac89f07e728d395e5a525bd4d1 2024-01-02 17:33:05 +00:00			`use Config;`

Extract description algorithm into a new class Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti. The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time. Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff Change-Id: I97fd065c9554837747021ba9fff26005e33270f4 2023-12-04 19:54:13 +00:00			`class SimpleDescriptionProvider implements DescriptionProvider {`
Implement a RemexHtml-based provider (requires 1.38) This patch has been in testing on ark.wiki.gg (with the platform's approval) for a few months now, and has yielded us significantly better extracts than the live algorithm. It brings Description2 closer to TextExtracts' level, but without its constraints. TextExtracts does not enable us to override descriptions as we see fit, requires internal API calls, and uses HtmlFormatter which has its own slew of problems. I'm also raising MW requirement to 1.38: this change has not been tested with older versions, and I removed uses of old `getProperty` in favour of `getPageProperty` to clean up the code. I doubt this really matters much: support for 1.35 is about to end (if it hasn't already), and 1.38 itself is already EoL. It also appears that lately this extension only received forward-compatibility patches. New provider leverages RemexHtml (used in core MediaWiki and Parsoid) to parse and process a page's HTML. For performance reasons (but also bit of practicality - it's unlikely that description derival needs full page text...) any HTML after the first <h1-6> heading is dropped and not parsed. However, this is just a precaution, on ARK we haven't noticed any performance degradation. Two new configuration variables are added: - `$wgUseSimpleDescriptionAlgorithm` (proposed default: false) determines which extract provider is used. `true` is the previous algorithm. `false` is the new Remex implementation. - `$wgDescriptionRemoveElements` is an array of tag names and/or class names to strip out when generating extracts. This is only supported in the Remex provider, and would be hard to retrofit into the previous algorithm. Depends-On: I04b00f99085f07f773212ee3eca8470eece34e9e Change-Id: I8e6bf1f17443feac89f07e728d395e5a525bd4d1 2024-01-02 17:33:05 +00:00			`/**`
			`* @param Config $config`
			`*/`
			`public function __construct( Config $config ) {`
			`}`
Extract description algorithm into a new class Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti. The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time. Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff Change-Id: I97fd065c9554837747021ba9fff26005e33270f4 2023-12-04 19:54:13 +00:00
			`/**`
			`* Extracts description from the HTML representation of a page.`
			`*`
			`* The algorithm:`
Remove style tags in description Some pages' description may include CSS code rendered by [[Extension:TemplateStyles]]. Change-Id: I352ac2338eb5977305308546523ec6c55f7cb599 2024-01-31 08:13:34 +00:00			`* 1. Removes all <style> and <table> elements and their contents.`
Extract description algorithm into a new class Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti. The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time. Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff Change-Id: I97fd065c9554837747021ba9fff26005e33270f4 2023-12-04 19:54:13 +00:00			`* 2. Selects all <p> elements.`
			`* 3. Iterates over those paragraphs, strips out all HTML tags and trims white-space around.`
			`* 4. Then the first non-empty paragraph is picked as the result.`
			`*`
			`* @param string $text`
			`* @return string`
			`*/`
			`public function derive( string $text ): ?string {`
Remove style tags in description Some pages' description may include CSS code rendered by [[Extension:TemplateStyles]]. Change-Id: I352ac2338eb5977305308546523ec6c55f7cb599 2024-01-31 08:13:34 +00:00			`$myText = $text;`
			`$stripTags = [ 'style', 'table' ];`
			`foreach ( $stripTags as $tag ) {`
			`$pattern = "%<$tag\b[^>]+>(?:(?R)\|[^<]+(?:(?!</?$tag\b)<[^<]+)+)*+</$tag>%i";`
			`$myText = preg_replace( $pattern, '', $myText );`
			`}`
Extract description algorithm into a new class Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti. The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time. Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff Change-Id: I97fd065c9554837747021ba9fff26005e33270f4 2023-12-04 19:54:13 +00:00
			`$paragraphs = [];`
			`if ( preg_match_all( '#<p>.*?</p>#is', $myText, $paragraphs ) ) {`
			`foreach ( $paragraphs[0] as $paragraph ) {`
			`$paragraph = trim( strip_tags( $paragraph ) );`
			`if ( !$paragraph ) {`
			`continue;`
			`}`
			`return $paragraph;`
			`}`
			`}`

			`return null;`
			`}`
			`}`