mediawiki-extensions-Descri.../includes/SimpleDescriptionProvider.php

<?php

namespace MediaWiki\Extension\Description2;

class SimpleDescriptionProvider implements DescriptionProvider {

	/**
	 * Extracts description from the HTML representation of a page.
	 *
	 * The algorithm:
	 * 1. Removes all <table> elements and their contents.
	 * 2. Selects all <p> elements.
	 * 3. Iterates over those paragraphs, strips out all HTML tags and trims white-space around.
	 * 4. Then the first non-empty paragraph is picked as the result.
	 *
	 * @param string $text
	 * @return string
	 */
	public function derive( string $text ): ?string {
		$pattern = '%<table\b[^>]*+>(?:(?R)|[^<]*+(?:(?!</?table\b)<[^<]*+)*+)*+</table>%i';
		$myText = preg_replace( $pattern, '', $text );

		$paragraphs = [];
		if ( preg_match_all( '#<p>.*?</p>#is', $myText, $paragraphs ) ) {
			foreach ( $paragraphs[0] as $paragraph ) {
				$paragraph = trim( strip_tags( $paragraph ) );
				if ( !$paragraph ) {
					continue;
				}
				return $paragraph;
			}
		}

		return null;
	}
}
Extract description algorithm into a new class Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti. The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time. Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff Change-Id: I97fd065c9554837747021ba9fff26005e33270f4 2023-12-04 19:54:13 +00:00			`<?php`

			`namespace MediaWiki\Extension\Description2;`

			`class SimpleDescriptionProvider implements DescriptionProvider {`

			`/**`
			`* Extracts description from the HTML representation of a page.`
			`*`
			`* The algorithm:`
			`* 1. Removes all <table> elements and their contents.`
			`* 2. Selects all <p> elements.`
			`* 3. Iterates over those paragraphs, strips out all HTML tags and trims white-space around.`
			`* 4. Then the first non-empty paragraph is picked as the result.`
			`*`
			`* @param string $text`
			`* @return string`
			`*/`
			`public function derive( string $text ): ?string {`
			`$pattern = '%<table\b[^>]+>(?:(?R)\|[^<]+(?:(?!</?table\b)<[^<]+)+)*+</table>%i';`
			`$myText = preg_replace( $pattern, '', $text );`

			`$paragraphs = [];`
			`if ( preg_match_all( '#<p>.*?</p>#is', $myText, $paragraphs ) ) {`
			`foreach ( $paragraphs[0] as $paragraph ) {`
			`$paragraph = trim( strip_tags( $paragraph ) );`
			`if ( !$paragraph ) {`
			`continue;`
			`}`
			`return $paragraph;`
			`}`
			`}`

			`return null;`
			`}`
			`}`