mediawiki-extensions-Descri.../includes/SimpleDescriptionProvider.php
alex4401 b73fe26c29 Extract description algorithm into a new class
Separating the extract algorithm from integration code. This results in a slightly cleaner code structure (at least in my opinion) and enables adding alternate algorithms without devolving into spaghetti.

The DescriptionProvider (name of the new base interface) is exposed as a service through dependency injection to avoid factories. The implementation can be swapped at service instantiation time.

Depends-On: I73c61ce045dcf31ac1ca5888f1548de8fd8b56ff
Change-Id: I97fd065c9554837747021ba9fff26005e33270f4
2024-01-12 08:24:19 +00:00

37 lines
985 B
PHP

<?php
namespace MediaWiki\Extension\Description2;
class SimpleDescriptionProvider implements DescriptionProvider {
/**
* Extracts description from the HTML representation of a page.
*
* The algorithm:
* 1. Removes all <table> elements and their contents.
* 2. Selects all <p> elements.
* 3. Iterates over those paragraphs, strips out all HTML tags and trims white-space around.
* 4. Then the first non-empty paragraph is picked as the result.
*
* @param string $text
* @return string
*/
public function derive( string $text ): ?string {
$pattern = '%<table\b[^>]*+>(?:(?R)|[^<]*+(?:(?!</?table\b)<[^<]*+)*+)*+</table>%i';
$myText = preg_replace( $pattern, '', $text );
$paragraphs = [];
if ( preg_match_all( '#<p>.*?</p>#is', $myText, $paragraphs ) ) {
foreach ( $paragraphs[0] as $paragraph ) {
$paragraph = trim( strip_tags( $paragraph ) );
if ( !$paragraph ) {
continue;
}
return $paragraph;
}
}
return null;
}
}