mediawiki-extensions-Descri.../includes/Description2.php
alex4401 dcfd6ea6a8 Implement a RemexHtml-based provider (requires 1.38)
This patch has been in testing on ark.wiki.gg (with the platform's approval) for a few months now, and has yielded us significantly better extracts than the live algorithm. It brings Description2 closer to TextExtracts' level, but without its constraints. TextExtracts does not enable us to override descriptions as we see fit, requires internal API calls, and uses HtmlFormatter which has its own slew of problems.

I'm also raising MW requirement to 1.38: this change has not been tested with older versions, and I removed uses of old `getProperty` in favour of `getPageProperty` to clean up the code. I doubt this really matters much: support for 1.35 is about to end (if it hasn't already), and 1.38 itself is already EoL. It also appears that lately this extension only received forward-compatibility patches.

New provider leverages RemexHtml (used in core MediaWiki and Parsoid) to parse and process a page's HTML. For performance reasons (but also bit of practicality - it's unlikely that description derival needs full page text...) any HTML after the first <h1-6> heading is dropped and not parsed. However, this is just a precaution, on ARK we haven't  noticed any performance degradation.

Two new configuration variables are added:
- `$wgUseSimpleDescriptionAlgorithm` (proposed default: false) determines which extract provider is used. `true` is the previous algorithm. `false` is the new Remex implementation.
- `$wgDescriptionRemoveElements` is an array of tag names and/or class names to strip out when generating extracts. This is only supported in the Remex provider, and would be hard to retrofit into the previous algorithm.

Depends-On: I04b00f99085f07f773212ee3eca8470eece34e9e
Change-Id: I8e6bf1f17443feac89f07e728d395e5a525bd4d1
2024-05-17 20:03:08 +00:00

76 lines
2.2 KiB
PHP
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?php
namespace MediaWiki\Extension\Description2;
use Parser;
use PPFrame;
/**
* Description2 Adds meaningful description <meta> tag to MW pages and into the parser output
*
* @file
* @ingroup Extensions
* @author Daniel Friesen (http://danf.ca/mw/)
* @copyright Copyright 2010 Daniel Friesen
* @license GPL-2.0-or-later
* @link https://www.mediawiki.org/wiki/Extension:Description2 Documentation
*/
class Description2 {
/**
* @param Parser $parser The parser.
* @param string $desc The description text.
*/
public static function setDescription( Parser $parser, $desc ) {
$parserOutput = $parser->getOutput();
if ( $parserOutput->getPageProperty( 'description' ) !== null ) {
return;
}
$parserOutput->setPageProperty( 'description', $desc );
}
/**
* @param Parser $parser The parser.
* @param PPFrame $frame The frame.
* @param string[] $args The arguments of the parser function call.
* @return string
*/
public static function parserFunctionCallback( Parser $parser, PPFrame $frame, $args ) {
$desc = isset( $args[0] ) ? $frame->expand( $args[0] ) : '';
self::setDescription( $parser, $desc );
return '';
}
/**
* Returns no more than a requested number of characters, preserving words.
*
* Borrowed from TextExtracts.
*
* @param string $text Source plain text to extract from. HTML tags should be removed by the description provider.
* @param int $requestedLength Maximum number of characters to return
* @return string
*/
public static function getFirstChars( string $text, int $requestedLength ) {
if ( $requestedLength <= 0 ) {
return '';
}
$length = mb_strlen( $text );
if ( $length <= $requestedLength ) {
return $text;
}
// The following (although in somewhat backwards order) cuts the text at given length and restores the end if it
// has been cut, with the ungreedy pattern always matching a single word built of word characters (no
// punctuation) and/or forward slashes.
$pattern = '/^[\w\/]*/su';
preg_match( $pattern, mb_substr( $text, $requestedLength ), $m );
$truncatedText = mb_substr( $text, 0, $requestedLength ) . $m[0];
if ( $truncatedText === $text ) {
return $text;
}
return trim( $truncatedText );
}
}