Skip <h2> in TOC when extracting first section

This piece of code is only relevant in case when:
- the intro section is requested (either in plaintext or html);
- the parse result for the full page is available in the parser cache;
- the full extract is not available in the TextExtracts WAN cache;
- the intro is also not available in the TextExtracts WAN cache.

In this case getFirstSection() is called with the parser output,
which is different from the the convertText() output it is called
with in other code paths, and still contains <h*> tags. A quick
regex is used to extract the first section. This stops at any <h2>.
A TOC also contains a <h2> (which will be removed later via
$wgExtractsRemoveClasses). This one needs to be ignored in case
the TOC is placed before the first section using e.g. the __TOC__
keyword.

The patch changes the regex so it ignores a h2 with
id="mw-toc-heading", but keeps working in plaintext mode when <h*>
tags are not present  (the code path when the intro section is
requested, and the full extract is available in the TextExtracts
WAN cache but the intro extract isn't).

Bug: T269967
Change-Id: I0a495d06cf1725744e556e81f17047fb53f53521
This commit is contained in:
Thiemo Kreuz 2021-05-31 15:13:30 +02:00 committed by Gergő Tisza
parent 7a3af1def7
commit 60e1c5ad83
2 changed files with 13 additions and 2 deletions

View file

@ -231,9 +231,10 @@ class ApiQueryExtracts extends ApiQueryBase {
*/ */
private function getFirstSection( $text, $plainText ) { private function getFirstSection( $text, $plainText ) {
if ( $plainText ) { if ( $plainText ) {
$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s'; $regexp = '/^.*?(?=' . ExtractFormatter::SECTION_MARKER_START .
'(?!.' . ExtractFormatter::SECTION_MARKER_END . '<h2 id="mw-toc-heading"))/s';
} else { } else {
$regexp = '/^(.*?)(?=<h[1-6]\b)/s'; $regexp = '/^.*?(?=<h[1-6]\b(?! id="mw-toc-heading"))/s';
} }
if ( preg_match( $regexp, $text, $matches ) ) { if ( preg_match( $regexp, $text, $matches ) ) {
$text = $matches[0]; $text = $matches[0];

View file

@ -133,6 +133,16 @@ class ApiQueryExtractsTest extends \MediaWikiTestCase {
false, false,
'Example <h11>...', 'Example <h11>...',
], ],
'__TOC__ before intro (HTML)' => [
'<h2 id="mw-toc-heading">Contents</h2>Intro<h2>Actual heading</h2>...',
false,
'<h2 id="mw-toc-heading">Contents</h2>Intro',
],
'__TOC__ before intro (plaintext)' => [
"\1\2_\2\1<h2 id=\"mw-toc-heading\">Contents</h2>Intro\1\2_\2\1<h2>Actual heading</h2>...",
true,
"\1\2_\2\1<h2 id=\"mw-toc-heading\">Contents</h2>Intro",
],
]; ];
} }