Skip <h2> in TOC when extracting first section

This piece of code is only relevant in case when: - the intro section is requested (either in plaintext or html); - the parse result for the full page is available in the parser cache; - the full extract is not available in the TextExtracts WAN cache; - the intro is also not available in the TextExtracts WAN cache. In this case getFirstSection() is called with the parser output, which is different from the the convertText() output it is called with in other code paths, and still contains <h*> tags. A quick regex is used to extract the first section. This stops at any <h2>. A TOC also contains a <h2> (which will be removed later via $wgExtractsRemoveClasses). This one needs to be ignored in case the TOC is placed before the first section using e.g. the __TOC__ keyword. The patch changes the regex so it ignores a h2 with id="mw-toc-heading", but keeps working in plaintext mode when <h*> tags are not present (the code path when the intro section is requested, and the full extract is available in the TextExtracts WAN cache but the intro extract isn't). Bug: T269967 Change-Id: I0a495d06cf1725744e556e81f17047fb53f53521
2024-11-23 15:56:52 +00:00 · 2021-05-31 15:13:30 +02:00 · 2021-05-31 15:13:30 +02:00 · 60e1c5ad83
parent 7a3af1def7
commit 60e1c5ad83
2 changed files with 13 additions and 2 deletions
--- a/includes/ApiQueryExtracts.php
+++ b/includes/ApiQueryExtracts.php
@ -231,9 +231,10 @@ class ApiQueryExtracts extends ApiQueryBase {
 	 */
 	private function getFirstSection( $text, $plainText ) {
 		if ( $plainText ) {
-			$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';
+			$regexp = '/^.*?(?=' . ExtractFormatter::SECTION_MARKER_START .
 				'(?!.' . ExtractFormatter::SECTION_MARKER_END . '<h2 id="mw-toc-heading"))/s';
 		} else {
-			$regexp = '/^(.*?)(?=<h[1-6]\b)/s';
+			$regexp = '/^.*?(?=<h[1-6]\b(?! id="mw-toc-heading"))/s';
 		}
 		if ( preg_match( $regexp, $text, $matches ) ) {
 			$text = $matches[0];
--- a/tests/phpunit/ApiQueryExtractsTest.php
+++ b/tests/phpunit/ApiQueryExtractsTest.php
@ -133,6 +133,16 @@ class ApiQueryExtractsTest extends \MediaWikiTestCase {
 				false,
 				'Example <h11>...',
 			],
 			'__TOC__ before intro (HTML)' => [
 				'<h2 id="mw-toc-heading">Contents</h2>Intro<h2>Actual heading</h2>...',
 				false,
 				'<h2 id="mw-toc-heading">Contents</h2>Intro',
 			],
 			'__TOC__ before intro (plaintext)' => [
 				"\1\2_\2\1<h2 id=\"mw-toc-heading\">Contents</h2>Intro\1\2_\2\1<h2>Actual heading</h2>...",
 				true,
 				"\1\2_\2\1<h2 id=\"mw-toc-heading\">Contents</h2>Intro",
 			],
 		];
 	}