mediawiki-extensions-Discus.../tests/phpunit/CommentParserTest.php

230 lines
7.2 KiB
PHP
Raw Normal View History

<?php
namespace MediaWiki\Extension\DiscussionTools\Tests;
use DateTimeImmutable;
use Error;
use MediaWiki\Extension\DiscussionTools\CommentItem;
use MediaWiki\Extension\DiscussionTools\CommentUtils;
use MediaWiki\Extension\DiscussionTools\HeadingItem;
use MediaWiki\Extension\DiscussionTools\ImmutableRange;
use MediaWiki\Extension\DiscussionTools\ThreadItem;
Change CommentParser into a service Goal: ----- To have a method like CommentParser::parse(), which just takes a node to parse and a title and returns plain data, so that we don't need to keep track of the config to construct a CommentParser object (the required config like content language is provided by services) and we don't need to keep that object around after parsing. Changes: -------- CommentParser.php: * …is now a service. Constructor only takes services as arguments. The node and title are passed to a new parse() method. * parse() should return plain data, but I split this part to a separate patch for ease of review: I49bfe019aa460651447fd383f73eafa9d7180a92. * CommentParser still cheats and accesses global state in a few places, e.g. calling Title::makeTitleSafe or CommentUtils::getTitleFromUrl, so we can't turn its tests into true unit tests. This work is left for future commits. LanguageData.php: * …is now a service, instead of a static class. Parser.js: * …is not a real service, but it's changed to behave in a similar way. Constructor takes only the required config as argument, and node and title are instead passed to a new parse() method. CommentParserTest.php: parser.test.js: * Can be simplified, now that we don't need a useless node and title to test internal methods that don't use them. testUtils.js: * Can be simplified, now that we don't need to override internal ResourceLoader stuff just to change the parser config. Change-Id: Iadb7757debe000025e52770ca51ebcf24ca8ee66
2022-02-19 02:43:21 +00:00
use MediaWiki\MediaWikiServices;
use stdClass;
use Title;
Don't refer directly to PHP `dom` extension classes; avoid nonstandard behavior These changes ensure that DiscussionTools is independent of DOM library choice, and will not break if/when Parsoid switches to an alternate (more standards-compliant) DOM library. We run `phan` against the Dodo standards-compliant DOM library, so this ends up flagging uses of non-standard PHP extensions to the DOM. These will be suppressed for now with a "Nonstandard DOM" comment that can be grepped for, since they will eventually will need to be rewritten or worked around. Most frequent issues: * Node::nodeValue and Node::textContent and Element::getAttribute() can return null in a spec-compliant implementation. Add `?? ''` to make spec-compliant results consistent w/ what PHP returns. * DOMXPath doesn't accept anything except DOMDocument. These uses should be replaced with DOMCompat::querySelectorAll() or similar (which end up using DOMXPath under the covers for DOMDocument any way, but are implemented more efficiently in a spec-compliant implementation). * A couple of times we have code like: `while ($node->firstChild!==null) { $node = $node->firstChild; }` and phan's analysis isn't strong enough to determine that $node is still non-null after the while. This same issue should appear with DOMDocument but phan doesn't complain for some reason. One apparently legit issue: * Node::insertBefore() is once called in a funny way which leans on the fact that the second option is optional in PHP. This seems to be a workaround for an ancient PHP bug, and can probably be safely removed. Bug: T287611 Bug: T217867 Change-Id: I3c4f41c3819770f85d68157c9f690d650b7266a3
2021-07-29 02:16:15 +00:00
use Wikimedia\Parsoid\DOM\Element;
use Wikimedia\Parsoid\DOM\Node;
use Wikimedia\Parsoid\Utils\DOMCompat;
use Wikimedia\TestingAccessWrapper;
/**
* @coversDefaultClass \MediaWiki\Extension\DiscussionTools\CommentParser
*
* @group DiscussionTools
*/
class CommentParserTest extends IntegrationTestCase {
/**
* Get the offset path from ancestor to offset in descendant
*
* Convert Unicode codepoint offsets to UTF-16 code unit offsets.
*
Don't refer directly to PHP `dom` extension classes; avoid nonstandard behavior These changes ensure that DiscussionTools is independent of DOM library choice, and will not break if/when Parsoid switches to an alternate (more standards-compliant) DOM library. We run `phan` against the Dodo standards-compliant DOM library, so this ends up flagging uses of non-standard PHP extensions to the DOM. These will be suppressed for now with a "Nonstandard DOM" comment that can be grepped for, since they will eventually will need to be rewritten or worked around. Most frequent issues: * Node::nodeValue and Node::textContent and Element::getAttribute() can return null in a spec-compliant implementation. Add `?? ''` to make spec-compliant results consistent w/ what PHP returns. * DOMXPath doesn't accept anything except DOMDocument. These uses should be replaced with DOMCompat::querySelectorAll() or similar (which end up using DOMXPath under the covers for DOMDocument any way, but are implemented more efficiently in a spec-compliant implementation). * A couple of times we have code like: `while ($node->firstChild!==null) { $node = $node->firstChild; }` and phan's analysis isn't strong enough to determine that $node is still non-null after the while. This same issue should appear with DOMDocument but phan doesn't complain for some reason. One apparently legit issue: * Node::insertBefore() is once called in a funny way which leans on the fact that the second option is optional in PHP. This seems to be a workaround for an ancient PHP bug, and can probably be safely removed. Bug: T287611 Bug: T217867 Change-Id: I3c4f41c3819770f85d68157c9f690d650b7266a3
2021-07-29 02:16:15 +00:00
* @param Element $ancestor
* @param Node $node
* @param int $nodeOffset
* @return string
*/
private static function getOffsetPath(
Don't refer directly to PHP `dom` extension classes; avoid nonstandard behavior These changes ensure that DiscussionTools is independent of DOM library choice, and will not break if/when Parsoid switches to an alternate (more standards-compliant) DOM library. We run `phan` against the Dodo standards-compliant DOM library, so this ends up flagging uses of non-standard PHP extensions to the DOM. These will be suppressed for now with a "Nonstandard DOM" comment that can be grepped for, since they will eventually will need to be rewritten or worked around. Most frequent issues: * Node::nodeValue and Node::textContent and Element::getAttribute() can return null in a spec-compliant implementation. Add `?? ''` to make spec-compliant results consistent w/ what PHP returns. * DOMXPath doesn't accept anything except DOMDocument. These uses should be replaced with DOMCompat::querySelectorAll() or similar (which end up using DOMXPath under the covers for DOMDocument any way, but are implemented more efficiently in a spec-compliant implementation). * A couple of times we have code like: `while ($node->firstChild!==null) { $node = $node->firstChild; }` and phan's analysis isn't strong enough to determine that $node is still non-null after the while. This same issue should appear with DOMDocument but phan doesn't complain for some reason. One apparently legit issue: * Node::insertBefore() is once called in a funny way which leans on the fact that the second option is optional in PHP. This seems to be a workaround for an ancient PHP bug, and can probably be safely removed. Bug: T287611 Bug: T217867 Change-Id: I3c4f41c3819770f85d68157c9f690d650b7266a3
2021-07-29 02:16:15 +00:00
Element $ancestor, Node $node, int $nodeOffset
): string {
if ( $node->nodeType === XML_TEXT_NODE ) {
$str = mb_substr( $node->nodeValue, 0, $nodeOffset );
// Count characters that require two code units to encode in UTF-16
$count = preg_match_all( '/[\x{010000}-\x{10FFFF}]/u', $str );
$nodeOffset += $count;
}
$path = [ $nodeOffset ];
while ( $node !== $ancestor ) {
if ( !$node->parentNode ) {
throw new Error( 'Not a descendant' );
}
array_unshift( $path, CommentUtils::childIndexOf( $node ) );
$node = $node->parentNode;
}
return implode( '/', $path );
}
private static function serializeComments( ThreadItem $threadItem, Element $root ): stdClass {
$serialized = new stdClass();
if ( $threadItem instanceof HeadingItem ) {
$serialized->placeholderHeading = $threadItem->isPlaceholderHeading();
}
$serialized->type = $threadItem->getType();
if ( $threadItem instanceof CommentItem ) {
$serialized->timestamp = $threadItem->getTimestampString();
$serialized->author = $threadItem->getAuthor();
}
// Can't serialize the DOM nodes involved in the range,
// instead use their offsets within their parent nodes
$range = $threadItem->getRange();
$serialized->range = [
self::getOffsetPath( $root, $range->startContainer, $range->startOffset ),
self::getOffsetPath( $root, $range->endContainer, $range->endOffset )
];
if ( $threadItem instanceof CommentItem ) {
$serialized->signatureRanges = array_map( function ( ImmutableRange $range ) use ( $root ) {
return [
self::getOffsetPath( $root, $range->startContainer, $range->startOffset ),
self::getOffsetPath( $root, $range->endContainer, $range->endOffset )
];
}, $threadItem->getSignatureRanges() );
}
if ( $threadItem instanceof HeadingItem ) {
$serialized->headingLevel = $threadItem->getHeadingLevel();
}
$serialized->level = $threadItem->getLevel();
$serialized->name = $threadItem->getName();
$serialized->id = $threadItem->getId();
$serialized->warnings = $threadItem->getWarnings();
// Ignore warnings about legacy IDs (we don't have them in JS)
$serialized->warnings = array_values( array_diff( $serialized->warnings, [ 'Duplicate comment legacy ID' ] ) );
$serialized->replies = [];
foreach ( $threadItem->getReplies() as $reply ) {
$serialized->replies[] = self::serializeComments( $reply, $root );
}
return $serialized;
}
/**
* @dataProvider provideTimestampRegexps
* @covers ::getTimestampRegexp
*/
public function testGetTimestampRegexp(
string $format, string $expected, string $message
): void {
$parser = TestingAccessWrapper::newFromObject(
Change CommentParser into a service Goal: ----- To have a method like CommentParser::parse(), which just takes a node to parse and a title and returns plain data, so that we don't need to keep track of the config to construct a CommentParser object (the required config like content language is provided by services) and we don't need to keep that object around after parsing. Changes: -------- CommentParser.php: * …is now a service. Constructor only takes services as arguments. The node and title are passed to a new parse() method. * parse() should return plain data, but I split this part to a separate patch for ease of review: I49bfe019aa460651447fd383f73eafa9d7180a92. * CommentParser still cheats and accesses global state in a few places, e.g. calling Title::makeTitleSafe or CommentUtils::getTitleFromUrl, so we can't turn its tests into true unit tests. This work is left for future commits. LanguageData.php: * …is now a service, instead of a static class. Parser.js: * …is not a real service, but it's changed to behave in a similar way. Constructor takes only the required config as argument, and node and title are instead passed to a new parse() method. CommentParserTest.php: parser.test.js: * Can be simplified, now that we don't need a useless node and title to test internal methods that don't use them. testUtils.js: * Can be simplified, now that we don't need to override internal ResourceLoader stuff just to change the parser config. Change-Id: Iadb7757debe000025e52770ca51ebcf24ca8ee66
2022-02-19 02:43:21 +00:00
MediaWikiServices::getInstance()->getService( 'DiscussionTools.CommentParser' )
);
// HACK: Fix differences between JS & PHP regexes
// TODO: We may just have to have two version in the test data
$expected = preg_replace( '/\\\\u([0-9A-F]+)/', '\\\\x{$1}', $expected );
$expected = str_replace( ':', '\:', $expected );
$expected = '/' . $expected . '/u';
$result = $parser->getTimestampRegexp( 'en', $format, '\\d', [ 'UTC' => 'UTC' ] );
self::assertSame( $expected, $result, $message );
}
public function provideTimestampRegexps(): array {
return self::getJson( '../cases/timestamp-regex.json' );
}
/**
* @dataProvider provideTimestampParser
* @covers ::getTimestampParser
*/
public function testGetTimestampParser(
string $format, array $data, string $expected, string $message
): void {
$parser = TestingAccessWrapper::newFromObject(
Change CommentParser into a service Goal: ----- To have a method like CommentParser::parse(), which just takes a node to parse and a title and returns plain data, so that we don't need to keep track of the config to construct a CommentParser object (the required config like content language is provided by services) and we don't need to keep that object around after parsing. Changes: -------- CommentParser.php: * …is now a service. Constructor only takes services as arguments. The node and title are passed to a new parse() method. * parse() should return plain data, but I split this part to a separate patch for ease of review: I49bfe019aa460651447fd383f73eafa9d7180a92. * CommentParser still cheats and accesses global state in a few places, e.g. calling Title::makeTitleSafe or CommentUtils::getTitleFromUrl, so we can't turn its tests into true unit tests. This work is left for future commits. LanguageData.php: * …is now a service, instead of a static class. Parser.js: * …is not a real service, but it's changed to behave in a similar way. Constructor takes only the required config as argument, and node and title are instead passed to a new parse() method. CommentParserTest.php: parser.test.js: * Can be simplified, now that we don't need a useless node and title to test internal methods that don't use them. testUtils.js: * Can be simplified, now that we don't need to override internal ResourceLoader stuff just to change the parser config. Change-Id: Iadb7757debe000025e52770ca51ebcf24ca8ee66
2022-02-19 02:43:21 +00:00
MediaWikiServices::getInstance()->getService( 'DiscussionTools.CommentParser' )
);
$expected = new DateTimeImmutable( $expected );
$tsParser = $parser->getTimestampParser( 'en', $format, null, 'UTC', [ 'UTC' => 'UTC' ] );
self::assertEquals( $expected, $tsParser( $data ), $message );
}
public function provideTimestampParser(): array {
return self::getJson( '../cases/timestamp-parser.json' );
}
/**
* @dataProvider provideTimestampParserDST
* @covers ::getTimestampParser
*/
public function testGetTimestampParserDST(
string $sample, string $expected, string $expectedUtc, string $format,
string $timezone, array $timezoneAbbrs, string $message
): void {
$parser = TestingAccessWrapper::newFromObject(
Change CommentParser into a service Goal: ----- To have a method like CommentParser::parse(), which just takes a node to parse and a title and returns plain data, so that we don't need to keep track of the config to construct a CommentParser object (the required config like content language is provided by services) and we don't need to keep that object around after parsing. Changes: -------- CommentParser.php: * …is now a service. Constructor only takes services as arguments. The node and title are passed to a new parse() method. * parse() should return plain data, but I split this part to a separate patch for ease of review: I49bfe019aa460651447fd383f73eafa9d7180a92. * CommentParser still cheats and accesses global state in a few places, e.g. calling Title::makeTitleSafe or CommentUtils::getTitleFromUrl, so we can't turn its tests into true unit tests. This work is left for future commits. LanguageData.php: * …is now a service, instead of a static class. Parser.js: * …is not a real service, but it's changed to behave in a similar way. Constructor takes only the required config as argument, and node and title are instead passed to a new parse() method. CommentParserTest.php: parser.test.js: * Can be simplified, now that we don't need a useless node and title to test internal methods that don't use them. testUtils.js: * Can be simplified, now that we don't need to override internal ResourceLoader stuff just to change the parser config. Change-Id: Iadb7757debe000025e52770ca51ebcf24ca8ee66
2022-02-19 02:43:21 +00:00
MediaWikiServices::getInstance()->getService( 'DiscussionTools.CommentParser' )
);
$regexp = $parser->getTimestampRegexp( 'en', $format, '\\d', $timezoneAbbrs );
$tsParser = $parser->getTimestampParser( 'en', $format, null, $timezone, $timezoneAbbrs );
$expected = new DateTimeImmutable( $expected );
$expectedUtc = new DateTimeImmutable( $expectedUtc );
preg_match( $regexp, $sample, $match, PREG_OFFSET_CAPTURE );
$date = $tsParser( $match );
self::assertEquals( $expected, $date, $message );
self::assertEquals( $expectedUtc, $date, $message );
}
public function provideTimestampParserDST(): array {
return self::getJson( '../cases/timestamp-parser-dst.json' );
}
/**
* @dataProvider provideComments
* @covers ::parse
* @covers ::buildThreadItems
* @covers ::buildThreads
* @covers ::computeIdsAndNames
*/
public function testGetThreads(
string $name, string $title, string $dom, string $expected, string $config, string $data
): void {
$title = Title::newFromText( $title );
$dom = self::getHtml( $dom );
$expectedPath = $expected;
$expected = self::getJson( $expected );
$config = self::getJson( $config );
$data = self::getJson( $data );
$doc = self::createDocument( $dom );
$body = DOMCompat::getBody( $doc );
$this->setupEnv( $config, $data );
$threadItemSet = self::createParser( $data )->parse( $body, $title );
$threads = $threadItemSet->getThreads();
$processedThreads = [];
foreach ( $threads as $i => $thread ) {
$thread = self::serializeComments( $thread, $body );
$thread = json_decode( json_encode( $thread ), true );
$processedThreads[] = $thread;
}
// Optionally write updated content to the JSON files
if ( getenv( 'DISCUSSIONTOOLS_OVERWRITE_TESTS' ) ) {
self::overwriteJsonFile( $expectedPath, $processedThreads );
}
foreach ( $threads as $i => $thread ) {
self::assertEquals( $expected[$i], $processedThreads[$i], $name . ' section ' . $i );
}
}
public function provideComments(): array {
return self::getJson( '../cases/comments.json' );
}
}