mediawiki-extensions-Cite/src/Parsoid/Ref.php

162 lines
5.5 KiB
PHP
Raw Normal View History

<?php
declare( strict_types = 1 );
namespace Wikimedia\Parsoid\Ext\Cite;
use DOMElement;
use DOMNode;
use Exception;
use Wikimedia\Parsoid\Config\ParsoidExtensionAPI;
use Wikimedia\Parsoid\Ext\ExtensionTag;
use Wikimedia\Parsoid\Utils\ContentUtils;
use Wikimedia\Parsoid\Utils\DOMCompat;
use Wikimedia\Parsoid\Utils\DOMDataUtils;
use Wikimedia\Parsoid\Utils\DOMUtils;
use Wikimedia\Parsoid\Utils\WTUtils;
/**
* Simple token transform version of the Ref extension tag.
*/
class Ref extends ExtensionTag {
/** @inheritDoc */
public function toDOM( ParsoidExtensionAPI $extApi, string $txt, array $extArgs ) {
// Drop nested refs entirely, unless we've explicitly allowed them
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
$parentExtTag = $extApi->parentExtTag();
if ( $parentExtTag === 'ref' && empty( $extApi->parentExtTagOpts()['allowNestedRef'] ) ) {
return null;
}
// The one supported case for nested refs is from the {{#tag:ref}} parser
// function. However, we're overly permissive here since we can't
// distinguish when that's nested in another template.
// The php preprocessor did our expansion.
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
$allowNestedRef = !empty( $extApi->inTemplate() ) && $parentExtTag !== 'ref';
return $extApi->parseTokenContentsToDOM(
$extArgs,
'',
$txt,
[
// NOTE: sup's content model requires it only contain phrasing
// content, not flow content. However, since we are building an
// in-memory DOM which is simply a tree data structure, we can
// nest flow content in a <sup> tag.
'wrapperTag' => 'sup',
'pipelineOpts' => [
'extTag' => 'ref',
'extTagOpts' => [ 'allowNestedRef' => $allowNestedRef ],
// FIXME: One-off PHP parser state leak.
// This needs a better solution.
'inPHPBlock' => true,
],
]
);
}
/** @inheritDoc */
public function lintHandler(
ParsoidExtensionAPI $extApi, DOMElement $ref, callable $defaultHandler
): ?DOMNode {
// Don't lint the content of ref in ref, since it can lead to cycles
// using named refs
if ( WTUtils::fromExtensionContent( $ref, 'references' ) ) {
return $ref->nextSibling;
}
$refFirstChild = $ref->firstChild;
DOMUtils::assertElt( $refFirstChild );
$linkBackId = preg_replace( '/[^#]*#/', '', $refFirstChild->getAttribute( 'href' ), 1 );
$refNode = $ref->ownerDocument->getElementById( $linkBackId );
if ( $refNode ) {
// Ex: Buggy input wikitext without ref content
$defaultHandler( $refNode->lastChild );
}
return $ref->nextSibling;
}
/** @inheritDoc */
public function fromDOM(
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
ParsoidExtensionAPI $extApi, DOMElement $node, bool $wrapperUnmodified
) {
$startTagSrc = $extApi->serializeExtensionStartTag( $node );
$dataMw = DOMDataUtils::getDataMw( $node );
$html = null;
if ( !isset( $dataMw->body ) ) {
return $startTagSrc; // We self-closed this already.
} else { // We self-closed this already.
if ( is_string( $dataMw->body->html ?? null ) ) {
// First look for the extension's content in data-mw.body.html
$html = $dataMw->body->html;
} elseif ( is_string( $dataMw->body->id ?? null ) ) {
// If the body isn't contained in data-mw.body.html, look if
// there's an element pointed to by body.id.
$bodyElt = DOMCompat::getElementById( $node->ownerDocument, $dataMw->body->id );
$editedDoc = $extApi->getPageConfig()->editedDoc ?? null;
if ( !$bodyElt && $editedDoc ) {
// Try to get to it from the main page.
// This can happen when the <ref> is inside another
// extension, most commonly inside a <references>.
// The recursive call to serializeDOM puts us inside
// inside a new document.
$bodyElt = DOMCompat::getElementById( $editedDoc, $dataMw->body->id );
}
if ( $bodyElt ) {
// n.b. this is going to drop any diff markers but since
// the dom differ doesn't traverse into extension content
// none should exist anyways.
DOMDataUtils::visitAndStoreDataAttribs( $bodyElt );
$html = ContentUtils::toXML( $bodyElt, [ 'innerXML' => true ] );
DOMDataUtils::visitAndLoadDataAttribs( $bodyElt );
} else {
// Some extra debugging for VisualEditor
$extraDebug = '';
$firstA = DOMCompat::querySelector( $node, 'a[href]' );
$href = $firstA->getAttribute( 'href' );
if ( $firstA && preg_match( '/^#/', $href ) ) {
try {
$ref = DOMCompat::querySelector( $node->ownerDocument, $href );
if ( $ref ) {
$extraDebug .= ' [own doc: ' . DOMCompat::getOuterHTML( $ref ) . ']';
}
$ref = DOMCompat::querySelector( $editedDoc, $href );
if ( $ref ) {
$extraDebug .= ' [main doc: ' . DOMCompat::getOuterHTML( $ref ) . ']';
}
} catch ( Exception $e ) {
// We are just providing VE with debugging info.
// So, ignore all exceptions / errors in this code.
}
if ( !$extraDebug ) {
$extraDebug = ' [reference ' . $href . ' not found]';
}
}
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
$extApi->log(
'error/' . $dataMw->name,
'extension src id ' . $dataMw->body->id . ' points to non-existent element for:',
DOMCompat::getOuterHTML( $node ),
'. More debug info: ',
$extraDebug
);
return ''; // Drop it!
}
} else { // Drop it!
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
$extApi->log( 'error', 'Ref body unavailable for: ' . DOMCompat::getOuterHTML( $node ) );
return ''; // Drop it!
} // Drop it!
}
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
$src = $extApi->serializeHTML(
[
Start untangling Parsoid internals from extensions * In this patch, toDOM, fromDOM, and DOM postprocessor extension methods all get a ParsoidExtensionAPI object. These API objects are constructed at the appropriate times in the wt2html and html2wt pipelines. * Got rid of direct references to SerializerState from fromDOM methods in extensions. * Exposed generic serialization and wikitext escaping methods in ParsoidExtensionAPI for extensions to leverage. The implementation of these methods is partial and only supports current usage of extensions in Parsoid's repo. This will need to be fully fleshed out going forward. * Stopped exposing wt2html options in toto and provided more specific convenience methods. * Reduced direct access to the Env object in a few more places. * Cite has code to inspect embedded HTML in data attributes of a node. Moved this code out of Cite into ParsoidExtensionAPI which reduces knowledge that extensions need. Unlike the other cleanups, this one is more of a convenience method since this code only requires knowledge of a publicly published spec. But, nevertheless an useful cleanup since it simplifies Cite's complexity just a bit. * More followup work is needed. - before/after methods should be eliminated in favour of a config flag that implements the inline/block layout option. Once this is done, extensions will no longer need direct access to the SerializerState internal object. - Env exposure should be reduced. - Provide access to Sanitizer via ParsoidExtensionAPI instead of needing extensions to directly import it. - It should be possible to eliminate the need for extensions to know about DSR / DSR-shifting and do it automatically via some high-level conceptual flag. - It might also be possible to infer source offsets directly via args instead of passing that explicitly. - Should we provide a convenience helper class with access to all the src/Utils/* methods? Bug: T242746 Change-Id: I7ffb5aa52a84854a9d363a0e8f1ce650241f1c41
2020-01-17 13:12:15 +00:00
'extName' => $dataMw->name,
// FIXME: One-off PHP parser state leak.
// This needs a better solution.
'inPHPBlock' => true
],
$html
);
return $startTagSrc . $src . '</' . $dataMw->name . '>';
}
}