Refine and fix "unclosed <ref> detected" regular expression

This simplifies as well as fixes a series of issues with this regular
expression:

* Before, the wikitext `<REF><REF>` would not trigger the error, but
`<ref><ref>` would. Parser tags are case-insensitive, but the error
check was not.

* Before, the wikitext `<ref><ref name="<">` would not trigger the error.
That's a valid name. The error check should not stop just because it
found a `<`.

* Both the old and the new code do *not* fail with the wikitext
`<ref><ref</ref>` where the inner `<ref` does not have a closing `>`. I
was thinking about changing this, but figured it might be used as a
feature.

* The old code was not able to properly understand HTML comments,
<nowiki> tags and such that contain a line break. That caused
inconsistent and confusing error reporting in some cases, but not in
others. This change *reduces* the amount of errors this code produces.

* The old code was looking for "SGML tags" with names that could be
anything, not just alphanumeric characters. This allowed for strange
edge-cases like `<ref><>><ref></>></ref>` that have not been reported,
but should be. This change *increases* the amount of errors. However,
relevant edge-cases should be extremely rare.

Note the ++ avoids backtracking, speeding up the regex.

Change-Id: I0c61a245f4f743871b4cad886ce239650af2b37c
This commit is contained in:
Thiemo Kreuz 2019-12-06 14:15:20 +01:00 committed by VolkerE
parent a3c589ac42
commit a7ee7c9586

View file

@ -241,8 +241,10 @@ class Cite {
} }
} }
if ( preg_match( '/<ref\b[^<]*?>/', if ( preg_match(
preg_replace( '#<([^ ]+?).*?>.*?</\\1 *>|<!--.*?-->#', '', $text ) ) ) { '/<ref\b.*?>/i',
preg_replace( '#<(\w++).*?>.*?</\1\s*>|<!--.*?-->#s', '', $text )
) ) {
// (bug T8199) This most likely implies that someone left off the // (bug T8199) This most likely implies that someone left off the
// closing </ref> tag, which will cause the entire article to be // closing </ref> tag, which will cause the entire article to be
// eaten up until the next <ref>. So we bail out early instead. // eaten up until the next <ref>. So we bail out early instead.