Go to file
Tim Starling 8f3369b090 Avoid using regexes
Review regex usage, and use an alternative where possible, to improve
performance.

* Add PHPUtils::stripPrefix() and PHPUtils::stripSuffix(). Benchmark in
  doc comment.
* /foo/ -> str_contains()
* /^foo/ -> str_starts_with()
* /^f/ -> ($s[0] ?? '') === 'f'
* /foo$/ -> str_ends_with()
* /^(foo|bar)$/ -> in_array(), benchmark suggests 10x improvement
* preg_replace(/foo/) -> str_replace()
* preg_replace(/^[abc]/) -> strspn(), benchmark suggests 3x improvement.
  Curiously, it is faster without a limit for short input strings,
  although a limit presumably adds robustness.
* preg_replace(/[abc]+$/) -> rtrim()
* preg_match_all() -> substr_count()
* In DOMUtils::hasTypeOf(), use explode() instead of a regex. Validated
  by a benchmark.
* In DOMUtils::addTypeOf(), stop normalizing adjacent spaces. This
  allows us to use implode(explode()) without a filtering loop. The
  patch to Ext/Cite/References.php was to remove spaces added by this
  change. The parserTests.txt changes were a consequence of the
  References.php change.
* In LinkHandlerUtils::getHref() I allowed a single bare slash to be
  counted as a path-absolute URL since I think that was the intention of
  the original code.
* In LinkHandlerUtils::getLinkRoundTripData() I captured the portion of
  interest from the previoulinkHandlers regex instead of running a
  second regex.
* LinkHandlerUtils::linkHandler() had the regex
  /^mw:WikiLink|mw:MediaLink$/ which I think was a bug, missing
  parentheses. I fixed the bug.

The margins are pretty tight for a lot of these. Using polyfills for
str_contains() etc. might change the conclusion.

Also:

* In DOMUtils::matchTypeOf(), avoid calling hasAttribute().
  getAttribute() is documented as returning an empty string if the
  attribute does not exist.

Change-Id: I8d7bdf1bccc869b4dc17058a5822ef34968471e6
2021-09-13 23:01:45 +00:00
src/Parsoid Avoid using regexes 2021-09-13 23:01:45 +00:00