An empty pattern isn't "safe" since it could match in between the
bytes of a UTF-8 character.
Also, it turns out there's a bug in PHP <5.6.9 preg_replace() that we
need to work around too.
Change-Id: I282e5909e4663461d60c5386693db182de2fd44c
For the PHP implementation, return the codepoints as a table instead of
multiple return values that get table-ified in Lua, to avoid hitting
too-many-values stack limits.
For the pure-Lua version, inline most of ustring.codepoint instead of
calling it to avoid what's effectively "{ unpack( stuff ) }".
Bug: T118687
Change-Id: I105f388cc23ab55d4124739700ef89d5354b7dbc
The following continue to be ignored:
* Generic.Arrays.DisallowLongArraySyntax.Found, because I'm not sure
Scribunto is ready to abandon old version support in master.
* MediaWiki.ControlStructures.AssignmentInControlStructures.AssignmentInControlStructures,
because it's overly strict for its purpose.
Squiz.Classes.ValidClassName.NotCamelCaps isn't ignored globally, we
just ignore it explicitly every place it's needed.
Change-Id: I307668da6ef7b3e23da19b1fd1e08914239b99b3
This also makes some updates to make-normalization-table.php to handle
the move of UtfNormal to a separate library.
Bug: T126427
Change-Id: Id4985c3ca441cf92f08ba1f1af85c762ba43d7d2
Lua actually treats a close-bracket at the start of a bracketed
character class as a literal, rather than using it to close the
character class. Probably unintended behavior, but it happens.
Also, have the pure-lua version throw our more informative errors on
error even when falling back to string.find and the like, and fix some
other weird edge cases that came up in testing.
Bug: T95958
Bug: T115686
Change-Id: Iab783d4a3e58b1514cc09729d4a71c2cb1242ee8
Scribunto currently supports libraries with PHP callbacks that are
loaded on startup, and pure-Lua libraries that may be loaded from the
module with require().
This change allows for libraries with PHP callbacks to also be loaded
with require().
Change-Id: Ibdc1f4ef51b1c8644c3d4c98d57755b5c06447a5
Cache patterns and the regexes they become, avoid revalidating the same
pattern multiple times, and don't bother checking if something is a string
when we just made it one.
Change-Id: I1a61dd0a36eb449c8acdc8c1be68aae793f172d3
Lua's string functions tend to auto-convert numbers to strings. We
should do the same in mw.ustring.
Bug: 67201
Change-Id: Icd3c5e93bac19dafd78d737ec9b315daba9f1729
A few edge cases were being incorrectly handled:
* mw.ustring.sub( 'abc', 1, 0 ) returned 'a', not ''.
* mw.ustring.codepoint( 'abc', 1, 0 ) returned 97, not no results.
* mw.ustring.codepoint( 'abc', 4, 4 ) returned 99, not no results.
* mw.ustring.gcodepoint had the same issues as mw.ustring.codepoint.
Change-Id: Ib8c0ef5a8073106eb7d90d0aa0513be4525dca08
Negative values for 'i' in mw.ustring.byteoffset are supposed to count
from the end of the string. But in LuaSandbox, it was actually counting
from two bytes before the end of the string due to a typo.
Fix that, and add some tests for it.
Bug: 50176
Change-Id: Iceee1022a55abd7a08df1ea7843e1277eb02798b
The "%f[set]" frontier pattern has been in Lua 5.1 since the beginning,
but was undocumented until Lua 5.2. And the code is even unchanged from
5.1.0 to 5.2.1. So there's no reason not to implement it in ustring too.
Note the changes to UstringLibrary.php are somewhat large, because it
splits the "convert a Lua bracketed charset to PCRE" code into a
separate function and it changes the handling of mw.ustring.find's and
mw.ustring.match's 'init' parameter from "substring, match from 0, then
add back on $init" to "use preg_match's $offset and use \G instead of ^
where this matters". Both of these are necessary to properly support
%f.
This also fixes a bug in the pure-Lua code (not used in Scribunto)
exposed by the unit tests for %f where %z was matching '\1' rather than
'\0' and %Z everything except '\1' instead of everything except '\0'.
Bug: 48331
Change-Id: Ie0b95ef5b734db53d6adc9de5dae4874f8944c08
The following errors are fixed:
* PHP warning and wrong return value with empty pattern and plain
* Incorrect offsets returned when init is larger than the string length
* Incorrect captured offsets returned when init is excessively negative
Bug: 47365
Change-Id: I9741418287dc727747326d6a19678370ce155a2b
* mw.ustring.sub( '', 1 ) errors in LuaStandalone
* Default value for ustring.maxStringLength and ustring.maxPatternLength
should be infinity, not nil
* mw.ustring.find() returns one value instead of two in "plain" mode.
Change-Id: I5e65c4ec3a05f0e6930ce7ab7fd4ac72bea95e7f
The Lua manual says this:
For this function, a '^' at the start of a pattern does not work as an
anchor, as this would prevent the iteration.
I had interpreted that to mean that a pattern starting with '^' would
never match in gmatch. But further testing reveals that the '^' is just
treated as a literal character: string.gmatch( "foo ^bar baz", "^%a+" )
will match "^bar".
Change-Id: Id91d6ee2db753ce1d6a4f6ae27764691d9e9fdc4
This is a reimplementation of Lua's string library with support for
UTF-8.
The entire ustring library is implemented in pure Lua. PHP callbacks are
also available for overrides: in LuaSandbox these are used for almost
all functions, while in LuaStandalone they are used only for the pattern
matching. Also, ustring.upper and ustring.lower are overridden using
mw.language's .uc and .lc if available.
It also includes a bunch of unit tests.
Note that if you download the normalization tests, they may fail under
LuaSandbox if you have PHP's intl extension installed and libicu on your
system is too old.
Change-Id: Ie76fdf8d3a85d0a3d2a41b0d3b7afe433f247af0