Commit graph

24 commits

Author SHA1 Message Date
Umherirrender 4abed1d7c7 Use short array syntax
Done by phpcbf over composer fix

Change-Id: I9b7419e025ef499ff68be79789d76ad4b886d256
2017-06-16 13:26:30 +00:00
Brad Jorsch fe094e7bae Update ustring data tables
normalization-data.lua is updated to Unicode 6.3.0.

upper.lua and lower.lua are updated to match HHVM 3.12.1's mb_strtoupper
and mb_strtolower. I don't know what version of Unicode that might be,
but it seems old.

Bug: T86096
Change-Id: I1a0c8be2756f86db5f36dd67319a1f79aea98b3e
2017-01-21 03:26:27 +00:00
jenkins-bot ae677fbc0d Merge "Ustring: Let gcodepoint work with moderately long strings" 2016-12-16 00:42:02 +00:00
Brad Jorsch 629f11d0dd Fix pure-Lua ustring and empty patterns
An empty pattern isn't "safe" since it could match in between the
bytes of a UTF-8 character.

Also, it turns out there's a bug in PHP <5.6.9 preg_replace() that we
need to work around too.

Change-Id: I282e5909e4663461d60c5386693db182de2fd44c
2016-10-05 14:32:27 -04:00
Brad Jorsch d643f40de9 Ustring: Let gcodepoint work with moderately long strings
For the PHP implementation, return the codepoints as a table instead of
multiple return values that get table-ified in Lua, to avoid hitting
too-many-values stack limits.

For the pure-Lua version, inline most of ustring.codepoint instead of
calling it to avoid what's effectively "{ unpack( stuff ) }".

Bug: T118687
Change-Id: I105f388cc23ab55d4124739700ef89d5354b7dbc
2016-07-15 19:35:58 +00:00
Brad Jorsch aa4d72e3ff Fix uncontroversial phpcs errors
The following continue to be ignored:
* Generic.Arrays.DisallowLongArraySyntax.Found, because I'm not sure
  Scribunto is ready to abandon old version support in master.
* MediaWiki.ControlStructures.AssignmentInControlStructures.AssignmentInControlStructures,
  because it's overly strict for its purpose.

Squiz.Classes.ValidClassName.NotCamelCaps isn't ignored globally, we
just ignore it explicitly every place it's needed.

Change-Id: I307668da6ef7b3e23da19b1fd1e08914239b99b3
2016-05-18 16:31:28 -04:00
Brad Jorsch b3da8a698d Add toNFKC and toNFKD to mw.ustring
This also makes some updates to make-normalization-table.php to handle
the move of UtfNormal to a separate library.

Bug: T126427
Change-Id: Id4985c3ca441cf92f08ba1f1af85c762ba43d7d2
2016-04-02 15:22:42 +00:00
Brad Jorsch 29266a9a0f Use correct variable in ustring.lua
Change-Id: Ic576b8c31c487c106593050538f9f2cc5b722b62
2016-01-02 10:49:48 -05:00
Brad Jorsch cd618c7a92 ustring: Handle "empty" charset like Lua does (part 2)
Lua actually treats a close-bracket at the start of a bracketed
character class as a literal, rather than using it to close the
character class. Probably unintended behavior, but it happens.

Also, have the pure-lua version throw our more informative errors on
error even when falling back to string.find and the like, and fix some
other weird edge cases that came up in testing.

Bug: T95958
Bug: T115686
Change-Id: Iab783d4a3e58b1514cc09729d4a71c2cb1242ee8
2015-10-16 09:26:55 -04:00
Jan Berkel fb20934b16 Fix a problem with simple pattern detection
A string with a dot pattern is only "simple" if
followed by +, - or *. The end of string condition was not checked
properly.

Change-Id: Ia10b9164caeabe464c76441cc82eef37a7013048
2015-10-07 10:27:45 -04:00
Jan Berkel 7c5454b36c Fix off-by one error in gsub
Change-Id: I49c0386970e007271d23087fd112580af7b21c9c
2015-09-23 17:41:15 +01:00
Brad Jorsch 4669e43135 ustring: Handle empty charset like Lua does
Both '[]' and '[^]' give a rather odd error, but it's probably best to
follow suit.

Bug: T95958
Change-Id: I3310da55f655537c9082fc9039003f6b2d31eff4
2015-04-13 18:20:33 -04:00
Kunal Mehta 3f5f3e247f Use full <?php instead of short <? in ustring generation scripts
Change-Id: Ida6bc4ee1803763b284fdaa7c63769a146fec6ad
2015-03-17 18:16:20 -07:00
Kunal Mehta f5a8a3b0ae Update make-normalization-table for core file moves
Depends upon Ib530ad9dbe1d3a33dc53ef8b9620f61d4e1a2d62 in core.

Change-Id: Ib530ad9dbe1d3a33dc53ef8b9620f61d4e1a2d62
2015-02-04 20:04:41 +00:00
Brad Jorsch 0367e9bddd Fix deceptively-simple pattern in pure-Lua ustring
The pure-Lua ustring pattern matching functions short-circuit to the
much faster string library when the pattern would match the same against
the raw bytes.

A pattern like "[^a-z]" can match a partial UTF-8 character when applied
bytewise, and so must be detected as unsafe.

Let's also directly test the pure-Lua module, instead of me having to
comment out lines in Scribunto_LuaUstringLibrary::register() whenever I
want to test them.

Change-Id: I91ed3374aadfea379b9db2e13b4248ab20df509e
2014-08-10 01:18:18 +00:00
Brad Jorsch cb2a331565 Fix wrong variable in ustring.lua
Change-Id: Ibc8056b36d615b57d357987c59219a22e63fdfe8
2014-07-11 11:25:35 -04:00
Brad Jorsch bf39827980 mw.ustring functions should accept numbers where string functions do
Lua's string functions tend to auto-convert numbers to strings. We
should do the same in mw.ustring.

Bug: 67201
Change-Id: Icd3c5e93bac19dafd78d737ec9b315daba9f1729
2014-06-27 12:31:04 -04:00
jenkins-bot 99b96d8b14 Merge "Add frontier pattern (%f[set]) to ustring" 2013-08-30 06:22:40 +00:00
Brad Jorsch d8314539da Fix mw.ustring edge cases
A few edge cases were being incorrectly handled:
* mw.ustring.sub( 'abc', 1, 0 ) returned 'a', not ''.
* mw.ustring.codepoint( 'abc', 1, 0 ) returned 97, not no results.
* mw.ustring.codepoint( 'abc', 4, 4 ) returned 99, not no results.
* mw.ustring.gcodepoint had the same issues as mw.ustring.codepoint.

Change-Id: Ib8c0ef5a8073106eb7d90d0aa0513be4525dca08
2013-07-03 11:49:52 -04:00
Brad Jorsch 82820aafc8 Add frontier pattern (%f[set]) to ustring
The "%f[set]" frontier pattern has been in Lua 5.1 since the beginning,
but was undocumented until Lua 5.2. And the code is even unchanged from
5.1.0 to 5.2.1. So there's no reason not to implement it in ustring too.

Note the changes to UstringLibrary.php are somewhat large, because it
splits the "convert a Lua bracketed charset to PCRE" code into a
separate function and it changes the handling of mw.ustring.find's and
mw.ustring.match's 'init' parameter from "substring, match from 0, then
add back on $init" to "use preg_match's $offset and use \G instead of ^
where this matters". Both of these are necessary to properly support
%f.

This also fixes a bug in the pure-Lua code (not used in Scribunto)
exposed by the unit tests for %f where %z was matching '\1' rather than
'\0' and %Z everything except '\1' instead of everything except '\0'.

Bug: 48331
Change-Id: Ie0b95ef5b734db53d6adc9de5dae4874f8944c08
2013-05-12 10:27:36 -04:00
Brad Jorsch d6f3633428 (bug 47365) Fix edge cases in mw.ustring.find, mw.ustring.match
The following errors are fixed:
* PHP warning and wrong return value with empty pattern and plain
* Incorrect offsets returned when init is larger than the string length
* Incorrect captured offsets returned when init is excessively negative

Bug: 47365
Change-Id: I9741418287dc727747326d6a19678370ce155a2b
2013-05-10 06:00:02 +00:00
Brad Jorsch fcef54e9d9 Fix ustring errors
* mw.ustring.sub( '', 1 ) errors in LuaStandalone
* Default value for ustring.maxStringLength and ustring.maxPatternLength
  should be infinity, not nil
* mw.ustring.find() returns one value instead of two in "plain" mode.

Change-Id: I5e65c4ec3a05f0e6930ce7ab7fd4ac72bea95e7f
2013-03-05 12:22:15 -05:00
Brad Jorsch 4dcac2fcd9 Fix mw.ustring.gmatch and patterns with '^'
The Lua manual says this:

 For this function, a '^' at the start of a pattern does not work as an
 anchor, as this would prevent the iteration.

I had interpreted that to mean that a pattern starting with '^' would
never match in gmatch. But further testing reveals that the '^' is just
treated as a literal character: string.gmatch( "foo ^bar baz", "^%a+" )
will match "^bar".

Change-Id: Id91d6ee2db753ce1d6a4f6ae27764691d9e9fdc4
2013-02-14 14:25:55 -05:00
Brad Jorsch 0a8757baba Lua ustring implementation
This is a reimplementation of Lua's string library with support for
UTF-8.

The entire ustring library is implemented in pure Lua. PHP callbacks are
also available for overrides: in LuaSandbox these are used for almost
all functions, while in LuaStandalone they are used only for the pattern
matching. Also, ustring.upper and ustring.lower are overridden using
mw.language's .uc and .lc if available.

It also includes a bunch of unit tests.

Note that if you download the normalization tests, they may fail under
LuaSandbox if you have PHP's intl extension installed and libicu on your
system is too old.

Change-Id: Ie76fdf8d3a85d0a3d2a41b0d3b7afe433f247af0
2013-02-12 14:26:29 -05:00