PHP was counting UTF-8 bytes, JS was counting UTF-16 bytes.
Both should have been counting codepoints (although it doesn't
really matter as long as they both count the same things).
I noticed the issue after adding some tests using the Cyrillic
script, when one case had different results in PHP and JS:
Id25b537fecd789640c209ff7f30e777455a3aece.
Change-Id: Ic31240678f71ba48e6ec202126bf490cea12bb66
The PHP code incorrectly assumed that the digits are single-byte in
UTF-8, which is never the case (except for 0-9).
The JS code worked correctly because it uses UTF-16 strings, so the
bug would only affect non-BMP digits there. This was noted in a TODO
comment, but we overlooked it when reimplementing in PHP.
Instead of a string of 10 characters, use an array of 10
single-character strings.
Bug: T261706
Change-Id: Ic5421382474c88f003424799c53ff473d99cce92
When a comment ended before the end of a paragraph, the next
comment would begin right there in the middle of the paragraph.
This could result in the detected indentation level of that
comment being incorrect, and replies being inserted in wrong
places, as seen in the 'signatures-funny' test case.
The code moved to the parser was previously repeated twice in
addListItem() and addReplyLink(), which should have been a hint
that something isn't quite right.
Also, fix the code guarding against overlapping signatures,
now that signatures may not be at the end of a comment.
Bug: T260855
Change-Id: Ic26a87642f8a15d5de2f7073d4d8176b299c7f94
Previously, parser would output offsets that don't exist in their
containers, because we were pretending that entities are parts of
their neighboring text nodes.
Turns out it's much easier to do it right when going backwards.
Change-Id: I9bccca2d403f1a976ae517449989170cdd99721e
Something terrible has happened to this function… It seems that I have
brutalized it when rebasing 092cfd6075.
Change-Id: I12d75c69d15645112563a7bc345209b23b54cb3e
When adding a reply, we take a node at the end of the previous comment,
compare that comment's indentation level to the expected indentation level
of the reply, and add (or remove) that number of wrapper lists.
The existing code did not consider that comments may have lists within
them, and so the indentation of that node may not match the indentation
of the comment.
Bug: T252702
Change-Id: Icc5ff19783d2b213bff99f283cb0599a8b5c1ab4
* Pass rootNode to the constructor
* Rename getters to match CommentItem/HeadingItem/ThreadItem
value classes.
* Always build the thread tree so CommentItem's always have
and ID and replies/parent.
Change-Id: I508be9534de59016ff806e3d84edcbb1c76cb0c6
Instead of doing a separate tree walk and finding all timestamps
separately, make it part of the getComments tree walk, and find
timestamps one at a time.
Change-Id: I47f466eaf228504faa189fd99e07493bc7f022cd
This is similar to what the JS version does.
The TreeWalker and NodeFilter classes are adapted from
https://github.com/Krinkle/dom-TreeWalker-polyfill
(MIT license).
This makes #getComments twice as fast on en-big-oldparser.html
Change-Id: I2441f33e6e7bad753ac830d277e6a2e81ee8c93d
* Move modifier#getFullyCoveredWrapper to utils
* Use that method to find the node where we start searching for
template wrappers, rather than using endContainer
Bug: T252058
Change-Id: I55de58102f3468fce01290bd413a7fdc96d322d6
It was useful when I was debugging those parts of the code, but now
it's usually annoying.
The warnings can still sometimes be useful for understanding how the
tool parses some discussion, though. To keep that functionality, add
displaying warnings for each comment in the debug mode.
Change-Id: I2d218a8a394f179bcc0990ff988a0567c275ccf2
Follow-up to Ic1438d516e223db462cb227f6668e856672f538c.
Minor corrections and comment improvements in PHP parser,
and "backporting" some changes to JS parser that I like.
Change-Id: I5e54121914ec6b323e556dd133bcb71b3aefbb61