[swift-dev] Combining Skin Tone Emoji Into Single Extended Grapheme Clusters

Michael Buckley michael at buckleyisms.com
Tue Dec 22 02:10:49 CST 2015


It actually appears that Swift already links against ICU. I'll see if I can
hook Swift up to ICU's grapheme separation code.

On Sun, Dec 20, 2015 at 10:41 PM, Michael Buckley <michael at buckleyisms.com>
wrote:

> After reading through the ICU sources, if I understand them correctly, ICU
> uses the Aho–Corasick algorithm to determine grapheme breaks, word breaks
> and line breaks, and then does some post-processing after matching using
> the algorithm.
>
> This allows ICU to solve the regional indicator problem by including a
> pattern that matches 3 regional indicator characters in a row and inserts a
> grapheme break after the second. This does not actually modify the string
> by adding a zero-width space or something.
>
> While this approach can solve the regional indicator problem efficiently,
> it cannot solve the problem with the zero-width joiner emoji sequences as
> easily. This is because Aho-Corasick is liner on the length of the text +
> the number of patterns + the number of matches, and the emoji problem would
> require a pattern for every emoji sequence we want to support.
>
> However, after reading UTR #51 again, we may want to treat all emoji
> joined by a ZWJ as a single extended grapheme cluster, whether they form a
> known sequence or not. That's because UTR#51 leaves the exact sequences as
> implementation-defined. It includes a list of currently-known implemented
> sequences, but allows for implementers to add their own sequences.
>
> Which means that Ubuntu could, for example, support a sequence of DOG FACE
> + ZWJ + PILE OF POO, and represent it with a glyph of a dog doing its
> business. We basically have two options here. We could treat Swift as an
> Apple-platform centric language and implement only the sequences that
> appear on Apple platforms, or we could implement a rule of any emoji + ZWJ
> + any emoji has no break. As Dmitri pointed out, this would mean Swift
> would mean Swift would report strings of invalid sequences as a single
> character, which could be confusing. But I posit that the situation we have
> now, reporting valid strings as multiple characters is also confusing, and
> much more likely. It's unlikely that anyone is going to stick a ZWJ between
> emoji unless they intend to make a sequence from it.
>
> Incidentally, this is what ICU does. You can test this yourself in
> TextEdit by typing HEAVY BLACK HEART followed by ZWJ ad infinitum, then
> press the left arrow key once and watch TextEdit treat the sequence as a
> single character, causing the cursor to jump to the beginning of the
> string. ICU, however, does hard-code the emoji that are currently used by
> Apple emoji sequences, so you can't do the same thing with PILE OF POO.
> This makes sense in an ICU context, since it's only implementing the Apple
> sequences, but if we want Swift to be more platform-agnostic, we would want
> this behavior for any emoji.
>
>
> ICU's implementation fixes the regional indicator problem, but the
> implementation is large and moderately complicated. Just throwing this out
> there, but would it be possible to add ICU as a dependency to Swift and
> just use its implementation? I'm sure this would be a nightmare to work out
> license and logistics-wise. (It would probably necessitate that ICU
> development be opened up to the same degree that other Swift dependencies
> are). I also understand that adding any dependencies at all is less than
> ideal. But this seems like a perfect situation for some code sharing. We
> have a moderately large and complicated library that is being updated with
> new Emoji support when new Emoji are added anyway. It's fast, it's already
> well-used, and we'd have to duplicate a lot of what it does to solve the
> same problems if we didn't use it.
>
> As a bonus, we could link to the system-supplied libicu on OS X and iOS,
> so Swift apps would automatically get the latest emoji support when users
> update their OSs. We would still have to bundle it for other OSs.
>
> I know that there are a lot of downsides to making it a dependency, but I
> wanted to throw the idea out there to see if it made sense.
>
> On Fri, Dec 18, 2015 at 6:22 AM, Michael Buckley <michael at buckleyisms.com>
> wrote:
>
>> Thanks for the response, Dimitri. My comments inline below.
>>
>> On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <gribozavr at gmail.com>
>> wrote:
>>>
>>>
>>> One thing to do would be to check the Apple's ICU implementation, which
>>> (I think) implements some extra handling for UTR #51 (
>>> http://opensource.apple.com/release/os-x-1011/) to see how it deals
>>> with this, whether it introduces tailoring, and if so, in what way.
>>>
>>
>>  I will look into that. I had always thought that would have been part of
>> Core Text, and not open sourced. It is great to know that it is
>> open-sourced.
>>
>>
>> My primary concern with the fix in the PR is that it seems to change the
>>> segmentation behavior for other sequences.  The grapheme cluster
>>> segmentation algorithm is local and stateless.  It only looks at two
>>> adjacent Unicode scalars.  This means that adding a rule like "ZWJ
>>> no_boundary Emoji" will affect all sequences, even those that are not a
>>> grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
>>> the three scalars would be grouped).
>>>
>>
>> Apologies, I forgot to mention that disadvantage. It does change the
>> segmentation behavior for other sequences, which was one of the reasons I
>> was on the fence about whether this should go through the swift-evolution
>> process.
>>
>>
>>
>>> This is the same issue as multiple flags pasted together (which are
>>> represented as regional indicator characters).  The current algorithm just
>>> does not have enough information to split them apart, it needs to look at a
>>> wider part of the string.
>>>
>>
>> I could be reading the Unicode standard incorrectly, but it appears that
>> this might be the intended behavior for the flag characters. I definitely
>> agree that it's not ideal.
>>
>>
>> I would be much happier with a solution only changed the segmentation for
>>> the cases covered by the TR, but I understand it might have performance
>>> implications.  I think we should try to add such a tailoring, and benchmark
>>> it.
>>>
>>
>> Just so that I understand what you mean by tailoring, you mean switching
>> to a possibly stateful algorithm which can consider more than just two
>> adjacent characters when grouping, right?
>>
>>
>>
>>> The change that adds the first tailoring to the algorithm might be
>>> significant enough.  But I think it would be a question of whether we want
>>> any tailoring at all, not about specific tailoring.
>>>
>>
>> Thanks for the clarification. Just to be sure, if this change wasn't as
>> problematic, but still changed the behavior of Swift.String, you're saying
>> it would not be important enough for swift-evolution? As a concrete
>> example, if I was just proposing to fix the skin tone emoji, but not the
>> SWJ sequences, would it be considered just a bug fix?
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20151222/611be1d4/attachment.html>


More information about the swift-dev mailing list