[swift-dev] Combining Skin Tone Emoji Into Single Extended Grapheme Clusters
Michael Buckley
michael at buckleyisms.com
Mon Dec 21 00:41:41 CST 2015
After reading through the ICU sources, if I understand them correctly, ICU
uses the Aho–Corasick algorithm to determine grapheme breaks, word breaks
and line breaks, and then does some post-processing after matching using
the algorithm.
This allows ICU to solve the regional indicator problem by including a
pattern that matches 3 regional indicator characters in a row and inserts a
grapheme break after the second. This does not actually modify the string
by adding a zero-width space or something.
While this approach can solve the regional indicator problem efficiently,
it cannot solve the problem with the zero-width joiner emoji sequences as
easily. This is because Aho-Corasick is liner on the length of the text +
the number of patterns + the number of matches, and the emoji problem would
require a pattern for every emoji sequence we want to support.
However, after reading UTR #51 again, we may want to treat all emoji joined
by a ZWJ as a single extended grapheme cluster, whether they form a known
sequence or not. That's because UTR#51 leaves the exact sequences as
implementation-defined. It includes a list of currently-known implemented
sequences, but allows for implementers to add their own sequences.
Which means that Ubuntu could, for example, support a sequence of DOG FACE
+ ZWJ + PILE OF POO, and represent it with a glyph of a dog doing its
business. We basically have two options here. We could treat Swift as an
Apple-platform centric language and implement only the sequences that
appear on Apple platforms, or we could implement a rule of any emoji + ZWJ
+ any emoji has no break. As Dmitri pointed out, this would mean Swift
would mean Swift would report strings of invalid sequences as a single
character, which could be confusing. But I posit that the situation we have
now, reporting valid strings as multiple characters is also confusing, and
much more likely. It's unlikely that anyone is going to stick a ZWJ between
emoji unless they intend to make a sequence from it.
Incidentally, this is what ICU does. You can test this yourself in TextEdit
by typing HEAVY BLACK HEART followed by ZWJ ad infinitum, then press the
left arrow key once and watch TextEdit treat the sequence as a single
character, causing the cursor to jump to the beginning of the string. ICU,
however, does hard-code the emoji that are currently used by Apple emoji
sequences, so you can't do the same thing with PILE OF POO. This makes
sense in an ICU context, since it's only implementing the Apple sequences,
but if we want Swift to be more platform-agnostic, we would want this
behavior for any emoji.
ICU's implementation fixes the regional indicator problem, but the
implementation is large and moderately complicated. Just throwing this out
there, but would it be possible to add ICU as a dependency to Swift and
just use its implementation? I'm sure this would be a nightmare to work out
license and logistics-wise. (It would probably necessitate that ICU
development be opened up to the same degree that other Swift dependencies
are). I also understand that adding any dependencies at all is less than
ideal. But this seems like a perfect situation for some code sharing. We
have a moderately large and complicated library that is being updated with
new Emoji support when new Emoji are added anyway. It's fast, it's already
well-used, and we'd have to duplicate a lot of what it does to solve the
same problems if we didn't use it.
As a bonus, we could link to the system-supplied libicu on OS X and iOS, so
Swift apps would automatically get the latest emoji support when users
update their OSs. We would still have to bundle it for other OSs.
I know that there are a lot of downsides to making it a dependency, but I
wanted to throw the idea out there to see if it made sense.
On Fri, Dec 18, 2015 at 6:22 AM, Michael Buckley <michael at buckleyisms.com>
wrote:
> Thanks for the response, Dimitri. My comments inline below.
>
> On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <gribozavr at gmail.com>
> wrote:
>>
>>
>> One thing to do would be to check the Apple's ICU implementation, which
>> (I think) implements some extra handling for UTR #51 (
>> http://opensource.apple.com/release/os-x-1011/) to see how it deals with
>> this, whether it introduces tailoring, and if so, in what way.
>>
>
> I will look into that. I had always thought that would have been part of
> Core Text, and not open sourced. It is great to know that it is
> open-sourced.
>
>
> My primary concern with the fix in the PR is that it seems to change the
>> segmentation behavior for other sequences. The grapheme cluster
>> segmentation algorithm is local and stateless. It only looks at two
>> adjacent Unicode scalars. This means that adding a rule like "ZWJ
>> no_boundary Emoji" will affect all sequences, even those that are not a
>> grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
>> the three scalars would be grouped).
>>
>
> Apologies, I forgot to mention that disadvantage. It does change the
> segmentation behavior for other sequences, which was one of the reasons I
> was on the fence about whether this should go through the swift-evolution
> process.
>
>
>
>> This is the same issue as multiple flags pasted together (which are
>> represented as regional indicator characters). The current algorithm just
>> does not have enough information to split them apart, it needs to look at a
>> wider part of the string.
>>
>
> I could be reading the Unicode standard incorrectly, but it appears that
> this might be the intended behavior for the flag characters. I definitely
> agree that it's not ideal.
>
>
> I would be much happier with a solution only changed the segmentation for
>> the cases covered by the TR, but I understand it might have performance
>> implications. I think we should try to add such a tailoring, and benchmark
>> it.
>>
>
> Just so that I understand what you mean by tailoring, you mean switching
> to a possibly stateful algorithm which can consider more than just two
> adjacent characters when grouping, right?
>
>
>
>> The change that adds the first tailoring to the algorithm might be
>> significant enough. But I think it would be a question of whether we want
>> any tailoring at all, not about specific tailoring.
>>
>
> Thanks for the clarification. Just to be sure, if this change wasn't as
> problematic, but still changed the behavior of Swift.String, you're saying
> it would not be important enough for swift-evolution? As a concrete
> example, if I was just proposing to fix the skin tone emoji, but not the
> SWJ sequences, would it be considered just a bug fix?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20151220/b90a4b0c/attachment.html>
More information about the swift-dev
mailing list