[swift-dev] Combining Skin Tone Emoji Into Single Extended Grapheme Clusters

Thu Dec 17 23:16:04 CST 2015

Hello,

I would like to fix rdar://20511834 , which is that the new skin tone and
multi-person grouping emoji introduced with iOS 8.3 and OS X 10.10.3 are
represented as multiple extended grapheme clusters by Swift.String, and I
have a few questions.

1. Is this something we want to fix at this time, considering these emoji
are part of UTR #51, but not part of an official Unicode standard, or do we
want to wait for these emoji groupings to be published as part of the
Unicode standard.

2. What is the best way to fix this?

3. Is this a significant enough change to warrant going through the entire
swift-evolution process, or a simple bug fix?

In terms of the first question, I am not too familiar with the Unicode
standards process. It appears to me that although UTRs aren't formal
standards, code that conforms to unicode standards are free to conform to
UTRs as well.

Currently, extended grapheme clusters are using GraphemeBreakProperty.txt,
which is supplied as part of Annex #44 of the Unicode standard, and which
does not yet group skin-tone emoji or emoji sequences. UTR #51 includes a
emoji-data.txt file which, while slightly outdated compared to the UTR,
does contain enough information to group these emoji properly.

We could currently pull in emoji-data.txt, or some other data source, and
use it to group these emoji, but there is a chance that
GraphemeBreakProperty.txt will be updated to include these groupings in the
future (though probably not until the next version of Unicode), at which
point we'd have to reverse this work and reimplement it using
GraphemeBreakProperty.txt.

For the second question, I have implemented a simple implementation of
these emoji groupings using emoji-data.txt. A diff can be found at
https://github.com/MichaelBuckley/swift/pull/1

This implementation merely adds a few new character classes to the Unicode
Trie, pulled mainly from emoji-data.txt, though the Zero Width Joiner is
given its own hardcoded character class.

As I see it, there are three disadvantages to this approach: The hardcoded
character class, the reliance on a second emoji data file, and the fact
that the trie bitmap had to be extended from 16 bits to 32 bits. This last
change was probably inevitable in the future, and it only increases the
trie size by 4096 bytes (from 18961 bytes to 23057 bytes).

Still, it's possible that there is a much better way to implement this fix,
and I was hoping to get some feedback from the designer(s) of the current
Unicode trie code.

But probably the biggest reason for seeking an alternative implementation
is that the existing behavior is not always incorrect. It's incorrect on
the most recent versions of Apple operating systems when rendered in
contexts that support these emoji, but it's correct everywhere else. It
seems as though Swift perhaps needs to allow users to change grapheme
clustering behavior based on a user setting, and perhaps even allow users
to specify which unicode version they want to use, but that's a much larger
change, which may not be worth the costs.

Finally, for question 3, I'm on the fence as to whether this is a
significant enough change to warrant going through the swift-evolution
process. On the one hand, it seems like a simple bug fix, and there's
already a radar tracking it as such. No one would require a swift-evolution
proposal to fix a compiler crash, for example. But at the same time, it
would also change the runtime behavior or anyone relying on the existing
behavior, which is problematic because, as pointed out before, the existing
behavior is not always wrong.

Anyway, I was hoping to get some guidance on where the line is, and what
side of the line this bug fix is on. Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20151217/811977c0/attachment.html>