[swift-dev] Combining Skin Tone Emoji Into Single Extended Grapheme Clusters

Dmitri Gribenko gribozavr at gmail.com
Fri Dec 18 05:29:16 CST 2015


Hi Michael,

On Thu, Dec 17, 2015 at 9:16 PM, Michael Buckley via swift-dev <
swift-dev at swift.org> wrote:

> Hello,
>
> I would like to fix rdar://20511834 , which is that the new skin tone and
> multi-person grouping emoji introduced with iOS 8.3 and OS X 10.10.3 are
> represented as multiple extended grapheme clusters by Swift.String, and I
> have a few questions.
>
> 1. Is this something we want to fix at this time, considering these emoji
> are part of UTR #51, but not part of an official Unicode standard, or do we
> want to wait for these emoji groupings to be published as part of the
> Unicode standard.
>

The issue you are describing is indeed important.  There are multiple
considerations here.  One of them is that we currently describe the
segmentation that Swift performs to be "extended grapheme clusters", which
has a precise definition in the spec.  Changing that would mean introducing
tailoring, which is allowed by the spec, but the algorithm would be custom.

One thing to do would be to check the Apple's ICU implementation, which (I
think) implements some extra handling for UTR #51 (
http://opensource.apple.com/release/os-x-1011/) to see how it deals with
this, whether it introduces tailoring, and if so, in what way.

2. What is the best way to fix this?
>

My primary concern with the fix in the PR is that it seems to change the
segmentation behavior for other sequences.  The grapheme cluster
segmentation algorithm is local and stateless.  It only looks at two
adjacent Unicode scalars.  This means that adding a rule like "ZWJ
no_boundary Emoji" will affect all sequences, even those that are not a
grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
the three scalars would be grouped).

This is the same issue as multiple flags pasted together (which are
represented as regional indicator characters).  The current algorithm just
does not have enough information to split them apart, it needs to look at a
wider part of the string.

I would be much happier with a solution only changed the segmentation for
the cases covered by the TR, but I understand it might have performance
implications.  I think we should try to add such a tailoring, and benchmark
it.

3. Is this a significant enough change to warrant going through the entire
> swift-evolution process, or a simple bug fix?
>

The change that adds the first tailoring to the algorithm might be
significant enough.  But I think it would be a question of whether we want
any tailoring at all, not about specific tailoring.


> As I see it, there are three disadvantages to this approach: The hardcoded
> character class, the reliance on a second emoji data file, and the fact
> that the trie bitmap had to be extended from 16 bits to 32 bits. This last
> change was probably inevitable in the future, and it only increases the
> trie size by 4096 bytes (from 18961 bytes to 23057 bytes).
>

In my opinion, the biggest disadvantage is that it would change
segmentation for other sequences.


> But probably the biggest reason for seeking an alternative implementation
> is that the existing behavior is not always incorrect. It's incorrect on
> the most recent versions of Apple operating systems when rendered in
> contexts that support these emoji, but it's correct everywhere else. It
> seems as though Swift perhaps needs to allow users to change grapheme
> clustering behavior based on a user setting, and perhaps even allow users
> to specify which unicode version they want to use, but that's a much larger
> change, which may not be worth the costs.
>

I would prefer to avoid platform-specific differences here.  Unicode is
hard as it is, and adding context-sensitivity to algorithms in an unusual
way (that is, not through existing established mechanisms like locales)
just calls for interoperability issues.

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20151218/1ecf7873/attachment.html>


More information about the swift-dev mailing list