[swift-dev] Combining Skin Tone Emoji Into Single Extended Grapheme Clusters

Michael Buckley michael at buckleyisms.com
Fri Dec 18 08:22:35 CST 2015


Thanks for the response, Dimitri. My comments inline below.

On Fri, Dec 18, 2015 at 3:29 AM, Dmitri Gribenko <gribozavr at gmail.com>
wrote:
>
>
> One thing to do would be to check the Apple's ICU implementation, which (I
> think) implements some extra handling for UTR #51 (
> http://opensource.apple.com/release/os-x-1011/) to see how it deals with
> this, whether it introduces tailoring, and if so, in what way.
>

 I will look into that. I had always thought that would have been part of
Core Text, and not open sourced. It is great to know that it is
open-sourced.


My primary concern with the fix in the PR is that it seems to change the
> segmentation behavior for other sequences.  The grapheme cluster
> segmentation algorithm is local and stateless.  It only looks at two
> adjacent Unicode scalars.  This means that adding a rule like "ZWJ
> no_boundary Emoji" will affect all sequences, even those that are not a
> grouping as defined in UTR #51 (for example, "Latin letter, ZWJ, Emoji":
> the three scalars would be grouped).
>

Apologies, I forgot to mention that disadvantage. It does change the
segmentation behavior for other sequences, which was one of the reasons I
was on the fence about whether this should go through the swift-evolution
process.



> This is the same issue as multiple flags pasted together (which are
> represented as regional indicator characters).  The current algorithm just
> does not have enough information to split them apart, it needs to look at a
> wider part of the string.
>

I could be reading the Unicode standard incorrectly, but it appears that
this might be the intended behavior for the flag characters. I definitely
agree that it's not ideal.


I would be much happier with a solution only changed the segmentation for
> the cases covered by the TR, but I understand it might have performance
> implications.  I think we should try to add such a tailoring, and benchmark
> it.
>

Just so that I understand what you mean by tailoring, you mean switching to
a possibly stateful algorithm which can consider more than just two
adjacent characters when grouping, right?



> The change that adds the first tailoring to the algorithm might be
> significant enough.  But I think it would be a question of whether we want
> any tailoring at all, not about specific tailoring.
>

Thanks for the clarification. Just to be sure, if this change wasn't as
problematic, but still changed the behavior of Swift.String, you're saying
it would not be important enough for swift-evolution? As a concrete
example, if I was just proposing to fix the skin tone emoji, but not the
SWJ sequences, would it be considered just a bug fix?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20151218/7a2f9453/attachment.html>


More information about the swift-dev mailing list