[swift-users] emoji in source code failed to compile in linux ubuntu

Jens Alfke jens at mooseyard.com
Fri Dec 4 13:25:11 CST 2015


> On Dec 4, 2015, at 3:28 AM, Quinn The Eskimo! <eskimo1 at apple.com> wrote:
> 
> I can explain that.  U+1F603 is encoded in UTF-16 as d83d de03.  If you encode each of these separately as UTF-8, you get ed a0 bd followed by ed b8 80.  That's not the correct way to encode U+1F603 as UTF-8, hence the failure.

By total coincidence, just this week I implemented support for the above in a JSON parser, so I’m suddenly an expert ;-)

Based on what I know, both of those encodings are correct. UTF-16 surrogate pairs can occur in decoded UTF-8 and need to be decoded in turn into Unicode codepoints; according to the Wikipedia page on UTF-8:

"In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and 983040 4-byte sequences.”

In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.

In short, I think this is a bug in the UTF-8 parser being used by the Swift compiler. But as my expertise here is only a few days old, I’m prepared to be corrected.

—Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-users/attachments/20151204/dc3726b2/attachment.html>


More information about the swift-users mailing list