[swift-users] emoji in source code failed to compile in linux ubuntu
Jens Alfke
jens at mooseyard.com
Fri Dec 4 13:25:11 CST 2015
> On Dec 4, 2015, at 3:28 AM, Quinn The Eskimo! <eskimo1 at apple.com> wrote:
>
> I can explain that. U+1F603 is encoded in UTF-16 as d83d de03. If you encode each of these separately as UTF-8, you get ed a0 bd followed by ed b8 80. That's not the correct way to encode U+1F603 as UTF-8, hence the failure.
By total coincidence, just this week I implemented support for the above in a JSON parser, so I’m suddenly an expert ;-)
Based on what I know, both of those encodings are correct. UTF-16 surrogate pairs can occur in decoded UTF-8 and need to be decoded in turn into Unicode codepoints; according to the Wikipedia page on UTF-8:
"In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and 983040 4-byte sequences.”
In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.
In short, I think this is a bug in the UTF-8 parser being used by the Swift compiler. But as my expertise here is only a few days old, I’m prepared to be corrected.
—Jens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-users/attachments/20151204/dc3726b2/attachment.html>
More information about the swift-users
mailing list