[swift-users] emoji in source code failed to compile in linux ubuntu

Fri Dec 4 14:22:36 CST 2015

> On Dec 4, 2015, at 11:47 AM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
> 
> On Fri, Dec 4, 2015 at 11:25 AM, Jens Alfke <jens at mooseyard.com> wrote:
>> In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.
> 
> No, surrogate code points can not appear in a UTF-8 stream, they can
> only appear in UTF-16.
> 
> http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf page 30, table 2-3.

Thanks for referencing the spec — that was useful to me. However, this looks like an issue where spec conformance clashes with real-world desire for compatibility. After some more reading I found that it’s actually pretty common to find UTF-8 containing surrogate pairs, mostly due to software that was using 16-bit Unicode before it got formalized as UTF-16. According to Wikipedia <https://en.wikipedia.org/wiki/CESU-8>*, Java encodes UTF-8 this way, as do Oracle and MySQL databases.

If there are enough text editors/processors that do this, it might be a good idea for Swift’s lexer to accept surrogate pairs even if they’re technically invalid.

—Jens

* https://en.wikipedia.org/wiki/CESU-8
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-users/attachments/20151204/dea364fd/attachment.html>