<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Dec 4, 2015, at 3:28 AM, Quinn The Eskimo! &lt;<a href="mailto:eskimo1@apple.com" class="">eskimo1@apple.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Alegreya-Regular; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">I can explain that. &nbsp;U+1F603 is encoded in UTF-16 as d83d de03. &nbsp;If you encode each of these separately as UTF-8, you get ed a0 bd followed by ed b8 80. &nbsp;That's not the correct way to encode U+1F603 as UTF-8, hence the failure.</span><br style="font-family: Alegreya-Regular; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""></div></blockquote></div><br class=""><div class="">By total coincidence, just this week I implemented support for the above in a JSON parser, so I’m suddenly an expert ;-)</div><div class=""><br class=""></div><div class="">Based on what I know, both of those encodings are correct. UTF-16 surrogate pairs can occur in decoded UTF-8 and need to be decoded in turn into Unicode codepoints; according to the Wikipedia page on UTF-8:</div><div class=""><br class=""></div><div class="">"In November 2003, UTF-8 was restricted by&nbsp;RFC 3629&nbsp;to end at U+10FFFF, in order to match the constraints of the&nbsp;UTF-16&nbsp;character encoding. This removed all 5- and 6-byte sequences, and 983040 4-byte sequences.”</div><div class=""><br class=""></div><div class="">In other words, during UTF-8 encoding, some Unicode codepoints need to be broken into surrogate pairs. Therefore a UTF-8 decoder needs to recognize surrogate pairs and reassemble them into a single codepoint. That’s what’s not happening in this case.</div><div class=""><br class=""></div><div class="">In short, I think this is a bug in the UTF-8 parser being used by the Swift compiler. But as my expertise here is only a few days old, I’m prepared to be corrected.</div><div class=""><br class=""></div><div class="">—Jens</div></body></html>