[swift-users] emoji in source code failed to compile in linux ubuntu

Fri Dec 4 15:36:13 CST 2015

The unicode spec is very clear that conforming implementations must
reject any alternative utf-8 encoding, even if it appears to make
sense, because not doing so is a security vulnerability. More
generally, any code point that encodes as 1-3 code units can actually
be rewritten to be encoded as 4 code units (or really, any number of
code units up to 4 that's not less than the canonical encoding). These
are no more valid than encoding surrogate pair code points in utf-8.
The reason why this is considered a security vulnerability is because
any code that attempts to validate or filter a utf-8 stream may not
recognize these alternative encodings, and the validation/filtering can
be bypassed. For example, think of an HTML form validator that
automatically converts < into &lt;. If you passed in an alternative
encoding such as C0 BC, the validator may not recognize it, but if the
browser then interpreted this ill-formed sequence as being the same as
0x3C, you'd have a trivial XSS attack.

More generally, section 3.9 of the Unicode 8.0 standard explicitly lists
the well-formed UTF-8 byte sequences, and this list does not include any
encoding for the surrogate pair range. And conforming implementations
must reject any ill-formed sequence.

-Kevin Ballard

On Fri, Dec 4, 2015, at 12:40 PM, Dmitri Gribenko wrote:
> On Fri, Dec 4, 2015 at 12:22 PM, Jens Alfke
> <jens at mooseyard.com> wrote:
> >
> > On Dec 4, 2015, at 11:47 AM, Dmitri Gribenko <gribozavr at gmail.com>
> > wrote:
> >
> > On Fri, Dec 4, 2015 at 11:25 AM, Jens Alfke <jens at mooseyard.com>
> > wrote:
> >
> > In other words, during UTF-8 encoding, some Unicode codepoints need
> > to be broken into surrogate pairs. Therefore a UTF-8 decoder needs
> > to recognize surrogate pairs and reassemble them into a single
> > codepoint. That’s what’s not happening in this case.
> >
> >
> > No, surrogate code points can not appear in a UTF-8 stream, they can
> > only appear in UTF-16.
> >
> > http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf page 30,
> > table 2-3.
> >
> >
> > Thanks for referencing the spec — that was useful to me. However,
> > this looks like an issue where spec conformance clashes with real-
> > world desire for compatibility.
>
> Violating the spec in this part would cause security issues, since
> different implementations would disagree on the character data. Sorry,
> but we are not doing this.  The Unicode spec is unambiguous here.
>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
> _______________________________________________
> swift-users mailing list swift-users at swift.org
> https://lists.swift.org/mailman/listinfo/swift-users