<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On 21 Jun 2016, at 20:15, Xiaodi Wu via swift-evolution &lt;<a href="mailto:swift-evolution@swift.org" class="">swift-evolution@swift.org</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">On Tue, Jun 21, 2016 at 1:16 PM, Joe Groff <span dir="ltr" class="">&lt;<a href="mailto:jgroff@apple.com" target="_blank" class="">jgroff@apple.com</a>&gt;</span> wrote:<div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Any discussion about this ought to start from UAX #31, the Unicode consortium's recommendations on identifiers in programming languages:<br class="">

<br class="">

<a href="http://unicode.org/reports/tr31/" rel="noreferrer" target="_blank" class="">http://unicode.org/reports/tr31/</a><br class="">

<br class="">

Section 2.3 specifically calls out the situations in which ZWJ and ZWNJ need to be allowed. The document also describes a stability policy for handling new Unicode versions, other confusability issues, and many of the other problems with adopting Unicode in a programming language's syntax.<br class=""></blockquote><div class=""><br class=""></div><div class="">That's a fantastic document--a very edifying read. Given Swift's robust support for Unicode in its core libraries, it's kind of surprising to me that identifiers aren't canonicalized at compile time. From a quick first read, faithful adoption of UAX #31 recommendations would address most if not all of the confusability and zero-width security issues raised in this conversation.</div></div></div></div></div></blockquote><div><br class=""></div></div>From what I've read of&nbsp;<a href="http://unicode.org/reports/tr31/" class="">UAX #31</a>&nbsp;it does seem to address all of the invisible character issues raised in the discussion. Given their unicode status of of&nbsp;<i class="">Default_Ignorable_Code_Points</i>, I believe the best course of action would be to canonicalise identifiers by allowing invisible characters only where appropriate and ignoring them everywhere else.<div class=""><br class=""></div><div class="">The alternative to ignoring them would be to not canonicalise identifiers and treat invisible characters as an error instead.</div><div class=""><br class=""></div><div class="">This doesn't address the issue of unicode confusable characters, but solving that has additional problems of its own and would probably be better addressed in a different proposal entirely.</div><div class=""><br class=""></div><div class="">I'd like to start writing the proposal if there is agreement that this would be the best course of action.</div><div class=""><br class=""></div><div class="">Sincerely,</div><div class="">João Pinheiro</div></body></html>