[swift-evolution] Prohibit invisible characters in identifier names

Tue Jun 21 14:15:47 CDT 2016

On Tue, Jun 21, 2016 at 1:16 PM, Joe Groff <jgroff at apple.com> wrote:

>
> > On Jun 21, 2016, at 8:47 AM, John McCall via swift-evolution <
> swift-evolution at swift.org> wrote:
> >
> >> On Jun 20, 2016, at 7:07 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
> >> On Mon, Jun 20, 2016 at 8:58 PM, John McCall via swift-evolution <
> swift-evolution at swift.org> wrote:
> >>> On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution <
> swift-evolution at swift.org> wrote:
> >>> IIRC, some languages require zero-width joiners (though not zero-width
> spaces, which are distinct) to properly encode some of their characters.
> I'd be very leery of having Swift land on a model where identifiers can be
> used with some languages and not others; that smacks of ethnocentrism.
> >>
> >> None of those languages require zero-width characters between two Latin
> letters, or between a Latin letter and an Arabic numeral, or at the end of
> a word.  Since standard / system APIs will (barring some radical shift) use
> those code points exclusively, it's justifiable to give them some special
> attention.
> >>
> >> Although the practical implementation may need to be more limited in
> scope, the general principle doesn't need to privilege Latin letters and
> Arabic numerals. If, in any context, the presence or absence of a
> zero-width glyph cannot possibly be distinguished by a human reading the
> text, then the compiler should also be indifferent to its presence or
> absence (or, alternatively, its presence should be a compile-time error).
> >
> > Sure, that's obvious.  Jordan was observing that the simplest way to
> enforce that, banning such characters from identifiers completely, would
> still interfere with some languages, and I was pointing out that just doing
> enough to protect English would get most of the practical value because it
> would protect every use of the system and standard library.  A program
> would then only become attackable in this specific way for its own
> identifiers using non-Latin characters.
> >
> > All that said, I'm not convinced that this is worthwhile; the
> identifier-similarity problem in Unicode is much broader than just
> invisible characters.  In fact, Swift still doesn't canonicalize
> identifiers, so canonically equivalent compositions of the same glyph will
> actually produce different names.  So unless we're going to fix that and
> then ban all sorts of things that are known to generally be represented
> with a confusable glyph in a typical fixed-width font (like the
> mathematical alphabets), this is just a problem that will always exist in
> some form.
>
> Any discussion about this ought to start from UAX #31, the Unicode
> consortium's recommendations on identifiers in programming languages:
>
> http://unicode.org/reports/tr31/
>
> Section 2.3 specifically calls out the situations in which ZWJ and ZWNJ
> need to be allowed. The document also describes a stability policy for
> handling new Unicode versions, other confusability issues, and many of the
> other problems with adopting Unicode in a programming language's syntax.
>

That's a fantastic document--a very edifying read. Given Swift's robust
support for Unicode in its core libraries, it's kind of surprising to me
that identifiers aren't canonicalized at compile time. From a quick first
read, faithful adoption of UAX #31 recommendations would address most if
not all of the confusability and zero-width security issues raised in this
conversation.

>
> -Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160621/e07d2c39/attachment.html>