[swift-evolution] Prohibit invisible characters in identifier names

Tue Jun 21 13:16:40 CDT 2016

> On Jun 21, 2016, at 8:47 AM, John McCall via swift-evolution <swift-evolution at swift.org> wrote:
> 
>> On Jun 20, 2016, at 7:07 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>> On Mon, Jun 20, 2016 at 8:58 PM, John McCall via swift-evolution <swift-evolution at swift.org> wrote:
>>> On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution <swift-evolution at swift.org> wrote:
>>> IIRC, some languages require zero-width joiners (though not zero-width spaces, which are distinct) to properly encode some of their characters. I'd be very leery of having Swift land on a model where identifiers can be used with some languages and not others; that smacks of ethnocentrism.
>> 
>> None of those languages require zero-width characters between two Latin letters, or between a Latin letter and an Arabic numeral, or at the end of a word.  Since standard / system APIs will (barring some radical shift) use those code points exclusively, it's justifiable to give them some special attention.
>> 
>> Although the practical implementation may need to be more limited in scope, the general principle doesn't need to privilege Latin letters and Arabic numerals. If, in any context, the presence or absence of a zero-width glyph cannot possibly be distinguished by a human reading the text, then the compiler should also be indifferent to its presence or absence (or, alternatively, its presence should be a compile-time error).
> 
> Sure, that's obvious.  Jordan was observing that the simplest way to enforce that, banning such characters from identifiers completely, would still interfere with some languages, and I was pointing out that just doing enough to protect English would get most of the practical value because it would protect every use of the system and standard library.  A program would then only become attackable in this specific way for its own identifiers using non-Latin characters.
> 
> All that said, I'm not convinced that this is worthwhile; the identifier-similarity problem in Unicode is much broader than just invisible characters.  In fact, Swift still doesn't canonicalize identifiers, so canonically equivalent compositions of the same glyph will actually produce different names.  So unless we're going to fix that and then ban all sorts of things that are known to generally be represented with a confusable glyph in a typical fixed-width font (like the mathematical alphabets), this is just a problem that will always exist in some form.

Any discussion about this ought to start from UAX #31, the Unicode consortium's recommendations on identifiers in programming languages:

http://unicode.org/reports/tr31/

Section 2.3 specifically calls out the situations in which ZWJ and ZWNJ need to be allowed. The document also describes a stability policy for handling new Unicode versions, other confusability issues, and many of the other problems with adopting Unicode in a programming language's syntax.

-Joe