[swift-evolution] Prohibit invisible characters in identifier names

Thu Jun 23 12:56:47 CDT 2016

On Thu, Jun 23, 2016 at 12:41 PM, João Pinheiro <joao at joaopinheiro.org>
wrote:

> There are two different issues here, individual character normalisation
> and identifier canonicalisation. NFC handles character normalisation and it
> definitely should be part of the proposal since identifier canonicalisation
> doesn't make sense if the individual character representation isn't
> normalised first.
>

I think we're using terminology differently here. What you call "character
normalization" is what I'm calling canonicalization. NFC is described in
UAX #15 as "canonical decomposition followed by canonical composition" and
I'm just using the word "canonicalization" because it's shorter. If Swift
represents each identifier in an NFC-transformed form (what I call
canonicalized), then I understand the identifier to be canonicalized. What
is the distinction you're drawing here?

>
> Swift currently doesn't normalise unicode characters, as can be seen in
> the following code example:
>
> let Å = "Hello" // Angstrom
> let Å = "Swift" // Latin Capital Letter A With Ring Above
> let Å = "World" // Latin Capital Letter A + Combining Ring Above
>
> print(Å)
> print(Å)
> print(Å)
>
> According to the unicode standard, all 3 of these characters should be
> normalised into the same representation.
>
> Sincerely,
> João Pinheiro
>
>
> On 23 Jun 2016, at 17:40, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>
> I think this issue is bigger than that. As UAX #31 suggests, the most
> appropriate approach is canonicalizing identifiers by NFC, with specific
> treatment of ZWJ and ZWNJ by allowing them in three contexts, which will
> require thought as to how to implement.
>
> Given that there is a specifically recommended algorithm on how to handle
> this issue, I'm also not sure anymore that this requires a proposal;
> "process Unicode correctly" is really more of a bug fix because, given the
> strict limits of what's canonicalized, there shouldn't be a user-facing
> effect if we are merely proposing to prohibit glyphs from appearing in
> certain contexts where they are never in fact encountered in real language.
>
> On Thu, Jun 23, 2016 at 11:19 AM Sean Heber <sean at fifthace.com> wrote:
>
>> I’m no unicode expert, but this sounds like the way to go to me.
>>
>> l8r
>> Sean
>>
>>
>> > On Jun 23, 2016, at 11:17 AM, João Pinheiro via swift-evolution <
>> swift-evolution at swift.org> wrote:
>> >
>> >
>> >> On 21 Jun 2016, at 20:15, Xiaodi Wu via swift-evolution <
>> swift-evolution at swift.org> wrote:
>> >>
>> >> On Tue, Jun 21, 2016 at 1:16 PM, Joe Groff <jgroff at apple.com> wrote:
>> >> Any discussion about this ought to start from UAX #31, the Unicode
>> consortium's recommendations on identifiers in programming languages:
>> >>
>> >> http://unicode.org/reports/tr31/
>> >>
>> >> Section 2.3 specifically calls out the situations in which ZWJ and
>> ZWNJ need to be allowed. The document also describes a stability policy for
>> handling new Unicode versions, other confusability issues, and many of the
>> other problems with adopting Unicode in a programming language's syntax.
>> >>
>> >> That's a fantastic document--a very edifying read. Given Swift's
>> robust support for Unicode in its core libraries, it's kind of surprising
>> to me that identifiers aren't canonicalized at compile time. From a quick
>> first read, faithful adoption of UAX #31 recommendations would address most
>> if not all of the confusability and zero-width security issues raised in
>> this conversation.
>> >
>> > From what I've read of UAX #31 it does seem to address all of the
>> invisible character issues raised in the discussion. Given their unicode
>> status of of Default_Ignorable_Code_Points, I believe the best course of
>> action would be to canonicalise identifiers by allowing invisible
>> characters only where appropriate and ignoring them everywhere else.
>> >
>> > The alternative to ignoring them would be to not canonicalise
>> identifiers and treat invisible characters as an error instead.
>> >
>> > This doesn't address the issue of unicode confusable characters, but
>> solving that has additional problems of its own and would probably be
>> better addressed in a different proposal entirely.
>> >
>> > I'd like to start writing the proposal if there is agreement that this
>> would be the best course of action.
>> >
>> > Sincerely,
>> > João Pinheiro
>> > _______________________________________________
>> > swift-evolution mailing list
>> > swift-evolution at swift.org
>> > https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160623/63e8e575/attachment.html>