[swift-evolution] Prohibit invisible characters in identifier names

Xiaodi Wu xiaodi.wu at gmail.com
Thu Jun 23 13:31:05 CDT 2016


On Thu, Jun 23, 2016 at 12:56 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:

> On Thu, Jun 23, 2016 at 12:41 PM, João Pinheiro <joao at joaopinheiro.org>
> wrote:
>
>> There are two different issues here, individual character normalisation
>> and identifier canonicalisation. NFC handles character normalisation and it
>> definitely should be part of the proposal since identifier canonicalisation
>> doesn't make sense if the individual character representation isn't
>> normalised first.
>>
>
> I think we're using terminology differently here. What you call "character
> normalization" is what I'm calling canonicalization. NFC is described in
> UAX #15 as "canonical decomposition followed by canonical composition" and
> I'm just using the word "canonicalization" because it's shorter. If Swift
> represents each identifier in an NFC-transformed form (what I call
> canonicalized), then I understand the identifier to be canonicalized. What
> is the distinction you're drawing here?
>
>
>>
>> Swift currently doesn't normalise unicode characters, as can be seen in
>> the following code example:
>>
>> let Å = "Hello" // Angstrom
>> let Å = "Swift" // Latin Capital Letter A With Ring Above
>> let Å = "World" // Latin Capital Letter A + Combining Ring Above
>>
>> print(Å)
>> print(Å)
>> print(Å)
>>
>> According to the unicode standard, all 3 of these characters should be
>> normalised into the same representation.
>>
>>
Just re-read UAX #31. I see two different issues here too--do these match
up with what you're saying above?

* Disallowing certain glyphs in identifiers. To do so, we can implement the
recommendation to disallow all glyphs in UAX #31 Table 4, except ZWJ and
ZWNJ in the specific scenarios outlined in section 2.3.

* Internally, when comparing two identifiers A and B, compare NFC(A) and
NFC(B) without modifying or otherwise restricting the actual user-facing
code to contain only NFC-normalized strings. This would be the approach
recommended in section 1.3.


> Sincerely,
>> João Pinheiro
>>
>>
>> On 23 Jun 2016, at 17:40, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>>
>> I think this issue is bigger than that. As UAX #31 suggests, the most
>> appropriate approach is canonicalizing identifiers by NFC, with specific
>> treatment of ZWJ and ZWNJ by allowing them in three contexts, which will
>> require thought as to how to implement.
>>
>> Given that there is a specifically recommended algorithm on how to handle
>> this issue, I'm also not sure anymore that this requires a proposal;
>> "process Unicode correctly" is really more of a bug fix because, given the
>> strict limits of what's canonicalized, there shouldn't be a user-facing
>> effect if we are merely proposing to prohibit glyphs from appearing in
>> certain contexts where they are never in fact encountered in real language.
>>
>> On Thu, Jun 23, 2016 at 11:19 AM Sean Heber <sean at fifthace.com> wrote:
>>
>>> I’m no unicode expert, but this sounds like the way to go to me.
>>>
>>> l8r
>>> Sean
>>>
>>>
>>> > On Jun 23, 2016, at 11:17 AM, João Pinheiro via swift-evolution <
>>> swift-evolution at swift.org> wrote:
>>> >
>>> >
>>> >> On 21 Jun 2016, at 20:15, Xiaodi Wu via swift-evolution <
>>> swift-evolution at swift.org> wrote:
>>> >>
>>> >> On Tue, Jun 21, 2016 at 1:16 PM, Joe Groff <jgroff at apple.com> wrote:
>>> >> Any discussion about this ought to start from UAX #31, the Unicode
>>> consortium's recommendations on identifiers in programming languages:
>>> >>
>>> >> http://unicode.org/reports/tr31/
>>> >>
>>> >> Section 2.3 specifically calls out the situations in which ZWJ and
>>> ZWNJ need to be allowed. The document also describes a stability policy for
>>> handling new Unicode versions, other confusability issues, and many of the
>>> other problems with adopting Unicode in a programming language's syntax.
>>> >>
>>> >> That's a fantastic document--a very edifying read. Given Swift's
>>> robust support for Unicode in its core libraries, it's kind of surprising
>>> to me that identifiers aren't canonicalized at compile time. From a quick
>>> first read, faithful adoption of UAX #31 recommendations would address most
>>> if not all of the confusability and zero-width security issues raised in
>>> this conversation.
>>> >
>>> > From what I've read of UAX #31 it does seem to address all of the
>>> invisible character issues raised in the discussion. Given their unicode
>>> status of of Default_Ignorable_Code_Points, I believe the best course of
>>> action would be to canonicalise identifiers by allowing invisible
>>> characters only where appropriate and ignoring them everywhere else.
>>> >
>>> > The alternative to ignoring them would be to not canonicalise
>>> identifiers and treat invisible characters as an error instead.
>>> >
>>> > This doesn't address the issue of unicode confusable characters, but
>>> solving that has additional problems of its own and would probably be
>>> better addressed in a different proposal entirely.
>>> >
>>> > I'd like to start writing the proposal if there is agreement that this
>>> would be the best course of action.
>>> >
>>> > Sincerely,
>>> > João Pinheiro
>>> > _______________________________________________
>>> > swift-evolution mailing list
>>> > swift-evolution at swift.org
>>> > https://lists.swift.org/mailman/listinfo/swift-evolution
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160623/9975c49c/attachment.html>


More information about the swift-evolution mailing list