[swift-evolution] [Proposal] Normalize Unicode Identifiers

João Pinheiro joao at joaopinheiro.org
Tue Jul 26 14:22:39 CDT 2016


This proposal [gist <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800>] is the result of the discussions from the thread "Prohibit invisible characters in identifier names <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>". I hope it's still on time for inclusion in Swift 3.

Sincerely,
João Pinheiro


Normalize Unicode Identifiers

Proposal: SE-NNNN <https://gist.github.com/JoaoPinheiro/NNNN-normalize-identifiers.md>
Author: João Pinheiro <https://github.com/joaopinheiro>
Status: Awaiting review
Review manager: TBD
 <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#introduction>Introduction

This proposal aims to introduce identifier normalization in order to prevent the unsafe and potentially abusive use of invisible or equivalent representations of Unicode characters in identifiers.

Swift-evolution thread: Discussion thread <http://thread.gmane.org/gmane.comp.lang.swift.evolution/21022>
 <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#motivation>Motivation

Even though Swift supports the use of Unicode for identifiers, these aren't yet normalized. This allows for different Unicode representations of the same characters to be considered distinct identifiers.

For example:

let Å = "Angstrom"
let Å = "Latin Capital Letter A With Ring Above"
let Å = "Latin Capital Letter A + Combining Ring Above"
In addition to that, default-ignorable characters like the Zero Width Space and Zero Width Non-Joiner (exemplified below) are also currently accepted as valid parts of identifiers without any restrictions.

let ab = "ab"
let a​b = "a + Zero Width Space + b"

func xy() { print("xy") }
func x‌y() { print("x + <Zero Width Non-Joiner> + y") }
The use of default-ignorable characters in identifiers is problematical, first because the effects they represent are stylistic or otherwise out of scope for identifiers, and second because the characters themselves often have no visible display. It is also possible to misapply these characters such that users can create strings that look the same but actually contain different characters, which can create security problems.

 <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#proposed-solution>Proposed solution

Normalize Swift identifiers according to the normalization form NFC recommended for case-sensitive languages in the Unicode Standard Annexes 15 <https://gist.github.com/JoaoPinheiro/UAX15> and 31 <https://gist.github.com/JoaoPinheiro/UAX31> and follow the Normalization Charts <https://gist.github.com/JoaoPinheiro/NormalizationCharts>.

In addition to that, prohibit the use of default-ignorable characters in identifiers except in the special cases described in UAX31 <https://gist.github.com/JoaoPinheiro/UAX31>, listed below:

Allow Zero Width Non-Joiner (U+200C) when breaking a cursive connection
Allow Zero Width Non-Joiner (U+200C) in a conjunct context
Allow Zero Width Joiner (U+200D) in a conjunct context
 <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#impact-on-existing-code>Impact on existing code

This has potential to be a code-breaking change in cases where people may have used distinct, but identical looking, identifiers with different Unicode representations. The likelihood of that happening in actual code is very small and the problem can be solved by renaming identifiers that don't conform to the new normalized form into new non-colliding identifiers.

 <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#alternatives-considered>Alternatives considered

The option of ignoring default-ignorable characters in identifiers was also discussed, but it was considered to be more confusing and less secure than explicitly treating them as errors.

 <https://gist.github.com/JoaoPinheiro/5f226f46c67d235a7039c775a4300800#unaddressed-issues>Unaddressed Issues

There was some discussion around the issue of Unicode confusable characters, but it was considered to be out of scope for this proposal. Unicode confusable characters are a complicated issue and any possible solutions also come with significant drawbacks that would require more time and consideration.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160726/e4564531/attachment.html>


More information about the swift-evolution mailing list