[swift-evolution] Lexical matters: identifiers and operators

Sat Oct 1 11:47:28 CDT 2016

There is a PR on swift-evolution to implement UAX#31 recommendations:

https://github.com/apple/swift-evolution/pull/531

It was discussed on this list fairly recently, so a quick scroll through
the archives should surface those threads.

Briefly, UAX#31 recommends NFC for languages with case-sensitive
identifiers and NFKC for languages with case-insensitive identifiers, so
the proposed normalization in this PR is NFC and not NFKC.

On Sat, Oct 1, 2016 at 11:02 Jonathan S. Shapiro via swift-evolution <
swift-evolution at swift.org> wrote:

> New to the list, but old hand at PL design. Was looking over the lexical
> structure of Swift 2.2 and 3.0, and I have some questions. A number of
> considerations identified in UAX31 (Unicode Identifier and Pattern Syntax)
> and UAX36 (Unicode Security Considerations) aren't obviously addressed.
>
> Here are some items that jumped out from a casual glance at the spec:
>
> 1. The specification does not appear to state any particular rules for
> compatibility or normalization in identifiers. Other Unicode-aware
> programming languages have adopted NFKC almost universally, and for good
> reason. The current identifier-head and identifier-character grammar admit
> sequences that Unicode considers malformed.
>
> 2. The specification does not appear to address any notion of Unicode
> equivalent sequences.
>
> 3. The relationship between the identifiers admitted by Swift 3 and
> identifiers admitted by UAX31 isn't clear. As a matter of cross-platform
> compatibility it would be really good if identifiers permitted by the
> default rules of UAX31 were all legal in Swift. This seems important for
> cross-language interop.
>
> Has this relationship been discussed somewhere I can catch up on?
>
> 4. Valid operators include code points that are undefined in any current
> or historical Unicode standard. That seems problematic. Future revisions to
> Unicode will eventually place *some* of those code points in the XIDS/XIDC
> categories, at which point we will have to choose between backwards
> compatibility and interop. Others will be assigned to new combining marks,
> which will want to be used in identifiers. As new languages are added to
> Unicode, compatibility concerns will exclude some groups from using
> identifiers that are natural to them.
>
>
> In order of least-to-most difficulty, I'd like to suggest some changes to
> the specification. I'm willing to implement them if agreement can be
> reached:
>
> 1. Pick a Unicode version and exclude any code point that is undefined as
> of that standard from both operators and identifiers. It's relatively easy
> and backwards compatible to move the Unicode version number forward as the
> language specification evolves.
>
> 2. Ensure that no code point in the Unicode Pattern_Syntax and
> Pattern_WhiteSpace categories are not included in identifier-head or
> identifier-character.
>
> 3. Explicitly state that no code point in (XIDS u XIDC) or
> Pattern_WhiteSpace is legal in an operator. Consider ensuring that
> everything in Pattern_Syntax *is* permitted in an operator.
>
> 4. I'd personally like to see an explicit statement of the extensions to
> XIDS/XIDC that are admitted by identifier-head and identifier-character.
> UAX31 refers to such extensions as a "profile", and explicitly allows them.
> I'm not interested in changing the identifier space unless there is
> something grossly and obviously problematic. What I'm after is enabling
> developers to be cognizant of potential interop challenges.
>
> 5. Adopt NFKC for identifiers. Specify and implement a combining algorithm
> version so that forward/backward compatibility is ensured.
>
>
> The first three are pretty trivial. The fourth would take some sleuthing,
> but it is straightforward. The fifth is real work. I'd be willing to sign
> up to any or all of these, but for a starting point I want to learn where
> things stand, what decisions have already been made, and where any current
> discussion may be happening.
>
> I very much doubt that NFKC would break existing code, if only because the
> use of malformed Unicode sequences is likely to be rare. To the extent that
> they exist in the field, they are almost certainly (a) unintentional, or
> (b) security concerns. It seems like a good thing to catch both of those
> early to the extent that we can, and to do so while the language definition
> remains somewhat fluid.
>
>
> Thanks!
>
>
> Jonathan Shapiro
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20161001/d0ababe7/attachment.html>