[swift-evolution] [Proposal] Normalize Unicode Identifiers

Xiaodi Wu xiaodi.wu at gmail.com
Thu Sep 22 20:11:17 CDT 2016


On Thu, Sep 22, 2016 at 7:44 PM, Michael Gottesman <mgottesman at apple.com>
wrote:

>
> On Sep 22, 2016, at 5:09 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>
> On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman at apple.com>
> wrote:
>
>>
>> On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>>
>> You mean values of type String?
>>
>>
>> I was speaking solely of constant strings.
>>
>> I would want those to be exactly what I say they are; NFC normalization
>> is available, if I recall, as part of Foundation, but by no means should my
>> String values be silently changed!
>>
>>
>> Why.
>>
>
> For one, I don't want to pay the computational cost of normalization at
> runtime unless necessary.
>
>
> This would only happen with strings that are known to be constant at
> compile time (and as such the transformation would occur at compile time).
> There would be no runtime cost.
>

Yes, for constant strings only there would be no runtime cost.


>
> For another, I expect to be able to round-trip user input.
>
>
> String checks for canonical equivalence, IIRC.
>

Sure, but I'm not talking about using comparison operators here. I mean
that if we have `let str = "[some non-NFC string]"`, I should be able to
write that out to a file with all the non-canonical glyphs intact.

There are known issues with NFC that are acceptable for normalizing Swift
identifiers but make it unsuitable for general use. For example, the
normalized form of Greek ano teleia is middle dot, but these two glyphs are
rendered differently in many fonts, and substituting a middle dot in place
of the Greek punctuation mark is actually quite inadequate for Greek text
(ano teleia is supposed to be around x-height; middle dot is not). Even for
constant strings, it is essential that one can output ano teleia when it is
specified rather than middle dot. However, Unicode normalization algorithms
guarantee stability and will forever require swapping the former for the
latter. I understand that other such problematic characters exist.

Normalization is not lossless and cannot be reversed. Finally, if I want to
> use normalization form D (NFD), your proposal
>
> would make it impossible, because (IIUC) serial NFC + NFD normalization
> can produce different output than NFD normalization alone.
>
>
> Why would you want to do this/care about this? I.e. what is the use case?
>

Use cases for NFD include searching, where you'd find substrings considered
"compatible." For instance, the fi ligature is considered compatible with
the letters f and i, but they are not equal. If you've ever successfully
searched for a word like "finance" in a PDF document that's been typeset
with ligatures, you've benefited from NFD. Roughly speaking (IIUC), the
difference between searching NFC-normalized strings and NFD-normalized
strings is analogous to the difference between a case-sensitive and a
case-insensitive search. Therefore, given a string x, it's sometimes
important to be able to obtain NFD(x). If every string x is now
automatically NFC(x), then the best one can do is NFD(NFC(x)), which is not
guaranteed equal to NFD(x) even with canonical comparison (i.e.
NFC(NFD(NFC(x))) != NFC(NFD(x)) for all x).


> As an aside, I am not formally proposing this. I am just discussing
> potential opportunities for optimization given that we would need (as apart
> of this proposal) to add knowledge of unicode to the compiler which would
> allow for compile time transformations.
>

I'd be interested to know what performance gains you're envisioning with
such an optimization of constant strings at compile time.

On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman at apple.com>
>> wrote:
>>
>>>
>>> > On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <
>>> swift-evolution at swift.org> wrote:
>>> >
>>> >
>>> >> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <
>>> swift-evolution at swift.org> wrote:
>>> >>
>>> >> +1. Even if it's too late for Swift 3, though, I'd argue that it's
>>> highly unlikely to be code-breaking in practice. Any existing code that
>>> would get tripped up by this normalization is arguably broken already.
>>> >
>>> > I'm inclined to agree. To be paranoid about perfect compatibility, we
>>> could conceivably allow existing code with differently-normalized
>>> identifiers with a warning based on Swift version, but it's probably not
>>> worth it. It'd be interesting to data-mine Github or the iOS Swift
>>> Playgrounds app and see if this breaks any Swift 3 code in practice.
>>>
>>> As an additional interesting point here, we could in general normalize
>>> unicode strings. This could potentially reduce the size of unicode
>>> characters or allow us to constant propagate certain unicode algorithms in
>>> the optimizer.
>>>
>>> >
>>> > -Joe
>>> > _______________________________________________
>>> > swift-evolution mailing list
>>> > swift-evolution at swift.org
>>> > https://lists.swift.org/mailman/listinfo/swift-evolution
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160922/7d2e91d2/attachment.html>


More information about the swift-evolution mailing list