[swift-evolution] [Proposal] Normalize Unicode Identifiers

Xiaodi Wu xiaodi.wu at gmail.com
Thu Sep 22 21:16:43 CDT 2016


Agreed. Taking this offlist :)


On Thu, Sep 22, 2016 at 9:01 PM, Michael Gottesman <mgottesman at apple.com>
wrote:

>
>
> On Sep 22, 2016, at 6:11 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>
> On Thu, Sep 22, 2016 at 7:44 PM, Michael Gottesman <mgottesman at apple.com>
> wrote:
>
>>
>> On Sep 22, 2016, at 5:09 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>>
>> On Thu, Sep 22, 2016 at 6:54 PM, Michael Gottesman <mgottesman at apple.com>
>>  wrote:
>>
>>>
>>> On Sep 22, 2016, at 4:19 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:
>>>
>>> You mean values of type String?
>>>
>>>
>>> I was speaking solely of constant strings.
>>>
>>> I would want those to be exactly what I say they are; NFC normalization
>>> is available, if I recall, as part of Foundation, but by no means should my
>>> String values be silently changed!
>>>
>>>
>>> Why.
>>>
>>
>> For one, I don't want to pay the computational cost of normalization at
>> runtime unless necessary.
>>
>>
>> This would only happen with strings that are known to be constant at
>> compile time (and as such the transformation would occur at compile time).
>> There would be no runtime cost.
>>
>
> Yes, for constant strings only there would be no runtime cost.
>
>
>>
>> For another, I expect to be able to round-trip user input.
>>
>>
>> String checks for canonical equivalence, IIRC.
>>
>
> Sure, but I'm not talking about using comparison operators here. I mean
> that if we have `let str = "[some non-NFC string]"`, I should be able to
> write that out to a file with all the non-canonical glyphs intact.
>
>
> I would argue that most people that is not an interesting distinction.
> Naturally there would be a way to escape such canonicalization to get the
> non-canonicalized String.
>
>
> There are known issues with NFC that are acceptable for normalizing Swift
> identifiers but make it unsuitable for general use. For example, the
> normalized form of Greek ano teleia is middle dot, but these two glyphs are
> rendered differently in many fonts, and substituting a middle dot in place
> of the Greek punctuation mark is actually quite inadequate for Greek text
> (ano teleia is supposed to be around x-height; middle dot is not). Even for
> constant strings, it is essential that one can output ano teleia when it is
> specified rather than middle dot. However, Unicode normalization algorithms
> guarantee stability and will forever require swapping the former for the
> latter. I understand that other such problematic characters exist.
>
>
> I would argue that that is a problem with the unicode standard and with
> the fonts. This is not a problem for Swift to solve.
>
>
> Normalization is not lossless and cannot be reversed. Finally, if I want
>> to use normalization form D (NFD), your proposal
>>
>> would make it impossible, because (IIUC) serial NFC + NFD normalization
>> can produce different output than NFD normalization alone.
>>
>>
>> Why would you want to do this/care about this? I.e. what is the use case?
>>
>
> Use cases for NFD include searching, where you'd find substrings
> considered "compatible." For instance, the fi ligature is considered
> compatible with the letters f and i, but they are not equal. If you've ever
> successfully searched for a word like "finance" in a PDF document that's
> been typeset with ligatures, you've benefited from NFD. Roughly speaking
> (IIUC), the difference between searching NFC-normalized strings and
> NFD-normalized strings is analogous to the difference between a
> case-sensitive and a case-insensitive search. Therefore, given a string x,
> it's sometimes important to be able to obtain NFD(x). If every string x is
> now automatically NFC(x), then the best one can do is NFD(NFC(x)), which is
> not guaranteed equal to NFD(x) even with canonical comparison (i.e.
> NFC(NFD(NFC(x))) != NFC(NFD(x)) for all x).
>
>
> There are issues here related to String design. For instance, one could
> make an argument that such searching is really only interesting for a
> "Text" use case which is different from a String use case. That being said,
> I don't want to argue about this here since we are hijacking this thread ;
> ).
>
>
>
>> As an aside, I am not formally proposing this. I am just discussing
>> potential opportunities for optimization given that we would need (as apart
>> of this proposal) to add knowledge of unicode to the compiler which would
>> allow for compile time transformations.
>>
>
> I'd be interested to know what performance gains you're envisioning with
> such an optimization of constant strings at compile time.
>
>
> I would have to measure such wins to say anything concrete.
> Algorithmically one would be able to avoid normalization during common
> unicode operations when you know you are using constant strings. Even
> though this may provide a runtime win, the major win from teaching the
> compiler about unicode would be in terms of applying unicode operations
> such as encoding/decoding to constant strings.
>
> That being said, this is not the proposal that is being discussed here or
> even being proposed here. [i.e. lets stop hijacking this thread ; )]
>
>
> On Thu, Sep 22, 2016 at 6:10 PM, Michael Gottesman <mgottesman at apple.com>
>>> wrote:
>>>
>>>>
>>>> > On Sep 22, 2016, at 10:50 AM, Joe Groff via swift-evolution <
>>>> swift-evolution at swift.org> wrote:
>>>> >
>>>> >
>>>> >> On Jul 26, 2016, at 12:26 PM, Xiaodi Wu via swift-evolution <
>>>> swift-evolution at swift.org> wrote:
>>>> >>
>>>> >> +1. Even if it's too late for Swift 3, though, I'd argue that it's
>>>> highly unlikely to be code-breaking in practice. Any existing code that
>>>> would get tripped up by this normalization is arguably broken already.
>>>> >
>>>> > I'm inclined to agree. To be paranoid about perfect compatibility, we
>>>> could conceivably allow existing code with differently-normalized
>>>> identifiers with a warning based on Swift version, but it's probably not
>>>> worth it. It'd be interesting to data-mine Github or the iOS Swift
>>>> Playgrounds app and see if this breaks any Swift 3 code in practice.
>>>>
>>>> As an additional interesting point here, we could in general normalize
>>>> unicode strings. This could potentially reduce the size of unicode
>>>> characters or allow us to constant propagate certain unicode algorithms in
>>>> the optimizer.
>>>>
>>>> >
>>>> > -Joe
>>>> > _______________________________________________
>>>> > swift-evolution mailing list
>>>> > swift-evolution at swift.org
>>>> > https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160922/981066b3/attachment.html>


More information about the swift-evolution mailing list