<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jun 23, 2016 at 2:29 PM, João Pinheiro <span dir="ltr"><<a href="mailto:joao@joaopinheiro.org" target="_blank">joao@joaopinheiro.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> I think we're using terminology differently here. What you call "character normalization" is what I'm calling canonicalization. NFC is described in UAX #15 as "canonical decomposition followed by canonical composition" and I'm just using the word "canonicalization" because it's shorter. If Swift represents each identifier in an NFC-transformed form (what I call canonicalized), then I understand the identifier to be canonicalized. What is the distinction you're drawing here?<br>
<br>
</span>There is a small difference between normalisation and canonicalisation, but it's mostly splitting hairs. They both ensure something is represented properly, but canonicalisation implies establishing a single base representation for something. Web addresses are a good example. Both <a href="http://www.apple.com" rel="noreferrer" target="_blank">http://www.apple.com</a> and <a href="http://apple.com" rel="noreferrer" target="_blank">http://apple.com</a> are valid normalised addresses, but only the former is the canonical address for the Apple website.<br>
<span class=""><br>
> Just re-read UAX #31. I see two different issues here too--do these match up with what you're saying above?<br>
><br>
> * Disallowing certain glyphs in identifiers. To do so, we can implement the recommendation to disallow all glyphs in UAX #31 Table 4, except ZWJ and ZWNJ in the specific scenarios outlined in section 2.3.<br>
><br>
> * Internally, when comparing two identifiers A and B, compare NFC(A) and NFC(B) without modifying or otherwise restricting the actual user-facing code to contain only NFC-normalized strings. This would be the approach recommended in section 1.3.<br>
<br>
</span>Yes, that's correct. The proposal would be to normalise the encoding via NFC and then canonicalise the identifiers by ignoring invisible characters except in the scenarios described in UAX #31</blockquote><div><br></div><div>That's cool, although my preferred solution would be more closely aligned with UAX #31: overtly disallow the glyphs in Table 4 (instead of ignoring them) except in the specific scenarios for ZWJ and ZWNJ identified in UAX #31, then afterwards internally represent the identifier as its NFC-normalized string.</div><div><br></div></div></div></div>