[swift-evolution] [Proposal] Refining Identifier and Operator Symbology

Jonathan S. Shapiro jonathan.s.shapiro at gmail.com
Sat Oct 22 08:02:26 CDT 2016


On Fri, Oct 21, 2016 at 11:37 PM, Nevin Brackett-Rozinsky <
nevin.brackettrozinsky at gmail.com> wrote:

> Ah, I had not previously understood that. Well then, in light of the fact
> that the Unicode recommendations may be influenced by our decisions, and
> given that Swift is an opinionated language, it follows that we ought to
> make our best effort at separating out what we have been calling “operator
> characters” (and your revised proposal calls “symbol identifier”
> characters).
>
> In particular, since there does not yet exist a categorization of symbols
> which fits our needs, and since our needs may help shape such a
> categorization as it forms, it behooves us to fully undertake the endeavor
> of defining which symbols we would like to see in which roles for Swift.
>

The Unicode standard has a four well-established and general categories for
symbols:

Sc: Symbols, Currency
Sk: Symbol, Modifier
Sm: Symbols, Math
So: Symbols, Other


For our purposes these aren't a terribly helpful set of assignments. The
assignments in some places were arbitrary between categories S and P, and
the conceptual model used wasn't necessarily appropriate for programming
purposes. It might be a good idea to start by reading Chapter 22 (Symbols
<http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf>) of the current
Unicode standard.

It would be a mistake to equate symbol identifiers with operators (which
are verbs). For example: my initial thought was to want to exclude the
LetterLike symbols block, but once the notion of operator and symbol
identifier are distinct it is no longer clear to me that this should be
done. The more pertinent question would seem to be "What kind of
identifiers should Letterlike Symbols be?" That seems to be driven more by
whitespace considerations than anything else. If we want to be able to
write something like

3*ℇ  // three times the Euler constant (U+2107)


without white space, then we want ℇ to fall into normal identifiers. What
we'd ideally like to have is "noun identifiers" and "verb identifiers",
because this would correspond to the best white space outcome.
Unfortunately that ship sailed well before FORTRAN was standardized, and
the best we can do now is "get it right enough" and accept the need for
white space when that can't be done adequately. The conceptual difficulty
with defining symbol identifiers as nouns or verbs is that symbols in
mathematical use are assigned to meanings arbitrarily on a case-by-case
basis for the convenience of that individual paper. There really *isn't* a
general consensus in mathematical symbols about what is a noun and what is
a verb,. There *certainly* isn't a consensus for symbol identifiers
involving more than one glyph; the style of formal mathematics favors
single-letter variables in various scripts with decorative modifiers. For
this reason, the best outcome we can achieve will be "right enough" rather
than perfect, and white space will still be required in some cases.

I'm coming around to the view that a new Unicode property is actually
warranted, but that will take time. My reasoning in proposing that start
with the Mathematical Operators block had four parts:

   1.  It's enough to make forward progress
   2. It allows most existing code to survive without breakage
   3. All of the code points in that particular block are pretty clearly
   things that want to be in symbol identifier
   4. It buys time for the Unicode group to negotiate the creation of a new
   property.

If we really want a good, fine-grain organization of code points, we need
to buy time for a property definition to happen over in Unicode-land, and
we need to avoid stepping on future backward-compatibility issues while
that happens. For this reason, I think we should be looking to define the
smallest set of symbol identifier codepoints that we think we can live with
for now given the current state of source code in the field. Everything we
add poses a risk of future backwards compatibility concerns.

What *would* be useful would be to go through each of the blocks mentioned
at the top of Chapter 22 of the standard, and characterize each one as
"mostly normal identifier" or "mostly symbol identifier". We can then go
through and identify the exceptions in each block. We should *avoid* the
Punctuation category and associated blocks at this time; category
assignments in that space were pretty arbitrary, so those will take a fair
bit of work to sort out.

Ideally, I'd like to see all of Sc (currency symbols) end up in "identifier
symbols" for consistency. The hold-out at the moment is '$'. As it turns
out, we could probably get away with adding decimal digits to
symbol-identifier-continue and then admit '$' in symbol identifiers rather
than conventional identifiers without breaking existing code, but I'm not
sure this clean-up is worth the consternation and worry that it will cause.


> It is worth noting that your proposed “symbol identifier” category, by its
> very name, suggests it should have broader membership than just operators.
> I am not sure if that was intentional.
>

That is very much intentional. Our attention should *not* be restricted
solely to operators. It's a general identifier space.


Jonathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20161022/cb9eba4e/attachment.html>


More information about the swift-evolution mailing list