[swift-evolution] InternalString class for easy String manipulation

Kenny Leung kenny_leung at pobox.com
Wed Aug 17 14:15:12 CDT 2016


It seems to me that UTF-8 is the best choice to encode strings in English and English-like character sets for storage, but it’s not clear that it is the most useful or performant internal representation for working with strings. In my opinion, conflating the preferred storage format and the best internal representation is not the proper thing to do. Picking the right internal storage format should be evaluated based on its own criteria. Even as an experienced programmer, I assert that the most useful indexing system is glyph based.

In Félix’s case, I would expect to have to ask for a mail-friendly representation of his name, just like you have to ask for a filesystem-friendly representation of a filename regardless of what the internal representation is. Just because you are using UTF-8 as the internal format, it does not mean that universal support is guaranteed.

In response to this statement: “Optimizing developer experience for beginning developers is just going to lead to software that screws…”, the current system trips up not only beginning developers, but is different from pretty much every programming language in my experience.

-Kenny


> On Aug 17, 2016, at 11:48 AM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
> 
> It's 2016, "the thing people would most commonly expect"
> impossible-to-screw-up Unicode support that's performance. Optimizing
> developer experience for beginning developers is just going to lead to
> software that screws up in situations the developer doesn't anticipate,
> as F+¬lix notes above.
> 
> Zachary
> 
> On Wed, Aug 17, 2016, at 09:40 AM, Kenny Leung via swift-evolution
> wrote:
>> I understand that the most friendly approach may not be the most
>> efficient, but that’s not what I’m pushing for. I’m pushing for "does the
>> thing people would most commonly expect”. Take a first-time programmer
>> who reads any (human) language, and that is what they would expect.
>> 
>> Why couldn’t String’s internal storage format be glyph-based? If I were,
>> say, writing a text editor, it would certainly be the easiest and most
>> efficient format to work in.
>> 
>> -Kenny
>> 
>> 
>>> On Aug 15, 2016, at 9:20 PM, Félix Cloutier <felixcca at yahoo.ca> wrote:
>>> 
>>> The major problem with this approach is that visual glyphs themselves have one level of variable-length encoding, and they sit on top of another variable-length encoding used to represent the Unicode characters (Swift-native Strings are currently encoded as UTF-8). For instance, the visual glyph 🇺🇸 is the the result of putting side-by-side the Unicode characters 🇺 and  🇸("REGIONAL INDICATOR SYMBOL LETTER U" and "REGIONAL INDICATOR SYMBOL LETTER S"), which are themselves encoded as UTF-8 using 4 bytes each. A design in which you can "just write" string[4544] hides the fact that indexing is a linear-time operation that needs to recompose UTF-8 characters and then recompose visual glyphs on top of that.
>>> 
>>> Generally speaking, I *think* that I agree that human-geared "long string" on which you probably won't need random access, and machine-geared smaller strings that encode a command, could benefit from not being considered the same fundamental thing. However, I'm also afraid that this will end with more applications and websites that think that first names only contain 7-bit-clean characters in the A-Z range. (I live in the US and I can attest that this is still very common.)
>>> 
>>> You could make a point too that better facilities to parse strings would probably address this issue.
>>> 
>>> Félix
>>> 
>>>> Le 15 août 2016 à 10:52:02, Kenny Leung via swift-evolution <swift-evolution at swift.org> a écrit :
>>>> 
>>>> I agree with both points of view. I think we need to bring back subscripting on strings which does the thing people would most commonly expect.
>>>> 
>>>> I would say that the subscripts indexes should correspond to a visual glyph. This seems reasonable to me for most character sets like Roman, Cyrillic, Chinese. There is some doubt in my mind for things like subscripted Japanese or connected (ligatured?) languages like Arabic, Hindi or Thai.
>>>> 
>>>> -Kenny
>>>> 
>>>> 
>>>>> On Aug 15, 2016, at 10:42 AM, Xiaodi Wu via swift-evolution <swift-evolution at swift.org> wrote:
>>>>> 
>>>>> On Sun, Aug 14, 2016 at 5:41 PM, Michael Savich via swift-evolution <swift-evolution at swift.org> wrote:
>>>>> Back in Swift 1.0, subscripting a String was easy, you could just use subscripting in a very Python like way. But now, things are a bit more complicated. I recognize why we need syntax like str.startIndex.advancedBy(x) but it has its downsides. Namely, it makes things hard on beginners. If one of Swift's goals is to make it a great first language, this syntax fights that. Imagine having to explain Unicode and character size to an 8 year old. This is doubly problematic because String manipulation is one of the first things new coders might want to do. 
>>>>> 
>>>>> What about having an InternalString subclass that only supports one encoding, allowing it to be subscripted with Ints? The idea is that an InternalString is for Strings that are more or less hard coded into the app. Dictionary keys, enum raw values, that kind of stuff. This also has the added benefit of forcing the programmer to think about what the String is being used for. Is it user facing? Or is it just for internal use? And of course, it makes code dealing with String manipulation much more concise and readable.
>>>>> 
>>>>> It follows that something like this would need to be entered as a literal to make it as easy as using String. One way would be to make all String literals InternalStrings, but that sounds far too drastic. Maybe appending an exclamation point like "this"! Or even just wrapping the whole thing in exclamation marks like !"this"! Of course, we could go old school and write it like @"this" …That last one is a joke.
>>>>> 
>>>>> I'll be the first to admit I'm way in over my head here, so I'm very open to suggestions and criticism. Thanks!
>>>>> 
>>>>> I can sympathize, but this is tricky.
>>>>> 
>>>>> Fundamentally, if it's going to be a learning and teaching issue, then this "easy" string should be the default. That is to say, if I write `var a = "Hello, world!"`, then `a` should be inferred to be of type InternalString or EasyString, whatever you want to call it.
>>>>> 
>>>>> But, we also want Swift to support Unicode by default, and we want that support to do things The Right Way(TM) by default. In other words, a user should not have to reach for a special type in order to handle arbitrary strings correctly, and I should be able to reassign `a = "你好"` and have things work as expected. So, we also can't have the "easy" string type be the default...
>>>>> 
>>>>> I can't think of a way to square that circle.
>>>>> 
>>>>> 
>>>>> Sent from my iPad
>>>>> 
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution at swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> swift-evolution mailing list
>>>>> swift-evolution at swift.org
>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>>> 
>>>> _______________________________________________
>>>> swift-evolution mailing list
>>>> swift-evolution at swift.org
>>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>> 
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution



More information about the swift-evolution mailing list