[swift-evolution] Strings in Swift 4

Ted F.A. van Gaalen tedvgiosdev at gmail.com
Mon Feb 6 11:26:14 CST 2017

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode 
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.  


How about this solution:  (if I am not making other omissions in my thinking again) 
-Store the string as a collection of fixed-width 32 bit UTF-32 characters anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode characters),then 
store a pointer to a hidden child string containing the actual grapheme cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer,  UTF32, UTF32, 1pointer, UTF32, UTF32]
                                                |                                          |
2:                               [UTF32, UTF32]                  [UTF32, UTF32, UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2)  are hidden child strings, each containing a grapheme cluster. 

To make the distinction between a “plain” single UTF-32 char and a grapheme cluster, 
set the most significant bit of the 32 bit value to 1 and use the other 31 bits
as a pointer to another (hidden) String instance, containing the grapheme cluster. 
In this way, one could then also make graphemes within graphemes,  
but that is probably not desired? Another solution is to store the grapheme clusters
in a dedicated “grapheme pool’, containing the (unique as in aSet) grapheme clusters
encountered whenever a Unicode string (in whatever format) is read-in or defined at runtime. 

but then again.. seeing how hard it is to recognise Grapheme clusters in the first place.. 
? I don’t know. Unicode is complicated..  

Kind regards 

www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

> On 6 Feb 2017, at 05:15, Dave Abrahams <dabrahams at apple.com> wrote:
>> On Feb 5, 2017, at 2:57 PM, Ted F.A. van Gaalen <tedvgiosdev at gmail.com> wrote:
>> However, that is not the case with UTF-32, because with UTF-32 encoding
>> each character has a fixed-width and always occupies exactly 4 bytes, 32 bit. 
>> Ergo: the problem can be easily solved: The simple solution is to always 
>> and without exception use UTF-32 encoding as Swift's internal 
>> string format because it only contains fixed width Unicode characters. 
> Those are not (user-perceived) Characters; they are Unicode Scalar Values (often called "characters" by the Unicode standard.  Characters as defined in Swift (a.k.a. extended grapheme clusters) have no fixed-width encoding, and Unicode scalar values are an inappropriate unit for most string processing. Please read the manifesto for details.
> Sent from my iPad

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170206/a39bd5b1/attachment.html>

More information about the swift-evolution mailing list