[swift-evolution] Strings in Swift 4

Sun Feb 5 16:57:12 CST 2017

We know that:
The cumbersome complexity of current Swift String handling 
and programming is caused by the fact that Unicode characters 
are stored and processed as  streams/arrays with elements 
of variable-width (1...4 bytes for each character) Unicode characters.

Because of that, direct subscripting of string elements e.g. str[2..<18] 
is not possible.Therefore it was, and still is, not implemented in Swift,
much to the unpleasant surprise of many new Swift programmers 
coming from many other PLs like me. They did miss plain direct subscripting 
so much that the first thing ever they do before using Swift intensively is 
implementing the following or similar dreadful code (at least for direct 
subscripting),  and bury it deep into a String extension, once written, 
hopefully never to be seen again, like in this example: 

extension String
{
   subscript(i: Int) -> String
   {
        guard i >= 0 && i < characters.count else { return "" }
        return String(self[index(startIndex, offsetBy: i)])
    }

    subscript(range: Range<Int>) -> String
    {
        let lowerIndex = index(startIndex, offsetBy: max(0,range.lowerBound), limitedBy: endIndex) ?? endIndex
        return substring(with: lowerIndex..<(index(lowerIndex, offsetBy: range.upperBound - range.lowerBound, limitedBy: endIndex) ?? endIndex))
    }

    subscript(range: ClosedRange<Int>) -> String
    {
        let lowerIndex = index(startIndex, offsetBy: max(0,range.lowerBound), limitedBy: endIndex) ?? endIndex
        return substring(with: lowerIndex..<(index(lowerIndex, offsetBy: range.upperBound - range.lowerBound + 1, limitedBy: endIndex) ?? endIndex))
    } 
 }

[splendid jolly good Earl Grey tea is now being served to help those flabbergasted to recover as quickly as possible.] 

This rather indirect and clumsy way of working with string data is because 
(with the exception of UTF-32 characters) Unicode characters come in 
variable-width encoding (1 to 4 bytes for each char), which as we know 
makes string handling for UTF-8, UTF-16 very complex and inefficient.
E.g. to isolate a substring it is necessary to sequentially 
traverse the string instead of direct access. 

However, that is not the case with UTF-32, because with UTF-32 encoding
each character has a fixed-width and always occupies exactly 4 bytes, 32 bit. 
Ergo: the problem can be easily solved: The simple solution is to always 
and without exception use UTF-32 encoding as Swift's internal 
string format because it only contains fixed width Unicode characters. 

Unicode strings with whatever UTF encoding as read into the program would 
be automatically converted to 32-bit UTF32 format. Note that explicit conversion 
e.g. back to UTF-8, can be specified or defaulted when writing Strings to a 
storage medium or URL etc. 

Possible but imho not recommended: The current String system could be pushed
down and kept alive (e.g. as Type StringUTF8?) as a secondary alternative to 
accommodate those that need to process very large quantities of text in core.

What y'all think?
Kind regards
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170205/796aa870/attachment-0001.html>