[swift-evolution] Strings in Swift 4

Dave Abrahams dabrahams at apple.com
Tue Jan 24 13:22:59 CST 2017

on Mon Jan 23 2017, Brent Royal-Gordon <swift-evolution at swift.org> wrote:

>> On Jan 21, 2017, at 3:49 AM, Brent Royal-Gordon <brent at architechies.com> wrote:
> I'm going to trim out the bits where my answer is an uninteresting "Good" or "Okay, we'll leave that
> for later" or what-have-you.
>>> The operands and sense of the comparison are kind of lost in all this garbage. You really want to
> see `foo < bar` in this code somewhere, but you don't.
>> Yeah, we thought about trying to build a DSL for that, but failed.  I think the best possible
> option would be something like:
>>   foo.comparison(case: .insensitive, locale: .current) < bar
>> The biggest problem is that you can build things like
>>     fu = foo.comparison(case: .insensitive, locale: .current)
>>     br = bar.comparison(case: .sensitive)
>>     fu < br // what does this mean?
>> We could even prevent such nonsense from compiling, but the cost in library API surface area is
>> quite large.
> Is it? I think we're talking, for each category of operation that can be localized like this:
> * One type to carry an operand and its options.
> * One method to construct this type.
> * One alternate version of each operator which accepts an
> operand+options parameter. (I'm thinking it should always be the
> right-hand side, so the long stuff ends up at the end; Larry Wall
> noted this follows an "end-weight principle" in natural languages.)
> I suspect that most solutions will at least require some sort of overload on the comparison
> operators, so this may be as parsimonious as we can get.

Tell you what: why don't you prototype it and see what you can come up
with?  Then we can think about the use-cases and see whether your
proposed API carries its weight.
>>> I'm struggling a little with the naming and syntax, but as a general approach, I think we want
>>> people to use something more like this:
>>>    if StringOptions(case: .insensitive, locale: .current).compare(foo < bar) { … }
>> Yeah, we can't do that without making 
>> 	let a = foo < bar
>> ambiguous
> Yeah, that's true. Perhaps we could introduce an attribute which can
> be used to say "disfavor this overload compared to other
> possibilities", but that seems disturbingly ad-hoc.

I think we want something a feature like that, some day for other
purposes anyway.

> I know you want to defer this for now, so feel free to set this part
> of the email aside, 

I think I will :-)

> but here's a quick list of solutions I've ballparked:
> 1. Your "one operand carries the options" solution.
> 2. As I mentioned, do something that effectively overloads comparison operators to return them in a
> symbolic form. You're right about the ambiguity problem, though.
> 3. Like #2, but with slightly modified operators, e.g.:
> 	if localized(fu &< br, case: .insensitive) { … }
> 4. Reintroduce something like the old `BooleanType` and have *all* comparisons construct a symbolic
> form that can be coerced to boolean. This is crazy, but actually probably useful in other places; I
> once experimented with constructing NSPredicates like this.
> 	protocol BooleanProtocol { var boolValue: Bool { get } }
> 	struct Comparison<Operand: Comparable> {
> 		var negated: Bool
> 		var sortOrder: SortOrder
> 		var left: Operand
> 		var right: Operand
> 		func evaluate(_ actualSortOrder: SortOrder) -> Bool {
> 			// There's circularity problems here, because `==` would itself return a
> `Comparison`,
> 			// but I think you get the idea.
> 			return (actualSortOrder == sortOrder) != negated
> 		}
> 	}
> 	extension Comparison: BooleanProtocol {
> 		var boolValue: Bool {
> 			return evaluate(left.compared(to: right))
> 		}
> 	}
> 	func < <ComparableType: Comparable>(lhs: ComparableType, rhs: ComparableType) -> Comparison
> {
> 		return Comparison(negated: false, sortOrder: .before, left: lhs, right: rhs)
> 	}
> 	func <= <ComparableType: Comparable>(lhs: ComparableType, rhs: ComparableType) -> Comparison
> {
> 		return Comparison(negated: true, sortOrder: .after, left: lhs, right: rhs)
> 	}
> 	// etc.
> 	// Now for our special String comparison thing:
> 	func localized(_ expr: Comparison<String>, case: StringCaseSensitivity? = nil, …) -> Bool {
> 		return expr.evaluate(expr.left.compare(expr.right, case: case, …))
> 	}
> 5. Actually add some all-new piece of syntax that allows you to add options to an operator. Bad part
> is that this is ugly and kind of weird; good part is that this could probably be used in other
> places as well. Strawman example:
> 	// Use:
> 	if fu < br %(case: .insensitive, locale: .current) { … }
> 	// Definition:
> 	func < (lhs: String, rhs: String, case: StringCaseSensitivity? = nil, …) -> Bool { … }
> 6. Punt on this until we have macros. Once we do, have the function be a macro which alters the
> comparisons passed to it. Bad part is that this doesn't give us a solution for at least a version or
> two.
>>> However, is there a reason we're talking about using a separate
>>> `Substring` type at all, instead of using `Slice<String>`?
>> Yes: we couldn't specialize its representation to store short
>> substrings inline, at least not without introducing an undesirable
>> level of complexity.
> How important is that, though? If you're using a `Substring`, you
> expect to keep the top-level `String` around and probably continue
> sharing storage with it, so you're probably extending its lifetime
> anyway. Or are you thinking of this as a speed optimization, rather
> than a memory optimization?

It's both.  It's true that it will rarely save space, but sometimes it
will.  More importantly perhaps, it eliminates ARC traffic.

> And is it worth not being able to have a `base` property on
> `Substring` like we've added to `Slice`? I've occasionally thought it
> might be useful to allow a slice's start and end indices to be
> adjusted, essentially allowing you to "slide" the bounds of the slice
> over the underlying collection; that wouldn't be possible with a
> `Substring` design which sometimes inlined data.

I can't really picture what you have in mind, but the way I imagine
doing it isn't incompatible with the small string optimization.
>> ArraySlice is doomed :-)
> Good news!
>>> I've seen people struggle with the `Array`/`ArraySlice` issue when
>>> writing recursive algorithms, so personally, I'd like to see a more
>>> general solution that handles all `Collection`s.
>> The more general solution is "extend Unicode" or "extend Collection"
>> (and when a String parameter is needed, "make your method generic
>> over Collection/Unicode").
> I know, but I know a lot of people really don't like doing that. 

We need to fix that.  Hopefully new generics features in Swift 4 will
make it a much more pleasant experience.

> My usual practice is to use generics at almost any opportunity—when an
> algorithm can work with any of a category of types, I'd rather take a
> type parameter than hard-code the arbitrary type I happen to need
> right now—but most people don't think that way. They'd prefer to
> write:
> 	func doThing(to slice: inout ArraySlice<Int>) { … }
> 	func doThing(to array: inout Array<Int>) { doThing(to: array[0 ..< array.count]) }
> (Yes, `array.startIndex ..< array.endIndex` would be slightly more proper, but we're not talking
> about *my* style here.)

They're equally proper as long as you're dealing with concrete types.

> Rather than:
> 	func doThing<C: RandomAccessCollection>(to collection: inout C)
> 		where C: RangeReplaceableCollection
> 	{ … }
> I haven't dug into this mindset that much; I suspect it comes from a
> combination of believing that generics are difficult and scary

...which they are, a bit, right now, due our inability to state some of
the constraints we want to on protocols, and due to inadequate error messages.

> not knowing the Collection protocols well enough to know which ones to
> use, and simply not wanting to introduce additional complexity when
> they don't need it.
> In any case, though, I do understand why you would feel a` T` ->
> `T.SubSequence` implicit coercion wouldn't carry its own weight, 

It's less that it wouldn't carry weight than that we can only have the
implicit conversion in one direction.  

If we had T -> T.SubSequence, the guidance for developers would be “only
store top-level Collections long-term, but otherwise, use SubSequences

This basically forces everyone to be aware of the distinction between
String and Substring.  That would be OK with me personally, but others
disagree with me.  What do y'all think?

> and `collection[]` *would* be a definite improvement on the status quo
> for these developers.
>> That's the problem, right there, combined with the fact that we
>> don't have a terse syntax like s[] for going the other way.  I think
>> it would be a much more elegant design, personally, but I don't see
>> the tradeoffs working out.  If we can come up with a way to do it
>> that works, we should.  So far, Ben and I have failed.
> I guess what I'm saying is "keep trying; it's more valuable than you
> might have anticipated". :^)

If you want to have one String type that you can “use everywhere without
thinking about it,” it has to be this way.  As long as that is seen to
be important, I think we don't really have a choice.

>>>> A user who needs to optimize away copies altogether should use this guideline:
>>>> if for performance reasons you are tempted to add a `Range` argument to your
>>>> method as well as a `String` to avoid unnecessary copies, you should instead
>>>> use `Substring`.
>>> I do like this as a guideline, though. There's definitely room in
>>> the standard library for "a string and a range of that string to
>>> operate upon".
>> I don't know what you mean.  It's our intention that nothing but the
>> lowest level operations (e.g. replaceRange) would work on ranges
>> when they could instead be working on slices.
> No, all I'm saying is that there's definitely a lot of value in
> `Substring` or `Slice<String>`. Talking about a slice of a string is
> something quite valuable that we don't currently support very well.


>>>> **Note:** `Unicode` would make a fantastic namespace for much of
>>>> what's in this proposal if we could get the ability to nest types and
>>>> protocols in protocols.
>>> I mean, sure, but then you imagine it being used generically:
>>>    func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType
>>>    // which concrete types can `source` be???
>> All "string" types, including String, Substring, UTF8String, StaticString, etc.
> I know that; my point is that it doesn't *read* well here.
> Imagine that you are a workaday Swift programmer. You know the syntax
> and the basic concrete types, but you have not read the standard
> library top-to-bottom, and don't have detailed knowledge of the
> protocols that it's built on. You read a source file with these three
> declarations:
> 	func factor<Integer: BinaryInteger>(_ number: Integer) -> [Integer]
> 	func decode<Encoding: UnicodeEncoding> (_ data: Data, as encoding: Encoding.Type) -> String
> 	func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType

Well, you chose a suboptimal spelling for the signature.  I don't see a
problem with this:

  func parse<Source: Unicode>(_ source: Source) -> Source 

> I think you would be able to understand what `factor(_:)` and
> `decode(_:as:)` do, even if you had never seen the `BinaryInteger` and
> `UnicodeEncoding` protocols, because their names clearly and simply
> say what sort of type would conform to the protocol. You would guess
> that familiar types like `Int` could be used with `factor(_:)`, and
> you might not know what the concrete `UnicodeEncoding` types were
> called, but you'd guess they probably had names with terms of art like
> `UTF8` in them somewhere.
> But what about `parse(_:)`? Sure, `Unicode` suggests it has something
> to do with string handling, but it doesn't suggest *a string*. 

Then call the type parameter "Text," "String," or "Str"

> As I said, I would assume it has something to do with the Unicode
> standard—maybe a type that does Unicode table lookups, for instance. I
> get that you're using it as an adjective, but it's such a specific
> technical term that using it to describe any chunk of text data is
> misleading, even if that text *is* required to be Unicode text.
> Perhaps you could call it `StringProtocol`, or `Textual`, or
> `UnicodeString`. But I really think just `Unicode` does not do a good
> job of conveying the meaning of the type.

OK, noted.  I'm not attached to "Unicode," but I think it works well.  I
think I'd go with StringProtocol if I had to change it.

>>>> We think random-access
>>>> *code-unit storage* is a reasonable requirement to impose on all `String`
>>>> instances.
>>> Wait, you do? Doesn't that mean either using UTF-32, inventing a
>>> UTF-24 to use, or using some kind of complicated side table that
>>> adjusts for all the multi-unit characters in a UTF-16 or UTF-8
>>> string? None of these sound ideal.
>> No; I'm not sure why you would think that.
> Oh, sorry. I read that as "random-access code-point
> [i.e. UnicodeScalar] storage", which I don't think would be a
> reasonable requirement. My mistake.
>>>> Then integers can easily be translated into offsets into a `String`'s `utf16`
>>>> view for consumption by Cocoa:
>>>> ```swift
>>>> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
>>>> let swiftIndex = s.utf16.index(offset: cocoaIndex)
>>>> ```
>>> I worry that this conversion will be too obscure.
>> I very much hope it will be rare enough that it'll be OK, but if it isn't, we can always have
>> 	let cocoaIndex = s.utf16Offset(of: i)
>> and/or take other measures to simplify it.
> I think that would still be too obscure.
> To give you an idea of what you're contending with here, take a look at a few Stack Overflow
> questions:
> http://stackoverflow.com/questions/28128554/convert-string-index-to-int-or-rangestring-index-to-nsrange
> http://stackoverflow.com/questions/27156916/convert-rangeint-to-rangestring-index
> http://stackoverflow.com/questions/34540185/how-to-convert-index-to-type-int-in-swift
> Objective-C programmers *do not know* that `NSInteger` and `NSRange`
> indices are UTF-16 indices. They don't think about what the
> "character" in `-characterAtIndex:` really means; they just take it at
> face value. That means putting "UTF-16" in the name will not help them
> identify the API as the correct one to use. It'd be like advertising a
> clinic to people with colds by saying you do "otolaryngology"—you're
> just not speaking the language of your audience.
> I see two ways to make it really, really obvious which API is the
> right one to use. The first is to explicitly refer to something like
> "objc", "cocoa", "foundation", or "ns" in the name. The second is to
> use full-width conversions, which people understand are the default
> way to convert between two things. (Actually, a lot of developers
> literally call these "casts" and assume they're extremely low cost.)
> I think that, if there's a `String.Index.init(_: Int)` and an
> `Int.init(_: String.Index)`, people will almost certainly identify
> these as the right way to convert between Foundation's `Int` indices
> as `String.Index`es. They certainly don't seem to be figuring it out
> now.

You're bringing me around to agreeing with you on this.  There are still
pitfalls, though.  The following will trap for some Strings:

  assert(Int(s.startIndex) + 1 == Int(s.index(after: s.startIndex)))

>>> 1. UTF-16 is the storage format, at least for an "ordinary" `Swift.String`.
>> It will be, in the common case, but many people seem to want plain String to be able to store
>> UTF-8, and I'm not yet prepared to rule that out.
> I suppose it doesn't matter what the actual storage format is as long
> as we get #2 (`UTF16View` indexed by `String.Index`).

If we allow String to store UTF-8, then that won't hold.  There *would*
be a full-width conversion from String.Index to String.UTF16View.Index,
but the UTF16View will need to store additional information to track the
UTF-16 code units corresponding to the underlying UTF-8.

> If we go with the facade design, I suppose it would simply be that the
> default string storage also uses its `UTF16Index` for its
> `CodeUnitIndex`. Other string storages could arrange their indices in
> other ways.


>>> 2. `String.Index` is used down to the `UTF16View`. It stores a UTF-16 offset.
>>> 3. With just the standard library imported, `String.Index` does not
>>> have any obvious way to convert to or from an `Int` offset; you use
>>> `index(_:offsetBy:)` on one of the views. `utf16`'s implementation
>>> is just faster than the others.
>> This is roughly where we are today.
> Yes, except for index interchangeability between `CharacterView`,
> `UnicodeScalarView`, and `UTF16View`. But the suggestion that we
> provide `init(_:)`s is the key part of this request.
>>>> #### String Interpolation
>> Let's go to a separate thread for this, as you suggested.
> Will do.
>>> So you might end up having to wrap it in an `init(cString:)` anyway, just for convenience. Oh
> well, it was worth exploring.
>> I think you ended up where we did.
> Unsurprising, I suppose. :^)
> Going out of order briefly:
>>> 2. I don't really understand how you envision using the "data
>>> specific to the underlying encoding" sections. Presumably you'll
>>> want to convert that data into a string eventually, right?
>> It already is in a string.  The point is that we have a way to scan
>> the string looking for ASCII patterns without transcoding it.
> So, if I understand this properly, you're imagining that
> `extendedASCII` has indices interchangeable with `codeUnits`, but
> doesn't do any sort of complicated Unicode decoding, so you can rip
> through the string with `extendedASCII` and then use the indices to
> extract actual, fully decoded Unicode data from substrings of
> `codeUnits`?


>>> 1. It doesn't sound like you anticipate there being any way to
>>> compare an element of the `extendedASCII` view to a character
>>> literal. That seems like it'd be really useful.
>> We don't have character literals :-)
> Excuse me, Unicode scalar literals. :^)
>> However, I agree that there needs to be a way to do it.  The thing
>> would be to make it easy to construct a UInt8 from a string literal.
> Honestly, I might consider having elements which are not plain
> `UInt8`s, but `ASCIIScalar?`s, where `ASCIIScalar` looks something
> like:
> 	struct ASCIIScalar {
> 		// Using a 7-bit integer means `ASCIIScalar?`'s tag bit can fit in the same byte.
> 		let _value: Builtin.UInt7
> 		var value: UInt8 {
> 			return UInt8(Builtin.zext_Int7_Int8(_value))
> 		}
> 		init?(_convertingValue: UInt8) {
> 			let result: (value: Builtin.Int7, error: Builtin.Int1) =
> Builtin.u_to_u_checked_trunc_Int8_Int7(value._value)
> 			guard Bool(result.error) == false else { return nil }
> 			_value = result.value
> 		}
> 		init?<Integer: BinaryInteger>(value: Integer) {
> 			guard let sizedValue = UInt8(exactly: value) else {
> 				return nil
> 			}
> 			self.init(_convertingValue: sizedValue)
> 		}
> 		init?(_ scalar: UnicodeScalar) {
> 			guard scalar.isASCII else { return nil }
> 			_value = Builtin.UInt7(scalar.value)
> 		}
> 	}
> 	extension ASCIIScalar: ExpressibleByUnicodeScalarLiteral, ExpressibleByIntegerLiteral {
> 		// Notional, not necessarily actual, implementation
> 		init(unicodeScalarLiteral value: UnicodeScalar) {
> 			self.init(value)!
> 		}
> 		init(integerLiteral value: UInt8) {
> 			self.init(value: value)
> 		}
> 	}
> Then you could write something like (if I understand what you're envisioning for the
> `ExtendedASCIIView`):
> 	for (char, i) in zip(source.extendedASCII, source.extendedASCII.indices) {
> 		switch (state, char) {
>> 		// Look for a single or double quote to start the string
> 		case (.expectingValue, "'"?), (.expectingValue, "\'"?):
> 			state = .readingStringLiteral(quoteIndex: i)
> 		// Scan to the end of the string
> 		case (.readingStringLiteral(let quoteIndex), _):
> 			// Is this the terminator?
> 			if char == source.extendedASCII[quoteIndex] {
> 				let range = source.extendedASCII.index(after: quoteIndex) ..< i
> 				// Note that we extract the value here with `codeUnits`
> 				let value = String(source.codeUnits[range])
> 				consumeValue(value)
> 				state = .expectingComma
> 			}
> 			else {
> 				// Do nothing; just scan past this character.
> 			}
>> 		}
> 	}
> Relying on the fact that you're switching against an `ASCIIScalar`,
> rather than a `UInt8`, to allow Unicode scalar literals to be used.
> (There are other possible designs as well; a generic
> `ASCIIScalar<CodeUnit: UnsignedInteger>` which directly wrapped a code
> unit without changing its storage at all would be one interesting
> example.)

This is a good idea that we should explore.

>>> If it *is* similar to `UnicodeCodec`, one thing I will note is that
>>> the way `UnicodeCodec` works in code units is rather annoying for
>>> I/O. It may make sense to have some sort of type-erasing wrapper
>>> around `UnicodeCodec` which always uses bytes. (You then have to
>>> worry about endianness, of course...)
>> Take a look at the branch and let me know how this looks like it would work for I/O.
> I don't claim to understand everything I'm seeing, but at a quick
> glance, I really like the overall design. It's nice to see it
> encapsulating a stateless algorithm; I think that will make it more
> flexible.
> However, there's an important tweak needed for I/O: Having a truncated
> character at the end of the collection needs to be detectable as a
> condition distinct from other errors, because a buffer might contain
> (say) two bytes of a three-byte UTF-8 character, with the third byte
> expected to arrive later. For instance, you might have:
> 	public enum ParseResult<T, Index> {
> 		case valid(T, resumptionPoint: Index)
> 		case error(resumptionPoint: Index)
> 		case partial(resumptionPoint: Index)
> 		case emptyInput
> 	}
> Or:
> 	public enum ParseResult<T, Index> {
> 		case valid(T, resumptionPoint: Index)
> 		case error(resumptionPoint: Index)
> 		case nothing(resumptionPoint: Index)
> 	}
> Unlike `error`'s `resumptionPoint`, which is after the garbled character, `partial` or `nothing`'s
> would be *before* the partial character.

I thought about this use-case, but I am not convinced we need it.  An
algorithm that has to decode from buffers can always be written such
that, when it finds an error whose resumption point is at the end of the
buffer, it takes all the code units from the previous resumption point
to this one and shifts them to the beginning of the buffer.  Why isn't
that adequate?

> I had a whole bunch of stuff here earlier where I discussed replacing
> `Sequence` with a new design that had a `Collection`-like interface,
> except that the start index was returned by a `makeStartIndex()`
> method which could only be called once. By tracking the lifetimes of
> indices, the sequence could figure out when a portion of its data was
> no longer accessible and could be discarded. However, I've tweaked
> that design a lot in the last day and haven't come up with anything
> that's quite satisfactory, so I'll leave that discussion aside for
> now.

IMO we really don't want to weigh indices down with anything that heavy,
and the best way to do this is to have a deque-like data structure and
drop parts off the front of it as they are no longer needed.

> (A side note related to `UnicodeEncoding`'s all-static-member design:
> I've taken advantage of this "types as tables of stateless methods and
> associated types" pattern myself (see
> https://github.com/brentdax/SQLKit/blob/master/Sources/SQLKit/SQLClient.swift),
> and although it's very useful, it always feels like I'm fighting the
> language. For these occasions, I wonder if it might make sense to
> introduce a concept of "singleton types" or "static types" where the
> instance and type member namespaces are unified, `T.Type` is the same
> as `T, `T.init()` is the same as `T.self`, and all stored properties
> are treated as static (and thus shared by all instances). That's
> properly the topic of a different thread, of course; it just occurred
> to me as I was writing this.)


>>> That way, if you just write `String`, you get something flexible; if
>>> you write `String<NFCNormalizedUTF16StringStorage>`, you get
>>> something fast.
>> This only works in the "facade" variant where you have a defaulted
>> generic parameter feature, but yes, that's the idea of that variant.
> Yeah, I'm speaking specifically of the defaulted case, which frankly
> is the only one I think is *really* extremely promising.
>>> What does that mean for `String.Index` unification?
>> Not much.  We never intended for indices to be interchangeable among
>> different specific string types (other than a string and its
>> SubSequence).
> I'm more asking, is it possible that different string types would have
> different interchangeability rules? For instance:
> * When using `UTF8StringStorage`, `String.Index` and `String.UTF8View.Index` are interchangeable.
> * When using `UTF16StringStorage` (or `NSString`?), `String.Index` and `String.UTF16View.Index` are
> interchangeable.
> * When using `UTF32StringStorage`, `String.Index` is *not* interchangeable with either of the
> `UTFnView` indices.

It's possible.

>>> `description` would have to change to be
>>> localizable. (Specifically, it would have to take a locale.) This
>>> is doable, of course, but it hasn't been done yet.
>> Well, it could use the current locale.  These things are supposed to
>> remain lightweight.
> I think that, if you're gonna go to the trouble of making your
> `description` localizable, there should be a way to inject a
> locale. That would make testing your localizations easier, for
> instance.

Yes, well it should be possible to change the current locale, but that's
really a topic for another day.

> (There's also the small matter of `LosslessStringConvertible`. Oops?)

What in particular are you concerned with here?

>>>> ### `StaticString`
>>> One complication there is that `Unicode` presumably supports
>>> mutation, which `StaticString` doesn't.
>> No, Unicode doesn't support mutation.  A mutable Unicode will
>> usually conform to Unicode and RangeReplaceableCollection (but not
>> MutableCollection, because replacing a grapheme is not an O(1)
>> operation).
> Oh, of course, that makes a lot of sense. Hopefully we won't need
> anything special from mutable `StringStorage`s. (That is, members that
> are only needed if a type is *both* `StringStorage` *and*
> `RangeReplaceableCollection`.)


More information about the swift-evolution mailing list