[swift-evolution] Strings in Swift 4

Mon Jan 23 21:31:18 CST 2017

> On Jan 21, 2017, at 3:49 AM, Brent Royal-Gordon <brent at architechies.com> wrote:

I'm going to trim out the bits where my answer is an uninteresting "Good" or "Okay, we'll leave that for later" or what-have-you.

>> The operands and sense of the comparison are kind of lost in all this garbage. You really want to see `foo < bar` in this code somewhere, but you don't.
> 
> Yeah, we thought about trying to build a DSL for that, but failed.  I think the best possible option would be something like:
> 
>   foo.comparison(case: .insensitive, locale: .current) < bar
> 
> The biggest problem is that you can build things like
> 
>     fu = foo.comparison(case: .insensitive, locale: .current)
>     br = bar.comparison(case: .sensitive)
>     fu < br // what does this mean?
> 
> We could even prevent such nonsense from compiling, but the cost in library API surface area is quite large.

Is it? I think we're talking, for each category of operation that can be localized like this:

* One type to carry an operand and its options.
* One method to construct this type.
* One alternate version of each operator which accepts an operand+options parameter. (I'm thinking it should always be the right-hand side, so the long stuff ends up at the end; Larry Wall noted this follows an "end-weight principle" in natural languages.)

I suspect that most solutions will at least require some sort of overload on the comparison operators, so this may be as parsimonious as we can get. 

>> I'm struggling a little with the naming and syntax, but as a general approach, I think we want people to use something more like this:
>> 
>>    if StringOptions(case: .insensitive, locale: .current).compare(foo < bar) { … }
> 
> Yeah, we can't do that without making 
> 
> 	let a = foo < bar
> 
> ambiguous

Yeah, that's true. Perhaps we could introduce an attribute which can be used to say "disfavor this overload compared to other possibilities", but that seems disturbingly ad-hoc.

I know you want to defer this for now, so feel free to set this part of the email aside, but here's a quick list of solutions I've ballparked:

1. Your "one operand carries the options" solution.

2. As I mentioned, do something that effectively overloads comparison operators to return them in a symbolic form. You're right about the ambiguity problem, though.

3. Like #2, but with slightly modified operators, e.g.:

	if localized(fu &< br, case: .insensitive) { … }

4. Reintroduce something like the old `BooleanType` and have *all* comparisons construct a symbolic form that can be coerced to boolean. This is crazy, but actually probably useful in other places; I once experimented with constructing NSPredicates like this.

	protocol BooleanProtocol { var boolValue: Bool { get } }

	struct Comparison<Operand: Comparable> {
		var negated: Bool
		var sortOrder: SortOrder
		var left: Operand
		var right: Operand

		func evaluate(_ actualSortOrder: SortOrder) -> Bool {
			// There's circularity problems here, because `==` would itself return a `Comparison`, 
			// but I think you get the idea.
			return (actualSortOrder == sortOrder) != negated
		}
	}
	extension Comparison: BooleanProtocol {
		var boolValue: Bool {
			return evaluate(left.compared(to: right))
		}
	}

	func < <ComparableType: Comparable>(lhs: ComparableType, rhs: ComparableType) -> Comparison {
		return Comparison(negated: false, sortOrder: .before, left: lhs, right: rhs)
	}
	func <= <ComparableType: Comparable>(lhs: ComparableType, rhs: ComparableType) -> Comparison {
		return Comparison(negated: true, sortOrder: .after, left: lhs, right: rhs)
	}
	// etc.

	// Now for our special String comparison thing:
	func localized(_ expr: Comparison<String>, case: StringCaseSensitivity? = nil, …) -> Bool {
		return expr.evaluate(expr.left.compare(expr.right, case: case, …))
	}

5. Actually add some all-new piece of syntax that allows you to add options to an operator. Bad part is that this is ugly and kind of weird; good part is that this could probably be used in other places as well. Strawman example:

	// Use:
	if fu < br %(case: .insensitive, locale: .current) { … }

	// Definition:
	func < (lhs: String, rhs: String, case: StringCaseSensitivity? = nil, …) -> Bool { … }

6. Punt on this until we have macros. Once we do, have the function be a macro which alters the comparisons passed to it. Bad part is that this doesn't give us a solution for at least a version or two.

>> However, is there a reason we're talking about using a separate `Substring` type at all, instead of using `Slice<String>`? 
> 
> Yes: we couldn't specialize its representation to store short substrings inline, at least not without introducing an undesirable level of complexity.

How important is that, though? If you're using a `Substring`, you expect to keep the top-level `String` around and probably continue sharing storage with it, so you're probably extending its lifetime anyway. Or are you thinking of this as a speed optimization, rather than a memory optimization?

And is it worth not being able to have a `base` property on `Substring` like we've added to `Slice`? I've occasionally thought it might be useful to allow a slice's start and end indices to be adjusted, essentially allowing you to "slide" the bounds of the slice over the underlying collection; that wouldn't be possible with a `Substring` design which sometimes inlined data.

> ArraySlice is doomed :-)

Good news!

>> I've seen people struggle with the `Array`/`ArraySlice` issue when writing recursive algorithms, so personally, I'd like to see a more general solution that handles all `Collection`s.
> 
> The more general solution is "extend Unicode" or "extend Collection" (and when a String parameter is needed, "make your method generic over Collection/Unicode").

I know, but I know a lot of people really don't like doing that. My usual practice is to use generics at almost any opportunity—when an algorithm can work with any of a category of types, I'd rather take a type parameter than hard-code the arbitrary type I happen to need right now—but most people don't think that way. They'd prefer to write:

	func doThing(to slice: inout ArraySlice<Int>) { … }
	func doThing(to array: inout Array<Int>) { doThing(to: array[0 ..< array.count]) }

(Yes, `array.startIndex ..< array.endIndex` would be slightly more proper, but we're not talking about *my* style here.)

Rather than:

	func doThing<C: RandomAccessCollection>(to collection: inout C)
		where C: RangeReplaceableCollection
	{ … }

I haven't dug into this mindset that much; I suspect it comes from a combination of believing that generics are difficult and scary, not knowing the Collection protocols well enough to know which ones to use, and simply not wanting to introduce additional complexity when they don't need it.

In any case, though, I do understand why you would feel a` T` -> `T.SubSequence` implicit coercion wouldn't carry its own weight, and `collection[]` *would* be a definite improvement on the status quo for these developers.

> That's the problem, right there, combined with the fact that we don't have a terse syntax like s[] for going the other way.  I think it would be a much more elegant design, personally, but I don't see the tradeoffs working out.  If we can come up with a way to do it that works, we should.  So far, Ben and I have failed.

I guess what I'm saying is "keep trying; it's more valuable than you might have anticipated". :^)

>>> A user who needs to optimize away copies altogether should use this guideline:
>>> if for performance reasons you are tempted to add a `Range` argument to your
>>> method as well as a `String` to avoid unnecessary copies, you should instead
>>> use `Substring`.
>> 
>> I do like this as a guideline, though. There's definitely room in the standard library for "a string and a range of that string to operate upon".
> 
> I don't know what you mean.  It's our intention that nothing but the lowest level operations (e.g. replaceRange) would work on ranges when they could instead be working on slices.

No, all I'm saying is that there's definitely a lot of value in `Substring` or `Slice<String>`. Talking about a slice of a string is something quite valuable that we don't currently support very well.

>>> **Note:** `Unicode` would make a fantastic namespace for much of
>>> what's in this proposal if we could get the ability to nest types and
>>> protocols in protocols.
>> 
>> I mean, sure, but then you imagine it being used generically:
>> 
>>    func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType
>>    // which concrete types can `source` be???
> 
> All "string" types, including String, Substring, UTF8String, StaticString, etc.

I know that; my point is that it doesn't *read* well here.

Imagine that you are a workaday Swift programmer. You know the syntax and the basic concrete types, but you have not read the standard library top-to-bottom, and don't have detailed knowledge of the protocols that it's built on. You read a source file with these three declarations:

	func factor<Integer: BinaryInteger>(_ number: Integer) -> [Integer]

	func decode<Encoding: UnicodeEncoding> (_ data: Data, as encoding: Encoding.Type) -> String

	func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType

I think you would be able to understand what `factor(_:)` and `decode(_:as:)` do, even if you had never seen the `BinaryInteger` and `UnicodeEncoding` protocols, because their names clearly and simply say what sort of type would conform to the protocol. You would guess that familiar types like `Int` could be used with `factor(_:)`, and you might not know what the concrete `UnicodeEncoding` types were called, but you'd guess they probably had names with terms of art like `UTF8` in them somewhere.

But what about `parse(_:)`? Sure, `Unicode` suggests it has something to do with string handling, but it doesn't suggest *a string*. As I said, I would assume it has something to do with the Unicode standard—maybe a type that does Unicode table lookups, for instance. I get that you're using it as an adjective, but it's such a specific technical term that using it to describe any chunk of text data is misleading, even if that text *is* required to be Unicode text.

Perhaps you could call it `StringProtocol`, or `Textual`, or `UnicodeString`. But I really think just `Unicode` does not do a good job of conveying the meaning of the type. 

>>> We think random-access
>>> *code-unit storage* is a reasonable requirement to impose on all `String`
>>> instances.
>> 
>> Wait, you do? Doesn't that mean either using UTF-32, inventing a UTF-24 to use, or using some kind of complicated side table that adjusts for all the multi-unit characters in a UTF-16 or UTF-8 string? None of these sound ideal.
> 
> No; I'm not sure why you would think that.

Oh, sorry. I read that as "random-access code-point [i.e. UnicodeScalar] storage", which I don't think would be a reasonable requirement. My mistake.

>>> Then integers can easily be translated into offsets into a `String`'s `utf16`
>>> view for consumption by Cocoa:
>>> 
>>> ```swift
>>> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
>>> let swiftIndex = s.utf16.index(offset: cocoaIndex)
>>> ```
>> 
>> I worry that this conversion will be too obscure.
> 
> I very much hope it will be rare enough that it'll be OK, but if it isn't, we can always have
> 
> 	let cocoaIndex = s.utf16Offset(of: i)
> 
> and/or take other measures to simplify it.

I think that would still be too obscure.

To give you an idea of what you're contending with here, take a look at a few Stack Overflow questions:

http://stackoverflow.com/questions/28128554/convert-string-index-to-int-or-rangestring-index-to-nsrange
http://stackoverflow.com/questions/27156916/convert-rangeint-to-rangestring-index
http://stackoverflow.com/questions/34540185/how-to-convert-index-to-type-int-in-swift

Objective-C programmers *do not know* that `NSInteger` and `NSRange` indices are UTF-16 indices. They don't think about what the "character" in `-characterAtIndex:` really means; they just take it at face value. That means putting "UTF-16" in the name will not help them identify the API as the correct one to use. It'd be like advertising a clinic to people with colds by saying you do "otolaryngology"—you're just not speaking the language of your audience.

I see two ways to make it really, really obvious which API is the right one to use. The first is to explicitly refer to something like "objc", "cocoa", "foundation", or "ns" in the name. The second is to use full-width conversions, which people understand are the default way to convert between two things. (Actually, a lot of developers literally call these "casts" and assume they're extremely low cost.)

I think that, if there's a `String.Index.init(_: Int)` and an `Int.init(_: String.Index)`, people will almost certainly identify these as the right way to convert between Foundation's `Int` indices as `String.Index`es. They certainly don't seem to be figuring it out now.

>> 1. UTF-16 is the storage format, at least for an "ordinary" `Swift.String`.
> 
> It will be, in the common case, but many people seem to want plain String to be able to store UTF-8, and I'm not yet prepared to rule that out.

I suppose it doesn't matter what the actual storage format is as long as we get #2 (`UTF16View` indexed by `String.Index`).

If we go with the facade design, I suppose it would simply be that the default string storage also uses its `UTF16Index` for its `CodeUnitIndex`. Other string storages could arrange their indices in other ways.

>> 2. `String.Index` is used down to the `UTF16View`. It stores a UTF-16 offset.
>> 
>> 3. With just the standard library imported, `String.Index` does not have any obvious way to convert to or from an `Int` offset; you use `index(_:offsetBy:)` on one of the views. `utf16`'s implementation is just faster than the others.
> 
> This is roughly where we are today.

Yes, except for index interchangeability between `CharacterView`, `UnicodeScalarView`, and `UTF16View`. But the suggestion that we provide `init(_:)`s is the key part of this request.

>>> #### String Interpolation
> 
> Let's go to a separate thread for this, as you suggested.

Will do.

>> So you might end up having to wrap it in an `init(cString:)` anyway, just for convenience. Oh well, it was worth exploring.
> 
> I think you ended up where we did.

Unsurprising, I suppose. :^)

Going out of order briefly:

>> 2. I don't really understand how you envision using the "data specific to the underlying encoding" sections. Presumably you'll want to convert that data into a string eventually, right?
> 
> It already is in a string.  The point is that we have a way to scan the string looking for ASCII patterns without transcoding it.

So, if I understand this properly, you're imagining that `extendedASCII` has indices interchangeable with `codeUnits`, but doesn't do any sort of complicated Unicode decoding, so you can rip through the string with `extendedASCII` and then use the indices to extract actual, fully decoded Unicode data from substrings of `codeUnits`?

>> 1. It doesn't sound like you anticipate there being any way to compare an element of the `extendedASCII` view to a character literal. That seems like it'd be really useful.
> 
> We don't have character literals :-)

Excuse me, Unicode scalar literals. :^)

> However, I agree that there needs to be a way to do it.  The thing would be to make it easy to construct a UInt8 from a string literal.

Honestly, I might consider having elements which are not plain `UInt8`s, but `ASCIIScalar?`s, where `ASCIIScalar` looks something like:

	struct ASCIIScalar {
		// Using a 7-bit integer means `ASCIIScalar?`'s tag bit can fit in the same byte.
		let _value: Builtin.UInt7

		var value: UInt8 {
			return UInt8(Builtin.zext_Int7_Int8(_value))
		}
		init?(_convertingValue: UInt8) {
			let result: (value: Builtin.Int7, error: Builtin.Int1) = Builtin.u_to_u_checked_trunc_Int8_Int7(value._value)
			guard Bool(result.error) == false else { return nil }
			_value = result.value
		}
		init?<Integer: BinaryInteger>(value: Integer) {
			guard let sizedValue = UInt8(exactly: value) else {
				return nil
			}
			self.init(_convertingValue: sizedValue)
		}
		init?(_ scalar: UnicodeScalar) {
			guard scalar.isASCII else { return nil }
			_value = Builtin.UInt7(scalar.value)
		}
	}
	extension ASCIIScalar: ExpressibleByUnicodeScalarLiteral, ExpressibleByIntegerLiteral {
		// Notional, not necessarily actual, implementation
		init(unicodeScalarLiteral value: UnicodeScalar) {
			self.init(value)!
		}

		init(integerLiteral value: UInt8) {
			self.init(value: value)
		}
	}

Then you could write something like (if I understand what you're envisioning for the `ExtendedASCIIView`):

	for (char, i) in zip(source.extendedASCII, source.extendedASCII.indices) {
		switch (state, char) {
		…
		// Look for a single or double quote to start the string
		case (.expectingValue, "'"?), (.expectingValue, "\'"?):
			state = .readingStringLiteral(quoteIndex: i)

		// Scan to the end of the string
		case (.readingStringLiteral(let quoteIndex), _):
			// Is this the terminator?
			if char == source.extendedASCII[quoteIndex] {
				let range = source.extendedASCII.index(after: quoteIndex) ..< i
				// Note that we extract the value here with `codeUnits`
				let value = String(source.codeUnits[range])

				consumeValue(value)
				state = .expectingComma
			}
			else {
				// Do nothing; just scan past this character.
			}
		…
		}
	}

Relying on the fact that you're switching against an `ASCIIScalar`, rather than a `UInt8`, to allow Unicode scalar literals to be used.

(There are other possible designs as well; a generic `ASCIIScalar<CodeUnit: UnsignedInteger>` which directly wrapped a code unit without changing its storage at all would be one interesting example.)

>> If it *is* similar to `UnicodeCodec`, one thing I will note is that the way `UnicodeCodec` works in code units is rather annoying for I/O. It may make sense to have some sort of type-erasing wrapper around `UnicodeCodec` which always uses bytes. (You then have to worry about endianness, of course...)
> 
> Take a look at the branch and let me know how this looks like it would work for I/O.

I don't claim to understand everything I'm seeing, but at a quick glance, I really like the overall design. It's nice to see it encapsulating a stateless algorithm; I think that will make it more flexible.

However, there's an important tweak needed for I/O: Having a truncated character at the end of the collection needs to be detectable as a condition distinct from other errors, because a buffer might contain (say) two bytes of a three-byte UTF-8 character, with the third byte expected to arrive later. For instance, you might have:

	public enum ParseResult<T, Index> {
		case valid(T, resumptionPoint: Index)
		case error(resumptionPoint: Index)
		case partial(resumptionPoint: Index)
		case emptyInput
	}

Or:

	public enum ParseResult<T, Index> {
		case valid(T, resumptionPoint: Index)
		case error(resumptionPoint: Index)
		case nothing(resumptionPoint: Index)
	}

Unlike `error`'s `resumptionPoint`, which is after the garbled character, `partial` or `nothing`'s would be *before* the partial character.

I had a whole bunch of stuff here earlier where I discussed replacing `Sequence` with a new design that had a `Collection`-like interface, except that the start index was returned by a `makeStartIndex()` method which could only be called once. By tracking the lifetimes of indices, the sequence could figure out when a portion of its data was no longer accessible and could be discarded. However, I've tweaked that design a lot in the last day and haven't come up with anything that's quite satisfactory, so I'll leave that discussion aside for now.

(A side note related to `UnicodeEncoding`'s all-static-member design: I've taken advantage of this "types as tables of stateless methods and associated types" pattern myself (see https://github.com/brentdax/SQLKit/blob/master/Sources/SQLKit/SQLClient.swift), and although it's very useful, it always feels like I'm fighting the language. For these occasions, I wonder if it might make sense to introduce a concept of "singleton types" or "static types" where the instance and type member namespaces are unified, `T.Type` is the same as `T, `T.init()` is the same as `T.self`, and all stored properties are treated as static (and thus shared by all instances). That's properly the topic of a different thread, of course; it just occurred to me as I was writing this.)

>> That way, if you just write `String`, you get something flexible; if you write `String<NFCNormalizedUTF16StringStorage>`, you get something fast.
> 
> This only works in the "facade" variant where you have a defaulted generic parameter feature, but yes, that's the idea of that variant.

Yeah, I'm speaking specifically of the defaulted case, which frankly is the only one I think is *really* extremely promising.

>> What does that mean for `String.Index` unification?
> 
> Not much.  We never intended for indices to be interchangeable among different specific string types (other than a string and its SubSequence).

I'm more asking, is it possible that different string types would have different interchangeability rules? For instance:

* When using `UTF8StringStorage`, `String.Index` and `String.UTF8View.Index` are interchangeable.
* When using `UTF16StringStorage` (or `NSString`?), `String.Index` and `String.UTF16View.Index` are interchangeable.
* When using `UTF32StringStorage`, `String.Index` is *not* interchangeable with either of the `UTFnView` indices.

>> `description` would have to change to be localizable. (Specifically, it would have to take a locale.) This is doable, of course, but it hasn't been done yet.
> 
> Well, it could use the current locale.  These things are supposed to remain lightweight.

I think that, if you're gonna go to the trouble of making your `description` localizable, there should be a way to inject a locale. That would make testing your localizations easier, for instance.

(There's also the small matter of `LosslessStringConvertible`. Oops?)

>>> ### `StaticString`
>> 
>> One complication there is that `Unicode` presumably supports mutation, which `StaticString` doesn't.
> 
> No, Unicode doesn't support mutation.  A mutable Unicode will usually conform to Unicode and RangeReplaceableCollection (but not MutableCollection, because replacing a grapheme is not an O(1) operation).

Oh, of course, that makes a lot of sense. Hopefully we won't need anything special from mutable `StringStorage`s. (That is, members that are only needed if a type is *both* `StringStorage` *and* `RangeReplaceableCollection`.)

-- 
Brent Royal-Gordon
Architechies