[swift-evolution] [Proposal] Foundation Swift Archival & Serialization

Wed Mar 15 23:19:07 CDT 2017

> On Mar 15, 2017, at 3:40 PM, Itai Ferber via swift-evolution <swift-evolution at swift.org> wrote:
> 
> Hi everyone,
> 
> The following introduces a new Swift-focused archival and serialization API as part of the Foundation framework. We’re interested in improving the experience and safety of performing archival and serialization, and are happy to receive community feedback on this work.

Thanks to all of the people who've worked on this. It's a great proposal.

> Specifically:
> 
> 	• It aims to provide a solution for the archival of Swift struct and enum types

I see a lot of discussion here of structs and classes, and an example of an enum without associated values, but I don't see any discussion of enums with associated values. Can you sketch how you see people encoding such types?

For example, I assume that `Optional` is going to get some special treatment, but if it doesn't, how would you write its `encode(to:)` method?

What about a more complex enum, like the standard library's `UnicodeDecodingResult`:

	enum UnicodeDecodingResult {
		case emptyInput
		case error
		case scalarValue(UnicodeScalar)
	}

Or, say, an `Error`-conforming type from one of my projects:

	public enum SQLError: Error {
	    case connectionFailed(underlying: Error)
	    case executionFailed(underlying: Error, statement: SQLStatement)
	    case noRecordsFound(statement: SQLStatement)
	    case extraRecordsFound(statement: SQLStatement)
	    case columnInvalid(underlying: Error, key: ColumnSpecifier, statement: SQLStatement)
	    case valueInvalid(underlying: Error, key: AnySQLColumnKey, statement: SQLStatement)
	}

(You can assume that all the types in the associated values are `Codable`.)

I don't necessarily assume that the compiler should write conformances to these sorts of complicated enums for me (though that would be nice!); I'm just wondering what the designers of this feature envision people doing in cases like these.

> 	• protocol Codable: Adopted by types to opt into archival. Conformance may be automatically derived in cases where all properties are also Codable.

Have you given any consideration to supporting types which only need to decode? That seems likely to be common when interacting with web services.

> 	• protocol CodingKey: Adopted by types used as keys for keyed containers, replacing String keys with semantic types. Conformance may be automatically derived in most cases.
> 	• protocol Encoder: Adopted by types which can take Codable values and encode them into a native format.
> 		• class KeyedEncodingContainer<Key : CodingKey>: Subclasses of this type provide a concrete way to store encoded values by CodingKey. Types adopting Encoder should provide subclasses of KeyedEncodingContainer to vend.
> 		• protocol SingleValueEncodingContainer: Adopted by types which provide a concrete way to store a single encoded value. Types adopting Encoder should provide types conforming to SingleValueEncodingContainer to vend (but in many cases will be able to conform to it themselves).
> 	• protocol Decoder: Adopted by types which can take payloads in a native format and decode Codable values out of them.
> 		• class KeyedDecodingContainer<Key : CodingKey>: Subclasses of this type provide a concrete way to retrieve encoded values from storage by CodingKey. Types adopting Decoder should provide subclasses of KeyedDecodingContainer to vend.
> 		• protocol SingleValueDecodingContainer: Adopted by types which provide a concrete way to retrieve a single encoded value from storage. Types adopting Decoder should provide types conforming to SingleValueDecodingContainer to vend (but in many cases will be able to conform to it themselves).

I do want to note that, at this point in the proposal, I was sort of thinking you'd gone off the deep end modeling this. Having read the whole thing, I now understand what all of these things do, but this really is a very large subsystem. I think it's worth asking if some of these types can be eliminated or combined.

> Structured types (i.e. types which encode as a collection of properties) encode and decode their properties in a keyed manner. Keys may be String-convertible or Int-convertible (or both),

What does "may" mean here? That, at runtime, the encoder will test for the preferred key type and fall back to the other one? That seems a little bit problematic.

I'm also quite worried about how `Int`-convertible keys will interact with code synthesis. The obvious way to assign integers—declaration order—would mean that reordering declarations would invisibly break archiving, potentially (if the types were compatible) without breaking anything in an error-causing way even at runtime. You could sort the names, but then adding a new property would shift the integers of the properties "below" it. You could hash the names, but then there's no obvious relationship between the integers and key cases.

At the same time, I also think that using arbitrary integers is a poor match for ordering. If you're making an ordered container, you don't want arbitrary integers wrapped up in an abstract type. You want adjacent integers forming indices of an eventual array. (Actually, you may not want indices at all—you may just want to feed elements in one at a time!)

So I would suggest the following changes:

* The coding key always converts to a string. That means we can eliminate the `CodingKey` protocol and instead use `RawRepresentable where RawValue == String`, leveraging existing infrastructure. That also means we can call the `CodingKeys` associated type `CodingKey` instead, which is the correct name for it—we're not talking about an `OptionSet` here.

* If, to save space on disk, you want to also people to use integers as the serialized representation of a key, we might introduce a parallel `IntegerCodingKey` protocol for that, but every `CodingKey` type should map to `String` first and foremost. Using a protocol here ensures that it can be statically determined at compile time whether a type can be encoded with integer keys, so the compiler can select an overload of `container(keyedBy:)`.

* Intrinsically ordered data is encoded as a single value containers of type `Array<Codable>`. (I considered having an `orderedContainer()` method and type, but as I thought about it, I couldn't think of an advantage it would have over `Array`.)

>     /// Returns an encoding container appropriate for holding a single primitive value.
>     ///
>     /// - returns: A new empty single value container.
>     /// - precondition: May not be called after a prior `self.container(keyedBy:)` call.
>     /// - precondition: May not be called after a value has been encoded through a previous `self.singleValueContainer()` call.
>     func singleValueContainer() -> SingleValueEncodingContainer

Speaking of which, I'm not sure about single value containers. My first instinct is to say that methods should be moved from them to the `Encoder` directly, but that would probably cause code duplication. But...isn't there already duplication between the `SingleValue*Container` and the `Keyed*Container`? Why, yes, yes there is. So let's talk about that.

>     open func encode<Value : Codable>(_ value: Value?, forKey key: Key) throws
>     open func encode(_ value: Bool?,   forKey key: Key) throws
>     open func encode(_ value: Int?,    forKey key: Key) throws
>     open func encode(_ value: Int8?,   forKey key: Key) throws
>     open func encode(_ value: Int16?,  forKey key: Key) throws
>     open func encode(_ value: Int32?,  forKey key: Key) throws
>     open func encode(_ value: Int64?,  forKey key: Key) throws
>     open func encode(_ value: UInt?,   forKey key: Key) throws
>     open func encode(_ value: UInt8?,  forKey key: Key) throws
>     open func encode(_ value: UInt16?, forKey key: Key) throws
>     open func encode(_ value: UInt32?, forKey key: Key) throws
>     open func encode(_ value: UInt64?, forKey key: Key) throws
>     open func encode(_ value: Float?,  forKey key: Key) throws
>     open func encode(_ value: Double?, forKey key: Key) throws
>     open func encode(_ value: String?, forKey key: Key) throws
>     open func encode(_ value: Data?,   forKey key: Key) throws

Wait, first, a digression for another issue: I'm concerned that, if you look at the `decode` calls, there are plain `decode(…)` calls which throw if a `nil` was originally encoded and `decodeIfPresent` calls which return optional. The result is, essentially, that the encoding system eats a level of optionality for its own purposes—seemingly good, straightforward-looking code like this:

	struct MyRecord: Codable {
		var id: Int?
		…

		func encode(to encoder: Encoder) throws {
			let container = encoder.container(keyedBy: CodingKey.self)
			try container.encode(id, forKey: .id)
			…
		}

		init(from decoder: Decoder) throws {
			let container = decoder.container(keyedBy: CodingKey.self)
			id = try container.decode(Int.self, forKey: .id)
			…
		}
	}

Will crash. (At least, I assume that's what will happen.)

I think we'd be better off having `encode(_:forKey:)` not take an optional; instead, we should have `Optional` conform to `Codable` and behave in some appropriate way. Exactly how to implement it might be a little tricky because of nested optionals; I suppose a `none` would have to measure how many levels of optionality there are between it and a concrete value, and then encode that information into the data. I think our `NSNull` bridging is doing something broadly similar right now.

I know that this is not the design you would use in Objective-C, but Swift uses `Optional` differently from how Objective-C uses `nil`. Swift APIs consider `nil` and absent to be different things; where they can both occur, good Swift APIs use doubled-up Optionals to be precise about the situation. I think the design needs to be a little different to accommodate that.

Now, back to the `SingleValue*Container`/`Keyed*Container` issue. The list above is, frankly, gigantic. You specify a *lot* of primitives in `Keyed*Container`; there's a lot to implement here. And then you have to implement it all *again* in `SingleValue*Container`:

>     func encode(_ value: Bool) throws
>     func encode(_ value: Int) throws
>     func encode(_ value: Int8) throws
>     func encode(_ value: Int16) throws
>     func encode(_ value: Int32) throws
>     func encode(_ value: Int64) throws
>     func encode(_ value: UInt) throws
>     func encode(_ value: UInt8) throws
>     func encode(_ value: UInt16) throws
>     func encode(_ value: UInt32) throws
>     func encode(_ value: UInt64) throws
>     func encode(_ value: Float) throws
>     func encode(_ value: Double) throws
>     func encode(_ value: String) throws
>     func encode(_ value: Data) throws

This is madness.

Look, here's what we do. You have two types: `Keyed*Container` and `Value*Container`. `Keyed*Container` looks something like this:

	final public class KeyedEncodingContainer<EncoderType: Encoder, Key: RawRepresentable> where Key.RawValue == String {
	    public let encoder: EncoderType

	    public let codingKeyContext: [RawRepresentable where RawValue == String]
	    // Hmm, we might need a CodingKey protocol after all.
	    // Still, it could just be `protocol CodingKey: RawRepresentable where RawValue == String {}`

	    subscript (key: Key) -> ValueEncodingContainer {
	        return encoder.makeValueEncodingContainer(forKey: key)
	    }
	}

It's so simple, it doesn't even need to be specialized. You might even be able to get away with combining the encoding and decoding variants if the subscript comes from a conditional extension. `Value*Container` *does* need to be specialized; it looks like this (modulo the `Optional` issue I mentioned above):

	public protocol ValueEncodingContainer {
	    func encode<Value : Codable>(_ value: Value?, forKey key: Key) throws
	    func encode(_ value: Bool?) throws
	    func encode(_ value: Int?) throws
	    func encode(_ value: Int8?) throws
	    func encode(_ value: Int16?) throws
	    func encode(_ value: Int32?) throws
	    func encode(_ value: Int64?) throws
	    func encode(_ value: UInt?) throws
	    func encode(_ value: UInt8?) throws
	    func encode(_ value: UInt16?) throws
	    func encode(_ value: UInt32?) throws
	    func encode(_ value: UInt64?) throws
	    func encode(_ value: Float?) throws
	    func encode(_ value: Double?) throws
	    func encode(_ value: String?) throws
	    func encode(_ value: Data?) throws

	    func encodeWeak<Object : AnyObject & Codable>(_ object: Object?) throws

	    var codingKeyContext: [CodingKey]
	}

And use sites would look like:

	func encode(to encoder: Encoder) throws {
		let container = encoder.container(keyedBy: CodingKey.self)
		try container[.id].encode(id)
		try container[.name].encode(name)
		try container[.birthDate].encode(birthDate)
	}

Decoding is slightly tricker. You could either make the subscript `Optional`, which would be more like `Dictionary` but would be inconsistent with `Encoder` and would give the "never force-unwrap anything" crowd conniptions, or you could add a `contains()` method to `ValueDecodingContainer` and make `decode(_:)` throw. Either one works.

Also, another issue with the many primitives: swiftc doesn't really like large overload sets very much. Could this set be reduced? I'm not sure what the logic was in choosing these particular types, but many of them share protocols in Swift—you might get away with just this:

	public protocol ValueEncodingContainer {
	    func encode<Value : Codable>(_ value: Value?, forKey key: Key) throws
	    func encode(_ value: Bool?,   forKey key: Key) throws
	    func encode<Integer: SignedInteger>(_ value: Integer?, forKey key: Key) throws
	    func encode<UInteger: UnsignedInteger>(_ value: UInteger?, forKey key: Key) throws
	    func encode<Floating: FloatingPoint>(_ value: Floating?, forKey key: Key) throws
	    func encode(_ value: String?, forKey key: Key) throws
	    func encode(_ value: Data?,   forKey key: Key) throws

	    func encodeWeak<Object : AnyObject & Codable>(_ object: Object?, forKey key: Key) throws

	    var codingKeyContext: [CodingKey]
	}

To accommodate my previous suggestion of using arrays to represent ordered encoded data, I would add one more primitive:

	    func encode(_ values: [Codable]) throws

(Also, is there any sense in adding `Date` to this set, since it needs special treatment in many of our formats?)

> Encoding Container Types
> 
> For some types, the container into which they encode has meaning. Especially when coding for a specific output format (e.g. when communicating with a JSON API), a type may wish to explicitly encode as an array or a dictionary:
> 
> // Continuing from before
> public protocol Encoder {
>     func container<Key : CodingKey>(keyedBy keyType: Key.Type, type containerType: EncodingContainerType) -> KeyedEncodingContainer<Key>
> }
> 
> /// An `EncodingContainerType` specifies the type of container an `Encoder` should use to store values.
> public enum EncodingContainerType {
>     /// The `Encoder`'s preferred container type; equivalent to either `.array` or `.dictionary` as appropriate for the encoder.
>     case `default`
>     
>     /// Explicitly requests the use of an array to store encoded values.
>     case array
> 
>     /// Explicitly requests the use of a dictionary to store encoded values.
>     case dictionary
> }

I see what you're getting at here, but I don't think this is fit for purpose, because arrays are not simply dictionaries with integer keys—their elements are adjacent and ordered. See my discussion earlier about treating inherently ordered containers as simply single-value `Array`s.

> Nesting
> 
> In practice, some types may also need to control how data is nested within their container, or potentially nest other containers within their container. Keyed containers allow this by returning nested containers of differing key types:

[snip]

> This can be common when coding against specific external data representations:
> 
> // User type for interfacing with a specific JSON API. JSON API expects encoding as {"id": ..., "properties": {"name": ..., "timestamp": ...}}. Swift type differs from encoded type, and encoding needs to match a spec:

This comes very close to—but doesn't quite—address something else I'm concerned about. What's the preferred way to handle differences in serialization to different formats?

Here's what I mean: Suppose I have a BlogPost model, and I can both fetch and post BlogPosts to a cross-platform web service, and store them locally. But when I fetch and post remotely, I ned to conform to the web service's formats; when I store an instance locally, I have a freer hand in designing my storage, and perhaps need to store some extra metadata. How do you imagine handling that sort of situation? Is the answer simply that I should use two different types?

> To remedy both of these points, we adopt a new convention for inheritance-based coding — encoding super as a sub-object of self:

[snip]

>         try super.encode(to: container.superEncoder())

This seems like a good idea to me. However, it brings up another point: What happens if you specify a superclass of the originally encoded class? In other words:

	let joe = Employee(…)
	let payload = try SomeEncoder().encode(joe)
	…
	let someone = try SomeDecoder().decode(Person.self, from: payload)
	print(type(of: someone))		// Person, Employee, or does `decode(_:from:)` fail?

> The encoding container types offer overloads for working with and processing the API's primitive types (String, Int, Double, etc.). However, for ease of implementation (both in this API and others), it can be helpful for these types to conform to Codable themselves. Thus, along with these overloads, we will offer Codable conformance on these types:

[snip]

> Since Swift's function overload rules prefer more specific functions over generic functions, the specific overloads are chosen where possible (e.g. encode("Hello, world!", forKey: .greeting) will choose encode(_: String, forKey: Key) over encode<T : Codable>(_: T, forKey: Key)). This maintains performance over dispatching through the Codable existential, while allowing for the flexibility of fewer overloads where applicable.

How important is this performance? If the answer is "eh, not really that much", I could imagine a setup where every "primitive" type eventually represents itself as `String` or `Data`, and each `Encoder`/`Decoder` can use dynamic type checks in `encode(_:)`/`decode(_:)` to define whatever "primitives" it wants for its own format.

* * *

One more thing. In Alternatives Considered, you present two designs—#2 and #3—where you generate a separate instance which represents the type in a fairly standardized way for the encoder to examine.

This design struck me as remarkably similar to the reflection system and its `Mirror` type, which is also a separate type describing an original instance. My question was: Did you look at the reflection system when you were building this design? Do you think there might be anything that can be usefully shared between them?

Thank you for your attention. I hope this was helpful!

-- 
Brent Royal-Gordon
Architechies