[swift-evolution] [Proposal] Foundation Swift Archival & Serialization

Fri Mar 17 15:23:15 CDT 2017

> On Mar 16, 2017, at 12:33 PM, Itai Ferber <iferber at apple.com> wrote:
> Optional values are accepted and vended directly through the API. The encode(_:forKey:) methods take optional values directly, and decodeIfPresent(_:forKey:) vend optional values.
> 
> Optional is special in this way — it’s a primitive part of the system. It’s actually not possible to write an encode(to:) method for Optional, since the representation of null values is up to the encoder and the format it’s working in; JSONEncoder, for instance, decides on the representation of nil (JSON null).
> 
Yes—I noticed that later but then forgot to revise the beginning. Sorry about that.
> It wouldn’t be possible to ask nil to encode itself in a reasonable way.
> 
I really think it could be done, at least for most coders. I talked about this in another email, but in summary:

NSNull would become a primitive type; depending on the format, it would be encoded either as a null value or the absence of a value.
Optional.some(x) would be encoded the same as x.
Optional.none would be encoded in the following fashion:
If the Wrapped associated type was itself an optional type, it would be encoded as a keyed container containing a single entry. That entry's key would be some likely-unique value like "_swiftOptionalDepth"; its value would be the number of levels of optionality before reaching a non-optional type.
If the Wrapped associated type was non-optional, it would be encoded as an NSNull.

That sounds complicated, but the runtime already has machinery to coerce Optionals to Objective-C id: Optional.some gets bridged as the Wrapped value, while Optional.none gets bridged as either NSNull or _SwiftNull, which contains a depth. We would simply need to make _SwiftNull conform to Codable, and give it a decoding implementation which was clever enough to realize when it was being asked to decode a different type.
> What about a more complex enum, like the standard library's `UnicodeDecodingResult`:
> 
> enum UnicodeDecodingResult {
> case emptyInput
> case error
> case scalarValue(UnicodeScalar)
> }
> 
> Or, say, an `Error`-conforming type from one of my projects:
> 
> public enum SQLError: Error {
> case connectionFailed(underlying: Error)
> case executionFailed(underlying: Error, statement: SQLStatement)
> case noRecordsFound(statement: SQLStatement)
> case extraRecordsFound(statement: SQLStatement)
> case columnInvalid(underlying: Error, key: ColumnSpecifier, statement: SQLStatement)
> case valueInvalid(underlying: Error, key: AnySQLColumnKey, statement: SQLStatement)
> }
> 
> (You can assume that all the types in the associated values are `Codable`.)
> 
> Sure — these cases specifically do not derive Codable conformance because the specific representation to choose is up to you. Two possible ways to write this, though there are many others (I’m simplifying these cases here a bit, but you can extrapolate this):
> 
Okay, so tl;dr is "There's nothing special to help with this; just encode some indication of the case in one key, and the associated values in separate keys". I suppose that works.
> Have you given any consideration to supporting types which only need to decode? That seems likely to be common when interacting with web services.
> 
> We have. Ultimately, we decided that the introduction of several protocols to cover encodability, decodability, and both was too much of a cognitive overhead, considering the number of other types we’re also introducing. You can always implement encode(to:) as fatalError().
> 
I understand that impulse.
> Structured types (i.e. types which encode as a collection of properties) encode and decode their properties in a keyed manner. Keys may be String-convertible or Int-convertible (or both),
> 
> What does "may" mean here? That, at runtime, the encoder will test for the preferred key type and fall back to the other one? That seems a little bit problematic.
> 
> Yes, this is the case. A lot is left up to the Encoder because it can choose to do something for its format that your implementation of encode(to:) may not have considered.
> If you try to encode something with an Int key in a string-keyed dictionary, the encoder may choose to stringify the integer if appropriate for the format. If not, it can reject your key, ignore the call altogether, preconditionFailure(), etc. It is also perfectly legitimate to write an Encoder which supports a flat encoding format — in that case, keys are likely ignored altogether, in which case there is no error to be had. We’d like to not arbitrarily constrain an implementation unless necessary.
> 
Wait, what? If it's ignoring the keys altogether, how does it know what to decode with each call? Do you have to decode in the same order you encoded?

(Or are you saying that the encoder might use the keys to match fields to, say, predefined fields in a schema provided to the encoder, but not actually write anything about the keys to disk? That would make sense. But ignoring the keys altogether doesn't.)

In general, my biggest concern with this design is that, in a hundred different places, it is very loosely specified. We have keyed containers, but the keys can convert to either, or both, or neither of two different types. We have encode and decode halves, but you only have to support one or the other. Nils are supported, but they're interpreted as equivalent to the absence of a value. If something encounters a problem or incompatibility, it should throw an error or trip a precondition.

I worry that this is so loosely specified that you can't really trust an arbitrary type to work with an arbitrary encoder; you'll just have to hope that your testing touches every variation on every part of the object graph.

This kind of design is commonplace in Objective-C, but Swift developers often go to great lengths to expose these kinds of requirements to the type system so the compiler can verify them. For instance, I would expect a similar Swift framework to explicitly model the raw values of keys as part of the type system; if you tried to use a type providing string keys with an encoder that required integer keys, the compiler would reject your code. Even when something can't be explicitly modeled by the type system, Swift developers usually try to document guarantees about how to use APIs safely; for instance, Swift.RangeReplaceableCollection explicitly states that its calls may make indices retrieved before the call invalid, and individual conforming types document specific rules about which indices will keep working and which won't.

But your Encoder and Decoder designs seem to document semantics very loosely; they don't formally model very important properties, like "Does this coder preserve object identities*?" and "What sorts of keys does this coder use?", even when it's easy to do so, and now it seems like they also don't specify important semantics, like whether or not the encoder is required to inspect the key to determine the value you're looking for, either. I'm very concerned by that.

The design you propose takes advantage of several Swift niceties—Optional value types, enums for keys, etc.—and I really appreciate those things. But in its relatively casual attitude towards typing, it still feels like an Objective-C design being ported to Swift. I want to encourage you to go beyond that.

* That is, if you encode a reference to the same object twice and then decode the result, do you get one instance with two references, or two instances with one reference each? JSONEncoder can't provide that behavior, but NSKeyedArchiver can. There's no way for a type which won't encode properly without this property to reject encoders which cannot guarantee it.
> For these exact reasons, integer keys are not produced by code synthesis, only string keys. If you want integer keys, you’ll have to write them yourself. :)
> 
That's another thing I realized on a later reading and forgot to correct. Sorry about that.

(On the other hand, that reminds me of another minor concern: Your statement that superContainer() instances use a key with the integer value 0. I'd suggest you document that fact in boldface in the documentation for integer keys, because I expect that every developer who uses integer keys will want to start at key 0.)
> So I would suggest the following changes:
> 
> * The coding key always converts to a string. That means we can eliminate the `CodingKey` protocol and instead use `RawRepresentable where RawValue == String`, leveraging existing infrastructure. That also means we can call the `CodingKeys` associated type `CodingKey` instead, which is the correct name for it—we're not talking about an `OptionSet` here.
> 
> * If, to save space on disk, you want to also people to use integers as the serialized representation of a key, we might introduce a parallel `IntegerCodingKey` protocol for that, but every `CodingKey` type should map to `String` first and foremost. Using a protocol here ensures that it can be statically determined at compile time whether a type can be encoded with integer keys, so the compiler can select an overload of `container(keyedBy:)`.
> 
> * Intrinsically ordered data is encoded as a single value containers of type `Array<Codable>`. (I considered having an `orderedContainer()` method and type, but as I thought about it, I couldn't think of an advantage it would have over `Array`.)
> 
> This is possible, but I don’t see this as necessarily advantageous over what we currently have. In 99.9% of cases, CodingKey types will have string values anyway — in many cases you won’t have to write the enum yourself to begin with, but even when you do, derived CodingKey conformance will generate string values on your behalf.
> The only time a key will not have a string value is if the CodingKey protocol is implemented manually and a value is either deliberately left out, or there was a mistake in the implementation; in either case, there wouldn’t have been a valid string value anyway.
> 
Again, I think this might come down to an Objective-C vs. Swift mindset difference. The Objective-C mindset is often "very few people will do X, so we might as well allow it". The Swift mindset is more "very few people will do X, so we might as well forbid it". :^)

In this case: Very few people will be inconvenienced by a requirement that they provide strings in their CodingKeys, so why not require it? Doing so ensures that encoders can always rely on a string key being available, and with all the magic we're providing to ensure the compiler fills in the actual strings for you, users will not find the requirement burdensome.
> /// Returns an encoding container appropriate for holding a single primitive value.
> ///
> /// - returns: A new empty single value container.
> /// - precondition: May not be called after a prior `self.container(keyedBy:)` call.
> /// - precondition: May not be called after a value has been encoded through a previous `self.singleValueContainer()` call.
> func singleValueContainer() -> SingleValueEncodingContainer
> 
> Speaking of which, I'm not sure about single value containers. My first instinct is to say that methods should be moved from them to the `Encoder` directly, but that would probably cause code duplication. But...isn't there already duplication between the `SingleValue*Container` and the `Keyed*Container`? Why, yes, yes there is. So let's talk about that.
> 
> In the Alternatives Considered section of the proposal, we detail having done just this. Originally, the requirements now on SingleValueContainer sat on Encoder and Decoder.
> Unfortunately, this made it too easy to do the wrong thing, and required extra work (in comparison) to do the right thing.
> 
> When Encoder has encode(_ value: Bool?), encode(_ value: Int?), etc. on it, it’s very intuitive to try to encode values that way:
> 
> func encode(to encoder: Encoder) throws {
>     // The very first thing I try to type is encoder.enc… and guess what pops up in autocomplete:
>     try encoder.encode(myName)
>     try encoder.encode(myEmail)
>     try encoder.encode(myAddress)
> }
> This might look right to someone expecting to be able to encode in an ordered fashion, which is not what these methods do.
> In addition, for someone expecting keyed encoding methods, this is very confusing. Where are those methods? Where don’t these "default" methods have keys?
> 
> The very first time that code block ran, it would preconditionFailure() or throw an error, since those methods intend to encode only one single value.
> 
That's true. But this is mitigated by the fact that the mistake is self-correcting—it will definitely cause a precondition to fail the first time you make it.

However, I do agree that it's not really a good idea. I'm more interested in the second suggestion I had, having the Keyed*Container return a SingleValue*Container.
> The return type of decode(Int.self, forKey: .id) is Int. I’m not convinced that it’s possible to misconstrue that as the correct thing to do here. How would that return a nil value if the value was nil to begin with?
> 
I think people will generally assume that they're going to get out the value they put in, and will be surprised that something encode(_:) accepts will cause decode(_:) to error out. I do agree that the type passed to `decode(_:forKey_:)` will make it relatively obvious what happened, but I think it'd be even better to just preserve the user's types.
> I think we'd be better off having `encode(_:forKey:)` not take an optional; instead, we should have `Optional` conform to `Codable` and behave in some appropriate way. Exactly how to implement it might be a little tricky because of nested optionals; I suppose a `none` would have to measure how many levels of optionality there are between it and a concrete value, and then encode that information into the data. I think our `NSNull` bridging is doing something broadly similar right now.
> 
> Optional cannot encode to Codable for the reasons given above. It is a primitive type much like Int and String, and it’s up to the encoder and the format to represent it.
> How would Optional encode nil?
> 
I discussed this above: Treat null-ness as a primitive value with its own encode() call and do something clever for nested Optionals.
> It's so simple, it doesn't even need to be specialized. You might even be able to get away with combining the encoding and decoding variants if the subscript comes from a conditional extension. `Value*Container` *does* need to be specialized; it looks like this (modulo the `Optional` issue I mentioned above):
> 
> Sure, let’s go with this for a moment. Presumably, then, Encoder would be able to vend out both KeyedEncodingContainers and ValueEncodingContainers, correct?
> 
Yes.
> public protocol ValueEncodingContainer {
> func encode<Value : Codable>(_ value: Value?, forKey key: Key) throws
> 
> I’m assuming that the key here is a typo, correct?
> 
Yes, sorry. I removed the forKey: label from the other calls, but not this one. (I almost left it on all of the calls, which would have been really confusing!)
> Keep in mind that combining these concepts changes the semantics of how single-value encoding works. Right now SingleValueEncodingContainer only allows values of primitive types; this would allow you to encode a value in terms of a different arbitrarily-codable value.
> 
Yes. I don't really see that as a problem; if you ask `Foo` to encode itself, and it only wants to encode a `Bar`, is anything really gained by insisting that it add a level of nesting first? More concretely: If you're encoding an enum with a `rawValue`, why not just encode the `rawValue`?
> var codingKeyContext: [CodingKey]
> }
> 
> And use sites would look like:
> 
> func encode(to encoder: Encoder) throws {
> let container = encoder.container(keyedBy: CodingKey.self)
> try container[.id].encode(id)
> try container[.name].encode(name)
> try container[.birthDate].encode(birthDate)
> }
> 
> For consumers, this doesn’t seem to make much of a difference. We’ve turned try container.encode(id, forKey:. id) into try container[.id].encode(id).
> 
It isn't terribly different for consumers, although the subscript is slightly less wordy. But it means that encoders/decoders only provide one—not two—sets of encoding/decoding calls, and it allows some small bits of cleverness, like passing a SingleValue*Container off to a piece of code that's supposed to handle it.
> These types were chosen because we want the API to make static guarantees about concrete types which all Encoders and Decoders should support. This is somewhat less relevant for JSON, but more relevant for binary formats where the difference between Int16 and Int64 is critical.
> 
> This turns the concrete type check into a runtime check that Encoder authors need to keep in mind. More so, however, any type can conform to SignedInteger or UnsignedInteger as long as it fulfills the protocol requirements. I can write an Int37 type, but no encoder could make sense of that type, and that failure is a runtime failure. If you want a concrete example, Float80 conforms to FloatingPoint; no popular binary format I’ve seen supports 80-bit floats, though — we cannot prevent that call statically…
> 
> Instead, we want to offer a static, concrete list of types that Encoders and Decoders must be aware of, and that consumers have guarantees about support for.
> 
But this way instead forces encoders to treat a whole bunch of types as "primitive" which, to those encoders, aren't primitive at all.

Maybe it's just that we have different priorities here, but in general, I want an archiving system that (within reason) handles whatever types I throw at it, if necessary by augmenting the underlying encoder format with default Foundation-provided behavior. If a format only supports 64-bit ints and I throw a 128-bit int at it, I don't want it to truncate it or throw up its hands; I want it to read the two's-compliment contents of the `BinaryInteger.words` property, convert it to a `Data` in some standard endianness, and write that out. Or convert to a human-readable `String` and use that. It doesn't matter a whole lot, as long as it does something it can undo later.

I also like that a system with very few primitives essentially makes no assumptions about what a format will need to customize. A low-level binary format cares a lot about different integer sizes, but a higher-level one probably cares more about dates, URLs, and dictionaries. For instance, I suspect (hope?) that the JSONEncoder is going to hook Array and Dictionary to make them form JSON arrays and objects, not the sort of key-based representation NSKeyedArchiver uses (if I recall correctly). If we just provide, for instance, these:

	func encode(_ value: String) throws
	func encode(_ value: NSNull) throws
	func encode(_ value: Codable) throws

Then there's exactly one path to customization—test for types in `encode(_: Codable)`—and everyone will use it. If you have some gigantic set of primitives, many coders will end up being filled with boilerplate to funnel ten integer types into one or two implementations, and nobody will be really happy with the available set.

In reality, you'll probably need a few more than just these three, particularly since BinaryInteger and FloatingPoint both have associated values, so several very important features (like their `bitWidth` and `isSigned` properties) can only be accessed through a separate primitive. But the need for a few doesn't imply that we need a big mess of them, particularly when the difference is only relevant to one particular class of encoders.
> To accommodate my previous suggestion of using arrays to represent ordered encoded data, I would add one more primitive:
> 
> func encode(_ values: [Codable]) throws
> 
> Collection types are purposefully not primitives here:
> 
> If Array is a primitive, but does not conform to Codable, then you cannot encode Array<Array<Codable>>.
> If Array is a primitive, and conforms to Codable, then there may be ambiguity between encode(_ values: [Codable]) and encode(_ value: Codable).
> Even in cases where there are not, inside of encode(_ values: [Codable]), if I call encode([[1,2],[3,4]]), you’ve lost type information about what’s contained in the array — all you see is Codable
> If you change it to encode<Value : Codable>(_ values: [Value]) to compensate for that, you still cannot infinitely recurse on what type Value is. Try it with encode([[[[1]]]]) and you’ll see what I mean; at some point the inner types are no longer preserved.
Hmm, I suppose you're right.

Alternative design: In addition to KeyedContainers, you also have OrderedContainers. Like my proposed behavior for KeyedContainers, these merely vend SingleValue*Containers—in this case as an Array-like Collection.

	extension MyList: Codable {
		func encode(to encoder: Encoder) throws {
			let container = encoder.orderedContainer(self.count)

			for (valueContainer, elem) in zip(container, self) {
				try valueContainer.encode(elem)
			}
		}

		init(from decoder: Decoder) throws {
			let container = decoder.orderedContainer()

			self.init(try container.map { try $0.decode(Element.self) })
		}
	}

This helps us draw an important distinction between keyed and ordered containers. KeyedContainers locate a value based on the key. Perhaps the way in which it's based on the key is that it extracts an integer from the key and then finds the matching location in a list of values, but then that's just how keys are matched to values in that format. OrderedContainers, on the other hand, are contiguous, variable-length, and have an intrinsic order to them. If you're handed an OrderedContainer, you are meant to be able to enumerate its contents; a KeyedContainer is more opaque than that.
> (Also, is there any sense in adding `Date` to this set, since it needs special treatment in many of our formats?)
> 
> We’ve considered adding Date to this list. However, this means that any format that is a part of this system needs to be able to make a decision about how to format dates. Many binary formats have no native representations of dates, so this is not necessarily a guarantee that all formats can make.
> 
> Looking for additional opinions on this one.
> 
I think that, if you're taking the view that you want to provide a set of pre-specified primitive methods as a list of things you want encoders to make a policy decision about, Date is a good candidate. But as I said earlier, I'd prefer to radically reduce the set of primitives, not add to it.

IIUC, two of your three proposed, Foundation-provided coders need to do something special with dates; perhaps one of the three needs to do something special with different integer sizes and types. Think of that as a message about your problem domain.
> I see what you're getting at here, but I don't think this is fit for purpose, because arrays are not simply dictionaries with integer keys—their elements are adjacent and ordered. See my discussion earlier about treating inherently ordered containers as simply single-value `Array`s.
> 
> You’re right in that arrays are not simply dictionaries with integer keys, but I don’t see where we make that assertion here.
> 
Well, because you're doing all this with a keyed container. That sort of implies that the elements are stored and looked up by key.

Suppose you want to write n elements into a KeyedEncodingContainer. You need a different key for each element, but you don't know ahead of time how many elements there are. So I guess you'll need to introduce a custom key type for no particular reason:

	struct /* wat */ IndexCodingKeys: CodingKey {
		var index: Int

		init(stringValue: String?, intValue: Int) throws {
			guard let i = intValue ?? Int(stringValue) else {
				throw …
			}
			index = i
		}

		var stringValue: String? {
			return String(index)
		}
		var intValue: Int? {
			return index
		}
	}

And then you write them all into keyed slots? And on the way back in, you inspect `allKeys` (assuming it's filled in at all, since you keep saying that coders don't necessarily have to use the keys), and use that to figure out the available elements, and decode them?

I'm just not sure I understand how this is supposed to work reliably when you combine arbitrary coders and arbitrary types.
> The way these containers are handled is completely up to the Encoder. An Encoder producing an array may choose to ignore keys altogether and simply produce an array from the values given to it sequentially. (This is not recommended, but possible.)
> 
Again, as I said earlier, this idea that a keyed encoder could just ignore the keys entirely is very strange and worrying to me. It sounds like a keyed container has no dependable semantics at all.

There's preserving implementation flexibility, and then there's being so vague about behavior that nothing has any meaning and you can't reliably use anything. I'm very worried that, in some places, this design leans towards the latter. A keyed container might not write the keys anywhere in the file, but it certainly ought to use them to determine which field you're looking for. If it doesn't—if the key is just a suggestion—then all this API provides is a naming convention for methods that do vaguely similar things, potentially in totally incompatible ways.
> This comes very close to—but doesn't quite—address something else I'm concerned about. What's the preferred way to handle differences in serialization to different formats?
> 
> Here's what I mean: Suppose I have a BlogPost model, and I can both fetch and post BlogPosts to a cross-platform web service, and store them locally. But when I fetch and post remotely, I ned to conform to the web service's formats; when I store an instance locally, I have a freer hand in designing my storage, and perhaps need to store some extra metadata. How do you imagine handling that sort of situation? Is the answer simply that I should use two different types?
> 
> This is a valid concern, and one that should likely be addressed.
> 
> Perhaps the solution is to offer a userInfo : [UserInfoKey : Any] (UserInfoKey being a String-RawRepresentable struct or similar) on Encoder and Decoder set at the top-level to allow passing this type of contextual information from the top level down.
> 
At a broad level, that's a good idea. But why not provide something more precise than a bag of `Any`s here? You're in pure Swift; you have that flexibility.

	protocol Codable {
		associatedtype CodingContext = ()

		init<Coder: Decoder>(from decoder: Coder, with context: CodingContext) throws
		func encoder<Coder: Encoder>(from encoder: Coder, with context: CodingContext) throws
	}
	protocol Encoder {
		associatedtype CodingContext = ()

		func container<Key : CodingKey>(keyedBy type: Key.Type) -> KeyedEncodingContainer<Key, CodingContext>
		…
	}
	class KeyedEncodingContainer<Key: CodingKey, CodingContext> {
		func encode<Value: Codable>(_ value: Value,? forKey key: Key, with context: Value.CodingContext) throws { … }

		// Shorthand when contexts are the same:
		func encode<Value: Codable>(_ value: Value,? forKey key: Key) throws
			where Value.CodingContext == CodingContext
		{ … }

		…
	}
> We don’t support this type of polymorphic decoding. Because no type information is written into the payload (there’s no safe way to do this that is not currently brittle), there’s no way to tell what’s in there prior to decoding it (and there wouldn’t be a reasonable way to trust what’s in the payload to begin with).
> We’ve thought through this a lot, but in the end we’re willing to make this tradeoff for security primarily, and simplicity secondarily.
> 
Well, `String(reflecting: typeInstance)` will give you the fully-qualified type name, so you can certainly write it. (If you're worried about `debugDescription` on types changing, I'm sure we can provide something, either public or as SPI, that won't.) You can't read it and convert it back to a type instance, but you can read it and match it against the type provided, including by walking into superContainer()s and finding the one corresponding to the type instance the user passed. Or you could call a type method on the provided type and ask it for a subtype instance to use for initialization, forming a sort of class cluster. Or, as a safety measure, you can throw if there's a class name mismatch.

(Maybe it'd be better to write out and check the key type, rather than the instance type. Hmm.)

Obviously not every encoder will want to write out types—I wouldn't expect JSONEncoder to do it, except perhaps with some sort of off-by-default option—but I think it could be very useful if added.
> How important is this performance? If the answer is "eh, not really that much", I could imagine a setup where every "primitive" type eventually represents itself as `String` or `Data`, and each `Encoder`/`Decoder` can use dynamic type checks in `encode(_:)`/`decode(_:)` to define whatever "primitives" it wants for its own format.
> 
> Does this imply that Int32 should decide how it’s represented as Data? What if an encoder forgets to implement that?
> 
Yes, Int32 decides how—if the encoder doesn't do anything special to represent integers—it should be represented in terms of a more immediately serializable type like Data. If an encoder forgets to provide a special representation for Int32, then it falls back to a sensible, Foundation-provided default. If the encoder author later realizes their mistake and wants to correct the encoder, they'd probably better build backwards compatibility into the decoder.
> Again, we want to provide a static list of types that Encoders know they must handle, and thus, consumers have guarantees that those types are supported.
> 
I do think that consumers are guaranteed these types are supported: Even if the encoder doesn't do anything special, Foundation will write them out as simpler and simpler types until, sooner or later, you get to something that is supported, like Data or String. This is arguably a stronger level of guarantee than we have when there are a bunch of primitive types, because if an encoder author feels like nobody's going to actually use UInt8 when it's a primitive, the natural thing to do is to throw or trap. If the author feels the same way about UInt8 when it's not a primitive, then the natural thing to do is to let Foundation do what it does, which is write out UInt8 in terms of some other type.

-- 
Brent Royal-Gordon
Architechies

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170317/27999f35/attachment.html>