[swift-evolution] [Review] SE-0168: Multi-Line String Literals

Fri Apr 7 22:55:52 CDT 2017

https://github.com/apple/swift-evolution/blob/master/proposals/0168-multi-line-string-literals.md <https://github.com/apple/swift-evolution/blob/master/proposals/0168-multi-line-string-literals.md>

First of all, to be clear: although my name is on this proposal, it's there because I worked on it last year when we were exploring other approaches. I haven't worked on this version of the proposal and wasn't aware that it was up for review until the announcement went out. So the below is just my opinion; it does not represent the opinion of the other authors, and I haven't even talked to them about it. So don't take this as "even the authors aren't totally happy with this proposal". :^)

> On Apr 6, 2017, at 12:35 PM, Joe Groff via swift-evolution <swift-evolution at swift.org> wrote:
> 
> 	• What is your evaluation of the proposal?

I think that, if we want to do something in the style of Python's multiline string literals, this proposal specifies a good design for them. It addresses one of the main problems with them in Python—the fact that they typically have to be flush against the left margin—and, although we might want to tighten the description of the de-indenting algorithm, I think its proposal in this area is basically the right thing to do. Others have argued that we should use a #keyword or special syntax for this, but I don't think that's the right move; we should always have de-indenting on, because you usually want it, and when you don't, you can easily "turn it off" by positioning the delimiter so it infers zero indentation.

Specifically, here are the rules I would specify for indentation:

— If the leading delimiter is at the end of its line, *and* the characters between the trailing delimiter and the newline before it consist entirely of zero or more horizontal whitespace characters, the literal uses de-indenting.
	—If one of these conditions is true but not the other, we may want to emit a warning.

— If a literal uses de-indenting:
	1. The one newline after the leading delimiter is removed.
	2. The whitespace on the last line is removed.
	3. Each line is compared to the whitespace removed from the last line. The largest common prefix of characters that are exact matches for the whitespace removed from the last line are removed.
		— If we only matched a prefix of the last line's whitespace, but not all of it, we should emit a warning—*especially* if the non-matching character was a whitespace character, but not the one we expected.

So, for instance, this code (⎵ = space, ⇥ = tab, ↵ = newline):

	"""↵
	⎵⎵⎵⎵Foo↵
	⎵⎵⇥Bar↵
	⎵⎵⎵⎵"""

Creates a literal with these contents (and probably a warning on the "Bar" line):

	Foo↵
	⇥Bar↵

However, this code (with non-whitespace before the trailing `"""`):

	"""↵
	⎵⎵⎵⎵Foo↵
	⎵⎵⇥Bar"""

Would be:

	↵
	⎵⎵⎵⎵Foo↵
	⎵⎵⇥Bar

And this code (with no newline after the leading `"""`) :

	"""⎵⎵⎵⎵Foo↵
	⎵⎵⇥Bar↵
	⎵⎵⎵⎵"""

Would be:

	⎵⎵⎵⎵Foo↵
	⎵⎵⇥Bar↵
	⎵⎵⎵⎵

Finally, if you escape a newline, the newline should not be included in the string, but de-indenting should still happen. So this string:

	"""↵
	⎵⎵⎵⎵Foo\↵
	⎵⎵⇥Bar\↵
	⎵⎵⎵⎵"""

Becomes this (plus a warning for the mismatched indentation):

	Foo↦Bar

* * *

However, I'm not sure Python-style quoting is our best alternative. But that's very much a matter of opinion, because there's no knockout winner in this area.

Basically, every multiline string literal design I'm aware of has at least two of these three weaknesses:

	1. It doesn't help you with strings which themselves contain quote marks (and, thus, you have to backslash them).

	2. There's little redundancy in the syntax, which makes parsing difficult, particularly when code is incomplete or out of context. Basically, at any point in the code, you might be in a multiline string literal or you might be in actual code, and the only way to know which is to look at the entire file before that point. And that assumes the code is well-formed. If someone forgets—or hasn't yet typed—a closing `"""`, you can tell that one of the multiline strings in the file is unclosed, but it's not clear which one. You end up applying a heuristic: It's *probably* the one before where you start getting syntax errors from the "code" you're parsing. But that heuristic can be wrong, particularly if you have Swift code that's generating Swift code.

	3. Conventional string literal syntax has you place the full content of the string in the place where it will go in the expression. That's fine for short strings, but when you start looking at long multiline strings, all that extra content makes the expression difficult to read. It's like having a hundred-word parenthetical in the middle of a sentence: By the time it's finished, you've forgotten where the outer sentence left off.

(The naïve multi-line string literal approach—"we'll just allow newlines in a normal string literal"—suffers from all three of these problems.)

Python-style quoting avoids #1, but has to deal with #2 and #3. That's clever in that it means one syntax plays two roles, but it's really easy to add #1 as a separate feature (like by making `'` an alternate delimiter, or adding a feature like Perl's `qq` which lets you choose any character as a delimiter), whereas #2 and #3 can't be fixed by simply adding another feature.

A design that addresses #2, while leaving #1 and #3 on the table, is continuation quotes. Basically, if a normal, one-double-quote-mark-delimited string literal is not terminated before the end of a line, the compiler looks at the next line. If the next line starts with optional whitespace followed by the opening string literal delimiter, then the string continues onto that line. Because every line of the string literal affirms that the string literal is meant to continue onto this line, mistakes are diagnosed very close to where the user made them. Syntax highlighting also doesn't require full knowledge of the file; a relatively naïve highlighter, like the ones common in text editors and CMSes, can consider each line in isolation and still apply correct highlighting to code. The main problem with this approach is that it doesn't allow you to paste text in verbatim—you must modify each line to add a leading quote mark—but I think editor support can address that shortcoming. It also still requires you to escape double-quotes in the text, but as I said, that can be addressed through a separate feature.

A design that addresses #3, while leaving #1 (partially) and #2 on the table, is heredocs. In a heredoc, the expression contains a token indicating the presence of a multiline string, but the contents of the string don't actually start immediately. Instead, the contents begin on the next line. The token merely primes the parser to look for a multiline string and indicates where in the expression it belongs. This means the expression is easy to read and the string is slightly out-of-band. There's also a second advantage to heredocs: Although they don't help you with *single-line* strings that contain `"`, they do help you with *multi-line* strings that do. Traditionally, the heredoc token is `<<` followed by an arbitrary delimiter, but I actually really like the `"""` token from Python, so personally I would steal that and use it as both the token and the delimiter:

	assert( xml == """ )
	    <?xml version="1.0"?>
	    <catalog>
	        <book id="bk101" empty="">
	            <author>\(author)</author>
	            <title>XML Developer's Guide</title>
	            <genre>Computer</genre>
	            <price>44.95</price>
	            <publish_date>2000-10-01</publish_date>
	            <description>An in-depth look at creating applications with XML.</description>
	        </book>
	    </catalog>
	    """

Although this idea has not been fully fleshed out, I've also considered a sort-of-heredoc-ish syntax involving functions. Basically, by annotating a function or method with an attribute like `@quoted` (or, alternately, by using `"""` instead of `{` to open the body), you would change the interpretation of its body: it would be treated as the contents of a giant, interpolatable string literal instead of as Swift code. This encourages you to explicitly name and parameterize your long string literals, allows you to move them around in your source code (possibly even to separate files), and permits you to refactor them into code if that makes sense. But it also starts to look not so much like string literals anymore.

My point is this: There are a lot of plausible designs in this space, and they all have different trade-offs. Pros of the Python-style syntax in this proposal:

1. It's easy to explain.
2. It does double duty by also quoting short strings with double-quote marks in them.
3. It doesn't require you to modify the contents of the literal.

Cons:

1. Incomplete or incorrect code is very ambiguous, so diagnosing errors or highlighting code being written is imprecise and heuristic-driven.
2. It takes more work to syntax highlight even when code is correct.
3. It breaks up expressions, making them less readable.
4. The amount of de-indenting is determined by implication (vs. continuation quotes, where the ignored whitespace is explicit).

It might look bad when I write it like that, but the reality is that all possible syntaxes have similarly mixed trade-offs.

The core team may feel that, given Swift's goals, Python-style multiline string syntax is the right choice—it kills two birds with one stone and is easy to teach, so tool designers will just have to deal with the challenges it poses, and people will need to adopt styles where they're not used in long expressions. But when the core team deliberates, I hope they will discuss and weigh these trade-offs explicitly. If they do choose to reject, I hope they communicate their priorities to us so we can write a proposal that matches them. But if they feel this type of string literal syntax meets their goals for multiline string literals, I urge them to accept this proposal.

> 	• Is the problem being addressed significant enough to warrant a change to Swift?

Yes. The current need to heavily escape string literals—and especially multiline ones—makes code that works with strings difficult to read and unpleasant to write. A new feature is sorely needed in this area.

> 	• Does this proposal fit well with the feel and direction of Swift?

There is definitely something lightweight and parsimonious about the Python design; that feels a bit Swifty.

> 	• If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I've spent a lot of my career in dynamic languages with good string-handling facilities, including strong string literal syntax features. A brief survey of the languages in this category:

* Perl 5 has highly-configurable quoting facilities permitting arbitrary delimiters, plus heredocs. It does not have deindenting, though.
* Perl 6 has everything Perl 5 does, plus de-indenting and the ability to write your own quoting mechanisms.
* Ruby also has Perl 5-like quoting facilities, plus deindenting.
* Python has a syntax that directly inspired this one.

To explain in a little more detail, Perl 5 and Ruby have several different interlocking quote features:

	1. A string delimited by double quotes supports backslash escapes and interpolation; a string delimited by single quotes does not.

	2. There are special tokens which are equivalent to single- and double-quoted strings. (`q` and `qq` in Perl; `%q` and `%Q` in Ruby.) After the token, the next non-whitespace, non-identifier character (sort of) becomes a delimiter; when it is repeated un-backslashed, the string literal ends. As a special exception,starting delimiters like `<`, `[`, `(` and `{` are terminated by their matching closing character, and non-backslashed inner instances of these characters are counted and balanced.

	3. They also support heredocs. In these languages, a heredoc token is a `<<` character followed by a short string in any of the other literal syntaxes; the contents of that literal are the delimiter for the heredoc, while the quoting style used on the literal controls the quoting style of the heredoc. (Ruby also uses certain prefix characters on the delimiter to describe various whitespace-stripping behaviors.

	4. You can put any character, including newlines, in any of these string literal types.

Perl 6 refactors these features a bit by rebuilding them around a modifier system on the `q` and `qq` operators; for instance, `q:b(zyz\nyx)` uses single-quote semantics except that it interprets backslash escapes, while `qq:to(END)` creates a double-quote-style heredoc delimited by `END`.

Python, on the other hand, has a much simpler set of features:

	1. There are two single-line string delimiters: `'` and `"`. They have identical semantics; you just choose one based on style or the literal's contents. Newlines are not allowed in these strings, unless they're escaped with a backslash (in which case they're ignored).

	2. There are two multi-line string delimiters: `'''` and `"""`. Again, they have identical semantics. Newlines *are* allowed in these strings.

	3. An `r` or `R` prefix on the string literal disables processing of backslash escapes. (Well, sort of—a backslash still prevents a quote mark after it from ending the string, but the backslash itself is also left in.)

Note that these rules *are* different from the ones proposed for Swift: Python includes every character between the multi-line string delimiters in the string, whereas we are removing indentation and a leading newline.

In my experience, I much prefer the Perl-style model with several different orthogonal features to the Python style with a couple of minimal features. The Python-style quotes are usually acceptable but almost never ideal, whereas in the Perl/Ruby style, one tool or another always feels like a perfect fit. But I also have *much* more experience with the Perl style, so that might just be what I'm used to. And Perl is a language specifically built for string handling, while Swift is not; it may not make sense to devote anywhere near as much syntactic attention to string literals.

> 	• How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Well, I might not have been involved with it the whole time through, but my name *is* on the proposal, right?

-- 
Brent Royal-Gordon
Architechies

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170407/4b138e6f/attachment.html>