[swift-evolution] Strings in Swift 4
Ted F.A. van Gaalen
tedvgiosdev at gmail.com
Tue Feb 14 10:29:56 CST 2017
Hi Ben,
In the case of processing all fields sequentially from a record,
that are directly adjacent to each other, your method
(tested on playground) is convenient, also because in this
case all I need to specify are the field’s lengths…
It’s like slicing bread :o)
However, if one just wants to extract fields directly
somewhere in the middle without having to deal
with the other record's content it gets quite clumsy,
Currently and AFAIK, no less than three statements are needed to do so:
var rec = record // create a mutable copy.
rec.dropPrefix(startpos) // ignore function result here
let result = rec.dropPrefix(length)
(unless you know of a better way to do this? )
this is of course undesirable, so I’ve made a String extension
to do this
record.midstr(at: pos, length: len)
(see tested example below)
( sidenote:
suggestion: imho dropPrefix should be three functions:
- s2 = s1.getPrefix(len) // which does not change s1 !!
- s2 = s1.getAndDropPrefix(len) // which also removes the prefix from s1.
- s1.dropPrefix(len) // which drops the prefix but does not return anything.
because at first sight intuitively one doesn’t expect that:
s1.dropPrefix(len) as it is now does in fact can optionally return the dropped part of s1..
in that case one could regard s2 = s1.dropPrefix(len) as a function having a side-effect,
which can be considered as bad programming practice...
).
see further in-line comments
// this particular API/implementation for demonstration only,
// not necessarily quite what will be proposed
extension Collection where SubSequence == Self
{
/// Drop n elements from the front of `self` in-place,
/// returning the dropped prefix.
mutating func dropPrefix(_ n: IndexDistance) -> SubSequence {
// nature of error handling/swallowing/trapping/optional
// returning here TBD...
let newStart = index(startIndex, offsetBy: n)
defer { self = self[newStart..<endIndex] }
return self[startIndex..<newStart]
}
}
// soon...
// (TedvG) never depend on expectations such as "soon" :o)
extension String: Collection { }
struct Product
{
var id, group, name, description, currency: String
var inStock, ordered, price: Int
var priceFormatted: String
{
let whole = (price/100)
let cents = price - (whole * 100)
return currency + " \(whole).\(cents)"
}
init(inputrecord: String)
{
// note, no copying will occur here, as String is
// copy-on-write and there’s no writing happening
var record = inputrecord
id = record.dropPrefix(10)
group = record.dropPrefix( 4)
name = record.dropPrefix(16)
description = record.dropPrefix(30)
inStock = Int(record.dropPrefix(10))!
ordered = Int(record.dropPrefix(10))!
price = Int(record.dropPrefix(10))!
currency = record.dropPrefix( 1)
}
}
let record = "123A.534.CMCU3Arduino Due Arm 32-bit Micro controller. 000000034100000005680000002250$"
// var rec = record
//var descr = rec.dropPrefix(10 + 4 + 16).dropPrefix(30)
// does not work:
// Attempt to extract a substring with somewhere in the middle of the source
// string with cascading in a single statement has failed
// because dropPrefix(n) function's result is unmutable
// pre-compiler diagnostic msg:
// "Cannot use member on immutable value: function call returns immutable value."
// (TedvG) I've made this string extension to allow direct substring access which
// can be regarded as a work-around to provide for str[a..<b] as yet not implemented.
// (using midstr() instead of substr() in this playground example
// to avoid confusion with existing substring functions.)
extension String
{
func midstr(at: Int, length: Int) -> String // error optionally throw...
{
var s = self // make a mutable copy of self (aString)
s.dropPrefix(at)
return s.dropPrefix(length)
}
}
func test()
{
let product = Product(inputrecord: record)
print("====== Product data for the item with ID: \(product.id) ================")
print("group : \(product.group)")
print("name : \(product.name)")
print("description : \(product.description)")
print("items available: \(product.inStock)")
print("items ordered : \(product.ordered)")
print("price per item : \(product.priceFormatted)")
print("=====================================================================")
// (TedvG) I've added this midstr. extension usage:
print("=== Extracting substrings at random positions and length directly: ==")
print("description : \( record.midstr(at: 30, length: 30) )")
print("items available: \(Int(record.midstr(at: 60, length: 10))!).")
print("=====end of test.====================================================")
}
test()
// test() prints this:
====== Product data for the item with ID: 123A.534.C ================
group : MCU3
name : Arduino Due
description : Arm 32-bit Micro controller.
items available: 341
items ordered : 568
price per item : $ 22.50
=====================================================================
=== Extracting substrings at random positions and length directly: ==
description : Arm 32-bit Micro controller.
items available: 341.
=====end of test.====================================================
> On 13 Feb 2017, at 02:49, Ben Cohen <ben_cohen at apple.com> wrote:
>
> Hi Ted,
>
> Dave is on vacation next two weeks so this is a reply on behalf of both him and me:
Dave! if you are reading this now, you are not spending your vacation as it should be!
Put your iPhone away immediately, and resume sipping your piña colada! (<- hey look Unicode! :o)
>
>> On Feb 12, 2017, at 10:17, "Ted F.A. van Gaalen" <tedvgiosdev at gmail.com <mailto:tedvgiosdev at gmail.com>> wrote:
>
>>> On 11 Feb 2017, at 18:33, Dave Abrahams <dabrahams at apple.com <mailto:dabrahams at apple.com>> wrote:
>>>
>>> All of these examples should be efficiently and expressively handled by the pattern matching API mentioned in the proposal. They definitely do not require random access or integer indexing.
>>>
>> Hi Dave,
>> then I am very interested to know how to unpack aString (e.g. read from a file record such as in the previous example:
>> 123534-09EVXD4568,991234,89ABCYELLOW12AGRAINESYTEMZ3453 )
>> without using direct subscripting like str[n1…n2) ?
>
> If you look again at the code I sent previously, it demonstrates how you can use lengths to move forward through a string without needing random access for your particular use case.
>
>> (which btw is for me the most straightforward and ideal method)
>> conditions:
>> -The source string contains fields of known position (offset) and length, concatenated together
>> without any separators (like in a CSV)
>> -the contents of each field is unpredictable.
>> which excludes the use of pattern-matching.
>
> Pattern matching isn’t just about matching known contents. Think of the regex “...”. This is a pattern matches any 3 characters. While full regex support is out of scope for the current discussions, the intention is for the pattern matching part of the proposal to handle this kind of use case.
>
>> -the source string needs to be unpacked in independent strings.
>>
>> I made this example: (the comments also stress my point)
>>
>
> Here is another way of implementing your example in a form that doesn’t require random access.
>
> Putting aside pattern matching for now, assume that there is an API on String that lets you drop a specific-length prefix from a Substring (for now in Swift 3, that's a String). An API like this (probably taking any pattern as its argument, not just a length) is likely to be proposed to evolution soon once we move into that phase of the 4.0 String project.
>
> // this particular API/implementation for demonstration only,
> // not necessarily quite what will be proposed
> extension Collection where SubSequence == Self {
> /// Drop n elements from the front of `self` in-place,
> /// returning the dropped prefix.
> mutating func dropPrefix(_ n: IndexDistance) -> SubSequence {
> // nature of error handling/swallowing/trapping/optional
> // returning here TBD...
> let newStart = index(startIndex, offsetBy: n)
> defer { self = self[newStart..<endIndex] }
> return self[startIndex..<newStart]
> }
> }
> // soon...
> extension String: Collection { }
>
> Given this, here’s your example code written using it (compacted a little for brevity):
>
> struct Product {
> var id, group, name, description, currency: String
> var inStock, ordered, price: Int
>
> var priceFormatted: String {
> let whole = (price/100)
> let cents = price - (whole * 100)
> return currency + " \(whole).\(cents)"
> }
>
> init(inputrecord: String) {
> // note, no copying will occur here, as String is
> // copy-on-write and there’s no writing happening
> var record = inputrecord
>
> id = record.dropPrefix(10)
> group = record.dropPrefix(4)
> name = record.dropPrefix(16)
> description = record.dropPrefix(30)
> inStock = Int(record.dropPrefix(10))!
> ordered = Int(record.dropPrefix(10))!
> price = Int(record.dropPrefix(10))!
> currency = record.dropPrefix(1)
> }
> }
> let record = "123A.534.CMCU3Arduino Due Arm 32-bit Micro controller. 000000034100000005680000002250$"
> let product = Product(inputrecord: record)
> print("=== Product data for the item with ID: \(product.id) ====")
> print("group : \(product.group)")
> print("name : \(product.name)")
> print("description : \(product.description)")
> print("items in stock : \(product.inStock)")
> print("items ordered : \(product.ordered)")
> print("price per item : \(product.priceFormatted)")
> print("=========================================================“)
>
> Now, other use cases might not have such a straightforward solution. But for the example here, this approach ought to suffice, or be a starting point for similar cases needing error handling, skipped regions etc.
>
>> Isn’t that an elegant solution or what?
>
> Unfortunately not. Adding integer subscripting to String via an extension that uses index(_:offsetBy) is a commonly proposed idea that we strongly caution against. Strings use an opaque
> index rather than integers for a reason, it’s not an oversight.
>
> The reason being: if ever your string contains more than just ASCII characters, then advancing a String's startIndex to the nth element becomes a linear-time operation, because Characters are variable length. As a result, every one of your uses of that subscript takes linear time. If you use them in a loop, then code that looks linear is actually (probably accidentally) quadratic.
Which could leave the impression that current string handling is badly designed/ structured?
From an OOP point of view: I would expect each (Unicode) character to be (an object/instance of) a Unicode Character class,
which would imply that each element in aString would be (nothing more than) a reference to a single Character object..(in Swift a GraphemeCluster)
In that case, it would not make any difference at all, whether or not this Character object instance itself contains a single or more elements.
This method has its (perhaps optimisable) look-up performance drawbacks, but it would cancel out the performance drawbacks that are
(probably) caused by traversing aString sequentially at runtime each and every time you need a substring of it.. ?
Before getting in to all this, (aarrrghh, how do I get out :o) I did assume that the str.character
had such independent behaviour, (has GraphemeClusters as direct accessible elements)
however that is not the case because it is (as it appears to me) a limited view upon String itself.
Still, How things then should appear graphically on an output device (e.g. print positioning)
is then a matter for the displaying sub-system only, that’s were atomic GraphemeClusters are for, isn’t?
>
> Now, sometimes, when the String knows it only contains ASCII, it might be able to do the advance in constant time. But we still recommend against these kind of extensions to avoid performance pitfalls if ever you are handling strings where this isn’t the case. There are other techniques like the one shown above that achieve the same goal just as well.
in the example given it is 100% certain that all String elements are ASCII.
How then, I might ask, can processing a row of just ASCII characters lead to performance pitfalls?
Thanks,
Met vriendelijke groeten
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>
>
>
>> I might start a very lengthy discussion here about the threshold of where and how
>> to protect the average programmer (like me :o) from falling in to language pittfalls
>> and to what extend these have effect on working with a PL. One cannot make
>> a PL idiot-proof. Of course, i agree a lot of it make sense, and also the “intelligence”
>> of the Swift compiler (sometimes it almost feels as if it sits next to me looking at
>> the screen and shaking its head from time to time) But hey, remember most of
>> us in our profession have a brain too.
>> (btw, if you now of a way to let Xcode respect in-between spaces when auto-formatting please let me know, thanks)
>>
>> @Ben Cohen:
>> Hi, you wrote:
>> "p.s. as someone who has worked in a bank with thousands of ancient file formats, no argument from me that COBOL rules :)"
>> Although still the most part of accounting software is Cobol (mostly because it is too expensive
>> and risky to convert to newer technologies) I don’t think that Cobol rules and that new apps definitely should
>> not be written in Cobol. I wouldn’t be doing Swift if I thought otherwise.
>> If I would be doing a Cobol project again, It would be with same enjoyment as say,
>> a 2017 mechanical engineer, working on a steam locomotive of a touristic railroad.
>
> Indeed. It was in this nostalgic spirit that my comment was meant.
>
>> which I would do with dedication as well. However, never use this comparison
>> at the hiring interview..:o)
>>
>>
>> Kind Regards
>> TedvG
>>
>>
>>
>>
>>
>>
>>
>>> Sent from my moss-covered three-handled family gradunza
>>>
>>> On Feb 9, 2017, at 5:09 PM, Ted F.A. van Gaalen <tedvgiosdev at gmail.com <mailto:tedvgiosdev at gmail.com>> wrote:
>>>
>>>>
>>>>> On 10 Feb 2017, at 00:11, Dave Abrahams <dabrahams at apple.com <mailto:dabrahams at apple.com>> wrote:
>>>>>
>>>>>
>>>>> on Thu Feb 09 2017, "Ted F.A. van Gaalen" <tedvgiosdev-AT-gmail.com <http://tedvgiosdev-at-gmail.com/>> wrote:
>>>>>
>>>>>> Hello Shawn
>>>>>> Just google with any programming language name and “string manipulation”
>>>>>> and you have enough reading for a week or so :o)
>>>>>> TedvG
>>>>>
>>>>> That truly doesn't answer the question. It's not, “why do people index
>>>>> strings with integers when that's the only tool they are given for
>>>>> decomposing strings?” It's, “what do you have to do with strings that's
>>>>> hard in Swift *because* you can't index them with integers?”
>>>>
>>>> Hi Dave,
>>>> Ok. here are just a few examples:
>>>> Parsing and validating an ISBN code? or a (freight) container ID? or EAN13 perhaps?
>>>> of many of the typical combined article codes and product IDs that many factories and shops use?
>>>>
>>>> or:
>>>>
>>>> E.g. processing legacy files from IBM mainframes:
>>>> extract fields from ancient data records read from very old sequential files,
>>>> say, a product data record like this from a file from 1978 you’d have to unpack and process:
>>>> 123534-09EVXD4568,991234,89ABCYELLOW12AGRAINESYTEMZ3453
>>>> into:
>>>> 123, 534, -09, EVXD45, 68,99, 1234,99, ABC, YELLOW, 12A, GRAIN, ESYSTEM, Z3453.
>>>> product category, pcs, discount code, product code, price Yen, price $, class code, etc…
>>>> in Cobol and PL/1 records are nearly always defined with a fixed field layout like this.:
>>>> (storage was limited and very, very expensive, e.g. XML would be regarded as a
>>>> "scandalous waste" even the commas in CSV files! )
>>>>
>>>> 01 MAILING-RECORD.
>>>> 05 COMPANY-NAME PIC X(30).
>>>> 05 CONTACTS.
>>>> 10 PRESIDENT.
>>>> 15 LAST-NAME PIC X(15).
>>>> 15 FIRST-NAME PIC X(8).
>>>> 10 VP-MARKETING.
>>>> 15 LAST-NAME PIC X(15).
>>>> 15 FIRST-NAME PIC X(8).
>>>> 10 ALTERNATE-CONTACT.
>>>> 15 TITLE PIC X(10).
>>>> 15 LAST-NAME PIC X(15).
>>>> 15 FIRST-NAME PIC X(8).
>>>> 05 ADDRESS PIC X(15).
>>>> 05 CITY PIC X(15).
>>>> 05 STATE PIC XX.
>>>> 05 ZIP PIC 9(5).
>>>>
>>>> These are all character data fields here, except for the numeric ZIP field , however in Cobol it can be treated like character data.
>>>> So here I am, having to get the data of these old Cobol production files
>>>> into a brand new Swift based accounting system of 2017, what can I do?
>>>>
>>>> How do I unpack these records and being the data into a Swift structure or class?
>>>> (In Cobol I don’t have to because of the predefined fixed format record layout).
>>>>
>>>> AFAIK there are no similar record structures with fixed fields like this available Swift?
>>>>
>>>> So, the only way I can think of right now is to do it like this:
>>>>
>>>> // mailingRecord is a Swift structure
>>>> struct MailingRecord
>>>> {
>>>> var companyName: String = “no Name”
>>>> var contacts: CompanyContacts
>>>> .
>>>> etc..
>>>> }
>>>>
>>>> // recordStr was read here with ASCII encoding
>>>>
>>>> // unpack data in to structure’s properties, in this case all are Strings
>>>> mailingRecord.companyName = recordStr[ 0..<30]
>>>> mailingRecord.contacts.president.lastName = recordStr[30..<45]
>>>> mailingRecord.contacts.president.firstName = recordStr[45..<53]
>>>>
>>>>
>>>> // and so on..
>>>>
>>>> Ever worked for e.g. a bank with thousands of these files unchanged formats for years?
>>>>
>>>> Any alternative, convenient en simpler methods in Swift present?
>>>>
>>>> Kind Regards
>>>> TedvG
>>>> ( example of the above Cobol record borrowed from here:
>>>> http://www.3480-3590-data-conversion.com/article-reading-cobol-layouts-1.html <http://www.3480-3590-data-conversion.com/article-reading-cobol-layouts-1.html> )
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>> On 9 Feb 2017, at 16:48, Shawn Erickson <shawnce at gmail.com <mailto:shawnce at gmail.com>> wrote:
>>>>>>>
>>>>>>> I also wonder what folks are actually doing that require indexing
>>>>>>> into strings. I would love to see some real world examples of what
>>>>>>> and why indexing into a string is needed. Who is the end consumer of
>>>>>>> that string, etc.
>>>>>>>
>>>>>>> Do folks have so examples?
>>>>>>>
>>>>>>> -Shawn
>>>>>>>
>>>>>>> On Thu, Feb 9, 2017 at 6:56 AM Ted F.A. van Gaalen via swift-evolution <swift-evolution at swift.org <mailto:swift-evolution at swift.org> <mailto:swift-evolution at swift.org <mailto:swift-evolution at swift.org>>> wrote:
>>>>>>> Hello Hooman
>>>>>>> That invalidates my assumptions, thanks for evaluating
>>>>>>> it's more complex than I thought.
>>>>>>> Kind Regards
>>>>>>> Ted
>>>>>>>
>>>>>>>> On 8 Feb 2017, at 00:07, Hooman Mehr <hooman at mac.com <mailto:hooman at mac.com> <mailto:hooman at mac.com <mailto:hooman at mac.com>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Feb 7, 2017, at 12:19 PM, Ted F.A. van Gaalen via swift-evolution <swift-evolution at swift.org <mailto:swift-evolution at swift.org> <mailto:swift-evolution at swift.org <mailto:swift-evolution at swift.org>>> wrote:
>>>>>>>>>
>>>>>>>>> I now assume that:
>>>>>>>>> 1. -= a “plain” Unicode character (codepoint?) can result in one glyph.=-
>>>>>>>>
>>>>>>>> What do you mean by “plain”? Characters in some Unicode scripts are
>>>>>>>> by no means “plain”. They can affect (and be affected by) the
>>>>>>>> characters around them, they can cause glyphs around them to
>>>>>>>> rearrange or combine (like ligatures) or their visual
>>>>>>>> representation (glyph) may float in the same space as an adjacent
>>>>>>>> glyph (and seem to be part of the “host” glyph), etc. So, the
>>>>>>>> general relationship of a character and its corresponding glyph (if
>>>>>>>> there is one) is complex and depends on context and surroundings
>>>>>>>> characters.
>>>>>>>>
>>>>>>>>> 2. -= a grapheme cluster always results in just a single glyph, true? =-
>>>>>>>>
>>>>>>>> False
>>>>>>>>
>>>>>>>>> 3. The only thing that I can see on screen or print are glyphs (“carvings”,visual elements that stand on their own )
>>>>>>>>
>>>>>>>> The visible effect might not be a visual shape. It may be for example, the way the surrounding shapes change or re-arrange.
>>>>>>>>
>>>>>>>>> 4. In this context, a glyph is a humanly recognisable visual form of a character,
>>>>>>>>
>>>>>>>> Not in a straightforward one to one fashion, not even in Latin / Roman script.
>>>>>>>>
>>>>>>>>> 5. On this level (the glyph, what I can see as a user) it is not relevant and also not detectable
>>>>>>>>> with how many Unicode scalars (codepoints ?), grapheme, or even on what kind
>>>>>>>>> of encoding the glyph was based upon.
>>>>>>>>
>>>>>>>> False
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> swift-evolution mailing list
>>>>>>> swift-evolution at swift.org <mailto:swift-evolution at swift.org> <mailto:swift-evolution at swift.org <mailto:swift-evolution at swift.org>>
>>>>>>> https://lists.swift.org/mailman/listinfo/swift-evolution <https://lists.swift.org/mailman/listinfo/swift-evolution>
>>>>>> <https://lists.swift.org/mailman/listinfo/swift-evolution <https://lists.swift.org/mailman/listinfo/swift-evolution>>
>>>>>>
>>>>>
>>>>> --
>>>>> -Dave
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170214/b4433b66/attachment.html>
More information about the swift-evolution
mailing list