[swift-users] Implementing String(contentOfUrl:usedEncoding)

Mohit Athwani mohit.athwani at gmail.com
Wed Feb 22 23:13:56 CST 2017


Hey Jens,

Thanks so much! This is really useful! I'm going to get started on this.

Cheers!
Mohit

On Wed, Feb 22, 2017 at 9:09 PM, Jens Alfke <jens at mooseyard.com> wrote:

>
> On Feb 22, 2017, at 6:05 PM, Mohit Athwani via swift-users <
> swift-users at swift.org> wrote:
>
> I don't understand why we need the usedEncoding parameter? I understand
> that it's a pointer but how do we decide what encoding to use? Do we
> default to NSUTF8StringEncoding?
>
>
> The original implementation in Foundation uses some heuristics to try to
> guess the encoding, since there are unfortunately billions of plain text
> files out there that don’t explicitly state their encoding. It’s not open
> source, so we can’t know for sure [except for the people who work at
> Apple], but I’m sure it includes things like:
>
> - Look for a Unicode BOM at the start, in which case it’s probably UTF-16
> (or maybe UTF-32? I don’t know the details.)
> - If not, see whether all bytes are 0x00-0x7F ⟶ in that case use ASCII
> - If not, does it contain any byte sequences that are illegal in UTF-8? ⟶
> If not, use UTF-8
> - Otherwise, does it contain any bytes in the range 0x80-0xBF?
> ⟶ If not, ISO-8859-1  (aka ISO-Latin-1) is a good guess
> ⟶ If so, CP-1252 (aka WinLatin1) is a good guess; it’s a nonstandard but
> very common superset of ISO-8859-1 with extra characters in that byte range
>
> There are likely other heuristics too. It used to be important to detect
> the old MacRoman encoding used in pre-OS X apps, but it’s been long enough
> that there shouldn’t be many docs like that in the wild anymore. There are
> multibyte non-Unicode encodings that used to be very common in non-Roman
> languages, like Shift-JIS, but I have no idea how to detect them or if
> they’re even still relevant.
>
> It could also be useful to check whether the start of the file looks like
> XML or HTML, and if so, parse it enough to find where it specifies its
> encoding. (Are there other text formats that include encodings? I’ve seen
> special markings at the top of source files used for emacs or vi,
> specifying tab widths and such, but I don’t know if those can specify
> encodings too.)
>
> I’m not involved in Swift development, but IMHO a basic implementation
> that just uses the rules I sketched above would be pretty useful, and then
> people with more domain knowledge could enhance that code to add more
> heuristics later on.
>
> —Jens
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-users/attachments/20170222/0b13ee2b/attachment.html>


More information about the swift-users mailing list