[swift-users] CharacterSet vs Set<Character>
Gerriet M. Denkmann
g at mdenkmann.de
Mon Oct 3 09:27:20 CDT 2016
> On 3 Oct 2016, at 19:17, Jean-Denis Muys via swift-users <swift-users at swift.org> wrote:
>
> You are right: I don’t know much about asian languages.
>
> How would you go about counting consonants, vowels (and tone-marks?) in the most general way?
Iterate over unicodeScalars (in the most general case) - Swift characters are probably ok for European languages.
For each unicodeScalar a.k.a codepoint you can use the icu function:
int8_t chrTyp = u_charType (codepoint)
This returns the general category value for the code point.
This gives you something like U_OTHER_PUNCTUATION, U_MATH_SYMBOL, U_OTHER_LETTER etc.
See enum UCharCategory in <http://icu-project.org/apiref/icu4c-latest/uchar_8h.html>
In European languages ignore U_NON_SPACING_MARKs.
There is a compare:options function for NSString (and probably similar for Swift String) which might use the options NSCaseInsensitiveSearch and NSDiacriticInsensitiveSearch to find equality between ‘E’, ‘e’ and è, é, Ĕ etc.
That is: for each character (or unicodeScalar) compare to a, e, i, o, u with these options.
let str = "HaÁÅǺáXeëẽêèâàZ"
for char in str.characters
{
let vowel = isVowel( char )
print("\(char) is \(vowel ? "vowel" : "consonant")")
}
func isVowel( _ char: Character ) -> Bool
{
let s1 = "\(char)"
let s2 = s1 as NSString
let opt: NSString.CompareOptions = [.diacriticInsensitive, .caseInsensitive]
// no idea how do to this with Strings:
if s2.compare("a", options: opt) == .orderedSame {return true}
if s2.compare("e", options: opt) == .orderedSame {return true}
…
return false
}
If you really want to use Thai, then do NOT ignore U_NON_SPACING_MARKs because some vowels are classified thusly.
U+0E01 … U+0E2E are consonants, U+0E30 … U+0E39 and U+0E40 … U+0E44 are vowels.
But then: ‘อ’ is sometimes a (silent) consonant (อยาก), sometimes a vowel (บอ), sometimes part of a vowel (มือ), sometimes part of a diphthong (เบื่อ).
Similar for ย: normal consonant (ยาก), part of vowel (ไทย) or diphthong (เมีย).
In the latter case only ม is a consonant, the rest is one single diphthong and ี is a U_NON_SPACING_MARK which really is a vowel.
Oh, and don't forget the ligatures ฤ, ฤๅ, ฦ, ฦๅ. These are both a consonant and a vowel. Same for ำ: not a ligature but a vowel + consonant.
But to talk about german:
What about diphthongs? “neu” has one consonant + one vowel sound (but 2 vowel characters).
What if some silly users don’t know how to type umlauts and write “ueber” (instead of correctly “über”). This is really one consonant (+diaeresis).
But beware: “aktuell” is definitely not a misspelling of “aktüll” and has two vowels.
Gerriet.
More information about the swift-users
mailing list