[swift-users] CharacterSet vs Set<Character>

Gerriet M. Denkmann g at mdenkmann.de
Mon Oct 3 09:27:20 CDT 2016


> On 3 Oct 2016, at 19:17, Jean-Denis Muys via swift-users <swift-users at swift.org> wrote:
> 
> You are right: I don’t know much about asian languages.
> 
> How would you go about counting consonants, vowels (and tone-marks?) in the most general way?

Iterate over unicodeScalars (in the most general case) - Swift characters are probably ok for European languages.

For each unicodeScalar a.k.a codepoint you can use the icu function:
	int8_t 	chrTyp = u_charType (codepoint) 
This returns the general category value for the code point.
This gives you something like U_OTHER_PUNCTUATION, U_MATH_SYMBOL, U_OTHER_LETTER etc.
See enum UCharCategory in <http://icu-project.org/apiref/icu4c-latest/uchar_8h.html>

In European languages ignore U_NON_SPACING_MARKs.

There is a compare:options function for NSString (and probably similar for Swift String) which might use the options NSCaseInsensitiveSearch and NSDiacriticInsensitiveSearch to find equality between ‘E’, ‘e’ and è, é, Ĕ etc.
That is: for each character (or unicodeScalar) compare to a, e, i, o, u with these options.

let str = "HaÁÅǺáXeëẽêèâàZ"

for char in str.characters
{
	let vowel = isVowel( char )
	print("\(char) is \(vowel ? "vowel" : "consonant")")
}

func isVowel( _ char: Character ) -> Bool
{
	let s1 = "\(char)"
	let s2 = s1 as NSString
	let opt: NSString.CompareOptions = [.diacriticInsensitive, .caseInsensitive]

	//	no idea how do to this with Strings:
	if s2.compare("a", options: opt) == .orderedSame {return true}
	if s2.compare("e", options: opt) == .orderedSame {return true}
	…
	return false
}


If you really want to use Thai, then do NOT ignore U_NON_SPACING_MARKs because some vowels are classified thusly.
U+0E01 … U+0E2E are consonants, U+0E30 … U+0E39 and U+0E40 … U+0E44 are vowels.
But then: ‘อ’ is sometimes a (silent) consonant (อยาก), sometimes a vowel (บอ), sometimes part of a vowel (มือ), sometimes part of a diphthong (เบื่อ).
Similar for ย: normal consonant (ยาก), part of vowel (ไทย) or diphthong (เมีย).
In the latter case only ม is a consonant, the rest is one single diphthong and ี is a U_NON_SPACING_MARK which really is a vowel.
Oh, and don't forget the ligatures ฤ, ฤๅ, ฦ, ฦๅ. These are both a consonant and a vowel. Same for ำ: not a ligature but a vowel + consonant.


But to talk about german:
What about diphthongs? “neu” has one consonant + one vowel sound (but 2 vowel characters).
What if some silly users don’t know how to type umlauts and write “ueber” (instead of correctly “über”). This is really one consonant (+diaeresis).
But beware: “aktuell” is definitely not a misspelling of “aktüll” and has two vowels.

Gerriet.



More information about the swift-users mailing list