2005-09-09

 

strcasecmp should die

If you are in the habit of using strcasecmp() or strncasecmp() in your code without much thinking, stop. It's broken by design, almost.

The GNU man page for strcasecmp() is less than useful. It boldly claims that strcasecmp() compares two strings ignoring the case of characters. No warnings or caveats. So it's not odd that the average "let them eat ASCII" I18N-ignorant programmer might think strcasecmp() is a well-defined and well-behaving function, that does what it says in the man page, right?

The man page says "conforms to BSD 4.4 and SUSv3". Well, BSD4.4 is so last century, isn't it? How much support for locales and non-ASCII charsets was there in BSD4.4 anyway? I don't know.

So that "conformance" doesn't really tell you much. But SUSv3 (trivial free registration required to read online at www.unix.org) clearly states:
In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed. The results are unspecified in other locales.

The man page in Solaris 10 says:
They assume the ASCII character set

Experimentation indicates that the Linux (glibc) strcasecmp in UTF-8 locales does certainly not compare non-ASCII (i.e. multi-byte) letters in UTF-8 strings case insensitively. Characters like upper and lower case Greek alpha, for instance, are considered different. This is allowed by the SUSv3. I have no idea what it does in other locales, or what other strcasecmp() implementations might do.

So, each time you use strcasecmp(), stop to think. In very few cases it's what you actually want.

Do you really know what character sets the strings are in? If yes, and if it is the current locale's character set, you need to use mbrstowcs() (convert to wide character (wchar_t) string) and then use wcscasecmp() (a GNU extension). Or perhaps something like mbscasecmp() if your C library has such a beast (Microsoft's C library has _mbsicmp()). If you know the charset, but it isn't the locale's charset (or if the strings are in different charsets), it gets really fun and you need to convert them to wide character strings using iconv() and then call wcscasecmp(). (When using iconv() you need to know the name of the charset used in your C implementation or wide characters. UTF-16? UCS-4?)

Solaris seems to not have wcscasecmp(), but wscasecmp(), which presumably is the same?

Or do the case-insensitive wide character string comparison manually. Or convert to UTF-8 and use a casefolding UTF-8 implementation.

If the strings are in UTF-8 from start, or after conversion to wide characters, there is also the issue of whether to take into account equivalence of precomposed letters vs. base character and combining diacritical marks. As a precomposed ä is indistinguishable to the user from an a followed by a combining diaeresis, you should.

Etc, as you see, something seemingly easy as case-insensitive string comparison is definitely not trivial.

In GLib, we have functions g_ascii_strcasecmp() which definitely ignores non-ASCII bytes. Use that if that is what you want. Then there is g_utf8_casefold() and g_utf8_collate().

Comments:
I tried to use multibyte strings in C programs. But in the end I came to the conclusion that it's not worth the effort. It was so much easier to create wchar_t * wrappers around all system commands that expect a char *, and then use wide characters internally. Quadruples the amount of memory used for storing strings, which is a bit of an issue, since the program I did this in is a commandline shell (http://roo.no-ip.org/fish/), and as such almost everything is a string. But I still think this is the right approach, since doing anything non-trivial in UTF-8 is such a pain.
 
Isn't the locale responsible for handling the logic of case comparison?

A new locale project started at http://live.gnome.org/LocaleProject
It's not GNOME-specific but rather meant to be used, as a library, by various programs.

The rules to change case for Greek are rather complicated (that's modern Greek), where accents are dropped when you capitalise and in some cases moved to different characters.

This complexity should be dealt with some other library that just does that.
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?