Cross-platform work with strings in C ++

Not so long ago, I was puzzled by the issue of cross-platform work with strings in c ++ applications. The task was, roughly speaking, set as a case-insensitive search for a substring in any encoding on any platform.

So, the first thing I had to understand was that you need to work with strings in Linux in UTF-8 encoding and in the std :: string type, and in Windows the strings should be in UTF-16LE (std :: wstring type). Why? Because it is by design of operating systems. It is extremely expensive to store strings in std :: wstring in Linux, since one wchar_t character takes 4 bytes (2 bytes in Windows), and std :: string needed to work in Windows during Windows 98. To work with strings, we define our platform-independent type :

#ifdef _WIN32
typedef std::wstring mstring;
#else
typedef std::string mstring;
#endif // _WIN32


The second is the task of converting text from any encoding to type mstring. There are not many options. The first option is to use std :: locale and other related standard things. Immediately struck by the need to search for each charset'a corresponding locale (type encoding "windows-1251" corresponds to the locale Russian_Russia.1251, etc.). Such a table was not found in the standard library (maybe it was looking badly?), I did not want to look for a lotion for the list of locales. Anyway, working with locales in C ++ is a very non-obvious thing, in my opinion. The forums advised using libiconv or icu libraries. libiconv looked very easy and simple, it did the job of transcoding from any charset to mstring perfectly, but when it came to converting mstring to lower case, I failed. It turned out libiconv does not know how to do this, but I could not convert the utf8 string to lowercase in Linux easily and beautifully. So, the choice fell on icu, which honorably solved all the tasks (conversion and lowercase translation). The platform-independent transcoding procedure using the icu library looks something like this:

std::string to_utf8(const std::string& source_str, const std::string& charset, bool lowercase)
{
	const std::string::size_type srclen = source_str.size();
	std::vector target(srclen);
	UErrorCode status = U_ZERO_ERROR;
	UConverter *conv = ucnv_open(charset.c_str(), &status);
	if (!U_SUCCESS(status))
		return std::string();
	int32_t len = ucnv_toUChars(conv, target.data(), srclen, source_str.c_str(), srclen, &status);
	if (!U_SUCCESS(status))
		return std::string();
	ucnv_close(conv);
	UnicodeString ustr(target.data(), len);
	if (lowercase)
		ustr.toLower();
	std::string retval;
	ustr.toUTF8String(retval);
	return retval;
}


I will not describe issues of working with Unicode in Windows - everything is well documented there.

Also popular now: