zayg21 December 12, 2017 at 09:52

Another implementation of case-insensitive search for Cyrillic characters in SQLite

From the sandbox
Tutorial

Good day, Khabrovsk!

The problem of searching for Russian characters in SQLite has long become a byword, the reasons for its appearance are described in some detail here . However, there are fairly common solutions to this problem, the most popular of which is connecting the ICU, a library with which you can implement a full-fledged search in Unicode. But I wanted a shorter solution in terms of code so that the result would be a search:

case insensitive;
according to Russian and English characters;
ignoring the ё / ё symbol when searching;
fast
used built-in NOCASE collation.

As a result, the idea came to dig deeper into the SQLite source code and fix it a little. The toolkit was as follows: the System.Data.Sqlite assembly, which included the SQLite Amalgamation source code (where all the kernel source code is stored in a single file), Visual Studio 2017 and a test database.

To compare strings in the database kernel, there is a patternCompare function. As you know, it implements nocase comparison only for the first 128 characters of UTF-8, for which there are corresponding checks in the code. The idea was to write another small block of code that checks to see if the character is Cyrillic. If you look at the table UTF-8, then for the characters familiar to the heart, the positions from 0x400 to 0x45F are highlighted. Now create a check for the Cyrillic character and the translation tables in upper and lower case. A bit of code.

#define IsCyrillic(A)(A >= 0x400 && A <= 0x45F)
static const unsigned short CyrillicUpper[128] =
{
	1024, 1045, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039,
	1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
	1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
	1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
	1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
	1104, 1045, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119,
	1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135,
	1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151
};
static const unsigned short CyrillicLower[128] =
{
	1024, 1177, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039,
	1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
	1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103,
	1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
	1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103,
	1104, 1177, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119,
	1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135,
	1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151
};
#define CyrillicToUpper(ch)(CyrillicUpper[(unsigned char)(ch&~0x400)])
#define CyrillicToLower(ch)(CyrillicLower[(unsigned char)(ch&~0x400)])

The point is small, look at all the checks that the character is less than 0x80 and add another check for cyrillicity. This happens in two places, when checking the first character and all the others. I will not give the whole function of patternCompare, only these two places (already carefully corrected).

The first block of code:

      if( c<=0x80 ){
        u32 cx;
        int bMatch;
        if( noCase ){
          cx = sqlite3Toupper(c);
          c = sqlite3Tolower(c);
        }else{
          cx = c;
        }
        while( (c2 = *(zString++))!=0 ){
          if( c2!=c && c2!=cx ) continue;
          bMatch = patternCompare(zPattern,zString,pInfo,matchOther);
          if( bMatch!=SQLITE_NOMATCH ) return bMatch;
        }
      }	  
	  else if (noCase && IsCyrillic(c))
	  {
		  u32 cx;
		  int bMatch;
		  if (noCase) {
			  cx = CyrillicToUpper(c);
			  c = CyrillicToLower(c);
		  }
		  else {
			  cx = c;
		  }
		  while ((c2 = Utf8Read(zString)) != 0) {
			  if (c2 != c && c2 != cx) continue;
			  bMatch = patternCompare(zPattern, zString, pInfo, matchOther);
			  if (bMatch != SQLITE_NOMATCH) return bMatch;
		  }
	  }

Second block of code:

    if( noCase  && sqlite3Tolower(c)==sqlite3Tolower(c2) && c<0x80 && c2<0x80 ){
      continue;
    }
	if (noCase && CyrillicToLower(c) == CyrillicToLower(c2) && IsCyrillic(c) && IsCyrillic(c2)) {
		continue;
	}

Case-insensitive search for Russian and English text now works correctly for the WHERE LIKE '% xxx%' pattern but will not work for the WHERE LIKE 'xxx%' patternif the field is indexed by the standard NOCASE function. So if you want indexed search to work too, you'll have to tinker a bit more. Take a deeper look at the library code and see that the sqlite3_strnicmp function is responsible for creating the c collation NOCASE index. In general, there are two of them, one with a parameter responsible for checking the length of the text, the other without. We need the first one. But here only the text in it, unlike patternCompare, is not decoded from UTF-8, so we implement one more check that the first byte is high for the Cyrillic character encoded in UTF-8, and we also implement decoding of the Cyrillic character from UTF- 8 with converting it to lowercase.

static unsigned char IsCyrillcByte(unsigned char b)
{
	return !((b >> 1) ^ 0x68);
}
static unsigned short ReadLowerCyrillic(const unsigned char* in)
{
	unsigned char b1, b2;
	b1 = *in;
	in++;
	b2 = *in;
	return CyrillicLower[(unsigned char)(b1 << 6) | (b2 &~ 0x80)];
}

Unfortunately, the function sqlite3_strnicmp had to be rewritten, since it was largely optimized for ASCII, now it looks like this:

SQLITE_API int sqlite3_strnicmp(const char *zLeft, const char *zRight, int N){
  register unsigned char *a, *b;
  unsigned short cLeft, cRight;
  if( zLeft==0 ){
    return zRight ? -1 : 0;
  }else if( zRight==0 ){
    return 1;
  }
  a = (unsigned char *)zLeft;
  b = (unsigned char *)zRight;
  for(;N > 0; N--)
  { 
	  if (*a <= 0x80 && *b <= 0x80)
	  {
		  if (*a != 0 && UpperToLower[*a] == UpperToLower[*b])
		  {
				a++; b++;
		  }
		  else
		  {
			  return UpperToLower[*a] - UpperToLower[*b];
		  }
	  }
	  else if (IsCyrillcByte(*a) && IsCyrillcByte(*b))
	  {
		   cLeft = ReadLowerCyrillic(a);
		   cRight = ReadLowerCyrillic(b);
		   if (cLeft == cRight)
		   {
			   a += 2; b += 2;
		   }
		   else
		   {
			   return cLeft - cRight;
		   }
	  }
	  else if (*a == *b)
	  {
		  a++; b++;
	  }
	  else
	  {
		  return *a - *b;
	  }
  }
  return 0;
}

Everything can be ~~watered with sauce and eat~~ compiled. It is also necessary at the end to execute a simple query to the REINDEX NOCASE database ; to recreate all case-insensitive indexes.

Tags:

Another implementation of case-insensitive search for Cyrillic characters in SQLite

Also popular now: