m00t November 11, 2010 at 19:31

Defining text encoding in PHP - an overview of existing solutions plus one more bike

Faced a task - auto-detection of page / text / anything encoding. The task is not new, and a lot of bicycles have already been invented. In the article, a small overview of what was found on the network - plus an offer of my own, it seems to me, a worthy solution.

1. Why not mb_detect_encoding ()?

In short, it does not work.

Let's watch:

// На входе - русский текст в кодировке CP1251
$string = iconv('UTF-8', 'Windows-1251', 'Он подошел к Анне Павловне, поцеловал ее руку, подставив ей свою надушенную и сияющую лысину, и покойно уселся на диване.');
// Посмотрим, что нам выдает md_detect_encoding(). Сначала $strict = FALSE
var_dump(mb_detect_encoding($string, array('UTF-8')));
// UTF-8
var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251')));
// Windows-1251
var_dump(mb_detect_encoding($string, array('UTF-8', 'KOI8-R')));
// KOI8-R
var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R')));
// FALSE
var_dump(mb_detect_encoding($string, array('UTF-8', 'ISO-8859-5')));
// ISO-8859-5
var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R', 'ISO-8859-5')));
// ISO-8859-5
// Теперь $strict = TRUE
var_dump(mb_detect_encoding($string, array('UTF-8'), TRUE));
// FALSE
var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251'), TRUE));
// FALSE
var_dump(mb_detect_encoding($string, array('UTF-8', 'KOI8-R'), TRUE));
// FALSE
var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R'), TRUE));
// FALSE
var_dump(mb_detect_encoding($string, array('UTF-8', 'ISO-8859-5'), TRUE));
// ISO-8859-5
var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R', 'ISO-8859-5'), TRUE));
// ISO-8859-5

As you can see, the output is complete mess. What do we do when it is not clear why the function behaves this way? Right, google. Found a wonderful answer .

To finally dispel all hopes of using mb_detect_encoding (), you need to get into the sources of the mbstring extension. So, rolled up our sleeves, went:

// ext/mbstring/mbstring.c:2629
PHP_FUNCTION(mb_detect_encoding)
{
...
// строка 2703
ret = mbfl_identify_encoding_name(&string, elist, size, strict);
...

Ctrl + click:

// ext/mbstring/libmbfl/mbfl/mbfilter.c:643
const char*
mbfl_identify_encoding_name(mbfl_string *string, enum mbfl_no_encoding *elist, int elistsz, int strict)
{
	const mbfl_encoding *encoding;
	encoding = mbfl_identify_encoding(string, elist, elistsz, strict);
...

Ctrl + click:

// ext/mbstring/libmbfl/mbfl/mbfilter.c:557
/*
 * identify encoding
 */
const mbfl_encoding *
mbfl_identify_encoding(mbfl_string *string, enum mbfl_no_encoding *elist, int elistsz, int strict)
{
...

I will not post the full text of the method, so as not to clutter up the article with unnecessary sources. To whom it is interesting to see for yourself. We’ll be examined by line number 593, where, in fact, a check is made to see if the character is suitable for encoding:

// ext/mbstring/libmbfl/mbfl/mbfilter.c:593
(*filter->filter_function)(*p, filter);
if (filter->flag) {
	bad++;
}

Here are the main filters for single-byte Cyrillic:

Windows-1251 (original comments saved)

// ext/mbstring/libmbfl/filters/mbfilter_cp1251.c:142
/* all of this is so ugly now! */
static int mbfl_filt_ident_cp1251(int c, mbfl_identify_filter *filter)
{
	if (c >= 0x80 && c < 0xff)
		filter->flag = 0;
	else
		filter->flag = 1; /* not it */
	return c;	
}

KOI8-R

// ext/mbstring/libmbfl/filters/mbfilter_koi8r.c:142
static int mbfl_filt_ident_koi8r(int c, mbfl_identify_filter *filter)
{
	if (c >= 0x80 && c < 0xff)
		filter->flag = 0;
	else
		filter->flag = 1; /* not it */
	return c;	
}

ISO-8859-5 (everything is generally fun here)

// ext/mbstring/libmbfl/mbfl/mbfl_ident.c:248
int mbfl_filt_ident_true(int c, mbfl_identify_filter *filter)
{
	return c;
}

As you can see, ISO-8859-5 always returns TRUE (to return FALSE, you need to set filter-> flag = 1).

When they looked at the filters, everything fell into place. CP1251 from KOI8-R can not be distinguished in any way. ISO-8859-5 in general, if there is one in the list of encodings, it will always be detected as true.

In general, fail. It is understandable - only by character codes it is impossible in general to recognize the encoding, since these codes intersect in different encodings.

2. What gives Google

And Google gives out all sorts of misery. I won’t even post the source here, see for yourself if you want (remove the space after http: //, I don’t know how to show the text with a link):

http: // deer.org.ua/2009/10/06/1/
http : // php.su/forum/topic.php?forum=1&topic=1346

3. Search on a hub

1) again character codes: habrahabr.ru/blogs/php/27378/#comment_710532

2) in my opinion, a very interesting solution: habrahabr.ru/blogs/php/27378/#comment_1399654
Cons and pros in the comment link. Personally, I think that only for encoding detection is this solution redundant - it turns out to be too powerful. The definition of the encoding in it is a side effect).

4. Actually, my decision

The idea arose while viewing the second link from the last section. The idea is this: we take a large Russian text, measure the frequencies of different letters, we detect a coding for these frequencies. Looking ahead, I’ll say right away that there will be problems with capital and small letters. Therefore, I post examples of letter frequencies (let's call it the “spectrum”), both case-sensitive and without (in the second case, I added an even larger letter with the same frequency, but deleted large ones). In these “spectra”, all letters with frequencies less than 0.001 and a space are cut out. Here's what I got after processing War and Peace:

Case-sensitive “spectrum”:

array (
  'о' => 0.095249209893009,
  'е' => 0.06836817536026,
  'а' => 0.067481298384992,
  'и' => 0.055995027400041,
  'н' => 0.052242744063325,
....
  'э' => 0.002252892226507,
  'Н' => 0.0021318391371162,
  'П' => 0.0018574762967903,
  'ф' => 0.0015961610948418,
  'В' => 0.0014044332975731,
  'О' => 0.0013188987793209,
  'А' => 0.0012623590130186,
  'К' => 0.0011804488387602,
  'М' => 0.001061932790165,
)

Case-insensitive:

array (
  'О' => 0.095249209893009,
  'о' => 0.095249209893009,
  'Е' => 0.06836817536026,
  'е' => 0.06836817536026,
  'А' => 0.067481298384992,
  'а' => 0.067481298384992,
  'И' => 0.055995027400041,
  'и' => 0.055995027400041,
....
  'Ц' => 0.0029893589260344,
  'ц' => 0.0029893589260344,
  'щ' => 0.0024649163501406,
  'Щ' => 0.0024649163501406,
  'Э' => 0.002252892226507,
  'э' => 0.002252892226507,
  'Ф' => 0.0015961610948418,
  'ф' => 0.0015961610948418,
)

Spectra in different encodings (array keys are the codes of the corresponding characters in the corresponding encoding):

Windows-1251: case sensitive , case insensitive
KOI8-R: case sensitive , case insensitive
ISO-8859-5: case sensitive , case insensitive

Next. We take the text of an unknown encoding, for each encoding we check, we find the frequency of the current character and add to the “rating” of this encoding. An encoding with a higher rating is, most likely, an encoding of the text.

$encodings = array(
	'cp1251' => require 'specter_cp1251.php',
	'koi8r' => require 'specter_koi8r.php',
	'iso88595' => require 'specter_iso88595.php'
);
$enc_rates = array();
for ($i = 0; $i < len($str); ++$i)
{
	foreach ($encodings as $encoding => $char_specter)
	{
		$enc_rates[$encoding] += $char_specter[ord($str[$i])];
	}
}
var_dump($enc_rates);

Do not even try to execute this code at home - it will not work. You can consider this a pseudo-code - I omitted the details so as not to clutter up the article. $ char_specter is exactly those arrays referenced by pastebin.

results

Table rows — text encoding; columns — contents of the $ enc_rates array.

1) $ str = 'Russian text'; Everything is fine. The real encoding has already 4 times higher rating than the rest - this is in such a short text. On longer texts, the ratio will be approximately the same. 2) $ str = 'LINE CAPSOM RUSSIAN TEXT'; Oops! Full porridge. And because the capital letters in CP1251 usually correspond to the small letters in KOI8-R. And small letters are used in turn much more often than large ones. So we define the line caps in CP1251 as KOI8-R. We try to do case insensitive (“spectra” case insensitive) 1) $ str = 'Russian text'; 2) $ str = 'LINE CAPSOM RUSSIAN TEXT';

cp1251 | koi8r | iso88595 | 

 0.441 | 0.020 | 0.085 | Windows-1251

 0.049 | 0.441 | 0.166 | KOI8-R

 0.133 | 0.092 | 0.441 | ISO-8859-5

cp1251 | koi8r | iso88595 | 

 0.013 | 0.705 | 0.331 | Windows-1251

 0.649 | 0.013 | 0.201 | KOI8-R

 0.007 | 0.392 | 0.013 | ISO-8859-5

cp1251 | koi8r | iso88595 | 

 0.477 | 0.342 | 0.085 | Windows-1251

 0.315 | 0.477 | 0.207 | KOI8-R

 0.216 | 0.321 | 0.477 | ISO-8859-5

cp1251 | koi8r | iso88595 | 

 1.074 | 0.705 | 0.465 | Windows-1251

 0.649 | 1.074 | 0.201 | KOI8-R

 0.331 | 0.392 | 1.074 | ISO-8859-5

As you can see, the correct encoding stably leads both with case-sensitive "spectra" (if the string contains a small number of capital letters), and with case-insensitive. In the second case, with case-insensitive, the leader is not so confident, of course, but it is quite stable even on small lines. You can play around with the weights of letters - make them non-linear with respect to frequency, for example.

5. Conclusion

In the topic, work with UTF-8 is not considered - there is no fundamental difference here, except that obtaining character codes and breaking a string into characters will be somewhat longer / more complicated.
These ideas can be extended not only to Cyrillic encodings, of course - the question is only in the "spectra" of the corresponding languages / encodings.

PS If it is very necessary / interesting - then I will post the second part of a fully working library on GitHub. Although I believe that the data in the post is quite enough to quickly write such a library and to my own needs - the "spectrum" for the Russian language is laid out, it can be easily transferred to all the necessary encodings.

UPDATED
In the commentsa wonderful function slipped by, a link to which I published under the "squalor" column. Maybe he got excited with the words, but since he published it, he published it - he’s not used to editing such things. In order not to be unfounded, let's see if it works 100%, as the alleged author says .
1) will there be errors during the “normal” operation of this function? Suppose that our content is 100% valid.
answer: yes, they will.
2) will it define anything other than UTF-8 and non-UTF-8?
answer: no, it will not determine.

Here is the code:

$str_cp1251 = iconv('UTF-8', 'Windows-1251', 'Русский текст');
var_dump(md5($str_cp1251));
var_dump(md5(iconv('Windows-1251', 'Windows-1251', $str_cp1251)));
var_dump(md5(iconv('KOI8-R', 'KOI8-R', $str_cp1251)));
var_dump(md5(iconv('ISO-8859-5', 'ISO-8859-5', $str_cp1251)));
var_dump(md5(iconv('UTF-8', 'UTF-8', $str_cp1251)));

what's the output:

m00t@m00t:~/workspace/test$ php detect_encoding.php 
string(32) "96e14d7add82668414ffbc498fcf2a4e"
string(32) "96e14d7add82668414ffbc498fcf2a4e"
string(32) "96e14d7add82668414ffbc498fcf2a4e"
string(32) "96e14d7add82668414ffbc498fcf2a4e"
PHP Notice:  iconv(): Detected an illegal character in input string in /home/m00t/workspace/test/detect_encoding.php on line 36
PHP Stack trace:
PHP   1. {main}() /home/m00t/workspace/test/detect_encoding.php:0
PHP   2. iconv() /home/m00t/workspace/test/detect_encoding.php:36
string(32) "d41d8cd98f00b204e9800998ecf8427e"

What do we see? The single-byte Cyrillic alphabet after iconv ($ encoding, $ encodigng) will not change. So only UTF-8 can be distinguished from non-UTF-8. And then - at the cost of vorning.
IMHO, it is precisely because of such pieces of code that PHP is considered a “language for fools” (c) - how not to troll write trolls in any topic about this language.

Tags: