UTF-8 in PHP. Part 1

    Hello, with this post I would like to try to bring a brighter future, in which everyone uses the "kosher" UTF-8 encoding. In particular, this concerns the environment closest to me - the web and the programming language - PHP, and at the end of the series we will approach the practical part and develop another bicycle library.

    1. Introduction


    To understand further text, beginners need to know some details about the encodings in general. I will try to simplify the presentation of the material as much as possible. For those unaware of anything about bitwise operations, you must first familiarize yourself with the materials on Wikipedia .

    You need to start by understanding that the computer works with numbers and store the string (and the character, as part of it), also in numerical form. For these purposes, there are encodings. In fact, these are tables in which the correspondence between numbers and symbols is indicated. Historically, the main ASCII encoding contains only control codes and Latin characters, a total of 128 (127 is the maximum number that can be stored in 7 bits).

    In order to store other ASCII texts, many other encodings were created in which the 8th bit was added. They can already store up to 256 characters, the first 128 of which traditionally corresponded to ASCII, but in the rest, everyone shoved everything he wanted. And it so happened that each manufacturer of operating systems has its own sets of encodings, and each one met the needs of only a relatively narrow circle of people. The situation was further complicated by the lack of common standards, it became impossible to distinguish them algorithmically and now it is more like guessing (more on that in the following parts).

    As a result, it required a universal output, an encoding that can store all possible characters and will take into account the differences in the writing of different peoples (for example, the direction of writing). The task was solved by creating Unicode, which is able to encode almost all writing systems in the world with one encoding.

    The most popular encoding on the web is UTF-8, which has a number of significant advantages:
    • full compatibility with ASCII;
    • it can be distinguished with high accuracy from other encodings ;
    • each character can occupy from 1 to 4 bytes (in the standard, bytes are called octets; attention, I can replace these terms with each other!) depending on the numerical value that needs to be stored.


    I would like to elaborate on the last point. This means that if earlier it was possible to perform a simple conversion on a table and record the result, now a method for saving this result has been defined, depending on the bit depth required to store it. As an example, you can see the storage principle in the table (x - stored data bits):
    BitMaximum stored value1 octet2 octet3 octet4 octet
    Start octetContinuing Octets
    7U + 007F0xxxxxxx
    elevenU + 07FF110xxxxx10xxxxxx
    16U + FFFF1110xxxx10xxxxxx10xxxxxx
    21U + 10FFFF (by standard, but really U + 1FFFFF)11110xxx10xxxxxx10xxxxxx10xxxxxx


    It is easy to notice that in the high bits of the initial octet there is always a counter indicating the number of bytes in the sequence - this is the number of leading units, followed by zero. Please note: if there is only one octet, then the leading unit is not indicated, so that the initial octets are easy to distinguish from the continuing ones.

    For an example, let's see how the string “Hi Hi” will look in UTF-8 encoding.

    Step one. Convert each character to its numeric representation (I will use the hexadecimal number system) according to the table .

    Hi Hi = 0x041F 0x0440 0x0438 0x0432 0x044D 0x0442 0x0020 0x0048 0x0069
    Do not forget that the space is also a character.

    Step TwoConvert numbers from hexadecimal to binary. We use the Windows 7 calculator (in programmer mode).

    0x041F = 0000 0100 0001 1111
    0x0440 = 0000 0100 0100 0000
    0x0438 = 0000 0100 0011 1000
    0x0432 = 0000 0100 0011 0010
    0x0435 = 0000 0100 0011 0101,
    0x0442 = 0000 0100 0100 0010
    0x0020 = 0010 0000
    0x0048 = 0100 1000
    0x0069 = 0110 1001
    For clarity I added zeros to high order. Please note: characters can occupy a different number of bytes.

    Step Three Translate numeric representations into UTF-8 octet sequences.

    0x041F = 100 0001 1111 = 110 xxxxx 10xxxxxx = 110 10000 10011111
    0x0440 = 100 0100 0000 = 110 xxxxx 10xxxxxx = 110 10001 10 000000
    0x0438 = 100 0011 1000 = 110 xxxxx 10xxxxxx = 110 10000 10 111000
    0x0432 = 100 0011 0010 = 110 xxxxx 10xxxxxx = 110 10000 10 110010
    0x0435 = 100 0011 0101 = 110 xxxxx 10xxxxxx = 110 10000 10 110 101
    0x0442 = 100 0100 0010 = 110 xxxxx 10xxxxxx = 110 10001 10 000010
    0x0020 = 010 0000 = 0 xxxxxx =0 0100000
    0x0048 = 100 1000 = 0 xxxxxx = 0 1001000
    0x0069 = 110 1001 = 0 xxxxxx = 0 1101001
    Counters are in bold. Please note: characters with codes up to 0x0080 are saved unchanged, this is ASCII compatibility. It should also be understood that UTF-8 will take up 2 times more space (2 bytes) for Russian-language text than Windows-1251, which uses only 1 byte.

    As a solution, you can write the entire sequence in a row (I hope without errors): "11010000 10011111 11010001 10000000 11010000 10111000 11010000 10110010 11010000 10110101 11010001 10000010 00100000 01001000 01101001".

    You can check the solution with the code:
    $ tmp = '';
    foreach (explode ('', '11010000 10011111 11010001 10000000 11010000 10111000 11010000 10110010 11010000 10110101 11010001 10000010 00100000 01001000 01101001') as $ octet) {
    $ tmp. = chr (bindec ($ octet));
    }
    echo $ tmp;


    To perform the reverse operation in the code, we need (simplified):
    1. Determine the number of octets in the 1st character and save this value;
    2. Discard the octet counter from the first byte, save the remainder;
    3. If in a sequence of more than 1 octet shift the remainder after operation 2 by 6 bits to the left and write information in them from the lower 6 bits of the subsequent octet;
    4. Repeat from 1 point until satisfied :).


    Optimized PHP code that allows you to get a numerical representation of characters and the inverse operation (I will publish the full version at the end of the loop):
    Copy Source | Copy HTML
    1. class String_Multibyte
    2. {
    3.     /**
           * Возвращает десятеричное значение UTF-8 символа, первый октет которого находится на позиции $index в строке $char.
           * Суррогатные коды, символы с приватных зон, BOM и 0x10FFFE-0x10FFFF вернут FALSE.
           * 
           * [...] Функция была оптимизирована, потому содержит избыточный код.
           * 
           * @author Andrew Dryga , {@link http://andryx.habrahabr.ru}.
           * @param  string    $char  Строка с символом (символами). 
           * @param int        &$index Аргумент указывает на октет, в котором необходимо начать вычисление значение для символа. После вызова будет хранить позицию последнего октета, принадлежащего указанному символу.
           * @return int|false Десятерчиное значение символа или FALSE в случае обнаружения символа или байта, которые нужно проигнорировать.
           */
    4.     public function getCodePoint($char, &$index =  0)
    5.     {
    6.         // Получаем значение первого октета
    7.         $octet1 = ord($char[$index]);
    8.         // Если оно попадает в диапазон ASCII кодов (имеет вид 0bbb bbbb), то возвращаем результат.
    9.         if ($octet1 >> 7 == 0x00) {
    10.             return $octet1;
    11.         } elseif ($octet1 >> 6 != 0x02) {
    12.             // Проверяем существование следующего октета
    13.             if (!isset($char[++$index])) {
    14.                 return false;
    15.             }
    16.             // Получаем его значение
    17.             $octet2 = ord($char[$index]);
    18.             // Проверяем его на валидность (должен иметь вид 10bb bbbb)
    19.             if ($octet2 >> 6 != 0x02) {
    20.                 --$index;
    21.                 return false;
    22.             }
    23.             // Оставляем только его нижние 6 бит
    24.             $octet2 &= 0x3F;
    25.  
    26.             // Проверяем счётчик и если октетов должно быть всего два, то формируем результат
    27.             if ($octet1 >> 5 == 0x06) {
    28.                 $result = ($octet1 & 0x1F) << 6 | $octet2;
    29.                 // Результат должен быть в максимально сокращённой форме
    30.                 if (0x80 < $result) {
    31.                     return $result;
    32.                 }
    33.             } else {
    34.                 if (!isset($char[++$index])) {
    35.                     return false;
    36.                 }
    37.  
    38.                 $octet3 = ord($char[$index]);
    39.                 if ($octet3 >> 6 != 0x02) {
    40.                     --$index;
    41.                     return false;
    42.                 }
    43.                 $octet3 &= 0x3F;
    44.  
    45.                 if ($octet1 >> 4 == 0x0E) {
    46.                     $result = ($octet1 & 0x0F) << 12 | $octet2 << 6 | $octet3;
    47.                     // Проверяем минимальное значение; удаляем суррогаты, приватную зону и BOM
    48.                     if (0x800 < $result && !(0xD7FF < $result && $result < 0xF900) && $result != 0xFEFF) {
    49.                         return $result;
    50.                     }
    51.                 } else {
    52.                     if (!isset($char[++$index])) {
    53.                         return false;
    54.                     }
    55.  
    56.                     $octet4 = ord($char[$index]);
    57.                     if ($octet4 >> 6 != 0x02) {
    58.                         --$index;
    59.                         return false;
    60.                     }
    61.                     $octet4 &= 0x3F;
    62.  
    63.                     if ($octet1 >> 3 == 0x1E) {
    64.                         $result = ($octet1 & 0x07) << 18 | $octet2 << 12 | $octet3 << 6 | $octet4;
    65.                         // Проверяем минимальное значение; Удаляем приватную зону и некоторые другие символы; 
    66.                         // Удостовериваемся, что полученое значение не выходит за рамки зоны Unicode 10FFFF
    67.                         if (0x10000 < $result && $result < 0xF0000) {
    68.                             return $result;
    69.                         }
    70.                     }
    71.                 }
    72.             }
    73.             return false;
    74.         }
    75.     }
    76.  
    77.  
    78.     /**
           * Возвращает UTF-8 символ по его коду.
           * [...]
           * @author ur001 , {@link http://ur001.habrahabr.ru}.
           * @param string $codePoint Unicode character ordinal.
           * @return string|FALSE UTF-8 символ или FALSE в случае ошибки.
           */
    79.     public function getChar($codePoint)
    80.     {
    81.         if ($codePoint < 0x80) {
    82.             return chr($codePoint);
    83.         } elseif ($codePoint < 0x800) {
    84.             return chr(0xC0 | $codePoint >> 6) . chr(0x80 | $codePoint & 0x3F);
    85.         } elseif ($codePoint < 0x10000) {
    86.             return chr(0xE0 | $codePoint >> 12) . chr(
    87.             0x80 | $codePoint >> 6 & 0x3F) . chr(0x80 | $codePoint & 0x3F);
    88.         } elseif ($codePoint < 0x110000) {
    89.             return chr(0xF0 | $codePoint >> 18) . chr(
    90.             0x80 | $codePoint >> 12 & 0x3F) . chr(0x80 | $codePoint >> 6 & 0x3F) . chr(
    91.             0x80 | $codePoint & 0x3F);
    92.         } else {
    93.             return false;
    94.         }
    95.     }
    96. }

    The getChar () method was taken from the Jevix library, anyway I already saw this code, remembered it well and even if it was implemented from memory it would be dishonest not to mention the author.

    You can test the resulting class using the code:
    Copy Source | Copy HTML
    1. // Создадим экземляр объекта
    2. $obj = new String_Multibyte ();
    3. // Сформируем строку наиболее удобным для теста способом
    4. $tmp = '';
    5. foreach ( explode ( ' ', '11010000 10011111 11010001 10000000 11010000 10111000 11010000 10110010 11010000 10110101 11010001 10000010 00100000 01001000 01101001' ) as $octet ) {
    6.     $tmp .= chr ( bindec ( $octet ) );
    7. }
    8. // Строим карту кодов символов
    9. $map = array ();
    10. $len = strlen ( $tmp );
    11. for($i =  0; $i < $len; $i ++) {
    12.     if (true == ($result = $obj->getCodePoint ( $tmp, $i ))) {
    13.             $map [] = $result;
    14.     }
    15. }
    16. // Очищаем строку и восстанавливаем её с карты
    17. $tmp = '';
    18. $count = count ( $map );
    19. for($i =  0; $i < $count; $i++) {
    20.     $tmp .= $obj->getChar ( $map[$i] );
    21. }
    22. // Выводим восстановленную строку
    23. echo $tmp, ''.EOL;
    24. // Проверяем её на валидность (это самый простой способ)
    25. echo preg_match ( '#.{1}#u', $tmp ) ? 'Valid Unicode' : 'Unknown', ''.EOL;
    26.  
    I did not try to write the most beautiful or correct code for the tests, but with it you can calmly change the values ​​of characters bit by bit and immediately see the result. All invalid sequences will be ignored, the output string is always valid, but there is more to come.

    To be sure that the text does not contain anything superfluous, you need to remove unnecessary (unprintable, non-marking, indefinite, surrogate, etc.) characters from it and carry out normalization, more on that in the next part.

    PS:
    Further it will be about normalization, safety, determination of codings and work with UTF-8 in PHP.

    References:

    Also popular now: