The whole truth about the UTF-8 flag

Tutorial

A common misconception is that character strings, unlike byte strings, have the UTF-8 flag set.
Many people suspect that if the data is ASCII-7-bit, then the UTF-8 flag is simply not important.

However, in fact, it can be set or reset, both for symbols and for absolutely arbitrary binary data.

Well-known perl community author Marc Lehmann comments on this in the JSON :: XS module documentation.

You can have Unicode strings with that flag set, with that flag clear, and you can have binary data with that flag set and that flag clear. Other possibilities exist, too.

Consider the case where ASCII-7bit data has the UTF-8 flag set.

use utf8;
use strict;
use warnings;
my $u = "тест"; # unicode строкаmy $ascii = "x"; # обычный ASCII символmy ($ascii_u, undef) = split(/ /, "$ascii $u");
dieunless $ascii_u eq "x"; # тот же ASCII символprint"UTF-8 flag set!"if utf8::is_utf8($ascii_u); # но теперь у него установлен UTF-8 флаг

This code displays “UTF-8 flag set!”. That is, the ASCII-7bit string received this flag after the split operation split the Unicode string (with the UTF-8 flag) into parts. We can say that the programmer does not control whether his ASCII data will have UTF-8 flag or not, it depends on where and how the data was received, and on what data was next to it.

The same effect is obtained if you decode ASCII-7bit bytes into ASCII-7bit characters using Encode :: decode ()

use strict;
use warnings;
use Encode;
my $ascii = 'x'; # ASCII символmy $ascii_u = decode("UTF-8", encode("UTF-8", "$ascii"));
dieunless $ascii_u eq "x"; # тот же ASCII символprint"UTF-8 flag set!"if utf8::is_utf8($ascii_u); # но теперь у него установлен UTF-8 флаг

Those. round-trip transcoding does not change the data (this is expected), but sets the UTF-8 flag.
(however, this behavior of decode () contradicts its own documentation , which, in turn, contradicts the idea that there should be no documentation and guarantees regarding the utf-8 flag in ASCII data) The

reasons for the appearance of the UTF-8 flag can be explained by efficiency considerations . It is too expensive after split to parse a string to see if it consists only of ASCII characters, and whether the flag can be reset.

This behavior of the UTF-8 flag is similar to a virus - it infects all the data it comes into contact with.

Consider the case where non-ASCII, Unicode characters do not have a UTF-8 flag.

use strict;
use warnings;
use Digest::SHA qw/sha1_hex/;
use utf8;
my $s = "µ";
my $s1 = $s;
my $s2 = $s;
my $digest = sha1_hex($s2); # попробуйте закомментировать эту строчкуprint"utf-8 bit ON (s1)\n"if utf8::is_utf8($s1);
print"utf-8 bit ON (s2)\n"if utf8::is_utf8($s2);
print"s1 and s2 are equal\n"if $s1 eq $s2;

prints:

utf-8 bit ON (s1)
s1 and s2 are equal

That is, the function call of the third-party module dropped the UTF-8 flag. At the same time, the lines with and without the flag turned out to be completely identical.
This can only happen with characters> 127 and <= 255 (i.e. Latin-1).

In fact, the operation utf8 :: downgrade has occurred with the string $ s2.

This function is described in the documentation as changing the internal representation of the string:

Converts in-place the internal representation of the string from UTF-X to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). The logical character sequence itself is unchanged.

In principle, the Digest :: SHA module documents this behavior, although it is not required to:

Be aware that the digest routines silently convert UTF-8 input into its
equivalent byte sequence in the native encoding (cf. utf8 :: downgrade). This
side effect influences only the way Perl stores the data internally, but
otherwise leaves the actual value of the data intact.

In the general case, any 3-rd party function can do downgrade strings without informing them in the documentation (or, for example, do it only occasionally).

Consider the case when absolutely arbitrary, binary data has a UTF-8 flag.

use utf8;
use strict;
use warnings;
# нам нужен bytes::length для отладки, ставим '()' чтобы bytes не влияло на ход программыuse bytes ();
my $u = "тест"; # не ASCII строка# байты, не символыmy $bin = "\xf1\xf2\xf3";
## опять получает ASCII строку с UTF-8 флагомmy $ascii = "x"; # обычный ASCII симовлmy ($ascii_u, undef) = split(/ /, "$ascii $u");
dieunless $ascii_u eq "x"; # тот же ASCII символdieunless utf8::is_utf8($ascii_u); # но теперь у него установлен UTF-8 флаг## //print"original bin length:\t";
printlength($bin) . "\t" . bytes::length($bin) ."\n";
my $bin_a = $bin.$ascii; # соединяем бинарные данные, с ASCII даннымиprint"bin_a length:\t";
printlength($bin_a) . "\t" . bytes::length($bin_a) ."\n";
my $bin_u = $bin.$ascii_u; # опять соединяем бинарные данные, с ASCII даннымиprint"bin_u length:\t";
printlength($bin_u) . "\t" . bytes::length($bin_u) ."\n";
print"bin_a and bin_u are equal!\n"if $bin_a eq $bin_u;
openmy $f, ">", "file_a.tmp";
binmode $f;
print $f $bin_a;
close $f;
open $f, ">", "file_u.tmp";
binmode $f;
print $f $bin_u;
close $f;
system("md5sum file_?.tmp"); # md5sum - команда linux

gives out:

original bin length: 3 3
bin_a length: 4 4
bin_u length: 4 7
bin_a and bin_u are equal!
33818f4b23aa74cddb8eb625845a459a file_a.tmp
33818f4b23aa74cddb8eb625845a459a file_u.tmp

As a result, it turns out that binary data, after concatenating with an ASCII string, increased its internal size in bytes (but not in characters) from 4 to 7, but only if, without meaning, the UTF-8 flag was set for ASCII .

However, when comparing this data with each other, they are identical, also, when outputting both lines to a file, even without specifying the encoding, the files were also identical.

Thus, binary data can increase in size and get a UTF-8 flag, while there is no bug, all the built-in Perl functions process them exactly as if there were no flag (if there are exceptions, then the bug is in them).

Any other perl code should also process such data without errors (if it does not try to analyze the internal structure of the string, or at least parse it correctly)

In fact, what happened to the binary data is analogous to the utf8 :: upgrade operation . The data was interpreted as Latin-1, converted to UTF-8, and set the UTF-8 flag. This operation is the opposite of utf8 :: downgrade described above. utf8 :: downgrade can only be done with Latin-1 characters. And utf8 :: upgrade can be done
with any bytes (since any byte corresponds to a character from Latin-1).

This can be important if you have a large amount of binary data in your memory. It’s not at all great if a 400 megabyte blob suddenly turns into a 700 megabyte one, just because you added one ASCII-7bit byte with the UTF-8 flag there. A good way out of the situation here is unit tests or runtime.assertions with the UTF-8 flag check.

In general, it is not possible to distinguish bytes from characters

Consider the problem: write a function to which XML will be input, if XML is bytes, look at the encoding in the "xml" tag and encode them into characters. If it is already symbols, do nothing.

Such a function cannot be implemented. For example, for the character string “Hello, München”, the function will not be able to
distinguish between this characters, or CP1251 encoded bytes, or in KOI8-R (in case the string is downgraded, but this is not controlled by the programmer in general).

For characters> 255, the UTF-8 flag is always set (you cannot use utf8 :: downgrade with them ). For characters with code <= 127 UTF-8, the bit is not important, in the sense that they can be considered both binary data and characters. For Latin1 characters, it is not possible to distinguish from bytes.

Distinguishing bytes from characters in Perl is the same as distinguishing a file name from email and from a person’s name. Sometimes it is possible, but in the general case, no. The programmer himself must remember what variable he has.

This is in the documentation:

perldoc.perl.org/perlunifaq.html

How can I determine if a string is a text string or a binary string?

You can't. Some use the UTF8 flag for this, but that's misuse, and makes well behaved modules like Data :: Dumper look bad. The flag is useless for this purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is used to store the string.

This is something you, the programmer, has to keep track of; sorry You could consider adopting a kind of "Hungarian notation" to help with this.

If you still need to do this, you can create your own class, which will contain a string of bytes or characters, and a flag showing what it is (the same trick is suitable for email vs file name vs person name).

Wide characters are not issued for characters from Latin-1

The following example gives warning Wide characters in print only if we print $ s2

use strict;
use warnings;
use utf8;
my $s1 = "ß";
my $s2 = "тест";
my $s = $ARGV[0] ? $s1 : $s2;
print $s;

If we print $ s1, Perl converts the Unicde character µ (U + 00DF, UTF-8 \ xC3xF9) to byte \ xDF and tries to display it.
The same behavior is true for all functions that accept bytes, not characters (print, syswrite without specifying an encoding, checksums SHA, MD5, CRC32, MIME :: Base64).

Viral downgrade

At the beginning of the article, the “viral” behavior of the UTF-8 bit in ASCII characters was described (viral utf8 :: upgrade ). Now consider the “viral” reset of the UTF-8 bit in Latin-1 characters (utf8 :: downgrade ).

Imagine that we are writing a function that is defined only over bytes, and not over characters, hash functions, encryption, archiving, Mime :: Base64, etc. are a good example.

1. Since it is impossible to distinguish binary data from characters, you should consider the input as bytes.
2. Bytes can have an upgrade form (as with the UTF-8 flag). The result should be the same as the downgrade form.

Therefore, you need to do utf8 :: downgradeand give an error if that doesn't work.

Algorithms, such as hash functions, are characterized by concern for performance. Making a second copy of the data in memory is not efficient, so, in most cases, the function modifies the parameter passed to it.

As many people probably know, in Perl all parameters are passed by reference, but are usually used by value.

submycode{
  $_[0] = "X"; # модифицировали первый фактической параметр, не зависимо от воли вызывающего
}

submycode{
  my ($arg1) = @_; # типичный способ работы с аргументами функции
  $arg1 = "X"; # теперь параметр доступен по значению, фактический параметр не модифицируется
}

Thus, when creating code that works exactly in accordance with the Perl specification, code is created that silently does utf8 :: downgrade on the actual parameters, regardless of the will of the caller, thereby possibly creating a bug in some other place that incorrectly processed lines, and up to this point worked fine.

For file names, this does not work

Functions that accept file names as arguments ( open , file tests -X ), as well as those that return file names ( readdir ), do not obey these rules (this is noted in the documentation).

They simply interpret the file name as it is in memory.

The algorithm of their work can be described as follows:

subopen{
 my ( ... $filename) = @_;
 utf8::_utf8_off($filename); # теперь это двоичные данные
 _open($filename);

There are several reasons for this:

1. In many POSIX systems (Linux / * BSD), on many file systems, the file name may be an arbitrary sequence of bytes, not necessarily a sequence of characters in any encoding.
2. There is no portable way to determine the encoding of a file system.
3. There may be several file systems with different encoding on the machine
4. You cannot rely on the assumption that the encoding of the file names matches the encoding of the locale.
5. Must be compatible with old code.

As a result, the programmer must determine the encoding and communicate it to the interpreter, but the API for this has not yet been done.

We modify our example where we “accidentally” stumbled upon a downgrade character string.

use strict;
use warnings;
use Digest::SHA qw/sha1_hex/;
use utf8;
my $s = "µ";
my $s1 = $s;
my $s2 = $s;
my $digest = sha1_hex($s2); # попробуйте закомментировать эту строчкуprint STDERR "s1 and s2 are equal\n"if $s1 eq $s2;
openmy $f, ">", "$s1.tmp"ordie"s1 failed: $!";
print $f "test";
close $f;
open $f, "<", "$s2.tmp"ordie"s2 failed: $!";
print STDERR "Done\n";

The result of work:

s1 and s2 are equal
s2 failed: No such file or directory

those. lines s1 and s2 coincide, but point to different files, if sha1_hex removal is removed, then to the same files.

You can stumble upon the same rake by accessing any modules that work with files (for example, File :: Find )

When else does it not work

In the Encode module, there is a decode_utf8 function
documented as:

Equivalent to $ string = decode ("utf8", $ octets [, CHECK])

But in fact, if $ octets has the UTF-8 flag set, the function simply returns them unchanged (although it should try to make utf8 :: downgrade and work with them like binary data, and if downgrade fails, throw a Wide characters error ) .

This bug was noticed ( RT # 61671 RT # 87267 ) as soon as it appeared - in 2010.

But the maintainer rejects all such bug reports. Moreover, the essence of the reports is not even that the function behaved correctly (in accordance with the idea of Perl), and not even that there was documentation to describe this behavior, but that, at least, this behavior is not must contradict existing documentation. Meinteiner believes that the functions are documented as equivalent, and this does not mean identical (although in my opinion, equivalence can be considered as similarity and identity). Perhaps in mathematics, equivalence does not even contain a hint of identity ... If someone can solve this riddle, I will be very grateful.

The unicode bug

In the downgraded form, Latin-1 cannot be distinguished from bytes, therefore, in this form, some metacharacters in regular expressions, the functions uc , lc , quotemeta , do not work well .

Workaround is utf8 :: upgrade , or, in newer versions of Perl, there are some directives that make this behavior consistent.

Detailed description in Perl documentation

What to do with all this?

1. Do not use (unless you know exactly what you are doing) the following functions: utf8 :: is_utf8 , Encode :: _ utf8_on , Encode :: _ utf8_off , and all functions from the bytes module (the documentation for all these functions does not recommend their use, except as for debugging)

2. Use utf8 :: upgrade , utf8 :: downgrade , whenever the Perl specification

3 requires it . To convert from characters to bytes, use Encode :: encode , Encode :: decode

3. If you use someone else's code, violating these rules, check it for bugs, use workarounds.

4. When working with file names, you will either have to use wrapper over all functions, or, using tests, make sure that the internal representation of file names does not change during the work of the code.

There are several examples where a violation of these rules seemed to me justified.

Encode::_utf8_off($_[0]) if utf8::is_utf8($_[0]) && (bytes::length($_[0]) == length($_[0]));

(will clear the UTF-8 flag for ASCII-7bit text (thereby achieving a 30% increase in regexp performance in all Perl except 5.19)

defined($_[0]) && utf8::is_utf8($_[0]) && (bytes::length($_[0]) != length($_[0]))

(Returns TRUE if the string has the UTF-8 flag set and it is not ASCII-7bit. It can be used in unit tests to make sure that your 400 megabytes of binary data does not turn into 700)

There is still an option to do nothing. Honestly, it will take quite a while before you come across any bug (but by then it will be too late). This option is highly recommended for library developers.

Tags: