
A few words about UTF-8
Perl knew nothing about encodings for a long time. The string was just a sequence of bytes, everyone kept everything they wanted there, and only occasionally had to think about what kind of encoding was this data. Times have changed, UTF has appeared; Perlists also had to support him. As it usually happens, in a perl way. I hope this article will save a little health to those who are still unaware of the implementation of UTF-8 in Perl.
Actually, there were two implementations of UTF-8 in Perl. The first appeared in Perl 5.6, but was rather crude and inconvenient. Starting with Perl 5.8, the Unicode mechanism has been radically revised, and the modules on CPAN are full of funny checks on the version of the interpreter. Everything that is written below relates precisely to this second implementation.
If you still have not thought about encodings, calmly developed monolingual applications and are going to continue in the same vein, you almost certainly do not need a unicode. Data in single-byte encodings is in any case more compact, they are processed faster, and it is easy and pleasant to deal with them.
You will probably need UTF-8 if you don’t know in advance what form the next portion of the data will come to the application, or are developing an international project. Indeed, even if your site is in English, some German with umlauts in the name, or even a resident of the Middle Kingdom may well register on it. The easiest way not to think about what will happen after that in the database (well, about how you will show the name of the Chinese in your favorite latin-1) is to work in an encoding that supports many languages.
And another case where you can’t get to know Perl UTF is integration with third-party components working in this format. For example, a library
Probably, the miners argued something like this: we stored byte chains in variables , now we need to learn how to store characters there . The character length in UTF-8 is inconsistent and may be more than one byte. If regulars and functions for working with strings (like
If you take two identical unicode variables and just drop the flag for one of them, the variables will be processed differently by the pearl (for example, they will most likely have different lengths). However, the data itself does not change at the same time - this can be seen, for example, if both variables are displayed in a file or on the screen.
It is worth mentioning that UTF-8 characters in Perl terminology are often called wide characters . If you come across warnings with these words, then it comes to unicode strings.
There are several options for working with Unicode data in Perl. The main ones are:
Module
The module is included with Perl 5.8, so it makes sense to use it not only for Unicode, but also for any other encoding transformations. Working with the module is not too complicated. The only problem is learning not to confuse a function
Since not all characters can be chased without loss from one encoding to another, there is also a third parameter that determines how to behave in case of problems. You can read about it in the documentation for the module
If you are sure that bytes in UTF-8 are in your variable, you can simply raise the flag of the variable without having to recode and check it with
Starting with Perl 5.8.1, the module functions part
The pragma
There is also a pragma
The Perl IO Layers theme deserves a separate article in principle. The idea is that for some time now the good old function has
In addition to the standard values of the type
If we are talking about a file containing data in UTF-8, the code can be slightly simplified:
Of course, you can use these modifiers to modify the files - the effect will be the opposite.
By the way, in Perl it is possible to make I / O streams unicode once and for all using the command line switch
Of course they are.In general, sometimes there is a feeling that at every turn of development Perl scatters a lot of different rakes around itself, which programmers then carefully collect (sometimes twice if the first rake was experimental).
Firstly, some functions, by definition, work specifically with bytes, and not with characters, and the lines in the internal representation rise across their throats. These functions include frequently used functions from the module
Secondly, the data does not always come in the form in which the program expects to see it. It is naive to expect, for example, that valid UTF-8 will always come to the HTML form handler. The results of excessive trust in the sources can be quite diverse, starting with data corruption and ending with fatal errors when trying to transcode them to another encoding (for example, when generating email).
And finally, the most common and interesting problem arises when trying to concatenate two strings, only one of which is stored in the internal pearl format. Suppose we have a file (written in UTF-8): In the last line, Perl tries to cast the lines to a commondenominator format. Because the
Gluck is clearly visible to the naked eye through the Unicode-specific crooks - you can’t confuse anything with anything.
Many subtleties remained unsolved in the article. A number of useful modules
UPD: the codesign habrayuzer sent links to his own developments on the same topic, I recommend:
Actually, there were two implementations of UTF-8 in Perl. The first appeared in Perl 5.6, but was rather crude and inconvenient. Starting with Perl 5.8, the Unicode mechanism has been radically revised, and the modules on CPAN are full of funny checks on the version of the interpreter. Everything that is written below relates precisely to this second implementation.
Pros and cons
If you still have not thought about encodings, calmly developed monolingual applications and are going to continue in the same vein, you almost certainly do not need a unicode. Data in single-byte encodings is in any case more compact, they are processed faster, and it is easy and pleasant to deal with them.
You will probably need UTF-8 if you don’t know in advance what form the next portion of the data will come to the application, or are developing an international project. Indeed, even if your site is in English, some German with umlauts in the name, or even a resident of the Middle Kingdom may well register on it. The easiest way not to think about what will happen after that in the database (well, about how you will show the name of the Chinese in your favorite latin-1) is to work in an encoding that supports many languages.
And another case where you can’t get to know Perl UTF is integration with third-party components working in this format. For example, a library
XML::LibXML
returns the results of parsing XML files in this format.The perl way
Probably, the miners argued something like this: we stored byte chains in variables , now we need to learn how to store characters there . The character length in UTF-8 is inconsistent and may be more than one byte. If regulars and functions for working with strings (like
length
, substr
) start behaving differently, they won’t say thanks. So, you need to make strings of two types - to work according to the old scheme, with bytes , and to work according to the new scheme, with characters . How to do it? And let's introduce a hidden flag for scalars. If the flag is set, the string is perceived as consisting of logical characters (let's call it Perl Internal Format ), if not, from bytes.If you take two identical unicode variables and just drop the flag for one of them, the variables will be processed differently by the pearl (for example, they will most likely have different lengths). However, the data itself does not change at the same time - this can be seen, for example, if both variables are displayed in a file or on the screen.
It is worth mentioning that UTF-8 characters in Perl terminology are often called wide characters . If you come across warnings with these words, then it comes to unicode strings.
There are several options for working with Unicode data in Perl. The main ones are:
- forced indication of unicode characters in a string - through a view construct
\x{0100}
; - manual recoding of a string using a module
Encode
or functions from a packageutf8
; - inclusion of a pragma
use utf8
- the flag is raised at all the constants that met in the code; - reading from the I / O descriptor indicating IO-Layers
:encoding
or:utf8
- all data is automatically transcoded into the internal format.
Module Encode
The module is included with Perl 5.8, so it makes sense to use it not only for Unicode, but also for any other encoding transformations. Working with the module is not too complicated. The only problem is learning not to confuse a function
encode
with a function decode
:-). They have the same interface, and the naming logic is not as obvious as we would like. Since the format of strings with a unicode flag is considered an internal format , it is necessary to decode data from an arbitrary encoding into it (including UTF-8 without a flag), and vice versa, if you want to transfer the data to a certain external encoding, they must be encoded from the internal format to her. It looks something like this:$bytes = encode('cp1251', $string); # перекодировали строку из внутреннего представления в cp1251
$string = decode('cp1251', $bytes); # и обратно
Since not all characters can be chased without loss from one encoding to another, there is also a third parameter that determines how to behave in case of problems. You can read about it in the documentation for the module
Encode
, there a whole section is devoted to this. If you are sure that bytes in UTF-8 are in your variable, you can simply raise the flag of the variable without having to recode and check it with
_utf8_on
. The function will help to determine the presence of a flag in the row (and if you want to check the validity of the data lying there) is_utf8
. Well, the flag is reset, as you might guess, through _utf8_off
. The only “but” - these functions are marked as INTERNAL , and you should not count on their immutability. Starting with Perl 5.8.1, the module functions part
Encode
It became available in namespaces utf8::
- this function is_utf8
, encode
, decode
. The last two differ from synonyms from the module Encode
in that they change the value of the passed variable instead of returning the result, and do not require the encoding (it is understood that the work occurs with UTF-8 data without a flag raised). All these functions are built into the interpreter, and use utf8
you do not need to write to access them - moreover, this can lead to additional effects (about them a little later).use utf8;
The pragma
use utf8
tells the interpreter that all constants and regular expressions written in the zone of its operation and having non-ASCII characters should be treated as unicode and automatically reduced to the internal format. To cancel the action of the pragma, as usual, the construction is used no utf8
. There is also a pragma
use bytes
that is opposite in meaning , in the area of which even data with the UTF-8 flag is treated as consisting of bytes.PerlIO
The Perl IO Layers theme deserves a separate article in principle. The idea is that for some time now the good old function has
open
acquired a three-argument syntax: open $fh, $mode, $filename
In addition to the standard values of the type
'>'
and '<'
in, $mode
you can also specify the encoding of the file. At the same time, the downloaded data is automatically converted to the internal Perl format: open $fh, "<:encoding(cp1251)", $filename
If we are talking about a file containing data in UTF-8, the code can be slightly simplified:
open $fh, "<:utf8", $filename
Of course, you can use these modifiers to modify the files - the effect will be the opposite.
By the way, in Perl it is possible to make I / O streams unicode once and for all using the command line switch
-C
. Details can be seen, as always, in perldoc .Rake
Of course they are.
Firstly, some functions, by definition, work specifically with bytes, and not with characters, and the lines in the internal representation rise across their throats. These functions include frequently used functions from the module
Digest::MD5
. So, the given example will fall off with an error Wide character in subroutine entry at test.pl line 3.
:use Digest::MD5 'md5_hex';
print md5_hex("\x{400}");
Secondly, the data does not always come in the form in which the program expects to see it. It is naive to expect, for example, that valid UTF-8 will always come to the HTML form handler. The results of excessive trust in the sources can be quite diverse, starting with data corruption and ending with fatal errors when trying to transcode them to another encoding (for example, when generating email).
And finally, the most common and interesting problem arises when trying to concatenate two strings, only one of which is stored in the internal pearl format. Suppose we have a file (written in UTF-8): In the last line, Perl tries to cast the lines to a common
use Encode;
$a = decode('utf8', "Мне нравится "); # строка во внутреннем формате
$b = "на Хабре"; # последовательность из 15 байт
$c = $a.$b;
$b
it perceives as a chain of bytes, each byte of this string is encoded in UTF-8. The result will be something like this mess (with the flag raised, by the way): $c = "Мне нравится на ХабÑе"
Gluck is clearly visible to the naked eye through the Unicode-specific crooks - you can’t confuse anything with anything.
Conclusion
Many subtleties remained unsolved in the article. A number of useful modules
Encode
, utf8
remained behind the scenes. There was no place to mention variations in the internal format that are sensitive to characters that are invalid from the point of view of UTF-8. The questions related to regular expressions are completely omitted. If you want to delve into this topic to the end, pay attention to the manuals:- perldoc utf8 ;
- perldoc Encode ;
- perldoc perluniintro ;
- perldoc perlunitut ;
- perldoc perlunifaq ;
- perldoc perlunicode .
UPD: the codesign habrayuzer sent links to his own developments on the same topic, I recommend: