Working with encodings in Perl

    On habr already there is a good article about use of UTF-8 in Perl - habrahabr.ru/post/53578 . Nevertheless,
    I would like to talk a little bit about encodings.

    A lot of questions are related to the variety of encodings, as well as the terminology used. In addition, many of us have encountered problems that are related to encodings. I will try in this article to write in an understandable form information on this issue. I'll start with the question of automatically detecting the text encoding.

    Determining the encoding of the source file.Determining the encoding of the source document, a task that is quite common in practice. Take the browser as an example. In addition to the html file, it can also receive a header in the HTTP response that sets the encoding of the document and this header may not be correct, therefore it is impossible to rely on it alone, as a result, browsers support the ability to automatically determine the encoding.

    In Perl, you can use Encode :: Guess for this, but Encode :: Detect :: Detector is a more “industrial” industrial option. As written in the documentation for it, it provides an interface to the Mozilovsky universal encoding determinant.

    If you are going to study the source code, pay attention to the vnsUniversalDetector.cpp file and the method

    nsresult nsUniversalDetector::HandleData(const char* aBuf, PRUint32 aLen)

    From this method, all work on determining the encoding begins. First, it is determined whether there is a BOM header, if so, then further determination of the encoding is performed by simple comparison of the initial data bytes:
    • EF BB BF UTF-8 encoded BOM
    • FE FF 00 00 UCS-4, unusual octet order BOM (3412)
    • FE FF UTF-16, big endian BOM
    • 00 00 FE FF UTF-32, big-endian BOM
    • 00 00 FF FE UCS-4, unusual octet order BOM (2143)
    • FF FE 00 00 UTF-32, little-endian BOM
    • FF FE UTF-16, little endian BOM


    Next, each byte of data is analyzed and whether the character is considered non-US-ASCII (codes 128 to 255); if so, then class objects are created:
    • nsMBCSGroupProber;
    • nsSBCSGroupProber;
    • nsLatin1Prober;


    each of which is responsible for the analysis of encoding groups (MB - multi-byte, SB - single-byte).

    If this is US-ASCII, then there are 2 options, either it is ordinary ASCII (pure ascii) or a file containing escape sequences and refers to encodings such as ISO-2022-KR, etc. (in more detail - en.wikipedia.org/wiki/ISO/IEC_2022 ). In this case, the detector implemented by the nsEscCharSetProber class is used.

    nsMBCSGroupProber supports such encodings as: “UTF8”, “SJIS”, “EUCJP”, “GB18030”, “EUCKR”, “Big5”, “EUCTW”.

    nsSBCSGroupProber - such as Win1251, koi8r, ibm866 and others.

    The definition of a single-byte encoding is based on the analysis of the frequency of occurrence of 2 character sequences in the text.

    It should be said that all these methods are probabilistic in nature. For example, if there are not enough words to determine, no algorithm can automatically determine the encoding. Therefore, in various programming environments, the issue with encodings is resolved in its own way, but there is no such thing that everything is determined by itself.

    Unicode and Perl. Historical view. According to www.unicode.org/glossaryUnicode has 7 possible encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE. For the term Unicode, the following definition is given: "... a standard for the digital representation of characters that are used in writing by all languages ​​of the world ...". In addition, there is also UTF-7, which is not part of the standard, but is supported by Perl - Encode :: Unicode :: UTF7 (see also RFC 2152).

    UTF-7 is practically not used. Here is what Encode :: Unicode :: UTF7 says - “... However, if you want to use UTF-7 for documents in mail and web pages, do not use it until you make sure that the recipients and readers (in the sense of these documents) can process this encoding ... ".

    Perl developers, following the progress in the widespread implementation of Unicode encodings in applications, also implemented Unicode support in Perl. In addition, the Encode module also supports other encodings, both single-byte and multi-byte, a list of which can be viewed in the Encode :: Config package. To work with letters, MIME encodings are supported: MIME-Header, MIME-B, MIME-Q, MIME-Header-ISO_2022_JP.

    It should be said that UTF-8 is very widespread as an encoding for web documents. UTF-16 is used in Java and Windows, UTF-8 and UTF-32 are used by Linux and other Unix-like systems.

    Starting with version Perl 5.6.0, the ability to work with Unicode was initially implemented. However, Perl 5.8.0 was recommended for more serious work with Unicode. Perl 5.14.0 is the first version in which Unicode support is easily (almost) integrated without several pitfalls (exceptions are some differences in quotemeta). Version 5.14 also fixes a number of errors and deviations from the Unicode standard.

    Visual Studio 2012 and encodings (for comparison with Perl).When we write some application in C # in Visual Studio we don’t think about the encoding in which all this is stored and processed. When creating a document in Vistual Studio, it will create it in UTF8 and add UTF8 to the BOM header - a sequence of bytes 0xEF, 0xBB, 0xBF. When we convert the source file (already open in Visual Studio), for example, from UTF8 to CP1251, we get the error message
    Some bytes have been replaced with the Unicode substitution character while loading ... with Unicode (UTF-8) encoding. Saving the file will not preserve the original file contents.

    If you open an existing file in cp1251 - ToUpper (), for example, will work correctly, and if you convert the file to KOI8-R and then open it in Visual Studio and execute, there is no question of any correct operation, here the environment does not know, what is the KOI8-R, and how can she find out?

    “Unicode Bug in Perl.” Like Visual Studio, something similar happens with a Perl program, but Perl developers can explicitly specify the encoding of the application source code. That's why when beginners of perl programming open their favorite editor on Russian-speaking Windows XP and write something like that in ANSI (i.e. cp1251)

    use strict;
    use warnings;
    my $a = "слово";
    my $b = "СЛОВО";
    my $c = “word”;
    print "Words are equal" if uc($a) eq uc($b);
    

    and the output is that the lines in the variables are not equal, at first it’s difficult for them to understand what is happening. Similar things happen with regular expressions, string functions (but uc ($ c) will work correctly).

    This is the so-called “Unicode Bug” in Perl (see the documentation for more details), due to the fact that for different single-byte encodings, characters with codes from 128 to 255 will have different meanings. For example, the letter П in cp1251 - has the code 0xCF, while in CP866 - 0x8F, and in KOI8-R - 0xF0. How, then, to work out correctly for such string functions as uc (), ucfirst (), lc (), lcfirst () or \ L, \ U in regular expressions?

    It is enough to “tell” the interpreter that the encoding of the source file is cp1251 and everything will work correctly. More precisely in the code below, the variables $ a and $ b will store the strings in the internal Perl format.

    use strict;
    use warnings;
    use encoding 'cp1251';
    my $a = "слово";
    my $b = "СЛОВО";
    print "equal" if uc($a) eq uc($b);
    


    The internal string format in Perl. In not very old versions of Perl, strings can be stored in the so-called internal format (Perl's internal form). Note that they can also be stored as just a set of bytes. In the example above, where the encoding of the source file was not explicitly specified (using use encoding 'cp1251';) the variables $ a, $ b, $ c simply store a set of bytes (the term octet is also used in the documentation for Perl - a sequence of octets).

    The internal format differs from a set of bytes in that UTF-8 encoding is used and the UTF8 flag is enabled for the variable. I will give an example. Change the source code of the program a bit to the following

    use strict;
    use warnings;
    use encoding 'cp1251';
    use Devel::Peek;
    my $a = "слово";
    my $b = "СЛОВО";
    print Dump ($a);
    


    This is what we get as a result of

    SV = PV (0x199ee4) at 0x19bfb4
    REFCNT = 1
    FLAGS = (PADMY, POK, pPOK, UTF8)
    PV = 0x19316c "\ 321 \ 201 \ 320 \ 273 \ 320 \ 276 \ 320 \ 262 \ 320 \ 276 "\ 0 [UTF8" \ x {441} \ x {43b} \ x {43e} \ x {432} \ x {43e} "]
    CUR = 10
    LEN = 12

    Note that FLAGS = (PADMY , POK, pPOK, UTF8). If we remove use encoding 'cp1251';
    then we get

    SV = PV (0x2d9ee4) at 0x2dbfc4
    REFCNT = 1
    FLAGS = (PADMY, POK, pPOK)
    PV = 0x2d316c "\ 321 \ 201 \ 320 \ 273 \ 320 \ 276 \ 320 \ 262 \ 320 \ 276" \ 0
    CUR = 10
    LEN = 12

    When we indicate that the source code of the file is encoded cp1251 or any other, then Perl knows that it is necessary to convert string literals in the source code from the specified encoding to the internal format (in this case, from cp1251 to the internal format UTF-8) and does .

    A similar problem of determining the encoding arises when working with data received from the outside, such as files or the web. Consider each of the cases.

    Suppose we have a cp866 encoded file that contains the word “When” (in the text file, the word When with a capital letter). We need to open it and analyze all the lines to find the word “when”. Here's how to do it right (while the source code itself must be in utf8).

    use strict;
    use warnings;
    use encoding 'utf8';
    open (my $tmp, "<:encoding(cp866)", $ARGV[0]) or die "Error open file - $!";
    while (<$tmp>)
    {
    	if (/когда/i)
    	{
    		print "OK\n";
    	}
    }
    close ($tmp);
    


    Please note that if we do not use "<: encoding (cp866)" and specify use encoding 'cp866', then regular expressions will work, but only with a set of bytes and / i will not work. The "<: encoding (cp866)" construction tells Perl that the data in the text file is encoded in CP866, so it correctly transcodes from CP866 to the internal format (CP866 -> UTF8 + enables the UTF8 flag).

    The following example, we get the page using LWP :: UserAgent. Here is the right example of how to do this.

    use strict;
    use warnings;
    use LWP::UserAgent;
    use HTML::Entities;
    use Data::Dumper;
    use Encode;
    use Devel::Peek;
    my $ua = LWP::UserAgent->new();
    my $res = $ua->get("http://wp.local");
    my $content;
    if (!$res->is_error)
    {
    	$content = $res->content;
    }
    else
    {
    	exit(1);
    }
    # Только если страница в UTF8, если в cp1251 - $content = decode('cp1251',$content);
    # decode конвертирует из utf8 байтов (последовательности октетов) во внутренний формат Perl
    $content = decode('utf8',$content);
    # теперь переменная $content содержит текст во внутреннем формате, с которым можно работать другим модулям, таким как, например, HTML::Entities, а также строковым функциями, регулярными выражениями и т.д.
    decode_entities($content);
    


    Note the call to $ content = decode ('utf8', $ content).

    LWP :: UserAgent works with bytes, it does not know, and this is not his concern, in what encoding the page is in single-byte cp1251 or in UTF8, we must explicitly indicate this. Unfortunately, a lot of literature contains examples in English and for older versions of Perl, as a result, there is nothing about transcoding in these examples.

    For example, search engine robots (or other code) should not only correctly determine the page encoding without using the server response headers or the HTML meta tag contents, which may be erroneous, but also determine the page language. Therefore, do not think that only Perl programmers should do all of the above.

    Using the example of receiving external data from a website, we came to consider the use of the Encode module. Here is its main API, very important in the work of any Perl programmer:

    $string = decode(ENCODING, OCTETS[, CHECK]). Выполняет конвертацию набора байтов (октетов) из кодировки ENCODING во внутренний формат Perl;
    $octets = encode(ENCODING, STRING[, CHECK]). Выполняет конвертацию из внутреннего формата Perl в набор байтов в кодировке ENCODING.
    [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK]). Выполняет конвертацию байтов из одной кодировки в другую.
    


    In the example in which we opened the text file in CP866, we can omit <: encoding (cp866). Then, with each read operation, we will get a set of bytes in CP866. We can convert them to internal format ourselves using

    $str = decode(‘cp866’,$str)
    


    continue to work with the variable $ str.

    Someone may suggest that you can use utf8 as the source code of the program, and in addition, transcode from cp866 to utf8 and everything will work as it should. This is not so, consider an example (in the text file, the word When with a capital letter).

    use strict;
    use warnings;
    use encoding 'utf8';
    use Encode;
    #open (my $tmp, "<:encoding(cp866)", $ARGV[0]) or die "Error open file - $!";
    open (my $tmp, "<", $ARGV[0]) or die "Error open file - $!";
    while (<$tmp>)
    {
    	my $str = $_;
    	Encode::from_to($str,'cp866','utf8');
    	if ($str=~/когда/i)
    	{
    		print "OK\n";
    	}
    }
    close ($tmp);
    


    $ str after executing Encode :: from_to ($ str, 'cp866', 'utf8') contains data in utf8 but as a sequence of bytes (octets) therefore / i does not work. So that everything works as you need to add a call

    $str = decode('utf8',$str)
    


    Of course, a simpler option is one line instead of two

    $str = decode(‘cp866’,$str)
    


    Perl's internal string format, in more detail. We have already said that regular expressions, part of modules, and string functions work correctly with strings that are stored not as a set of bytes, but in the internal Perl representation. It has also been said that PerL uses UTF-8 as its internal string storage format. This encoding was chosen for a reason. Some of the character codes in this encoding from 0-127 coincide with ASCII (US-ASCII), which are just responsible for the English alphabet, which is why the uc call for a line with codes from 0 to 127 works correctly and this will work regardless of the single-byte encoding in which the source code is saved. For UTF8, everything still works correctly.

    However, this is not all you need to know.

    UTF-8 vs utf8 vs UTF8.UTF-8 encoding has become more “strict” over time (for example, the presence of certain characters was prohibited). Therefore, the implementation of UTF-8 in Perl is deprecated. Starting with Perl 5.8.7, “UTF-8” means modern “dialect” more “strict”, while “utf8” means more “liberal old dialect”. Here is a small example

    use strict;
    use warnings;
    use Encode;
    # символ который не используется в UTF-8 
    my $str = "\x{FDD0}";
    $str = encode("UTF-8",$str,1); # Ошибка
    $str = encode("utf8",$str,1); # OK
    


    Thus, the hyphen between “UTF” and “8” is important; without it, Encode becomes more liberal and possibly overly permissive. If to execute

    use strict;
    use warnings;
    use Encode;
    my $str = sprintf ("%s | %s | %s | %s | %s\n",
       find_encoding("UTF-8")->name ,
       find_encoding("utf-8")->name ,
       find_encoding("utf_8")->name ,
      	find_encoding("UTF8")->name ,
    	find_encoding("utf8")->name 
    	);
    print $str;
    

    We get the following result - utf-8-strict | utf-8-strict | utf-8-strict | utf8 | utf8.

    Work with the console. Consider the console of the Windows family of OS. As everyone knows in Windows there is the concept of Unicode, ANSI, OEM encoding. The API of the OS itself supports 2 types of functions that work with ANSI and Unicode (UTF-16). ANSI depends on the localization of the OS, for the Russian version, the encoding CP1251 is used. OEM is the encoding used for console I / O; for Russian-speaking Windows, it is CP866. This is the encoding that was proposed in the Russian-language MS-DOS, and later migrated to Windows for backward compatibility with old software. That is why, the following program in utf-8

    use strict;
    use warnings;
    use Encode;
    use encoding 'utf8';
    my $str = 'Привет мир';
    print $str;
    


    will not print the treasured line, but we print UTF8 when CP866 is needed. Here you need to use the Encode :: Locale module. If you look at its source code, you can see that for Windows it determines the encoding of ANSI and the console and creates aliases console_in, console_out, locale, locale_fs. All that remains to be done is to slightly change our program.

    use strict;
    use warnings;
    use Encode::Locale;
    use Encode;
    use encoding 'utf8';
    my $str = 'Привет мир';
    if (-t) 
    {
    	binmode(STDIN, ":encoding(console_in)");
    	binmode(STDOUT, ":encoding(console_out)");
    	binmode(STDERR, ":encoding(console_out)");
    }
    print $str;
    


    PS This article is for those who are starting to work with Perl and maybe it’s a little too much. I am ready to listen and implement the wishes regarding the expansion of the article.

    Also popular now: