leron January 17, 2012 at 18:51

Unicode for Dummies

I myself don’t really like headlines like “Pokémon in their own juice for dummies / pots / pans”, but this seems to be the case - we’ll talk about basic things that work quite often lead to a compartment full of cones and a lot of lost time around the question - “Why doesn’t it work?” If you are still afraid and / or do not understand Unicode - I ask for cat.

What for?

The newbie’s main question is that he encounters an impressive amount of encodings and seemingly confusing mechanisms for working with them (for example, in Python 2.x). The short answer is because it happened :) The

coding, who does not know, is the way of representing numbers, beeches and all other characters in the computer memory (read - in zeros-units \ numbers). For example, a space is represented as 0b100000 (in binary), 32 (in decimal), or 0x20 (in hexadecimal).

So, once there was very little memory and all computers had enough 7 bits to represent all the necessary characters (numbers, lowercase / uppercase Latin alphabet, a bunch of characters and the so-called controlled characters - all 127 possible numbers were given to someone). The encoding at that time was one - ASCII. As time passed, everyone was happy, and who was not happy (read - who lacked the "©" sign or the native letter "u") - used the remaining 128 characters at their discretion, that is, created new encodings. So both ISO-8859-1 and our (i.e., Cyrillic) cp1251 and KOI8 appeared . Together with them, there was the problem of interpreting bytes of type 0b1 ******* (that is, characters \ numbers from 128 to 255) - for example, 0b11011111 in cp1251 encoding is our native “I”, at the same time in ISO- 8859-1 is the ~~Greek~~ German Eszett (tells Moonrise) "ß". Expectedly, network communication and just file sharing between different computers turned into hell-knows-what, despite the fact that the headers like 'Content-Encoding' in the HTTP protocol, emails and HTML pages saved the situation a bit.

At that moment, bright minds gathered and proposed a new standard - Unicode . This is the standard, not the encoding - Unicode alone does not determine how characters will be stored on the hard drive or transmitted over the network. It only determines the relationship between a character and a certain number, and the format according to which these numbers will be converted into bytes is determined by Unicode encodings (for example, UTF-8 or UTF-16) Currently, the Unicode standard has a little more than 100 thousand characters, while UTF-16 allows you to support more than one million (UTF-8 - and even more).

I advise reading more and more fun on the topic with the magnificent Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets .

Get to the point!

Naturally, there is support for Unicode in Python as well. But, unfortunately, only in Python 3 did all the strings become unicode, and newcomers have to kill themselves about errors like:

>>> with open('1.txt') as fh:
	s = fh.read()
>>> print s
кощей
>>> parser_result = u'баба-яга'  # присвоение для наглядности, представим себе, что это результат работы какого-то парсера
>>> parser_result + s

Traceback (most recent call last):
  File "", line 1, in 
    parser_result + s
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)

or so:

>>> str(parser_result)

Traceback (most recent call last):
  File "", line 1, in 
    str(parser_result)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Let's figure it out, but in order.

Why is someone using Unicode?

Why is my favorite html parser returning Unicode? Let it return the usual string, but I’ll deal with it there already! Right? Not really. Although each of the characters existing in Unicode can be (probably) represented in some single-byte encoding (ISO-8859-1, cp1251 and others are called single-byte, since they encode any character exactly in one byte), but what if there should be characters in the string from different encodings? Assign a separate encoding to each character? No, of course, you need to use Unicode.

Why do we need a new type of "unicode"?

So we got to the most interesting. What is a string in Python 2.x? These are just bytes . Just binary data that can be anything. In fact, when we write something like:

>>> x = 'abcd'
>>> x
'abcd'

the interpreter does not create a variable that contains the first four letters of the Latin alphabet, but only a sequence

('a', 'b', 'c', 'd')

with four bytes, and the Latin letters here are used exclusively to indicate this particular byte value. That is, 'a' here is simply a synonym for writing '\ x61', and not a bit more. For instance:

>>> '\x61' 
'a'
>>> struct.unpack('>4b', x)  # 'x' - это просто четыре signed/unsigned char-а
(97, 98, 99, 100)
>>> struct.unpack('>2h', x)  # или два short-а
(24930, 25444)
>>> struct.unpack('>l', x)  # или один long
(1633837924,)
>>> struct.unpack('>f', x)  # или float
(2.6100787562286154e+20,)
>>> struct.unpack('>d', x * 2)   # ну или половинка double-а
(1.2926117739473244e+161,)

And that’s it!

And the answer to the question - why do we need “unicode” is more obvious - we need a type that will be represented by characters, not bytes.

Well, I understood what the line is. Then what is Unicode in Python?

“Type unicode” is primarily an abstraction that implements the idea of Unicode (a set of characters and numbers associated with them). An object of type “unicode” is no longer a sequence of bytes, but a sequence of characters themselves without any idea of how these characters are effectively stored in computer memory. If you want, this is a higher level of abstraction than byte strings (that's what Python 3 calls regular strings that are used in Python 2.6).

How to use Unicode?

You can create a Unicode string in Python 2.6 in three (at least naturally) ways:

u "" literal:
```
>>> u'abc'
u'abc'
```
The decode method for a byte string:
```
>>> 'abc'.decode('ascii')
u'abc'
```
Unicode function:
```
>>> unicode('abc', 'ascii')
u'abc'
```

ascii in the last two examples is specified as an encoding, which will be used to convert bytes to characters. The stages of this transformation look something like this:

'\x61' -> кодировка ascii -> строчная латинская "a" -> u'\u0061' (unicode-point для этой буквы)
или
'\xe0' -> кодировка c1251 -> строчная кириличная "a" -> u'\u0430'

How to get the usual from a Unicode string? Encode it:

>>> u'abc'.encode('ascii')
'abc'

The coding algorithm is naturally the reverse of the above.

We remember and do not confuse - Unicode == characters, string == bytes, and bytes -> something meaningful (characters) is de-encoding (decode), and characters -> bytes are encoding (encode).

Not encoded :(

Let's look at examples from the beginning of the article. How does string and unicode string concatenation work? A simple string must be converted to a Unicode string, and since the interpreter does not know the encoding, it uses the default encoding, ascii. If this encoding fails to decode the string, we get an ugly error. In this case, we need to cast the string to a Unicode string, using the correct encoding:

>>> print type(parser_result), parser_result
 баба-яга
>>> s = 'кощей'
>>> parser_result + s

Traceback (most recent call last):
  File "", line 1, in 
    parser_result + s
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)

>>> parser_result + s.decode('cp1251')
u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0\u043a\u043e\u0449\u0435\u0439'
>>> print parser_result + s.decode('cp1251')
баба-ягакощей
>>> print '&'.join((parser_result, s.decode('cp1251')))
баба-яга&кощей   # Так лучше :)

"UnicodeDecodeError" is usually evidence that you need to decode a string to Unicode using the correct encoding.

Now using str and unicode strings. Do not use “str” and unicode strings :) In “str” there is no way to specify the encoding, so the default encoding will always be used and any characters> 128 will lead to an error. Use the "encode" method:

>>> print type(s), s
 кощей
>>> str(s)

Traceback (most recent call last):
  File "", line 1, in 
    str(s)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

>>> s = s.encode('cp1251')
>>> print type(s), s
 кощей

"UnicodeEncodeError" is a sign that we need to specify the correct encoding when converting a Unicode string to a regular one (or use the second parameter 'ignore' \ 'replace' \ 'xmlcharrefreplace' in the "encode" method).

I want more!

Ok, use the baba yaga from the example above again:

>>> parser_result = u'баба-яга'   #1
>>> parser_result
u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0'   #2
>>> print parser_result
áàáà-ÿãà   #3
>>> print parser_result.encode('latin1')  #4
баба-яга
>>> print parser_result.encode('latin1').decode('cp1251')  #5
баба-яга
>>> print unicode('баба-яга', 'cp1251')   #6
баба-яга

The example is not entirely simple, but there is everything (well, or almost everything). What's going on here:

What do we have at the entrance? The bytes that IDLE passes to the interpreter. What is needed at the exit? Unicode, i.e. characters. It remains to turn the bytes into characters - but you need an encoding, right? What encoding will be used? We look further.

Here's an important point:

>>> 'баба-яга'
'\xe1\xe0\xe1\xe0-\xff\xe3\xe0'
>>> u'\u00e1\u00e0\u00e1\u00e0-\u00ff\u00e3\u00e0' == u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0'
True

as you can see, Python does not bother with the choice of encoding - bytes just turn into unicode points:

>>> ord('а')
224
>>> ord(u'а')
224

Only here is the problem - the 224th character in cp1251 (the encoding used by the interpreter) is not at all the same as 224 in Unicode. It is because of this that we get krakozyabra when we try to print our Unicode string.
How to help a woman? It turns out that the first 256 Unicode characters are the same as in the encoding ISO-8859-1 \ latin1, respectively, if we use it to encode a Unicode string, we get those bytes that we entered ourselves (who are interested - Objects / unicodeobject.c , looking for the definition of the function "unicode_encode_ucs1"):
```
>>> parser_result.encode('latin1')
'\xe1\xe0\xe1\xe0-\xff\xe3\xe0'
```

How to get a woman in unicode? You must specify which encoding to use:

>>> parser_result.encode('latin1').decode('cp1251')
u'\u0431\u0430\u0431\u0430-\u044f\u0433\u0430'

The method from point # 5 is certainly not so hot, it is much more convenient to use the built-in unicode .

Actually, not everything is so bad with “u” ”literals, since the problem only arises in the console. Indeed, if non-ascii characters are used in the source file, Python will insist on using a header like "# - * - coding: - * -" ( PEP 0263 ), and Unicode strings will use the correct encoding.

There is also a way to use “u” ”to represent, for example, Cyrillic, and do not specify an encoding or unreadable Unicode points (that is,“ u '\ u1234' "). The method is not entirely convenient, but interesting is to use unicode entity codes:

>>> s = u'\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER SHCHA}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER SHORT I}'
>>> print s
кощей

Well, everything seems to be. The main tips are not to confuse "encode" \ "decode" and understand the differences between bytes and characters.

Python 3

There is no code, because there is no experience. Witnesses claim that everything is much simpler and more fun there. Who will undertake on cats to demonstrate the differences between here (Python 2.x) and there (Python 3.x) - respect and respect.

Useful

Since we’re talking about encodings, I’ll recommend a resource that helps to overcome krakozyabra from time to time - http://2cyr.com/decode/?lang=en .

Once again, a link to the article by Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets .

Unicode HOWTO is an official document on where, how and why Unicode in Python 2.x.

Thanks for attention. I would be grateful for the comments in private.

PS Threw a link to the translation of Spolsky - the Absolute Minimum that Every Software Developer Must Know About Unicode and Character Sets .

Tags: