Ismail November 10, 2010 at 11:42

About encodings and Unicode

From the sandbox

First, a couple of terms should be clarified. Code page - a table of a known size in advance, each position (or code) of which is associated with a single character or its absence. For example, a code page of dimension 256, where the letter “G” corresponds to the 71st position. Encoding - A rule for encoding a character in a numeric representation. Any encoding is created for a specific code page. For example, the “G” character in the Abrwal encoding will take the value 71. By the way, the simplest encodings do so — they represent the characters with their values in the code tables, ASCII also applies to these.

Previously, only 7 bits per character was enough for encoding. What? enough for 128 different characters, everything that was necessary for the users of that time was located: the English alphabet, punctuation, numbers and some special characters. The main English-language 7-bit encoding with its corresponding code page was called ASCII (American Standard Code for Information Interchange) , they also laid the foundations for the future. Later, when computers spread to non-English speaking countries, there was a need for national languages, here the ASCII foundation came in handy. Computers process information at the byte level, and ASCII codeonly takes the first 7 bits. Using the 8th expanded the space to 256 places without losing compatibility, and with it the support of the English language, it was important. On this fact, the majority of non-English-language code pages and encodings are built: the lower 128 positions are the same as for ASCII , and the upper 128 are reserved for national needs and encoded with the highest bit. However, the creation for each language (sometimes a group of similar languages) of its own page and encoding led to problems with the support of such an economy by developers of operating systems and software in general.

To overcome the situation, a consortium was organized that developed and proposed the Unicode standard. It was supposed to combine the signs of all languages of the world in one large table. In addition, encodings were determined. At first, the guys decided that 65,535 seats should be enough for everyone; they introduced UCS-2 , an encoding with a fixed 16-bit code length. But Asians came with multivolume ABCs, and the calculations collapsed. The code area was doubled, UCS-2 could no longer cope, a 32-bit UCS-4 appeared . The tangible benefits of UCS encodingswere a constant multiple of two code lengths and the simplest encoding algorithm, both of which contributed to the processing speed of the Tex computer. But at the same time there was an unjustified, too wasteful waste of space: imagine that in ASCII 00010101, then in UCS-2 00000000 00010101, and UCS-4 is already 000000000000000000000000 00010101. There was something to be done about this.

The development of Unicode has turned towards encodings with a variable length of received codes. Representatives were UTF-8 , UTF-16 and UTF-32 , the latter on parole, since at the moment it is identical to UCS-4 . Every character in UTF-8takes from 8 to 32 bits, and there is compatibility with ASCII. In UTF-16, 16 or 32 bits, UTF-32 - 32 bits (if the Unicode space was doubled, then 32 or 64 bits), these two are not friends with ASCII . The number of bytes occupied depends on the character position in the Unicode table. Obviously, the most practical encoding is UTF-8. Due to its compatibility with ASCII , a small gluttony to memory and fairly simple encoding rules, it is the most common and promising Unicode encoding. Well, in conclusion, a beautiful scheme for converting a character code to UTF-8 :

Tags:

About encodings and Unicode

Also popular now: