zelserg September 29, 2014 at 12:12

About encodings and code pages

From the sandbox

It is unlikely that this is very relevant now, but it may seem interesting to someone (or just remember the past years).

I'll start with a short digression into the history of the computer. Since the computer was used to process information, it is simply obliged to present this information in a "human" form. The computer stores information in the form of numbers (bytes), and a person perceives symbols (letters, numbers, various signs). So, you need to make a comparison of the number <-> character and the problem will be solved. First, we will calculate how many characters we need (we will not forget that “we” are Americans using the Latin alphabet). We need 10 digits + 26 capital letters of the English alphabet + 26 lowercase letters + mathematical characters (at least + - / * => <%) + punctuation marks (.,!?:; '”) + Various brackets + service characters (_ ^% $ @ |) + 32 unprintable control characters for working with devices (first of all, with teletype). Generally, 128 characters are enough "back to back" and this standard character set "we" called ASCII, i.e. American Standard Code for Information Interchange

Great, for 128 characters, 7 bits is enough. On the other hand, 8 bytes in the byte and 8-bit communication channels (forget about the “prehistoric” times when there were fewer bits in the byte and channels). On an 8-bit channel we will transmit 7 bits of the character code and 1 bit control (to increase reliability and recognition of errors). And everything was wonderful until computers were used in other countries (where the Latin alphabet contains more than 26 characters or the non-Latin alphabet is generally used). Instead of everyone learning English without exception, the inhabitants of the USSR, France, Germany, Georgia and dozens of other countries wanted the computer to communicate with them in their native language. The paths were different (depending on the severity of the problem): one thing, if you need to add 2-3 national characters to 26 Latin characters (you can donate some special ones) and another thing, when you need to "wedge" the Cyrillic alphabet. Now “we” are Russians, striving to “Russify” technology. The first were decisions based on the replacement of lowercase English letters with Russian capital. However, the problem is that there are Russian letters (33) and they do not fit into 26 places. It is necessary to “condense” and the first victim of this condensation fell the letter E (it was simply replaced everywhere with E). Another trick is that instead of the “Russians” A, E, K, M, H, O, P, C, T began to use similar English letters (there are even more such letters than necessary, but in some pairs uppercase letters are similar, and lowercase letters are not very: Hh Tt Bb Kk Mm). But nevertheless they “wedged” and as a result the whole conclusion came in CAPITAL LETTERS, which is inconvenient and ugly, but they got used to it over time. The second trick is “switching the language”. The code of the Russian symbol coincided with the code of the English symbol, but the device remembered that now it is in Russian mode and displayed the Cyrillic character (and in English mode - Latin characters). The mode was switched by two service characters: Shift Out (SO, code 14) into Russian and Shift IN (SI, code 15) into English (it is interesting that once in typewriters two-color tape was used and SO led to the physical lifting of the tape and as a result the print went red, and SI put the ribbon in place and the print went black again). The text with capital and small letters began to look pretty decent. All these options more or less worked on large computers, but after the release of IBM PC, the mass distribution of personal computers around the world began and something had to be solved centrally. code 15) into English (it is interesting that once in typewriters a two-color ribbon was used and SO led to the physical lifting of the ribbon and as a result the print went red, and SI put the ribbon in place and the print went black again). The text with capital and small letters began to look pretty decent. All these options more or less worked on large computers, but after the release of IBM PC, the mass distribution of personal computers around the world began and something had to be solved centrally. code 15) into English (it is interesting that once in typewriters a two-color ribbon was used and SO led to the physical lifting of the ribbon and as a result the print went red, and SI put the ribbon in place and the print went black again). The text with capital and small letters began to look pretty decent. All these options more or less worked on large computers, but after the release of IBM PC, the mass distribution of personal computers around the world began and something had to be solved centrally.

The solution was IBM's codepage technology. By this time, the “control character” during transmission had lost its relevance and all 8-bits could be used for the character code. Instead of the range of codes 0-127, the range 0-255 became available. A code page (or encoding) is a mapping of a code from the range 0-255 to a certain graphic image (for example, the letter "I" of the Cyrillic alphabet or the letter "omega" of the Greek). You can’t say “the character with code 211 looks like this”, but you can say “the character with code 211 in the code page CP1251 looks like this: Y, and in CP1253 (Greek) it looks like this: Σ”. In all (or almost all) code tables, the first 128 codes correspond to the ASCII table, only for the first 32 non-printed codes IBM “assigned” its pictures (which are displayed when displayed on the monitor screen). In the upper part, IBM placed pseudo-graphic characters (for drawing various frames), additional Latin characters used in Western Europe, some mathematical characters and individual characters of the Greek alphabet. This code page is called CP437 (IBM has developed many other code pages) and was used by default in video adapters. In addition, various standardization centers (global and national) have created code pages for displaying national symbols. Our computer "minds" proposed 2 options: the main DOS encoding and the alternative DOS encoding. The main one was intended to work everywhere, and the alternative - in special cases when the use of the main is inconvenient. It turned out that such special cases are the majority and the main (not by name, and for use) it became precisely the “alternative” encoding. I think this outcome was clear from the very beginning for most specialists (except for “pundits” divorced from life). The fact is that in most cases English programs were used, which “for beauty” actively used pseudographics to draw various frames, etc. A typical example is the super-popular Norton Commander, who then stands on most computers. The main encoding in the places of the pseudo-graphics placed the Russian characters and the Norton panels looked just awful (as well as any other pseudo-graphic output). And the alternative encoding carefully preserved the characters of pseudographics, I use other places for Russian letters. As a result, it was quite possible to work with both Norton Commander and other programs. Andrei Chernov (a well-known personality at that time) developed the encoding KOI8-R (KOI8), which came from the "large" computers dominated by UNIX. Its peculiarity was that if the 8th bit disappeared from the Russian symbol, then the English symbol resulting from the “circumcision” will be consonant with the original Russian. And instead of “Hi,” it turned out “pRIVET,” which is not quite right, but at least readable. As a result, in the USSR, 3 different code pages were used on computers (main, alternative, and KOI8). And this is not counting the various "variations" when, in an alternative encoding, for example, individual characters (or even strings) changed. KOI8 also "budded" the options - Ukrainian, Belorussian, Tajik, Caucasian, etc. Equipment (printers, video adapters) also needed to be configured (or "flashed") to work with their own encodings.

Nevertheless, in general, code pages allowed solving the problem of outputting national characters (the device just needs to be able to work with the corresponding code page), but it created a problem of multiple encodings when the mail program sends data in one encoding and the receiving program displays them in another. As a result, the user sees the so-called "krakozyabry" (instead of "hello" it says "ўҐаёўҐв" or "ОПХБЕР"). It took transcoder programs that transfer data from one encoding to another. Alas, sometimes letters when passing through mail servers were repeatedly automatically transcoded (or even the 8th bit was “cut off”) and it was necessary to find and complete the entire chain of inverse transformations.

After a massive transition to Windows, a fourth (Windows-1251 aka CP1251 aka ANSI) and a fifth (CP866 aka OEM or DOS) were added to three code pages. Do not be surprised - Windows uses the CP866 encoding by default for working with the Cyrillic alphabet in the console (Russian characters are the same as in the "alternative encoding", only some special characters are different), for other purposes - the CP1251 encoding. Why did Windows need two encodings, was it really impossible to manage one? Alas, it doesn’t work: DOS encoding is used in file names (a heavy legacy of DOS) and console commands like dir, copy should correctly show and correctly process dos file names. On the other hand, in this encoding, a lot of codes are reserved for pseudo-graphic symbols (various frames, etc.), and Windows works in graphical mode and to it (or rather, windows applications) do not need pseudo-graphic characters (but they need the codes they use, which are used in CP1251 for other useful characters). Five cyrillic encodings at first aggravated the situation, but over time, the most popular were Windows-1251 and KOI8, and dosovskie just began to use less. Even when using Windows, it became unimportant what encoding is in the video adapter (only occasionally, before booting Windows, you can see "crackers" in diagnostic messages).

The solution to the encoding problem came when the Unicode system was introduced everywhere (for both personal OS and servers). Unicode associates each national character with a 20-bit number once and permanently assigned to it (a “dot” in the Unicode code space, moreover, 16 bits are most often enough, since 20-bit codes are used for rare characters and hieroglyphs), so there is no need transcode (for more information about Unicode, see the next log entry). Now for any pair <byte code> + <code page> you can determine the corresponding Unicode code (now the 16-bit Unicode code is shown in the code pages for each 8-bit code) and then, if necessary, display this character for any code page, where is he present.

Interestingly, about a year ago, the encoding problem surfaced brieflywhen the FAS “hit” the mobile operators, they say they discriminate against Russian-speaking users, since they charge more for the transfer of Cyrillic. This is due to the technical solution chosen by the developer of the SMS communication protocol. If the Russians had developed it, they would probably have given priority to the Cyrillic alphabet. In the said article, “the head of the transport and communications control department Dmitry Rutenberg noted that there are eight-bit encodings for the Cyrillic alphabet that operators could use.” In how - on the 21st century, Unicode walks around the world, and Mr. Rutenberg pulls us in the beginning of the 90s, when there was a “war of encodings” and the problem of transcoding was in full swing. It’s interesting in what encoding Vasya Pupkin should use, using a Finnish phone, being in Turkey on vacation, from his wife with a Korean phone, sending SMS from Kazakhstan? And from his French companion (with a Japanese telephone) located in Spain? I think no boss can answer this question. Fortunately, this “economical” offer did not materialize.

A young reader may ask - what prevented the immediate use of Unicode, why were these troubles with code pages invented? I think the matter is in the financial side of the problem. Unicode requires 2 times more memory, and memory costs money (both disk and RAM). Would an American buy a computer for 1-2 thousand more due to the fact that “now the new OS requires more memory, but allows working with Russian, European, Arabic languages without problems”? I’m afraid a simple English-speaking buyer would take this argument “inadequately” (and would turn to other manufacturers).

Tags:

About encodings and code pages

Also popular now: