2che February 20, 2018 at 16:55

Cyrillic programming can increase productivity

From the sandbox

Hi, Habrahabr. As you know, technical English is the language of the world of information technology. The main documentation, all programming standards are presented in English. Among others, the ASCII main code page and the portable character set include 26 Latin characters, which are not a problem when using different encodings. This has historically been due to the international level of the English language and the US leadership in the field of information technology. This circumstance allows us to achieve maximum compatibility of technology in the era of the Internet and globalization. In this article, I do not set a goal to change the standards, but simply want to show an alternative approach to IT in Russian.

Initially, just for the sake of sports interest, I wanted to build a 7-bit character table similar to ASCII, including the entire Russian alphabet, 10 digits and all punctuation marks from the portable character set of the POSIX standard. In the process, it became increasingly clear that such a table was much more convenient than ASCII due to the clear definition of its subsections. I know that there is Unicode, but here it is considered the possibility of single-byte encoding. In the process of creation, additional advantages were found, the table was completely rewritten many times, control characters appeared, then one of the final results was presented. What are the advantages? Let's take it apart.

1. Obviously, the letters in its entirety occupy the entire second half of the table. There wasn’t enough space just for the letter Е, it’s not critical, everyone has long been accustomed to its special status as a “distant relative,” although I personally would have removed the solid sign Kommersant from the list, it’s not clear why the linguists left it instead of the more needed Е. But such a decision would be completely rebellious, therefore, as in ISO 8859-5 and Win-1251, it is allocated a place separately from the alphabet, a uppercase and lowercase version in adjacent cells. Without this good letter, the Russian alphabet consists of 32 letters, adding a second case, we get 64 characters, or 2 ^ 6. Thus, one bit can determine whether a character is a letter and whether the middle case (about which later) is applicable. As it should be in a real letter row, it goes in alphabetical order, with all lowercase first, then all uppercase. Why not the other way around I’ll say after, now the main thing is that due to the power of two, the boundaries of both alphabets coincide with the boundaries of the rows of the hexadecimal table, and this is just wonderful. The binary index of each letter in the array is equal to the number of the last 5 bits on the code page, the register is determined by the sixth, and the alphabetic nature of the symbol itself is determined by the seventh.

2. In the first half, many non-letters are also divided in two, and twice: the same addressing works here, and an additional register is added, I called it "middle". The top line is occupied by control characters, here they are written in Russian, because the presence of the Latin alphabet in the Orthodox Russian picture, contrary to all the rules, looked clumsy. Their purpose is not important, I am not a master of assembler and especially machine instructions. Since the creation of ASCII, technological progress has gone far ahead, and most of the control characters of the 70s are no longer relevant. Teletypes and many switches will not be able to work with this encoding, but to use a computer with modern architecture, 16 commands are quite enough, if you need more, at the end a special command “UPR” is specially assigned, upon receipt of which the device will take the next byte differently.

In short: START - \ 0, BIP - \ a, NAZ - \ b, TAB - \ t, NOV - \ n, ABZ - \ v, AML - \ f, WHO - \ r, TIME - space, KOH - signal about the end, STO - stop command, VER - upper case, shift, FIC - fix case, caps lock, AL - middle case, alternative letter assignment, alt, OTM - cancel, esc, UPR - control, start of command, ctrl. The teams selected are not the best, in particular, there is no DEL command needed - delete, a process response request, but the UPR solves the issue. In addition, this article does not discuss the intricacies of the processor. It is important that the set of control characters is located entirely on the first line and complies with the POSIX standard. Registers do not work for control commands, the first 4 zero bits at the beginning disable any switching of the rest.

3. The second line has two registers. The first, upper, changes the value of the third bit from the left from 0 to 1 and transfers the reading to the fourth line. The second, middle, changes the third and fourth bits and translates to the third. You can change the second line with the fourth, then the “middle” register is called rather “lower”, and switching will be carried out by one substitution. But I put the lines as I set them up for better readability. For her, by the way, the left half of the 3rd and 4th lines is occupied by arithmetic signs, and the right half by punctuation. If someone decided to carefully examine the table, he already noticed useful symbols absent in ASCII ¬ (logical NOT, a very necessary thing) and ¤ (currency sign, for financial documentation), and also the native letter Ё, hidden without a certain key and two reserve familiarity.

The order of characters in a row also matters: one column - one key. That is, the coding of the keyboard keys is fully correlated with the table. Yes, the mechanical layout of the keyboard is common here, but the functional one is also its own, the fact is that the QWERTY standard was designed to minimize typing. In the 1870s, there was no method of touch typing, and the typewriter Remington 1 had already gained commercial success. Its predecessors did not have a single layout, and when working on them, the levers clutch often with each other. Christopher Scholes constantly experimented, trying to load the little fingers as much as possible and slow down the typing speed of the text, thereby preventing adhesion. Frequently used letters and punctuation marks were difficult to access. Since then, realities have changed, modern keyboards do not suffer from mechanical diseases, but the standard remains, after all, relearning to type on a new layout is like putting the piano keys in the reverse order. However, alternatives still exist, for example, the layout of the Dvorak. The Russian layout was originally designed based on the ten-finger method, we were more fortunate. But punctuation marks still partially inherited QWERTY, and the comma ended up in upper case. In the keyboard of the encoding presented in the article, the numeric keys, and therefore the signs of arithmetic and punctuation, go from left to right, taking into account the frequency of their use. Dvorak layout. The Russian layout was originally designed based on the ten-finger method, we were more fortunate. But punctuation marks still partially inherited QWERTY, and the comma ended up in upper case. In the keyboard of the encoding presented in the article, the numeric keys, and therefore the signs of arithmetic and punctuation, go from left to right, taking into account the frequency of their use. Dvorak layout. The Russian layout was originally designed based on the ten-finger method, we were more fortunate. But punctuation marks still partially inherited QWERTY, and the comma ended up in upper case. In the keyboard of the encoding presented in the article, the numeric keys, and therefore the signs of arithmetic and punctuation, go from left to right, taking into account the frequency of their use.

Returning to open questions, that’s why uppercase letters follow uppercase ones - the BEP command drops us two lines below for all lines except the first. Since the addressing of characters 3 and 4 of the line occurs through the registers of the second line, it is impossible to reuse the register using conventional methods (the shift or alt key is either pressed or not, increasing the pressure of the fingers from above is useless). The perception of the rule plays a large role: if the third bit on the left is 0, it means there is no case, 1 - upper case is turned on, small letters are replaced by large ones.

Thanks to these rules, the table can be recovered from memory literally after the first reading, and the properties of the new character entering the stream are determined when the first bits are read, which can be used to significantly increase performance. It is unlikely that this code page will find any application, programmers around the world have long been accustomed to English, and for ordinary users there is Unicode, American corporations are working on fundamental programming and PC architecture. This article is purely informal in nature, although a quick interpreter of a high-level YP (and a low one too) in Russian can still be created. And for even greater performance, to the detriment of readability, I present the second version of the table, where the middle register is determined by one fourth fourth bit on the left, and both registers are turned on by zero,

Tags:

Cyrillic programming can increase productivity

Also popular now: