A simple way to assess the intelligibility of a text in Russian

In fact, the one published below is my comment on the publication “What is“ Clear Russian Language ”in terms of technology. Let's take a look at the readability metrics of texts . Since I can not leave comments, I write in the Sandbox .

The criteria for evaluating the comprehensibility of the texts that were discussed in the post are based on practically zero knowledge of the language in which these texts are written: it is enough to know how it is divided into words and sentences. This approach is convenient in terms of simplicity of calculations, but does not allow the use of a lot of relevant data. It seems to me that in the case of the Russian language it is obvious what else can be used, and this data is easily accessible.

In my opinion, incomprehensibility makes sense to divide into two types:

(a) deep incomprehensibility (when it is in no way possible to make out what is written);

(b) confusion associated with complexity.

The incomprehensibility of type (a), which is saturated with every second, if not just every, official document, is connected with the fact that people simply do not know how to express their thoughts. What seems understandable in the head and somehow manages to be explained “in words” cannot be transferred to paper: the momentum does not close, the anaphora intertwined, the composition combines things that are better not to be together, and so on. In the pure case, it is difficult to distinguish this automatically from a normal text: often even people who read the text superficially think that it is more or less nothing, and then it turns out that this is some kind of a whirlpool. Moreover, it is impossible to automatically fix this: first you have to sit down with the author and pry out for a long time from him, which, in fact, he had in mind. But, fortunately, this incomprehensibility almost always entails the incomprehensibility of type (b),

Incomprehensibility = complexity implies that people use some non-trivial language tools that are poorly understood without education and / or extraordinary effort. And here we are faced with the mediated nature of traditional metrics. Long sentences, of course, are best avoided, but a long sentence as such is not synonymous with darkness: a simple listing can make a sentence long, without necessarily making it incomprehensible. The use of long words also does not make the text deliberately incomprehensible. In the end, no one canceled the technical language, and it is impossible to convey all the subtleties in simple words, not to mention the fact that official documents cannot do without “implementation”, “bringing” and the like of multi-letter things. In other words, if you don’t come up with new terms all the time,

It seems to me that complexity of type (b) is primarily syntactic, or rhetorical, complexity. Chancery is usually characterized by the fact that the parsing tree quickly breaks through the ceiling, and this is typical for almost any "dark" texts. To make the texts more understandable, we need to make them structurally simple. And this is very simple: in the vast majority of cases, syntactic complexity is achieved through the use of a single means - participles of the real voice. Try to write confusing text without active participles, and you will see that it is almost impossible. Or you will be completely absurd, or suggestions if necessary will become shorter - and more understandable. The thesis that Russian people do not use participles and participles in colloquial speech is as old as the world. He’s not entirely true - I know people

I do not claim that this is the only true way to assess the intelligibility of the text, but I am almost sure that the number of active participles will reveal a complex Russian text no worse than any other one-factor metric. For the preliminary test, I took five texts: “Captain’s daughter”, “War and peace”, separately an epilogue to “War and peace”, famous for its obscurity, “Classic and non-classical ideals of rationality” Merab Mamardashvili (modern philosophical text of the Russian-speaking author) and federal Law “On Education in the Russian Federation”. I divided the texts into sentences and using Python 3 + pymorphy2 calculated the average number of active participles in each of them. The result was predictable, but still eloquent: The

service offered in the post gives the following results:

He could not cope with the full text of War and Peace on two attempts - it would be interesting to find out what was the matter. We see that the ranking in the ranking coincides, but if we measure by the participles, the difference between the Law on Education and the “Captain's Daughter”, as well as between the epilogue to “War and Peace” and the text of Mamardashvili is higher. I can’t vouch for the absolute values, but I suspect that the text of Mamardashvili is more complicated than the text of Tolstoy.

If you go from the other side, it turns out that the text of Mamardashvili is the most complex of all. The complexity of words can be considered not only by their length, but also by their occurrence in the texts. Rare word = difficult. To measure the rarity of words, I took the frequency data published on the NKRJ website, and for each text I made an array where each word corresponded to a number = 1 / occurrence (i.e. the rarity of the word). In the NKRY table, the rarest words have a 3 occurrence, so if the word was not in the table, it received a rarity of 1/2. Then I calculated the average dictionary rarity for all texts. In this rating, “War and Peace” completely overtook the epilogue (there is no French), and even higher were “Captain's Daughter” (many non-trivial spellings), the Law on Education and, with a margin, “Ideals”. This is a bit crooked result, but it shows how specific the text is for Mamardashvili. If we multiply the data on the participles and the data on the words, we get the following rating, in my opinion, very meaningful:

