Commentary on the note “Frequency analysis of the Ukrainian language”

As a commentary on the article “Frequency Analysis of the Ukrainian Language” [1], simple observations on the frequency of letter pairs are presented. It is proposed to apply the developed technique to the analysis of texts. The main hypothesis: many geometrically connected symbol clusters carry information about authorship and other important integral data.

In particular, it seems to me that it seems unreasonable to expect from different communities of native speakers (forums, etc.) the same spectrum of digrams.

Motivation

The author of a note on the frequency in the Ukrainian language does not give a motivation for his calculation. It seems that the analysis of the frequency of letters, as well as pairs, triples of letters in a language, was initiated by the goals of decrypting simple passwords and similar tasks of cryptanalysis.

About six to seven years ago, a friend and I did similar, but less ambitious calculations. Our motivation was amateur, primitive, but different. These calculations, as we believed, could be the first step in an attempt to machine determine the allocation of information that is meaningful to a person from a text. (Later, it turned out that they, in the most interesting part, are not original [1-3]).

image

It was assumed that the machine was able to "read", knew the characters of letters and punctuation marks, was able to count frequencies, etc. But he does not know how to “think,” in particular, to obtain generalized integral information about the text as a whole. What is good and evil, what or who is being discussed, etc. The task then would be to study the structure of the text and select “algorithms” that would allow us to extract meaningful information from the frequency that was not embedded at the letter level. From a human point of view, let’s try, for example, turn off our learning experience and try to understand the meaning of groups of consecutive characters that are not clear to us. Encode, for example, all letters a, b, c, etc. as x, y, z, alpha, beta and gamma, or, better, Babylonian wedges. And after that ask what we can say about the text as a whole. It’s not clear, but, I believe,

Significant information in a very narrow sense can be the similarity of the text with another, authorship, the rhythm and speed of the text, etc. We have not been able to advance so much in solving even such primitive problems. There are some observations, but there are many more questions. I want to believe that the task is meaningful and partially solvable. We took up such a primitive frequency analysis, after simple estimates (based on English texts). Firstly, we looked at the diagram of the appearance of the White Fang character set in the novel of the same name by D. London. The name of the dog in the first parts was completely absent! And on a small scale, in the main part of the book, fluctuations in the density of words were, of course, overwhelmed by noise. However, it was obvious that the strictly zero frequency of the words “White Fang” at the beginning of the novel correlated with the plot. There was a clear sensation that either the main character was not born, or was not named so, or the novel consists of several parts (about the fang or not about the fang). It is probably impossible to call such a conclusion strict. Nevertheless, it is believed that the words "The White Fang is not mentioned in the first chapters ..." would be a normal partHuman Response about the text. Incomplete, primitive, but also machine “analysis” took milliseconds and there was no algorithm at all. Secondly, statistics on the appearance of character names in the last Harry Potter book also indicated that it is very possible to trace only by the frequency, when and with whom Harry was close (geometrically in the text, but it turned out that in meaning), when Ron and Hogwarts dropped the plot from Harry when Voldemort appeared, etc. Those. taking the words “Harry” and looking at the density of other characters in the geometric neighborhood on the page, one could draw some very vague and vague conclusions about the “plot line”.

Technical challenge

The first step in identifying the protagonist is the development of techniques for studying the frequency of letters. At this first step, our achievements ended in general. Due to the high cost of the Internet at that time, we did not conduct a thorough analysis of forums and the media, but simply took text files with books by one author, starting from Leo Tolstoy’s novels and ending with Daria Dontsova’s detectives. In total there were several hundred books of volume from 300Kb. It turned out to be interesting that graphomaniac writers had obvious correlations in their works in the frequency of individual letters . In particular, the spectral series of letters was indicative for the prolific author of "ironic detectives."

Testing the spectral composition of texts in which clusters of adjacent letters would be studied was the next step. In particular, we calculated the problem in the simplest geometry - the normalized frequency of pairs of the nearest significant characters “aa”, “ab”, “av”, ..., “yaya”. Among the most frequent combinations of symbol pairs, the first 30-60 were distinguished, which were compared for different texts. Relative indicators were considered - the frequency divided by the total frequency of the pairs. The statistical sum in the problem for 300-400 Kb of text turned out to be quite large. More specifically, the frequency of the trilogy "Childhood", "Adolescence", "Youth" was taken, against which the fluctuations in the frequencies of other works looked. The results showed, in particular, a significant difference in the spectrum of different authors,

One of the firmly established “laws” of the Russian language that we did not know about before is the fact that the number of commas in a text is, in order of magnitude, in all texts comparable to the number of most frequently encountered pairs of adjacent letters. It is possible that this is primitive knowledge, but for us it was a kind of “discovery”, the importance of which, at least, is not emphasized.

The remaining observations were not so accurate. In particular, good writers in their works of the same period (early Chekhov, or late Chekhov, early Tolstoy or late) even had eye-catching spectral curves. However, these copyright curves differed from the curves of other authors. As for modern writers, the correlation of the curves for their textswas the stronger the more this writer was considered the more “junk”. This conclusion was made on the basis of several examples and is not rigorous. For example, the curves of various graphomanian works by newcomers from Samizdat laid almost on top of each other. The same thing could be said about more advanced works, such as Akunin's Fandorins, science fiction of early and middle Lukyanenko, detectives of the Marinins, etc. Classics and, in particular, Dostoevsky lied very badly on themselves. The novels of Tolstoy, a special last period, were knocked out of the author's spectrum. To account for all authors, to separate curves into authorship classes, I had to try different definitions of the proximity of curves. However, in general, the style determination technique worked. In the vast majority of cases, the spectral curves of different authors could be separated from each other (more than a hundred authors).

In our approach, the inclusion of clusters of three adjacent letters did not give significant quantitative amendments to the definition of the author's style. The differences in the
frequency of punctuation marks seemed more significant . (In the works of Khmelev and his followers, trigrams were taken as a basis [2-4]). No structure was also observed when counting pairs “through one” and in other simple modifications.

Why clusters of a pair of adjacent letters clarify a lot in the texts, this is generally a mystery. In the comments to the original post about the frequency there was such a remark:
Robotex
By the way, if someone knows how to convert a sequence of phonemes (whereby they can be repeated, i.e. the word mom recognizes the program as mmmmmmmaaaaaaammmmmmmmmaaaaaa) into words, then I would be glad to read (I have stopped for now)

It is possible that I do not correctly interpret the original question. But this example clearly shows the importance of precisely pairs of letters when highlighting a significant part of a sequence of characters. When considering pairs, we see that in the big word there are clusters of “mm”, “ma”, “am”, “aa”. Folding back the low-frequency “mm” and “aa” leads to “ma” and “ma”, or to “ma”, “am”, “ma”, if all two-character combinations are counted. It is clear that the word “mother” has the same spectrum as the word mmmmmmmaaaaaaaammmmmmmmmaaaaaaa in terms of high-frequency two-letter packets. To decrypt the password, guessing the original, this, of course, is useless. From the point of view of noise clipping, analysis of the original, it seems to work well: the extra a and m do not carry new information.

In terms of meaning, breaking up words into pairs of letters in the Russian language is pretty close to breaking up into syllables. Note that there are alphabets like hiragana, in which exactly pairs have the meaning of basic elements. Take the word “frequency” (any other can be). In the approach of paired clusters, it breaks up into “cha,” “ac,” “that,” “from,” “us,” “oh.” The consonant-consonant pairs with rare exceptions (nn, etc.) fall out of the spectrum due to merging with the background. Roughly speaking, you can throw out the consonant-consonant pairs and the conclusion about authorship and style will not change. Thus, only consonant + vowel pairs remain, which correspond to sounds made by the language. The inclusion of punctuation marks (like time for inhalation and exhalation) probably makes the analysis of pairs even closer to the analysis of syllables and oral speech.

The conclusion was also pleasant that the old-fashioned Kommersant or Ukrainian “i”, Russian or Ukrainian languages ​​did not affect the spectral curve of this author (only Bunin was taken as an example, there are no statistics on the authors). It was not possible to verify the effect of translation from English on authorship and text style. The analysis of only English texts was limited to Harry Potter and a couple of works by Jack London, i.e. again, no statistics are collected, but the two symbolic correlations of these two authors were also visible.

The problem was abandoned by us, firstly, because a search even on the RuNet showed that a similar analysis of the frequency of texts has been carried out since the beginning of the last century, including Morovozym [1], whose work Markov himself became interested in. There was also some kind of Fomenkovism in this regard. The very conclusion about the possibility of authorship of texts based on trigrams was already formulated in the 2000 area by D. Khmelev [2,3]. There were works by other authors, see, for example, [4]. Khmelev’s works, of course, included words about text invariants, Markov chains, diagonalization of transition matrices, etc. In fact, similar statements were made there about the importance of the most common triples of letters for determining style. We have many questions for these works. How Dostoevsky is caught, for example, by trigrams, is not clear to us. Etc.

Even without mathematical terms, it can be seen that pairs of letters give quite similar spectral graphs for many authors. Quantitatively, the figures over the author’s range strongly depended on how exactly the proximity of the two graphs was determined, whether the quadratic error was taken pointwise, cubic, etc. But these are the details. The fact that the pattern in the "style" is at the level of pairs of letters and punctuation is quite obvious. Explicit punctures were observed in the examples when the novel was written by "blacks in the project." Translated books were also a problem.

In general, our conclusion 6 years ago was that all these tasks are not original, and there is no sense in continuing their “study”. It is possible that such frequencies were discussed here on the site, of course. What we noticed was that in the analysis of digrams a significant difference in authorship is obtained taking into account punctuation marks. To some extent, the proportion of punctuation marks is a measure of the pace of the text. When considering trigrams, punctuation marks are most likely not included in the statistics, and this, in my opinion, is a mistake and loss of significant information.

The fact that the language is built only on the frequency of a couple of symbols, and no more objective quantitativethere are no characteristics and laws, it is believed very poorly. It is more believed that no one was looking because of the complexity of the analysis in past years. However, our further search for the principles of the organization of texts, the geometry of symbols in the Russian language did not give any obvious results. One of the hypotheses, for example, was that the presence of identical clusters within the same sentencefor example, the pair “ma” within the same sentence is an important characteristic of text and style. At school, for example, they teach that in the sentences standing next to them it’s not worth saying “which” twice. Or repeat the name of the protagonist. It was believed that repeating a couple of letters would serve as a criterion for poor quality of the text, would be a violation of melody, etc. The same children's words “mother”, “father”, “woman”: it is obvious that the second “mother”, “pa”, “ba” are unnecessary from the point of view of the new information brought. A lot of “ma” without separation by a point — should also be avoided. Therefore, such combinations were taken with an enhanced - reduced contribution - 2, 3 times, etc. However, this hypothesis did not bring any new clearly formulated results. Analysis of the classics showed that many works include doubles between two points. Complications in geometry at that level, of course,poems , but this is so obvious.

It is possible that it is necessary to study the structure of texts more subtly. Or set the task in a completely different way. Unfortunately, not a single competent specialist who would take the issue seriously in our environment was found. What science already knows, we could not understand. People from the Mechmath of Moscow State University said that "all this was done in Bauman 100 years ago," and therefore is irrelevant. But we could not find published texts and people from Baumanka, etc.

Perhaps a new challenge

Our “experiments” also did not advance due to the lack of elementary knowledge in programming and the high cost of the Internet. Due to the primitive nature of the program and equipment, the analysis of a 500 page text took several minutes, it hung through a computer, etc. We could not think of automatic and free downloading of texts by gigabytes over the Internet, analysis of html tags, etc. Those. the conclusions made above - this was (and is) our technical limit.

However, the task of studying the geometric structure of the text in Russian was set by us initially (not seriously, of course) more broadly.

It is possible that someone from the community will now be able to test the hypothesis ... It consists in the fact that every online community, newspaper, etc. has its own spectrum, which would be interesting to analyzequantitatively .

For concreteness and vitality, we consider the problem of how one or another large site relates to a politician, company, phenomenon - “XYZ”, for example. The obvious idea was that among the many web pages of the site, where the combination of letters “XYZ” is often found, there will be a corresponding environment of characteristic clusters of letters, words. For example, a publication, a community, has a negative attitude towards the XYZ brand. In theory, on the page geometrically close to this combination of symbols, on average, exactly negative symbols should stand - “collapse”, “ruin”, “crisis”, “decline”, pictures of falling planes, sunken ships, etc. In another community, next to the “XYZ” combination, on average, good symbols can stand closer - “confidence”, “progress”, “achievement”, etc.

In general, such an analysis of resources would help streamline the Internet. This, of course, is very far from the problem of recognizing the meaning of textual information, but some kind of new knowledge about communities, a machine, should also be obtained from such geometry. The following is meant. By analogy with the frequency of letters in a language, the simplest task of text analysis, of course, is to calculate the frequency of significant words, the structure of the semantic core, or the tag cloud. It is wise to take bold, color, and tilt with extra weight. This has been done and is being done by Google, Yandex, all SEOs. The next step is to enable the geometry of the semantic core. Closer to the cap, the word means higher weight, closer to the basement - lower. In the header or inside the body. And this is also done when ranking and issuing by search engines.

But is the metric included, in the sense of Riemann, i.e. the geometric distance (for example, the root mean square distance on a page, the distance in clicks, etc.) between significant words when evaluating texts by a machine, I still do not know.

Even in free online tools for SEO beginners, you can erase everything except significant words from the text of the web page. From the large text remains some frame of meaningful words. Something in this skeleton must be important, not just one quantity - the frequency. For example, relative distances. However, I do not know whether further computational work is carried out with such a skeleton, with such a basis of text, by the machine. It would seem natural for a search engine to study the laws of text organization, because most of us read "diagonally." We need some measure to determine what the author wanted to say in the text, if you remove all unnecessary. But ... search engines have billions of pages, and it’s not at all a fact that such an analysis is already technically possible even for giants like Google. Those. the scale of the task is much less than the standard for search engines (thousands, not billions of pages), but the analysis is suggested deeper. In addition, this direction can be greatly inhibited by the hypertext structure of the Internet. It’s much more efficient to keep track of who refers to whom, who enters the trust network. But ... links are the inclusion of the natural intelligence of webmasters that the search engine exploits for its own purposes. Sometime, the ability of the machine itself to draw conclusions about the text will also be needed.

Conclusion

It is argued that, in the Russian language there is an objective, statistically determinable, internal structure based on the frequency of characters [1-4]. Unlike previous works [1-4], the remaining “novelty” of our statement is that the frequency of pairs of adjacent letters and punctuation marks in the text plays a great role . When analyzing texts, philologists vaguely refer to “unique rhythm”, “readability”; it is possible that the frequency of punctuation marks, along with other factors, gives an equivalent quantitative description of rhythm.

It is proposed to pay attention to the study of the frequency of the language, which began in the original study on the frequency of the Ukrainian language, but for the purpose of highlighting significant informationfrom texts of individual communities. That is, not to average the entire Russian-language Internet in frequency, but, on the contrary, to divide it into sectors characterized by the same invariants.

It is proposed to try to study the geometry of the text by introducing a new parameter - the relative distance between the characters. There are (N-1) characters on N characters! connections, and each connection still has weight. Therefore, technically, such an analysis can be much more complicated than a simple calculation of frequencies.

It is possible that the ideas are completely trivial and not original. It is clear that the practical value here would not have been the analysis of Turgenev’s novels, but specific infographics by popular media or communities for a particular presidential candidate in 2012, public relations or anti-public relations of a particular brand in a particular online publication, etc. But overall, it can be an interesting task in itself.

Such tasks primarily require knowledge of Internet technologies that the author of the original post on frequency used - the ability to download gigabytes of texts, the ability to quickly analyze data sets for frequency, the ability to find common features in chaos, etc. The same html tags give a lot to the geometry of the text.

For the same technical, Internet-computer reasons, such problems could be little studied by professional mathematicians and philologists. Therefore, such tasks may be suitable for research by IT specialists who do not even know either philology or mathematics.

References

[1] professor_k “Frequency analysis of the Ukrainian language”

[1] N. Morozov, “Linguistic spectra: a means for distinguishing plagiarism from the true works of one or another unknown author” 1915;

[2] Khmelev D. V. Recognition of the author of the text using chains A. A. Markova // Moscow State University Bulletin, ser. 9: philology. "- 2000." - N 2. "- S. 115-126.

[3] Khmelev D., Tweedie F. Using Markov Chains for Identification of Writers // Literary and Linguistic Computing. "- 2001." - Vol. 16, no. 4. "- Pp. 299-307.

[4] Romanov AS, Meshcheryakov RV, Identification of the author of the text using the apparatus of reference vectors in the case of two possible alternatives.

Also popular now: