Search for the longest word chains in Russian using the Wolfram Language (Mathematica)


    You can download the translation as a Mathematica document, which contains all the code used in the article, here (archive, ~ 5 MB).

    Introduction


    In Russian, as in many other languages, there are words that have the same length, but differ in only one letter. Such pairs of words are called metagrams .

    Suppose we have several consecutive metagrams, say:

    opinion-milling-smoldering-friction-debating-drinking-swarming-vigilance-vigil-beating —

    they form a chain of metagrams, or a chain of words.

    From here comes the game called the word ladder , which was invented by Lewis Carroll back in 1879 .

    It is clear that far from every initial word such a chain can be made up, and some words, apparently, should generate rather long chains.

    In this post we will try to analyze the chains of words that can be built in the Russian language, and also find the chains of the greatest length.

    Import a list of Russian words


    To build and analyze word chains, we need a list of words that we can meet in Russian. In this post we will not limit the part of speech that can be used, but we will only consider the form of the word corresponding to the nominative case in the word paradigm . We use, as in one of the previous posts ( Finding the best sequence for viewing the list of 250 best films using the Wolfram Language (Mathematica) ), a morphological dictionary of the Russian language created by academician Andrei Anatolyevich Zaliznyak :

    In [1]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_2.gif

    In [3] : = We

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_3.png

    process the data of the dictionary, compiling on its basis a list of words in the Russian language russianWords :

    In [4]: ​​=

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_4.png

    The function that defines the link in the word chain


    Let's create a chainLinkQ function , which will determine whether two words can be a link in a chain of words, in other words, this function answers the question whether two words are a metagram.

    In this case, we consider two options:

    • words are distinguished by one letter (in this case, it is enough to use the EditDistance function to check , which calculates the Levenshtein distance , the calculation result of which should be equal to 1):

    In [5]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_5.png

    • words are distinguished by n adjacent letters:

    In [6]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_6.png

    Search function for all metagrams of fixed-length words


    Now we find all the links in the chain of words (metagrams).

    To begin with, we will divide all the words of the Russian language into equivalence classes , i.e., in fact, we will construct a factor set of words of the Russian language by their length, in other words, we will divide the list of all words into sets of words from the 1st, 2nd, 3, 4, etc. letters.

    In [7]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_7.png

    Find the minimum and maximum word length in Russian:

    In [8]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_8.png

    Out [8] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_9.png

    Let's look at the distribution of the words of the Russian language in length:

    In [9]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_10.png

    Out [9] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_11.png

    Now create a function preserving its calculated values, which will select, among all possible pairs of words of a given length, those that can be links in word chains:

    In [10]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_12.png

    We calculate this function for each equivalence class of words in the Russian language, while we will now consider only words that differ by one letter:

    In [11]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_13.png

    Example of the result of calculations:

    In [12]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_14.png

    Out [12] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_15.png

    Wordline Graphs


    After we have calculated the search function for all links of the word chains for each equivalence class, we can construct graphs corresponding to them that will allow us to visually study the structure of word chains of each type (if there is no graph, then there are no links for words of a given length).

    When you hover over each vertex of the graph, you can get a tooltip with the word that corresponds to the given vertex of the graph (you can download the document with interactive graphs and source code from the link provided at the very beginning of the post).

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_16.png

    In [13]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_17.png

    In [14]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_18.png

    Count of Russian language word chains containing 1 letter
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_19.png

    Count of Russian language word chains containing 2 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_20.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_21.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_22.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_23.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_24.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_25.png

    Count of chains of Russian words containing 8 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_26.png

    Count of Russian language word chains containing 9 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_27.png

    Count of 10-letter Russian word chains
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_28.png

    Count of Russian language word chains containing 11 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_29.png

    Count of chains of Russian words containing 12 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_30.png

    Count of Russian language word chains containing 13 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_31.png

    Count of chains of Russian words containing 14 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_32.png

    Count of Russian language word chains containing 15 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_33.png

    Count of chains of Russian words containing 16 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_34.png

    Count of Russian language word chains containing 17 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_35.png

    Count of 18-letter words in the Russian language
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_36.png

    Count of Russian language word chains containing 21 letters
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_37.png

    The graphs above show how much their structure varies depending on the length of the words from which it is built, while it can be observed that most graphs have more than one connected component.

    From the graph below you can see the dependence of the number of connected components of different lengths among all the graphs discussed above, from which it follows that the dependence should be of the form Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_38.png, where x is the size of the connected component, y is the number of connected components.

    In [15]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_39.gif

    Out [16] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_40.png

    Valid:

    In [17]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_41.gif

    Out [17] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_42.png

    In [19]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_43.png

    Out [19] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_44.png

    Donald Knuth's assumption as applied to chains of words in Russian


    Donald Knuth was one of the first to use a computer to study English language chains. He studied a chain of words consisting of 5 letters, because he believed that words from a smaller number of letters, say 3, are too easy to learn (although Lewis Carroll composed a chain of 6 links from the word APE to the word MAN), and the words from more letters cannot lead to long chains, including because the corresponding graph has many connected components.

    Let us verify Knuth’s assumption for Russian language chains.

    The graph below shows how, with an increase in the number of letters in the words from which the chains are built, the position of the point with the coordinates changes:

    {the number of connected components of the graph, the number of edges of the graph} .

    The upper left corner corresponds to the “ideal graph” which has one connected component, while it has the maximum number of edges. The graph corresponding to the closest point of the trajectory under consideration to a given point, obviously, can be considered as the “best”. The corresponding point, found automatically, is marked with two red circles and corresponds to the graph of chains of five-letter words.

    Thus, it can be argued that the assumption of Donald Knuth, made for the English word chain, just as it is not surprising, is true for the Russian language. Perhaps this is a fairly deep-seated properties of human psychology, tied to the numbers 5 and 10, corresponding to the fingers on one and two palms, respectively.

    In [20]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_45.gif

    Out [25] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_46.png

    Search for the longest chain of words in Russian


    So, having traveled a rather long way, we are finally ready to find the longest chain of words that exists in Russian. As follows from the previous section, it should be among the chains of words of five letters.

    In [26]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_47.png

    Out [26] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_48.png

    So, we got the two longest chains of words in the Russian language, and, as expected, they are contained in the “best” column of chains of five-letter words.

    Search for the longest chains of n-letter words of the Russian language


    In the previous section, we searched for the longest chain of words in the Russian language, and we found two chains of 40 five-letter words.

    Let's now look at the longest possible chains of n-letter words:

    In [27]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_49.png

    Chains of Russian words containing 2 letters, with at least 5 words
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_50.png

    Цепочки слов русского языка, содержащих 3 буквы, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_51.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_52.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_53.png

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_54.png

    Цепочки слов русского языка, содержащих 7 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_55.png

    Цепочки слов русского языка, содержащих 8 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_56.png

    Цепочки слов русского языка, содержащих 9 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_57.png

    Цепочки слов русского языка, содержащих 10 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_58.png

    Цепочки слов русского языка, содержащих 11 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_59.png

    Цепочки слов русского языка, содержащих 12 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_60.png

    Цепочки слов русского языка, содержащих 13 букв, с количеством слов не менее 5
    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_61.png

    Заключение


    I hope that my post was able to interest you, and some of the ideas and programs presented in it will be useful to you. Of course, you can come up with a lot of interesting studies of the Russian language. Say, at the beginning of the post, the chainLinkQ function was created , the functionality of which allows a similar study for metagrams that differ in two letters - which corresponds, in fact, to a small modification of the rules of the game in word chains. From the two graphs presented below, it is clear that the graph of four-letter metagrams that differ in two letters is connected, compared with the graph of metagrams that differ in only one letter. This suggests that the results of the study of such graphs and chains will be completely different.

    In [28]: =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_62.png

    Out [28] =

    Poisk-samyh-dlinnyh-cepochek-slov-v-russkom-jazyke-s-pomoshhju-jazyka-Wolfram-Language-Mathematica_63.png

    Resources for learning Wolfram Language ( Mathematica ) in Russian: http://habrahabr.ru/post/244451

    Also popular now: