Results and prospects of a small analysis of Russian texts

    I present to the readers the statistics collected during the creation of the simplest robot-generator of Russian phrases

    Word distribution

    I will give you some numbers first.
    At 12.5 Mb of the Russian text (mostly classical literature of different authors), at 142,114 different words in it, the union “and” is most often found - 83,575 times (words are taken in all word forms). And that's more than half!
    The second most frequent occurrence is the preposition "by" - 52124 times, in third place - the particle "not": 36268 times.
    The verb "said" (singular, 3l.) Occurs 6566 times and is in 28th place.
    But the word “yes” is in 36th place and occurs 5039 times, while “no” - occurs 2948 times and is in 53rd place.
    The remaining words are chosen quite randomly, based on the preferences of the author.

    The frequency of words on the body of texts has been studied since the discovery of the Zipf law for the English language (i.e., for more than 60 years), various dictionaries and reviews on this topic have been published, but we will look at the Russian language a little more carefully and clearly.
    Detailed graphs and examples with conclusions

    Also popular now: