Lexicon Habra
This post is a continuation of this study of the haxtraiser Muxto about the most common words in Habr's articles and comments. As, however, many have noticed, the top 10 and even the top 50 obtained by Muxto are not replete with IT terms proper, they are not there at all: “in” (107,735), “and” (106,420), “on” (103 084), “s” (93 453), “not” (91 591), “what” (88 488), etc.
The next obvious step was to identify the terms that most significantly deviate from the average in Russian. Having received the "go-ahead" from the author of the first part of the study and having discussed some mathematical questions with the trept user , I proceeded to the following activities.
From the site of the National Corps of the Russian Language (NKRYA)The frequency base of word forms of the “average common” Russian language was downloaded based on the analysis of texts with a total volume of 192 689 044 units (words). The database contains 1 054 211 unique case-sensitive word forms. Since the analysis of the Habra vocabulary provided by Muxto is case-insensitive, and in principle this is more consistent with the final goal, the first task turned out to bring all word forms to lower case. There are 888 397 unique case-insensitive word forms in the NKRY base (the frequency values of the combined forms, naturally, were summed up).
The second issue was the actual identification of significantly distinguished words. As it turned out, this problem has long been solved in modern linguistics, which is actively using statistics and computer technology. One of the statistics on the degree of “heterogeneity” of the frequency of occurrence of a word in one case with respect to the general set of cases, which philologists especially liked, is the G-test, which is a special case of the likelihood ratio test. The statistics for a single word are calculated as
Here a i is the actually observed frequency of occurrence of the i -th word form in the case under study,
and E i is the expected frequency of the same word form in the case under study, provided that the cases are combined, that is,
where ai and b i are the frequencies of occurrence of the i- th word form in the buildings (Habr and NKRYA),
and c and d are the total volume of these cases (33 732 229 and 192 689 044 units, respectively).
So, all the calculations are made, the words are sorted in descending order of statistics G i , top-30:
Suspicious? Yes, I confess, I combined the frequencies of several forms of the same word in the top 150 after the first run manually, choosing the initial word form, because it was a shame to see in the top the word forms “user / user / users” or, for example, “version / versions / version” with very high rates, but not in the leaders just because the Russian language is rich in endings and numbers.
Both the top 30 and the top 150 Habrahabr certainly deserve reflection. Personally, I was pleased with the result - in my opinion, the essence of this unique IT resource was highlighted very accurately. Well, the leader - “USER” - is that generalized goal for which we spend hours, days and years of our lives.
Wordle.netI reacted to the loaded top-30 (with frequencies proportional to the G statistics) and the Habr’s color palette with such a cloud of tags: All
I have to do is offer you, as a philological warm-up, to come up with the longest sentence in the comments with words from the top-30 that would not seem too artificial.
I wish you an optimistic and boring Friday!
The next obvious step was to identify the terms that most significantly deviate from the average in Russian. Having received the "go-ahead" from the author of the first part of the study and having discussed some mathematical questions with the trept user , I proceeded to the following activities.
From the site of the National Corps of the Russian Language (NKRYA)The frequency base of word forms of the “average common” Russian language was downloaded based on the analysis of texts with a total volume of 192 689 044 units (words). The database contains 1 054 211 unique case-sensitive word forms. Since the analysis of the Habra vocabulary provided by Muxto is case-insensitive, and in principle this is more consistent with the final goal, the first task turned out to bring all word forms to lower case. There are 888 397 unique case-insensitive word forms in the NKRY base (the frequency values of the combined forms, naturally, were summed up).
The second issue was the actual identification of significantly distinguished words. As it turned out, this problem has long been solved in modern linguistics, which is actively using statistics and computer technology. One of the statistics on the degree of “heterogeneity” of the frequency of occurrence of a word in one case with respect to the general set of cases, which philologists especially liked, is the G-test, which is a special case of the likelihood ratio test. The statistics for a single word are calculated as
Here a i is the actually observed frequency of occurrence of the i -th word form in the case under study,
and E i is the expected frequency of the same word form in the case under study, provided that the cases are combined, that is,
where ai and b i are the frequencies of occurrence of the i- th word form in the buildings (Habr and NKRYA),
and c and d are the total volume of these cases (33 732 229 and 192 689 044 units, respectively).
So, all the calculations are made, the words are sorted in descending order of statistics G i , top-30:
405587,703 пользователь
197850,057 сайт
139330,707 разработчик
135705,259 файл
124132,397 приложение
121233,522 веб
116809,907 данные
113262,075 компания
109463,742 код
94468,080 версия
92093,985 проект
79257,370 com
77786,398 информация
74006,346 сеть
71844,136 ru
66674,626 работает
64946,067 помощью
63195,334 сервер
60807,287 можно
60433,187 google
55160,380 ссылка
55147,137 интернет
53984,795 например
52609,986 windows
50998,105 позволяет
50177,316 возможность
48421,264 http
48372,913 работы
48328,683 видео
48158,301 сделать
Suspicious? Yes, I confess, I combined the frequencies of several forms of the same word in the top 150 after the first run manually, choosing the initial word form, because it was a shame to see in the top the word forms “user / user / users” or, for example, “version / versions / version” with very high rates, but not in the leaders just because the Russian language is rich in endings and numbers.
Both the top 30 and the top 150 Habrahabr certainly deserve reflection. Personally, I was pleased with the result - in my opinion, the essence of this unique IT resource was highlighted very accurately. Well, the leader - “USER” - is that generalized goal for which we spend hours, days and years of our lives.
Wordle.netI reacted to the loaded top-30 (with frequencies proportional to the G statistics) and the Habr’s color palette with such a cloud of tags: All
I have to do is offer you, as a philological warm-up, to come up with the longest sentence in the comments with words from the top-30 that would not seem too artificial.
I wish you an optimistic and boring Friday!