OLS November 29, 2013 at 02:25

Lexicon Habra

This post is a continuation of this study of the haxtraiser Muxto about the most common words in Habr's articles and comments. As, however, many have noticed, the top 10 and even the top 50 obtained by Muxto are not replete with IT terms proper, they are not there at all: “in” (107,735), “and” (106,420), “on” (103 084), “s” (93 453), “not” (91 591), “what” (88 488), etc.

The next obvious step was to identify the terms that most significantly deviate from the average in Russian. Having received the "go-ahead" from the author of the first part of the study and having discussed some mathematical questions with the trept user , I proceeded to the following activities.

From the site of the National Corps of the Russian Language (NKRYA)The frequency base of word forms of the “average common” Russian language was downloaded based on the analysis of texts with a total volume of 192 689 044 units (words). The database contains 1 054 211 unique case-sensitive word forms. Since the analysis of the Habra vocabulary provided by Muxto is case-insensitive, and in principle this is more consistent with the final goal, the first task turned out to bring all word forms to lower case. There are 888 397 unique case-insensitive word forms in the NKRY base (the frequency values of the combined forms, naturally, were summed up).

The second issue was the actual identification of significantly distinguished words. As it turned out, this problem has long been solved in modern linguistics, which is actively using statistics and computer technology. One of the statistics on the degree of “heterogeneity” of the frequency of occurrence of a word in one case with respect to the general set of cases, which philologists especially liked, is the G-test, which is a special case of the likelihood ratio test. The statistics for a single word are calculated as

Here a _i is the actually observed frequency of occurrence of the i -th word form in the case under study,
and E _i is the expected frequency of the same word form in the case under study, provided that the cases are combined, that is,

where a_i and b _i are the frequencies of occurrence of the i- th word form in the buildings (Habr and NKRYA),
and c and d are the total volume of these cases (33 732 229 and 192 689 044 units, respectively).

So, all the calculations are made, the words are sorted in descending order of statistics G _i , top-30:

  405587,703 пользователь
  197850,057 сайт
  139330,707 разработчик
  135705,259 файл
  124132,397 приложение
  121233,522 веб
  116809,907 данные
  113262,075 компания
  109463,742 код
   94468,080 версия
   92093,985 проект
   79257,370 com
   77786,398 информация
   74006,346 сеть
   71844,136 ru
   66674,626 работает
   64946,067 помощью
   63195,334 сервер
   60807,287 можно
   60433,187 google
   55160,380 ссылка
   55147,137 интернет
   53984,795 например
   52609,986 windows
   50998,105 позволяет
   50177,316 возможность
   48421,264 http
   48372,913 работы
   48328,683 видео
   48158,301 сделать

Suspicious? Yes, I confess, I combined the frequencies of several forms of the same word in the top 150 after the first run manually, choosing the initial word form, because it was a shame to see in the top the word forms “user / user / users” or, for example, “version / versions / version” with very high rates, but not in the leaders just because the Russian language is rich in endings and numbers.

Both the top 30 and the top 150 Habrahabr certainly deserve reflection. Personally, I was pleased with the result - in my opinion, the essence of this unique IT resource was highlighted very accurately. Well, the leader - “USER” - is that generalized goal for which we spend hours, days and years of our lives.

Wordle.netI reacted to the loaded top-30 (with frequencies proportional to the G statistics) and the Habr’s color palette with such a cloud of tags: All

I have to do is offer you, as a philological warm-up, to come up with the longest sentence in the comments with words from the top-30 that would not seem too artificial.

I wish you an optimistic and boring Friday!

Tags:

Lexicon Habra

Also popular now: