Habra dictionary. Part 1

Published on June 04, 2018

Habra dictionary. Part 1

Good day friends.


Solved the task of compiling a Habrahabr dictionary for the purpose of tracking the emergence of new languages, frameworks, management practices, etc. In short, new words.


The result was a list of English words "in the nominative and singular."


Made in the environment of Windows 10 x64, used the Python 3 language in the Spyder editor in Anaconda 5.1.0, used a wired connection to the network.


In this article, I get a dictionary of English words on a limited sample. If the topic turns out to be interesting, then in the future I plan to get a dictionary of both English and Russian words on a full sample of Habr's articles. With the Russian language all the more difficult.


Parsing process


A pig took from here . Just below is the code for my version of the parser.


To compile the Habr dictionary, you need to bypass its articles and select the text of articles from them. I did not process the meta information of the articles. Articles on Habré have my own “number”, such as https://habr.com/post/346198/ . Searching for articles can be done from 0 to 354366, this was the last article at the time of the project.


For each number we are trying to get an html page and, when it succeeds, then we take out the title and text of the article from the html structure. The bypass code is:


import pandas as pd
import requests
from bs4 import BeautifulSoup
dataset = pd.DataFrame()
for pid in range(350000,354366):
    r = requests.get('https://habrahabr.ru/post/' +str(pid) + '/')
    soup = BeautifulSoup(r.text, 'html5lib')
    if soup.find("span", {"class": "post__title-text"}):
        title = soup.find("span", {"class": "post__title-text"}).text
        text = soup.find("div", {"class": "post__text"}).text
        my_series = pd.Series([pid, title, text], index=['id', 'title', 'text'])
        df_new = pd.DataFrame(my_series).transpose()
        dataset = dataset.append(df_new, ignore_index = True)

Empirically found that the articles themselves are smaller than the numbers are three times. I trained on 4366 rooms - my system loads this amount in half an hour.


I did not work on speed optimization, although they say that if you start processing in 100 threads, it will be much faster.


I saved the result to disk


dataset.to_excel(directory+'dataset.xlsx', sheet_name='sheet1', index=False)

- so as not to repeat the slow download from the Internet. The file turned out the size of 10 megabytes.


I was interested in the English names of the instruments. I did not need the terms in different forms, I wanted to get normal forms of words right away. It is clear that the words “in”, “on” and “on” are most often met, we remove them. To normalize the dictionary, I used English Porter Stemmer from the ntlk library.


Directly to create a list of vocabulary words, I used a slightly indirect method, see the code starting from "from sklearn.feature_extraction.text import CountVectorizer". I will need this later.


import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, len(dataset.index)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])    
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
names = cv.get_feature_names()
dfnames = pd.DataFrame(names).transpose()
dfnames.to_excel(directory+'names.xlsx', sheet_name='sheet1', index=False)

The names object is the desired dictionary. We saved it to disk.


Results Review


It turned out more than 30 thousand pieces of normalized words. And these are only 4,366 articles and words in English only.


From interesting:


  1. The authors of the articles use many strange "words", for example: aaaaaaaaaaa, aaaabbbbccccdddd or zzzhoditqxfpqbcwr


  2. From object X we get the Top 10 most popular English words in our sample:

Word Piece
iter 4133
op 4030
return 2866
ns 2834
id 2740
name 2556
new 2410
data 2381
string 2358
http 2304