alexmikh September 14, 2017 at 18:21

Analyzing market requirements for data scientist

There is a lot of information on the Internet that data sciencist needs to know and be able to. But I decided that I need to become data sciencist right away, so we will find out the requirements for specialists by analyzing the text of vacancies.

First of all, we formulate the task and develop a plan:

Task:

View all the vacancies in the market and find out the general requirements indicated in them.

Plan:

1. Collect all vacancies at the request of Dat Scientist in a format convenient for processing;
2. Find out the words and phrases that are often found in the description.

Implementation will require a little knowledge in SQL and Python.

If they are not, then you are here

For SQL, I recommend sqlbolt.com , and for Python, the SoloLearn mobile application ( GooglePlay and AppStore ).

Data collection

Source: hh.ru
At first I thought that you could spars the site. Fortunately, I found that hh.ru has an API .

To begin with, we will write a function that receives a list of id vacancies for analysis. In the parameters, the function receives the search text (here we will send 'Data Scientist') and the search area (according to the api documentation), and returns an id list. To obtain data, we use the job search api function :

Here is the code

def get_list_id_vacancies(area, text):
    url_list = 'https://api.hh.ru/vacancies'
    list_id = []
    params = {'text': text, 'area': area}
    r = requests.get(url_list, params=params)
    found = json.loads(r.text)['found']; #кол-во всего найденных вакансий
    if found <= 500: # API не отдает больше 500 вакансий за раз (на странице). Если найденно меньше 500 то получим все  сразу. 
        params['per_page'] = found
        r = requests.get(url_list, params=params)
        data = json.loads(r.text)['items']
        for vac in data:
            list_id.append(vac['id'])
    else:
        i = 0;
        while i <= 3: # если больше 500 то "перелистываем" страницы с 0 по 3 и получаем все вакансии поочереди. API не отдаст вам больше 2000 вакансий, поэтому тут захардкожено 3.  
            params['per_page'] = 500
            params['page'] = i
            r = requests.get(url_list, params=params)
            if 200 != r.status_code:
                break
            data = json.loads(r.text)['items']
            for vac in data:
                list_id.append(vac['id'])
            i += 1
    return list_id

For debugging, I sent the orders directly to the API. I recommend using the chrome Postman application for this .

After that, you need to get detailed information about each vacancy:

Here is the code

def get_vacancy(id):
    url_vac = 'https://api.hh.ru/vacancies/%s'
    r = requests.get(url_vac % id)
    return json.loads(r.text)

Now we have a list of vacancies and a function that receives detailed information about each vacancy. It is necessary to decide where to write the received data. I had two options: save everything to a csv file or create a database. Since it is easier for me to write SQL queries than to analyze in Excel, I chose a database. First you need to create a database and tables in which we will record. To do this, we analyze what the API answers and make decisions about which fields we need.

Insert the api link into Postman, for example api.hh.ru/vacancies/22285538 , make a GET request and get the answer:

Full json

{ "alternate_url": "https://hh.ru/vacancy/22285538", "code": null, "premium": false, "description": "

Мы занимаемся....", "schedule": { "id": "fullDay", "name": "Полный день" }, "suitable_resumes_url": null, "site": { "id": "hh", "name": "hh.ru" }, "billing_type": { "id": "standard_plus", "name": "Стандарт+" }, "published_at": "2017-09-05T11:43:08+0300", "test": null, "accept_handicapped": true, "experience": { "id": "noExperience", "name": "Нет опыта" }, "address": { "building": "36с7", "city": "Москва", "description": null, "metro": { "line_name": "Калининская", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": "Парк Победы", "lng": 37.514401 }, "metro_stations": [ { "line_name": "Калининская", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": "Парк Победы", "lng": 37.514401 } ], "raw": null, "street": "Кутузовский проспект", "lat": 55.739068, "lng": 37.525432 }, "key_skills": [ { "name": "Математическое моделирование" }, { "name": "Анализ рисков" } ], "allow_messages": true, "employment": { "id": "full", "name": "Полная занятость" }, "id": "22285538", "response_url": null, "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "archived": false, "name": "Математик/ Data scientist", "contacts": null, "employer": { "logo_urls": { "90": "https://hhcdn.ru/employer-logo/1680554.png", "240": "https://hhcdn.ru/employer-logo/1680555.png", "original": "https://hhcdn.ru/employer-logo-original/309546.png" }, "vacancies_url": "https://api.hh.ru/vacancies?employer_id=1475513", "name": "Аналитическое агентство Скориста", "url": "https://api.hh.ru/employers/1475513", "alternate_url": "https://hh.ru/employer/1475513", "id": "1475513", "trusted": true }, "created_at": "2017-09-05T11:43:08+0300", "area": { "url": "https://api.hh.ru/areas/1", "id": "1", "name": "Москва" }, "relations": [], "accept_kids": false, "response_letter_required": false, "apply_alternate_url": "https://hh.ru/applicant/vacancy_response?vacancyId=22285538", "quick_responses_allowed": false, "negotiations_url": null, "department": null, "branded_description": null, "hidden": false, "type": { "id": "open", "name": "Открытая" }, "specializations": [ { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.91", "name": "Информатика, Информационные системы" }, { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.141", "name": "Математика" }] }

Everything that we do not plan to analyze is removed from JSON.

JSON with just the right

{ "description": "

Мы занимаемся....", "schedule": { "id": "fullDay", "name": "Полный день" }, "accept_handicapped": true, "experience": { "id": "noExperience", "name": "Нет опыта" }, "key_skills": [ { "name": "Математическое моделирование" }, { "name": "Анализ рисков" } ], "employment": { "id": "full", "name": "Полная занятость" }, "id": "22285538", "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "name": "Математик/ Data scientist", "employer": { "name": "Аналитическое агентство Скориста", }, "area": { "name": "Москва" }, "specializations": [ { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.91", "name": "Информатика, Информационные системы" }, { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.141", "name": "Математика" }] }

Based on this JSON, we make the database. It’s not difficult, so I’ll omit it :) We’ll

implement a module for interacting with the database. I used MySQL:

Here is the code


def get_salary(vac): #зарплата не всегда заполена. Поэтому при обращение внутрь будет ошибка, для этого пишем отдельную функцию, которая вернет словарь с None, если данные пустые. 
    if vac['salary'] is None:
        return {'currency':None , 'from':None,'to':None,'gross':None}
    else:
        return {'currency':vac['salary']['currency'],
                'from':vac['salary']['from'],
                'to':vac['salary']['to'],
                'gross':vac['salary']['gross']}
def get_connection():
    conn = pymysql.connect(host='localhost', port=3306, user='root', password='-', db='hh', charset="utf8")
    return conn
def close_connection(conn):
    conn.commit()
    conn.close()
def insert_vac(conn, vac, text):
    a = conn.cursor()
    salary = get_salary(vac)
    print(vac['id'])
    a.execute("INSERT INTO vacancies (id, name_v, description, code_hh, accept_handicapped,  \
                 area_v,  employer, employment, experience, salary_currency, salary_from, salary_gross,  \
                 salary_to, schedule_d, text_search)  \
                 VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
                 (vac['id'], vac['name'], vac['description'],
                 vac['code'], vac['accept_handicapped'], vac['area']['name'],
                 vac['employer']['name'],
                 vac['employment']['name'], vac['experience']['name'], salary['currency'],
                 salary['from'], salary['gross'],
                 salary['to'], vac['schedule']['name'], text))
    for key_skill in vac['key_skills']:
        a.execute("INSERT INTO key_skills(vacancy_id, name) VALUES(%s, %s)",(vac['id'], key_skill['name']))
    for spec in vac['specializations']:
        a.execute("INSERT INTO specializations(vacancy_id, name, profarea_name) VALUES(%s, %s, %s)",
                  (vac['id'], spec['name'], spec['profarea_name']))
    a.close()

Now we collect everything together by adding the main () method to the file

Data collection

text_search = 'data scientist'
list_id_vacs = get_list_id_vacancies(text_search)
vacs = []
for vac_id in list_id_vacs:
    vacs.append(get_vacancy(vac_id))
conn = get_connection()
for vac in vacs:
    insert_vac(conn, vac, text_search)
close_connection(conn)

Changing the text_search and area variables, we get different vacancies from different regions.
On this data mining is completed and move on to the interesting.

Text analysis

The main inspiration was an article on the search for popular phrases in the series How I met your mother.

First, we will get a description of all the vacancies from the database:

Here is the code


def get_vac_descriptions(conn, text_search):
    a = conn.cursor()
    a.execute("SELECT description FROM vacancies WHERE text_search = %s", text_search)
    descriptions = a.fetchall()
    a.close
    return descriptions

To work with text, we will use the nltk package . By analogy with the article located above, we write the function of obtaining popular phrases from the text:

Here is the code

def get_popular_phrase(text, len, count_phrases):
    phrase_counter = Counter()
    words = nltk.word_tokenize(text.lower())
    for phrase in nltk.ngrams(words, len):
        if all(word not in string.punctuation for word in phrase):
            phrase_counter[phrase] += 1
    return phrase_counter.most_common(count_phrases)
    descriptions = get_vac_descriptions(get_connection(), 'data scientist')
    text = ''
    for description in descriptions:
            text = text + description[0]
    result = get_popular_phrase(text, 1, 20)
    for r in result:
        print(" ".join(r[0]) + " - " + str(r[1]))

We combine all the methods described above in the main method and run it:

Here is the code


def main():
    descriprions = get_vac_descriptions(get_connection(), 'data scientist')
    text = ''
    for descriprion in descriprions:
            text = text + descriprion[0]
    result = get_popular_phrase(text, 4, 20, stopwords)
    for r in result:
        print(" ".join(r[0]) + " - " + str(r[1]))
main()

We execute and see:

li - 2459
/ li - 2459
and - 1297
p - 1225
/ p - 1224
in - 874
strong - 639
/ strong - 620
and - 486
ul - 457
/ ul - 457
s - 415
on - 341
data - 329
data - 313
the - 308
experience - 275
of - 269
for - 254
jobs - 233

We see that the result includes a lot of words that are typical for all vacancies and tags that are used in the description. We remove these words from the analysis. To do this, we need a list of stop words. We will form it automatically by analyzing vacancies from another sphere. I chose “cook”, “cleaning lady” and “locksmith”.

Let's go back to the beginning and get vacancies for these requests. After that, add the function of receiving stop words.

Here is the code

def get_stopwords():
    descriptions = get_vac_descriptions(get_connection(), 'повар') \
                   + get_vac_descriptions(get_connection(), 'уборщица') + \
                   get_vac_descriptions(get_connection(), 'слесарь')
    text = ''
    for description in descriptions:
            text = text + descriprion[0]
    stopwords = []
    list = get_popular_phrase(text, 1, None, 200) #размер списка стоп слов
    for i in list:
        stopwords.append(i[0][0])
    return stopwords

We also see the English the and of. We’ll do it easier and remove the vacancies in English.
Make changes to main ():

Here is the code

for description in descriptions:
    if detect(description[0]) != 'en':
    text = text + description[0]

Now the result looks like this:

data - 329
data - 180
analysis - 157
training - 134
machine - 129
models - 128
areas - 101
algorithms - 87
python - 86
tasks - 82
tasks - 82
development - 77
analysis - 73
construction - 68
methods - 66
will be - 65
statistics - 56
higher - 55
knowledge - 53
learning - 52

Well this is one word, it does not always reflect the truth. Let's see what the word combinations of 2 words show:

machine learning - 119
data analysis - 56
machine learning - 44
data science - 38
data scientist - 38
big data - 34
mathematical models - 34
data mining - 28
machine algorithms - 27
mathematical statistics - 23
will be a plus - 21
statistical analysis - 20
data processing - 18
English - 17
data analysis - 17
including - 17
and also - 17
machine methods - 16
areas of analysis - 15
probability theory - 14

The results of the analysis.

A more explicit query returns the result in two words, we need to know:

Machine learning
Mathematical models
Data mining
Python
Math statistics
English
Probability theory

Nothing new, but it was fun :)

Conclusions.

This is far from an ideal solution.

Mistakes:

1. It is not necessary to exclude vacancies in English, it is necessary to translate them.
2. Not all stop words are excluded.
3. It is necessary to bring all the words to the basic form (machine -> machine, analysis -> analysis, etc.).
4. To come up with a method by which to calculate a more optimal list of stop words. Answer the question “why 200?”, “Why a cleaning lady?”.
5. We need to figure out how to analyze the result automatically, to understand that one or two words have a meaning or more.

Tags:

Analyzing market requirements for data scientist

Data collection

Text analysis

The results of the analysis.

Conclusions.

Also popular now: