Analyzing market requirements for data scientist

First of all, we formulate the task and develop a plan:
Task:
View all the vacancies in the market and find out the general requirements indicated in them.
Plan:
1. Collect all vacancies at the request of Dat Scientist in a format convenient for processing;
2. Find out the words and phrases that are often found in the description.
Implementation will require a little knowledge in SQL and Python.
Data collection
Source: hh.ru
At first I thought that you could spars the site. Fortunately, I found that hh.ru has an API .
To begin with, we will write a function that receives a list of id vacancies for analysis. In the parameters, the function receives the search text (here we will send 'Data Scientist') and the search area (according to the api documentation), and returns an id list. To obtain data, we use the job search api function :
def get_list_id_vacancies(area, text):
url_list = 'https://api.hh.ru/vacancies'
list_id = []
params = {'text': text, 'area': area}
r = requests.get(url_list, params=params)
found = json.loads(r.text)['found']; #кол-во всего найденных вакансий
if found <= 500: # API не отдает больше 500 вакансий за раз (на странице). Если найденно меньше 500 то получим все сразу.
params['per_page'] = found
r = requests.get(url_list, params=params)
data = json.loads(r.text)['items']
for vac in data:
list_id.append(vac['id'])
else:
i = 0;
while i <= 3: # если больше 500 то "перелистываем" страницы с 0 по 3 и получаем все вакансии поочереди. API не отдаст вам больше 2000 вакансий, поэтому тут захардкожено 3.
params['per_page'] = 500
params['page'] = i
r = requests.get(url_list, params=params)
if 200 != r.status_code:
break
data = json.loads(r.text)['items']
for vac in data:
list_id.append(vac['id'])
i += 1
return list_id
For debugging, I sent the orders directly to the API. I recommend using the chrome Postman application for this .
After that, you need to get detailed information about each vacancy:
def get_vacancy(id):
url_vac = 'https://api.hh.ru/vacancies/%s'
r = requests.get(url_vac % id)
return json.loads(r.text)
Now we have a list of vacancies and a function that receives detailed information about each vacancy. It is necessary to decide where to write the received data. I had two options: save everything to a csv file or create a database. Since it is easier for me to write SQL queries than to analyze in Excel, I chose a database. First you need to create a database and tables in which we will record. To do this, we analyze what the API answers and make decisions about which fields we need.
Insert the api link into Postman, for example api.hh.ru/vacancies/22285538 , make a GET request and get the answer:
{ "alternate_url": "https://hh.ru/vacancy/22285538", "code": null, "premium": false, "description": "
Мы занимаемся....", "schedule": { "id": "fullDay", "name": "Полный день" }, "suitable_resumes_url": null, "site": { "id": "hh", "name": "hh.ru" }, "billing_type": { "id": "standard_plus", "name": "Стандарт+" }, "published_at": "2017-09-05T11:43:08+0300", "test": null, "accept_handicapped": true, "experience": { "id": "noExperience", "name": "Нет опыта" }, "address": { "building": "36с7", "city": "Москва", "description": null, "metro": { "line_name": "Калининская", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": "Парк Победы", "lng": 37.514401 }, "metro_stations": [ { "line_name": "Калининская", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": "Парк Победы", "lng": 37.514401 } ], "raw": null, "street": "Кутузовский проспект", "lat": 55.739068, "lng": 37.525432 }, "key_skills": [ { "name": "Математическое моделирование" }, { "name": "Анализ рисков" } ], "allow_messages": true, "employment": { "id": "full", "name": "Полная занятость" }, "id": "22285538", "response_url": null, "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "archived": false, "name": "Математик/ Data scientist", "contacts": null, "employer": { "logo_urls": { "90": "https://hhcdn.ru/employer-logo/1680554.png", "240": "https://hhcdn.ru/employer-logo/1680555.png", "original": "https://hhcdn.ru/employer-logo-original/309546.png" }, "vacancies_url": "https://api.hh.ru/vacancies?employer_id=1475513", "name": "Аналитическое агентство Скориста", "url": "https://api.hh.ru/employers/1475513", "alternate_url": "https://hh.ru/employer/1475513", "id": "1475513", "trusted": true }, "created_at": "2017-09-05T11:43:08+0300", "area": { "url": "https://api.hh.ru/areas/1", "id": "1", "name": "Москва" }, "relations": [], "accept_kids": false, "response_letter_required": false, "apply_alternate_url": "https://hh.ru/applicant/vacancy_response?vacancyId=22285538", "quick_responses_allowed": false, "negotiations_url": null, "department": null, "branded_description": null, "hidden": false, "type": { "id": "open", "name": "Открытая" }, "specializations": [ { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.91", "name": "Информатика, Информационные системы" }, { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.141", "name": "Математика" }] }
Everything that we do not plan to analyze is removed from JSON.
{ "description": "
Мы занимаемся....", "schedule": { "id": "fullDay", "name": "Полный день" }, "accept_handicapped": true, "experience": { "id": "noExperience", "name": "Нет опыта" }, "key_skills": [ { "name": "Математическое моделирование" }, { "name": "Анализ рисков" } ], "employment": { "id": "full", "name": "Полная занятость" }, "id": "22285538", "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "name": "Математик/ Data scientist", "employer": { "name": "Аналитическое агентство Скориста", }, "area": { "name": "Москва" }, "specializations": [ { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.91", "name": "Информатика, Информационные системы" }, { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.141", "name": "Математика" }] }
Based on this JSON, we make the database. It’s not difficult, so I’ll omit it :) We’ll
implement a module for interacting with the database. I used MySQL:
def get_salary(vac): #зарплата не всегда заполена. Поэтому при обращение внутрь будет ошибка, для этого пишем отдельную функцию, которая вернет словарь с None, если данные пустые.
if vac['salary'] is None:
return {'currency':None , 'from':None,'to':None,'gross':None}
else:
return {'currency':vac['salary']['currency'],
'from':vac['salary']['from'],
'to':vac['salary']['to'],
'gross':vac['salary']['gross']}
def get_connection():
conn = pymysql.connect(host='localhost', port=3306, user='root', password='-', db='hh', charset="utf8")
return conn
def close_connection(conn):
conn.commit()
conn.close()
def insert_vac(conn, vac, text):
a = conn.cursor()
salary = get_salary(vac)
print(vac['id'])
a.execute("INSERT INTO vacancies (id, name_v, description, code_hh, accept_handicapped, \
area_v, employer, employment, experience, salary_currency, salary_from, salary_gross, \
salary_to, schedule_d, text_search) \
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
(vac['id'], vac['name'], vac['description'],
vac['code'], vac['accept_handicapped'], vac['area']['name'],
vac['employer']['name'],
vac['employment']['name'], vac['experience']['name'], salary['currency'],
salary['from'], salary['gross'],
salary['to'], vac['schedule']['name'], text))
for key_skill in vac['key_skills']:
a.execute("INSERT INTO key_skills(vacancy_id, name) VALUES(%s, %s)",(vac['id'], key_skill['name']))
for spec in vac['specializations']:
a.execute("INSERT INTO specializations(vacancy_id, name, profarea_name) VALUES(%s, %s, %s)",
(vac['id'], spec['name'], spec['profarea_name']))
a.close()
Now we collect everything together by adding the main () method to the file
text_search = 'data scientist'
list_id_vacs = get_list_id_vacancies(text_search)
vacs = []
for vac_id in list_id_vacs:
vacs.append(get_vacancy(vac_id))
conn = get_connection()
for vac in vacs:
insert_vac(conn, vac, text_search)
close_connection(conn)
Changing the text_search and area variables, we get different vacancies from different regions.
On this data mining is completed and move on to the interesting.
Text analysis
The main inspiration was an article on the search for popular phrases in the series How I met your mother.
First, we will get a description of all the vacancies from the database:
def get_vac_descriptions(conn, text_search):
a = conn.cursor()
a.execute("SELECT description FROM vacancies WHERE text_search = %s", text_search)
descriptions = a.fetchall()
a.close
return descriptions
To work with text, we will use the nltk package . By analogy with the article located above, we write the function of obtaining popular phrases from the text:
def get_popular_phrase(text, len, count_phrases):
phrase_counter = Counter()
words = nltk.word_tokenize(text.lower())
for phrase in nltk.ngrams(words, len):
if all(word not in string.punctuation for word in phrase):
phrase_counter[phrase] += 1
return phrase_counter.most_common(count_phrases)
descriptions = get_vac_descriptions(get_connection(), 'data scientist')
text = ''
for description in descriptions:
text = text + description[0]
result = get_popular_phrase(text, 1, 20)
for r in result:
print(" ".join(r[0]) + " - " + str(r[1]))
We combine all the methods described above in the main method and run it:
def main():
descriprions = get_vac_descriptions(get_connection(), 'data scientist')
text = ''
for descriprion in descriprions:
text = text + descriprion[0]
result = get_popular_phrase(text, 4, 20, stopwords)
for r in result:
print(" ".join(r[0]) + " - " + str(r[1]))
main()
We execute and see:
li - 2459
/ li - 2459
and - 1297
p - 1225
/ p - 1224
in - 874
strong - 639
/ strong - 620
and - 486
ul - 457
/ ul - 457
s - 415
on - 341
data - 329
data - 313
the - 308
experience - 275
of - 269
for - 254
jobs - 233
We see that the result includes a lot of words that are typical for all vacancies and tags that are used in the description. We remove these words from the analysis. To do this, we need a list of stop words. We will form it automatically by analyzing vacancies from another sphere. I chose “cook”, “cleaning lady” and “locksmith”.
Let's go back to the beginning and get vacancies for these requests. After that, add the function of receiving stop words.
def get_stopwords():
descriptions = get_vac_descriptions(get_connection(), 'повар') \
+ get_vac_descriptions(get_connection(), 'уборщица') + \
get_vac_descriptions(get_connection(), 'слесарь')
text = ''
for description in descriptions:
text = text + descriprion[0]
stopwords = []
list = get_popular_phrase(text, 1, None, 200) #размер списка стоп слов
for i in list:
stopwords.append(i[0][0])
return stopwords
We also see the English the and of. We’ll do it easier and remove the vacancies in English.
Make changes to main ():
for description in descriptions:
if detect(description[0]) != 'en':
text = text + description[0]
Now the result looks like this:
data - 329
data - 180
analysis - 157
training - 134
machine - 129
models - 128
areas - 101
algorithms - 87
python - 86
tasks - 82
tasks - 82
development - 77
analysis - 73
construction - 68
methods - 66
will be - 65
statistics - 56
higher - 55
knowledge - 53
learning - 52
Well this is one word, it does not always reflect the truth. Let's see what the word combinations of 2 words show:
machine learning - 119
data analysis - 56
machine learning - 44
data science - 38
data scientist - 38
big data - 34
mathematical models - 34
data mining - 28
machine algorithms - 27
mathematical statistics - 23
will be a plus - 21
statistical analysis - 20
data processing - 18
English - 17
data analysis - 17
including - 17
and also - 17
machine methods - 16
areas of analysis - 15
probability theory - 14
The results of the analysis.
A more explicit query returns the result in two words, we need to know:
- Machine learning
- Mathematical models
- Data mining
- Python
- Math statistics
- English
- Probability theory
Nothing new, but it was fun :)
Conclusions.
This is far from an ideal solution.
Mistakes:
1. It is not necessary to exclude vacancies in English, it is necessary to translate them.
2. Not all stop words are excluded.
3. It is necessary to bring all the words to the basic form (machine -> machine, analysis -> analysis, etc.).
4. To come up with a method by which to calculate a more optimal list of stop words. Answer the question “why 200?”, “Why a cleaning lady?”.
5. We need to figure out how to analyze the result automatically, to understand that one or two words have a meaning or more.