Analyzing market requirements for data scientist

    There is a lot of information on the Internet that data sciencist needs to know and be able to. But I decided that I need to become data sciencist right away, so we will find out the requirements for specialists by analyzing the text of vacancies.


    First of all, we formulate the task and develop a plan:

    Task:

    View all the vacancies in the market and find out the general requirements indicated in them.

    Plan:

    1. Collect all vacancies at the request of Dat Scientist in a format convenient for processing;
    2. Find out the words and phrases that are often found in the description.

    Implementation will require a little knowledge in SQL and Python.

    If they are not, then you are here
    For SQL, I recommend sqlbolt.com , and for Python, the SoloLearn mobile application ( GooglePlay and AppStore ).

    Data collection


    Source: hh.ru
    At first I thought that you could spars the site. Fortunately, I found that hh.ru has an API .

    To begin with, we will write a function that receives a list of id vacancies for analysis. In the parameters, the function receives the search text (here we will send 'Data Scientist') and the search area (according to the api documentation), and returns an id list. To obtain data, we use the job search api function :

    Here is the code
    def get_list_id_vacancies(area, text):
        url_list = 'https://api.hh.ru/vacancies'
        list_id = []
        params = {'text': text, 'area': area}
        r = requests.get(url_list, params=params)
        found = json.loads(r.text)['found']; #кол-во всего найденных вакансий
        if found <= 500: # API не отдает больше 500 вакансий за раз (на странице). Если найденно меньше 500 то получим все  сразу. 
            params['per_page'] = found
            r = requests.get(url_list, params=params)
            data = json.loads(r.text)['items']
            for vac in data:
                list_id.append(vac['id'])
        else:
            i = 0;
            while i <= 3: # если больше 500 то "перелистываем" страницы с 0 по 3 и получаем все вакансии поочереди. API не отдаст вам больше 2000 вакансий, поэтому тут захардкожено 3.  
                params['per_page'] = 500
                params['page'] = i
                r = requests.get(url_list, params=params)
                if 200 != r.status_code:
                    break
                data = json.loads(r.text)['items']
                for vac in data:
                    list_id.append(vac['id'])
                i += 1
        return list_id
    


    For debugging, I sent the orders directly to the API. I recommend using the chrome Postman application for this .

    After that, you need to get detailed information about each vacancy:

    Here is the code
    def get_vacancy(id):
        url_vac = 'https://api.hh.ru/vacancies/%s'
        r = requests.get(url_vac % id)
        return json.loads(r.text)



    Now we have a list of vacancies and a function that receives detailed information about each vacancy. It is necessary to decide where to write the received data. I had two options: save everything to a csv file or create a database. Since it is easier for me to write SQL queries than to analyze in Excel, I chose a database. First you need to create a database and tables in which we will record. To do this, we analyze what the API answers and make decisions about which fields we need.

    Insert the api link into Postman, for example api.hh.ru/vacancies/22285538 , make a GET request and get the answer:

    Full json
    {
        "alternate_url": "https://hh.ru/vacancy/22285538",
        "code": null,
        "premium": false,
        "description": "

    Мы занимаемся....", "schedule": { "id": "fullDay", "name": "Полный день" }, "suitable_resumes_url": null, "site": { "id": "hh", "name": "hh.ru" }, "billing_type": { "id": "standard_plus", "name": "Стандарт+" }, "published_at": "2017-09-05T11:43:08+0300", "test": null, "accept_handicapped": true, "experience": { "id": "noExperience", "name": "Нет опыта" }, "address": { "building": "36с7", "city": "Москва", "description": null, "metro": { "line_name": "Калининская", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": "Парк Победы", "lng": 37.514401 }, "metro_stations": [ { "line_name": "Калининская", "station_id": "8.470", "line_id": "8", "lat": 55.736478, "station_name": "Парк Победы", "lng": 37.514401 } ], "raw": null, "street": "Кутузовский проспект", "lat": 55.739068, "lng": 37.525432 }, "key_skills": [ { "name": "Математическое моделирование" }, { "name": "Анализ рисков" } ], "allow_messages": true, "employment": { "id": "full", "name": "Полная занятость" }, "id": "22285538", "response_url": null, "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "archived": false, "name": "Математик/ Data scientist", "contacts": null, "employer": { "logo_urls": { "90": "https://hhcdn.ru/employer-logo/1680554.png", "240": "https://hhcdn.ru/employer-logo/1680555.png", "original": "https://hhcdn.ru/employer-logo-original/309546.png" }, "vacancies_url": "https://api.hh.ru/vacancies?employer_id=1475513", "name": "Аналитическое агентство Скориста", "url": "https://api.hh.ru/employers/1475513", "alternate_url": "https://hh.ru/employer/1475513", "id": "1475513", "trusted": true }, "created_at": "2017-09-05T11:43:08+0300", "area": { "url": "https://api.hh.ru/areas/1", "id": "1", "name": "Москва" }, "relations": [], "accept_kids": false, "response_letter_required": false, "apply_alternate_url": "https://hh.ru/applicant/vacancy_response?vacancyId=22285538", "quick_responses_allowed": false, "negotiations_url": null, "department": null, "branded_description": null, "hidden": false, "type": { "id": "open", "name": "Открытая" }, "specializations": [ { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.91", "name": "Информатика, Информационные системы" }, { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.141", "name": "Математика" }] }



    Everything that we do not plan to analyze is removed from JSON.

    JSON with just the right
    {
        "description": "

    Мы занимаемся....", "schedule": { "id": "fullDay", "name": "Полный день" }, "accept_handicapped": true, "experience": { "id": "noExperience", "name": "Нет опыта" }, "key_skills": [ { "name": "Математическое моделирование" }, { "name": "Анализ рисков" } ], "employment": { "id": "full", "name": "Полная занятость" }, "id": "22285538", "salary": { "to": 90000, "gross": false, "from": 50000, "currency": "RUR" }, "name": "Математик/ Data scientist", "employer": { "name": "Аналитическое агентство Скориста", }, "area": { "name": "Москва" }, "specializations": [ { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.91", "name": "Информатика, Информационные системы" }, { "profarea_id": "14", "profarea_name": "Наука, образование", "id": "14.141", "name": "Математика" }] }



    Based on this JSON, we make the database. It’s not difficult, so I’ll omit it :) We’ll

    implement a module for interacting with the database. I used MySQL:

    Here is the code
    
    def get_salary(vac): #зарплата не всегда заполена. Поэтому при обращение внутрь будет ошибка, для этого пишем отдельную функцию, которая вернет словарь с None, если данные пустые. 
        if vac['salary'] is None:
            return {'currency':None , 'from':None,'to':None,'gross':None}
        else:
            return {'currency':vac['salary']['currency'],
                    'from':vac['salary']['from'],
                    'to':vac['salary']['to'],
                    'gross':vac['salary']['gross']}
    def get_connection():
        conn = pymysql.connect(host='localhost', port=3306, user='root', password='-', db='hh', charset="utf8")
        return conn
    def close_connection(conn):
        conn.commit()
        conn.close()
    def insert_vac(conn, vac, text):
        a = conn.cursor()
        salary = get_salary(vac)
        print(vac['id'])
        a.execute("INSERT INTO vacancies (id, name_v, description, code_hh, accept_handicapped,  \
                     area_v,  employer, employment, experience, salary_currency, salary_from, salary_gross,  \
                     salary_to, schedule_d, text_search)  \
                     VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
                     (vac['id'], vac['name'], vac['description'],
                     vac['code'], vac['accept_handicapped'], vac['area']['name'],
                     vac['employer']['name'],
                     vac['employment']['name'], vac['experience']['name'], salary['currency'],
                     salary['from'], salary['gross'],
                     salary['to'], vac['schedule']['name'], text))
        for key_skill in vac['key_skills']:
            a.execute("INSERT INTO key_skills(vacancy_id, name) VALUES(%s, %s)",(vac['id'], key_skill['name']))
        for spec in vac['specializations']:
            a.execute("INSERT INTO specializations(vacancy_id, name, profarea_name) VALUES(%s, %s, %s)",
                      (vac['id'], spec['name'], spec['profarea_name']))
        a.close()
    


    Now we collect everything together by adding the main () method to the file

    Data collection
    text_search = 'data scientist'
    list_id_vacs = get_list_id_vacancies(text_search)
    vacs = []
    for vac_id in list_id_vacs:
        vacs.append(get_vacancy(vac_id))
    conn = get_connection()
    for vac in vacs:
        insert_vac(conn, vac, text_search)
    close_connection(conn)
    


    Changing the text_search and area variables, we get different vacancies from different regions.
    On this data mining is completed and move on to the interesting.

    Text analysis


    The main inspiration was an article on the search for popular phrases in the series How I met your mother.

    First, we will get a description of all the vacancies from the database:

    Here is the code
    
    def get_vac_descriptions(conn, text_search):
        a = conn.cursor()
        a.execute("SELECT description FROM vacancies WHERE text_search = %s", text_search)
        descriptions = a.fetchall()
        a.close
        return descriptions
    


    To work with text, we will use the nltk package . By analogy with the article located above, we write the function of obtaining popular phrases from the text:

    Here is the code
    def get_popular_phrase(text, len, count_phrases):
        phrase_counter = Counter()
        words = nltk.word_tokenize(text.lower())
        for phrase in nltk.ngrams(words, len):
            if all(word not in string.punctuation for word in phrase):
                phrase_counter[phrase] += 1
        return phrase_counter.most_common(count_phrases)
        descriptions = get_vac_descriptions(get_connection(), 'data scientist')
        text = ''
        for description in descriptions:
                text = text + description[0]
        result = get_popular_phrase(text, 1, 20)
        for r in result:
            print(" ".join(r[0]) + " - " + str(r[1]))
    


    We combine all the methods described above in the main method and run it:

    Here is the code
    
    def main():
        descriprions = get_vac_descriptions(get_connection(), 'data scientist')
        text = ''
        for descriprion in descriprions:
                text = text + descriprion[0]
        result = get_popular_phrase(text, 4, 20, stopwords)
        for r in result:
            print(" ".join(r[0]) + " - " + str(r[1]))
    main()
    


    We execute and see:

    li - 2459
    / li - 2459
    and - 1297
    p - 1225
    / p - 1224
    in - 874
    strong - 639
    / strong - 620
    and - 486
    ul - 457
    / ul - 457
    s - 415
    on - 341
    data - 329
    data - 313
    the - 308
    experience - 275
    of - 269
    for - 254
    jobs - 233

    We see that the result includes a lot of words that are typical for all vacancies and tags that are used in the description. We remove these words from the analysis. To do this, we need a list of stop words. We will form it automatically by analyzing vacancies from another sphere. I chose “cook”, “cleaning lady” and “locksmith”.

    Let's go back to the beginning and get vacancies for these requests. After that, add the function of receiving stop words.

    Here is the code
    def get_stopwords():
        descriptions = get_vac_descriptions(get_connection(), 'повар') \
                       + get_vac_descriptions(get_connection(), 'уборщица') + \
                       get_vac_descriptions(get_connection(), 'слесарь')
        text = ''
        for description in descriptions:
                text = text + descriprion[0]
        stopwords = []
        list = get_popular_phrase(text, 1, None, 200) #размер списка стоп слов
        for i in list:
            stopwords.append(i[0][0])
        return stopwords



    We also see the English the and of. We’ll do it easier and remove the vacancies in English.
    Make changes to main ():

    Here is the code
    for description in descriptions:
        if detect(description[0]) != 'en':
        text = text + description[0]
    


    Now the result looks like this:

    data - 329
    data - 180
    analysis - 157
    training - 134
    machine - 129
    models - 128
    areas - 101
    algorithms - 87
    python - 86
    tasks - 82
    tasks - 82
    development - 77
    analysis - 73
    construction - 68
    methods - 66
    will be - 65
    statistics - 56
    higher - 55
    knowledge - 53
    learning - 52

    Well this is one word, it does not always reflect the truth. Let's see what the word combinations of 2 words show:

    machine learning - 119
    data analysis - 56
    machine learning - 44
    data science - 38
    data scientist - 38
    big data - 34
    mathematical models - 34
    data mining - 28
    machine algorithms - 27
    mathematical statistics - 23
    will be a plus - 21
    statistical analysis - 20
    data processing - 18
    English - 17
    data analysis - 17
    including - 17
    and also - 17
    machine methods - 16
    areas of analysis - 15
    probability theory - 14

    The results of the analysis.


    A more explicit query returns the result in two words, we need to know:

    • Machine learning
    • Mathematical models
    • Data mining
    • Python
    • Math statistics
    • English
    • Probability theory

    Nothing new, but it was fun :)

    Conclusions.


    This is far from an ideal solution.

    Mistakes:

    1. It is not necessary to exclude vacancies in English, it is necessary to translate them.
    2. Not all stop words are excluded.
    3. It is necessary to bring all the words to the basic form (machine -> machine, analysis -> analysis, etc.).
    4. To come up with a method by which to calculate a more optimal list of stop words. Answer the question “why 200?”, “Why a cleaning lady?”.
    5. We need to figure out how to analyze the result automatically, to understand that one or two words have a meaning or more.

    Also popular now: