HeadHunter Job Analysis

    Once I wondered what if I tried to analyze the vacancies and compose some tops on them. Find out who pays the most, who is most in demand and much more.

    I used the well-known HeadHunter as a data source. Were collected and processed jobs for May of this year. Only for a month, because the API does not allow getting more.

    Data collection

    The HeadHunter API has excellent documentation that is located in the repository . Requests should be made to the https://api.hh.ru/ domain with the set User-Agent, preferably of the form название_приложения/версия_приложения (емейл_для_связи)(other options sometimes work User-Agent, but if the server doesn't like something, it will return an error).

    The logic of the collection is very simple, so I implemented it in bash using cURL and jq . However, I want to share a few nuances.


    Endpoint is available to search for vacancies by various parameters GET /vacancies.

    curl -A 'irenica (https://irenica.com/)''https://api.hh.ru/vacancies'

    The search results will be divided into pages, the size of which is the parameter per_page(20 by default and 100 maximum). You can select a specific page by specifying a parameter page(the numbering starts from 0).

    In the field of pagesservice information returned with vacancies, the total number of pages of the result will be indicated.

    With this, you can easily search through all pages:

    declare -i i=0
    whiletrue; dodeclare url="https://api.hh.ru/vacancies?per_page=100&page=$i"declare page="$(curl -A 'irenica (https://irenica.com/)' "$url")"# обрабатываем $page
      declare -i totalCount=$(echo"$page" | jq '.pages')
      if ((i >= totalCount)); thenbreakfidone

    Full job details

    However, the search results contain only part of the job data. To get everything, you need to make a separate request for the endpoint view GET /vacancies/id_вакансии.

    Partial data on vacancies are in the itemssearch results field . At first we will collect from them vacancy IDs:

    declare vacanciesIds="$(echo "$page" | jq -r '.items[].id')"

    Then we will request complete information about the relevant vacancies separately:

    for vacancyId in$vacanciesIds; dodeclare url="https://api.hh.ru/vacancies/$vacancyId"declare vacancy="$(curl -A 'irenica (https://irenica.com/)' "$url")"# обрабатываем $vacancydone

    Search limit bypass

    The HeadHunter API has one feature - no matter how many are found, a maximum of 2000 will be returned. At the same time, the actual amount found will also be returned to the foundsearch results field . Thanks to this, it is possible to know for sure whether you received all the requested data, or if there are losses.

    To get around this limitation, I came up with the following. When searching, you can specify the length of time when vacancies of interest were published (through parameters date_fromand date_tothat take the date in ISO 8601 format). You can take a small interval and sort through all the results with such pieces: the smaller the time interval, the less vacancies were published for it.

    It is worth paying attention that the vacancies published only for the last month are returned. Therefore, it makes no sense to set the range anymore.

    To iterate over time intervals, the latter is best represented as Unix time:

    declare -i startTime=$(date -d '-1 month' +%s)
    declare -i endTime=$(date -d now +%s)
    while ((startTime <= endTime)); dodeclare -i intervalEnd=$((startTime + 60*60))
      declare startTimeIso="$(date -d @$startTime +%FT%T)"declare intervalEndIso="$(date -d @$intervalEnd +%FT%T)"# ...declare url="https://api.hh.ru/vacancies?per_page=100&page=$i&date_from=$startTimeIso&date_to=$intervalEndIso"# ...

    Payroll processing

    To collect statistics, it was necessary to group vacancies on certain grounds. At bash, doing this was already problematic, so I used Python.

    The logic of the collection is nothing special - the accumulation of data in the associative array, sorting and output to CSV. However, again a few nuances.

    Salary fork

    It should be noted that the salary is presented in the form of two numbers - the minimum and maximum, and any of them may be absent.

    Since for analysis it was necessary to have one number, I decided to use the lower limit, and only if it is absent, the upper one.

    salary = Noneif vacancy['salary']:
        if vacancy['salary']['to']:
            salary = vacancy['salary']['to']
        if vacancy['salary']['from']:
            salary = vacancy['salary']['from']

    Exchange rates

    Salary in a job can be specified in different currencies, and they - have a different rate. The HeadHunter API has an endpoint GET /dictionariescontaining all the necessary predefined values. Exchange rates are presented in the field currency. For convenience, it would be better to put their list in an associative array, where the key is the alphabetic currency code:

    currencies = {}
    dictionaries = requests.get('https://api.hh.ru/dictionaries').json()
    for currency in dictionaries['currency']:
        currencies[currency['code']] = currency['rate']

    Now, during processing, it will be easy to convert all salaries into one currency:

    salary /= currencies[vacancy['salary']['currency']]

    NDFL accounting

    In some vacancies the salary is indicated before the payment of personal income tax, in some - after. A specific variant is indicated by a field gross: it is equal truein the first case and false- in the second.

    I decided to transfer all salaries to the option after tax:

    if vacancy['salary']['gross']:
        salary -= salary * 0.13

    Results analysis

    Now is the time to show the numbers.

    Remote work

    Probably many of those who read this post, would like to work on the remote. But as we see, work from home in our country is not very much quoted yet. Salary is much lower, the number of vacancies is significantly less. And therefore there is less opportunity to choose for the applicant.

    And this is quite strange, because in many professions and many firms (by the specifics of the tasks), the presence of a person in the office is completely unnecessary. But this is an eternal argument.

    NameSalary, averageSalary, minimumSalary, maximumNumber
    Domestic staff11253610977130000nineteen
    Information technology, Internet, telecom552251000300,0002828
    Top management476879474100,00023
    Extraction of raw materials4657920,0009089880
    Installation and Service4543911874696009
    Public service, non-profit organizations4491120,00090000nineteen
    Working staff4421894996786037
    Construction, real estate3989670110000329
    Transport, logistics376629490100,000223

    Applicants with disabilities

    However, there is an even smaller category of vacancies - for people with disabilities. And this is completely illogical - if employers do not want remote workers, but of those who are ready for this, why are there so few who think about people with disabilities? If you do not care that a person is in three time zones, what difference does it make to you whether he is able to walk, for example?

    Perhaps many of you are familiar with people with disabilities. I, too, and I wondered how difficult it is for them to find a job, and what they can count on.

    NameSalary, averageSalary, minimumSalary, maximumNumber
    Public service, non-profit organizations69675870090000eight
    Top management4870530,0008242515
    Information technology, Internet, telecom453214350200,0001050
    Science education45056315890000376
    Construction, real estate4214822250,000210
    Accounting, management accounting, finance companies363872610113100125


    We all start with something, namely, with a job search, without any experience. I decided to assess the situation with positions open to such candidates.

    The number of vacancies is encouraging for quick employment. And I do not know how realistic it is to get the maximum salary, but you can even somehow live by the average figures.

    NameSalary, averageSalary, minimumSalary, maximumNumber
    Construction, real estate55855209499896455
    Top management5082611310400,000111
    Extraction of raw materials381928,000100,000328
    Medicine, Pharma34475450200,00011776
    Transport, logistics336005001500008,000
    Science education3142611001245101660
    Installation and Service30360826480,000381

    Common top

    And now the most interesting thing: who pays the most? Sorted all vacancies found without any filters.

    Of course, this is top management. Who would doubt that.

    A curious fact: if you pay attention to the average salary in all tables, you can see that it is not that different.

    NameSalary, averageSalary, minimumSalary, maximumNumber
    Top management787891502,000,0002408
    Extraction of raw materials616998,0001800002302
    Information technology, Internet, telecom527772668480425900
    Construction, real estate485872094998933229
    Working staff4120325200,00043079
    Car business38555208242549269
    Installation and Service38412251800002390

    Cleaning woman

    And here is the easiest way: why study for 5 years, if you can just wash the office? Below is the result of filtering the top vacancies for the query "cleaning *".

    What if you get a job in several offices and come in the evening for a couple of hours for cleaning? So you can live quite luxurious. We will consider it life hacking.

    NameSalary, averageSalary, minimumSalary, maximumNumber
    Top management6300040,00087000eight
    Marketing, Advertising, PR50,00050,00050,0006
    Extraction of raw materials45,00045,00045,0003
    HR management, training3324679088700058
    Accounting, management accounting, finance companies32,00030,00035,000ten
    Construction, real estate2902441380,00073
    Transport, logistics249871099045,00026
    Car business24465712445,00061

    Top by city

    Finally, I decided to check the number of open positions by city. The first places are not surprising, but then there are curious and even unexpected positions.

    St. Petersburg11745
    Nizhny Novgorod2876


    All code from the article, with improvements and instructions, is available in the repository .

    Also popular now: