Using Python to analyze related demanded skills from developers

From the sandbox

Today, on the Web, you can find a huge amount of diverse information about the most popular programming languages, libraries, frameworks, operating systems, and other entities - let's call them technologies. The number of these technologies is constantly growing and it becomes clear that everyone who wants to follow the path of a developer needs to focus on studying some of the most in-demand stack associated with any key technology.

This raises the first question - how can you determine the relevance of a particular technology? This question can be answered, for example, as follows: technology is in demand when employers mention it as a requirement for an applicant when describing a vacancy. In other words, if the viewing position 100 Technology A reference was made 60 times, and Technology B - 20 times, the technology can be regarded as A more marketable than B .

The second question is what are key technologies?

Based on the great interest in articles on the analysis of the popularity of programming languages, we will consider the programming language to be the key technology.

Thus, the task can be formulated as follows - in the set of vacancies it is necessary to allocate a subset associated with the key technology, and in this subset calculate the frequency of mentioning other technologies.

We will use the hh.ru portal as a database of vacancies , due to its great popularity and availability of the HeadHunter API . Programming language - Python 3.4.

To reduce the size of the article, the technical side of the issue of receiving and processing data will not be discussed in detail, but it’s worthwhile to dwell on some key points. The source code of the project is open and available on GitHub .

Getting a list of vacancies

To get a list of jobs related to Python programmers, let's do the following get request using the requests library :

import requests
import json
l = "Python"
params = {"text": "Программист " + l, "search_field": "name", "area": 2, "period": 30, "page": 0}
r = requests.get("https://api.hh.ru/vacancies", params=params)
jr = json.loads(r.text)

As a result, we obtain a dictionary with the following elements:

page: class 'int'
clusters: class 'NoneType'
per_page: class 'int'
alternate_url: class 'str'
found: class 'int'
arguments: class 'NoneType'
items: class 'list'
pages: class 'int'

We are interested in:

the pages key, whose value contains the number of pages with vacancies,
the items key pointing to the job list on the page.

From each element of the items list , which is a dictionary, we need the url key . The value for this key is a link to a detailed job description. By successively changing the page parameter, you can loop through all the vacancies on demand and create a list necessary for further link analysis. To speed up the process of loading information about vacancies, several parallel threads from the threading library are used :

from math import ceil
import requests
from threading import Thread
import json
classDownloadThread(Thread):def__init__(self, urls, number, res):
       Thread.__init__(self)
       self.number = number
       self.urls = urls
       self.res = res
   defrun(self):for url in self.urls:
           resp = requests.get(url)
           if resp.status_code == requests.codes.ok:
               self.res.append(json.loads(requests.get(url).text))
           else:
               print("Status code: " + str(resp.status_code))
               print(url)
defstart_dl_threads(urls, th_num, res):
   threads = []
   n = ceil(len(urls) / th_num)
   for i in range(th_num):
       t = DownloadThread(urls[i * n: (i + 1) * n], i, res)
       threads.append(t)
       t.start()
   for t in threads:
       t.join()

After building the dependence of the download speed of the test recruitment of 274 vacancies on the number of download threads, it was decided to use 10 threads, since, in this case, a larger number of them practically does not shorten the script operation time.

Formation of skills dictionary and search for skills in the job description

Initially it was intended to create a dictionary of key skills manually. However, after analyzing the structure of the job description, it became clear that the process can be largely automated. To do this, we need a list of key_skills available , containing the key skills of this job. Unfortunately, few vacancies contain information about key skills. In addition, this information may differ from the main description. Therefore, processing only this data would not give a complete picture.

Were collected all the unique key skills for vacancies with the name “Programmer” + (“Java”, “JavaScript”, “1C”, “Python”, “C”, “C ++”, “C #”, “Objective-C”, “Perl”, “Ruby”, “PHP”), and the first 150 most frequently used are used.

os.makedirs("data", exist_ok=True)
langs = ("Java", "JavaScript",
         "1С", "Python",
         "C", "C++",
         "C#", "Objective-C",
         "Perl", "Ruby",
         "PHP")
par = {"text": "", "search_field": "name", "area": 2, "period": 30}
o = {"skills": 1, "urls": 0, "vacs": 0}
for l in langs:
   par["text"] = "Программист " + l
   with open("data\data_" + par["text"] + str(datetime.date.today())
             + ".json", "w") as f:
       json.dump(get_info_from_hh(par, 10, o), f, indent=4, ensure_ascii=False)
data = Counter()
for fn in os.listdir("data"):
   if os.path.isfile("data/" + fn):
       with open("data/" + fn, "r") as rf:
           data += Counter(json.load(rf)["skills"])
for item in data.most_common(150):
   print(item)
jsdict = {item[0]: item[0] for item in data.most_common(150)}
with open("kw.json", "w") as wf:
   json.dump(jsdict, wf, indent=4, ensure_ascii=False, sort_keys=True)

The search for keywords in job descriptions was carried out using a regular expression of the form:

pattern = r"(?i)[^а-яА-Яa-zA-Z0-9_|^]%s[^а-яА-Яa-zA-Z0-9_|$]" % kw[item]

Since some technologies are a generalization of several more particular ones, for example, when calculating with the sql key , for example, the following value was used:

"sql": "sql|mysql|postgresql|ms sql"

Also, to account for various names related to the same technology, expressions of the form were used:

"шаблоны проектирования": "шаблон.+проектирования|паттерн.+проектирования|design patterns",
"английский язык": "английск.+?|english",
"машинное обучение": "машинн.+?обучен.+?|нейр.+?сет.+?|neural"

The final keyword file can be found in the GitHub repository .

When selecting key technologies, a list of the top 20 languages by the number of vacancies was used (mentioned in the title) . In this case, only those languages are left which together with the word “Programmer” in the title give more than 20 vacancies in St. Petersburg.

results

The result of processing such an abstract concept as a “programmer” well reflects the overall picture of industry vacancies. Firstly, it is clear that every second employer wants to get sql and English knowledge from a programmer . Every third employer wants a programmer to own a git version control system . By virtue of the development orientation on the web, popular technologies are html and css . This is especially important for JavaScript and PHP . Somewhat unexpectedly, but the programmer will probably have to work in a team .

Among the common technologies included in the top twenty are object-oriented programming , databases , algorithms and design patterns .

As an operating system, preference is given to Linux .

When considering specific languages, you can see that next to the first line is the most popular framework or library. For Java, this is Spring and Hibernate , for C # - .net and asp.net , Python - the programmer will probably need Django , and for JavaScript you will need React .

Special and the most undemanding is the domestic complex 1C . Many employers will have enough knowledge only of this platform (I don’t have a clue if it’s a lot or a little). Useful skills will be: teamwork , understanding of testing processes and ability to work with databases .

Due to the peculiarities of the work of the search engine hh, a lot of results with 1C technology were issued for the “Programmer C” request. Therefore, the results for this language were incorrect.

The results obtained in St. Petersburg, cite below.