Big Data course: three months on basic knowledge, and why you need it
A student at Big Data receives 70 thousand rubles a month, and a specialist with 3-4 years of experience - 250 thousand rubles a month. These are, for example, those who can personalize retail offers, search a person’s social network for loan applications or use the list of visited sites to calculate a new SIM card of an old subscriber.
We decided to do a professional course on the Big Data without "water", and marketing all sorts of Agile, only hardcore. They called on practitioners from 7 large companies (including Sberbank and Oracle) and staged, in fact, a full-length hackathon. Recently, we had an open day on a program where we directly asked practitioners what Big Data is in Russia and how companies actually use big data. Below are the answers.
Ekaterina Frolovicheva, head of technology research at Sberbank of Russia, says that Big Data is good good marketing, a term that was formed from a number of disciplines that did not exist yesterday, the day before yesterday, and not two or three years ago. Machine learning, data mining - all this combined is simply used to solve problems.
Where is the fine line between classic analytics and big data? If you can fit your data into a regular table with a measured number of rows and create aggregated queries for it, this is a classic analytics. But if you take diverse sources of information and examine them according to various parameters, and in real time - this is Big Data.
For customers, the obvious things are mass personalization and all that can help increase the number of secondary sales. Sberbank's active cardholders - 50 million. These are not those who simply have cards, but those who spend. How to try to identify them by their vector of interests, by what set of parameters, signs, by which ID to recognize them, that they are recorded somewhere in some kind of system of record - this is the first slice that needs to be overcome. But how to make sure that in real time the user receives an offer that he will respond to - these are exactly the tasks that you should focus on. Those cases that describe the work of compliance and distressed assets, I would not want to disclose, because these are not questions of public knowledge.
Pavel Lebedev, head of research at Superjob, immediately started with money and statistics. At the time of the Big Data open-door speech, their statistics included approximately 200 job openings directly on the topic and 80 jobs about Data Science / Data mining. Six large Russian companies are constantly looking for specialists, the rest - occasionally. Most of all Big Data-professionals are needed in telecoms, banks and large retail. Moreover, to get to work in these places, it is enough to take an intensive specialized course for 1-2 months in the presence of a common IT background (a bit of mathematics, a bit of SQL).
Typically, business analysts and machine learning engineers are needed. Sometimes they look for a database architect. In general, every employer understands Big Data in its own way, and so far there are no common criteria as for C ++ developers, for example.
What is included in the work of such a person? As a rule, he must first rebuild the process of collecting data, then rebuild the process of its analysis. Analytics, hypothesis testing, etc. Then - the implementation of the solutions into business processes directly at the enterprise.
The first salary range is 70-80 thousand rubles a month. This is an entry-level, without work experience and deep knowledge of programming languages. As a rule, these are university graduates. It is assumed that the university gave basic knowledge of SQL queries and, again, taught to remove outliers when constructing a moving average.
The next range of up to 100-120 thousand rubles a month implies a larger set of practical knowledge, work with various statistical tools. Most often SPSS, SAS Data Miner, Tableau, etc. You need to be able to visualize data in order to prove to other people why it is important to do something specific. Simply put, it will be necessary to stand at a meeting of investors and explain what you got there, but not in bird language.
The third range - up to about 180 thousand rubles a month - there are programming requirements. The most commonly mentioned scripting languages are Python, etc., and have already had two years of experience, experience in machine learning, using Hadoop, etc. But the highest salaries - up to 250 thousand rubles a month - are people with very high qualifications. It is determined by the experience of implementing something specific on the market, academic implementation and its development. Above is only exclusive when the salary is greater, but there are dozens or units of people across the country with the right qualifications.
Sberbank clarifies: the norm is from 1.5 to 3 million rubles per year. And yes, Sberbank expects to take at least a couple of people from the nearest course to their work (but more on that below).
Expert - Vitaliy Saginov, is responsible for sending Big Data to MTS.
“In the early 1990s, two mathematicians came to the conclusion that the use of regression analysis methods allows us to predict with a fair degree of probability how a bank client will pay bills — whether he will pay on time or allow delays. They went around this all over Manhattan, offering City and everyone else. They were told: “No, guys. What are you? Here we have professionals who communicate with a client who, by the color of their pupils, can determine the delay in the 3rd, 9th or 12th month. ” As a result, they found a small regional bank in Virginia called Signet. The quality of the loan portfolio has doubled from its original value - before they began to experiment. Over the next 10 years, the bank’s retail business was transferred to a separate company, now called Capital One, and this company, This bank is among the ten largest US retail banks with the number of customers, in my opinion, about 20 million, and about 17-18 billion dollars of client money. In fact, this company has put data and its processing at the core of its business strategy and business model. ”
Vitaly says that data is an asset. But there is no market for this asset, as there was once no market for online business until the 2000s. The same thing in Europe and the USA - there is simply no market now, so most of the real investment goes into working with data within the company to optimize its processes. Usually, it is first established experimentally what exactly makes a profit, then the hardware and software architecture is built under it. Only one company allowed itself to go the other way - British Telecom - but there Big Data was done by a former IT director who perfectly knew exactly what was needed.
Vitaliy believes that Big Data will spawn the new Internet in 15-20 years, and we are now at its source. Specifically, now the main problem of the development of the direction is the lack of accurate legal procedures, a lot of approvals and controversial issues.
Svetlana Arkhipkina, Big Data Technology’s sales leader, Oracle, says the first group of cases around big data is what’s related to customers, a personalized approach like offering discounts on diapers, even when my father didn’t know that his 15-year-old daughter is pregnant.
The second group of tasks associated with big data is optimization, that is, everything related to the modeling and use of very large amounts of data.
The third group is all the tasks associated with fraud. Here various solutions are used for video recognition of images, for the analysis of unstructured information. This is a very large stack of tasks, especially for banks and telecoms.
And the newest challenges are cross-industrial. It most often raises questions about working with the level of databases that are not related to traditional relational ones.
Alexei Ruslyakov, Acronis Product Development Director, said that the two main problems of Big Data are how to store this data and what to do with it.
About 5-6 years ago, we launched a cloud backup storage service, thanks to which users could make backup copies of their laptops, workstations, servers, and store them in our data center, in the cloud. The first were data centers in the USA, Boston, France. Now there are DCs in Russia. If we had organized the storage of cloud backups on netap or devices from EMC, the cost of a gigabyte of storage would have been very high, and this event would most likely have become commercially unprofitable. With the advent of giants such as Google and Amazon, it would be difficult for us to compete, because, thanks to their huge capacities, the cost of a gigabyte of data is pretty cheap. Therefore, our task was to develop an efficient and inexpensive storage system.
“It was about lazy data - data that is written once, and then periodically read, or deleted. This is not the data that needs constant access, and not the one that requires high IOPS. For this "cold information" we have developed our own technology for storing big data. Another question that has come before us is how to catalog stored data, index it and provide our users with a quick search on it. The task is, in fact, non-trivial, given that the data is stored distributed and with some redundancy. In parallel, you need to provide data tiering: so that the information that is often accessed is stored on expensive and fast media, and the rest on slow and cheap ones.
»One of the most interesting tasks that we are working on now is data deduplication. When we talk about Big Data, the question arises about the distribution of nodes that store data, and how, given this distribution, to make deduplication effective. It is necessary to correctly synchronize data between nodes, and this is a lot of work.
Louise Iznaurova, director of new media development at CondeNast Russia, added that Big Data for journalism can change the field quite a lot.
Actually, as you can see, the Big Data market is experiencing a severe shortage of qualified specialists. Therefore, it is these experts and several other representatives of large companies who have relied on a professional Big Data course that will partially resolve this issue.
The first set was already. April 18th will be the second set for this three-month course.The program includes 3 parts. These are three specific cases, each of which takes one month, and they are infinitely practical. Case No. 1 is the creation of a DMP system in a month. Case No. 2 is an analysis of a social graph on the example of Vkontakte. It will also take a whole month, and as a result of this it will be necessary to write in the team an analyzer of this social graph on big data. Case No. 3 - recommendation systems. Again, this story is very understandable and in demand on the part of business, many spoke about it - how it is possible to predict what a person wants.
It’s not theory that is interesting and demanded in the market, but practice, therefore, a technical specialist, a specialist in data processing, analysis must understand which business task he solves, and the technology stack that is associated with this very much depends on this business task. This means working with completely real data. Not with data that has been sucked from Wikipedia, not with data that has been academically known 25 times, but with data from business, and our business partners share it with us.
Timing is cruel. Building a DMP system from scratch in a month is hard. We understand this, and this means that the course will be very intense and require a lot of concentration. It can be combined with work, but if in addition to work you have this course in your life, then everything else will be gone.
- Konstantin Kruglov, founder of DCA Alliance
Arranged this way - three times a week: Tuesday, Thursday from 7 to 10 in the evening, on Saturday from 4 to 7.
Every week it will be necessary to commit something specific. One pass - and you do not take the course. If you need a theory, go to Kurser, only practice will be here. Work will be a team, and teams will be constantly mixed.
Another story is the DCA contest, which will allow you to return money from 25% of the cost of your training in the first month if you write a good algorithm. Achievement of a similar plan is in every task.
Here is a link to the details and program.
It is expected that a third of graduates are analysts who can use all kinds of tools for analyzing big data, debug models, test hypotheses and collect data (for example, for sales companies or to identify fraud patterns), the remaining two thirds of graduates will be developers who can deploy Tools for working with big data and with your own hands can create working systems (that is, at the entrance it should be people at the level of architects and advanced level application programmers).