How many Data Scientists do you need to twist the light bulb (or which team will force the data to work on the business)
- How much does a date scientist need to turn on a light bulb?
- One, if the historical sample of successfully swirling light bulbs is sufficient.
This, of course, is a joke, but when in any company it comes to taming big data to improve business performance, not everyone understands exactly who will tame. Classical opinion: a data scientist is needed (data scientist) - a data analyst who is able to build models, understands artificial intelligence and machine learning. And this man in one head decides everything.
Also, there is a trend that when a Big Data division is formed in a company, Data Scientists are those who are primarily hired.
In reality, everything is more complicated. Without data scientists, of course, there is no work with big data, but he is not a warrior alone in the field. Who else should fight shoulder to shoulder with him is better understood through examples.
Suppose there is a network of fitness clubs that wanted to use big data. Data Scientist solves the problem of predicting that the client, in addition to basic training, is inclined to use some other personal ones. The specialist takes the data, who did what before, and builds a model of addiction.
The question arises - what workouts? And how are we going to suggest that he go to them? It will be necessary to clearly divide the workout into male and female. Divided by business logic - if a person is already engaged with a premium coach, we should not offer non-premium.
Or an example from the banking sector. The banks have products that are sold by themselves, and there are those that are often sold along with others. We buy a card or take a loan, and in parallel we are selling insurance. A similar story in insurance companies. We can purchase auto insurance, but at the same time, we can simultaneously sell life insurance.
So, if you do not know the business, but there is a task to predict some kind of purchase, you can mess up the following: “Look, a lot of our customers buy this training / insurance”. And start building models on it to stimulate sales. But business knows that this training / insurance goes only together with something. And even the model can turn out good, but the product will not work separately.
When building a model, there is always a set of input related to how a business works. And if we formulated them incorrectly, then there will be no sense. Therefore, in addition to the actual Scientist data item, you need a product owner (product owner) - a product manager who will make math and business friends.
These two roles are definitely needed in the big data team. It is important: if we have several business lines, then each line needs its own product. Data Scientist can be universal.
You could even say that the product owner is the one with whom it all begins. Who comes up with cases of machine learning in a given company and further drives the implementation of these cases.
But as they say, that’s not all.
Imagine that a certain bank decided to promote a special card for customers who often travel abroad. What historical data can he orient with in order to form a so-called attribute? The most obvious thing at some point in time was a transaction abroad on the customer’s card. The sign is simple, but it needs to be given clear requirements. How many times a year were such transactions? At what points? What period? All this needs to be formulated, and then - encoded from simple data so that the sign is correctly selected. For this you need a separate person - data engineer.
The tasks of the roles are really different. Data Scientist should build a good model. The head is busy choosing which signs to use, cases, algorithms, how to optimize so that the model works quickly. A data engineer is more like a programmer or database developer. He needs to collect data from 10/100/500 different tables and sources, calculate it, compare it, taking into account this, that and this.
Important point: the engineer doesn’t turn on at the first stage. As we have already understood, the development cycle consists of an experimental (MVP - minimally viable product) and productive stages. While we are experimenting, it is very difficult each time to clearly describe the data engineer, what data to upload. There is a creative work, hypotheses are being worked out, data is spinning in different versions. Here, even the slightest communication between the scientist and the engineer postpones the MVP readiness for weeks.
More precisely, the Data Engineer does the first iteration of data preparation, because if there is no data, then the Data Scientist has nothing to do with. Further, Data Scientist in the iterative mode builds features (features) for the model. After the model has turned out successful and it needs to be translated into Data Engineer according to the specification from Data Scientist writes productive code for regular calculation of the sign.
Therefore, the current trend: at the MVP stage, the data scientist prepares the data independently. But then, when the model is built and everyone has accepted it, the data scientist clearly describes how the signs he needs are formed, and passes it on to a separately trained person. He programs them so that they are constantly used in production.
This story can be unscrewed from the other side - if the business goal has not yet been defined, but the company has a huge array of data that you want to use.
In this case, we try conditionally 100 cases, 100 MVP, from which one can shoot. If, on the other hand, the MVP process is expanded in each case, 80% is spent on data preparation, 20% is spent on the model itself. Each time you need to get data from separate and multi-format sources. Collect them into logical and understandable signs: for example, the “transaction at point N” should turn into “a trip abroad so many times a year”.
This job takes a lot of time. If we used some kind of data vector and built a model, but it turned out bad, we go back and upload the data again. With each case of 100. To optimize these iterations is possible only in one way - if we have a big “showcase” in advance with all possible signs - thousands, tens of thousands. To create such a "shop window" is the task of the engineer under the direction of the scientist. Experiments are accelerated by several times - incoming parameters for models can be selected and changed quickly.
Conductors of the Big Data Orchestra
The data collected, the model was built, they made friends with the business. Everything?
Not all. This big data story should have a leader. It seems that this post is the most simple and understandable, but it is not quite so. A manager must combine two properties that are usually not very well combined.
If we in a certain company start big data from scratch, we need the Strategist and the Seller as the head and driver of the direction. He will explain to the whole company why working with big data is so important. It is clear that at the start of something innovative to ask for a clear business case is very difficult - because it is based on a large number of assumptions. Therefore, the strategist will explain: guys, we will plan big data on the principle of “top down” (top down). And set goals of different degrees of globalization, such as:
- so that in 5 years the income from projects, products related to big data, made up 10% of our revenues
- reduce risks by default by 20%
- reduce 30% of inefficient offices
and so on.
On the other hand, this strategist should be able to sell an idea inside the organization.
The problem is that if such a person is found, it is difficult for him in tactical matters. To implement the ideas of the strategist at the physical level, you need a person operating. He will build business processes, analysts, product managers, do everything according to agile. It is important that all this work quickly. Therefore, the leadership is divided into two parts: the strategist is responsible for a brighter future, the operations officer submits to the strategist and implements the plans. Independently, none of them will cope.
You can still look at this problem and from a completely different angle. Imagine that the implementation of Big Data technologies is planned in a large classical manufacturing company for which these technologies are new. Who to put in charge? A person from the outside, with extensive experience in applying big data in various industries and knowledge in this area, or a person from the inside, who has been in a company for a long time, has a fairly high position, implemented many projects that everyone knows and respects?
I think it is clear that a person from the inside, who knows well how the company works from the inside, knows people and processes there will achieve more. To help him, respectively, you need to put a person from outside, with experience in implementing Big Data, so that he points out the right directions and manages the Big Data team.
A place under the sun
Determined with the composition. It remains to subordinate the big data orchestra to the right department.
It is logical to define it in the direction of business that we optimize. It is good if the company is mature. Then you can try to arrange big data in target sales. We need a business line to make it work. For example, for a bank, if we want to retain customers, we need a branch that knows how to contact the customers chosen by the model and actually retain them. If you want to use big data to plan the location of bank offices, you need a branch, which deals with the opening of these offices. We want to optimize data for bank scoring - we need a branch responsible for risks. Without a business line responsible for working with the results of the model, nothing will come of it.
Globally, without support directly from above, the topic simply does not take off - we need the same top down strategy. Especially when you need support for a direction that is already busy with its processes, and looks askew at all sorts of innovations.
Want to learn more about aspects of the implementation of Big Data in companies, read our other publications on our website or come to study at the School of Data
Post prepared by the School of Data on the basis of the publication of the founder of the School in the Business HUB of Kyivstar PJSC