Experience developing requirements for professional qualities data scientist

    Today, almost every business feels the need for data mining. Data science is not perceived as something new. However, it is not obvious to everyone what a hired specialist should be.

    This article was not written by an HR specialist, but a date by a Scientist, so the presentation style is very specific, but there is an advantage - this is an inside look that allows you to understand what qualities a data scientist is necessary for the profession, so that the company can rely on such person.


    The time has come when the data science startup has grown out of diapers - the number of tasks for analysis has increased at an unexpected speed, and this speed immediately ceased to be compensated by automation. It became obvious that we needed new brains for the team ...

    It seemed to me at first that a person needed a certain one: just an ordinary date-something-there ... programmer, analyst, statistician. So what is the difficulty of compiling a list of requirements?

    “In engineering, if you don’t know what you are doing, you should not do this.”
    Richard Hamming

    I approached the matter as usual. He took out two sheets of paper. One entitled “Technical Skills”, the other - “Professional Skills”. After that, there was a desire to climb onto any resource, find a bunch of resumes there, write out lists of qualities, choose the ones that you like. But something stopped me. “This is not my way,” I told myself. - I'm not good at it. I understand the tasks .. ”

    I tried to go from the task. Our tasks are simple. You are given an unresponsive CRM of dubious content and are asked to predict sales for a couple of months in advance. Quite simple. Anyone can handle ... Disclaimer: if you can understand the business of the client. Ideally, a working group is taken for this, which abstracts from all other tasks and devotes itself to analyzing this particular one. At the entrance - the wishes of the client, at the exit - a solution that can be checked without going into details and without duplicating the work performed.

    From here I put together the first somehow formal requirement - a person should be able to take on a separate task and not particularly pull anyone until the moment that the first rude decision is received.Then this decision can be improved by attracting specialists to help. But at the first stage, using someone else is the same as putting an overseer on a person. And the overseer can at any time push away the newcomer and start doing everything for him, making the hiring absolutely meaningless.

    Based on this first requirement, I quickly filled out the first sheet: know python, be able to extract information from different sources, store information, use AWS, know the server and statistics, be able to random processes. A little later I added the economy there in the basic version. The result is a list of skills needed to ensure that the first requirement is met.

    But, with the list of professional qualities, I did not succeed. Even googling, I did not find any professional requirements for a data scientist that seemed appropriate.

    Either general formulations of the form “responsibility” came up, or qualities were understood as skills, which belonged to another list.

    His own thoughts were mixed into porridge, which was difficult to systematize. The global was mixed with the specific, applicable only to certain tasks. It seemed to me very wrong to endure in one heap such qualities that were too general, along with qualities that the candidate could never use later on.

    Somewhere here, the idea of ​​the Problem was born. It seemed to me a good and elegant way to pay off the need to philosophize over the requirements lists, and at the same time collect the necessary list, looking at errors in the solutions.

    Statement of Tasks

    The entrepreneur decided to open a store at badminton courts, so that visitors did not have to go to the supermarket for shuttlecock and racket.

    Throughout the year, the entrepreneur kept all receipts from purchases in order to subsequently understand what decisions should be made to increase profits. Information from the checks is contained in the attached train_dataset.csv file .

    He packed shuttlecocks and rackets and sold exclusively in sets of three types:

    1. Racket and two shuttlecocks
    2. Racket and five shuttlecocks
    3. Ten shuttlecocks

    From time to time, the entrepreneur had to change prices with an eye to supermarket prices and tax rates.

    The store and the court worked without days off and holidays. The flow of customers was somewhat limited due to the fact that only 4 people are allowed on the court, and the court is pre-booked in advance for a two-hour session, there are only three courts in the stadium. Nevertheless, not a day went by without a sale, as from time to time either completely unprepared people came to the court, or someone tore a racket or lost shuttlecocks.

    A year later, the entrepreneur decided to arrange a sale, which should last from January 1 to January 31, inclusive. He redistributed sets of goods and assigned them the following prices:

    1. Only one racket - 11 dollars 80 cents
    2. Five shuttlecocks - 5 dollars 90 cents
    3. One racket and one shuttlecock - 12 dollars 98 cents

    It is required to establish the size of the entrepreneur's income in January.

    Probability sensitivity

    “I believe that the best predictions are based on an understanding of
    the fundamental forces involved in the process.”
    Richard Hamming

    The task was drawn up in imitation of the real tasks of life, but in an artificial way that was not hidden from the candidates. And, therefore, some formulas were applied to create the dataset. Suppose, flavored with random variables, but formulas. In any case, it was assumed that the data scientist was able to detect and use these formulas for forecasting.

    Of course, one should not discard the possibility that the dataset does not provide a complete picture that allows one to restore formulas with the necessary accuracy. But for this case in real life, we come up with what additional information should be, and where to get it from.

    In general, the desire to find the "law of the universe" is a good professional quality. The ability to understand what to look for and where to look is also.Mr. Hamming knew what he was talking about. And thanks to him, the first line appeared in my list of requirements:

    The ability to detect cause-effect relationships, describe them, formulate the conditions under which relationships can be converted into a formula useful to business.

    It is no coincidence that I used the phrase “useful for business” here. In my personal practice, it often turned out that it was not the answer to the problem that brought the business a profit, but a side result, which was obtained by opening some kind of internal dependencies. In some cases, this brought startups extra money, new contracts, and increased the amount of know-how and by-products.

    Therefore, analyzing the decisions sent to me, I carefully watched how the candidate would use the knowledge about the artificiality of the dataset, whether he would ask for additional information at some point or prove the sufficiency of the dataset for the task.

    Self confidence

    “If an event attracts our attention, associative memory begins to search for its cause, or rather, any reason already stored in memory is activated.”
    Daniel Kahneman

    I will not say that associative memory is bad. She is the source and fuel of our imagination. Fantasy allows you to generate hypotheses, intuitively put forward assumptions, quickly find those pairs of variables between which a connection is possible.

    And she puts us on the bandwagon in the form of a bias confirmation.

    We are so used to our own experience and our own knowledge that we begin to spread them to new situations. In the living world, this is often useful. Say, the belief that all snakes are poisonous, saves more lives than doubt that this particular snake is not poisonous. But in a safe office, having enough time, it is better to perceive any judgment as a hypothesis.

    The task dataset was specially designed in such a way that the time interval covered only a year of observations. It is good that the candidates at the stage of considering the graphs put forward a hypothesis about the presence of seasonal fluctuations. It is bad that rarely anyone has stated the need to verify this. And it’s very bad that some, without checking, insisted on the presence of seasonality.

    So I entered the following in the list of qualities:

    Criticality of thinking, including in relation to my own experience.

    I really wanted to add “and knowledge” here, but then it seemed to me that this postscript opens up a big new topic.


    “Having developed a particular theory, we again turn to observations
    to test it.”
    Gregory Mankyu

    The data science literature examines ways to automate hypothesis testing. However, I rarely met guidelines for their use. Because of this, believe it or not, once I got confused between two seemingly very different activities - checking statistical hypotheses and checking the model.

    At the same time, which is even more confusing, the difference between the concepts of the statistical hypothesis and the hypothesis in general is overlooked. To avoid such confusion in our article, let me use the term assumption for the general concept of a hypothesis.

    In the previous paragraph, one such assumption was made regarding the dataset, namely, the presence of seasonality. It is quite intuitively possible to define a seasonal component as periodically recurring. And here you should immediately ask yourself the question: how many times does the component have to be repeated so that it can be considered seasonal? Moreover, can we, on the basis of periodic repetition, confirm the presence of a seasonal component in the dataset, the time interval of which is only a year.

    As already mentioned, the length of the interval was specially selected. I wanted the candidates to have the need and the opportunity to offer their own ways of checking the availability of seasonality for the task in question. And I also added this quality to the list of required professional qualities:

    The ability to test assumptions in standard ways and come up with new ways of checking.

    Probably “come up with new ways” sounds too loud. I rarely encounter the need to come up with something new. The method of simple considerations following the question “What if?” Is quite suitable.

    In the beautiful article “This is correct, but false”, Alexander Chernookiy gave examples of quick and almost intuitive solutions for several probabilistic problems. A similar mechanism, it seems to me, is quite well suited for testing assumptions.

    First we’ll think about what kind of seasonality we want to find. Seasonality may be an external factor that is unknown to us, and which represents a certain paranormal repeatability in the data. It is possible to describe such seasonality without going beyond the dataset by writing out the seasonal component separately and showing the degree of its stability. And seasonality can be hidden inside known data. For example, if seasonality affects the number of buyers, and the number of buyers on the sales volume, then if we knew ahead of time and when which buyer would come, it is unlikely that we would need seasonality as a separate phenomenon. Consequently, we will seek precisely the paranormal seasonality, since we do not know and need it.

    Let's now assume that such seasonality does not affect sales. Then all fluctuations in sales are either random, or you can find some relationship between them and changes in other variables. How fully does this dependence describe what is happening? Will there still be room for paranormal seasonality?

    That is, to check the presence of seasonality, we can find all the dependencies on the known variables, and after that, subtracting these dependencies from the fluctuations, look at the remainder. Moreover, if the spread of the remainder is sufficiently small, then perhaps there will be no sense at all in the search for paranormal values.

    So we got a simple way to check for seasonality in the absence of a sufficiently long data interval.


    “Our mind is not prepared to understand rare events.”
    Robert Banner

    Turning to the search for the relationship between the two quantities, the first thing we try to feel their mutual change. And there is, perhaps, no simpler and more elaborate method than linear regression. It can help to form an opinion about the relationship, even in cases where the quantitative relationship between the quantities is unknown. Well, it has a number of other advantages.

    And the flaws.

    In fact, the relationship between the two quantities is far from always so simple that it can be identified by numerical characteristics. No matter how beautiful the linear approximation of the relationship between the two quantities may be, there is always the possibility that we are dealing with something more complex. The English mathematician Francis Enscombe illustrated this phenomenon with four examples, which later received the nameThe Enscombe Quartet .

    Putting something similar to Enscomb's quartet into the task turned out to be a good idea and very simple to implement. Despite the popularity of the phenomenon, a lot of candidates fell for the bait.

    The implementation of the phenomenon in the problem was as follows. Let there be three groups of customers, each of which realizes a certain interest when buying. The two groups behave in a similar way, and their behavior is expressed in a linear relationship between demand and price. But the third group does otherwise. With the transition of prices above a certain threshold, buyers from this group sharply stop buying more than the necessary minimum.

    This phenomenon, quite common in the real world, made it possible to simulate one of Enscomb's examples and hide it among two other distributions.

    In fact, “hide” is not a good fit for the situation. I just put this distribution next to others, more familiar and understandable. The difference was obvious on the graphs, as it seemed to me, but not everyone noticed. And the attempt of one of the candidates to “improve” the approximation by moving to a higher order polynomial was especially interesting.

    So I formulated another requirement for professional qualities: To be

    able to isolate significant observations, build hypotheses regarding their significance.


    “The meter has been used extensively for five years and went through three checks.”
    Timothy Leary

    Earlier, I described a situation where unexplained balances become so small that their influence becomes indistinguishable against the background of the business benefits that the rest of the model provides.

    However, you need to understand what may be hidden behind the expression "so small."

    Usually the world is observed and measured by us using some instruments. Simple, like a ruler, or complex, like an electron microscope. Complex devices include a computer with a statistical programming environment installed on it.

    In a sense, any observation or conclusion we make can be perceived as the result of a measurement. We look at the conditions of the problem and measure income on a time interval that has not yet happened. Here I replaced the mysterious and magical for many the word “predict” with the word “measure”. As part of my everyday work, I can quite say so, since the forecast at a fairly high level of accuracy is replaced by routine calculation.

    But any measurement cannot be extremely accurate. Each device has a measurement error caused by its imperfection. And in the measurements it is necessary to indicate their accuracy, for this, along with the result obtained, a confidence interval is indicated.

    The indication of the confidence interval is not even a recommendation, but a necessity that is often forgotten. Moreover, although there is some pedantry in my words, I believe that calculating the confidence interval is an act of self-esteem, and the following quality is one of the necessary qualities for a data scientist:

    Accuracy in observing the formal requirements of algorithms and methods, especially when it comes to calculating confidence intervals and verification of necessary and sufficient conditions.


    “This provision is not quite true, but true enough for practical application in most cases.”
    Francis Enscomb

    Until now, I have avoided discussing the most striking features of this task. The forecasted interval is characterized by a strong change in the goods sold. Now is the time to explain why this change appears in the task.

    Above, I have already outlined my view on the possibility of checking various assumptions. Verification should always be. If something cannot be verified, or the method of verification is not known, then various options should be outlined; they may serve as a reason for further research. But at the same time, it is necessary to try to describe the situation as much as possible, based on known information.

    In fact, what do we know about sales? There are people who, due to the known and listed reasons, make purchases. You can almost completely simulate the whole process, since we found all the dependencies and found out that the unexplained residue is normally distributed and has a very small dispersion.

    Questions begin to appear: does the purchased volume of goods cover the needs of people? What do they do when the need remains unmet? For example, what do they do if, in their opinion, the price of a product is too high? Where does the linear dependence of demand come from?

    In fact, these are questions for business. And, of course, they should be asked to the business owner as an expert in their field. In the end, the initial dataset is far from always full, and the business, even having a staff of professional analysts, does not know everything. Actually, the business turns to data science precisely because not everyone knows. But what if ...

    What if there is a verifiable and consistent model that describes the situation using only our known data? This is also worth checking out.


    Let me make a final list of the data scientist's professional qualities that I wrote out.

    1. The ability to detect cause-effect relationships, describe them, formulate the conditions under which relationships can be converted into a formula useful to business.
    2. The criticality of thinking, including in relation to their own experience.
    3. The ability to test assumptions in standard ways and come up with new ways of checking.
    4. To be able to isolate significant observations, build hypotheses regarding their significance.
    5. Accuracy in observing the formal requirements of algorithms and methods, especially when it comes to calculating confidence intervals and checking the necessary and sufficient conditions.

    In this assembled form, the list seems pretty obvious to me. Perhaps because it repeats to some extent the list of cognitive biases. Which, incidentally, leads me to the idea of ​​the natural evidence of posterior observations. And yet, I remember the time of meditation on the second empty sheet of paper and I understand that the list would not have been compiled without the work done.

    Still interesting is the idea that the importance of a fact for one person is not necessarily obvious to another. This can be clearly seen from the solutions to the problem that I received from dozens of candidates ...

    Author: Valery Kondakov, Co-founder and CTO of Uninum
    Co-author: Pavel Zhirnovsky, Co-founder and CEO of Uninum


    Statistics on the vacancy on 06/25/19
    Date of placement of the vacancy: 27/05/19
    Total views of the vacancy: 2727
    Total responses: 94

    • They sent a solution to the problem, but it turned out to be wrong: 20%
    • They agreed to solve the problem, but did not send an answer: 30%
    • Refusal at the stage of consideration of a resume for various reasons: 45%
    • They sent a solution close to the correct one: 5%

    Also popular now: