IDFinance April 11, 2018 at 13:14

Five myths about Data Science

My name is Ivan Serov, I work in the Data Science department of the ID Finance fintech company. Data scientist is a fairly young, but very popular profession, which has overgrown with many myths. In this post, I will talk about a few misconceptions that novice Data Scientists (DS) face.

DS do not have to know about business

A good DS should not only be able to build a good model, but also understand why he should build such a model, and even say that this model is not needed, if so. For example, for one of our projects, we made a model that would predict the availability of money on the client’s account and write it off using a special algorithm. But in the process of creating the model, they realized that it was not needed: it is easier to slightly improve the working algorithm. Sometimes the DS operating costs are far in excess of the revenues from the new model they are developing. In this case, he should discuss the need for such a model with the project manager and do something more useful.

Complex algorithms are always better

XGBoost, LightGBM, Random Forest ... All these algorithms are called as priority for any task. Many DS beginners do not even try to start with something easier. However, when suddenly there is a problem with sparse data, where 10,000 variables and 20,000 rows, and XGBoost shows Gini 0.2 (AUROC 0.6), problems begin. For example, in this case, a simple SVM with a nonlinear core, which gave Gini 0.8, is better. Simple models sometimes work better than complex ones.

If you want to become a cool DS - go to a big company

Every day we hear from large companies about their new projects. How artificial intelligence improves one process by 10%, another by 20%, and more. After this, many may get the impression that only in large companies something happens, and in smaller companies there are neither interesting projects, nor good DS. Fortunately, this is not so - having worked in one of the largest banks, which positions itself as digital, I can say that there are more interesting projects in startups. The speed of implementing projects in large companies has already become a byword and a reason for memes. For example, a project can be implemented in a bank for 3 months or half a year, during which time you can make several projects in a startup. Conclusion: PR of large companies is often just PR.

Project managers get paid more than good specialists

Those who outgrow the average level often have a question - where to move on. There are actually two options - Lead Data Scientist (team leader) and Senior DS. A lot has already been written about the difference between the levels (for example, here is a good post from Victor Kantor), I can only say that the salary of good specialists can be much higher than any team leader, and you need to start only from your desires. Usually, after several years of work, burnout begins, all the tasks seem the same and annoying. Here you need to either look for something new (good, market leaders like Nvidia, Amazon or Yandex will always find something), or go to management (Lead DS -> Chief DS -> CDO), which many people choose.

DS must not implement a model or test its results

Many will not agree, they say, now there are date engineers who should implement these models. But DS still has to take care to make the date of the engineer easier, but at least:

Write competent code that is easy to understand
Think about coding variables. For example, LabelEncoder can be easily uploaded as a .pkl file, but frequency coding on new data can be a problem
To think about how AB tests will be conducted in the future (by the way, the evaluation of the model after introduction into production in most cases still rests with the one who developed it)

Many companies do not have date engineers at all, and DS themselves do everything. Another situation is possible when the model interacts with your service through the API that one of the IT specialists creates, and not the fact that they know something about data science. In this case, DS can make a module for data processing, unload the algorithm in the form of pkl and create a ready-made executable file, which receives a json request as input and outputs an answer in the same json. Separately, about testing: when creating a model, it is important to consider future AB tests, choose the right metric and understand the economic effect of the model.

I hope this post I uncovered some points that novice data scientists face and it will help someone. In the following posts I will dwell on some myths and conjectures in more detail.

What myths have you encountered most often?

A little about us:

Fintech holding ID Finance specializes in data science, credit scoring and non-bank lending. The company develops the brands MoneyMan, AmmoPay, Solva and Plazo in Russia, Spain, Kazakhstan, Georgia, Poland, Brazil and Mexico. R&D center ID Finance is located in Minsk. The founders of the company are former top managers of Deutsche bank and Royal Bank of Scotland Alexander Dunaev and Boris Batin. Among investors ID Finance venture capital fund Emery Capital. The company took 36th place in the Financial Times ranking of the fastest growing companies in Europe in 2018. Since 2012, ID Finance's asset companies have funded loans totaling more than EUR 275 million. At the beginning of 2018, the total loan portfolio of the company amounted to 77 million USD. Forbes, Business Insider, Finextra, Venture Beat, Crowdfund Insider, The Banker and BBC write about us.

Tags: