“I find it difficult to understand the motivation of a data scientist who does not see beauty in mathematics” - Kirill Danilyuk, Data Scientist

Hi, Habr! Data Science has long become an attractive area, and more and more people want to change their professional trajectory and start working with big data. Kirill Danilyuk, Data Scientist from RnD Lab, shared his story of transition in data science, tips for beginners and advanced data scientists. In addition, we talked about the necessary qualities of a data scientist, about the layout of data, about the difference between the ML Engineer and the data scientist, current projects, cool teams and people whose work inspires.



- How did you come to data science? What attracted you to the field of work with data?

- I have a rather atypical background: I came to the date from the world of Yandex PM’s (Project Management - author’s comment)), when I was called to ZeptoLab, perhaps the best Russian gaming company. I made them a prototype of an analytical system, deshboards, in fact, the first time I started writing code that someone else used. The code was terrible, but it was a real practice. Formally, of course, I coordinated the work of two outsourcers, but they wrote the code according to this prototype. I didn’t know then that this is about data science, which I’m doing, let’s party time. So the acquaintance happened quite organically.

Even then, it was clear that there was a whole shift in the development paradigm - instead of classical imperative programming, when you rigidly set conditions, there comes an era when the machine can train itself with the help of data. It was incredibly cool to see this shift, and I really wanted to be among those developers of the new era.

- What difficulties did you face professionally, what were the challenges at the beginning and in the future?

- Again, I remind you that I was originally a project, that is, a career change was quite serious. There were a lot of difficulties. First of all, it is self-disbelief. You see around you all these smart guys who write something there, speak in a language that you don’t understand. You see a huge gulf between them and them. At the same time, your environment also does not encourage transition - it begins to seem to people that you are “doing garbage and in general, procrastinating”. It is very depressing. Now, of course, the datasyntist community has taken shape, they will help and encourage you, but it used to be harder. So, to make this first step - to say that I will be a datasainteist and really systematically go in this direction, despite my previous career - it was very difficult.

The turning point was when I read the book“So Good They Can't Ignore You” , by the way, is advised by Andrew Eun, creator of Google Brain, Coursera, the famous ML course. The book is about my case: your background and history are not important. If you can show in practice that you are really so good that you just can not be ignored, you will be noticed. I was very impressed with this book and decided not to drop data science. Very advise everyone to read.

- What kind of life hacks you can share with novice experts in working with data both in terms of studying the field, and in terms of building a career?

- All come to data science from different spheres, to different parts and with different goals - there is no one optimal path. But there are some tips.

Data science may seem difficult at first sight - and so it is! However, the surprising fact is that data science can be compared with an onion: it is necessary to study layer by layer. This is called the top-down approach, when you first look at a primitive level, how algorithms work, how you can train a neural network in a couple of lines — without actually knowing the processes — you simply specify the input data, a couple of lines of code, and that's it. The first layer of the bulb is removed. Further more. It becomes interesting to you, you already want to know how. How does it work? You go deeper, look at the code, the implementation. Then it becomes interesting to you why this code is written like this. It turns out that there is a theoretical justification. And so on. Keep the interest. Start from the top, it inspires.Read Richard Feynman , he wrote a lot about this approach.

Another tip: join the data scientist as soon as possible. Even if you still do not understand, but firmly decided to develop in this area. When I was studying, ODS was not there yet, nobody encouraged you so much, there was no organized data-meeting party. And I went to the program Newprolab including, to get such a get-together. The key to development is precisely in socialization. Do not boil in your own juice, otherwise you will move very slowly.

Third Council (a continuation of the second): start as soon as possible to participate in competitions. Kaggle can be treated differently, but at least it gives another reason for socialization - adjoin the team. Older colleagues will be happy to give you a hint and help. Plus, kaggle gives a good kick in terms of your portfolio, speeches and blog posts. Steep data scientists, that’s how they became cool, by the way.

- In addition to the passage of two programs in Newprolab, where have you studied and studied? What programs can you recommend for beginners and advanced?

- I try to learn all the time, because tasks, especially in our country, are constantly changing. I took more or less basic online courses, such as Yandex specialization in DS at Coursera , ML-nanodegree at Udacity , and their own course in droneless . For beginners, I highly recommend the DS specialization on the Coursera - this is probably the most structured course for understanding the approaches and tasks in general. I was also pleased with the “Big Data Specialist” , I, in general, started the entrance to data science, he really helped me. Once again - in the beginning do what seems interesting.

For the more advanced, there is a terrific Caltech course Learning From Data - relatively short, but very practical . Very well put brains. There is also a wonderful ShAD course from Vorontsov - in the public domain lectures and a textbook. I also strongly recommend the Harvard course on the Stat 110 theory, there are fundamental principles of probability theory and the matstat, which you should definitely know. Plus, there is an open library of MIT courses, look at the course on algorithms , it is very good.

- From your observations: what soft and hard skills are often lacking for both beginners and experienced data scientists to become truly high-class specialists?

- Let's start with soft skills - because they are not enough. Despite the fact that the data scientist is a technical profession, it is extremely important to be able to correctly / beautifully present the result of your work. Roughly speaking, it is like an iPhone - not only the filling is good, but also its appearance, packaging, history. People need to learn how to present their results: write blog posts, speak, share code. The best data scientists understand this very well, and they do. Otherwise, you can get stuck in your hole, and even go unnoticed with a cool result.

You can talk about hard skills for a long time, but there is one thing that so many data scientists lack - the skill of writing competent, structured, beautiful code. This is the real scourge of the profession. It is necessary to learn to write beautiful readable code. If you look at kaggle, then most of the code is terrible there. I understand what it is connected with: people write the code once and then do not use it, this is a standard practice among a data scientist, especially beginners. I did it myself before, but this is bad, because, firstly, you cannot share it with anyone (people want to read beautiful readable code), and secondly, you can’t use bad code in other projects.

Another fundamental skill is the knowledge of the materiel: linear algebra, the apparatus of statistics, discrete, optimization. And, frankly, you just need a love of mathematics. I find it difficult to understand the motivation of a data scientist who does not see beauty in mathematics. At the same time, it should be noted that mathematics in data analysis is quite accessible, at the level of the first and second year of the university.

- After completing the Big Data Specialist program , you left the corporate world and, together with your group mates, opened a consulting company. Why do not want to be an employee of a large company with a bunch of buns? After all, the demand on the labor market is much greater than the available proposals, and you are a great specialist.

- Here is a rather interesting reason: initially the goal was to recruit consulting companies for projects that you can already show a serious company and get into it. After all, if you say that you are a data scientist, then show that you know how.

First, we took absolutely any data science projects for any money, just to show that we can do it. A lot of mistakes have been made, on all the rakes that can be stepped on, they have come. The first year was just a nightmare, very hard. If you look back now, it is not a fact that consulting was a good option to start. Maybe it was necessary to go to junior'a, and this year to work on some project.

We overcame everything. Projects began to emerge, self-confidence grew stronger, at some point an understanding emerged that it is possible to work and not inside a large corporation with its lengthy projects, approvals and bureaucracy. It turns out that our projects are now much more interesting and varied than most large companies could give me: there are a lot of them, they often change and you constantly learn. Of course, now I don’t really want to go into a big company.

- Let's talk a little bit about the topic of data layout. You have a small team in RnD Lab, you can hardly spend a lot of time marking up the data and manually doing everything yourself. How do you mark up the data?

- On the markup of data can be very long to talk! Machine learning algorithms require data. And not just any data, but qualitatively marked. And a lot. For example, we had a project to determine the quality of scrambled eggs from a photograph . For the algorithms to work, you need to mark each photo, circle each of the ingredients - protein, yolk, bacon - manually. Can you imagine what kind of work it is to mark out a thousand, ten thousand such photos? And this is only to have data ready. After this, the work only begins.

Now there are many companies that sell markup - they hire an army of cheap markers to manually traced the boundaries of objects. What irony is that in the age of AI, it is low-paid, low-skilled and unmotivated people who stand behind it.

I want to make this process more technologically. For example, in our project we wrote a neural network, which marks data in a semi-automatic mode. You first give her 20 manually labeled photos of scrambled eggs and 20 unpartitioned ones - she learns in the first twenty and marks out the second twenty, albeit not very well. You correct errors manually and give these corrected 20 auto-tagged photos for additional training. Now the model is already studying in 40 photos with markup. You submit 20 other photos to the markup, correct errors, and retrain the model on the corrected markup. After several iterations, there are almost no errors. By the way, it is for this technique that I am writing a blog post on the Medium right now .

There are other options: you can use the simulator, roughly speaking, a 3D editor to generate a lot of already auto-tagged images. You have the necessary objects, you render them from different angles together with the markup - that's all. But not entirely: such images will still not be similar to the model for real ones. To bring these images to a variety of real, you need to use a technique called domain adaptation - on the GAN'ah. This is now a real cutting edge of research, such things are exciting. Just imagine: you are simulating a whole world and any datasets are generated literally from nothing. Now imagine that the model is trained just in the simulator, and then it works in the real world. This is just the future!

- Can you name the teams / individuals whose work in the field of big data admires and inspires you?

- Yes of course! I really like not the research itself, but its application in products. I am talking about those whom I know myself - you can simply google the top experts, and they are already well known.

If we talk about the team, this is, without question, the team of the drone Yandex. The guys are doing their technology from scratch, in Russian conditions, they started testing it in the winter - Google never dreamed of it. They are great lads, and I watch them closely. Including their publications and courses. The number of technologies that they apply in practice is huge, very few people are lucky to use so many different things at once.

Team connectome.ai- guys make a computer vision system for production. This is a challenging task, and what and how it turns out is great.

The guys from supervise.ly . They were originally consultants, like us at RnD Lab, but then they made a semi-automatic markup system and now they are developing it.

In terms of people, firstly, it is Eric Bernhardsson, the former head of the Spotify recommender system. He has a terrific blog about data science , and I recommend it to everyone.

Secondly, it is Volodya Iglovikovhe ternaus on ods. He came from physicists, his development path is very curious and extremely motivating to raise his ass and start working. He showed by his example how serious work and competent marketing help him move his career.

- You were the group coordinator for the Big Data Specialist program and the building. the program in Luxembourg, and in the fall you will mentor mini groups on our new online program. Tell me, why do you need all this? Because you can't make a lot of money here)))

- Do not earn it for sure. The meaning is different - in socialization. As I have already said, it is socialization that is the key to self pumping, not to mention just business contacts that are useful to business. Through my co-ordination, we found some orders that were lucrative for ourselves. Secondly, I just like to transfer my knowledge and experience to people and teach them how to work with data. In addition, in the process of preparation, I myself will learn many new things. I studied a lot myself and understand perfectly well, at the price of how many hours some things get. Plus, of course, coordinating and mentoring is a challenge, getting out of your comfort zone and being able to pump yourself.



- Data Scientist and ML Engineer: what's the difference?

- There is a report on this topic in the framework of Yandex Data & Science. The idea is that the data industry has generated a whole set of overlapping occupations. At the same time, different companies interpret them differently. Data Scientist and MLE - just an example of such.

It is believed that a data scientist may not be able to write production code, but must create or adapt theories (for example, scientific articles) and build models. But the actual code involved ML engineers - professional programmers who are less immersed in the theoretical part and more - in engineering.

This separation works great, for example, in Google. Of course, there are strong PhD's, which, strictly speaking, may not program at all, but are strong in theory. And there are high-class programmers who turn the prototypes of these PhD's into beautiful code. But if we talk about small teams, such as ours or even Yandex teams, then there is no time for pure research from scratch, but it is possible to take the results of research by others (in the form of articles or code) and write a combat code based on these articles.

Personally, I do not believe in the practical value of a data scientist who does not write code - it is the code that is the result of the work of a datasaentist. If you don't write code, you are most likely a data analyst. This is also not bad, but it is a different specialization. By the way, many companies with a data scientist dressing sell just an analyst. Because the analyst is Excel and dull, and the data scientist is “the sexiest profession of the 21st century.”

So I'm for ML Engineer.

- What are your thoughts and plans for the future? Where do you want to move in a professional and geographical (and suddenly!) Plan?

- We, RnD Lab, started as data science consulting in a general sense. But they quickly realized that it was impossible to effectively deal with everything at once, it was necessary to focus. Now our focus is on computer vision projects, such as our project on food quality recognition. Imagine watching a football game on your desk in 3D. Imagine that you, as the owner of a large store, see all the theft from shop windows. Imagine that your old paper black and white photos can be converted into color and add details to it. We are engaged in just such projects. Right now, we are developing two new and incredibly interesting projects, in terms of complexity they are not inferior to projects in Yandex, we announce them after some time. Now we have made a prototype, with very high probability we will continue this project further, the scale there will be different and we will expand the team. I will need both data engineers and computer vision engineers in the first place, who will pick up the prototype and make it a system. The customer is great, the system is interesting, and this is a great opportunity to upgrade yourself as a specialist. For any portfolio, such a project would be just great!

Therefore, computer vision and its application - in AR / VR, GANY, image and video generation, image and video enhancement, video analytics - we focus on that. And here we already have excellent expertise and tools.



And about geography: one of my important principles is the possibility of 100% remote work from anywhere. None of the large companies will offer you this. If you want to travel all year round, and you are an adult organized person, why should you be tied to an office? Read the guys from Basecamp, they wrote a whole book about remote work . We want to be like them, we have very similar principles.

- And finally, the quiz:
Scrambled eggs or scrambled eggs?

-Omelette.

- Quick, but so-so or long, but perfect?
- Quickly, but so-so.

- Business with friends or business friendship?
- Friendship in business.

“I thought you would choose“ long, but perfect. ”
- “Long, but perfect” does not work, unfortunately. It was my mistake too, many perfectionists have such an approach that everything is super and cool. I had such an approach in ZeptoLab: I wanted to do it perfectly and did it for a long time, longer than I should have, this level of quality was not required of me. We must always proceed from the task.
Now we have a prototyping approach where you can show the result in a week or two and get feedback. You say: “Look, everything is ready, but by 5% on the knee: the entire pipeline is working, there is a datum, a process, a model is being trained, there is a web interface with buttons ...” And everything is clear, no one argues that this is knee done. And customers understand that, give them another 3 months, you will improve it all. This approach works, it is effective, and we are supporters of this approach.



And on September 20, the 9th Big Data Specialist program starts in Newprolab , come to data science.

Also popular now: