Your personal Big Data course

Hi, Habr!

After publishing several articles on Big Data and Machine Learning , I received a lot of letters from readers with questions. Over the past few months, I have managed to help many people make a quick start, some of them are already solving applied problems and making progress. And someone has already got a job and is engaged in solving real problems. My goal is to have smart people around me, with whom I can work in the future as well. Therefore, I want to help those who really want to learn how to solve real problems in practice. The network has a large number of manuals on how to become a scientist for data ( the Data Scientist ). At one time, I went through everything that is there.However, in practice sometimes completely different knowledge is needed. I’ll tell you exactly what skills are needed in today's article and try to answer all your questions.

If you google "How to become a Data Scientist" , you can come across a lot of pictures like this or this . In general, everything that is written there is really true. But, having studied all this, it is not guaranteed that you will succeed in solving real problems in practice. In general, you can go the way outlined in the images above - namely, learn on your own, and then go and solve real problems. You can do otherwise - go get a special education. At one time, I happened to go both the one and the other - both Coursera courses , and the School of Data Analysisand many other courses at the university, including computer vision, analysis of web graphs, Large Scale Machine Learning, etc. I was lucky to study with the best teachers - and take the best courses that are available. But only after I began to apply the acquired knowledge in practice, it came to an understanding that the courses often do not pay due attention to practical problems, or they are not mastered until you stumble upon them. Therefore, I will try to outline a set of minimum skills that will be enough to start solving problems in practice as soon as possible.

Become an excellent mathematician

Yes, this is probably the most important thing - mathematical thinking, which must be constantly developed in oneself from a young age. For those who may have missed this, it is worth starting with discrete mathematics courses - this is generally useful for all people who work in IT. All evidence and reasoning in further courses are based on this. I recommend taking the course of Alexander Borisovich Dainyak, which I once listened to in person. That should be enough. It is important to gain skills in working with discrete objects.

After you learn how to operate discrete objects, it is recommended that you familiarize yourself with the construction of effective algorithms - for this it is enough to take a short course in algorithms, such as the SHAD course or after reading a review of known algorithms one-maxx.ru is a fairly popular site among ACM members. It’s enough to understand how to implement algorithms efficiently, as well as to know typical data structures and cases when to use them.

After your brain has learned to operate with discrete objects, as well as the development of algorithmic thinking, you need to learn to think in terms of probability theory. To do this, I recommend (at the same time refreshing knowledge in the field of discrete mathematics) to take the course of my supervisor Andrei Mikhailovich Raigorodsky , who knows how to embrace complex things “on the fingers”. Here it is important to learn how to operate in terms of probability theory and to know the basic concepts of mathematical statistics.

In general, this is not enough, but in practice it is enough to deal with discrete objects and operate with probabilistic quantities. It is still nice to have an idea of linear algebra, but, as a rule, in machine learning courses there are introductions to the necessary sections. By adding good programming skills to this, you can become a good developer.

Learn to write code

In order to become a good developer, of course you need to know programming languages and have experience writing good industrial code. For a scientist, according to the data, knowledge of, as a rule, scripting languages is sufficient, such things as templates or classes, exception handling, as a rule, are not needed, so you should not go into them. Instead, it’s good to know at least one scripting language focused on scientific and statistical calculations. The most popular of them - it's Python and the R . There are quite a few good online courses in both languages. For example, this one in Python or this one in R - they provide basic knowledge sufficient for a data specialist.Here, first of all, it is important to learn how to work with data manipulation - this is 80% of the scientist's work on data.

Take basic machine learning courses

After you have gained a good mathematical culture and gained programming skills - it's time to start learning machine learning. I highly recommend starting with the Andrew Ng course - as this course is still the best introduction to the subject. Of course, important common algorithms like trees are missed in the course, but in practice, the theoretical knowledge gained in this course will be enough for you to solve most problems. After that, it is strongly recommended that you start solving problems on Kaggle as soon as possible - namely, starting with the tasks from the Knowledge section- they have good Tutorials that understand tasks - they are aimed at a quick start for beginners. After this, you can learn more about the remaining sections of machine learning and take the full course of K.V. Vorontsov in machine learning . It is important here to get a holistic view of the problems that may arise in practice, methods for solving them and learn how to put your ideas into practice. It is also important to add that most machine learning algorithms are already implemented in libraries such as scikit-learn for Python. An introduction to Scikit-Learn I published earlier .

Practice building algorithms

Participate as much as possible in machine learning competitions - solve both simple classical problems and problems in a non-classical setting, when, for example, there is no training set . This is necessary in order for you to gain a variety of techniques and tricks that are used in tasks and help to significantly increase the quality of the resulting algorithms . I talked about some practically important tricks earlier here and here .

After that, you are usually ready to build good algorithms and participate in Kaggle money competitions, however, while your capabilities are limited to working with small data that is placed in the RAM of your machine. In order to be able to work with big data, you need to get acquainted with the Map-Reduce calculation model and the tools used to work with big data

Get to know big data

After you have learned how to build good models, you need to learn how to work with big data. First of all, you need to get acquainted with the methods of storing big data, namely with the HDFS file system , which is part of the Hadoop stack , as well as with the Map-Reduce calculation model . After that, you need to get acquainted with the other components from the Hadoop stack - namely, how YARN works , how Oozie scheduler works , how NoSQL databases such as Cassandra and HBase work . How data is imported into a cluster using Apache Flume and Apache Sqoop. There are still few courses on these sections in the network, the most comprehensive reference is the book Hadoop: The Definitive Guide . It is important to understand the interaction features of all Hadoop components, as well as the methods of storage and computation on big data.

Get to know modern tools

After exploring the Hadoop technology stack, you need to familiarize yourself with the frameworks that use the Map-Reduce paradigm and with other tools that are used for computing on big data. I described some of these tools already. Namely, get acquainted with the recently growing popularity of Apache Spark , which we have already examined here , here and here . In addition, it is recommended that you familiarize yourself with alternative tools that you can work with even without a cluster - this is a tool that allows you to build linear models (training them online, without putting the training set into RAM) Vowpal Wabbit , which we reviewed earlier. It is also important to learn the simple tools from the Hadoop stack - Hive and Pig , which are used for simple operations with data in a cluster. Here, it’s important to learn how to implement the machine learning algorithms you need, as you previously did with Python. The difference is that now you are working with big data using a different calculation model.

Explore Real-Time Big Data Tools and Architectural Issues

Often you want to build systems that make decisions in real time. Unlike working with accumulated data, there is its own terminology and calculation model. It is recommended that you familiarize yourself with the Apache Storm tools , which is based on the assumption that the unit of information to be processed is a transaction, and Apache Spark Streaming , which contains the idea of processing data in small pieces ( batch's). After that, any reader will have a question - what does the cluster architecture look like in which part of the incoming data is processed online, and part is accumulated for further processing, how these two components interact with each other and what tools are used in each at each stage of storage and processing data. For this, I recommend getting acquainted with the so-called lambda architecture , which is described in sufficient detail on this resource. It is important to understand what happens at each stage with data, how they are transformed, how they are stored and how calculations are performed on them .

So, we have considered far from all the knowledge and skills that are required in order to understand how to work with Big Data in practice. But often in real problems in practice there are many difficulties with which we have to work. For example, a training sample may be elementary or some of the data may be known with some accuracy. When it comes to really huge data arrays, technical difficulties often begin here and it is important to know not only machine learning methods, but also their effective implementation. Moreover, tools that allow you to process data in RAM are still just emerging and developing , and you often need to try very hard to cache it correctly, or the well-known problem of small files of the same Apache Spark - you have to deal with all this in practice!

Write me your questions

I repeat that when publishing articles on the Habré, I pursue the goal of preparing people for work in Big Data, in order to work with them later. Over the past few months, I have been able to help many people make a quick start. Therefore, I really want to get to know you and answer current questions, help start solving problems or help with solving existing ones. Further I will monitor your progress (if you do not mind) and help, if necessary. I will choose the best people and will personally cook over the next few months, after which, perhaps, I will have interesting offers for them!

I don’t know how many letters will come to the post office, I’ll just say right away that I will answer late in the evening or at night, because during the day I work). I will try to answer as many letters as I can.

In addition to the goal of educating people, I also want to show that the Big Data processing methods that marketers love to talk about are not a “magic wand” with which you can work wonders. I will try to show which tasks are now being solved well, which can be solved if desired, and which are still difficult to solve. After your questions, I will write a long post in which I will publish detailed answers. Let's develop Data Science together , because there are really not enough real specialists now, and there are more than enough expensive courses .

Therefore, all those who would like to learn how to solve problems, regardless of your level of preparation, write me an email with the subject Big Data in my mail ( al.krot.kav@gmail.com )by specifying:

Information about yourself: what is your name, what do you do, where do you work / study
Your experience: what you tried to teach yourself, what happened / failed
Objectives you want to achieve: the most important point - I won’t read the letter without this)
Your immediate question, if any

I will be waiting for your letters!

Tags: