How we teach machine learning and data analysis at Beeline



    After a long preparation, selection of material and preliminary approbation of the course on October 19, we started. Corporate intensive practical course of data analysis from experts in this matter. At the moment, we have completed 6 classes, half of our course, and this is a brief overview of what we do on them.

    First of all, our task was to create a course in which we will give the maximum practice that students can immediately apply in daily work.

    We often saw how people who came to our interviews, despite a good knowledge of the theory, due to lack of experience, could not reproduce all the stages of solving a typical machine learning problem - preparing data, selecting / constructing features, choosing models, their correct composition, achieving high quality and correct interpretation of the results.

    Therefore, the main thing with us is practice. We open IPython notebooks and immediately work with them.

    In the first lesson, we discussed decision trees and forests and explored a feature extraction technique using the Kaggle Titanic: Machine Learning from Disaster dataset as an example.

    We believe that participation in machine learning competitions is necessary to maintain the required level and constantly increase the analyst’s expertise. Therefore, starting from the first lesson, we launched our own competition for predicting payments in car insurance (now a very popular topic in business) using Kaggle Inclass. The competition will last until the end of the course.

    In the second lesson, we analyzed Titanic data using Pandas and visualization tools. We also tried simple classification methods on the problem of predicting payments in car insurance.

    In the third lesson, we examined the Kaggle task “Greek Media Monitoring Multilabel Classification (WISE 2014)” and the use of mixing algorithms as the main methodology for most Kaggle competitions.

    In the third lesson, we also examined the capabilities of the Scikit-Learn machine learning library, and then, in more detail, linear classification methods: logistic regression and linear SVM, discussed when it is better to use linear models, and when to use complex ones. We talked about the importance of attributes in the learning task and the quality metrics of classification.

    The main topic of the fourth lesson was the meta-learning algorithm for stacking and its application in the triumphant solution of Otto's Kaggle task “Classify products into correct category”.

    In the fourth lesson, we got to the main problem of machine learning - the fight against retraining, looked at how to use the learning and validation curves for this, how to conduct cross-validation correctly, which indicators can be optimized in this process to achieve the best generalizing ability. We also studied the methods of constructing ensembles of classifiers and regressors - bagging, the method of random subspaces, boosting, stacking, etc.

    And so on - each lesson is a new practical task.

    Since Python is our main language in the course, we will of course introduce the main Python data analysis libraries NumPy, SciPy, Matplotlib, Seaborn, Pandas and Scikit-learn. We devote a lot of time to this, since the work of a data explorer largely consists of calling various methods of the listed modules.

    Understanding the theory on which the methods are based is also important. Therefore, we examined the basic mathematical methods implemented in SciPy - finding eigenvalues, singular matrix decomposition, maximum likelihood method, optimization methods, etc. These methods were considered not just theoretically, but in conjunction with machine learning algorithms that use them - the support vector method , logistic regression, spectral clustering, the method of principal components, etc. This approach is first practice, then theory, finally, practice at a new level, with a theoretical understanding - p It is implemented by many machine learning schools abroad.

    Let's take a closer look at one of the popular methods often used by Kaggle competitors - stacking, which we studied in the fourth lesson.

    In its simplest form, the idea of ​​stacking is to take M basic algorithms (for example, 10), split the training set into two parts - say, A and B. First, train all M basic algorithms on part A and make predictions for part B. Then, on the contrary, train all models on Part B and make predictions for objects from Part A. So you can get a prediction matrix whose dimensions are nx M, where n is the number of objects in the original training set, M is the number of basic algorithms. This matrix is ​​fed to the input of another model - a second-level model, which, in fact, learns from the learning outcomes.



    More often, the scheme is made a little more complicated. To train models on more data than half of the training sample, the sample K times is divided into K parts. Models train on the K-1 parts of the training sample, predictions are made on one part and enter the results into the prediction matrix. This is a K-fold stacking.



    Let’s take a brief look at the winners of the Kaggle Otto Group Product Classification Challenge. www.kaggle.com/c/otto-group-product-classification-challenge
    The task was to correctly classify goods into one of 9 categories based on 93 attributes, the essence of which Otto does not disclose. Predictions were evaluated based on the average F1 measure. The competition became the most popular in the history of Kaggle, perhaps because the entry threshold was low, the data was well prepared, and one of the models could be quickly taken and tried.

    In the decision of the winners of the competition, essentially the same K-fold stacking was used, only three models were trained at the second level, and not one, and then the predictions of these three models were mixed.



    At the first level, 33 models were trained - various implementations of many machine learning algorithms with different sets of attributes, including those created.

    These 33 models were trained 5 times on 80% of the training sample and made predictions for the remaining 20% ​​of the data. The predictions obtained were collected in a matrix. That is 5x stacking. But the difference from the classical stacking model is that in addition to the predictions of 33 models, 8 created attributes were added - mainly obtained by clustering the initial data (the attribute itself is the labels of the obtained clusters) or taking into account the distances to the nearest representatives of each class.

    On the results of training 33 first-level models and 8 created attributes, 3 second-level models were trained - XGboost, the neural network Lasagne NN and AdaBoost with ExtraTrees trees. The parameters of each algorithm were selected in the cross-validation process. The final formula for averaging the predictions of second-level models was also selected in the process of cross-validation.

    In the fifth lesson, we continued to study classification algorithms, namely, neural networks. We also touched on the topic of training without a teacher - clustering, reducing dimensions and finding emissions.

    In the sixth lesson, we plunged into Data Mining and analysis of the consumer basket. It is widely believed that machine learning and data mining are one and the same or that the latter is part of the first. We explain that this is not so and point out the differences.

    Initially, machine learning was more focused on predicting the type of new objects based on the analysis of the training sample, and it is the classification problem that is most often remembered when people talk about machine learning.

    Data mining was focused on the search for “interesting” patterns (patterns) in the data, for example, goods often bought together. Of course, this line is now being erased, and those methods of machine learning that do not work like a black box, but give out interpreted rules, can also be used to find patterns.

    So, the decision tree allows us to understand not only that this user behavior is similar to fraud, but also why. Such rules resemble associative rules, however, they appeared in data mining for sales analysis, because it is useful to know that, it would seem, unrelated goods are bought together. Having bought A, they buy and B.

    We not only talk about all this, but also give the opportunity to analyze the actual data on such purchases ourselves. So, data on sales of contextual advertising will help to understand what recommendations about buying search phrases to give to those who want to promote their online store on the network.

    Another important method in mining data is to search for frequent sequences. If the sequence <laminate, refrigerator> is quite frequent, then most likely new settlers are bought in your store. Also, such sequences are useful for predicting future purchases.

    A separate lesson is devoted to recommender systems. We not only teach classical collaborative filtering algorithms, but also dispel the misconceptions that SVD, the de facto standard in this area, is a panacea for all advisory tasks. So, in the task of recommending radio stations, SVD noticeably retrains, and the rather natural hybrid approach of collaborative filtering and the similarity of users and dynamic profiles of users and radio stations based on tags works great.

    So, six lessons have passed, what's next? Further more. We will have an analysis of social networks, we will build a recommendation system from scratch, teach the processing and comparison of texts. Also, of course, we will have Big Data analysis with Apache Spark.

    We will also devote a separate lesson to tricks and tricks on Kaggle, about which the person who now occupies the 4th place in the global Kaggle rating will tell.

    Since November 16, we are launching the second course for those who could not get on the first. Details, as usual, at bigdata.beeline.digital .

    You can follow the chronicle of classes for the current course on our Facebook page .

    Also popular now: