How to get to the top on Kaggle, or Matrixnet at home

I want to share the experience of participating in the Kaggle contest and machine learning algorithms, with which I got to 18th place out of 1604 in the Avazu competition for click-through rate) mobile advertising. In the process, he tried to recreate the original McTrixnet algorithm, tested several options for logistic regression and worked with the characteristics. About all this below, plus I enclose the full code so that you can see how everything works.

I divide the story into the following sections:
1. Terms of the competition;
2. Creation of new features;
3. Logistic regression - the delights of an adaptive gradient;
4. Matrixnet - recreation of the complete algorithm;
5. Speeding up machine learning in Python.

1. Terms of the competition


Data provided:
  • 40.4 million records for training (10 days of Avazu ad impressions);
  • 4.5 million records for testing (1 day).

The data itself can be downloaded here .

Negative Likelihood Loss (NLL) was declared as an evaluation criterion:



Where N is the number of records, y is the value of the click variable, p is the probability that the event was a click ("1").

An important property of this error function is that if p is based on the sigmoid of the function, then the partial derivative (hereinafter, the gradient) will be (py) .



2. Creating new features


First, let's look at the source data, with which we can work:
  • click - “0” - there was no click, “1” - there was a click, this is the goal for prediction;
  • hour - time and date of advertisement display;
  • banner_pos - banner location (presumably the first screen, second, etc.);
  • site *, app * features - information about the place of the advertisement;
  • device * characteristics - information about the device on which the advertisement is displayed;
  • C1, C14-C21 - encrypted characteristics (presumably including data on the geolocation of the display, time zone and, possibly, other information).

Sample data


This is not a very large field for work, since we do not have historical data on users, and most importantly, we don’t know anything about what kind of advertisement is shown each time. And this is an important component (you don’t click on ads for the whole row?).

What are we creating new:
  1. Polynomial characteristics of the 2nd level (3rd too slows down the learning process);
  2. User_id I tested several options, it works best - device_id + device_ip + device_model + C14 (presumably geography at the city / region level). And yes, device_id does not equal user_id;
  3. Frequency of contact with ads. Typically, users who see an ad 100th time per day do not react to it the same way as those who see it in the 1st. Therefore, we consider the frequency of each impression for each user_id.

User id example


I tried different ideas, the above gave the best result. In their formation, it was mainly based on their experience in advertising and what could be pulled from Avazu data.

We also make small data transformations / transformations aimed, first of all, at getting rid of duplicate information:
  • hour - select an hour, throw out the day;
  • C1 - I suppose that the time zone was hidden behind it, so after the 2nd day I merge the hour with the column;
  • C15 and C16 - we combine, since behind them the width and height of the banner are easily guessed, it makes no sense to leave unnecessary characteristics;
  • Site * and app * - we transform into placement *, since it is clear that the banner is displayed either on the site or in the application, and the rest of the values ​​are just encrypted NULL, which is add. does not carry information;
  • We remove all values ​​that were not found in the test data. This helps reduce retraining.

I tested all the changes with the help of logistic regression: they either gave improvements or accelerated the algorithm and did not worsen the results.

3. Logistic Regression - the delights of an adaptive gradient


Logistic regression is a popular classification algorithm. There are 2 main reasons for this popularity:
1. Easy to understand and create an algorithm;



2. The speed and adaptability of prediction on big data due to stochastic gradient descent (stochastoc gradient descent, SGD).



Using Avazu data as an example, let us see how, due to the stochastic gradient, we do not always “go” exactly to the minimum:



3.1. Adaptive gradient

However, if over time we reduce the learning rate of the algorithm (learning rate), then we will come to a more accurate solution, since the gradient will not respond so much to atypical data. The authors of the adaptive gradient (Adaptive Gradient, AdaGrad) suggest using the sum of all the previous gradients to sequentially reduce the learning speed:



Thus, we obtain useful properties of the algorithm:
  • Smoother descent to a minimum (learning speed decreases with time);
  • Alpha for each characteristic is calculated individually (which is very important for sparse data, where most characteristics are very rare);
  • The alpha calculation takes into account how much the parameter parameter (w) changed: the stronger it changed earlier, the less it will change in the future.

The adaptive gradient begins to learn in the same way as a regular logistic regression, but then comes to a lower minimum:



In fact, if a regular stochastic gradient descent is repeated many times on the same data, then it can come close to AdaGrad. However, this will take more iterations. In my model, I used a combination of these techniques: 5 iterations with the usual algorithm, and then one with AdaGrad. Here are the results:



3.2. Data Transformation for Logistic Regression

In order for the logistic regression algorithm to work with data (and they are presented in the format of text values), they must be converted to scalar values. I used one-hot encoding: converting text characteristics to an NxM matrix with values ​​of “0” and “1”, where N is the number of records and M is the number of unique values ​​for this characteristic. The main reasons - maximum information is saved, and feature hashing allows you to quickly work with spaces with millions of characteristics as part of a logistic regression.

One-hot encoding example



4. Matrixnet - assembly at home


After I got good results with logistic regression, it was necessary to improve further. I was interested to understand how Yandex Matrixnet works, moreover, I expected good results from it (after all, this is one of its tasks - to predict CTR inside Yandex for search advertising results). If you collect all the publicly available information about the Matrixnet (see the full list of links at the end of the article), then you can recreate its algorithm. I do not pretend that it is in this form that the algorithm works inside Yandex, I do not know this, but in my code and in this article I used all the found "chips" and hints of them.

Let's go in order, what McTrixnet consists of:
  1. The basic element is Classification and Regression Tree (CART);
  2. Main Algorithm - Gradient Boosting Machine (GBM)
  3. The main algorithm update is the Stochastic Gradient Boosting Machine (SGBM).

4.1. Classification and Regression Tree (CART)

This is one of the classical decision tree algorithms that we already wrote about on Habré (for example, here and here ). Therefore, I will not go into details, I only recall the principle of work and basic terms.



Thus, the decision tree has the following parameters that determine the algorithm:
  • Split conditions on sheets ( x_1≥0.5 )
  • "Height" of the tree (number of levels with conditions, in the example above there are 2)
  • Prediction rule p (expectation is used in the example above)

4.1.1. How to define a characteristic for a split condition

At each level, we need to define a characteristic for the split condition, which will divide the plane in such a way that we will make more accurate forecasts.



Thus, we go through all the characteristics and possible splits and select those points where we will have a lower NLL value after the split (in the example above, this, of course, is x2 ). Other functions are usually used to determine the split - information gain and Gini impurity , but in our case we choose NLL, since that is what we were asked to minimize in the task (see the description of the task in section No. 1).

4.1.2. Regularization in CART

In the case of CART, regularization is usually necessary so as not to make too confident predictions where we are not really sure. Yandex suggests adjusting the prediction formula as follows:



Where N is the number of values ​​in the sheet, and lambda is the regularization parameter (Maktriksnet experts recommend 100, but you need to test for each task separately). Thus, the less values ​​we have in the sheet, the less our algorithm will be sure of the predicted value.

4.1.3. Oblivious Trees

In the MatrixNet, instead of the classical approach, when, after a split by x1, the next level of the tree cannot divide data by this characteristic, the so-called forgetful trees that can, within the framework of several levels, split values ​​according to the same characteristic (as if "forgetting" that a split has already been made on it).



The use of this type of trees, in my opinion, is justified primarily in those cases when the data has 1-2 characteristics, narrower splits for which give better results than splits for unused characteristics.

4.2. Gradient Boosting Machine

Gradient Boosting Machine (GBM) is the use of a forest of short trees, where each subsequent one does not try to guess the target value, but tries to improve the forecast of previous trees. Consider a simple example with regression and trees with a height of 1 (we can only make 1 split within each tree).



In essence, each tree helps optimize the quadratic error function:



The main advantage of GBM compared to CART is the less likelihood of retraining, as we give forecasts based on sheets with a larger number of values. In MatrixNet in GBM, the “height” of a tree depends on the current iteration: it starts at 1, every 10 iterations increases by another 1, but never exceeds 6. This approach allows you to significantly overclock the algorithm without significantly degrading the result at the first iterations. I tested this option, but settled on the option when the transition to the next level is carried out after exhausting the possibilities on the previous one.

4.2.1. GBM for classification

When working with classification, each subsequent tree should improve the prediction of the previous one, however, this must be done in such a way that the trees work for one task - classification using the sigmoid function. In this case, it is convenient to present the optimization problem as in the logistic regression, namely:



An interesting Yandex solution is that the logistic regression predictions are used as the initial forecast p0 , and the product of weights and characteristics (wTx) as one of the characteristics.

4.3. Stochastic gbm

Stochastic GBM differs from regular GBM in that each tree is counted on a sample of data (10-50%). This helps to simultaneously kill 2 birds with one stone - to speed up the algorithm (otherwise we would have to calculate the result for all 40.4 million training records), as well as to get rid of the problem of overtraining to a large extent.
The final result:



As you can see, still the greatest optimization is given by working with data, and not by the algorithms themselves. Although with the usual logistic regression, it would be difficult to rise above the 30th place, where the bill goes for every ten thousandth.

5. Attempts to speed up machine learning in Python


This was my first project of implementing machine learning algorithms on my own, that is, in the code I made the predictions, I don’t use ready-made third-party solutions, all the algorithms and data manipulations take place openly, which allowed me to better understand the essence of the task and the principles of operation of these tools. However, I used the groundwork: logistic regression was significant - Abnishek’s Kaggle code , Matrixnet borrowed a small part of the CART from Stephen Marshall’s code to Machine Learning: Algorithmic Perspective.

From an implementation point of view, I started experimenting with a task in R, but then quickly abandoned it, since it is almost impossible to work with big data. Python was chosen because of the simplicity of the syntax and the large number of libraries for machine learning.

The main problem with CPython is that it is VERY slow (albeit much faster than R). However, there are many options for accelerating it, as a result I used the following:
  • PyPy is a JIT compiler that allows you to speed up the standard CPython x20 times, the main problem is that there are practically no libraries for working with calculations (NumPyPy is still in development), everything needs to be written without them. Perfectly suited for the implementation of logistic regression with stochastic gradient descent, as in our case;
  • Numba – декораторы jit позволяют пре-компилировать некоторые функции (принцип, как в PyPy), однако остальной код может использовать все преимущества библиотек CPython. Большой плюс – для прекомпилированных функций можно снимать GIL (Global Interpreter Lock) и использовать несколько cpu. Что я и использовал для ускорения Матрикснета. Проблема у Numba такая же, как и у PyPy, — нет поддержки никаких библиотек, главное отличие – у Numba есть реализация некоторых функций NumPy.

I didn’t get to Cython, because I tried to speed up with minimal blood, but I think the next time it’s easier to switch directly to GPGPU using Theano / Numba.

The full code of all transformations with the data and the training algorithms themselves lies here. Disclaimer: the code is not optimized, intended only to study the algorithms themselves.

If you find any inaccuracies or errors in the article or code, write in a personal.

References to sources used for the article and when working on algorithms:

Also popular now: