Using CatBoost models inside ClickHouse. Yandex lecture

    In what situations is it convenient to use pre-trained machine learning models inside ClickHouse? Why is CatBoost best suited for this task ? Not so long ago, we had a meeting dedicated to the simultaneous use of these two open source technologies. The developer Nikolai Kochetov spoke at the meeting - we decided to share his lecture with you. Nikolai analyzes the described problem using an example of a purchase probability prediction algorithm.


    - First, about how ClickHouse works. ClickHouse is an analytical distributed DBMS. It is columnar and open source. The most interesting word here is “columnar”. What does it mean?

    Here's how data is stored in regular databases. Stored line by line, files recorded in a row. In other words, if we need to read data for only one value, we are forced to read a whole line. This is due to the nature of the hard drive. It is not possible to read one byte, and random reading is much more expensive than sequential reading. As a result, we read a lot of extra data.

    In ClickHouse, it’s not so. ClickHouse stores data in columns, which means when we need to read only one attribute, one column, we read exactly one file. This is done much faster, much like on a slide.

    Additionally, ClickHouse compresses the data, and the fact that we store data in columns gives us an advantage. Data is more distant and better compressed.

    Compression also saves us disk space and allows us to execute queries faster in many scenarios.

    How fast is ClickHouse? We conducted speed tests at the company. I will show you better testing of people from the side.

    The graph compares ClickHouse, Spark and the Amazon Redshift database. ClickHouse on all requests is many times faster, although the databases are similar in structure, they are also columnar.

    ClickHouse supports SQL syntax. This is not honest SQL, this is a dialect. ClickHouse also has the ability to use special functions: working with arrays, built-in data structures, or functions specific to working with URLs. Paste into ClickHouse is fast too. We inserted the data and can immediately work with them. It turns out working online not only for queries, but also for inserts.

    Speaking of distributed data - ClickHouse can work on laptops, but it feels great on hundreds of servers. For example, in Yandex.Metrica.

    Let's move on to the second part. We will use some auxiliary task - predicting the probability of a purchase. The task is quite simple. Let's say there is data from the store. We want to benefit from them, to learn more information about users. One could consider the probability of a purchase, to do something with it. About what exactly to do with it, I will not talk in detail. I'll tell you an example. For example, you can divide users into two categories: good and not so good. Good to show more advertising, not very good less. And - save on advertising or get a higher conversion.

    Let's look at the workflow. It is very simple and familiar to all of you, three stages: working with attributes, collecting data and obtaining a sample.

    The second stage is training the model and evaluating its quality.

    The latter is the use of the model in production.

    Where to get the data from? You can take it from Yandex.Metrica, where the data is stored inside ClickHouse. We will also use ClickHouse. But the data that we took, we took using the Logs API metrics. We introduced ourselves to external users and downloaded data using what is accessible to external people.

    What data could we get? Firstly, information about the pages viewed, which products he saw that he visited. This is also the state of the user's basket and what purchases he made for this.

    The latter is the device from which the user logged in, browser type, desktop or mobile client.

    The fact that we store data in ClickHouse gives us advantages. Specifically, in that we can count almost any signs that we can come up with, since ClickHouse stores data in an aggregated form, there is an example - calculating the average duration of a session.

    The example has several features. The first is date clipping. We used start and end_date, and ClickHouse works effectively with this, it will not read what we did not request. We also used sampling: they said that we would read 1/10 of the data. ClickHouse does this efficiently; it discards unnecessary data immediately after reading data from disk.

    I mention the training of models. What classifiers did we use?

    We have a trained model, it gives out good quality. How to implement it now?

    It would seem that the question is simple. Let's make it simple: every Monday to unload data from ClickHouse for the previous month, run Python scripts, get the probability of purchase and load it back, say, into the same ClickHouse table. It all works well, is written simply and efficiently, but there are some disadvantages.

    The first drawback is that data must be uploaded. This can slow down, and not inside ClickHouse, but most likely inside Python scripts.

    The second drawback is that we prepare answers in advance, and if we considered the probability of a purchase from Monday to Sunday, we can’t just find out from Wednesday to Thursday, for example, or last year.

    One way to solve this is to look carefully at the model. The model is simple: operations of addition, multiplication, conditional transitions. Let's write the model in the form of ClickHouse queries, and immediately get rid of our shortcomings. Firstly, we do not use data uploading, and secondly, we can substitute any dates and everything will work.

    But the question arises: what algorithms can we transfer to the DBMS? The first thing that comes to mind is the use of linear classifiers.

    To use it, you need to multiply the weight vectors by the value of the signs and compare with some threshold. For example, I trained our logistic regression samples and got a log loss quality of 0.041. Next, I checked how quickly this works in ClickHouse, wrote a request, and it worked in less than half a second. Not a bad result.

    If we have something more complicated than a linear model, for example, trees, they can also be written as a ClickHouse request. What I've done? I took a selection of irises . This sample is good because you can train a small tree with a depth of two for it. She gets an error of less than 5%. If you're so lucky to have a good selection, you can use a small tree.

    You can write a query, conditionally, in the form of two selection functions, and everything will work well too. Ingredient boosting or forest, for example. You need to calculate the amount or average. I also tested this for a small forest, 100 trees with a depth of 3, in the end I got the quality even worse, and it works in ClickHouse for 2.5 seconds.

    Not everything is so bad, if I trained more trees, the quality would be better. But why didn’t I do that? That's why.

    This is what the third part of the forest looks like on 100 trees. I can see that this is very similar to a forest. And if you make the request even more, then ClickHouse just starts to slow down on parsing.

    And they got additional flaws. Besides the fact that the request is very long, we are forced to come up with it, and this can slowly work in the database itself. For example, for trees to apply them, ClickHouse is forced to read the value in each node of the tree. And this is inefficient, because we could only calculate the values ​​on the way from the root to the leaf.

    There is a problem with performance, something can slow down and work inefficiently. This is not what we want.

    We would like simpler things: to tell the base that let yourself deal with how to apply the model, we just tell you that the machine learning library lies here, and you just give me the modelEvaluate function, in which we put a list of our signs . This approach, firstly, has all the advantages of the previous one, and secondly, shifts all work to the database. And there is only one drawback - your base must support the use of these models.

    In ClickHouse, we implemented this for CatBoost, of course, we tested it. They got the best quality, despite the fact that I taught ... The training time is not very long, only 4 seconds. This is comparatively good compared to forest depth 100 3.

    What can be seen in the end?

    If it is important for you to see the speed of work, the speed of encumbering the algorithm, the quality is not so important, then you can use logistic regression, it will all work quickly.

    If quality matters to you, train CatBoost and use it. How to use CatBoost in ClickHouse?

    You need to do some pretty simple steps. The first is to describe the configuration of the model. Only 4 significant lines. Model type - now it's CatBoost, maybe someday we’ll add a new one.

    The name of the model to be used as a parameter to the modelEvaluate () function.

    The location of the trained model on the file system.

    Update time. Here 0 - this means we will not re-read the material of the disk. If it were 5, it would be re-read every 5 seconds.

    Next, you need to tell ClickHouse where the configuration file is located, this is the bottom parameter on the slide. An asterisk is used there, which means that all files that fit this mask will be downloaded.

    You also need to tell ClickHouse where the machine learning library itself lies, namely the wrapper module for CatBoost.

    Everything, when we indicated this, we get the opportunity to function modelEvaluate (), which takes the name of the model from the configuration file as the first argument, the rest are our signs. And it is necessary to indicate first numerical signs, then categorical. But do not worry if you mix something up. First, ClickHouse will tell you that you have a bug. Secondly, you can test the quality and understand that everything is not the way you thought.

    To make testing more convenient, we also added support for reading from the CatBoost data pool. Probably someone worked with CatBoost from the console instead of working with it from Python or from R. Then you would have to use a special format for describing your selection. The format is quite simple - just two files. The first is a TSV with three columns, the attribute number and its type: categorical, numerical, or target. Plus an additional optional name parameter. The second file is the data itself, also TSV.

    Just in ClickHouse, we added the table function сatBoostPool, which takes these two files as arguments, and returns a special table of type file, temporary, from which data can be read.

    I created a small CatBoost Pool of three values: two attributes and a target. It contains five lines. Two are signs, one is a target, two more additional ALIAS.

    Also, the numerical and categorical features changed places. This is done for convenience, so that you can submit parameters to the playModule function in the desired order.

    If we do describe what the SELECT * query will produce, then the values ​​are smaller, there are only two signs left in the right order. This is again done so that you can conveniently use the CatBoost Pool, namely, pass the asterisk as the remaining arguments to the moduleEvaluate function.

    Let's try to apply something. First, let's calculate the probability of a purchase. What will we use? First of all, we will read from the CatBoost Pool, use the moduleEvaluate function and pass the asterisk as the second parameter.

    The last line, the third, the use of sigmoid.

    Let's complicate the request, calculate the quality. We write the previous query as a ClickHouse subquery. In total, we added two significant lines, counting the logloss metric. We calculated the average, got the value from the table.

    I was interested in how quickly such a request works. I conducted three tests and got such a schedule.

    On the left - reading from the table, on the right - from the pool. It can be seen that the columns are almost the same, there is almost no difference, but reading from the ClickHouse table works a little faster - apparently because ClickHouse compresses the data, and we do not spend too much time reading from the file. And the CatBoost Pool does not compress, so we spend time reading and transposing tags. It turns out a little faster, but still the CatBoost Pool is convenient to use for tests.

    To summarize. The integration is that we can use trained models, use the CatBoost Pool and read data from it, which is quite convenient. The plans are to add the use of other models. The question of training inside the database is also questionable, but so far we have not decided whether to do it. If there are a lot of requests, we will seriously consider this.

    We have a Russianand the English group on Telegram, where we actively respond, has GitHub , a mailing list . If you have questions - you can send questions to me by mail , in Telegram or in the Google group . Thanks!

    Also popular now: