# Forecasting real estate sales. Lecture in Yandex

Success in machine learning projects is usually associated not only with the ability to use different libraries, but also with an understanding of the area from which data is taken. An excellent illustration of this thesis was the solution proposed by the team of Alexey Kayuchenko, Sergey Belov, Alexander Drobotov and Alexey Smirnov in the PIK Digital Day competition. They took second place, and a couple of weeks later they talked about their participation and the models built at the next ML training session of Yandex .

Alexey Kayuchenko:

- Good afternoon! We will talk about the PIK Digital Day competition in which we participated. A little about the team. We were four people. All with a completely different background, from different areas. In fact, we met at the final. The team was formed literally the day before the final. I will tell you about the course of the competition, the organization of work. Then Serezha will come out, he will tell about the data, and Sasha will already tell about the submission, about the final course of work and how we moved along the leaderboard.

Briefly about the competition. The task was very applied. PIK organized this competition by providing apartment sales data. As a training data set, there was a story with attributes for 2 ½ years in Moscow and the Moscow Region. The competition consisted of two stages. It was an online stage, where each of the participants individually tried to make their own model, and the offline stage, not so long, just one day from morning to evening. It hit the leaders of the online stage.

Our places at the end of the online competition were not even in the top 10, and not even in the top 20. We were there on the ground 50+. At the very end, that is, offline stage, there were 43 teams. There were a lot of teams consisting of one person, although it was possible to unite. About a third of the teams were more than one person. At the final there were two competitions. The first competition is a model without restrictions. It was possible to use any algorithms: deep learning, machine learning. At the same time, there was a competition for the best linear regression solution. The organizer considered that the linear regression was also quite applied, since the competition itself was generally very applied. That is, an applied problem was posed - it was necessary to predict the volume of sales of apartments, having historical data for the previous 2.5 years with attributes.

Our team won second place in the competition for the best model without restrictions and first place in the competition for the best regression. Double prize.

About the general course of the organization, I can say that the final was very tense, quite stressful. For example, our winning decision was loaded literally two minutes before the stop game. The previous decision put us, in my opinion, in fourth or fifth place. That is, we worked to the end, not relaxing. PIC organized everything very well. There were such tables, there was even a veranda so that you could sit outside, get some fresh air. Food, coffee, everything was provided. In the picture you can see that everyone sat in their small groups, worked.

Sergey will tell more about the data.

Sergey Belov:

- Thank. PIC provided us with several data files. The two main ones are train.csv and test.csv, in which there were approximately 50 features generated by the PIC itself. Train consisted of approximately 10 thousand lines, test - from 2 thousand.

What did the string provide? She contained sales data. That is, as a value (in this case, “target”), we had sales by square meters for apartments averaged over a specific building. There were about 10 thousand such lines. The features in the sets that the PIC itself poured are shown on the slide with an approximate significance that we received.

I was helped here by experience in development companies. Features such as the distance of the apartment to the Kremlin or to the Transport Ring, the number of parking spaces - they do not have a very strong effect on sales. The class of the object, the bedroom, and, most importantly, the number of apartments in the implementation at the moment have an impact. PIC did not generate this feature, but they provided us with three additional files: flat.csv, status.csv, and price.csv. And we decided to take a look at flat.csv, because there was just data on the number of apartments, on their status.

And if you ask yourself what was the success of our solution, then this is a certain teamwork. We worked very smoothly from the very beginning of this competition. We immediately discussed somewhere in 20 minutes what we would do. They came to the general conclusion that the first thing to do is to work with the data, because any data scientist understands: there is a lot of data in the data and often the victory is obtained because of some feature that the team generated. After working with the data, we, first of all, used different models. We decided to see what result our features give in each of these models, and further focused on the model without restrictions and the linear regression model.

We started working with data. First of all, we looked at how the train set tests relate to each other, that is, whether the areas of this data intersect. Yes, they intersect: in the number of apartments, and in the bedroom, and in a certain average number of floors.

Further, for linear regression, we began to perform some transformations. These are like standard logarithms of exponents. For example, in the case of the middle floor, this was the inverse Gaussian transformation for linearization. We also noticed that sometimes it is better to divide the data into groups. If we take, for example, the distance from the apartment to the metro or its room, then there are a few other markets, and it is better to divide, make different models for each such group.

From the flat.csv file, we generated three features. One of them is presented here. It can be seen that it has a fairly good linear relationship, except for this subsidence. What was this feature? It corresponded to the number of apartments that are currently under construction. And this feature works very well at low values. That is, there can not be more apartments sold than the number that is in sale. But in these files, in fact, a certain human factor was laid, because they are often composed by man. We right there saw the points that are knocked out of the area, because they were scored a little wrong.

An example from scikit-learn. The model from GBR and Random Forest without features gave RMSE 239, and with these three features - 184.

Sasha will tell you about the models we used.

Alexander Drobotov:

- A few words about our approach. As the guys have already said, we are all different, came from different spheres, different education. And our approaches were different. In the final stage, Lesch used more XGBoost from Yandex (most likely CatBoost is meant - ed.), Seryozha is the scikit-learn library, I am LightGBM and linear regression.

The XGBoost, linear regression, and Prophet models are the three options that showed our best score. For linear regression, we had a blending of two models, and for the general competition, XGBoost, and we added a little bit of linear regression.

Here is the process of sending solutions and team work. In the graph on the left, the X axis is public RMSE, the value of the metric, and the Y axis is the private score, RMSE. We started approximately from these positions. Here are the individual models of each of the participants. Then, after exchanging ideas and creating new features, we began to approach our best score. Our values for individual models were approximately the same. The best individual model is XGBoost and Prophet. Prophet created a forecast for cumulative sales. There was a sign like start square. That is, we knew how many apartments we had total, understood what historical value, and incremental-value aspired to total-value. Prophet made a forecast for the future, gave out values in the following periods and submitted them to XGBoost.

The blending of our best individual score is somewhere here, here are these two orange dots. But this score was not enough for us to get to the top.

After examining the usual correlation matrix of the best submitters, we saw the following: trees - and this is logical - showed a correlation close to one, and the best tree gave XGBoost. It shows a not so high correlation with linear regression. We decided to dabble these two options in the ratio of 8 to 2. That is how we got the best final solution.

This is a leaderboard with results. Our team ranked second in models without restrictions and first in linear models. As for the score - here all the values are pretty close. The difference is not very big. A linear regression is already underway in the area of 5. We have it all, thank you!

Alexey Kayuchenko:

- Good afternoon! We will talk about the PIK Digital Day competition in which we participated. A little about the team. We were four people. All with a completely different background, from different areas. In fact, we met at the final. The team was formed literally the day before the final. I will tell you about the course of the competition, the organization of work. Then Serezha will come out, he will tell about the data, and Sasha will already tell about the submission, about the final course of work and how we moved along the leaderboard.

Briefly about the competition. The task was very applied. PIK organized this competition by providing apartment sales data. As a training data set, there was a story with attributes for 2 ½ years in Moscow and the Moscow Region. The competition consisted of two stages. It was an online stage, where each of the participants individually tried to make their own model, and the offline stage, not so long, just one day from morning to evening. It hit the leaders of the online stage.

Our places at the end of the online competition were not even in the top 10, and not even in the top 20. We were there on the ground 50+. At the very end, that is, offline stage, there were 43 teams. There were a lot of teams consisting of one person, although it was possible to unite. About a third of the teams were more than one person. At the final there were two competitions. The first competition is a model without restrictions. It was possible to use any algorithms: deep learning, machine learning. At the same time, there was a competition for the best linear regression solution. The organizer considered that the linear regression was also quite applied, since the competition itself was generally very applied. That is, an applied problem was posed - it was necessary to predict the volume of sales of apartments, having historical data for the previous 2.5 years with attributes.

Our team won second place in the competition for the best model without restrictions and first place in the competition for the best regression. Double prize.

About the general course of the organization, I can say that the final was very tense, quite stressful. For example, our winning decision was loaded literally two minutes before the stop game. The previous decision put us, in my opinion, in fourth or fifth place. That is, we worked to the end, not relaxing. PIC organized everything very well. There were such tables, there was even a veranda so that you could sit outside, get some fresh air. Food, coffee, everything was provided. In the picture you can see that everyone sat in their small groups, worked.

Sergey will tell more about the data.

Sergey Belov:

- Thank. PIC provided us with several data files. The two main ones are train.csv and test.csv, in which there were approximately 50 features generated by the PIC itself. Train consisted of approximately 10 thousand lines, test - from 2 thousand.

What did the string provide? She contained sales data. That is, as a value (in this case, “target”), we had sales by square meters for apartments averaged over a specific building. There were about 10 thousand such lines. The features in the sets that the PIC itself poured are shown on the slide with an approximate significance that we received.

I was helped here by experience in development companies. Features such as the distance of the apartment to the Kremlin or to the Transport Ring, the number of parking spaces - they do not have a very strong effect on sales. The class of the object, the bedroom, and, most importantly, the number of apartments in the implementation at the moment have an impact. PIC did not generate this feature, but they provided us with three additional files: flat.csv, status.csv, and price.csv. And we decided to take a look at flat.csv, because there was just data on the number of apartments, on their status.

And if you ask yourself what was the success of our solution, then this is a certain teamwork. We worked very smoothly from the very beginning of this competition. We immediately discussed somewhere in 20 minutes what we would do. They came to the general conclusion that the first thing to do is to work with the data, because any data scientist understands: there is a lot of data in the data and often the victory is obtained because of some feature that the team generated. After working with the data, we, first of all, used different models. We decided to see what result our features give in each of these models, and further focused on the model without restrictions and the linear regression model.

We started working with data. First of all, we looked at how the train set tests relate to each other, that is, whether the areas of this data intersect. Yes, they intersect: in the number of apartments, and in the bedroom, and in a certain average number of floors.

Further, for linear regression, we began to perform some transformations. These are like standard logarithms of exponents. For example, in the case of the middle floor, this was the inverse Gaussian transformation for linearization. We also noticed that sometimes it is better to divide the data into groups. If we take, for example, the distance from the apartment to the metro or its room, then there are a few other markets, and it is better to divide, make different models for each such group.

From the flat.csv file, we generated three features. One of them is presented here. It can be seen that it has a fairly good linear relationship, except for this subsidence. What was this feature? It corresponded to the number of apartments that are currently under construction. And this feature works very well at low values. That is, there can not be more apartments sold than the number that is in sale. But in these files, in fact, a certain human factor was laid, because they are often composed by man. We right there saw the points that are knocked out of the area, because they were scored a little wrong.

An example from scikit-learn. The model from GBR and Random Forest without features gave RMSE 239, and with these three features - 184.

Sasha will tell you about the models we used.

Alexander Drobotov:

- A few words about our approach. As the guys have already said, we are all different, came from different spheres, different education. And our approaches were different. In the final stage, Lesch used more XGBoost from Yandex (most likely CatBoost is meant - ed.), Seryozha is the scikit-learn library, I am LightGBM and linear regression.

The XGBoost, linear regression, and Prophet models are the three options that showed our best score. For linear regression, we had a blending of two models, and for the general competition, XGBoost, and we added a little bit of linear regression.

Here is the process of sending solutions and team work. In the graph on the left, the X axis is public RMSE, the value of the metric, and the Y axis is the private score, RMSE. We started approximately from these positions. Here are the individual models of each of the participants. Then, after exchanging ideas and creating new features, we began to approach our best score. Our values for individual models were approximately the same. The best individual model is XGBoost and Prophet. Prophet created a forecast for cumulative sales. There was a sign like start square. That is, we knew how many apartments we had total, understood what historical value, and incremental-value aspired to total-value. Prophet made a forecast for the future, gave out values in the following periods and submitted them to XGBoost.

The blending of our best individual score is somewhere here, here are these two orange dots. But this score was not enough for us to get to the top.

After examining the usual correlation matrix of the best submitters, we saw the following: trees - and this is logical - showed a correlation close to one, and the best tree gave XGBoost. It shows a not so high correlation with linear regression. We decided to dabble these two options in the ratio of 8 to 2. That is how we got the best final solution.

This is a leaderboard with results. Our team ranked second in models without restrictions and first in linear models. As for the score - here all the values are pretty close. The difference is not very big. A linear regression is already underway in the area of 5. We have it all, thank you!