9851754 March 30, 2016 at 14:39

Content forecasting

Introduction

The predictable, but so long-awaited by me change of seasons is happening right now. Many of the friends are looking forward to the beginning of the summer season and are actively updating their inventory. The list of very necessary things that you need to buy exceeds all imaginable budgets by ten years in advance (after all, you still need to provide for the rental of a freight train to deliver everything you need) and online message boards come to the rescue. In the hope of saving money, you determine a list of things that you no longer need, place them for sale, and in anticipation of a bargain, start to wait for calls and ... They are not. What's the matter? It turns out that the discerning buyer is interested not only in the fact that “the lawn mower is in excellent condition”, but also in engine power, grass discharge direction, shaft position, running hours, etc. Not being a specialist in garden equipment, how could you foresee all this? And so you begin to look at other announcements on a similar topic, and time goes by and your country logistics man has already ordered a barge and two cargo planes for transportation. On the example of one of the headings of the bulletin board, we will consider building a predictive model that would help to find out exactly what people would like to know from the description of your proposal, as well as give a very rough estimate of the number of clicks on your ad.

Here I tried to describe the whole picture, the big picture, the details are available on the links to the code and data at the end of the post. The following assumptions are made in the article:

The number of transitions is inversely proportional to the time of sale of the goods
For other cities (in the article only about the capital) and categories, the analysis can be done by analogy

Dataset description

Using the python urllib library, 3879 records were retrieved from one popular site. Topic - dogs, city Moscow. When selecting ads, I tried to leave only non-commercial offers to transfer to good hands, so the breed was not specifically indicated. Description of the selection fields:

description - full ad text
identificator - ad number on the site
num_counts - the number of visits to the ad since the beginning of its placement
price - the price for which it is proposed to buy an animal. usually, volunteers put 100r. or do not indicate the price at all
start_date - date when the ad was placed
title - the name of the ad, how it looks on the first page

The first 5 entries:

Purpose of the study

To develop a model for predicting the dependence of the number of views per day on the description of the advertisement and determine the most significant words for this section.

Data preprocessing

The num_counts field contains the number of clicks since the start of publication start_date . Since each record has a different publication time, it is necessary to divide the number of visits by the number of days that have passed from the moment of publication to the moment of receiving the data, thus we obtain a rough estimate of the number of visits per day, and we will predict it. To analyze the text, the bag of words model is used. So the plan:

Stemming, to exclude the use of the same word, which is in different forms as different signs
The “date” field contains a date in the form of a string, so it must be converted to the correct format from the point of view of analysis
The description field is taken as a sign , so the text must be translated into the bag of words view and apply tf-idf. In this case, stopwords are removed from the text: prepositions, auxiliary particles, etc.
After several unsuccessful attempts to restore the regression between the document-term matrix and the average number of visitors, it was decided to break the target variable into intervals (quartiles) and consider the classification problem (hence tf-idf). Those. at the output, the model will predict the interval where the average traffic for this ad is contained. The conversion to quartiles was carried out only on the training set, so it is necessary to write a function that converts the test set too. The entire sample cannot be converted, since then the test data will indirectly participate in the training
The 'price' field represents the price per animal. High prices are an indicator of the sale of a thoroughbred animal, but we are interested in non-profit activities, therefore we leave only those records for which price <500r. or not specified
Splitting into train \ test. Moreover, the train will conduct training and selection of parameters on the grid for cross-validation, and on test the final quality will be checked. Basic metric - accuracy

After all the transformations, the output is a document-term matrix and the target variable mean_count , divided into quartiles (I chose the number of quartiles equal to 5).

Exploratory analysis

The number of views per day has a power-law distribution, perhaps this section is in principle not popular:

It is interesting to look at the scatter diagram between the number of words and the number of views:

It is seen that shorter ads have a higher number of visits. Here I would suggest such an explanation - in long ads, the potential owner often describes a model of communication between him and the pet, for example:

If you love home peace, then Romush will quietly lie down at your feet and will gladly watch a movie with you, which you will then certainly discuss together with a cup of hot chocolate with cheesecakes. And with it you will be very comfortable and warm on cold evenings. If you have children and your house looks like a “children's dreamland”, then Romush will run ahead with a cry of “Banzai”, thereby amusing the kids, who will simply beep with delight from their new friend!

Since all people are different, such an announcement can immediately weed out people who present their communication in a different way. I’m not sure that this is good, because the communication model is an extremely subjective view of the volunteer and the person loses interest in the advertisement not because the dog does not fit him, but for biased reasons - he tried on the wrong model. The second possible reason is a description of the hard life in the shelter. There is no doubt that life there is not sugar, but the average person, having read such a text, can endure severe stress and unconsciously try to forget about it as a traumatic memory (this is my subjective hypothesis).

Baseline for model

The target variable was divided into 5 intervals (read classes):

(13.599, 324] 454
[0.0888, 1.184] 454
(5.334, 13.599] 453
(2.436, 5.334] 453
(1.184, 2.436] 453

That is, there are 454 entries , where the target variable takes values from the interval (13.599, 324], etc. If you always predict with any particular interval, the number of correct answers will be approximately 0.2, we will choose this value as the base level, the quality of which we would like to improve.

Model

After several experiments, I chose a random forest as a classifier. Various parameters were configured through grid search for cross-validation with the number of folds equal to five. Training takes approximately 15-20 minutes on intel i7. The average quality for cross-validation according to the accuracy metric was 0.386, which is almost twice as much as the constant value prediction. In the delayed sample, which was not previously involved accuracy = 0.384. The tables below show that the classifier better distinguishes between extreme values (intervals [0.0888, 1.184] and (13.599, 324]) and worse adjacent ones:

Perhaps the quality of the model can be improved if add photos to the text. To extract features from photos, you can try to use convolutional neural networks, for example, AlexNet.

Word meaning

Let's see the top 50 words that are important in the classification: The

graph does not contradict intuition: people are interested in how old the animal is and what gender, whether the dog walks on a leash, whether it is more suitable for families or single people, as it gets along with other pets. We can conclude that this is the minimum information that should be included in the ad.

Source code

Data set and ipython laptop

Conclusion

We have already seen that the number of views for the heading “animals as a gift” is not high, and even less for shelters than for private individuals. Perhaps this is due to insufficient information of people and various prejudices. I will give some facts:

Announcements are placed by volunteers whose interest is to provide the best possible conditions for their ward. They are not paid money for how many animals can be attached. If you have problems, you can return the animal back. Therefore, the volunteer has no desire so that he doesn’t vparit a sick pet. If the animal requires special care, then such things are always negotiated in advance, and you can count on all kinds of (reasonable) support from the volunteer
In shelters they monitor the epidemiological situation, otherwise, in conditions of stress and moderate quality feed, all animals would have died long ago
In the shelter there are a lot of animals that were domestic but lost, escaped from the owners during a car accident or any incident, or simply became unnecessary and uncomfortable. Those. these are not wild wolves
With each animal that you see on the ad at least once a week, or even more, volunteers conduct training - walk on a leash, teach teams, so there is constant contact with a person
You can also take part in this.
There are cats in shelters too
Shelters have small to medium sized dogs.

If someday you want to get an animal, then be sure to check, suddenly someone is looking at you from the photo here:

Acknowledgments

This analysis was carried out as part of the final project of the “Machine Learning and Data Mining” course at the Higher School of Economics, so many thanks to our teachers for their patience and work, as well as to my supervisor.

PS About all inaccuracies and typos, write in a personal!

UPD The user andraszsom, in the framework of the kaggle competition, laid out an analysis of the relationship between different life outcomes in the shelter (euthanasia, or the animal was given for adaptation to the family, etc.) on the breed, age and other signs, link .

Tags: