How we won in SmartMailHack 2
Last weekend (July 14-15), another SmartMailHack hackathon took place at the Mail.Ru Group office. We were offered "to come up with a feature that will allow access to data from the Post and more effectively interact with them."
Description of data, ideas and solutions
We received a test box with more than 1500 messages, as well as full access to it through the API. The organizers provided a large, detailed manual on its use (547-page booklet). With the help of a token and simple JSON requests, we could receive all the necessary information about the mail: letters, sender names, various characteristics.
Having discussed that each of us has about several thousand unread mailings from mailings in the mail, we decided to deal with this problem. In our case, the relevance of the letter is no longer determined by the time of its appearance in the mailbox. And if we proceed from the fact that not all letters will be opened, then it is better to show the user only those that he is likely to open. And everything else can be removed to hell. So we decided to do the sorting for the mail.
Sorting letters was supposed to be categorized, and categories placed inside tiles (hello, Trello). The top row of tiles united within the meaning of the letter from different senders. There can be “Trips”, “Registrations”, “Correspondence with Vasya”, “Events”, “Finances” and so on, all in all about 10 categories. The second row were tiles with the coolest offers from companies. We searched for the most relevant promotional codes, the most discount promotions, the most valuable offers, and showed them here, grouped by company. Then came all the other letters distributed by the sending companies, and these senders, in turn, were scattered into categories (“Food”, “Cosmetics”, “Electronics” and others). Moreover, categories were also ranked by the relevance of letters, and only letters that crossed a certain threshold of relevance were shown inside.
We decided to build three models:
- a classifier from more than 30 categories that we have designated as base for all users;
- clustering and highlighting new categories based on user preferences;
- ranking of letters within a category, from most relevant to least.
It seems that this item should be described personally for each task. However, we generated one common feature dataset and trained all the models on it. There was no time for careful selection.
There was a bunch of binary features that are downloaded via the API. However, most of them were generated on the texts:
- tf-idf on document collections;
- embeddings obtained with Word2Vec;
- behavioral signs, such as:
- the number of read messages for the last window (1, 2, 5 weeks ago);
- the number of messages from this.
We have hand-marked 1000 letters for learning. It turned out that this is not such a slow and tedious job as it may seem at first. If you use addresses and headers, you can significantly speed up the work. For example, Lamoda almost always sends letters to the “Clothing” category.
Next, we train LightGBMs on the whole set of features and obtained the quality of 0.913 accuracy and 0.892 f1 measures, which we determined as a very good result at the base line level. This shows that letters can be classified very well.
We used the binary flag 0/1 as the target variable - whether the letter was read by the user. Further ranked in terms of probability, predicted by the model, since it is precisely this that reflects how confident the model is in whether the person will read the message or not.
Here, we also trained LightGBMs on the whole set of features and got the quality around 0.816 auc-roc.
Clustering and highlighting new categories
In addition to the main categories, we have the “Other” category. From it, you can select new topics.
We trained the standard DBSCAN on letters from this group, and then selected those clusters in which there were many messages (the threshold can be optimized, but it was fixed by chance). On the collection of documents of the cluster, you can set, for example, thematic modeling, get the most relevant topic for this cluster and select it as a separate group. Dismiss this algorithm is not enough time.
So, incoming letters are passed through the classifier, if they fall into the category “Other” - they are clustered, new topics try to stand out, and then the ranking takes place. A request is sent to the backend, which aggregates everything, and the frontend is rendered.
- improved machine learning models;
- collecting data from a larger number of users to better predict the behavior of each of them;
- thorough validation of new emerging categories;
- the use of pictures as signs, for example, the allocation of embeddingings from the pre-trained neural networks.