SmartMailHack. Name Entity Recognition Winners History

Last weekend (April 20-22), a student machine learning hackathon was held at the Mail.ru Group office. The hackathon brought together students from different universities, different courses and, most interestingly, from different directions: from programmers to security guards.



From Mail.ru Mail, three tasks were provided:

  1. Recognition and classification of company logos. This task is useful in antispam for detecting phishing emails.
  2. Determination of the text of the letter, which of its parts belong to certain categories. Named Entity Recognition (NER)
  3. The implementation of the latter task was not regulated. It was necessary to invent and make a prototype of a new useful function for Mail. Evaluation criteria were usefulness, quality of implementation, ML application and hype feature.

It was allowed to choose one task from the first two, and in the solution of the third one it was possible to participate at will. We chose the second task, because we understood that in the first one the neurons with which we had little experience of work would definitely win. But with the second task was the hope that the classic ML would fire. In general, we really liked the idea of ​​separation of tasks, since during the hackathon it was possible to discuss solutions and ideas with non-competing teams.

The hackathon is notable for the fact that there was no public leaderboard, and the model was tested at the end of the hackathon on a closed dataset.

Task description


We were provided with letters from stores confirming orders or sending promotions. In addition to the source emails, a script was provided for their parsing and marked results of its work, which were a dataset for training models.

The letter is parsed line by line, each line is divided into tokens and each token is given a label.

Tagged Data Example
# [](http://t.adidas-news.adidas.com/res/adidas-t/spacer.gif) Итого к оплате
39 []( OUT
39 http OUT
39 :// OUT
39 t OUT
39 . OUT
39 adidas OUT
39 - OUT
39 news OUT
39 . OUT
39 adidas OUT
39 . OUT
39 com OUT
39 / OUT
39 res OUT
39 / OUT
39 adidas OUT
39 - OUT
39 t OUT
39 / OUT
39 spacer OUT
39 . OUT
39 gif OUT
39 ) OUT
39 Итого B-PRICE
39 к PRICE
39 оплате PRICE


In the marked-up file, there is first a parsed line starting with the “#” character, then the parsing result in the form of three columns: (line number, token, class label).

Tags are of the following types:

  • Product item: B-ARTICUL, ARTICUL
  • Order and its number: B-ORDER, ORDER
  • The total amount of the order: B-PRICE, PRICE
  • Ordered Products: B-PRODUCT, PRODUCT
  • Item Type: B-PRODUCT_TYPE, PRODUCT_TYPE
  • Seller: B-SHOP, SHOP
  • All other tokens: OUT

The mandatory prefix “B-” indicates the beginning of the token in the sentence. To evaluate the model, we used the f1 metric for all labels except OUT.

Score for a logbook as a baseline
Training time 157.34269189834595 s
================================================TRAIN======================================
+++++++++++++++++++++++ ARTICUL +++++++++++++++++++++++
Tokenwise precision: 0.0 Tokenwise recall: 0.0 Tokenwise f-measure: 0.0
+++++++++++++++++++++++ ORDER +++++++++++++++++++++++
Tokenwise precision: 0.7981220657276995 Tokenwise recall: 0.188470066518847 Tokenwise f-measure: 0.30493273542600896
+++++++++++++++++++++++ PRICE +++++++++++++++++++++++
Tokenwise precision: 0.9154929577464789 Tokenwise recall: 0.04992319508448541 Tokenwise f-measure: 0.09468317552804079
+++++++++++++++++++++++ PRODUCT +++++++++++++++++++++++
Tokenwise precision: 0.6538461538461539 Tokenwise recall: 0.0160075329566855 Tokenwise f-measure: 0.03125000000000001
+++++++++++++++++++++++ PRODUCT_TYPE +++++++++++++++++++++++
Tokenwise precision: 0.5172413793103449 Tokenwise recall: 0.02167630057803468 Tokenwise f-measure: 0.04160887656033287
+++++++++++++++++++++++ SHOP +++++++++++++++++++++++
Tokenwise precision: 0.0 Tokenwise recall: 0.0 Tokenwise f-measure: 0.0
+++++++++++++++++++++++ CORPUS MEAN METRIC +++++++++++++++++++++++
Tokenwise precision: 0.7852941176470588 Tokenwise recall: 0.05550935550935551 Tokenwise f-measure: 0.1036893203883495
================================================TEST=======================================
+++++++++++++++++++++++ ARTICUL +++++++++++++++++++++++
Tokenwise precision: 0.0 Tokenwise recall: 0.0 Tokenwise f-measure: 0.0
+++++++++++++++++++++++ ORDER +++++++++++++++++++++++
Tokenwise precision: 0.8064516129032258 Tokenwise recall: 0.205761316872428 Tokenwise f-measure: 0.3278688524590164
+++++++++++++++++++++++ PRICE +++++++++++++++++++++++
Tokenwise precision: 0.8666666666666667 Tokenwise recall: 0.05263157894736842 Tokenwise f-measure: 0.09923664122137404
+++++++++++++++++++++++ PRODUCT +++++++++++++++++++++++
Tokenwise precision: 0.4 Tokenwise recall: 0.0071174377224199285 Tokenwise f-measure: 0.013986013986013988
+++++++++++++++++++++++ PRODUCT_TYPE +++++++++++++++++++++++
Tokenwise precision: 0.3333333333333333 Tokenwise recall: 0.011627906976744186 Tokenwise f-measure: 0.02247191011235955
+++++++++++++++++++++++ SHOP +++++++++++++++++++++++
Tokenwise precision: 0.0 Tokenwise recall: 0.0 Tokenwise f-measure: 0.0
+++++++++++++++++++++++ CORPUS MEAN METRIC +++++++++++++++++++++++
Tokenwise precision: 0.7528089887640449 Tokenwise recall: 0.05697278911564626 Tokenwise f-measure: 0.10592885375494071


The distribution of marks in the training dataset is as follows:


OUT class marks are almost 580k

A strong class imbalance is visible, and the OUT mark does not make sense at all to evaluate the quality of the model. To take into account the imbalance of classes, we added the parameter class_weight = 'balanced' to the forest. Although it reduced the score on the train and on the test (from 0.27 and 0.15 to 0.09 and 0.08), however, it allowed to get rid of retraining (the difference between these indicators decreased).

Models


To represent words as a vector, we used fastText's word embedding for the Russian language, which allowed us to represent the token as a vector of 300 values. While part of the team was trying to write a neuron, standard classification algorithms such as logreg, random forest, knn and xgboost were tried. According to the results of the tests, a random forest was chosen as a fallback, in case the neuron does not take off. The choice was largely justified by the good speed of training and predicting the model (which greatly saved at the end of the competition) with satisfactory quality compared to other models.

Having experience in passing the mlcourse_open course from ODS, we understood that only high-quality features can significantly increase the score, and they spent the remaining time generating them. The first thing that came to mind was to add simple signs, such as a token index in a sentence; token length; whether the token is alphanumeric; consists only of uppercase or lowercase letters, etc. This gave an increase in the metric f1 to 0.21 in the test sample. Upon further study of the dataset, we concluded that context is important, and depending on it, two identical tokens could have different class labels. To take into account the context, we took a window - added the previous and next tokens to the vector of signs. This increased the score already to 0.55 on the train and 0.43 on the test. On the last night of the hackathon, we tried to enlarge the window and shove more signs into 12 gigabytes of laptop RAM. As it turned out, it does not cram. Having abandoned these attempts, they began to think about what other signs could be added to the model. We turned to the pymorphy2 library, but did not manage to fasten it properly.

Submit


A couple of hours remained before the test dataset and the first submission were issued. After issuing the dataset, an hour was given to make predictions and send to the organizers - this was the first submission. After that, another hour was given for a second attempt. So, it's time to start doing preprocessing and train the forest for the entire sample. Also, we still did not leave faith in the neuron. Preprocessing and training a forest of 50 trees went surprisingly quickly: 10 minutes for preprocessing (along with a five-minute loading dictionary for embedding) and another 10 minutes for training a forest on a matrix of size (609101, 906). This speed pleased us, because it indicated that we could quickly adjust the model to the second submit and re-train it. The trained forest showed a score of 0.59 for the entire sample. Given previous testing on a deferred sample,

Having received a test dataset of ~ 300,000 tokens and having an already trained model, we literally made a prediction in 2 minutes. They were the first to receive a score of 0.2997. Waiting for the results of other teams and thinking over plans to improve our own model, we came up with the idea to add a just labeled test to the training sample. Firstly, this did not contradict the rules, since manual marking was prohibited, and secondly, we ourselves wondered what would come of it. At this time, we learned the results of other teams - they all appeared behind us, which we were pleasantly surprised. However, the result closest to us was 0.28, which gave our rivals a chance to get around us. Also, we were not sure that they had no trump cards in their sleeves. The second hour was tense, the teams kept submitting to the last, and our idea failed with an increase in the training sample:

When the time was up, the final leaderboard appeared, some teams improved the result, and for some the result deteriorated, but we were still in first place. However, the answer was validated: it was necessary to show a working model and make a prediction in front of the organizers, and we just had a rebooted laptop, on which trees were trained, and the model was not saved. What to do? Then the organizers went to meet us and agreed to wait until the model learns and makes predictions. However, there was a catch: the forest was random, we did not fix the SEED, and the accuracy of the prediction had to be confirmed to the last digit. We hoped for the best, and also launched training on the second laptop, where the dictionary was already loaded. At this time, the organizers carefully examined the data and code, asked about the model, features. The laptop did not stop getting on my nerves and sometimes hung for several minutes, so that even the time did not change. When the training was completed, they repeated the prediction and sent it to the organizers. At this time, the second laptop completed the training, and the predictions on the training sample on both laptops coincided ideally, which increased our faith in the successful validation of the result. And now, after a few seconds, the organizers congratulate you on the victory and shake hands :)

Total


We had a great weekend in the office of Mail.Ru Group, listened to interesting reports on the topic of ML and DL from the Mail team. We enjoyed endless supplies of cookies, milk chocolate (milk, however, turned out to be a finite but replenished resource) and pizza, as well as an unforgettable atmosphere and communication with interesting people.

If we consider the task itself and our experience, then we can draw the following conclusions:

  • Do not chase the fashion and immediately apply DL, perhaps classic models can take off quite well.
  • Generate more useful features, and only then optimize the model and select hyperparameters.
  • Fix SEED and save your models, as well as backups :)


The EpicTeam team are the winners in the NER challenge. From left to right:

  • Leonid Zharikov, MSTU Bauman, a student at TechnoPark.
  • Andrey Atamanyuk, MSTU Bauman, a student at TechnoPark.
  • Miliya Fathutdinova, REU named after Plekhanov.
  • Andrey Pashkov, NRNU MEPhI, student of TechnoAtom.


Also popular now: