The 2019 championship season is open! SNA Hackathon Launches ala ML Boot Camp 8
Many of the readers already know that we are trying to constantly conduct various IT-championships on a variety of topics. Last year alone, more than 10 different major competitions were held ( Russian AI Cup , ML Boot Camp , Technocup and others). At least 25,000 people took part in them, and more than 150,000 since 2011.
If you just now found out about this, congratulations: from that moment you became part of a large community of people participating in our competitions and exchanging experience with each other. a friend. Already now you can join Telegram-groups of communities on artificial intelligence , sports programming ,high-load projects and administration , machine learning and data analysis . This will help you quickly get into a serious party!
So let's get down to business. Today, February 7, we are pleased to open the new season of the 2019 championships. And we will start with the eighth machine learning and data analysis competition held on the ML Boot Camp specialized platform (our analogue of Kaggle) - SNA Hackathon , or ML Boot Camp 8 (as you prefer).
The organizers of this championship are Mail.ru Group and Odnoklassniki. We recommend after this article to read a brief history of smart tape , in which Dmitry Bugaychenko talks about the tape ranking algorithms in Odnoklassniki, there are a lot of useful things there.
And now we will tell the championship mechanics, its schedule, tasks and proposed data.
It's simple. After the opening of the championship at ML Boot Camp , you need to:
- read the conditions of the tasks (they are already in this article);
- select a task or tasks that you are going to solve;
- download data;
- start building models and making predictions;
- upload your answers (regular file) to the verification system.
Every day you can upload your answer files up to five times to us. The system will check the responses for only 50% of the sample (public), so the results will be preliminary. The final results for the rest of the sample (private) will be shown to participants after the competition has been completed.
If you have never participated in such competitions, then there is nothing to worry about. Read the article , and you will succeed :)
The championship will be held in two stages:
- online - from February 7 to March 15;
- offline - from March 30 to April 1.
After March 15, interim results will be announced and 15 people from the top leaders for each of the tasks will receive invitations to the second stage, which will be held in the Moscow office of Mail.ru Group. In addition, the invitation to the final stage will receive three people who were in the lead of the rating at the end of February 23.
Description of tasks
For the SNA Hackathon competition, logs of content hits from open groups were collected in users' news feeds for February-March 2018. In the test set hidden the last week and a half of March. Each entry in the log contains information about what and to whom it was shown, as well as how the user reacted to this content: set “Class”, commented, ignored or hidden from the tape.
The essence of the task is to rank the candidates for each user of the test set, raising as high as possible those who will receive a “class”.
Usually we gave one task, but this time we decided to give three at once. You do not need to solve them all, just one is enough. Since a custom tape combines content of a different type, then when it is ranked, skills from different areas are in demand - computer vision, text processing and recommendation systems.
Within the online phase, we offer three sets of data, each of which contains only one type of information: an image, a text, or data about various collaborative features.
Only at the second stage, when experts in different areas will come together, a common dataset will be revealed, allowing to find points for synergy of different methods.
After the opening of the championship on the platform, you will see a description of the tasks and get the opportunity to download the necessary data for participation.
The information is presented in the Apache Parquet format , which is the main one for the Spark framework. To work with this format from Python, we recommend using the Apache Arrow library . For ease of understanding, baselines are laid out on the GitHub repository. Use!
In the training set, the data is decomposed by day, and within the day is divided into 6 parts by user ID (the same user always falls into the same part). This layout allows participants to analyze not all data at once, but to limit themselves to specific days and / or subgroups of users.
The training sets are divided into three non-intersecting groups: with texts, with pictures and with collaborative features. In each group, the data contains the following fields:
instanceId_userId- user ID (anonymized);
instanceId_objectType- Object type;
instanceId_objectId- object identifier (anonymized);
feedback- an array with the types of user reactions (the presence of the Liked token in the array indicates that the object received a “class” from the user);
audit_clientType- the type of platform from which the user logged in;
audit_timestamp- the time when the tape was built;
metadata_ownerId- The author of the shown object (anonymized);
metadata_createdAt- date of creation of the shown object.
For objects from the training text set, additional texts are provided in the Apache Parquet format:
objectId- object identifier;
lang- text language (based on the Odnoklassniki language detector);
text- raw text associated with the object;
preprocessed- array of tokens obtained after filtering punctuation and stemming.
In the data for ranking by images, there is additionally a field array
ImageIdwith MD5 hashes associated with image objects. The bodies of the images are decomposed into separate tar files, depending on the first letter of the hash.
In the block with collaborative features presented a variety of additional information:
audit_*- extended information about the context of the construction of the tape;
metadata_*- extended information about the object itself;
userOwnerCounters_*- information about previous interactions of the user and the author of the content;
ownerUserCounters_*- information about previous interactions of the content author and user;
membership_*- information about the membership of the user in the group where the content is published;
user_*- detailed information about the user;
auditweights_*- A large number of runtime signs extracted by the current system.
The structures of test sets are equivalent to the structure of training sets, but they are not day-to-day and do not contain a field
Championship participants should sort the ribbon in such a way that objects with a high probability of “class” are at the top. Sorting is done individually for each user, after which a submission text of the following type is formed (the format corresponds to export from a Pandas data frame with columns of type
User_id_1,"[object_id_1_1, object_id_1_2]" User_id_2,"[object_id_2_1, object_id_2_2, object_id_2_3]"
A line should contain a line for each test case user, and the rows should be sorted by ID. Objects for each user must be sorted by decreasing relevance.
When evaluating a submission for each user, his personal ROC-AUC will be calculated, after which the average for all users will be calculated and multiplied by 100.
The second stage will be held in a team format, and the winners will receive valuable gifts, stikerpak and other prizes:
- 300 000 rubles to the team that won first place;
- 200 000 rubles for the second place team;
- 100 000 rubles for the third-ranked team;
- 100,000 rubles to the team that proposed a solution with the best prospects for the introduction into service according to the jury.
In addition, the prize-winners of the online stage (the top 33 of the leaders for each task) will get cool T-shirts.
Bonus! The best and most active participant in the online stage will receive a PlayStation / XBox to choose from. The criteria are simple - a lot of graphs on the case in the chat, interesting to talk up / down, well, and something else. The winner will be elected by public vote.
Registration and community
There is no need to register for participation in the competition. It is enough to be registered on the platform once and all the competitions and sandboxes of past championships will be immediately available to you.
Do not waste time. The community is waiting for new heroes. Welcome !