Prediction of post likes. SNA Hackathon 2014

    What good can you learn from social networks? You can find a football team, a bass player in a group, brothers in mind, a wife , rent / rent an apartment / room / villa on the ocean. And if you connect data analysis? You can find your place in society. For example, if I listen to XXX, read YYY and drink ZZZ, then there are only 100 of us on this ball. And if I still paint my nails in green, then I will definitely be the one and only?

    You can understand what people like, what they can sell, you can make predictions and test the theory of six handshakes for the hundredth time. There are many tasks in the field of Social Network Analysis, one of which we propose to solve at the online stage of SNA Hackathon 2014 .


    Social Media Tasks


    Social networks today are an inexhaustible source of information about people, their hobbies and thoughts. Every day, users generate about 8 terabytes of photos, text, video, which can become a resource for creating new software products or a powerful prediction tool.



    We decided to focus on the task of analyzing text data generated by users, and asked the hackathon participants to analyze the relationship between the content of the post and its rating.

    About the hackathon and the task of the online stage


    To take part in the offline phase, participants must predict the number of likes that the post will gain a certain time after publication until April 10th. Or, in terms of Odnoklassniki, whose data we analyze, the number of marks “Class!” have a specific topic.



    Today, such a leaderboard has formed. Participants whose models turn out to be the most accurate will be invited to the offline stage, which will be held in St. Petersburg, and a chance to win the Macbook pro. There, in 24 hours, it will be necessary to analyze the real publications of about 44 million users and create a prototype of the product on their basis. Experts from EMC, JetBrains, Data Mining Labs and HSE and NES Universities will help and advise and make small presentations.

    Initial data of the first stage

    Post data is stored in two files: train_content.csv and test_content.csv with the following fields:
    group_id - Anonymous identifier of the group in which the post is posted
    post_id - Anonymous identifier of the post
    timestamp - Post publication time, which is the number of milliseconds that have passed since midnight 1- January 1st, 1970 (UTC).
    content - The content of the post. Note: this field may contain spaces, special characters, as well as http links, images and polls. Author spelling and punctuation saved.

    Example:

    Classes Information training sets are stored in the train_likes.csv file with the following fields:
    user_id - Anonymous identifier of the user who set “Class!”
    post_id - Anonymous post identifier
    timestamp - Time “Class!”, which is the number of milliseconds since midnight on January 1, 1970 (UTC).

    Example:

    The forecast is estimated using the R2 metric (we multiply by 1000 for display convenience):


    Where:
    f - the actual value of the number of "Classes!"
    p - forecast of the number of "Classes!"
    Var (x) - sample variance of the value x
    It turns out that 1000 is the maximum score for the forecast. To get to the second stage of the hackathon, you must overcome the base line, the accuracy of the algorithm that we wrote.

    Default algorithm

    The source code for R with preliminary data processing and the construction of a basic forecast can be found in our github repository .

    There you will find three scripts on R:
    prepare.R - Data preprocessing
    features.R - Finding basic attributes (number of characters, number of words, average word length)
    baseline.R - Building a model (we use linear regression)

    How to start?

    Unzip the input (test_content.csv, train_content.csv, train_likes.csv) into the ./data/src/ folder. Type in the command line:
    git clone https://github.com/snahackathon/sh2014.git
    cd ./sh2014
    #
    cd R
    R --vanilla < prepare.R
    R --vanilla < features.R
    R --vanilla < baseline.R
    

    The predicted number of likes for the test set lies in data / submit. Of course, this is just a basic algorithm, it does not overcome the boundary value of score.

    If you are brave, dexterous, skillful ...



    Take part in the hackathon! Our task is to collect enthusiastic and creative people to make it interesting to compete, and as a result of the competition, we got accurate models and elegant algorithms. Those who are still studying and want to try their hand, we invite you to participate - download training and test data and squeeze out everything that is possible from them. We invite those who have already learned not only to participate, but also to act as an expert or judge. To do this, write to us at contact@sh2014.org.

    Also popular now: