When there is really a lot of data: Vowpal Wabbit

    Hi, Habr!



    In the previous two posts ( one , two ), we examined the basic algorithms and techniques used by Kaggle competitors . Today I would like to go further and talk about what difficulties researchers encounter when developing algorithms in the case when there is a lot of data and you have to learn from samples that do not fit into memory. It’s worth noting that this happens quite often, even on Kaggle itself (in this task, the training sample has a volume of several gigabytes and it may just not be clear to a beginner what to do with this). Below we look at machine learning algorithms and tools that deal with this problem.

    Many who are familiar with machine learning know that quite often good quality can be obtained through simple linear models, provided that the attributes were well selected and generated ( which we discussed earlier ). These models are also pleasing with their simplicity and often even visibility (for example, SVM, which maximizes the width of the dividing strip). However, there is another very important advantage of linear methods - during training, you can achieve the fact that the adjustment of the algorithm parameters (i.e., the stage of updating the weights) will be performed each time a new object is added. These machine learning methods in the literature are often also called Online Machine Learning .

    If you don’t go into details, then in a nutshell it looks something like this: in order to select the parameters of a particular linear method (for example, weight in a logistic regression), some initial value of these parameters is initialized, after which another object from the training sample is input , weights are updated. So linear methods allow this kind of training. At the same time, it is clear that it is simply no longer necessary to store all objects simultaneously in memory.

    Today, one of the most famous implementations of such methods is the Vowpal Wabbit package , which can be briefly described in a few points:

    • It is possible to teach only the linear model . At the same time, as we already know, it is possible to increase the quality of the methods themselves by adding new features, fitting the loss function, and also using low-rank decompositions (we will talk about this in more detail in the following articles)
    • The training sample is processed using a stachostic optimizer, so that you can study on samples that do not fit in memory
    • It is possible to process a large number of features due to their hashing (the so-called hashing trick), due to which it is possible to train models even in cases where the full set of weights simply does not fit in memory
    • Active learning mode is supported , in which the objects of the training sample can be submitted even from several machines over the network
    • Training can be parallelized across multiple machines

    So, let us dwell in more detail on how to work in practice with this tool and what results can be obtained using it. As an example, consider the well-known task Titanic: Machine Learning from Disaster . This is probably not the best example because there is not much data in this task. However, since The article is intended primarily for beginners in machine learning - this post will be an excellent continuation of the official Tutorial . In addition, it will be quite easy to rewrite the code used in this post for the real (and relevant at the time of writing the post) Click-Through Rate Prediction task - in it, the training sample has a size of more than 5GB.

    Before beginning the description of specific steps, we note that the code described below was launched quite a while ago (even before Vawpal Wabbit became popular), and the project itself has been actively updated recently, so further all the source codes given are correct with some accuracy - the author leaves this to check to the reader.

    Recall that in the problem under consideration it is proposed to construct a classifier that would predict for a particular person (passenger of the Titanic) whether he would drown or not. We will not describe in detail the condition of the problem and the signs given to us. Those who wish can familiarize themselves with this information on the competition page.

    So, to begin with, Vowpal Wabbit accepts input in a certain format:

    label | A feature1: value1 | B feature2: value2

    Which as a whole is no different from the usual “object-sign” matrix, except that the signs can be divided into categories, so that later, during training, some of them can be “turned off”. Thus, after we downloaded the training and test samples, the first thing is to convert the data to a format that Vowpal Wabbit would read

    Preparation of training and test samples


    To do this, you can take a simple script (or you can simply use the excellent phraug2 library ), which reads the train.csv file line by line and convert each object of the training set to the desired format. It is worth noting that label in the case of two-class classification, we take the values ​​+1 or -1

    import csv
    import re
    i = 0
    def clean(s):
      return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower()
    with open("train_titanic.csv", "r") as infile, open("train_titanic.vw", "wb") as outfile:
      reader = csv.reader(infile)
      for line in reader:
        i += 1
        if i > 1:
          vw_line = ""
          if str(line[1]) == "1":
            vw_line += "1 '"
          else:
            vw_line += "-1 '"
          vw_line += str(line[0]) + " |f "
          vw_line += "passenger_class_"+str(line[2])+" "
          vw_line += "last_name_" + clean(line[3].split(",")[0]).replace(" ", "_") + " "
          vw_line += "title_" + clean(line[3].split(",")[1]).split()[0] + " "
          vw_line += "sex_" + clean(line[4]) + " "
          if len(str(line[5])) > 0:
            vw_line += "age:" + str(line[5]) + " "
          vw_line += "siblings_onboard:" + str(line[6]) + " "
          vw_line += "family_members_onboard:" + str(line[7]) + " "
          vw_line += "embarked_" + str(line[11]) + " "
          outfile.write(vw_line[:-1] + "\n")
    

    We will do the same with the test sample:

    i = 0
    with open("test_titanic.csv", "r") as infile, open("test_titanic.vw", "wb") as outfile:
      reader = csv.reader(infile)
      for line in reader:
        i += 1
        if i > 1:
          vw_line = ""
          vw_line += "1 '"
          vw_line += str(line[0]) + " |f "
          vw_line += "passenger_class_"+str(line[1])+" "
          vw_line += "last_name_" + clean(line[2].split(",")[0]).replace(" ", "_") + " "
          vw_line += "title_" + clean(line[2].split(",")[1]).split()[0] + " "
          vw_line += "sex_" + clean(line[3]) + " "
          if len(str(line[4])) > 0:
            vw_line += "age:" + str(line[4]) + " "
          vw_line += "siblings_onboard:" + str(line[5]) + " "
          vw_line += "family_members_onboard:" + str(line[6]) + " "
          vw_line += "embarked_" + str(line[10]) + " "
          outfile.write(vw_line[:-1] + "\n")
    

    The output is 2 files train_titanic.vw and test_titanic.vw, respectively. It is worth noting that this is often the most difficult and longest stage - sample preparation. In fact, further we will only run several times the machine learning methods on this sample and immediately get the result

    Linear model training at Vowpal Wabbit


    The work comes from the command line by running the vw utility with the parameters passed to it. To start it. We will not focus on a detailed description of all the parameters, but just run one of the examples:

    vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --adaptive --normalized --l1 0.00000001 - l2 0.0000001 -b 24

    Here we report that we want to solve the binary classification problem ( --binary ), we want to make 20 passes along the training sample ( --passes 20 ), we want to do L1 and L2 regularization ( --l1 0.00000001 --l2 0.0000001 ), normalization, and save the model itself in model.vw . Option -b 24it is used to indicate the hashing function (as was said at the beginning - all signs are hashed, and the hashes themselves take values ​​from 0 to 2 ^ b-1). Also, it is important to note the -q ff parameter , which indicates that we also want to add paired features to the model (this is a very useful feature of VW, sometimes allowing us to significantly increase the quality of algorithms)

    After some time, we will get a trained model. It remains only to run the algorithm on the test sample

    vw -d test_titanic.vw -t -i model.vw -p preds_titanic.txt

    And convert the result for sending to the kaggle.com system :

    import csv
    with open("preds_titanic.txt", "r") as infile, open("kaggle_preds.csv", "wb") as outfile:
      outfile.write("PassengerId,Survived\n")
      for line in infile.readlines():  
        kaggle_line = str(line.split(" ")[1]).replace("\n","")
        if str(int(float(line.split(" ")[0]))) == "1":
          kaggle_line += ",1\n"
        else:
          kaggle_line += ",0\n"
        outfile.write(kaggle_line)
    

    This simple solution shows pretty good quality - more than 0.79 AUC . We applied a rather trivial model. By optimizing the parameters and “playing” with the signs, you can slightly improve the result (the reader is invited to do this as an exercise). I hope this introduction helps beginners cope with the volume of data in machine learning competitions!

    Also popular now: