Optimization of hyperparameters in Vowpal Wabbit with the new vw-hyperopt module

    Hello, Habr! This article will focus on such a not very pleasant aspect of machine learning as the optimization of hyperparameters. Two weeks ago , the vw-hyperopt.py module was added to a very famous and useful Vowpal Wabbit project , which can find good configurations of hyperparameters of Vowpal Wabbit models in large-dimensional spaces. The module was developed internally by the DCA (Data-Centric Alliance).


    To find good configurations vw-hyperopt uses algorithms from the library pitonovskoj Hyperopt and can optimize hyperparameters adaptively using the method of Tree-Structured Parzen Estimators (TPE) . This allows you to find better optimums than a simple grid search, with an equal number of iterations.

    This article will be of interest to everyone who deals with Vowpal Wabbit, and especially to those who are annoyed at the lack of ways to configure numerous model knobs in the source code, and either tuned them manually, or encoded the optimization themselves.

    Hyperparameters


    What are hyperparameters? These are all “degrees of freedom” of the algorithm, which it does not directly optimize, but on which the result depends. Sometimes the result depends quite a bit, and then, if it’s not a kaggle, you can get by with default values ​​or pick it up manually. But sometimes an unsuccessful configuration can ruin everything: the algorithm will either retrain heavily or, conversely, will not be able to use most of the information.

    In a narrow sense, hyperparameters often mean only regularization and other “obvious” settings of machine learning methods. However, in a broad sense, hyperparameters are generally any manipulations with data that can affect the result: engineering features, weighting observations, undersampling, etc.

    Grid search


    Of course, it would be nice to have an algorithm that, in addition to optimizing parameters, would also optimize hyperparameters. Even better if we could trust this algorithm more than intuition. Of course, some steps have been taken in this direction. Naive methods are built into many machine learning libraries: grid search - walk through the grid, or random search - sampling points from a fixed distribution (the most famous instances are GridSearchCV and RandomizedGridSearchCV in sklearn). The advantage of a grid pass is that it is easy to code on its own and easy to parallelize. However, it also has serious disadvantages:

    • He goes over many obviously unsuccessful points. Suppose you already have a set of some configurations with results or some other kind of information. A person can understand what configurations will definitely give a sloppy result, and guess not to check these regions again. Grid search cannot do this.

    • If there are a lot of hyperparameters, then the size of the “cell” must be made too large, and a good optimum can be missed. Thus, if you include in the search space a lot of extra hyperparameters that do not affect the result, grid search will work much worse with the same number of iterations. However, for random search this is true to a lesser extent:


    Bayesian methods


    In order to reduce the number of iterations to find a good configuration, adaptive Bayesian methods were invented. They select the next point to check, taking into account the results on the points already checked. The idea is to find a compromise at each step between (a) exploring regions near the most successful points among those found and (b) exploring regions with great uncertainty where even more successful points may be located. This is often called the explore-exploit or learning vs earning dilemma. Thus, in situations where checking each new point is expensive (in machine learning, verification = learning + validation), you can approach the global optimum in a much smaller number of steps.

    Similar algorithms in different variations are implemented in MOE tools ,Spearmint , SMAC , BayesOpt and Hyperopt . We will dwell on the latter in more detail, since vw-hyperoptthis is a wrapper over Hyperopt, but first you need to write a little about Vowpal Wabbit.

    Vowpal wabbit


    Many of you must have used this tool, or at least heard of it. In short, this is one of the fastest (if not the fastest) machine learning libraries in the world. To train a model for our CTR predictor (binary classification) on 30 million cases and tens of millions of features takes only a few gigabytes of RAM and 6 minutes on one core. Vowpal Wabbit implements several online algorithms:

    • Stochastic gradient descent with different bells and whistles;
    • FTRL-Proximal, which can be read about here ;
    • Online similarity of SVM;
    • Online boosting;
    • Factorization machines.

    In addition, it implements feed-forward neural networks, batch optimization (BFGS) and LDA. You can run Vowpal Wabbit in the background and receive a data stream as an input, either by learning from them or simply by giving predictions.

    FTRL and SGD can solve both regression and classification problems, this is regulated only by the loss function. These algorithms are linear with respect to features, but non-linearity can easily be achieved using polynomial features. There is a very useful early stop mechanism to protect yourself from retraining if you have indicated too many eras.

    Vowpal Wabbit is also famous for its feature hashing, which acts as an additional regularization if there are a lot of features. Thanks to this, it is possible to learn categorical features with billions of rare categories, fitting the model into RAM without sacrificing quality.

    Vowpal Wabbit requires a special input data format , but it is easy to understand. It is naturally sparse and takes up little space. Only one observation (or several, for LDA) is loaded into RAM at any given time. Training is easiest to run through the console.

    Those interested can read the tutorial and other examples and articles in their repository, as well as a presentation. About the insides of Vowpal Wabbit can be found in detail in the publications of John Langford and in his blog . There is also a suitable post on Habré . A list of arguments can be obtained through vw --helpor read a detailed description . As can be seen from the description, there are dozens of arguments, and many of them can be considered hyperparameters that can be optimized.

    Vowpal Wabbit has a vw-hypersearch module that can select one hyperparameter using the golden ratio method. However, if there are several local minima, this method is likely to find far from the best option. In addition, often you need to optimize many hyperparameters at once, and this is not the case in vw-hypersearch. A couple of months ago I tried to write a multidimensional golden section method, but the number of steps he needed for convergence exceeded any grid search, so this option was no longer needed. It was decided to use Hyperopt.

    Hyperopt


    This python library implements the Tree-Structured Parzen Estimators (TPE) optimization algorithm. Its advantage is that it can work with very “clumsy" spaces: when one hyperparameter is continuous, the other categorical; the third is discrete, but its neighboring values ​​are correlated with each other; finally, some combinations of parameter values ​​may simply not make sense. TPE takes in a hierarchical search space with a priori probabilities, and at each step mixes them with a Gaussian distribution centered at a new point. Its author James Bergstra claims that this algorithm solves the explore-exploit problem quite well and works better both for grid search and expert search, at least for deep learning tasks, where there are a lot of hyperparameters.here and here . You can read about the TPE algorithm here . Perhaps in the future it will be possible to write a detailed post about him.

    Although Hyperopt was not embedded in the source code of well-known machine learning libraries, many use it. For example, here is a great hyperopt + sklearn tutorial . Here is the application of hyperopt + xgboost. All my contribution is this similar wrapper for Vowpal Wabbit, more or less tolerable syntax for specifying the search space and launching all this from the command line. Since Vowpal Wabbit didn’t have such functionality yet, Langford liked my module and poured it in. In fact, anyone can try Hyperopt for their favorite machine learning tool: it’s easy to do, and everything you need is in this tutorial .

    vw-hyperopt


    Let's move on to using the module vw-hyperopt. First you need to install the latest version of Vowpal Wabbit from the github. The module is located in the utl folder.

    Attention! Recent changes (in particular, the new command syntax) so far (December 15) are not merged into the main repository. In the coming days, I hope the problem is solved, but for now you can use the latest version of the code from my branch . EDIT: December 22nd, the changes are poured, now you can use the main repository.

    Usage example:


    ./vw-hyperopt.py --train ./train_set.vw --holdout ./holdout_set.vw --max_evals 200 --outer_loss_function logistic --vw_space '--algorithms=ftrl,sgd --l2=1e-8..1e-1~LO --l1=1e-8..1e-1~LO -l=0.01..10~L --power_t=0.01..1 --ftrl_alpha=5e-5..8e-1~L --ftrl_beta=0.01..1 --passes=1..10~I --loss_function=logistic -q=SE+SZ+DR,SE~O --ignore=T~O'  --plot 

    The input module requires training and validation samples, as well as a priori distributions of hyperparameters (quoted inside --vw_space). You can specify integer, continuous, or categorical hyperparameters. For all but categorical, you can specify a uniform or log-uniform distribution. The search space from the example is transformed inside vw-hyperoptapproximately into such an object for Hyperopt(if you went through the tutorial on Hyperopt, you will understand this):

    from hyperopt import hp
    prior_search_space = hp.choice('algorithm', [
        {'type': 'sgd',
         '--l1': hp.choice('sgd_l1_outer', ['empty', hp.loguniform('sgd_l1', log(1e-8), log(1e-1))]),
         '--l2': hp.choice('sgd_l2_outer', ['empty', hp.loguniform('sgd_l2', log(1e-8), log(1e-1))]),
         '-l': hp.loguniform('sgd_l', log(0.01), log(10)),
         '--power_t': hp.uniform('sgd_power_t', 0.01, 1),
         '-q': hp.choice('sgd_q_outer', ['emtpy', hp.choice('sgd_q', ['-q SE -q SZ -q DR', '-q SE'])]),
         '--loss_function': hp.choice('sgd_loss', ['logistic']),
         '--passes': hp.quniform('sgd_passes', 1, 10, 1),
        },
        {'type': 'ftrl',
         '--l1': hp.choice('ftrl_l1_outer', ['emtpy', hp.loguniform('ftrl_l1', log(1e-8), log(1e-1))]),
         '--l2': hp.choice('ftrl_l2_outer', ['emtpy', hp.loguniform('ftrl_l2', log(1e-8), log(1e-1))]),
         '-l': hp.loguniform('ftrl_l', log(0.01), log(10)),
         '--power_t': hp.uniform('ftrl_power_t', 0.01, 1),
         '-q': hp.choice('ftrl_q_outer', ['emtpy', hp.choice('ftrl_q', ['-q SE -q SZ -q DR', '-q SE'])]),
         '--loss_function': hp.choice('ftrl_loss', ['logistic']),
         '--passes': hp.quniform('ftrl_passes', 1, 10, 1),
         '--ftrl_alpha': hp.loguniform('ftrl_alpha', 5e-5, 8e-1),
         '--ftrl_beta': hp.uniform('ftrl_beta', 0.01, 1.)
        }
    ])

    Optionally, you can change the loss function on the validation sample and the maximum number of iterations ( --outer_loss_function, by default logistic, and --max_evals, by default, 100). It is also possible to store the results of each iteration and graph using --plot, if installed matplotlib, and preferably seaborn:



    Documentation


    Since it is not customary to lay out detailed documentation on the hub, I will limit myself to a link to it. You can read about all the semantics on the Russian-language wiki in my fork or wait for the English version in the main Vowpal Wabbit repository.

    Plans


    In the future, it is planned to add to the module:

    1. Support for regression and multiclass classification tasks.

    2. Support for a “warm start”: issue Hyperopt points that have been evaluated in advance, and begin optimization already taking into account the results on them.

    3. The option of evaluating the error at each step on another test sample (but without optimizing the hyperparameters on it). This is necessary in order to better assess the generalizing ability - have we not retrained.

    4. Support for binary parameters that do not take any values, such as --lrqdropout, --normalized, --adaptiveetc. Now you can, in principle, write --adaptive=\ ~O, but this is completely unintuitive. You can do something like --adaptive=~Bor --adaptive=~BO.

    I would be very happy if someone uses the module, and he helps someone. I will be glad to any suggestions, ideas or bugs discovered. You can write them here or create issue on github .

    Update 12.22.2015


    Pull request with the latest changes was added to the main Vowpal Wabbit repository, so you can now use it, and not a branch.

    Also popular now: