Machine Learning for an Insurance Company: Improving a Model Through Algorithm Optimization

    We go to the finish line. A little over two months ago, I shared with you an introductory article about why machine learning is needed in an insurance company and how the reality of the idea itself was tested. After which we talked about testing algorithms. Today will be the last article in a series in which you will learn about improving the model through the optimization of algorithms and their interaction.



    Series of articles “Machine Learning for an Insurance Company”


    1. Realistic ideas .
    2. We study the algorithms .
    3. Improving the model through the optimization of algorithms .

    Setting model complexity and class definition boundaries


    Not always the minimum error on the training sample corresponds to the maximum accuracy on the test sample, since an overly complex model may lose the ability to generalize.

    A machine learning error consists of three parts: Bias , Variance, and Noise . As a rule, nothing can be done with noise ( Noise ): it reflects the influence on the result of factors not taken into account in the model.

    With Bias and Variancethe situation is different. The first one reflects the error associated with poorly developed dependencies, decreases with increasing complexity, and for large values ​​indicates that the model is not sufficiently trained. The second displays the sensitivity of the model to fluctuations in the values ​​of the input data, increases with increasing complexity and indicates retraining. Hence the concept of bias / variance tradeoff .

    The following image will best illustrate it:

    The graph shows that the optimal complexity of the model corresponds to the function C = min (V + B2) , and this value will not correspond to the minimum value of Bias .

    The complexity of the model consists of two parts. The first part is common to all - this is the number of features used (or the dimension of the input data). In our example, there are few of them, so filtering is unlikely to increase the accuracy of the model, but we will demonstrate the principle itself. In addition, there is always the possibility that some columns may be redundant, and deleting them will not worsen the model. Since a simpler solution should be chosen with the same accuracy, even such a change will be useful.

    The Azure ML has several filtration modules signs for screening less useful. We will look at Filter Based Feature Selection .



    This module includes seven filtering methods: Pearson correlation, mutual information, Kendall correlation, Spearman correlation, chi-squared, Fisher score, count based. Take Kendall correlation , because the data does not have a normal distribution and does not have a well-defined linear relationship. We set the parameter for the number of desired characteristics so that one column is removed as a result.

    Let's look at the results of assigning the similarity coefficients of the target and input columns.



    The ageLessThan19 column has a low level of correlation with the target; it can be neglected. We will verify this by training the Random forest model with the same settings that were used in the example in the previous article .



    The red curve corresponds to the old model. The removal of the column led to a slight deterioration of the model, but within the statistical error. Consequently, the deleted column really had no significant effect on the model and was not needed.

    The second part of the complexity of the model depends on the chosen algorithm. In our case, this is Random forest, for which the complexity is primarily determined by the maximum depth of trees under construction. Other parameters also matter, but to a lesser extent.

    Consider Azure ML modules that will help with these tasks. To get started, take the Tune Model Hyperparameters .



    Consider its settings:

    • Specify parameter-sweeping mode - set the search mode (in this case, random grid). The search is performed within the values ​​specified at the initialization stage of the algorithm.
    • Maximum number of runs on random grid - the number of tested options for combinations of parameters for training the model.
    • Metric for measuring performance for classification - a choice from the metric as a target value for optimizing classification tasks.

    The remaining parameters in our case do not matter or do not require a separate description.

    For the module to work correctly in the initialization of the algorithm, one of the parameters should be changed, the description of which we omitted in the previous article in connection with its special purpose. It's about create trainer mode . It is necessary to select parameter range in it . In this mode, you can select several options for the numerical parameters of the algorithm. You can also set the parameter range mode for each of the numerical parameters . In it, you can select the range in which potential values ​​will be selected. This is the mode we need - Tune Model Hyperparametersuses these ranges to find optimal values. In our example, to save time, we set the range only for the depth of decision trees.



    Another module that may come in handy is Partition and Sample . In assign to folds mode, it breaks the data into a specified number of parts. When such input data is supplied to the parameter settings module, it starts working in cross-validation mode. The remaining settings allow you to specify the number of parts into which the data will be divided, and the features of the partition (for example, an even distribution of values ​​across one of the columns).



    With minimal effort, this allowed us to slightly improve our AUC score. With proper settings and a more thorough search for optimal parameters, improving the result will be more significant.



    Now consider another tool for customizing model output processing. Since in our case the situation of binary classification is considered, membership in one of the classes is determined through the boundary value. You can check its operation in the tab of the Evaluate module . At the output of the classifier, we get not a strict class value, but confidence in the presence of a “positive” class, which can take values ​​from 0 to 1. In other words, by default, if the classifier gives a value from 0.5, then in our case there will be a peak prediction positive.

    0.5 is not always the optimal border. If the classes are equivalent, F1 can serve as a good criterion, although in practice this is rare. Suppose FN is 2 times more expensive than FP. Consider the different boundaries and evaluate the total cost for them.

    In the graph below, the total cost of FP errors is indicated in blue, FN in orange, and their total in green. As you can see, the minimum total cost falls on threshold = 0.31 with a value equal to ≈40.4k. For comparison, at a border of 0.5, the price is 15k more.



    For clarity, we calculate the minimum cost for a model without a previously removed column.



    The illustration shows that, the cost is greater. The difference is small, but even such an example is enough to show that a larger number of signs does not always give a better model.

    Search for statistical outliers


    Emissions can have a critical impact on model learning outcomes. If the method used is not robust, they can have a very negative effect on the result.

    As an example, consider the following case.



    Although most of the data is focused on the diagonal, a single point, being quite far from the rest, changes the slope of the function (green). These points are called high-leverage points. Often they are associated with measurement or recording errors. This is only one of the types of stat emissions and their impact on the model. In our case, the selected model is robust, so there will be no such effect on it. But to remove anomalous data is in any case worth it.

    To find these points, you can use the Clip Values ​​module. It processes data that goes beyond the specified limits, and can either substitute a different value in them, or mark them as missing. For the purity of the experiment, we carry out this operation only for training records.



    Through the module for cleaning missing values, you can remove rows with missing data, having received an array filtered from anomalies.

    Check the accuracy on the new dataset.



    The model has become a little better on all 4 points. It is worth recalling that the algorithm we are considering is robust. In certain situations, removing statistical outliers will give an even greater increase in accuracy.

    Use of the committee (ensemble) of classifiers


    The committee combines several models into one to obtain better results than each of them individually can provide. Yes, the algorithm we use is itself a committee, because it consists in building many different decision trees and combining their results into one. However, it is not necessary to dwell on this.

    Consider one of the simplest and, at the same time, popular and effective ways of combining - stacked generalization. Its essence is to use the outputs of various algorithms as features for a model of a higher (in terms of hierarchy) level. Like any other committee, it does not guarantee an improvement in the result. However, the models obtained through it, as a rule, manifest themselves as more accurate. For our example, let's take a series of binary classification algorithms presented in Azure ML Studio: Averaged Perceptron, Bayes Point Machine, Boosted Decision Tree, Random Forest, Decision Jungle and Logistic Regression. At this stage, we will not go into their details anymore - we will check the work of the committee.



    As a result, we get one more small improvement of the model in all metrics. Now, recall that the classes in our model are not equivalent, and we accepted that FN is 2 times more expensive than FP. To find the optimal point, check the different threshold values .



    The minimum amount of the cost of errors is at the border of 0.29 and reaches 40.2. This is only 0.2 less than the indicator obtained before the removal of anomalies and calculations through the committee. Of course, a lot depends on what the real monetary equivalent of this difference will be. But more often with such a slight improvement, it makes sense to use the principle of Occam's razor and choose a slightly less accurate, but simpler architecture model that we had after setting the parameters with the Tune Model Hyperparameters module with a minimum error cost of 40.4.

    Summary


    In the final part of the series of articles on machine learning, we examined:

    1. Setting the model complexity and choosing the optimal threshold value.
    2. Search and delete statistical outliers from training data.
    3. Building a committee of several algorithms.
    4. The choice of the final structure of the model.

    In this series of articles, we presented a simplified version of the cost prediction system, which was implemented as part of a comprehensive solution for an insurance company. In the demonstration model, we described most of the steps necessary to solve this kind of problem: building a prototype, analyzing and processing data, selecting and tuning algorithms, and other tasks.

    We also examined one of the most important aspects - the choice of complexity of the model. The final structure obtained gave the best results in terms of accuracy metrics, f1-score and others. However, when assessing losses from false positive and false negative results, it yielded very little profit. In this regard, the not so complex model described at the beginning of this article looks more attractive.

    This is not all the possibilities of machine learning in such tasks, but we deliberately simplified the demonstration model for greater clarity and, in part, due to the NDA. The model features not included in the article are specific to the client’s business and are not applicable to other projects, since most of the solutions based on machine learning require an individual approach. The full version of the system is used in the real project of our customer, and we continue to improve it.

    About Authors


    WaveAccess team creates technically sophisticated, highly loaded and fault-tolerant software for companies from different countries. Comment by Alexander Azarov, head of machine learning at WaveAccess:
    Machine learning allows you to automate areas where expert opinions currently dominate. This makes it possible to reduce the impact of the human factor and increase the scalability of the business.

    Also popular now: