# Data preprocessing and model analysis

- Tutorial

Hello. In a previous post, I talked about some basic classification methods. Today, due to the specifics of the last homework, the post will be not so much about the methods themselves, but about data processing and analysis of the resulting models.

Data was provided by the Department of Statistics, University of Munich. Here it is possible to take himself dataset, as well as the very description of the data (field names are in German). The data contains loan applications, where each application is described by 20 variables. In addition, each application corresponds to whether the applicant was given a loan or not. Here you can see in detail what which variable means.

Our task was to build a model that would predict the decision that would be made by one or another applicant. Alas, there were no

test data and systems for checking our models, as it was, for example, done in MNIST . In this regard, we had some scope for imagination in terms of validating our models.

First, let's take a look at the data itself. The following graph shows the histograms of the distributions of all the variables available in our presence. The order in which the variables appear is specially changed for clarity.

Looking at these graphs, several conclusions can be drawn. Firstly, most of our variables are actually categorical, that is, they take only a couple (s) of values. Secondly, there are only two (well, maybe three) conditionally continuous variables, namely

When working with continuous variables, generally speaking, quite a lot can be forgiven for the type of their distribution. For example, multimodality, when the density has a bizarre hilly shape, with several peaks. Something similar can be seen in the graph of the density of the variable

Returning to our tails, the variables

##### What is the strength in, brother? Or who are all these

Often, when conducting an analysis, some variables are unnecessary. That is, well, really. This means that if we throw them out of the analysis, then in solving our problem, even in the worst case, we will lose almost nothing. In our case of credit scoring, by loss we mean a slightly decreased classification accuracy.

This is what will happen in the worst case. However, practice shows that with careful selection of variables, popularly known as

In practice, this problem arises due to the fact that by the time the data is collected, experts still do not know which variables will be most significant in the analysis. At the same time, during the experiment itself and data collection, no one bothers the experimenters to collect all the variables that can be collected at all. Like, we’ll collect everything that is, and analysts will sort it out somehow.

You need to cut variables wisely. If we simply cut data on the frequency of the appearance of the trait, for example, in the case of the variable gastarb, then we cannot knowingly guarantee that we will not throw out a very significant trait. In some textual or biological data, this problem is even more obvious, since there are generally very rare which variables take non-zero values.

The problem with the selection of signs is that for each model pulled over the data, the criterion for the selection of signs will be different, built specifically for this model. For example, for linear models, t-statistics are used for the significance of the coefficients , and for Random Forest, the relative significance of the variables in the cascade of trees . And sometimes feature selection in general can be explicitly built into the model .

For simplicity, we consider only the significance of variables in a linear model. We just build a generalized linear model, GLM. Since our target variable is a class label, therefore, it has a (conditionally) binomial distribution. Using the glm function in R, we build this model and look under the hood for it, calling summary for it. As a result, we get the following plate:

We are interested in the very last column. This column indicates the probability that our coefficient is zero, that is, it does not play a role in the final model. Asterisks here indicate the relative significance of the coefficients. From the table we see that, generally speaking, we can ruthlessly cut out almost all variables except

If we close our eyes to the validation of the model, assuming that the linear model does not overshoot our data, we can verify the validity of our hypothesis. Namely, we test the accuracy of two models on all data: a model with 4 variables and a model with 20. For 20 variables, the classification accuracy will be 77.1%, while for a model with 4 variables, 76.1%. Apparently, not very sorry.

It is interesting that the variables that we prologarized do not affect the model in any way. Being never prologarithmic, as well as prologarithmic dzhads, by significance did not even reach 0.1.

We decided to build the classifiers themselves on Python, using Scikit. In the analysis, we decided to use all the basic classifiers that scikit provides, somehow playing with their hyperparameters. Here is a list of what was launched:

At the end of the article are links to documentation on classes that implement these algorithms.

Since we did not have the ability to test the output explicitly, we used the cross-validation method. We took 10 as the number of folds. As a result, we derive the average value of the classification accuracy from all 10 folds.

The implementation is very transparent.

After we launched our script, we got the following results:

More clearly, they can be visualized by the following graph:

The average accuracy for all-all models on 4 variables is 72.7

The average accuracy for all-all models on all-variables is 73.7 The

discrepancy with the predictions at the beginning of the article is explained by the fact that those tests were performed on a different the framework.

Looking at the accuracy results of our models, we can draw a couple of interesting conclusions. We built a pack of different models, linear and non-linear. As a result, all these models show approximately the same accuracy on the data. That is, models such as RF and SVM did not give significant advantages in accuracy compared to the linear model. This is most likely due to the fact that the initial data was almost certainly generated by some kind of linear dependence.

The consequence of this is that it makes no sense to chase this data for accuracy with complex massive methods such as Random Forest, SVM or Gradient Boosting. That is, everything that could be caught in this data was already caught by the linear model. Otherwise, if there were explicit nonlinear dependencies in the data, this difference in accuracy would be more significant.

This tells us that sometimes the data is not as complex as it seems, and very quickly you can come to the actual maximum of what you can squeeze out of it.

Moreover, from the greatly reduced data by the selection of traits, our accuracy was practically not affected. That is, our solution for this data was not only simple (cheap-cheerful), but also compact.

Logistic Regression (our GLM case)

SVM

kNN

Random Forest

Gradient Boosting

In addition, here is an example of working with cross-validation.

Thank you for helping me write this article thanks to the Data Mining track from GameChangers , as well as to Alexei Natekin.

##### Task

Data was provided by the Department of Statistics, University of Munich. Here it is possible to take himself dataset, as well as the very description of the data (field names are in German). The data contains loan applications, where each application is described by 20 variables. In addition, each application corresponds to whether the applicant was given a loan or not. Here you can see in detail what which variable means.

Our task was to build a model that would predict the decision that would be made by one or another applicant. Alas, there were no

test data and systems for checking our models, as it was, for example, done in MNIST . In this regard, we had some scope for imagination in terms of validating our models.

##### Data pre-processing

First, let's take a look at the data itself. The following graph shows the histograms of the distributions of all the variables available in our presence. The order in which the variables appear is specially changed for clarity.

Looking at these graphs, several conclusions can be drawn. Firstly, most of our variables are actually categorical, that is, they take only a couple (s) of values. Secondly, there are only two (well, maybe three) conditionally continuous variables, namely

**hoehe**and**alter**. Thirdly, there are apparently no emissions.When working with continuous variables, generally speaking, quite a lot can be forgiven for the type of their distribution. For example, multimodality, when the density has a bizarre hilly shape, with several peaks. Something similar can be seen in the graph of the density of the variable

**laufzeit**. But the stretched tails of the distributions are the main headache in building models, since they greatly affect their properties and appearance. Emissions also greatly affect the quality of the constructed models, but since we are lucky and we don’t have them, I will talk about emissions sometime next time.Returning to our tails, the variables

**hoehe**and**alter**There is one feature: they are not normal. That is, they are very similar to lognormal, in view of the strong right tail. Given all of the above, we have some reason to prologarithm these variables in order to squeeze these tails.##### What is the strength in, brother? Or who are all these ~~people~~ variables?

Often, when conducting an analysis, some variables are unnecessary. That is, well, really. This means that if we throw them out of the analysis, then in solving our problem, even in the worst case, we will lose almost nothing. In our case of credit scoring, by loss we mean a slightly decreased classification accuracy.

This is what will happen in the worst case. However, practice shows that with careful selection of variables, popularly known as

**feature selection**, you can even win exactly. Insignificant variables introduce only noise into the model, almost without affecting the result. And when there are a lot of them going, you have to separate the grain from the chaff.In practice, this problem arises due to the fact that by the time the data is collected, experts still do not know which variables will be most significant in the analysis. At the same time, during the experiment itself and data collection, no one bothers the experimenters to collect all the variables that can be collected at all. Like, we’ll collect everything that is, and analysts will sort it out somehow.

You need to cut variables wisely. If we simply cut data on the frequency of the appearance of the trait, for example, in the case of the variable gastarb, then we cannot knowingly guarantee that we will not throw out a very significant trait. In some textual or biological data, this problem is even more obvious, since there are generally very rare which variables take non-zero values.

The problem with the selection of signs is that for each model pulled over the data, the criterion for the selection of signs will be different, built specifically for this model. For example, for linear models, t-statistics are used for the significance of the coefficients , and for Random Forest, the relative significance of the variables in the cascade of trees . And sometimes feature selection in general can be explicitly built into the model .

For simplicity, we consider only the significance of variables in a linear model. We just build a generalized linear model, GLM. Since our target variable is a class label, therefore, it has a (conditionally) binomial distribution. Using the glm function in R, we build this model and look under the hood for it, calling summary for it. As a result, we get the following plate:

We are interested in the very last column. This column indicates the probability that our coefficient is zero, that is, it does not play a role in the final model. Asterisks here indicate the relative significance of the coefficients. From the table we see that, generally speaking, we can ruthlessly cut out almost all variables except

**laufkont**,**laufzeit**,**moral**and**sparkont**(**intercept**is a shift parameter, we also need it). We selected them on the basis of the statistics obtained, that is, these are the variables for which the statistics “on take-off” are less than or equal to 0.01.If we close our eyes to the validation of the model, assuming that the linear model does not overshoot our data, we can verify the validity of our hypothesis. Namely, we test the accuracy of two models on all data: a model with 4 variables and a model with 20. For 20 variables, the classification accuracy will be 77.1%, while for a model with 4 variables, 76.1%. Apparently, not very sorry.

It is interesting that the variables that we prologarized do not affect the model in any way. Being never prologarithmic, as well as prologarithmic dzhads, by significance did not even reach 0.1.

##### Analysis

We decided to build the classifiers themselves on Python, using Scikit. In the analysis, we decided to use all the basic classifiers that scikit provides, somehow playing with their hyperparameters. Here is a list of what was launched:

At the end of the article are links to documentation on classes that implement these algorithms.

Since we did not have the ability to test the output explicitly, we used the cross-validation method. We took 10 as the number of folds. As a result, we derive the average value of the classification accuracy from all 10 folds.

The implementation is very transparent.

**View code**

```
from sklearn.externals import joblib
from sklearn import cross_validation
from sklearn import svm
from sklearn import neighbors
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
def avg(x):
s = 0.0
for t in x:
s += t
return (s/len(x))*100.0
dataset = joblib.load('kredit.pkl') #сюда были свалены данные, полученные после препроцессинга
target = [x[0] for x in dataset]
target = np.array(target)
train = [x[1:] for x in dataset]
numcv = 10 #количество фолдов
glm = LogisticRegression(penalty='l1', tol=1)
scores = cross_validation.cross_val_score(glm, train, target, cv = numcv)
print("Logistic Regression with L1 metric - " + ' avg = ' + ('%2.1f'%avg(scores)))
linSVM = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(linSVM, train, target, cv = numcv)
print("SVM with linear kernel - " + ' avg = ' + ('%2.1f'%avg(scores)))
poly2SVM = svm.SVC(kernel='poly', degree=2, C=1)
scores = cross_validation.cross_val_score(poly2SVM, train, target, cv = numcv)
print("SVM with polynomial kernel degree 2 - " + ' avg = ' + ('%2.1f' % avg(scores)))
rbfSVM = svm.SVC(kernel='rbf', C=1)
scores = cross_validation.cross_val_score(rbfSVM, train, target, cv = numcv)
print("SVM with rbf kernel - " + ' avg = ' + ('%2.1f'%avg(scores)))
knn = neighbors.KNeighborsClassifier(n_neighbors = 1, weights='uniform')
scores = cross_validation.cross_val_score(knn, train, target, cv = numcv)
print("kNN 1 neighbour - " + ' avg = ' + ('%2.1f'%avg(scores)))
knn = neighbors.KNeighborsClassifier(n_neighbors = 5, weights='uniform')
scores = cross_validation.cross_val_score(knn, train, target, cv = numcv)
print("kNN 5 neighbours - " + ' avg = ' + ('%2.1f'%avg(scores)))
knn = neighbors.KNeighborsClassifier(n_neighbors = 11, weights='uniform')
scores = cross_validation.cross_val_score(knn, train, target, cv = numcv)
print("kNN 11 neighbours - " + ' avg = ' + ('%2.1f'%avg(scores)))
gbm = GradientBoostingClassifier(learning_rate = 0.001, n_estimators = 5000)
scores = cross_validation.cross_val_score(gbm, train, target, cv = numcv)
print("Gradient Boosting 5000 trees, shrinkage 0.001 - " + ' avg = ' + ('%2.1f'%avg(scores)))
gbm = GradientBoostingClassifier(learning_rate = 0.001, n_estimators = 10000)
scores = cross_validation.cross_val_score(gbm, train, target, cv = numcv)
print("Gradient Boosting 10000 trees, shrinkage 0.001 - " + ' avg = ' + ('%2.1f'%avg(scores)))
gbm = GradientBoostingClassifier(learning_rate = 0.001, n_estimators = 15000)
scores = cross_validation.cross_val_score(gbm, train, target, cv = numcv)
print("Gradient Boosting 15000 trees, shrinkage 0.001 - " + ' avg = ' + ('%2.1f'%avg(scores)))
#распараллеливать на несколько ядер он почему-то отказывается
forest = RandomForestClassifier(n_estimators = 10, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 10 - " +' avg = ' + ('%2.1f'%avg(scores)))
forest = RandomForestClassifier(n_estimators = 50, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 50 - " +' avg = ' + ('%2.1f'%avg(scores)))
forest = RandomForestClassifier(n_estimators = 100, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 100 - " +' avg = '+ ('%2.1f'%avg(scores)))
forest = RandomForestClassifier(n_estimators = 200, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 200 - " +' avg = ' + ('%2.1f'%avg(scores)))
forest = RandomForestClassifier(n_estimators = 300, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 300 - " +' avg = '+ ('%2.1f'%avg(scores)))
forest = RandomForestClassifier(n_estimators = 400, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 400 - " +' avg = '+ ('%2.1f'%avg(scores)))
forest = RandomForestClassifier(n_estimators = 500, n_jobs = 1)
scores = cross_validation.cross_val_score(forest, train, target, cv=numcv)
print("Random Forest 500 - " +' avg = '+ ('%2.1f'%avg(scores)))
```

After we launched our script, we got the following results:

Method with parameters | Average accuracy on 4 variables | Average accuracy on 20 variables |
---|---|---|

Logistic Regression, L1 metric | 75.5 | 75.2 |

SVM with linear kernel | 73.9 | 74.4 |

SVM with polynomial kernel | 72.6 | 74.9 |

SVM with rbf kernel | 74.3 | 74.7 |

kNN 1 neighbor | 68.8 | 61.4 |

kNN 5 neighbors | 72.1 | 65.1 |

kNN 11 neighbors | 72.3 | 68.7 |

Gradient Boosting 5000 trees shrinkage 0.001 | 75.0 | 77.6 |

Gradient Boosting 10000 trees shrinkage 0.001 | 73.8 | 77.2 |

Gradient Boosting 15000 trees shrinkage 0.001 | 73.7 | 76.5 |

Random forest 10 | 72.0 | 71.2 |

Random Forest 50 | 72.1 | 75.5 |

Random forest 100 | 71.6 | 75.9 |

Random forest 200 | 71.8 | 76.1 |

Radom Forest 300 | 72.4 | 75.9 |

Random forest 400 | 71.9 | 76.7 |

Random forest 500 | 72.6 | 76.2 |

More clearly, they can be visualized by the following graph:

The average accuracy for all-all models on 4 variables is 72.7

The average accuracy for all-all models on all-variables is 73.7 The

discrepancy with the predictions at the beginning of the article is explained by the fact that those tests were performed on a different the framework.

##### conclusions

Looking at the accuracy results of our models, we can draw a couple of interesting conclusions. We built a pack of different models, linear and non-linear. As a result, all these models show approximately the same accuracy on the data. That is, models such as RF and SVM did not give significant advantages in accuracy compared to the linear model. This is most likely due to the fact that the initial data was almost certainly generated by some kind of linear dependence.

The consequence of this is that it makes no sense to chase this data for accuracy with complex massive methods such as Random Forest, SVM or Gradient Boosting. That is, everything that could be caught in this data was already caught by the linear model. Otherwise, if there were explicit nonlinear dependencies in the data, this difference in accuracy would be more significant.

This tells us that sometimes the data is not as complex as it seems, and very quickly you can come to the actual maximum of what you can squeeze out of it.

Moreover, from the greatly reduced data by the selection of traits, our accuracy was practically not affected. That is, our solution for this data was not only simple (cheap-cheerful), but also compact.

##### Documentation

Logistic Regression (our GLM case)

SVM

kNN

Random Forest

Gradient Boosting

In addition, here is an example of working with cross-validation.

Thank you for helping me write this article thanks to the Data Mining track from GameChangers , as well as to Alexei Natekin.