akrot January 13, 2015 at 09:39

Introduction to Machine Learning with Python and Scikit-Learn

From the sandbox

Hi, Habr!

My name is Alexander , I am engaged in machine learning and analysis of web graphs ( mainly theoretical ), as well as the development of Big Data products in one of the Big Three operators. This is my first post - please, do not judge strictly!)

Recently, people who want to learn how to develop effective algorithms and participate in machine learning competitions have increasingly come to me with the question: “Where to start?”. Some time ago, I led the development of Big Data tools for analyzing media and social networks in one of the institutions of the Government of the Russian Federation, and I still have some material on which my team was trained and which can be shared. It is assumed that the reader has a good knowledge of mathematics and machine learning (the team consisted mainly of MIPT graduates and students of the School of Data Analysis ).

Essentially, it was an introduction to Data Science . Recently, this science has become quite popular. Increasingly, machine-learning competitions are being held (for example, Kaggle , TudedIT ), often with a considerable budget. The purpose of this article is to give the reader a quick introduction to machine learning tools so that he can participate in competitions as soon as possible.

The most common Data Scientist tools available today areR and Python . Each tool has its pros and cons, however, recently, Python has won in all respects (this is exclusively the opinion of the author, who also uses both). This became after the well-documented Scikit-Learn library , which implements a large number of machine learning algorithms, appeared.

Immediately, we note that in the article we will focus on Machine Learning algorithms. Initial data analysis is usually best done using the Pandas package , which you can deal with yourself. So, we will focus on the implementation, for the sake of definiteness, assuming that at the input we have an object-attribute matrix stored in a file with the * .csv extension

Data loading

First of all, the data must be loaded into RAM so that we can work with them. The Scikit-Learn library itself uses arrays in its implementation of NumPy, so we will load * .csv files using NumPy. Download one of the datasets from the UCI Machine Learning Repository :

import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

Further, in all examples, we will work with this data set, namely, the object-attribute matrix X and the values of the target variable y .

Data normalization

Everyone is well aware that most gradient methods (on which almost all machine learning algorithms are essentially based) are highly sensitive to data scaling. Therefore, before starting the algorithms, either normalization or the so-called standardization is most often done . Normalization involves replacing the nominal features so that each of them lies in the range from 0 to 1. Standardization implies such data preprocessing, after which each feature has an average of 0 and variance of 1. Scikit-Learn already has functions ready for this:

from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)

Feature Selection

It is no secret that often the most important thing in solving a problem is the ability to select and even create attributes correctly. In English literature this is called Feature Selection and Feature Engineering . While Future Engineering is quite a creative process and relies more on intuition and expert knowledge, there are already a large number of ready-made algorithms for Feature Selection. “Woody” algorithms allow the calculation of the information content of signs:

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)

All other methods are in one way or another based on efficient enumeration of subsets of attributes in order to find the best subset on which the constructed model gives the best quality. One such enumeration algorithm is the Recursive Feature Elimination algorithm, which is also available in the Scikit-Learn library:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

Algorithm building

As already noted, Scikit-Learn implements all the basic machine learning algorithms. Let's consider some of them.

Logistic Regression

It is most often used to solve classification problems (binary), but multiclass classification is also allowed (the so-called one-vs-all method). The advantage of this algorithm is that at the output for each object we have the probability of belonging to the class

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Naive Bayes

It is also one of the most famous machine learning algorithms, the main task of which is to restore the data density of the training sample. Often this method gives good quality in problems of precisely multiclass classification.

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

K-nearest neighbors

The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. For example, its assessment can be used as a sign for an object. And sometimes, a simple kNN on well-chosen attributes gives excellent quality. When correctly setting parameters (mainly metrics), the algorithm often gives good quality in regression problems

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Decision trees

Classification and Regression Trees (CART) are often used in tasks in which objects have categorical attributes and are used for regression and classification tasks. Trees are very well suited for multi-class classification.

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Support Vector Method

SVM (Support Vector Machines) is one of the most famous machine learning algorithms used mainly for classification tasks. Like logistic regression, SVM allows multiclass classification using the one-vs-all method.

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

In addition to classification and regression algorithms, Scikit-Learn has a huge number of more complex algorithms, including clustering, as well as implemented techniques for constructing algorithm compositions, including Bagging and Boosting .

Optimization of algorithm parameters

One of the most difficult steps in building truly effective algorithms is choosing the right parameters. Usually, this is made easier with experience, but somehow you have to do a bust. Fortunately, Scikit-Learn already has a lot of functions implemented for this.

For an example, let's look at the selection of a regularization parameter, in which we take turns sorting through several values:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

Sometimes it turns out to be more effective many times to randomly select a parameter from a given segment, measure the quality of the algorithm with this parameter, and thereby choose the best one:

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

We examined the entire process of working with the Scikit-Learn library, with the exception of outputting the results back to a file, which the reader is invited to do as an exercise, because one of the advantages of Python (and the Scikit-Learn library itself) compared to R is its excellent documentation. In the following parts, we will examine in detail each of the sections, in particular, we will touch upon such an important thing as Feauture Engineering .

I really hope that this material will help novice Data Scientists to start solving machine learning problems in practice as soon as possible. In conclusion, I want to wish success and patience to those who are just starting to participate in machine learning competitions!

Tags: