burn7ng May 20, 2018 at 14:22

Multi-output in machine learning

From the sandbox

The task of artificial intelligence algorithms is to learn, based on the provided sample, for subsequent data prediction. However, the most common task that is discussed in most textbooks is the prediction of one value, one or another set of signs. What if we need to get feedback? That is, to get a certain number of signs based on one or more values.

Faced with a task of this kind and not having in-depth knowledge in the sections of mathematical statistics and probability theory - for me this turned out to be a little research.

So, the first thing I got acquainted with is the method of recovering lost data by averages. Accordingly, I worked with the class provided by scikit-learn - Imputer. Referring to the materialsI can clarify:

The Imputer class provides basic strategies for recovering lost values, either using the average, median, or most common values of the column or row in which the lost data is located.

Even though I realized that the result would not be useful, I still decided to try to use this class, and here’s what actually happened:

import pandas as pd  
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header=None)
df.columns = ['Класс', 'Алкоголь', 'Яблочная кислота', 'Зола', 'Щелочность золы', 'Магний', 'Всего фенола', 'Флавоноиды', 'Фенолы нефлаваноидные', 'Проантоцианидины', 'Интенсивность цвета', 'Оттенок', 'OD280/OD315 разбавленных вин', 'Пролин']
imp = Imputer(missing_values='NaN', strategy='mean')
imp.fit(df)
imp.transform([[3, 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN']])

array([[3.00000000e+00, 1.30006180e+01, 2.33634831e+00, 2.36651685e+00,
        1.94949438e+01, 9.97415730e+01, 2.29511236e+00, 2.02926966e+00,
        3.61853933e-01, 1.59089888e+00, 5.05808988e+00, 9.57449438e-01,
        2.61168539e+00, 7.46893258e+02]])

After trying to verify the data on the RandomForestClassifier class, it turned out that he did not agree with us, and generally believed that this array of values exactly corresponded to the first class, but not to the third one.

Now, after we realized that this method does not suit us, we turn to the MultiOutputRegressor class. MultiOutputRegressor is designed specifically for those regressors that do not support multi-target regression. Let's check its effect on the least squares method:

from sklearn.datasets import make_regression
from sklearn.multioutput import MultiOutputRegressor
X, y = make_regression(n_features=1, n_targets=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)
multioutput = MultiOutputRegressor(LinearRegression()).fit(X_train, y_train)
print("Правильность на тестовом наборе: {:.2f}".format(multioutput.score(X_test, y_test)))
print("Правильность на обучающем наборе: {:.2f}".format(multioutput.score(X_train, y_train)))

Правильность на тестовом наборе: 0.82
Правильность на обучающем наборе: 0.83

The result is quite good. The logic of the action is very simple - it all comes down to the use of a separate regressor for each element of the set of output features.
I.e:

class MultiOutputRegressor__:
    def __init__(self, est):
        self.est = est
    def fit(self, X, y):
        g, h = y.shape
        self.estimators_ = [sklearn.base.clone(self.est).fit(X, y[:, i]) for i in range(h)]
        return self.estimators_
    def predict(self, X):
        res = [est.predict(X)[:, np.newaxis] for est in self.estimators_]
        return np.hstack(res)

Now let's check the operation of the RandomForestRegressor class, which also supports multi-target regression, on real data.

df = df.drop(['Класс'], axis=1)
X, y = df[['Алкоголь', 'Проантоцианидины']], df.drop(['Алкоголь', 'Проантоцианидины'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
forest = RandomForestRegressor(n_estimators=30, random_state=13)
forest.fit(X_train, y_train)
print("Правильность на тестовом наборе: {:.2f}".format(forest.score(X_test, y_test)))
print("Правильность на тренировочном наборе:{:.2f}".format(forest.score(X_train, y_train)))

Правильность на тестовом наборе: 0.65
Правильность на тренировочном наборе:0.87

In order not to mislead some people about proanthocyanidins

actually

Proanthocyanidins are a natural chemical compound. It is mainly found in the bones and skin of grapes, it is also found in oak and enters the wine when aged in oak barrels. The molecular weight of proanthocyanidins varies with the aging time of the wines. The older the wine - the more there are (for very old wines, the molecular weight decreases).

Significantly affect the resistance of red wines.

The result is worse than on synthetic data (random forest runs 99% on them). However, with the addition of signs, it is expected to improve.

Using multi-output methods, you can solve many interesting problems and get really needed data.

Tags:

Multi-output in machine learning

Also popular now: