Wellford method and one-dimensional linear regression

One-dimensional linear regression is one of the simplest regression methods (and generally one of the simplest machine learning methods), which allows us to describe the linear dependence of the observed value on one of the attributes. In the general case, in machine learning problems, one has to deal with a large number of different attributes; one-dimensional linear regression in this case selects one of them that allows you to achieve the best correlation with the objective function.

In a previous post in this series, we discussed the accuracy of calculating averages and covariances, and also got acquainted with the Wellford method, which in many cases avoids computational errors in these problems. Today we look at the practical application of the Wellford method in the problem of one-dimensional linear regression.

1. One-dimensional linear regression

In the one-dimensional linear regression problem, we assume that there are two sequences of real numbers: signs $inline$ and answers $inline$ . In addition, there is a vector of corresponding weights $inline$ . As always, we will assume that these sequences contain a potentially infinite number of elements, but for now only $inline$ first elements of each of the sequences.

Our task will be to restore a linear relationship between features and answers, that is, to build a linear decision function $f: \ mathbb {R} \ rightarrow \ mathbb {R}$ :

$f (x_i) = \ alpha \ cdot x_i + \ beta$

This minimizes the mean-square loss functional:

$Q (f, x, y, w) = \ sqrt {\ frac {1} {\ sum_ {i = 1} ^ {n} w_i} \ sum_ {i = 1} ^ {n} w_i \ cdot (f (x_i) - y_i) ^ 2}$

For analysis, it is easier to work with a formula that is free of radical and normalization:

$Q_1 (f, x, y, w) = \ sum_ {i = 1} ^ {n} w_i \ cdot (f (x_i) -y_i) ^ 2 = \ sum_ {i = 1} ^ {n} w_i \ cdot (\ alpha x_i + \ beta-y_i) ^ 2$

Since the minimum points for the functionals $inline$ and $inline$ match, such a replacement is correct.

2. Solution for centered samples

For loss functional $inline$ easy to write derivatives with respect to $\ alpha$ and $\ beta$ :

$\ frac {\ partial Q_1 (f, x, y, w)} {\ partial \ alpha} = 2 \ cdot \ sum_ {i = 1} ^ {n} w_i x_i (\ alpha x_i + \ beta - y_i)$

$\ frac {\ partial Q_1 (f, x, y, w)} {\ partial \ beta} = 2 \ cdot \ sum_ {i = 1} ^ {n} w_i (\ alpha x_i + \ beta - y_i)$

If we equate them to zero, we get:

$\ alpha \ cdot \ sum_ {i = 1} ^ {n} w_i x ^ 2_i + \ beta \ cdot \ sum_ {i = 1} ^ {n} w_i x_i - \ sum_ {i = 1} ^ {n} w_i x_i y_i = 0$

$\ alpha = \ frac {\ sum_ {i = 1} ^ {n} w_i x_i y_i - \ beta \ cdot \ sum_ {i = 1} ^ {n} w_i x_i} {\ sum_ {i = 1} ^ { n} w_i x ^ 2_i}$

$\ beta = \ sum_ {i = 1} ^ {n} w_i y_i - \ alpha \ cdot \ sum_ {i = 1} ^ {n} w_i x_i$

An important digression. Equating the derivative to zero in this case is correct, because:
1. The loss functional is convex in terms of optimizable parameters, so any point of the local optimum will also be a point of global optimum .
2. The loss functional with respect to optimized parameters is a parabola; therefore, the found extreme points will be minimum points.

If the optimal parameter $\ beta$ equal to zero, to find a solution would not be difficult. You may notice that centering, a standard machine learning method of preprocessing a sample, leads precisely to this effect. Indeed, consider the problem for centered variables:

$x'_i = x_i - \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i}$

$y'_i = y_i - \ frac {\ sum_ {i = 1} ^ n w_i y_i} {\ sum_ {i = 1} ^ n w_i}$

The sum of the weighted attributes is now equal to zero, as well as the sum of the weighted responses:

$\ sum_ {k = 1} ^ {n} w_k x_k '= \ sum_ {k = 1} ^ {n} w_k \ cdot \ Big (x_k - \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i} \ Big) = \ sum_ {k = 1} ^ {n} w_k x_k - \ sum_ {i = 1} ^ n w_i x_i = 0$

$\ sum_ {k = 1} ^ {n} w_k y_k '= \ sum_ {k = 1} ^ {n} w_k \ cdot \ Big (y_k - \ frac {\ sum_ {i = 1} ^ n w_i y_i} {\ sum_ {i = 1} ^ n w_i} \ Big) = \ sum_ {k = 1} ^ {n} w_k y_k - \ sum_ {i = 1} ^ n w_i y_i = 0$

Then the optimal value of the free parameter is zero:

$\ beta '= \ sum_ {i = 1} ^ {n} w_i y_i' - \ alpha '\ cdot \ sum_ {i = 1} ^ {n} w_i x'_i = 0$

And this means that the optimal value of the parameter $\ alpha '$ easy to find:

$\ alpha '= \ frac {\ sum_ {i = 1} ^ {n} w_i x'_i y'_i} {\ sum_ {i = 1} ^ {n} w_i x' ^ 2_i}$

3. The decision in the General case

Now let's try to return to the general case of off-center data. If a $inline$ Is the decisive function for the centered case, the values of which are determined by the formula

$f '(x'_k) = \ alpha' \ cdot x'_k$

and approximate the value $y'_k = y_k - \ frac {\ sum_ {i = 1} ^ n y_i x_i} {\ sum_ {i = 1} ^ n w_i}$ then the next decisive function approximates the quantity $inline$ :

$f (x_k) = \ alpha '\ cdot \ Big (x_k - \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i} \ Big) + \ frac { \ sum_ {i = 1} ^ n y_i x_i} {\ sum_ {i = 1} ^ n w_i} = \ alpha '\ cdot x_k + \ Big (\ frac {\ sum_ {i = 1} ^ n y_i x_i} {\ sum_ {i = 1} ^ n w_i} - \ alpha '\ cdot \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i} \ Big)$

Therefore, the solution to the initial problem of one-dimensional linear regression can be written as follows:

$\ alpha = \ frac {\ sum_ {i = 1} ^ {n} w_i (x_i - m_n ^ {wx}) (y_i - m_n ^ {wy})} {\ sum_ {i = 1} ^ {n} w_i (x_i - m_n ^ {wx}) (x_i - m_n ^ {wx})}$

$\ beta = m_n ^ {wy} - \ alpha \ cdot m_n ^ {wx}$

Here we use the notation introduced in the last article for the weighted average :

$m_n ^ {wx} = \ frac {\ sum_ {i = 1} ^ n w_i x_i} {\ sum_ {i = 1} ^ n w_i}$

$m_n ^ {wy} = \ frac {\ sum_ {i = 1} ^ n w_i y_i} {\ sum_ {i = 1} ^ n w_i}$

You can understand that such a transition is correct, in another way. If the solution for centered data is optimal, then the parameters $\ alpha '$ and $\ beta '$ deliver minimum loss functionality $inline$ :

$Q_1 (f ', x', y ', w) = \ sum_ {i = 1} ^ {n} w_i \ cdot (\ alpha' \ cdot x'_i + \ beta '- y'_i) ^ 2$

Now we will replace the variables, returning to the off-center data:

$Q_1 (f ', x', y ', w) = \ sum_ {i = 1} ^ {n} w_i \ cdot \ Big (\ alpha' \ cdot (x_i - m_n ^ {wx}) - y_i + m_n ^ {wy} \ Big) ^ 2 =$

$= \ sum_ {i = 1} ^ {n} w_i \ cdot \ Big (\ alpha '\ cdot x_i + (m_n ^ {wy} - \ alpha' \ cdot m_n ^ {wx}) - y_i \ Big) ^ 2$

The resulting expression describes the value of the loss functional $inline$ for unbiased data in accordance with the formulas for $\ alpha$ and $\ beta$ which we got above. The value of the functional in this case reaches a minimum, therefore, the problem is solved correctly!

4. Application of the Wellford method

Now note that when calculating the parameter $\ alpha$ the same covariances are used, the calculation of which we dealt with in the previous article . In fact, using the notation from it, you can write:

$\ alpha = \ frac {D_ {n} ^ {wxy}} {D_ {n} ^ {wxx}} = \ frac {C_ {n} ^ {wxy}} {C_ {n} ^ {wxx}}$

This means that to calculate the regression coefficient, you need to calculate the covariance twice using the Wellford method. In the process of these calculations, we will simultaneously find the average values needed to calculate the free regression coefficient.

The code for adding another element to the selection consists of updating the averages and variances for the signs and answers, as well as the covariance between the signs and answers:

void TWelfordSLRSolver::Add(constdouble feature, constdouble goal, constdouble weight) {
    SumWeights += weight;
    if (!SumWeights) {
        return;
    }
    constdouble weightedFeatureDiff = weight * (feature - FeaturesMean);
    constdouble weightedGoalDiff = weight * (goal - GoalsMean);
    FeaturesMean += weightedFeatureDiff / SumWeights;
    FeaturesDeviation += weightedFeatureDiff * (feature - FeaturesMean);
    GoalsMean += weightedGoalDiff / SumWeights;
    GoalsDeviation += weightedGoalDiff * (goal - GoalsMean);
    Covariation += weightedFeatureDiff * (goal - GoalsMean);
}

And this is the method that solves the problem of finding the optimal parameters of one-dimensional linear regression after adding all the elements:

template <typename TFloatType>
void TWelfordSLRSolver::Solve(TFloatType& factor, TFloatType& intercept, constdouble regularizationParameter = 0.1) const {
    if (!FeaturesDeviation) {
        factor = 0.;
        intercept = GoalsMean;
        return;
    }
    factor = Covariation / (FeaturesDeviation + regularizationParameter);
    intercept = GoalsMean - factor * FeaturesMean;
}

The value GoalsDeviationis not used here, but we will need it in future articles.

Combining all the calculations in one class allows you to avoid some overhead. Say, if two objects were used in the implementation for storing secondary and three objects for storing covariances (signs with signs, answers with answers, signs with answers), then the sum of weights would be updated five times for each example from the sample.

5. Experimental method comparison

For practical comparison, I wrote a program that implements various methods for solving problems of one-dimensional and multidimensional linear regression . We will discuss multidimensional regression in the following articles, but now we focus on the one-dimensional case.

We will compare, as usual, "naive" methods, methods based on summation by the Kahan method, and methods based on the Wellford method.

The “naive” methods directly apply formulas for calculating covariances :

voidAdd(constdouble feature, constdouble goal, constdouble weight = 1.){
    SumFeatures += feature * weight;
    SumSquaredFeatures += feature * feature * weight;
    SumGoals += goal * weight;
    SumSquaredGoals += goal * goal * weight;
    SumProducts += goal * feature * weight;
    SumWeights += weight;
}

The class is template and has specializations with counters of type double and type TKahanAccumulator .

In addition, a class has been implementedTTypedBestSLRSolver that selects the best of the features for constructing a one-dimensional regression model. This is done very simply : the problem of one-dimensional linear regression is solved for each of the signs, and then the best of the resulting models is selected.

To test the developed methods, we will use model data from the LIAC collection . For convenience, some of the data sets are placed in the data directory in a format that the written program understands.

The data in the tasks is "spoiled" in a simple way : the values of attributes and responses are multiplied by a certain number, after which a different number is added to them. Thus, we can obtain a problematic from the point of view of calculations case: large average values in comparison with the scatter values.

In the mode, the research-bslrselection changes several times in a row, and each time a rolling control procedure is launched on it . The result of the check is the average value of the coefficient of determination for test samples.

For example, for the kin8nm sample, the results are as follows:

injurefactor: 1
injureoffset: 1
   fast_bslrtime: 0.001322R^2: 0.27359kahan_bslrtime: 0.002999R^2: 0.27359welford_bslrtime: 0.00432R^2: 0.27359normalized_welford_bslrtime: 0.004288R^2: 0.27359injurefactor: 0.1injureoffset: 10
   fast_bslrtime: 0.001256R^2: 0.27359kahan_bslrtime: 0.002948R^2: 0.27359welford_bslrtime: 0.004303R^2: 0.27359normalized_welford_bslrtime: 0.004275R^2: 0.27359injurefactor: 0.01injureoffset: 100
   fast_bslrtime: 0.001283R^2: 0.27359kahan_bslrtime: 0.003015R^2: 0.27359welford_bslrtime: 0.004304R^2: 0.27359normalized_welford_bslrtime: 0.004285R^2: 0.27359injurefactor: 0.001injureoffset: 1000
   fast_bslrtime: 0.001262R^2: 0.27324kahan_bslrtime: 0.002977R^2: 0.27359welford_bslrtime: 0.004329R^2: 0.27359normalized_welford_bslrtime: 0.00428R^2: 0.27359injurefactor: 0.0001injureoffset: 10000
   fast_bslrtime: 0.00128R^2: -59.271kahan_bslrtime: 0.003009R^2: -0.0005269welford_bslrtime: 0.004304R^2: 0.27359normalized_welford_bslrtime: 0.00428R^2: 0.27359fulllearningtime:
   fast_bslr                                      0.006403skahan_bslr                                     0.014948swelford_bslr                                   0.02156snormalized_welford_bslr                        0.021408s

In this case, a decrease of all values in the sample by 10 thousand times, together with the addition of a value of 10,000 to them, leads to the inoperability of the standard algorithm, even when summing by the Kahan method. Similar results are obtained on other samples, including those that are found in real life in production.

Conclusion

So, today we talked about the problem of one-dimensional linear regression, figured out how to obtain analytical solutions in this problem and how to use the Wellford method to find solutions.

The Wellford method makes solving the problem substantially more resistant to possible problems in the data. However, this method is 2-4 times slower than the standard algorithm, so in practice you need to decide for yourself what is more important at the moment - not to depend on possible problems in the data or to work as quickly as possible.

If there is a need to build models on different data many times and there is no way to control the quality of each received model, I would recommend using the Wellford method.

In the next article, we will talk about using the Wellford method to solve the multidimensional linear regression problem.

Literature

habrahabr.ru: Accurate calculation of means and covariances by the Welford method
github.com: Linear regression problem solvers with different computation methods
github.com: The collection of ARFF datasets of the Connectionist Artificial Intelligence Laboratory (LIAC)
machinelearning.ru: One-dimensional linear regression

Tags: