Drino October 16, 2016 at 11:57

Why do we need the Ho-Kashyap algorithm?

From the sandbox

Recently, a publication appeared on Habré about the Ho-Kashyap algorithm (Ho-Kashyap procedure, aka - the algorithm of the NSCE, the least standard error). It seemed to me not very clear and I decided to figure it out myself. It turned out that the topic is not very well understood on the Russian-speaking Internet, so I decided to write an article based on the results of searches.

Despite the boom in neural networks in machine learning, linear classification algorithms remain much simpler to use and interpret. But at the same time, sometimes I don’t feel like using any advanced methods, such as the support vector method or logistic regression, and there is a temptation to drive all the data into one large linear MNC regression, moreover, even MS Excel is able to build it perfectly.

The problem with this approach is that even if the input data is linearly separable, the resulting classifier may not separate them. For example, for a set of points $X = [(6, 9), (5, 7), (5, 9), (10, 1)]$ , $y = [1, 1, -1, -1]$ we get the dividing line $(0.15x_1 - 0.43x_2 + 3.21) = 0$ (the example is borrowed from (1)):

The question arises - is it possible to somehow get rid of this feature of behavior?

Linear classification problem

First, we formalize the subject of the article.

A matrix is given $X$ , each line of $X_i$ which corresponds to a characteristic description of the object $i$ (including a constant $1$ ) and the vector of class labels $y$ , where $y_i \ in \ {+ 1, -1 \}$ is the label of the object $i$ . We want to build a linear view classifier $sign (Xw) = y$ .

>>> import numpy as np
>>> X = np.array([[6, 9, 1], [5, 7, 1], [5, 9, 1], [0, 10, 1]])
>>> y = np.array([[1], [1], [-1], [-1]])

The easiest way to do this is to construct an OLS regression for $Xw = y$ , that is, minimize the sum of the squared deviations $\ min \ sum (w X_i - y_i) ^ 2$ . Optimal weights can be found by the formula $w = X ^ {+} y$ , where $X ^ {+}$ is the pseudoinverse matrix. Thus, a picture is obtained from the beginning of the article.

>>> w = np.dot(np.linalg.pinv(X), y)
>>> w
array([[ 0.15328467],
       [-0.4379562 ],
       [ 3.2189781 ]])
>>> np.dot(X, w)
array([[ 0.19708029],
       [ 0.91970803],
       [ 0.04379562],
       [-1.16058394]])

Linear separability

For the convenience of writing, we multiply each line of inequality $Xw = y$ by element by element $y$ and call the matrix obtained on the left side $Y = y \ times X$ (here, it $\ times$ means row- wise multiplication). Then the condition for OLS regression is reduced to the form $Yw = 1$ , and the minimization problem is minimized $\ min \ sum (w Y_i - 1) ^ 2$ .

>>> Y = y * X
>>> Y
array([[  6,   9,   1],
       [  5,   7,   1],
       [ -5,  -9,  -1],
       [  0, -10,  -1]])

At this point, we can recall that the condition for separating classes is $sign (Xw) = y$ or $sign (Yw) = 1$ , and since we want to separate classes, we must solve this problem.

We introduce a vector $b$ in which $b_i$ is responsible for the distance from the element $i$ to the dividing line ( $Yw = b$ ). Since we want all elements to be classified correctly, we introduce a condition $b> 0$ . Then the problem will be reduced to $\ min \ sum (w Y_i - b_i) ^ 2$ and will be solved as $w = Y ^ {+} y$ . You can manually select such values $b$ that the resulting plane will separate our sample:

>>> b = np.ones([4, 1])
>>> b[3] = 10
>>> w = np.dot(np.linalg.pinv(Y), b)
>>> np.dot(Y, w)
array([[  0.8540146 ],
       [  0.98540146],
       [  0.81021898],
       [ 10.02919708]])

Ho Kashyap Algorithm

The Ho-Kashyap algorithm is designed to be $b$ automatically selected . Algorithm scheme ( $k$ - step number, $b ^ {(0)}$ usually taken equal $1$ ):

Calculate the OLS regression coefficients ( $w ^ {(k)} = Y ^ {+} b ^ {(k)}$ ).
Calculate the indentation vector $b ^ {(k + 1)}$ .
If the solution does not fit ( $b ^ {(k)} \ neq b ^ {(k + 1)}$ ), then repeat step 1.

I want to calculate the indent vector in some way, sort of $b ^ {(k + 1)} = \ max (Yw ^ {(k)}, 0)$ , since this minimizes the loss function $\ min \ sum (w Y_i - b_i) ^ 2$ . Unfortunately, the condition $b> 0$ does not allow us to do this, and instead it is proposed to take a step along the positive part of the gradient of the loss function $e ^ {(k)} = b ^ {k} - Yw ^ {(k)}$ :, $b ^ {(k + 1)} = b ^ {(k)} - \ mu (e ^ {(k)} - | e ^ {k} |)$ where $\ mu$ is the step of the gradient descent, decreasing in the course of solving the problem.

>>> e = -np.inf * np.ones([4, 1])
>>> b = np.ones([4, 1])
>>> while np.any(e < 0):
...     w = np.dot(np.linalg.pinv(Y), b)
...     e = b - np.dot(Y, w)
...     b = b - e * (e < 0)
... 
>>> b
array([[  1.],
       [  1.],
       [  1.],
       [ 12.]])
>>> w
array([[ 2.],
       [-1.],
       [-2.]])

In the case of a linearly separable sample, the algorithm always converges and converges to the separating plane (if all elements of the gradient are $b$ nonnegative, then they are all zero).

If linearly inseparable sample loss function can be arbitrarily small, since it is sufficient to multiply $b$ and $w$ by a constant to change its value and, in principle, the algorithm may not converge. The search does not give any specific recommendations on this topic.

Connection of the Ho-Kashyap algorithm and linear SVM

You may notice that if the object is classified correctly, then the error in the posed optimization problem ( $\ min \ sum_ {i} (w Y_i - b_i) ^ 2$ ), the error on it can be reduced to zero. If the object is classified incorrectly, then the minimum error on it is equal to the square of its indent from the dividing line ( $(w Y_i) ^ 2$ ). Then the loss function can be rewritten in the form:

$\ min_ {w} \ sum_ {i} \ max (0, 1 - w Y_i) ^ 2 - 2 \ max (0, 1 - w Y_i)$

In turn, the linear SVM loss function has the form:

$\ min_ {w} \ lambda || w || ^ 2 + \ frac {1} {N} \ sum_ {i} \ max (0, 1 - w Y_i)$

Thus, the problem solved by the Ho-Kashyap algorithm is an analogue of SVM with a quadratic loss function (it fines heavier for emissions far from the separating plane) and ignoring the width of the separating strip (i.e., looking for a non-plane located as far as possible from the nearest correctly classified elements, but any dividing plane).

Multidimensional case

We may recall that MNC regression is an analogue of Fisher's two-class linear discriminant (their solutions coincide up to a constant). The Ho-Kashpyap algorithm can also be applied to the case of $K$ classes - in this $w$ they $b$ become matrices of $W$ both $B$ size $D \ times K$ and, $N \ times K$ respectively, where $D$ is the dimension of the problem, and $N$ is the number of objects. In this case, the columns corresponding to the incorrect classes must have negative values.

Acknowledgments

parpalak for a convenient editor.
rocket3 for the original article.

References

(1) http://www.csd.uwo.ca/~olga/Courses/CS434a_541a/Lecture10.pdf
(2) http://research.cs.tamu.edu/prism/lectures/pr/pr_l17.pdf
( 3) http://web.khu.ac.kr/~tskim/PatternClass Lec Note 07-1.pdf
(4) A.E. Lepsky, A.G. Bronevich Mathematical methods for pattern recognition. Lecture Course
(5) Tu J., González R. Principles of Pattern Recognition
(6) R. Duda, P. Hart Pattern Recognition and Scene Analysis

Tags: