Why do we need the Ho-Kashyap algorithm?

Recently, a publication appeared on Habré about the Ho-Kashyap algorithm (Ho-Kashyap procedure, aka - the algorithm of the NSCE, the least standard error). It seemed to me not very clear and I decided to figure it out myself. It turned out that the topic is not very well understood on the Russian-speaking Internet, so I decided to write an article based on the results of searches.

Despite the boom in neural networks in machine learning, linear classification algorithms remain much simpler to use and interpret. But at the same time, sometimes I don’t feel like using any advanced methods, such as the support vector method or logistic regression, and there is a temptation to drive all the data into one large linear MNC regression, moreover, even MS Excel is able to build it perfectly.

The problem with this approach is that even if the input data is linearly separable, the resulting classifier may not separate them. For example, for a set of points X = [(6, 9), (5, 7), (5, 9), (10, 1)], y = [1, 1, -1, -1]we get the dividing line (0.15x_1 - 0.43x_2 + 3.21) = 0(the example is borrowed from (1)):

Latex


The question arises - is it possible to somehow get rid of this feature of behavior?

Linear classification problem


First, we formalize the subject of the article.

A matrix is ​​given X, each line of X_iwhich corresponds to a characteristic description of the object i(including a constant 1) and the vector of class labels y, where y_i \ in \ {+ 1, -1 \}is the label of the object i. We want to build a linear view classifier sign (Xw) = y.

>>> import numpy as np
>>> X = np.array([[6, 9, 1], [5, 7, 1], [5, 9, 1], [0, 10, 1]])
>>> y = np.array([[1], [1], [-1], [-1]])

The easiest way to do this is to construct an OLS regression for Xw = y, that is, minimize the sum of the squared deviations \ min \ sum (w X_i - y_i) ^ 2. Optimal weights can be found by the formula w = X ^ {+} y, where X ^ {+}is the pseudoinverse matrix. Thus, a picture is obtained from the beginning of the article.

>>> w = np.dot(np.linalg.pinv(X), y)
>>> w
array([[ 0.15328467],
       [-0.4379562 ],
       [ 3.2189781 ]])
>>> np.dot(X, w)
array([[ 0.19708029],
       [ 0.91970803],
       [ 0.04379562],
       [-1.16058394]])

Linear separability


For the convenience of writing, we multiply each line of inequality Xw = yby element by element yand call the matrix obtained on the left side Y = y \ times X(here, it \ timesmeans row- wise multiplication). Then the condition for OLS regression is reduced to the form Yw = 1, and the minimization problem is minimized \ min \ sum (w Y_i - 1) ^ 2.

>>> Y = y * X
>>> Y
array([[  6,   9,   1],
       [  5,   7,   1],
       [ -5,  -9,  -1],
       [  0, -10,  -1]])

At this point, we can recall that the condition for separating classes is sign (Xw) = yor sign (Yw) = 1, and since we want to separate classes, we must solve this problem.

We introduce a vector bin which b_iis responsible for the distance from the element ito the dividing line ( Yw = b). Since we want all elements to be classified correctly, we introduce a condition b> 0. Then the problem will be reduced to \ min \ sum (w Y_i - b_i) ^ 2and will be solved as w = Y ^ {+} y. You can manually select such values bthat the resulting plane will separate our sample:

>>> b = np.ones([4, 1])
>>> b[3] = 10
>>> w = np.dot(np.linalg.pinv(Y), b)
>>> np.dot(Y, w)
array([[  0.8540146 ],
       [  0.98540146],
       [  0.81021898],
       [ 10.02919708]])

Ho Kashyap Algorithm


The Ho-Kashyap algorithm is designed to be bautomatically selected . Algorithm scheme ( k- step number, b ^ {(0)}usually taken equal 1):

  1. Calculate the OLS regression coefficients ( w ^ {(k)} = Y ^ {+} b ^ {(k)}).
  2. Calculate the indentation vector b ^ {(k + 1)}.
  3. If the solution does not fit ( b ^ {(k)} \ neq b ^ {(k + 1)}), then repeat step 1.

I want to calculate the indent vector in some way, sort of b ^ {(k + 1)} = \ max (Yw ^ {(k)}, 0), since this minimizes the loss function \ min \ sum (w Y_i - b_i) ^ 2. Unfortunately, the condition b> 0does not allow us to do this, and instead it is proposed to take a step along the positive part of the gradient of the loss function e ^ {(k)} = b ^ {k} - Yw ^ {(k)}:, b ^ {(k + 1)} = b ^ {(k)} - \ mu (e ^ {(k)} - | e ^ {k} |)where \ muis the step of the gradient descent, decreasing in the course of solving the problem.

>>> e = -np.inf * np.ones([4, 1])
>>> b = np.ones([4, 1])
>>> while np.any(e < 0):
...     w = np.dot(np.linalg.pinv(Y), b)
...     e = b - np.dot(Y, w)
...     b = b - e * (e < 0)
... 
>>> b
array([[  1.],
       [  1.],
       [  1.],
       [ 12.]])
>>> w
array([[ 2.],
       [-1.],
       [-2.]])



In the case of a linearly separable sample, the algorithm always converges and converges to the separating plane (if all elements of the gradient are bnonnegative, then they are all zero).

If linearly inseparable sample loss function can be arbitrarily small, since it is sufficient to multiply band wby a constant to change its value and, in principle, the algorithm may not converge. The search does not give any specific recommendations on this topic.

Connection of the Ho-Kashyap algorithm and linear SVM


You may notice that if the object is classified correctly, then the error in the posed optimization problem ( \ min \ sum_ {i} (w Y_i - b_i) ^ 2), the error on it can be reduced to zero. If the object is classified incorrectly, then the minimum error on it is equal to the square of its indent from the dividing line ( (w Y_i) ^ 2). Then the loss function can be rewritten in the form:

\ min_ {w} \ sum_ {i} \ max (0, 1 - w Y_i) ^ 2 - 2 \ max (0, 1 - w Y_i)

In turn, the linear SVM loss function has the form:

\ min_ {w} \ lambda || w || ^ 2 + \ frac {1} {N} \ sum_ {i} \ max (0, 1 - w Y_i)

Thus, the problem solved by the Ho-Kashyap algorithm is an analogue of SVM with a quadratic loss function (it fines heavier for emissions far from the separating plane) and ignoring the width of the separating strip (i.e., looking for a non-plane located as far as possible from the nearest correctly classified elements, but any dividing plane).

Multidimensional case


We may recall that MNC regression is an analogue of Fisher's two-class linear discriminant (their solutions coincide up to a constant). The Ho-Kashpyap algorithm can also be applied to the case of Kclasses - in this wthey bbecome matrices of Wboth Bsize D \ times Kand, N \ times Krespectively, where Dis the dimension of the problem, and Nis the number of objects. In this case, the columns corresponding to the incorrect classes must have negative values.

Acknowledgments


parpalak for a convenient editor.
rocket3 for the original article.

References


(1) http://www.csd.uwo.ca/~olga/Courses/CS434a_541a/Lecture10.pdf
(2) http://research.cs.tamu.edu/prism/lectures/pr/pr_l17.pdf
( 3) http://web.khu.ac.kr/~tskim/PatternClass Lec Note 07-1.pdf
(4) A.E. Lepsky, A.G. Bronevich Mathematical methods for pattern recognition. Lecture Course
(5) Tu J., González R. Principles of Pattern Recognition
(6) R. Duda, P. Hart Pattern Recognition and Scene Analysis

Also popular now: