Path Machine: the idea of a single algorithm

Prehistory

About 15 years ago I learned about the existence of fundamental paths - groups that can distinguish topological spaces by connectedness. Then it will not be about them, but they have prompted the idea of a regressor and a classifier - without any optimizations based on memorizing the sample.

Further details.

Ideas and conjectures

Since the fundamental paths are loops from a selected point and equivalence on them, how can you find such loops in the data? And is it necessary?

The idea came by analogy with KNN and SVM - "analogistic" algorithms. What if the loop is replaced with a "pipe", a cylinder from one data point to another, and what got into the pipe is assigned to the same class as these two points of the path (not already fundamental)? The “tube” must be wide enough to properly classify ... and in the limit it is straight. The distance to the new, unknown point should be minimal on the way, then we will be able to refer it to the same class as the way.

A regressor can be constructed from the projection of a point onto a straight line between data points, and interpolating target values corresponding to data points with the same ratio in which the projection divides the path.

This method of construction remembers the entire sample and gives accurate predictions on the training set.

Primitive implementation

How to build a path? I took the maximum element in the norm, and began to look for the one closest to it, connecting to get a path. In the end - he closed with the beginning (arguably of course, but like this).

Estimator

Это самая первая версия кода, смотрите ниже обновленный notebook.

classPathMachine(BaseEstimator):def__init__(self, norm=np.linalg.norm, classify=False):
        self.norm = norm
        self.classify = classify
    deffind_start(self, X):
        index_max = None
        value_max = -np.inf
        for index, x in enumerate(X):
            value = self.norm(x)
            if value > value_max:
                index_max = index
                value_max = value
        return index_max
    deffind_next(self, point, target, X, y):
        index_min = None
        value_min = np.inf
        for index, x in enumerate(X):
            if self.classify and (y[index] != target):
               continue
            value = self.norm(x - point)
            if value < value_min:
                index_min = index
                value_min = value
        return index_min
    deffit(self, X, y):
        X = np.copy(X)
        y = np.copy(y).flatten()
        self.paths = {} if self.classify else []
        start_index = self.find_start(X)
        start_value = X[start_index]
        start_target = y[start_index]
        X = np.delete(X, start_index, axis=0)
        y = np.delete(y, start_index, axis=0)
        while len(X) > 0:
            next_index = self.find_next(start_value, start_target, X, y)
            if self.classify and next_index isNone:
                start_index = self.find_start(X)
                start_value = X[start_index]
                start_target = y[start_index]
                continue
            next_target = y[next_index]
            if self.classify:
                ifnot next_target in self.paths:
                    self.paths[next_target] = []
                self.paths[next_target].append({
                    'start': start_value,
                    'next': X[next_index]
                })
            else:
                self.paths.append({
                    'start': start_value,
                    'next': X[next_index],
                    'value': start_target,
                    'target': next_target
                })
            start_value = X[next_index]
            start_target = y[next_index]
            X = np.delete(X, next_index, axis=0)
            y = np.delete(y, next_index, axis=0)
        if self.classify:
            self.paths[start_target].append({
                'start': start_value,
                'next': self.paths[start_target][0]['start']
            })
        else:
            self.paths.append({
                'start': start_value,
                'next': self.paths[0]['start'],
                'value': start_target,
                'target': self.paths[0]['target']
            })
        return self
    defpredict(self, X):
        result = []
        for x in X:
            if self.classify:
                predicted = None
                min_distance = np.inf
                for target in self.paths:
                    for path in self.paths[target]:
                        point = x - path['start']
                        line = path['next'] - path['start']
                        if np.allclose(self.norm(line), 0):
                            continue
                        direction = line / self.norm(line)
                        product = np.dot(point, direction)
                        projection = product * direction
                        distance = self.norm(projection - point)
                        if distance < min_distance:
                            predicted = target
                            min_distance = distance
                result.append(predicted)
            else:
                predicted = None
                min_distance = np.inf
                for path in self.paths:
                    point = x - path['start']
                    line = path['next'] - path['start']
                    if np.allclose(self.norm(line), 0):
                            continue
                    direction = line / self.norm(line)
                    product = np.dot(point, direction)
                    projection = product * direction
                    parameter = np.sign(product) * self.norm(projection) /\
                               self.norm(line)
                    distance = self.norm(projection - point)
                    if distance < min_distance:
                        predicted = (1 - parameter) * path['value'] +\
                                   parameter * path['target']
                        min_distance = distance
                result.append(predicted)
        return np.array(result)
    defscore(self, X, y):if self.classify:
            return f1_score(y.flatten(), self.predict(X), average='micro')
        else:
            return r2_score(y.flatten(), self.predict(X))

Theoretically (and technically), it is also possible to do predict_proba - like 1 - normalized distances. But I didn’t think up a chance to test it outright, so ...

Some tests

I started with classic Boston Housing and Iris, and for the baseline I chose Ridge and LogisticRegression. Well, what, and suddenly.

In general, I suggest to look jupyter notebook .

In short: the regressor worked worse, the classifier worked better.
update: after StandardScaler, the regressor is working better.

I also rolled on synthetic datasets. There is generally a forest that firewood.

But it was noticed that:

Regressor works well in spaces of small dimensions,
Both the regressor and the classifier are sensitive to noise, starting from a certain threshold,
The regressor, unlike the ruler, suspected multimodality (at Boston Housing),
By construction, the classifier worked well (best of all ... :)) on a linearly inseparable set of moons.

Conclusions, pros and cons and applicability

Plus, I personally do not really see in the current implementation. Both technically and mathematically. Perhaps this can be improved to something more sensible. Accordingly, I do not see any particular applicability either. Is it possible that without preprocessing to work with very small data ...

There are a lot of minuses: it is required to keep all datasets in memory, the generalizing ability is based on extrapolation, the main and only hyperparameter - the norm - cannot be chosen.

But, anyway, it was a surprise for me, which I decided to share here.

Tags:

someone reads tags