srgkvlv December 30, 2012 at 22:34

Overview of Numerical Data Space Clustering Algorithms

From the sandbox

The clustering task is a special case of the problem of teaching without a teacher, which reduces to dividing the existing set of data objects into subsets in such a way that the elements of one subset are significantly different in some set of properties from the elements of all other subsets. A data object is usually regarded as a point in a multidimensional metric space, each dimension of which corresponds to a certain property (attribute) of the object, and the metric is a function of the values of these properties. The choice of data clustering algorithm and the metric used depend on the types of measurements of this space, which can be either numerical or categorical. This choice is dictated by differences in the nature of different types of attributes.

This article provides a brief overview of clustering methods for numerical data spaces. It will be useful for those who are just starting to study Data Mining and cluster analysis and will help you navigate the variety of modern clustering algorithms and get a general idea about them. The article does not claim to be a complete presentation of the material; on the contrary, the description of the algorithms in it is maximally simplified. For a more detailed study of an algorithm, it is recommended to use the scientific work in which it was presented (see the list of references at the end of the article).

Partition Methods

The most famous representatives of this family of methods are the k-means [1] and k-medoids [2] algorithms. They take the input parameter k and divide the data space into k clusters such that between objects of the same cluster the similarity is maximum, and between objects of different clusters is minimal. Similarity is measured with respect to some center of the cluster as the distance from the object in question to the center. The main difference between these methods is the way the cluster center is determined.

In the k-means algorithm, similarity is considered with respect to the center of mass of the cluster — the average value of the coordinates of the cluster objects in the data space. First, kobjects, each of which is a prototype of a cluster and represents its center of mass. Then, for each of the remaining objects, it joins the cluster with which there is more similarity. After that, the center of mass of each cluster is re-calculated. For each obtained partition, a certain evaluation function is calculated, the values of which at each step form a convergent series. The process continues until the indicated series converges to its limit value. In other words, the movement of objects from cluster to cluster ends when the clusters remain unchanged with each iteration. Minimizing the evaluation function allows you to make the resulting clusters as compact and separate as possible. The k-means method works well, when clusters are significantly separated from each other compact "clouds". It is effective for processing large amounts of data, but is not applicable for detecting non-convex clusters or very different sizes. Moreover, the method is very sensitive to noise and isolated points of space, since even a small number of such points can significantly affect the calculation of the center of mass of the cluster.

To reduce the effect of noise and isolated points of space on the result of clustering, the k-medoids algorithm, in contrast to k-means, uses not a center of mass to represent the center of the cluster, but a representative object - one of the objects in the cluster. As in the k-means method, at first krepresentative objects. Each of the remaining objects is combined in a cluster with the nearest representative object. Then, iteratively for each representative object, it is replaced by an arbitrary non-representative object of the data space. The replacement process continues until the quality of the resulting clusters improves. The quality of clustering is determined by the sum of the deviations between each object and the representative object of the corresponding cluster, which the method seeks to minimize. That is, iterations continue until in each cluster its representative object becomes a honeyoid - the object closest to the center of the cluster. The algorithm does not scale well for processing large amounts of data, but the CLARANS algorithm [3] complements the k-medoids method.

Hierarchical methods

The general idea of the methods of this group is a sequential hierarchical decomposition of many objects. Depending on the direction of the hierarchy, divisible and agglomerative methods are distinguished. In the case of the agglomerative method (from bottom to top), the decomposition process begins with the fact that each object is an independent cluster. Then, at each iteration, pairs of nearby clusters are sequentially combined into a common cluster. Iterations continue until all objects are combined into one cluster or until some stop condition is fulfilled. The divisible method (from top to bottom), on the contrary, implies that at the initial stage all objects are combined into a single cluster. At each iteration, it is divided into smaller ones until until each object is in a separate cluster or the stop condition is met. As a stopping condition, you can use the threshold number of clusters that you need to obtain, however, the threshold value of the distance between the clusters is usually used.

The main problem of hierarchical methods is the difficulty of determining the stopping condition in such a way as to distinguish “natural” clusters and at the same time to prevent their partitioning. Another problem of hierarchical clustering methods is the choice of the point of separation or merging of clusters. This choice is critical, because after the separation or merging of clusters at each subsequent step, the method will only operate on newly formed clusters, therefore, incorrect selection of the merge or separation point at any step can lead to poor clustering. In addition, hierarchical methods cannot be applied to large data sets, because the decision to split or merge clusters requires analysis of a large number of objects and clusters, which leads to a large computational complexity of the method. Examples of algorithms

Density methods

Clusters are considered as regions of a data space with a high density of objects, which are divided by regions with a low density of objects.

The DBSCAN algorithm [7] is one of the first density clustering algorithms. This algorithm is based on several definitions:

An ε-neighborhood of an object is a neighborhood of the radius ε of some object.
A root object is an object whose ε- neighborhood contains at least some minimum number of MinPts objects.
An object p is directly tightly reachable from an object q if p is in the ε- neighborhood of q and q is the root object.
An object p is tightly reachable from q for given ε and MinPts if there exists a sequence of objects p ₁ , ..., p _n , where p ₁ = q and p _n = p such that p _i +1 is directly tightly reachable from p _i , 1 ≤ i ≤ n .
The object p is tightly connected to the object q for given ε and MinPts if there exists an object o such that p and q are tightly reachable from o .

To search for clusters, the DBSCAN algorithm checks the ε-neighborhood of each object. If the ε- neighborhood of p contains more points than MinPts , then a new cluster is created with the root object p . DBSCAN then iteratively collects the objects directly tightly reachable from the root objects, which can lead to the union of several tightly reachable clusters. The process ends when no new objects can be added to any cluster.

Although, unlike partitioning methods, DBSCAN does not require a preliminary indication of the number of clusters received, the values of the parameters ε and MinPts are requiredthat directly affect the result of clustering. The optimal values of these parameters are difficult to determine, especially for multidimensional data spaces. In addition, the distribution of data in such spaces is often asymmetric, which does not allow the use of global density parameters for their clustering. For clustering multidimensional data spaces based on DBSCAN, the SUBCLU algorithm was created [8].

Network methods

The general idea of the methods is that the space of objects is divided into a finite number of cells forming a network structure, within which all clustering operations are performed. The main advantage of the methods of this group is the short execution time, which usually does not depend on the number of data objects, but depends only on the number of cells in each space dimension.

The CLIQUE algorithm [9], adapted for clustering high-dimensional data, is one of the classical network algorithms. The method is based on the assumption that if the distribution of objects in a multidimensional data space is not uniform - density and rarefaction regions occur, then the projection of the density region into a subspace with a lower dimension will be part of the density region in this subspace. The CLIQUE algorithm clusters a multidimensional data space as follows: the data space is divided into non-intersecting cells of a fixed size, among them dense cells are identified - those whose density of data objects exceeds a predetermined threshold value. Further, from the found cells, a space is formed in which dense cells of larger dimension can exist.

This algorithm is scalable for processing a large amount of data, however, with a large number of measurements, the number of considered combinations grows nonlinearly, therefore, it is required to use heuristics to reduce the number of considered combinations. In addition, the result obtained very much depends on the choice of the cell size and the threshold value of the density of objects in the cell. This is a big problem, because the same values of these parameters are used when considering all combinations of measurements. This problem is solved by the MAFIA algorithm [10], which works on a similar principle, but uses adaptive cell size when partitioning subspaces.

Model Methods

The methods of this family suggest that there is some mathematical model of the cluster in the data space and seek to maximize the similarity of this model and the available data. Often, the apparatus of mathematical statistics is used.

The EM algorithm [11] is based on the assumption that the data set under study can be modeled using a linear combination of multidimensional normal distributions. Its purpose is to evaluate distribution parameters that maximize the likelihood function used as a measure of model quality. In other words, it is assumed that the data in each cluster obeys a certain distribution law, namely, normal distribution. Based on this assumption, it is possible to determine the optimal parameters of the distribution law - the mathematical expectation and variance, for which the likelihood function is maximum. Thus, we assume that any object belongs to all clusters, but with different probability. Then the task will be to “fit” the totality of distributions to the data, and then in determining the probabilities of an object belonging to each cluster. Obviously, the object should be assigned to the cluster for which this probability is higher.

The EM algorithm is simple and easy to implement, not sensitive to isolated objects, and converges quickly with successful initialization. However, it requires initialization to indicate the number of clusters k, which implies the presence of a priori knowledge of data. In addition, if the initialization fails, the convergence of the algorithm may turn out to be slow, or a poor-quality result may be obtained.
Obviously, such algorithms are not applicable to spaces with a high dimension, since in this case it is extremely difficult to assume a mathematical model of the distribution of data in this space.

Conceptual clustering

Unlike traditional clustering, which detects groups of similar objects based on a measure of similarity between them, conceptual clustering defines clusters as groups of objects that belong to the same class or concept - a specific set of attribute-value pairs.

The COBWEB algorithm [12] is a classical method of incremental conceptual clustering. It creates a hierarchical clustering in the form of a classification tree: each node of this tree refers to a concept and contains a probabilistic description of this concept, which includes the probability that the concept belongs to this node and conditional probabilities of the form: P (A _i = v _ij | C _k ), where A _i = v _ij- attribute-value pair, C _k - concept class.
Nodes located at a certain level of the classification tree are called a slice. To build a classification tree, the algorithm uses a heuristic measure of assessment, called category utility - an increase in the expected number of correct assumptions about attribute values with knowledge of their belonging to a certain category relative to the expected number of correct assumptions about attribute values without this knowledge. To embed a new object in the classification tree, the COBWEB algorithm iteratively traverses the entire tree in search of the “best” node to which this object belongs. The choice of a node is based on placing the object in each node and calculating the utility of the category of the resulting slice. The utility of the category is also calculated for the case when the object belongs to a newly created node. As a result, the object belongs to the node
However, COBWEB has a number of limitations. First, he suggests that the probability distributions of the values of various attributes are statistically independent of each other. However, this assumption is not always true, because there is often a correlation between attribute values. Secondly, the probabilistic representation of clusters makes it very difficult to update them, especially when the attributes have a large number of possible values. This is because the complexity of the algorithm depends not only on the number of attributes, but also on the number of possible values.

List of references

Tags: