Cluster analysis: its method and scope of application. Overview of data clustering algorithms

Greetings!

In my thesis, I reviewed and comparative analysis data clustering algorithms. I thought that the already collected and processed material might be interesting and useful to someone.
Sashaeve talked about what clustering is in the article “Clustering: k-means and c-means algorithms.” I will partially repeat Alexander’s words and partially add them. Also at the end of this article, those interested can read the materials via the links in the bibliography.

I also tried to bring the dry “graduate” style of presentation to a more journalistic one.

Concept of clustering

Clustering (or cluster analysis) is the task of dividing a set of objects into groups called clusters. Within each group there should be “similar” objects, and objects from different groups should be as different as possible. The main difference between clustering and classification is that the list of groups is not clearly defined and is determined during the operation of the algorithm.

The application of cluster analysis in general comes down to the following steps:

  1. Selection of a sample of objects for clustering.
  2. Defining a set of variables by which objects in the sample will be assessed. If necessary, normalize the values ​​of variables.
  3. Calculation of similarity measure values ​​between objects.
  4. Application of the cluster analysis method to create groups of similar objects (clusters).
  5. Presentation of analysis results.
After receiving and analyzing the results, it is possible to adjust the selected metric and clustering method until the optimal result is obtained.

Distance measures

So, how do we determine the “similarity” of objects? First, you need to create a vector of characteristics for each object - as a rule, this is a set of numerical values, for example, a person’s height and weight. However, there are also algorithms that work with qualitative (so-called categorical) characteristics.

Once we have determined the feature vector, normalization can be carried out so that all components contribute equally to the “distance” calculation. During the normalization process, all values ​​are brought to a certain range, for example, [-1, -1] or .

Finally, for each pair of objects, the “distance” between them is measured - the degree of similarity. There are many metrics, here are just the main ones:

The choice of metric rests entirely with the researcher, since clustering results can differ significantly when using different measures.

Classification of algorithms

For myself, I have identified two main classifications of clustering algorithms.
  1. Hierarchical and flat.
    Hierarchical algorithms (also called taxonomy algorithms) build not just one partition of the sample into disjoint clusters, but a system of nested partitions. That. As a result, we get a tree of clusters, the root of which is the entire sample, and the leaves are the smallest clusters.
    Flat algorithms build one partition of objects into clusters.
  2. Clear and fuzzy.
    Clear (or non-overlapping) algorithms assign each sample object a cluster number, i.e. each object belongs to only one cluster. Fuzzy (or intersecting) algorithms assign each object a set of real values ​​that show the degree of the object’s relationship to the clusters. Those. each object belongs to each cluster with a certain probability.

Merging clusters

In the case of using hierarchical algorithms, the question arises of how to combine clusters with each other, how to calculate the “distances” between them. There are several metrics:
  1. Single link (nearest neighbor distances)
    In this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters. The resulting clusters tend to form chains.
  2. Full connectivity (distance of most distant neighbors)
    In this method, distances between clusters are determined by the largest distance between any two objects in different clusters (i.e., most distant neighbors). This method usually works very well when objects come from separate groups. If the clusters have an elongated shape or their natural type is “chain”, then this method is unsuitable.
  3. Unweighted pairwise average
    In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them. The method is effective when objects form different groups, but it works equally well in cases of extended (“chain” type) clusters.
  4. Weighted pairwise average
    The method is identical to the unweighted pairwise average method, except that the size of the corresponding clusters (that is, the number of objects they contain) is used as a weighting factor in the calculations. Therefore, this method should be used when unequal cluster sizes are expected.
  5. Unweighted centroid method
    In this method, the distance between two clusters is defined as the distance between their centers of gravity.
  6. Weighted centroid method (median)
    This method is identical to the previous one, except that the calculation uses weights to account for differences between cluster sizes. Therefore, if there are or are suspected significant differences in cluster sizes, this method is preferable to the previous one.

Algorithms overview

Hierarchical clustering algorithms
Among hierarchical clustering algorithms, there are two main types: bottom-up and top-down algorithms. Top-down algorithms work on a top-down principle: at the beginning, all objects are placed in one cluster, which is then divided into smaller and smaller clusters. More common are bottom-up algorithms, which start by placing each object in a separate cluster and then combining the clusters into larger and larger ones until all the objects in the sample are contained in a single cluster. In this way, a system of nested partitions is constructed. The results of such algorithms are usually presented in the form of a tree - a dendrogram. A classic example of such a tree is the classification of animals and plants.

To calculate distances between clusters, everyone most often uses two distances: a single link or a complete link (see the overview of distance measures between clusters).

A disadvantage of hierarchical algorithms is the system of complete partitions, which may be unnecessary in the context of the problem being solved.

Quadratic Error Algorithms
The clustering problem can be considered as constructing an optimal partition of objects into groups. In this case, optimality can be defined as the requirement to minimize the root mean square error of partitioning:

Where c j- “center of mass” of the cluster j(point with average characteristics for a given cluster).

Quadratic error algorithms are a type of flat algorithms. The most common algorithm in this category is the k-means method. This algorithm builds a given number of clusters located as far apart as possible. The work of the algorithm is divided into several stages:

  1. Randomly select k points that are the initial “centers of mass” of the clusters.
  2. Assign each object to the cluster with the nearest “center of mass”.
  3. Recalculate the “centres of mass” of clusters according to their current composition.
  4. If the algorithm stopping criterion is not satisfied, return to step 2.
The minimum change in the mean square error is usually chosen as the criterion for stopping the algorithm. It is also possible to stop the algorithm if at step 2 there were no objects that moved from cluster to cluster.

The disadvantages of this algorithm include the need to specify the number of clusters for partitioning.

Fuzzy Algorithms
The most popular fuzzy clustering algorithm is the c-means algorithm. It is a modification of the k-means method. Algorithm steps:

This algorithm may not be suitable if the number of clusters is unknown in advance, or if it is necessary to unambiguously assign each object to one cluster.
Algorithms based on graph theory
The essence of such algorithms is that a selection of objects is represented in the form of a graph G=(V, E), whose vertices correspond to objects, and whose edges have a weight equal to the “distance” between objects. The advantages of graph clustering algorithms are clarity, relative ease of implementation, and the ability to introduce various improvements based on geometric considerations. The main algorithms are the algorithm for identifying connected components, the algorithm for constructing a minimum spanning tree and the layer-by-layer clustering algorithm.
Algorithm for identifying connected components
In the algorithm for identifying connected components, the input parameter is specified R and in the graph all edges for which the “distances” are greater are deleted R. Only the closest pairs of objects remain connected. The point of the algorithm is to select such a value R, lying in the range of all “distances” at which the graph “falls apart” into several connected components. The resulting components are clusters.

To select a parameter R Usually a histogram of distributions of pairwise distances is constructed. In tasks with a well-defined cluster structure of data, the histogram will have two peaks - one corresponds to intra-cluster distances, the second - inter-cluster distances. Parameter R is selected from the minimum zone between these peaks. At the same time, it is quite difficult to control the number of clusters using a distance threshold.

Minimum spanning tree algorithm
The minimum spanning tree algorithm first constructs a minimum spanning tree on a graph and then sequentially removes the edges with the largest weight. The figure shows the minimum spanning tree obtained for nine objects.

By removing the link labeled CD with a length of 6 units (the edge with the maximum distance), we obtain two clusters: (A, B, C) and (D, E, F, G, H, I). The second cluster can later be divided into two more clusters by removing the edge EF, which has a length of 4.5 units.

Layer-by-layer clustering
The layer-by-layer clustering algorithm is based on identifying connected graph components at a certain level of distances between objects (vertices). The distance level is set by the distance threshold c. For example, if the distance between objects , That .

The layer-by-layer clustering algorithm generates a sequence of subgraphs of the graph G, which reflect hierarchical relationships between clusters:

,

Where G t = (V, E t)- level graph with t,
,
with t– t-th distance threshold,
m – number of hierarchy levels,
G 0 = (V, o), o is the empty set of graph edges obtained by t 0 = 1,
G m = G, that is, a graph of objects without restrictions on distance (the length of the edges of the graph), since t m = 1.

By changing the distance thresholds ( s 0 , …, s m), where 0 = from 0 < from 1 < …< with m= 1, it is possible to control the depth of the hierarchy of the resulting clusters. Thus, the layer-by-layer clustering algorithm is capable of creating both a flat and hierarchical partition of the data.

Comparison of algorithms

Computational complexity of algorithms

Algorithm comparison table
Clustering algorithm Cluster shape Input data results
Hierarchical free Number of clusters or distance threshold to truncate hierarchy Binary cluster tree
k-means Hypersphere Number of clusters Cluster centers
c-means Hypersphere Number of clusters, degree of fuzziness Cluster centers, membership matrix
Selecting connected components free Distance threshold R
Minimum spanning tree free Number of clusters or distance threshold for removing edges Tree structure of clusters
Layer-by-layer clustering free Sequence of distance thresholds Tree structure of clusters with different levels of hierarchy

A little about application

In my work, I needed to select individual areas from hierarchical structures (trees). Those. essentially it was necessary to cut the original tree into several smaller trees. Since a directed tree is a special case of a graph, then naturally Algorithms based on graph theory are suitable.

Unlike a fully connected graph, in a directed tree not all vertices are connected by edges, and the total number of edges is n–1, where n is the number of vertices. Those. in relation to tree nodes, the work of the algorithm for identifying connected components will be simplified, since removing any number of edges will “break” the tree into connected components (individual trees). The minimum spanning tree algorithm in this case will coincide with the algorithm for selecting connected components - by removing the longest edges, the original tree is divided into several trees. In this case, it is obvious that the phase of constructing the minimum spanning tree itself is skipped.

If other algorithms were used, they would have to separately take into account the presence of connections between objects, which complicates the algorithm.

Separately, I would like to say that in order to achieve best result it is necessary to experiment with the choice of distance measures, and sometimes even change the algorithm. There is no single solution.

To date, more than a hundred different clustering algorithms have been developed. As a result of using different clustering methods, different results may be obtained: clusters of different shapes, different quantity or the composition of clusters. This is normal and is a feature of the operation of a particular algorithm.

For example, “chain” type clusters are possible, when the clusters are represented by long “chains”, elongated clusters, etc., and some methods can create clusters of arbitrary shape.

Various methods may strive to create clusters of specific sizes (e.g., small or large) or assume that there are clusters of different sizes in the data set.

Some cluster analysis methods are particularly sensitive to noise or outliers, others less so.

The results obtained require further interpretation, research and study of the properties and characteristics of objects to be able to accurately describe the formed clusters.

The clustering process and its result depend on the chosen method and the method for determining the distance measure.

Cluster analysis methods can be divided into two groups:

    hierarchical;

    non-hierarchical.

Each of these groups includes many approaches and algorithms.

10.5.1 Hierarchical cluster analysis methods

The essence of hierarchical clustering is to sequentially combine smaller clusters into larger ones (agglomerative methods) or divide large clusters into smaller ones (divisible methods).

Hierarchical agglomerative methods (Agglomerative Nesting, AGNES) are characterized by the sequential combination of initial elements and a corresponding reduction in the number of clusters. At the beginning of the algorithm, all objects are separate clusters. In the first step, the two most similar objects are combined into a cluster. In subsequent steps, the merging continues until all objects form one cluster.

Hierarchical divisible (divisible) methods (DIvisive ANAlysis, DIANA) are the logical opposite of agglomerative methods. At the beginning of the algorithm, all objects belong to one cluster, which in subsequent steps is divided into smaller clusters, resulting in a sequence of splitting groups.

The essence of these methods is illustrated using a dendrogram in Fig. 10.4.

Rice. 10.4 Dendrogram of agglomerative and divisional methods

Software implementation of cluster analysis algorithms is widely represented in various Data Mining tools, which allow solving problems of a fairly large dimension. For example, agglomerative methods are implemented in the SPSS package, divisional methods - in the Statgraf package.

The advantage of hierarchical clustering methods is their clarity. However, hierarchical cluster analysis methods are used for small data sets.

Hierarchical algorithms are associated with the construction of dendrograms (from the Greek dendron - “tree”), which are the result of hierarchical cluster analysis. A dendrogram describes the proximity of individual points and clusters to each other, and represents in graphical form the sequence of unification (separation) of clusters.

Dendrogram - a tree diagram containing levels, each of which corresponds to one of the steps in the process of sequential consolidation of clusters. A dendrogram is also called a tree diagram, a tree of clusters, or a tree of hierarchical structure. A dendrogram is a nested grouping of objects that changes at different levels of the hierarchy.

There are many ways to construct dendrograms. In a dendrogram, objects can be arranged vertically or horizontally. An example of a horizontal dendrogram is shown in Fig. 10.4, vertical dendrogram - in Fig. 10.5.

Rice. 10.5. Vertical dendrogram

In Fig. 10.5, at the first step, each observation represents one cluster (vertical line), at the second step we observe the union of such observations: 11 and 10; 3, 4 and 5; 8 and 9; 2 and 6. In the second step, clustering continues: observations 11, 10, 3, 4, 5 and 7, 8, 9. This process continues until all observations are combined into one cluster.

The union is carried out using one of the methods discussed in clause 10.4: the nearest neighbor method, the distant neighbor method, the Ward method, the pairwise average method, the centroid method, etc.

Cluster analysis(ClA) is a set of multidimensional classification methods, the purpose of which is to form groups (clusters) of similar objects. Unlike traditional groupings considered in the general theory of statistics, ClA leads to a division into groups taking into account all grouping characteristics simultaneously.

KLA methods allow you to solve the following problems:

Carrying out classification of objects taking into account many characteristics;

Checking the assumptions made about the presence of some structure in the studied set of objects, i.e. search for an existing structure;

Construction of new classifications for poorly studied phenomena, when it is necessary to establish the presence of connections within a population and try to introduce structure into it.

To record formalized KLA algorithms, the following are used: symbols:

– a set of observation objects;

i-th observation in m-dimensional feature space ();

– distance between the -th and -objects;

– normalized values ​​of the original variables;

– matrix of distances between objects.

To implement any KLA method, it is necessary to introduce the concept of “object similarity”. Moreover, during the classification process, each cluster should include objects that are most similar to each other in terms of observed variables.

To quantify similarity, the concept of metrics is introduced. Each object is described by -features and represented as a point in -dimensional space. The similarity or difference between classified objects is established depending on the metric distance between them. Typically, the following measures of distance between objects are used:

Euclidean distance ;

Weighted Euclidean distance ;

Distance city-block ;

Mahalanobis distance,

where is the distance between the th and th objects;

, are the values ​​of the -variable and, respectively, the -th and -th objects;

, – vectors of variable values ​​for the -th and -th objects;

– general covariance matrix;

– weight assigned to the th variable.

All KLA methods can be divided into two groups: hierarchical (agglomerative and divisional) and iterative (method of averages, method of searching for condensations).

Hierarchical cluster analysis. Of all the cluster analysis methods, the most common is the agglomerative classification algorithm. The essence of the aggrogrit is that at the first step, each sample object is considered as a separate cluster. The process of merging clusters occurs sequentially: based on the distance matrix or similarity matrix, the closest objects are combined. If the distance matrix initially has dimension (), then the entire merging process is completed in () steps. As a result, all objects will be combined into one cluster.

The sequence of association can be represented as a dendrogram, shown in Figure 3.1. The dendrogram shows that at the first step the second and third objects were combined into one cluster with a distance between them of 0.15. In the second step, the first object joined them. The distance from the first object to the cluster containing the second and third objects is 0.3, etc.

Many methods of hierarchical cluster analysis differ in their combination (similarity) algorithms, of which the most common are: the single link method, the complete link method, the average link method, and Ward’s method.

Complete link method– inclusion of a new object in a cluster occurs only if the similarity between all objects is not less than a certain specified level of similarity (Figure 1.3).


b)


Average link method– when a new object is included in an existing cluster, the average value of the similarity measure is calculated, which is then compared with a specified threshold level. If we are talking about combining two clusters, then a measure of similarity between their centers is calculated and compared with a given threshold value. Let's consider a geometric example with two clusters (Figure 1.4).

Figure 1.4. Combining two clusters using the average link method:

If the measure of similarity between cluster centers () is not less than a given level, then the clusters will be combined into one.

Ward's method– at the first step, each cluster consists of one object. Initially, the two closest clusters are merged. For them, the average values ​​of each characteristic are determined and the sum of squared deviations is calculated

, (1.1)

where is the cluster number, is the object number, is the feature number; – the number of features characterizing each object; number of objects in - mcluster.

Subsequently, at each step of the algorithm, those objects or clusters that give the smallest increment in value are combined.

Ward's method results in clusters of approximately equal sizes with minimal intracluster variation.

The hierarchical cluster analysis algorithm can be represented as a sequence of procedures:

Normalization of initial values ​​of variables;

Calculation of a matrix of distances or a matrix of similarity measures;

Determining a pair of the closest objects (clusters) and combining them according to the selected algorithm;

Repeating the first three procedures until all objects are combined into one cluster.

The similarity measure for combining two clusters is determined by the following methods:

“Nearest neighbor” method – the degree of similarity between clusters is assessed by the degree of similarity between the most similar (closest) objects of these clusters;

“Distant neighbor” method – the degree of similarity is assessed by the degree of similarity between the most distant (dissimilar) objects of the clusters;

Average connection method - the degree of similarity is estimated as average value degrees of similarity between cluster objects;

Median link method - distance between any cluster S and a new cluster, which resulted from the merger of clusters R And q, defined as the distance from the cluster center S to the middle of the segment connecting the cluster centers R And q.

Condensation search method. One of the iterative classification methods is the cluster search algorithm. The essence of the iterative algorithm this method consists in using a hypersphere of a given radius, which moves in the space of classification features in order to search for local concentrations of objects.



The method of searching for condensations requires, first of all, the calculation of a matrix of distances (or a matrix of similarity measures) between objects and the selection of the initial center of the sphere. Typically, at the first step, the center of the sphere is the object (point) in whose immediate vicinity the largest number of neighbors are located. Based on a given sphere radius (R) a set of points falling inside this sphere is determined, and the coordinates of the center (vector of average feature values) are calculated for them.

When the next recalculation of the coordinates of the center of the sphere leads to the same result as in the previous step, the movement of the sphere stops, and the points that fall within it form a cluster and are excluded from the further clustering process. The above procedures are repeated for all remaining points. The algorithm is completed in a finite number of steps, and all points are distributed among clusters. The number of clusters formed is unknown in advance and strongly depends on the radius of the sphere.

To assess the stability of the resulting partition, it is advisable to repeat the clustering process several times to different meanings radius of the sphere, changing the radius by a small amount each time.

There are several ways to select the radius of a sphere. If is the distance between the th and th objects, then choose , and the upper bound of the radius can be defined as .

If you start the algorithm with a value and change it by a small value each time it is repeated, then you can identify the values ​​of the radii that lead to the formation of the same number of clusters, i.e. to a stable partition.

Example 1. Based on the data in Table 1.1, it is necessary to classify five enterprises using hierarchical agglomerative cluster analysis.

Table 1.1

Here: – average annual cost of fixed assets production assets, billion rubles; – material costs per ruble of manufactured products, kopecks; – volume of products produced, billion rubles.

Solution. Before calculating the distance matrix, we normalize the original data using the formula

The matrix of values ​​of normalized variables will look like

.

We will carry out the classification using the hierarchical agglomerative method. To construct the distance matrix, we will use the Euclidean distance. Then, for example, the distance between the first and second objects will be

The distance matrix characterizes the distances between objects, each of which, at the first step, represents a separate cluster

.

As can be seen from the matrix, the closest objects are and. Let's combine them into one cluster and assign a number to it . Let's recalculate the distances of all remaining objects (clusters) to the cluster and obtain a new distance matrix

.

In the matrix, the distances between clusters are determined using the “far neighbor” algorithm. Then the distance between the object and the cluster is

In the matrix we again find the closest clusters. These will be and , . Therefore, at this step we also combine the clusters; we get a new cluster containing objects , . Let's assign him a number . Now we have three clusters (1,3), (2,5), (4).

.

Judging by the matrix, in the next step we combine clusters and into one cluster and assign a number to it. Now we have only two clusters:

.

And finally, in the last step we will combine the clusters at a distance of 3.861.


Let's present the classification results in the form of a dendrogram (Figure 1.5). The dendrogram indicates that the cluster is more homogeneous in the composition of the incoming objects, since in it the union occurred at shorter distances than in the cluster.

Figure 3.5. Dendrogram of clustering of five objects

Example 2. Based on the data given below, classify stores according to three criteria: – sales area, m2, – turnover per seller, den. units, – level of profitability, %.

Store number Store number

To classify stores, use the cluster search method (you must select the first cluster).

Solution. 1. Calculate the distances between objects using the Euclidean metric

,

where , are the standardized values ​​of the initial variables for the th and th objects, respectively; T– number of signs.

.

2. Based on the Z matrix, we calculate a square symmetric matrix of distances between objects () .

Analysis of the distance matrix helps determine the position of the initial center of the sphere and select the radius of the sphere.

In this example, most of the “small” distances are in the first line, i.e. the first object has quite a lot of “close” neighbors. Therefore, the first object can be taken as the center of the sphere.

3. Set the radius of the sphere. In this case, objects whose distance to the first object is less than 2 fall into the sphere.

For six points (objects 1, 2, 3, 6, 7, 8) we determine the coordinates of the center of gravity: .

4. At the next step of the algorithm, we place the center of the sphere at a point and determine the distance of each object to the new center.

, public administration, philology, anthropology, marketing, sociology, geology and other disciplines. However, the universality of application has led to the emergence of a large number of incompatible terms, methods and approaches, making it difficult to unambiguously use and consistent interpretation of cluster analysis.

Encyclopedic YouTube

  • 1 / 5

    Cluster analysis performs the following main tasks:

    • Development of a typology or classification.
    • An exploration of useful conceptual schemes for grouping objects.
    • Generating hypotheses based on data exploration.
    • Hypothesis testing or research to determine whether the types (groups) identified in one way or another are actually present in the available data.

    Regardless of the subject of study, the use of cluster analysis involves the following steps:

    • Selecting a sample for clustering. The implication is that it makes sense to cluster only quantitative data.
    • Determining the set of variables by which objects in the sample will be assessed, that is, the feature space.
    • Calculation of the values ​​of a particular measure of similarity (or difference) between objects.
    • Using the cluster analysis method to create groups of similar objects.
    • Checking the reliability of the cluster solution results.

    You can find a description of two fundamental requirements for data - homogeneity and completeness. Homogeneity requires that all clustered entities be of the same nature and described by a similar set of characteristics. If cluster analysis is preceded by factor analysis, then the sample does not need to be “repaired” - the stated requirements are fulfilled automatically by the factor modeling procedure itself (there is another advantage - z-standardization without negative consequences for the sample; if it is carried out directly for cluster analysis, it can lead to entails a decrease in the clarity of the division of groups). Otherwise, the sample needs to be adjusted.

    Typology of clustering problems

    Input types

    In modern science, several algorithms for processing input data are used. Analysis by comparing objects based on characteristics (most common in biological sciences) is called Q-type of analysis, and in the case of comparing features, based on objects - R-type of analysis. There are attempts to use hybrid types of analysis (for example, RQ-analysis), but this methodology has not yet been properly developed.

    Clustering goals

    • Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying a different method of analysis to each cluster (the “divide-and-conquer” strategy).
    • Data compression. If the original sample is excessively large, then you can reduce it, leaving one most typical representative from each cluster.
    • Novelty detection. Atypical objects are identified that cannot be attached to any of the clusters.

    In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, the most interesting are individual objects that do not fit into any of the clusters.

    In all these cases, hierarchical clustering can be used, when large clusters are divided into smaller ones, which in turn are divided into even smaller ones, etc. Such problems are called taxonomy problems. The taxonomy results in a tree-like hierarchical structure. Moreover, each object is characterized by a listing of all the clusters to which it belongs, usually from large to small.

    Clustering methods

    There is no generally accepted classification of clustering methods, but a number of groups of approaches can be distinguished (some methods can be classified into several groups at once and therefore it is proposed to consider this typification as some approximation to the real classification of clustering methods):

    1. Probabilistic approach. It is assumed that each object under consideration belongs to one of k classes. Some authors (for example, A.I. Orlov) believe that this group does not relate to clustering at all and oppose it under the name “discrimination,” that is, the choice of assigning objects to one of the known groups (training samples).
    2. Approaches based on artificial intelligence systems: a very conditional group, since there are a lot of methods and they are methodologically very different.
    3. Logical approach. The dendrogram is constructed using a decision tree.
    4. Graph-theoretic approach.
    5. Hierarchical approach. The presence of nested groups (clusters of different orders) is assumed. Algorithms, in turn, are divided into agglomerative (unifying) and divisional (dividing). Based on the number of characteristics, monothetic and polythetic methods of classification are sometimes distinguished.
      • Hierarchical divisional clustering or taxonomy. Clustering problems are addressed in a quantitative taxonomy.
    6. Other methods. Not included in previous groups.
      • Statistical clustering algorithms
      • Ensemble of clusterizers
      • KRAB family algorithms
      • Algorithm based on sifting method

    Approaches 4 and 5 are sometimes combined under the name of a structural or geometric approach, which has a more formalized concept of proximity. Despite the significant differences between the listed methods, they all rely on the original “ compactness hypothesis": in object space, all close objects must belong to the same cluster, and all different objects, accordingly, must be in different clusters.

    Formal formulation of the clustering problem

    Let X (\displaystyle X)- many objects, Y (\displaystyle Y)- a set of numbers (names, labels) of clusters. The distance function between objects is specified ρ (x , x ′) (\displaystyle \rho (x,x")). There is a finite training sample of objects X m = ( x 1 , … , x m ) ⊂ X (\displaystyle X^(m)=\(x_(1),\dots ,x_(m)\)\subset X). It is required to split the sample into disjoint subsets called clusters, so that each cluster consists of objects that are similar in metric ρ (\displaystyle \rho ), and the objects of different clusters were significantly different. At the same time, each object x i ∈ X m (\displaystyle x_(i)\in X^(m)) cluster number is assigned y i (\displaystyle y_(i)).

    Clustering algorithm is a function a: X → Y (\displaystyle a\colon X\to Y), which to any object x ∈ X (\displaystyle x\in X) matches the cluster number y ∈ Y (\displaystyle y\in Y). A bunch of Y (\displaystyle Y) in some cases it is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

    In general, it is worth noting that historically, measures of similarity rather than measures of difference (distance) are often used as measures of proximity in biology.

    In sociology

    When analyzing the results of sociological research, it is recommended to carry out the analysis using the methods of the hierarchical agglomerative family, namely the Ward method, in which the minimum dispersion within clusters is optimized, ultimately creating clusters of approximately equal sizes. Ward's method is most suitable for analyzing sociological data. A better measure of difference is the quadratic Euclidean distance, which helps increase the contrast of clusters. The main result of hierarchical cluster analysis is a dendrogram or “icicle diagram”. When interpreting it, researchers are faced with the same kind of problem as the interpretation of the results of factor analysis - the lack of unambiguous criteria for identifying clusters. It is recommended to use two main methods - visual analysis of the dendrogram and comparison of clustering results performed by different methods.

    Visual analysis of the dendrogram involves “trimming” the tree at optimal level similarities of sample elements. It is advisable to “cut off the grape branch” (the terminology of M. S. Oldenderfer and R. K. Blashfield) at level 5 of the Rescaled Distance Cluster Combine scale, thus an 80% level of similarity will be achieved. If identifying clusters using this label is difficult (several small clusters merge into one large one), then you can select another label. This technique is proposed by Oldenderfer and Blashfield.

    Now the question arises of the sustainability of the adopted cluster solution. In essence, checking the stability of clustering comes down to checking its reliability. There is a rule of thumb here - a stable typology is preserved when clustering methods change. The results of hierarchical cluster analysis can be verified by iterative cluster analysis using the k-means method. If the compared classifications of groups of respondents have a coincidence rate of more than 70% (more than 2/3 of matches), then a cluster decision is made.

    It is impossible to check the adequacy of a solution without resorting to another type of analysis. At least in theoretical terms, this problem has not been solved. Oldenderfer and Blashfield's classic paper, Cluster Analysis, discusses in detail and ultimately rejects an additional five robustness testing methods:

    1. cophenetic correlation - not recommended and limited in use;
    2. significance tests (analysis of variance) - always give a significant result;
    3. the technique of repeated (random) sampling, which, however, does not prove the validity of the decision;
    4. significance tests for external signs suitable for repeated measurements only;
    5. Monte Carlo methods are very complex and are only accessible to experienced mathematicians [ (eng. edge detection) or object recognition.
    6. Intelligent data analysis (English: data mining) - clustering in Data Mining acquires value when it acts as one of the stages of data analysis and construction of a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features and build a separate model for each group than to create one general model for all data. This technique is constantly used in marketing, identifying groups of clients, buyers, products and developing a separate strategy for each of them.

    Often in the most various areas In our activities, we have to deal with a huge number of items in relation to which action needs to be taken.

    And we cannot even comprehend this entire volume, let alone understand it.

    What is the way out? Well, of course, “put everything into order.” In this case, folk wisdom takes on a very definite scientific formulation.

    Cluster analysis is the study of objects by combining them into homogeneous groups with similar characteristics. His methods are applicable in literally all areas: from medicine to Forex trading, from car insurance to archaeology. And for marketers and HR specialists it is simply irreplaceable.

    More details about this in the article.

    What is a cluster

    Cluster analysis is designed to divide a set of objects into homogeneous groups (clusters or classes). This is a multidimensional data classification problem.


    There are about 100 different clustering algorithms, however, the most commonly used are:

    1. hierarchical cluster analysis,
    2. k-means clustering.

    Where is cluster analysis used:

    • In marketing, this is the segmentation of competitors and consumers.
    • In management:
      1. dividing personnel into groups of different levels of motivation,
      2. supplier classification,
      3. identification of similar production situations in which defects occur.
    • In medicine - classification of symptoms, patients, drugs.
    • In sociology, the division of respondents into homogeneous groups.

    In fact, cluster analysis has proven itself well in all spheres of human life. The beauty of this method is that it works even when there is little data and the requirements for normal distribution of random variables and other requirements are not met classical methods statistical analysis.

    Let us explain the essence of cluster analysis without resorting to strict terminology.

    Let's say you conducted a survey of employees and want to determine how to most effectively manage personnel. That is, you want to divide employees into groups and highlight the most effective management levers for each of them. At the same time, differences between groups should be obvious, and within the group respondents should be as similar as possible.

    To solve the problem, it is proposed to use hierarchical cluster analysis. As a result, we will get a tree, looking at which we must decide how many classes (clusters) we want to divide the staff into. Let’s assume that we decide to divide the staff into three groups, then to study the respondents who fall into each cluster we will get a table with approximately the following content:


    Let us explain how the above table is formed. The first column contains the number of the cluster - the group, the data for which is reflected in the line. For example, the first cluster is 80% men. 90% of the first cluster fall into the age category from 30 to 50 years, and 12% of respondents believe that benefits are very important. And so on.

    Let's try to create portraits of respondents in each cluster:

    1. The first group consists mainly of mature men who occupy leadership positions. They are not interested in the social package (MED, LGOTI, TIME-free time). They prefer to receive a good salary rather than help from an employer.
    2. Group two, on the contrary, gives preference to the social package. It consists mainly of “aged” people occupying low positions. Salary is certainly important to them, but there are other priorities.
    3. The third group is the “youngest”. Unlike the previous two, there is an obvious interest in learning and professional development opportunities. This category of employees has a good chance to soon join the first group.

    Thus, when planning an implementation campaign effective methods personnel management, it is obvious that in our situation it is possible to increase the social package of the second group to the detriment, for example, of wages. If we talk about which specialists should be sent for training, we can definitely recommend paying attention to the third group.

    Source: "nickart.spb.ru"

    Cluster analysis is the key to understanding the market

    A cluster is the price of an asset during a certain period of time during which transactions were made. The resulting volume of purchases and sales is indicated by a number within the cluster. The bar of any timeframe usually contains several clusters. This allows you to see in detail the volumes of purchases, sales and their balance in each individual bar, at each price level.


    Building a cluster graph

    A change in the price of one asset inevitably entails a chain of price movements in other instruments. In most cases, understanding a trend movement occurs already at the moment when it is rapidly developing, and entering the market along the trend risks ending up in a correctional wave.

    For successful transactions, you need to understand the current situation and be able to anticipate future price movements. This can be learned by analyzing the cluster graph. Using cluster analysis, you can see the activity of market participants within even the smallest price bar.

    This is the most accurate and detailed analysis, as it shows the point distribution of transaction volumes at each asset price level. There is a constant conflict between the interests of sellers and buyers in the market. And every smallest price movement (tick) is a move towards a compromise - a price level - that currently suits both parties.

    But the market is dynamic, the number of sellers and buyers is constantly changing. If at one point in time the market was dominated by sellers, then at the next moment there will most likely be buyers. The number of completed transactions at adjacent price levels is also not the same.

    And yet, first the market situation is reflected in the total volumes of transactions, and only then in the price. If you see the actions of the dominant market participants (sellers or buyers), then you can predict the price movement itself.

    To successfully use cluster analysis, you first need to understand what a cluster and delta are:

    • A cluster is a price movement that is divided into levels at which transactions with known volumes were made.
    • Delta shows the difference between the purchases and sales occurring in each cluster.


    Cluster graph

    Each cluster, or group of deltas, allows you to understand whether buyers or sellers dominate the market at a given time. It is enough just to calculate the total delta by summing up sales and purchases. If the delta is negative, then the market is oversold and there are redundant sell transactions. When the delta is positive, buyers clearly dominate the market.

    The delta itself can take a normal or critical value. The delta volume value above normal in the cluster is highlighted in red. If the delta is moderate, then this characterizes a flat state in the market. At normal value delta in the market there is a trend movement, but a critical value is always a harbinger of a price reversal.

    Forex trading using CA

    To obtain maximum profit, you need to be able to determine the transition of the delta from a moderate level to a normal one. Indeed, in this case, you can notice the very beginning of the transition from flat to trend movement and be able to get the greatest profit.

    A cluster chart is more visual; on it you can see significant levels of accumulation and distribution of volumes, and plot support and resistance levels.

    This allows the trader to find the exact entry into the trade. Using the delta, you can judge the predominance of sales or purchases in the market. Cluster analysis allows you to observe transactions and track their volumes inside a bar of any TF. This is especially important when approaching significant support or resistance levels. Cluster judgments are the key to understanding the market.

    Source: "orderflowtrading.ru"

    Areas and features of application of cluster analysis

    The term cluster analysis (first coined by Tryon, 1939) actually includes a set of different classification algorithms. General question, asked by researchers in many fields, is how to organize observed data into visual structures, i.e. expand taxonomies.

    For example, biologists set a goal to divide animals into different kinds to meaningfully describe the differences between them. According to the modern system adopted in biology, humans belong to primates, mammals, amniotes, vertebrates and animals.

    Note that in this classification, the higher the level of aggregation, the less similarity between members in the corresponding class. Humans bear more similarities to other primates (i.e., apes) than to “outlying” members of the mammalian family (i.e., dogs), etc.

    Note that the previous discussion refers to clustering algorithms, but does not mention anything about statistical significance testing. In fact, cluster analysis is not so much an ordinary statistical method as a “set” of various algorithms for “distributing objects into clusters.”

    There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about the classes, but are still in the descriptive stage of the study. It should be understood that cluster analysis determines the “most likely significant solution.”

    Therefore, testing for statistical significance is not really applicable here, even in cases where p-levels are known (as in the K-means method, for example).

    Clustering techniques are used in a wide variety of fields. Hartigan (1975) gave an excellent review of many published studies containing results obtained using cluster analysis methods. For example, in the medical field, clustering of diseases, treatments for diseases, or symptoms of diseases leads to widely used taxonomies.

    In the field of psychiatry, correct diagnosis of symptom clusters such as paranoia, schizophrenia, etc. is crucial for successful therapy. In archeology, using cluster analysis, researchers try to establish taxonomies of stone tools, funeral objects, etc.

    There are widespread applications of cluster analysis in marketing research. In general, whenever it is necessary to classify “mountains” of information into groups suitable for further processing, cluster analysis turns out to be very useful and effective.

    Tree Clustering

    The purpose of a union algorithm (tree clustering) is to combine objects (for example, animals) into sufficiently large clusters using some measure of similarity or distance between objects. The typical result of such clustering is a hierarchical tree.

    Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram). Now imagine that gradually (in very small steps) you “relax” your criterion about which objects are unique and which are not. In other words, you lower the threshold related to the decision to combine two or more objects into one cluster.


    As a result, you link more and more objects together and aggregate (combine) more and more clusters consisting of increasingly different elements. Finally, in the last step, all objects are combined together.

    In these diagrams, the horizontal axes represent the join distance (in vertical tree diagrams, the vertical axes represent the join distance). So, for each node in the graph (where a new cluster is formed), you can see the distance value for which the corresponding elements are associated into a new single cluster.

    When data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by different branches. As a result of successful analysis using the merging method, it becomes possible to detect clusters (branches) and interpret them.

    Distance measures

    The union or tree clustering method is used to form clusters of dissimilarity or distance between objects. These distances can be defined in one-dimensional or multi-dimensional space. For example, if you were to cluster types of food in a cafe, you might take into account the number of calories it contains, price, subjective taste rating, etc.

    The most direct way to calculate distances between objects in multidimensional space is to calculate Euclidean distances. If you have a two- or three-dimensional space, then this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure).

    However, the pooling algorithm does not "care" whether the distances "provided" for that distance are the real ones or some other derived distance measure, which is more meaningful to the researcher; and the task of researchers is to select correct method for specific applications.

    1. Euclidean distance.
    2. This seems to be the most general type distances. It is simply a geometric distance in multidimensional space and is calculated as follows:

      Note that the Euclidean distance (and its square) is calculated from the original data, not the standardized data. This is a common way to calculate it, which has certain advantages (for example, the distance between two objects does not change when a new object is introduced into the analysis, which may be an outlier).

      However, distances can be greatly influenced by differences between the axes from which the distances are calculated.

      For example, if one of the axes is measured in centimeters, and you then convert it to millimeters (multiplying the values ​​by 10), then the final Euclidean distance (or the square of the Euclidean distance) calculated from the coordinates will change greatly, and as a result, the results of the cluster analysis may differ greatly from previous ones.

    3. Squared Euclidean distance.
    4. Sometimes you may want to square the standard Euclidean distance to give more weight to objects that are farther apart. This distance is calculated as follows:

    5. City block distance (Manhattan distance).
    6. This distance is simply the average of the differences over the coordinates. In most cases, this distance measure produces the same results as the ordinary Euclidean distance.

      However, we note that for this measure the influence of individual large differences (outliers) is reduced (since they are not squared). The Manhattan distance is calculated using the formula:

    7. Chebyshev distance.
    8. This distance can be useful when one wants to define two objects as "different" if they differ in any one coordinate (in any one dimension). The Chebyshev distance is calculated using the formula:

    9. Power distance.

      Sometimes one wishes to progressively increase or decrease a weight related to a dimension for which the corresponding objects are very different. This can be achieved using power-law distance. Power distance is calculated using the formula:

      where r and p are user-defined parameters.

      A few example calculations can show how this measure “works”:

      • The p parameter is responsible for gradually weighing the differences along individual coordinates.
      • The r parameter is responsible for progressively weighing large distances between objects.
      • If both parameters r and p are equal to two, then this distance coincides with the Euclidean distance.
    10. Percentage of disagreement.
    11. This measure is used when the data is categorical. This distance is calculated by the formula:

    Association or connection rules

    At the first step, when each object is a separate cluster, the distances between these objects are determined by the selected measure. However, when several objects are linked together, the question arises, how should the distances between clusters be determined?

    In other words, a union or connection rule is needed for the two clusters. There are various possibilities: For example, you can link two clusters together when any two objects in two clusters are closer to each other than the corresponding link distance.

    In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single link method. This rule builds “fibrous” clusters, i.e. clusters “linked together” only by individual elements that happen to be closest to each other.

    As an alternative, you can use neighbors in clusters that are farthest from each other by all other pairs of objects. This method is called the full link method. There are also many other methods for combining clusters similar to those discussed.

    • Single link (nearest neighbor method).
    • As described above, in this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

      This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "chains".

    • Full link (most distant neighbors method).
    • In this method, distances between clusters are determined by the largest distance between any two objects in different clusters (i.e., "most distant neighbors").

      This method usually works very well when the objects come from actually different "groves."

      If the clusters have a somewhat elongated shape or their natural type is “chain”, then this method is unsuitable.

    • Unweighted pairwise average.
    • In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them. The method is effective when objects actually form different “groves”, but it works equally well in cases of extended (“chain” type) clusters.

      Note that in their book, Sneath and Sokal (1973) introduce the abbreviation UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

    • Weighted pairwise average.
    • The method is identical to the unweighted pairwise average method, except that the size of the corresponding clusters (that is, the number of objects they contain) is used as a weighting factor in the calculations. Therefore, the proposed method should be used when unequal cluster sizes are expected.

      The book by Sneath and Sokal (1973) introduces the abbreviation WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

    • Unweighted centroid method.
    • In this method, the distance between two clusters is defined as the distance between their centers of gravity.

      Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

    • Weighted centroid method (median).
    • This method is identical to the previous one, except that the calculation uses weights to account for the difference between the sizes of the clusters (that is, the number of objects in them).

      Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

      Sneath and Sokal (1973) used the abbreviation WPGMC to refer to it as the weighted pair-group method using the centroid average.

    • Ward's method.
    • This method is different from all other methods because it uses analysis of variance techniques to estimate the distances between clusters. The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step.

      Details can be found in Ward (1963). Overall, the method seems to be very effective, but it tends to create small clusters.

    Two-input combining

    This method was previously discussed in terms of the "objects" that need to be clustered. In all other types of analysis, the question of interest to the researcher is usually expressed in terms of observations or variables. It turns out that clustering, both by observations and by variables, can lead to quite interesting results.

    For example, imagine that a medical researcher is collecting data on various characteristics (variables) of patients' conditions (cases) suffering from heart disease. A researcher may want to cluster observations (patients) to identify clusters of patients with similar symptoms.

    At the same time, the researcher may want to cluster variables to identify clusters of variables that are associated with similar physical conditions. After this discussion regarding whether to cluster observations or variables, one might ask, why not cluster in both directions?

    The Cluster Analysis module contains an efficient two-way join routine that allows you to do just that. However, two-way pooling is used (relatively rarely) in circumstances where both observations and variables are expected to simultaneously contribute to the discovery of meaningful clusters.

    Thus, returning to the previous example, we can assume that a medical researcher needs to identify clusters of patients who are similar in relation to certain clusters of physical condition characteristics.

    The difficulty in interpreting the results obtained arises from the fact that similarities between different clusters may arise from (or be the cause of) some differences in subsets of variables. Therefore, the resulting clusters are heterogeneous in nature.

    This may seem a little hazy at first; in fact, compared to other cluster analysis methods described, two-way join is probably the least commonly used method. However, some researchers believe that it offers a powerful means of exploratory data analysis (for more information, see Hartigan's (1975) description of this method).

    K means method

    This clustering method differs significantly from such agglomerative methods as Union (tree clustering) and Two-way union. Let's assume you already have hypotheses about the number of clusters (based on observations or variables).

    You can tell the system to form exactly three clusters so that they are as distinct as possible. This is exactly the type of problem that the K-means algorithm solves. In general, the K-means method builds exactly K different clusters located at the greatest possible distances from each other.

    In the physical condition example, a medical researcher might have a “hunch” from his clinical experience that his patients generally fall into three different categories. Next, he may want to know whether his intuition can be confirmed numerically, that is, does K-means cluster analysis actually produce three clusters of patients as expected?

    If this is the case, then the averages of the various measures of physical parameters for each cluster will provide a quantitative way of representing the researcher's hypotheses (eg, patients in cluster 1 have a high parameter 1, a low parameter 2, etc.).

    From a computational point of view, you can think of this method as an analysis of variance in reverse.

    The program starts with K randomly selected clusters and then changes the objects' membership in them so that:

    1. minimize variability within clusters,
    2. maximize variability between clusters.

    This method is similar to reverse ANOVA in that the test of significance in ANOVA compares between-group and within-group variability in testing the hypothesis that group means differ from each other.

    In K-means clustering, the program moves objects (i.e., observations) from one group (cluster) to another in order to obtain the most significant result when conducting an analysis of variance (ANOVA). Typically, once the results of a K-means cluster analysis are obtained, the means for each cluster along each dimension can be calculated to assess how different the clusters are from each other.

    Ideally, you should obtain widely varying means for most, if not all, of the measurements used in the analysis. The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

    Source: "biometrica.tomsk.ru"

    Classification of objects according to their characteristics

    Cluster analysis is a set of multidimensional statistical methods for classifying objects according to the characteristics that characterize them, dividing a set of objects into homogeneous groups that are similar in defining criteria, and identifying objects of a certain group.

    A cluster is a group of objects identified as a result of cluster analysis based on a given measure of similarity or differences between objects. Object – these are specific objects of research that need to be classified. The objects of classification are, as a rule, observations. For example, consumers of products, countries or regions, products, etc.

    Although it is possible to conduct cluster analysis by variables. Classification of objects in multidimensional cluster analysis occurs according to several criteria simultaneously. These can be both quantitative and categorical variables, depending on the method of cluster analysis. So, the main goal of cluster analysis is to find groups of similar objects in the sample.

    The set of multivariate statistical methods of cluster analysis can be divided into hierarchical methods (agglomerative and divisive) and non-hierarchical (k-means method, two-stage cluster analysis).

    However, there is no generally accepted classification of methods, and cluster analysis methods sometimes also include methods for constructing decision trees, neural networks, discriminant analysis, and logistic regression.

    The scope of use of cluster analysis, due to its versatility, is very wide. Cluster analysis is used in economics, marketing, archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, sociology and other fields.

    Here are some examples of using cluster analysis:

    • medicine – classification of diseases, their symptoms, treatment methods, classification of patient groups;
    • marketing – tasks of optimizing the company’s product line, segmenting the market by groups of goods or consumers, determining potential consumer;
    • sociology – dividing respondents into homogeneous groups;
    • psychiatry – correct diagnosis of groups of symptoms is decisive for successful therapy;
    • biology - classification of organisms by group;
    • economics – classification of subjects of the Russian Federation according to investment attractiveness.

    Source: "statmethods.ru"

    Understanding Cluster Analysis

    Cluster analysis includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures.

    For example, biologists aim to classify animals into different species in order to meaningfully describe the differences between them.

    The task of cluster analysis is to divide the original set of objects into groups of similar objects that are close to each other. These groups are called clusters.

    In other words, cluster analysis is one of the ways to classify objects according to their characteristics. It is desirable that the classification results have a meaningful interpretation.

    The results obtained by cluster analysis methods are used in a variety of fields:

    1. In marketing, this is the segmentation of competitors and consumers.
    2. In psychiatry, the correct diagnosis of symptoms such as paranoia, schizophrenia, etc. is decisive for successful therapy.
    3. In management, it is important to classify suppliers and identify similar production situations in which defects occur.
    4. In sociology, the division of respondents into homogeneous groups.
    5. In portfolio investing, it is important to group securities by similarity in return trends in order to create, based on information obtained about the stock market, an optimal investment portfolio that allows you to maximize investment returns at a given degree of risk.

    In fact, cluster analysis has proven itself well in all spheres of human life. In general, whenever it is necessary to classify a large amount of information of this kind and present it in a form suitable for further processing, cluster analysis turns out to be very useful and effective.

    Cluster analysis allows you to consider a fairly large amount of information and greatly compress large areas socio-economic information, make them compact and visual.

    Cluster analysis is of great importance in relation to sets of time series characterizing economic development (for example, general economic and commodity conditions).

    Here you can highlight periods when the values ​​of the corresponding indicators were quite close, and also determine groups of time series whose dynamics are most similar. In the tasks of socio-economic forecasting, the combination of cluster analysis with other methods is very promising. quantitative methods(for example, with regression analysis).

    Advantages and disadvantages

    Cluster analysis allows for an objective classification of any objects that are characterized by a number of characteristics. There are a number of benefits that can be derived from this:

    • The resulting clusters can be interpreted, that is, they can describe what groups actually exist.
    • Individual clusters can be discarded. This is useful in cases where certain errors were made during the data collection, as a result of which the values ​​of indicators for individual objects deviate sharply. When applying cluster analysis, such objects fall into a separate cluster.
    • Only those clusters that have the characteristics of interest can be selected for further analysis.

    Like any other method, cluster analysis has certain disadvantages and limitations. In particular:

    1. the composition and number of clusters depends on the selected partition criteria,
    2. when reducing the original data array to a more compact form, certain distortions may occur,
    3. The individual features of individual objects may be lost by replacing them with the characteristics of generalized values ​​of the cluster parameters.

    Methods

    Currently, more than a hundred different clustering algorithms are known. Their diversity is explained not only by different computational methods, but also by different concepts underlying clustering. It is possible to give recommendations for choosing one or another clustering method only in general outline, and the main selection criterion is the practical usefulness of the result.

    The Statistica package implements the following clustering methods:

    • Hierarchical algorithms – tree clustering. Hierarchical algorithms are based on the idea of ​​sequential clustering. At the initial step, each object is considered as a separate cluster. In the next step, some of the clusters closest to each other will be combined into a separate cluster.
    • K-means method. This method is used most often. It belongs to the group of so-called reference methods of cluster analysis. The number of clusters K is specified by the user.
    • Two-input combining. When using this method, clustering is carried out simultaneously both by variables (columns) and by observations (rows).

    The two-way pooling procedure is used in cases where simultaneous clustering across variables and observations can be expected to produce meaningful results.

    The results of the procedure are descriptive statistics for the variables and observations, as well as a two-dimensional color chart in which the data values ​​are color coded. Based on the color distribution, you can get an idea of ​​homogeneous groups.

    Normalization of variables

    Partitioning the initial set of objects into clusters involves calculating distances between objects and selecting objects whose distance is the smallest of all possible. The most commonly used is the Euclidean (geometric) distance that is familiar to all of us. This metric corresponds to intuitive ideas about the proximity of objects in space (as if the distances between objects were measured with a tape measure).

    But for a given metric, the distance between objects can be greatly affected by changes in scales (units of measurement). For example, if one of the features is measured in millimeters and then its value is converted to centimeters, the Euclidean distance between objects will change greatly. This will lead to the fact that the results of cluster analysis may differ significantly from previous ones.

    If variables are measured in different units of measurement, then their preliminary normalization is required, that is, a transformation of the original data that converts them into dimensionless quantities.

    Normalization greatly distorts the geometry of the original space, which can change the clustering results. In the Statistica package, normalization of any variable x is performed using the formula:

    To do this, right-click on the variable name and select the sequence of commands in the menu that opens: Fill/ Standardize Block/ Standardize Columns. The values ​​of the normalized variable will become equal to zero, and the variance will become equal to one.

    K-means method in Statistica program

    The K-means method divides a set of objects into a given number K of different clusters located at the greatest possible distances from each other. Typically, once the results of a K-means cluster analysis are obtained, the means for each cluster along each dimension can be calculated to assess how different the clusters are from each other.

    Ideally, you should obtain widely varying means for most of the measurements used in the analysis. The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

    As an example, consider the results of a survey of 17 employees of an enterprise on satisfaction with indicators of the quality of their career. The table provides answers to the survey questions on a ten-point scale (1 is the minimum score, 10 is the maximum).

    The variable names correspond to the answers to the following questions:

    1. SLC – a combination of personal goals and organizational goals;
    2. OSO – sense of fairness in remuneration;
    3. TBD - territorial proximity to home;
    4. OEB – sense of economic well-being;
    5. KR – career growth;
    6. JSR – desire to change jobs;
    7. RSD – sense of social well-being.


    Using this data, it is necessary to divide employees into groups and identify the most effective management levers for each of them. At the same time, differences between groups should be obvious, and within the group respondents should be as similar as possible.

    Today, most sociological surveys provide only percentage votes: the main number of those who responded positively, or the percentage of dissatisfied ones, is considered, but this issue is not systematically considered. Most often, the survey does not show a trend in the situation.

    Cluster analysis procedures can be used to identify, based on survey data, some really existing relationships between characteristics and generate their typology on this basis. The presence of any a priori hypotheses of a sociologist when operating cluster analysis procedures is not a necessary condition.

    In Statistica, cluster analysis is performed as follows.

    1. Create a data file.
    2. Select module Statistics/ Multivariable Exploratory Techniques/ Cluster Analysis. Click OK, which will result in a dialog box appearing:

    3. In the window that appears, select the K-means clustering method and click OK.
    4. In the dialog box that appears, you need to set the following settings:


      • Select variables using the Variables button.
      • Select clustering objects: these can be variables - columns (Variables сolumns)), or observations - rows (Cases (Rows)). First, let's cluster by rows (Cases(rows)).
      • Select the number of clusters.
        This choice is made by the user based on his own assumptions about the number of groups of similar objects.

        When choosing the number of clusters, be guided by the following:

        1. The number of clusters, if possible, should not be too large.
        2. The distance at which the objects of a given cluster were combined should, if possible, be much less than the distance at which something else joins this cluster.
        When choosing the number of clusters, most often there are several correct solutions at the same time. We are interested, for example, in how the answers to the survey questions compare between ordinary employees and the management of the enterprise. Therefore we choose K=2. For further segmentation, you can increase the number of clusters.
      • Next, you need to select the initial division of objects into clusters (Initial cluster centers). The Statistica package offers:
        1. select observations with the maximum distance between cluster centers;
        2. sort distances and select observations at regular intervals (default setting);
        3. take the first observations as centers and attach the remaining objects to them.

        The first option is suitable for our purposes.

    Many clustering algorithms often “impose” an unnatural structure on the data and disorient the researcher. Therefore, it is extremely necessary to apply several cluster analysis algorithms and draw conclusions based on an overall assessment of the results of the algorithms.

    The analysis results can be viewed in the dialog box that appears:

    If you select the Graph of means tab, a graph of the coordinates of the cluster centers will be built:


    Each broken line in this graph corresponds to one of the clusters:

    • Each division on the horizontal axis of the graph corresponds to one of the variables included in the analysis.
    • The vertical axis corresponds to the average values ​​of the variables for objects included in each of the clusters.

    It can be noted that there are significant differences in the attitude of the two groups of people to their careers on almost all issues. There is complete unanimity on only one issue – the sense of social well-being (SSW), or rather, the lack thereof (2.5 points out of 10).

    It can be assumed, that:

    1. Cluster 1 displays workers,
    2. cluster 2 – leadership:
      • Managers are more satisfied with career growth (CR), the combination of personal goals and organizational goals (SLC).
      • They have higher levels of perceived economic well-being (SEW) and perceived pay equity (SPE).
      • They are less concerned about territorial proximity to home (TPH) than workers, probably due to fewer problems with transport.
      • Also, managers have less desire to change jobs (JSR).

    Despite the fact that workers are divided into two categories, they answer most questions relatively equally. In other words, if something does not suit the general group of employees, the same does not suit senior management, and vice versa.

    Coordination of schedules allows us to draw conclusions that the well-being of one group is reflected in the well-being of another.

    Cluster 1 is not satisfied with the territorial proximity to home. This group is the bulk of workers who mainly come to the enterprise with different sides cities. Therefore, it is possible to propose to the main management to allocate part of the profit to the construction of housing for the company’s employees.

    There are significant differences in the attitude of the two groups of people to their careers:

    1. Those employees who are satisfied with career growth, who have a high coincidence of personal goals and the goals of the organization, do not have the desire to change jobs and feel satisfied with the results of their work.
    2. Conversely, employees who want to change jobs and are dissatisfied with the results of their work are not satisfied with the stated indicators.

    Senior management should contact Special attention to the current situation.

    The results of variance analysis for each characteristic are displayed by clicking the Analysis of variance button:

    Outputs:

    • sum of squared deviations of objects from cluster centers (SS Within),
    • sum of squared deviations between cluster centers (SS Between),
    • F-statistic values,
    • significance levels p.
    For our example, the significance levels for two variables are quite large, which is explained by the small number of observations. In the full version of the study, which can be found in the work, the hypothesis about the equality of means for cluster centers is rejected at significance levels less than 0.01.

    The Save classifications and distances button displays the numbers of objects included in each cluster and the distances of objects to the center of each cluster.

    The composition of each cluster and the distance of objects from the center

    The table shows the observation numbers (CASE_NO), the constituent clusters with CLUSTER numbers and the distance from the center of each cluster (DISTANCE).

    Information about objects belonging to clusters can be written to a file and used in further analysis. In this example, a comparison of the results obtained with the questionnaires showed that cluster 1 consists mainly of ordinary workers, and cluster 2 of managers.

    Thus, it can be noted that when processing the results of the survey, cluster analysis turned out to be a powerful method that allows us to draw conclusions that cannot be reached by constructing a histogram of averages or calculating the percentage of people satisfied with various indicators of the quality of working life.

    Tree clustering is an example of a hierarchical algorithm, the principle of which is to sequentially combine into a cluster, first the closest, and then increasingly distant elements from each other. Most of these algorithms start from a similarity (distance) matrix, and each individual element is first considered as a separate cluster.

    After loading the cluster analysis module and selecting Joining (tree clustering), in the window for entering clustering parameters you can change following parameters:

    1. Initial data (Input). They can be in the form of a matrix of the data under study (Raw data) and in the form of a distance matrix (Distance matrix).
    2. Clustering of observations (Cases (raw)) or variables (Variable (columns)) describing the state of an object.
    3. Distance measure. Here you can select the following measures:
      • Euclidean distances,
      • Squared Euclidean distances,
      • distance of city blocks (Manhattan distance, City-block (Manhattan) distance), Chebychev distance metric,
      • power distance (Power...;),
      • Percent disagreement.
    4. Clustering method (Amalgamation (linkage) rule).
      The following options are possible here:
      • single link (nearest neighbor method) (Single Linkage),
      • complete linkage (most distant neighbors method),
      • unweighted pair-group average,
      • weighted pair-group average,
      • unweighted centroid method (Unweighted pair-group centroid),
      • weighted pair-group centroid (median) method,
      • Ward's method.

    As a result of clustering, a horizontal or vertical dendrogram is constructed - a graph on which the distances between objects and clusters are determined when they are sequentially combined.

    The tree structure of the graph allows you to define clusters depending on the selected threshold - a specified distance between clusters.

    In addition, a matrix of distances between the original objects (Distance matrix) is displayed; average and standard deviations for each source object (Distiptive statistics). For the example considered, we will perform a cluster analysis of variables with default settings. The resulting dendrogram is shown in the figure:


    The vertical axis of the dendrogram shows the distances between objects and between objects and clusters. Thus, the distance between the variables OEB and OSD is five. At the first step, these variables are combined into one cluster.

    Horizontal segments of the dendrogram are drawn at levels corresponding to the threshold distance values ​​selected for a given clustering step.

    The graph shows that the question “desire to change jobs” (WSW) forms a separate cluster. In general, the desire to go anywhere visits everyone in equally. Next, a separate cluster is the question of territorial proximity to home (TDP).

    In terms of importance, it is in second place, which confirms the conclusion about the need for housing construction made based on the results of the study using the K-means method.

    Perception of economic well-being (SEW) and pay equity (WFE) are combined - this is a block of economic issues. Career development (CR) and the combination of personal and organizational goals (LOG) are also combined.

    Other clustering methods, as well as the choice of other types of distances, do not lead to a significant change in the dendrogram.

    results

    1. Cluster analysis is a powerful tool for exploratory data analysis and statistical research in any subject area.
    2. The Statistica program implements both hierarchical and structural methods of cluster analysis. The advantages of this statistical package stem from their graphical capabilities. Two-dimensional and three-dimensional graphical displays of the resulting clusters in the space of the variables under study are provided, as well as the results of the hierarchical procedure for grouping objects.
    3. It is necessary to apply several cluster analysis algorithms and draw conclusions based on an overall assessment of the results of the algorithms.
    4. Cluster analysis can be considered successful if it is completed different ways, the results were compared and general patterns were found, and stable clusters were found regardless of the clustering method.
    5. Cluster analysis allows you to identify problem situations and outline ways to solve them. Consequently, this method of nonparametric statistics can be considered as an integral part of system analysis.