Hierarchical clustering

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of O ( n 3 ) {displaystyle {mathcal {O}}(n^{3})} and requires O ( n 2 ) {displaystyle {mathcal {O}}(n^{2})} memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity O ( n 2 ) {displaystyle {mathcal {O}}(n^{2})} ) are known: SLINK for single-linkage and CLINK for complete-linkage clustering. With a heap the runtime of the general case can be reduced to O ( n 2 log ⁡ n ) {displaystyle {mathcal {O}}(n^{2}log n)} at the cost of further increasing the memory requirements. In many programming languages, the memory overheads of this approach are too large to make it practically usable. Except for the special case of single-linkage, none of the algorithms (except exhaustive search in O ( 2 n ) {displaystyle {mathcal {O}}(2^{n})} ) can be guaranteed to find the optimum solution. Divisive clustering with an exhaustive search is O ( 2 n ) {displaystyle {mathcal {O}}(2^{n})} , but it is common to use faster heuristics to choose splits, such as k-means. In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (1,0) and the origin (0,0) is always 1 according to the usual norms, but the distance between the point (1,1) and the origin (0,0) can be 2 under Manhattan distance, 2 {displaystyle scriptstyle {sqrt {2}}} under Euclidean distance, or 1 under maximum distance. Some commonly used metrics for hierarchical clustering are: For text or other non-numeric data, metrics such as the Hamming distance or Levenshtein distance are often used.

Parent Topic

Child Topic

No Parent Topic