Dasgupta's objective - Misplaced Pages

Concept in hierarchical clustering

In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016. Its key property is that, when the similarity comes from an ultrametric space, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the ground truth underlying the given similarity measure.

In Dasgupta's formulation, the input to a clustering problem consists of similarity scores between certain pairs of elements, represented as an undirected graph $G=(V,E)$ , with the elements as its vertices and with non-negative real weights on its edges. Large weights indicate elements that should be considered more similar to each other, while small weights or missing edges indicate pairs of elements that are not similar. A hierarchical clustering can be described as a tree (not necessarily a binary tree) whose leaves are the elements to be clustered; the clusters are then the subsets of elements descending from each tree node, and the size $|C|$ of any cluster $C$ is its number of elements. For each edge $uv$ of the input graph, let $w(uv)$ denote the weight of edge $uv$ and let $C(uv)$ denote the smallest cluster of a given clustering that contains both $u$ and $v$ . Then Dasgupta defines the cost of a clustering to be

\sum _{uv\in E}w(uv)\cdot |C(uv)|.

The optimal clustering for this objective is NP-hard to find. However, it is possible to find a clustering that approximates the minimum value of the objective in polynomial time by a divisive (top-down) clustering algorithm that repeatedly subdivides the elements using an approximation algorithm for the sparsest cut problem, the problem of finding a partition that minimizes the ratio of the total weight of cut edges to the total number of cut pairs. Equivalently, for purposes of approximation, one may minimize the ratio of the total weight of cut edges to the number of elements on the smaller side of the cut. Using the best known approximation for the sparsest cut problem, the approximation ratio of this approach is $O({\sqrt {\log n}})$ .

References

^ Dasgupta, Sanjoy (2016), "A cost function for similarity-based hierarchical clustering", Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2016), New York, New York: ACM, pp. 118–127, arXiv:1510.05043, doi:10.1145/2897518.2897527, MR 3536559, S2CID 2262168
Cohen-Addad, Vincent; Kanade, Varun; Mallmann-Trenn, Frederik; Mathieu, Claire (2018), "Hierarchical clustering: objective functions and algorithms", Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics, pp. 378–397, arXiv:1704.02147, doi:10.1137/1.9781611975031.26, MR 3775814
Arora, Sanjeev; Rao, Satish; Vazirani, Umesh (2009), "Expander flows, geometric embeddings and graph partitioning", Journal of the ACM, 56 (2): A5:1–A5:37, doi:10.1145/1502793.1502794, MR 2535878, S2CID 263871111

This statistics-related article is a stub. You can help Misplaced Pages by expanding it.

Categories: