A Beyond-Worst-Case Analysis of Greedy k-means++

#1 A Beyond-Worst-Case Analysis of Greedy k-means++ [PDF] [Copy] [Kimi] [REL]

Authors: Qingyun Chen, Sungjin Im, Benjamin Moseley, Ryan Milstrey, Chenyang Xu, Ruilong Zhang

$k$-means++ and the related greedy $k$-means++ algorithm are celebrated algorithms that efficiently compute seeds for Lloyd's algorithm. Greedy $k$-means++ is a generalization of $k$-means++ where, in each iteration, a new seed is greedily chosen among multiple $\ell \geq 2$ points sampled, as opposed to a single seed being sampled in $k$-means++. While empirical studies consistently show the superior performance of greedy $k$-means++, making it a preferred method in practice, a discrepancy exists between theory and practice. No theoretical justification currently explains this improved performance. Indeed, the prevailing theory suggests that greedy $k$-means++ exhibits worse performance than $k$-means++ in worst-case scenarios. This paper presents an analysis demonstrating the outperformance of the greedy algorithm compared to $k$-means++ for a natural class of well-separated instances with exponentially decaying distributions, such as Gaussian, specifically when $\ell = \Theta(\log k)$, a common parameter setting in practical applications.

Subject: NeurIPS.2025 - Poster

2ym6uVbgMN@OpenReview

#1 A Beyond-Worst-Case Analysis of Greedy k-means++ [PDF] [Copy] [Kimi] [REL]