kmeans++ for Careful Seeding
Posted on
This note is for Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford, 11.
By augmenting k-means with a simple, randomized seeding technique, they obtain an algorithm is $\Theta(\log k)$-competitive with the optimal clustering
k-means formulation: given an integer $k$ and a set of $n$ data points in $\IR^d$. The goal is to choose $k$ centers so as to minimize $\phi$, the sum of the squared distances between each point and its closest center
\[\phi = \sum_{x\in \cX} \min_{x\in \cC} \Vert x - c\Vert^2\]Llyod’s algorithm begins with $k$ arbitrary centers, typically chosen uniformly at random from the data points
this paper proposes a way of initializing k-means by choosing random starting centers with very specific probabilities
choose a point $p$ as a center with probability proportional to $p$’s contribution to the overall potential