kmeans++ for Careful Seeding

Posted on Jan 28, 2025

Tags: K-means

This note is for Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford, 11.

By augmenting k-means with a simple, randomized seeding technique, they obtain an algorithm is $\Theta(\log k)$-competitive with the optimal clustering

k-means formulation: given an integer $k$ and a set of $n$ data points in $\IR^d$. The goal is to choose $k$ centers so as to minimize $\phi$, the sum of the squared distances between each point and its closest center

\[\phi = \sum_{x\in \cX} \min_{x\in \cC} \Vert x - c\Vert^2\]

Llyod’s algorithm begins with $k$ arbitrary centers, typically chosen uniformly at random from the data points

this paper proposes a way of initializing k-means by choosing random starting centers with very specific probabilities

choose a point $p$ as a center with probability proportional to $p$’s contribution to the overall potential

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

kmeans++ for Careful Seeding

Posted on Jan 28, 2025