Title: Streaming k-means Approximation.
Speaker: Ragesh Jaiswal, Columbia
Abstract:
We extend results of Arthur and Vassilvitskii ( k-means++: the advantages of careful seeding, SODA 2007)
to provide a pseudo-approximation algorithm for the k-means clustering problem
in the batch (non-streaming) setting. We then use this algorithm in
an extended version of the divide-and-conquer strategy by Guha et al. to
obtain a streaming algorithm for the k-means problem with good
approximation/memory tradeoffs. More specifically, we first extend
the k-means++ algorithm, due to Arthur and Vassilvitskii, which is a
randomized algorithm that gives an O(log k) approximation to the
k-means objective in the batch setting. Our new algorithm, is shown
to give a constant approximation factor to the k-means objective, and
outputs O(k log k) centers. We then provide a variant of Guha et al.'s
algorithm for k-means in the one-pass (streaming) setting, resulting
in an efficient algorithm which outputs exactly k centers and gives a
tradeoff of O(n^a) memory for an O(c^(1/a) log(k)) approximation for
any constant a<1 and some fixed constant c>1. Empirical evaluations
on real and simulated data reveal the practical utility of our method.
I will also talk about a quantitatively improved version of our first
result obtained independently by Aggarwal, Deshpande, and Kannan
(Adaptive Sampling for k-means Clustering, APPROX 2009).
This is a joint work with Claire Monteleoni (CCLS, Columbia) and Nir
Ailon (Google). This work appeared in NIPS 2009.