The curse of dimensionality has traditionally been the bane of
nonparametric statistics (histograms, kernel density estimation, nearest
neighbor search, and so on), as reflected in running times and convergence
rates that are exponentially bad in the dimension. This problem is all the
more pressing as data sets get increasingly high dimensional. Recently the
field has been rejuvenated in several ways, of which perhaps the most
promising is the realization that a lot of real-world data which appears
high-dimensional in fact has low "intrinsic" dimension in the sense of
lying close to a low-dimensional manifold. In the past few years, there
has been a huge interest in learning such manifolds from data, and then
using the learned structure to transform the data into a lower-dimensional
space where standard statistical methods generically work better. I'll
exhibit a way to benefit from intrinsic low dimensionality without having
to go to the trouble of explicitly learning its fine structure.
Specifically, I'll present a simple variant of the ubiquitous k-d tree (a
spatial data structure widely used in machine learning and statistics)
that is provably adaptive to low dimensional structure. We call this a
"random projection tree" (RP tree). Along the way, I'll discuss different
notions of intrinsic dimension -- motivated by manifolds, by local
statistics, and by analysis on metric spaces -- and relate them. I'll then
prove that RP trees require resources that depend only on these dimensions
rather than the dimension of the space in which the data happens to be
situated.
This is work with Yoav Freund (UC San Diego).