Computer Science Colloquium

Gaussian Processes for Active Data Mining of Spatial Aggregates

Naren Ramakrishnan
Virginia Tech

Friday, October 1, 2004 11:30 A.M.
Room 1302 Warren Weaver Hall
251 Mercer Street
New York, NY 10012-1185

Colloquium Information:


Bud Mishra, (212) 998-3464


Data mining has traditionally focused on the task of drawing inferences from large datasets. However, many scientific and engineering domains, such as aircraft design and wireless system simulation, are characterized by scarce data, due to the expense and complexity of associated experiments and simulations. In such data-scarce domains, it is advantageous to focus the data collection effort on only those regions deemed most important to support a particular data mining objective.

In this talk, I will present an active data mining mechanism based on the spatial aggregation language (SAL), a generic framework for spatial data mining, and Gaussian processes (GPs), a powerful unifying theory for approximating and reasoning about datasets. SAL uncovers successive multi-level aggregates of spatial data, and Gaussian processes provide the "glue" that enables us to perform active mining on these aggregates. In particular, they aid in (i) creation of surrogate models from data using a sparse set of samples (for cheap generation of dense approximate datasets), (ii) reasoning about the uncertainty of estimation at unsampled points, and (iii) formulation of objective criteria for active data selection. This approach enables us to design sampling strategies that bridge higher-level quality metrics of spatial structures (e.g., entropy) with lower-level considerations of data samples (e.g., locations and fidelity). Experimental results on synthetic as well as real datasets from wireless system simulations will be presented.

(joint work with Chris Bailey-Kellogg, Dartmouth College)

top | contact