Lecture Notes in Computer Science


Supported by

NYU-Poly
NYU-Poly

CATT
CATT

IBM Research
IBM Research


Compression, Indexing, and Retrieval for Massive String Data

Jeffrey S. Vitter
Texas A&M University


Abstract

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

Bio

Jeff Vitter is the incoming provost and executive vice chancellor at the University of Kansas. He is currently a professor in the Department of Computer Science and Engineering at Texas A&M University, where he served from 2008 to 2009 as provost and executive vice president for academics. From 2002 to 2008, Dr. Vitter served as the Frederick L. Hovde Dean of Science at Purdue University. From 1993 to 2002, Dr. Vitter held a distinguished professorship at Duke University, as the Gilbert, Louis, and Edward Lehrman Professor of Computer Science. He was department chair from 1993 to 2001 and co-director and a founding member of the Center for Geometric and Biological Computing from 1997 to 2002. From 1980 to 1992, he progressed through the faculty ranks at Brown University. His educational degrees include a BS with highest honors in mathematics from the University of Notre Dame in 1977, a PhD in computer science under Don Knuth from Stanford University in 1980, and an MBA from Duke in 2002. He was born and raised in New Orleans.

Dr. Vitter is a Guggenheim Fellow, an ACM Fellow, an IEEE Fellow, an AAAS Fellow, an NSF Presidential Young Investigator, and a Fulbright Scholar. He has over 280 book, journal, conference, and patent publications, primarily on how to process massive amounts of information. His Google Scholar h-index is 57. His most recent book examines external memory algorithms, a field he helped found. He is co-holder of patents in the areas of external sorting, parallel I/O, prediction, and approximate data structures. He proposed the concept and participated in the design of the Indiana Database for University Research Expertise, www.indure.org. Dr. Vitter serves on the Board of Advisors for the School of Science and Engineering at Tulane University, and from 2000 to 2009 he served on the Board of Directors of the Computing Research Association, where he continues to co-chair the Government Affairs Committee. He has served as chair of ACM SIGACT and on the executive council of the EATCS, as well as on numerous review committees. Sabbatical sites have included MSRI, INRIA (Rocquencourt, France), Ecole Normale Supérieure (Paris), Bell Labs, and INRIA (Sophia Antipolis, France).