Words, links, and patterns: novel representations for Web-scale text mining

Dragomir R. Radev
University of Michigan

Friday, December 10, 2004 11:30 A.M.
Room 1302 Warren Weaver Hall
251 Mercer Street
New York, NY 10012-1185

I. Dan Melamed melaed at cs dot nyu dot edu, (212) 998-3003


Textual data is everywhere, in email and scientific papers, in online newspapers and e-commerce sites. The Web contains more than 200 terabytes of text not even counting the contents of dynamic textual databases. This enormous source of knowledge is seriously underexploited. Textual documents on the Web are very hard to model computationally: they are mostly unstructured, time-dependent, collectively authored, multilingual, and of uneven importance. Traditional grammar-based techniques don't scale up to address such problems. Novel representations and analytical tools are needed.

I will discuss several current projects at Michigan related to text mining from a variety of genres. Depending on the amount of time, I will talk about (a) lexical centrality for multidocument summarization, (b) syntax-based sentence alignment, (c) graph-based classification,(d) lexical models of Web growth, and (e) mining protein interactions from scientific papers. As it turns out, the right representations, when complemented with traditional NLP and IR techniques, turn many of these into instances of better studied problems in areas such as social networks, statistical mechanics, sequence analysis, and computational phylogenetics.

