Computer Science Colloquium
Words, links, and patterns:
novel representations for Web-scale text mining
Dragomir R. Radev
University of Michigan
Friday, December 10, 2004 11:30 A.M.
Room 1302 Warren Weaver Hall
251 Mercer Street
New York, NY 10012-1185
Colloquium Information: http://cs.nyu.edu/csweb/Calendar/colloquium/index.html
I. Dan Melamed melaed at cs dot nyu dot edu, (212) 998-3003
Textual data is everywhere, in email and scientific papers, in
online newspapers and e-commerce sites. The Web contains more than 200
terabytes of text not even counting the contents of dynamic textual
databases. This enormous source of knowledge is seriously
underexploited. Textual documents on the Web are very hard to model
computationally: they are mostly unstructured, time-dependent,
collectively authored, multilingual, and of uneven importance.
Traditional grammar-based techniques don't scale up to address such
problems. Novel representations and analytical tools are needed.
I will discuss several current projects at Michigan related to text
mining from a variety of genres. Depending on the amount of time, I
will talk about (a) lexical centrality for multidocument
summarization, (b) syntax-based sentence alignment, (c) graph-based
classification,(d) lexical models of Web growth, and (e) mining
protein interactions from scientific papers. As it turns out, the
right representations, when complemented with traditional NLP and IR
techniques, turn many of these into instances of better studied
problems in areas such as social networks, statistical mechanics,
sequence analysis, and computational phylogenetics.
| contact firstname.lastname@example.org