Prof. Melamed's Translation Research
CS Prof. Dan Melamed designs algorithms and writes code that translates
text between human languages. See http://www.cs.nyu.edu/~melamed/interests.html
for more info. He's exploring an approach based on
structured models of translational equivalence. His work is
funded by DARPA.
Proposal: A System for collecting parallel texts from the Web
Melamed's work uses parallel texts--copies of the same document in 2
languages--to help develop and test his translation software.
This project involves designing and building a system for collecting
and processing parallel texts retrieved from the Web. The system
will work as follows:
Note that the web trawling and serving loop is very similar to what is
used by "shopping agents" and price-comparison websites.
- A spider constantly trolls the Web looking for pairs of documents
that might be parallel texts; an existing spider will be used
- The resulting documents are processed by some filtering programs
(which Prof. Melamed has) which help decide whether the pair is indeed a
parallel text worth saving; for example, one filter decides each
- A database records the results of the previous steps
- A web site enables people to investigate the progress of the
spider and the filtering; for example, someone could specify 2 languages
and ask for all parallel texts in those particular languages
The open design issues in this project include:
The system must be able to run on Linux.
- Making all the components (spider, filters, DB writing code) work
together in a seamless and efficient manner, either on one computer or
several. The components were not designed with integration in
- Choosing tools like the database, and Web site development
- The database schema
- The design of the output Web site
This is a technically demanding project, that Prof. Melamed will
supervise. He's available for meetings on evenings and weekends.
Company Web site
Prof. I. Dan Melamed
Resources that will be made available to students
Access to any relevant data, code and information.