G22.2590 - Natural Language Processing -- Spring 2010 -- Prof.
January 28, 2010
You are to write a regular expression which recognizes expressions
appearing in such written texts as newpapers and blogs which denote a
specific quantity of (US) dollars.
To test your regular expression, prepare a very small corpus in which
every dollar expression in enclosed in brackets.
- it should denote a specific quantity, such as "2 billion dollars"
but not "a few billion dollars"
- the number may be specified by digits ("2 billlion dollars") or
words ("two billion dollars")
- expressions explicitly mentioning a specific country's dollars
("US dollars", "Australian dollars") are excluded
[Two million dollars] is a common salary on Wall Street.
Maharashtra to pay Enron a sum of [30 billion dollars].
Put this corpus in file "key". Then put your regular expression
in our regular expression test program
and compile and run that program. The program should report
precision and recall.
Submit the program and your test file. Send them as separate
attachments (in a single email) to firstname.lastname@example.org and
email@example.com. We will run your program both on your data and
on our data (culled from recent news stories), so you will be graded in
part on whether you handled the common expressions.
You must prepare the regular expression yourself, but you may exchange
test corpora with fellow students. If you do, mention that
in your submission email.
Due February 4.