Answer a specific type of query over some category. For instance,
"Did person X ever meet person Y?", "Does animal/plant X live in
place Y?" "What is the best course to take at NYU if you
want to learn about subject Y?"
Some comments about these:
General comment: You need to get a question that is not too easy to answer
reliably. (For example, as mentioned in the main page, the question,
"What movies did actor X work in?" is too easy to answer from imdb.com.)
Did X ever meet Y?
I am giving my undergraduate AI class a related assignment; the
write-up of that assignment has some additional discussion of this
There are at least three possible answers: "Definitely yes,", "Definitely
no", and "Possibly".
- The obvious solution is to do a search on the strings "X met Y", "Y
met X", "X never met Y", "Y never met X". But that
misses most "Yes" answers and almost all "No" answers. Rather, in most
cases one has to infer that X met or didn't met Y from other information.
For instance: If you can find a page that states that Marie Antoinette was
married to Louis XVI, you can infer that they met. If you can find a page
with George Washington's dates and a page with Julius Caesar's dates,
you can infer that they never met.
Obviously, you are not going to come close to finding all combinations of
text patterns that would
allow a human reader to answer this question. A reasonable project here would
be to have a system in which some number of patterns are encoded, and in which
new patterns can be fairly easily added.
For a project like this, it would be reasonable to restrict yourself to
Wikipedia pages and to people with Wikipedia entries, for example.
- Evaluation is a little tricky. If queries are taken at random from
the space of all people (or even the space of all people with Wikipedia
entries) then a program that always answers "No" will be right 99.99% of
the time. We will discuss this later in the semester.
What is the best course to take at NYU if you want to learn about
- You should return a ranked list. E.g. If the subject is in the title,
that's really good; if it is in the short course description, that's
less good; if it is somewhere on a course web page, that's even less good.
You might try using WordNet to deal with synonymy.
Given a category of entities and an attribute with various values, and
given a corpus of labelled instances, use the web pages for the
labelled instances to learn a classifier that will enable you to predict
the attribute value on a new instance.
Clearly, here, one wants to choose the categories and attribute quite
carefully so that the project is neither trivial nor impossible. (As always
in this project, it is better to err on the side of impossible than on
the side of trivial.) For instance: Categorize a product into categories
such as "Food", "Clothing," "Electronics", etc.
Probably, the learning should take place off-line and the classification of
a new instance should take place at query time.
Divide a collection of entities into clusters, based on the web pages
Web page classification
Given a collection of Web pages, labelled by the value of some attribute,
learn a classifier for that attribute. For example, distinguish right-wing
blogs from left-wing blogs. Divide web pages into categores such as
"Blogs", "Commercial", "Information", "Personal" etc.
Web page clustering
Given a collection of Web pages, divide them into clusters.
Coordinate parallel texts
Given a collection of web pages describing the same thing, match sentences
making the same statement. For instance, a collection of news stories
about the same event, or a collection of articles about the same disease
from different medical sites.
Report features of a multimedia page, or search for pages with specified
features, combining information from the surrounding text with information
from the actual content. E.g. "Search for a line drawing of a cat"; the
identification of an image as a cat can be done from the text, whereas
the identification of an image as a line drawing can, perhaps, be done
from the JPEG file. Or "give the features of a given piece of music";
some of these can be gotten from the text, others can be gotten from the