** Problem 1 ** (20 points)

Explain briefly (2 or 3 sentences each) the difference between:

A. Depth-first search and iterative deepening search.

B. State space search and game tree evaluation.

C. Top-down parsing and bottom-up parsing.

D. A perceptron and a feed-forward, back-propagation neural network.

** Answer: **

A. Iterative deepening involves carrying out a series of depth-first
searches to successively increasing depths.

B. Both a state space and a game tree consist of states connected by operators.
However, in a state space, all operators are actions of the problem
solver, while in a game tree, operators alternate between actions of
the player and actions of the adversary. Therefore, in state space search
it suffices to find a goal state (or a path to a goal state), while in
game-tree evaluation, one must show that there is an action of the player
such that for any action of the adversary there is an action of the player ...
ending in a win for the player.

C. In top-down parsing, the parser builds the tree from the root "S" down
towards the leaves. In bottom-up parsing, the parser builds the tree
from the words (leaves) up toward the root.

D. A neural network consists of a collection of perceptrons organized
in layers, so that the output of one layer is the input to the next layer.
The perceptrons are modified so that the output is a continuous function
of the inputs, rather than being a step function. The learning algorithm
is similar in principle to the perceptron learning algorithm, but involves
a system of message passing from the output layer backwards to
the input layer.

** Problem 2 ** (15 points)

Consider the sentence "Hammers are for driving nails into surfaces."
Name two words in this sentence that are lexically ambiguous. (There are
at least four.) For each of these two words, describe a disambiguation
technique which will choose the right interpretation over at least one
of the wrong interpretations. Be specific.

** Answer: **

- A. The preposition "for" has many different meanings. Even in the phrase
"are for" it can mean:
- i. "are used for" as above.
- ii. "favors" as in "Cheney is for burning more fossil fuel and against conserving resources." This can be ruled out by selectional restrictions: this meaning requires an animate subject, unlike "hammers".
- iii. "are intended to be given to" as in "The presents are for the baby." This can be ruled out by selectional restrictions: it requires an animate object, unlike "driving".

- B. "driving" can mean:
- i. forcing an object to move against resistance, as above.
- ii. driving a car (by far the most frequent meaning).
- iii. impelling a person to undesired behaviors (as in "driving me crazy", "driving me to drink")
- quite a few other specialized meanings (iv. "driving" in golf, v. "driving cattle", etc.)

- C. "nails" can be either the tool or fingernails. However, frequency in the context of "hammer" gives a preference for the tool.
- D. "into" can mean
- i. motion into the interior of a region, as in "He drove the nail into the board".
- ii. motion through a boundary into the interior of another region, as above.
- iii. "in the direction of" as in "He looked into the sun".
- iv. "against" as in "He ran into a brick wall".

** Problem 3 ** (10 points)

In this problem and in problems 4 and 5, we consider a data set
with three Boolean predictive attributes,
A,B,C, and a Boolean classification, Z.

A. Suppose that your data is completely characterized by the following rules:

- 1. If X.A = T and X.C = F then X.Z = T.
- 2. If X.B = F and X.C = T then X.Z = T.
- 3. Otherwise, X.Z=F.

B. True or false: Given any consistent set of rules like those above, it is possible to construct a decision tree that executes that set of rules. By "consistent", I mean that there are no examples where two different rules give different answers.

** Answer: ** True. At the worst, one can use the complete decision
tree, where all tests on all attributes are made, and so each different
instance is represented by one leaf.

** Problem 4 ** (5 points)

Which of the following expresses the * independence
* assumption that is used
in deriving the formula for Naive Bayesian learning, for the classification
problem in problem 3.

- a. Prob(Z|A,B,C) = Prob(Z|A) * Prob(Z|B) * Prob(Z|C)
- b. Prob(Z|A,B,C) = Prob(A,B,C | Z) * Prob(Z) / Prob(A,B,C)
- c. Prob(A,B,C|Z) = Prob(A|Z) * Prob(B|Z) * Prob(C|Z)
- d. Prob(A,B,C) = Prob(A) * Prob(B) * Prob(C)
- e. Prob(Z|A,B,C) = Prob(Z|A) * Prob(A|B) * Prob(B|C).
- f. Prob(A,B,C|Z) = Prob(A|Z) * Prob(B|A) * Prob(C|B).

** Answer: ** c. (b) is Bayes' Law, which is used, but is not an
independence assumption. (d) and (f) are independence assumptions, but
not the ones we need. (a) and (e) are junk.

** Problem 5 ** (10 points)

Consider the following data set T. A and B are numerical attributes and
Z is a Boolean classification.

A B Z 1 2 T 2 1 F 3 2 T 1 1 FA. Let P be the perceptron with weights w

** Answer: ** Perceptron P gets the first instance wrong, with an
error of |(2*1)+(1*2)-4.5|=0.5 and the second instance wrong with an
error of |(2*2)+(1*1)-4.5|=0.5. The total error is therefore 1.0.

B. Find a set of weights and a threshhold that categorizes all this data correctly. (Hint: Sketch a graph).

** Answer: ** w_{A}=0, w_{B}=1, T=1.5 will do fine.

** Problem 6 ** (10 points)

The first choice in designing a learning algorithm is the choice of
hypothesis space. What are the advantages and disadvantages of
a large hypothesis space versus a small one?

** Answer: **
The advantage of a large hypothesis set is that it is more likely to contain
the true theory of the classification, or something close to the true
theory. The disadvantages of a large hypothesis set are, first, that it
is harder to search; second, that it is more susceptible to overfitting.

** Problem 7 ** (10 points)

You have developed a wonderful new theory of classification learning, called
"the NYU algorithm". You input data, it outputs a formula.
You decide to test your theory on the problem of
predicting in January who will
win a political election in November given attributes like office, region,
party, incumbency, success in raising funds to date, current poll numbers, etc.
You have a database with all this data for all American elections over the
last twenty years. You ask a research assistant to test the NYU algorithm
on this data. He comes back with the good news that when he ran the
NYU algorithm over the your data set, it output a formula that
gives the right answer 80% of the time on the data.

A. What further tests should you run before you publish a paper in a political science journal, claiming that you have found a formula that solves the problem of predicting political elections?

B. What further tests should you run before you publish a paper in a machine learning journal, claiming that you have promising new technique for machine learning?

** Answer: ** The test that has been run, using the same data for
training and testing, is pretty much worthless, as it can be passed at
100% by a learning algorithm that just memorizes the data. Since the
data set is very large, there should be no problems running tests that
separate out the test set from the training set.

There are also two other types of tests that should be run. First is a test to check whether the NYU formula is actually more accurate than simpler formulas that could be found by other learning algorithms, such as 1R. It could well be, for example, that the single rule "Predict that the incumbent will win, if he is running; else guess randomly" will do just as well. You would want to check this before publishing either the formula or the algorithm.

Second are tests to see how well this algorithm runs on other data sets, or on small subsets of this data set. These are of no importance to the political science journal, but would be important in the machine learning journal.

** Problem 8 ** (10 points)

Consider the following inference: Given the rules and facts,

R1: If X is a close relative of Y and Y is a close relative of Z then X is acquainted with Z.Infer that Sam is acquainted with Alice.

R2: If X is a parent of Y, then X is a close relative of Y.

R3: If X is married to Y, then X is a close relative of Y.

F1: Sam is a parent of Mike.

F2: Mike is married to Alice.

A. Express rules R1-R3 and facts F1,F2 in the Datalog representation using the following primitives:

c(X,Y) --- X is a close relative of Y.

p(X,Y) --- X is a parent of Y.

m(X,Y) --- X is married to Y.

a(X,Y) --- X is acquainted with Y.

sam, mike, alice : Constants.

**Answer:**

R1: a(X,Z) :- c(X,Y),c(Y,Z). R2: c(X,Y) :- p(X,Y). R3: c(X,Y) :- m(X,Y). F1: p(sam,mike), F2. m(mike,alice).

B. Explain how Datalog can carry out the inference. You may use either backward or forward chaining.

** Answer: ** In backward chaining you start with the goal
G0 ?- a(sam,alice), and proceed as follows:

G0 ?- a(sam,alice) matches R1 giving G1 G1 ?- c(sam,Y1), c(Y1,alice) generates first G2 G2 ?- c(sam,Y1) matches R2 giving G3 G3 ?- p(sam,Y1) matches F1 with Y1=mike. G3 succeeds. G2 succeeds G1 generates G4: G4 ?- c(mike,alice) matches R2 giving G5 G5 ?- p(mike,alice) fails. Return to G4 G4 matches R3 giving G6 G6 ?- m(mike,alice) matches F2. G6 succeeds. G4 succeeds G1 succeeds G0 succeeds.

** Problem 9 ** (10 points)

In computer vision, techniques for using stereo vision are in some ways
similar to techniques for dealing with images of a fixed scene by a moving
camera.
What are the similarities? What are the differences?

** Answer: ** Two successive images of a fixed scene, like the two
images in stereo vision, are images of the same scene taken from slightly
different vantage points. The process of using this information, in both
cases, involves first matching features between the two images and
then using the shift in the apparent position of the feature in the two
images to determine the distance to the actual object, using triangulation.

Analysis of motion has two advantages over stereo vision. First, there are many images as opposed to just two, giving more information. Second, the distance between successive images is typically substantially less than the distance between two stereo cameras, so the change in the image is less, simplifying the matching problem.

Stereo vision also has two advantages. First, the distance between two stereo cameras, which are bolted to a frame, is known more precisely than the distance between successive images of a moving camera. Second, the displacement of two stereo cameras is always a horizontal displacement, so matching always involves corresponding rows of the two images. With a moving camera, the relation between successive images, though known, can be much more complicated, involving expansions from the center point, or rotations, and do not fit the built-in structures of pixels in the same way.