## Final Exam: Introduction to Artificial Intelligence: Solutions

Problem 1 (20 points)
Explain briefly (2 or 3 sentences each) the difference between:
A. Depth-first search and iterative deepening search.
B. State space search and game tree evaluation.
C. Top-down parsing and bottom-up parsing.
D. A perceptron and a feed-forward, back-propagation neural network.

Answer:
A. Iterative deepening involves carrying out a series of depth-first searches to successively increasing depths.
B. Both a state space and a game tree consist of states connected by operators. However, in a state space, all operators are actions of the problem solver, while in a game tree, operators alternate between actions of the player and actions of the adversary. Therefore, in state space search it suffices to find a goal state (or a path to a goal state), while in game-tree evaluation, one must show that there is an action of the player such that for any action of the adversary there is an action of the player ... ending in a win for the player.
C. In top-down parsing, the parser builds the tree from the root "S" down towards the leaves. In bottom-up parsing, the parser builds the tree from the words (leaves) up toward the root.
D. A neural network consists of a collection of perceptrons organized in layers, so that the output of one layer is the input to the next layer. The perceptrons are modified so that the output is a continuous function of the inputs, rather than being a step function. The learning algorithm is similar in principle to the perceptron learning algorithm, but involves a system of message passing from the output layer backwards to the input layer.

Problem 2 (15 points)
Consider the sentence "Hammers are for driving nails into surfaces." Name two words in this sentence that are lexically ambiguous. (There are at least four.) For each of these two words, describe a disambiguation technique which will choose the right interpretation over at least one of the wrong interpretations. Be specific.

Answer:

• A. The preposition "for" has many different meanings. Even in the phrase "are for" it can mean:
• i. "are used for" as above.
• ii. "favors" as in "Cheney is for burning more fossil fuel and against conserving resources." This can be ruled out by selectional restrictions: this meaning requires an animate subject, unlike "hammers".
• iii. "are intended to be given to" as in "The presents are for the baby." This can be ruled out by selectional restrictions: it requires an animate object, unlike "driving".
• B. "driving" can mean:
• i. forcing an object to move against resistance, as above.
• ii. driving a car (by far the most frequent meaning).
• iii. impelling a person to undesired behaviors (as in "driving me crazy", "driving me to drink")
• quite a few other specialized meanings (iv. "driving" in golf, v. "driving cattle", etc.)
However, most of these can be ruled out by selectional restrictions on the object. E.g. (ii) requires a car as object; (iii) and (v) require animate objects, etc.
• C. "nails" can be either the tool or fingernails. However, frequency in the context of "hammer" gives a preference for the tool.
• D. "into" can mean
• i. motion into the interior of a region, as in "He drove the nail into the board".
• ii. motion through a boundary into the interior of another region, as above.
• iii. "in the direction of" as in "He looked into the sun".
• iv. "against" as in "He ran into a brick wall".
Here, I think, one has to rely on world knowledge that a hammer is used to drive a nail through the surface of an object into its interior.

Problem 3 (10 points)
In this problem and in problems 4 and 5, we consider a data set with three Boolean predictive attributes, A,B,C, and a Boolean classification, Z.

A. Suppose that your data is completely characterized by the following rules:

• 1. If X.A = T and X.C = F then X.Z = T.
• 2. If X.B = F and X.C = T then X.Z = T.
• 3. Otherwise, X.Z=F.
Construct a decision tree whose predictions correspond to these rules.

B. True or false: Given any consistent set of rules like those above, it is possible to construct a decision tree that executes that set of rules. By "consistent", I mean that there are no examples where two different rules give different answers.

Answer: True. At the worst, one can use the complete decision tree, where all tests on all attributes are made, and so each different instance is represented by one leaf.

Problem 4 (5 points)
Which of the following expresses the independence assumption that is used in deriving the formula for Naive Bayesian learning, for the classification problem in problem 3.

• a. Prob(Z|A,B,C) = Prob(Z|A) * Prob(Z|B) * Prob(Z|C)
• b. Prob(Z|A,B,C) = Prob(A,B,C | Z) * Prob(Z) / Prob(A,B,C)
• c. Prob(A,B,C|Z) = Prob(A|Z) * Prob(B|Z) * Prob(C|Z)
• d. Prob(A,B,C) = Prob(A) * Prob(B) * Prob(C)
• e. Prob(Z|A,B,C) = Prob(Z|A) * Prob(A|B) * Prob(B|C).
• f. Prob(A,B,C|Z) = Prob(A|Z) * Prob(B|A) * Prob(C|B).

Answer: c. (b) is Bayes' Law, which is used, but is not an independence assumption. (d) and (f) are independence assumptions, but not the ones we need. (a) and (e) are junk.

Problem 5 (10 points)
Consider the following data set T. A and B are numerical attributes and Z is a Boolean classification.

```      A   B   Z
1   2   T
2   1   F
3   2   T
1   1   F
```
A. Let P be the perceptron with weights wA = 2, wB = 1, and threshhold T=4.5. What is the value of the standard error function for this perceptron?

Answer: Perceptron P gets the first instance wrong, with an error of |(2*1)+(1*2)-4.5|=0.5 and the second instance wrong with an error of |(2*2)+(1*1)-4.5|=0.5. The total error is therefore 1.0.

B. Find a set of weights and a threshhold that categorizes all this data correctly. (Hint: Sketch a graph).

Answer: wA=0, wB=1, T=1.5 will do fine.

Problem 6 (10 points)
The first choice in designing a learning algorithm is the choice of hypothesis space. What are the advantages and disadvantages of a large hypothesis space versus a small one?

Answer: The advantage of a large hypothesis set is that it is more likely to contain the true theory of the classification, or something close to the true theory. The disadvantages of a large hypothesis set are, first, that it is harder to search; second, that it is more susceptible to overfitting.

Problem 7 (10 points)
You have developed a wonderful new theory of classification learning, called "the NYU algorithm". You input data, it outputs a formula. You decide to test your theory on the problem of predicting in January who will win a political election in November given attributes like office, region, party, incumbency, success in raising funds to date, current poll numbers, etc. You have a database with all this data for all American elections over the last twenty years. You ask a research assistant to test the NYU algorithm on this data. He comes back with the good news that when he ran the NYU algorithm over the your data set, it output a formula that gives the right answer 80% of the time on the data.

A. What further tests should you run before you publish a paper in a political science journal, claiming that you have found a formula that solves the problem of predicting political elections?

B. What further tests should you run before you publish a paper in a machine learning journal, claiming that you have promising new technique for machine learning?

Answer: The test that has been run, using the same data for training and testing, is pretty much worthless, as it can be passed at 100% by a learning algorithm that just memorizes the data. Since the data set is very large, there should be no problems running tests that separate out the test set from the training set.

There are also two other types of tests that should be run. First is a test to check whether the NYU formula is actually more accurate than simpler formulas that could be found by other learning algorithms, such as 1R. It could well be, for example, that the single rule "Predict that the incumbent will win, if he is running; else guess randomly" will do just as well. You would want to check this before publishing either the formula or the algorithm.

Second are tests to see how well this algorithm runs on other data sets, or on small subsets of this data set. These are of no importance to the political science journal, but would be important in the machine learning journal.

Problem 8 (10 points)
Consider the following inference: Given the rules and facts,

R1: If X is a close relative of Y and Y is a close relative of Z then X is acquainted with Z.
R2: If X is a parent of Y, then X is a close relative of Y.
R3: If X is married to Y, then X is a close relative of Y.
F1: Sam is a parent of Mike.
F2: Mike is married to Alice.
Infer that Sam is acquainted with Alice.

A. Express rules R1-R3 and facts F1,F2 in the Datalog representation using the following primitives:

c(X,Y) --- X is a close relative of Y.
p(X,Y) --- X is a parent of Y.
m(X,Y) --- X is married to Y.
a(X,Y) --- X is acquainted with Y.
sam, mike, alice : Constants.

Answer:

```R1: a(X,Z) :- c(X,Y),c(Y,Z).
R2: c(X,Y) :- p(X,Y).
R3: c(X,Y) :- m(X,Y).
F1: p(sam,mike),
F2. m(mike,alice).
```

B. Explain how Datalog can carry out the inference. You may use either backward or forward chaining.

Answer: In backward chaining you start with the goal G0 ?- a(sam,alice), and proceed as follows:

```G0 ?- a(sam,alice)  matches R1 giving G1
G1 ?- c(sam,Y1), c(Y1,alice) generates first G2
G2 ?- c(sam,Y1) matches R2 giving G3
G3 ?- p(sam,Y1) matches F1 with Y1=mike. G3 succeeds.
G2 succeeds
G1 generates G4:
G4 ?- c(mike,alice) matches R2 giving G5
G5 ?- p(mike,alice) fails. Return to G4
G4 matches R3 giving G6
G6 ?- m(mike,alice) matches F2. G6 succeeds.
G4 succeeds
G1 succeeds
G0 succeeds.
```

Problem 9 (10 points)
In computer vision, techniques for using stereo vision are in some ways similar to techniques for dealing with images of a fixed scene by a moving camera. What are the similarities? What are the differences?

Answer: Two successive images of a fixed scene, like the two images in stereo vision, are images of the same scene taken from slightly different vantage points. The process of using this information, in both cases, involves first matching features between the two images and then using the shift in the apparent position of the feature in the two images to determine the distance to the actual object, using triangulation.

Analysis of motion has two advantages over stereo vision. First, there are many images as opposed to just two, giving more information. Second, the distance between successive images is typically substantially less than the distance between two stereo cameras, so the change in the image is less, simplifying the matching problem.

Stereo vision also has two advantages. First, the distance between two stereo cameras, which are bolted to a frame, is known more precisely than the distance between successive images of a moving camera. Second, the displacement of two stereo cameras is always a horizontal displacement, so matching always involves corresponding rows of the two images. With a moving camera, the relation between successive images, though known, can be much more complicated, involving expansions from the center point, or rotations, and do not fit the built-in structures of pixels in the same way.