## Final Examination: Artificial Intelligence

Tuesday December 19. Open book and open notes.

Note: As always, it is important to use good test-taking strategy: First work on the problems that you feel confident about, then tackle the ones you are not sure about. In particular, problem 3.D, though not actually difficult once you have completed 3.C, could potentially take you off on wild goose chase if you take a wrong turn. If you are not confident about backward chaining, you should probably leave that till last. There is no need to have the answers in sequential order in the exam booklet.

### Problem 1: (5 points)

Convert the following sentence to CNF:
~(P^Q) <=> R.

P V R.
Q V R.
~R V ~P V ~Q.

### Problem 2 (10 points)

In applying the random-restart hill-climbing algorithm to propositional satisfiability:
• What is the state space? That is, what is a state, and what are the neighbors to a given state?

Answer: A state is a valuation on the atoms. A neighbor to state S is a valuation that differs on a single atom.

• What is the error function?

Answer: The number of unsatisfied clauses.

• What is the major limitation of this algorithm as compared to Davis-Putnam?

Answer: The algorithm is only likely to give you an answer; it is not guaranteed to find one. In particular, if the algorithm does not find a satisfying valuation, you don't know whether that is because no such valuation exists, or just because the algorithm has been unlucky.

### Problem 3 (20 points)

Let L be a first-order language over a domain of events with the following primitives:
b(E,F) -- E occurs before F. That is, E completes before F begins.
m(E,F) -- E meets F. That is, E completes just as F begins
d(E,F) -- E occurs during F. That is, first F starts, then E starts, then E ends, then F ends.
f(E,F) -- E finishes F. That is, F starts before E but they end together.
q(E,F) -- E and F occur at the same time; that is, they start together and end together.

Constants: g (Hamlet's encounter with the ghost); a1 (act 1); a2 (act 2); p (Hamlet's encounter with the players); y (performance of the play).

• A. Represent the following statements in L:
• i. If E meets F and F occurs during G then E occurs before G.
Answer: forall(E,F,G) m(E,F) ^ d(F,G) => b(E,G).
• ii. If E occurs before F and F occurs before G, then E occurs before G.
Answer: forall(E,F,G) b(E,F) ^ b(F,G) => b(E,G).
• iii. If E finishes F and F meets G then E meets G.
Answer: forall(E,F,G) f(E,F) ^ m(F,G) => m(E,G).
• iv. For any event E there exists an event F such that F occurs during E.
Answer: forall(E) exists(F) d(F,E).
• v. If E occurs during F and G occurs at the end of F, then
Answer: forall(E,F,G) [m(E,G) ^ m(F,G)] => [f(E,F) V f(F,E) V q(E,F)].

• B. Statements iv and v cannot be expressed in Datalog. Why not?

Answer: Statement iv involves an existential quantifier, and statement v involves disjunction.

• C. Given the above and the facts "Hamlet encounters the Ghost at the end of Act 1", "Act 1 meets Act 2", "Hamlet encounters the players during Act 2'', and ``The encounter with the players precedes the performance of the play'', show how ``Hamlet's encounter with the ghost precedes the performance of the play'' may be inferred by Datalog forward chaining. You need only show the steps that lead to the desired conclusion.

Answer: Given i,ii,and iii above and given

vi. f(g,a1).
vii. m(a1,a2).
viii. d(p,a2).
ix. b(p,y).
Unifying vi and vii with the left side of iii under the binding E=g, F=a1, G=a2 gives (x.) m(g,a2).
Unifying x and viii with the left side of i under the binding E=g, F=a2, G=p gives (xi.) b(g,p).
Unifying xi and ix with the left side of ii under the binding E=g, F=p, G=y gives (xii) b(g,y), which is the desired result.
• D. Show how the same inference is carried out using Datalog backward chaining. Again, you need only show the steps that lead to the desired conclusion. (Very important!)
```Goal G0: b(g,y).  Unifying with the head of ii under the bindings E=g, G=y,
creates the two subgoals G1: b(g,F1);  G2: b(F1,y).
Goal G1: b(g,F1). Unifying with the head of (i) under the binding E=g,G=F1
creates the two subgoals G3: m(g,F2); G4: d(F1,F2).
Goal G3: m(g,F2). Unifying with the head of iii under the binding E=g, G=F2
creates the two subgoals G5: f(g,F3), G6: m(F3,F2).
Goal G5: f(g,F3).  Unifying with (vi) under the binding F3=a1 succeeds.
Goal G6: m(a1,F2). Unifying with (vii) under the binding F2=a2 succeeds.
Goal G3 succeeds.
Goal G4: d(F1,a2). Unifying with (viii) under the binding F1=p succeeds.
Goal G1 succeeds.
Goal G2: b(p,y). Unifying with (ix) succeeds.
Goal G0 succeeds.
```

### Problem 4 (10 points)

Consider the following Bayesian network:

Assume that all the random variables are Boolean.
• A. What absolute and conditional probabilities are recorded in the network?

Answer: Prob(A=T/F), Prob(B=T/F), Prob(C=T/F | A=T/F,B=T/F), Prob(D=T/F | A=T/F, B=T/F).

• B. For each of the following, is it true or false in the network? More than one may be true:
• i. A and B are independent absolutely.
• ii. A and B are conditionally independent given C and D.
• iii. A and C are independent absolutely.
• iv. A and C are conditionally independent given B and D.
• v. C and D are independent absolutely.
• vi. C and D are conditionally independent given A and B.

Answer: (i) and (vi) are true; the rest are false.

• C. How is Prob(C=T|A=F) computed from the values in this table? Answer:
```Prob(C=T|A=F) = Prob(C=T,B=T|A=F) + Prob(C=T,B=F|A=F) =
Prob(C=T|B=T,A=F) Prob(B=T|A=F) + Prob(C=T|B=F,A=F) Prob(B=F|A=F) =
Prob(C=T|B=T,A=F) Prob(B=T) + Prob(C=T|B=F,A=F) Prob(B=F) =
```

### Problem 5 (10 points)

Consider the following CFG grammar for compound nouns (noun groups):
NG -> Noun
NG -> NG NG
Assume that "NYU", "computer", "science", and "department", and "office" are all labelled "nouns" in the lexicon.
• A. Show all the parse trees for the NG "NYU computer science department office". (You may abbreviate the leaves by their initial letters "n", "c", "s", "d"). Which is the correct one?
```1. NG ---> NG ---> Noun ---> n
|
|-> NG ---> NG ---> Noun ---> c
|
|-> NG ---> NG ---> Noun ---> s
|
|-> NG ---> Noun ---> d

2. NG ---> NG ---> Noun ---> n
|
|-> NG ---> NG ---> NG ---> Noun ---> c
|       |
|       |-> NG ---> Noun ---> s
|
|-> NG ---> Noun ---> d

3. NG ---> NG ---> NG ---> Noun ---> n
|       |
|       |-> NG ---> Noun ---> c
|
|-> NG ---> NG ---> Noun ---> s
|
|-> NG ---> Noun ---> d

4. NG ---> NG ---> NG ---> Noun ---> n
|       |
|       |-> NG ---> NG ---> Noun ---> c
|               |
|               |-> NG ---> Noun ---> s
|
|-> NG ---> Noun ---> d

5. NG ---> NG ---> NG ---> NG ---> Noun ---> n
|       |       |
|       |       |-> NG ---> Noun ---> c
|       |
|       |-> NG ---> Noun ---> s
|
|-> NG ---> Noun ---> d
```
(2) is the correct tree.
• B. Suppose that the sentence "Pat works for the NYU computer science department" is input to a chart parser, which contains the above as part of its grammar. Assume that the chart parser generates the dotted edge [4,4,NG -> * NG NG]. For each of the following edges, identify which of the parse trees in your answer to (A) corresponds to the edge. (You should number the parse trees in (A) and answer here with those numbers. An edge may correspond to more than one tree. Note: 4 is the space between "for" and "NYU"; 5 is the space between "NYU" and "computer" etc.)
• i. [5,6, NG -> Noun *]. Answer: All.
• ii. [5,7, NG -> NG NG *]. Answer: Trees 2 and 4.
• iii. [5,7, NG -> NG * NG ] Answer: Tree 1, 2, and 4.

### Problem 6 (10 points)

Consider the following sentence:
Prof. Jones has determined that the coin dates from the time of Julius Caesar, when it was about the price of a loaf of bread.
• A. Identify two instances of lexical ambiguity, one of syntactic ambiguity, and one of anaphoric ambiguity. Answer: (These are not all the possible answers:)
Lexical: "determined"="figured out" / "decided" / "made inevitable".
"dates" = "was created at a specified time" / "assigns to a particular time" / "accompanies on romantic outings".
"about" = "approximately" / "concerning"
"from", "time", "of", "price" and "bread" are also lexically ambiguous.

Syntactic: "from the time" can attach to "dates" or "determined".
"when ..." can attach to "Julius Caesar", "time", "dates", or "determined". Anaphoric: "it" can refer to "coin" or "time", or even to Prof. Jones' determination.

• B. Describe how one possible misreading can be excluded. Be specific.

Answer: (Many answers are possible.) The reading "dates = accompanies on romantic outings" can be excluded by selectional restriction: the subject of such a reading must be human, but "coin" cannot be human.

### Problem 7 (10 points)

In using the K-gram model for probabilistic tagging of text from training input, it is necessary to apply a smoothing function
Prob(tI | tI-2 tI-1) = w3 FreqC(tI | tI-2 tI-1) + w2 FreqC(tI | tI-1) + w1 FreqC(tI)
where C is the training corpus and w1, w2, w3 are weights.

What would be the problem if you use the set of weights w1 = 0.98, w2 = 0.01, w3 = 0.01?

Answer: This effectively eliminates the K-gram part of the model, and makes the choice of tag depend only on the corresponding element.

What would be the problem if you use the set of weights w1 = 0.01, w2 = 0.01, w3 = 0.98?

Answer: This is subject to the sparse data problem; if the trigram does not occur in the corpus, it will be judged as extremely unlikely.

### Problem 8 (5 points)

Fill in the blanks: Suppose that we are using Naive Bayes to conjecture the value of classification attribute C from predictive attributes P,Q,R, for an instance X where X.P=p, X.Q=q, X.R=r. The calculation proceeds as follows: We wish to calculate the value V that maximizes Prob(X.C=V | X.P=p, X.Q=q, X.R=r).
```
By Bayes' law, Prob(X.C=V | X.P=p, X.Q=q, X.R=r) = (a)
Prob(X.P=p, X.Q=q, X.R=r | X.C=V) Prob(X.C=V) / Prob(X.P=p, X.Q=q, X.R=r)

We may ignore the factor (b) 1 / Prob(X.P=p, X.Q=q, X.R=r)

because (c) that is the same for all values of V (it is a normalizing factor).

We assume that attributes (d) P, Q, and R

are conditionally independent given (e) C,

giving us the equation
(f) Prob(X.P=p, X.Q=q, X.R=r | X.C=V) =
(g) Prob(X.P=p | X.C=V)  * Prob(X.Q=q | X.C=V)  * Prob(X.R=r | X.C=V)

Thus, we conclude that the value of V that maximizes
Prob(X.C=V | X.P=p, X.Q=q, X.R=r) is the same as the value of V that maximizes
(h) Prob(X.P=p | X.C=V)  * Prob(X.Q=q | X.C=V)  * Prob(X.R=r | X.C=V)  *
Prob(X.C=V).

and the numbers on this last expression can all be estimated from the
training data.
```

### Problem 9 (10 points)

Datasets often contain instances with null values in some of the attributes. Some classification learning algorithms are able to use such instances in the training set; other algorithms must discard them.
• A. Can Naive Bayes make use of instances with null values in the training set? Explain your answer.

Answer: Yes. Suppose that P and Q are predictive attributes, C is the classifaction attribute, and an instance I in the training set has a value recorded for I.P but not for I.Q. Then I can be used in computing Prob(X.P|X.C) even though it must be ignored in computing Prob(X.Q|X.C).

• B. Can K-Nearest neighhbors make use of instances with null values in the training set? Explain your answer.

Answer: No. The distance from a new instance X to a training instance Y is some function of the difference over all the attributes, and there is no way to compute that if one of the attributes is missing.

### Problem 10 (5 points)

The version of the ID3 algorithm in the class handout includes a test "If AVG_ENTROPY(AS,C,T) is not substantially smaller than ENTROPY(C,T)'' then the algorithm constructs a leaf corresponding to the current state of T and does not recur. "Substantially smaller" here, of course, is rather vague. What is the disadvantage of changing this to "much smaller"? What is the disadvantage of eliminating the test?

Answer: If you require that AVG_ENTROPY(AS,C,T) be much smaller than ENTROPY(C,T), then you will often fail to split on an attribute that is actually informative about C. If you eliminate the test, then you may end up splitting on attributes whose relation to the classification attribute is actually completely random, and thus end up with a tree that overfits the data.