Final Examination: Artificial Intelligence

Tuesday December 19. Open book and open notes.

Note: As always, it is important to use good test-taking strategy: First work on the problems that you feel confident about, then tackle the ones you are not sure about. In particular, problem 3.D, though not actually difficult once you have completed 3.C, could potentially take you off on wild goose chase if you take a wrong turn. If you are not confident about backward chaining, you should probably leave that till last. There is no need to have the answers in sequential order in the exam booklet.

Problem 1: (5 points)

Convert the following sentence to CNF:
~(P^Q) <=> R.

P V R.
Q V R.
~R V ~P V ~Q.

Problem 2 (10 points)

In applying the random-restart hill-climbing algorithm to propositional satisfiability:

Problem 3 (20 points)

Let L be a first-order language over a domain of events with the following primitives:
b(E,F) -- E occurs before F. That is, E completes before F begins.
m(E,F) -- E meets F. That is, E completes just as F begins
d(E,F) -- E occurs during F. That is, first F starts, then E starts, then E ends, then F ends.
f(E,F) -- E finishes F. That is, F starts before E but they end together.
q(E,F) -- E and F occur at the same time; that is, they start together and end together.

Constants: g (Hamlet's encounter with the ghost); a1 (act 1); a2 (act 2); p (Hamlet's encounter with the players); y (performance of the play).


Problem 4 (10 points)

Consider the following Bayesian network:

Assume that all the random variables are Boolean.

Problem 5 (10 points)

Consider the following CFG grammar for compound nouns (noun groups):
NG -> Noun
Assume that "NYU", "computer", "science", and "department", and "office" are all labelled "nouns" in the lexicon.

Problem 6 (10 points)

Consider the following sentence:
Prof. Jones has determined that the coin dates from the time of Julius Caesar, when it was about the price of a loaf of bread.

Problem 7 (10 points)

In using the K-gram model for probabilistic tagging of text from training input, it is necessary to apply a smoothing function
Prob(tI | tI-2 tI-1) = w3 FreqC(tI | tI-2 tI-1) + w2 FreqC(tI | tI-1) + w1 FreqC(tI)
where C is the training corpus and w1, w2, w3 are weights.

What would be the problem if you use the set of weights w1 = 0.98, w2 = 0.01, w3 = 0.01?

Answer: This effectively eliminates the K-gram part of the model, and makes the choice of tag depend only on the corresponding element.

What would be the problem if you use the set of weights w1 = 0.01, w2 = 0.01, w3 = 0.98?

Answer: This is subject to the sparse data problem; if the trigram does not occur in the corpus, it will be judged as extremely unlikely.

Problem 8 (5 points)

Fill in the blanks: Suppose that we are using Naive Bayes to conjecture the value of classification attribute C from predictive attributes P,Q,R, for an instance X where X.P=p, X.Q=q, X.R=r. The calculation proceeds as follows: We wish to calculate the value V that maximizes Prob(X.C=V | X.P=p, X.Q=q, X.R=r).

By Bayes' law, Prob(X.C=V | X.P=p, X.Q=q, X.R=r) = (a) 
  Prob(X.P=p, X.Q=q, X.R=r | X.C=V) Prob(X.C=V) / Prob(X.P=p, X.Q=q, X.R=r)

We may ignore the factor (b) 1 / Prob(X.P=p, X.Q=q, X.R=r)

because (c) that is the same for all values of V (it is a normalizing factor).

We assume that attributes (d) P, Q, and R

are conditionally independent given (e) C,

giving us the equation 
  (f) Prob(X.P=p, X.Q=q, X.R=r | X.C=V) = 
  (g) Prob(X.P=p | X.C=V)  * Prob(X.Q=q | X.C=V)  * Prob(X.R=r | X.C=V)  

Thus, we conclude that the value of V that maximizes
Prob(X.C=V | X.P=p, X.Q=q, X.R=r) is the same as the value of V that maximizes
(h) Prob(X.P=p | X.C=V)  * Prob(X.Q=q | X.C=V)  * Prob(X.R=r | X.C=V)  * 

and the numbers on this last expression can all be estimated from the
training data.

Problem 9 (10 points)

Datasets often contain instances with null values in some of the attributes. Some classification learning algorithms are able to use such instances in the training set; other algorithms must discard them.

Problem 10 (5 points)

The version of the ID3 algorithm in the class handout includes a test "If AVG_ENTROPY(AS,C,T) is not substantially smaller than ENTROPY(C,T)'' then the algorithm constructs a leaf corresponding to the current state of T and does not recur. "Substantially smaller" here, of course, is rather vague. What is the disadvantage of changing this to "much smaller"? What is the disadvantage of eliminating the test?

Answer: If you require that AVG_ENTROPY(AS,C,T) be much smaller than ENTROPY(C,T), then you will often fail to split on an attribute that is actually informative about C. If you eliminate the test, then you may end up splitting on attributes whose relation to the classification attribute is actually completely random, and thus end up with a tree that overfits the data.