Solutions to Sample Exam Problems

Problem 1:

Consider the following collection of sentences in the propositional logic.
1. P => (Q <=> R).
2. Q <=> ~(R^W)
3. Q => (P^W).
A. Convert this set to CNF. (You need not show the intermediate steps.)


1A. ~P V ~Q V R
1B. ~P V Q V ~R
2A. ~Q V ~R V ~W
2B. Q V R
2C. Q V W
3A P V ~Q
3B ~Q V W

B. Show how the Davis-Putnam assignment finds a satisfying assumption. (Assume that, when a branch point is reached, the algorithm chooses the first atom alphabetically and tries TRUE before FALSE.)

Set S0 as above.

Try P=TRUE.  Delete clause 3, delete ~P from 1A,  1B.

    Set S1
       1A. ~Q V R 
       1B. Q V ~R 
       2A. ~Q V ~R V ~W 
       2B. Q V R  
       2C. Q V W  
       3B  ~Q V W

    Try Q=TRUE. Delete clauses 1B, 2B, 2C, delete ~Q from 1A, 2A, 3B.

        1A.  R
        2A. ~R V ~W
        3B   W
        1A is a singleton clause. Set R=TRUE. Delete 1A,  delete ~R from 2A.

        2A. ~W
        3B. W

        2A is a singleton clause.  Set W=FALSE.  Delete 2A, delete W from 3B.
        3B is the null clause. Fail.

    Return to S1. Try Q=FALSE. Delete 1A, 2A, 3B, delete Q from 1B, 2B, 2C, 

       1B. ~R 
       2B.  R  
       2C.  W  
       1B is a singleton clause. Set R=FALSE. Delete 1B, delete R from 2B.

       2B is the null clause. Fail

Return to S0.  Try P=FALSE. Delete 1A, 1B.  Delete P from 3A. 

       2A.  ~Q V ~R V ~W 
2B. Q V R
2C. Q V W
3A ~Q
3B ~Q V W 3A is a singleton clause. Set Q=FALSE. Delete 2A, 3B, delete Q from 2B, 2C. 2B. R 2C. W 2B and 2C are singleton clauses. Set R=TRUE, W=TRUE. Succeed P=FALSE, Q=FALSE, R=TRUE, W=TRUE.

Problem 2:

What is the result of doing alpha-beta pruning in the game tree shown below?

Problem 3:

Consider a domain where the individuals are people and languages. Let L be the first-order language with the following primitives:
s(X,L) --- Person X speaks language L. 
c(X,Y) --- Persons X and Y can communicate. 
i(W,X,Y) --- Person W can serve as an interpreter between persons X and Y. 
j,p,e,f --- Constants: Joe, Pierre, English, and French respectively.
A. Express the following statements in L:


B. Show that (vi) can be proven from (i)---(v) using backward-chaining resolution. You must show the Skolemized form of each statement, and every resolution that is used in the final proof. You need not show the intermediate stages of Skolemization, or show resolutions that are not used in the final proof.

Answer: The Skolemized forms of (i)-(v) are

(sk1(L,M) is a Skolem function mapping L and M to a person who speaks both L and M.)

The Skolemized form of the negation of (vi) is

The backward chaining proof is then as follows:

Problem 4:

Let A, B, C be Boolean random variables. Assume that
Prob(A=T) = 0.8
Prob(B=T | A=T) = 0.5.
Prob(B=T | A=F) = 1.
Prob(C=T | B=T) = 0.1
Prob(C=T | B=F) = 0.5
A and C are conditionally independent given B.
Evaluate the following terms. (If you wish, you can give your answer as an arithmetic expression such as "0.8*0.5 / (0.8*1 + 0.5*0.1)")

Problem 5:

A. Give an example of a decision tree with two internal nodes (including the root), and explain how it classifies an example.

This is a deterministic decision tree for a data set with two Boolean predictive attributes, A and B, and a Boolean classification C. The tree first tests an example X on attribute A. If X.A is T, then the tree tests on the tree tests on attribute B. If X.B is T, then the tree predicts that X.C is T; otherwise it predicts that X.C is F. If, in the first test, X.A is F, then the tree predicts that X.C is F.

B. Describe the ID3 algorithm to construct decision trees from training data.

Answer: The ID3 algorithm builds the decision tree recursively from the top down. At each stage of the recursion, an internal node N is passed a subtable, with those instances that would reach N in the decision procedure, and those attributes that have not already been tested. If the table has more than one classification value, and has some predictive attributes left to test, and is more than a specified minimum splitting size, then the algorithm chooses the best attribute A to split on, namely the one that gives the minimum expected entropy. If the expected entropy of the classification C after splitting on A would be less than the entropy of C at node N, then node N is labelled as a test for A, the table is partitioned according to the value of A, and the partitions are passed down to the next nodes recursively.

C. What is the entropy of a classification C in table T? What is the expected entropy of classification C if the table is split on predictive attribute A?

Let pc be the fraction of instances X in T where X.C=c. Then the entropy of C in T is the sum over c of -pc log(pc).

Let qa be the fraction of instance X in T where X.A=a. Let Ta be the subtable of T of instances X such that X.A=a. Let ra,c be the fraction of instances X in Ta where X.C=c. Then the expected entropy of C after splitting on A is

suma qa (sumc -ra,c log(ra,c))

D. What kinds of techniques can be used to counter the problem of over-fitting in decision trees?

Answer: Prepruning techniques prune the decision tree as it is being built; e.g. by enforcing a minimum splitting size on nodes. Postpruning techniques build the entire decision tree, and then eliminate tests that do not seem to be useful.

Problem 6:

Consider the following data set with three Boolean predictive attributes, W,X,Y and Boolean classification C.
  W   X   Y   C
  T   T   T   T
  T   F   T   F
  T   F   F   F
  F   T   T   F
  F   F   F   T
We now encounter a new example: W=F, X=T, Y=F. If we apply the Naive Bayes method, what probability is assigned to the two values of C?

Answer: We wish to compute

argmaxc Prob(C=c | W=F, X=T, Y=F) = (by Bayes' Law)
argmaxc Prob(W=F, X=T, Y=F | C=c) Prob(C=c) / Prob(W=F, X=T, Y=F) =
      (dropping the normalizing factor, which does not depend on c)
argmaxc Prob(W=F, X=T, Y=F | C=c) Prob(C=c) =
      (assuming conditional independence)
argmaxc Prob(W=F|C=c) Prob(X=T|C=c) Prob(Y=F|C=c) Prob(C=c)
Using the frequencies in the tables to estimate the probabilities, for c=T, this product evaluates to (1/2) * (1/2) * (1/2) * (2/5) = 1/20. For c=F, it evaluates to (1/3) * (1/3) * (1/3) * (3/5) = 1/45. The Bayesian estimate, therefore, is that c=T is more likely than c=F by a factor of 2.25. To be exact the estimated probability that C=T is (1/20)/((1/20)+(1/45)) = 9/13; the estimated probability that C=F is 4/13.

Problem 7:

"Local minima can cause difficulties for a feed-forward, back-propagation neural network." Explain. Local minima of what function of what arguments? Why do they create difficulties?

Answer: Local minima of the error function over the corpus of labelled examples, as a function of the weights on the links. The learning algorithm is in effect doing a hill-climbing search for the value of the weights that gives the minimal error function. If the hill-climbing search finds a local mininum of this function, it will get stuck at that suboptimal state, and will not find the true solution.

Problem 8:

Which of the following describes the process of task execution (classifying input signal) in a feed-forward, back-propagation neural network? Which describe the process of learning? (One answer is correct for each.)

Answer: Task execution is carried out through process (a). Learning is carried out through process (d).

Problem 9

Explain briefly (2 or 3 sentences) the use of a training set and a test set in evaluating learning programs.

Answer: When you wish to evaluate a learning program using a fixed corpus of examples, it is common to divide the corpus into a training set and a test set. The program is first trained by running it on the training set. Then the learning module is turned off, and the quality of the current state of the execution module is tested by running it on the test set. In this way you guard against the possibility of overlearning, in which the program simply learns the examples in the corpus, rather than the underlying concept.