**Answer:** ``Choose" designates a non-deterministic choice point. All
possible choices must be considered. Different choices lead to different
solutions.

``Pick'' designates an arbitrary choice point. Any one of the possible options can be used; they all lead to the same solutions, though possibly in a different order. If the option picked does not lead to a solution, no other option will do any better.

Example: A non-deterministic algorithm for solving the map-coloring problem can be written as follows:

repeat pick an uncolored country X; choose a color C for X that is different from any of the colors for countries that border X; until all the countries have been colored.Here, the algorithm needs to consider all consistent partial colorings of the map, so ``choose'' is used for the color. If one color does not lead to a solution, you consider an alternative color. The ``pick'' among countries, on the other hand, determines the order in which countries are considered. This may effect the efficiency of search, but certainly cannot effect the existence or non-existence of a solution.

**Answer:** (i) It must be (i) zero-sum; (ii) perfect knowledge;
(iii) deterministic; (iv) two-player; (v) discrete.

s(X,L) --- Person X speaks language L. c(X,Y) --- Persons X and Y can communicate. i(W,X,Y) --- Person W can serve as an interpreter between persons X and Y. j,p,e,f --- Constants: Joe, Pierre, English, and French respectively.A. Express the following statements in L:

- i. Joe speaks English.
- ii Pierre speaks French.
- iii. If X and Y both speak L, then X and Y can communicate.
- iv. If W can communicate both with X and with Y, then W can serve as an interpreter between X and Y.
- v. For any two languages L and M, there is someone who speaks both L and M.
- vi. There is someone who can interpret between Joe and Pierre.

**Answer:**

- i. s(j,e).
- ii. s(p,f).
- iii. forall(X,Y,L) s(X,L) ^ s(Y,L) => c(X,Y)
- iv. forall(W,X,Y) c(X,W) ^ c(Y,W) => i(W,X,Y).
- v. forall(L,M) exists(W) s(W,L) ^ s(W,M)
- vi. exists(W) i(W,j,p)

B. Show that (vi) can be proven from (i)---(v) using backward-chaining resolution. You must show the Skolemized form of each statement, and every resolution that is used in the final proof. You need not show the intermediate stages of Skolemization, or show resolutions that are not used in the final proof.

** Answer: ** The Skolemized forms of (i)-(v) are

- 1. s(j,e).
- 2. s(p,f).
- 3. not s(X,L) V not s(Y,L) V c(X,Y).
- 4. not c(X,W) V not c(Y,W) V i(W,X,Y).
- 5. s(sk1(L,M),L).
- 6. s(sk1(L,M),M).

(sk1(L,M) is a Skolem function mapping L and M to a person who speaks both L and M.)

The Skolemized form of the negation of (vi) is

- 7. not i(W,j,p).

The backward chaining proof is then as follows:

- 8. not c(j,W) V not c(p,W). (7+4)
- 9. not s(j,L) V not s(W,L) V not c(p,W). (8+3)
- 10. not s(W,e) V not c(p,W). (9+1).
- 11. not s(W,e) V not s(p,L) V not s(W,L). (10+3)
- 12. not s(W,e) V not s(W,f). (11+2)
- 13. not s(sk1(e,M),f). (12+5)
- 14. null clause (13+6)

This is a deterministic decision tree for a data set with two Boolean predictive attributes, A and B, and a Boolean classification C. The tree first tests an example X on attribute A. If X.A is T, then the tree tests on the tree tests on attribute B. If X.B is T, then the tree predicts that X.C is T; otherwise it predicts that X.C is F. If, in the first test, X.A is F, then the tree predicts that X.C is F.

B. Describe the ID3 algorithm to construct decision
trees from training data.

**Answer:** The ID3 algorithm builds the decision tree recursively from
the top down. At each stage of the recursion, an internal node N is passed
a subtable, with those instances that would reach N in the decision procedure,
and those attributes that have not already been tested. If the table has
more than one classification value, and has some predictive attributes left
to test, and is more than a specified minimum splitting size, then
the algorithm chooses the best attribute A to split on, namely the one
that gives the minimum expected entropy. If the expected entropy of the
classification C after splitting on A would be less than the entropy of
C at node N, then node N is labelled as a test for A, the table is partitioned
according to the value of A, and the partitions are passed down to the
next nodes recursively.

C. What is the entropy of a classification C in table T? What is
the expected entropy of classification C if the table is split on
predictive attribute A?

Let p_{c} be the fraction of instances X in T where X.C=c.
Then the entropy of C in T is the sum over c of -p_{c} log(p_{c}).

Let q_{a} be the fraction of instance X in T where X.A=a.
Let T_{a} be the subtable of T of instances X such that X.A=a.
Let r_{a,c} be the fraction of instances X in T_{a} where
X.C=c. Then the expected entropy of C after splitting on A is

sum_{a}q_{a}(sum_{c}-r_{a,c}log(r_{a,c}))

D. What kinds of techniques can be used to counter the problem of over-fitting in decision trees?

** Answer: **
* Prepruning techniques * prune the decision tree as it is
being built; e.g. by enforcing a minimum splitting size on nodes.
* Postpruning techniques * build the entire decision tree, and then
eliminate tests that do not seem to be useful.

W X Y C ---------------- T T T T T F T F T F F F F T T F F F F TWe now encounter a new example: W=F, X=T, Y=F. If we apply the Naive Bayes method, what probability is assigned to the two values of C?

** Answer: **
We wish to compute

argmaxUsing the frequencies in the tables to estimate the probabilities, for c=T, this product evaluates to (1/2) * (1/2) * (1/2) * (2/5) = 1/20. For c=F, it evaluates to (1/3) * (1/3) * (1/3) * (3/5) = 1/45. The Bayesian estimate, therefore, is that c=T is more likely than c=F by a factor of 2.25. To be exact the estimated probability that C=T is (1/20)/((1/20)+(1/45)) = 9/13; the estimated probability that C=F is 4/13._{c}Prob(C=c | W=F, X=T, Y=F) = (by Bayes' Law) argmax_{c}Prob(W=F, X=T, Y=F | C=c) Prob(C=c) / Prob(W=F, X=T, Y=F) = (dropping the normalizing factor, which does not depend on c) argmax_{c}Prob(W=F, X=T, Y=F | C=c) Prob(C=c) = (assuming conditional independence) argmax_{c}Prob(W=F|C=c) Prob(X=T|C=c) Prob(Y=F|C=c) Prob(C=c)

**Answer:**
Local minima of the error function over the corpus of labelled examples,
as a function of the weights on the links. The learning algorithm
is in effect doing a hill-climbing search for the value of the weights
that gives the minimal error function. If the hill-climbing search finds
a local mininum of this function, it will get stuck at that suboptimal
state, and will not find the true solution.

- a. Activation levels are propagated from the inputs through the hidden layers to the outputs.
- b. Activation levels are propagated from the outputs through the hidden layers to the inputs.
- c. Weights on the links are modified based on messages propagated from input to output.
- d. Weights on the links are modified based on messages propagated from output to input.
- e. Connections in the network are modified, gradually shortening the path from input to output.
- f. Weights at the input level are compared to the weights at the output level, and modified to reduce the discrepancy.

** Answer:**
Task execution is carried out through process (a). Learning is
carried out through process (d).

** Answer:** When you wish to evaluate a learning program using
a fixed corpus of examples, it is common to divide the corpus into
a training set and a test set. The program is first
trained by running it on the training set. Then the learning module
is turned off, and the quality of the current state of the execution module
is tested by running it on the test set. In this way you guard against
the possibility of overlearning, in which the program simply learns
the examples in the corpus, rather than the underlying concept.

A. perfect classification hypotheses (i.e. classification hypotheses that always give the correct classification, given the values of the predictive attributes) for nominal classifications.

B. imperfect classification hypotheses (i.e. hypotheses that do better than chance) for nominal attributes.

C. approximate classification hypotheses for numeric classifications. (i.e. hypotheses that give answers that are nearly correct.)

** Answer:**

A. The original data can be encoded by storing the predictive attributes
for each example plus the hypothesis. If the space saved
on recording the classification is less than the space required for the
hypothesis, then the overall description length is reduced.

B. The original data can be encoded by storing (i) the predictive attributes;
(ii) the hypothesis; (iii) the classification X.C' which states either that the
hypothesis suceeds on example X or that it fails and gives the correct value
of C. If the entropy of C' is less than that of C, and if the data
set is large enough, then the space saved storing C' in an optimal coding
as opposed to storing C in an optimal coding will be enough to make up for
the cost of the space to store the hypothesis.

C. Let A,B,C be the predictive attributes; let D be the classification;
let f(A,B,C) be the hypothesis. For each example X, record X.A, X.B, X.C
and the error X.D - f(A,B,C). If f is a useful hypothesis, then the
range of values of the error will be much less than the range of values
of X.D; hence the error will require fewer bits, for a given level of
precision, than X.D. If the space thus saved is more than the space
required to store f, then the description length is reduced.