## Final Exam: Artificial Intelligence

Problem 1 (15 points)
Define each of the following briefly (2 or 3 sentences):

• A. Simulated annealing:
Answer: A variant of hill-climbing in which the search can, with a small probability, move to the neighboring state is lower. The probability of going "downhil" is governed by a parameter known as "temperature" and is reduced toward zero over iterations of the algorithm.
• B. Factoring rule (in first-order resolution):
Answer: If clause C contains literals L1 and L2 with the same sign that can be unified with most general unifier U, then:
Step 1: C' := U(C)
Step 2: Delete the repeated instance of U(L1) = U(L2) from C'. Assert the new clause.
Example: Let C be the clause "p(X,a) or p(b,Y) or q(X,Y)". Then the first and second literals can be unified under the substitution X=b,Y=a. Factoring then generates the new clause "p(b,a) or q(b,a)".
• C. Bayes' Law:
Answer: Prob(B|A) = Prob(A|B) Prob(B) / Prob(A).
• D. K-fold cross-validation:
Answer: To estimate the accuracy of a learning procedure over a small data set, carry out the following procedure:
1. Partition the data set into K parts of equal size, known as "folds".
```2. For I = 1 to K do
Use all the folds except the Ith as the training set
Evaluate the accuracy of the output hypothesis using the Ith fold as the
test set.
```
3. Take the average of the accuracies computed in (2) as an estimate of the actual accuracy.
• E. Confidence interval:
Answer: Suppose you are trying to determine the frequency of feature C in space D using a sample T, and the frequency of C in T is p. Let q be the true frequencey of C in D. An interval I containing p is a confidence interval at the L level (L is a probability, such as "95%") if, roughly, you have confidence L that the q is in I. More precisely, I is generated from p using a procedure f(p) which has the property that, over samples of size T, the probability that f(p(T)) contains q is L.

Problem 2 (15 points)
The BIN-PACKING problem is defined as follows: You have K bins, each of size M, and you have a collection of objects of varying sizes. Put the objects into the bins. Formally, find an assigment A[O] that assigns a bin to each object in such a way that, for each bin B, the sum of the sizes of the objects assigned to B is no more than M.

For example, if K=3, M=15, and the objects are A-10, B-8, C-6, D-6, E-4, F-4, G-3, H-2, then one solution is to put A,G,H in one bin; B,D in the second; and C,E,F in the third.

A. Write a non-deterministic algorithm to solve the BIN-PACKING problem. Describe the state space explored by this algorithm.

B. Describe a different state space that could be used to solve the problem.

Answer: BIN-PACKING is unusual in that there are two natural ways to do a systematic search.

The first is to loop through the objects, and assign a bin to each object

```function BIN-PACKING(in K,M, OBJS;  out A[OBJS]);
for I := 1 to K do capacity[I] := M;
for O in OBJS do
choose bin I between 1 and K such that capacity[I] >=  O.size;
capacity[I] =- O.size;
A[O] := I;
```

Here a state is an assignment of each of the first J objects to a bin. The operator is to assign the next object to a bin where it fits. The start state is the empty assignment, and a goal state is an assignment of all the objects.

The second approach is to loop through the bins, and assign objects that fit.

```function BIN-PACKING(in K,M, OBJS;  out A[OBJS])
for I := 1 to K do
C := M;
for O in OBJS do
if C >= O.size then
choose CHOICE in {TRUE,FALSE};
if CHOICE then
C =- O.size;
A[O] := I;
```
Here, a state is an assignment of some objects to bins 1 .. I. An operator is either to assign a new object to bin I or to increment I and add a new object to the new bin I. Same start state and goal state.

Other state spaces are also possible. E.g. one can define a state as an assignment, valid or invalid, of all objects to bin, and operators as moving an object from one bin to another or swapping two objects in different bins. One would use hill-climbing or some variant over a space of this structure.

Problem 3 (10 points)
A. Define the syntax of the propositional calculus.

Answer: The symbols are propositional atoms, boolean operators, and parentheses. The syntax is given by the single rule
S ::= atom | not S | (S V S) | (S ^ S) | (S => S) | (S <=> S)

B. Comparing the propositional calculus with the predicate calculus as languages for AI representation and reasoning, what are the advantages and disadvantages of each? (I'm looking for a very short answer here; one sentence on either side is plenty.)

Answer: The advantage of the predicate calculus is that it is more expressive. It provides a structure for propositions in terms of predicates and functions applied to terms, and it suppport the statement of general rules using the quantifiers.

The advantage of the propositional calculus is that it is much more tractable. Specifically inference in the propositional calculus is decidable, whereas in the predicate calculus it is only semi-decidable.

Problem 4 (20 points)
Consider the following inference:
Given the premises:

a. If everyone at a party enjoys it, then the party is a success.
b. Everyone who dances at a party enjoys it.
c. Everyone who sings at a party enjoys it.
d. Everyone at party p1 either danced or sang.
Infer that (e) party p1 was a success.

Let L be the first-order language with the following primitives:

```e(X,P) --- X enjoyed party P.
a(X,P) --- X attended party P.
d(X,P) --- X danced at party P.
s(X,P) --- X sang at party P.
u(P)   --- Party P was a success.
```

A. Express (a)-(e) in L.

Answer: First, note that the statement "Everyone at party p0 enjoyed it" is expressed "forall(X) a(X,p0) => e(X,p0)." Therefore the statement (a) is expressed;

```forall(P) [forall(X) a(X,P) => e(X,P)] => u(P)
```
No one in the class got this right on the exam, though one person came close.

The rest are straightforward.

```b. forall(X,P) d(X,P) => e(X,P).
c. forall(X,P) d(X,P) => e(X,P).
d. forall(X,P) a(X,p1) => [d(X,p1) V s(X,p1)].
e. u(p1).
```

Answer: B. Show how (e) can be proven from (a-d) using resolution. You do NOT have to show the intermediate stages of Skolemization. You DO have to show all the clauses and all the resolutions.

Answer: Skolemizing (a-d) and the negation of (e), we get the following clauses:

```1. a(sk(P),P) V u(P).  (from a.)
2. not e(sk(P),P) V u(P) (from a.)
3. not d(X,P) V e(X,P) (from b.)
4. not s(X,P) V e(X,P) (from c.)
5. not a(X,p1) V d(X,p1) V s(X,p1) (from d.)
6. not u(p1).
```
. The Skolem function sk(P) maps an unsuccessful party P into someone who was there but didn't enjoy it.

Applying resolution, we get the following:

```7. a(sk(p1),p1). (from 6 and 1).
8. not e(sk(p1),p1). (from 6 and 2).
9. not d(sk(p1),p1). (from 8 and 3).
10. not s(sk(p1),p1). (from 8 and 4).
11. not a(sk(p1),p1) V s(sk(p1),p1). (from 9 and 5).
12. not a(sk(p1),p1). (from 11 and 10).
13. null. (from 12 and 7).
```

Problem 5 (15 points)
One can view an algorithm for learning decision trees as a kind of state space search where the a state is a tree and an operator modifies one tree into another tree, using the data as guide. For example in ID3, the operator is to split a leaf on the best attribute, and the search strategy is a simple greedy algorithm.

A. Suppose that, instead, you wanted to use a more highly connected state space (same states, but more operators), and to apply a hill-climbing strategy. Propose at least one plausible new operator that might be worth trying. (You can assume, if you want, that all the predictive attributes are Boolean.)

Answer: The operator that most people suggested was just to split on some attribute other than the best.

Another possible operator would be to swap the attributes in two nodes N and M in the tree. Since all attributes are boolean, this can be done without changing the structure of the tree. It does lead to trees with redudant tests (same attribute tested twice on a single path); those can either be cleaned up with some further tree manipulation, or one can simply allow them.

B. Propose an objective function for part (A) that attempts to solve the problem of overfitting.

Answer: You want an objective function that rewards accuracy over the training set, while penalizing trees that are too refined. The latter can be done in any number of ways; using an MDL measure, or charging a penalty for every node in the tree, or for every node where the set being split is too small, etc.

Problem 6 (10 points)
A common model in statistical language learning is called the bigram model. Given a large text T, one constructs the following table: For every pair of words X and Y in the language, one finds all the occurrences of X in T, and records the fraction of occurrences of X that are followed by an occurrence of Y.

For example, given a toy language with three words, A,B,C and the text T=AABBCCABCACBA, the table would record the following information:

Freq(TN+1=A | TN=A) = 1/4
Freq(TN+1=B | TN=A) = 1/2
Freq(TN+1=C | TN=A) = 1/4
(That is, of the occurences of A in the table, ignoring the last, 1/4 are followed by another A, 1/2 by B, and 1/4 by C.)

Freq(TN+1=A | TN=B) = 1/4
Freq(TN+1=B | TN=B) = 1/4
Freq(TN+1=C | TN=B) = 1/2

Freq(TN+1=A | TN=C) = 1/2
Freq(TN+1=B | TN=C) = 1/4
Freq(TN+1=C | TN=C) = 1/4

How can the extraction of these statistics be justified in minimum description length (MDL) learning theory?

Answer: According to MDL theory, a pattern P can be learned can be learned from data D if there is some encoding D' of D such that D' can be reconstructed from P and D and such that |D'| < |D| + |P|

One way to construct such a code in this example is this: Use a code for each letter where the meaning of the code depends on the previous letter. Specifically:

The first letter in the string is encoded "00" if A; "01" if B; "10" if C.
The code for the I+1st letter is
"0" if [TI=A and TI+1=B] or [TI=B and TI+1=C] or [TI=C and TI+1=A];
"10" if [TI=A and TI+1=A] or [TI=B and TI+1=B] or [TI=C and TI+1=C];
"11" if [TI=A and TI+1=C] or [TI=B and TI+1=A] or [TI=C and TI+1=B];
Thus, the code for the string T=AABBCCABCACBA is "00100100100000111111". This uses 20 bits, whereas the natural encoding of two bits per letter would use 26 bits. For a string of length N+1, after the first letter there would be N/2 letters that would be encoded as "0", N/4 that would be encoded "10" and N/4 that would be encoded "11", for a total of 2+3N/2 bits, as contrasted with 2N bits for the simple encoding, or N*log23 = 1.585 N, for an encoding that uses only the fact that there are three letters of equal probability 1/3. If N is large enough, then this saves enough bits to make up for the cost of saving the table.

In general, for such a bigram model, you develop for each letter X you develop a minimum bit encoding for the possible letters following X.

Another solution, proposed by several students, is to encode two letter sequences. Thus the sequences "AB", "BC" and "CA" would get two bit codes, while "AA", "AC", "BB", "BA", "CB", and "CC" would get four bit codes. The bit-count efficiency is exactly the same as in the solution above.

Problem 7 (15 points)
You have developed a wonderful new theory of classification learning, called "the NYU algorithm". You input data, it outputs a formula. You decide to test your theory on the problem of predicting in January who will win a political election in November given attributes like office, region, party, incumbency, success in raising funds to date, current poll numbers, etc. You have a database with all this data for all American elections over the last twenty years. You ask a research assistant to test the NYU algorithm on this data. He comes back with the good news that when he ran the NYU algorithm over the your data set, it output a formula that gives the right answer 80% of the time on the data.

A. What further tests should you run before you publish a paper in a political science journal, claiming that you have found a formula that solves the problem of predicting political elections?

B. What further tests should you run before you publish a paper in a machine learning journal, claiming that you have a promising new technique for machine learning?

Answer: The test that has been run, using the same data for training and testing, is pretty much worthless, as it can be passed at 100% by a learning algorithm that just memorizes the data. Since the data set is very large, there should be no problems running tests that separate out the test set from the training set.

There are also two other types of tests that should be run. First is a test to check whether the NYU formula is actually more accurate than simpler formulas that could be found by other learning algorithms, such as 1R. It could well be, for example, that the single rule "Predict that the incumbent will win, if he is running; else guess randomly" will do just as well. You would want to check this before publishing either the formula or the algorithm.

Second are tests to see how well this algorithm runs on other data sets, or on small subsets of this data set. These are of no importance to the political science journal, but would be important in the machine learning journal.