## Problem Set 8

Assigned: Nov. 28
Due: Dec. 12

Let D be a data set with three predictive attributes, P, Q, and R, and one classification attributes C. Attributes P, Q, and R are Boolean. Attribute C has three values: 1, 2, and 3. There are 40 instances in total. The data is as follows

Line P Q R C Number of
instances.
1. Y Y Y 1 2
2. Y Y Y 2 6
3. Y Y Y 3 2
4. Y Y N 1 2
5. Y Y N 2 6
6. Y Y N 3 0
7. Y N Y 1 3
8. Y N Y 2 0
9. Y N Y 3 1
10. Y N N 1 0
11. Y N N 2 0
12. Y N N 3 0
13. N Y Y 1 0
14. N Y Y 2 5
15. N Y Y 3 1
16. N Y N 1 0
17. N Y N 2 0
18. N Y N 3 0
19. N N Y 1 7
20. N N Y 2 1
21. N N Y 3 0
22. N N N 1 0
23. N N N 2 1
24. N N N 3 3

### Problem 1:

Suppose that one computes the ``nearest neighbors'' prediction as follows: Given a new instance X, collect all the instances that agree with X on all three attributes and give each instance 2 votes, and collect all instances that agree with X on two attributes and give each instance 1 vote. What is the prediction for P=Y, Q=N, R=N?

Answer: Lines 4 and 7 give 5 votes for 1. Lines 5 and 23 give 7 votes for 2. Lines 9 and 24 give 4 votes for 3. Result: 1.

What is the prediction for P=Y, Q=N, R=Y?

Answer: Lines 1, 7(x2), and 19 give 15 votes for 1. Lines 2 and 20 give 7 votes for 2 Lines 3 and 9(x2) give 4 votes for 3. Result: 1.

### Problem 2:

What rule is output by the 1R algorithm? What is its prediction for a new instance with P=Y,Q=N,R=N?

Answer: The best rule based on P is

```case (P) { Y: predict C=2;
N: predict C=1 or predict C=2 (tie)
}
```
with an accuracy of 19/40.

The best rule based on Q is

```case (Q) { Y: predict C=2;
N: predict C=1
}
```
with an accuracy of 27/40.

The best rule based on R is

```case (R) { Y: predict C=2;
N: predict C=2
}
```
(i.e. always predict C=2) with an accuracy of 19/40. So the best rule is the one based on Q above. For the given instance, it predicts C=1.

### Problem 3:

What is the prediction of Naive Bayes algorithm for a new instance with P=Y,Q=Y,R=Y?

Prob(C=1|P=Y,Q=Y,R=Y) = Prob(P=Y|C=1) * Prob(Q=Y|C=1) * Prob(R=Y|C=1) * Prob(C=1) = (7/14) * (12/14) * (12/14) * (14/40) = 0.129
Prob(C=2|P=Y,Q=Y,R=Y) = Prob(P=Y|C=2) * Prob(Q=Y|C=2) * Prob(R=Y|C=2) * Prob(C=2) = (12/19) * (17/19) * (12/19) * (19/40) = 0.169
Prob(C=3|P=Y,Q=Y,R=Y) = Prob(P=Y|C=3) * Prob(Q=Y|C=3) * Prob(R=Y|C=3) * Prob(C=3) = (3/7) * (3/7) * (4/7) * (7/40) = 0.018
So the prediction is C=2.

What is its prediction for P=Y,Q=N,R=N?

Prob(C=1|P=Y,Q=N,R=N) = Prob(P=Y|C=1) * Prob(Q=N|C=1) * Prob(R=N|C=1) * Prob(C=1) = (7/14) * (2/14) * (2/14) * (14/40) = 0.0018
Prob(C=2|P=Y,Q=N,R=N) = Prob(P=Y|C=2) * Prob(Q=N|C=2) * Prob(R=N|C=2) * Prob(C=2) = (12/19) * (2/19) * (7/19) * (19/40) = 0.0116
Prob(C=3|P=Y,Q=N,R=N) = Prob(P=Y|C=3) * Prob(Q=N|C=3) * Prob(R=N|C=3) * Prob(C=3) = (3/7) * (4/7) * (3/7) * (7/40) = 0.018
So the prediction is C=3.

### Problem 4:

What is the root node in the decision tree output by the ID3 algorithm? In computing the root node, what is the value of AVG_ENTROPY(A,C,T) for each predictive attribute A? (The book calls this "Remainder(A)" (p. 660).)

```AVG_ENTROPY(P,C,T) = (22/40) ENTROPY(C,SUBTABLE(P,Y,T)) + (18/40) SUBTABLE(P,N,T) =
(22/40) -[(7/22) log (7/22) + (12/22) log (12/22) + (3/22) log(3/22)] +
(18/40) -[(7/18) log (7/18) + (7/18) log (7/18) + (4/18) log(4/18)] =
(22/40) (1.3946) + (18/40) (1.5419) = 1.4608
(Logs calculated to base 2).

AVG_ENTROPY(Q,C,T) = (24/40) ENTROPY(C,SUBTABLE(Q,Y,T)) + (16/40) SUBTABLE(Q,N,T) =
(24/40) -[(4/24) log(4/24) + (17/24) log(17/24) + (3/24) log(3/24)] +
(16/40) -[(10/16) log(10/16) + (2/16) log(2/16) + (4/16) log(4/16)] = 0.934

AVG_ENTROPY(R,C,T) = (28/40) ENTROPY(C,SUBTABLE(R,Y,T)) + (12/40) SUBTABLE(R,N,T) =
(28/40) -[(12/28) log(12/28) + (12/28) log(12/28) + (4/28) log(4/28)] +
(12/40) -[(2/12) log(2/12) + (7/12) log(7/12) + (3/12) log(3/12)] = 1.5267

```
The top node splits on Q. When the top-level call to ID3 calls itself recursively, what are the sub-tables that are passed as parameters? The table in the first recursive call (Q=Y) is lines 1-6, 13-18 with column Q deleted. The table in the second recursive call (Q=N) is lines 7-12, 19-24 with column Q deleted.`