Due: Dec. 12

Let D be a data set with three predictive attributes, P, Q, and R, and one classification attributes C. Attributes P, Q, and R are Boolean. Attribute C has three values: 1, 2, and 3. There are 40 instances in total. The data is as follows

Line | P | Q | R | C | Number of instances. |
---|---|---|---|---|---|

1. | Y | Y | Y | 1 | 2 |

2. | Y | Y | Y | 2 | 6 |

3. | Y | Y | Y | 3 | 2 |

4. | Y | Y | N | 1 | 2 |

5. | Y | Y | N | 2 | 6 |

6. | Y | Y | N | 3 | 0 |

7. | Y | N | Y | 1 | 3 |

8. | Y | N | Y | 2 | 0 |

9. | Y | N | Y | 3 | 1 |

10. | Y | N | N | 1 | 0 |

11. | Y | N | N | 2 | 0 |

12. | Y | N | N | 3 | 0 |

13. | N | Y | Y | 1 | 0 |

14. | N | Y | Y | 2 | 5 |

15. | N | Y | Y | 3 | 1 |

16. | N | Y | N | 1 | 0 |

17. | N | Y | N | 2 | 0 |

18. | N | Y | N | 3 | 0 |

19. | N | N | Y | 1 | 7 |

20. | N | N | Y | 2 | 1 |

21. | N | N | Y | 3 | 0 |

22. | N | N | N | 1 | 0 |

23. | N | N | N | 2 | 1 |

24. | N | N | N | 3 | 3 |

** Answer: ** Lines 4 and 7 give 5 votes for 1. Lines 5 and 23 give 7 votes
for 2. Lines 9 and 24 give 4 votes for 3. Result: 1.

What is the prediction for P=Y, Q=N, R=Y?

** Answer: **
Lines 1, 7(x2), and 19 give 15 votes for 1.
Lines 2 and 20 give 7 votes for 2
Lines 3 and 9(x2) give 4 votes for 3.
Result: 1.

** Answer: **
The best rule based on P is

case (P) { Y: predict C=2; N: predict C=1 or predict C=2 (tie) }with an accuracy of 19/40.

The best rule based on Q is

case (Q) { Y: predict C=2; N: predict C=1 }with an accuracy of 27/40.

The best rule based on R is

case (R) { Y: predict C=2; N: predict C=2 }(i.e. always predict C=2) with an accuracy of 19/40. So the best rule is the one based on Q above. For the given instance, it predicts C=1.

** Answer: **

Prob(C=1|P=Y,Q=Y,R=Y) = Prob(P=Y|C=1) * Prob(Q=Y|C=1) * Prob(R=Y|C=1) *
Prob(C=1) = (7/14) * (12/14) * (12/14) * (14/40) = 0.129

Prob(C=2|P=Y,Q=Y,R=Y) = Prob(P=Y|C=2) * Prob(Q=Y|C=2) * Prob(R=Y|C=2) *
Prob(C=2) = (12/19) * (17/19) * (12/19) * (19/40) = 0.169

Prob(C=3|P=Y,Q=Y,R=Y) = Prob(P=Y|C=3) * Prob(Q=Y|C=3) * Prob(R=Y|C=3) *
Prob(C=3) = (3/7) * (3/7) * (4/7) * (7/40) = 0.018

So the prediction is C=2.

What is its prediction for P=Y,Q=N,R=N?

** Answer: **

Prob(C=1|P=Y,Q=N,R=N) = Prob(P=Y|C=1) * Prob(Q=N|C=1) * Prob(R=N|C=1) *
Prob(C=1) = (7/14) * (2/14) * (2/14) * (14/40) = 0.0018

Prob(C=2|P=Y,Q=N,R=N) = Prob(P=Y|C=2) * Prob(Q=N|C=2) * Prob(R=N|C=2) *
Prob(C=2) = (12/19) * (2/19) * (7/19) * (19/40) = 0.0116

Prob(C=3|P=Y,Q=N,R=N) = Prob(P=Y|C=3) * Prob(Q=N|C=3) * Prob(R=N|C=3) *
Prob(C=3) = (3/7) * (4/7) * (3/7) * (7/40) = 0.018

So the prediction is C=3.

** Answer: **

AVG_ENTROPY(P,C,T) = (22/40) ENTROPY(C,SUBTABLE(P,Y,T)) + (18/40) SUBTABLE(P,N,T) = (22/40) -[(7/22) log (7/22) + (12/22) log (12/22) + (3/22) log(3/22)] + (18/40) -[(7/18) log (7/18) + (7/18) log (7/18) + (4/18) log(4/18)] = (22/40) (1.3946) + (18/40) (1.5419) = 1.4608 (Logs calculated to base 2). AVG_ENTROPY(Q,C,T) = (24/40) ENTROPY(C,SUBTABLE(Q,Y,T)) + (16/40) SUBTABLE(Q,N,T) = (24/40) -[(4/24) log(4/24) + (17/24) log(17/24) + (3/24) log(3/24)] + (16/40) -[(10/16) log(10/16) + (2/16) log(2/16) + (4/16) log(4/16)] = 0.934 AVG_ENTROPY(R,C,T) = (28/40) ENTROPY(C,SUBTABLE(R,Y,T)) + (12/40) SUBTABLE(R,N,T) = (28/40) -[(12/28) log(12/28) + (12/28) log(12/28) + (4/28) log(4/28)] + (12/40) -[(2/12) log(2/12) + (7/12) log(7/12) + (3/12) log(3/12)] = 1.5267The top node splits on Q. When the top-level call to ID3 calls itself recursively, what are the sub-tables that are passed as parameters? The table in the first recursive call (Q=Y) is lines 1-6, 13-18 with column Q deleted. The table in the second recursive call (Q=N) is lines 7-12, 19-24 with column Q deleted.`