Solution Set 6 Problem 1 Let A, B, C, D be Boolean random variables. Given that A and B are (absolutely) independent. C is independent of B given A. D is independent of C given A and B. Prob(A=T) = 0.3 Prob(B=T) = 0.6 Prob(C=T|A=T) = 0.8 Prob(C=T|A=F) = 0.4 Prob(D=T|A=T,B=T) = 0.7 Prob(D=T|A=T,B=F) = 0.8 Prob(D=T|A=F,B=T) = 0.1 Prob(D=T|A=F,B=F) = 0.2 Compute the following quantities: P(D=T) = P(D=T,A=T,B=T) + P(D=T,A=T,B=F) + P(D=T,A=F,B=T) + P(D=T,A=F,B=F) = P(D=T|A=T,B=T) P(A=T,B=T) + P(D=T|A=T,B=F) P(A=T,B=F) + P(D=T|A=F,B=T) P(A=F,B=T) + P(D=T|A=F,B=F) P(A=F,B=F) = (since A and B are independent absolutely) P(D=T|A=T,B=T) P(A=T) P(B=T) + P(D=T|A=T,B=F) P(A=T) P(B=F) + P(D=T|A=F,B=T) P(A=F) P(B=T) + P(D=T|A=F,B=F) P(A=F) P(B=F) = 0.7*0.3*0.6 + 0.8*0.3*0.4 + 0.1*0.7*0.6 + 0.2*0.7*0.4 = 0.32 P(D=F,C=T) = P(D=F,C=T,A=T,B=T) + P(D=F,C=T,A=T,B=F) + P(D=F,C=T,A=F,B=T) + P(D=F,C=T,A=F,B=F) = P(D=F,C=T|A=T,B=T) P(A=T,B=T) + P(D=F,C=T|A=T,B=F) P(A=T,B=F) + P(D=F,C=T|A=F,B=T) P(A=F,B=T) + P(D=F,C=T|A=F,B=F) P(A=F,B=F) = (since C and D are independent given A and B) P(D=F|A=T,B=T) P(C=T|A=T,B=T) P(A=T,B=T) + P(D=F|A=T,B=F) P(C=T|A=T,B=F) P(A=T,B=F) + P(D=F|A=F,B=T) P(C=T|A=F,B=T) P(A=F,B=T) + P(D=F|A=F,B=F) P(C=T|A=F,B=F) P(A=F,B=F) = (since C is independent of B given A and A and B are independent absolutely) P(D=F|A=T,B=T) P(C=T|A=T) P(A=T) P(B=T) + P(D=F|A=T,B=F) P(C=T|A=T) P(A=T) P(B=F) + P(D=F|A=F,B=T) P(C=T|A=F) P(A=F) P(B=T) + P(D=F|A=F,B=F) P(C=T|A=F) P(A=F) P(B=F) = 0.3*0.8*0.3*0.6 + 0.2*0.8*0.3*0.4 + 0.9*0.4*0.7*0.6 + 0.8*0.4*0.7*0.4 = 0.3032 P(A=T|C=T) = P(C=T|A=T)P(A=T) / P(C=T). Now P(C=T) = P(C=T,A=T) + P(C=T,A=F) = P(C=T|A=T)P(A=T) + P(C=T|A=F)P(A=F) = 0.8*0.3+ 0.4*0.7 = 0.52 So P(C=T|A=T)P(A=T) / P(C=T) = 0.8*0.3/0.52= 0.46. P(A=T|D=F) = P(D=F|A=T) P(A=T)/P(D=F). Now P(D=F) = 1-P(D=T) = 0.68 from the first question above. P(D=F|A=T) = P(D=T,B=T|A=T) + P(D=F,B=F|A=T) = P(D=F|B=T,A=T) P(B=T|A=T) + P(D=F|B=F,A=T) P(B=F|A=T) = (since B is independent of A) P(D=F|B=T,A=T) P(B=T) + P(D=F|B=F,A=T) P(B=F) = 0.3*0.6 + 0.2*0.4 = 0.26. So P(A=T|D=F) = P(D=F|A=T) P(A=T)/P(D=F) = 0.26 * 0.3 / 0.68 = 0.115 P(A=T,D=T|B=F) = P(D=T|A=T,B=F) P(A=T|B=F) = (since A and B are independent) P(D=T|A=T,B=F) P(A=T) = 0.8*0.3 = 0.24. Problem 2 Let D be a data set with three predictive attributes: P, Q, and R and one classification attributes C. Attributes P, Q, and C are Boolean. Attribute R has three values: 1, 2, and 3. The data is as follows P Q R C Number of instances Y Y 1 N 2 Y Y 2 Y 10 Y Y 3 Y 15 Y N 1 Y 20 Y N 2 N 3 Y N 3 N 4 N Y 1 N 40 N Y 2 Y 6 N Y 3 Y 5 N N 1 Y 4 N N 2 Y 2 N N 3 Y 4 A. What rule is output by the 1R algorithm? What is its prediction for a new instance with P=Y,Q=Y,R=1? Answer: 1R outputs the rule "Case (P) {Y : C=Y} {N : C=N}". On this instance it predicts C=Y. B. What is the prediction of Naive Bayes algorithm for a new instance with P=Y,Q=Y,R=1? What probability is assigned to the prediction? P(C=Y|P=Y,Q=Y.R=1) = P(C=Y)*P(P=Y|C=Y)*P(Q=Y|C=Y)*P(R=1|C=Y) / P(P=Y,Q=Y,R=1) = (66/115) * (45/66) * (36/66) * (24/66) / P(P=Y,Q=Y=R=1) = 0.0776 / P(P=Y,Q=Y,R=1) P(C=N|P=Y,Q=Y.R=1) = P(C=N)*P(P=Y|C=N)*P(Q=Y|C=N)*P(R=1|C=N) / P(P=Y,Q=Y,R=1) = (49/115) * (9/49) * (42/49) * (42/49) = 0.0575/(P=Y,Q=Y,R=1) So the algorithm predicts C=Y. The denominator is the sum of the two nominators, P(C=Y|P=Y,Q=Y.R=1) = 0.0776 / (0.0776+0.0575) = 0.5743 C. Trace the execution of the ID3 algorithm, and show the decision tree that it outputs. At each stage, you should compute the average entropy AVG\_ENTROPY(A,C,T) for each attribute A. (The book calls this "Remainder(A)" (p. 660).) Let T0 be the entire table. C1: ID3(T0,C) ENTROPY(T0,C) = -(66/115 log(66/115) + 49/115 log(49/115)) = 0.9842 Create root node N1 Let T1=SUBTABLE(T0,P,Y) = lines 1-6 ENTROPY(C,T1) = (9/54) log(9/54) + (45/54) log(45/54) = 0.6500 Let T2=SUBTABLE(T0,P,N) = lines 7-12 ENTROPY(C,T2) = (40/61) log(40/61) + (21/61) log(21/61) = 0.9288 AVG-ENTROPY(P,C,T0) = 54/115 * 0.6500 + 61/115 * 0.9288 = 0.7979 Let T3=SUBTABLE(T0,Q,Y) = lines 1-3, 6-9 ENTROPY(C,T1) = (42/78) log(42/78) + 36/78 log(36/78) = 0.9957 Let T4=SUBTABLE(T0,Q,N) = lines 4-6, 10-12 ENTROPY(C,T2) = (7/37) log(7/37) + (30/37) log(30/37) = 0.6998 AVG-ENTROPY(Q,C,T0) = 78/115 * 0.9957 + 37/115 * 0.6998 = 0.9005 Let T5=SUBTABLE(T0,R,1) = lines 1,4,7,10 ENTROPY(C,T5) = (42/66) log(42/66) + 24/66 log(24/66) = 0.9457 Let T6=SUBTABLE(T0,R,2) = lines 2,5,8,11 ENTROPY(C,T6) = (3/21) log(3/21) + (18/21) log(18/21) = 0.5917 Let T7=SUBTABLE(T0,R,3) = lines 3,6,9,12 ENTROPY(C,T7) = (4/28) log(4/28) + (24/28) log(24/18) = 0.5917 AVG-ENTROPY(R,C,T0) = 66/115 * 0.9457 + 21/115 * 0.5917 + 28/115 * 0.5917 = 0.7949 Split on R. ID3(T5,C) Create new node N2. ENTROPY(C,T5) = 0.9457 Let T8=SUBTABLE(T5,P,Y) = lines 1,4 ENTROPY(C,T8) = (2/22) log(2/22) + (20/22) log(20/22) = 0.4395 Let T9=SUBTABLE(T5,P,N) = lines 7,10 ENTROPY(C,T8) = (40/44) log(40/44) + (4/44) log(4/44) = 0.4395 AVG_ENTROPY(P,C,T5) = 0.4395. Let T10=SUBTABLE(T5,Q,Y) = lines 1,7 ENTROPY(C,T10) = 0 Let T11=SUBTABLE(T5,Q,N) = lines 4,10 ENTROPY(C,T11) = 0 AVG_ENTROPY(P,C,T5) = 0 Split on Q. ID3(T10,C). C has value C=N on T10. Create leaf node labelled C=N with prob. 1. ID3(T11,C). C has value C=Y on T11. Create leaf node labelled C=Y with prob. 1. ID3(T6,C) Create new node N3. ENTROPY(C,T6) = 0.5917 Let T12=SUBTABLE(T6,P,Y) = lines 2,5. ENTROPY(C,T12) = (3/13) log(3/13) + (10/13) log(10/13) = 0.7793 Let T13=SUBTABLE(T6,P,N) = lines 8,11 ENTROPY(C,T13) = 0 AVG_ENTROPY(P,C,T6) = (13/21) * 0.7793 = 0.4825 Let T14=SUBTABLE(T6,Q,Y) = lines 2,8. ENTROPY(C,T14) = 0 Let T15=SUBTABLE(T6,Q,N) = lines 5,11 ENTROPY(C,T15) = (3/5) log(3/5) + (2/5) log(2/5) = 0.9710 AVG_ENTROPY(P,C,T6) = (5/21) * 0.9710 = 0.2312. Split on Q. ID3(T14,C) is a leaf node, labelled C=Y ID3(T15,C) splits at node N4 on P into two leaf nodes with an average entropy of 0. ID3(T7,C) Create new node N5. ENTROPY(C,T6) = 0.7949 Let T16=SUBTABLE(T7,P,Y) = lines 3,6. ENTROPY(C,T7) = (4/19) log(4/19) + (15/19) log(15/19) = 0.7425 Let T17=SUBTABLE(T7,P,N) = lines 9,12 ENTROPY(C,T13) = 0 AVG_ENTROPY(P,C,T6) = (19/28) * 0.7425 = 0.5038 Let T18=SUBTABLE(T6,Q,Y) = lines 3,9. ENTROPY(C,T14) = 0 Let T19=SUBTABLE(T6,Q,N) = lines 6,12 ENTROPY(C,T12) = (4/8) log(4/8) + (4/8) log(4/8) = 1.0 AVG_ENTROPY(Q,C,T6) = (8/28) * 1.0 = 0.2857 Split on Q. ID3(T18,C) is a leaf node, labelled C=Y ID3(T19,C) splits at node N6 on P into two leaf nodes with an average entropy of 0. Final tree: N1:R --- 1 ---> N2:Q --- N ---> C=N. | | | |- Y ---> C=Y | |- 2 ---> N3:Q --- N ---> N4:P --- N ---> C=Y | | | |- Y ---> C=N | |- 3 ---> N5:Q --- N ---> N6:P --- N ---> C=Y | |- Y ---> C=N