Example of ID3 algorithm This example shows the construction of a decision tree where P, Q, and C are the predictive attributes and R is the classification attribute. I've added line numbers for convenient reference: this is _not_ an attribute. Line number P Q R C Number of instances 1 Y Y 1 Y 1 2 Y Y 2 N 10 3 Y Y 3 Y 3 4 Y N 1 Y 2 5 Y N 2 Y 11 6 Y N 3 Y 0 7 N Y 1 Y 2 8 N Y 2 N 20 9 N Y 3 Y 3 10 N N 1 Y 1 11 N N 2 Y 15 12 N N 3 Y 3 Total number of instances 71 The above table is T0. Create root node N1. C1: ID3(T,R) C2: AVG_ENTROPY(P,R,T) C3: FREQUENCY(P,Y,T) = (1+10+3+2+11+0)/71 = 27/71 = 0.3803 C4: SUBTABLE(P,Y,T) = lines 1,2,3,4,5,6. Call this T1. Size(T1)= 27. C5: ENTROPY(R,T1) C6: FREQUENCY(R,1,T1) = (1+2)/27 = 0.1111 C7: FREQUENCY(R,2,T1) = (10+11)/27 = 0.7778 C8: FREQUENCY(R,3,T1) = (3+0)/27 = 0.1111 C5: Return -(0.1111 log(0.1111) + 0.7777 log(0.7777) + 0.1111 log(0.1111)) = 0.9864 C8.1: FREQUENCY(P,Y,T) = (2+20+3+1+15+3)/71 = 44/71 = 0.6197 C9: SUBTABLE(P,N,T) = lines 7,8,9,10,11,12. Call this T2. Size(T2) = 44. C10: ENTROPY(R,T2) C11: FREQUENCY(R,1,T2) = (2+1)/44 = 0.0682 C12: FREQUENCY(R,2,T2) = (20+15)/44 = 0.7955 C13: FREQUENCY(R,3,T2) = (3+3)/44 = 0.1364 C10: Return -(0.0682 log(0.0682) + 0.7955 log(0.7955) + 0.1364 log(0.1364)) = 0.9188 C2: Return (27/71) * 0.9864 + (44/71) * 0.9188 = 0.9445 C14: AVG_ENTROPY(Q,R,T) C15: FREQUENCY(Q,Y,T) = (1+10+3+2+20+3)/71 = 39/71 = 0.5493 C16: SUBTABLE(Q,Y,T) = lines 1,2,3,7,8,9. Call this T3. Size(T3)= 39. C17: ENTROPY(R,T3) C18: FREQUENCY(R,1,T3) = (1+2)/39 = 0.0769 C19: FREQUENCY(R,2,T3) = (10+20)/39 = 0.7692 C20: FREQUENCY(R,3,T3) = (3+3)/39 = 0.1538 C17: Return -(0.0769 log(0.0769) + 0.7692 log(0.7692) + 0.1538 log(0.1538)) = 0.9914 C21: FREQUENCY(Q,N,T) = (2+11+0+1+15+3)/71 = 32/71 = 0.4507 C21: SUBTABLE(Q,N,T) = lines 4,5,6,10,11,12. Call this T4. Size(T2) = 32. C22: ENTROPY(R,T4) C23: FREQUENCY(R,1,T4) = (2+1)/32 = 0.0938 C24: FREQUENCY(R,2,T4) = (11+15)/32 = 0.8125 C25: FREQUENCY(R,3,T4) = (0+3)/32 = 0.0938 C22: Return -(0.0938 log(0.0938) + 0.8125 log(0.8125) + 0.0938 log(0.0938)) = 0.8838 C14: Return (39/71) * 0.9914 + (32/71) * 0.8836 = 0.9394 From here on down, I'm abbreviating. C26: AVG_ENTROPY(C,R,T) C27: FREQUENCY(C,Y,T) = 41/71 = 0.5775 C28: SUBTABLE(C,Y,T) = all lines but 2 and 8. Call this T5. C29: ENTROPY(R,T5) = (6/41) log(6/41) + (26/41) log (26/41) + (9/41) log(9/41) = 1.3028 C30: FREQUENCY(C,N,T) = 30/71 = 0.4225 C31: SUBTABLE(C,N,T) = lines 2 and 8. Call this T6. C32: ENTROPY(R,T6) = 0 log 0 + (30/30) log(30/30) + 0 log 0 = 0 C26 returns (41/71) * 1.3028 = 0.7523. ENTROPY(R,T) = -((6/71) log(6/71) + (56/71) log(56/71) + (9/71) log(9/71)) = 0.9492 Choose AS=C Mark N1 as split on attribute C. C33: SUBTABLE(C,N,T) is T6 as before. C34: ID3(T6,R) Make a new node N2 In all instances in T5 (lines 2 and 8), X.R = 2. Therefore, this is base case 2. Label node N2 as "X.R=2" C34 returns N2 to C1 In C1: Make an arc labelled "N" from N1 to N2. ************ This is as far as you should take the solution to problem 1 *** C35: SUBTABLE(C,Y,T) is T5 as above. C36: ID3(T5,R) Create new node N3; C37: AVG_ENTROPY(P,R,T5) C38: SUBTABLE(P,Y,T5) is lines 1,3,4,5,6. Call this T7. C39: ENTROPY(R,T7) = -((3/17) log(3/17) + (11/17)log(11/17) + (3/17) log(3/17)) = 1.2898 C40: SUBTABLE(P,N,T5) is lines 7,9,10,11,12. Call this T8. C41: ENTROPY(R,T8) = (3/24) log(3/24) + (15/24) log(15/24) + (6/24) log(6/24) = 1.2988 C37: AVG_ENTROPY(P,R,T5) = (17/41) 1.2898 + (24/41) 1.2988 = 1.2951 C42: AVG_ENTROPY(Q,R,T5) C43: SUBTABLE(Q,Y,T5) is lines 1,3,7,9. Call this T9. C44: ENTROPY(R,T9) = (3/9) log(3/9) + 0 log 0 + (6/9) log (6/9) = 0.9184 C45: SUBTABLE(Q,N,T5) is lines 4,5,6,10,11,12. This is table T4, above. (except that the C column has been deleted) C46: ENTROPY(R,T4) = 0.8836 (see C22 above) C42: AVG_ENTROPY(Q,R,T5) = (9/41) 0.9184 + (32/41) 0.8836 = 0.8912 So we choose AS = Q. C47: ENTROPY(R,T5) was calculated in C29 above to be 1.3029 Mark N3 as split on attribute Q. C48: SUBTABLE(T5,Q,N) is T4 above: Lines 4,5,6,10,11,12 (minus columns C and Q) C49: ID3(T4,R) Create new node N4 C50: AVG_ENTROPY(P,R,T4) C51: SUBTABLE(P,Y,T4) = lines 4,5,6. Call this T10. C52: ENTROPY(R,T10) = (2/13) log(2/13) + (11/13) log(11/13) + 0 log 0 = 0.6194 C53: SUBTABLE(P,N,T4) = lines 10,11,12. Call this T11. C54: ENTROPY(R,T11) = (1/19) log(1/19) + (15/19) log(15/19) + (3/19) log(3/19) = 0.8264 C50: AVG_ENTROPY(P,R,T4) = (13/32) * 0.6194 + (19/32) * 0.8264 = 0.7423 Choose AS = P (no other choices) C55: ENTROPY(R,T4) was calculated in C22 to be 0.8836. C49 continuing: Mark node N4 as split on P. C56: SUBTABLE(T4,P,N) is T11 above (lines 10,11,12) C57: ID3(T11,R) Make new node N5 No predictive attributes remain Label N5: "Prob(X.R=1) = 1/19. Prob(X.R=2) = 15/19 Prob(X.R=3) = 3/19" Return N5 to C49 C49 continuing: Make an arc labelled "N" from N4 to N5. C58: SUBTABLE(T4,P,Y) is T10 above (lines 4,5,6) C59: ID3(T4,R) Make new node N6 No predictive attributes remain in T10 Label N6: "Prob(X.R=1) = 2/13. Prob(X.R=2) = 11/13 Prob(X.R=3) = 0" Return N6 to C49 C49 continuing: Make an arc labelled "Y" from N4 to N5 C49 returns N4 to C36. C36 continuing: Make an arc labelled "N" from N3 to N4. C60: SUBTABLE(T5,Q,Y) is T9 above (lines 1,3,7,9) C61: ID3(T9,R) Make a new node N7 C62: AVG_ENTROPY(P,R,T9) C63: SUBTABLE(P,Y,T9) is lines 1 and 3. Call this T12. C64: ENTROPY(R,T12) = -((1/4) log(1/4) + (3/4) log(3/4)) = 0.8113 C65: SUBTABLE(P,N,T9) is lines 7 and 9. Call this T11 C67: ENTROPY(R,T11) = -((2/5) log(2/5) + (3/5) log(3/5)) = 0.9710 C68: AVG_ENTROPY(P,R,T9) = (4/9) * 0.8113 * (5/9) * 0.9710 = 0.9000 AS is P C69: ENTROPY(R,T9) is calculated in C44 as 0.9184 The result in C50 is not a substantial improvement over C51, particularly considering the size of the table T9. N7 is a leaf, labelled "Prob(X.R=1) = 3/9. Prob(X.R=3) = 6/9" C61 returns N7 to C36. C36 continuing: Make an arc labelled Y from N3 to N7 C36 returns N3 to C1 C1 continuing: Make an arc labelled Y from N1 to N3 C1 returns N1. Final tree: ______________ | N1 | | Split on C | -------------- | ________N___________|________Y__________ | | ______|_______ _______|_______ | N2 | | N3 | |Prob(R=2)=1.| | Split on Q | -------------- --------------- | _____N_______|______Y______ | | _______|________ ______|_________ | N4 | | N7 | | Split on P | |Prob(R=1)=3/9 | ---------------- |Prob(R=3)=6/9 | | ---------------- _______N________|_______Y_______ | | _______|_________ ________|________ | N5 | | N6 | |Prob(R=1)=1/19 | |Prob(R=1)=2/13 | |Prob(R=2)=15/19| |Prob(R=2)=11/13| |Prob(R=3)=3/19 | ----------------- ----------------- Note that, in a deterministic tree, there would be no point in the split at N4, since both N5 and N6 predict R=2. This split would be eliminated in post-processing.