Example of ID3 algorithm
I'm going to use the same data set as in the problem set; however,
the example here will use R rather than C as the classification attribute.
I've added line numbers for convenient reference: this is _not_
an attribute.
Line number P Q R C Number of instances
1 Y Y 1 Y 1
2 Y Y 2 N 10
3 Y Y 3 Y 3
4 Y N 1 Y 2
5 Y N 2 Y 11
6 Y N 3 Y 0
7 N Y 1 Y 2
8 N Y 2 N 20
9 N Y 3 Y 3
10 N N 1 Y 1
11 N N 2 Y 15
12 N N 3 Y 3
Total number of instances 71
The above table is T0.
Create root node N1.
C1: ID3(T,R)
C2: AVG_ENTROPY(P,R,T)
C3: FREQUENCY(P,Y,T) = (1+10+3+2+11+0)/71 = 27/71 = 0.3803
C4: SUBTABLE(P,Y,T) = lines 1,2,3,4,5,6. Call this T1. Size(T1)= 27.
C5: ENTROPY(R,T1)
C6: FREQUENCY(R,1,T1) = (1+2)/27 = 0.1111
C7: FREQUENCY(R,2,T1) = (10+11)/27 = 0.7778
C8: FREQUENCY(R,3,T1) = (3+0)/27 = 0.1111
C5: Return (0.1111 log(0.1111) + 0.7777 log(0.7777) + 0.1111 log(0.1111))
= 0.9864
C8.1: FREQUENCY(P,Y,T) = (2+20+3+1+15+3)/71 = 44/71 = 0.6197
C9: SUBTABLE(P,N,T) = lines 7,8,9,10,11,12. Call this T2. Size(T2) = 44.
C10: ENTROPY(R,T2)
C11: FREQUENCY(R,1,T2) = (2+1)/44 = 0.0682
C12: FREQUENCY(R,2,T2) = (20+15)/44 = 0.7955
C13: FREQUENCY(R,3,T2) = (3+3)/44 = 0.1364
C10: Return (0.0682 log(0.0682) + 0.7955 log(0.7955) + 0.1364 log(0.1364))
= 0.9188
C2: Return (27/71) * 0.9864 + (44/71) * 0.9188 = 0.9445
C14: AVG_ENTROPY(Q,R,T)
C15: FREQUENCY(Q,Y,T) = (1+10+3+2+20+3)/71 = 39/71 = 0.5493
C16: SUBTABLE(Q,Y,T) = lines 1,2,3,7,8,9. Call this T3. Size(T3)= 39.
C17: ENTROPY(R,T3)
C18: FREQUENCY(R,1,T3) = (1+2)/39 = 0.0769
C19: FREQUENCY(R,2,T3) = (10+20)/39 = 0.7692
C20: FREQUENCY(R,3,T3) = (3+3)/39 = 0.1538
C17: Return (0.0769 log(0.0769) + 0.7692 log(0.7692) + 0.1538 log(0.1538))
= 0.9914
C21: FREQUENCY(Q,N,T) = (2+11+0+1+15+3)/71 = 32/71 = 0.4507
C21: SUBTABLE(Q,N,T) = lines 4,5,6,10,11,12. Call this T4. Size(T2) = 32.
C22: ENTROPY(R,T4)
C23: FREQUENCY(R,1,T4) = (2+1)/32 = 0.0938
C24: FREQUENCY(R,2,T4) = (11+15)/32 = 0.8125
C25: FREQUENCY(R,3,T4) = (0+3)/32 = 0.0938
C22: Return (0.0938 log(0.0938) + 0.8125 log(0.8125) + 0.0938 log(0.0938))
= 0.8838
C14: Return (39/71) * 0.9914 + (32/71) * 0.8836 = 0.9394
From here on down, I'm abbreviating.
C26: AVG_ENTROPY(C,R,T)
C27: FREQUENCY(C,Y,T) = 41/71 = 0.5775
C28: SUBTABLE(C,Y,T) = all lines but 2 and 8. Call this T5.
C29: ENTROPY(R,T5) = (6/41) log(6/41) + (26/41) log (26/41) +
(9/41) log(9/41) = 1.3028
C30: FREQUENCY(C,N,T) = 30/71 = 0.4225
C31: SUBTABLE(C,N,T) = lines 2 and 8. Call this T6.
C32: ENTROPY(R,T6) = 0 log 0 + (30/30) log(30/30) + 0 log 0 = 0
C26 returns (41/71) * 1.3028 = 0.7523.
ENTROPY(R,T) = ((6/71) log(6/71) + (56/71) log(56/71) + (9/71) log(9/71)) =
0.9492
Choose AS=C
Mark N1 as split on attribute C.
C33: SUBTABLE(C,N,T) is T6 as before.
C34: ID3(T6,R)
Make a new node N2
In all instances in T5 (lines 2 and 8), X.R = 2. Therefore, this
is base case 2.
Label node N2 as "X.R=2"
C34 returns N2 to C1
In C1: Make an arc labelled "N" from N1 to N2.
************ This is as far as you should take the solution to problem 1 ***
C35: SUBTABLE(C,Y,T) is T5 as above.
C36: ID3(T5,R)
Create new node N3;
C37: AVG_ENTROPY(P,R,T5)
C38: SUBTABLE(P,Y,T5) is lines 1,3,4,5,6. Call this T7.
C39: ENTROPY(R,T7) = ((3/17) log(3/17) + (11/17)log(11/17) +
(3/17) log(3/17)) = 1.2898
C40: SUBTABLE(P,N,T5) is lines 7,9,10,11,12. Call this T8.
C41: ENTROPY(R,T8) = (3/24) log(3/24) + (15/24) log(15/24) +
(6/24) log(6/24) = 1.2988
C37: AVG_ENTROPY(P,R,T5) = (17/41) 1.2898 + (24/41) 1.2988 = 1.2951
C42: AVG_ENTROPY(Q,R,T5)
C43: SUBTABLE(Q,Y,T5) is lines 1,3,7,9. Call this T9.
C44: ENTROPY(R,T9) = (3/9) log(3/9) + 0 log 0 + (6/9) log (6/9) = 0.9184
C45: SUBTABLE(Q,N,T5) is lines 4,5,6,10,11,12. This is table T4, above.
(except that the C column has been deleted)
C46: ENTROPY(R,T4) = 0.8836 (see C22 above)
C42: AVG_ENTROPY(Q,R,T5) = (9/41) 0.9184 + (32/41) 0.8836 = 0.8912
So we choose AS = Q.
C47: ENTROPY(R,T5) was calculated in C29 above to be 1.3029
Mark N3 as split on attribute Q.
C48: SUBTABLE(T5,Q,N) is T4 above: Lines 4,5,6,10,11,12
(minus columns C and Q)
C49: ID3(T4,R)
Create new node N4
C50: AVG_ENTROPY(P,R,T4)
C51: SUBTABLE(P,Y,T4) = lines 4,5,6. Call this T10.
C52: ENTROPY(R,T10) = (2/13) log(2/13) + (11/13) log(11/13) + 0 log 0 =
0.6194
C53: SUBTABLE(P,N,T4) = lines 10,11,12. Call this T11.
C54: ENTROPY(R,T11) = (1/19) log(1/19) + (15/19) log(15/19) +
(3/19) log(3/19) = 0.8264
C50: AVG_ENTROPY(P,R,T4) = (13/32) * 0.6194 + (19/32) * 0.8264 = 0.7423
Choose AS = P (no other choices)
C55: ENTROPY(R,T4) was calculated in C22 to be 0.8836.
C49 continuing: Mark node N4 as split on P.
C56: SUBTABLE(T4,P,N) is T11 above (lines 10,11,12)
C57: ID3(T11,R)
Make new node N5
No predictive attributes remain
Label N5: "Prob(X.R=1) = 1/19.
Prob(X.R=2) = 15/19
Prob(X.R=3) = 3/19"
Return N5 to C49
C49 continuing: Make an arc labelled "N" from N4 to N5.
C58: SUBTABLE(T4,P,Y) is T10 above (lines 4,5,6)
C59: ID3(T4,R)
Make new node N6
No predictive attributes remain in T10
Label N6: "Prob(X.R=1) = 2/13.
Prob(X.R=2) = 11/13
Prob(X.R=3) = 0"
Return N6 to C49
C49 continuing: Make an arc labelled "Y" from N4 to N5
C49 returns N4 to C36.
C36 continuing: Make an arc labelled "N" from N3 to N4.
C60: SUBTABLE(T5,Q,Y) is T9 above (lines 1,3,7,9)
C61: ID3(T9,R)
Make a new node N7
C62: AVG_ENTROPY(P,R,T9)
C63: SUBTABLE(P,Y,T9) is lines 1 and 3. Call this T12.
C64: ENTROPY(R,T12) = ((1/4) log(1/4) + (3/4) log(3/4)) = 0.8113
C65: SUBTABLE(P,N,T9) is lines 7 and 9. Call this T11
C67: ENTROPY(R,T11) = ((2/5) log(2/5) + (3/5) log(3/5)) = 0.9710
C68: AVG_ENTROPY(P,R,T9) = (4/9) * 0.8113 * (5/9) * 0.9710 = 0.9000
AS is P
C69: ENTROPY(R,T9) is calculated in C44 as 0.9184
The result in C50 is not a substantial improvement over C51, particularly
considering the size of the table T9.
N7 is a leaf, labelled "Prob(X.R=1) = 3/9. Prob(X.R=3) = 6/9"
C61 returns N7 to C36.
C36 continuing: Make an arc labelled Y from N3 to N7
C36 returns N3 to C1
C1 continuing: Make an arc labelled Y from N1 to N3
C1 returns N1.
Final tree:
______________
 N1 
 Split on C 


________N___________________Y__________
 
_____________ ______________
 N2   N3 
Prob(R=2)=1.  Split on Q 
 

_____N_____________Y______
 
_______________ _______________
 N4   N7 
 Split on P  Prob(R=1)=3/9 
 Prob(R=3)=6/9 
 
_______N_______________Y_______
 
________________ ________________
 N5   N6 
Prob(R=1)=1/19  Prob(R=1)=2/13 
Prob(R=2)=15/19 Prob(R=2)=11/13
Prob(R=3)=3/19  

Note that, in a deterministic tree, there would be no point in the
split at N4, since both N5 and N6 predict R=2. This split would
be eliminated in postprocessing.