Assume that all the random variables are Boolean, and that you have the following probabilities:
|P(A=T) = 0.9.||P(A=F) = 0.1.|
|P(B=T) = 0.5.||P(B=F) = 0.5|
|P(C=T|A=T,B=T) = 1.0.    ||P(C=F|A=T,B=T) = 0.0.|
|P(C=T|A=T,B=F) = 0.8.||P(C=F|A=T,B=F) = 0.2.|
|P(C=T|A=F,B=T) = 0.5.||P(C=F|A=F,B=T) = 0.5.|
|P(C=T|A=F,B=F) = 0.1.||P(C=F|A=F,B=F) = 0.9.|
|P(D=T|B=T) = 0.7.||P(D=F|B=T) = 0.3|
|P(D=T|B=F) = 0.2.||P(D=F|B=F) = 0.8|
|P(E=T|C=T) = 0.1.||P(E=F|C=T)=0.9.|
|P(E=T|C=F) = 0.8.||P(E=F|C=F)=0.2.|
Now P(A=T,B=T,C=T) = (as in (A)) P(C=T|A=T,B=T) * P(A=T) * P(B=T) = 1.0 * 0.9 * 0.5 = 0.45.
P(B=T,C=T) = P(A=T,B=T,C=T) + P(A=F,B=T,C=T) and
P(A=F,B=T,C=T) = P(C=T|A=F,B=T) * P(A=F) * P(B=T) = 0.5*0.1*0.5 = 0.025,
so P(B=T,C=T) = 0.45+0.025 = 0.475.
So P(A=T|B=T,C=T) = 0.45/0.475 = 0.947
Note that P(A=T) < P(A=T|B=T,C=T) < P(A=T|C=T). Intuitively, A and B being true each make C more likely to be true. Therefore if C is observed to be true, then the explanation may be that C is true because A is true. However, if B is also known to be true, then that gives an alternative explanation for the fact that C is true, weakening the conclusion that A is true.
There are a number of difficulties with discretization:
A. (Open ended --- there is no right or wrong answer) One problem is that one has to choose the number of discrete categories and the cutoff points and that this is generally somewhat arbitrary. Describe an algorithm or a criterion that takes as input a training data set, with a combination of numerical and discrete predictive attributes and a discrete classification attribute, and returns a discretization for each numerical attribute. Discuss the different objectives that arise in choosing a discretization, and whether and how your algorithm addresses them.
Answer: There are two main trade-offs. One is the decision of how finely to discriminate. Too coarse a subdivision loses predictive power, by lumping together values of the predictive attribute with very different significance for the classification attribute. Too fine a subdivision means that there will be too few instances with each value to allow reliable prediction, and an elevated chance of attaching significance to what is just random variation. Both of these effects are magnified if you have several numeric attributes to discretize simultaneously, and the classification attribute depends on some complex combination of these. But in that case, you are probably better off using a numerical prediction method, rather than discretizing.
The second trade-off is between looking for large gaps in the values and trying to split the space of instances evenly. The advantages of looking for large gaps is that (a) the gap may reflect some real qualitative difference, which will then be reflected in the classification attribute; (b) the classifier becomes less sensitive to the precise value of the cutoff, since few instances are near the cutoff. The disadvantage is that you may end up splitting the instances into two sets of uneven size, and having little predictive power on the large set.
One simple type of algorithm is to fix the number K of discrete categories as a function of the number of instances and to fix a minimum percentage P < i 1/K of the fraction that will go into each category, and then look for the K-1 largest gaps that will create categories each of at least size PN.
B. (Open ended) One particular choice that has to be made is whether or not the algorithm should consider the associated values of the classification attribute in deciding how to discretize the predictive attribute. What are the pros and cons of this? Discuss arguments in both directions.
The argument in favor is that there may be a cutoff value for predictive attribute A which is very powerful in predicting classification C but is not particular evident just by looking at the values of A.
The argument against is that this is susceptible to overfitting if a fine discretization is used.
C. (Open ended). A second problem is that discretization throws away a lot of information. Obviously, one is throwing away information by lumping all the values within a range into a single category; in the example above, we are treating a 0 and a 24 as the same. Slightly less obviously, one is throwing away information in that the relation between categories is lost. For instance, "High" is closer to "Medium" than it is to "Low", but once one has discretized, these are just names, and the learning algorithms don't know this.
Describe a data set in which the first issue is particularly problematic and one in which the second issue is particularly problematic. In both cases, you are trying to design a data set in which the numerical attribute is in fact usefully predictive of the classification attribute, but a large part of this predictive power is lost when the attribute is discretized due to one or the other of these issues.
Answer: On thinking about this some more, I decided that this was a bad question. The two problems occur with the same kinds of data sets; namely ones where the significance of the predictive attribute changes smoothly with its value, and where the amount of data is limited compared to the precision needed. The first problem if you discretize too coarsely, and the second occurs if you discretize too finely.
Suppose, for instance, that the classification attribute C has k different values. Suppose that for the ith value of C, the distribution of X.A is a bell curve centered at i/k with standard deviation 1/sqrt(k)., and suppose you have k2 points in total. Then if you take the average of A values for each value of C, you can probably recover most of the ordering. However, if you discretize coarsely, you will blur the values in the range together, and if you discretize finely, you will lose predictive power due to random variation in each small subcollection.
D. (Not open ended). A third problem is that the results of applying an ML algorithm to a discretized data set may depend very sensitively on the discretization that is used.
Consider a classification problem with two predictive attributes A and B and a numeric classification attribute C with values between 0 and 100. Let D1 be the discretization 0-25 => Low, 26-50 => Medium, 51-100 => High. Let D2 be the discretization 0-50 => Low, 51-75 => Medium, 76-100 => High. Let X be a test example with X.A = X.B = TRUE. Construct a training set T such that, if D1 is applied, then Naive Bayes predicts with high confidence that X.C is Low, and if D2 is applied then Naive Bayes predicts with high confidence that X.C is High.
|A||B||C||Number of instances|
Using Naive Bayes with D1:
P(C=Low | A=TRUE,B=TRUE) = q*P(A=TRUE | C=Low) * P(B=TRUE | C=Low) * P(C=Low) = q*1*1*0.05 = 0.05q.
P(C=Med | A=TRUE,B=TRUE) = q*P(A=TRUE | C=Med) * P(B=TRUE | C=Med) * P(C=Med) = 0.0
P(C=Hi | A=TRUE,B=TRUE) = q*P(A=TRUE | C=Hi) * P(B=TRUE | C=Hi) * P(C=Hi) = 0.1*0.1*0.5 = 0.005q
The normalizing factor q is the sum of these = 0.055, so the true probabilities
P(C=Low|A=TRUE,B=TRUE) = 0.05/0.055 = 0.909.
P(C=Med|A=TRUE,B=TRUE) = 0.0
P(C=Hi|A=TRUE,B=TRUE) = 0.091
By symmetry, exactly the reverse is true if D2 is used.
Note that the Naive Bayes assumption that A and B are independent given C is very far from true under either of the discretizations, though it is true, in a degenerate sense, in the original data. Thus, independence is also susceptible to grouping effects.