Assume that all the random variables are Boolean, and that you have the following probabilities:
|P(A=T) = 0.9.||P(A=F) = 0.1.|
|P(B=T) = 0.5.||P(B=F) = 0.5|
|P(C=T|A=T,B=T) = 1.0.    ||P(C=F|A=T,B=T) = 0.0.|
|P(C=T|A=T,B=F) = 0.8.||P(C=F|A=T,B=F) = 0.2.|
|P(C=T|A=F,B=T) = 0.5.||P(C=F|A=F,B=T) = 0.5.|
|P(C=T|A=F,B=F) = 0.1.||P(C=F|A=F,B=F) = 0.9.|
|P(D=T|B=T) = 0.7.||P(D=F|B=T) = 0.3|
|P(D=T|B=F) = 0.2.||P(D=F|B=F) = 0.8|
|P(E=T|C=T) = 0.1.||P(E=F|C=T)=0.9.|
|P(E=T|C=F) = 0.8.||P(E=F|C=F)=0.2.|
There are a number of difficulties with discretization:
A. (Open ended --- there is no right or wrong answer) One problem is that one has to choose the number of discrete categories and the cutoff points and that this is generally somewhat arbitrary. Describe an algorithm or a criterion that takes as input a training data set, with a combination of numerical and discrete predictive attributes and a discrete classification attribute, and returns a discretization for each numerical attribute. Discuss the different objectives that arise in choosing a discretization, and whether and how your algorithm addresses them.
B. (Open ended) One particular choice that has to be made is whether or not the algorithm should consider the associated values of the classification attribute in deciding how to discretize the predictive attribute. What are the pros and cons of this? Discuss arguments in both directions.
C. (Open ended). A second problem is that discretization throws away a lot of information. Obviously, one is throwing away information by lumping all the values within a range into a single category; in the example above, we are treating a 0 and a 24 as the same. Slightly less obviously, one is throwing away information in that the relation between categories is lost. For instance, "High" is closer to "Medium" than it is to "Low", but once one has discretized, these are just names, and the learning algorithms don't know this.
Describe a data set in which the first issue is particularly problematic and one in which the second issue is particularly problematic. In both cases, you are trying to design a data set in which the numerical attribute is in fact usefully predictive of the classification attribute, but a large part of this predictive power is lost when the attribute is discretized due to one or the other of these issues.
D. (Not open ended). A third problem is that the results of applying an ML algorithm to a discretized data set may depend very sensitively on the discretization that is used.
Consider a classification problem with two predictive attributes A and B and a numeric classification attribute C with values between 0 and 100. Let D1 be the discretization 0-25 => Low, 26-50 => Medium, 51-100 => High. Let D2 be the discretization 0-50 => Low, 51-75 => Medium, 76-100 => High. Let X be a test example with X.A = X.B = TRUE. Construct a training set T such that, if D1 is applied, then Naive Bayes predicts with high confidence that X.C is Low, and if D2 is applied then Naive Bayes predicts with high confidence that X.C is High.