## Problem Set 5

Assigned: April 6
Due: April 20

### Problem 1

Consider the following Bayesian network:

Assume that all the random variables are Boolean, and that you have the following probabilities:
 P(A=T) = 0.9. P(A=F) = 0.1. P(B=T) = 0.5. P(B=F) = 0.5 P(C=T|A=T,B=T) = 1.0. P(C=F|A=T,B=T) = 0.0. P(C=T|A=T,B=F) = 0.8. P(C=F|A=T,B=F) = 0.2. P(C=T|A=F,B=T) = 0.5. P(C=F|A=F,B=T) = 0.5. P(C=T|A=F,B=F) = 0.1. P(C=F|A=F,B=F) = 0.9. P(D=T|B=T) = 0.7. P(D=F|B=T) = 0.3 P(D=T|B=F) = 0.2. P(D=F|B=F) = 0.8 P(E=T|C=T) = 0.1. P(E=F|C=T)=0.9. P(E=T|C=F) = 0.8. P(E=F|C=F)=0.2.
Computed the following quantities:
• A. P(C=T) =
P(C=T,A=T,B=T) + P(C=T,A=T,B=F) + P(C=T,A=F,B=T) + P(C=T,A=F,B=F) = (Conjuntive rule)
P(C=T|A=T,B=T)*P(A=T,B=T) + P(C=T|A=T,B=F)*P(A=T,B=F) + P(C=T|A=F,B=T)*P(A=F,B=T) + P(C=T|A=F,B=F)*P(A=F,B=F) = (Independence)
P(C=T|A=T,B=T)*P(A=T)*P(B=T) + P(C=T|A=T,B=F)*P(A=T)*P(B=F) + P(C=T|A=F,B=T)*P(A=F)*P(B=T) + P(C=T|A=F,B=F)*P(A=F)*P(B=F) =
1.0*0.9*0.5 + 0.8*0.9*0.5 + 0.5*0.1*0.5 + 0.1*0.1*0.5 = 0.84.
• B. P(C=T|A=T) = P(C=T,B=T|A=T) + P(C=T,B=F|A=T) = (Conjunctive rule)
P(C=T|B=T,A=T) * P(B=T|A=T) + P(C=T|B=F,A=T) * P(B=F|A=T) = (Independence)
P(C=T|B=T,A=T) * P(B=T) + P(C=T|B=F,A=T) * P(B=F) = 1.0*0.5 + 0.8*0.5 = 0.9.
• C. P(C=T,D=T) = P(C=T,D=T,A=T,B=T) + P(C=T,D=T,A=T,B=F) + P(C=T,D=T,A=F,B=T) + P(C=T,D=T,A=F,B=F) = (Conjunctive rule)
P(C=T|D=T,A=T,B=T) * P(D=T|A=T,B=T) * P(A=T,B=T) + P(C=T|D=T,A=T,B=F) * P(D=T|A=T,B=F) * P(A=T,B=F) + P(C=T|D=T,A=F,B=T) * P(D=T|A=F,B=T) * P(A=F,B=T) + P(C=T|D=T,A=F,B=F) * P(D=T|A=F,B=F) * P(A=F,B=F) = (Independence)
P(C=T|A=T,B=T)*P(D=T|B=T)*P(A=T)*P(B=T) + P(C=T|A=T,B=F)*P(D=T|B=F)*P(A=T)*P(B=F) + P(C=T|A=F,B=T)*P(D=T|B=T)*P(A=F)*P(B=T) + P(C=T|A=F,B=F)*P(D=T|B=F)*P(A=F)*P(B=F) =
1.0*0.7*0.9*0.5 + 0.8*0.2*0.9*0.5 + 0.5*0.7*0.1*0.5 + 0.1*0.2*0.1*0.5= 0.4055.
• D. P(A=T|C=T) = (Bayes' law)
P(C=T|A=T) P(A=T)/P(C=T) = (from (A) and (B)) 0.9*0.9/0.84 = 0.964
• E. P(A=T|B=T,C=T) = (defn. of conditional probability) P(A=T,B=T,C=T) / P(B=T,C=T).

Now P(A=T,B=T,C=T) = (as in (A)) P(C=T|A=T,B=T) * P(A=T) * P(B=T) = 1.0 * 0.9 * 0.5 = 0.45.

P(B=T,C=T) = P(A=T,B=T,C=T) + P(A=F,B=T,C=T) and
P(A=F,B=T,C=T) = P(C=T|A=F,B=T) * P(A=F) * P(B=T) = 0.5*0.1*0.5 = 0.025,
so P(B=T,C=T) = 0.45+0.025 = 0.475.
So P(A=T|B=T,C=T) = 0.45/0.475 = 0.947

Note that P(A=T) < P(A=T|B=T,C=T) < P(A=T|C=T). Intuitively, A and B being true each make C more likely to be true. Therefore if C is observed to be true, then the explanation may be that C is true because A is true. However, if B is also known to be true, then that gives an alternative explanation for the fact that C is true, weakening the conclusion that A is true.

• F. P(E=T|A=T) =
P(E=T,B=T,C=T|A=T) + P(E=T,B=T,C=F|A=T) + P(E=T,B=F,C=T|A=T) + P(E=T,B=F,C=F|A=T) = (conjunction)
P(E=T|C=T,A=T,B=T) * P(C=T|A=T,B=T) * P(B=T|A=T) + P(E=T|C=F,A=T,B=T) * P(C=F|A=T,B=T) * P(B=T|A=T) + P(E=T|C=T,A=T,B=F) * P(C=T|A=T,B=F) * P(B=F|A=T) + P(E=T|C=F,A=T,B=F) * P(C=F|A=T,B=F) * P(B=F|A=T) = (independence)
P(E=T|C=T) * P(C=T|A=T,B=T) * P(B=T) + P(E=T|C=F) * P(C=F|A=T,B=T) * P(B=T) + P(E=T|C=T) * P(C=T|A=T,B=F) * P(B=F) + P(E=T|C=F) * P(C=F|A=T,B=F) * P(B=F) =
0.1*1.0*0.5 + 0.8*0.0*0.5 + 0.1*0.8*0.5 + 0.8*0.2*0.5 = 0.17
• G. P(D=T|C=T) = P(C=T,D=T)/P(C=T) (defn. of conditional probability) = 0.4055/0.84 = 0.4827

### Problem 2

A standard method for applying an ML classification method designed for discrete data to a data set with numerical attributes is to discretize the values; that is, to divide the space of values into K different values such as "High" and "Low" for K=2 or "High", "Medium", and "Low" for K=3, by choosing K-1 different cutoff points. For instance, if the attribute is an integer between 0 and 100, then one might decide that "Low" is 0-24, "Medium" is 25-50, and "High" is 51-100. Of course this can be done for either predictive or classification attributes or both. In parts A, B and C, assume that the classification attribute is discrete and that we are concerned with discretizing one or more of the predictive attributes. In D the classification attribute is numeric and there are two Boolean predictive attributes.

There are a number of difficulties with discretization:

A. (Open ended --- there is no right or wrong answer) One problem is that one has to choose the number of discrete categories and the cutoff points and that this is generally somewhat arbitrary. Describe an algorithm or a criterion that takes as input a training data set, with a combination of numerical and discrete predictive attributes and a discrete classification attribute, and returns a discretization for each numerical attribute. Discuss the different objectives that arise in choosing a discretization, and whether and how your algorithm addresses them.

Answer: There are two main trade-offs. One is the decision of how finely to discriminate. Too coarse a subdivision loses predictive power, by lumping together values of the predictive attribute with very different significance for the classification attribute. Too fine a subdivision means that there will be too few instances with each value to allow reliable prediction, and an elevated chance of attaching significance to what is just random variation. Both of these effects are magnified if you have several numeric attributes to discretize simultaneously, and the classification attribute depends on some complex combination of these. But in that case, you are probably better off using a numerical prediction method, rather than discretizing.

The second trade-off is between looking for large gaps in the values and trying to split the space of instances evenly. The advantages of looking for large gaps is that (a) the gap may reflect some real qualitative difference, which will then be reflected in the classification attribute; (b) the classifier becomes less sensitive to the precise value of the cutoff, since few instances are near the cutoff. The disadvantage is that you may end up splitting the instances into two sets of uneven size, and having little predictive power on the large set.

One simple type of algorithm is to fix the number K of discrete categories as a function of the number of instances and to fix a minimum percentage P < i 1/K of the fraction that will go into each category, and then look for the K-1 largest gaps that will create categories each of at least size PN.

B. (Open ended) One particular choice that has to be made is whether or not the algorithm should consider the associated values of the classification attribute in deciding how to discretize the predictive attribute. What are the pros and cons of this? Discuss arguments in both directions.

The argument in favor is that there may be a cutoff value for predictive attribute A which is very powerful in predicting classification C but is not particular evident just by looking at the values of A.

The argument against is that this is susceptible to overfitting if a fine discretization is used.

C. (Open ended). A second problem is that discretization throws away a lot of information. Obviously, one is throwing away information by lumping all the values within a range into a single category; in the example above, we are treating a 0 and a 24 as the same. Slightly less obviously, one is throwing away information in that the relation between categories is lost. For instance, "High" is closer to "Medium" than it is to "Low", but once one has discretized, these are just names, and the learning algorithms don't know this.

Describe a data set in which the first issue is particularly problematic and one in which the second issue is particularly problematic. In both cases, you are trying to design a data set in which the numerical attribute is in fact usefully predictive of the classification attribute, but a large part of this predictive power is lost when the attribute is discretized due to one or the other of these issues.

Answer: On thinking about this some more, I decided that this was a bad question. The two problems occur with the same kinds of data sets; namely ones where the significance of the predictive attribute changes smoothly with its value, and where the amount of data is limited compared to the precision needed. The first problem if you discretize too coarsely, and the second occurs if you discretize too finely.

Suppose, for instance, that the classification attribute C has k different values. Suppose that for the ith value of C, the distribution of X.A is a bell curve centered at i/k with standard deviation 1/sqrt(k)., and suppose you have k2 points in total. Then if you take the average of A values for each value of C, you can probably recover most of the ordering. However, if you discretize coarsely, you will blur the values in the range together, and if you discretize finely, you will lose predictive power due to random variation in each small subcollection.

D. (Not open ended). A third problem is that the results of applying an ML algorithm to a discretized data set may depend very sensitively on the discretization that is used.

Consider a classification problem with two predictive attributes A and B and a numeric classification attribute C with values between 0 and 100. Let D1 be the discretization 0-25 => Low, 26-50 => Medium, 51-100 => High. Let D2 be the discretization 0-50 => Low, 51-75 => Medium, 76-100 => High. Let X be a test example with X.A = X.B = TRUE. Construct a training set T such that, if D1 is applied, then Naive Bayes predicts with high confidence that X.C is Low, and if D2 is applied then Naive Bayes predicts with high confidence that X.C is High.

A B C Number of instances
TRUE TRUE 10 10
FALSE FALSE 40 90
FALSE FALSE 60 90
TRUE TRUE 90 10

Using Naive Bayes with D1:
P(C=Low | A=TRUE,B=TRUE) = q*P(A=TRUE | C=Low) * P(B=TRUE | C=Low) * P(C=Low) = q*1*1*0.05 = 0.05q.
P(C=Med | A=TRUE,B=TRUE) = q*P(A=TRUE | C=Med) * P(B=TRUE | C=Med) * P(C=Med) = 0.0
P(C=Hi | A=TRUE,B=TRUE) = q*P(A=TRUE | C=Hi) * P(B=TRUE | C=Hi) * P(C=Hi) = 0.1*0.1*0.5 = 0.005q

The normalizing factor q is the sum of these = 0.055, so the true probabilities are:
P(C=Low|A=TRUE,B=TRUE) = 0.05/0.055 = 0.909.
P(C=Med|A=TRUE,B=TRUE) = 0.0
P(C=Hi|A=TRUE,B=TRUE) = 0.091

By symmetry, exactly the reverse is true if D2 is used.

Note that the Naive Bayes assumption that A and B are independent given C is very far from true under either of the discretizations, though it is true, in a degenerate sense, in the original data. Thus, independence is also susceptible to grouping effects.