Machine Learning: Study Questions

The final exam will be closed book. You may bring a calculator if you like, but you will not need it.

Topics covered

Please note: This is not a sample exam. The actual exam will not be nearly so long.

Sample questions

How does the candidate elimination algorithm work? What are its limitations?

What is the entropy, and what is its significance?

What is a decision tree? In the ID3 algorithm, what is the expected information gain, and how is it used? What is the gain ratio, and what is the advantage of using the gain ratio over using the expected information gain? Describe a strategy that can be used to avoid overfitting in decision trees.

How can a decision tree be converted into a rule set? Illustrate with an example. What are the advantages of the rule set representation over the decision tree representation?

Give the rules for (a) the optimal Bayesian hypothesis; (b) the maximum likelihood hypothesis. When are these the same?

Describe the Naive Bayesian method of classification. What assumptions does this method make about the attributes and the classification? Give an example where this assumption is not justified. What is the Laplacian correction, and why is it necessary?

Describe the nearest neighbors algorithm.
In most learning algorithms, the computation time required for training is large and the time required to apply the classifier is short. The reverse is true of the nearest neighbors algorithm. How can the computation time required for queries be reduced?

In linear prediction, what is the error function that is typically minimized? Under certain assumptions, this can be viewed as a form of maximum likelihood prediction: What are those assumptions? What are the disadvantages of using this error function? Name an alternative error function that avoids these disadvantages.

What is the difference between linear prediction and linear classification? What error function is typically minimized in applications of linear classification? What technique is used to find the optimal linear classifier?

Give a high-level description of the feed-forward, back-propagation algorithm: What does a ``neuron'' do, what is the architecture of the network, how is the classification task carried out, and how is learning carried out? How do you avoid overfitting in back-propagation? To what extent is this a realistic model of actual neurons in an actual brain?

What is the ``sparse data'' problem in statistically-based natural language processing? How is this addressed in Markov models?

What is the ``accuracy'' and the ``coverage'' of an association rule over a database? Describe the algorithm for finding association rules.

Describe an algorithm for clustering.

What is the difficulty with evaluating learning methods that do not address classification, such as association rules and clusters?

What is knowledge-based learning? Describe explanation-based generalization. In what sense is this learning? What is the role of the data in driving explanation-based generalization?

Suppose that you have learned a classifier for some Boolean property. You run your classifier over a randomly selected test set of 1024 instances, and find that it gives the correct answer three-quarters of the time. Give the formula for the 95% confidence interval for the true accuracy of the classifier. What is the meaning of the 95% confidence interval? (You may give the non-rigorous definition given in the textbook, rather than the rigorous definition I gave in class.)

In the above problem, if you increase the size of the test set by a factor of 16 (i.e. use a test set of 16,384 instances), what happens to the confidence interval?

In tuning the parameters of a learning algorithm to a given application, why is it necessary to use a validation set that is separate from the test set?

What is stratification? What is cross-validation? What is ``leave-one-out'' cross-validation? Why is cross-validation useful, and what are its drawbacks?

What is the theory of minimum description length learning? How is this theory used with imperfect classifiers? Explain, using an example, how it attempts to address the problem of overfitting.

A learning theory may be divided into the following parts:
(1) A hypothesis space;
(2) A representation for hypotheses
(3) A preference criterion over hypotheses, independent of the data
(4) A measure of how well a given hypothesis fits given data.
(5) A search strategy to find a good hypothesis for a given data set.
For the machine learning method of your choice, explain what each of these is. (For certain methods, some of these are degenerate.)