Programming Assignment 2: Machine Learning

Assigned: October 20.
Due: November 10.

Summary

To use Naive Bayes and nearest neighbors learning methods to predict the party affiliation of congressmen from their voting records.

Input data format

The file vote.dat contains, in tabular form, the voting record of the current House of Representatives on 4 important votes in 2003.

The first line of the file is a header, with the labels of each row. This is simply for the human reader; your program may skip this.

Each of the following lines corresponds to one congressman:

Learning procedure

It must be possible to easily set the size of the training set, NTRAIN, and the size of the test set NTEST. The training set will then be the first NTRAIN lines of the data file and the test set will be the last NTEST lines. You may assume that NTRAIN is between 8 and 100 (inclusive) and that NTEST is between 20 and 200. (Thus, the middle 134 congressmen never get looked at in any of these tests.)

Naive Bayes:

In the learning phase, you compute Freq(Party) and Freq(VoteI | Party) over the training set. In the prediction phase, you combine these using the Naive Bayes formula to predict the party affiliation from the votes. In evaluation, you compare the predicted party affiliation to the recorded affiliation, and note whether the prediction was correct or not.

As discussed in class, the Naive Bayes method elegantly handles null values, whether in the training data or in the test data.

The data has been deliberately designed so that, for the training sets you are using, the problem of zero values does not arise (that is, the above frequencies are never zero.)

Nearest neighbors

The "distance" between two voting records is just the number of bills where they differ. You should implement the nearest neighbors algorithm as follows: For a given test instance X: (Note: since there are only 16 possible voting patterns, all this can be pre-computed for any fixed training set, if you want.)

The nearest neighbors algorithm is not as tolerant of null values as Naive Bayes. Adopt the following policies:

Output

For each of two learning methods (Naive Bayes and nearest neighbors) you should output: Sample output

Programming

The only requirement on the programming is that it must be possible for the TA to easily (i.e. in not more than 4 minutes) be able to set the values of the training and test set and run an experiment. Beyond that, I don't care how you set it up. You may: Similarly, the actual data may be The two learning algorithms may be either in a single program that always does them both; or in a single program that chooses which to do on the basis of some kind of signal; or in two separate programs.

Philosophical comment on this approach to programming.

In general, computer science education and practice rightly favors the creation of tools, programs that can be applied to a wide range of problems; as opposed to ad hoc programs that work on a single problem or a very small range of problems. However, this preference is more problematic than sometimes acknowledged. On the one hand, generality can be costly to attain: it generally requires defining and communicating a large number of conventions; features that have been thought through and implemented with great effort end up not being used; interactions of two different features can be as scary as drug interactions. On the other hand, the problem at hand often has idiosyncratic features that are not worth generalizing because you will never see them again or idiosyncratic features that allow some powerful simplification. There is a place in the CS world for quick and dirty solutions to highly specific problems.

To submit

Email to the grader (mysore@cs.nyu.edu):

Comments on this learning problem

As a problem in machine learning, for pedagogical purposes, this application has its strong points and weak points.

Strong points:

Weak points:

Further comments on the data

(The following do not in the least affect the assignment.)

The data is taken from vote-smart.org. Errors (especially mis-spellings of names) may have been introduced in transcription. Since these do not affect the assignment, I have not double-checked this for accuracy. Therefore, the data in this file should not be used as a source for serious political information without further checking.

The bills at issue were as follows:

I have performed the following data cleaning:

Congressman Richard Gephardt did not vote on any of these bills --- why not, I do not know. I have deliberately placed his line in the middle of the file where it does not enter into any of these experiments.

Only 434 congressmen are recorded, rather than 435. Apparently the 19th district of Texas was vacant during this period. I don't know what the story is there.