## Programming Assignment 2: Machine Learning

Assigned: October 20.
Due: November 10.

### Summary

To use Naive Bayes and nearest neighbors learning methods to predict the party affiliation of congressmen from their voting records.

### Input data format

The file vote.dat contains, in tabular form, the voting record of the current House of Representatives on 4 important votes in 2003.

The first line of the file is a header, with the labels of each row. This is simply for the human reader; your program may skip this.

Each of the following lines corresponds to one congressman:

• Chars. 1-2: Abbreviation of state name.
• Chars. 7-24: Name of congresssman.
• Char 26: Party affiliation: R or D (see comment below).
• Chars 31, 39, 45, 51: Votes on four issues. These are each "Y", "N", or "-" (no vote).

### Learning procedure

It must be possible to easily set the size of the training set, NTRAIN, and the size of the test set NTEST. The training set will then be the first NTRAIN lines of the data file and the test set will be the last NTEST lines. You may assume that NTRAIN is between 8 and 100 (inclusive) and that NTEST is between 20 and 200. (Thus, the middle 134 congressmen never get looked at in any of these tests.)

#### Naive Bayes:

In the learning phase, you compute Freq(Party) and Freq(VoteI | Party) over the training set. In the prediction phase, you combine these using the Naive Bayes formula to predict the party affiliation from the votes. In evaluation, you compare the predicted party affiliation to the recorded affiliation, and note whether the prediction was correct or not.

As discussed in class, the Naive Bayes method elegantly handles null values, whether in the training data or in the test data.

The data has been deliberately designed so that, for the training sets you are using, the problem of zero values does not arise (that is, the above frequencies are never zero.)

#### Nearest neighbors

The "distance" between two voting records is just the number of bills where they differ. You should implement the nearest neighbors algorithm as follows: For a given test instance X:
• 1. Find the distance D from X to the nearest element of the training set.
• 2. Let S be the set of instances in T at distance D from X.
• 3. Take a "vote" on party affiliation over S. Predict that the party affiliation of X is the winner of the vote. If the vote is tied, predict that X is a Republican (the majority party).
(Note: since there are only 16 possible voting patterns, all this can be pre-computed for any fixed training set, if you want.)

The nearest neighbors algorithm is not as tolerant of null values as Naive Bayes. Adopt the following policies:

• In the training set, any congressman with any null votes should be ignored.
• If a test instance X has a null vote, then that attribute should be ignored for that instance.

### Output

For each of two learning methods (Naive Bayes and nearest neighbors) you should output:
• 1. The accuracy over the test set of the method learned over the training set.
• 2. The accuracy over the training set of the method learned over the training set.
• 3. For Naive Bayes, you should output the probability assigned to each outcome for the four congressmen Raymond Green, Denis Majette, Clifford Stearns, and Michael Castle (the last four lines of the data file.)
• 4. For nearest neighbors, you should output the "vote" on the outcome for these four congressmen.
Sample output

### Programming

The only requirement on the programming is that it must be possible for the TA to easily (i.e. in not more than 4 minutes) be able to set the values of the training and test set and run an experiment. Beyond that, I don't care how you set it up. You may:
• Input these values from a command line.
• Input these value from a window.
• Declare these values as constants in your program, and have them changed by editing the text of the program.
• Stick these values at the front of the data file.
• etc.
Similarly, the actual data may be
• Read from standard input or a file.
• Included in a large constant declaration in your program.
• Pre-munged in any way or to any degree that you want.
The two learning algorithms may be either in a single program that always does them both; or in a single program that chooses which to do on the basis of some kind of signal; or in two separate programs.

#### Philosophical comment on this approach to programming.

In general, computer science education and practice rightly favors the creation of tools, programs that can be applied to a wide range of problems; as opposed to ad hoc programs that work on a single problem or a very small range of problems. However, this preference is more problematic than sometimes acknowledged. On the one hand, generality can be costly to attain: it generally requires defining and communicating a large number of conventions; features that have been thought through and implemented with great effort end up not being used; interactions of two different features can be as scary as drug interactions. On the other hand, the problem at hand often has idiosyncratic features that are not worth generalizing because you will never see them again or idiosyncratic features that allow some powerful simplification. There is a place in the CS world for quick and dirty solutions to highly specific problems.

### To submit

• 1. Source code for your programs(s).
• 3. Any other files needed to run the code (e.g. edited versions of the data.)
• 4. The output of the following experiments: Run both learning methods, with NTEST = 200 and with NTRAIN = 8, 16, 24, and 32. (Thus 8 total experiments.) Plain text is the preferred format for this output.

### Comments on this learning problem

As a problem in machine learning, for pedagogical purposes, this application has its strong points and weak points.

Strong points:

• The data is real-world.
• There is a reasonable amount of it.
• The attributes are discrete, well-defined, and easily understood.
• The predictive attributes are undeniably related to the classification attribute.

Weak points:

• The task is pointless. We know the party affiliation of congressmen.
• The task is too easy. Party discipline (or party loyalty) is so strong at the moment, especially among the Republicans, that there are very few departures from party line, except on vote 4. For this reason, this data set could not be used as a test-bed for serious machine learning research.

### Further comments on the data

(The following do not in the least affect the assignment.)

The data is taken from vote-smart.org. Errors (especially mis-spellings of names) may have been introduced in transcription. Since these do not affect the assignment, I have not double-checked this for accuracy. Therefore, the data in this file should not be used as a source for serious political information without further checking.

The bills at issue were as follows:

• Vote 1: Constitutional amendment to prohibit flag desecration. 300-125. R 214-11. D 86-113. I 0-1.
• Vote 2: Partial-Birth Abortion Ban. 282-139. R 220-5. D 62-133 I 0-1
• Vote 3: Energy Policy Act of 2003. 247-175. R 207-17. D 40-157. I 0-1
• Vote 4: Permit Importation of Prescription Drugs. 243-186. R 87-141. D 155-45. I 1-0.

I have performed the following data cleaning:

• The lone independent in the House, Bernard Sanders of Vermont, has had his party affiliation changed to "Democrat".
• I have lumped all manner of not voting on a bill, including "announcing for/against it" together as not voting.

Congressman Richard Gephardt did not vote on any of these bills --- why not, I do not know. I have deliberately placed his line in the middle of the file where it does not enter into any of these experiments.

Only 434 congressmen are recorded, rather than 435. Apparently the 19th district of Texas was vacant during this period. I don't know what the story is there.