Programming Assignment 2: Machine Learning
Assigned: October 20.
Due: November 10.
Summary To use Naive Bayes and nearest neighbors learning
methods to predict the party affiliation of congressmen from their
Input data format
The file vote.dat contains, in tabular
form, the voting record of the current House of Representatives on 4 important
votes in 2003.
The first line of the file is a header, with the labels of each row.
This is simply for the human reader; your program may skip this.
Each of the following lines corresponds to one congressman:
Chars. 1-2: Abbreviation of state name.
- Chars. 7-24: Name of congresssman.
- Char 26: Party affiliation: R or D (see comment below).
- Chars 31, 39, 45, 51: Votes on four issues. These are each "Y", "N",
or "-" (no vote).
It must be possible to easily set the size of the training set, NTRAIN,
and the size of the test set NTEST. The training set will then be the
first NTRAIN lines of the data file and the test set will be the last NTEST
lines. You may assume that NTRAIN is between 8 and 100 (inclusive) and that
NTEST is between 20 and 200. (Thus, the middle 134 congressmen never
get looked at in any of these tests.)
In the learning phase, you compute Freq(Party) and Freq(VoteI | Party)
over the training set. In the prediction phase, you combine these using
the Naive Bayes formula to predict the party affiliation from the votes.
In evaluation, you compare the predicted party affiliation to the
recorded affiliation, and note whether the prediction was correct or not.
As discussed in class, the Naive Bayes method elegantly handles null
values, whether in the training data or in the test data.
The data has been deliberately designed so that, for the training sets
you are using, the problem of zero values does not arise (that is,
the above frequencies are never zero.)
The "distance" between two voting records is just the number of bills where
they differ. You should implement the nearest neighbors algorithm as follows:
For a given test instance X:
(Note: since there are only 16 possible voting patterns, all this can
be pre-computed for any fixed training set, if you want.)
- 1. Find the distance D from X to the nearest element of the training set.
- 2. Let S be the set of instances in T at distance D from X.
- 3. Take a "vote" on party affiliation over S. Predict that the
party affiliation of X is the winner of the vote. If the vote is tied,
predict that X is a Republican (the majority party).
The nearest neighbors algorithm is not as tolerant of null values as Naive
Bayes. Adopt the following policies:
- In the training set, any congressman with any null votes should be
- If a test instance X has a null vote, then that attribute should be
ignored for that instance.
For each of two learning methods (Naive Bayes and nearest neighbors)
you should output:
- 1. The accuracy over the test set of the method learned over the training
- 2. The accuracy over the training set of the method learned over the
- 3. For Naive Bayes, you should output the probability assigned
to each outcome for the four congressmen Raymond Green, Denis Majette,
Clifford Stearns, and Michael Castle (the last four lines of the data file.)
- 4. For nearest neighbors, you should output the "vote" on the outcome
for these four congressmen.
The only requirement on the programming is that it must be possible for
the TA to easily (i.e. in not more than 4 minutes) be able to set the
values of the training and test set and run an experiment. Beyond that,
I don't care how you set it up. You may:
Similarly, the actual data may be
- Input these values from a command line.
- Input these value from a window.
- Declare these values as constants in your program, and have them
changed by editing the text of the program.
- Stick these values at the front of the data file.
The two learning algorithms may be either in a single program that always
does them both; or in a single program that chooses which to do on the
basis of some kind of signal; or in two separate programs.
- Read from standard input or a file.
- Included in a large constant declaration in your program.
- Pre-munged in any way or to any degree that you want.
Philosophical comment on this approach to programming.
In general, computer science education and practice rightly favors
the creation of tools, programs that can be applied to a
wide range of problems; as opposed to ad hoc programs that
work on a single problem or a very small range of problems. However,
this preference is more problematic than sometimes acknowledged.
On the one hand, generality can be costly to attain: it generally
requires defining and communicating a large number of conventions;
features that have been thought through and implemented with great effort
end up not being used; interactions of two different
features can be as scary as drug interactions. On the other hand, the
problem at hand often has idiosyncratic features that are not worth
generalizing because you will never see them again or idiosyncratic
features that allow some powerful simplification. There is a place
in the CS world for quick and dirty solutions to highly specific problems.
Email to the grader (email@example.com):
- 1. Source code for your programs(s).
- 2. A README file explaining how to run your code.
- 3. Any other files needed to run the code (e.g. edited versions of
- 4. The output of the following experiments: Run both learning
methods, with NTEST = 200 and with NTRAIN = 8, 16, 24, and 32.
(Thus 8 total experiments.) Plain text is the preferred format for
Comments on this learning problem
As a problem in machine learning, for pedagogical purposes, this application
has its strong points and weak points.
- The data is real-world.
- There is a reasonable amount of it.
- The attributes are discrete, well-defined, and easily understood.
- The predictive attributes are undeniably related to the classification
- The task is pointless. We know the party affiliation of congressmen.
- The task is too easy. Party discipline (or party loyalty)
is so strong at the moment,
especially among the Republicans, that there are very few departures from
party line, except on vote 4. For this reason, this data set could
not be used as a test-bed for serious machine learning research.
Further comments on the data
(The following do not in the least affect the assignment.)
The data is taken from vote-smart.org.
Errors (especially mis-spellings of names) may have been introduced
in transcription. Since these do not affect the assignment, I have not
double-checked this for accuracy. Therefore, the data in this file should
not be used as a source for serious political information without further
The bills at issue were as follows:
- Vote 1: Constitutional amendment to prohibit flag desecration. 300-125.
R 214-11. D 86-113. I 0-1.
- Vote 2: Partial-Birth Abortion Ban. 282-139. R 220-5. D 62-133 I 0-1
- Vote 3: Energy Policy Act of 2003. 247-175. R 207-17. D 40-157. I 0-1
- Vote 4: Permit Importation of Prescription Drugs. 243-186.
R 87-141. D 155-45. I 1-0.
I have performed the following data cleaning:
- The lone independent
in the House, Bernard Sanders of Vermont,
has had his party affiliation changed to "Democrat".
- I have lumped all manner of not voting on a bill, including
"announcing for/against it" together as not voting.
Congressman Richard Gephardt did not vote on any of these
bills --- why not, I do not know.
I have deliberately placed his
line in the middle of the file where it does not enter into any of these
Only 434 congressmen are recorded, rather than 435. Apparently the 19th
district of Texas was vacant during this period. I don't know what the
story is there.