Assigned: Apr. 10
Due: Apr. 24
Note: Those students who took Machine Learning from me last semester did a similar but more difficult assignment as Programming Assignment 1. Therefore, they have the option of skipping this assignment and instead applying their grade from that assignment in its place. If you wish to do this, send me email.
This is actually an experimental assignment, rather than a programming assignment. The assignment is to run three machine learning algorithms --- linear regression, Naive Bayes, and a decision tree algorithm --- over a given data set and compare the accuracy of the output for given data set and compare the accuracy of the output for different sizes of training sets.
The WEKA system is a package of machine learning resources developed by Ian Witten and his research group at the University of Waikato (New Zealand). The code is written in Java, and is available on the Web. (This link above is my local copy. The original package, if you want to get it, is here.)
There are three main types of resources available in WEKA:
There is also an online tutorial on WEKA available in PDF format.
This assignment will involve the "auto-mpg" data set from the UC Irvine Machine Learning Data Repository. The file contains parameters of various makes of cars from various years. The task is to predict the miles per gallon. (Use the version at the course web site, not the version at UC Irvine.)
The auto-mpg data set is in ARFF format. It contains the following parts:
You can find all of the WEKA material in the
in the class Web site. If you want to run it on your own machine, then
1. Download the file weka-3-0-2.jar
2. Expand this using the command "jar xvf weka-3-0-2.jar".
3. Further expand the new file "weka.jar" using the command "jar xvf weka.jar". (The README file says to use "jar -jar weka.jar", but that didn't work for me.)
Or you can just download the files you need for this assignment (a small subset of the entire directory).
4. Download the training set auto-mpg.arff and the test set auto-mpg-test.arff
Alternatively, you can run the code from an ACF5 account by changing directory
cd /home/e/ed1/weka/weka-3-0-2and then running the code as described below.
You can run the code from an account on the Sun system by changing directory
cd /usr/httpd/htdocs_cs/courses/fall00/G22.3033-001/weka/weka-3-0-2and there you can find directories with the various WEKA programs in Java, which can be run as described below.
If you have any trouble with this, please let me know as soon as possible. Don't wait until the day before the due date to make sure that you can find this code.
Step 1: Eliminate the nominal field "origin" using the command
java weka.filters.AttributeFilter -i data/auto-mpg.arff -o ~/auto-mpgA.arff -R 8and similarly for the test file.
Note: Here and throughout, I've written these commands so as to create a new file "auto-mpgA.arff" in your home directory. Of course, you can call them and put them whereever you want.
Step 2: Create subsets of the training data with 5, 10, 15, and 24 elements. Create 3 subsets of each size. A random subset of size 5 can be created using the command
java weka.filters.SplitDatasetFilter -i ~/auto-mpgA.arff -o ~/auto-mpg5a.arff -N 48 -S 1The argument -N 48 means that the data should be divided into 48 folds. As there are 240 instances in the data, that gives a data set of size 5. Create the other training files analogously. (Note: The different training files need not be disjoint.) The argument -S 1 gives a random number seed, so as to get a random choice. Of course, you should give a different seed for each training set you create.
Step 3: Run the linear regression algorithm over the different partial training files and the complete training file, and evaluate them over the test file. The linear regression algorithm is run using the command:
java weka.classifiers.LinearRegression -t ~/auto-mpg5a.arff -T ~/auto-mpg-testA.arff -c 1 -S 1
Step 4: Create a plot where the x-axis is the size of the training set (5, 10, 15, 24) and the y-axis is the average accuracy, measured by mean absolute error, over the three sample sets of the specified size.
Step 5: Define the similarity of two hypotheses to be the sum of the absolute values of the difference of corresponding coefficients. For instance, given the two models
mpg = -1.5 * cylinders - 1 * accelleration + 0.75 * year + 1.0 and mpg = -1.0 * cylinders - 0.8 * accelleration + 1.0 * year + 0.8the difference between these two models is defined to be
|-1.5 - (-1.0)| + |-1 - (-0.8)| + | 0.75 - 1.0 | + |1.0 - 0.8 | = 0.5 + 0.2 + 0 .25 + 0.2 = 1.15.
The average similarity of a set of three hypotheses H1, H2, H3, is the average of simiarity(H1,H2), similarity(H1,H3) and similarity(H2,H3).
Create a plot where the x-axis is the size of the training set and the y-axis is the average similarity of the hypothesis generated for the training sets of that size.
Step 1: Discretize the classification attribute mpg in both the training set and the test set into 4 bins using the command
java weka.filters.DiscretizeFilter -i data/auto-mpg.arff -o ~/auto-mpg-d1.arff \ -b -r data/auto-mpg-test.arff -s ~/auto-mpg-test-d1.arff -B 4 -R 1
Step 2: Discretize the remaining attributes training set and the test set into 2 bins using the command
java weka.filters.DiscretizeFilter -i ~/auto-mpg-d1.arff -o ~/auto-mpg-d2.arff \ -b -r ~/auto-mpg-test-d1.arff -s ~/auto-mpg-test-d2.arff -B 2 -R 2,3,4,5,6,7
Step 3: As in part 1, step 2, create subsets of the training data with 5, 10, 15, and 24 elements. Create 3 subsets of each size.
Step 4: Run the Naive Bayes algorithm on the various discretized training sets, and test them on the discretized test set. The command is
java weka.classifiers.NaiveBayes -t
Step 5: Create a plot where the x-axis is the size of the training set (5, 10, 15, 24) and the y-axis is the average accuracy, measured by percentage correct, over the three sample sets of the specified size.
Step 6: Redo step 2, but discretize the remaining attributes into 4 bins rather than 2 (change the argument of -B to 4). Run Naive Bayes again. Compare the results of the two discretizations.
java weka.classifiers.j48.J48 -tThe flag -M 3 requires that nodes with 3 or fewer instances are not further split. Report on the accuracuy as in step 5 of part II. Also, discuss the output hypotheses. How much do the trees differ, and how do they change as the training set becomes larger?
-T ~/auto-mpg-test-d2.arff \ -M 3