Introduction to Artificial Intelligence: Programming assignment 2

Assigned: Apr. 11
Due: Apr. 25

Overview of the assignment

This is actually an experimental assignment, rather than a programming assignment. The assignment is to run three machine learning algorithms --- linear regression, Naive Bayes, and a decision tree algorithm --- over a given data set and compare the accuracy of the output for given data set and compare the accuracy of the output for different sizes of training sets.

The WEKA system is a package of machine learning resources developed by Ian Witten and his research group at the University of Waikato (New Zealand). The code is written in Java, and is available on the Web. (This link above is my local copy. The original package, if you want to get it, is here.)

There are three main types of resources available in WEKA:

There is also an online tutorial on WEKA available in PDF format.

This assignment will involve the "auto-mpg" data set from the UC Irvine Machine Learning Data Repository. The file contains parameters of various makes of cars from various years. The task is to predict the miles per gallon. (Use the version at the course web site, not the version at UC Irvine.)

The auto-mpg data set is in ARFF format. It contains the following parts:

Running the code

You can find all of the WEKA material in the weka directory in the class Web site. If you want to run it on your own machine, then
1. Download the file weka-3-0-2.jar
2. Expand this using the command "jar xvf weka-3-0-2.jar".
3. Further expand the new file "weka.jar" using the command "jar xvf weka.jar". (The README file says to use "jar -jar weka.jar", but that didn't work for me.)
Or you can just download the files you need for this assignment (a small subset of the entire directory).
4. Download the training set auto-mpg.arff and the test set auto-mpg-test.arff

Alternatively, you can run the code from an ACF5 account by changing directory

cd /home/e/ed1/weka/weka-3-0-2
and then running the code as described below.

You can run the code from an account on the Sun system by changing directory

  cd /usr/httpd/htdocs_cs/courses/fall00/G22.3033-001/weka/weka-3-0-2
and there you can find directories with the various WEKA programs in Java, which can be run as described below.

If you have any trouble with this, please let me know as soon as possible. Don't wait until the day before the due date to make sure that you can find this code.

The Experiments

Part I: Linear regression

In this section, you will test the accuracy of linear regression as a predictor, using training sets of different sizes.

Step 1: Eliminate the nominal field "origin" using the command

java weka.filters.AttributeFilter -i data/auto-mpg.arff -o ~/auto-mpgA.arff -R 8
and similarly for the test file.

Note: Here and throughout, I've written these commands so as to create a new file "auto-mpgA.arff" in your home directory. Of course, you can call them and put them whereever you want.

Step 2: Create subsets of the training data with 5, 10, 15, and 24 elements. Create 3 subsets of each size. A random subset of size 5 can be created using the command

java weka.filters.SplitDatasetFilter -i ~/auto-mpgA.arff -o ~/auto-mpg5a.arff -N 48 -S 1
The argument -N 48 means that the data should be divided into 48 folds. As there are 240 instances in the data, that gives a data set of size 5. Create the other training files analogously. (Note: The different training files need not be disjoint.) The argument -S 1 gives a random number seed, so as to get a random choice. Of course, you should give a different seed for each training set you create.

Step 3: Run the linear regression algorithm over the different partial training files and the complete training file, and evaluate them over the test file. The linear regression algorithm is run using the command:

java weka.classifiers.LinearRegression -t ~/auto-mpg5a.arff -T ~/auto-mpg-testA.arff -c 1 -S 1

Step 4: Create a plot where the x-axis is the size of the training set (5, 10, 15, 24) and the y-axis is the average accuracy, measured by mean absolute error, over the three sample sets of the specified size.

Step 5: Define the similarity of two hypotheses to be the sum of the absolute values of the difference of corresponding coefficients. For instance, given the two models

mpg = -1.5 * cylinders - 1 * accelleration + 0.75 * year + 1.0
        and
mpg = -1.0 * cylinders - 0.8 * accelleration + 1.0 * year + 0.8
the difference between these two models is defined to be
|-1.5 - (-1.0)| + |-1 - (-0.8)| + | 0.75 - 1.0 | + |1.0 - 0.8 |  = 0.5 + 0.2 + 0.25 + 0.2 = 1.15.

The average similarity of a set of three hypotheses H1, H2, H3, is the average of simiarity(H1,H2), similarity(H1,H3) and similarity(H1,H3).

Create a plot where the x-axis is the size of the training set and the y-axis is the average similarity of the hypothesis generated for the training sets of that size.

Part 2: Naive Bayes

Step 1: Discretize the classification attribute mpg in both the training set and the test set into 4 bins using the command

java weka.filters.DiscretizeFilter -i data/auto-mpg.arff -o ~/auto-mpg-d1.arff \
-b -r data/auto-mpg-test.arff -s ~/auto-mpg-test-d1.arff  -B 4 -R 1

Step 2: Discretize the remaining attributes training set and the test set into 2 bins using the command

java weka.filters.DiscretizeFilter -i ~/auto-mpg-d1.arff -o ~/auto-mpg-d2.arff \
-b -r ~/auto-mpg-test-d1.arff -s ~/auto-mpg-test-d2.arff -B 2 -R 2,3,4,5,6,7

Step 3: As in part 1, step 2, create subsets of the training data with 5, 10, 15, and 24 elements. Create 3 subsets of each size.

Step 4: Run the Naive Bayes algorithm on the various discretized training sets, and test them on the discretized test set. The command is

java weka.classifiers.NaiveBayes -t  -T ~/auto-mpg-test-d2.arff

Step 5: Create a plot where the x-axis is the size of the training set (5, 10, 15, 24) and the y-axis is the average accuracy, measured by percentage correct, over the three sample sets of the specified size.

Step 6: Redo step 2, but discretize the remaining attributes into 4 bins rather than 2 (change the argument of -B to 4). Run Naive Bayes again. Compare the results of the two discretizations.

Part III: Decision trees

Using the discretized training sets and test setting created in steps 1 through 3 of part II, run the C4.5 algorithm (a sophisticated decision tree algorithm) using the command
java weka.classifiers.j48.J48 -t  -T ~/auto-mpg-test-d2.arff \
-M 3 -c 1
The flag -M 3 requires that nodes with 3 or fewer instances are not further split. Report on the accuracy as in step 5 of part II. Also, discuss the output hypotheses. How much do the trees differ, and how do they change as the training set becomes larger?

Extra credit (optional)

I should be interested to see the results of any further experimentation you choose to do.