## Introduction to Artificial Intelligence: Programming assignment 2

Assigned: Apr. 11
Due: Apr. 25

### Overview of the assignment

This is actually an experimental assignment, rather than a programming assignment. The assignment is to run three machine learning algorithms --- linear regression, Naive Bayes, and a decision tree algorithm --- over a given data set and compare the accuracy of the output for given data set and compare the accuracy of the output for different sizes of training sets.

The WEKA system is a package of machine learning resources developed by Ian Witten and his research group at the University of Waikato (New Zealand). The code is written in Java, and is available on the Web. (This link above is my local copy. The original package, if you want to get it, is here.)

There are three main types of resources available in WEKA:

• Code to do classification learning, using a variety of machine learning algorithms. These are called classifiers.
• Code to preprocess data sets prior to learning. These are called filters.
• Data sets. Real and artificial data sets suitable for learning experimentation.

There is also an online tutorial on WEKA available in PDF format.

This assignment will involve the "auto-mpg" data set from the UC Irvine Machine Learning Data Repository. The file contains parameters of various makes of cars from various years. The task is to predict the miles per gallon. (Use the version at the course web site, not the version at UC Irvine.)

The auto-mpg data set is in ARFF format. It contains the following parts:

• Lines beginning with % are comments.
• The line "@RELATION auto-mpg" defines this as the auto-mgp relation.
• There are then 8 lines defining the attributes: mpg, cylinters, displacement, horsepower, weight, acceleration, modal year, and origin (1 for US, 2 for Europe, 3 for Japan).
• The next line @DATA introduces the data.
• The remaining lines are data. Each line is a single instance, with the values of the attributes, following the same order as the @ATTRIBUTE statements, separated by commas.

### Running the code

You can find all of the WEKA material in the weka directory in the class Web site. If you want to run it on your own machine, then
2. Expand this using the command "jar xvf weka-3-0-2.jar".
3. Further expand the new file "weka.jar" using the command "jar xvf weka.jar". (The README file says to use "jar -jar weka.jar", but that didn't work for me.)
Or you can just download the files you need for this assignment (a small subset of the entire directory).

Alternatively, you can run the code from an ACF5 account by changing directory

```cd /home/e/ed1/weka/weka-3-0-2
```
and then running the code as described below.

You can run the code from an account on the Sun system by changing directory

```  cd /usr/httpd/htdocs_cs/courses/fall00/G22.3033-001/weka/weka-3-0-2
```
and there you can find directories with the various WEKA programs in Java, which can be run as described below.

If you have any trouble with this, please let me know as soon as possible. Don't wait until the day before the due date to make sure that you can find this code.

### The Experiments

#### Part I: Linear regression

In this section, you will test the accuracy of linear regression as a predictor, using training sets of different sizes.

Step 1: Eliminate the nominal field "origin" using the command

`java weka.filters.AttributeFilter -i data/auto-mpg.arff -o ~/auto-mpgA.arff -R 8`
and similarly for the test file.

Note: Here and throughout, I've written these commands so as to create a new file "auto-mpgA.arff" in your home directory. Of course, you can call them and put them whereever you want.

Step 2: Create subsets of the training data with 5, 10, 15, and 24 elements. Create 3 subsets of each size. A random subset of size 5 can be created using the command

```java weka.filters.SplitDatasetFilter -i ~/auto-mpgA.arff -o ~/auto-mpg5a.arff -N 48 -S 1
```
The argument -N 48 means that the data should be divided into 48 folds. As there are 240 instances in the data, that gives a data set of size 5. Create the other training files analogously. (Note: The different training files need not be disjoint.) The argument -S 1 gives a random number seed, so as to get a random choice. Of course, you should give a different seed for each training set you create.

Step 3: Run the linear regression algorithm over the different partial training files and the complete training file, and evaluate them over the test file. The linear regression algorithm is run using the command:

```java weka.classifiers.LinearRegression -t ~/auto-mpg5a.arff -T ~/auto-mpg-testA.arff -c 1 -S 1
```

Step 4: Create a plot where the x-axis is the size of the training set (5, 10, 15, 24) and the y-axis is the average accuracy, measured by mean absolute error, over the three sample sets of the specified size.

Step 5: Define the similarity of two hypotheses to be the sum of the absolute values of the difference of corresponding coefficients. For instance, given the two models

```mpg = -1.5 * cylinders - 1 * accelleration + 0.75 * year + 1.0
and
mpg = -1.0 * cylinders - 0.8 * accelleration + 1.0 * year + 0.8
```
the difference between these two models is defined to be
```|-1.5 - (-1.0)| + |-1 - (-0.8)| + | 0.75 - 1.0 | + |1.0 - 0.8 |  = 0.5 + 0.2 + 0.25 + 0.2 = 1.15.
```

The average similarity of a set of three hypotheses H1, H2, H3, is the average of simiarity(H1,H2), similarity(H1,H3) and similarity(H1,H3).

Create a plot where the x-axis is the size of the training set and the y-axis is the average similarity of the hypothesis generated for the training sets of that size.

#### Part 2: Naive Bayes

Step 1: Discretize the classification attribute mpg in both the training set and the test set into 4 bins using the command

```java weka.filters.DiscretizeFilter -i data/auto-mpg.arff -o ~/auto-mpg-d1.arff \
-b -r data/auto-mpg-test.arff -s ~/auto-mpg-test-d1.arff  -B 4 -R 1
```

Step 2: Discretize the remaining attributes training set and the test set into 2 bins using the command

```java weka.filters.DiscretizeFilter -i ~/auto-mpg-d1.arff -o ~/auto-mpg-d2.arff \
-b -r ~/auto-mpg-test-d1.arff -s ~/auto-mpg-test-d2.arff -B 2 -R 2,3,4,5,6,7
```

Step 3: As in part 1, step 2, create subsets of the training data with 5, 10, 15, and 24 elements. Create 3 subsets of each size.

Step 4: Run the Naive Bayes algorithm on the various discretized training sets, and test them on the discretized test set. The command is

```java weka.classifiers.NaiveBayes -t  -T ~/auto-mpg-test-d2.arff
```

Step 5: Create a plot where the x-axis is the size of the training set (5, 10, 15, 24) and the y-axis is the average accuracy, measured by percentage correct, over the three sample sets of the specified size.

Step 6: Redo step 2, but discretize the remaining attributes into 4 bins rather than 2 (change the argument of -B to 4). Run Naive Bayes again. Compare the results of the two discretizations.

#### Part III: Decision trees

Using the discretized training sets and test setting created in steps 1 through 3 of part II, run the C4.5 algorithm (a sophisticated decision tree algorithm) using the command
```java weka.classifiers.j48.J48 -t  -T ~/auto-mpg-test-d2.arff \
-M 3 -c 1
```
The flag -M 3 requires that nodes with 3 or fewer instances are not further split. Report on the accuracy as in step 5 of part II. Also, discuss the output hypotheses. How much do the trees differ, and how do they change as the training set becomes larger?

#### Extra credit (optional)

I should be interested to see the results of any further experimentation you choose to do.