Matrix Multiply Assignment

Problem

For this first assignment, you will optimize a routine to multiply square matrices. Matrix multiplication is a basic building block in many scientific codes; and since it is an O(n3) algorithm, these codes often spend a lot of their time in matrix multiplication. The most naive code to multiply matrices is short, sweet, and slow:

```  for i = 1 to n
for j = 1 to n
for k = 1 to n
C[i,j] = C[i,j] + A[i,k] * B[k,j]
end
end
end
```

We're providing implementations of the trivial unoptimized matrix multiply code in C and Fortran, and a very simple blocked implementation in C. We're also providing a wrapper for the BLAS (Basic Linear Algebra Subroutines) library. The BLAS is a standard interface for building blocks like matrix multiplication, and many vendors provide optimized versions of the BLAS for their platforms. You may want to compare your routines to codes that others have optimized; if you do, think about the relative difficulty of using someone else's library to writing and optimizing your own routine. Knowing when and how to make use of standard libraries is an important skill in building fast programs!

Implementation

You need to write a dgemm.c that contains a function with the following C signature:

```      void
square_dgemm(const unsigned M,
const double *A, const double *B, double *C)
```

Note that the matrices are stored in column-major order. Also, your program will actually be doing a multiply and add operation:

```    C := C + A*B
```

Look at the code in basic_dgemm.c if you find this confusing.

The necessary files are in matmul.tar.gz. Included are the following:

Makefile
a sample Makefile, with some basic rules,
matmul.c
the driver program,
basic_dgemm.c
a very simple square_dgemm implementation,
blocked_dgemm.c
a slightly more complex square_dgemm implementation
blas_dgemm.c
another wrapper that lets the C driver program call the dgemm routine in BLAS implementations,
timing.gnuplot
a sample gnuplot script to display the results,

We will be testing on the 2.33 GHz Intel Pentium III Xeon (Katmai) machines on the Dell cluster. These are quad-core nodes (but you will be using only one processor of these). See the page on using the class machines for more detailed information on how to use the Dell cluster. Also see the NYU HPC wiki for more information.

The y-axis in the gnuplot script plots is labeled MFlop/s, but really it should be "MFlop/s assuming you were using the standard algorithm." If you use something like Strassen's, we will still measure time/(2n3). However, the numerical error checking may haunt you.

You will get a lot of improvement from tiling, but you may want to play with other optimizations, too. Strassen-like algorithms, copy optimizations, recursive data layout, ... you can get some pretty elaborate optimization strategies. We'll cover some of the possibilities in class, but you might want to do some independent reading, too. The off-limit optimizations are those that weaken the floating point system.

Submission

Your group should submit your dgemm.c, Makefile (for compiler options) and a write-up. Your write-up should contain:

• the names of the people in your group,
• the optimizations used or attempted,
• the results of those optimizations, and
• your explanation for any odd behavior (e.g., dips) in performance.

To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)

Please tar up your group's dgemm.c, write-up, and associated files and mail us either a URL from which I can retrieve the tar file or an encoded tar file.

Resources

• A description of the machine is available through the NYU HPC wiki.
• Much of this assignment was shamelessly borrowed from Assignment 1 of CS 267 in Fall 2007 (which in turn shamelessly borrowed from previous years -- the introductory paragraphs, in particular, read a lot like something I remember writing). Many of the resources listed there will be useful.
• You can find out more about the processor by running Todd Allen's cpuid utility and by running cat /proc/cpuinfo.
• The optimizations used in PHiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however. You might want to skim the tech report (.pdf) on PHiPAC. The section on C coding guidelines will be particuarly relevant.
• There is a classic paper (PostScript) on the idea of cache blocking in the context of matrix multiply by Monica Lam, et al.
• Several folks have tried to automatically get cache locality by basing their matrix storage organization around space-filling curves. See the paper by Chatterjee and the classic paper by Gustavson, Recursion leads to automatic variable blocking for dense linear algebra.