Due: Tues, Sep 16 at 5:00 pm
[ Problem  Implementation  Submission  Resources ]
Problem
For this first assignment, you will optimize a routine to multiply square matrices. Matrix multiplication is a basic building block in many scientific codes; and since it is an O(n^{3}) algorithm, these codes often spend a lot of their time in matrix multiplication. The most naive code to multiply matrices is short, sweet, and slow:
for i = 1 to n for j = 1 to n for k = 1 to n C[i,j] = C[i,j] + A[i,k] * B[k,j] end end end
We're providing implementations of the trivial unoptimized matrix multiply code in C and Fortran, and a very simple blocked implementation in C. We're also providing a wrapper for the BLAS (Basic Linear Algebra Subroutines) library. The BLAS is a standard interface for building blocks like matrix multiplication, and many vendors provide optimized versions of the BLAS for their platforms. You may want to compare your routines to codes that others have optimized; if you do, think about the relative difficulty of using someone else's library to writing and optimizing your own routine. Knowing when and how to make use of standard libraries is an important skill in building fast programs!
Implementation
You need to write a dgemm.c that contains a function with the following C signature:
void square_dgemm(const unsigned M, const double *A, const double *B, double *C)
Note that the matrices are stored in columnmajor order. Also, your program will actually be doing a multiply and add operation:
C := C + A*B
Look at the code in basic_dgemm.c if you find this confusing.
The necessary files are in matmul.tar.gz. Included are the following:
 Makefile
 a sample Makefile, with some basic rules,
 matmul.c
 the driver program,
 basic_dgemm.c
 a very simple square_dgemm implementation,
 blocked_dgemm.c
 a slightly more complex square_dgemm implementation
 blas_dgemm.c
 another wrapper that lets the C driver program call the dgemm routine in BLAS implementations,
 timing.gnuplot
 a sample gnuplot script to display the results,
We will be testing on the 2.33 GHz Intel Pentium III Xeon (Katmai) machines on the Dell cluster. These are quadcore nodes (but you will be using only one processor of these). See the page on using the class machines for more detailed information on how to use the Dell cluster. Also see the NYU HPC wiki for more information.
The yaxis in the gnuplot script plots is labeled MFlop/s, but really it should be "MFlop/s assuming you were using the standard algorithm." If you use something like Strassen's, we will still measure time/(2n^{3}). However, the numerical error checking may haunt you.
You will get a lot of improvement from tiling, but you
may want to play with other optimizations, too. Strassenlike algorithms,
copy optimizations, recursive data layout, ... you can get some
pretty elaborate optimization strategies. We'll cover some of the
possibilities in class, but you might want to do some
independent reading, too. The offlimit optimizations
are those
that weaken the floating point system.
Submission
Your group should submit your dgemm.c, Makefile (for compiler options) and a writeup. Your writeup should contain:
 the names of the people in your group,
 the optimizations used or attempted,
 the results of those optimizations, and
 your explanation for any odd behavior (e.g., dips) in performance.
To show the results of your optimizations, include a graph comparing your dgemm.c with the included basic_dgemm.c. Your explanations should rely heavily on knowledge of the memory hierarchy. (Benchmark graphs help.)
Please tar up your group's dgemm.c, writeup, and associated files and mail us either a URL from which I can retrieve the tar file or an encoded tar file.
Resources
 A description of the machine is available through the NYU HPC wiki.
 Much of this assignment was shamelessly borrowed from Assignment 1 of CS 267 in Fall 2007 (which in turn shamelessly borrowed from previous years  the introductory paragraphs, in particular, read a lot like something I remember writing). Many of the resources listed there will be useful.
 You can find out more about the processor by running Todd Allen's cpuid utility and by running cat /proc/cpuinfo.
 The optimizations used in PHiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however. You might want to skim the tech report (.pdf) on PHiPAC. The section on C coding guidelines will be particuarly relevant.
 There is a classic paper (PostScript) on the idea of cache blocking in the context of matrix multiply by Monica Lam, et al.

Several folks have tried to automatically get cache locality by
basing their matrix storage organization around spacefilling
curves. See the paper by
Chatterjee and the classic paper by Gustavson,
Recursion leads to automatic variable blocking for dense linear algebra.