V22.0436 - Prof. Grishman

Lecture 19:  Multi-threading

(discuss Asgn 7 -- registers, clocking)

Using multiple threads:  speed-up

Applications differ in the degree to which they can be parallelized and the communication required between threads
Amdahl's law (text, p. 51):
                                1 / ( (1 - P) + (P / S) )

How to make use of multi-thread hardware?

Explicit thread control:  Java

In parallel programs, we distinguish between processes, which each have their own environment, and threads, which share an environment (memory and files).

When you start a Java program, it starts a single user thread (main method).  Additional threads can be created as objects of type Thread.  To create a new thread, first create a Runnable class.  A Runnable class must have a run method (with no arguments):

public class PrintMsg implements Runnable {

    String message;

    public PrintMsg (String m) {
        message = m;
    }

    public void run () {
        System.out.println ("Starting thread " + message);
        ...
        System.out.println ("Ending thread " + message);
    }
}

then create a Thread from an instance of the Runnable class and start the thread:

    Thread t = new Thread (new PrintMsg ("moo"));
    t.start();

The thread will execute the run method of PrintMsg;  when that method finishes, the thread dies.  If the main thread needs results generated by thread t, it can wait for t to die with

    t.join();

(Note:  join may throw InterruptedException if thread t was interrupted by another thread;  this exception must be caught.)  When all user threads have died, the JVM interpreter quits.

Programs written in class: R.java and PrintMsg.java.

If the hardware provides N threads and a task in your program can be divided into N subtasks (roughly equal in length) which can be performed concurrently, you can get a large speed-up by creating N-1 new threads (in addition to the original thread) and waiting for them all to finish.  If the threads need to communicate, they can do so through shared variables.  However, there are risks that the threads may interfere if they both modify a single variable.  This problem can be avoided by accessing the shared variables through methods which are declared to be synchronized.

Graphics Processing Units and CUDA

Modern graphics processing units (GPUs) rely on both multi-cores and multi-threading to give very high performance.  The number of cores and the number of threads per core is typically much higher than for general purpose processors.  The NVIDIA GeForce 8800 presented in Appendix A of the text has 128 streaming processor cores and each can handle up to 96 threads.  Theoretical peak performance is 576 GFLOPS.

While designed for graphics tasks, GPUs are now also used for large-scale scientific computing, where operations have to be performed on large vectors and arrays.  To support such applications, the GPU provides an extension of C/C++ called CUDA (Compute Unified Device Architecture), which allows the simple expression of parallel operations on vectors, automatically translating that into the necessary threads (pages A-19 and A-20) and scheduling these threads.

Computing y = ax + y with a serial loop and in parallel using CUDA (text Fig. A.3.4).

void saxpy_serial (int n, float alpha, float *x, float *y)
{
    for (int i=0; i<n; ++i)
        y[i] = alpha * x[i] + y[i];
}

// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y)

====================================================================

_global_ void saxpy_parallel (int n, float alpha, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n) y[i] = alpha * x[i] + y[i];
}
// Invoke parallel SAXPY kernel (256 threads per block)
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>> (n, 2.0, x, y);