CSCI-GA.3033-008
Graphics Processing Units (GPUs): Architecture and Programming

Lecture 7: Parallel Patterns and Performance Issues

Mohamed Zahran (aka Z)
mzahran@cs.nyu.edu
http://www.mzahran.com

Some slides of this lectures from:
• Vasily Volkov (UC Berkeley)
• http://nvlabs.github.io/moderngpu/
Some Insights About Performance

Throughput

Latency

Occupancy

Utilization
It is a common belief that ...

• More threads is better
  – because it needs more threads hide latency

But is it always true?
### Multiplication of two large matrices, single precision (SGEMM):

<table>
<thead>
<tr>
<th></th>
<th>CUBLAS 1.1</th>
<th>CUBLAS 2.0</th>
<th>Comparison</th>
</tr>
</thead>
<tbody>
<tr>
<td>Threads per block</td>
<td>512</td>
<td>64</td>
<td>8x smaller thread blocks</td>
</tr>
<tr>
<td>Occupancy (G80)</td>
<td>67%</td>
<td>33%</td>
<td>2x lower occupancy</td>
</tr>
<tr>
<td>Performance (G80)</td>
<td>128 Gflop/s</td>
<td>204 Gflop/s</td>
<td>1.6x higher performance</td>
</tr>
</tbody>
</table>

### Batch of 1024-point complex-to-complex FFTs, single precision:

<table>
<thead>
<tr>
<th></th>
<th>CUFFT 2.2</th>
<th>CUFFT 2.3</th>
<th>Comparison</th>
</tr>
</thead>
<tbody>
<tr>
<td>Threads per block</td>
<td>256</td>
<td>64</td>
<td>4x smaller thread blocks</td>
</tr>
<tr>
<td>Occupancy (G80)</td>
<td>33%</td>
<td>17%</td>
<td>2x lower occupancy</td>
</tr>
<tr>
<td>Performance (G80)</td>
<td>45 Gflop/s</td>
<td>93 Gflop/s</td>
<td>2x higher performance</td>
</tr>
</tbody>
</table>
Latency Vs Throughput

• Latency (how much time) is \textit{time}:
  – instruction takes 4 cycles per warp
  – memory takes 400 cycles

• Throughput (how many operations per cycle or second) is \textit{rate}:
  – Arithmetic: 1.3 Tflop/s = 480 ops/cycle (op=multiply-add)
  – Memory: 177 GB/s ≈ 32 ops/cycle (op=32-bit load)
Hide Latency is ...

• Doing other operations while waiting
• This will make the kernel runs faster
• But not at the peak performance
• What can we do?
Little's Law

Needed parallelism = Latency x Throughput
Examples from GPU

<table>
<thead>
<tr>
<th>GPU model</th>
<th>Latency (cycles)</th>
<th>Throughput (cores/SM)</th>
<th>Parallelism (operations/SM)</th>
</tr>
</thead>
<tbody>
<tr>
<td>G80-GT200</td>
<td>≈24</td>
<td>8</td>
<td>≈192</td>
</tr>
<tr>
<td>GF100</td>
<td>≈18</td>
<td>32</td>
<td>≈576</td>
</tr>
<tr>
<td>GF104</td>
<td>≈18</td>
<td>48</td>
<td>≈864</td>
</tr>
</tbody>
</table>

Average latency of a computational operation

Less operations means idle cycle
So ...

- Higher performance does not mean more threads but **higher utilization**
- Utilization is related to **parallelism**
- We can increase utilization by
  - increasing throughput
    - Instruction level parallelism
    - Thread level parallelism
  - decreasing latency

Occupancy is not utilization, but one of the contributing factors.
How About Memory?

- What we said was related to computation.
- Can we apply the same technique (i.e. Little's law) to memory?

Needed parallelism = \text{Latency} \times \text{Throughput}

<table>
<thead>
<tr>
<th>Latency</th>
<th>Throughput</th>
<th>Parallelism</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arithmetic</td>
<td>\approx 18 cycles</td>
<td>32 ops/SM/cycle</td>
</tr>
<tr>
<td>Memory</td>
<td>&lt; 800 cycles (?)</td>
<td>&lt; 177 GB/s</td>
</tr>
</tbody>
</table>

This means that to hide memory latency you need to keep 100KB in flight. But less if the kernel is compute bound!
How Can You Get 100KB From Threads?

• Use more threads
• Use more instructions per thread
• Use more data per thread
Running Fast May Require Lower Occupancy

- Registers are > 6x the speed of shared memory in Fermi for example.
- To reach peak performance we may need to place many variables in registers.
Pattern: Convolution
Convolution

• An Array operation
• Output data element = weighted sum of a collection of neighboring input elements.
• The weights are defined by an input mask array.
• Usually used as filters to transform signals (or pixels or ...) into more desirable form.
Convolution

\[ N = [\begin{array}{cccccc} 1 & 2 & 3 & 4 & 5 & 6 & 7 \end{array}] \]

\[ P = [\begin{array}{cccccccc} & & & & & & 57 & & & & \end{array}] \]

Mask

\[ M = [\begin{array}{cccccc} 3 & 4 & 5 & 4 & 3 \end{array}] \]
Convolution can also be 2D.
Convolution

N

1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 5 6
5 6 7 8 5 6 7
6 7 8 9 0 1 2
7 8 9 0 1 2 3

P


M

1 2 3 2 1
2 3 4 3 2
3 4 5 4 3
2 3 4 3 2
1 2 3 2 1

1 4 9 8 5
4 9 16 15 12
9 16 25 24 21
8 15 24 21 16
5 12 21 16 5
Convolution

The 1D Version

```c
__global__ void convolution_1D_basic_kernel(float *N, float *M, float *P,
int Mask_Width, int Width) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;

    float Pvalue = 0;
    int N_start_point = i - (Mask_Width/2);
    for (int j = 0; j < Mask_Width; j++) {
        if (N_start_point + j >= 0 && N_start_point + j < Width) {
            Pvalue += N[N_start_point + j]*M[j];
        }
    }
    P[i] = Pvalue;
}
```

- Thread organized as 1D grid.
- Pvalue allows intermediate values to be accumulated in registers to save DRAM bw.
- We assume ghost values are 0.
- There will be control flow divergence (due to ghost elements).
- Ratio of floating point arithmetic calculation to global memory access is ~ 1.0 → What can we do??
Regarding Mask M

- Size of M is typically small.
- The contents of M do not change during execution.
- All threads need to access M and in the same order.

Doesn’t this make M a good candidate for constant memory?
Constant Memory

- Constant memory variables are visible to all thread blocks.
- Constant memory variables cannot be changed during kernel execution.
- The size of constant memory can vary from device to device.
Mask M and Constant Memory

• In host:
  • `#define MASK_WIDTH 10`
  • `__constant__ float M[MASK_WIDTH]`
  • Allocate and initialize a mask `h_M`
  • `cudaMemcpyToSymbol(M, h_M, MASK_WIDTH * sizeof(float));`

• Kernel functions
  – access constant memory variables as global variables → no need to pass pointers of these variables to the kernel as parameter.
Question: Isn’t the constant memory also in DRAM? Why is it assumed faster than global memory?

Answer:

• CUDA runtime knows that constant memory variables are not modified.
• It directs the hardware to aggressively cache them during kernel execution.
Pattern: Reduction Tree
What is it?

• A commonly used strategy for processing large input data sets
  – There is no required order of processing elements in a data set
  – Partition the data set into smaller chunks
  – Have each thread to process a chunk
  – Use a reduction tree to summarize the results from each chunk into the final answer

• Google and Hadoop MapReduce frameworks are examples of this pattern
What is it?

• Summarize a set of input values into one value using a “reduction operation”
  – Max
  – Min
  – Sum
  – Product

• Often with user defined reduction operation function as long as the operation
  – Is associative and commutative
  – Has a well-defined identity value (e.g., 0 for sum)
An efficient sequential reduction algorithm performs N operations in $O(N)$

- Initialize the result as an identity value for the reduction operation
  - Smallest possible value for max reduction
  - Largest possible value for min reduction
  - 0 for sum reduction
  - 1 for product reduction

- Scan through the input and perform the reduction operation between the result value and the current input value
A parallel reduction tree algorithm performs \( N-1 \) Operations in \( \log(N) \) steps.
Straightforward Implementation

- The original vector is in device global memory
- The shared memory is used to hold a partial sum vector
- Each step brings the partial sum vector closer to the sum
- The final sum will be in element 0
- Reduces global memory traffic due to partial sum values
A lot of branch divergence.
A Better Version
Be Careful!

- Although the number of “operations” is N, each “operation involves much more complex address calculation and intermediate result manipulation.
- If the parallel code is executed on a single-thread hardware, it would be significantly slower than the code based on the original sequential algorithm.
Pattern: Prefix Sum (Scan)
Scan / Parallel Prefix Sum

Given an array \( A = [a_0, a_1, \ldots, a_{n-1}] \) and a binary associative operator \( @ \) with identity \( I \)

\[ \text{scan} (A) = [I, a_0, (a_0 @ a_1), \ldots, (a_0 @ a_1 @ \ldots @ a_{n-2})] \]

This is the exclusive scan
Inclusive Scan

- Given an array $A = [a_0, a_1, \ldots, a_{n-1}]$
  and a binary associative operator $\@$ with identity $I$

  - $\text{scan } (A) = [a_0, (a_0 \@ a_1), \ldots, (a_0 \@ a_1 \@ \ldots \@ a_{n-1})]$
Why?

• Scan is used as a building block for many parallel algorithms
  – Radix sort
  – Quicksort
  – String comparison
  – Lexical analysis
  – Run-length encoding
  – Histograms
  – Etc.
A Inclusive Scan Application Example

- Assume that we have a 100-inch sausage to feed 10 people.
- We know how much each person wants in inches: [3 5 2 7 28 4 3 0 8 1].
- How do we cut the sausage quickly?
- How much will be left?

- Method 1: cut the sections sequentially: 3 inches first, 5 inches second, 2 inches third, etc.
- Method 2: calculate Prefix scan: [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left).
Other Examples

- Assigning camp slots
- Assigning farmer market space
- Allocating memory to parallel threads
- Allocating memory buffer for communication channels

...
void scan( float* output, float* input, int length)
{
    output[0] = 0; // since this is a prescan, not a scan
    for(int j = 1; j < length; ++j)
    {
        output[j] = input[j-1] + output[j-1];
    }
}

• $N$ additions
• Use a guide:
  – Want parallel to be work efficient
  – Does similar amount of work
A Parallel Inclusive Scan Algorithm

1. Read input from device memory to shared memory

| T0 | 3 | 1 | 7 | 0 | 4 | 1 | 6 | 3 |

Each thread reads one value from the input array in device memory into shared memory array.
A Parallel Scan Algorithm

1. (previous slide)

2. Iterate log(n) times: Threads \textit{stride} to \textit{n}: Add pairs of elements \textit{stride} elements apart. Double \textit{stride} at each iteration. (note must double buffer shared mem arrays)

- Active threads: \textit{stride} to \text{n-1} (\textit{n-stride} threads)
- Thread \textit{j} adds elements \textit{j} and \textit{j-stride} from T0 and writes result into shared memory buffer T1 (ping-pong)
A Parallel Scan Algorithm

1. Read input from device memory to shared memory.

2. Iterate log(n) times: Threads *stride* to *n*: Add pairs of elements *stride* elements apart. Double *stride* at each iteration. (note must double buffer shared mem arrays)

Iteration #2
Stride = 2
1. Read input from device memory to shared memory. Set first element to zero and shift others right by one.

2. Iterate \( \log(n) \) times: Threads \( \text{stride} \) to \( n \): Add pairs of elements \( \text{stride} \) elements apart. Double \( \text{stride} \) at each iteration. (note must double buffer shared memory arrays)

3. Write output from shared memory to device memory
Work Efficiency Considerations

- The first-attempt Scan executes $\log(n)$ iterations.
- This scan algorithm is not very work efficient:
  - Sequential scan algorithm does $n$ adds.
  - A factor of $\log(n)$ hurts: 20x for $10^6$ elements!
- A parallel algorithm can be slow when execution resources are saturated due to low work efficiency.
Improving Efficiency

• A common parallel algorithm pattern: 
  *Balanced Trees*
  – Build a balanced binary tree on the input data and sweep it to and from the root
  – Tree is not an actual data structure, but a concept to determine what each thread does at each step

• For scan:
  – Traverse down from leaves to root building partial sums at internal nodes in the tree
    • Root holds sum of all leaves
  – Traverse back up the tree building the scan from the partial sums
Parallel Scan - Reduction Step

\[ \sum_{i=0}^{1} x_i \]

\[ \sum_{i=2}^{3} x_i \]

\[ \sum_{i=4}^{5} x_i \]

\[ \sum_{i=6}^{7} x_i \]

\[ \sum_{i=0}^{3} x_i \]

In place calculation

Final value after reduce
Move (add) a critical value to a central location where it is needed
Inclusive Post Scan Step

\[ x_0 \quad \sum x_0..x_1 \quad x_2 \quad \sum x_0..x_3 \quad x_4 \quad \sum x_4..x_5 \quad x_6 \quad \sum x_0..x_7 \]

\[ \sum x_0..x_2 \quad \sum x_0..x_4 \quad \sum x_0..x_6 \]
Putting it Together
Reduction Step Kernel Code

// scan_array[BLOCK_SIZE] is in shared memory

int stride = 1;
while(stride < BLOCK_SIZE)
{
    int index = (threadIdx.x+1)*stride*2 - 1;
    if(index < BLOCK_SIZE)
        scan_array[index] += scan_array[index-stride];
    stride = stride*2;

    __syncthreads();
}

Reduction Step Kernel Code

// scan_array[BLOCK_SIZE] is in shared memory

int stride = 1;
while(stride < BLOCK_SIZE)
{
    int index = (threadIdx.x+1)*stride*2 - 1;
    if(index < BLOCK_SIZE)
        scan_array[index] += scan_array[index-stride];
    stride = stride*2;
    __syncthreads();
}
int stride = BLOCK_SIZE/2;
    while(stride > 0)
    {
        int index = (threadIdx.x+1)*stride*2 - 1;
        if(index < BLOCK_SIZE)
        {
            scan_array[index+stride] += scan_array[index];
        }
        stride = stride / 2;
        __syncthreads();
    }
Why Exclusive Scan

- To find the beginning address of allocated buffers

- Inclusive and Exclusive scans can be easily derived from each other; it is a matter of convenience

\[
\text{Exclusive} \quad [3 \ 1 \ 7 \ 0 \ 4 \ 1 \ 6 \ 3] \\
\text{Inclusive} \quad [0 \ 3 \ 4 \ 11 \ 11 \ 15 \ 16 \ 22] \\
\text{Exclusive} \quad [3 \ 4 \ 11 \ 11 \ 15 \ 16 \ 22 \ 25]
\]
An Exclusive Post Scan Step (Add-move Operation)
Exclusive Post Scan Step

$x_0$  $\sum x_{0..1}$  $x_2$  $\sum x_{0..3}$  $x_4$  $\sum x_{4..5}$  $x_6$  $0$

$0$  $\sum x_{0..1}$  $\sum x_{0..2}$  $\sum x_{0..3}$  $\sum x_{0..4}$  $\sum x_{0..5}$  $\sum x_{0..6}$

$0$  $\sum x_{0..3}$  $\sum x_{0..5}$
Exclusive Post Scan Step

if (threadIdx.x == 0) scan_array[2*blockDim.x-1] = 0;
int stride = BLOCK_SIZE;
while(stride > 0)
{
    int index = (threadIdx.x+1)*stride*2 - 1;
    if(index < 2* BLOCK_SIZE)
    {
        float temp = array[index+stride];
        scan_array[index+stride] += scan_array[index];
        scan_array[index] = temp;
    }
    stride = stride / 2;
    __syncthreads();
}
Exclusive Scan Example - Reduction Step

Assume array is already in shared memory
Reduction Step (cont.)

<table>
<thead>
<tr>
<th>T</th>
<th>3</th>
<th>1</th>
<th>7</th>
<th>0</th>
<th>4</th>
<th>1</th>
<th>6</th>
<th>3</th>
</tr>
</thead>
</table>

Stride 1

Iteration 1, $n/2$ threads

| T | 3 | 4 | 7 | 7 | 4 | 5 | 6 | 9 |

- Iterate log(n) times. Each thread adds value *stride* elements away to its own value.

- Each ⬇️ corresponds to a single thread.
Iterate $\log(n)$ times. Each thread adds value *stride* elements away to its own value

<table>
<thead>
<tr>
<th>$T$</th>
<th>3</th>
<th>1</th>
<th>7</th>
<th>0</th>
<th>4</th>
<th>1</th>
<th>6</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Stride 1

<table>
<thead>
<tr>
<th>$T$</th>
<th>3</th>
<th>4</th>
<th>7</th>
<th>7</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Stride 2

<table>
<thead>
<tr>
<th>$T$</th>
<th>3</th>
<th>4</th>
<th>7</th>
<th>11</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>14</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Iteration 2, $n/4$ threads

Each $\bigcirc$ corresponds to a single thread.
Reduction Step (cont.)

Iterate \( \log(n) \) times. Each thread adds value \( \text{stride} \) elements away to its own value.

Note that this algorithm operates in-place: no need for double buffering.
Zero the Last Element

We now have an array of partial sums. Since this is an exclusive scan, set the last element to zero. It will propagate back to the first element.
Post Scan Step from Partial Sums

| T | 3 | 4 | 7 | 11 | 4 | 5 | 6 | 0 |
Iterate $\log(n)$ times. Each thread adds value \textit{stride} elements away to its own value, and sets the value \textit{stride} elements away to its own \textit{previous} value.
Iterate \( \log(n) \) times. Each thread adds value \( stride \) elements away to its own value, and sets the value \( stride \) elements away to its own previous value.

Each \( \bigcirc \) corresponds to a single thread.
Post Scan Step From Partial Sums (cont.)

Done! We now have a completed scan that we can write out to device memory.

Total steps: $2 \times \log(n)$.
Total work: $2 \times (n-1)$ adds = $O(n)$  
Work Efficient!
Work Analysis

- The parallel Scan executes $2^* \log(n)$ parallel iterations
  - $\log(n)$ in reduction and $\log(n)$ in post scan
  - The iterations do $\frac{n}{2}, \frac{n}{4}, \ldots, 1, 1, \ldots, \frac{n}{4}$. $\frac{n}{2}$ adds
  - Total adds: $2^* (n-1) \rightarrow O(n)$ work

- The total number of adds is no more than twice of that done in the efficient sequential algorithm
  - The benefit of parallelism can easily overcome the $2X$ work when there is sufficient hardware
Conclusions

• Performance is related to how you keep the GPU and its memory busy → does not necessarily mean higher occupancy.

• We looked at some of the common parallel patterns used in many GPU kernels. These are tools that you can use in your own kernels.