CSCI-GA.3033-004
Graphics Processing Units (GPUs): Architecture and Programming
CUDA
Advanced Techniques 1
Mohamed Zahran (aka Z)
mzahran@cs.nyu.edu
http://www.mzahran.com
Some Refreshing Exercises!

- Suppose registers and shared memory capacities were not an issue. When is it still beneficial to put values fetched from memory into the shared memory?
Some Refreshing Exercises!

• Assume a kernel is launched with 1000 blocks. Each block has 512 threads.
  – If a variable is declared as local in the kernel, how many versions will be created throughout the lifetime of the kernel?
  – How about if the variable is created as shared?
Some Refreshing Exercises!

• A kernel contains 36 floating point operations and 7 32-bit word global memory accesses per thread. For each of the following device properties, indicate whether this kernel is compute- or memory-bound.
  – Peak FLOPS = 200 GFLOPS, Peak Memory Bandwidth = 100 GB/s

  – Peak FLOPS = 300 GFLOPS, Peak Memory Bandwidth = 250 GB/s
A Note About Performance
Defining Performance

- Which airplane has the best performance?
Performance Considerations

• There are many hardware constraints.
• Depending on the application, different constraints may dominate.
• We can improve performance of an application by trading one resource usage for another.
Performance Issue: Thread Diversion

• Works well when all threads in a warp follow the same control-flow
• Performance loss due to thread diversion

Instructions: A, B, C
Control Flow

Thread Serialization Behavior

Time
Warp
Inst. A
Inst. B
Inst. C
Performance Issue:
Thread Diversion

1. __shared__ float partialSum[]
2. unsigned int t = threadIdx.x;
3. for (unsigned int stride = 1;
4.     stride <= blockDim.x; stride *= 2)
5. {
6.     syncthreads();
7.     if (t % (2*stride) == 0)
8.         partialSum[t] += partialSum[t+stride];
9. }

Example: Sum Reduction Kernel
Performance Issue: Thread Diversion

1. __shared__ float partialSum[];
2. unsigned int t = threadIdx.x;
3. for (unsigned int stride = blockDim.x>>1;
4.    stride > 0; stride >>= 1)
5. }
6. __syncthreads();
7. if (t < stride)
8.    partialSum[t] += partialSum[t+stride];
9. |

Why is this version better than the previous one?

Example: Sum Reduction Kernel
Performance Issue: Global Memory

- Typical application: process massive amount of data within short period of time
  - From global memory
  - large amount + short period = huge bandwidth requirement

- Two main challenges regarding global memory:
  - Long latency
  - Relatively limited bandwidth
Dealing With Global Memory: TILING

- We have seen this before
- Make use of shared memory available in SMs to reduce trips to global memory
Dealing With Global Memory: Coalescing

• To more effectively move data from global memory to shared memory and registers
• For best results: can be used with tiling
• Global memory:
  – DRAM
  – Reading a bit is slow
  – So memory is implemented to read several bits in parallel
Dealing With Global Memory: Coalescing

• If an application can make use of data from multiple consecutive locations, the DRAM can supply the data in much higher rate.
• Kernel must arrange its data access accordingly
• When all threads in a warp execute a load instruction:
  – The hardware detects whether the addresses are consecutive
  – The hardware combines (coalesces) all accesses in a consolidated access to consecutive DRAM locations
Dealing With Global Memory: Coalescing

Not coalesced

coalesced

Same warp

Thread 1

Thread 2

M_d

N_d

WIDTH

Linearized order in increasing address
Dealing With Global Memory: Coalescing

**Favorable** access direction in Kernel code of one thread

<table>
<thead>
<tr>
<th>M₀₀</th>
<th>M₁₀</th>
<th>M₂₀</th>
<th>M₃₀</th>
</tr>
</thead>
<tbody>
<tr>
<td>M₀₁</td>
<td>M₁₁</td>
<td>M₂₁</td>
<td>M₃₁</td>
</tr>
<tr>
<td>M₀₂</td>
<td>M₁₂</td>
<td>M₂₂</td>
<td>M₃₂</td>
</tr>
<tr>
<td>M₀₃</td>
<td>M₁₃</td>
<td>M₂₃</td>
<td>M₃₃</td>
</tr>
</tbody>
</table>

Time Period 1
T₀  T₁  T₂  T₃

Time Period 2
T₀  T₁  T₂  T₃

...
Dealing With Global Memory: Coalescing

Access direction in Kernel code

Time Period 1

Time Period 2

M₀₀, M₁₀, M₂₀, M₃₀
M₀₁, M₁₁, M₂₁, M₃₁
M₀₂, M₁₂, M₂₂, M₃₂
M₀₃, M₁₃, M₂₃, M₃₃

T₁, T₂, T₃, T₄

M₀₀, M₁₀, M₂₀, M₃₀
M₀₁, M₁₁, M₂₁, M₃₁
M₀₂, M₁₂, M₂₂, M₃₂
M₀₃, M₁₃, M₂₃, M₃₃
Dealing With Global Memory: Cache

- Fermi (and later GPUs) have cache for global memory
- Caches automatically coalesce most of kernel access patterns
Dealing With Global Memory: Prefetching

Prefetch next data elements while consuming the current data elements.
This increases the number of independent instructions between memory accesses and consumers.

<table>
<thead>
<tr>
<th>A</th>
<th>Without prefetching</th>
</tr>
</thead>
<tbody>
<tr>
<td>Loop {</td>
<td></td>
</tr>
<tr>
<td>Load current tile to shared memory</td>
<td></td>
</tr>
<tr>
<td>__syncthreads()</td>
<td></td>
</tr>
<tr>
<td>Compute current tile</td>
<td></td>
</tr>
<tr>
<td>__syncthreads()</td>
<td></td>
</tr>
<tr>
<td>}</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>B</th>
<th>With prefetching</th>
</tr>
</thead>
<tbody>
<tr>
<td>Loop {</td>
<td></td>
</tr>
<tr>
<td>Load first tile from global memory into registers</td>
<td></td>
</tr>
<tr>
<td>Loop {</td>
<td></td>
</tr>
<tr>
<td>Deposit tile from registers to shared memory</td>
<td></td>
</tr>
<tr>
<td>__syncthreads()</td>
<td></td>
</tr>
<tr>
<td>Load next tile from global memory into registers</td>
<td></td>
</tr>
<tr>
<td>Compute current tile</td>
<td></td>
</tr>
<tr>
<td>__syncthreads()</td>
<td></td>
</tr>
<tr>
<td>}</td>
<td></td>
</tr>
</tbody>
</table>
Performance Issue: SM Resources

• Execution resources in SM include:
  – registers
  – block slots
  – thread slots

• There is an interaction among the resources that you must take into account.
Performance Issue: SM Resources

Example: Assume G80 executing the matrix multiplication with 16x16 thread blocks (8 block slots, 768 thread slots, 8192 registers)

If a thread needs 10 registers then:
- A block needs $10 \times 16 \times 16 = 2560$ registers
- 3 blocks -> 7680 registers (under the 8192 limit)
- We can’t add another block (will make it 10240)
- 3 blocks x 256 threads/block = 768 (within limit)

By using 1 extra variable the program saw a 1/3 reduction in warp parallelism -> performance cliff

Assume the programmer declares one more auto var:
- $11 \times 16 \times 16 = 2816$ registers per block
- 3 blocks -> $3 \times 2816 = 8448$ (above limit)
- SM reduces #blocks by 1 -> 5632 registers required
- This reduces the number of threads in SM to $2 \times 256$ -> 512
Performance Issue: SM Resources

**Example:** Still with G80:
- An instruction takes 4 cycles
- Assume 4 independent instructions between global memory load and its use
- Global memory latency is 200 cycles

To keep execution units fully utilized:
We need to have \( \frac{200}{(4 \times 4)} = 14 \) warps

Assume an extra register allows the programmer to use a transformation to increase independent instructions from 4 to 8, then:
- Now we need \( \frac{200}{(4 \times 8)} = 7 \) warps
- Blocks reduced from 3 to 2 -> warps reduced from 24 to 16
- Still we can fully utilize execution units
Performance Issue: Instruction Mix

for (int k = 0; k < BLOCK_SIZE; ++k)
    Pvalue += Ms[ty][k] * Ns[k][tx];

Is the above code efficient?

• Extra instructions to update loop counter
• Extra instructions for conditional branch at the end of each iteration
• Using k to access matrices incurs address arithmetic instructions.
• All of the above compete with the floating-point calculations for limited instruction processing bandwidth.

2 FP arithmetic
2 address arithmetic instructions
1 loop branch instructions
1 loop increment instructions

only 1/3 instructions are FP operations
Performance Issue: Instruction Mix

for (int k = 0; k < BLOCK_SIZE; ++k)
    Pvalue += Ms[ty][k] * Ns[k][tx];

Pvalue += Ms[ty][0] * Ns[0][tx] + ...
    Ms[ty][15] * Ns[15][tx];

Loop unrolling
Performance Issue: Thread Granularity

- Algorithmic decision
- It is often advantageous to put more work into each thread and use fewer threads
  - When redundant work exists between threads
  - Example: Let a thread compute 2 tiles
  + Less redundant work
  + Potentially more independent instructions
- More resources requirements
Putting It All Together

Thread Granularity:
• normal
• merging 2 blocks
• merging 4 blocks
Until tile reaches 16x16 neither loop unrolling nor data prefetch helps. For small tile size, global memory bandwidth severely limits performance.
Granularity adjustment can reduce global memory access.
Data prefetching becomes less beneficial as thread granularity increases.
About Algorithms for GPUs

• When designing an algorithm for GPU, the two main characteristics that determine its performance are:
  1. How much parallelism is available
  2. How much data must move through the memory hierarchy

• Then when you move from algorithm to code you need to take hardware constraints into account.
A note about compilation ...
And some useful tools!
NVCC device specific switches

- **-arch**: controls the "virtual" architecture that will be used for the generation of the PTX code.
- **-code**: specifies the actual device that will be targeted by the cubin binary.

<table>
<thead>
<tr>
<th>Virtual Architecture (-arch)</th>
<th>Streaming Multipr. code (-code)</th>
<th>Feature Enabled</th>
</tr>
</thead>
<tbody>
<tr>
<td>compute_10</td>
<td>sm_10</td>
<td>Basic features</td>
</tr>
<tr>
<td>compute_11</td>
<td>sm_11</td>
<td>global memory atomic operations</td>
</tr>
<tr>
<td>compute_12</td>
<td>sm_12</td>
<td>shared memory atomic operations and vote instructions</td>
</tr>
<tr>
<td>compute_13</td>
<td>sm_13</td>
<td>double precision floating point support</td>
</tr>
<tr>
<td>compute_20</td>
<td>sm_20</td>
<td>Fermi support</td>
</tr>
<tr>
<td></td>
<td>sm_21</td>
<td>SM structure changes (e.g more cores)</td>
</tr>
<tr>
<td>compute_30</td>
<td>sm_30</td>
<td>Kepler support</td>
</tr>
<tr>
<td>compute_35</td>
<td>sm_35</td>
<td>Dynamic parallelism (enables recursion)</td>
</tr>
<tr>
<td>compute_50</td>
<td>sm_50</td>
<td>Maxwell support</td>
</tr>
</tbody>
</table>

Source: *Multicore and GPU Programming: An Integrated Approach* by G. Barlas
sm_{xy}

• x is the GPU generation number
• y is the version within that generation
• Binary compatibility of GPU applications is not guaranteed across different generations.
  – Example: a CUDA application that has been compiled for a Fermi GPU will very likely not run on a Kepler GPU (and vice versa).

• This is why nvcc relies on a two stage compilation model for ensuring application compatibility with future GPU generations.
Source: CUDA compiler driver nvcc manual (NVIDIA website)
JIT Compilation

- If you are unsure which exact GPU the code will run on.
- Use -arch without -code
- Main disadvantage: slower startup
Fatbinaries

```
nvcc x.cu -arch=compute_30 -code=compute_30,sm_30,sm_35
```

generate PTX and keep it in the binary generated, for JIT on future GPUs

Generate binaries for two versions of Kepler

At runtime, the CUDA driver will select the most appropriate translation when the device function is launched.

Till now we have single virtual architecture and several real architectures. How about several virtual architectures?
nvcc x.cu \n--generate-code arch=compute_20,code=sm_20 \n--generate-code arch=compute_20,code=sm_21 \n--generate-code arch=compute_30,code=sm_30
The Default

nvcc x.cu

is equivalent to

nvcc x.cu -arch=compute_20 -code=sm_20,compute_20
Some nvcc features: --ptxas-options=-v
  - Print the smem, register and other resource usages

Generates CUDA binary file: nvcc -cubin
  - cubin file is the cuda executable
Dealing with binary files

<table>
<thead>
<tr>
<th>Extract ptx and extract and disassemble cubin from the following input files:</th>
<th>cuobjdump</th>
<th>nvdisasm</th>
</tr>
</thead>
<tbody>
<tr>
<td>Host binaries</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>Executables</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Object files</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Static libraries</td>
<td></td>
<td></td>
</tr>
<tr>
<td>External fatbinary files</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Control flow analysis and output</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>Advanced display options</td>
<td>No</td>
<td>Yes</td>
</tr>
</tbody>
</table>
nvprof

- **CUDA profiler:** profiling data from the command line

  ```
  $ nvprof [nvprof_args] <app> [app_args]
  ```

- To profile a region of the application:
  1. `#include <cuda_profiler_api.h>`
  2. in the host function surround the region with:
     - `cudaProfilerStart()`
     - `cudaProfilerStop()`
  3. `nvcc myprog.cu`
  4. `nvprof --profile-from-start-off ./a.out`
nvprof summary mode (default)

```
$ nvprof dct8x8

Profiling result:

<table>
<thead>
<tr>
<th>Time(%)</th>
<th>Time</th>
<th>Calls</th>
<th>Avg</th>
<th>Min</th>
<th>Max</th>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td>49.52</td>
<td>9.36ms</td>
<td>101</td>
<td>92.68us</td>
<td>92.31us</td>
<td>94.31us</td>
<td>CUDAkernel2DCT(float*, float*, int)</td>
</tr>
<tr>
<td>37.47</td>
<td>7.08ms</td>
<td>10</td>
<td>708.31us</td>
<td>707.99us</td>
<td>708.50us</td>
<td>CUDAkernel1IDCT(float*, int, int, int)</td>
</tr>
<tr>
<td>3.75</td>
<td>708.42us</td>
<td>1</td>
<td>708.42us</td>
<td>708.42us</td>
<td>708.42us</td>
<td>CUDAkernel1IDCT(float*, int, int, int)</td>
</tr>
<tr>
<td>1.84</td>
<td>347.99us</td>
<td>2</td>
<td>173.99us</td>
<td>173.59us</td>
<td>174.40us</td>
<td>CUDAkernelQuantizationFloat()</td>
</tr>
<tr>
<td>1.75</td>
<td>331.37us</td>
<td>2</td>
<td>165.69us</td>
<td>165.67us</td>
<td>165.70us</td>
<td>[CUDA memcpyDtoH]</td>
</tr>
<tr>
<td>1.41</td>
<td>266.70us</td>
<td>2</td>
<td>133.35us</td>
<td>89.70us</td>
<td>177.00us</td>
<td>[CUDA memcpyHtoD]</td>
</tr>
<tr>
<td>1.00</td>
<td>189.64us</td>
<td>1</td>
<td>189.64us</td>
<td>189.64us</td>
<td>189.64us</td>
<td>CUDAkernelShortDCT(short*, int)</td>
</tr>
<tr>
<td>0.94</td>
<td>176.87us</td>
<td>1</td>
<td>176.87us</td>
<td>176.87us</td>
<td>176.87us</td>
<td>[CUDA memcpyHtoA]</td>
</tr>
<tr>
<td>0.92</td>
<td>174.16us</td>
<td>1</td>
<td>174.16us</td>
<td>174.16us</td>
<td>174.16us</td>
<td>CUDAkernelShortIDCT(short*, int)</td>
</tr>
<tr>
<td>0.76</td>
<td>143.31us</td>
<td>1</td>
<td>143.31us</td>
<td>143.31us</td>
<td>143.31us</td>
<td>CUDAkernelQuantizationShort(short*)</td>
</tr>
<tr>
<td>0.52</td>
<td>97.75us</td>
<td>1</td>
<td>97.75us</td>
<td>97.75us</td>
<td>97.75us</td>
<td>CUDAkernel2IDCT(float*, float*)</td>
</tr>
<tr>
<td>0.12</td>
<td>22.59us</td>
<td>1</td>
<td>22.59us</td>
<td>22.59us</td>
<td>22.59us</td>
<td>[CUDA memcpyDtoA]</td>
</tr>
</tbody>
</table>
```
nvprof trace mode

```
$ nvprof --print-gpu-trace dct8x8

======== Profiling result:
     Start    Duration  Grid Size  Block Size  Regs  SSMem  DSMem  Size    Throughput     Name
167.82ms  176.84us   -          -         -      -      -       1.05MB  5.93GB/s [CUDA memcpy HtoA]
168.00ms  708.51us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
168.95ms  708.51us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
169.74ms  708.26us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
170.53ms  707.89us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
171.32ms  708.12us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
172.11ms  708.05us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
172.89ms  708.38us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
173.68ms  708.31us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
174.47ms  708.15us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
175.26ms  707.95us (64 64 1) (8 8 1) 28     512B  0B      -      -     -   -
176.05ms  173.87us (64 64 1) (8 8 1) 27     0B    0B      -      -     -   -
176.23ms  22.82us   -          -         -      -      -       1.05MB  45.96GB/s [CUDA memcpyDtoA]
```

GPU-trace mode provides a timeline of all activities taking place on the GPU in chronological order.
Print individual kernel invocations and sort them in chronological order

Print CUDA runtime/driver API trace

```
$ nvprof --print-gpu-trace --print-api-trace dct8x8

Profiling result:

<table>
<thead>
<tr>
<th>Start</th>
<th>Duration</th>
<th>Grid Size</th>
<th>Block Size</th>
<th>Regs</th>
<th>SSMem</th>
<th>DSMem</th>
<th>Size</th>
<th>Throughput/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>167.82ms</td>
<td>176.84us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>1.05MB</td>
<td>5.93GB/s</td>
</tr>
<tr>
<td>167.81ms</td>
<td>2.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>167.81ms</td>
<td>38.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>167.85ms</td>
<td>1.00ms</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.00ms</td>
<td>708.51us</td>
<td>(64 64 1)</td>
<td>(8 8 1)</td>
<td>28</td>
<td>512B</td>
<td>0B</td>
<td></td>
<td></td>
</tr>
<tr>
<td>168.86ms</td>
<td>2.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.86ms</td>
<td>1.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.86ms</td>
<td>1.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.87ms</td>
<td>0ns</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.87ms</td>
<td>24.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.89ms</td>
<td>761.00us</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>168.95ms</td>
<td>708.51us</td>
<td>(64 64 1)</td>
<td>(8 8 1)</td>
<td>28</td>
<td>512B</td>
<td>0B</td>
<td>-</td>
<td></td>
</tr>
</tbody>
</table>

[CUDA memcpy HtoA]
cudaSetupArgument
cudaLaunch
cudaDeviceSynchronize
CUDAkernel1DCT(float*, ...)
cudaConfigureCall
cudaSetupArgument
cudaSetupArgument
cudaSetupArgument
cudaSetupArgument
cudaLaunch
cudaDeviceSynchronize
CUDAkernel1DCT(float*, ...)
```
nvcc --devices \texttt{x} --events \texttt{y} ./a.out

\begin{itemize}
  \item \texttt{x}: device number in case of multi-GPU
  \item \texttt{y}: event name
    \begin{itemize}
      \item Gives very useful information, such as:
        \begin{itemize}
          \item number of global memory loads, stores, ...
          \item number of global memory coalesced
          \item branch divergences
        \end{itemize}
    \end{itemize}
\end{itemize}

$ \texttt{nvprof --devices 0 --events branch,divergent_branch dct8x8}$
Revisiting Shared Memory
Till now

• We know that we have to use __shared__
• But, you leave it to the runtime system to distribute shared memory among blocks. Can you change that?
• What if you don’t know how much shared memory we need before execution?
This is what we know

```c
__global__ void staticShared(int *d, int n)
{
    __shared__ int array[64];
    int t = threadIdx.x;
    s[t] = d[t];
}
```

What if you don’t know this size beforehand?
Dynamic Shared Memory

__global__ void DynamicShared(int *d, int n)
{
    extern __shared__ int array[];
    int t = threadIdx.x;
    s[t] = d[t];
}

main()
{
    …
    DynamicShared<<<100, 100, N*sizeof(int)>>>(int *, int);
}
What if you need more than one array in the shared memory?

- Declare a single extern unsized array (as in the previous slide).
- Use pointers into it to divide it into multiple arrays.
- Example:

```c
extern __shared__ int s[];
int *integerData = s;       // nI ints
float *floatData = (float*)integerData[nI]; // nF floats
char *charData = (char*)floatData[nF];       // nC chars
```
Potential Performance Loss: Shared Memory Bank Conflicts

- Shared memory is divided into equally sized memory modules called banks.
- Banks can be accessed simultaneously.
- Shared memory accesses that span $b$ distinct banks yield an effective bandwidth that is $b$ times as high as the bandwidth of when accesses map to the same bank.
- Exceptions are: broadcast and multicast.
Potential Performance Loss: Shared Memory Bank Conflicts

- Shared memory banks are organized such that successive 32-bit words are assigned to successive banks.
- Bandwidth is 32 bits per bank per clock cycle.
- Warp size is 32
Potential Performance Loss: Shared Memory Bank Conflicts

- Now, depending on compute capability:
  - 1.x: the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp.
  - 2.x: the number of banks is 32. A shared memory request for a warp is not split.
  - 3.x: have configurable bank size
cudaDeviceSetSharedMemConfig($i$)

- $i$ above can be:
  - cudaSharedMemBankSizeFourByte ← default
  - cudaSharedMemBankSizeEightByte
Floating Points
Importance of Floating Points

• Many graphics operations are floating point operations
• GPU performance is measure in GFLOPS
Turing Award 1989 to William Kahan for design of the IEEE Floating Point Standards 754 (binary) and 854 (decimal)
What is Excel doing?

A1:  \[ 1.333333333333330000 \]

A2:  \[ =\frac{4}{3} \]

Excel tries to round internal binary floating point to output decimal format to look like what it thinks the user wants to see, rather than the most accurate answer (depending on parentheses).
Floating Point

• We need a way to represent
  - numbers with fractions, e.g., 3.1416
  - very small numbers, e.g., .000000001
  - very large numbers, e.g., $3.15576 \times 10^9$

• Representation:
  - sign, exponent, mantissa: $(-1)^{\text{sign}} \times \text{mantissa} \times 2^{\text{exponent}}$
  - more bits for mantissa gives more accuracy
  - more bits for exponent increases range

• IEEE 754 floating point standard:
  - single precision: 8 bit exponent, 23 bit mantissa
  - double precision: 11 bit exponent, 52 bit mantissa
IEEE 754 floating-point standard

• Leading “1” bit of significand is implicit (called hidden 1 technique, except when exp = -127)
• Exponent is “biased” to make sorting easier
  - all 0s is smallest exponent
  - all 1s is largest exponent
  - bias of 127 for single precision and 1023 for double precision
  - summary: \((-1)^{\text{sign}} \times (1+\text{significand}) \times 2^{\text{exponent} - \text{bias}}\)

• Example:
  - decimal: \(-.75 = - (\frac{1}{2} + \frac{1}{4})\)
  - binary: \(-.11 = -1.1 \times 2^{-1}\)
  - floating point: exponent = 126 = 01111110
  - IEEE single precision: 10111111010000000000000000000000
More about IEEE floating Point Standard

Single Precision:

\[( -1)^{\text{sign}} \times (1+\text{mantissa}) \times 2^{\text{exponent} - 127}\]

The variables shown in red are the numbers stored in the machine.
Floating Point Example

what is the decimal equivalent of

1 01110110 10110000...0
Based on $\exp$
we have 3 encoding schemes

- $\exp \neq 0..0$ or $11...1 \rightarrow$ normalized encoding
- $\exp = 0...000 \rightarrow$ denormalized encoding
- $\exp = 1111...1 \rightarrow$ special value encoding
  - $\frac{\text{frac}}{\text{frac}} = 000...0$
  - $\frac{\text{frac}}{\text{frac}} = \text{something else}$
1. Normalized Encoding

• **Condition:** exp ≠ 000...0 and exp ≠ 111...1
  referred to as Bias

• **Exponent is:** $E = \text{Exp} - (2^{k-1} - 1)$, $k$ is the # of exponent bits
  – Single precision: $E = \text{exp} - 127$
  – Double precision: $E = \text{exp} - 1023$

• **Significand is:** $M = 1.xxx…x_2$
  – Range($M$) = [1.0, 2.0-ε)
  – Get extra leading bit for free

Range($E$)=[-126,127]
Range($E$)=[-1022,1023]
Normalized Encoding Example

• Value: \( \text{Float } F = 15213.0; \)
  \(- 15213_{10} = 11101101101101_{2} \)
  \(= 1.1101101101101_{2} \times 2^{13} \)

• Significand
  \( M = \phantom{.}1.1101101101101_{2} \frac{\text{frac}}{110110110110100000000000_{2}} \)

• Exponent
  \( E = \exp - \text{Bias} = \exp - 127 = 13 \)
  \(\Rightarrow \exp = 140 = 10001100_{2} \)

• Result:
  \[
  \begin{array}{c}
  0 \\
  10001100 \\
  110110110110100000000000
  \end{array}
  \]
  \( s \quad \exp \quad \text{frac} \)
2. Denormalized Encoding

- **Condition**: $\text{exp} = 000...0$

- **Exponent value**: $E = 1 - \text{Bias}$ (instead of $E = 0 - \text{Bias}$)

- **Significand is**: $M = 0.xxx...x_2$ (instead of $M=1.xxx_2$)

- **Cases**
  - $\text{exp} = 000...0$, $\text{frac} = 000...0$
    - Represents zero
    - Note distinct values: +0 and -0
  - $\text{exp} = 000...0$, $\text{frac} \neq 000...0$
    - Numbers very close to 0.0
3. Special Values Encoding

- **Condition**: \( \text{exp} = 111\ldots1 \)

- **Case**: \( \text{exp} = 111\ldots1, \text{frac} = 000\ldots0 \)
  - Represents value \( \infty \) (infinity)
  - Operation that overflows
  - E.g., \( 1.0/0.0 = -1.0/-0.0 = +\infty, \ 1.0/-0.0 = -\infty \)

- **Case**: \( \text{exp} = 111\ldots1, \text{frac} \neq 000\ldots0 \)
  - Not-a-Number (NaN)
  - Represents case when no numeric value can be determined
  - E.g., \( \sqrt{-1}, \infty - \infty, \infty \times 0 \)
Visualization: Floating Point Encodings
Floating Point: IEEE 754

What is the decimal equivalent of:

10111111110100000000000000000000
1 10111111 10100000000000000000000

- 127

So:

• Real exponent = 127 -127 = 0
• There is hidden 1

1.1010....0
= 1.625

Final answer = -1.625
## In Summary About Floating Points

<table>
<thead>
<tr>
<th>exponent</th>
<th>mantissa</th>
<th>meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>11...1</td>
<td>≠ 0</td>
<td>NaN</td>
</tr>
<tr>
<td>11...1</td>
<td>=0</td>
<td>((-1)^S \times \infty)</td>
</tr>
<tr>
<td>00...0</td>
<td>≠0</td>
<td>denormalized</td>
</tr>
<tr>
<td>00...0</td>
<td>=0</td>
<td>0</td>
</tr>
</tbody>
</table>
Algorithm Considerations

• Non representable numbers are rounded
• This rounding error leads to different results depending on the order of operations
  – Non-repeatability makes debugging harder
• A common technique to maximize floating point arithmetic accuracy is to presort data before a reduction computation.
So..

When doing floating-point operations in parallel you have to decide:

• How much accuracy is good enough?
• Do you need single-precision or double precision?
• Can you tolerate presorting overhead, if you care about rounding errors?
Streams
Streams

- A sequence of operations that execute on the device in the order in which they are issued by the host code.
- Operations in different streams can be interleaved and, when possible, they can even run concurrently.
- A stream can be a sequence of kernel launches and host-device memory copies.
- Can have several open streams to the same device at once.
- Need GPUs with concurrent transfer/execution capability.
- Potential performance improvement: can overlap transfer and computation.
Streams

- By default all transfers and kernel launches are assigned to stream 0
  - This means they are executed in order
Example: Default Stream

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);
increment<<<1,N>>>(d_a);
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

• In the code above, from the perspective of the device, all three operations are issued to the same (default) stream and will execute in the order that they were issued.
• From the perspective of the host:
  • data transfers are blocking or synchronous transfers
  • kernel launch is asynchronous.

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);
increment<<<1,N>>>(d_a);
anyCPUfunction();
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
Example: Non-Default Stream

Non-default streams in CUDA C/C++ are declared, created, and destroyed in host code as follows:

```c
cudaStream_t stream1;
cudaError_t result;
result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1);
```

To issue data transfer to non-default stream (non-blocking):

```c
result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1);
```

To launch a kernel to non-default stream:

```c
increment<<<1,N,0,stream1>>>(d_a);
```
Important

• All operations to non-default streams are non-blocking with respect to the host code.

• Sometimes you need to synchronize the host code with operations in a stream.

• You have several options:
  – `cudaDeviceSynchronize()` → blocks host
  – `cudaStreamSynchronize(stream)` → blocks host
  – `cudaStreamQuery(stream)` → does not block host
Streams

- The amount of overlap execution between two streams depends on:
  - Device supports overlap transfer and kernel execution (compute capability 1.1 and higher)
  - Devices supports concurrent kernel execution (compute capability 2.x and higher)
  - Device supports concurrent data transfer (compute capability 2.x and higher)
  - The order on which commands are issued to each stream
Using streams to overlap device execution with data transfer

• Conditions to be satisfied first:
  – The device must be capable of concurrent copy and execution.
  – The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams.
  – The host memory involved in the data transfer must be pinned memory.
Using streams to overlap device execution with data transfer

for (int i = 0; i < nStreams; ++i) {

    int offset = i * streamSize;

    cudaMemcpyAsync(&d_a[offset], &a[offset],
                     streamBytes,
                     cudaMemcpyHostToDevice,
                     stream[i]);

    kernel<<<......>>>(d_a, offset);

    cudaMemcpyAsync(&a[offset], &d_a[offset],
                     streamBytes,
                     cudaMemcpyDeviceToHost,
                     stream[i]);
}

So..

- Streams are a good way to overlap execution and transfer, hardware permits.
- Don’t confuse kernels, threads, and streams.
Pinned Pages

- Allocate page(s) from system RAM (cudaMallocHost() or cudaHostAlloc())
  - Accessible by device (but wait till next slide)
  - Cannot be paged out
  - Enables highest memory copy performance (cudaMemcpyAsync())
  - Don’t forget cudaFreeHost();

- If too much pinned pages, overall system performance may greatly suffer.
Host page accessible by the device??

• The pointer to the host memory is not directly transferable to device, except with:
  – `cudaHostGetDevicePointer` *(void ** pDevice,
                               void * pHost,
                               unsigned int flags)*
  – flags are 0 for now

• Accessing host memory from device without explicit copy is called “zero-copy” mechanism.
Steps for Zero-Copy

1. `cudaHostAlloc` (void ** ptr, size_t size, unsigned int flags)
   - flag here: `cudaHostAllocMapped`

2. `cudaHostGetDevicePointer()`

3. Then use the pointer in your kernel on device as if it is in the GPU memory
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

#define N 32         // size of vectors

__global__ void add(int *a, int *b, int *c) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if(tid < N)  c[tid] = a[tid]+b[tid];
}

int main(int argc, char *argv[]) {
  int T = 32, B = 1;  // threads per block and blocks per grid
  int *a, *b, *c;    // host pointers
  int *dev_a, *dev_b, *dev_c; // device pointers to host memory

  cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped);
  cudaHostAlloc( (void**)&b, size, cudaHostAllocMapped);
  cudaHostAlloc( (void**)&c, size, cudaHostAllocMapped);

  ...  // load arrays with some numbers

  cudaHostGetDevicePointer(&dev_a, a, 0);
  cudaHostGetDevicePointer(&dev_b, b, 0);
  cudaHostGetDevicePointer(&dev_c, c, 0);

  add<<<B,T>>>(dev_a, dev_b, dev_c);

  cudaFreeHost(a);
  cudaFreeHost(b);
  cudaFreeHost(c);

  return 0;
}
So..

- If the CPU program requires a lot of memory, then pinned pages is not a good idea.
Asynchronous Execution

• Asynchronous = returns to host right-away and does not wait for device
• This includes (but not limited to):
  – Kernel launches;
  – Memory copies between two addresses to the same device memory;
  – Memory copies from host to device of a memory block of 64 KB or less;
  – Memory copies performed by functions that are suffixed with Async;
Asynchronous Execution

• Some CUDA API calls and all kernel launches are asynchronous with respect to the host code.

• This means error-reporting is also asynchronous.

```c
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
cpuFunction();
```
Other Sources of Concurrency

- Some devices of compute capability 2.x and higher can execute multiple kernels concurrently.
- The maximum number of kernel launches that a device can execute concurrently is 32 on devices of compute capability 3.5 and 16 on devices of lower compute capability.
- A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.
- Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.
- Some devices of compute capability 2.x and higher can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page locked host memory.
Conclusions

• As we program GPUs we need to pay attention to several performance bottlenecks:
  – Branch diversion
  – Global memory latency
  – Global memory bandwidth
  – Shared memory bank conflicts
  – Communication
  – Limited resources

• We have several techniques in our arsenal to enhance performance
  – Try to make threads in the same warp follow the same control flow
  – Tiling
  – Coalescing
  – Loop unrolling
  – Increase thread granularity
  – Trade one resource for another
  – Memory access pattern
  – Streams

• Pay attention to interaction among techniques