Processing Units (GPUs): Architecture and Programming
Welcome students! ... to
the Graphics Processing Units course, edition Fall 2017.
course examines the architecture and capabilities of
modern GPUs (graphics processing unit),
I will keep updating this page regularly. If you have
questions related to this course feel free to email me.
and how to use them to get the best performance for many applications.
Why GPUs are important now more than ever?
Many computations can be performed
faster on the GPU than on a traditional CPU (e.g. many scientific applications, training part of deep learning, ...).
This is why GPUs exist now in almost all
computers (from tablets to supercomputers);
many of Top 500 supercomputers in the world are built
GPUs are now used for a diverse set of applications not only
traditional graphics applications.
This course introduces the concept of
general-purpose GPUs or GPGPUs.
In this course, we will cover architectural aspects of modern GPUs.
We will also learn how to program GPUs to solve different type of
problems and how to make the best use of its hardware.
- Some other suggested, but not required, books:
- Our grader for this semester:
Libin Lu email: ll1488 (at) nyu.edu
- Office hours: Weds 60 Fifth Ave room 520
- When no reading material is not provided, it means you just need to study the slides.
- Unless statetd otherwise, reading material refers to the assigned textbook.
(non-programming - Submission through NYU classes)
- (15% of total grade)
(30% of total grade)
(25% of the total grade)
Note: You may not find a paper about your specific project topic. The idea is to survey the literature and introduce your new idea.
The papers given below are just starting points. If you pick a project, you may need to read and digg more.
- Quantifying the relationship between occupancy and performance: Some descriptions can be found on NVIDIA website here and here.
- Reverse engineering the memory hierarchy of a GPU:
Simply speaking, can you write few CUDA kernels and from
anaylsing their performance you can find out what is the memory
hierarchy of the GPU (size of L1 cache, presence of L2 cache, size of
L2 cache, ...)? Here is a quick summary of how the memory architecture look like, and here is a more detailed paper about benchmarking the memory hierarchy of GPUs.. There is a paper for dissecting GPU memory with benchmarking.
- Predicting the performance of a GPU given program characteristics and hardware characteristics: This tutorial (long one) gives an idea about performance of GPUs (You need to be on NYU network to be able to access it).
- Memory hierarchy design in the presence of multiple kernels:
Suppose there are several kernels running simultaneously on the GPU
(can have up to 32), what is the best L1 and L2 characteristics (total
size, block size, replacement policy, and associativity) to get the
best performance? Do we need L3 cache? You don't need to answer all of
these questions but can pick 2-3 of them. Here are some general papers:
how to find out about the GPU memory hierarchy?
- Power efficient GPU programming: Here is a paper about analyzing and improving GPU power efficiency
(You can mainly read the programming-level techniques only
bit the whole paper is worth reading). Other papers can be found here and here (you may need to be on NYU network to download some of these papers).
an application to parallelise and compare the performance and
scalability to other state-of-the-art implementation (multicore and
other GPU if any)
- Example of papers published from this project:
Links (Geeky stuff about GPUs)
GPUs in general:
High Performance Computing on GPUs
digital 3D rendered film (Thanks William Ward)
with Ed Catmull (Thanks William Ward)
GPU computing seminars
GPU accelerated machine learning (Thanks Darshan Hegde for the link)
Floating point numbers (Thanks Chris W. Quackenbush)
Nice summary of optimizations (Thanks to Darshan Hegde )
C programming guide
CUDA C Best Practices
of CUDA articles at Dr. Dobb's
2.0 reference card
Simulators and Tools:
(simulates both GPUs and multicore)
(dynamic compilation for PTX)