(= ФェФ=) toji

Learn HPC: Chapter 4 - Compute architecture and scheduling

4.1 Architecture of a modern GPU

Architecture of a CUDA Capable GPU

While older GPUs used graphics double data rate synchronous DRAM, more recent GPUs starting with NVIDIA’s Pascal architecture may use HBM (high-bandwidth memory) or HBM2, which consist of DRAM (dynamic random access memory) modules tightly integrated with the GPU in the same package.

4.2 Block Scheduling

Thread block assignment to SM

Synchronization and Transparent Scalability

Incorrect use of sync_threads Don't keep it in separate if else statements

Not having synchronization between blocks enables transparent scalability for CUDA programs Not having synchronization between blocks enables the transparent scalability

The ability to execute the same application code on different hardware with different amounts of execution resources is referred to as transparent scalability, which reduces the burden on application developers and improves the usability of applications.

4.4 Warps and SIMD hardware

A warp is the unit of thread scheduling in SMs.

For blocks that consist of multiple dimensions of threads, the dimensions will be projected into a linearized row-major layout before partitioning into warps.

how the cores in an SM are grouped into processing blocks in which every 8 cores form a processing block and share an instruction fetch/dispatch unit

The advantage of SIMD is that the cost of the control hardware, such as the instruction fetch/dispatch unit, is shared across many execution units. This design choice allows for a smaller percentage of the hardware to be dedicated to control and a larger percentage to be dedicated to increasing arithmetic throughput.

This is the essence of SIMD hardware. The fact that control hardware is shared accross multiple execution units.

Warps and SIMD Hardware

Von Neumann Architecture

Von Neumann Model modified to reflect GPU design

Control Divergence

I think we should use __syncthreads() before 'C'

In Pascal and prior architectures, these passes are executed sequentially(i.e. one pass is executed to completion and then we start next pass) but from Volta architecture onwards, the passes may be executed concurrently. This feature is called Independent thread scheduling.

Warp scheduling and latency tolerance

Threads, Context Switching and Zero Overhead Scheduling

This is also why the number of warps that can be resident on an SM is limited by register usage.

Resource partitioning and Occupancy

If a grid is launched with block size of 1024 threads then 2048 threads in a SM is partitioned and assigned to 2 blocks. So each SM will have 2 blocks.

Most important thing while optimizing for Occupancy is deciding the thread block size.

For example, Ampere A100 allows maximum of 65,536 registers per SM which means 65,536 / 2048 = 32 registers per thread. So, if a kernel uses 64 registers per thread, the maximum number of threads that can be supported with 65,536 registers is 1024 threads. In this case, the kernel cannot run with full occupancy regardless of what the block size is set to be. Instead, the occupancy will be at most 50%. In some cases, the compiler may perform register spilling to reduce the register requirement per thread and thus elevate the level of occupancy. However, this is typically at the cost of increased execution time for the threads to access the spilled register values from memory and may cause the total execution time of the grid to increase.

Register Spilling: The compiler moves (spills) some variables from registers to memory (usually L1 cache or local memory). When these variables are needed, we have to load them from memory to registers and then write back after modification. This adds the overhead. This increases occupancy because we are allowing more threads to run parallely despite not having enough registers.

How occupancy changes with slight change in resource usage of a kernel:

Assume that a programmer implements a kernel that uses 31 registers per thread and configures it with 512 threads per block. In this case, the SM will have (2048 threads)/(512 threads/block) = 4 blocks running simultaneously. These threads will use a total of (2048 threads)3(31 registers/thread) = 63,488 registers, which is less than the 65,536 register limit. Now assume that the programmer declares another two automatic variables in the kernel, bumping the number of registers used by each thread to 33. The number of registers required by 2048 threads is now 67,584 registers, which exceeds the register limit. The CUDA runtime system may deal with this situation by assigning only 3 blocks to each SM instead of 4, thus reducing the number of registers required to 50,688 registers. However, this reduces the number of threads running on an SM from 2048 to 1536; that is, by using two extra automatic variables, the program saw a reduction in occupancy from 100% to 75%. This is sometimes referred to as a “performance cliff,” in which a slight increase in resource usage can result in significant reduction in parallelism and performance achieved.

Querying device properties

The amount of resources in each CUDA device SM is specified as part of the compute capability of the device.

int devCount;
cudaGetDeviceCount(&devCount);
cudaDeviceProp devProp;
for(unsigned int i = 0; i < devCount; i++) {
cudaGetDeviceProperties(&devProp, i);
// Decide if device has sufficient resources/capabilities
}

Summary

Footnote

  1. Threads in different blocks can perform barrier synchronization through the Cooperative Groups API. However, there are several important restrictions that must be obeyed to ensure that all threads involved are indeed simultaneously executing on the SMs.