Learn HPC: Chapter 4 - Compute architecture and scheduling

15 Jan, 2025

4.1 Architecture of a modern GPU

In the next three chapters we will discuss the architecture of modern GPUs, both the compute architecture and the memory architecture, and the performance optimization techniques stemming from the understanding of this architecture.

Architecture of a CUDA Capable GPU

Fig. 4.1 shows a high-level, CUDA C programmer’s view of the architecture of a typical CUDA-capable GPU. It is organized into an array of highly threaded streaming multiprocessors (SMs). Each SM has several processing units called streaming processors or CUDA cores (hereinafter referred to as just cores for brevity), shown as small tiles inside the SMs in Fig. 4.1, that share control logic and memory resources.
The SMs also come with different on-chip memory structures collectively labeled as “Memory”. For all kinds of processors there is on-chip and off-chip memory.
GPUs also come with gigabytes of off-chip device memory, referred to as “Global Memory”.

While older GPUs used graphics double data rate synchronous DRAM, more recent GPUs starting with NVIDIA’s Pascal architecture may use HBM (high-bandwidth memory) or HBM2, which consist of DRAM (dynamic random access memory) modules tightly integrated with the GPU in the same package.

4.2 Block Scheduling

The threads in a launched grid are assigned to SMs on a block-by-block basis. That is, all threads in a block are simultaneously assigned to the same SM.

Thread block assignment to SM

Multiple blocks are likely to be simultaneously assigned to the same SM. For example, in Fig. 4.2, three blocks are assigned to each SM. However, blocks need to reserve hardware resources to execute, so only a limited number of blocks can be simultaneously assigned to a given SM.
Most grids contain many more blocks than this number(the most number of blocks that can be simultaneously execute in a CUDA device). To ensure that all blocks in a grid get executed, the runtime system maintains a list of blocks that need to execute and assigns new blocks to SMs when previously assigned blocks complete execution.
The assignment of threads to SMs on a block-by-block basis guarantees that threads in the same block are scheduled simultaneously on the same SM. This guarantee makes it possible for threads in the same block to interact with each other in ways that threads across different blocks cannot¹. This is what fundamentally provides the speed to the GPU.

Synchronization and Transparent Scalability

When a thread calls __syncthreads(), it will be held at the program location of the call until every thread in the same block reaches that location. This ensures that all threads in a block have completed a phase of their execution before any of them can move on to the next phase.

Incorrect use of sync_threads Don't keep it in separate if else statements

In CUDA, if a __syncthreads() statement is present, it must be executed by all threads in a block. When a __syncthreads() statement is placed in an if statement, either all threads in a block execute the path that includes the __syncthreads() or none of them does. For an if-then-else statement, if each path has a __syncthreads() statement, either all threads in a block execute the then-path or all of them execute the else-path. The two __syncthreads() are different barrier synchronization points.
In the above figure, since not all threads in a block are guaranteed to execute either of the barriers, the code violates the rules for using __syncthreads() and will result in undefined execution behavior.
The system needs to make sure that all threads involved in the barrier synchronization have access to the necessary resources to eventually arrive at the barrier. Otherwise, a thread that never arrives at the barrier synchronization point can cause a deadlock. The CUDA runtime system satisfies this constraint by assigning execution resources to all threads in a block as a unit.
Not only do all threads in a block have to be assigned to the same SM, but also they need to be assigned to that SM simultaneously. That is, a block can begin execution only when the runtime system has secured all the resources needed by all threads in the block to complete execution. This ensures the time proximity of all threads in a block and prevents an excessive or even indefinite waiting time during barrier synchronization.
By not allowing threads in different blocks to perform barrier synchronization with each other, the CUDA runtime system can execute blocks in any order relative to each other, since none of them need to wait for each other.

Not having synchronization between blocks enables the transparent scalability

In a low-cost system with only a few execution resources, one can execute a small number of blocks at the same time, portrayed as executing two blocks a time on the left-hand side in above figure. In a higher-end implementation with more execution resources, one can execute many blocks at the same time, portrayed as executing four blocks at a time on the right-hand side in above figure. A high-end GPU today can execute hundreds of blocks simultaneously.
The ability to execute the same application code with a wide range of speeds allows the production of a wide range of implementations according to the cost, power, and performance requirements of different market segments. For example, a mobile processor may execute an application slowly but at extremely low power consumption, and a desktop processor may execute the same application at a higher speed while consuming more power. Both execute the same application program with no change to the code.

The ability to execute the same application code on different hardware with different amounts of execution resources is referred to as transparent scalability, which reduces the burden on application developers and improves the usability of applications.

4.4 Warps and SIMD hardware

The correctness of executing a kernel should not depend on any assumption that certain threads will execute in synchrony with each other without the use of barrier synchronizations.
Thread scheduling in CUDA GPUs is a hardware implementation concept.
In most implementations to date, once a block has been assigned to an SM, it is further divided into 32-thread units called warps. The size of warps is implementation specific and can vary in future generations of GPUs.

A warp is the unit of thread scheduling in SMs.

In general, warp n starts with thread 32n and ends with thread 32(n+1)-1.
For a block whose size is not a multiple of 32, the last warp will be padded with inactive threads to fill up the 32 thread positions.

For blocks that consist of multiple dimensions of threads, the dimensions will be projected into a linearized row-major layout before partitioning into warps.

An SM is designed to execute all threads in a warp following the single-instruction, multiple-data (SIMD) model. That is, at any instant in time, one instruction is fetched and executed for all threads in the warp.
Threads in the same warp are assigned to the same processing block, which fetches the instruction for the warp and executes it for all threads in the warp at the same time. These processing blocks are a part of SIMD hardware. As a real example, the Ampere A100 SM, which has 64 cores, is organized into four processing blocks with 16 cores each. The number of cores in a processing block is a hardware design.

how the cores in an SM are grouped into processing blocks in which every 8 cores form a processing block and share an instruction fetch/dispatch unit

These threads apply the same instruction to different portions of the data. Because the SIMD hardware effectively restricts all threads in a warp to execute the same instruction at any point in time, the execution behavior of a warp is often referred to as single instruction, multiple-thread.
If the number of cores in a SM is less than the size of a warp then the threads are executed in temporal manner i.e. the warp will be executed by the same block but it will be executed over multiple cycles.

The advantage of SIMD is that the cost of the control hardware, such as the instruction fetch/dispatch unit, is shared across many execution units. This design choice allows for a smaller percentage of the hardware to be dedicated to control and a larger percentage to be dedicated to increasing arithmetic throughput.

This is the essence of SIMD hardware. The fact that control hardware is shared accross multiple execution units.

Authors expect that in the foreseeable future, warp partitioning will remain a popular implementation technique. However, the size of warp can vary from implementation to implementation. Up to this point in time, all CUDA devices have used similar warp configurations in which each warp consists of 32 threads.

Warps and SIMD Hardware

Von Neumann Architecture

The above architecture is also called the stored program model, which means that a user can change the behavior of a computer by storing a different program into its memory.

Von Neumann Model modified to reflect GPU design

The processor, which corresponds to a processing block in earlier figure, has only one control unit that fetches and dispatches instructions. The same control signals (arrows that go from the Control Unit to the Processing Units) go to multiple processing units that each correspond to a core in the SM, each of which executes one of the threads in a warp. Each core executes a thread.
Since all processing units are controlled by the same instruction in the Instruction Register (IR) of the Control Unit, their execution differences are due to the different data operand values in the register files. This is called Single-Instruction-Multiple-Data (SIMD) in processor design. For example, although all processing units (cores) are controlled by an instruction, such as add r1, r2, r3, the contents of r2 and r3 are different in different processing units.

Control Divergence

When threads within a warp take different control flow paths, the SIMD hardware will take multiple passes through these paths, one pass for each path.
For example, in an if-else construct, if some threads in a warp follow the if-path while others follow the else path, the hardware will take two passes. One pass executes the threads that follow the if-path, and the other executes the threads that follow the else-path. During each pass, the threads that follow the other path are not allowed to take effect.
When threads in the same warp follow different execution paths, we say that these threads exhibit Control Divergence.
The multipass approach to divergent warp execution extends the SIMD hardware’s ability to implement the full semantics of CUDA threads.
While the hardware executes the same instruction for all threads in a warp, it selectively lets these threads take effect in only the pass that corresponds to the path that they took, allowing every thread to appear to take its own control flow path.
This preserves the independence of threads while taking advantage of the reduced cost of SIMD hardware. The cost of divergence, however, is the extra passes the hardware needs to take to allow different threads in a warp to make their own decisions as well as the execution resources that are consumed by the inactive threads in each pass.

I think we should use __syncthreads() before 'C'

In Pascal and prior architectures, these passes are executed sequentially(i.e. one pass is executed to completion and then we start next pass) but from Volta architecture onwards, the passes may be executed concurrently. This feature is called Independent thread scheduling.

One can determine whether a control construct can result in thread divergence by inspecting its decision condition. If the decision condition is based on threadIdx values, the control statement can potentially cause thread divergence. Similarly, a loop can cause thread divergence if its loop condition is based on thread index values.
A prevalent reason for using a control construct with thread control divergence is handling boundary conditions when mapping threads to data.

Consider the example of vector addition. The performance impact of control divergence decreases as the vector size increases. Even for two dimensional example such as Color to Grayscale conversion, the performance impact of control divergence decreases as the number of pixels in the horizontal direction increases.
An important implication of control divergence is that one cannot assume that all threads in a warp have the same execution timing. Therefore if all threads in a warp must complete a phase of their execution before any of them can move on, one must use a barrier synchronization mechanism such as __syncwarp() to ensure correctness.

Warp scheduling and latency tolerance

Each SM is assigned a fixed number of warps b/w which the SM tries to reduce the latency.
In earlier GPU designs, each SM can execute only one instruction for a single warp at any given instant(Single instruction for one warp). In more recent designs, each SM can execute instructions for a small number of warps at any given point in time(Multiple instructions for multiple warps).
The mechanism of filling the latency time of operations from some threads with work from other threads is often called “latency tolerance” or “latency hiding”.
Note that warp scheduling is also used for tolerating other types of operation latencies, such as pipelined floating-point arithmetic and branch instructions.
The selection of warps that are ready for execution does not introduce any idle or wasted time into the execution timeline, which is referred to as zero-overhead thread scheduling.
This ability to tolerate long operation latencies is the main reason why GPUs do not dedicate nearly as much chip area to cache memories and branch prediction mechanisms as CPUs do. As a result, GPUs can dedicate more chip area to floating-point execution and memory access channel resources.

Threads, Context Switching and Zero Overhead Scheduling

A thread in modern computers is a program and the state of executing the program on a von Neumann Processor.
Heart of the von Neumann model: In a computer based on the von Neumann model, the code of the program is stored in the memory. The PC keeps track of the address of the instruction of the program that is being executed. The IR holds the instruction that is being executed. The register and memory hold the values of the variables and data structures.
CPUs do context switching by carefully saving and restoring the PC value and the contents of registers and memory. This saving and restoring the register contents during context switching can incur a significant overhead in terms of added execution time.
Zero-overhead scheduling refers to the GPU’s ability to put a warp that needs to wait for a long-latency instruction result to sleep and activate a warp that is ready to go without introducing any extra idle cycles in the processing units. Traditional CPUs incur such idle cycles because switching the execution from one thread to another requires saving the execution state (such as register contents of the out-going thread) to memory and loading the execution state of the incoming thread from memory.
GPU SMs achieves zero-overhead scheduling by holding all the execution states for the assigned warps in the hardware registers so there is no need to save and restore states when switching from one warp to another. Because of this, all the warps that are assigned to a SM are assigned registers.

This is also why the number of warps that can be resident on an SM is limited by register usage.

For latency tolerance to be effective, it is desirable for an SM to have many more threads assigned to it than can be simultaneously supported with its execution resources to maximize the chance of finding a warp that is ready to execute at any point in time. For example, in an Ampere A100 GPU, an SM has 64 cores but can have up to 2048 threads assigned to it at the same time. Thus the SM can have up to 32 times more threads assigned to it than its cores can support at any given clock cycle.

Resource partitioning and Occupancy

The ratio of the number of warps(or threads) assigned to an SM to the maximum number it supports is referred to as Occupancy.
The execution resources in an SM include registers, shared memory, thread block slots, and thread slots. Thread block slots and thread slots are present separately. The separation is important because:
1. Resource Granularity:
  - Thread blocks need their own management because they share resources (like shared memory)
  - Individual threads need separate management for their private resources (like registers)
2. Scheduling Flexibility:
  - Having separate slots allows the SM to optimize occupancy based on both block-level and thread-level resources
  - Some applications might benefit from fewer blocks with more threads, others from more blocks with fewer threads
3. Hardware Implementation:
  - Block management requires different hardware structures than thread management
  - The separation allows for more efficient hardware implementation of the scheduling logic
These resources are dynamically partitioned across threads to support their execution. For ex. Ampere A100 GPU supports a maximum of
- 2 to 32 blocks per SM
- 64 warps (2048 threads) per SM
- 64 to 1024 threads per block

If a grid is launched with block size of 1024 threads then 2048 threads in a SM is partitioned and assigned to 2 blocks. So each SM will have 2 blocks.

Most important thing while optimizing for Occupancy is deciding the thread block size.

This ability to dynamically partition thread slots among blocks makes SMs versatile. They can either execute many blocks each having few threads or execute few blocks each having many threads. This dynamic partitioning can be contrasted with a fixed partitioning method in which each block would receive a fixed amount of resources regardless of its real needs. Fixed partitioning results in wasted thread slots when a block requires fewer threads than the fixed partition supports and fails to support blocks that require more thread slots than that.
Another situation that could negatively affect occupancy occurs when the maximum number of threads per block is not divisible by the block size. For A100, if we use some random number like 768 then max blocks we can have is 2 (since 768 * 3 > 2048). Occupancy = (2 * 768) / 2048 = 75%.
The preceding discussion does not consider the impact of other resource constraints, such as registers and shared memory. For instance, automatic variables declared in a CUDA kernel are placed into registers. Some kernels may use many automatic variables, and others may use few of them. Therefore one should expect that some kernels require many registers per thread and some require few. By dynamically partitioning registers in an SM across threads, the SM can accommodate many blocks if they require few registers per thread and fewer blocks if they require more registers per thread.

For example, Ampere A100 allows maximum of 65,536 registers per SM which means 65,536 / 2048 = 32 registers per thread. So, if a kernel uses 64 registers per thread, the maximum number of threads that can be supported with 65,536 registers is 1024 threads. In this case, the kernel cannot run with full occupancy regardless of what the block size is set to be. Instead, the occupancy will be at most 50%. In some cases, the compiler may perform register spilling to reduce the register requirement per thread and thus elevate the level of occupancy. However, this is typically at the cost of increased execution time for the threads to access the spilled register values from memory and may cause the total execution time of the grid to increase.

Register Spilling: The compiler moves (spills) some variables from registers to memory (usually L1 cache or local memory). When these variables are needed, we have to load them from memory to registers and then write back after modification. This adds the overhead. This increases occupancy because we are allowing more threads to run parallely despite not having enough registers.

How occupancy changes with slight change in resource usage of a kernel:

Assume that a programmer implements a kernel that uses 31 registers per thread and configures it with 512 threads per block. In this case, the SM will have (2048 threads)/(512 threads/block) = 4 blocks running simultaneously. These threads will use a total of (2048 threads)3(31 registers/thread) = 63,488 registers, which is less than the 65,536 register limit. Now assume that the programmer declares another two automatic variables in the kernel, bumping the number of registers used by each thread to 33. The number of registers required by 2048 threads is now 67,584 registers, which exceeds the register limit. The CUDA runtime system may deal with this situation by assigning only 3 blocks to each SM instead of 4, thus reducing the number of registers required to 50,688 registers. However, this reduces the number of threads running on an SM from 2048 to 1536; that is, by using two extra automatic variables, the program saw a reduction in occupancy from 100% to 75%. This is sometimes referred to as a “performance cliff,” in which a slight increase in resource usage can result in significant reduction in parallelism and performance achieved.

Accurate determination of the number of threads running in each SM can be difficult. CUDA Occupancy Calculator can be used for this. Excel based Occupancy Calculator is deprecated. Occupancy calculator is available in Nsight Compute. Please refer to Nsight Compute Occupancy Calculator documentation for more details on usage.

Querying device properties

The amount of resources in each CUDA device SM is specified as part of the compute capability of the device.

In general, the higher the compute capability level, the more resources are available in each SM. The compute capability of GPUs tends to increase from generation to generation. The Ampere A100 GPU has compute capability 8.0.
The CUDA runtime system(device driver) has an API function cudaGetDeviceCount that returns the number of available CUDA devices in the system.

int devCount;
cudaGetDeviceCount(&devCount);

Many PC systems come with one or more “integrated” GPUs(iGPUs). These GPUs are the default graphics units and provide rudimentary capabilities and hardware resources to perform minimal graphics functionalities for modern window-based user interfaces.
Most CUDA applications will not perform very well on these integrated devices. This would be a reason for the host code to iterate through all the available devices, query their resources and capabilities, and choose the ones that have enough resources to execute the application with satisfactory performance.
CUDA runtime provides an API function cudaGetDeviceProperties that returns the properties of the device whose number is given as an argument. The built-in type cudaDeviceProp is a C struct type with fields that represent the properties of a CUDA device.

cudaDeviceProp devProp;
for(unsigned int i = 0; i < devCount; i++) {
cudaGetDeviceProperties(&devProp, i);
// Decide if device has sufficient resources/capabilities
}

Some fields in devProp
1. devProp.maxThreadsPerBlock
2. devProp.multiProcessorCount: number of SMs in the device
3. devProp.clockRate: The combination the clock rate and the number of SMs gives a good indication of the maximum hardware execution throughput of the device.
4. devProp.maxThreadsDim[0] (for the x dimension), devProp.maxThreadsDim[1] (for the y dimension), and devProp.maxThreadsDim[2] (for the z dimension): maximum number of threads allowed along each dimension of a block. An example of use of this information is for an automated tuning system to set the range of block dimensions when evaluating the best performing block dimensions for the underlying hardware.
5. devProp.maxGridSize[0] (for the x dimension), devProp.maxGridSize[1] (for the y dimension), and devProp.maxGridSize[2] (for the z dimension): maximum number of blocks allowed along each dimension of a grid. A typical use of this information is to determine whether a grid can have enough threads to handle the entire dataset or some kind of iterative approach is needed.
6. devProp.regsPerBlock: number of registers that are available in each SM. This field can be useful in determining whether the kernel can achieve maximum occupancy on a particular device or will be limited by its register usage. Note that the name of the field is a little misleading. For most compute capability levels, the maximum number of registers that a block can use is indeed the same as the total number of registers that are available in the SM. However, for some compute capability levels, the maximum number of registers that a block can use is less than the total that are available on the SM.
7. devProp.warpSize

Summary

A GPU is organized into SM, which consist of multiple processing blocks of cores that share control logic and memory resources. When a grid is launched, its blocks are assigned to SMs in an arbitrary order, resulting in transparent scalability of CUDA applications. The transparent scalability comes with a limitation: Threads in different blocks cannot synchronize with each other.
The ratio of the number of threads assigned to the SM to the maximum number of threads it can support is referred to as occupancy. The higher the occupancy of an SM, the better it can hide long-latency operations.

Footnote

Threads in different blocks can perform barrier synchronization through the Cooperative Groups API. However, there are several important restrictions that must be obeyed to ensure that all threads involved are indeed simultaneously executing on the SMs.↩