Learn HPC: Chapter 3 - Multidimensional Grid and Data

04 Oct, 2024

Exercise Code:

Link: CUDA Code for Matrix Vector Multiplication

Multidimensional grid organization

The examples code covered throughout this book should familiarize the reader with reasoning about data parallelism.
Each execution configuration parameter has the type dim3, which is an integer vector type of three elements x, y, and z. These three elements specify the sizes of the three dimensions. The programmer can use fewer than three dimensions by setting the size of the unused dimensions to 1.
Once the grid has been launched, the grid and block dimensions will remain the same until the entire grid has finished execution.
For convenience, CUDA provides a special shortcut for calling a kernel with one-dimensional (1D) grids and blocks.
The default values of the parameters to the dim3 constructor are

When a single value is passed where a dim3 is expected, that value will be passed to the first parameter of the constructor, while the second and third parameters take the default value of 1. The result is a 1D grid or block in which the size of the x dimension is the value passed and the sizes of the y and z dimensions are 1.

In CUDA C, the allowed values of gridDim.x range from 1 to 2³¹ - 1¹ and those of gridDim.y and gridDim.z range from 1 to 2¹⁶ - 1 (65,535). (Compute capability essentially tells stuff like how many threads can exist in one block, how many threads can each dimension in a grid can have, etc. in short Hardware limitations)

Two-dimensional (2D) blocks can be created by setting blockDim.z to 1.

The total size of a block in current CUDA systems is limited to 1024 threads. These threads can be distributed across the three dimensions in any way as long as the total number of threads does not exceed 1024.

A grid can have higher dimensionality than its blocks and vice versa.
Each block is labeled with (blockIdx.z, blockIdx.y, blockIdx.x). This notation uses an ordering that is the reverse of that used in the C statements for setting configuration parameters, in which the lowest dimension comes first

In execution configuration parameters: (x, y, z) but, while indexing: (z, y, x)

Typical CUDA grids contain thousands to millions of threads.

Mapping threads to multidimensional arrays

The code written in this section is present below:

Link: CUDA Code for RGB to Grayscale Conversion

The choice of 1D, 2D, or 3D thread organizations is usually based on the nature of the data.
We will refer to the dimensions of multidimensional data in descending order: the z dimension followed by the y dimension, and so on...
malloc allocated memory is dynamically allocated memory.

Dynamically Allocated Multidimensional Arrays(Reason why CUDA C doesn't have A[i][j] syntax):

However, the ANSI C standard on the basis of which CUDA C was developed requires the number of columns in P_in to be known at compile time for P_in to be accessed as a 2D array. This is because arrays in C(as well as CUDA C) are stored in row-order fashion. Number of columns should be known at compile time to avoid out of bound memory accesses.
As a result, programmers need to explicitly linearize, or “flatten,” a dynamically allocated 2D array into an equivalent 1D array in the current CUDA C.
In reality, all multidimensional arrays in C are linearized. This is due to the use of a “flat” memory space in modern computers.
In the case of statically allocated arrays, the compilers allow the programmers to use higher-dimensional indexing syntax, such as P_{in_d}[j][i], to access their elements. Under the hood, the compiler linearizes them into an equivalent 1D array and translates the multidimensional indexing syntax into a 1D offset.
In the case of dynamically allocated arrays, the current CUDA C compiler leaves the work of such translation to the programmers, owing to lack of dimensional information at compile time.

Source code of colorToGrayscaleConversion with 2D thread mapping to data

As for Pin, we need to multiply the gray pixel index by 3 (line 13), since each colored pixel is stored as three elements (r, g, b), each of which is 1 byte.

RGB images are stored in pixel major format in the modern CPUs. Obvious when you start thinking about cache reasons. Basically, in this example the authors have assumed that this is how an image is stored and accordingly we have to load it.

Each thread in the grid ultimately maps to a value in the memory and an element in the multidimensional data.

In case of 3D arrays, the programmer also needs to determine the values of blockDim.z and gridDim.z when calling a kernel. In the kernel the array index will involve another global index.
```
int plane = blockIdx.z * blockDim.z + threadIdx.z
```
The linearized access to a 3D array P will be in the form of P[plane*m*n+row*n+col]. A kernel processing the 3D P array needs to check whether all the three global indices, plane, row, and col, fall within the valid range of the array.

Image Blur Kernel

The code written in this section is present below:

Link: CUDA Code for Blur Kernel

In real CUDA C programs, threads often perform complex operations on their data and need to cooperate with each other.

The computation of weighted sums belongs to the convolution pattern.

Matrix Multiplication

The code written in this section is present below:

Link: CUDA Code for Matrix Matrix Multiplication

In the Basic Linear Algebra Subprograms (BLAS)², a de facto standard for publishing libraries that perform basic algebra operations, there are three levels of linear algebra functions. As the level increases, the number of operations performed by the function increases.
- Level 1 functions perform vector operations of the form y = αx+y, where x and y are vectors and α is a scalar. Our vector addition example is a special case of a level 1 function with α = 1.
- Level 2 functions perform matrix-vector operations of the form y = αAx+βy, where A is a matrix, x and y are vectors, and α and β are scalars.
- Level 3 functions perform matrix-matrix operations in the form of C = αAB + βC, where A, B, and C are matrices and α and β are scalars. Our matrix-matrix multiplication example is a special case of a level 3 function where α = 1 and β = 0.
In matrix multiplication parallel version, the thread-to-data mapping effectively divides P(output matrix) into tiles.

Each block is responsible for calculating one of these tiles. All the tiles/blocks are processed simultaneously.

Since the size of a grid is limited by the maximum number of blocks per grid and threads per block, the size of the largest output matrix P that can be handled by matrixMulKernel will also be limited by these constraints.
In the situation in which output matrices larger than this limit are to be computed, one can divide the output matrix into submatrices whose sizes can be covered by a grid and use the host code to launch a different grid for each submatrix. Alternatively, we can change the kernel code so that each thread calculates more P elements.

Footnote:

CUDA Compute Capability is a version number that defines the features supported by the GPU hardware. Devices with a capability of less than 3.0 allow blockIdx.x to range from 1 to 2¹⁶ - 1.↩
These BLAS functions are important because they are used as basic building blocks of higher-level algebraic functions, such as linear system solvers and eigen value analysis. As we will discuss later, the performance of different implementations of BLAS functions can vary by orders of magnitude in both sequential and parallel computers.↩