(= ФェФ=) toji

Learn HPC: Chapter 3 - Multidimensional Grid and Data

Exercise Code:

Link: CUDA Code for Matrix Vector Multiplication

Multidimensional grid organization

  1. When a single value is passed where a dim3 is expected, that value will be passed to the first parameter of the constructor, while the second and third parameters take the default value of 1. The result is a 1D grid or block in which the size of the x dimension is the value passed and the sizes of the y and z dimensions are 1.

In CUDA C, the allowed values of gridDim.x range from 1 to 231 - 11 and those of gridDim.y and gridDim.z range from 1 to 216 - 1 (65,535). (Compute capability essentially tells stuff like how many threads can exist in one block, how many threads can each dimension in a grid can have, etc. in short Hardware limitations)

The total size of a block in current CUDA systems is limited to 1024 threads. These threads can be distributed across the three dimensions in any way as long as the total number of threads does not exceed 1024.

In execution configuration parameters: (x, y, z) but, while indexing: (z, y, x)

Mapping threads to multidimensional arrays

The code written in this section is present below:

Link: CUDA Code for RGB to Grayscale Conversion

Dynamically Allocated Multidimensional Arrays(Reason why CUDA C doesn't have A[i][j] syntax):

image

Source code of colorToGrayscaleConversion with 2D thread mapping to data

As for Pin, we need to multiply the gray pixel index by 3 (line 13), since each colored pixel is stored as three elements (r, g, b), each of which is 1 byte.

RGB images are stored in pixel major format in the modern CPUs. Obvious when you start thinking about cache reasons. Basically, in this example the authors have assumed that this is how an image is stored and accordingly we have to load it.

Each thread in the grid ultimately maps to a value in the memory and an element in the multidimensional data.

Image Blur Kernel

The code written in this section is present below:

Link: CUDA Code for Blur Kernel

In real CUDA C programs, threads often perform complex operations on their data and need to cooperate with each other.

Matrix Multiplication

The code written in this section is present below:

Link: CUDA Code for Matrix Matrix Multiplication

Each block is responsible for calculating one of these tiles. All the tiles/blocks are processed simultaneously.

Footnote:

  1. CUDA Compute Capability is a version number that defines the features supported by the GPU hardware. Devices with a capability of less than 3.0 allow blockIdx.x to range from 1 to 216 - 1.

  2. These BLAS functions are important because they are used as basic building blocks of higher-level algebraic functions, such as linear system solvers and eigen value analysis. As we will discuss later, the performance of different implementations of BLAS functions can vary by orders of magnitude in both sequential and parallel computers.