Chapter 2 - Heterogeneous data parallel computing

10 Sep, 2024

Data Parallelism

When modern software apps run slowly, the problem is often too much data to process.
Other than data parallelism, there is also Task Parallelism where an application which consists of multiple independent tasks is parallelized. But, in general, data parallelism is the main source of scalability for parallel programs.

CUDA Program Structure

All the threads that are launched by a kernel call are collectively called a Grid.
Many heterogeneous computing apps manage overlapped CPU and GPU execution to take advantage of both CPUs and GPUs.
IMPORTANT POINT: Although CPU threads are faster than GPU threads, the GPU threads take very few clock cycles to generate and schedule, owing to the efficient hardware support. On the other hand, traditional CPU threads typically take thousands of clock cycles to generate and schedule.
A thread consists of the code of the program, the point in the code that is being executed and the values of its variables and data structures.
One can use a source-level debugger to monitor the progress of a thread by executing one statement at a time, looking at the statement that will be executed next and checking the values of the variables and data structures as the execution progresses.
In CUDA, the execution of each thread is sequential as well.
Important point about pointers in C:
```
// Suppose A is an array of floats
float *P; // This is a pointer
P = &(A[0]);
```
The last statement makes P point to the 0^th element of A. P[i] becomes a synonym for A[i].
There are normally atleast two types of functions in a parallel code:
- Setup for calling kernels, Stub.
- Kernel function which uses threadIdx, blockIdx and blockDim.
Stub ships input data to the device, activates the calculations on the device and collects the results from the device. **It does it in such a way that the main program does not even need to know that the calculations are happening on the device.

In practise, such a 'transparent' outsourcing model can be very inefficient because of all the copying of data back and forth. Its better to keep large and important data structures on the device and simplt invoke device functions on them from the host code. A good parallel code won't move a lot of data between device and the host.

Device global memory and data transfer

Devices are often hardware cards that come with their own DRAM called device global memory. So, the global memory is nothing but the RAM on the GPU. There are other types of memories present as well in the GPU.
The CUDA runtime system(typically running on the host) provides API functions for memory management on behalf of the programmer¹.
cudaMalloc function allocates the piece of device global memory for an object. It expects a generic pointer as an argument for allocating memory because, this function is a generic function that is not restricted to any particular type of objects².
cudaFree doesn't need an address of A_d because it uses only the value of A_d (and not it's address) for freeing up the memory.
Dereferencing a device global memory pointer in host code can cause exceptions or other types of runtime errors.
Some good practices for writing debuggable CUDA code:
- CUDA API functions return flags that indicate whether an error has occurred when they served the request. *Most errors are due to inappropriate argument values used in the call.**
- In practice, we should surround the call with code that test for error condition and print out error messages so that the user can be aware of the fact that an error has occurred. For instance,
```
cudaError_t err=cudaMalloc((void **) &A_d, size);
if (err != cudaSuccess) {
    printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__);
    exit(EXIT_FAILURE);
```
- This way if the system is out of device memory, the user will be informed about the situation.
- You can also define a C macro to make checking code more concise in the source.

Parallel Computing Models

SIMD(Single Instruction, Multiple Data):
- Definition: One instruction operates on multiple data elements simultaneously
- Use Case: vector operations, image processing
- Pros: Efficient for data parallel tasks
- Cons: Limited to tasks with uniform operations across data
SPMD(Single Program, Multiple Data):
- Definition: Same program runs on multiple processors, operating on different parts of data
- Use Case: Distributed computing, multi-core processing
- Pros: Flexible for various programming tasks
- Cons: May have higher overhead for task distribution
SIMT(Single Instruction, Multiple Threads):
- Definition: Hybrid models combining aspects of SIMD and SPMD.
- Use Case: GPU computing
- Characteristics: 1. Threads grouped in warps(usually 32 threads) 2. Warps execute in SIMD-like fashion 3. Different warps can be at different points in program(SPMD-like)
- Pros: Balances efficiency and flexibility
- Cons: Performance penalties for thread divergence within warps

Comparison
- SIMD: Best for uniform data parallel operations
- SPMD: Suitable for complex distributed tasks
- SIMT: Combines benefits, ideal for GPU architectures

Use these models depending on the task at hand.

Kernel Functions and Threading

Kernel function specifies the code to be executed by all threads during a parallel phase.
By default, all functions in a CUDA program are host functions if they do not have any of the CUDA keywords in their declaration.

Note that one can use both __host__ and __device__ in a function declaration. This combination tells the compilation system to generate two versions of object code for that same function. One is executed on the host and can be called only from a host function. The other is executed on the device and can be called only from a device or kernel function. This supports common use case when the same function source code can be recompiled to generate a device version.

The built-in variables blockDim, threadIdx and blockIdx are means for the threads to access hardware registers that provide the identifying coordinates to threads. Different threads will see different values in these variables.
The variable idx defined using the built-in variables is called Automatic variable.
Each thread in the grid corresponds to one iteration in the loop. This is referred to as Loop Parallelism in which iterations of the original sequential code are executed by threads in parallel.
Remember that the bounding condition (idx < n) is quite important. This is becuase not all vector lengths can be expressed as multiples of the block size.
The smallest efficient thread block dimension is 32. This is called as 1 Warp.

CUDA Thread block size summary:

Smallest efficient block size - 32 Threads(1 Warp)
Threads in a single warp are executed in SIMD fashion. This is why even though you can use block size smaller than 32, this block will still occupy a full single warp of 32 threads.
Common sizes: 32, 64, 128, 256(multiples of warp size)
Factors to consider while deciding block size - Warp utilization, shared memory, registers, GPU capability.
Best Practices:
- Use multiples of 32
- Experiment for optimal size
- Consider problem specifics(parallelism, memory access)
- Use CUDA occupancy calculator
Trade-offs:
- Smaller block size -> Less Occupancy and More Granularity
- Larger block size -> More resource usage and Less Granulariy
Optimize for: Max Occupancy and Min Divergence

NOTE: Optimal block size varies by algorithm and GPU architecture.

Calling Kernel Functions

While calling the kernel, the grid and thread block dimensions are set using Execution Configuration Parameters.
To ensure that we have enough threads in the grid to cover all the vector elements, we use ceil function while calculating number of blocks.
Using 256.0 (instead of just 256) ensures that we generate a floating value for the division so that the ceiling function can round it up correctly.
A small GPU with a small amount of execution resources might execute only 1 or 2 thread blocks in parallel. A larger GPU might execute 64 or 128 blocks in parallel.
Same CUDA code with same execution config parameters runs at lower speed on small GPUs and at higher speed on larger GPUs.

Some Practical Stuff

In practise, the overhead of allocating device memory, transferring it and deallocating it will likely make the resulting code slower than original sequential code.
This is because the amount of calculation done by the kernel is small relative to the amount of data processed or transferred.
Real applications typically have kernels in which much more work is needed relative to the amount of data processed, which makes the additional overhead worthwhile.

Compilation

Once the CUDA C extensions are used in a code, it is no longer acceptable to a traditional C compiler.
NVCC(NVIDIA C Compiler) is the compiler for this code. NVCC splits the CUDA C code into host code and device code which is then ran on corresponding processors after compilation by corresponding processors.
The host code is compiled with the host's standard C/C++ compiler.
The NVCC compiles the code into virtual binary files called PTX Files.

Summary

CUDA C supports heterogeneous parallel computing through the code practices highlighted in the chapter.
Refer CUDA Programming Guide for more details on kernel launch extension and other types of execution config parameters.
Note that the CUDA API functions cudaMalloc, cudaFree and cudaMemcpy are used only in host code.
Other architectures than CUDA will differ only in terms of the syntax. The key parallel computing concepts remain the same.

=========================== Chapter Over ===========================

Footnote:

-> Next Chapter

CUDA C also has more advanced library functions for allocating space in the host memory.↩
cudaMalloc returning a generic object makes the use of dynamically allocated multi-dimensional arrays more complex. The two parameter format of cudaMalloc allows it to use the return values to report any errors in the same was as other CUDA API functions.↩