Chapter 1 - Introduction

22 Aug, 2024

There is a Positive(Virtuous) cycle for the computer industry where the users demand more improvements once they become accustomed to the current improvements.
It is not like only GPUs need special kind of code to be written for improving the overall execution speed. Even the multi-core CPUs also need such special code. The applications are written in such a way to support multi-core processing.
Conceptual definition of a thread is Sequence of instruction execution activities resulting from the sequential step by step execution of an application is called Thread of execution.
According to the authors, parallel programs will be the programs that will enjoy significant performance improvements. **The dramatically escalated advantages of parallel programs over sequential programs is termed as *Concurrency Revolution***.

Design Trajectories

The semiconductor industry has settled on two main trajectories for designing multiprocessors:
- Multi-core trajectory
- Many-threads trajectory
In both, the number of cores/threads increases with each generation.
Computationally intensive parts of the programs are prioritized for the parallelization. When there is more work to do, there is more scope to divide the work among cooperating parallel workers, threads.

Some design differences between a CPU and a GPU

CPU has a latency oriented design. There are less memory channels between DRAM and the other components like ALU, Cache. There are larger on chip caches than that for GPUs. ALUs are also lesser in number.
GPU has a throughput oriented design. This design philosophy is shaped by the gaming industry which needs massive number of floating point operations and memory accesses per video frame in AAA games. That's why we have lots of ALUs and more number of memory channels between DRAM and other components. The size of cache is smaller.
A GPU must be capable of moving extremely large amounts of data into and out of graphic frame buffers in its DRAM. This allows more FPS which most gamers would kill a human for.
In contrast, CPU must satisfy requirements from legacy OS, applications and I/O devices that present more challenges to supporting parallel memory accesses and thus make it more difficult to increase the throughput of memory accesses, memory bandwidth.
An individual thread of a CPU is faster than an individual thread of a GPU.
Compute Unified Device Architecture(CUDA) is designed to support joint CPU-GPU execution of an application.
The techniques covered in this book for GPU also apply on programming tasks for other accelerators.
In most cases, effective management of data delivery can have a major impact on achievable speed of a parallel application.

Amdahl's Law:

The fact that speedup achievable through parallel programming is limited by the parallelizable portion of the program.

        Speedup = 1 / ((1 - P) + (P/S))

where P = % of the program that can be parallelized and S = Speedup of parallelized portion.

The parallel programs are also bottlenecked by the memory bandwidth saturation. Straightforward parallelization might result in only 10x speedup. Highly sophisticated parallel programs use the different memory models in the GPUs intelligently. Techniques like Tiling can help a lot in some cases.
Among all the heat around the GPU, let's not forget the contribution of CPUs. CPUs should also get a fair chance so that code is written for GPU to complement the CPU execution. Good parallel programs also consider how CPU cores can be used effectively to optimize the whole flow.
Earlier GPU programming interfaces like GPGPU could cover only a small portion of most exciting application that can benefit greatly from parallel execution. CUDA covers a much larger portion of that pie.

Challenges in Parallel Programming:

Love this quote:

Someone once said that if you do not care about performance, parallel programming is very easy

This is quite an important thing to keep in mind about parallel programs. Many parallel algorithms perform the same amount of work as their sequential counterparts. However, some parallel algorithms do more work than their sequential counterparts.
This tells that sometimes to increase the overall throughput, you might have to use the algorithms which take longer route but are more parallelizable. Parallelizing these problems often requires nonintuitive ways of thinking about the problem and may require redundant work during execution.
Compute bound applications are limited by the number of instructions performed per byte of data. Achieving high-performance parallel execution in memory-bound applications often requires methods for improving memory access speed.
The execution speed of parallel programs is often more sensitive to the input data characteristics than is the case for their sequential counterparts.
Some applications can be parallelized while requiring little collaboration across different threads. These applications are often referred to as embarrassingly parallel.
NCCL(NVIDIA Collective Communications Library) is used for multi-GPU programming. OpenCL is similar to CUDA but relies more on API and less on language extensions.
It will probably take many years to build tools and machines that will enable programmers to develop high-performance code without the knowledge in this book. REMEMBER: Parallel Programming isn't AI-prone for some time.
Programmers who have worked on parallel systems in the past know that achieving initial performance is not enough. The challenge is to achieve it in such a way that you can debug the code and support users.

=========================== Chapter Over ===========================

-> Next Chapter