(= ФェФ=) toji

Learn HPC: Chapter 2 - Heterogeneous data parallel computing

The code written in this chapter is present below:

Data Parallelism

CUDA Program Structure

GPU threads can ready it's threads faster than the CPU threads. CPU threads on their own are faster than GPU threads but they take more time to schedule. So, a large number of slow GPU threads can be scheduled faster to start the work on the parallel program compared to a small number of fast CPU threads. This is the main advantage of GPU over CPU.

In practise, such a 'transparent' outsourcing model can be very inefficient because of all the copying of data back and forth. Its better to keep large and important data structures on the device and simply invoke device functions on them from the host code. A good parallel code won't move a lot of data between device and the host.

Device global memory and data transfer

Parallel Computing Models

  1. SIMD(Single Instruction, Multiple Data):

    • Definition: One instruction operates on multiple data elements simultaneously
    • Use Case: vector operations, image processing
    • Pros: Efficient for data parallel tasks
    • Cons: Limited to tasks with uniform operations across data
  2. SPMD(Single Program, Multiple Data):

    • Definition: Same program runs on multiple processors, operating on different parts of data
    • Use Case: Distributed computing, multi-core processing
    • Pros: Flexible for various programming tasks
    • Cons: May have higher overhead for task distribution
  3. SIMT(Single Instruction, Multiple Threads):

    • Definition: Hybrid models combining aspects of SIMD and SPMD.
    • Use Case: GPU computing
    • Characteristics:
      1. Threads grouped in warps(usually 32 threads)
      2. Warps execute in SIMD-like fashion
      3. Different warps can be at different points in program(SPMD-like)
    • Pros: Balances efficiency and flexibility
    • Cons: Performance penalties for thread divergence within warps

Use these models depending on the task at hand.

Kernel Functions and Threading

Note that one can use both __host__ and __device__ in a function declaration. This combination tells the compilation system to generate two versions of object code for that same function. One is executed on the host and can be called only from a host function. The other is executed on the device and can be called only from a device or kernel function. This supports common use case when the same function source code can be recompiled to generate a device version.

CUDA Thread block size summary:

NOTE: Optimal block size varies by algorithm and GPU architecture.

Calling Kernel Functions

Some Practical Stuff

Compilation

Summary

=========================== Chapter Over ===========================

Footnote:

-> Next Chapter

image jjbjb

  1. CUDA C has more advanced library functions for allocating space in the host memory.

  2. cudaMalloc returning a generic object makes the use of dynamically allocated multi-dimensional arrays more complex. The two parameter format of cudaMalloc allows it to use the return values to report any errors in the same way as other CUDA API functions.