(= ФェФ=) toji

Chapter 2 - Heterogeneous data parallel computing

Link: CUDA Code for Vector Addition

Data Parallelism

CUDA Program Structure

In practise, such a 'transparent' outsourcing model can be very inefficient because of all the copying of data back and forth. Its better to keep large and important data structures on the device and simplt invoke device functions on them from the host code. A good parallel code won't move a lot of data between device and the host.

Device global memory and data transfer

Parallel Computing Models

  1. SIMD(Single Instruction, Multiple Data):

    • Definition: One instruction operates on multiple data elements simultaneously
    • Use Case: vector operations, image processing
    • Pros: Efficient for data parallel tasks
    • Cons: Limited to tasks with uniform operations across data
  2. SPMD(Single Program, Multiple Data):

    • Definition: Same program runs on multiple processors, operating on different parts of data
    • Use Case: Distributed computing, multi-core processing
    • Pros: Flexible for various programming tasks
    • Cons: May have higher overhead for task distribution
  3. SIMT(Single Instruction, Multiple Threads):

    • Definition: Hybrid models combining aspects of SIMD and SPMD.
    • Use Case: GPU computing
    • Characteristics: 1. Threads grouped in warps(usually 32 threads) 2. Warps execute in SIMD-like fashion 3. Different warps can be at different points in program(SPMD-like)
    • Pros: Balances efficiency and flexibility
    • Cons: Performance penalties for thread divergence within warps

Use these models depending on the task at hand.

Kernel Functions and Threading

Note that one can use both __host__ and __device__ in a function declaration. This combination tells the compilation system to generate two versions of object code for that same function. One is executed on the host and can be called only from a host function. The other is executed on the device and can be called only from a device or kernel function. This supports common use case when the same function source code can be recompiled to generate a device version.

CUDA Thread block size summary:

NOTE: Optimal block size varies by algorithm and GPU architecture.

Calling Kernel Functions

Some Practical Stuff

Compilation

Summary

=========================== Chapter Over ===========================

Footnote:

-> Next Chapter

  1. CUDA C also has more advanced library functions for allocating space in the host memory.

  2. cudaMalloc returning a generic object makes the use of dynamically allocated multi-dimensional arrays more complex. The two parameter format of cudaMalloc allows it to use the return values to report any errors in the same was as other CUDA API functions.