Lecture 7 - Ending OpenMP, Starting CUDA

Last time we reviewed a lot of loop Dependency:

![[openMP Loop Dependency.pdf]]

and using OpenMP to send data to the GPU and use it:

![[openMP pragma Target.pdf]]

Note that:

the loop dependencies can create dependency chains (see second-to-last slides), so sometimes we parallelize on the diagonal, sometimes called parallelizing in "waves".

Intro to CUDA Programming

We refer to:

![[2.1IntroToGPU-Cuda.pdf]]

But CUDA is not as easy as using OpenMP:

The CPU is required, as the GPU doesn't have an OS or anything. The CPU tells the GPU what to do, so it's needed.

To compare:

The CPU has a large cache, small about of registers, some SIMD unit (small), and large control, data forwarding, branch prediction, ...
The GPU has a smaller cache, large SIMD units, many registers, threading units, no branch prediction nor forwarding (more throughput oriented), many many many cores (and thus more possible threads).

As a result, while the ALU operations will be very fast and vectorized, on the GPU the cache hits are even more important to consider.

A CUDA Core is not equivalent to a CPU core. Why? Compare the two:

CPU Core	GPU Core
CPUs have $n$ registers, based on the bitage of it usually.	Have many many registers (255+)
Cores are not groupsed together	Cores are grouped together into groups of 32, called a warp. There's 4 threads per warp.
...	4 Groups creates a streaming multiprocessor, a total of 128 cores. There's shared memory between all of these cores.

The thing is that one of these streaming multiprocessors are equivalent to a CPU core (even though we are comparing apples to oranges). So take the GPU core count and divide by 128 to get the equivalent CPU core count.

All streaming multiprocessors share a global memory, which is the VRAM you see on a GPU. If there's 3073 cores on a GPU, then that makes 24 MP's, and with a typically 12GB of global memory we have a lot of shared data.

The Software

The CPU is the host memory, and we must do cudaMemcpy() to move it to each GPU device memory (if we have multiple devices). For the time, we'll just focus on one GPU device.

The CUDA programming language will use a different compiler for CUDA:

![[2.1IntroToGPU-Cuda.pdf#page=8]]

Here:

The cudaMemcpy() will copy from CPU Main Memory to GPU Global Memory

Some CUDA is as follows:

// here __global__ is a CUDA directive that will run on the GPU only
__global__ void matrix_mul(float *a_gpu, float *b_gpu, float *c_gpu)
{ // ... 
}

// this will call the function on the CPU to call it on the GPU correspondingly
matrix_mul<<<dim grid, dim of block>>>(input parameters); //C++
matrix_mul(input parameters); // C

// here the <<<...>>> tells CUDA to say how many threads per team. 
// dim_grid: how many blocks 
// dim_blocK: how many threads per block

Because we have so many threads, we can really go crazy on the thread train. This is something that differentiates from OpenMP. Threads are basically free. Even having something like 100,000 threads is worth it, and usually you need them.

So then how do I decide that number? That's the important part:

![[2.1IntroToGPU-Cuda.pdf#page=10]]

Here:

Threads are free, but they are made in blocks. To abstract away from what data goes to what MP, we create threads in blocks (ie: teams from OpenMP), and how many threads per block.
Blocks are either 1, 2, or 3 dimensional, since the data is usually this dimensional as well. Choosing these are arbitrary, and doesn't really have to depend on the data at hand, but it's usually recommended.

![[2.1IntroToGPU-Cuda.pdf#page=11]]

As such, then each thread block gets mapped to some MP.

![[2.1IntroToGPU-Cuda.pdf#page=12]]

Here the selected thread id is 3 (0 indexed) and the block id is 2. As such, we get get the global thread id via:

blockIdx * blockDim + threadIdx

here the red-selected one is global id of 2 * 6 + 3 = 15.

The most common way to allocate work is via:

![[2.1IntroToGPU-Cuda.pdf#page=13]]

Notice the absence of a for loop. Here, for matrix addition, you might as well just do one thread per element of the array. Again, threads are free.

Running too fast?

Make sure that the code will actually run! CUDA will not tell you if your code failed to execute. Make sure you pull the output of the run code and verify it actually was right.