Lecture 8 - CUDA Problems

When Pantoja teaches CUDA, there's a big problem that asked: how many threads should I create?

Recall from Lecture 7 - Ending OpenMP, Starting CUDA that we should make as many as you need. But how many do you need? And then once you have that number, how do you create teams and partition the work the best?

Recall as a reference:

![[2.1IntroToGPU-Cuda.pdf#page=10]]

The CUDA programming paradigm gives you three API variables (with crappy names):

blockIdx: is the blockID (see the pictoral reference above). For instance, the block id of the top right block in the slide is (2,0). If you select blockIdx.x in this case it's 2 and if it's blockIdy.y it's 0.
blockDim: is the number of threads per block. For instance, in the block(1,1) above we have 4 by 3 threads. So blockDim.x is 4 in this case, and similar for y.
- Notice that every block is required to have the same dimension in the same grid.
threadId: the ID of the thread in the block. Here it's calculated as (x,y) -> x * blockDim.x + y.

Why we organize in this way?

If we do this, then we can easily organize our threadId's in a nice way such that each dimension gets a thread (or really a set of threads).

By default, when we do the function call with <<<dim grid, dim of block>>> part, really the <<< # blocks, # threads per block>>>:

<<< 16, 056>>> only uses the x direction
<<< (16, 16, 16), (256, 256, 256)>>> using all x,y,z directions.

There's a CUDA 3Dint that allows you to set these at runtime. But you have to use an int! The compiler will not tell you if this call didn't actually work.

If you take anything from this section, it's to call:

int id_x = (blockIdx.x * blockDim.x) + threadIdx.x;
int id_y = (blockIdx.y * blockDim.y) + threadIdx.y;
int id_z = (blockIdx.z * blockDim.z) + threadIdx.z;

to get your corresponding id as a 3D point.

Calling CUDA

Consider the following example:

// CPU implementation of vector addition.
void sum(int *A, int *B, int *C, int N)
{
	for(int i = 0; i < N, i++)
	{
		C[i] = A[i] + B[i];
	}
}

Notice if $N$ is very small, then maybe we shouldn't put this on the GPU due to the overhead costs. However, as an alternative, if $N \to \infty$ then how many threads should we run? We should run as much as possible? The better question is to ask how little work can a thread do? In this case the smallest is to assign one addition per thread, so the number of threads should be $N$ .

Notice that because the memory hierarchy is:

each thread has access to local registers (very small, but still like ~2,000)
each thread block (ie: each team) gets block shared memory
each grid get's shared global memory.

So the idea to distribute your threads is to do:

Use a power of 2 for the # threads per block (ie: 256, 512, ...)
The number of blocks (number of teams) will be just $B * T_{b l o c k} = N \to B = \frac{N}{T_{b l o c k}}$ . But this may not be an integer! So then you may have accidentally just assigned 0 teams, and thus have not done any work! As such, you should do the ceiling function: $B = ⌈ \frac{N}{T_{b l o c k}} ⌉$ .
This investigation shows that you may actually make more threads than the number of ones we actually need.
Note that, architecturally, the number of blocks is hardware dependent. And if you do 65535 is equivalent in size to (16,16,16).

As such, in CUDA an implementation would be to use $N$ threads:

#include <stdlib.h>
// CUDA implementation: called a CUDA Kernel
__global__ void sum(int *A, int *B, int *C, int N)
{
	int t_id = threadIdx.x + blockDim.x * blockIdx.x;
	C[t_id] = A[t_id] + B[t_id];
}

int main()
{
	// ... Assume d_A, d_B, d_C sent to the GPU correctly ...

	// Run grid of N/256 blocks of 256 threads each.
	// Note the usage of the ceiling to ensure at least one block.
	vecAdd<<< std::ceil(N/256), 256>>>(d_A, d_B, d_C);
	// alternatively, rather than using std::ceil, to not import the library, you can do:
	vecAdd<<< N/256, 256 >>>(d_A, d_B, d_C)
}

Notice that as an example, if $N = 512$ then our code is good as we make 2 blocks with 256 threads per; a total of 512 threads (perfect). But if we do $N = 513$ then our code generates 3 blocks with 256 threads per, so a total of 768. When t_id = 514 or more, you'll get a SEG_FAULT! As such, we need an if:

// ...
if (t_id < N)
	C[t_id] = A[t_id] + B[t_id];
// ...

Note

As an alternative to std::ceil(N/256), you can use N/256 + 1 instead (it's functionally equivalent).

`cudaMalloc()`

On just the CPU we use malloc() to create memory space for A, B for the vectors above. In CUDA, we have to additionally move our CPU to the GPU:

// alocate CPU memory
float *h_A = ..., *h_b = ...; ...
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice);

// do our operation
vecAdd<<< (N+255)/256, 256>>>(d_A, d_B, d_C);

// opy result back to host memory
cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// free device (GPU) memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

Often, this is the biggest bottleneck. Because most often the operations are so fast that the data moving is the slowest part of the process.

Other Examples

2D matrix addition is then easily extended:

![[2.1IntroToGPU-Cuda.pdf#page=21]]