Lecture 8 - CUDA Problems

When Pantoja teaches CUDA, there's a big problem that asked: how many threads should I create?

Recall from Lecture 7 - Ending OpenMP, Starting CUDA that we should make as many as you need. But how many do you need? And then once you have that number, how do you create teams and partition the work the best?

Recall as a reference:

![[2.1IntroToGPU-Cuda.pdf#page=10]]

The CUDA programming paradigm gives you three API variables (with crappy names):

Why we organize in this way?

If we do this, then we can easily organize our threadId's in a nice way such that each dimension gets a thread (or really a set of threads).

By default, when we do the function call with <<<dim grid, dim of block>>> part, really the <<< # blocks, # threads per block>>>:

There's a CUDA 3Dint that allows you to set these at runtime. But you have to use an int! The compiler will not tell you if this call didn't actually work.

If you take anything from this section, it's to call:

int id_x = (blockIdx.x * blockDim.x) + threadIdx.x;
int id_y = (blockIdx.y * blockDim.y) + threadIdx.y;
int id_z = (blockIdx.z * blockDim.z) + threadIdx.z;

to get your corresponding id as a 3D point.

Calling CUDA

Consider the following example:

// CPU implementation of vector addition.
void sum(int *A, int *B, int *C, int N)
{
	for(int i = 0; i < N, i++)
	{
		C[i] = A[i] + B[i];
	}
}

Notice if N is very small, then maybe we shouldn't put this on the GPU due to the overhead costs. However, as an alternative, if N then how many threads should we run? We should run as much as possible? The better question is to ask how little work can a thread do? In this case the smallest is to assign one addition per thread, so the number of threads should be N.

Notice that because the memory hierarchy is:

So the idea to distribute your threads is to do:

As such, in CUDA an implementation would be to use N threads:

#include <stdlib.h>
// CUDA implementation: called a CUDA Kernel
__global__ void sum(int *A, int *B, int *C, int N)
{
	int t_id = threadIdx.x + blockDim.x * blockIdx.x;
	C[t_id] = A[t_id] + B[t_id];
}

int main()
{
	// ... Assume d_A, d_B, d_C sent to the GPU correctly ...

	// Run grid of N/256 blocks of 256 threads each.
	// Note the usage of the ceiling to ensure at least one block.
	vecAdd<<< std::ceil(N/256), 256>>>(d_A, d_B, d_C);
	// alternatively, rather than using std::ceil, to not import the library, you can do:
	vecAdd<<< N/256, 256 >>>(d_A, d_B, d_C)
}

Notice that as an example, if N=512 then our code is good as we make 2 blocks with 256 threads per; a total of 512 threads (perfect). But if we do N=513 then our code generates 3 blocks with 256 threads per, so a total of 768. When t_id = 514 or more, you'll get a SEG_FAULT! As such, we need an if:

// ...
if (t_id < N)
	C[t_id] = A[t_id] + B[t_id];
// ...
Note

As an alternative to std::ceil(N/256), you can use N/256 + 1 instead (it's functionally equivalent).

cudaMalloc()

On just the CPU we use malloc() to create memory space for A, B for the vectors above. In CUDA, we have to additionally move our CPU to the GPU:

// alocate CPU memory
float *h_A = ..., *h_b = ...; ...
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice);

// do our operation
vecAdd<<< (N+255)/256, 256>>>(d_A, d_B, d_C);

// opy result back to host memory
cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

// free device (GPU) memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

Often, this is the biggest bottleneck. Because most often the operations are so fast that the data moving is the slowest part of the process.

Other Examples

2D matrix addition is then easily extended:

![[2.1IntroToGPU-Cuda.pdf#page=21]]