Lecture 8 - CUDA Problems
When Pantoja teaches CUDA, there's a big problem that asked: how many threads should I create?
Recall from Lecture 7 - Ending OpenMP, Starting CUDA that we should make as many as you need. But how many do you need? And then once you have that number, how do you create teams
and partition the work the best?
Recall as a reference:
![[2.1IntroToGPU-Cuda.pdf#page=10]]
The CUDA programming paradigm gives you three API variables (with crappy names):
blockIdx
: is the blockID (see the pictoral reference above). For instance, the block id of the top right block in the slide is(2,0)
. If you selectblockIdx.x
in this case it's2
and if it'sblockIdy.y
it's0
.blockDim
: is the number of threads per block. For instance, in theblock(1,1)
above we have 4 by 3 threads. SoblockDim.x
is4
in this case, and similar fory
.- Notice that every block is required to have the same dimension in the same grid.
threadId
: the ID of the thread in the block. Here it's calculated as(x,y) -> x * blockDim.x + y
.
If we do this, then we can easily organize our threadId
's in a nice way such that each dimension gets a thread (or really a set of threads).
By default, when we do the function call with <<<dim grid, dim of block>>>
part, really the <<< # blocks, # threads per block>>>
:
<<< 16, 056>>>
only uses thex
direction<<< (16, 16, 16), (256, 256, 256)>>>
using allx,y,z
directions.
There's a CUDA 3Dint
that allows you to set these at runtime. But you have to use an int
! The compiler will not tell you if this call didn't actually work.
If you take anything from this section, it's to call:
int id_x = (blockIdx.x * blockDim.x) + threadIdx.x;
int id_y = (blockIdx.y * blockDim.y) + threadIdx.y;
int id_z = (blockIdx.z * blockDim.z) + threadIdx.z;
to get your corresponding id
as a 3D point.
Calling CUDA
Consider the following example:
// CPU implementation of vector addition.
void sum(int *A, int *B, int *C, int N)
{
for(int i = 0; i < N, i++)
{
C[i] = A[i] + B[i];
}
}
Notice if
Notice that because the memory hierarchy is:
- each thread has access to local registers (very small, but still like ~2,000)
- each thread block (ie: each team) gets block shared memory
- each grid get's shared global memory.
So the idea to distribute your threads is to do:
- Use a power of 2 for the # threads per block (ie:
256, 512, ...
) - The number of blocks (number of
teams
) will be just. But this may not be an integer! So then you may have accidentally just assigned 0 teams, and thus have not done any work! As such, you should do the ceiling function: . - This investigation shows that you may actually make more threads than the number of ones we actually need.
- Note that, architecturally, the number of blocks is hardware dependent. And if you do
65535
is equivalent in size to(16,16,16)
.
As such, in CUDA an implementation would be to use
#include <stdlib.h>
// CUDA implementation: called a CUDA Kernel
__global__ void sum(int *A, int *B, int *C, int N)
{
int t_id = threadIdx.x + blockDim.x * blockIdx.x;
C[t_id] = A[t_id] + B[t_id];
}
int main()
{
// ... Assume d_A, d_B, d_C sent to the GPU correctly ...
// Run grid of N/256 blocks of 256 threads each.
// Note the usage of the ceiling to ensure at least one block.
vecAdd<<< std::ceil(N/256), 256>>>(d_A, d_B, d_C);
// alternatively, rather than using std::ceil, to not import the library, you can do:
vecAdd<<< N/256, 256 >>>(d_A, d_B, d_C)
}
Notice that as an example, if t_id = 514
or more, you'll get a SEG_FAULT
! As such, we need an if
:
// ...
if (t_id < N)
C[t_id] = A[t_id] + B[t_id];
// ...
As an alternative to std::ceil(N/256)
, you can use N/256 + 1
instead (it's functionally equivalent).
cudaMalloc()
On just the CPU we use malloc()
to create memory space for A, B
for the vectors above. In CUDA, we have to additionally move our CPU to the GPU:
// alocate CPU memory
float *h_A = ..., *h_b = ...; ...
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice);
// do our operation
vecAdd<<< (N+255)/256, 256>>>(d_A, d_B, d_C);
// opy result back to host memory
cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);
// free device (GPU) memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
Often, this is the biggest bottleneck. Because most often the operations are so fast that the data moving is the slowest part of the process.
Other Examples
2D matrix addition is then easily extended:
![[2.1IntroToGPU-Cuda.pdf#page=21]]