Lecture 10 - ending CUDA

Even though, when programming CUDA, we have virtually unlimited threads, there has to be some physical hardware limit. These are:

max number of threads per block per grid dimension
It's $2^{32}$ or about 65538 for sm2_0
The maximum number of threads per block is 512 or 1024 for each dimension
If we have the maximums, the maximum number of threads is:
- 65535 * 128 = 8388480 number of threads

Once we reach this limit, we must increase the number of operations per thread. We can do 4 operations, 8 operations, ... $n$ operations, all per thread.

Cache Misses

We have to make sure that each thread is hitting spatially and temporally local data between each other, in order to maximize the hits on the cache.

![[2.1IntroToGPU-Cuda.pdf#page=23]]

Notice above that:

The i+=... part just adds over the number of threads in the grid.
Here we are running $N$ /grid dimension.
For example if we call vecAdd<<< 100, 256>>> then blockDim.x === 256 and gridDim.x === 100.

Note that the code above will have worse memory accesses due to more cache misses. The diagram below shows this difference:

Branch Divergence

branch divergence occurs when some threads (a very even amount) are taking an if (ie: a branch) or not. This is a problem because then the branch prediction tends to fail more often than not, slowing down the operations.