Lecture 10 - ending CUDA
Even though, when programming CUDA, we have virtually unlimited threads, there has to be some physical hardware limit. These are:
- max number of threads per block per grid dimension
- It's
or about 65538 for sm2_0
- The maximum number of threads per block is 512 or 1024 for each dimension
- If we have the maximums, the maximum number of threads is:
- 65535 * 128 = 8388480 number of threads
Once we reach this limit, we must increase the number of operations per thread. We can do 4 operations, 8 operations, ...
We have to make sure that each thread is hitting spatially and temporally local data between each other, in order to maximize the hits on the cache.
![[2.1IntroToGPU-Cuda.pdf#page=23]]
Notice above that:
- The
i+=...
part just adds over the number of threads in the grid. - Here we are running
/grid dimension. - For example if we call
vecAdd<<< 100, 256>>>
thenblockDim.x === 256
andgridDim.x === 100
.
Note that the code above will have worse memory accesses due to more cache misses. The diagram below shows this difference:
branch divergence occurs when some threads (a very even amount) are taking an if
(ie: a branch
) or not. This is a problem because then the branch prediction tends to fail more often than not, slowing down the operations.