Lecture 10 - ending CUDA

Even though, when programming CUDA, we have virtually unlimited threads, there has to be some physical hardware limit. These are:

Once we reach this limit, we must increase the number of operations per thread. We can do 4 operations, 8 operations, ... n operations, all per thread.

Cache Misses

We have to make sure that each thread is hitting spatially and temporally local data between each other, in order to maximize the hits on the cache.

![[2.1IntroToGPU-Cuda.pdf#page=23]]

Notice above that:

Note that the code above will have worse memory accesses due to more cache misses. The diagram below shows this difference:

Branch Divergence

branch divergence occurs when some threads (a very even amount) are taking an if (ie: a branch) or not. This is a problem because then the branch prediction tends to fail more often than not, slowing down the operations.