Lecture 12 - More on CUDA Memory and Specialization

See

![[2.3Atomics.pdf]]

The strategy to overcome issues with atomic operations is as follows:

Make some <<<N, M>>> with $N$ number of blocks and $M$ threads per block.
Put __shared__ on something like temp such that each block gets their own shared memory between local amounts of threads
Use __syncthreads() to make sure that all threads in a block are synchronized.

An example on the slide 8 code:

If you look at the dot product code:

![[2.3Atomics.pdf#page=9]]

We should look at:

![[2.2CudaMemoryModel.pdf]]