Lecture 5 - (finish) OpenMP

Let's look at how to create a histogram using OpenMP

![[3.1OpenMP-PartII.pdf#page=14]]

The problem with the above code is that all chunks are writing to the same histogram, which is in the same piece of shared memory (global memory). So when the chunks access it and go to write, there's a race condition. Thus, adding #pragma atomic could help here for the ++ operation.

Note that if we change the problem to try to count the number of different words, this approach becomes much harder! How many words can we base our chunks off of? We don't really know.

![[3.1OpenMP-PartII.pdf#page=15]]

Hence, our proposed change:

![[3.1OpenMP-PartII.pdf#page=16]]

Notice the threaded version is super slow! Adding the atomic inside the for loop because of the thread locking/unlocking that is a problem.

Instead, we should have local_histograms and then add them up:

![[3.1OpenMP-PartII.pdf#page=17]]

We:

Notice:

Notice that if you do a reduced would be better here. Having threads combining in pairs, then those pairs combining will have a much better speedup.

Also notice the local_histogram[111][num_buckets], and it makes sure that no threads share the same cache line. This padding helps to have the access to local_histogram be within the same cache tag. You can try doing this by using a weird integer number of arrays or pieces of data. For example, if n is the number of threads, then try n+1, n+3, n+7, ... until you see a speedup. Just make it not divisible by n.

A common length is 128, so a common one to try is 128 + 1 = 129. If the 111 was changed to 129 that would likely also work here.

Doing it Again

Consider:

![[3.1OpenMP-PartII.pdf#page=18]]

Notice the only difference is that __declspec (align(64)) int local_histogram[...]. The align(64) is a C standard that aligns the chunks to the nearest 64-th byte. As such, then instead of doing [111] we do [num_threads+1][num_buckets]. This forces there to be no false sharing, without having to make a bunch of empty space.

Also the new pragma will have all threads synched on the last for, then only one thread does the last for. The next one fixes some of these synch issues but:

![[3.1OpenMP-PartII.pdf#page=19]]

There's a race condition on the bottom for loop, namely on writing to the global histogram.

In Summary

Atomic operations are expensive, but they are fundamental building blocks. Also, we have to only have synchronization when we need it, as we should try to be correct before we try to make it faster. We also have to consider the hardware primitives such as the cache.

![[3.1OpenMP-PartII.pdf#page=21]]

A Final Look at a Program

Look at:

![[3.1OpenMP-PartII.pdf#page=24]]

Notice that:

You can actually do an alternative approach where each thread does a "tile" and then synch's their output:

Another thing is that if you go columns, then rows, you get more spatial locality, so then you're getting more cache hits than misses.