Lecture 5 - (finish) OpenMP

Let's look at how to create a histogram using OpenMP

![[3.1OpenMP-PartII.pdf#page=14]]

The problem with the above code is that all chunks are writing to the same histogram, which is in the same piece of shared memory (global memory). So when the chunks access it and go to write, there's a race condition. Thus, adding #pragma atomic could help here for the ++ operation.

Note that if we change the problem to try to count the number of different words, this approach becomes much harder! How many words can we base our chunks off of? We don't really know.

![[3.1OpenMP-PartII.pdf#page=15]]

Hence, our proposed change:

![[3.1OpenMP-PartII.pdf#page=16]]

Notice the threaded version is super slow! Adding the atomic inside the for loop because of the thread locking/unlocking that is a problem.

Instead, we should have local_histograms and then add them up:

![[3.1OpenMP-PartII.pdf#page=17]]

We:

First chunk our work
Then run the chunked work in parallel.

Notice:

The last for loop is in each thread that runs, to combine their results
Notice we can do the parallel call on the very outside. It's good to do this, and is common, for code that is "thread" related. It's nice this way since you break up your compiler calls into relevant chunks.
The nowait says to not wait for all the other threads to finish, you can just keep going.
Notice that the highest level {} in this case "creates" the thread, and at the very end will join the thread.

Notice that if you do a reduced would be better here. Having threads combining in pairs, then those pairs combining will have a much better speedup.

Also notice the local_histogram[111][num_buckets], and it makes sure that no threads share the same cache line. This padding helps to have the access to local_histogram be within the same cache tag. You can try doing this by using a weird integer number of arrays or pieces of data. For example, if $n$ is the number of threads, then try $n + 1$ , $n + 3$ , $n + 7$ , ... until you see a speedup. Just make it not divisible by $n$ .

A common length is 128, so a common one to try is 128 + 1 = 129. If the 111 was changed to 129 that would likely also work here.

Doing it Again

Consider:

![[3.1OpenMP-PartII.pdf#page=18]]

Notice the only difference is that __declspec (align(64)) int local_histogram[...]. The align(64) is a C standard that aligns the chunks to the nearest 64-th byte. As such, then instead of doing [111] we do [num_threads+1][num_buckets]. This forces there to be no false sharing, without having to make a bunch of empty space.

Also the new pragma will have all threads synched on the last for, then only one thread does the last for. The next one fixes some of these synch issues but:

![[3.1OpenMP-PartII.pdf#page=19]]

There's a race condition on the bottom for loop, namely on writing to the global histogram.

In Summary

Atomic operations are expensive, but they are fundamental building blocks. Also, we have to only have synchronization when we need it, as we should try to be correct before we try to make it faster. We also have to consider the hardware primitives such as the cache.

![[3.1OpenMP-PartII.pdf#page=21]]

A Final Look at a Program

Look at:

![[3.1OpenMP-PartII.pdf#page=24]]

Notice that:

The double's may experience false sharing, so we could align them to the nearest 64 or 128.
The code is actually correct. Each threads acts indpendently on i, so thread 0 has i = 0, 8, ... and so on.
If you add #pragma omp for on the innermost for, doesn't give you a speedup. You can chunk per thread, but then that means that the extra threads (the "threads within the threads") are going to have to synchronize their chunks, which would be a likely speed-down.

You can actually do an alternative approach where each thread does a "tile" and then synch's their output:

Another thing is that if you go columns, then rows, you get more spatial locality, so then you're getting more cache hits than misses.