Lecture 5 - (finish) OpenMP
Let's look at how to create a histogram using OpenMP
![[3.1OpenMP-PartII.pdf#page=14]]
The problem with the above code is that all chunks are writing to the same histogram
, which is in the same piece of shared memory (global memory). So when the chunks access it and go to write, there's a race condition. Thus, adding #pragma atomic
could help here for the ++
operation.
Note that if we change the problem to try to count the number of different words, this approach becomes much harder! How many words can we base our chunks off of? We don't really know.
![[3.1OpenMP-PartII.pdf#page=15]]
Hence, our proposed change:
![[3.1OpenMP-PartII.pdf#page=16]]
Notice the threaded version is super slow! Adding the atomic
inside the for
loop because of the thread locking/unlocking that is a problem.
Instead, we should have local_histograms
and then add them up:
![[3.1OpenMP-PartII.pdf#page=17]]
We:
- First chunk our work
- Then run the chunked work in parallel.
Notice:
- The last
for
loop is in each thread that runs, to combine their results - Notice we can do the
parallel
call on the very outside. It's good to do this, and is common, for code that is "thread" related. It's nice this way since you break up your compiler calls into relevant chunks. - The
nowait
says to not wait for all the other threads to finish, you can just keep going. - Notice that the highest level
{}
in this case "creates" the thread, and at the very end will join the thread.
Notice that if you do a reduced
would be better here. Having threads combining in pairs, then those pairs combining will have a much better speedup.
Also notice the local_histogram[111][num_buckets]
, and it makes sure that no threads share the same cache line. This padding helps to have the access to local_histogram
be within the same cache tag. You can try doing this by using a weird integer number of arrays or pieces of data. For example, if
A common length is 128, so a common one to try is 128 + 1 = 129. If the 111
was changed to 129
that would likely also work here.
Doing it Again
Consider:
![[3.1OpenMP-PartII.pdf#page=18]]
Notice the only difference is that __declspec (align(64)) int local_histogram[...]
. The align(64)
is a C standard that aligns the chunks to the nearest 64-th byte. As such, then instead of doing [111]
we do [num_threads+1][num_buckets]
. This forces there to be no false sharing, without having to make a bunch of empty space.
Also the new pragma
will have all threads synched on the last for
, then only one thread does the last for
. The next one fixes some of these synch issues but:
![[3.1OpenMP-PartII.pdf#page=19]]
There's a race condition on the bottom for
loop, namely on writing to the global histogram
.
In Summary
Atomic operations are expensive, but they are fundamental building blocks. Also, we have to only have synchronization when we need it, as we should try to be correct before we try to make it faster. We also have to consider the hardware primitives such as the cache.
![[3.1OpenMP-PartII.pdf#page=21]]
A Final Look at a Program
Look at:
![[3.1OpenMP-PartII.pdf#page=24]]
Notice that:
- The
double
's may experience false sharing, so we could align them to the nearest 64 or 128. - The code is actually correct. Each threads acts indpendently on
i
, so thread 0 hasi = 0, 8, ...
and so on. - If you add
#pragma omp for
on the innermostfor
, doesn't give you a speedup. You can chunk per thread, but then that means that the extra threads (the "threads within the threads") are going to have to synchronize their chunks, which would be a likely speed-down.
You can actually do an alternative approach where each thread does a "tile" and then synch's their output:
Another thing is that if you go columns, then rows, you get more spatial locality, so then you're getting more cache hits than misses.