Lecture 4 - OpenMP
From last time, we looked at the starting syntax for OpenMP:
As always in parallel programming, we look to:
- are there data dependencies?
- are there ways to separate data dependencies?
- ...
But we have to consider shared memory, since we're working on multicore and/or GPU:
![[3.1OpenMP-1.pdf#page=5]]
![[3.1OpenMP-1.pdf#page=6]]
Notice above that at the fork, OpenMP itself will join the threads automatically.
But because it's a shared memory architecture, we have to consider since if it's local they can't communicate, but if global then we have to worry about race conditions to that memory. As such, we define variable scope:
![[3.1OpenMP-1.pdf#page=7]]
As an example:
int size = 10; // by default in global scope, it'll be shared (GLOBAL)
#pragma omp parallel shared(m) private(a, b)
{
// ... (variable 'm' is shared between threads)
// ... (variables 'a,b' are local to the thread)
// shared variables will introduce overhead, so be careful here.
int i; // by default, this will be private
}
This is one of the main mistakes you may make in using OpenMP. So usually in OpenMP we actually will put all the variables in the #pragma
. To force it, we can add the default(none)
flag:
#pragma omp parallel default(none) shared(m) private(a, b)
The most interesting one to use though is reduction(op:var)
, which copies the variable to each parallel region. If we have a shared variable that is read modified like for sum
below:
int sum = 0;
#pragma ompg parallel default(none) shared(size, A, sum) private(i)
{
for(int i = 0; i < size; i++)
{
lock(); // not actually what we'd write, but what happens
sum += A[i];
unlock();
}
}
Instead, we'd modify it is to have a local_sum
:
int sum = 0;
#pragma ompg parallel default(none) shared(size, A, sum) private(i, local_sum)
{
for(int i = 0; i < size; i++)
{
local_sum+= A[i];
}
lock();
sum += local_sum;
unlock();
}
We use reduction
, which applies one operation to the entire set of threads:
int sum = 0;
#pragma ompg parallel default(none) shared(size, A) private(i) reduction(+: sum)
{
for(int i = 0; i < size; i++)
{
sum+=A[i];
}
}
There's only a finite, limited set of operators that can be used in the reduction
flag:
- addition and subtraction
- swap
- comparisons
<, >
In reality, it's not doing thelock, unlock
in thefor
loop; instead it's using all the cores and locks and unlocks in a very efficient manner. It does it like as follows:
![[3.1OpenMP-1.pdf#page=8]]
Having the lock/unlock on the outside of the for
, while each thread, only allows one core to add at a time. The way that the reduction
library works is that:
- it first works on the local
- Have core 0 add with core 1, core 2 with 3, ...
- Then add core 01 result with 23, ...
- Add the two tree branches together in a logarithmic pattern
This is
When making a parallel program, it's highly recommended to put everything in the #pragma
! It should be obvious which variables go where (shared or local), but if you put it in there, you can be confident that these race conditions aren't the problem.
There's other types of variables as seen in:
![[3.1OpenMP-1.pdf#page=9]]
![[3.1OpenMP-1.pdf#page=10]]
Notice:
- to get the thread id you do
omp_get_thread_num()
- you can also get the set thread count via
omp_get_num_threads()
- adding the
-fopenmp
allows for the ussage of the#pragmas
so make sure to include it.
Here's another example:
![[3.1OpenMP-1.pdf#page=12]]
![[3.1OpenMP-1.pdf#page=13]]
To turn the code into parallelized we do:
#include <omp.h>
float approx = 0.0f;
float h, a, b, n; // defined elsewhere
float f(float x)
{
// .. defined outside
}
/* Input a,b,n*/
#pragma omp parallel default(none) shared(h,a,b,n, chunk_size, rank, start, end) private(x_i, i) reduction(+: approx)
{
h = (b-a)/n;
approx = (f(a) + f(b))/2.0f;
// define our chunking
int chunk_size = n / omp_get_num_threads();
int rank = omp_get_thread_num();
int start = rank * chunk;
int end = start + chunk;
for(int i = start; i <= end && i < n; i++)
{
float x_i = a + i*h;
approx += f(x_i);
}
}
approx *= h;
Notice our choices for our variable scope:
h,a,b,n
are all sharable as we don't write to themx_i, i
are private as they don't need to be sharedapprox
needs to add together a lot of results, hence the reduction
Another way to do this is to use the critical
flag. It allows only one thread to execute the block at a time, and the compiler will spend some time trying to optimize that:
![[3.1OpenMP-1.pdf#page=16]]
Note again that it applies to the next line, or the whole next {}
block:
#include <omp.h>
float approx = 0.0f;
float h, a, b, n; // defined elsewhere
float f(float x)
{
// .. defined outside
}
/* Input a,b,n*/
#pragma omp parallel default(none) shared(h,a,b,n, chunk_size, rank, start, end) private(x_i, i)
{
float local_approx = 0.0f;
h = (b-a)/n;
approx = (f(a) + f(b))/2.0f;
// define our chunking
int chunk_size = n / omp_get_num_threads();
int rank = omp_get_thread_num();
int start = rank * chunk;
int end = start + chunk;
for(int i = start; i <= end && i < n; i++)
{
float x_i = a + i*h;
approx_local += f(x_i);
}
#pragma omp critical
approx += approx_local;
}
The math in the previous two examples doesn't really work. However, the idea still works. Be mindfull to write it out on paper before coding up actually.
A final actual result of the trapezoid rule is as follows:
![[3.1OpenMP-1.pdf#page=18]]
See slides 19-21 for the rest of the implementations. For a demonstration to run of this code see this link.
Differences in omp parallel for
and without the for
Notice in:
![[3.1OpenMP-PartII.pdf#page=2]]
The for
directive makes code a lot cleaner, but it's not going to solve any depedency. It will resolve reductions, but only when there's no loop dependency:
![[3.1OpenMP-PartII.pdf#page=3]]
This is especially apparent if you need the previous loop's value in the next loop.
If I actually want to change the number of iterations/`chunk_size:
![[3.1OpenMP-PartII.pdf#page=4]]
This is done by using the #pragma omp parallel for schedule(static)
directive.
The OpenMP compiler may just ignore some #pragmas
and not tell you which ones you ignored. You have been warned!
If instead you want uneven chunk_size
use guided
:
![[3.1OpenMP-PartII.pdf#page=8]]
There's a lot of ways to control it. See slides 4-6 in Part II:
![[3.1OpenMP-PartII.pdf#page=5]]
Ways to set the OMP_SCHEDULE
:
export OMP_SCHEDULE="static, 1"
or whatever you want- using
void omp_set_schedule(omp_sched_t kind, int chunk_size)
in your codekind
is one ofomp_sched_static, dynamic, guided, auto
all with theomp_sched_
prefix
- using the
#pragma
section
Helps to allow work items in blocks given by #pragma
's already:
![[3.1OpenMP-PartII.pdf#page=8]]
task
The difference between section
and a task
in OpenMP is very similar. For example, calculating for
version. As such, task
is useful for recursiion:
![[3.1OpenMP-PartII.pdf#page=9]]
flush
It makes sure that all memory operations are erased, and propogated to main memory.
![[3.1OpenMP-PartII.pdf#page=10]]
atomic
This cannot be used on {}
. We use it for atomic instructions like ++
:
![[3.1OpenMP-PartII.pdf#page=12]]
The problem with normal y++
is that it reads y
, and then writes to y
, which is two instructions. As such, it's better to use the #pragma omp atomic
and then your operation to combine these two via your computer's hardware.
Misc Operations
![[3.1OpenMP-PartII.pdf#page=13]]
All things here will slow your program (as with any synchronization), but may be necessary.
An Example: A Histogram
Creating a histogram from a file of character occurence is horrible.