Lecture 4 - OpenMP

From last time, we looked at the starting syntax for OpenMP:

Intro to OpenMP

This is essentially a library on top of C that we can use. It came and grew really fast (still growing) since it removes a lot of the void* shenanigans that we had to deal with with threads.

They also tried to change as little as possible to the original code as possible, so then it can run on multiple cores. As such, there's #pragma compiler indications to tell the compiler to include a certain file. If the compiler doesn't understand, it'll just ignore the #pragma, but if it knows it it'll use the directive.

Some other things:

![[3.1OpenMP-1.pdf#page=1]]

But keep in mind that the problems with p_thread will carry over to OpenMP too.

Motivation

Consider:

![[3.1OpenMP-1.pdf#page=3]]

Here the red lines on the right:

Include the library header
The omp_set_num_threads(int) manually sets the thread count. Usually it just uses the core count, but if you want to add or remove some you are free to do so.
The #pragma omp parallel says that whatever line comes next needs to be executed on each core.

As such, this program will print Hello World!\n 16 times. Notice that you don't need to create and join the threads. Here the { after the pragma creates the threads, and the } joins/destroys them all.

As another example:

![[3.1OpenMP-1.pdf#page=4]]

This example shows that the for loop get's unrolled by the parallel for directive. The way OpenMP handles it is that:

Thread 0 will handle iteration 0
...
Thread 7 will handle iteration 7
Thread 0 handles iteration 8
...
Assuming 8 threads. The beauty of OpenMP is that we just modify the #pragma itself. Adding the clause that is schedule to modify the behavior:

#include <omp.h>

void main()
{
	double Res[1000];
	#pragma omp parallel for schedule
	for(int i = 0; i < 1000; ++i)
	{
		do_huge_comp(Res[i]);
	}
}

We want to be able to modify the way the threads access data, so that the cache has better accesses (handling lines better).

As always in parallel programming, we look to:

are there data dependencies?
are there ways to separate data dependencies?
...

But we have to consider shared memory, since we're working on multicore and/or GPU:

![[3.1OpenMP-1.pdf#page=5]]

![[3.1OpenMP-1.pdf#page=6]]

Notice above that at the fork, OpenMP itself will join the threads automatically.

But because it's a shared memory architecture, we have to consider since if it's local they can't communicate, but if global then we have to worry about race conditions to that memory. As such, we define variable scope:

![[3.1OpenMP-1.pdf#page=7]]

As an example:

int size = 10; // by default in global scope, it'll be shared (GLOBAL)

#pragma omp parallel shared(m) private(a, b)
{
// ... (variable 'm' is shared between threads)
// ... (variables 'a,b' are local to the thread)
// shared variables will introduce overhead, so be careful here. 
	int i; // by default, this will be private
}

This is one of the main mistakes you may make in using OpenMP. So usually in OpenMP we actually will put all the variables in the #pragma. To force it, we can add the default(none) flag:

#pragma omp parallel default(none) shared(m) private(a, b)

The most interesting one to use though is reduction(op:var), which copies the variable to each parallel region. If we have a shared variable that is read modified like for sum below:

int sum = 0;

#pragma ompg parallel default(none) shared(size, A, sum) private(i)
{
	for(int i = 0; i < size; i++)
	{
		lock(); // not actually what we'd write, but what happens
		sum += A[i];
		unlock();
	}
}

Instead, we'd modify it is to have a local_sum:

int sum = 0;

#pragma ompg parallel default(none) shared(size, A, sum) private(i, local_sum)
{
	for(int i = 0; i < size; i++)
	{
		local_sum+= A[i];
	}
	lock();
	sum += local_sum;
	unlock();
}

We use reduction, which applies one operation to the entire set of threads:

int sum = 0;

#pragma ompg parallel default(none) shared(size, A) private(i) reduction(+: sum)
{
	for(int i = 0; i < size; i++)
	{
		sum+=A[i];
	}
}

There's only a finite, limited set of operators that can be used in the reduction flag:

addition and subtraction
swap
comparisons <, >
In reality, it's not doing the lock, unlock in the for loop; instead it's using all the cores and locks and unlocks in a very efficient manner. It does it like as follows:

![[3.1OpenMP-1.pdf#page=8]]

Having the lock/unlock on the outside of the for, while each thread, only allows one core to add at a time. The way that the reduction library works is that:

it first works on the local
Have core 0 add with core 1, core 2 with 3, ...
Then add core 01 result with 23, ...
Add the two tree branches together in a logarithmic pattern

This is $O (\log (n))$ relative to the number of cores. It's not a huge speedup for small $n \approx 8$ core machines, but for something multicore like a threadripper or a supercomputer, then it really adds up. It makes the most sense for something like a GPU where there's potentially hundreds of cores.

Advice

When making a parallel program, it's highly recommended to put everything in the #pragma! It should be obvious which variables go where (shared or local), but if you put it in there, you can be confident that these race conditions aren't the problem.

There's other types of variables as seen in:

![[3.1OpenMP-1.pdf#page=9]]

![[3.1OpenMP-1.pdf#page=10]]

Notice:

to get the thread id you do omp_get_thread_num()
you can also get the set thread count via omp_get_num_threads()
adding the -fopenmp allows for the ussage of the #pragmas so make sure to include it.

Here's another example:

![[3.1OpenMP-1.pdf#page=12]]

![[3.1OpenMP-1.pdf#page=13]]

To turn the code into parallelized we do:

#include <omp.h>

float approx = 0.0f;
float h, a, b, n; // defined elsewhere

float f(float x)
{
	// .. defined outside
}

/* Input a,b,n*/
#pragma omp parallel default(none) shared(h,a,b,n, chunk_size, rank, start, end) private(x_i, i) reduction(+: approx)
{
	h = (b-a)/n;
	approx = (f(a) + f(b))/2.0f;

	// define our chunking
	int chunk_size = n / omp_get_num_threads();
	int rank = omp_get_thread_num();
	
	int start = rank * chunk;
	int end = start + chunk;

	for(int i = start; i <= end && i < n; i++)
	{
		float x_i = a + i*h;
		approx += f(x_i);
	}
}
approx *= h;

Notice our choices for our variable scope:

h,a,b,n are all sharable as we don't write to them
x_i, i are private as they don't need to be shared
approx needs to add together a lot of results, hence the reduction

Another way to do this is to use the critical flag. It allows only one thread to execute the block at a time, and the compiler will spend some time trying to optimize that:

![[3.1OpenMP-1.pdf#page=16]]

Note again that it applies to the next line, or the whole next {} block:

#include <omp.h>

float approx = 0.0f;
float h, a, b, n; // defined elsewhere

float f(float x)
{
	// .. defined outside
}

/* Input a,b,n*/
#pragma omp parallel default(none) shared(h,a,b,n, chunk_size, rank, start, end) private(x_i, i)
{
	float local_approx = 0.0f;
	h = (b-a)/n;
	approx = (f(a) + f(b))/2.0f;

	// define our chunking
	int chunk_size = n / omp_get_num_threads();
	int rank = omp_get_thread_num();
	
	int start = rank * chunk;
	int end = start + chunk;

	for(int i = start; i <= end && i < n; i++)
	{
		float x_i = a + i*h;
		approx_local += f(x_i);
	}
	#pragma omp critical
	approx += approx_local;
}

Warning

The math in the previous two examples doesn't really work. However, the idea still works. Be mindfull to write it out on paper before coding up actually.

A final actual result of the trapezoid rule is as follows:

![[3.1OpenMP-1.pdf#page=18]]

See slides 19-21 for the rest of the implementations. For a demonstration to run of this code see this link.

Differences in `omp parallel for` and without the `for`

Notice in:

![[3.1OpenMP-PartII.pdf#page=2]]

The for directive makes code a lot cleaner, but it's not going to solve any depedency. It will resolve reductions, but only when there's no loop dependency:

![[3.1OpenMP-PartII.pdf#page=3]]

This is especially apparent if you need the previous loop's value in the next loop.

If I actually want to change the number of iterations/`chunk_size:

![[3.1OpenMP-PartII.pdf#page=4]]

This is done by using the #pragma omp parallel for schedule(static) directive.

Warning

The OpenMP compiler may just ignore some #pragmas and not tell you which ones you ignored. You have been warned!

If instead you want uneven chunk_size use guided:

![[3.1OpenMP-PartII.pdf#page=8]]

There's a lot of ways to control it. See slides 4-6 in Part II:

![[3.1OpenMP-PartII.pdf#page=5]]

Ways to set the OMP_SCHEDULE:

export OMP_SCHEDULE="static, 1" or whatever you want
using void omp_set_schedule(omp_sched_t kind, int chunk_size) in your code
- kind is one of omp_sched_static, dynamic, guided, auto all with the omp_sched_ prefix
using the #pragma

`section`

Helps to allow work items in blocks given by #pragma's already:

![[3.1OpenMP-PartII.pdf#page=8]]

`task`

The difference between section and a task in OpenMP is very similar. For example, calculating $n!$ is notoriously bad for a recursive algorithm. We hate recursion in parallel programming, so the advice there is to just not use recursion at all and change the algorithm to some for version. As such, task is useful for recursiion:

![[3.1OpenMP-PartII.pdf#page=9]]

`flush`

It makes sure that all memory operations are erased, and propogated to main memory.

![[3.1OpenMP-PartII.pdf#page=10]]

`atomic`

This cannot be used on {}. We use it for atomic instructions like ++:

![[3.1OpenMP-PartII.pdf#page=12]]

The problem with normal y++ is that it reads y, and then writes to y, which is two instructions. As such, it's better to use the #pragma omp atomic and then your operation to combine these two via your computer's hardware.

Misc Operations

![[3.1OpenMP-PartII.pdf#page=13]]

All things here will slow your program (as with any synchronization), but may be necessary.

An Example: A Histogram

Creating a histogram from a file of character occurence is horrible.

Intro to OpenMP

Motivation

Differences in omp parallel for and without the for

section

task

flush

atomic