Lecture 3 - (cont. Threads)

Review

Consider:

![[1.1pthreads.pdf#page=11]]

when reading someone elses' code, consider:

how much work is done per thread
what work is done per thread
what data is read/written by each thread

Here the issue was that each thread does a row of operations. This is bad as if we have a 1000x1000 matrix, then we have 1000 threads, leading to slowdown once we hit our core count. However; we don't get race conditions since each thread works with separate data (both reading and writing).

Instead, we can try the following:

![[1.1pthreads.pdf#page=12]]

This is one of the most commonly used tricks in the book for this. Here the number of threads is usually the size of the problem divided by twice the number of cores:

chunk = \frac{Size of Problem}{# of cores \cdot 2}

Here the thread count is likely twice the number of cores.

Here, we instead grow the work per thread as the size of the problem increases:

Another Example

Consider the following:

![[1.1pthreads.pdf#page=13]]

See the Leibniz formula for pi for context on how to calculate $π$ in a parallel way. Essentially:

\frac{π}{4} = \sum_{k = 0}^{\infty} \frac{(- 1)^{k}}{2 k + 1}

But there's a problem with the naive way to parallizing this. We have to add in the order from 0 to $\infty$ ! That's because this series is conditionally convergent, so adding the numbers in a different order will actually output a different number.

Going back to the program listed above, it essentially gets rid of the depedency of the sign (whether it's an odd $k$ or even).

Question

Is there a race condition in the code above?
The answer is yes, namely sum is being written to by all the threads, hence the pthread_mutex_lock. Instead, it'd be better to have a separate sumN and then add them up together by the master program. You have to make sure that global memory exists just for the one thread to store the sum in.
Alternatively, you can have a local local_sum and then lock and unlock before you add to the global sum via:

pthread_mutex_lock(&mutex);
sum += local_sum;
pthread+mutex_unlock(&mutex);

![[1.1pthreads.pdf#page=15]]

The difference from a mutex and a semophore is that it allows access by up to some number of threads (say, after the 6th thread, it locks the other threads from entering into writing). However, there's a lot of overhead for semaphores (it has to consider write priority and what not).

Producer/Consumer (Reader/Writer) Problem

Say that I have 1 very big file that I need to find the word Maria in. There may be more than one instance. How do I parallelise the problem? You could chunk the file like we've done before. But only one thread can read the file at a time. So each thread can only work one at a time.

The usual approach is that you let anyone read in a queue, but once someone needs to write, you clear the queue and then let people start seeing the file once again.

Summary

![[1.1pthreads.pdf#page=16]]

The ideal is to have more speed as you have more threads. But in reality it's not as fast at the core count number of threads:

Having the operating system manage anything in our program, because there may be 100 or so programs with their own threads that have probably higher priority.

Also, consider the cache:

The cache is "cold" (has invalid data) on program start, so it's super slow.
The memory could be voltatile, so we could accidentally "dirty" some data another thread is using, and thus make the memory access for that data even slower!
If I hit on L1 it's 1 clock, on L2 it's 10, on L3 it's 100, ...
If you have arrays of data for a matrix, it's better to get the entire array for one row in one memory access.
False Sharing: When you access some data that has the same tag (likely a very different address), then you replace the whole line, even if you're recently using the data. Sometimes we do something like sum[10 * 16 + 1] so that we have some difference in the tags in the cache so that we aren't essentially cache thrashing
Thread safety: things like string operations aren't thread safe, so you have to make your own (if it's unsupported by the library).

THIS IS THE MOST IMPORTANT SECTION OF THE COURSE

Study it well!

We won't really use semaphores, but as a reference refer to:

![[1.2pthreads.pdf]]

Intro to OpenMP

This is essentially a library on top of C that we can use. It came and grew really fast (still growing) since it removes a lot of the void* shenanigans that we had to deal with with threads.

They also tried to change as little as possible to the original code as possible, so then it can run on multiple cores. As such, there's #pragma compiler indications to tell the compiler to include a certain file. If the compiler doesn't understand, it'll just ignore the #pragma, but if it knows it it'll use the directive.

Some other things:

![[3.1OpenMP-1.pdf#page=1]]

But keep in mind that the problems with p_thread will carry over to OpenMP too.

Motivation

Consider:

![[3.1OpenMP-1.pdf#page=3]]

Here the red lines on the right:

Include the library header
The omp_set_num_threads(int) manually sets the thread count. Usually it just uses the core count, but if you want to add or remove some you are free to do so.
The #pragma omp parallel says that whatever line comes next needs to be executed on each core.

As such, this program will print Hello World!\n 16 times. Notice that you don't need to create and join the threads. Here the { after the pragma creates the threads, and the } joins/destroys them all.

As another example:

![[3.1OpenMP-1.pdf#page=4]]

This example shows that the for loop get's unrolled by the parallel for directive. The way OpenMP handles it is that:

Thread 0 will handle iteration 0
...
Thread 7 will handle iteration 7
Thread 0 handles iteration 8
...
Assuming 8 threads. The beauty of OpenMP is that we just modify the #pragma itself. Adding the clause that is schedule to modify the behavior:

#include <omp.h>

void main()
{
	double Res[1000];
	#pragma omp parallel for schedule
	for(int i = 0; i < 1000; ++i)
	{
		do_huge_comp(Res[i]);
	}
}

We want to be able to modify the way the threads access data, so that the cache has better accesses (handling lines better).