Lecture 3 - (cont. Threads)

Review

Consider:

![[1.1pthreads.pdf#page=11]]

when reading someone elses' code, consider:

Here the issue was that each thread does a row of operations. This is bad as if we have a 1000x1000 matrix, then we have 1000 threads, leading to slowdown once we hit our core count. However; we don't get race conditions since each thread works with separate data (both reading and writing).

Instead, we can try the following:

![[1.1pthreads.pdf#page=12]]

This is one of the most commonly used tricks in the book for this. Here the number of threads is usually the size of the problem divided by twice the number of cores:

chunk=Size of Problem# of cores2

Here the thread count is likely twice the number of cores.

Here, we instead grow the work per thread as the size of the problem increases:

Another Example

Consider the following:

![[1.1pthreads.pdf#page=13]]

See the Leibniz formula for pi for context on how to calculate π in a parallel way. Essentially:

π4=k=0(1)k2k+1

But there's a problem with the naive way to parallizing this. We have to add in the order from 0 to ! That's because this series is conditionally convergent, so adding the numbers in a different order will actually output a different number.

Going back to the program listed above, it essentially gets rid of the depedency of the sign (whether it's an odd k or even).

Question

Is there a race condition in the code above?
The answer is yes, namely sum is being written to by all the threads, hence the pthread_mutex_lock. Instead, it'd be better to have a separate sumN and then add them up together by the master program. You have to make sure that global memory exists just for the one thread to store the sum in.
Alternatively, you can have a local local_sum and then lock and unlock before you add to the global sum via:

pthread_mutex_lock(&mutex);
sum += local_sum;
pthread+mutex_unlock(&mutex);

![[1.1pthreads.pdf#page=15]]

The difference from a mutex and a semophore is that it allows access by up to some number of threads (say, after the 6th thread, it locks the other threads from entering into writing). However, there's a lot of overhead for semaphores (it has to consider write priority and what not).

Producer/Consumer (Reader/Writer) Problem

Say that I have 1 very big file that I need to find the word Maria in. There may be more than one instance. How do I parallelise the problem? You could chunk the file like we've done before. But only one thread can read the file at a time. So each thread can only work one at a time.

The usual approach is that you let anyone read in a queue, but once someone needs to write, you clear the queue and then let people start seeing the file once again.

Summary

![[1.1pthreads.pdf#page=16]]

The ideal is to have more speed as you have more threads. But in reality it's not as fast at the core count number of threads:

Having the operating system manage anything in our program, because there may be 100 or so programs with their own threads that have probably higher priority.

Also, consider the cache:

THIS IS THE MOST IMPORTANT SECTION OF THE COURSE

Study it well!

We won't really use semaphores, but as a reference refer to:

![[1.2pthreads.pdf]]

Intro to OpenMP

This is essentially a library on top of C that we can use. It came and grew really fast (still growing) since it removes a lot of the void* shenanigans that we had to deal with with threads.

They also tried to change as little as possible to the original code as possible, so then it can run on multiple cores. As such, there's #pragma compiler indications to tell the compiler to include a certain file. If the compiler doesn't understand, it'll just ignore the #pragma, but if it knows it it'll use the directive.

Some other things:

![[3.1OpenMP-1.pdf#page=1]]

But keep in mind that the problems with p_thread will carry over to OpenMP too.

Motivation

Consider:

![[3.1OpenMP-1.pdf#page=3]]

Here the red lines on the right:

As such, this program will print Hello World!\n 16 times. Notice that you don't need to create and join the threads. Here the { after the pragma creates the threads, and the } joins/destroys them all.

As another example:

![[3.1OpenMP-1.pdf#page=4]]

This example shows that the for loop get's unrolled by the parallel for directive. The way OpenMP handles it is that:

#include <omp.h>

void main()
{
	double Res[1000];
	#pragma omp parallel for schedule
	for(int i = 0; i < 1000; ++i)
	{
		do_huge_comp(Res[i]);
	}
}

We want to be able to modify the way the threads access data, so that the cache has better accesses (handling lines better).