Lecture 2 - Continuing Multicore

This lecture largely continues from the last class.

Recall from last lecture that we moved to multicore because of the power/temperature limits that we hit, even before hitting the transistor limits.

Now, you must re-think your algorithms to work in a parallel fashion. This requires that you:

![[Multicore_Architectures.pdf#page=9]]

Notice that in each core we have both L1 and L2 cache, where L3 is shared amongst all cores. As such, these memories are local to the core, and cannot be accessed outside of the core by other cores. But of course, these are very architecturally specific. But the idea of locality for memory is very important for programming purposes. Furthermore, there's shared memory via our L3 in this example. If you have a process that need to communicate, you can use this to help facilitate this.

Again, we don't really care if it's some L1, L2, L3, L17, L... cache, we care about if it's shared memory or not.

Types of Parallelism

We have the following:

for (int i = 0; i < 10000000; i++)
{
  sum += A[0];
}

which we can assign to different cores:

But this creates a race condition, since if everything is writing to the same place, then one core's adding of the sum variable may not have been updated as another process may have just used the same previous value.

For example, core 1 may have sum1 = 5 and core 2 may have sum2 = 7 with the total as sumT = 8. But if core 1 reads sumT before core 2 has added sumT with sum2 after updating, it may overwrite the other core's work.

But as parallel programmers, we want to not have to wait! So we won't and try to overcome these challenges when they arise. But it's on you to determine why and when this happens.

![[Multicore_Architectures.pdf#page=11]]

Here:

Speed on a single processor

On single-core processors there's:

The only thing we can really change here in the SW would be handling the threads, which is what we'll do.

![[Multicore_Architectures.pdf#page=13]]

The chart on the left shows states of a program:

Once a program has passed it's quantum of execution time, it'll maybe tell the OS the other memory it needs before going into waiting state, and then the OS will run a different program into running (as the ready/waiting programs are put into a queue by the OS).

If we have multiple cores, we can have one thread on each core:

![[Multicore_Architectures.pdf#page=14]]

When you actually break a program into multiple threads, it makes sense to divide it amongst your cores. If you have 4 cores, divide a program into 4 threads. If you have n operations, do NOT have n/4 threads! Do 4 threads!

Memory Types on Multiprocessors

![[Multicore_Architectures.pdf#page=15]]

As a summary:

A GPU will have shared memory.

This gives rise to the "official" definition for multicore CPUs:

![[Multicore_Architectures.pdf#page=16]]

An Aside: Simultaneous Multithreading

Sometimes if you have different ALU operations that you want to operate at the same time (ex: int and float operations) then you can have different threads that run concurrently:

![[Multicore_Architectures.pdf#page=18]]

Note though that this speedup is very depedent on both the HW implementation (if it's supported), as well as if your threads are broken up correctly to take advantage of this.

Cache Coherence Problem

If Core A has a memory address 0x1 and Core B has a copy from main memory (0x1 still), then if Core A updates the memory there then the cache copy in Core B is now invalidated.

We solve this via:

![[Multicore_Architectures.pdf#page=21]]

We don't have control over this (it's in HW), but it's a problem to be aware of.

OS "review": Process Model

See:

![[1.1pthreads.pdf#page=2]]

Here:

The OS takes it between states as we saw before:

![[1.1pthreads.pdf#page=3]]

The OS has a Process Control Block (PCB) that has:

![[1.1pthreads.pdf#page=4]]

![[1.1pthreads.pdf#page=5]]

The above slide shows why context switching is so expensive!

OS Threads

So far, the process has a single thread of execution. But consider having multiple PC per process. Multiple locations can execute at once.

We must then have storage for thread details, multiple PC in PCB.

![[1.1pthreads.pdf#page=7]]

The following slide is the most important for the whole class:

![[1.1pthreads.pdf#page=8]]

The important things are that:

The code would be:

#include <stdio.h> 
#include <stdlib.h> 
#include <pthread.h> 
/*global variable accesible to all threads*/ 
long threads_count; 
void* Hello(void* rank); 

int main(int argc, char* argv[]){ 
	long thread; 
	pthread_t* thread_handles; 
	//get number of threads 
	threads_count=strtol(argv[1], NULL, 10);
	thread_handles=malloc(threads_count*sizeof(pthread_t)); 
	for(thread=0; thread<threads_count; thread++) pthread_create(&thread_handles[thread], NULL, Hello, (void*) thread); 
		
	printf("Hello from the main thread\n"); 
	for(thread=0; thread<threads_count; thread++) pthread_join(thread_handles[thread], NULL); 
	free(thread_handles); 
	return 0; 
} 

void* Hello(void* rank){ 
	long my_rank = (long) rank; 
	printf("hello from thread %ld of %ld\n", my_rank, threads_count); 
	return NULL;
}

Check out the Google Collab collaboration to try to run this code and see what we're talking about.

Example:

Consider the following code. Is there a race condition?:

void* p_matvec(void * id)
{
	int threadID = (int) id;
	int i, j;

	y[threadID] = 0.0;
	for(j = 0; j < nCOL; ++j)
	{
		y[threadID] += A[threadID][j] * x[j];
	}
	return NULL;
}

So no, there's no race condition, as the data that's being written isn't being read or written between threads (namely y). Note there that even though there's shared memory, we don't have a race condition as we've partitioned our memory to not create a race condition.