Lecture 2 - Continuing Multicore

This lecture largely continues from the last class.

Recall from last lecture that we moved to multicore because of the power/temperature limits that we hit, even before hitting the transistor limits.

Now, you must re-think your algorithms to work in a parallel fashion. This requires that you:

Program for performance
Balance your load amongst your hardware
Optimize communication and synchronization

![[Multicore_Architectures.pdf#page=9]]

Notice that in each core we have both L1 and L2 cache, where L3 is shared amongst all cores. As such, these memories are local to the core, and cannot be accessed outside of the core by other cores. But of course, these are very architecturally specific. But the idea of locality for memory is very important for programming purposes. Furthermore, there's shared memory via our L3 in this example. If you have a process that need to communicate, you can use this to help facilitate this.

Again, we don't really care if it's some L1, L2, L3, L17, L... cache, we care about if it's shared memory or not.

Types of Parallelism

We have the following:

Task-level parallelism: completely different tasks, no dependency
- ex: downloading software and running antivirus
Instruction-Level: parallelism at the machine instruction (ie: combining instructions in the HW). Executing a load and arithmetic operation at the same time.
- Processor can reorder, pipeline, branch predict, etc. (only done in HW)
Data-Level: vector units (mainly for this class)
- Instruction set extensions such as SSE mainly for multimedia. Same instruction but operates on different data. For instance, a for loop can be easily paralllelized (assuming different data):

for (int i = 0; i < 10000000; i++)
{
  sum += A[0];
}

which we can assign to different cores:

But this creates a race condition, since if everything is writing to the same place, then one core's adding of the sum variable may not have been updated as another process may have just used the same previous value.

For example, core 1 may have sum1 = 5 and core 2 may have sum2 = 7 with the total as sumT = 8. But if core 1 reads sumT before core 2 has added sumT with sum2 after updating, it may overwrite the other core's work.

But as parallel programmers, we want to not have to wait! So we won't and try to overcome these challenges when they arise. But it's on you to determine why and when this happens.

Thread-level: same program can run separate threads
- word app, running spell check in one thread and print in another thread
- operating on all registers, rather than on multiple data.
- threads are just lightweight-process, namely a subprocess under a process. These processes can be executed independently on the CPU, and are usually managed by the OS.
- single core superscalar processors can fully exploit TLP.

![[Multicore_Architectures.pdf#page=11]]

Here:

Data stream just means "core"
Instruction stream just literally means instruction.

Speed on a single processor

On single-core processors there's:

Pipeline
Branch prediction
Super scalar
Out-of-order
Register renaming
SIMD
Multithreading
- In one processor machine the threads only give the illusion of running in parallel (context switching)
- Without context switching, we have to execute each thread one at a time, which cripples our performance. But how can one CPU do parallel work?
- We do a little bit of thread 1 for a few instructions, then do thread 2 instructions for a bit, then switch back and forth.
  - SW threads are managed by the OS
  - HW threads are managed by the HW
- Context switching is expensive, so we got to be careful with it.

The only thing we can really change here in the SW would be handling the threads, which is what we'll do.

![[Multicore_Architectures.pdf#page=13]]

The chart on the left shows states of a program:

You create a new program
It'll be ready to be started (given memory)
It'll start running (using memory)
It might wait (still has access to memory)
It end execution (releases memory)

Once a program has passed it's quantum of execution time, it'll maybe tell the OS the other memory it needs before going into waiting state, and then the OS will run a different program into running (as the ready/waiting programs are put into a queue by the OS).

If we have multiple cores, we can have one thread on each core:

![[Multicore_Architectures.pdf#page=14]]

When you actually break a program into multiple threads, it makes sense to divide it amongst your cores. If you have 4 cores, divide a program into 4 threads. If you have $n$ operations, do NOT have $n / 4$ threads! Do 4 threads!

Memory Types on Multiprocessors

![[Multicore_Architectures.pdf#page=15]]

As a summary:

Hard drive memory is set in stone (long R/W)
Shared memory (CPU memory) is volatile
Each core has cache which is local.

A GPU will have shared memory.

This gives rise to the "official" definition for multicore CPUs:

![[Multicore_Architectures.pdf#page=16]]

An Aside: Simultaneous Multithreading

Sometimes if you have different ALU operations that you want to operate at the same time (ex: int and float operations) then you can have different threads that run concurrently:

![[Multicore_Architectures.pdf#page=18]]

Note though that this speedup is very depedent on both the HW implementation (if it's supported), as well as if your threads are broken up correctly to take advantage of this.

Cache Coherence Problem

If Core A has a memory address 0x1 and Core B has a copy from main memory (0x1 still), then if Core A updates the memory there then the cache copy in Core B is now invalidated.

We solve this via:

![[Multicore_Architectures.pdf#page=21]]

We don't have control over this (it's in HW), but it's a problem to be aware of.

OS "review": Process Model

See:

![[1.1pthreads.pdf#page=2]]

Here:

A program becomes a process if it's currently running.
It's given an address space for where it can run (from it's relative 0x0 to 0xMAX)
Heap is for dynamic allocated memory, the stack for local variables (for the function execution)

The OS takes it between states as we saw before:

![[1.1pthreads.pdf#page=3]]

The OS has a Process Control Block (PCB) that has:

Each process state
Each process number
Each program counter (PC)
Registers
Memory Limits
List of open files

![[1.1pthreads.pdf#page=4]]

![[1.1pthreads.pdf#page=5]]

The above slide shows why context switching is so expensive!

OS Threads

So far, the process has a single thread of execution. But consider having multiple PC per process. Multiple locations can execute at once.

We must then have storage for thread details, multiple PC in PCB.

![[1.1pthreads.pdf#page=7]]

The following slide is the most important for the whole class:

![[1.1pthreads.pdf#page=8]]

The important things are that:

Each thread has separate register and stack
The main code segment, data segment, and file opening are all shared between threads.
Is the heap local or shared? It's important since if it's global we need to synchronize. If not, we don't care.
- The heap is shared memory.

The code would be:

#include <stdio.h> 
#include <stdlib.h> 
#include <pthread.h> 
/*global variable accesible to all threads*/ 
long threads_count; 
void* Hello(void* rank); 

int main(int argc, char* argv[]){ 
	long thread; 
	pthread_t* thread_handles; 
	//get number of threads 
	threads_count=strtol(argv[1], NULL, 10);
	thread_handles=malloc(threads_count*sizeof(pthread_t)); 
	for(thread=0; thread<threads_count; thread++) pthread_create(&thread_handles[thread], NULL, Hello, (void*) thread); 
		
	printf("Hello from the main thread\n"); 
	for(thread=0; thread<threads_count; thread++) pthread_join(thread_handles[thread], NULL); 
	free(thread_handles); 
	return 0; 
} 

void* Hello(void* rank){ 
	long my_rank = (long) rank; 
	printf("hello from thread %ld of %ld\n", my_rank, threads_count); 
	return NULL;
}

Check out the Google Collab collaboration to try to run this code and see what we're talking about.

Example:

Consider the following code. Is there a race condition?:

void* p_matvec(void * id)
{
	int threadID = (int) id;
	int i, j;

	y[threadID] = 0.0;
	for(j = 0; j < nCOL; ++j)
	{
		y[threadID] += A[threadID][j] * x[j];
	}
	return NULL;
}

So no, there's no race condition, as the data that's being written isn't being read or written between threads (namely y). Note there that even though there's shared memory, we don't have a race condition as we've partitioned our memory to not create a race condition.