Lecture 6 - Applying OpenMP to Convolution

As the big picture, if you had a really big (terrabyte-size) matrix, we'll use distributed computing to split the work to multiple different computers, and then on each computer to use all the cores we use parallel computing (OpenMP)

Convolution

We'll be referencing:

![[Convolution.pdf]]

we know how to do convolution with a matrix. However, looking at the code, notice for the edges we are missing values. We can:

Some cool info on CNN (Convolutional Neural Networks):

![[Convolution.pdf#page=7]]

We can check with a parallel implementation:

![[OpenMP-Stencil.pptx.pdf]]

How would you accelerate the sequential code?

int iter_count = NUM_TIMES_TO_APPLY_FILTER;

// no pragma here, due to the dependency of swapping our out/in arr's
for(int i = 0; i < iter_count; i++)
{
	// divide work per row, even collapse which just gets the other for loop.
	// make sure not to do by column, as you'll get  
	#pragma omp for collapse(2)
	for(int y = 0; y < N; y++)
	{
		// doing the pragma here makes the thread do per column, but you get 
		// more cache misses due to moving through huge arrays. 
		for(int x = 0; x < N; x++)
		{
			out[y][x] = in[y][x] + (
				ctop * in[y-1][x] + 
				cbot * in[y+1][x] +
				clft * in[y][x-1] + 
				crht * in[y][x+1]
			) / SPEED;
		}
	}
	tmp = out;
	out = in;
	in = tmp;
}

This shows an important idea: having the #pragma on the outer loops for a related section on the inside is the best.

You can also do a simd call for the second #pragma because each thread operates on a single instruction, for multiple pieces of data:

// ...
#pragma omp simd
for(int x = 0; x < N; x++)
{
	// (see above)
}
// ...

So in the 3d case, if you collapse you may have randomized work, but it will better evenly distribute the work.

omp target Directive

The target directive instructs the compiler to generate a target task, to map varibles to a device data environment and to execute the enclosed block of code on that device.

We have a CPU, and accelerating on the cores. But with a GPU (intel or NVIDIA), then we can send code to the accelerators with just one #pragma. The CPU has it's own L1/L2 cache (with shared L3) and the GPU has similar ones, for way more many cores. The process is:

  1. Send our program from main memory (CPU cache) to GPU memory (shared)