Lecture 6 - Applying OpenMP to Convolution

As the big picture, if you had a really big (terrabyte-size) matrix, we'll use distributed computing to split the work to multiple different computers, and then on each computer to use all the cores we use parallel computing (OpenMP)

Convolution

We'll be referencing:

![[Convolution.pdf]]

we know how to do convolution with a matrix. However, looking at the code, notice for the edges we are missing values. We can:

Just ignore the edges and reduce the size of the filtered image by $n_{i m g} - n_{f i l t e r}$ where here we
We can pad by having the zeroes outside of the image.
We can do copy padding by duplicating the edge up/down/left/right
We can do mirror padding where we "wrap around" the image/array to the other size (topologically). This creates a very smooth artifact.

Some cool info on CNN (Convolutional Neural Networks):

We don't try to connect every pixel with the other in the NN. Rather, we just filter out similar pixels, as pixels near each other should more influence each other.
Instead we try to train for the weights of our filters we use

![[Convolution.pdf#page=7]]

We can check with a parallel implementation:

![[OpenMP-Stencil.pptx.pdf]]

How would you accelerate the sequential code?

int iter_count = NUM_TIMES_TO_APPLY_FILTER;

// no pragma here, due to the dependency of swapping our out/in arr's
for(int i = 0; i < iter_count; i++)
{
	// divide work per row, even collapse which just gets the other for loop.
	// make sure not to do by column, as you'll get  
	#pragma omp for collapse(2)
	for(int y = 0; y < N; y++)
	{
		// doing the pragma here makes the thread do per column, but you get 
		// more cache misses due to moving through huge arrays. 
		for(int x = 0; x < N; x++)
		{
			out[y][x] = in[y][x] + (
				ctop * in[y-1][x] + 
				cbot * in[y+1][x] +
				clft * in[y][x-1] + 
				crht * in[y][x+1]
			) / SPEED;
		}
	}
	tmp = out;
	out = in;
	in = tmp;
}

This shows an important idea: having the #pragma on the outer loops for a related section on the inside is the best.

You can also do a simd call for the second #pragma because each thread operates on a single instruction, for multiple pieces of data:

// ...
#pragma omp simd
for(int x = 0; x < N; x++)
{
	// (see above)
}
// ...

So in the 3d case, if you collapse you may have randomized work, but it will better evenly distribute the work.

`omp target` Directive

The target directive instructs the compiler to generate a target task, to map varibles to a device data environment and to execute the enclosed block of code on that device.

We have a CPU, and accelerating on the cores. But with a GPU (intel or NVIDIA), then we can send code to the accelerators with just one #pragma. The CPU has it's own L1/L2 cache (with shared L3) and the GPU has similar ones, for way more many cores. The process is:

Send our program from main memory (CPU cache) to GPU memory (shared)

Convolution

omp target Directive

`omp target` Directive