Lecture 6 - Applying OpenMP to Convolution
As the big picture, if you had a really big (terrabyte-size) matrix, we'll use distributed computing to split the work to multiple different computers, and then on each computer to use all the cores we use parallel computing (OpenMP)
Convolution
We'll be referencing:
![[Convolution.pdf]]
we know how to do convolution with a matrix. However, looking at the code, notice for the edges we are missing values. We can:
- Just ignore the edges and reduce the size of the filtered image by
where here we - We can pad by having the zeroes outside of the image.
- We can do copy padding by duplicating the edge up/down/left/right
- We can do mirror padding where we "wrap around" the image/array to the other size (topologically). This creates a very smooth artifact.
Some cool info on CNN (Convolutional Neural Networks):
- We don't try to connect every pixel with the other in the NN. Rather, we just filter out similar pixels, as pixels near each other should more influence each other.
- Instead we try to train for the weights of our filters we use
![[Convolution.pdf#page=7]]
We can check with a parallel implementation:
![[OpenMP-Stencil.pptx.pdf]]
How would you accelerate the sequential code?
int iter_count = NUM_TIMES_TO_APPLY_FILTER;
// no pragma here, due to the dependency of swapping our out/in arr's
for(int i = 0; i < iter_count; i++)
{
// divide work per row, even collapse which just gets the other for loop.
// make sure not to do by column, as you'll get
#pragma omp for collapse(2)
for(int y = 0; y < N; y++)
{
// doing the pragma here makes the thread do per column, but you get
// more cache misses due to moving through huge arrays.
for(int x = 0; x < N; x++)
{
out[y][x] = in[y][x] + (
ctop * in[y-1][x] +
cbot * in[y+1][x] +
clft * in[y][x-1] +
crht * in[y][x+1]
) / SPEED;
}
}
tmp = out;
out = in;
in = tmp;
}
This shows an important idea: having the #pragma
on the outer loops for a related section on the inside is the best.
You can also do a simd
call for the second #pragma
because each thread operates on a single instruction, for multiple pieces of data:
// ...
#pragma omp simd
for(int x = 0; x < N; x++)
{
// (see above)
}
// ...
So in the 3d case, if you collapse
you may have randomized work, but it will better evenly distribute the work.
omp target
Directive
The target
directive instructs the compiler to generate a target task, to map varibles to a device data environment and to execute the enclosed block of code on that device.
We have a CPU, and accelerating on the cores. But with a GPU (intel or NVIDIA), then we can send code to the accelerators with just one #pragma
. The CPU has it's own L1/L2 cache (with shared L3) and the GPU has similar ones, for way more many cores. The process is:
- Send our program from main memory (CPU cache) to GPU memory (shared)