# Smoothing with SIMD

Suppose we want to take a bitmap image, and "smooth" it by replacing each pixel with the majority value among itself and its eight neighbours. One way of approaching this on a multi-core machine is to write the obvious doubly-nested `for` loop, then slice the outer loop into bands, computing each band in a separate thread that can run on its own core, and arranging that there are about as many bands as there are cores. This can be arranged either by writing in a programming language with a suitable parallel `for` construct, or by creating and synchronising multiple worker threads explicity.

Another approach exploits a different kind of parallelism, by using bitwise operations to compute several pixels of the result simultaneously. Such an approach is supported on modern commodity hardware with extensions like SSE, which has registers big enough to hold multiple integers or floating point numbers, and instructions that add corresponding elements of two such registers. These instructions must be exploited either by highly specific cleverness in a compiler, or by writing assembly language, hardware manual in hand. We will instead exploit operations on ordinary integer registers that can be expressed in ordinary C.

For compactness, and to permit the bitwise operations we shall introduce below, we will represent images by arrays of 64 integers, each 64 bits.

```typedef uint64_t word;

word image;
```

Consider the pixel that is one to the left and one down from the top right corner, which we will give the coordinates (1, 1). This is the first pixel we find in scanning the image that is not an edge or corner pixel. Its value in the smoothed image is obtained by summing it and the surrounding 8 pixels; all these pixels occur in the top three rows of the image – let's call these rows `x`, `y` and `z`.

```word x = image;
word y = image;
word z = image;
```

We can extract the bits we need by shifting these rows in the right way, and masking off all but the bottom bit.

```word x0 = x & 0x1,        y0 = y & 0x1,        z0 = z & 0x1;
word x1 = (x >> 1) & 0x1, y1 = (y >> 1) & 0x1, z1 = (z >> 1) & 0x1;
word x2 = (x >> 2) & 0x1, y2 = (y >> 2) & 0x1, z2 = (z >> 2) & 0x1;
```

Now to determine the pixel value for the result of smoothing, we must add together these 9 bits.

```word sum = x0+x1+x2+y0+y1+y2+z0+z1+z3;
```

We want the result pixel to be 1 if sum is at least 5; to avoid writing a conditional, we can get this result by adding 3, so that for a positive result the value will be at least 8, then shifting right by 3 bits and masking off all but the lowest bit for safety.

```word p = ((sum+3) >> 3) & 0x1;
```

Now we can observe that we have used only the bottom 4 bits of a machine word in carrying out this calculation: the values `x0` ... `z2` occupy only the bottom bit, and when they are added together with the fudge factor on top, the biggest value that can be reached is 12, which still fits in 4 bits; the final answer is masked so that only the leftmost bit of these 4 affects the answer. So we really don't care what happens in the other 60 bits of the word as this calculation is proceeding, and we can use these bits to compute simultaneously other bits of the smoother image.

Specifically, if we replace the mask `0x1` used in the code above with the value written in C as `MASK = 0x1111111111111111LL` (that is, the 64 bit hexadecimal constant with 16 digits all equal to 1), then the shifts, additions and masks above succeed in computing the 16 pixels (1, 5), (1, 9), (1, 13), ..., (1, 61) of the smoother image all at once. The fudge factor of 3 must be replaced by `3*MASK`, a value that is a row of 3's in hexadecimal.

By using x1, x2 and a new value x3 in place of x0, x1 and x2, etc., it's possible also to compute pixels (1, 2), (1, 6), ..., (1, 62) of the result in a similar way. For the other pixels in that row, we need to consider what happens at the boundaries of the 64 x 64 image. Let's agree that edge pixels will be 1 in the result if 3 or more from the pixel and its 5 neighbours are 1 in the original image, and at the corners we get a 1 in the result if 2 or more of 4 pixels are 1 in the original. We can respect these conventions by ensuring that pixels beyond the image boundary are treated as 0, and changing the fudge factor to 5 for edge pixels and 6 for corner pixels.

The top and bottom rows require further special treatment, in that the row above or the row below should read as all 0's, and the rows consist entirely of edge or corner pixels with their modified fudge factors.

All this reasoning leads to the following code for computing a smoothed 64 x 64 image `newimg` from an input image called `image`. It looks complicated, but the loop body is almost entirely straight-line code on which an optimising compiler can have a field day.

```typedef uint64_t word;

word image, newimg;

void smooth(void) {
for (int i = 0; i < 64; i++) {
word x = (i > 0 ? image[i-1] : 0);
word y = image[i];
word z = (i < 63 ? image[i+1] : 0);

word f0 = (i > 0 && i < 63 ? 3*MASK+2 :  5*MASK+1);
word f12 = (i > 0 && i < 63 ? 3*MASK : 5*MASK);
word f3 = (i > 0 && i < 63 ? 3*MASK+(2LL<<60) : 5*MASK+(1LL<<60));

word xm1 = (x << 1) & MASK, ym1 = (y << 1) & MASK, zm1 = (z << 1) & MASK;
word x1 = (x >> 1) & MASK,  y1 = (y >> 1) & MASK,  z1 = (z >> 1) & MASK;
word x2 = (x >> 2) & MASK,  y2 = (y >> 2) & MASK,  z2 = (z >> 2) & MASK;
word x3 = (x >> 3) & MASK,  y3 = (y >> 3) & MASK,  z3 = (z >> 3) & MASK;
word x4 = (x >> 4) & MASK,  y4 = (y >> 4) & MASK,  z4 = (z >> 4) & MASK;

word p0 = ((xm1+x0+x1+ym1+y0+y1+zm1+z0+z1+f0) >> 3) & MASK;
word p1 = ((x0+x1+x2+y0+y1+y2+z0+z1+z2+f12) >> 2) & (MASK<<1);
word p2 = ((x1+x2+x3+y1+y2+y3+z1+z2+z3+f12) >> 1) & (MASK<<2);
word p3 = (x2+x3+x4+y2+y3+y4+z2+z3+z4+f3) & (MASK<<3);
newimg[i] = p0 + p1 + p2 + p3;
}
}
```

This style of parallelism is called SIMD, because there is a Single Instuction stream operating on Multiple Data streams. Its nature is that the sequence of instructions must be the same whatever the data, so conditional branches are banned. It's possible to include conditional operations (such as conditional moves) that affect the data that need them but leave others unchanged: the processing elements that don't participate in the operation then just idle for a cycle. What we've done here is to reduce an operation that was a conditional `if (sum >= 5)` to a form where the result is computed by a uniform recipe (adding a fudge factor and masking).

On larger images, each row will consist of multiple words, and the bits at the ends of one word will depend on adjacent pixels that are parts of the words to the left or right: so computing the values `xm1`, `x4`, etc. will be a bit different. The same speedup would continue to apply, however, and it would still be possible to compute 16 pixels of the result simultaneously.

A symbolic representation of the machine code for a program.