Lecture 5 – Loops and subroutines (Digital Systems)

A better multiplication algorithm

[5.1] Let's keep the invariant a * b = x * y + z, but try halving y in each iteration: in not-quite-C,

unsigned func(unsigned a, unsigned b) {
unsigned x = a, y = b, z = 0;

/* Invariant: a * b = x * y + z */
while (y != 0) {
if (y odd) z = z + x;
x = x*2; y = y/2;
}

return z;
}


We'll code this routine in assembly language, again keeping x, y and z in registers r0, r1 and r2. How are we going to implement the operations x = x*2 and y = y/2? The ARM provides shift instructions that (at least for positive inputs) can multiply or divide by powers of 2: the instruction

lsls r0, r0, #1


shifts the contents of r0 to the left by one place, multiplying the value by 2. (Actually, almost the same effect results from adds r0, r0, r0.) Also, the instruction

lsrs r1, r1, #1


shifts the contents of r1 one place to the right, dividing the value by 2, and throwing away any remainder. It shifts out a single bit at the right (and puts it in the C flag), and shifts in a zero at the left. These are called a logical shift because of the zero bit (or bits) that are shifted in at one end or the other; there's also an arithmetic right shift instruction that makes more sense when the bit pattern might represent a negative number.

The fact that the bit shifted out is put in the C flag allows us to implement the test x odd neatly: if we write

lsrs, r1, r1, #1
bcc even


then that divides y by 2 and branches to the label even if y was even, executing the next instruction only if y was odd. (That's not an improvement we can expect a compiler to find.)

[5.2] A final idea is to put the test y != 0 at the end of the loop, so that there is only one branch instruction in the loop itself. The unconditional branch at the start is executed only once.

func:
movs r2, #0             @ z = 0
b test
again:
lsrs r1, r1, #1         @ y = y/2
bcc even                @ if y was even, skip
adds r2, r2, r0         @ z = z + x
even:
lsls r0, r0, #1         @ x = x*2
test:
cmp r1, #0              @ if y != 0
bne again               @   repeat
movs r0, r2             @ return z
bx lr


There's still something not-quite-optimal about this function, because if y is non-zero and is even, then we can be sure that y>>1 is still non-zero, and yet the program still tests for zero after every step: you can investigate this in the lab. Also, we could use loop unrolling, this time with a clear idea of how much unrolling is worthwhile.

Stack frames and locals

[5.3] What if ...

• we want to write a subroutine that calls another, or perhaps recursively calls itself? (We must save our return address.)
• we want to use registers r4--r7? (We must restore their values before returning.)
• we want a subroutine that accepts more than four arguments? (They won't all fit in r0 to r3.)
• we need more local variables than will fit in registers (or maybe a local array)?

The answer is to let the subroutine use space in memory, taken from the subroutine stack, giving it a stack frame. Most subroutines will be like this, except for tiny routines that call no others. The subroutine stack is an area of memory, starting from the highest addresses in RAM and growing downwards, that is reserved for storing information that relates to each subroutine activation. If one subroutine calls others, then all the others must return before the original subroutine returns, and it is this that makes the storage behave in a stack-like manner.

Stack frame layout

We'll leave until next time the possibility of having local variables that live in the stack frame, and the load and store instructions that must be used to access them. For now, let's concentrate on the possibility that we want to use the registers r4 to r7, in addition to the registers r0 to r3 that are already available to us. Conventions obeyed by all software on the micro:bit require us to restore these registers to their original values before the subroutine returns. Also, if this subroutine calls others, then the link register lr will be overwritten by the calls, and we will need to save its value somewhere if we are to know where to return.

The subroutine will have a layout like this:

func:
push {r4, r5, lr}     @ Save registers

@ Use r4 and r5, call other routines

pop {r4, r5, pc}      @ Restore and return


The push and pop instructions transfer multiple words between registers and memory, and adjust the stack pointer. The push saves whatever registers (from r4 to r7) are used in the body of the subroutine, and also moves the return address from lr into the slot in the stack frame that is marked in the diagram. It also moves the stack pointer register sp down by the size of the items saved, ensuring that when this subroutine calls another one, the other subroutine will create its stack frame below ours without overlapping. When the micro:bit comes out of reset, the stack pointer is initialised so it points to the very top of the RAM area.

The body of the subroutine can overwrite the saved registers (r4 and r5 in this case) however it likes, and also call other subroutines using a bl instruction that overwrites the lr register. Whatever values we put in these registers will be preserved in their turn by the subroutines we call, so we can use them (unlike r0 to r3) to save values across subroutine calls. Other registers (r6 and r7 in this case) that we don't use need not be mentioned by us, even if they are used by other subroutines that we call. If those subroutines use them, then they will look after saving them (with the values that we received and have not changed) and restoring them to the same values before they return, so that they will have their original values when we ourselves return.

When execution reaches the pop instruction, its action is roughly inverse to the push instruction. Provided the register lists agree, the values saved in the stack frame on entry will be restored to the same registers, and then the return address will be moved – not back into lr where it arrived, but into the program counter pc, so that execution resumes with the instruction just after the one that called this subroutine.

The push and pop instructions represent a departure from the RISC principle that each instruction should describe a single, simple operation, but they are included to make programming easier and for performance. Although a push or pop instruction is equivalent to a sequence of load or store instructions plus an add or sub instruction that adjusts sp, in fact the instruction is more compact, and faster because it only needs to be decoded once, then (on the Cortex-M0) one register can be saved on each successive clock cycle. An ordinary load or store instruction takes 2 cycles, but a push or pop takes only n+1 cycles to save or restore n registers. Each instruction contains a little bitmap to say which of the low registers r0 to r7 to save or restore, plus an additional bit saying whether to save a return address from lr or restore it to pc. Although the assembly language syntax looks more general, in fact we aren't allowed to write things like

    pop {r4, lr}     @ Wrong!


because a pop can reload pc and never lr.

Example: factorials without a multiply instruction

Let's illustrate these ideas by writing a subroutine that computes the factorial of its argument, using a subroutine to do the required multiplications. This is a contrived and perhaps too simple example, but making it simple helps us to concentrate on the mechanisms. In C, our subroutine could be written like this:

int fac(int n)
{
int k = n, f = 1;

while (k != 0) {
f = mult(f, k);
k = k-1;
}

return f;
}


There are two local variables, k and f. Because this subroutine calls others, we will keep both of them in variables that are preserved across subroutine calls: k in r4 and f in r5. Luckily, the argument n is not mentioned again after we have used it to initialise k, otherwise we would have to move it into another register also.

In assembly language, our subroutine can begin like this:

fac:
push {r4, r5, lr}
movs r4, r0             @ Initialise k to n
movs r5, #1             @ Set f to 1


Then we are ready for the loop. For simplicity, let's keep the test at the top of the loop.

again:
cmp r4, #0              @ Is k = 0?
beq finish              @ If so, finished


Next comes the subroutine call f = mult(f, k): we must move the arguments, f and k into registers r0 and r1 respectively, then use bl to branch to the subroutine, and finally move the result from r0 back into r5 where f lives.

    movs r0, r5             @ Set f to f * k
movs r1, r4
bl mult
movs r5, r0


The end of the loop body reduces k by one and branches back the the start.

    subs r4, r4, #1         @ Decrement k
b again                 @ And repeat


When the loop is finished, we must move the result f into r0 before returning.

finish:
movs r0, r5             @ Return f
pop {r4, r5, pc}


In this program, as it happens, the subroutine mult uses only registers r0--r3, so it does not need to do anything to preserve the values that fac is keeping in r4 and r5. Those registers are saved on entry to fac, then the whole calculation happens using r4 and r5 to hold fac's variables, using r0--r3 temporarily to perform multiplications, and with no further transfers between registers and memory, until r4 and r5 are restored when fac returns.

Although the version of fac shown above is a faithful translation of the C original, it can be improved a bit if we observe that the variable f need not be preserved across the subroutine call, because the result of the call overwrites f. Because of this, there is no need to keep f in a register that is preserved across the call, and it can in fact live in r0 all the time, like this:

fac:
push {r4, lr}
movs r4, r0             @ Initialise k to n
movs r0, #1             @ Set f to 1

again:
cmp r4, #0              @ Is k = 0?
beq finish              @ If so, finished

movs r1, r4             @ Set f to f * k
bl mult

subs r4, r4, #1         @ Decrement n
b again                 @ And repeat

finish:
pop {r4, pc}


Optimising compilers are able to generate code like this by analyising the dataflow in the program, and assigning the same register to two quantities when one is a copy of the other. In this case, starting from the C code for fac, GCC is able to determine that the value returned by mult, the argument to the next invocation of mult, and the result that fac eventually returns can all live in r0. The code it outputs is similar to the second version of fac shown above.

Questions

What's the difference between a subroutine, a procedure and a function?

I tend to use all three terms interchangeably. I think subroutine is the most neutral term, unlikely to be confused with either an informal list of instructions (e.g., the procedure for downloading code to the micro:bit) or a mathematical operation (e.g., the sine function). The term procedure is common in thinking about Pascal-like languages, where indeed each subroutine is introduced by the keyword procedure. The corresponding term in C (and, I suppose, in functional languages) is function.

Couldn't we use a tst instruction to implement the test whether y is even?

In the lecture, I suggested using an ands instruction to mask out the bottom bit of y. After a bit of fumbling, we collectively came up with

    movs r3, #1
ands r3, r3, r1
beq even


It's also possible to exploit the tst instruction, which (according to the reference card) computes the bitwise and of its two inputs, uses it to set the N and Z flags, then throws the result away. It is related to the ands instruction in the same way that the cmp instruction is related to subs. Using this instruction, we can write

    movs r3, #1
...
tst r1, r3
beq even


That's the same number of instructions, but the difference is that the value in r3 is not overwritten, so we can set r3 to 1 just once outside the loop, and leave it there throughout the execution of the subroutine. Neither of these methods is quite as good as the solution we adopted, exploiting the fact that the lsrs instruction sets the C flag to the last bit shifted out.