Lecture 5 – Loops and subroutines (Digital Systems)
A better multiplication algorithm
[5.1] Let's keep the invariant a * b = x * y + z
, but try halving y
in each iteration: in not-quite-C,
unsigned func(unsigned a, unsigned b) { unsigned x = a, y = b, z = 0; /* Invariant: a * b = x * y + z */ while (y != 0) { if (y odd) z = z + x; x = x*2; y = y/2; } return z; }
We'll code this routine in assembly language, again keeping x
, y
and z
in registers r0
, r1
and r2
. How are we going to implement the operations x = x*2
and y = y/2
? The ARM provides shift instructions that (at least for positive inputs) can multiply or divide by powers of 2: the instruction
lsls r0, r0, #1
shifts the contents of r0
to the left by one place, multiplying the value by 2. (Actually, almost the same effect results from adds r0, r0, r0
.) Also, the instruction
lsrs r1, r1, #1
shifts the contents of r1
one place to the right, dividing the value by 2, and throwing away any remainder. It shifts out a single bit at the right (and puts it in the C
flag), and shifts in a zero at the left. These are called a logical shift because of the zero bit (or bits) that are shifted in at one end or the other; there's also an arithmetic right shift instruction that makes more sense when the bit pattern might represent a negative number.
The fact that the bit shifted out is put in the C
flag allows us to implement the test x odd
neatly: if we write
lsrs, r1, r1, #1 bcc even
then that divides y
by 2 and branches to the label even
if y
was even, executing the next instruction only if y
was odd. (That's not an improvement we can expect a compiler to find.)
[5.2] A final idea is to put the test y != 0
at the end of the loop, so that there is only one branch instruction in the loop itself. The unconditional branch at the start is executed only once.
func: movs r2, #0 @ z = 0 b test again: lsrs r1, r1, #1 @ y = y/2 bcc even @ if y was even, skip adds r2, r2, r0 @ z = z + x even: lsls r0, r0, #1 @ x = x*2 test: cmp r1, #0 @ if y != 0 bne again @ repeat movs r0, r2 @ return z bx lr
There's still something not-quite-optimal about this function, because if y
is non-zero and is even, then we can be sure that y>>1
is still non-zero, and yet the program still tests for zero after every step: you can investigate this in the lab. Also, we could use loop unrolling, this time with a clear idea of how much unrolling is worthwhile.
Stack frames and locals
[5.3] What if ...
- we want to write a subroutine that calls another, or perhaps recursively calls itself? (We must save our return address.)
- we want to use registers
r4
--r7
? (We must restore their values before returning.) - we want a subroutine that accepts more than four arguments? (They won't all fit in
r0
tor3
.) - we need more local variables than will fit in registers (or maybe a local array)?
The answer is to let the subroutine use space in memory, taken from the subroutine stack, giving it a stack frame. Most subroutines will be like this, except for tiny routines that call no others. The subroutine stack is an area of memory, starting from the highest addresses in RAM and growing downwards, that is reserved for storing information that relates to each subroutine activation. If one subroutine calls others, then all the others must return before the original subroutine returns, and it is this that makes the storage behave in a stack-like manner.
We'll leave until next time the possibility of having local variables that live in the stack frame, and the load and store instructions that must be used to access them. For now, let's concentrate on the possibility that we want to use the registers r4
to r7
, in addition to the registers r0
to r3
that are already available to us. Conventions obeyed by all software on the micro:bit require us to restore these registers to their original values before the subroutine returns. Also, if this subroutine calls others, then the link register lr
will be overwritten by the calls, and we will need to save its value somewhere if we are to know where to return.
The subroutine will have a layout like this:
func: push {r4, r5, lr} @ Save registers @ Use r4 and r5, call other routines pop {r4, r5, pc} @ Restore and return
The push
and pop
instructions transfer multiple words between registers and memory, and adjust the stack pointer. The push
saves
whatever registers (from r4
to r7
) are used in the body of the subroutine, and also moves the return address from lr
into the slot in the stack frame that is marked in the diagram. It also moves the stack pointer register sp
down by the size of the items saved, ensuring that when this subroutine calls another one, the other subroutine will create its stack frame below ours without overlapping. When the micro:bit comes out of reset, the stack pointer is initialised so it points to the very top of the RAM area.
The body of the subroutine can overwrite the saved registers (r4
and r5
in this case) however it likes, and also call other subroutines using a bl
instruction that overwrites the lr
register. Whatever values we put in these registers will be preserved in their turn by the subroutines we call, so we can use them (unlike r0
to r3
) to save values across subroutine calls. Other registers (r6
and r7
in this case) that we don't use need not be mentioned by us, even if they are used by other subroutines that we call. If those subroutines use them, then they will look after saving them (with the values that we received and have not changed) and restoring them to the same values before they return, so that they will have their original values when we ourselves return.
When execution reaches the pop
instruction, its action is roughly inverse to the push
instruction. Provided the register lists agree, the values saved in the stack frame on entry will be restored to the same registers, and then the return address will be moved – not back into lr
where it arrived, but into the program counter pc
, so that execution resumes with the instruction just after the one that called this subroutine.
The push
and pop
instructions represent a departure from the RISC principle that each instruction should describe a single, simple operation, but they are included to make programming easier and for performance. Although a push
or pop
instruction is
equivalent to a sequence of load or store instructions plus an add
or sub
instruction that adjusts sp
, in fact the instruction is more compact, and faster because it only needs to be decoded once, then (on the Cortex-M0) one register can be saved on each successive clock cycle. An ordinary load or store instruction takes 2 cycles, but a push
or pop
takes only n+1
cycles to save or restore n
registers. Each instruction contains a little bitmap to say which of the low registers r0
to r7
to save or restore, plus an additional bit saying whether to save a return address from lr
or restore it to pc
. Although the assembly language syntax looks more general, in fact we aren't allowed to write things like
pop {r4, lr} @ Wrong!
because a pop
can reload pc
and never lr
.
Example: factorials without a multiply instruction
Let's illustrate these ideas by writing a subroutine that computes the factorial of its argument, using a subroutine to do the required multiplications. This is a contrived and perhaps too simple example, but making it simple helps us to concentrate on the mechanisms. In C, our subroutine could be written like this:
int fac(int n) { int k = n, f = 1; while (k != 0) { f = mult(f, k); k = k-1; } return f; }
There are two local variables, k
and f
. Because this subroutine calls others, we will keep both of them in variables that are preserved across subroutine calls: k
in r4
and f
in r5
. Luckily, the argument n
is not mentioned again after we have used it to initialise k
, otherwise we would have to move it into another register also.
In assembly language, our subroutine can begin like this:
fac: push {r4, r5, lr} movs r4, r0 @ Initialise k to n movs r5, #1 @ Set f to 1
Then we are ready for the loop. For simplicity, let's keep the test at the top of the loop.
again: cmp r4, #0 @ Is k = 0? beq finish @ If so, finished
Next comes the subroutine call f = mult(f, k)
: we must move the arguments, f
and k
into registers r0
and r1
respectively, then use bl
to branch to the subroutine, and finally move the result from r0
back into r5
where f
lives.
movs r0, r5 @ Set f to f * k movs r1, r4 bl mult movs r5, r0
The end of the loop body reduces k
by one and branches back the the start.
subs r4, r4, #1 @ Decrement k b again @ And repeat
When the loop is finished, we must move the result f
into r0
before returning.
finish: movs r0, r5 @ Return f pop {r4, r5, pc}
In this program, as it happens, the subroutine mult
uses only registers r0
--r3
, so it does not need to do anything to preserve the values that fac
is keeping in r4
and r5
. Those registers are saved on entry to fac
, then the whole calculation happens using r4
and r5
to hold fac
's variables, using r0
--r3
temporarily to perform multiplications, and with no further transfers between registers and memory, until r4
and r5
are restored when fac
returns.
Although the version of fac
shown above is a faithful translation of the C original, it can be improved a bit if we observe that the variable f
need not be preserved across the subroutine call, because the result of the call overwrites f
. Because of this, there is no need to keep f
in a register that is preserved across the call, and it can in fact live in r0
all the time, like this:
fac: push {r4, lr} movs r4, r0 @ Initialise k to n movs r0, #1 @ Set f to 1 again: cmp r4, #0 @ Is k = 0? beq finish @ If so, finished movs r1, r4 @ Set f to f * k bl mult subs r4, r4, #1 @ Decrement n b again @ And repeat finish: pop {r4, pc}
Optimising compilers are able to generate code like this by analyising the dataflow in the program, and assigning the same register to two quantities when one is a copy of the other. In this case, starting from the C code for fac
, GCC is able to determine that the value returned by mult
, the argument to the next invocation of mult
, and the result that fac
eventually returns can all live in r0
. The code it outputs is similar to the second version of fac
shown above.
Questions
What's the difference between a subroutine, a procedure and a function?
I tend to use all three terms interchangeably. I think subroutine is the most neutral term, unlikely to be confused with either an informal list of instructions (e.g., the procedure for downloading code to the micro:bit) or a mathematical operation (e.g., the sine function). The term procedure is common in thinking about Pascal-like languages, where indeed each subroutine is introduced by the keyword procedure
. The corresponding term in C (and, I suppose, in functional languages) is function.
Couldn't we use a tst
instruction to implement the test whether y
is even?
In the lecture, I suggested using an ands
instruction to mask out the bottom bit of y
. After a bit of fumbling, we collectively came up with
movs r3, #1 ands r3, r3, r1 beq even
It's also possible to exploit the tst
instruction, which (according to the reference card) computes the bitwise and of its two inputs, uses it to set the N and Z flags, then throws the result away. It is related to the ands
instruction in the same way that the cmp
instruction is related to subs
. Using this instruction, we can write
movs r3, #1 ... tst r1, r3 beq even
That's the same number of instructions, but the difference is that the value in r3
is not overwritten, so we can set r3
to 1 just once outside the loop, and leave it there throughout the execution of the subroutine. Neither of these methods is quite as good as the solution we adopted, exploiting the fact that the lsrs
instruction sets the C flag to the last bit shifted out.