Lecture 5 – Loops and subroutines (Digital Systems)

From Spivey's Corner
Jump to: navigation, search

A better multiplication algorithm[edit]

Let's keep the invariant a * b = x * y + z, but try halving x in each iteration: in not-quite-C,

unsigned foo(unsigned a, unsigned b) {
    unsigned x = a, y = b, z = 0;

    /* Invariant: a * b = x * y + z */
    while (x != 0) {
        if (x odd) z = z + y;
        x = x/2; y = y*2;

    return z;

We'll code this routine in assembly language, again keeping x, y and z in registers r0, r1 and r2. How are we going to implement the operations x = x/2 and y = y*2? The ARM provides shift instructions that (at least for positive inputs) can multiply or divide by powers of 2: the instruction

lsls r1, r1, #1

shifts the contents of r1 to the left by one place, multiplying the value by 2. (Actually, almost the same effect results from adds r1, r1, r1.) Also, the instruction

lsrs r0, r0, #1

shifts the contents of r0 one place to the right, dividing the value by 2, and throwing away any remainder. It shifts out a single bit at the right (and puts it in the C flag), and shifts in a zero at the left. These are called a logical shift because of the zero bit (or bits) that are shifted in at one end or the other; there's also an arithmetic right shift instruction that makes more sense when the bit pattern might represent a negative number (see later).

The fact that the bit shifted out is put in the C flag allows us to implement the test x odd neatly: if we write

lsrs, r0, r0, #1
bcc even

then that divides x by 2 and branches to the label even if x was even, executing the next instruction only if x was odd. (That's not an improvement we can expect a compiler to find.)

A final idea is to put the test x != 0 at the end of the loop, so that there is only one branch instruction in the loop itself. The unconditional branch at the start is executed only once.

        movs r2, #0             @ z = 0
        b test
        lsrs r0, r0, #1         @ x = x/2
        bcc even                @ if x was even, skip
        adds r2, r2, r1         @ z = z + y
        lsls r1, r1, #1         @ y = y*2
        cmp r0, #0              @ if x != 0
        bne again               @   repeat
        movs r0, r2             @ return z
        bx lr

Stack frames and locals[edit]

What if ...

  • we want to write a subroutine that calls another, or perhaps recursively calls itself? (We must save our return address.)
  • we want to use registers r4--r7? (We must restore their values before returning.)
  • we need more local variables than will fit in registers (or maybe a local array)?

The answer is to let the subroutine use space on the stack, giving it a stack frame.

Stack frame layout

The subroutine will have a layout like this:

    push {r4-r7, lr}      @ Save registers
    sub sp, #n            @ Allocate locals
    add sp, #n            @ Deallocate locals
    pop {r4-r7, pc}       @ Restore and return

The push and pop instructions transfer multiple words between registers and memory, and adjust the stack pointer. They are equivalent to a sequence of load or store instructions plus an add or sub, but more compact and faster. An ordinary load or store instruction takes 2 cycles, but a push or pop takes only n+1 cycles to save or restore n registers. Each instruction contains a little bitmap to say which of the low registers r0--r7 to save or restore, and we don't need to save and restore registers that we don't use. Notice that the push instruction saves the value of the lr register, and the pop instruction restores the same value into the pc, effectively returning from the subroutine. The push instruction implicitly subtracts from sp, and the sub instruction subtracts some more; the value of n should be a multiple of 4 and less than 512.

In the subroutine body,

  • we are free to set and use those registers r4--r7 that have been saved. The values we put there will be preserved by any subroutines we call, and their original values will be restored when this subroutine returns.
  • local storage can be addresses as positive offsets from sp; there are special encodings of ldr/str rt, [sp, #n] that help with this.
  • addressing for local arrays is assisted by a special encoding of add rd, sp, #n.

The RISC approach is to provide simple instructions that operate uniformly on a plentiful set of registers. Traditional RISC machines provide a straightforward encoding of these intructions. With Thumb, only the most useful instructions have an encoding, and it is not uniform: for example, all the following instructions have an encoding, but different rules apply to each – different ranges of values can be used to replace the constant 4 in each case.

ldr r0, [r1, #4]        adds r0, r1, #4
ldr r0, [sp, #4]        add r0, sp, #4
ldr r0, [pc, #4]        add r0, pc, #4

This non-uniformity is a bit irritating if we are writing large amounts of assembly language by hand: but why do that when high-level languages exist? In a compiler, it is easy enough to represent the options in symbolic form and use them when possible, falling back on more long-winded code when necessary (see next year's Compilers course). In every case, we can get round the limited size of the immediate field by putting the constant into a register first.

Example – Binomial coefficients[edit]

Let's write a recursive subroutine foo(n, k) that computes the binomial coefficient (n choose k) using the recurrence


with (n choose 0) = (n choose n) = 1. The arguments n and k will arrive in registers r0 and r1 as usual, but we will need to re-use those registers to hold the arguments (and receive the results) of recursive calls. We have plenty of registers that are preserved across calls, so we could save n and r there, but to demonstrate the possibilities, let's save them in the stack frame. We will also need to preserve the result of the first recursive call while we do the second one, and for the sake of variety, we'll use register r4 for that.

We can sketch the plan in C, though because the saving of parameters is implicit in C, we can't really express all of it.

unsigned foo(unsigned n, unsigned k) {
    unsigned result = 1;

    if (k != 0 && k != n) {
        unsigned nn = n-1, kk = k;
        result = foo(nn, kk);
        result = result + foo(nn, kk-1);

    return result;

To perform a subroutine call – including a recursive call to the same subroutine, we put the arguments in r0 and r1, then use a bl instruction, which both saves the current pc value in the link register lr and loads the pc with the subroutine address.

    push {r4, lr}           @ Save registers
    sub sp, #8              @ Allocate space for locals
    mov r4, #1              @ Default result

    cmp r1, #0              @ Base case if k = 0
    beq done
    cmp r1, r0              @ ... or k = n
    beq done

    subs r0, r0, #1         @ Compute n-1
    str r0, [sp, #0]        @ Save n-1 in stack
    str r1, [sp, #4]        @ Save k in stack
    bl foo                  @ Call foo(n-1, k)
    movs r4, r0             @ Save result in r4
    ldr r0, [sp, #0]        @ Reload n-1
    ldr r1, [sp, #4]        @ Reload k
    subs r1, r1, #1         @ Compute k-1
    bl foo                  @ Call foo(n-1, k-1)
    adds r4, r4, r0         @ Add to previous result

    movs r0, r4             @ Put result in r0
    add sp, #8              @ Reclaim space
    pop {r4, pc}            @ Restore and return

The load and store instructions here, such as ldr r1, [sp, #4] have a special form that allows us to form the address by adding together the value of the stack pointer sp and a fixed offset, 4 in this case. It's the programmer's (or the compiler's) job to keep track of the frame layout: here the variable nn is at offset 0 from the stack pointer, and the variable rr is at offset 4.

The result of the first call is moved from r0 to r4 before making the second call. The recursive invocation of foo will then save the value using the push instruction as the subroutine is entered, and restore it as part of the action of the pop instruction as it exits.

(This subroutine deliberately mixes techniques in order to demonstrate some of the possibilities. It would have been simpler and better to save the value of n and k in two registers – say r5 and r6 – using register-to-register moves, rather than make space in the stack frame then store and load them explicitly. We would have started the subroutine with push {r4, r5, r6, lr}, so that these values would still have been saved in memory, but by a different mechanism.)


Generally speaking, recursive subroutines are a bad idea in embedded systems, because they are often profligate with memory, and it is difficult to determine an upper bound on the depth of the recursion. In this example, each recursive call consumes four words of stack space, and the depth of recursion is determined by the parameter n. Even without changing the (admittedly rather silly) algortithm, we could instead keep an explicit stack of pairs (n, r) waiting to have binom(n, r) computed and added to the grand total, consuming at most two words of memory per waiting call.


What's the difference between a subroutine, a procedure and a function?

I tend to use all three terms interchangeably. I think subroutine is the most neutral term, unlikely to be confused with either an informal list of instructions (e.g., the procedure for downloading code to the micro:bit) or a mathematical operation (e.g., the sine function). The term procedure is common in thinking about Pascal-like languages, where indeed each subroutine is introduced by the keyword procedure. The corresponding term in C (and, I suppose, in functional languages) is function.

Lecture 6

A symbolic representation of the machine code for a program.

A region of storage, allocated from a stack, that contains the parameters and local variables of a procedure activation, together with administrative information (in the frame head) used when the procedure returns, or for access to variables in enclosing procedures.

A register sp that holds the address of the most recent occupied word of the subroutine stack. On ARM, as on most recent processors, the subroutine stack grows downwards, so that the sp holds the lowest address of any occupied work on the stack.

An alternative instruction encoding for the ARM in which each instruction is encoded in 16 rather than 32 bits. The advantage is compact code, the disadvantage that only a selection of instructions can be encoded, and only the first 8 registers are easily accessible. In Cortex-M microcontrollers, the Thumb encoding is the only one provided.

On ARM processors, a register (r14) in which the program counter value is saved by the instructions bl and blx that call a subroutine. The subroutine can return by branching to this address with the instruction bx lr, or can save the value on the stack (with push {..., lr}) and later return by restoring the same value back into the program counter (with pop {..., pc}).