# Lecture 5 – Loops and subroutines (Digital Systems)

## A better multiplication algorithm

Let's keep the invariant `a * b = x * y + z`, but try halving `x` in each iteration: in not-quite-C,

```unsigned foo(unsigned a, unsigned b) {
unsigned x = a, y = b, z = 0;

/* Invariant: a * b = x * y + z */
while (x != 0) {
if (x odd) z = z + y;
x = x/2; y = y*2;
}

return z;
}
```

We'll code this routine in assembly language, again keeping `x`, `y` and `z` in registers `r0`, `r1` and `r2`. How are we going to implement the operations `x = x/2` and `y = y*2`? The ARM provides shift instructions that (at least for positive inputs) can multiply or divide by powers of 2: the instruction

```lsls r1, r1, #1
```

shifts the contents of `r1` to the left by one place, multiplying the value by 2. (Actually, almost the same effect results from `adds r1, r1, r1`.) Also, the instruction

```lsrs r0, r0, #1
```

shifts the contents of `r0` one place to the right, dividing the value by 2, and throwing away any remainder. It shifts out a single bit at the right (and puts it in the `C` flag), and shifts in a zero at the left. These are called a logical shift because of the zero bit (or bits) that are shifted in at one end or the other; there's also an arithmetic right shift instruction that makes more sense when the bit pattern might represent a negative number (see later).

The fact that the bit shifted out is put in the `C` flag allows us to implement the test `x odd` neatly: if we write

```lsrs, r0, r0, #1
bcc even
```

then that divides `x` by 2 and branches to the label `even` if `x` was even, executing the next instruction only if `x` was odd. (That's not an improvement we can expect a compiler to find.)

A final idea is to put the test `x != 0` at the end of the loop, so that there is only one branch instruction in the loop itself. The unconditional branch at the start is executed only once.

```foo:
movs r2, #0             @ z = 0
b test
again:
lsrs r0, r0, #1         @ x = x/2
bcc even                @ if x was even, skip
adds r2, r2, r1         @ z = z + y
even:
lsls r1, r1, #1         @ y = y*2
test:
cmp r0, #0              @ if x != 0
bne again               @   repeat
movs r0, r2             @ return z
bx lr
```

## Stack frames and locals

What if ...

• we want to write a subroutine that calls another, or perhaps recursively calls itself? (We must save our return address.)
• we want to use registers `r4`--`r7`? (We must restore their values before returning.)
• we need more local variables than will fit in registers (or maybe a local array)?

The answer is to let the subroutine use space on the stack, giving it a stack frame.

The subroutine will have a layout like this:

```foo:
push {r4-r7, lr}      @ Save registers
sub sp, #n            @ Allocate locals
```
```    ...
```
```    add sp, #n            @ Deallocate locals
pop {r4-r7, pc}       @ Restore and return
```

The `push` and `pop` instructions transfer multiple words between registers and memory, and adjust the stack pointer. They are equivalent to a sequence of load or store instructions plus an `add` or `sub`, but more compact and faster. An ordinary load or store instruction takes 2 cycles, but a `push` or `pop` takes only `n+1` cycles to save or restore `n` registers. Each instruction contains a little bitmap to say which of the low registers `r0`--`r7` to save or restore, and we don't need to save and restore registers that we don't use. Notice that the `push` instruction saves the value of the `lr` register, and the `pop` instruction restores the same value into the `pc`, effectively returning from the subroutine. The `push` instruction implicitly subtracts from `sp`, and the `sub` instruction subtracts some more; the value of `n` should be a multiple of 4 and less than 512.

In the subroutine body,

• we are free to set and use those registers `r4`--`r7` that have been saved. The values we put there will be preserved by any subroutines we call, and their original values will be restored when this subroutine returns.
• local storage can be addresses as positive offsets from `sp`; there are special encodings of `ldr/str rt, [sp, #n]` that help with this.
• addressing for local arrays is assisted by a special encoding of `add rd, sp, #n`.

The RISC approach is to provide simple instructions that operate uniformly on a plentiful set of registers. Traditional RISC machines provide a straightforward encoding of these intructions. With Thumb, only the most useful instructions have an encoding, and it is not uniform: for example, all the following instructions have an encoding, but different rules apply to each – different ranges of values can be used to replace the constant 4 in each case.

```ldr r0, [r1, #4]        adds r0, r1, #4
ldr r0, [sp, #4]        add r0, sp, #4
ldr r0, [pc, #4]        add r0, pc, #4
```

This non-uniformity is a bit irritating if we are writing large amounts of assembly language by hand: but why do that when high-level languages exist? In a compiler, it is easy enough to represent the options in symbolic form and use them when possible, falling back on more long-winded code when necessary (see next year's Compilers course). In every case, we can get round the limited size of the immediate field by putting the constant into a register first.

## Example – Binomial coefficients

Let's write a recursive subroutine `foo(n, k)` that computes the binomial coefficient `(n choose k)` using the recurrence with `(n choose 0) = (n choose n) = 1`. The arguments `n` and `k` will arrive in registers `r0` and `r1` as usual, but we will need to re-use those registers to hold the arguments (and receive the results) of recursive calls. We have plenty of registers that are preserved across calls, so we could save `n` and `r` there, but to demonstrate the possibilities, let's save them in the stack frame. We will also need to preserve the result of the first recursive call while we do the second one, and for the sake of variety, we'll use register `r4` for that.

We can sketch the plan in C, though because the saving of parameters is implicit in C, we can't really express all of it.

```unsigned foo(unsigned n, unsigned k) {
unsigned result = 1;

if (k != 0 && k != n) {
unsigned nn = n-1, kk = k;
result = foo(nn, kk);
result = result + foo(nn, kk-1);
}

return result;
}
```

To perform a subroutine call – including a recursive call to the same subroutine, we put the arguments in `r0` and `r1`, then use a `bl` instruction, which both saves the current `pc` value in the link register `lr` and loads the `pc` with the subroutine address.

```foo:
push {r4, lr}           @ Save registers
sub sp, #8              @ Allocate space for locals
mov r4, #1              @ Default result

cmp r1, #0              @ Base case if k = 0
beq done
cmp r1, r0              @ ... or k = n
beq done

subs r0, r0, #1         @ Compute n-1
str r0, [sp, #0]        @ Save n-1 in stack
str r1, [sp, #4]        @ Save k in stack
bl foo                  @ Call foo(n-1, k)
movs r4, r0             @ Save result in r4
ldr r0, [sp, #0]        @ Reload n-1
ldr r1, [sp, #4]        @ Reload k
subs r1, r1, #1         @ Compute k-1
bl foo                  @ Call foo(n-1, k-1)

done:
movs r0, r4             @ Put result in r0
add sp, #8              @ Reclaim space
pop {r4, pc}            @ Restore and return
```

The load and store instructions here, such as `ldr r1, [sp, #4]` have a special form that allows us to form the address by adding together the value of the stack pointer `sp` and a fixed offset, 4 in this case. It's the programmer's (or the compiler's) job to keep track of the frame layout: here the variable `nn` is at offset 0 from the stack pointer, and the variable `rr` is at offset 4.

The result of the first call is moved from `r0` to `r4` before making the second call. The recursive invocation of `foo` will then save the value using the `push` instruction as the subroutine is entered, and restore it as part of the action of the `pop` instruction as it exits.

(This subroutine deliberately mixes techniques in order to demonstrate some of the possibilities. It would have been simpler and better to save the value of `n` and `k` in two registers – say `r5` and `r6` – using register-to-register moves, rather than make space in the stack frame then store and load them explicitly. We would have started the subroutine with `push {r4, r5, r6, lr}`, so that these values would still have been saved in memory, but by a different mechanism.)

Context

Generally speaking, recursive subroutines are a bad idea in embedded systems, because they are often profligate with memory, and it is difficult to determine an upper bound on the depth of the recursion. In this example, each recursive call consumes four words of stack space, and the depth of recursion is determined by the parameter `n`. Even without changing the (admittedly rather silly) algortithm, we could instead keep an explicit stack of pairs `(n, r)` waiting to have `binom(n, r)` computed and added to the grand total, consuming at most two words of memory per waiting call.

## Questions

What's the difference between a subroutine, a procedure and a function?

I tend to use all three terms interchangeably. I think subroutine is the most neutral term, unlikely to be confused with either an informal list of instructions (e.g., the procedure for downloading code to the micro:bit) or a mathematical operation (e.g., the sine function). The term procedure is common in thinking about Pascal-like languages, where indeed each subroutine is introduced by the keyword `procedure`. The corresponding term in C (and, I suppose, in functional languages) is function.

A symbolic representation of the machine code for a program.

A region of storage, allocated from a stack, that contains the parameters and local variables of a procedure activation, together with administrative information (in the frame head) used when the procedure returns, or for access to variables in enclosing procedures.

A register `sp` that holds the address of the most recent occupied word of the subroutine stack. On ARM, as on most recent processors, the subroutine stack grows downwards, so that the `sp` holds the lowest address of any occupied work on the stack.

An alternative instruction encoding for the ARM in which each instruction is encoded in 16 rather than 32 bits. The advantage is compact code, the disadvantage that only a selection of instructions can be encoded, and only the first 8 registers are easily accessible. In Cortex-M microcontrollers, the Thumb encoding is the only one provided.

On ARM processors, a register (`r14`) in which the program counter value is saved by the instructions `bl` and `blx` that call a subroutine. The subroutine can return by branching to this address with the instruction `bx lr`, or can save the value on the stack (with `push {..., lr}`) and later return by restoring the same value back into the program counter (with `pop {..., pc}`).