# Lecture 3 – Multiplying numbers (Digital Systems)

## Multiplying numbers

The nRF51822 implements the optional integer multiply instruction, but let's pretend for a little while that it doesn't, and write our own subroutine for multiplying two integers together. We could write it in C.

```unsigned foo(unsigned a, unsigned b) {
unsigned x = a, y = b, z = 0;

/* Invariant: a * b = x * y + z */
while (x != 0) {
x = x - 1;
z = z + y;
}

return z;
}
```

Here we use the `unsigned` type that represents integers in the range [0..232), because the simple algorithm doesn't deal with negative numbers. It is the simplest algorithm imaginable, computing `a * b` by adding together `a` copies of `b`.

Let's rewrite that subroutine in assembly language, partly for practice, and partly so we can study what takes the time when we run it on a machine. We will follow the convention that the two arguments `a` and `b` arrive in registers `r0` and `r1`: if we arrange to keep `x` in `r0` and `y` in `r1` during the subroutine body, then we won't have to do anything to set `x` to `a` and `y` to `b` initially. We'll keep `z` in `r2`.

Here's an implementation of the same subroutine in assembly language, obtained by translating the C code line by line.

```foo:
@ -----------------
movs r2, #0         @ z = 0
loop:
cmp r0, #0          @ if x == 0
subs r0, r0, #1     @ x = x - 1
adds r2, r2, r1     @ z = z + y
done:
movs r0, r2         @ put result in r0
@ -----------------
bx lr
```

There are many things to notice here.

• Arithmetic happens between registers, so
```subs r0, r0, #1
```
subtracts the constant 1 from register `r0`, putting the result back in `r0`. And
```adds r2, r2, r1
```
adds the number in `r1` to the number in `r2`, putting the result back in `r2`.
• Control structures like the `while` loop are implemented with conditional and unconditional branches. Thus, the two instructions
```cmp r0, #0
beq done
```
compare the number in `r0` with the constant 0; if they are equal, the second instruction "branch-if-equal" takes effect, and execution of the program continues at label `done` instead of the next instruction. The `cmp` instruction sets the four bits `NZCV` in the processor status word according to the result of the comparison, and the `beq` instruction interprets these bits to find whether the two values compared were equal; it branches if the `Z` bit is 1. At the end of the loop is an unconditional branch back to the start, written
```b loop
```

Context

This scheme of having condition codes set by arithmetic instructions or by explicit comparisons, followed by conditional branches that test the condition codes, is an almost universal feature of instruction set architectures, and the interpretation of the NZCV bits is practically standard. Only a few architectures (e.g., the MIPS) are different.
• We can set a register to a small constant (in the range [0..256) ) with an instruction like
```movs r2, #0
```
or copy a value from one register to another with
```movs r0, r2
```
That's used in the subroutine to put the result (accumulated in `r2`) into the register `r0` where out caller expects to find it.
• In a simple subroutine like this, we are free to use registers `r0`, `r1`, `r2`, `r3` as we like, without worrying that they may hold values needed elsewhere. We can also use `r4` to `r7`, provided we preserve and restore their values, in way we shall see later.

By disassembling the program, we can see how these new instructions are encoded.

```\$ arm-none-eabi-objdump -d mul1.o
00000000 <foo>:
0:  2200      movs    r2, #0
```
```00000002 <loop>:
2:  2800      cmp     r0, #0
4:  d002      beq.n   0xc <done>
6:  1852      adds    r2, r2, r1
8:  3801      subs    r0, #1
a:  e7fa      b.n     0x2 <loop>
```
```0000000c <done>:
c:  0010      movs    r0, r2
e:  4770      bx      lr
```

Note that at offset 0x4, the `beq` instruction is assembler as 0xd002: in binary,

```1101 0000 00000010
b    eq   offset 2
```

When the branch is taken, the target address is the address of the instruction, plus 4, plus twice the offset: 0x4 + 4 + 2 * 2 = 0xc. Each conditional branch contains an 8-bit signed offset relative to `pc+4` that is multiplied by 2. (The instruction is shown as `beq.n` because it is the narrow form of `beq` that fits in 16 bits; other ARM variants have a wide variant also, which the Cortex-M0 lacks.)

An unconditional branch has an 11-bit offset, so at 0xa we find 0xe7fa, or in binary,

```11100 11111111010
b     offset -6
```

The target address is 0xa + 4 - 2 * 6 = 0x2.

Glancing at the other instructions, the `subs` is encoded like this:

```00111 000 00000001
subs  r0  const 1
```

You can see that, in this form of instruction that subtracts a constant from a register and puts the result back in the register, we have three bits to specify any register in the range `r0`--`r7`, and eight bits to specify a constant in the range [0..256).

The `adds` instruction is encoded in a form where three registers can be specified, so the result could have been put in a different place from the two inputs.

```0001100 001 010 010
```

Only `adds` and `subs` exist in this form. As we shall see, other arithmetic and logical operations exist only in a form where the result register is the same as one of the inputs. This isn't because of any restrictions on what the core of the processor can do, but a matter of using the 16-bit instruction space in the most useful way.

Context

The Cortex-M0 (like other ARM variants) with the usual calling convention allows a small subroutine that makes no access to (data) memory, receiving its arguments in registers, saving the return address in a register, and returning its result in a register. More complex subroutines do need to access memory, if they need to use more than a few registers, or if they need to call another subroutine, but simple 'leaf' routines are common enough that this gives useful savings. By way of contrast, the x86 has a calling convention where subroutine arguments and the return address are always stored on the stack, so that use of memory cannot be avoided.

## Speeding the program up

How fast (or how slow) is this routine? We can predict the timing easily on a simple machine like the Cortex-M0, because each instruction in the program takes one clock cycle to execute, except that a taken branch causes an additional two cycles to be lost before the machine executes the instruction at the branch target address. Thus the loop in the subroutine contains five instructions, but in a typical iteration (any but the last) takes seven cycles, one for each instruction, including the untaken `beq`, and two extra cycles for the taken `b` instruction. And the number if of iterations is equal to the argument `a`. Connecting an oscilloscope. The ground clip of the scope probe is connected to the micro:bit's ground, and the probe itself to a pin that is high during the timing pulse. (I've soldered header onto the edge connector to make such connections easier.)

Apart from branch instructions, which need 3 cycles if taken and 1 if not, instructions that access memory need an extra cycle, and provide another exception to the rule that the processor executes on instruction per clock cycle. The timings of all the instructions are given in Section 3.3 of the Cortex-M0 Technical Reference Manual linked from the documentation page. These timings reveal that a model of the processor where each successive state is computed in a combinational way from the preceding state is not really accurate. In fact, the Cortex-M0 design has a small amount of pipelining, overlapping the decoding of the next instruction and the fetching of the next-but-one with the execution of the current instruction.

The progress of instructions through the pipeline is disrupted whenever a branch is taken, and it takes a couple of cycles for the pipeline to fill again. Also, load and store instructions need to use the single, shared interface to memory both when the instruction is fetched and again when it is executed, so they also add an extra cycle. Unlike more sophisticated machines, the Cortex-M0 doesn't try to guess which way a branch instruction will go in order to avoid stalling the pipeline; for many such machines with deeper pipelines, the penalty for a mis-predicted branch is high, so effective branch prediction becomes essential for performance. More details on all of this can be found in the Second Year course on Computer Architecture.

It's quite easy to verify the timing of the loop by experiment. The main program (the same one as before) it set up so that it turns on one of the LEDs on the micro:bit before calling the subroutine `foo`, and turns it off again afterwards. By attaching the probe of an oscilloscope to one side of the LED, we can measure precisely how long the LED is illuminated; then by trying various values for `a` and `b`, we can find that the value of `b` does not affect the running time of the subroutine at all, but increasing `a` by one increases the time taken by 7 cycles or 437.5ns at 16MHz.

Here's a trace for the calculation of 5 * 123, showing a pulse length of 3.33μsec.

And here's a calculation of 6 * 123. The pulses can be measured with the oscilloscope even though they are far too short for the LED to light visibly.

The difference is 0.44μsec as predicted.

We can do better than this! For one thing, we could code the function more tightly, and reduce the number of instructions executed and the number of cycles needed by the loop (see Ex. 1.3). But better still, we could use a better algorithm.

## Questions

Will we need to memorise Thumb instructions for the exam?

No, the exam paper will contain a quick reference guide (published here in advance) to the commonly used instructions – and of course, you won't be expected to remember details of the uncommon ones. As you'll see, the ranges covered by immediate fields are spelled out, but there will be no questions that turn on the precise values. A question could refer to an instruction not on the chart, but only if the question itself describes the instruction.

For example, the paper might ask, "Explain why it is advantageous to allow a larger range of offsets for the load instruction `ldr r0, [sp, #off]` than for the form `ldr r0, [r1, #off]`." Or it might ask, "There is an instruction `svc` that directly causes an interrupt: explain why this instruction provides a better way of entering the operating system than an ordinary procedure call." There certainly won't be a question that says without further context, "Decode the instruction 0x8447." And in asking for a simple piece of assembly language, the paper won't try to trick you with snags like the fact that `add r0, r0, #10` has an encoding but `add r0, r1, #10` does not. If you write the latter instruction in place of `mov r0, r1; add r0, r0, #10` then you are unlikely to lose even a single mark.

If a subroutine expects its return address in the `lr` register, and one subroutine `A` calls another subroutine `B`, won't the call to `B` overwrite the return address of `A`, so that `A` no longer knows where to return?

Our first attempts at assembly language programming have been tiny subroutines that call no others and use only the registers `r0` to `r3`. For them (leaf routines), we can assume the return address arrives in `lr` and remains there until the subroutine returns with `bx lr`. Calling another subroutine does indeed trash the value of `lr`, and that means that non-leaf routines must carefully save the `lr` value when they are entered, so that they can use it as a return address later. As we'll see very shortly, this is neatly done by writing something like

```push {r4, r6, lr}
```

at the top of the subroutine, an instruction which at one fell swoop saves some registers and the `lr` value on the stack. Then we write a matching instruction

```pop {r4, r6, pc}
```

at the bottom to restore the registers and load the return address into the `pc`, effectively returning from the subroutine. It's good that the action of saving registers on the stack is separate from the action of calling a subroutine, because it allows us to use the simpler and faster scheme for leaf routines.

A symbolic representation of the machine code for a program.

Four bits, `N`, `Z`, `V` and `C`, in the processor status word that indicate the result of a comparison or other arithmetic operation. Briefly, `N` indicates whether the result of the operation was negative, `Z` indicates whether it was zero, `C` is the value of the carry-out bit from the ALU, and `V` indicates whether the operation overflowed, yielding a result that was different in sign from what could be predicted from the inputs to the operation. A comparison is treated like a subtraction as far as setting the condition codes is concerned. After the condition codes have been set, a subsequent conditional branch instruction can test them, and make a branch decision based on a boolean combination of their values. All ten arithmetic comparisons (equal, not-equal, and less-than, less-than-or-equal, greater-than, and greater-than-or-equal for both signed and unsigned representations) can be represented in this way. When a process is interrupted, the condition codes must be saved and restored as part of the processor state, in case the interrupt came between a comparison and a subsequent conditional branch.

(A near-synonym for ABI). The convention that determines where arguments for a subroutine are to be found, and where the result is returned.

An alternative instruction encoding for the ARM in which each instruction is encoded in 16 rather than 32 bits. The advantage is compact code, the disadvantage that only a selection of instructions can be encoded, and only the first 8 registers are easily accessible. In Cortex-M microcontrollers, the Thumb encoding is the only one provided.