Lecture 3 – Multiplying numbers (Digital Systems)

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

Multiplying numbers

[3.1] The nRF51822 implements the optional integer multiply instruction, but let's pretend for a little while that it doesn't, and write our own subroutine for multiplying two integers together. We could write it in C.

unsigned foo(unsigned a, unsigned b) {
    unsigned x = a, y = b, z = 0;

    /* Invariant: a * b = x * y + z */
    while (y != 0) {
        y = y - 1;
        z = z + x;
    }

    return z;
}

Here we use the unsigned type that represents integers in the range [math]\displaystyle{ [0\ldotp\ldotp2^{32}) }[/math], because the simple algorithm doesn't deal with negative numbers. It is the simplest algorithm imaginable, computing a * b by adding together b copies of a.

Let's rewrite that subroutine in assembly language, partly for practice, and partly so we can study what takes the time when we run it on a machine. We will follow the convention that the two arguments a and b arrive in registers r0 and r1: if we arrange to keep x in r0 and y in r1 during the subroutine body, then we won't have to do anything to set x to a and y to b initially. We'll keep z in r2.

[3.2] Here's an implementation of the same subroutine in assembly language, obtained by translating the C code line by line.

foo:
@ -----------------
        movs r2, #0         @ z = 0
loop:
        cmp r1, #0          @ if y == 0
        beq done            @     jump to done
        subs r1, r1, #1     @ y = y - 1
        adds r2, r2, r0     @ z = z + x
        b loop              @ jump to loop
done:
        movs r0, r2         @ put result in r0
@ -----------------
        bx lr

There are many things to notice here.

  • Arithmetic happens between registers, so
subs r1, r1, #1
subtracts the constant 1 from register r1, putting the result back in r1. And
adds r2, r2, r0
adds the number in r0 to the number in r2, putting the result back in r2.
  • Control structures like the while loop are implemented with conditional and unconditional branches. Thus, the two instructions
cmp r1, #0
beq done
compare the number in r1 with the constant 0; if they are equal, the second instruction "branch-if-equal" takes effect, and execution of the program continues at label done instead of the next instruction. The cmp instruction sets the four bits NZCV in the processor status word according to the result of the comparison, and the beq instruction interprets these bits to find whether the two values compared were equal; it branches if the Z bit is 1. At the end of the loop is an unconditional branch back to the start, written
b loop

Context

This scheme of having condition codes set by arithmetic instructions or by explicit comparisons, followed by conditional branches that test the condition codes, is an almost universal feature of instruction set architectures, and the interpretation of the NZCV bits is practically standard. Only a few architectures (e.g., the MIPS) are different.
  • We can set a register to a small constant (in the range [0..256) ) with an instruction like
movs r2, #0
or copy a value from one register to another with
movs r0, r2
That's used in the subroutine to put the result (accumulated in r2) into the register r0 where out caller expects to find it.
  • In a simple subroutine like this, we are free to use registers r0, r1, r2, r3 as we like, without worrying that they may hold values needed elsewhere. We can also use r4 to r7, provided we preserve and restore their values, in way we shall see later.

[3.3] By disassembling the program, we can see how these new instructions are encoded.

$ arm-none-eabi-objdump -d mul1.o
00000000 <foo>:
   0:  2200      movs    r2, #0 
00000002 <loop>:
   2:  2900      cmp     r1, #0
   4:  d002      beq.n   0xc <done>
   6:  3901      subs    r1, #1
   8:  1812      adds    r2, r2, r0
   a:  e7fa      b.n     0x2 <loop>
0000000c <done>:
   c:  0010      movs    r0, r2
   e:  4770      bx      lr

Note that at offset 0x4, the beq instruction is assembler as 0xd002: in binary,

1101 0000 00000010
b    eq   offset 2

When the branch is taken, the target address is the address of the instruction, plus 4, plus twice the offset: 0x4 + 4 + 2 * 2 = 0xc. Each conditional branch contains an 8-bit signed offset relative to pc+4 that is multiplied by 2. (The instruction is shown as beq.n because it is the narrow form of beq that fits in 16 bits; other ARM variants have a wide variant also, which the Cortex-M0 lacks.)

An unconditional branch has an 11-bit offset, so at 0xa we find 0xe7fa, or in binary,

11100 11111111010
b     offset -6

The target address is 0xa + 4 - 2 * 6 = 0x2.

Glancing at the other instructions, the subs is encoded like this:

00111 001 00000001
subs  r1  const 1

You can see that, in this form of instruction that subtracts a constant from a register and puts the result back in the register, we have three bits to specify any register in the range r0--r7, and eight bits to specify a constant in the range [0..256).

The adds instruction is encoded in a form where three registers can be specified, so the result could have been put in a different place from the two inputs.

0001100 000 010 010
adds    r0  r2  r2

Only adds and subs exist in this form. As we shall see, other arithmetic and logical operations exist only in a form where the result register is the same as one of the inputs. This isn't because of any restrictions on what the core of the processor can do, but a matter of using the 16-bit instruction space in the most useful way.

Context

The Cortex-M0 (like other ARM variants) with the usual calling convention makes special provision for a small subroutine like this that makes no access to (data) memory, receiving its arguments in registers, saving the return address in a register, and returning its result in a register. More complex subroutines do need to access memory, if they need to use more than a few registers, or if they need to call another subroutine, but simple 'leaf' routines are common enough that this gives useful savings. By way of contrast, the x86 has a calling convention where subroutine arguments and the return address are always stored on the stack, so that use of memory cannot be avoided.

Timing the program

[3.4] How fast (or how slow) is this routine? We can predict the timing easily on a simple machine like the Cortex-M0, because each instruction in the program takes one clock cycle to execute, except that a taken branch causes an additional two cycles to be lost before the machine executes the instruction at the branch target address. Thus the loop in the subroutine contains five instructions, but in a typical iteration (any but the last) takes seven cycles, one for each instruction, including the untaken beq, and two extra cycles for the taken b instruction. And the number if of iterations is equal to the argument a.

Connecting the logic anayser. The ground wire (white) is connected to the micro:bit's ground, and channel 1 (black) to Pin 3, which is high during the timing pulse. (I've used a breakout board to make the connections easier.)

Apart from branch instructions, which need 3 cycles if taken and 1 if not, instructions that access memory need an extra cycle, and provide another exception to the rule that the processor executes one instruction per clock cycle. The timings of all the instructions are given in Section 3.3 of the Cortex-M0 Technical Reference Manual linked from the documentation page. These timings reveal that a model of the processor where each successive state is computed in a combinational way from the preceding state is not really accurate. In fact, the Cortex-M0 design has a small amount of pipelining, overlapping the decoding of the next instruction and the fetching of the next-but-one with the execution of the current instruction.

The progress of instructions through the pipeline is disrupted whenever a branch is taken, and it takes a couple of cycles for the pipeline to fill again. Also, load and store instructions need to use the single, shared interface to memory both when the instruction is fetched and again when it is executed, so they also add an extra cycle. Unlike more sophisticated machines, the Cortex-M0 doesn't try to guess which way a branch instruction will go in order to avoid stalling the pipeline; for many such machines with deeper pipelines, the penalty for a mis-predicted branch is high, so effective branch prediction becomes essential for performance. More details on all of this can be found in the Second Year course on Computer Architecture.

[3.5] It's quite easy to verify the timing of the loop by experiment. Things are set up so that we can do this in three different ways.

  • First, the processor has an internal counter that increment at the same rate (16MHz) as the processor clock. As you've seen, the main program uses this counter to measure the number of clock cycles taken by the assembly-language subroutine func. After compensating for the time taken to perform the subroutine call, the time is printed together with the result.
  • Second, the main program lights one of the LEDs on the micro:bit during the time that the subroutine is running. We use electronic instruments to measure the time the LED is on. One possibility is to use an inexpensive logic analyser, but that is restricted to a sample rate of 24MHz, and that's barely enough to measure a pulse length that is a multiple of 1/16 μsec. But we can time a program that takes many clock cycles and still get a fairly accurate idea of the running time. For the multiplication program, we could try calculating 100 * 123 and 200 * 123, and the difference between the two times would be the time for 100 iterations of the loop.
  • Third, it's better to use an oscilloscope that is able to sample at a higher rate.

[3.6] Here's a trace for the calculation of 123 * 5 on the V1 micro:bit, showing a pulse length of 3.33μsec.

Calculation of 123 * 5
Calculation of 123 * 5

And here's a calculation of 123 * 6. The pulses can be measured with the oscilloscope even though they are far too short for the LED to light visibly.

Calculation of 123 * 6
Calculation of 123 * 6

The difference is 0.44μsec as predicted.

We can do better than this! For one thing, we could code the function more tightly, and reduce the number of instructions executed and the number of cycles needed by the loop (see Ex. 1.3). But better still, we could use a better algorithm.

Questions

Will we need to memorise Thumb instructions for the exam?

No, the exam paper will contain a quick reference guide (published here in advance) to the commonly used instructions – and of course, you won't be expected to remember details of the uncommon ones. As you'll see, the ranges covered by immediate fields are spelled out, but there will be no questions that turn on the precise values. A question could refer to an instruction not on the chart, but only if the question itself describes the instruction.

For example, the paper might ask, "Explain why it is advantageous to allow a larger range of offsets for the load instruction ldr r0, [sp, #off] than for the form ldr r0, [r1, #off]." Or it might ask, "There is an instruction svc that directly causes an interrupt: explain why this instruction provides a better way of entering the operating system than an ordinary procedure call." There certainly won't be a question that says without further context, "Decode the instruction 0x8447." And in asking for a simple piece of assembly language, the paper won't try to trick you with snags like the fact that adds r0, r0, #10 has an encoding but adds r0, r1, #10 does not. If you write the latter instruction in place of movs r0, r1; adds r0, r0, #10 then you are unlikely to lose even a single mark.

If a subroutine expects its return address in the lr register, and one subroutine A calls another subroutine B, won't the call to B overwrite the return address of A, so that A no longer knows where to return?

Our first attempts at assembly language programming have been tiny subroutines that call no others and use only the registers r0 to r3. For them (leaf routines), we can assume the return address arrives in lr and remains there until the subroutine returns with bx lr. Calling another subroutine does indeed trash the value of lr, and that means that non-leaf routines must carefully save the lr value when they are entered, so that they can use it as a return address later. As we'll see very shortly, this is neatly done by writing something like

push {r4, r6, lr}

at the top of the subroutine, an instruction which at one fell swoop saves some registers and the lr value on the stack. Then we write a matching instruction

pop {r4, r6, pc}

at the bottom to restore the registers and load the return address into the pc, effectively returning from the subroutine. It's good that the action of saving registers on the stack is separate from the action of calling a subroutine, because it allows us to use the simpler and faster scheme for leaf routines.

Lecture 4