Lecture 6 – Memory and addressing (Digital Systems)

From Spivey's Corner
Jump to: navigation, search

Memory and addressing[edit]

In the RISC style, access to memory is done by separate instructions from those that do arithmetic and branching – a 'load/store' architecture. The Cortex-M0 provides a limited but usable set of ways to compute the address of an item in memory – addressing modes – and permits these to be used in ldr and str instructions to move values between memory and registers.

Having separate load and store instructions from arithmetic means that a RISC program may have more instructions than a CISC program for the same task, but the aim is to make each instruction simple enough that it can be executed in a single clock cycle. By way of contrast, a complex instruction for a CISC machine that both loads a word from memory and adds it to an existing value in a register will probably need several cycles, doing each part of the instruction separately.

The ARM is a RISC machine, so load and store are separate instructions from arithmetic: a = b * c + d is written as something like

    ldr r0, [#b]
    ldr r1, [#c]
    mul r0, r0, r1
    ldr r1, [#d]
    add r0, r0, r1
    str r0, [#a]

This contrasts with a CISC machine, where it's possible to combine memory access and arithmetic in the same instruction, like this:

    ldr r0, [#b]
    mul r0, [#c]
    add r0, [#d]
    str r0, [#a]

That looks shorter, but each instruction is more complex. On modern implementations of CISC instruction sets (e.g. x86), each CISC-style instruction is expanded into a sequence of RISC-style operations before execution. By way of contrast, the instructions in the RISC style fragment correspond more directly to individual actions of the machine

But (alas) what looks simple and ought to be easy isn't as simple and easy as it seems. The problem is that that addresses a, b, etc., if they refer to global variables, might be anywhere in the 32-bit address space, and there isn't space in an ARM instruction (32 bits), and still less in a Thumb instruction (16 bits) to fit such a constant in the instruction as well as an opcode. The solution is in two parts: first, we switch to a load or store instruction that takes the address from a register, and then we introduce a way to put an arbitrary 32-bit quantity (such as an address) into any register.

For the first part, if register r2 contains an address, then the instruction

    ldr r0, [r2]

loads the 4-byte quantity from that address and puts in register r0.

For the second part, we are going to use a second load instruction, one that uses an offset relative to the program counter pc, and we will plant the address of variable a (not the variable a itself) close to the instruction that wants to use it.

    ldr r2, [pc, #n] ----
    ldr r0, [r2]       |
                       | offset n 
    ...                |
                       V
    .word a          ----

This will work provided the offset n is not too large to fit in the instruction. Since this construction is common, the assembler provides a handy abbreviation for it: we can write

    ldr r2, =a
    ldr r0, [r2]

and the assembler will find a place to put the constant a and calculate the offset for the pc-relative load instruction. The assembler will put its constant pool – containing all the constants it has gathered – after all the code in the source file, or we can put a special directive .ltorg or .pool in a place of our choosing, such as just after the return instruction of a subroutine, so as to put small groups of constants closer to where they are used. (The name ltorg and the idea of a constant pool come from the IBM System/360 assembler.)

Using this mechanism, the code for a = b * c + d still looks quite cumbersome:

    ldr r2, =b
    ldr r0, [r2]
    ldr r2, =c
    ldr r1, [r2]
    mul r0, r0, r1
    ldr r2, =d
    ldr r1, [r2]
    add r0, r0, r1
    ldr r2, =a
    str r0, [r2]

But this statement is not really typical of what happens in real programs, where references to global variables are rare, and tend to be clustered, with several references to the same variable close together. Much more typical (of code that does refer to a global variable) would be the assignment a = a+1, which can be achieved by

    ldr r0, =a
    ldr r1, [r0]
    add r1, r1, #1
    str r1, [r0]

Note here how the address of a is put in register r0, and then used twice, once in the ldr and again in the str.

It's worth noticing that a decent C compiler won't care whether the command to increment a is written a = a+1 or a++ or ++a: in each case, the compiler will convert the source into the same internal form, and it can notice for itself (by a process called common subexpression elimination) that the a's on the two sides of the assignment a = a+1 refer to the same variable.

Array indexing[edit]

Individual global or local variables are useful, but more useful still is the ability to access elements of an array by indexing. That entails begin able to calculate the address of the location accessed by a load or store instruction. The address of the i 'th element a[i ] of an array a is base + i * elsize, where base is the starting address of the array and elsize is the size of one element.

Array indexing

Source file: lab1-asm/catalan.s

The ARM helps with this calculation by providing a form of the ldr and str instructions (an addressing mode) were the memory address is computed as the sum of two registers, as in ldr r0, [r1, r2]. This instruction adds together the values in r1 and r2 to form an address, loads the 4-byte word from this address, and puts the value in r0. Typically, r1 might contain the base address of an array, and r2 might contain 4 times the index, usually computed by means of a left-shift instruction lsls r2, r2, #2.

Just for fun, let's write a function that computes the Catalan numbers Cn, defined by the recurrence

Catalan1.png

with C0 = 1. We can do that by storing the values Ck in an array, and repeatedly using the values of Cj for 0 ≤ jk to compute Ck+1. Here is the idea written as a C function, using a statically allocated array row.

static unsigned row[256];

unsigned foo(unsigned n, unsigned dummy) {
    int j, k;
    unsigned t;
 
    k = 0;
    row[0] = 1;

    while (k < n) {
        /* Use C[0..k] to compute C[k+1] */
        j = 0; t = 0;
        while (j <= k) {
            t += row[j] * row[k-j];
            j++;
        }
        k++; row[k] = t;
    }

    return row[n];
}

I've given foo an extra parameter to fit in with our usual main program, and (for no very good reason) use the type int for array indices and unsigned for the computed values. (We do nothing sensible here if n is either negative or too big for the array; the values of Cn overflow the integer range long before then anyway.)

To render this in assembly language, we might begin by looking at the expression row[j] that is embedded in the inner loop. Let's assume that the base address of the array row is held in register r4 throughout the subroutine, and that j lives in register r6. In order to compute the address of row[j], we must add together the value in r4 and 4 times the value in j. We need to multiply by 4 because each byte of memory has its own address, and each 32-bit integer in the array occupies 4 bytes. We can multiply by 4 conveniently by shifting left by 2 bits. There's a form of the ldr instruction that adds the values in two registers to form the address, so we can get the value of row[j] in r2 with the instructions

    lsls r1, r6, #2          @ 4*j in r1
    ldr r2, [r4, r1]         @ row[j] in r2

Next, let's get the value of row[k-j] in r1, assuming k lives in r5.

    subs r1, r5, r6          @ k-j in r1
    lsls r1, r1, #2          @ 4*(k-j) in r1
    ldr r1, [r4, r1]         @ row[k-j] in r1

After fetching the two array elements, we can complete the assignment

t += row[j] * row[k-j]

by multiplying them, then adding the product to t, assumed to live in r3.

    muls r2, r2, r1          @ Multiply
    adds r3, r3, r2          @ Add to t

The function has two nested loops that can be translated in a straightforward way. The remaining problem is how to create space for the array row and arrange for r4 to contain its address. As a statically-allocated array, not declared within any function, it becomes part of the BSS segment of the program and is assigned a fixed address in RAM. We can reserve space with these directives

    .bss                     @ Allocate space in BSS segment     
    .align 2                 @ Align on 4-byte boundary
row:
    .space 1024              @ Reserve 1024 bytes of space

Then to ensure that r4 points to the array throughout the function, begin it with

foo:
    push {r4-r7, lr}         @ Save registers
    ldr r4, =row             @ Set r4 to the base of the array row

And end with

    lsls r1, r0, 2           @ return a[n]
    ldr r0, [r4, r1]
    pop {r4-r7, pc}          @ Restore and return

The remaining details are in the file catalan.s

Actually, the sequence of Catalan numbers is well known, arising in many combinatorial situations, and the Online Encylopedia of Integer Sequences (https://oeis.org/A000108) gives a formula for the n'th term.

Catalan2.png

(see also this nice derivation on the website of my namesake.)

The first few values are 1, 1, 2, 5, 14, 42, .... We can use this to check that the function is returning the correct results, at least for small values of n.

Lecture 7

(Complex Instruction Set Computer) The opposite of RISC.

A computer designed with a simplified instruction set. Typically, these machines have a large set of uniform registers, a small set of addressing modes, and load/store instructions separate from the instructions that carry out arithmetic operations.

(Reduced Instruction Set Computer). A style of computer design where there are multiple, identical registers, arithmetic instructions that operate between registers, and separate load and store instructions with a limited set of addressing modes.

A numbering system for memory locations. ARM-based microcontrollers (like most bigger machines) have a single address space containing both code and data. Some other microcontroller families have separate address spaces for code and data, in what is called a Harvard architecture.

An alternative instruction encoding for the ARM in which each instruction is encoded in 16 rather than 32 bits. The advantage is compact code, the disadvantage that only a selection of instructions can be encoded, and only the first 8 registers are easily accessible. In Cortex-M microcontrollers, the Thumb encoding is the only one provided.

A register that contains the address of the next instruction to be executed. Because of pipelining, on ARM Cortex-M machines, reading the program counter yields a value that is 4 bytes greater than the address of the current instruction.

In instructions that access memory, one of several rules for computing the address of the location to be accessed. For example, one addressing mode might obtain the address by adding the contents of two registers, and another might add a register and a small constant. CISC machines are characterised by more varied and more complex addressing modes than RISC machines.

A symbolic representation of the machine code for a program.