Lecture 6 – Memory and addressing (Digital Systems)

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

The subroutines we have written so far have kept their data in registers, occasionally using the stack to save some registers on entry and restore them on exit. In more elaborate programs, we will want to access data in memory explicitly, for several reasons.

  • Sometimes there are too many local variables to fit in registers, and we will need to allocate space in the stack frame of a subroutine to store some of them.
  • High-level programs allow global variables, declared outside any subroutine, whose value persists even when the subroutines that access them have returned, and we want the same effect in an assembly-language program.
  • Programs that handle more than a fixed amount of data must use arrays allocated in memory to store it, either global arrays allocated once and for all, or arrays that are local to a subroutine.

For all these purposes, ARM processors use load and store instructions that move data between memory and registers, with the memory transfers happening in separate instructions from any arithmetic that is done on the loaded values before storing them back to memory. This contrasts with designs like the x86, where there are instructions that (for example) fetch a value from memory and add it to a register in one go; on the ARM, fetching the value and adding to the register would be two instructions. The most frequently used load and store instructions move 4-byte quantities between registers and memory, but there are also other instructions (we'll get to them later) that move single bytes and also two-byte quantities.

The entire memory of the computer (including both RAM and ROM) is treated as an array of individual bytes, each with its own address. Quantities that are larger than a single byte occupy several contiguous bytes, most commonly four bytes corresponding to a 32-bit register, with the "little-endian" convention that the least significant byte of a multibyte quantity is stored at the smallest address. These multibyte quantities must be aligned, so that (for example) the least significant byte of a four-byte quantity is stored at an address that is a multiple of four.

Context

Computers differ in the way they treat unaligned loads and stores, that is, load and store operations for multi-byte quantities where the address is not an exact multiple of the size of object. On some architectures, such as modern ARM chips, such loads and stores are not implemented and result in a trap, and in the case of the micro:bit with our software, in the Seven Stars of Death. On such machines alignment of loads and stores is mandatory. On other machines, like the x86 family, unaligned loads and stores have traditionally been allowed, and continue to be implemented in later members of the family, though as time goes on there may be an increasing performance gap between aligned and unaligned transfers. This happens because modern hardware becomes more and more optimised for the common case, and unaligned transfers may be supported by some method such as performing two separate loads and combining the results. So on these machines too, it makes sense to align objects in memory.

Each invocation of a subroutine is associated with a block of memory on the program's stack that we call the stack frame for the invocation. The stack pointer begins at the top of memory, and as subroutine calls nest, it moves towards lower addresses, so that the stack grows downwards into an area of 2kB or more of memory that is dedicated to it. When a subroutine is entered, the initial push instruction (if any) saves some registers and decrements the stack pointer, leaving it pointing at the last value to be saved; then the subroutine may contain an explicit sub instuction that decrements the stack pointer further, reserving space for local variables whose addresses are at positive offsets from the stack pointer. The instruction set is designed to make access to these varaibles simple and efficient. The size of the local variables should be a multiple of 4 bytes, so that the stack pointer is always a multiple of 4. On larger ARM processors, there is a convention that the stack pointer should be a multiple of 8 whenever a subroutine is entered, but there is no need for this stronger convention on the Cortex-M0 processor, so we ignore it.

Other subroutines called by this one may push their own frames on the stack, but remove them before exiting, so that the stack pointer has a consistent value whenever this subroutine is running. At the end of a subroutine with local variables, the operations on the stack pointer are reversed: first we add back to the stack pointer the same quantity that was subtracted on entry, then there is a pop instruction that restores those registers that were saved on entry, also putting the return address back into the pc.

Example: factorial with local variables

Let's begin with an example: an implementation of factorial that calls a subroutine for multiplication, but instead of keeping values in the callee-save registers r4 to r7 across subroutine calls, keeps them in slots in its stack frame. As before, we will implement an assembly language equivalent to the following C subroutine.

int func(int x, int y)
{
    int n = x, f = 1;

    while (n != 0) {
        f = mult(f, n);
        n = n-1;
    }

    return f;
}

We'll keep the values of n and f in two slots in the stack frame for the subroutine:

     |  Stack frame   |
     |    of caller   |
     +================+
     |  Saved lr      |
     +----------------+
     |  Local n       |
     +----------------+
 sp: |  Local f       |
     +================+

The subroutine won't use registers r4 to r7 at all, so there's no need to save them on entry, but we do need to save lr, and we also need to adjust the stack pointer so as to allocate space for n and f.

func:
    push {lr}
    sub sp, sp, #8          @ Reserve space for n and f

Because each of the integer variables n and f occupies four bytes, we must subtract 8 from the stack pointer.

Next, we should set n and f to their initial values: n to the parameter x that arrived in r0, and f to the constant 1, which we put in a register and then store into f's slot in the stack frame.

    str r0, [sp, #4]        @ Save n in the frame
    movs r0, #1             @ Set f to 1
    str r0, [sp, #0]

The two str instructions each transfer a value from r0 to a slot in memory whose address is obtained by adding to the stack pointer sp a constant offset, 4 for n and 0 for f.

Now we come to the loop in the body of func. Whenever we want to use n or f, we must use a load instruction ldr to fetch it from the appropriate stack slot. For efficiency, we can fetch n once into r1, and then use it for both the test n != 0 and the function call mult(f, n) without fetching it again.

again:
    ldr r1, [sp, #4]        @ Fetch n
    cmp r1, #0              @ Is n = 0?
    beq finish              @ If so, finished

    ldr r0, [sp, #0]        @ Fetch f
    bl mult                 @ Compute f * old n
    str r0, [sp, #0]        @ Save as new value of f

    ldr r0, [sp, #4]        @ Fetch n again
    subs, r2, r1, #1        @ Compute n-1
    str r0, [sp, #4]        @ Save as new n

    b again                 @ Repeat

Values in r0 to r3 may be overwritten by the mult subroutine, but the values stored in the stack frame will not be affected by the subroutine, and we can implement n = n-1 by fetching n again from its slot after the subroutine call, decrementing it, and storing the result back into the same slot.

When the loop terminates, we need to fetch the final value of f into r0 so that it becomes the result of the subroutine. Then the stack pointer is adjusted back upwards to remove n and f from the stack before popping the return address.

finish:
    ldr r0, [sp, #0]        @ Return f
    add sp, sp, #8          @ Deallocate locals
    pop {pc}

It's clear that holding local variables in the stack frame is more cumbersome than saving some registers on entry to the subroutine and then using them to hold the local variables. It's only when a subroutine needs more than about four local variables that we would want to do so, and even then we would want to keep the most used four of them in registers, mixing the two techniques.

Addressing modes

For each load or store instruction, the address of the item in memory that is transferred to or from a register is obtained by a calculations that may involve the contents of other registers and also a small constant. Each way of calculating the address is called and addressing mode, and addressing on ARM processors has two main modes that we can call reg+const and reg+reg. As an example of the reg+const adressing mode, the instruction

ldr r0, [r1, #20]

takes the value occupying register r1 and adds the constant 20 to it to form an address (which must be a multiple of four). It fetches the 4-byte quantity from this address and loads it into register r0. In a typical use of this instruction, r1 contains the address of a multi-word object in memory (such as an array), and the constant 20 contains the offset from there to a particular word in the object. There is a matching instruction

str r0, [r1, #20]

where the address is calculated in the same way, but the value in r0 is stored into the memory location at that address. Also provided is the reg+reg addressing mode, with typical syntax

ldr r0, [r1, r2]

or

str r0, [r1, r2]

where the offset part of the address is obtained from a register: the two named registers are added together to obtain the address (again a multiple of 4) for loading or storing the value in r0.

In native ARM code, these two forms can be used with any registers, and some other possibilities are provided. When forming an address by adding two registers, it's possible to scale one of the registers by a power of two by shifting it left some number of bits, giving an addressing mode that we can describe as reg+reg<<scale, with the assembly language syntax

ldr r0, [r1, r2, LSL #2]     @ Native code only!

This form is particularly useful when the register shown here as r1 contains the base address of an array, and r2 contains an index into the array, because as we'll see later, the index of an array must be multiplied by the size of each element to find the address of a particular element. In Thumb code, we will have to do the scaling in a separate instruction. Native ARM code also provides other, less frequently used addressing modes that are unsupported in Thumb code and needn't be described here.

In Thumb code, the options provided are the two forms shown above, where the registers shown as r0, r1 and r2 may be replaced by any of the low registers r0 to r7. In addition, three special-purpose instruction encodings are provided that use the stack pointer register sp or the program counter pc as a base register with the addition of a constant. The two instructions with syntax

ldr r0, [sp, #20]

and

str r0, [sp, #20]

provide access to local variables stored in the stack frame of the current procedure, addressed at fixed offsets from the stack pointer. Also, the form

ldr r0, [pc, #40]

loads a word that is located at a fixed offset from the current pc value: this is a mechanism for access to a table of large constants that is placed in ROM next to the subroutine that uses the constants. In the Thumb encoding, the constants allowed in the forms ldr r0, [r1, #const] and ldr r0, [sp, #const] and ldr r0, [pc, #const] are subject to different ranges that are listed in the architecture manual: suffice it to say that the ranges are large enough for most instructions we will want to write.

Global variables

As well as local variables declared inside a function, C also allows global or static varaibles declared outside any function. These are accessible from any function in the same file (for static variables) or in the whole program (for global variables), and importantly they continue to exist even when the functions that access them are not being invoked. Here's a tiny example:

int count = 0;

void increment(int n)
{
    count = count + n;
}

(Note that the body of increment can be abbreviated to count += n; but any decent C compiler will produce the same code whichever way the function is written.)

How can we implement the same function in assembly language? The answer is a complicated story that is best told backwards. In the program as it runs, the global variable count will have a fixed location in RAM: let's suppose that location is 0x20000234. If that address is already in register r1 and n is in r0, then we can achieve the effect of count = count + n in three instructions: fetch the current value of count, add n to it, and store the result back into count.

    ldr r2, [r1]
    adds r2, r2, r0
    str r2, [r1]

the addressing mode [r1] uses the value in r1 as the address, and is just an abbreviation for [r1, #0]. Now the problem is this: how can we get the constant address 0x20000234 into a register? We can't write

    movs r1, #0x20000234     @ Wrong!

because that huge constant won't fit in the eight-bit field provided in an immediate move instruction. In Thumb code, the solution is to place the constant somewhere in memory just after the code for the increment function (typically, that means placing it in ROM), then use a pc-relative load instruction to move it into a register.

    ldr r1, [pc, #36]
    ...                A
    ...                |
    ...              36 bytes
    ...                |
    ...                V
    .word 0x20000234

The idea is that, although the constant 0x20000234 won't fit in an instruction, the offset shown here as 36 between the ldr instruction and the place that value sits in the ROM is small enough to fit in the (generously sized) field of the ldr instruction. The effect of the C statement count = count + n can therefore be achieved in four instructions:

    ldr r1, [pc, #36]
    ldr r2, [r1]
    adds r2, r2, r0
    str r2, [r1]

plus, of course, arranging that the constant 0x20000234 appears in memory in the expected place. Though we use count twice, once to load and once to store, we only need to put its address in a register once. Although this scheme works well when the program runs, it's a bit difficult to set up when we write the program. We don't want to have to count the offset 36 between the first ldr instruction and the place the constant sits in the ROM, or to update it every time we slightly change the program; also, we don't want to have to assign addresses like 0x20000234 manually and make sure that different variables get different addresses. When writing a program, we will get the assembler and linker to help with these details.

Context

If you are familiar with assembly language programming for Intel chips, it may seem disappointing that the ARM needs four instructions to increment a global variable, when the x86 can do it in just one, written incl counter using unix conventions. When comparing the two machines, we should bear in mind that most implementations of the x86 instruction set, execution of the instruction will be broken down into multiple stages, first decoding the 6-byte instruction and extracting the 32-bit address from it, then fetching from that address, incrementing, and storing back into the same address. The difference between the two machines relates less to how the operation is executed, and more to how the sequence of actions is encoded as instructions.

First, we'll introduce a name counter for the global variable. In the assembly language program, the value counter will be the address of the variable. Instead of writing

    ldr r1, [pc, #36]

and placing the constant manually in the right place, the assembler lets us write

    ldr r1, =counter

and looks after finding a place for the constant and filling in the required offset, then translates the instuction as a pc-relative load. The assembler keeps a list of values that have been used in such ldr instructions, and whenever it can do so, outputs the list and fills in the offsets in instructions that refer to the values. It always outputs constants at the end of the assembly language file, and also in places where we give a .pool or .ltorg directive; it is safe to do so between subroutines, and also in other places that can never be reached by the program, such as immediately after an unconditional branch and before an ensuing label. With smallish fragments of assembly language, there's no need to do anything, provided the offset range of the pc-relative load instructions is large enough to reach to the end of the file.

An important detail that's dealt with for us: in Thumb code, when an instruction like ldr, r1, [pc, #36] is executed, the value that is used for pc is the address of the instruction plus 4, because pipelining means that the processor will already have fetched the next instruction and incremented the PC beyond it by the time this instruction is executed. In the pc-relative ldr instruction, this pc value is then rounded down to a multiple of 4 before the constant 36 (always itself a multiple of 4) is added to it. The assembler takes care of these details for us, and the rules only become apparent if we look at the binary machine code, or if we try (as we will next term) to design an implementation of the instruction set.

As well as placing the 32-bit constant appropriately, we also want to place the counter variable itself in RAM without having to assign its address for ourselves, and be sure that the storage it uses does not overlap with other variables. To do this, we must add the following lines to the end of the assembly language file.

    .bss          @ Place the following in RAM
    .balign 4     @ Align to a multiple of 4 bytes
counter:
    .word 0       @ Allocate a 4-byte word of memory

The directives written here first (.bss) direct the assembler to place the variable in the BSS segment, which the linker will assign to RAM, and the startup code will initialise with zeroes. This directive complements the .text directive that came before the increment subroutine: that code goes in the text segment and is placed by the linker in ROM, and this data is placed in RAM.

The next directive (.balign 4) pads the BSS segment with zero bytes until its length is a multiple of 4, so as to be sure that the variable is aligned on a 4-byte boundary. Such alignment directives on unix-based systems have a vexed history, because the directive .align n sometimes aligns on an n-byte boundary, and sometimes on a 2n-byte boundary depending on the target processor and sometimes even on the format of object file that is being used. The Gnu assembler provides a pair of unambiguous directives .balign n and .p2align n that remove this doubt, and it's best to use them in all new programs.

Now we can allocate space for the variable, by first placing a label (counter:), then assembling a 4-byte word with value zero (.word 0). Taken together, these four lines of assembly language cause the assembler to output requests to the linker to reserve a 4-byte space in RAM, ensuring it is aligned on a 4-byte boundary, and to put its address into the constant table from where it can be accessed by the code for the increment subroutine. The lines can be put either before or after the code for the subroutine, because the assembler doesn't generally mind about names (like counter here) being used before they are defined.

Disassembling the object file that results from the code shown here reveals that the linker has placed the increment subroutine at address 0x150 and the counter variable at address 0x20000004.

00000150 <increment>:
 150:   4902            ldr     r1, [pc, #8]    ; (15c <increment+0xc>)
 152:   680a            ldr     r2, [r1, #0]
 154:   1812            adds    r2, r2, r0
 156:   600a            str     r2, [r1, #0]
 158:   4770            bx      lr
 15a:   0000            .short  0x0000
 15c:   20000004        .word   0x20000004

The pc-relative load instruction at address 0x150 contains an offset of 8, which is correct because 0x150 + 4 + 8 = 0x15c, and the constant is placed at that address. At 0x15a is two bytes of padding, included so that the constant is aligned on a 4-byte boundary as it should be, followed by the constant itself, whose value is the address of the variable in the BSS segment.

Arrays

[6.5] Individual global or local variables are useful, but more useful still is the ability to access elements of an array by indexing. That entails begin able to calculate the address of the location accessed by a load or store instruction. The address of the i 'th element a[i ] of an array a is base + i * elsize, where base is the starting address of the array and elsize is the size of one element.

Array indexing

The ARM helps with this calculation by providing a form of the ldr and str instructions (an addressing mode) were the memory address is computed as the sum of two registers, as in ldr r0, [r1, r2]. This instruction adds together the values in r1 and r2 to form an address, loads the 4-byte word from this address, and puts the value in r0. Typically, r1 might contain the base address of an array, and r2 might contain 4 times the index, usually computed by means of a left-shift instruction lsls r2, r2, #2.

We can experiment with these instructions by building a "bank account" program that maintains an array of integers, with a functon func(x, y) that we could define in C like this:

int account[10];

int func(int x, int y)
{
   ​int z = account[x] + y;
   ​account[x] = z;
   ​return z;
}

Given an account number x and an amount y, this function increases the balance of account x by y and returns the new balance. (We could write the entire routine body as return account[x] += y, but that's more obscure, and a good C compiler will produce identical code either way.)

Let's write the same thing in assembly language. We will need to allocate space for the array, 40 bytes aligned on a 4-byte boundary and allocate in the BSS segment.

   ​.bss                @ Allocate space in the uninitialised data segment
   ​.balign 4           @ Pad to a 4-byte boundary
account:
   ​.space 40           @ Allocate 40 bytes

Now here's the function itself.

   ​.text               @ Back to the text segment
   ​.global func
   ​.thumb_func
func:
   ​ldr r3, =account    @ Base address of array in r3
   ​lsls r2, r0, #2     @ Offset of element x
   ​ldr r0, [r3, r2]    @ Fetch the element
   ​adds r0, r0, r1     @ Increment by y
   ​str r0, [r3, r2]    @ Save the new value
   ​bx lr               @ New value is also result

Some things to note:

  • The address of account is a large constant, unknown until the program is linked – just like the address of the count variable in the previous example. We can get it in a register using the ldr = method: this gives us the address of the array element account[0], not yet its value.
  • To access the element account[x], we need to compute the offset 4*x in bytes from the beginning of the array. We can multiply by four using a left shift.
  • The address of an array element is obtained by adding base and offset, using the reg+reg addressing mode. It's cheaper to do this calculation twice, once in the ldr and once in the str, than to compute it once into a register, which would take an extra instruction.
  • Both base and offset can be shared between the two access to the array. A compiler would spot this as part of its optimisation of the program, and would like us express this insight by setting r3 and r2 once and using each twice.

The example uses a global array that persists from one invocation of func to the next – otherwise the program would be pointless. It's also possible to allocate a local array in the stack frame of a procedure, so that it lives only as long as an invocation of the procedure. To allocate space for it, subtract a suitably large constant from the stack pointer on entry, and add it back on exit. There's a useful form of add instruction that adds the sp value and a moderately sized constant and puts the result in a register:

add r0, sp, #32

This is useful for computing the base address of the array into a register, and from there the process of indexing is the same as for a global array. It's up to us to plan the layout of the stack frame, though – something that is done by the compiler for programs written in C.

Other load and store instructions

The instructions ldr and str are the most frequently used, because they transfer a quantity as wide as a register to or from memory. Next in importance are the instructions ldrb and strb that transfer single characters. For example, the instruction

ldrb r0, [r1]

takes the single character in memory whose address is in r1 (the address need not be a multiple of 4), and transfers it to the least significant byte of r0, replacing the other three bytes with zeroes. As a complement, the instruction

strb r0, [r1]

takes the least significant byte of r0, ignoring anything held in the other three bytes, and stores it at the address found in r1. Both these instructions exist for Thumb code with the same reg+reg and reg+const addressing modes as the ldr and str instructions. There are also instructions ldrh and strh that load and store two-byte halfwords in the same way.

In addition to the ldrb instruction that loads an unsigned byte, padding it from 8 to 32 bits with zeroes, there is also an instruction ldrsb for loading a signed byte and replicating bit 7 into the upper 24 bits of the result. Similarly, there is an instruction ldrsh for loading signed halfwords. These instructions are less useful than the others, so they exist in Thumb code only with the reg+reg addressing mode; on the rare occasions that the reg+const addressing mode would be useful, the same effect can be obtained by following an unsigned load with an explicit sign extension sxtb or sxth. There is no need for 'signed store' instructions, because no matter whether the value is considered as signed or unsigned, the store instruction takes the bottom 8 or 16 bits of the value on the register and stores them into memory.

Questions

Does the program always have to be in the Flash ROM and variables in RAM, or can we mix them?

On the micro:bit (and other ARM-based microcontrollers) both ROM and RAM form a single address space, with the two kinds of memory occupying different address ranges: 0 to 0x3ffff for ROM and 0x2000 0000 to 0x2000 3fff for RAM. Although it's convenient to put all the code for a program in ROM, so that it survives times when the chip is powered down, it's perfectly possible for the processor to execute code that is stored in RAM – perhaps generated dynamically as the program runs, or downloaded from an external source and held in RAM temporarily. Still more usefully, the processor can access data held in ROM, such as string constants, or tables of constant data that are particularly useful in embedded programming, and it can use the same load instructions to do so as it uses to access data in the RAM, just with different addresses. What you can't do, of course, is have data that changes held in the ROM, and modify it using ordinary store instructions. Executing a store instruction with an address in ROM space either results in no change, or leads to an exception and the Seven Stars of Death – I wonder which?

It is possible for the contents of ROM to change under program control, so that (for example) embedded systems can load new versions of their sofware 'over the air'. For that to happen, the ROM is accessed like an I/O device via the Non Volatile Memory Controller (NVMC), and not using ordinary instructions. It's usually not possible to be running a program held in ROM at the same time as the ROM is being overwritten, so the usual practice is to copy the code that controls the writing into RAM and execute it from there. This is not quite what happens when we download a program to the micro:bit: instead, the second processor on the board takes control of the main processor via its debug interface, and by that means is able to make the NVMC load a program into the ROM.

Lecture 7