Frequently asked questions

From Digital Systems
Jump to: navigation, search

New questions

If you have a question, perhaps it is answered here – or maybe you can find help in the growing glossary. Feel free to add headwords to the glossary so Mike can fill in the definitions, or to add questions below.

Can you write the question within triple single quotes like this?

Then I will write the answer afterwards. Don't worry about formatting words like micro:bit properly – I will copy-edit your question when I add the answer.

Arising from lectures

These items are links to questions answered at the end of some lectures.

Lecture 1: How does the processor know how many instruction bits to use as the opcode?

Lecture 1: How does the control unit know to interpret the last nine bits of an adds instruction 0x1840 as the names of three registers?

Lecture 2: What is the file /usr/lib/gcc/arm-none-eabi/5.4.1/armv6-m/libgcc.a that is mentioned when a program is linked?

Lecture 2: If you're prepared to use that integer division subroutine from the gcc library, why not use the library version of printf too, instead of writing your own?

Lecture 3: Will we need to memorise Thumb instructions for the exam?

Lecture 3: If a subroutine expects its return address in the lr register, and one subroutine A calls another subroutine B, won't the call to B overwrite the return address of A, so that A no longer knows where to return?

Lecture 4: Is overflow possible in unsigned subtraction, and how can we test for it?

Lecture 5: What's the difference between a subroutine, a procedure and a function?

Lecture 8: What does the assignment statement GPIO_OUT = 0x4000 look like in assembly language?

Lecture 18: In the lectures, you used only p-type transistors in the upper part of a CMOS gate and the corresponding n-type transistors in the lower part. Can we mix them up in order to make a circuit that implements (a ∨ ¬b) ?

Lecture 19: What is the semantics of your C-like hardware description language?

Lecture 19: Aren't sequential circuits equivalent to finite-state automata?

Lecture 21: What's the difference between decoded, derived and dynamic signals?

Lecture 21: In the lectures, you talk about processor designs containing lots of multiplexers, but some books talk about designs containing various busses. What's the difference?

Assembly language

Why write integer constants in hexadecimal for low-level programming?

We often want to work out the binary bits that make up a constant, and hexadecimal notation makes it easy to do so. Each hexadecimal digit corresponds to four bits of the binary representation, and it's not too hard to memorise that bit pattern corresponding to each digit:

0  0000          8  1000
1  0001          9  1001
2  0010          a  1010
3  0011          b  1011
4  0100          c  1100
5  0101          d  1101
6  0110          e  1110
7  0111          f  1111

To convert a larger hexadecimal constant to binary, just concatenate the four-bit groups corresponding to each digit: for example,

0xcafe1812 = 1100 1010 1111 1110 0001 1000 0001 0010.

How can we convert between decimal and hexadecimal?

For quantities that fit in 16 bits or less, it's helpful to recall that the powers of 16 are 1, 16, 256, 4096. Then to convert, say, 1326 to hexadecimal, we note that the highest multiple of 256 that fits is 5 * 256 = 1280. We can set out the calculation as follows.

            1326
5 * 256  =  1280
            ----
              46
2 * 16   =    32
            ----
              14

The remainder, 46, can be expressed as 2 * 16 + 14, and we get 1326 = 5 * 256 + 2 * 16 + 14 * 1 = 0x52e. The final 'e' is a hexadecimal digit with a value of 14.

To convert the other way, we just calculate 0x52e = 5 * 256 + 2 * 16 + 14 = 1326.

What registers can a subroutine use?

It can freely use registers r0 to r3, destroying any values they had when the subroutine was called. The (first four) arguments of the subroutine arrive in those registers, but need not be kept there, and the result can be moved into r0 just before the subroutine returns. Note, however, that if this subroutine calls any others, they in their turn may destroy the values in all four of these registers.

In addition, a subroutine can use registers r4 to r7, provided it leaves them with the same values on exit that they had on entry. An easy way to ensure that is to push them on the stack with a push instruction at the routine start (together with the register lr), and pop them off again with a pop instruction at the end (together with the pc). Any subroutines that are called will also obey this convention, so important values can be left in these registers across calls.

What if a subroutine has more than 4 arguments?

In that case, arguments beyond the first four are passed on the stack. It's the caller's responsibility to put them there, following a layout that is fixed by the machine's calling convention, and then the subroutine itself can access them using load instructions. I don't think we'll need to write any subroutines with so many arguments (not in assembly language, anyway).

Shouldn't the 'move' instruction really be called 'copy'?

Probably so: mov r0, r1 puts in r0 the same number that is held in r1, without changing the contents of r1.

What do I do about labels if I have a big assembly language program with lots of functions and several loops? Can I re-use labels like loop and done in each function?

No, I'm afraid you can't – you can't have two labels with the same name in a single file of code. You can either prefix each label with the name of the enclosing function, as in foo_loop or baz_done, or you can make use of a system of numeric labels that do allow multiple definitions. There can be several labels written 2: in the file, and you refer to them using the notations 2b or 2f, meaning the next occurrence of label 2 going backwards (2b) or forwards (2f) in the file. That saves the effort of inventing names.

Compilers solve the problem in a different way, just inventing their own labels with names like .L392 and never reusing them. That's OK when nobody is expected to read the resulting assembly language.

When I try running my assembly language program with qemu-arm, I get this message: what does it mean?

qemu-arm: ... tb_lock: Assertion `!have_tb_lock' failed.

That's a symptom of the program running haywire, possibly because it has branched to a random location and started executing code there. A probable cause of that is returning from a subroutine with the stack in a bad state, so that the value loaded into the pc is not the correct return address.

What is this .thumb_func directive all about?

Most of the time, the assembler and linker manage to work out for themselves that assembly language code is written using the Thumb encoding, and all works sweetly. But just occasionally (e.g., when writing an interrupt handler in assembly language), it's necessary to mark a function explicitly as being written in Thumb code: in the case of an interrupt handler, so that the interrupt vector can be marked with a 1 bit in the LSB. If this is done properly, then the processor stays in Thumb mode when the interrupt vector is followed. Failing to do this results in an attempt to switch to native ARM mode, and an immediate HardFault exception: the program crashes irretrievably. This seems a bit silly in a device that doesn't support native ARM mode at all, but it is done in the name of software compatibility with bigger devices.

Given that the .thumb_func directive is occasionally needed (at times when it's difficult to determine whether it is actually necessary), and given that it is hard to debug the HardFault that occurs when a needed directive is omitted, it seems simplest just to make a habit of writing .thumb_func in front of every assembly-language function.

I wrote nop in my program, expecting to get the instruction encoded as 0xbf80, but the assembler produced 0x46c0 instead. Why?

Later versions of the ARM architecture implement a specific nop instruction with encoding 0xbf80. On earlier versions, the instruction encoded 0x46c0 is used instead. The rainbow chart decodes that as mov r8, r8, a form of move instruction that can access high registers and (crucially) does not write the condition codes. In practice, there's no significant difference between the two instructions: both last one cycle and have no effect.

What is the difference between the data segment and the bss segment?

The C compiler takes static variables (those that are declared outside any function) and assigns them to one of three segments that are differently treated by the linker. Plain variables such as the array a declared by

int a[10];

are assigned to the bss segment. When the program starts, such variables have no special initial values, or rather, their initial values are zero, false, null. Other variables may be given specific initial values: for example, the array b declared by

int b[10] = { 3, 1, 4, 1, 5, 9, 2, 6, 5, 4 };

These variables are allocated in the data segment, and unlike the bss segment, the program output by the linker contains an image of their initial values. The third segment used for static data is the rodata segment, used for 'variables' that are declared as constant, such as the array c declared by

const int c[10] = { 2, 7, 1, 8, 2, 8, 1, 8, 2, 8 };

These variables do not need space in the RAM, and can be kept in the (more spacious) ROM instead. Such constant tables are common in embedded programs: for example, a lunar lander game might contain a table of the sine of each whole number of degrees from 0 to 360, to save time computing them when the angle of a thruster changes.

Ok, fair enough: but how can I declare similar variables in assembly language?

You can use the directives .data or .bss or .segment rodata to select a segment, a .align directive to pad for alignment, then either .word or .space directives to allocate space. Like this:

    .bss
    .align 2
a:
    .space 40

    .data
    .align 2
b:
    .word 3, 1, 4, 1, 5, 9, 2, 6, 5, 4

    .segment rodata
    .align 2
c:
    .word 2, 7, 1, 8, 2, 8, 1, 8, 2, 8

The directives .data and .bss are just abbreviations for .segment data and .segment bss. If you want a scalar variable instead of an array, just reduce the amount of space from 40 to 4, or give only one word of initialisation data:

    .bss
    .align 2
x:
    .space 4

    .data
    .align 2
y:
    .word 314159

Ok, now I see how to declare global variables. How can I use them?

Let's try writing in assembly language the function that we could express in C as follows.

unsigned g = 4;

unsigned foo(unsigned x, unsigned y) {
    unsigned z = g;

    g = x;

    return y + z;
}

As you can see, this uses a global variable to remember the argument x from one invocation of foo to the next, with the remembered value being 4 initially. In the following interaction, the values printed are 20+4 = 24 and 8+10 = 18, and the memory contains 6 for the next invocation.

Gimme a number: 10
Gimme a number: 20
foo(10, 20) = 24
Gimme a number: 6
Gimme a number: 8
foo(6, 8) = 18

Without further ado, let's try to express the same thing in assembly language. First, we allocate space in the data segment for the global variable.

    .data
    .align 2
g:
    .word 4

In the function foo, we need to get access to g, and to do this, we plant the address of g somewhere in the text segment, and use pc-relative addressing to load the address into a register.

    .text
    .thumb_func
foo:
    ldr r3, [pc, #8]
    ldr r2, [r3]
    str r0, [r3]
    add r0, r1, r2
    bx lr
    .align 2
aaa:
    .word g

The instruction ldr r2, [pc, #8] loads the address g of the global variable into r2, by loading it from a convenient place aaa in the text segment. Then, with the address g in a register, we can use a second ldr instruction to load its value, which sits in the data segment in RAM. The point is that there are two load instructions, one to get the address of the global variable, and the second to get its value. The address is a constant, and can be stored in the ROM at a place where we can address it using pc-relative addressing. The value is variable, and is stored in RAM, at an address that we cannot write as part of the code stream. The address in r3 is also used to write the global variable with the first argument of the function from r0 – then we add together the second argument and the fetched value of g to give the result, putting it in r0 before returning.

We could write the instruction ldr r2, [pc, #8] as ldr r2, aaa to get the assembler to compute the offset 8 for us. Or, better still, write ldr r2, =g to get the assembler to find a 'convenient place' for us, and leave it to put the constant at the end of its output to the text segment, or put the hint .pool after the bx lr (and delete the rest of the source shown aboive from that point on. However we express the source file, the resulting object program is the same: it has a word in RAM, another word in ROM that contains the address of the first word, and a pc-relative load instruction that addresses the second word.

If we want to plant constants using .word directives, or use the .pool directive to give the assembler a hint where to put its own table of literals, where is a safe place to put them?

If we use .word to put constants (such as addresses) into the text segment to access them with pc-relative addressing, then we need to be sure that the machine will not attempt to execute these constants as if they were code. For example, we could put them immediately after the bx lr or pop {pc} that ends a function, because we can be sure the processor will never try to fetch and execute instructions from there. We can put .pool directives in similar places to give the assembler a hint it can place its tables of literals there. If we don't put in these hints, then the assembler will build one big table of literals and put it after the last function in the source file. Usually this is sufficient, because our fragments of assembly language will be small.

C programming

What's the difference between the expressions m.m_type and m->m_type?

The first of these is used when m is a structure (record) declared with message m, so that storage for an actual structure is allocated by the declaration; it names the m_type field of the structure.

The second is used when m is a pointer to a message declared elsewhere. We can declare such a pointer with message *m, and if msg is a message declared as in the first part of this answer, we can write &msg to get a pointer to it. The notation m->m_type selects the m_type field of the structure to which m points; it is equivalent to (*m).m_type, because *m is the structure to which m points, and we can use the notation from the first part to select its m_type field. The notations are equivalent in meaning and in efficiency, but m->m_type looks a bit neater.

It's common in C programming to associate pointer variables with storage that is allocated dynamically using the library function malloc(). Such storage is allocated when needed, and persists until it is freed with an explicit call to free(): in particular, it can live beyond the lifespan of the function activation that created it. We shall not need to allocate storage in this way: the closest we come is the pool of process records used in Phōs, but these are allocated as a static array of fixed size before the program begins.

Your C programs use types like int and unsigned and short. Wouldn't it be more portable to use types like int32_t and int32_t and int16_t that have a well-defined width on all platforms?

Yes, it would: it's just that I'm trying to keep down the number of obscure conventions that are used in the programs. On the ARM, it's safe to assume that int and unsigned are 32 bits.

Which is faster as a way of writing an infinite loop: while (1) ... or for (;;) ...?

This is one of many things connected with minor efficiency that really don't matter. If I were writing a compiler, I would ensure that both of these idioms resulted in exactly the same code, with the loop body followed by an unconditional branch from the bottom back to the top. If by accident gcc generates slightly less tidy code for one form, then we are unlikely to notice the difference. (Similar remarks apply to the idioms n++ and ++n when written as statements.)

According to the standard books about C, the main program should be called main, and yet in this course it is called init. Why is that?

In normal C, the main program ought to have the heading

int main(int argc, char **argv)

reflecting the fact the main receives two parameters, one being the number of command-line arguments (including the program name), and the other being an array of strings, each corresponding to one of the arguments. The return type is int, because the program may terminate by returning an integer, zero for success and some non-zero value for failure.

Our function init takes no parameters and delivers no result:

void init(void)

What's more, once we have an operating system on the chip, then unlike main, when init returns the program is not over, but concurrent execution of the processes in the program is only just beginning.

These differences might be enough to make a change of name desirable, but there are practical reasons too. One is that GCC (un)helpfully signals a warning if a function named main does not have the heading it expects; another is that it treats main specially in the output, assigning it to its own section in the object file so that a linker script can place it in the program image at a specific place. Both of these behaviours are unhelpful to us.

  • Actually, the warning is suppressed by specifying -ffreestanding, which also suppresses other warnings about the types of functions like sprintf. And the separate section is only enabled with the optimisation flag -Os or -O2, not at the -O level we will generally use, mostly to make it easier to interpret program behaviour under a debugger.

You have not explained the implementation of printf. How does it work?

The implementation of printf in any of our programs is split into two parts: the part that is independent of the output device, and renders the data as a sequence of characters, and the device-dependent part that takes each character and transmits it on the output device. The C code is made slightly complicated by two facts: printf takes a variable number of arguments, and the device-independent part is a higher-order function that takes the device-dependent part as a parameter.

Let's begin with the code that puts the two parts together:

void serial_printf(char *fmt, ...) {
    va_list va;
    va_start(va, fmt);
    do_print(serial_putc, fmt, va);
    va_end(va);
}

This function takes a variable number of arguments, so we can write a call such as

serial_printf("prime(%d) = %d\n", n, p);

where the one fixed argument – the format string – indicates that there are two additional arguments to come, both integers. The code for serial_printf gets a kind of pointer va to the variable part of the argument list, and passes this pointer, together with the device-dependent output function serial_putc and the format string fmt, to the library routine do_print. Variable argument lists are implemented in different ways on different machines. On the Cortex-M0, the compiler will store the arguments from r0--r3 into the stack frame together with the fifth and later arguments (if any), then use a pointer to the relevant portion of the stack frame for va. What happens on a particular machine is all hidden in the functions va_start and va_end.

I'll simplify the code for do_print a bit: the real version also supports the function sprintf that formats data like printf but saves the resulting string in a character array instead of printing it. Here's the function heading:

void do_print(void (*putch)(char), const char *fmt, va_list va) {
     ...

The C syntax is a bit obscure here, but it indicates that do_print takes three arguments:

  • A (pointer to) a function putch that accepts a character as argument and returns nothing.
  • A string fmt, represented as a character pointer.
  • The variable argument list va from do_print's caller.

The body of do_print uses these to do its job.

     ...
     char nbuf[NMAX];

     for (const char *p = fmt; *p != 0; p++) {
          if (*p == '%' && *(p+1) != '\0') {
               switch (*++p) {
               case 'c':
                    putch(va_arg(va, int));
                    break;
               case 'd':
                    do_string(putch, itoa(va_arg(va, int), nbuf));
                    break;
               case 's':
                    do_string(putch, va_arg(va, char *));
                    break;
               default:
                    putch(*p);
                    break;
               }
          } else {
               putch(*p);
          }
     }
}     

As you can see, do_print iterates over the format string, using a pointer p to point to the character of that string that it is currently working on. I've shown the implementation of three format codes here: %c for printing a character, %d for printing a decimal integer, and %s for printing a string. Note that you can write %% to print a literal percent sign. The formatting loop works its way through the variable argument list by using the function va_arg (more likely, a macro), and outputs each character by calling the argument function putch. A helper function do_string is used to call putch on each character of a string, such as the one returned by the function itoa (the name is traditional) that converts an integer into a string.

We carefully avoid global variables so that printf can be called from multiple processes simultaneously – it is reentrant, to use the jargon. That's the reason for the array nbuf that is passed to itoa. None of this will be on the exam.

Code in the examples that prints messages on the terminal contains "\r\n" to move to a new line, but other examples of C code that I've seen use just "\n". Why is that?

On an old-fashioned terminal, and on the simulation provided by minicom, the carriage-return character 0x0d (written '\r' in C) is needed to move the cursor or print head back to the left of the page, and the line-feed character 0x0a (written '\n') scrolls up by a line. Normally, the C library or operating system is written so as to insert a '\r' in front of each '\n' output by a C program. But we are working without a library and without an operating system, so we must do this ourselves. (The serial driver in micro:bian does insert the '\r' in its version of the serial_putc function.)

If I ask the C compiler to output assembly language (and turn off -g), then most of the file is intelligible, but these lines appear close to the start of the file. What do they mean?

.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1

The lines record the assumptions used by the compiler in generating code: things like whether there is a hardware floating point unit, whether unaligned memory access is allowed, etc. These assumptions will be recorded in a special section of the binary file output by the assembler, and the linker then checks that the assumptions are consistent between different files of code.

C provides bit-fields for access to parts of a memory word. Why don't we use them instead of using shifting and masking operations to program peripheral registers?

On the ARM, most peripheral registers must be accessed as a whole 32-bit word, and can't be used with the ldrb, strb, ldrh, strh instructions that access bytes or 16-bit halfwords. Also, it's sometimes the case that peripheral registers must be accessed exactly once in an operation, or some side-effect may be duplicated. The C compiler doesn't guarantee to respect these conventions when it generates code for bit-field access, so we avoid bit-fields and write (with the help of simplifying macros) explicit shifting and masking code instead. The resulting object code is no worse than would be compiled for bit-fields anyway.

What indentation style do you use for C?

My style is close to the one used in the classic book by Kernighan and Ritchie. I don't insist on braces around a single statement controlled by if or for or while, except that the else part of an if must, for balance, have braces if the if part does, and } else { goes all on one line. Expressions should have parentheses whenever the interpretation is anything but obvious without them. I don't accept the argument that without braces, accidental mistakes of indentation can mislead about the scope of an if or while – but I use an editor that reliably gives the correct indentation automatically.

I/O devices

What is the difference between the I/O registers UART.INTEN, UART.INTENSET and UART.INTENCLR?

The UART.INTEN register shows which events in the UART can cause an interrupt: you can write a value to the register to enable or disable individual interrupt sources, and you can read it to find out which sources are actually enabled. The Nordic chips have a scheme – mostly for convenience – that allows individual bits in this register to be set or cleared without having to read the register, modify it, and write back the result. If you write a value to the UART.INTENSET register containing one or more 1 bits, then the corresponding bits in UART.INTEN will be set to 1, and the other bits will remain unchanged. Similarly, writing a pattern of 1 bits to UART.INTENCLR clears the corresponding bits of UART.INTEN and leaves the others unchanged. The general scheme is described in Section 10.1.2 of the nRF51822 Reference Manual.

Can an interrupt handler itself be interrupted?

The interrupt controller (NVIC) provides a priority scheme, so that the handler for a low-priority interrupt can be interrupted if a request for a higher-priority interrupt comes in: these 'nested' interrupts are the N in NVIC. But we shall not use this scheme, instead giving all interrupts the same priority. In micro:bian, interrupt handlers will be very short, and just send a message to an attached driver process, so there is no need for nested interrupts. But whether nesting is allowed or not, it's clear that an interrupt handler cannot be interrupted by a second request the same interrupt.

What happens when the micro:bit boots?

The Nordic chip doesn't boot, in the sense that there is a small program that is responsible for loading a larger one. Instead, we could talk about it "coming out of reset". When the hardware Reset signal is deactivated (for example, by releasing the reset button), the processor fetches the word at address 0 and loads it into the stack pointer, and fetches the word at address 4 (the reset vector) and loads it into the program counter. That is the extent of the hardware's responsibility.

On the software side, the C compiler, the linker, the linker script nRF51822.ld and the C functions in startup.c work together to determine what happens next. The initial stack pointer is set to the top of RAM, and the reset vector points to the routine __reset in startup.c. The contents of the RAM at this point are meaningless, and may be random. Here's the code for __reset:

/* __reset -- the system starts here */
void __reset(void) {
    /* Activate the crystal clock */
    CLOCK.HFCLKSTARTED = 0;
    CLOCK.HFCLKSTART = 1;
    while (! CLOCK.HFCLKSTARTED) { }

    /* Copy data segment and zero out bss */
    int data_size = __data_end - __data_start;
    int bss_size = __bss_end - __bss_start;
    memcpy(__data_start, __etext, data_size);
    memset(__bss_start, 0, bss_size);

    /* Call the 'main program' */
    init();

    /* If it returns, enter a tight loop */
    while (1) pause();
}

When the chip comes out of reset, it is using for the clock a crude internal RC oscillator based on the time constant of a resistor and capacitor. For accurate I/O timing, it's better to use the 16MHz crystal that is installed on the board, so the first action is to switch on the crystal oscillator and wait for it to stabilise. Next, memcpy is used to copy the initial values of static variables into the bottom of RAM, and memset is used to set the remainder of the static variables to zero. The linker has made all the code for the program refer to locations in the bottom of the RAM, but the data which ought to appear there is not yet present. Instead, the linker has (folowing directions in the file NRF51822.ld) put an image of the data segment in the ROM just after the program code. It has also defined symbols like __data_start, __bss_start and __bss_end that we can use here. From this point on, the program can assume that static variables have their proper initial values.

There follows a call to the user-written subroutine init that forms the main program. As you will see, if init ever returns, the program falls into an infinite loop, forever doing nothing; interrupt handlers remain active during this loop. Later in the course, we will start to include the micro:bian operating system in our programs, and then the __reset() function will look slightly different, calling os_init() will set up its data structures, then init() that creates a number of operating system processes, and finally os_start(), the scheduler that interleaves execution of the processes, which work together to carry out the program's tasks. The scheduler never returns.

Why do wired-or busses like I2C have the convention that signal lines are active-low?

Electrically, active-low busses are better in two ways: one, N-type MOSFETs are generally better behaved than P-types, and specifically tend to have a lower on resistance. This makes them better able to pull the signal line low crisply (when it may have significant capacitance) than a P-type could pull it high. In older systems an NPN transistor in an open-collector output stage had a similar advantage over a PNP transistor, and for the same physical reason – that the majority carriers in these transistors are electrons, which have a better motility than holes. The second reason for using active-low signalling is that a bus may cross between circuits that are using different power supplies, and ground is then used as a common reference, whereas the power rails may differ a little or a lot in voltage. An active-low signal can have threshold voltages defined with respect to ground, be pulled low by an open-drain output with short fall time, and draw little or no quiescent current through its pullup resistor.

micro:bian

Why does each process in micro:bian take an integer parameter? Is this parameter the same as the process ID?

The parameter is there to make it possible to use the same function as the body of more than one process, with those processes doing the same thing with different data. For example, in the Today programme example, the two 'politician' processes run the same code, endlessly spouting their message, but a different message in each case. We do this by having a global table of messages, then giving the process a parameter that it an index into the table.

char *message[2] = {
    "Brexit means brexit",
    "No deal is better than a bad deal"
};

void politician(int i) {
    while (1) {
        spout(message[i]);
    }
}

When we start the two processes, we give different parameters to be passed to the process body.

FARAGE = start("Farage", politician, 0, STACK);
MAY = start("May", politician, 1, STACK);

Providing one integer argument is just enough to make this kind of programming possible without too much inconvenience.

No, this parameter is not the same as the process ID, which in the example is the result, FARAGE or MAY, that is returned by start(). The confusion perhaps arises because the body of a process usually doesn't have to mention its ID, though the ID is implicitly used as the m_sender field whenever the process sends a message.

What's really going on in Problem 4.4?

Thanks, Mike, for asking: it gives me the opportunity to link to another page that reports the result of an experiment with the program.

Why can't one micro:bian processes start another?

In short, because that would be a complication that would add little utility. The memory allocation in micro:bian is deliberately simple, stack space for each process being nibbled away from one end of a single chunk of free space. That works well if a fixed number of processes are started when the program begins, but much less well if processes come and (by implication) go during the life of the system. If processes exit and restart, we will need to re-use the storage space of dead processes to find room for new ones; and unless we are prepared to move processes around, we then face the problem of fragmentation of storage – all too messy to contemplate. Given that the operating system is only going to support a fixed collection of processes, we may as well have those processes created as part of initialisation, before the scheduler takes control.

When a process makes a system call such as receive(ANY, &m), how does the operating system access the arguments?

The arguments are in registers r0 and r1 before the svc instruction that enters the operating system. The handler for the SVC interrupt gets the the stack pointer for the process after the context has been saved onto the stack. It can then reach into it (in a machine-dependent way) to discover what the values of r0 and r1 were at the time of the call. It also uses the saved pc value to look at the svc instruction, which contains an integer constant that identifies which system call – receive or send or another – was made. Given the address of the message buffer m, the operating system can read or write the memory belonging to the process in order to copy a message from sender to receiver.

What happens if a process loops forever, consuming CPU time, without ever calling send or receive?

Scheduling is not time-based, so once such a process is running, it will stay running until the next interrupt, and other processes will not run until then. micro:bian is not really designed to share the processor fairly among processes that always want to run, because doing that really needs a clock interrupt to be directly connected to the scheduler in order to measure out time slices, and we'd like to make use of timers optional. The assumption is that CPU time is a plentiful resource, so that fairness in sharing it doesn't matter. In the heart example, it works well to give the process driving the display a higher priority than the one searching for primes, so that after a timer interrupt it get to run first.

If a micro:bian process calls a subroutine (like primes calling serial_printf which subsequently calls serial_putc) and the subroutine calls send to send a message, how does the operating system know which process sent the message?

The operating system always knows which processs is currently running (that is the purpose of the internal variable os_current). The stack frames for primes and serial_printf and serial_putc are all part of the stack of that process. So when the call to send is made, the operating system knows that it is made on behalf of the PRIMES process. The same subroutine serial_putc could be called at another time by another process, and then is would generate a message from that process – whichever one was current at the time.

What is the resting state of a pipeline where one process PRIMES is generating primes, passing them to another process FORMAT for formatting, then FORMAT is sending characters to SERIAL for printing? Let's assume primes are being found faster than they can be printed.

Process pipeline

In this case, the serial driver's buffer will usually be full. SERIAL will not be listening for more PUTC messages, but blocked until the next UART interrupt. In consequence, FORMAT will be blocked waiting to send it a character. And PRIMES will eventually have found another prime and be blocked waiting to send it the FORMAT. Nothing more will happen until the next interrupt, and the IDLE process is the one running, or rather it has put the processor to sleep at a wfe instruction – the only one in the system.

When a UART interrupt arrives, SERIAL will start the UART transmitting a character taken from its buffer, so a slot in the buffer is now available, and SERIAL accepts a PUTC message from FORMAT. This fills the buffer again, so the SERIAL process blocks waiting for another interrupt. FORMAT is also able to run, and it will take a short time to decide what character to print next, then block again waiting to send it to SERIAL. Just occasionally, FORMAT will have put the last newline to a message, and it will then accept another PRIME message from PRIMES: that process will then be free to begin seraching for another prime to transmit.

Is it possible to create a buffer of messages?

There is already a buffer of characters in the SERIAL process, but it's possible to introduce an independent buffer process between any pair of processes that communicate.

Buffer process

The process on the right (the producer) can sent PUT messages to the central buffer process, asking it to store data items for later delivery. Then the process on the left (the consumer) can send GET messages to the buffer, getting each time one data item in a REPLY message. micro:bian provides just enough expressiveness to deal with this kind of situation, but note that the buffer process must know the names (process IDs) of the two client processes and depends on them to communicate with a fixed protocol. In more elaborate message passing schemes, there are notions like named channels of communication, which make it much easier to plug together processes and adapters in an ad-hoc way.

If two messages are sent one after the other, can they be delivered in the opposite order?

If they are sent from the same process, then No: this is a consequence of the fact that the process does not get to call send a second time until the first call of send has returned, and when that happens, we can be sure that the recipient has also completed its call of receive and has hold of the first message.

Ordering of messages

But if the two messages are sent from different senders, then the question is more complicated, and I have four different answers for you. In order (!), they are No, Yes, Maybe and Yes.

  • No, because if two proceses B and C are both on the sending queue of a process A, then there are no guarantees on the order they will be picked. It could be that process A specifies that it would like to receive from process C first and process B afterwards. And the micro:bian designer does not promise to accept a bug report claiming in any case that a re-ordering of waiting processes is wrong.
  • Yes, because in practice (selective receive aside) the sending queue is treated like a first-come-first-served queue by micro:bian. Although there may be no guarantees about order of message delivery, nevertheless FCFS is the simplest way of ensuring fairness, which in this context means that if process B wants to send to A, and A is continually accepting messages in a way that does not exclude B, then B will not be delayed forever.
  • Maybe, because we've got to ask what we mean by 'before'. Suppose processes B and C are connected to the two push-buttons on the micro:bit, and I push the button for B at 4:16pm and the button for C at 4:17pm. There's no guarantee on a busy machine that process B will get to run before I press the second button, but then process C might get to run immediately – so it would be wrong to rule out a behaviour where the message from C got through first.
  • But Yes: suppose process B sends its message, then turns on a light; and I, seeing the light, then press the button for process C. Then there is a causal chain between one message and the other – the light comes on after the first message is delivered, the button for C is pressed after that, and the second message is only then sent. Without including the human factor, we can define a temporal ordering of events where events in the same process are strictly ordered by the history of the process, and events in different processes are ordered in a way that is determined by the synchonisation implicit in the sending and receiving of a message. It's possible for events in two different processes to be unordered if there is no causal chain connecting them; but if one message is sent before another according to this ordering of events, then they are indeed received in the same order.

Is HARDWARE really a process?

Each process is identified by a small integer. This integer is an index into a table that stores the state of each process. The special value HARDWARE = −1 is not the PID of a genuine process and does not correspond to a slot in the table, but nevertheless you can write receive(INTERRUPT, &m) and afterwards find that m.m_sender == HARDWARE. The mechanism for sending an interrupt message is not to put any process on the sending queue of the process that will receive the message, but instead to set a flag p_pending in the receiving process that indicates an interrupt is waiting. So in this sense, there is no HARDWARE process. Actually, having such a process object wouldn't help with the implementation, because we know a process can only be on one queue at a time, and there can be several interrupts pending for different drivers.

Do interrupts jump the queue of waiting processes?
When an interrupt happens, the OS looks at the connected driver process. If it's blocked at a call of receive, the job is simple: we fill in sender = HARDWARE and type = INTERRUPT in its message buffer, and queue it for execution. If the driver process is not blocked at receive, we set its p_pending flag instead. Later, when it calls receive, the OS looks to see if any interrupts are pending: if so, the process does not block, but an INTERRUPT message is immediately delivered to it and the p_pending flag is cleared. In this way, yes, the interrupt message does overtake any other messages that are waiting. That's no bad thing, because usually interrupts are quite urgent.

Why is the stack space of a micro:bian process filled with the value 3735928559 when the process starts?

This helps with finding out how much stack space the process is actually using: when micro:bian prints a dump of the process table, it shows how much of the stack space of the process has been changed from this initial value, and we can assume that this is the total stack space that has been used, unless by some slim chance the last few words of the stack have been set to this magic value by the process itself. As to the choice of value, it appears in the micro:bian source code in hexadecimal notation, which makes it seem more significant.

In micro:bian, what happens if an interrupt is triggered when the attached process is not waiting to receive a message?

The process table entry for each process has space to record one pending interrupt message, so if an interrupt arrives and the process is not waiting for it, the message can be delivered later. The default interrupt handler, which converts interrupts to messages, also disables the interrupt that invoked it, because the interrupts on the micro:bit continue to be triggered if the cause is not removed, and removing the cause (such as an idle UART) requires hardware-specific actions from the driver process. So a second interrupt cannot happen until the driver process has run to re-enable the interrupt. If this behaviour is inadequate to the situation, it's possible to replace the default handler with another one for a specific device: the timer driver does this to help with implementing a microsecond-resolution clock function.

What happens if a device driver process does not finish responding to one interrupt message before another interrupt occurs?

As noted above, the default action of disabling the interrupt means that a second interrupt can't actually be requested. But if (for example) the interrupt comes from a timer, and interrupts are being lost causing the clock to run slowly, then the right thing is to move some work from the driver process to the interrupt handler, and modify it so that the interrupt remains enabled all the time.

In the UART driver process, code for input and code for output are intertwined. Wouldn't it be neater to untangle them and have separate driver processes for the two sides of the UART?

In some ways, yes. But the vital deciding factor in favour of a single driver is that the two sides of the UART share a single interrupt request, so interrupts for both sending and receiving turn into messages to a single process. We could use a more elaborate scheme where a little router process passed on interrupt messages to one or other of two driver processes, but we would be adding as much complexity as we could hope to remove. (A minor additional point is that echoing of input characters is easy to do if input and output share a driver process.)

If message-passing makes programming so much easier, why do widely-used embedded operating systems provide semaphores instead?

Programs written using semaphores have the potential to be faster than those written using message passing and server processes, because context switches from one process to another can be made much less frequent. For example, to buffer and print characters on the serial port, our message-based design requires about four context switches per character sent, in addition to the running of the UART interrupt handler. There is a context switch to send a request message to the UART driver process and put the character in the buffer, another to return to the main program, then two further context switches into and out of the driver process when a UART interrupt prompts it to send the character. By way of contrast, a semaphore-based solution might need no context switches at all, if characters are retrieved from the buffer and sent to the UART directly in the interrupt handler. This is because a down operation on an available semaphore only needs to disable interrupts for a moment to test and lock the semaphore, and need not block the invoking process. It's only when a process needs to block on a semaphore that a context switch is needed.

On the other hand, message-based concurrency is much easier to think about, and it makes it easier to avoid problems like priority inversion, where a high-priority process is prevented from making progress because it needs to lock a semaphore currently held by a low-priority process.

Why does micro:bian not have (i) time-based scheduling, (ii) nested interrupt handlers, (iii) an interruptible kernel?

In a simple, interrupt-driven system that is not overloaded, most of the time every process is blocked waiting for something to happen, and the idle process is 'active', in the weak sense that it is the running process, but it has put the processor to sleep with a wait-for-interrupt instruction (wfe on the ARM). When an interrupt arrives, processes wake up, there is a flurry of activity, and then the system settles down to sleep again, and the time needed for that to happen is bounded. If there is no response signal due from the system as a result of an interrupt that must be produced in less than this time-bound, then there is no need for pre-emptive scheduling (which does not change the total amount of work to be done, but just affects the order in which to tasks are carried out). And if we keep interrupt times short by doing no more than sending a message in each interrupt handler, then there will be little need to allow the running of one interrupt handler to be suspended in order to tun another. Likewise, if the time spent in delivering a message is short, it won't matter much if an interrupt is delayed until the kernel has finished.

There are well-developed theories of process scheduling that aim to ensure that complex sets of real-time deadlines are met using limited resources, and in circumstances where they are needed, these theories are really valuable. In simple cases, however, process scheduling is hard to get wrong, in the sense that if every process waiting for an event gets to run after the event has happened and before the system goes to sleep again, then everything will happen when it should and in good time. micro:bian is designed with these modest goals in mind.

Digital logic

What does CMOS stand for?

CMOS logic is so named because it relies on Complementary pairs of transistors, each of them constructed of three layers: a Metal (or possibly polycrystalline silicon) gate, an insulating layer of silicon Oxide, and a Silicon channel between source and drain. Other logic families exist, for example the NMOS family where the p-types are replaced by pull-up resisitors. These chips are easier to make because they contain only n-type transistors (actually the resistor is implemented as a weak transistor too). But they consume more power, because when the output is low, current constantly flows through the resistor and transistor.

Can you really buy an integrated circuit containing a single and gate?

Yes, you can. It used to be that the cost of packaging meant that several gates came in one package, such as the classic 7400 'quad 2-input nand gate', but recent logic families include single gates in tiny five-pin packages (with two inputs, one output, plus power and ground. Here for example is the datasheet for a 74LVC1G08 single 2-input and gate.

Why does an inverter need two transistors?

Image

Both transistors are needed in an inverter to ensure that the output is always driven – either forced towards the positive rail when the output should be 1 (by the p-type), or forced towards the negative rail when it should be 0 (by the n-type).  :We use p- and n-type transistors in this way (p-type in the upper half and n-type in the lower half) because that ensures the gate-to-source voltage is high enough in each case to drive the transistors into strong conduction, producing a good output signal.

Can we use n-types in the top part of a design, or p-types in the bottom part?

In (for example) a nor gate, both p-type transistors conduct a high voltage, and both n-types conduct a low voltage, so all provide a strong drive to the output. That picture generally holds if we stick to the rules. On the other hand, n-types used when their source is not at 0V will not be turned on fully, and will provide a weak drive, making the circuit slower and more vulnerable to noise.

Do logic chips need power and ground connections?

Small chips like the 7400HC contain a handful of gates, with one pin for each input or output, plus separate pins for power and ground. A trap for young players is that CMOS logic may still work most of the time if you forget to connect power and ground, because the nature of the chip implementation allows it to be powered through parasitic diodes between the inputs and the rails. These diodes are normally a good thing, because they soak up transients in the inputs.

Image

Could we implement an or function using diodes?

An older logic family, Diode-Transistor Logic (DTL) used diodes to implement a nand gate.

How sub-optimal can a circuit get if we design it according to the recipe used in the lecture?

Blindly implementing the Boolean function or – using the construction we used to prove adequacy of {and, or, not} – results in a ridiculously complicated circuit.

Image

But we can simplify that circuit, either using algebraic reasoning, or by recognising smaller terms, each covering more of the truth table, until it becomes a single or gate. There are traditional, hand methods (like Karnaugh maps) for finding optimal implementations of Boolean functions of a few variables. More importantly, there are software tools that automate the process. The worst function to implement using sum-or-products form is the multi-input xor or parity function. Actually, with two-to-the-two-to-the-n different Boolean functions on n variables, the majority of them will have no simple implementation.

How many essentially different state assignments are there for a four-state machine?

If we succumb to temptation and design a sequential circuit by laboriously listing the states and transitions between them, we must then number the states in binary before designing the output and next-state logic. With four states, we are going to need two bits to number them, and these become two flip-flops in the eventual circuit. It seems like there ought to be 4! = 24 ways of numbering the states, with the threat that some of these assignments will make the combinational logic dramatically more complicated than others. But in reality, several considerations make pairs of assignments essentially equivalent, so that we need consider only a few of them in a systematic search for a simple implementation.

For one thing, swapping the order of the two bits isn't going to affect the size of the combinational logic needed; this halves the number of assignments we must consider. What's more, if we negate one or both of the state bits, we can keep nearly the same combinational logic simply by inserting inverters at the inputs and outputs of the flip-flops whose sense we have changed. De Morgan's laws mean we can even simplify the resulting circuit to get rid of most of the inverters. This removes another factor of four from the number of assignments, leaving us with just three to consider. We may as well choose the encoding 00 for the initial state (convenient if we choose a design of flip-flop that adopts state 0 at power-on); then the assignment is essentially settled by selecting which other state will be encoded as 11, because it makes no essential difference which way the other two states are assigned 01 and 10.

[math]\displaystyle{ }[/math]For a machine with \(2^n\) states encoded in \(n\) bits, we start with \(2^n!\) assignments, then remove a factor of \(n!\) for permutations of the state bits, and a factor of \(2^n\) for negations, leaving us with \((2^n-1)!/n!\) assignments to consider. If we want to be exhaustive, \(n\) had better be pretty small.

How can we figure out the number of states in a sequential circuit from a description of its behaviour?

We have to be prepared to be creative, and if insight does not give us any help, start to draw a state diagram that seems adequate to produce the behaviour that has been described. For the 'bathroom light switch', one we see that two states are not enough, we might come up with the four state machine shown in the diagram. This is consistent with all the behaviour shown in the timing diagram, so appears to be sufficient.

Timing diagram for bathroom light switch

Given such a state diagram, we can number its states in binary, then make tables showing (for a Moore machine) the next state in terms of the current state and inputs, and the output in terms of the current state. We can tell that a Moore machine is sufficient for the bathroom light switch because its output changes only at a clock edge. Here's the next-state function for the state diagram shown.

x y a x' y'
0 0 0 0 0
0 0 1 0 1
0 1 0 1 1
0 1 1 0 1
1 0 0 0 0
1 0 1 1 0
1 1 0 1 1
1 1 1 1 0

Because (sneakily) I've numbered the states in a way consistent with the design I already made, we can see in this table that x' = y' = y whenever a = 0; and x' = x and y' = ¬x whenever a = 1. The next-state logic can therefore be implemented as before with two muxes and an inverter. The output function is even simpler:

x y z
0 0 0
0 1 1
1 0 0
1 1 1

We just get z = y. Other ways of numbering the states will give different tables which may be easier or harder to encode in combinational logic.

What is the advantage of having a fully synchronous design?

Consider a circuit where the output of one register R1 feeds the input of another register R2 that has write-enable implemented by gating the clock.

Image

The and gate has the effect of the delaying the clock signal to R2 by a time equal to its propagation delay. After a clock edge, the output of R1 will be stable for a short time (called the contamination delay of R1) before becoming invalid, and we will have to hope that this contamination delay is long enough to compensate for the delay in the clock (the clock skew) seen by R2. Without the clock gating, this clock skew would not occur and so not be a worry.

Clock skew is inevitable in physically large circuits owing to the finite speed of light (1 foot per nanosecond in traditional units) and must be planned for; clock skew owing to gating of the clock just makes the problem a bit worse. A register with write-enable, considered as a circuit module, might implement the write-enable by gating the clock internally. In that case, the module would have a non-zero hold time (the time the input must remain stable after the clock event). The designer could trade hold time for propagation delay by introducing a delay element on the signal lines, but that would increase the overall propagation delay of the circuit and so limit the clock speed.

Architectural elements

In the diagram of a ROM, what is the meaning of the vertical lines with blobs on them?

Read-only memory 8x8

Each column determines a single bit of the ROM's output. It has an up-to-8-input or gate that is connected to those horizontal lines that correspond to locations (combinations of address bits) where the output should be a 1. So the intention is that a blob indicates a connection to the or gate, and the absence of a blob means no connection. (If there are no connections at all, then there is no need for an or gate, and the output could be wired to 0.

Image

In a CMOS implementation, we would make the decoder's outputs be active-low, then use nand gates for the outputs. Or better still, there's a design where each blob is a single transistor that pulls the vertical wire low, and the output stage is an inverter with a pull-up on its input. All we need to observe is that it's possible to build the circuit somehow.

In the twin-port register file, is it possible for all three registers rd1, rd2 and wr to be the same? What would be the use of that?

Register file

Yes, in the circuit, rd1 and rd2 independently select two registers for reading. The process continues to work perfectly if both select the same register. Also, wr independently selects a register to be written, and that can be the same as a register that is being read in the same clock cycle. The usefulness of this becomes apparent when we consider an instruction like adds r0, r0, r0, which adds together two copies of the value in r0 and writes the result back into r0. This instruction will work properly using the same mechanism (an adder external to the register file) if we arrange that the register file allows multiple, simultaneous references to the same register in the way we just discussed.

If the same register is read and written on the same clock cycle, should the value that is read be the value sitting in the register, or the new value that is provided as an input?

For the adds instruction to work, or any other arithmetic instruction whose result goes into the same register that one of its operands came from, we what the old value to be used. To see that this is in fact how it works, we need to look at the implementation of one of the registers. Here's a not-fully-synchronous design:

n-bit register

Here we see that the output is taken directly from the row of D-types, whereas the input is fed to them to become (if the clock is enabled) the value they will hold after the next clock edge. So it is the old value that is read from the register in the event that it is both read and written on the same clock cycle. Note: in some processor designs with pipelining, it is helpful to use a register file with the opposite behaviour, in order to arrange that a value being written by one instruction is available for use by the next. We won't be concerned with this.

In this circuit for a turbo-charged register file, what features of the ARM instruction set are supported by the various special elements?

ARM register file
Three registers can be read simultaneously.
An instruction like str r1, [r2, r3] stores the value from one register in a location whose address is computed as the sum of two other registers. (On a machine like the MIPS, the only addressing mode is equivalent to our str r1, [r2, #12], so only two registers are ever read by any instruction.)
The write-enable of PC is wired to 1.
The PC is updated in every instruction, either by being written with some calculated value (as in a branch instruction), or being set to the value nextpc = PC+2 that is computed by an adder external to the register file.
The +4 adder on the output of the PC.
The instruction set specifies that branches and PC-relative load instructions see the PC value as the address of the next-but-one instruction after the instruction being executed. This is probably an artefact of an early, pipelined implementation of the instruction set, but nevertheless be faithfully implemented in a non-pipelined implementation like ours, if machine language programs are to run correctly.
The multiplexer on the PC input.
Some instructions (e.g. branches) write their result to the PC; others allow the PC to be incremented by 2 as a default. This multiplexer selects which of these possible values is written to the PC.
The multiplexer on the input to LR.
A subroutine call instruction like blx r3 simultaneously set the PC to the value obtained from R3, and set LR to the address of the next instruction, obtained from nextpc. This multiplexer allows the nextpc value to be routed to LR simultaneously with din being routed to the PC, and also allows LR to be written as a normal register.
The or gate on the write-enable input of LR.
This allows LR to be written both as a normal register (when the decoder sends it a write-enable signal) and in the blx instruction just described, when cLink is active.

Processor architecture

Why are two opcodes 11010 and 11011 given for the conditional branch instruction in the notes?

My thought was that the decoding ROM would be addressed by the top five bits of the instruction, making each location correspond to half a row in table [A] of the 'rainbow' chart, which often corresponds to a single instruction. For half-rows that contain more than one instruction (or a reference to a different table), the thought is that further instruction bits would be decoded by an auxiliary ROM, and there would be a control signal from the main ROM say, "Please use the auxiliary ROM for this instruction." You can see a hint of this in the way decoding is implemented in the simulator.

Now, conditional branch instruction occupy a whole row in table [A], so that corresponds to two consecutive five-bit opcodes. If the top four bits of an instruction are 1101, then it's a conditional branch regardless of whether the next bit is 0 or 1. The next four bits (like bits 8–11 of any instruction) are fed to the conditional execution unit, controlling what Boolean function of the flag bits it computes; and bits 0–7 of the instruction determine the branch displacement.

Writing into a register: I'm still confused about the timing/clock mechanism of writing into a register after, say, an arithmetic instruction. Let's take the example of adds r0, r0, r1 or ldr r0, [r0, r1]. When the clock edge rises, what is immediately written into the register file?

For add r0, r1, r1, there's a combinational path from the outputs of r0 and r1 through various multiplexers, the shifter and the ALU to the input of r0. When the add instruction is executed, during the time between one clock edge and the next, these elements settle down to deliver the sum of the present values of r0 and r1 to the input of r0. Therefore, at the next clock edge, the value of r0 will change to this sum. For the load instruction, the combinational path involves the data memory (interface) too: we are assuming that the memory can be read within the clock cycle, as is common with SRAM. Then the ALU does the address calculation and the memory delivers the contents of the word to the register file, all within the same clock cycle: the result is latched into the target register at the next clock edge, completing the instruction.

linksel: For linksel, what determines the conditional input (ie. the arrow in between 1 and 0)? For regwrite it is the result from the Condx element, but what is it for linksel?

The bx reg and blx reg instructions differ in a single bit, which is not one of the bits decoded by the control ROM. That bit is used to drive the third input of the linksel mux. We can tell from the decoded bits that most instructions don't write nextpc to the link register, a few definitely do (e.g., bl), and bx/blx does so dependent on this additional instruction bit. This kind of thing is needed only because of the convoluted instruction encoding of Thumb.

Implementing bl: To generate a 22 bit offset, shouldn't the higher 11 bits of offset be shifted left by 11 places? Why 12?

The bl instruction takes a sign-extended 11 bit field from the first half of the instruction and an 11 bit field from the second half. Since the branch target must be a multiple of 2, like all instruction addresses, this displacement is multiplied by 2 before being used, so the top half is shifted by 12 bits and the bottom half by 1 bit. This is visible in the decoding of the bl1 and bl2 instructions in the simulator.

According to the ARM manual, PC-relative load instructions should round down the value of PC+4 to a multiple of four before adding the offset. Where is this implemented in your datapath design?

There's an ALU operation Adr (see simulator code, section 9) that adds together two inputs and truncates to a multiple of 4. That does the job when needed.

Lab exercises

Something in the lab exercises isn't working. What should I do?

First of all, please come to the sessions! The lab exercises for this course are (deliberately and inevitably) a bit less cut-and-dried than those for other courses, and there is more scope to explore directions of your own. Today in the lab we had a team of two plundering my code for the Seven Stars of Death to make the LEDs flash in their own way. We were able to help them with that, and it was far more worthwhile than following my pre-conceived list of tasks.

Second, check for updates on the labs page. Some differences are inevitable between the environment in the Software Lab and what works on my machines, and we can't hope to pin down every last one in advance. And as you'll discover, it sometimes takes a lot of thought to work out a reason for the strange behaviours we often see. There's no operating system to intervene and stop your program with a helpful message; and even the seven stars of death only tell you that something is wrong.

When an update is published, there's a procedure for merging it into your working copy using Mercurial. We will help you with that in difficult cases, such as where you've already made one fix in the lab but you are merging in others.

You may prefer Mercurial, but I'm dedicated to Git. How can I stay true to myself?

In place of

$ hg clone http://spivey.oriel.ox.ac.uk/hg/digisys

at the start of Lab 0, you can use the command

$ git clone http://spivey.oriel.ox.ac.uk/git/digisys.git

instead, and hack away to your hearts content without compromising your principles. (This feature is brought to you by the superior flexibility of Mercurial.)

I stop the echo program with the debugger, then try typing in the minicom window. Nothing happens; and even if I type cont to gdb, still nothing happens. Why?

The debugger can stop your program, but it can't stop other things from happening. In particular, if the UART (the interface connected to the serial port) receives a second character before your program has collected the first one, it signals an 'overrun error' and will do nothing more until the error is cleared. The echo program is written in the simplest way possible, assuming characters will arrive slowly enough for it to fetch each one before the next arrives, and it doesn't have provision for detecting or clearing such errors. When the program is stopped by the debugger, it hasn't a hope of keeping up with your typing. So once you've typed a second character in minicom while the program is stopped, all is lost until you reset the micro:bit. This is a perfect example to show that debuggers are of limited use in embedded systems development.

I want to see the Seven Stars of Death. How can I make them appear?

Ok, I suppose we'd better get this over with. Any unexpected interrupt or exception will lead to the seven stars, and you can provoke a HardFault exception with a bx instruction where the target address is even, not having the Thumb bit set. So the two instructions

movs r0, #0
bx r0

ought to do it. You may prefer to find more creative ways of your own. In fact, you probably will.

The detailed mechanism is as follows: the file startup.c contains code that lays out the interrupt vectors for the program, including the vector associated with the HardFault exception. Unless these vectors are redefined elsewhere, they all point to a routine spin that turns off interrupts, (re-)initialises the GPIO pins, and then uses them to display the seven stars, flashing them forever using delay loops. The code is made simpler by the happy fact that the seven LEDs needed are all in the same row of the 3 x 9 wiring matrix. The code is deliberately mashed together in one function (rather than, say, using a subroutine delay), so that it doesn't depend on anything else. Many undefined behaviours of the chip evoke a HardFault exception, and attempting to clear the T bit in the psr using a bx instruction is one of them. The machine saves some registers on the stack, fetches the HardFault vector from address 12, and starts executing the routine at that address, which immediately calls spin.

I keep getting the following message when I try to start minicom. What is happening?

minicom: cannot open /dev/ttyACM0: Device or resource busy

I found the second answer to this Stack Overflow question helpful: https://stackoverflow.com/questions/40951728, but wish they would do the decent thing and reinstate Monica.

Why are the C compiler and other tools called by names like arm-none-eabi-gcc and arm-none-eabi-objdump?

These long names are chosen to distinguish these cross-compiling tools from native tools like gcc, which on a Linux PC is a compiler that translates C into code for an Intel processor. The lengthy names make explicit that the compiler generates ARM code (arm) for a machine running no operating system (none), and uses a standard set of subroutine calling conventions called the Extended Application Binary Interface (eabi). This explicitness is needed, because a Linux machine might also have installed compilers for the Raspberry Pi (gcc-arm-linux-gnueabihf) and maybe AVR microcontrollers (gcc-avr) and the MIPS (mipsel-linux-gnu-gcc). As you can see, the naming conventions aren't quite consistent between GCC cross-compiler projects.

What configuration is needed to use the hardware random number generator?

See Chapter 21 of the hardware reference manual for the nRF51822. The header file hardware.h (in the latest revision of Lab 3) defines symbolic constants such as RNG.VALUE for the device register addresses. If you make a first attempt, we will help you to get it working in the lab sessions.

I can't get the TEMP example to work on my micro:bit. Can you help?

Here's a working example that prints the temperature twice a second: ex-temp.c. I've jammed the device driver and a client program into the same file for simplicity. You can test the program by holding a thumb on the Nordic chip or breathing on it, but don't panic if the reading seems high – it's increased by the power dissipation of the processor itself. Happy hacking!

What is the file microbian.a that is made in Lab 4?

The file microbian.a is a library archive, containing copies of various files of object code such as microbian.o and serial.o. The linker treats such libraries specially, linking in only those parts of them that are needed by the program it is linking. Thus, for example, micro:bian programs that don't need the timer will be linked without including the driver process for it. Making and using the library simplifies the Makefile while keeping the resulting programs small.

How can I restore the original demo program to my micro:bit?

Download and flash the hex file you can find on the official activities page as 'BBC micro:bit Original Code' under 'You'll need'.

What are all the numbers shown on an oscilloscope trace?

Here is a trace of a character being sent on a serial line.

RS232 waveform
RS232 waveform
  • In the top left, you can see (STOP) that acquisition of the signal has been stopped, presumably so that I can save a screenshot; at other times, the same label shows WAIT (when the scope is waiting for a trigger event) or TRIG'D (when an acquisition is in progress).
  • Next to it (100us/) we see that the horizontal scale is 100 microseconds per division. The clock period for the waveform can be seen from the trace to be a bit more than 100us, because the transitions fall a bit behind the grid of dotted divisions towards the right of the picture.
  • At the top right, we see that the oscilloscope's trigger is detecting a falling edge on channel 1 and set at a level of 1.72V, midway between the high of 3.3V and the low of 0V.
  • In the left margin, you can see the trigger level marked (T), as well as the 0V level for channel 1.
  • At the bottom left, you can see that channel 1 is DC-coupled, and has a scale of 1.00V per division; the difference between high and low is therefore seen on the trace to be about 3.3V.
  • On the black background just below the top margin, you can see at the left a T mark denoting the trigger point. I've adjusted this to appear towards the left of the display, so that we can see subsequent events, but as the outline at centre top shows, there's plenty more waveform captured in this single-shot mode before and after the trigger that we could see by scrolling.

Edit: here is a similar trace made with one of the new oscilloscopes in the lab.

RS232 waveform
RS232 waveform
  • In the top left, the legend shows that Channel 1 has a scale of 1V/div, while Channel 2 is off.
  • In the top right, the center of the display is delayed by 30.73ms from the trigger event (which is off the left of the screen). The horizontal scale is 200μs/div, and acquisition is stopped. The scope is triggered by a falling edge at a level of 1.6V.
  • In the bottom center, Channel 1 (yellow) has an offset of +1.98V from the bottom of the screen. The channel is DC coupled, and the probe attenuates by a ratio of 10:1 (see below).
  • In the right-hand column, from top to bottom:
    • The scope is acquiring samples without any special processing (Normal) at a rate of 50 million samples per second. (Consequently, the narrowest, 100μs, pulses of the displayed waveform correspond to 5000 samples, giving plenty of data for zooming in.)
    • Repeating the info from below, both channels are DC coupled with 10:1 probes.
    • Waveform math is disabled (greyed out).
    • Reference signals are disabled (greyed out).

It's normal to use oscilloscopes with probes that contain an integral voltage divider, so that the voltage seen by the scope is 1/10 of the voltage at the probe. This raises the effective input impedance of the scope, most vitally reducing by an order of magnitude the extent to which the circuit being measured is loaded by the capacitance and inductance of the probe cable. After digitising the input signal, the scope multiplies the readings by a factor of 10 to compensate, so that true voltages are shown on the display.

Trivia

Why use the ARM? Why not the x86 or x86_64 we already have in our laptops?

I wanted us to program a machine where all the code was accessible to our inspection. That's easy to achieve on a microcontroller, and as simple as it can be on an ARM-based machine. Yes, we could program x86 machines in assembly language, though we would have to face some unavoidable complexity that has accrued during the 30 or more year history of the architecture. But we wouldn't easily get to program the bare machine and understand everything that happens.

Why C? Why not C++? Or why not Python/Java/Scala/Haskell/Some other higher level language?

Quite a lot of embedded programming is done in C++ these days, and in fact the official software support for the micro:bit is in C++. This gives a number of advantages: peripheral devices can be initialised by declaring a static instance of the driver class: PinOut led(PIN3); and C++'s overloading feature subsequently lets you write led = 1; and led = 0; to turn on and off the LED, even if the actual code needed is something more complicated, possibly involving a function call. This is all very well for applications programming, but it really gets in the way of understanding what is happening at the machine level.

Also, for C++, as for other high level languages, we would need to know how the high-level constructs are implemented in order to see how a high level program corresponds to a sequence of machine instructions. For C, this is more straightforward, and we can use C as a quick, clean and portable way of writing what we could also write in assembly language. And in any case, nobody's employment prospects were ever hurt by saying that they knew how to program in C.

Why this micro:bian operating system, and not another, industry standard real-time operating system such as FreeRTOS?

FreeRTOS is OK, and provides a rag-bag of facilities for multi-tasking, including processes, semaphores of different kinds, message queues, timers, and a lot more. It is quite complicated, and everything seems to have lots of options that interact in subtle ways. For the course, I wanted to find out how far we could get with a much simpler scheme, where there are no nested interrupts, possibly no preemptive scheduling, and the system is structured into a family of processes communicating by messages. As you will discover in Concurrent Programming next year, programming with semaphores is a complicated business; and next year will be soon enough for that. And I can't resist saying that the coding style of FreeRTOS gives me the creeps.

Why bother with assembly language if we can write all our programs in C?

Several reasons: first, our aim here is to give a complete account, spanning all levels of abstraction, of the functioning of a small computer system. We're going to take for granted a bit the process of translating high level programs into machine code (more on that next year), but we do need to understand machine language as the interface between hardware and software. Second, when we come to implement an operating system, we will need to understand clearly what it means to save the state of one process and restore the state of another, and that can only be seen at the level of machine code. And third, even if writing assembly language is something you never need to do (especially for the ARM with Thumb code), it's a really useful skill to be able to read the assembly language output by a compiler to see what is happening. When you've used instrumentation to identify the speed-determining part of your big program and done your best to write it in high-level code that will run fast, it's good to be able to look at the compiler output and see if it is sensible.

Is it "twos' complement" or "two's complement" or "twos complement"?

It's "two's complement". To see why this is right, consider that the ones' complement of a number is that quantity that, when added to the number, results in a string of ones:

 00101011
 11010100
----------
 11111111

The two's complement is obtained by adding 1 to the ones complement: when this is added to the original number, the result is a single power of two:

 00101011
 11010101
----------
100000000

So the two's complement of a number is that quantity that, added to the number, results in a power of two. You could make case for the reading "twos' complement" – the complement belonging to a string of 2s. (Cf. Knuth, Vol. 2, p. 203).

You keep saying the micro:bit has 16kB (kilobytes) of RAM. That means 16,000 bytes: isn't it true that there is actually 16KiB (kibibytes) of RAM, or 16,384 bytes?

Yes, that's right. You can collect your refund on the way out. Please note as you go that it's correct to put a thin space between a number and a following unit, as in 16 KiB, and that a thin space should also be used in place of a comma to separate thousands, as in 16 000 bytes, to avoid confusion with the use of a comma as a decimal marker in some countries. Glad we sorted that out.

What fonts do you use on your website?

On webfont-enabled browsers, running text is in Roboto Sans, and titles in Roboto Slab. Code samples are set in Roboto Mono, and the occasional scary word is set in Unifraktur Cook, just for fun.

Now that Scratch has subroutines (including recursion: https://scratch.mit.edu/projects/361167805) is it finally a valid teaching tool?

Scratch does factorials

I'm not mad keen. Particularly worrying in this example is the use of global variables for everything, and the apparent lack of built-in provision for subroutines that return a result. I'm not sure of the purpose of the statement set <n> to <<n> - <1>>, but perhaps that's not really needed; certainly the statement <set <result> to <<1>*<result>> could be removed.

By the way, I think recursion is overrated for general programming, and there are many purposes for which we could use a cleaned-up Fortran – a language supporting (unnested) subroutines with parameters and local variables but no recursion. The book Software Tools by Brian Kernighan and P. J. Plauger makes the argument for this quite eloquently.

We'll write a few recursive subroutines early in the course as a demonstration that we can, and that they're supported by the stack mechanism we discuss and by the addressing possibilities of our machine. But in later parts of the course, for example in writing the micro:bian operating system and programs based on it, there will be no need for recursive subroutines. And with only 16KB of memory to play with, we'd better be sure any recursion doesn't go too deep.

What software do you use to make the slides?

I use an old version of Keynote on the Mac. The crucial advantage over PowerPoint for me is that PDF figures can be pasted into a Keynote presentation with no loss of quality; PowerPoint cannot do this without converting the PDFs to bitmaps. The mathematical formulas that appear on some of the sides are prepared with TeX, and some of the circuit diagrams shown in Trinity Term are prepared with an online editor called Draw.IO and saved as PDF. At one stage, the flexibility of Keynote was dramatically reduced in order to make a version that would run on the iPad. I think that some of the deleted functionality has been restored, but I have just kept to the old version, and kept running an older version of Mac OS to support it.

Why is the BSS segment called BSS?

One answer is that it derives from IBM usage associated with the (vacuum tube based) computers numbered 704 and 709 via the later (transistorised) 7090 and 7094, and it stands for Block Starting with Symbol, from the convention that an assembly language symbol would be associated with the start of each block allocated in the segment. Another possible answer, believed by some unixites, is that it stands for Below Stack Storage, from the fact that the segment is allocated with addresses lower than the stack area in the memory map.

But why "Block Starting with Symbol"? What sense would "Block Ending with Symbol" make? Well, on those IBM machines, one addressing mode combined an index register with a static base address to form an address in memory, and the value in the index register was subtracted from the base address instead of being added to it. So in implementing a static array, it made sense to have successive elements of the array appear at successively lower memory addresses, and so to name as the base address of the array the end of the block of storage allocated for it.