## New questions

If you have a question, perhaps it is answered here – or maybe you can find help in the growing glossary. Feel free to add headwords to the glossary so Mike can fill in the definitions, or to add questions below.

Can you write the question within triple single quotes like this?

Then I will write the answer afterwards. Don't worry about formatting words like micro:bit properly – I will copy-edit your question when I add the answer.

In the primes-intr program, is the processor ever doing more than one thing at once, or is it all done by interleaving?

It is indeed to right to say that the times when the main program is running is interleaved with the times when the interrupt handler is running, with the asymmetric relationship that the interrupt handler is called intermittently as a subroutine, running each time until it returns, wheras one invocation of the main program runs in fits and starts between interupts.

It's not right to say, however, that there is no genuine concurrent action in the program, because the UART is capable of acting independently of the processor. It is given a character to output, and then without further involvement of the CPU, it clocks out the ten bits that make up that character on the relevant device pin, only interrupting the processor when that process is over. So while the UART is not capable of running a program as such, it is capable of acting concurrently with the processor, and the point of introducing interrupt control is to make it convenient for the processor to do useful work at the same time the UART is working on outputting a character.

Your unacceptable program at the beginning of Lecture 10 (the one that keeps the UART busy by polling), is it sufficient to insert a call to poll_uart() in just one place?

It's a good exercise to complete that program so that it works, and the easiest approach is to take the primes-intr program, delete the stuff that enables interrupts, rename the function uart_handler() to poll_uart(), then insert calls to poll_uart() in a sufficient set of places.

You'll find that the program uses an internal (and spectacularly slow) routine called modulo in place of the C language % operator, which itself would be translated by the compiler into a call to a faster out-of-line routine provided in libgcc.a. I found it sufficient to add a call to poll_uart() in the loop within prime(n), as we did in the lecture, and then also in two other places:

• in the loop in serial_putc() that waits for space to appear in the buffer. Without this call, and without an interrupt routine to empty the buffer, the loop if entered would otherwise loop forever.
• at the end of the main program, where I added the loop
while (1) pool_uart()
in order to drain the buffer after the prime-finding process is complete.

Why are we using the V1 micro:bit board? Isn't that obsolete, since the V2 board has a much better processor?

In many ways, the V1 board is better for our purposes, because the instruction set and other aspects of the processor are simpler. We should have enough V1 boards on hand that everyone can get hold of one. The one thing I most want to avoid is having to run the lab sessions with a mix of V1 and V2 boards: apart from anything else, that just doubles the work to support the software.

Will there be lab sessions in Trinity Term?

No, here are just the two unofficial and four official sessions in Hilary. I have created a couple of lab exercises for the last part of the course, one to preserve some materials about arithmetic circuits and their simulation in Haskell, and another to make available a simulator for the ARM Thumb architecture we develop in lectures towards the end of Trinity Term. Both are entirely optional, but (like most things in this course) provided for your interest.

This course has 24 lectures, but only 3 exam questions. Why is that?

I have no idea, but it does seem odd, doesn't it? Looking on the bright side, it will give us plenty of time to learn some interesting things, without having to obsess about getting through topics for the exam.

Course Lectures Questions Ratio
Functional Programming 16 4 4
Algorithm Design 16 4 4
Imperative Programming 1 and 2 20 5 4
Imperative Programming 3 12 3 4
Intro to Formal Proof 10 2 5
Discrete Maths 16 3 5⅓
Continuous Maths 16 3 5⅓
Probablility 16 3 5⅓
Linear Algebra 20 3 6⅔
Digital Systems 24 3 8

## Arising from lectures

These items are links to questions answered at the end of some lectures.

## Assembly language

Why write integer constants in hexadecimal for low-level programming?

We often want to work out the binary bits that make up a constant, and hexadecimal notation makes it easy to do so. Each hexadecimal digit corresponds to four bits of the binary representation, and it's not too hard to memorise that bit pattern corresponding to each digit:

0  0000          8  1000
1  0001          9  1001
2  0010          a  1010
3  0011          b  1011
4  0100          c  1100
5  0101          d  1101
6  0110          e  1110
7  0111          f  1111


To convert a larger hexadecimal constant to binary, just concatenate the four-bit groups corresponding to each digit: for example,

0xcafe1812 = 1100 1010 1111 1110 0001 1000 0001 0010.


How can we convert between decimal and hexadecimal?

For quantities that fit in 16 bits or less, it's helpful to recall that the powers of 16 are 1, 16, 256, 4096. Then to convert, say, 1326 to hexadecimal, we note that the highest multiple of 256 that fits is 5 * 256 = 1280. We can set out the calculation as follows.

            1326
5 * 256  =  1280
----
46
2 * 16   =    32
----
14


The remainder, 46, can be expressed as 2 * 16 + 14, and we get 1326 = 5 * 256 + 2 * 16 + 14 * 1 = 0x52e. The final 'e' is a hexadecimal digit with a value of 14.

To convert the other way, we just calculate 0x52e = 5 * 256 + 2 * 16 + 14 = 1326.

What registers can a subroutine use?

It can freely use registers r0 to r3, destroying any values they had when the subroutine was called. The (first four) arguments of the subroutine arrive in those registers, but need not be kept there, and the result can be moved into r0 just before the subroutine returns. Note, however, that if this subroutine calls any others, they in their turn may destroy the values in all four of these registers.

In addition, a subroutine can use registers r4 to r7, provided it leaves them with the same values on exit that they had on entry. An easy way to ensure that is to push them on the stack with a push instruction at the routine start (together with the register lr), and pop them off again with a pop instruction at the end (together with the pc). Any subroutines that are called will also obey this convention, so important values can be left in these registers across calls.

What if a subroutine has more than 4 arguments?

In that case, arguments beyond the first four are passed on the stack. It's the caller's responsibility to put them there, following a layout that is fixed by the machine's calling convention, and then the subroutine itself can access them using load instructions. I don't think we'll need to write any subroutines with so many arguments (not in assembly language, anyway).

Shouldn't the 'move' instruction really be called 'copy'?

Probably so: movs r0, r1 puts in r0 the same number that is held in r1, without changing the contents of r1.

What do I do about labels if I have a big assembly language program with lots of functions and several loops? Can I re-use labels like loop and done in each function?

No, I'm afraid you can't – you can't have two labels with the same name in a single file of code. You can either prefix each label with the name of the enclosing function, as in foo_loop or baz_done, or you can make use of a system of numeric labels that do allow multiple definitions. There can be several labels written 2: in the file, and you refer to them using the notations 2b or 2f, meaning the next occurrence of label 2 going backwards (2b) or forwards (2f) in the file. That saves the effort of inventing names.

Compilers solve the problem in a different way, just inventing their own labels with names like .L392 and never reusing them. That's OK when nobody is expected to read the resulting assembly language.

When I try running my assembly language program with qemu-arm, I get this message: what does it mean?

qemu-arm: ... tb_lock: Assertion !have_tb_lock' failed.


That's a symptom of the program running haywire, possibly because it has branched to a random location and started executing code there. A probable cause of that is returning from a subroutine with the stack in a bad state, so that the value loaded into the pc is not the correct return address.

What is this .thumb_func directive all about?

Most of the time, the assembler and linker manage to work out for themselves that assembly language code is written using the Thumb encoding, and all works sweetly. But just occasionally (e.g., when writing an interrupt handler in assembly language), it's necessary to mark a function explicitly as being written in Thumb code: in the case of an interrupt handler, so that the interrupt vector can be marked with a 1 bit in the LSB. If this is done properly, then the processor stays in Thumb mode when the interrupt vector is followed. Failing to do this results in an attempt to switch to native ARM mode, and an immediate HardFault exception: the program crashes irretrievably. This seems a bit silly in a device that doesn't support native ARM mode at all, but it is done in the name of software compatibility with bigger devices.

Given that the .thumb_func directive is occasionally needed (at times when it's difficult to determine whether it is actually necessary), and given that it is hard to debug the HardFault that occurs when a needed directive is omitted, it seems simplest just to make a habit of writing .thumb_func in front of every assembly-language function.

I wrote nop in my program, expecting to get the instruction encoded as 0xbf80, but the assembler produced 0x46c0 instead. Why?

Later versions of the ARM architecture implement a specific nop instruction with encoding 0xbf80. On earlier versions, the instruction encoded 0x46c0 is used instead. The rainbow chart decodes that as mov r8, r8, a form of move instruction that can access high registers and (crucially) does not write the condition codes. In practice, there's no significant difference between the two instructions: both last one cycle and have no effect.

What is the difference between the data segment and the bss segment?

The C compiler takes static variables (those that are declared outside any function) and assigns them to one of three segments that are differently treated by the linker. Plain variables such as the array a declared by

int a[10];


are assigned to the bss segment. When the program starts, such variables have no special initial values, or rather, their initial values are zero, false, null. Other variables may be given specific initial values: for example, the array b declared by

int b[10] = { 3, 1, 4, 1, 5, 9, 2, 6, 5, 4 };


These variables are allocated in the data segment, and unlike the bss segment, the program output by the linker contains an image of their initial values. The third segment used for static data is the rodata segment, used for 'variables' that are declared as constant, such as the array c declared by

const int c[10] = { 2, 7, 1, 8, 2, 8, 1, 8, 2, 8 };


These variables do not need space in the RAM, and can be kept in the (more spacious) ROM instead. Such constant tables are common in embedded programs: for example, a lunar lander game might contain a table of the sine of each whole number of degrees from 0 to 360, to save time computing them when the angle of a thruster changes.

Ok, fair enough: but how can I declare similar variables in assembly language?

You can use the directives .data or .bss or .segment rodata to select a segment, a .align directive to pad for alignment, then either .word or .space directives to allocate space. Like this:

    .bss
.align 2
a:
.space 40

.data
.align 2
b:
.word 3, 1, 4, 1, 5, 9, 2, 6, 5, 4

.segment rodata
.align 2
c:
.word 2, 7, 1, 8, 2, 8, 1, 8, 2, 8


The directives .data and .bss are just abbreviations for .segment data and .segment bss. If you want a scalar variable instead of an array, just reduce the amount of space from 40 to 4, or give only one word of initialisation data:

    .bss
.align 2
x:
.space 4

.data
.align 2
y:
.word 314159


Ok, now I see how to declare global variables. How can I use them?

Let's try writing in assembly language the function that we could express in C as follows.

unsigned g = 4;

unsigned foo(unsigned x, unsigned y) {
unsigned z = g;

g = x;

return y + z;
}


As you can see, this uses a global variable to remember the argument x from one invocation of foo to the next, with the remembered value being 4 initially. In the following interaction, the values printed are 20+4 = 24 and 8+10 = 18, and the memory contains 6 for the next invocation.

a = 10
b = 20
foo(10, 20) = 24

a = 6
b = 8
foo(6, 8) = 18


Without further ado, let's try to express the same thing in assembly language. First, we allocate space in the data segment for the global variable.

    .data
.align 2
g:
.word 4


In the function foo, we need to get access to g, and to do this, we plant the address of g somewhere in the text segment, and use pc-relative addressing to load the address into a register.

    .text
.thumb_func
foo:
ldr r3, [pc, #8]
ldr r2, [r3]
str r0, [r3]
bx lr
.align 2
aaa:
.word g


The instruction ldr r2, [pc, #8] loads the address g of the global variable into r2, by loading it from a convenient place aaa in the text segment. Then, with the address g in a register, we can use a second ldr instruction to load its value, which sits in the data segment in RAM. The point is that there are two load instructions, one to get the address of the global variable, and the second to get its value. The address is a constant, and can be stored in the ROM at a place where we can address it using pc-relative addressing. The value is variable, and is stored in RAM, at an address that we cannot write as part of the code stream. The address in r3 is also used to write the global variable with the first argument of the function from r0 – then we add together the second argument and the fetched value of g to give the result, putting it in r0 before returning.

We could write the instruction ldr r2, [pc, #8] as ldr r2, aaa to get the assembler to compute the offset 8 for us. Or, better still, write ldr r2, =g to get the assembler to find a 'convenient place' for us, and leave it to put the constant at the end of its output to the text segment, or put the hint .pool after the bx lr (and delete the rest of the source shown aboive from that point on. However we express the source file, the resulting object program is the same: it has a word in RAM, another word in ROM that contains the address of the first word, and a pc-relative load instruction that addresses the second word.

If we want to plant constants using .word directives, or use the .pool directive to give the assembler a hint where to put its own table of literals, where is a safe place to put them?

If we use .word to put constants (such as addresses) into the text segment to access them with pc-relative addressing, then we need to be sure that the machine will not attempt to execute these constants as if they were code. For example, we could put them immediately after the bx lr or pop {pc} that ends a function, because we can be sure the processor will never try to fetch and execute instructions from there. We can put .pool directives in similar places to give the assembler a hint it can place its tables of literals there. If we don't put in these hints, then the assembler will build one big table of literals and put it after the last function in the source file. Usually this is sufficient, because our fragments of assembly language will be small.

How should I use the push and pop instructions in a subroutine that uses more than 4 registers?

The best thing to do is save the additional registers at the beginning of the subroutine, and restore them at the end. For example, here's a subroutine that sums an integer array specified as a parameter.

int sum(int *a, int n)
{
int total = 0;
for (int i = 0; i < n; i++)
total += a[i];
}


To translate it into assembly language, it's convenient to use five or more registers, so that we can keep the parameters a and n in the registers where they arrived, and the quantities i and total can live in registers too, and we have at least one working register to hold temporarily the values 4*i and a[i]. Here's a reasonable translation:

sum:
push {r4, lr}       @ Save caller's r4
mov r2, #0          @ Set total to 0
mov r3, #0          @ Set i to 0
L1:
cmp r3, r1          @ Is i < n?
bge L2              @ If not, end the loop

lsls r4, r3, #2     @ Compute offset of array element
ldr r4, [r0, r4]    @ Fetch the element a[i]

adds, r3, r3, #1    @ Increment i
b L1                @ And repeat
L2:
movs r0, r2         @ Put result in r0
pop {r4, pc}        @ Restore and return


Do you see how the incoming value in r4 is saved at the start and restored at the end, making r4 free for us to use throughout the subroutine body? This code saves lr at the top and restores it into pc at the end, but that can be avoided by writing push {r4} at the start and pop {r4}; bx lr at the end, for about the same cost.

An inferior approach (which is the point of this question and answer) is to save r4 just around the fragment of code where it's needed, like this:

sum:
mov r2, #0          @ Set total to 0
mov r3, #0          @ Set i to 0
L1:
cmp r3, r1          @ Is i < n?
bge L2              @ If not, end the loop

push {r4}           @ Save caller's r4
lsls r4, r3, #2     @ Compute offset of array element
ldr r4, [r0, r4]    @ Fetch the element a[i]
pop {r4}            @ Restore r4

adds, r3, r3, #1    @ Increment i
b L1                @ And repeat
L2:
movs r0, r2         @ Put result in r0
bx lr               @ Return


This approach may come naturally to those versed in assembly language programming for x86, where the stack is often used in a similar way to this. But (at least on ARM, and really on x86 too) the approach is inferior, because the work of saving and restoring r4 is repeated in each iteration of the loop, rather than being done once at the very start and end.

An added advantage of the first approach is that diagnostic tools like stack traces and debuggers on the ARM are able to decypher the code for subroutines written in this conventional style, in order to deduce how big the stack frame is, where one frame ends and another starts, and so on – information that is impossible to gain from the register state alone. Subroutines that make free use of the stack like the second approach are harder for such tools to make sense of. Naturally, both implementations could be improved in a minor way with tactics like rolling the loops around so the test comes at the bottom, keeping 4*i or a pointer to a[i] in a register, or counting down instead of up. But the same observations continue to apply to such improved versions – a subroutine that saves registers should have a push instruction at the top and a complementary pop instruction at the bottom, that's all.

The rainbow chart seems a bit vague about the instructions that occupy 32 bits. What's going on?

The Cortex-M0 processor in the micro:bit implements only a few 32-bit instructions. Most important is the bl instruction for calling a subroutine, which encodes a 22-bit offset that (after scaling) allows the calling of any subroutine within a distance of ±4 MB. But in order to make the instruction set complete, the processor also implements a handful of other rare instructions with 32-bit encodings, such as the mrs and msr instructions that move data between ordinary registers and the 'special' registers that control various aspects of the processor's behaviour. Such instructions are used very occasionally in this course to implement some features of the micro:bian operating system, and particularly to put the processor into a mode where every process can have its own stack.

Orginally, the bl instruction was encoded and executed in two halves; the first half had an encoding in the range 0xf000--0xf7ff, and put the upper 11 bits of the target address into the lr register, and the second half (in the range 0xf800--0xffff) would combine these with another 11 bits and perform the branch. On processors that had Native mode as well as Thumb mode, an alternative second half, encoded in the range 0xe800--0xefff, completed a blx instruction for calling a subroutine written in Native code. This instruction is missing on processors that do not have Native mode.

In time, the architecture was changed so that the two halves of the bl instruction were treated as a unit, and so the processor did not have to be able to decode the second half independently of the first. This restriction enabled a number of additional freedoms:

• Two additional bits became available in the second half to encode more bits of target address, allowing jumps of ±16 MB; this is largely irrelevant on a microcontroller with only 256 kB of ROM.
• It became possible to encode instructions other than bl using a 32-bit value that had a first half that looked like the first half of a bl instruction, but a different second half. More recent (Cortex-M4) machines implement a huge number of additional instructions this way, relaxing most of the restrictions (such as access to only low registers) that were imposed with 16-bit Thumb-1 code. The Cortex-M0 microcontrollers implement only a very few of these, but with an encoding that is consistent with other designs.

Thus, a bl instruction has a first half beginning 11110... and a second half beginning 11?1?..., though with a limited range of offsets, the second half will begin 11111.... The other few 32-bit instructions have a first half that begins 11110... and a second half that begins with anything but 11?1?..., and typically with 01.... Thumb-2 code packs a lot of instructions into this space, and a rainbow chart for it would need multiple subsidiary tables to cover everything. When we build an architectural simulator, we'll simplify things by taking the bl instruction in two halves and reverting to the original, simpler, encoding scheme.

## C programming

What's the difference between the expressions m.type and m->type?

The first of these is used when m is a structure (record) declared with message m, so that storage for an actual structure is allocated by the declaration; it names the m_type field of the structure.

The second is used when m is a pointer to a message declared elsewhere. We can declare such a pointer with message *m, and if msg is a message declared as in the first part of this answer, we can write &msg to get a pointer to it. The notation m->type selects the type field of the structure to which m points; it is equivalent to (*m).type, because *m is the structure that m points to, and we can use the notation from the first part to select its type field. The notations are equivalent in meaning and in efficiency, but m->type looks a bit neater.

It's common in C programming to associate pointer variables with storage that is allocated dynamically using the library function malloc(). Such storage is allocated when needed, and persists until it is freed with an explicit call to free(): in particular, it can live beyond the lifespan of the function activation that created it. We shall not need to allocate storage in this way: the closest we come is the pool of process records used in micro:bian, but a fixed number of these are allocated before the program begins.

Your C programs use types like int and unsigned and short. Wouldn't it be more portable to use types like int32_t and uint32_t and int16_t that have a well-defined width on all platforms?

Yes, it would: it's just that I'm trying to keep down the number of obscure conventions that are used in the programs. On the ARM, it's safe to assume that int and unsigned are 32 bits.

Which is faster as a way of writing an infinite loop: while (1) ... or for (;;) ...?

This is one of many things connected with minor efficiency that really don't matter. If I were writing a compiler, I would ensure that both of these idioms resulted in exactly the same code, with the loop body followed by an unconditional branch from the bottom back to the top. If by accident gcc generates slightly less tidy code for one form, then we are unlikely to notice the difference. (Similar remarks apply to the idioms n++ and ++n when written as statements.)

According to the standard books about C, the main program should be called main, and yet in this course it is called init. Why is that?

In normal C, the main program ought to have the heading

int main(int argc, char **argv)


reflecting the fact the main receives two parameters, one being the number of command-line arguments (including the program name), and the other being an array of strings, each corresponding to one of the arguments. The return type is int, because the program may terminate by returning an integer, zero for success and some non-zero value for failure.

Our function init takes no parameters and delivers no result:

void init(void)


What's more, once we have an operating system on the chip, then unlike main, when init returns the program is not over, but concurrent execution of the processes in the program is only just beginning.

These differences might be enough to make a change of name desirable, but there are practical reasons too. One is that GCC (un)helpfully signals a warning if a function named main does not have the heading it expects; another is that it treats main specially in the output, assigning it to its own section in the object file so that a linker script can place it in the program image at a specific place. Both of these behaviours are unhelpful to us.

• Actually, the warning is suppressed by specifying -ffreestanding, which also suppresses other warnings about the types of functions like sprintf. And the separate section is only enabled with the optimisation flag -Os or -O2, not at the -O level we will generally use, mostly to make it easier to interpret program behaviour under a debugger.

You have not explained the implementation of printf. How does it work?

The implementation of printf in any of our programs is split into two parts: the part that is independent of the output device, and renders the data as a sequence of characters, and the device-dependent part that takes each character and transmits it on the output device. The C code is made slightly complicated by two facts: printf takes a variable number of arguments, and the device-independent part is implemented as a higher-order function that takes the device-dependent part as a parameter.

Any program that uses the printf function from the library must provide a function print_buf with the heading

void print_buf(const char *buf, int n)


It takes an array of characters and a number as parameters, and should output the characters buf[0], ..., buf[n-1] on whatever device is chosen. In simple applications, print_buf can be implemented with a loop that calls some other subroutine to output each character, but the interface allows smarter implementations that doesn't treat each character separately – for example, in the serial driver that is part of the micro:bian operating system.

The implementation of printf in the library is itself composed of two parts: there's a generic function _do_print that takes care of the formatting, and a function f_bufferc that stores characters in a buffer and periodically calls print_buf. The implementation of printf itself looks like this:

/* printf -- print using client-supplied print_buf */
void printf(const char *fmt, ...)
{
va_list va;
struct buffer b;
b.nbuf = 0;
va_start(va, fmt);
_do_print(f_bufferc, &b, fmt, va);
va_end(va);
flush(&b);
}


This function takes a variable number of arguments, so we can write a call such as

printf("prime(%d) = %d\n", n, p);


where the one fixed argument – the format string – indicates that there are two additional arguments to come, both integers. The function captures its list of arguments as a kind of pointer va which is one of the things passed to _do_print, so that _do_print can access the arguments of printf one at a time. Variable argument lists are implemented in different ways on different machines. On the Cortex-M0, the compiler will store the arguments from r0--r3 into the stack frame together with the fifth and later arguments (if any), then use a pointer to the relevant portion of the stack frame for va. What happens on a particular machine is all hidden in the functions va_start and va_end.

The function also declares a buffer object b that it passes to _do_print, together with a function f_bufferc that operates on it. Such buffer objects can store up to 16 characters, and f_bufferc adds a character to the buffer it receives as an argument, flushing the buffer by passing it to print_buf whenever it becomes full. As you can see, printf also flushes the buffer before returning. This version of printf can't buffer the results of several calls, but it nevertheless gives a valuable speed-up in some cases.

The point of the separation between _do_print and f_bufferc is that _do_print can be called in other ways: for example, the string formatting function sprintf(s, fmt, ...) is implemented by calling _do_print with a function that stores characters in the array s instead of buffering and outputting them.

Here's the function heading for _do_print:

void do_print(void (*putc)(void *, char), void *q, const char *fmt, va_list va) {
...


The C syntax is a bit obscure here, but it indicates that do_print takes four arguments:

• A (pointer to) a function putc that accepts an untyped pointer and a character as argument and returns nothing.
• An untyped pointer q that is passed to putc whenever that function is called. When _do_print is called for printf, this will be a pointer to a buffer object.
• A string fmt, represented as a character pointer.
• The variable argument list va from _do_print's caller.

The body of _do_print uses these to do its job.

     ...
char nbuf[NMAX];

for (const char *p = fmt; *p != 0; p++) {
if (*p == '%' && *(p+1) != '\0') {
switch (*++p) {
case 'c':
putc(q, va_arg(va, int));
break;
case 'd':
do_string(putc, q, itoa(va_arg(va, int), nbuf));
break;
case 's':
do_string(putc, q, va_arg(va, char *));
break;
default:
putc(q, *p);
break;
}
} else {
putc(q, *p);
}
}
}


As you can see, _do_print iterates over the format string, using a pointer p to point to the character of that string that it is currently working on. I've shown the implementation of three format codes here: %c for printing a character, %d for printing a decimal integer, and %s for printing a string. Note that you can write %% to print a literal percent sign. The formatting loop works its way through the variable argument list by using the function va_arg (more likely, a macro), and outputs each character by calling the argument function putch. A helper function do_string is used to call putc on each character of a string, such as the one returned by the function itoa (the name is traditional) that converts an integer into a string.

We carefully avoid global variables so that printf can be called from multiple processes simultaneously – it is reentrant, to use the jargon. That's the reason for the local array nbuf that is passed to itoa. (None of this will be on the exam.)

Code in the examples that prints messages on the terminal contains "\r\n" to move to a new line, but other examples of C code that I've seen use just "\n". Why is that?

On an old-fashioned terminal, and on the simulation provided by minicom, the carriage-return character 0x0d (written '\r' in C) is needed to move the cursor or print head back to the left of the page, and the line-feed character 0x0a (written '\n') scrolls up by a line. Normally, the C library or operating system is written so as to insert a '\r' in front of each '\n' output by a C program. But we are working without a library and without an operating system, so we must do this ourselves. (The serial driver in micro:bian does insert the '\r' in its version of the serial_putc function.)

If I ask the C compiler to output assembly language (and turn off -g), then most of the file is intelligible, but these lines appear close to the start of the file. What do they mean?

.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1


The lines record the assumptions used by the compiler in generating code: things like whether there is a hardware floating point unit, whether unaligned memory access is allowed, etc. These assumptions will be recorded in a special section of the binary file output by the assembler, and the linker then checks that the assumptions are consistent between different files of code.

C provides bit-fields for access to parts of a memory word. Why don't we use them instead of using shifting and masking operations to program peripheral registers?

On the ARM, most peripheral registers must be accessed as a whole 32-bit word, and can't be used with the ldrb, strb, ldrh, strh instructions that access bytes or 16-bit halfwords. Also, it's sometimes the case that peripheral registers must be accessed exactly once in an operation, or some side-effect may be duplicated. The C compiler doesn't guarantee to respect these conventions when it generates code for bit-field access, so we avoid bit-fields and write (with the help of simplifying macros) explicit shifting and masking code instead. The resulting object code is no worse than would be compiled for bit-fields anyway.

What indentation style do you use for C?

My style is close to the one used in the classic book by Kernighan and Ritchie. I don't insist on braces around a single statement controlled by if or for or while, except that the else part of an if must, for balance, have braces if the if part does, and } else { goes all on one line. Expressions should have parentheses whenever the interpretation is anything but obvious without them. I don't accept the argument that without braces, accidental mistakes of indentation can mislead about the scope of an if or while – but I use an editor that reliably gives the correct indentation automatically.

## I/O devices

What is the difference between the I/O registers UART_INTEN, UART_INTENSET and UART_INTENCLR?

The UART_INTEN register shows which events in the UART can cause an interrupt: you can write a value to the register to enable or disable individual interrupt sources, and you can read it to find out which sources are actually enabled. The Nordic chips have a scheme – mostly for convenience – that allows individual bits in this register to be set or cleared without having to read the register, modify it, and write back the result. If you write a value to the UART_INTENSET register containing one or more 1 bits, then the corresponding bits in UART_INTEN will be set to 1, and the other bits will remain unchanged. Similarly, writing a pattern of 1 bits to UART_INTENCLR clears the corresponding bits of UART_INTEN and leaves the others unchanged. The general scheme is described in Section 10.1.2 of the nRF51822 Reference Manual.

What happens when the micro:bit boots?

It's not right to say that the Nordic chip "boots", in the sense that a small program runs initially and is responsible for loading a larger one. Instead, we could talk about it "coming out of reset". When the hardware Reset signal is deactivated (for example, by releasing the reset button), the processor fetches the word at address 0 and loads it into the stack pointer, and fetches the word at address 4 (the reset vector) and loads it into the program counter. That is the extent of the hardware's responsibility.

On the software side, the C compiler, the linker, the linker script nRF51822.ld and the C functions in startup.c work together to determine what happens next. The initial stack pointer is set to the top of RAM, and the reset vector points to the routine __reset in startup.c. The contents of the RAM at this point are meaningless, and may be random. Here's the code for __reset:

/* __reset -- the system starts here */
void __reset(void) {
/* Activate the crystal clock */
CLOCK_HFCLKSTARTED = 0;
CLOCK_HFCLKSTART = 1;
while (! CLOCK_HFCLKSTARTED) { }

/* Copy data segment and zero out bss */
int data_size = __data_end - __data_start;
int bss_size = __bss_end - __bss_start;
memcpy(__data_start, __etext, data_size);
memset(__bss_start, 0, bss_size);

/* Call the 'main program' */
init();

/* If it returns, enter a tight loop */
while (1) pause();
}


When the chip comes out of reset, it is using for the clock a crude internal RC oscillator based on the time constant of a resistor and capacitor. For accurate I/O timing, it's better to use the 16MHz crystal that is installed on the board, so the first action is to switch on the crystal oscillator and wait for it to stabilise. Next, memcpy is used to copy the initial values of static variables into the bottom of RAM, and memset is used to set the remainder of the static variables to zero. The linker has made all the code for the program refer to locations in the bottom of the RAM, but the data which ought to appear there is not yet present. Instead, the linker has (folowing directions in the file NRF51822.ld) put an image of the data segment in the ROM just after the program code. It has also defined symbols like __data_start, __bss_start and __bss_end that we can use here. From this point on, the program can assume that static variables have their proper initial values.

There follows a call to the user-written subroutine init that forms the main program. As you will see, if init ever returns, the program falls into an infinite loop, forever doing nothing; interrupt handlers remain active during this loop. Later in the course, we will start to include the micro:bian operating system in our programs, and then the __reset() function will look slightly different, calling os_init() will set up its data structures, then init() that creates a number of operating system processes, and finally os_start(), the scheduler that interleaves execution of the processes, which work together to carry out the program's tasks. The scheduler never returns.

Why do wired-or busses like I2C have the convention that signal lines are active-low?

Electrically, active-low busses are better in two ways: one, N-type MOSFETs are generally better behaved than P-types, and specifically tend to have a lower on resistance. This makes them better able to pull the signal line low crisply (when it may have significant inductance or capacitance) than a P-type could pull it high. In older systems an NPN transistor in an open-collector output stage had a similar advantage over a PNP transistor, and for the same physical reason – that the majority carriers in these transistors are electrons, which have a better mobility than holes. The second reason for using active-low signalling is that a bus may cross between circuits that are using different power supplies, and ground is then used as a common reference, whereas the power rails may differ a little or a lot in voltage. An active-low signal can have threshold voltages defined with respect to ground, be pulled low by an open-drain output with short fall time, and draw little or no quiescent current through its pullup resistor.

## micro:bian

Why does each process in micro:bian take an integer parameter? Is this parameter the same as the process ID?

The parameter is there to make it possible to use the same function as the body of more than one process, with those processes doing the same thing with different data. For example, in the Today programme example, the two 'politician' processes run the same code, endlessly spouting their message, but a different message in each case. We do this by having a global table of messages, then giving the process a parameter that it an index into the table.

char *message[2] = {
"Brexit means brexit",
"No deal is better than a bad deal"
};

void politician(int i) {
while (1) {
spout(message[i]);
}
}


When we start the two processes, we give different parameters to be passed to the process body.

FARAGE = start("Farage", politician, 0, STACK);
MAY = start("May", politician, 1, STACK);


Providing one integer argument is just enough to make this kind of programming possible without too much inconvenience.

No, this parameter is not the same as the process ID, which in the example is the result, FARAGE or MAY, that is returned by start(). The confusion perhaps arises because the body of a process usually doesn't have to mention its ID, though the ID is implicitly used as the sender field whenever the process sends a message.

What's really going on in Problem 4.4?

Thanks, Mike, for asking: it gives me the opportunity to link to another page that reports the result of an experiment with the program.

Why can't one micro:bian processes start another?

In short, because that would be a complication that would add little utility. The memory allocation in micro:bian is deliberately simple, stack space for each process being nibbled away from one end of a single chunk of free space. That works well if a fixed number of processes are started when the program begins, but much less well if processes come and (by implication) go during the life of the system. If processes exit and restart, we will need to re-use the storage space of dead processes to find room for new ones; and unless we are prepared to move processes around, we then face the problem of fragmentation of storage – all too messy to contemplate. Given that the operating system is only going to support a fixed collection of processes, we may as well have those processes created as part of initialisation, before the scheduler takes control.

When a process makes a system call such as receive(ANY, &m), how does the operating system access the arguments?

The arguments are in registers r0 and r1 before the svc instruction that enters the operating system. The handler for the SVC interrupt gets the the stack pointer for the process after the context has been saved onto the stack. It can then reach into it (in a machine-dependent way) to discover what the values of r0 and r1 were at the time of the call. It also uses the saved pc value to look at the svc instruction, which contains an integer constant that identifies which system call – receive or send or another – was made. Given the address of the message buffer m, the operating system can read or write the memory belonging to the process in order to copy a message from sender to receiver.

What happens if a process loops forever, consuming CPU time, without ever calling send or receive?

Scheduling is not time-based, so once such a process is running, it will stay running until the next interrupt, and other processes will not run until then. micro:bian is not really designed to share the processor fairly among processes that always want to run, because doing that really needs a clock interrupt to be directly connected to the scheduler in order to measure out time slices, and we'd like to make use of timers optional. The assumption is that CPU time is a plentiful resource, so that fairness in sharing it doesn't matter. In the heart example, it works well to give the process driving the display a higher priority than the one searching for primes, so that after a timer interrupt it get to run first.

If a micro:bian process calls a subroutine (like primes calling printf which subsequently calls serial_putc) and the subroutine calls send to send a message, how does the operating system know which process sent the message?

The operating system always knows which processs is currently running (that is the purpose of the internal variable os_current). The stack frames for primes and printf and serial_putc are all part of the stack of that process. So when the call to send is made, the operating system knows that it is made on behalf of the PRIMES process. The same subroutine serial_putc could be called at another time by another process, and then is would generate a message from that process – whichever one was current at the time.

What is the resting state of a pipeline where one process PRIMES is generating primes, passing them to another process FORMAT for formatting, then FORMAT is sending characters to SERIAL for printing? Let's assume primes are being found faster than they can be printed.

Process pipeline

In this case, the serial driver's buffer will usually be full. SERIAL will not be listening for more PUTC messages, but blocked until the next UART interrupt. In consequence, FORMAT will be blocked waiting to send it a character. And PRIMES will eventually have found another prime and be blocked waiting to send it the FORMAT. Nothing more will happen until the next interrupt, and the IDLE process is the one running, or rather it has put the processor to sleep at a wfe instruction – the only one in the system.

When a UART interrupt arrives, SERIAL will start the UART transmitting a character taken from its buffer, so a slot in the buffer is now available, and SERIAL accepts a PUTC message from FORMAT. This fills the buffer again, so the SERIAL process blocks waiting for another interrupt. FORMAT is also able to run, and it will take a short time to decide what character to print next, then block again waiting to send it to SERIAL. Just occasionally, FORMAT will have put the last newline to a message, and it will then accept another PRIME message from PRIMES: that process will then be free to begin seraching for another prime to transmit.

Is it possible to create a buffer of messages?

There is already a buffer of characters in the SERIAL process, but it's possible to introduce an independent buffer process between any pair of processes that communicate.

Buffer process

The process on the right (the producer) can sent PUT messages to the central buffer process, asking it to store data items for later delivery. Then the process on the left (the consumer) can send GET messages to the buffer, getting each time one data item in a REPLY message. micro:bian provides just enough expressiveness to deal with this kind of situation, but note that the buffer process must know the names (process IDs) of the two client processes and depends on them to communicate with a fixed protocol. In more elaborate message passing schemes, there are notions like named channels of communication, which make it much easier to plug together processes and adapters in an ad-hoc way.

If two messages are sent one after the other, can they be delivered in the opposite order?

If they are sent from the same process, then No: this is a consequence of the fact that the process does not get to call send a second time until the first call of send has returned, and when that happens, we can be sure that the recipient has also completed its call of receive and has hold of the first message.

Ordering of messages

But if the two messages are sent from different senders, then the question is more complicated, and I have four different answers for you. In order (!), they are No, Yes, Maybe and Yes.

• No, because if two proceses B and C are both on the sending queue of a process A, then there are no guarantees on the order they will be picked. It could be that process A specifies that it would like to receive from process C first and process B afterwards. And the micro:bian designer does not promise to accept a bug report claiming in any case that a re-ordering of waiting processes is wrong.
• Yes, because in practice (selective receive aside) the sending queue is treated like a first-come-first-served queue by micro:bian. Although there may be no guarantees about order of message delivery, nevertheless FCFS is the simplest way of ensuring fairness, which in this context means that if process B wants to send to A, and A is continually accepting messages in a way that does not exclude B, then B will not be delayed forever.
• Maybe, because we've got to ask what we mean by 'before'. Suppose processes B and C are connected to the two push-buttons on the micro:bit, and I push the button for B at 4:16pm and the button for C at 4:17pm. There's no guarantee on a busy machine that process B will get to run before I press the second button, but then process C might get to run immediately – so it would be wrong to rule out a behaviour where the message from C got through first.
• But Yes: suppose process B sends its message, then turns on a light; and I, seeing the light, then press the button for process C. Then there is a causal chain between one message and the other – the light comes on after the first message is delivered, the button for C is pressed after that, and the second message is only then sent. Without including the human factor, we can define a temporal ordering of events where events in the same process are strictly ordered by the history of the process, and events in different processes are ordered in a way that is determined by the synchonisation implicit in the sending and receiving of a message. It's possible for events in two different processes to be unordered if there is no causal chain connecting them; but if one message is sent before another according to this ordering of events, then they are indeed received in the same order.

Is HARDWARE really a process?

Each process is identified by a small integer. This integer is an index into a table that stores the state of each process. The special value HARDWARE = −1 is not the PID of a genuine process and does not correspond to a slot in the table, but nevertheless you can write receive(INTERRUPT, &m) and afterwards find that m.m_sender == HARDWARE. The mechanism for sending an interrupt message is not to put any process on the sending queue of the process that will receive the message, but instead to set a flag p_pending in the receiving process that indicates an interrupt is waiting. So in this sense, there is no HARDWARE process. Actually, having such a process object wouldn't help with the implementation, because we know a process can only be on one queue at a time, and there can be several interrupts pending for different drivers.

Do interrupts jump the queue of waiting processes?

When an interrupt happens, the OS looks at the connected driver process. If it's blocked at a call of receive, the job is simple: we fill in sender = HARDWARE and type = INTERRUPT in its message buffer, and queue it for execution. If the driver process is not blocked at receive, we set its p_pending flag instead. Later, when it calls receive, the OS looks to see if any interrupts are pending: if so, the process does not block, but an INTERRUPT message is immediately delivered to it and the p_pending flag is cleared. In this way, yes, the interrupt message does overtake any other messages that are waiting. That's no bad thing, because usually interrupts are quite urgent.

Why is the stack space of a micro:bian process filled with the value 3735928559 when the process starts?

This helps with finding out how much stack space the process is actually using: when micro:bian prints a dump of the process table, it shows how much of the stack space of the process has been changed from this initial value, and we can assume that this is the total stack space that has been used, unless by some slim chance the last few words of the stack have been set to this magic value by the process itself. As to the choice of value, it appears in the micro:bian source code in hexadecimal notation, which makes it seem more significant.

In micro:bian, what happens if an interrupt is triggered when the attached process is not waiting to receive a message?

The process table entry for each process has space to record one pending interrupt message, so if an interrupt arrives and the process is not waiting for it, the message can be delivered later. The default interrupt handler for micro:bian, which converts interrupts to messages, also disables the interrupt that invoked it, because the interrupts on the micro:bit continue to be triggered if the cause is not removed, and removing the cause (such as an idle UART) requires hardware-specific actions from the driver process. So a second interrupt cannot happen until the driver process has run to re-enable the interrupt. If this behaviour is inadequate to the situation, it's possible to replace the default handler with another one for a specific device: the timer driver does this to help with implementing a microsecond-resolution clock function.

What happens if a device driver process does not finish responding to one interrupt message before another interrupt occurs?

As noted above, the default action of disabling the interrupt means that a second interrupt can't actually be requested. But if (for example) the interrupt comes from a timer, and interrupts are being lost causing the clock to run slowly, then the right thing is to move some work from the driver process to the interrupt handler, and modify it so that the interrupt remains enabled all the time.

In the UART driver process, code for input and code for output are intertwined. Wouldn't it be neater to untangle them and have separate driver processes for the two sides of the UART?

In some ways, yes. But the vital deciding factor in favour of a single driver is that the two sides of the UART share a single interrupt request, so interrupts for both sending and receiving turn into messages to a single process. We could use a more elaborate scheme where a little router process passed on interrupt messages to one or other of two driver processes, but we would be adding as much complexity as we could hope to remove. (A minor additional point is that echoing of input characters is easy to do if input and output share a driver process.)

If message-passing makes programming so much easier, why do widely-used embedded operating systems provide semaphores instead?

Programs written using semaphores have the potential to be faster than those written using message passing and server processes, because context switches from one process to another can be made much less frequent. For example, to buffer and print characters on the serial port, our message-based design requires about four context switches per character sent, in addition to the running of the UART interrupt handler. There is a context switch to send a request message to the UART driver process and put the character in the buffer, another to return to the main program, then two further context switches into and out of the driver process when a UART interrupt prompts it to send the character. By way of contrast, a semaphore-based solution might need no context switches at all, if characters are retrieved from the buffer and sent to the UART directly in the interrupt handler. This is because a down operation on an available semaphore only needs to disable interrupts for a moment to test and lock the semaphore, and need not block the invoking process. It's only when a process needs to block on a semaphore that a context switch is needed.

On the other hand, message-based concurrency is much easier to think about, and it makes it easier to avoid problems like priority inversion, where a high-priority process is prevented from making progress because it needs to lock a semaphore currently held by a low-priority process.

Why does micro:bian not have (i) time-based scheduling, (ii) nested interrupt handlers, (iii) an interruptible kernel?

In a simple, interrupt-driven system that is not overloaded, most of the time every process is blocked waiting for something to happen, and the idle process is 'active', in the weak sense that it is the running process, but it has put the processor to sleep with a wait-for-interrupt instruction (wfe on the ARM). When an interrupt arrives, processes wake up, there is a flurry of activity, and then the system settles down to sleep again, and the time needed for that to happen is bounded. If there is no response signal due from the system as a result of an interrupt that must be produced in less than this time-bound, then there is no need for pre-emptive scheduling (which does not change the total amount of work to be done, but just affects the order in which to tasks are carried out). And if we keep interrupt times short by doing no more than sending a message in each interrupt handler, then there will be little need to allow the running of one interrupt handler to be suspended in order to tun another. Likewise, if the time spent in delivering a message is short, it won't matter much if an interrupt is delayed until the kernel has finished.

There are well-developed theories of process scheduling that aim to ensure that complex sets of real-time deadlines are met using limited resources, and in circumstances where they are needed, these theories are really valuable. In simple cases, however, process scheduling is hard to get wrong, in the sense that if every process waiting for an event gets to run after the event has happened and before the system goes to sleep again, then everything will happen when it should and in good time. micro:bian is designed with these modest goals in mind.

Instead of declaring a message buffer and laboriously filling in its fields, why not use GCC extensions and write something like this?

send(client, REPLY, &((message) { .int1 = temp }));


Why not, indeed? It certainly works, but it does look a bit intimidating. I think I'll stick to the simpler style in teaching materials. We are, by the way, using a late C enhancement (actually part of recent standards) that allows anonymous nested structures and unions, so that we can write m.int1 instead of m.m_field1.m_int or something similar. That, I think, gives an improvement in clarity.

What is the purpose of the micro:bian system call yield()?

It has no purpose, and should never be used. Well, hardly ever. A process that is checking repeatedly for some condition, such as

while (1) {
if (rat detected)
kill the rat;

yield();
}


should be avoided. It might have come about because a previous version without the yield() turned out to hog all the CPU time. But adding yield() doesn't really provide a solution. On the one hand, the program is still wasting time checking the same condition repeatedly; and on the other, there is neither a guarantee that other processes will get a sufficient share of the CPU to do their work, nor a guarantee that this process will run often enough to kill the rat before it escapes.

The process should be replaced by one that waits for a message instead:

while (1) {
kill the rat;
}


That way, CPU time is not consumed in useless polling. However, this does require a design where processes communicate with each other explicitly, and that is what micro:bian encourages. This rat-killing process can be given an elevated priority if it needs to act urgently, something that would not work if it relied on yield() to give CPU time to others.

There is one vital use for yield(): when the scheduler starts up, the idle process is the first one to run, and it invokes yield() just once to give the scheduler chance to start running other processes.

Why not implement mutual exclusion with a global turn variable and yield()?

The program you showed me had a global variable turn saying which politician should speak next, and something like the following code for each speaker.

volatile int turn = 0;

void speaker(int me)
{
int them = 1 - me;

while (1) {
while (turn != me) yield();

serial_puts(slogan[me]);
turn = them;
}
}


There are several problems with this:

• Each speaker process is always ready to run, and UART interrupts during the printing will frequently cause the operating system to invoke the scheduler, running each process frequently, even when it it futile to do so. This wastes a lot of time in needless context switches. A proper solution ought to avoid this by blocking (at a call to send() or receive()) those processes that are not ready to make progress.
• The solution forces the two participants to speak in strict alternation. This doesn't seem too bad if there are only two speakers, because they might be replying to each other (though always with the same slogan). But with three or more speakers, it seems wrong that the whole conversation could be held up by one participant who was slow to speak. A proper solution would avoid this by letting the participants speak in whatever order they wanted, while still ensuring that two participants didn't speak at once. This happens naturally in a solution where a coordinating process accepts requests to speak from the participants.
• Though it happens in my implementation of the operating system that yield() and the system of queues gives each process a chance to run frequently, it's best to think of yield() as a hint to the operating system that it might like to run other processes, rather than an instruction to do so that it must obey. It's more robust to write programs so that they work whatever scheduling decisions are taken by the operating system, provided it chooses one of the ready processes to run whenever it can do so, and eventually delivers each message that is sent.
• The solution depends on the shared variable turn. Though the updates to this variable are simple assignments turn = them that can reasonably be assumed to be atomic, shared variables are best avoided when message passing makes them unnecessary.

The upshot of considerations like these is that yield() is vary rarely used in practical programming. It is provided in micro:bian mostly as an example of a system call whose implementation is as simple as it can be.

## Digital logic

What does CMOS stand for?

CMOS logic is so named because it relies on Complementary pairs of transistors, each of them constructed of three layers: a Metal (or possibly polycrystalline silicon) gate, an insulating layer of silicon Oxide, and a Silicon channel between source and drain. Other logic families exist, for example the NMOS family where the p-types are replaced by pull-up resisitors. These chips are easier to make because they contain only n-type transistors (actually the resistor is implemented as a weak transistor too). But they consume more power, because when the output is low, current constantly flows through the resistor and transistor.

Can you really buy an integrated circuit containing a single and gate?

Yes, you can. It used to be that the cost of packaging meant that several gates came in one package, such as the classic 7400 'quad 2-input nand gate', but recent logic families include single gates in tiny five-pin packages (with two inputs, one output, plus power and ground. Here for example is the datasheet for a 74LVC1G08 single 2-input and gate.

Why does an inverter need two transistors?

Image

Both transistors are needed in an inverter to ensure that the output is always driven – either forced towards the positive rail when the output should be 1 (by the p-type), or forced towards the negative rail when it should be 0 (by the n-type). :We use p- and n-type transistors in this way (p-type in the upper half and n-type in the lower half) because that ensures the gate-to-source voltage is high enough in each case to drive the transistors into strong conduction, producing a good output signal.

Can we use n-types in the top part of a design, or p-types in the bottom part?

In (for example) a nor gate, both p-type transistors conduct a high voltage, and both n-types conduct a low voltage, so all provide a strong drive to the output. That picture generally holds if we stick to the rules. On the other hand, n-types used when their source is not at 0V will not be turned on fully, and will provide a weak drive, making the circuit slower and more vulnerable to noise.

Do logic chips need power and ground connections?

Small chips like the 7400HC contain a handful of gates, with one pin for each input or output, plus separate pins for power and ground. A trap for young players is that CMOS logic may still work most of the time if you forget to connect power and ground, because the nature of the chip implementation allows it to be powered through parasitic diodes between the inputs and the rails. These diodes are normally a good thing, because they soak up transients in the inputs.

Image

Could we implement an or function using diodes?

An older logic family, Diode-Transistor Logic (DTL) used diodes to implement a nand gate.

How sub-optimal can a circuit get if we design it according to the recipe used in the lecture?

Blindly implementing the Boolean function or – using the construction we used to prove adequacy of {and, or, not} – results in a ridiculously complicated circuit.

Image

But we can simplify that circuit, either using algebraic reasoning, or by recognising smaller terms, each covering more of the truth table, until it becomes a single or gate. There are traditional, hand methods (like Karnaugh maps) for finding optimal implementations of Boolean functions of a few variables. More importantly, there are software tools that automate the process. The worst function to implement using sum-or-products form is the multi-input xor or parity function. Actually, with two-to-the-two-to-the-n different Boolean functions on n variables, the majority of them will have no simple implementation.

How many essentially different state assignments are there for a four-state machine?

If we succumb to temptation and design a sequential circuit by laboriously listing the states and transitions between them, we must then number the states in binary before designing the output and next-state logic. With four states, we are going to need two bits to number them, and these become two flip-flops in the eventual circuit. It seems like there ought to be 4! = 24 ways of numbering the states, with the threat that some of these assignments will make the combinational logic dramatically more complicated than others. But in reality, several considerations make pairs of assignments essentially equivalent, so that we need consider only a few of them in a systematic search for a simple implementation.

For one thing, swapping the order of the two bits isn't going to affect the size of the combinational logic needed; this halves the number of assignments we must consider. What's more, if we negate one or both of the state bits, we can keep nearly the same combinational logic simply by inserting inverters at the inputs and outputs of the flip-flops whose sense we have changed. De Morgan's laws mean we can even simplify the resulting circuit to get rid of most of the inverters. This removes another factor of four from the number of assignments, leaving us with just three to consider. We may as well choose the encoding 00 for the initial state (convenient if we choose a design of flip-flop that adopts state 0 at power-on); then the assignment is essentially settled by selecting which other state will be encoded as 11, because it makes no essential difference which way the other two states are assigned 01 and 10.

$\displaystyle{ }$For a machine with $$2^n$$ states encoded in $$n$$ bits, we start with $$2^n!$$ assignments, then remove a factor of $$n!$$ for permutations of the state bits, and a factor of $$2^n$$ for negations, leaving us with $$(2^n-1)!/n!$$ assignments to consider. If we want to be exhaustive, $$n$$ had better be pretty small.

How can we figure out the number of states in a sequential circuit from a description of its behaviour?

We have to be prepared to be creative, and if insight does not give us any help, start to draw a state diagram that seems adequate to produce the behaviour that has been described. For the 'bathroom light switch', one we see that two states are not enough, we might come up with the four state machine shown in the diagram. This is consistent with all the behaviour shown in the timing diagram, so appears to be sufficient.

Timing diagram for bathroom light switch

Given such a state diagram, we can number its states in binary, then make tables showing (for a Moore machine) the next state in terms of the current state and inputs, and the output in terms of the current state. We can tell that a Moore machine is sufficient for the bathroom light switch because its output changes only at a clock edge. Here's the next-state function for the state diagram shown.

x y a x' y'
0 0 0 0 0
0 0 1 0 1
0 1 0 1 1
0 1 1 0 1
1 0 0 0 0
1 0 1 1 0
1 1 0 1 1
1 1 1 1 0

Because (sneakily) I've numbered the states in a way consistent with the design I already made, we can see in this table that x' = y' = y whenever a = 0; and x' = x and y' = ¬x whenever a = 1. The next-state logic can therefore be implemented as before with two muxes and an inverter. The output function is even simpler:

x y z
0 0 0
0 1 1
1 0 0
1 1 1

We just get z = y. Other ways of numbering the states will give different tables which may be easier or harder to encode in combinational logic.

What is the advantage of having a fully synchronous design?

Consider a circuit where the output of one register R1 feeds the input of another register R2 that has write-enable implemented by gating the clock.

Image

The and gate has the effect of the delaying the clock signal to R2 by a time equal to its propagation delay. After a clock edge, the output of R1 will be stable for a short time (called the contamination delay of R1) before becoming invalid, and we will have to hope that this contamination delay is long enough to compensate for the delay in the clock (the clock skew) seen by R2. Without the clock gating, this clock skew would not occur and so not be a worry.

Clock skew is inevitable in physically large circuits owing to the finite speed of light (1 foot per nanosecond in traditional units) and must be planned for; clock skew owing to gating of the clock just makes the problem a bit worse. A register with write-enable, considered as a circuit module, might implement the write-enable by gating the clock internally. In that case, the module would have a non-zero hold time (the time the input must remain stable after the clock event). The designer could trade hold time for propagation delay by introducing a delay element on the signal lines, but that would increase the overall propagation delay of the circuit and so limit the clock speed.

## Architectural elements

In the diagram of a ROM, what is the meaning of the vertical lines with blobs on them?

Each column determines a single bit of the ROM's output. It has an up-to-8-input or gate that is connected to those horizontal lines that correspond to locations (combinations of address bits) where the output should be a 1. So the intention is that a blob indicates a connection to the or gate, and the absence of a blob means no connection. (If there are no connections at all, then there is no need for an or gate, and the output could be wired to 0.

Image

In a CMOS implementation, we would make the decoder's outputs be active-low, then use nand gates for the outputs. Or better still, there's a design where each blob is a single transistor that pulls the vertical wire low, and the output stage is an inverter with a pull-up on its input. All we need to observe is that it's possible to build the circuit somehow.

In the twin-port register file, is it possible for all three registers rd1, rd2 and wr to be the same? What would be the use of that?

Register file

Yes, in the circuit, rd1 and rd2 independently select two registers for reading. The process continues to work perfectly if both select the same register. Also, wr independently selects a register to be written, and that can be the same as a register that is being read in the same clock cycle. The usefulness of this becomes apparent when we consider an instruction like adds r0, r0, r0, which adds together two copies of the value in r0 and writes the result back into r0. This instruction will work properly using the same mechanism (an adder external to the register file) if we arrange that the register file allows multiple, simultaneous references to the same register in the way we just discussed.

If the same register is read and written on the same clock cycle, should the value that is read be the value sitting in the register, or the new value that is provided as an input?

For the adds instruction to work, or any other arithmetic instruction whose result goes into the same register that one of its operands came from, we want the old value to be used. To see that this is in fact how it works, we need to look at the implementation of one of the registers. Here's a not-fully-synchronous design:

n-bit register

Here we see that the output is taken directly from the row of D-types, whereas the input is fed to them to become (if the clock is enabled) the value they will hold after the next clock edge. So it is the old value that is read from the register in the event that it is both read and written on the same clock cycle. Note: in some processor designs with pipelining, it is helpful to use a register file with the opposite behaviour, in order to arrange that a value being written by one instruction is available for use by the next. We won't be concerned with this.

In this circuit for a turbo-charged register file, what features of the ARM instruction set are supported by the various special elements?

ARM register file
Three registers can be read simultaneously.
An instruction like str r1, [r2, r3] stores the value from one register in a location whose address is computed as the sum of two other registers. (On a machine like the MIPS, the only addressing mode is equivalent to our str r1, [r2, #12], so only two registers are ever read by any instruction.)
The write-enable of PC is wired to 1.
The PC is updated in every instruction, either by being written with some calculated value (as in a branch instruction), or being set to the value nextpc = PC+2 that is computed by an adder external to the register file.
The +4 adder on the output of the PC.
The instruction set specifies that branches and PC-relative load instructions see the PC value as the address of the next-but-one instruction after the instruction being executed. This is probably an artefact of an early, pipelined implementation of the instruction set, but must nevertheless be faithfully implemented in a non-pipelined implementation like ours, if machine language programs are to run correctly.
The multiplexer on the PC input.
Some instructions (e.g. branches) write their result to the PC; others allow the PC to be incremented by 2 as a default. This multiplexer selects which of these possible values is written to the PC.
The multiplexer on the input to LR.
A subroutine call instruction like blx r3 simultaneously set the PC to the value obtained from R3, and set LR to the address of the next instruction, obtained from nextpc. This multiplexer allows the nextpc value to be routed to LR simultaneously with din being routed to the PC, and also allows LR to be written as a normal register.
The or gate on the write-enable input of LR.
This allows LR to be written both as a normal register (when the decoder sends it a write-enable signal) and in the blx instruction just described, when cLink is active.

## Processor architecture

Why are two opcodes 11010 and 11011 given for the conditional branch instruction in the notes?

My thought was that the decoding ROM would be addressed by the top five bits of the instruction, making each location correspond to half a row in table [A] of the 'rainbow' chart, which often corresponds to a single instruction. For half-rows that contain more than one instruction (or a reference to a different table), the thought is that further instruction bits would be decoded by an auxiliary ROM, and there would be a control signal from the main ROM say, "Please use the auxiliary ROM for this instruction." You can see a hint of this in the way decoding is implemented in the simulator. Alternatively, it's possible to design a ROM that decodes a variable number of address bits, in a way that is not so easily reflected in a software simulation.

Now, conditional branch instruction occupy a whole row in table [A], so that corresponds to two consecutive five-bit opcodes. If the top four bits of an instruction are 1101, then it's a conditional branch regardless of whether the next bit is 0 or 1. The next four bits (like bits 8–11 of any instruction) are fed to the conditional execution unit, controlling what Boolean function of the flag bits it computes; and bits 0–7 of the instruction determine the branch displacement.

Writing into a register: I'm still confused about the timing/clock mechanism of writing into a register after, say, an arithmetic instruction. Let's take the example of adds r0, r0, r1 or ldr r0, [r0, r1]. When the clock edge rises, what is immediately written into the register file?

For add r0, r1, r1, there's a combinational path from the outputs of r0 and r1 through various multiplexers, the shifter and the ALU to the input of r0. When the add instruction is executed, during the time between one clock edge and the next, these elements settle down to deliver the sum of the present values of r0 and r1 to the input of r0. Therefore, at the next clock edge, the value of r0 will change to this sum. For the load instruction, the combinational path involves the data memory (interface) too: we are assuming that the memory can be read within the clock cycle, as is common with SRAM. Then the ALU does the address calculation and the memory delivers the contents of the word to the register file, all within the same clock cycle: the result is latched into the target register at the next clock edge, completing the instruction.

linksel: For linksel, what determines the conditional input (ie. the arrow in between 1 and 0)? For regwrite it is the result from the Condx element, but what is it for linksel?

The bx reg and blx reg instructions differ in a single bit, which is not one of the bits decoded by the control ROM. That bit is used to drive the third input of the linksel mux. We can tell from the decoded bits that most instructions don't write nextpc to the link register, a few definitely do (e.g., bl), and bx/blx does so dependent on this additional instruction bit. This kind of thing is needed only because of the convoluted instruction encoding of Thumb.

Implementing bl: To generate a 22 bit offset, shouldn't the higher 11 bits of offset be shifted left by 11 places? Why 12?

The bl instruction takes a sign-extended 11 bit field from the first half of the instruction and an 11 bit field from the second half. Since the branch target must be a multiple of 2, like all instruction addresses, this displacement is multiplied by 2 before being used, so the top half is shifted by 12 bits and the bottom half by 1 bit. This is visible in the decoding of the bl1 and bl2 instructions in the simulator.

According to the ARM manual, PC-relative load instructions should round down the value of PC+4 to a multiple of four before adding the offset. Where is this implemented in your datapath design?

There's an ALU operation Adr (see simulator code, section 9) that adds together two inputs and truncates to a multiple of 4. That does the job when needed.

## Lab exercises

Something in the lab exercises isn't working. What should I do?

First of all, please come to the sessions! The lab exercises for this course are (deliberately and inevitably) a bit less cut-and-dried than those for other courses, and there is more scope to explore directions of your own. Today in the lab we had a team of two plundering my code for the Seven Stars of Death to make the LEDs flash in their own way. We were able to help them with that, and it was far more worthwhile than following my pre-conceived list of tasks.

Second, check for updates on the labs page. Some differences are inevitable between the environment in the Software Lab and what works on my machines, and we can't hope to pin down every last one in advance. And as you'll discover, it sometimes takes a lot of thought to work out a reason for the strange behaviours we often see. There's no operating system to intervene and stop your program with a helpful message; and even the seven stars of death only tell you that something is wrong.

When an update is published, there's a procedure for merging it into your working copy using Mercurial. We will help you with that in difficult cases, such as where you've already made one fix in the lab but you are merging in others.

I tried to open Geany by clicking on the file named lab1.geany, but I can't find it. What's wrong?

The file lab1.geany and others like it are not included in the software kit, but generated by running the script setup/install. If you forgot to run the script, then that explains why the file does not exist. You should go back to the installation instructions at the beginning of Lab zero and run the script, checking that it completes correctly.

Generally speaking, Geany project files are not suitable for checking in to version control, for two reasons: they contain absolute path information that depends on the root directory under which the code is stored, and they contain information (such as the names of recently used files) that is of no global significance and is changed whenever Geany exits. This means that the files are not the same for every developer, and they change frequently in insignificant ways. My solution is to distribute prototype files such as setup/lab1.geany in the setup directory, and to substitute the correct pathnames to create lab1-asm/lab1.geany etc. as part of the setup process. The prototype files are the same for every participant and do not change. The same story applies whether you are using Mercurial or Git as the version control system.

Geany project files have a plain, boring icon for me, instead of the magic lamp. How can I fix that?

There's some confusion about whether the default icon theme (Adwaita on my Debian machines) should inherit icons it doesn't know about from the Hicolor theme, where the icons for Geany and some other applications reside. The confusion can be avoided by simply choosing Hicolor itself as the theme; this can be done with the gnome-tweaks application, or if you are a command-line junkie with

$gsettings set org.gnome.desktop.interface icon-theme 'hicolor'  Whether the file has the fancy icon or not, double-clicking it will still open the project in Geany. You may prefer Mercurial, but I'm dedicated to Git. How can I stay true to myself? I get it: you have Jakub Narębski on speed-dial. But there is good news: in place of $ hg clone http://spivey.oriel.ox.ac.uk/hg/digisys


at the start of Lab 0, you can use the command

\$ git clone http://spivey.oriel.ox.ac.uk/git/digisys.git


instead, and hack away to your hearts content without compromising your principles. (This feature is brought to you by the superior flexibility of Mercurial; don't tell Jakub!)

I stop the echo program with the debugger, then try typing in the minicom window. Nothing happens; and even if I type cont to gdb, still nothing happens. Why?

The debugger can stop your program, but it can't stop other things from happening. In particular, if the UART (the interface connected to the serial port) receives a second character before your program has collected the first one, it signals an 'overrun error' and will do nothing more until the error is cleared. The echo program is written in the simplest way possible, assuming characters will arrive slowly enough for it to fetch each one before the next arrives, and it doesn't have provision for detecting or clearing such errors. When the program is stopped by the debugger, it hasn't a hope of keeping up with your typing. So once you've typed a second character in minicom while the program is stopped, all is lost until you reset the micro:bit. This is a perfect example to show that debuggers are of limited use in embedded systems development.

I want to see the Seven Stars of Death. How can I make them appear?

Ok, I suppose we'd better get this over with. Any unexpected interrupt or exception will lead to the seven stars, and you can provoke a HardFault exception with a bx instruction where the target address is even, not having the Thumb bit set. So the two instructions

movs r0, #0
bx r0


ought to do it. You may prefer to find more creative ways of your own. In fact, you probably will.

The detailed mechanism is as follows: the file startup.c contains code that lays out the interrupt vectors for the program, including the vector associated with the HardFault exception. Unless these vectors are redefined elsewhere, they all point to a routine spin that turns off interrupts, (re-)initialises the GPIO pins, and then uses them to display the seven stars, flashing them forever using delay loops. The code is made simpler by the happy fact that the seven LEDs needed are all in the same row of the 3 x 9 wiring matrix. The code is deliberately mashed together in one function (rather than, say, using a subroutine delay), so that it doesn't depend on anything else. Many undefined behaviours of the chip evoke a HardFault exception, and attempting to clear the T bit in the psr using a bx instruction is one of them. The machine saves some registers on the stack, fetches the HardFault vector from address 12, and starts executing the routine at that address, which immediately calls spin.

I keep getting the following message when I try to start minicom. What is happening?

minicom: cannot open /dev/ttyACM0: Device or resource busy


I found the second answer to this Stack Overflow question helpful: https://stackoverflow.com/questions/40951728, but wish they would do the decent thing and reinstate Monica.

Why are the C compiler and other tools called by names like arm-none-eabi-gcc and arm-none-eabi-objdump?

These long names are chosen to distinguish these cross-compiling tools from native tools like gcc, which on a Linux PC is a compiler that translates C into code for an Intel processor. The lengthy names make explicit that the compiler generates ARM code (arm) for a machine running no operating system (none), and uses a standard set of subroutine calling conventions called the Extended Application Binary Interface (eabi). This explicitness is needed, because a Linux machine might also have installed compilers for the Raspberry Pi (gcc-arm-linux-gnueabihf) and maybe AVR microcontrollers (gcc-avr) and the MIPS (mipsel-linux-gnu-gcc). As you can see, the naming conventions aren't quite consistent between GCC cross-compiler projects.

What configuration is needed to use the hardware random number generator?

See Chapter 21 of the hardware reference manual for the nRF51822. The header file hardware.h (in the latest revision of Lab 3) defines symbolic constants such as RNG_VALUE for the device register addresses. If you make a first attempt, we will help you to get it working in the lab sessions.

I can't get the TEMP example to work on my micro:bit. Can you help?

Here's a working example that prints the temperature twice a second: ex-temp.c. I've jammed the device driver and a client program into the same file for simplicity. You can test the program by holding a thumb on the Nordic chip or breathing on it, but don't panic if the reading seems high – it's increased by the power dissipation of the processor itself. Happy hacking!

What is the file microbian.a that is made in Lab 4?

The file microbian.a is a library archive, containing copies of various files of object code such as microbian.o and serial.o. The linker treats such libraries specially, linking in only those parts of them that are needed by the program it is linking. Thus, for example, micro:bian programs that don't need the timer will be linked without including the driver process for it. Making and using the library simplifies the Makefile while keeping the resulting programs small.

How can I restore the original demo program to my micro:bit?

Download and flash the hex file you can find on the official activities page as 'BBC micro:bit Original Code' under 'You'll need'.

What are all the numbers shown on an oscilloscope trace?

Here is a trace of a character being sent on a serial line.

RS232 waveform
• In the top left, you can see (STOP) that acquisition of the signal has been stopped, presumably so that I can save a screenshot; at other times, the same label shows WAIT (when the scope is waiting for a trigger event) or TRIG'D (when an acquisition is in progress).
• Next to it (100us/) we see that the horizontal scale is 100 microseconds per division. The clock period for the waveform can be seen from the trace to be a bit more than 100us, because the transitions fall a bit behind the grid of dotted divisions towards the right of the picture.
• At the top right, we see that the oscilloscope's trigger is detecting a falling edge on channel 1 and set at a level of 1.72V, midway between the high of 3.3V and the low of 0V.
• In the left margin, you can see the trigger level marked (T), as well as the 0V level for channel 1.
• At the bottom left, you can see that channel 1 is DC-coupled, and has a scale of 1.00V per division; the difference between high and low is therefore seen on the trace to be about 3.3V.
• On the black background just below the top margin, you can see at the left a T mark denoting the trigger point. I've adjusted this to appear towards the left of the display, so that we can see subsequent events, but as the outline at centre top shows, there's plenty more waveform captured in this single-shot mode before and after the trigger that we could see by scrolling.

Edit: here is a similar trace made with one of the new oscilloscopes in the lab.

RS232 waveform
• In the top left, the legend shows that Channel 1 has a scale of 1V/div, while Channel 2 is off.
• In the top right, the center of the display is delayed by 30.73ms from the trigger event (which is off the left of the screen). The horizontal scale is 200μs/div, and acquisition is stopped. The scope is triggered by a falling edge at a level of 1.6V.
• In the bottom center, Channel 1 (yellow) has an offset of +1.98V from the bottom of the screen. The channel is DC coupled, and the probe attenuates by a ratio of 10:1 (see below).
• In the right-hand column, from top to bottom:
• The scope is acquiring samples without any special processing (Normal) at a rate of 50 million samples per second. (Consequently, the narrowest, 100μs, pulses of the displayed waveform correspond to 5000 samples, giving plenty of data for zooming in.)
• Repeating the info from below, both channels are DC coupled with 10:1 probes.
• Waveform math is disabled (greyed out).
• Reference signals are disabled (greyed out).

It's normal to use oscilloscopes with probes that contain an integral voltage divider, so that the voltage seen by the scope is 1/10 of the voltage at the probe. This raises the effective input impedance of the scope, most vitally reducing by an order of magnitude the extent to which the circuit being measured is loaded by the capacitance and inductance of the probe cable. After digitising the input signal, the scope multiplies the readings by a factor of 10 to compensate, so that true voltages are shown on the display.

The driver program in Lab 1 prints the time in clock cycles taken by the func subroutine. How is this quantity measured?

Like most microcontrollers, the Nordic chip has several counter/timer circuits as part of its set of peripherals, some with 16 bit counters and some with 32 bits. In one mode, each of these can be connected to the 16 MHz clock signal to make an elapsed time counter. The counter/timers have many other features, including the possibility of generating a regular interrupt signal that we will exploit later. The driver program sets a 32-bit counter going before calling the func subroutine, and captures the value that the counter has reached when the subroutine returns, allowing cycle-accurate measurements of elapsed time up to $$2^{32}/16\,{\rm MHz}$$ or about four and a half minutes. While the subroutine is running, the counter is simultaneously being incremented independently of the processor core, so the processor runs at the same speed it would without the counter. The difference in counter readings is adjusted by the program to allow for the time taken to call the subroutine, so what is displayed is the time to run the subroutine body and return at the end, four cycles for the minimal one-instruction subroutine.

## Trivia and obiter dicta

Why use the ARM? Why not the x86 or x86_64 we already have in our laptops?

I wanted us to program a machine where all the code was accessible to our inspection. That's easy to achieve on a microcontroller, and as simple as it can be on an ARM-based machine. Yes, we could program x86 machines in assembly language, though we would have to face some unavoidable complexity that has accrued during the 30 or more year history of the architecture. But we wouldn't easily get to program the bare machine and understand everything that happens.

Why C? Why not C++? Or why not Python/Java/Scala/Haskell/Some other higher level language?

Quite a lot of embedded programming is done in C++ these days, and in fact the official software support for the micro:bit is in C++. This gives a number of advantages: peripheral devices can be initialised by declaring a static instance of the driver class: PinOut led(PIN3); and C++'s overloading feature subsequently lets you write led = 1; and led = 0; to turn on and off the LED, even if the actual code needed is something more complicated, possibly involving a function call. This is all very well for applications programming, but it really gets in the way of understanding what is happening at the machine level.

Also, for C++, as for other high level languages, we would need to know how the high-level constructs are implemented in order to see how a high level program corresponds to a sequence of machine instructions. For C, this is more straightforward, and we can use C as a quick, clean and portable way of writing what we could also write in assembly language. And in any case, nobody's employment prospects were ever hurt by saying that they knew how to program in C.

Why this micro:bian operating system, and not another, industry standard real-time operating system such as FreeRTOS?

FreeRTOS is OK, and provides a rag-bag of facilities for multi-tasking, including processes, semaphores of different kinds, message queues, timers, and a lot more. It is quite complicated, and everything seems to have lots of options that interact in subtle ways. For the course, I wanted to find out how far we could get with a much simpler scheme, where there are no nested interrupts, possibly no preemptive scheduling, and the system is structured into a family of processes communicating by messages. As you will discover in Concurrent Programming next year, programming with semaphores is a complicated business; and next year will be soon enough for that. And I can't resist saying that the coding style of FreeRTOS gives me the creeps.

Why the name micro:bian?

We ought properly to let the answers to such questions become lost in the mysts of thyme, but the truth is that the "micro:" comes from "micro:bit" and the "bian" from "Debian". Debian gets its name from the fact that its initiator, Ian Murdock, had at the time a girlfriend called Deb. Sweet. [The later part of his life and his eventual death are, however, marked by tragedy.]

Why bother with assembly language if we can write all our programs in C?

Several reasons: first, our aim here is to give a complete account, spanning all levels of abstraction, of the functioning of a small computer system. We're going to take for granted a bit the process of translating high level programs into machine code (more on that next year), but we do need to understand machine language as the interface between hardware and software. Second, when we come to implement an operating system, we will need to understand clearly what it means to save the state of one process and restore the state of another, and that can only be seen at the level of machine code. And third, even if writing assembly language is something you never need to do (especially for the ARM with Thumb code), it's a really useful skill to be able to read the assembly language output by a compiler to see what is happening. When you've used instrumentation to identify the speed-determining part of your big program and done your best to write it in high-level code that will run fast, it's good to be able to look at the compiler output and see if it is sensible.

Is it "twos' complement" or "two's complement" or "twos complement"?

It's "two's complement". To see why this is right, consider that the ones' complement of a number is that quantity that, when added to the number, results in a string of ones:

 00101011
11010100
----------
11111111


The two's complement is obtained by adding 1 to the ones complement: when this is added to the original number, the result is a single power of two:

 00101011
11010101
----------
100000000


So the two's complement of a number is that quantity that, added to the number, results in a power of two. You could make case for the reading "twos' complement" – the complement belonging to a string of 2s. (Cf. Knuth, Vol. 2, p. 203).

You keep saying the micro:bit has 16kB (kilobytes) of RAM. That means 16,000 bytes: isn't it true that there is actually 16KiB (kibibytes) of RAM, or 16,384 bytes?

Yes, that's right. You can collect your refund on the way out. Please note as you go that it's correct to put a thin space between a number and a following unit, as in 16 KiB, and that a thin space should also be used in place of a comma to separate thousands, as in 16 000 bytes, to avoid confusion with the use of a comma as a decimal marker in some countries. Glad we sorted that out.

What fonts do you use on your website?

On webfont-enabled browsers, running text is in Roboto Sans, and titles in Roboto Slab. Code samples are set in Roboto Mono, and the occasional scary word is set in Unifraktur Cook, just for fun.

Now that Scratch has subroutines (including recursion: https://scratch.mit.edu/projects/361167805) is it finally a valid teaching tool?

Scratch does factorials

I'm not mad keen. Particularly worrying in this example is the use of global variables for everything, and the apparent lack of built-in provision for subroutines that return a result. I'm not sure of the purpose of the statement set <n> to <<n> - <1>>, but perhaps that's not really needed; certainly the statement <set <result> to <<1>*<result>> could be removed.

By the way, I think recursion is overrated for general programming, and there are many purposes for which we could use a cleaned-up Fortran – a language supporting (unnested) subroutines with parameters and local variables but no recursion. The book Software Tools by Brian Kernighan and P. J. Plauger makes the argument for this quite eloquently.

We'll write a few recursive subroutines early in the course as a demonstration that we can, and that they're supported by the stack mechanism we discuss and by the addressing possibilities of our machine. But in later parts of the course, for example in writing the micro:bian operating system and programs based on it, there will be no need for recursive subroutines. And with only 16KB of memory to play with, we'd better be sure any recursion doesn't go too deep.

What software do you use to make the slides?

I use an old version of Keynote on the Mac. The crucial advantage over PowerPoint for me is that PDF figures can be pasted into a Keynote presentation with no loss of quality; PowerPoint cannot do this without converting the PDFs to bitmaps. The mathematical formulas that appear on some of the sides are prepared with TeX, and some of the circuit diagrams shown in Trinity Term are prepared with an online editor called Draw.IO and saved as PDF. At one stage, the flexibility of Keynote was dramatically reduced in order to make a version that would run on the iPad. I think that some of the deleted functionality has been restored, but I have just kept to the old version, and kept running an older version of Mac OS to support it.

Why is the BSS segment called BSS?

One answer is that it derives from IBM usage associated with the (vacuum tube based) computers numbered 704 and 709 via the later (transistorised) 7090 and 7094, and it stands for Block Starting with Symbol, from the convention that an assembly language symbol would be associated with the start of each block allocated in the segment. Another possible answer, believed by some unixites, is that it stands for Below Stack Storage, from the fact that the segment is allocated with addresses lower than the stack area in the memory map.

But why "Block Starting with Symbol"? What sense would "Block Ending with Symbol" make? Well, on those IBM machines, one addressing mode combined an index register with a static base address to form an address in memory, and the value in the index register was subtracted from the base address instead of being added to it. So in looping through a static array, it made sense to load an index register with the size of the array and decrement it until it reached zero; the elements of the array would then be found at decreasing offsets from the named end of the block of storage allocated for it.

What was the website or course someone mentioned in the lecture today?

It's Nand2Tetris.

Why do you say the yield mechanism of Python generators is broken?

Python distinguishes between ordinary subroutines and subroutines that are generators: the rule is that any subroutine that contains the keyword yield in its body is a generator, and those that don't have the yield keyword are ordinary subroutines.

Note that a generator can invoke yield only directly from its own body: if yield statements occur in subroutines that it calls, then they make those subroutines into generators themselves, and they yield values to the initial generator, not to its client. From a programming language point of view, this breaks the principle that any part of a program can be turned into a subroutine. For example (I'll use C syntax for clarity of comparison), consider the function

void search(void) {
for (int n = 2; ; n++) {
if (prime(n)) yield(n);
}
}


We can't turn the loop body into a subroutine test(n), because that subroutine would become a generator, and the main search() function would cease to be one. Related to this is the fact that in Python it's impossible to write a combinator concat(g1, g2) that invokes two generators g1 and g2 and concatenates the streams yielded by both of them, except by writing the body of concat in such a way that it collects the values yielded by each generator one at a time and yields them in turn to its caller.

From an implementation point of view, there is a reason for this: in Python, a suspended generator is stopped at a call of yield in its main body, and the suspended state is a single stack frame. Allowing yield in a subroutine would require a suspended generator to be represented by the full state of its stack, and that would be more difficult to implement. For processes in micro:bian, we need this extra complexity of implementation, because we will want to run processes like search, but with the call to yield replaced with a call of printf, with printf eventually calling a subroutine serial_putc that suspends the process by sending a message. In Python, a generator that yields successive characters in the formatted output could only be written by making each line of output into a string, and explicitly yielding the characters one at a time: something like

void search(void) {
char buf[128];

for (int n = 0; ; n++) {
if (prime(n)) {
sprintf(buf, "%d\n", n);
for (char *p = buf; *p != '\0'; p++) yield(*p);
}
}
}
`

This really doesn't respect the structure of the program we would like to write.