Lecture 16 – We're going on a bug hunt! (Digital Systems)

From Spivey's Corner
Jump to: navigation, search

The program lab4-phos/heart.c works if compiled on the lecturer's machine (running Debian Stretch with compiler A – arm-none-eabi-gcc at version 5.4.1) but not if compiled on the lab machines (running Fedora something with Compiler B – arm-none-eabi-gcc version 7.1.0). The program is supposed both to display a beating heart pattern and to print a list of primes. When compiled on the lab machines, it prints

Timer got bad message

on the serial port and shows the Seven Stars of Death. How can we fix this?

Examining the patient[edit]

The first impulse is to blame the compiler: the program works when compiled with one version of the compiler, and stops working when we switch to a different one. Clearly, the new compiler version is generating bad code for the timer process. This hypothesis is attractive because it means the bug is not our fault.

Looking at the timer process shows no obvious faults. Adding to the panic() call in the default: arm of the switch statement (Aren't we glad we included it?) shows that the values of m.m_type and m.m_sender are both nonsense, and not the values INTERRUPT = 1 and HARDWARE = -1 that we expected. What's more, a few experiments with disabling parts of it all seem to make the bug go away: if we delete the call to check_timers() from the process body, the program no longer crashes (though no timers will ever go off). Also, glancing at the object code shows that the compiler has inlined the local function check_timers() into the body of timer_task(). Telling it not to do that, by adding the annotation

 static void __attribute((noinline)) check_timers(void) { ...

to the heading of check_timer() also makes the problem disappear (I had to check the GCC manual for that). Now we want to believe that it really is the compiler's fault: there's surely a bug in the part of the compiler that does inlining. Not so fast! The compiler may be generating correct code, but naturally enough, code that only works if it finds itself in the proper environment. For example, code that accesses the stack frame will only work if the stack pointer has a sensible value.

Isolating the problem[edit]

XKCD #1722

If the compiler is at fault, we should be able to identify the problem in the machine code it outputs. We can ask the compiler to produce its output in assembly language, so that we can study it, and so that we can try assembling it separately. This is helpful, because at the moment Compiler A runs on one machine and Compiler B on another. Compiling to assembly language on Machine B, copying the resulting file timer.s to Machine A, and combining it with other part of the program compiled on Machine A results in a program that still fails. Using instead a file timer.s prepared on Machine A gives a program that does not fail. Good! The problem will be revealed by comparing the two versions of timer.s. If we cannot find the bug in B's version of timer.s, then the compiler is not at fault. Generating assembly language and then assembling it separately should reliably enable us to reproduce the problem, because in fact that is what the compiler itself does. The program called gcc (or arm-none-eabi-gcc) is just a wrapper (or "shell") that invokes a compiler-to-assembly-language and the assembler and the linker as separate programs.

One worthwhile approach at this point involves two printouts, a box of sharp pencils, and a cup of tea. Also good is a side-by-side listing of the two files generated with diff -y. Although the two compiler versions are numerically some distance apart, we are lucky to find that the object code is not drastically different in trivial ways. What differences can we see?

First, the boilerplate at the start of the file has been tidied up a bit. Since (as we know) both compilers are capable of generating code that works, so differences here should not be important.

        Compiler A                              Compiler B
        ==========                              ==========
	.syntax unified		      <
	.cpu cortex-m0				.cpu cortex-m0
	.fpu softvfp		      <
	.eabi_attribute 20, 1			.eabi_attribute 20, 1
	...					...
	.eabi_attribute 21, 1			.eabi_attribute 21, 1
	.eabi_attribute 34, 0			.eabi_attribute 34, 0
	.eabi_attribute 18, 4			.eabi_attribute 18, 4
	.thumb			      <
	.syntax unified		      <
	.file	"timer.c"			.file	"timer.c"
	.text					.text
.Ltext0:				.Ltext0:
	.cfi_sections	.debug_frame		.cfi_sections	.debug_frame
	.align	2		      |		.align	1
	.global	timer_task			.global	timer_task
				      >		.syntax unified
	.code	16				.code	16
	.thumb_func				.thumb_func
				      >		.fpu softvfp

Interestingly, the new compiler only aligns code on a 2-byte boundary, but that's not likely to be the problem.

Next, we see that Compiler B has learned a new trick for saving the high registers.

	mov	r7, r10		      |		mov	lr, r10
	mov	r6, r9		      |		mov	r7, r9
	mov	r5, r8		      |		mov	r6, r8
	push	{r5, r6, r7}	      |		push	{r6, r7, lr}

If the trick works sometimes, it very likely works all of the time. Actually, we can verify that minor things like this are not part of the problem, by slowly editing version A of the program to be more like version B, and testing incrementally whether the bug has appeared. I like to work under a version control system, so that I can always see what I've changed and rewind to the start, take checkpoints from time to time by copying files, and work in an editor with plenty of Undo depth so I can recover from minor mistakes. And indeed, we find that this way of saving registers is not the problem.

What's next? We see that the two compilers have invented different labels for the literal table that comes at the end of the function. The lines saying .loc 1 65 0, etc., are line number information for the debugger, but they help us find the source lines that are being implemented here (they are setting up the hardware registers of the timer), and they provide a useful synchronisation point for the side-by-side listing.

	.loc 1 65 0				.loc 1 65 0
	movs	r3, #1				movs	r3, #1
	ldr	r2, .L20	      |		ldr	r2, .L22
	str	r3, [r2]			str	r3, [r2]
	.loc 1 66 0				.loc 1 66 0
	movs	r2, #0				movs	r2, #0
	ldr	r1, .L20+4	      |		ldr	r1, .L22+4
	str	r2, [r1]			str	r2, [r1]
	.loc 1 67 0				.loc 1 67 0
	ldr	r1, .L20+8	      |		ldr	r1, .L22+8
	str	r2, [r1]			str	r2, [r1]
	.loc 1 68 0				.loc 1 68 0
				      >		ldr	r2, .L22+12
	movs	r1, #4				movs	r1, #4
	ldr	r2, .L20+12	      <
	str	r1, [r2]			str	r1, [r2]

You can see that the compilers have generated equivalent code, with a minor difference in the order of operations. No crucial differences yet.

The next part of the comparison seems to show code generated by Compiler B that has no counterpart in the code from Compiler A.

	.loc 1 27 0				.loc 1 27 0
	movs	r7, r3		      |		movs	r5, r3
.L2:				      |		b	.L2
				      >	.L4:
.LBE7:					.LBE7:
.LBE6:					.LBE6:
				      >		.loc 1 83 0
				      >		ldr	r3, .L22+44
				      >		ldr	r3, [r3]
				      >		cmp	r3, #0
				      >		bne	.L20
				      >	.L6:
				      >		.loc 1 88 0
				      >		movs	r0, #9
				      >		bl	clear_pending
				      >	.LVL4:
				      >		.loc 1 89 0
				      >		movs	r0, #9
				      >		bl	enable_irq
				      >	.LVL5:
				      >	.L2:
	.loc 1 79 0				.loc 1 79 0

Closer inspection reveals, however, that similar code does appear on both sides, but in a different order. Compiler B is just a bit more enthusiastic than Compiler A in re-ordering the program so that the tangle of jumps and labels is simplified a bit. What is easy to miss is that in the first couple of instructions, Compiler B is using register r7 when Compiler A has used r5. Since registers are, from the hardware point of view, identical in function, and registers r4--r7 all have the same usage rules from a software point of view, this difference ought not to matter.

Similar differences in register usage occur in the next part of the code.

	.loc 1 79 0				.loc 1 79 0
	add	r5, sp, #16	      |		add	r4, sp, #16
	movs	r1, r5		      |		movs	r1, r4
	ldr	r0, .L20+44	      |		ldr	r0, .L22+48
	bl	receive				bl	receive
.LVL4:				      |	.LVL6:
	.loc 1 81 0				.loc 1 81 0
	ldrh	r3, [r5]	      |		ldrh	r3, [r4]
	cmp	r3, #1				cmp	r3, #1
	beq	.L4				beq	.L4
	cmp	r3, #3				cmp	r3, #3
	beq	.L5				beq	.L5

What we see here is the code that calls receive(ANY, &m), then enters a switch statement based on the message type m.m_type. The arguments of receive() are the value ANY = 999 in r0 and the address of m in r1. Since m lives in the stack frame, its address is computed by adding an offset to the stack pointer, and to save repeating that calculation in order to access m.m_type later, the address is saved in a register across the call the receive() – register r5 in one program, and register r4 in the other. You can see after the call, the code retrieves a two-byte value from the message (with ldrh), then tests whether it is equal to 1 (INTERRUPT) or 3 (REGISTER).

Closing the gap[edit]

Apart from trivial points (which we can edit away if they are worrying), the main difference between the two programs is in the use of registers. What if some but not all registers were being trashed at some point, and program B saved a crucial value in a register that was trashed, whereas program A used a register that was not trashed at the same point? We can experiment with this in two ways.

First, we can change program A to use the same registers as program B. In fact, we can make a table that shows the correspondence between registers used by program A and those used by program B.

Program A     Program B
  r5            r4
  r6            r7
  r4            r6
  r7            r5

(the correspondence extends to registers r8, r9, ...) You could systematically edit the code with this substitution, but cyclic permutations of renaming are always a bore to do in a text editor. A better idea is to replace actual registers with symbolic names, so r4 becomes rr4 and r5 becomes rr5, etc., throughout all the code (except for the boilerplate at the start of the function). Then we can write some macro definitions:

#define rr4 r6
#define rr5 r4
#define rr6 r7
#define rr7 r5

and adjust the register assignment at will, using the C preprocessor to make the substitutions for us. We quickly find that this reveals the problem: permuting registers r4 to r7 to match Program B causes the bug to appear; and in fact just swapping registers r4 and r5 is enough.

Let's look again at the code surrounding receive() – I'm suspicious about that, because register values are preserved across operating system calls by our context switch code, not by the compiler's normal prologue and epilogue boilerplate. The process also contains a call to send, but I don't think that's executed before the crash; there's also a call to connect, but that seems (superficially at least) to complete successfully.

	.loc 1 79 0				.loc 1 79 0
	add	r5, sp, #16	      |		add	r4, sp, #16
	movs	r1, r5		      |		movs	r1, r4
	ldr	r0, .L20+44	      |		ldr	r0, .L22+48
	bl	receive				bl	receive
.LVL4:				      |	.LVL6:
	.loc 1 81 0				.loc 1 81 0
	ldrh	r3, [r5]	      |		ldrh	r3, [r4]
	cmp	r3, #1				cmp	r3, #1
	beq	.L4				beq	.L4
	cmp	r3, #3				cmp	r3, #3
	beq	.L5				beq	.L5

It's vital to the correctness of this code that r5 (or r4) is preserved across the call, but in fact the value preserved is easily recomputed after it. Let's try replicating the add instruction after the call (and of course trying this both using r5 and using r4).

	.loc 1 79 0
	add	r5, sp, #16
	movs	r1, r5
	ldr	r0, .L20+44
	bl	receive	
.LVL4:
	.loc 1 81 0
	add	r5, sp, #16  @@ Replicated instruction
	ldrh	r3, [r5]
	cmp	r3, #1
	beq	.L4
	cmp	r3, #3
	beq	.L5

Bingo! The problem disappears, even if we are using r4 in place of r5.

Delving into details[edit]

We haven't used a debugger yet. Why not? Because we have had no hypothesis to test! The debugger would let us look (in a high level way) at the values of variables in the timer_task() function, but we would just see that the values are indeed nonsense. Debugging at the machine level would let us see great screeds of numbers, like the green-on-black screens of The Matrix. But that really wouldn't help. But now we have a hypothesis:

Register r4 is not preserved across calls to receive().

Or, refining that a bit,

Register r4 is not preserved across calls to receive(), when the call is completed by the arrival of an interrupt.

Now is the time to get out the debugger.

  1. Edit the program so the primes and UART part is disabled.
  2. Set a breakpoint at default_handler in mpx-m0.s.
  3. Let her rip!

The breakpoint is soon reached. We can single-step instruction by instruction (stepi), or we can step over subroutine calls (nexti).

default_handler:
    push {lr}
    bl isave
    mrs r1, ipsr            @ Get exception reason
    subs r1, #16            @ Convert to range [-16 .. 32)
    bl intr_handler
    bl irestore
    pop {pc}

There's a bit of code there that fetches a view of the psr containing the interrupt number, then subtracts 16 so that internal interrupts occupy the range [-16..0) and external ones are in the range [0..32). The value we see after this is 9, and that's correct for timer 1. We can follow the code along into the operating system, which calls make_ready first on device driver process for the timer, then on the the current process. Then there's a call to choose_proc that picks the device driver to run next, and we're back in default_handler again. We can watch as irestore restores the context of timer_task, including a value for r4 that is plausible as an address in its stack frame. Then default_handler returns ... to something called wrap_default. The disassembled code looks like this:

000011f4 <wrap_default>:
    11f4:	b510      	push	{r4, lr}
    11f6:	f7ff fe9e 	bl	f36 <default_handler>
    11fa:	bd10      	pop	{r4, pc}

This is a trivial function from startup.c, included so that interrupt vectors can be redirected to a default handler configured at link time. And look, it saves register r4 on entry and restores it on exit. Normally, this is benign, and the compiler has done it so that the number of registers saved is even, because of requirements of the calling convention that are not really relevant to the Cortex-M0. But for us it is fatal, because the r4 value being restored here is the one that was present when the interrupt happened, not the one from before the call to receive() that was just painstakingly restored by irestore(). There's the bug.

The fix[edit]

Now that the problem is properly understood, it is not hard to fix. The mechanism surrounding wrap_default works well enough for programs that make ordinary use of interrupts, but it won't do for the bottom layer of our operating system, where an interrupt can save context for one process, then in returning restore the context of another one. For them, it's necessary for the interrupt vector to lead directly to the Phōs interrupt handler, without any intervening saving and restoring of registers. It's a simple matter to make a vector table where all the interrupts (or almost all) point to the same handler directly.

Some closing reflections: the temptation to suspect a compiler bug was a strong one, but unjustified. Yielding to the temptation might just have wasted some time, but it might also have led to a poor decision to disable the inlining and leave it at that. This program would have worked, but the latent bug would have remained and might re-emerge in the future. In a concurrent program, the flow of control can be unpredictable, depending on timing. In Phōs, even if a process is not usually awakened by an interrupt, that can happen on rare occasions, and a bug like the one we've just fixed could then cause a fault that would be very hard to diagnose.

Lecture 17

A register sp that holds the address of the most recent occupied word of the subroutine stack. On ARM, as on most recent processors, the subroutine stack grows downwards, so that the sp holds the lowest address of any occupied work on the stack.

A symbolic representation of the machine code for a program.

(Universal Asynchronous Receiver/Transmitter). A peripheral interface that is able to send and receive characters serially, commonly used in the past for communication between a computer and an attached terminal. It is commonly used in duplex mode, with the transmitter of one device connected to the receiver of the other with one wire, and the receiver of the one connected to transmitter of the other with a different wire. The asynchronous part of the name refers to the fact that the transmitter and receiver on each wire do not share a common clock, but rely instead on the signalling protocol and precise timing to achieve synchronisation.

(A near-synonym for ABI). The convention that determines where arguments for a subroutine are to be found, and where the result is returned.