The story of a bug report (Compilers)
On 3rd Jun 2017, my frequent correspondent Christian Kleinert posted a bug report on the Bitbucket page of my Oberon compiler. This is the story of how I dealt with the bug.
Christian wrote the following text:
The attached files compiled with obc version 3.0.5 (obc HeapProc.mod TestHeapProc.mod) on macOS 10.12.5 produce a Segmentation Fault at the commented lines in HeapProc.mod. When storing the index for writing the array in a temporary variable it works.
and provided two files of Oberon code, one called
HeapProc.m and another called
TestHeapProc.m, that are listed at the end of this page.
The bug report is a good one (as I would expect from Christian) in that it mentions the compiler version and host platform, states the symptoms clearly, and comes with a fairly small example that provokes the bug.
My first action was to download the files, compile the program, and check that I could reproduce the problem. I did this with the latest revision of OBC 3.0, which I retrieved from version control, and I did it on a Linux machine, not on a Mac like Christian was using. If I had failed to provoke the bug with the latest revision, then I could have reverted to exactly the revision that Christian was using, and I could then have looked for the place in the revision history where the bug disappeared. If I had failed to reproduce the problem on a Linux machine, then I could have gone and found a Mac – but most problems are host-independent, and the fact that the bug happened on the Mac but not on Linux would have been a vital clue. As things stood, the bug did show itself on Linux and in the latest revision of the compiler. When runtime checks are turned on (as they are by default), no Oberon program should cause a segfault unless it uses the dangerous operations that are provided in the SYSTEM module, so this is definitely a bug.
The next step was to find out whether the bug would show up if I ran the program using the bytecode interpreter
obx in place of the JIT translator
obxj that is used by default on platforms (including Linux and the Mac) that support it. The bug disappeared, making it likely that the problem lies in the JIT translator, and not in the compiler that translates Oberon into bytecode, or its associated assembler/linker for bytecode.
Next, I began to reduce and simplify the program, with the goal of having the smallest possible program that provokes the fault. I started by combining the two supplied files into one, then began deleting bits of the program, checking after each change that the bug was still there. Using an editor with Undo is very convenient for this. This process helps to eliminate possible causes of the bug, and also reduces the volume of debugging output that has to be examined in the next stage. Is the procedural parameter p to Generate important, or does the bug persist without it? Do the print statements contribute to the bug, or is it still there when they are deleted? What about the 'open array parameter' iA?
After some trial and error, I ended up with the following code.
MODULE tKleinert; IMPORT Out; VAR A: POINTER TO ARRAY OF INTEGER; PROCEDURE Generate(iA: ARRAY OF INTEGER); VAR i, tmp: INTEGER; BEGIN NEW(A, 4); i := 2; tmp := i MOD 2; A := A[i]; tmp := i MOD 2; END Generate; VAR tstArr: ARRAY 4 OF INTEGER; BEGIN Generate(tstArr); Out.String("OK"); Out.Ln END tKleinert.
This test case has a conventional name (starting with a
t) for a compiler test case, and outputs
OK if it succeeds in completing without the segfault. I've eliminated the loop and the conditional from Generate and replaced the call the LEN with the value 4. What remains seems necessary to provoke the bug: the open array parameter (which Generate will copy into its stack frame) and the use of
MOD before and after the array assignment.
Bug reporters are notoriously bad at this reduction task, because they want to submit a program that would work and do something sensible if only the compiler didn't have the bug, but for my purposes I don't care what the program does, so long as it causes the segfault. Chrisitian had done a job well above average in providing a code sample that fitted on one or two pages, but as you see I was able to cut it down a lot further. In reality, some of the cutting down was overlapped with the next stage – once an anomaly had shown itself in the debugging output, that provided a guide to cutting down the input still further.
Now that the bug is provoked by a very small program (and with the foreknowledge that the problem is in the JIT translator), it's time to run the example with debugging output switched on, and see if anything unusual appears. I used the commands
$ obc0 tKleinert.m $ obxj -d -d -d a.out
to compile the program and invoke the JIT runtime system with quite a lot (
-d -d -d) of debugging output, showing the process of translating Keiko code first into code for an idealised register machine called Thunder, and then into i386 machine code. I started by looking at the translations of the assignment
A := A[i] and the assignment
A[c[i]] := A[i] that had been part of the original program, and not at first eliminated in reducing the example. They looked more-or-less OK, but then I noticed that a value displayed in one part of the debugging output as
V2(1001) had become
V2(1000) in another, and that gave a clue to the bug.
The two lines in question show the state of the JIT's register allocator before and after the first statement
tmp := i MOD 2: each register of interest is shown with a reference count and perhaps a cached value.
Regs: I1(0) = [LOADW -4(V3)] V2(1001) ... Regs: I0(0) = [LOADW -8(V3)] V2(1000)
The first line shows that register
I1 of the Thunder machine (which actually corresponds to
ECX on the i386) is known to contain the value of a local variable (at offset -4 from the base pointer
V3), but is not otherwise allocated (0); while register
V2 has a reference count of 1001 = 1000 + 1, being both allocated (1) – so that it cannot be re-allocated to another purpose – and locked (1000) – so that it cannot be spilled to memory. This state of
V2 reflects that in a procedure that has one or more open array parameters,
V2 is used as an extra base register to cope with the fact that the stack frame has a variable size. The value of 1001 is intended to make sure that
V2 is ignored by the register allocator for the whole procedure body.
After translation of the statement, the register state has changed. Now we see that register
I0 contains a cached value and
I1 no longer does; that's fair enough. But we also see that the reference count for
V2 has changed from 1001 to 1000, and that is the bug. This change allows
V2 to be used for other purposes, and in fact translation of the statement
A := A[i] creates sufficient demand for registers that it is used to hold a temporary value, and that means
V2 has been trashed before it is used in the second statement
tmp := i MOD 2, leading to a segfault.
Closer examination of the debugging trace shows that it is when the result of
MOD is moved from the stack frame into a register that the reference count is changed. There is a line in
jitval.c) that reads
r = load(LDW, ty, breg, base + v->v_val);
This line is responsible for moving a value from the stack frame into a register, and refers to the register that is
V2 in the example via the variable
breg. The subroutine
load decrements the reference count of this register if it has one; normally this causes no harm, because in the absence of a variable-sized stack frame,
breg is a register that is never used by the register allocator and has no reference count. In this case, however, it does matter; the right thing is to compensate for the decrement by first incrementing the reference count, changing the line to
r = load(LDW, ty, reserve(breg), base + v->v_val);
A similar change is needed a few lines below in code that moves double-precision floats in a similar way. That fixes the problem.
Fixing the bug requires just two lines of the JIT translator to be changed, but it is not a good idea to just fix it and forget about it. To ensure that subsequent changes to the compiler (or a misguided merge with another thread of development) doesn't reintroduce the problem, it's wise to add the test case
tKleinert.m to the compiler's corpus of test cases, run every time the compiler is changed. To this end, we add markers to the file so that the expected output (the string
OK) and the Keiko code generated for the test case are stored in the same file, in a form that the Oberon compiler will treat as comments.
[Worth including debugging prints, even in production code]
[Difficult to be sure that all interactions of features work properly. Here: open array parameters and arithmetic operations implemented as out-of-line subroutines.]
The submitted files
MODULE HeapProc; IMPORT Out; VAR A- : POINTER TO ARRAY OF INTEGER; PROCEDURE Generate*(iA : ARRAY OF INTEGER; p : PROCEDURE); VAR c : POINTER TO ARRAY OF INTEGER; i, n, tmp, ti : INTEGER; BEGIN n := LEN(iA); NEW(c, n); NEW(A, n); FOR i := 0 TO n - 1 DO A[i] := iA[i]; c[i] := 0; END; (* output(A) *) p(); i := 0; WHILE i < n DO IF c[i] < i THEN IF i MOD 2 = 0 THEN Out.String("->1"); Out.Ln; tmp := A; A := A[i]; A[i] := tmp; (* <-- Seg.Fault: 11, Works fine with ti := i and used here *) ELSE Out.String("->2a"); Out.Ln; tmp := A[c[i]]; Out.String("->2b"); Out.Ln; Out.String("i = "); Out.Int(i, 0); Out.Ln; Out.String("c[i] = "); Out.Int(c[i],0 ); Out.Ln; Out.String("A[c[i]] = "); Out.Int(A[c[i]], 0); Out.Ln; A[c[i]] := A[i]; (* <-- Seg.Fault: 11, Works fine with ti := c[i] and used here *) Out.String("->2c"); Out.Ln; A[i] := tmp; Out.String("->2d"); Out.Ln; END; (* output(A) *) Out.String("->3"); Out.Ln; p(); INC(c[i]); i := 0; ELSE Out.String("->4"); Out.Ln; c[i] := 0; INC(i); END; END; c := NIL; A := NIL; END Generate; BEGIN END HeapProc.
MODULE TestHeapProc; IMPORT HeapProc, Out; PROCEDURE PrintArr(); VAR i : INTEGER; BEGIN FOR i := 0 TO LEN(HeapProc.A^) - 1 DO Out.Int(HeapProc.A[i], 0); Out.String(" "); END; Out.Ln; END PrintArr; PROCEDURE AProc(); BEGIN; Out.String("AProc called: "); PrintArr(); Out.Ln; END AProc; PROCEDURE Test(); VAR tstArr : ARRAY 4 OF INTEGER; (* POINTER TO ARRAY 4 OF INTEGER; *) BEGIN (* NEW(tstArr, 4); *) tstArr := 1; tstArr := 2; tstArr := 3; tstArr := 4; HeapProc.Generate(tstArr, AProc); END Test; BEGIN Test; END TestHeapProc.