Frequently asked questions (Compilers)
Some questions from previous years have moved to the FAQ archive.
What prerequisites are needed to follow the course?
There are three courses taken by our undergraduates that provide useful background for this one. The first is Functional programming, which is important because we will be writing compilers not as purely functional programs, but certainly as programs that use recursion over algebraic data types, and sometimes higher-order functions. I will give a brief introduction to the OCaml language we will be using in the course, but you will need to be familiar with the ideas in advance.
The second relevant course is Models of computation, where our students learn about regular expressions, finite state automata and context-free grammars. I will aim to make the Compilers course self-contained as regards these topics, but will not repeat material from the other course unless it is directly needed in this course.
The third course is Digital systems, where students will already have met simple algorithms for compiling arithmetic expressions to machine code, and also the general organisation of computer hardware and systems software. Though this course is largely self-contained, that material provides useful background.
Will we be writing compilers in Oberon?
No, we'll use OCaml as our implementation language.
Will we be writing compilers for Oberon?
Well, sort of. We'll work towards a compiler that implements a simple, Pascal-like language that looks a bit like Oberon. It's natural to do so, because Pascal-like languages are easy to implement with machine-level operations, and lack the irregularities that can make a nightmare of compiling, say, C. Many of the techniques we learn can be applied to implementing other languages. And rest assured, the keywords will be in lower case!
The O in OCaml stands for Objective, doesn't it? Will we be using object-oriented ideas in the course? And why won't we be using Java?
Yes, OCaml includes object-oriented features of a slightly strange kind, but we won't be using them. Generally speaking, the facilities for defining data types in object-oriented languages are heavy and clumsy compared with functional programming; for example, what can be achieved by an algebraic type definition in half a dozen lines in Haskell or OCaml might take several pages in Java. And a function that is defined in a few lines using pattern matching and recursion might become a tedious application of the 'visitor pattern' that involves several classes spread over many pages.
Will we actually be generating native code?
There will be two parts to the course. At first, we will target a stack machine, Keiko, that is like the Java Virtual Machine (JVM) but with instructions at a slightly lower level. Like the JVM, the Keiko machine has instructions that mostly consist of a single byte each, so the code is called bytecode. I will supply two different implementations of the Keiko machine:
- an interpreter for the bytecode that has options to print each instruction as it is executed.
- a JIT translator that compiles the bytecode into native code for the x86 just before it is executed for the first time.
The bytecode interpreter is sufficient for debugging the compilers we write, but it's nice to have the JIT implementation too.
So what about the second part of the course?
Glad you asked. In the second part of the course, the Keiko machine is used again as an intermediate code, that is, as the interface between the front end of the compiler, where source programs are translated into low-level operations, and the back end, where we will implement these low-level operations using machine instructions. The machine will be almost exactly a sanitized subset of the x86. I say 'subset' because there are a many features of the x86 that we will ignore, as most compilers do. I say 'almost' because the x86 has a number of annoying restriction that we'll just ignore. And I say 'sanitized' because we'll use a cleaned-up syntax for the assembly language.
A good example of the cleaning up we will do is that it's possible to access bits 0 to 7 of the EAX register of the x86 as an 8-bit quantity, as in the instruction
mov al, x that stores these 8 bits in the location
x. It's also possible to access bits 8 to 15 of the same register using the name
ah instead of
al, something we won't use, though just occasionally it would be possible to generate faster code by exploiting this feature. But worse than this, it isn't possible to access the low-order 8 bits of some other registers, such as ESI, that are otherwise equivalent to EAX. We will just ignore this problem.
If you ignore restrictions like that, doesn't it make it pure luck whether your object code will work?
If we simply ignored them, then yes. But what we'll do is use a special assembler that plasters over the cracks. In the language of this assembler, you can write,
stb %0, x
to store the low-order byte of register
%0 (equivalent to EAX), and you can write
stb %5, x
to do the same thing taking the value from register
%5 (equivalent to ESI) instead. The assembler inserts some extra code to get round the restrictions in the processor, but we won't need to be concerned with that.
Actually, the special assembler is made up of a translation pass in front of the ordinary assembler for the x86, and that makes it possible to see the actual x86 code if you really want.
Why do you use Mercurial for the labs? Why not Git? After all, Linus says that Git is better.
Mercurial and Git share the facility to make convenient copies of an entire repository and to synchronise changes between multiple copies of the same repository; that's what the word distributed means in the term distributed version control. (Actually, none of these products provides a truly distributed system, in that the people using them give explicit instructions to move data around the network; a distributed implementation would provide the abstraction of a single, unified history of development without relying on a central server.) Both products have the same basic data structure: a rooted, directed acyclic graph of revisions with operations to merge any two revisions by reference to their most recent shared ancestor. Though terminology differs a bit between them, Mercurial provides everything that we need for the course, and the same operations can be simulated in Git, but with a few more unexplained switches passed to the various commands, and a bit more of the works showing. There seems to be a consensus that Mercurial is a bit easier to learn, so I've decided to go with it. Everything we do with Mercurial will be applicable to any other version control system, distributed or not. I have used SCCS, RCS, CVS and Subversion in my time, and am quite happy with Subversion for my own work; ironically, as the internet has improved, the need for a distributed system has become less pressing, and I can reach my central server from most places I visit; since the central server is regularly and reliably backed up, it gives added protection from losing work.
Arising from lectures
How can we prove whether a grammar is LR(0) or SLR or LALR or LR(1)?
A grammar is LR(0) if each state of the LR(0) automaton that permits a reduction (i.e. has an item with the dot at the right) contains only that reduction – so that neither a shift nor any other reduction is possible.
A grammar is SLR if, in each state that permits one or more reductions, the decision whether to shift or reduce (or the decision which reduction to perform) can be made on the basis of FOLLOW sets. That is, if a state contains the item [S → α .] then we compare FOLLOW(S) with the set of tokens that can be shifted in the state (and also with the FOLLOW set of other possible reductions). If these sets are disjoint, then the grammar is SLR.
A grammar is LALR if a similar condition holds, based not on FOLLOW sets but on lookahead sets, which are always subsets of the FOLLOW sets. I'm not telling you how to compute lookahead sets, and I won't be asking you to classify grammars as LALR or not, unless they are either SLR or not LR(1).
A grammar is LR(1) if it has a correct LR(1) parsing table. A grammar is not LR(1) if it is amibguous. And a grammar also fails to be LR(1) if there is any viable prefix and lookahead token where different completions of the sentence require different immediate actions. To prove that a grammar is not LR(1), find two sentences that share a common prefix, but require different actions at the end of that prefix, despite the stack contents and the lookahead token being the same. For example, if you can say, "When you see an identifier, you can't tell whether it's a continuation of this expression or the start of a new one," then that's a sign the grammar is not LR(1).
Are conditions evaluated with the short-circuit method even if they appear in parentheses?
Yes: we can be sure of that, because parentheses are used by the parser to determine the shape of the abstract syntax tree for an expression, but don't appear explicitly in the tree itself. So adding superfluous parentheses to an expression does not affect the abstract syntax tree that is built, and thus cannot affect the code our compiler generates for it.
In calling a procedure, the static link is pushed on the stack before the
CALL instruction. Where does the dynamic link come from?
On the Keiko machine, the instructions before the call should push on the stack the following information:
- The arguments, pushed from right to left so that the first argument is nearest the top of the stack (which grows downwards in memory) and therefore at the smallest address.
- The static link (SL).
- The address of the procedure (CP).
CALL instruction refers to the procedure address but leaves all the above information on the stack, and adds:
- The return address (RA): this is the address of the next instruction after the
CALLinstruction, and execution will resume there when the procedure returns.
- The dynamic link (DL): this is the present value of the
bpregister, which is then reset to the value of the stack pointer.
- Local storage for the procedure. The size of the local storage is specified in the
PROCdirective that begins the procedure.
The frame head is make up of SL, CP, RA, DL; two of these are contributed by the caller, and two are added by the action of the
When the procedure returns,
bp is reset from the dynamic link,
pc is reset from the return address, and the stack pointer
sp is reset in such a way that the
n parameters specified in the
CALL instruction have disappeared from the stack.
Other languages implemented on Keiko may not use static links, or may pass them in another way. So the static link is treated by the machine as if it were just an extra parameter of the procedure, and it's just our convention to call it part of the frame head. That's reflected in the fact that the parameter count in the
CALL instruction is one greater than the number of parameter words.
Is the procedure address used on Keiko to call a procedure actually the address of the procedure's bytecode?
No: each procedure is descibed by a little table that contains a pointer to the bytecode as well as the size of the local variables, a pointer map of the frame, and the values of any large constants that are used by the procedure. The procedure address is the address of this table, and the Keiko machine's
cp register points to the table throughout execution of the procedure. But all that is transparent to compilers generating code for the machine. The
PROC directive in the Keiko assembly language looks after all the details.
Keiko comes with an optional garbage collector that requires the compiler to describe how the storage used by the program is laid out, and specifically what locations may contain pointers; that is the purpose of the pointer map that is part of each procedure. In this course, we're not using the garbage collector, and any storage allocated by
new in the program is never recycled.
The two zeroes in the
PROC _Fac 8 0 0
are one of them a non-existent pointer map, and the other an estimate of the amount of stack space needed by the procedure. That estimate is currently not used by the virtual machine.
Type metrics consist of a size and an alignment; what is the alignment for?
Typically, byte-addressed machines have restrictions about what addresses can be used for quantities that are larger than a byte. For example, there may be a rule that a 4-byte integer must be aligned so that its address is a multiple of 4. That would mean that an array of bytes could begin at any address, but an array of integers would have to be given an address that was a multiple of 4. So the type 'array 10 of integer' has size 40 and alignment 4, but the type 'array 40 of char' has the same size but alignment 1. When the compiler lays out local storage for a subroutine or lays out a record type, it must leave gaps in order to satisfy these alignment restrictions, so that a character followed by an integer would leave a gap of 3. Sometimes space could be saved by shuffling the fields so that they appear e.g. in decreasing order of alignment, but that doesn't really matter much.
Recent x86 machines support unaligned loads and stores (of, say, a 4-byte quantity at an address that is not a multiple of 4) for backwards compatibility, but it's likely that they do so inefficiently, perhaps turning a load into two loads of adjacent words followed by some shifting and bitwise or-ing. So it's desirable to obey some alignment restrictions even when the architecture doesn't insist on them.
What's the difference between a RISC and a CISC machine?
For us, the vital difference is that on a RISC, transfers to and from memory are done by separate instructions from arithmetic. This means that a statement like
x := x+1, where
x is a local variable in the frame (say at offset 24), is translated into the sequence
ldw r1, 24(bp) add r1, r1, 1 stw r1, 24(bp)
with a load and store separate from the
add instruction. RISC architectures tend to have arithmetic instructions that can put their result in a different resgister from the operands, and they tend to have a large set of uniform registers. They tend to have a single, indexed addressing mode that adds together a register and a constant, with a pure constant and simple indirection through a register as special cases. Instruction selection for a RISC machine largely consists of ensuring that address arithmetic is, where possible, done using the indexed addressing mode rather than by a separate instruction.
On a CISC machine, arithmetic and memory transfers can be combined in the same instruction, so the statement
x := x+1 might be translated as
add 24(bp), 1
a complex instruction that would have to be executed in several steps. Instruction selection for such machines is more difficult, and if expressed via tree grammars typically requires non-trivial dynamic programming, since the selection of tiles for different parts of the tree can interact in complicated ways. Question 6 on problem sheet 4 invites you to consider this by investigating a machine with a memory-to-memory move instruction.
What's the difference between the PLUS and PLUSA instructions on Keiko?
The PLUS instruction pops two integers on the stack, adds them together, and pushes the result. The PLUSA instruction expects first an address and then an integer offsest to be in the stack (with the integer at the top) and also adds them together. For the Keiko bytecode interpreter, there's no difference in behaviour, although the sanity checks on Keiko code that are incorporated in the lab materials do enforce the distinction. It's a distinction that does become important when we translate the Keiko code into native code for a register machine, because occurrences of PLUSA signal opportunities to use the indexed addressing mode of the target machine for address calculations. Depending on the target, we might not wish to use the addressing hardware for other additions: for example, it might not behave the same way with respect to arithmetic overflow.
My compiler gives the error message
Stack overflow when it runs. What do I do now?
Notice that this message comes from your compiler when it is translating the test program, not from the object program when you run it on the Keiko machine. Your compiler has entered an infinite recursion, and you can get a stack backtrace using the following unix command:
$ env OCAMLRUNPARAM=-b ./ppc test.p
test.p is the test program you were trying to run). This works provided your compiler was built using the
-g flag to
ocamlc, introduced in the 2014 edition of the lab materials. The most likely culprit is a recursive function you have just introduced into the compiler, and the likeliest of all is an off-by-one error in the recursive loop that generates code to follow the static chain.
My compiler gives an error message
generated code failed sanity check. What do I do now?
Again, this message comes from your compiler, not from the runtime system. The function Keiko.output that's supposed to output the Keiko code first applies a crude sanity check, verifying for example that your code does not try to pop non-existent values off the evaluation stack. The code generated by your compiler fails this check, probably indicating that it would crash the Keiko machine if run. Look in the assembly language file
a.k to find which instruction triggered the message.
When I run the object code from my compiler, I get the message
segmentation fault. No other helpful message is printed. What do I do now?
Welcome to the wonderful world of compiler debugging, where it is more productive to read the code than to run it. The object code from your compiler is broken somehow, in a way that was not detected by the sanity checks mentioned above. You can try simplifying the test case until you find which part of it is provoking the error, but in the end the most productive thing is to read the code (found in the file
a.k) and see where it fails to make sense.
In Lab 2, my compiler generates a program from
case.p that prints the message
illegal instruction 0. Why, why, why?
Your program probably contains a sequence like this:
CASEJUMP 3 CASEARM 1 6 CASEARM 3 6 CASEARM 5 6 CASEARM 2 7 CASEARM 6 7 CASEARM 8 8
The problem here is that the number of cases (3) in the
CASEJUMP instruction does not agree with the number (6) of following
CASEARM instructions, and that leads to the bytecode interpreter trying is execute part of the jump table as if it were code. I'll resist the temptation to give a blow-by-blow account of the reason why this leads to the message you saw; suffice it to say that I think you're lucky to get a message at all. (This error ought now to be caught by the sanity checks – see above.)
In Lab 4, why does this program (or a related one) segfault?
var p; proc compose(f, g); proc fg(x); begin return f(g(x)) end; begin return fg end; proc add2(x); begin return x+2 end; proc square(x); begin return x * x end; begin p := compose(square, add2); print p(2); newline end.
The program forms a closure for the function
fg that contains a pointer to the stack frame for
compose; this closure is then returned as the result of
compose, and continues to exist after the stack frame for
compose has been destroyed. A subsequent attempt to call this closure follows the static link into oblivion. There is no check in the runtime system that closures contain 'sensible' addresses, so the result is a segmentation fault.
There's a similar program
compose.p in the lab materials that preserves the stack frame for
compose by a trick: by calling
compose from a function that itself has a big enough stack frame, it arranges that in the subsequent execution of the program, the stack pointer never climbs high enough to overwrite the frame for
compose, and so the static link continues to work. This is only a dirty trick, however.
[The original question concerned an attempt to define the factorial function by iterating a functional corresponding to the function body in the usual recursive program.]
What should I do with Mercurial when I've finished a lab exercise?
It's quite a good idea – while in a subdirectory like
compilers/lab2 – to use
hg st . to find out what files you've changed; the dot at the end of this command restricts its effect to the current directory, ignoring any changes made elsewhere in the directory tree. The listing you see will look something like this:
$ hg st . M kgen.ml M Makefile ? mytest.p ? kgen.ml~
meaning that files
Makefile have been modified, and
mytest.p and the backup file
kgen.ml~ are unknown to Mercurial.
Your next actions might me to add the file
mytest.p to Mercurial's list, then to check in your changes, giving a brief log message:
$ hg add mytest.p $ hg ci -m "Finished Lab 2" .
Include the dot again so as to check in just the modified files in the current directory.
How can I keep my copy of the lab materials up to date?
When you've finished a lab and checked in your changes, we may ask you to update your copy of the lab materials, using the following sequence of commands:
$ hg pull $ hg merge $ hg ci -m 'Merged upstream changes'
The first of these commands contacts the Mercurial server and copies the updates to your clone of the repository; and the second merges the changes into your working copy of the files. The third command adds the result of the merge to your copy of the repository by making a revision with two parents: one being the latest version of your work, the other the newly-added version of the lab materials.
It's always safe to pull and merge in this way, and it makes sure you have the latest version of everything. In particular, I'm about to add to the file
.hgignore listing files that are ignored by Mercurial so that backup files like
kgen.ml~ are no longer shown. I'll try to minimise changes in the upstream repository that may interfere with your work, but I am in the process of putting in more comments in the example compilers
ppc2 that are included in the lab materials
basis is automatically updated so that it points to the latest version of the lab materials, so that you can make a diff with the command
$ hg diff -r basis .
at any time.
I tried to check in my work, but I got a message saying
abort: no username supplied (see "hg help config"). What should I do?
You need to create a file called
.hgrc in your home directory containing the following two lines, with the name and e-mail address replaced by your own details.
[ui] username = Fred Bloggs <email@example.com>
When you check in revisions, this name and address are attached to each check-in, and are subsequently in Mercurial's logs.
How can I use gedit to edit the commit message when I check in my work?
It's complicated. You can specify the editor you'd like to use by setting one of the environment variables
VISUAL, and Mercurial will invoke that editor if you don't specify a commit message on the command line with the
-m flag. But the convention is that this editor should not exit until the commit message is saved, and
gedit doesn't respect this convention; if (as is likely) there is already a
gedit session running, then invoking
gedit as a shell command opens a new tab in the existing session and exits immediately, leading to an abort with "Commit message was empty". You could try to hack around this, but take it for all in all, the game is not worth the candle.
Vim users can set either environment variable to
vim; emacs users can set it to
emacsclient, a special wrapper program that pops up a new buffer inside an existing emacs session and waits for you to invoke the command
A simpler solution is just to specify the commit message on the command line: a short phrase is enough for our purposes.
Why does this program in the language of Lab 2 loop forever?
repeat x := x+1 until true
Congratulations on winning the prize for Bug of the month!
Perhaps the word
true appears in magenta in your editor, as if it were a language keyword. But that's just because the editor is assuming your program, stored in a file with extension
.p, is written in Pascal. In our language,
true is not a keyword but an ordinary variable name. In Lab 2's language, variables don't have to be declared, and they have an initial value of 0, which is interpeted as Boolean false. Thus your program loops until the variable
true has a non-zero value, which it never will.
true is not a reserved word in Pascal but a predefined identifier. That makes it legal for a Pascal program to contain the declaration
const true = false if you want.)
How can you get the Keiko runtime system to trace instructions?
It turns out that you need to give the
-d flag twice. From the Lab 2 directory:
$ ../keiko/ppx -d -d ./a.out
-d increments the debugging level, and the execution trace appears at level 2 or above. The output from
ppx shows first a disassembly of the binary program, then a trace with two lines for each cycle, one showing the contents of the evaluation stack and the other showing the instruction that's about to be executed. There are too many values printed in hexadecimal, but at least you can follow the stream of instructions.
What's all that stuff about minus signs in integer constants?
If you don't make the minus sign in
-2147483648 part of a single token with the digits, then the token
2147483648 is an integer constant whose value doesn't fit in 32 bits. In practice, this need not be a problem, because the result of
- 2147483648 will be correct modulo 232 and therefore equal to -2147483648. But if in theory your language doesn't allow constants that don't fit in 32 bits, then you will be led to make a complicated exception for this case. Such an exception appears in Section 3.10.1 of the Java Language Specification. None of this is important, except as an example of the 1001 details that need to be got right in language and compiler design in order that ordinary programmers don't need to worry about them.
And what was that about
typedef names in C?
In a C program, usually
x * y will be interpreted by the compiler as multiplying two variables
y. An expression can appear on its own as a statement: in this case writing
x * y;
as a statement all by itself is useless, but the rule is there to allow you to write a subroutine call as a statement, for example. So it's a fact that
x * y; is a syntactically valid statement. But after the type definition
typedef int x;
what was syntactically a statement becomes a declaration:
x * y; now declares
y as having type pointer-to-integer. The upshot is that the parser must not treat all identifiers as interchangeable, but must rely on the lexer to classify each identifier as an ordinary identifier or as a typedef name that introduces a declaration. Because the set of typedef names can change, this requires an interaction between the parser and the lexer that is not envisaged in our simple data flow. The problem can be solved, but it's a bit messy; there's not really any reason to design a language to require this kind of messy trick in the compiler, and it's a symptom of the fact C evolved over a lengthy period.
What font do you use in the printed materials for the course?
The fonts all come from the Lucida family, which is well set up for use with TeX. The body font is Lucida Bright.