Note: I've just migrated to a different physical server to run Spivey's Corner,
with a new architecture, a new operating system, a new version of PHP, and an updated version of MediaWiki.
Please let me know if anything needs adjustment! – Mike

Course outline

From Compilers
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Syntax

1. Phases and representations. Compilers can be structured as the composition of multiple transformations, with different representations of the program passed between each transformation and the next.

Roadmap1.png

One way to build a compiler for a simple but usable (≅ Pascal) programming language.

Why study compilers?

  • Because knowing how they work helps us to use them better.
  • Because compiler techniques are useful in other programs.
  • To study the design of a much larger (4000–5000 lines) program with internal interfaces.
  • Because compiling is a fertile arena for applying Computer Science theory.

Our compiler, built in stages:

  • A sequence of source languages, building up to something like Pascal.
  • The target is at first a stack machine (Keiko), then native ARM code.
  • Compiler written in OCaml, using power tools (lex and yacc) for lexing and parsing.

Lab stages:

  • Lab 1 – different kinds of loop.
  • Lab 2 – type checking and accessing arrays.
  • Lab 3 – higher-order, nested procedures.
  • Lab 4 – code for the ARM.

Christmas assignment:

  • Statistically likely to build on Lab 4.
  • You will want your own Linux setup, and we will help with that.

Materials:

2. Lexical analysis. Regular expressions can be used to describe the set of tokens in the input language; lexical analysis divides the input program into meaningful tokens, discarding insignificant layout and classifying the tokens according to their grammatical function.

3. Syntax analysis. Context free grammars describe the nested structure of the programming language. An automated tool generates a parser that can consume the stream of tokens and produce an abstract syntax tree represented the syntactic structure of the input program.

Problem sheet 1

Expressions and statements

4. Postfix code for expressions. The abstract syntax tree guides the process of generating code for expressions. In code to evaluate expressions on a stack machine, code for the operands is followed by instructions that combine operand values according to the operators in the expression.

5. Branching code for control structures. IF and WHILE statements can be translated into code where the tests become conditional branches, again in a recursive process guided by the abstract syntax tree. Any untidiness in the resulting program can be removed by using a system of local rewrite rules in a peephole optimiser.

Problem sheet 2

Lab one: REPEAT and LOOP statements

Data structures

6. Semantic analysis. Languages with mutliple data types and declarations of variables require an additional phase, where the abstract syntax tree is annotated so that each use of a variable is labelled with the corresponding definition. Following this phase, code can be generated in a purely local way as before.

7. Address arithmetic for data structure access. The classic data types of records, arrays and pointers can be translated into code that uses address arithmetic to access elements of each kind of data structure.

Problem sheet 3

Lab two: array access.

Subroutines

8. Simple subroutines. Procedures without parameters need a stack of frames, with machine support for creating and destroying frames on call and return. Local variables can be allocated in the frame and addressed relative to the frame base.

9. Parameters and nested procedures. Parameters can be implemented by preparing values before the call that the linkage mechanism makes addressible in the frame. Parameters passed be reference are represented by their addresses, with an additional indirection at each access from within the procedure. Nested procedures with access to non-local variables can be implemented by linking frames into a static chain.

Extra lecture A. Introducing the ARM. Handout.

10. Higher-order functions. Procedural parameters are implemented by passing the code address and the static link that is passed when the procedure is called.

11. Compiler implementation, testing and maintenance. The technqiues described so far are sufficient to implement a compiler for a complete and usable language similar to Pascal. A well-managed compiler project relies on an automated testing scheme, where a corpus of test cases is rerun after each change to the compiler. Bug reports result in the addition of new test cases that ensure the same problem does not emerge again.

Problem sheet 4

Lab three: subroutines.

Machine code

12. The back end: operator trees. Translation to machine code can be organised in multiple phases that share a common intermediate representation. We use trees with nodes that are simple Keiko instructions and edges that show the flow of values between them.

Roadmap2.png

13. Instruction selection. Machine instructions can be selected by tiling each tree with patches that correspond to instructions. The system of patches can be described, surprisingly enough, by a (highly ambiguous) context free grammar specific to the target machine.

14. Common sub-expressions. Code generated for modern-day RISC machines is substantially better if we can detect and eliminate common sub-expressions, at least in simple cases. A classic algorithm for this uses a hash table to build a DAG from the optrees, then flattens the DAG into a sequence of trees, assigning shared expressions to temps.

15. Register allocation and assignment. A simple scheme for allocating registers is to traverse the optrees, allocating a register to the output of each tile that becomes the input to the tile above it. This allocation scheme can be merged with the greedy approach to selecting instructions in a single traversal of each tree. More sophisticated compilers are able to keep values in registers across larger regions of code that are still smaller than a whole subroutine.

Problem sheet 5

Lab four: machine code.

16. Object oriented programming. Objects with methods and dynamic dispatch have (assuming single inheritance) a particularly simple and efficient implementation that adds a single pointer to each object. Studying this implementation helps to explain why method invocation has the properties it does.

Problem sheet 6: revision.

Christmas assignment.