Glossary (Compilers)

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search
Application Binary Interface (ABI)
A set of conventions about the layout of memory and the use of registers that is required for machine code on a specific architecture that interacts with an operating system and existing libraries. For example, library functions may expect to be called using a specific instruction (saving the PC in a known place) and with arguments in specific registers. The ABI for a platform may be determined partly by the hardware (e.g., what the subroutine call instruction does) and partly by agreement among all software suppliers (e.g., which registers are preserved across subroutine calls).
Addressing mode
Machines commonly have one or more schemes that can be used in an instruction that accesses memory in order to calculate the address of the memory location to be accessed. Examples of such schemes or addressing modes are: a fixed address, or the sum of a register and a constant, or the sum of two registers. On a RISC machine, there may be only one or two addressing modes, perhaps a register plus a small offset, or the sum of two registers. Commonly, it is not possible on such machines to fit a whole memory address into an instruction, so that access to statically allocated variables requires the address to be developed into a register first.
Basic block
A sequence of operations one entry point at the start and one exit point at the end; if one operation in a basic block is executed, then all of them are executed. An extended basic block has one entry point at the start, but may have multiple exits, for example via conditional branches in the middle.
Big switch
The collection of action routines in a bytecode interpreter (usually expressed as a switch statement in C, or something equivalent), with one action for each possible bytecode. The big switch and its enclosing fetch/execute loop make up much of the overhead of executing bytecode – typically 80% of the execution time – making it highly beneficial to choose bytecodes that perform several related actions rather than just one.
Code for a virtual machine that is represented in a compact form, with most instructions occupying a single byte.
Common sub-expression elimination
An optimisation that identifies repeated operations (typically within a basic block) and eliminates them by performing each operation once and saving the result in a temp.
Context-free grammar
A formal definition of a language, consisting of alphabets \(\Sigma\) of tokens or terminal symbols and \(N\) of variables or non-terminal symbols, a start symbol \(S\in N\), and a set of productions \(A \to \alpha\) where \(A \in N\) and \(\alpha \in (N \cup \Sigma)^*\). The language defined by a CFG is the set of strings in \(\Sigma^*\) that are generated by derivations in the grammar.
Context pointer
On Keiko, procedures are represented by a small table that contains the size of the stack frame, a pointer to the bytecode for the procedure body, and the values of constants referenced by the bytecode, including the addresses of statically allocated variables and of other procedures that it calls. The context pointer is a register that always points to the table for the currently active procedure, and is used by bytecode instructions to fetch the values of those constants.
Developing a constant into a register
RISC machines are commonly unable to fit an arbitrary word-sized constant into the operand field of an instruction. Programs that use such constants (especially for addressing) must be written to use special instructions to put the value of the constant into a register before it is used. On the ARM, there is a special form of PC-relative load instruction that is used for this purpose. Other machines (including later models of the ARM) can use a pair of instructions, each providing 16 bits of a 32-bit constant.
Dynamic allocation
Storage that is allocated as needed during execution of a program. Typically, the program will have pointer variables p, and an operation new(p) that allocates a fresh block of storage and puts its address in p. Dynamically allocated storage can be recycled either by providing an explicit operation to free blocks of storage, or by using a garbage collector.
Dynamic link
A pointer leading from the stack frame of a procedure to the stack frame of the procedure that called it. The dynamic link is used when the procedure returns to restore the frame pointer to the value it had before the call.
Floating point
A representation of real numbers with a sign, a mantissa, and an exponent that allows the number to be scaled by a power of two. Typical machines have special registers for holding floating point numbers and special instructions for loading, storing and comparing them and performing arithmetic operations on them. In this course, we ignore these machine features for simplicity because they add complexity without presenting any really new problems from the point of view of compiling.
Frame head
the portion of a stack frame that contains administrative information. On Keiko, the frame head contains the dynamic link, the return address, a context pointer for the current procedure, and the static link.
Frame pointer
A register that points to the stack frame of the currently active procedure.
Garbage collector
A software component that is able to analyse the pointers that are stored by a running program, in order to identify blocks of storage that have been dynamically allocated but are no longer accessible by any pointer. Such blocks can be recycled and used to satisfy future allocation requests. In order for a garbage collector to work safely and efficiently, it must be able to identify all pointers that are stored by the running program. Keiko has provision for pointer maps that provide this information for each stack frame, for the program's statically allocated storage, and for each kind of dynamically allocated block. In the course, for simplicity we forgo garbage collection and omit these pointer maps from the compiler output.
A program that describes the behaviour of other programs written in a fixed language, without translating those programs into another form. For example, an interpreter for bytecode will have an explicit loop that fetches instructions one at a time, and invokes one of a fixed list of actions (the big switch) according to the instruction it finds.
JIT translator
A compiler that is embedded in the execution environment for a virtual machine, and translates code for the virtual machine into native machine code just before each procedure is called. After translation, the native code remains in memory and can be used if the same procedure is called again, but the code is thrown away when the program terminates. Either each procedure (or some other fragment of the program) is translated when it is called for the first time, or an interpreter and a translator co-exist, and procedures are translated if they have been called more than a few times.
In the theory of syntax, a set of strings over a specified alphabet.
Operator tree (optree)
A tree whose nodes are labelled with elementary operations, such as forming the address of a local variable, loading from an address, or adding two integers together. In the course, these operations will be expressed as Keiko instructions. Machine code can be generated by finding groups of nodes ('tiles') in an operator tree that can be covered by single instructions of the target machine.
PC-relative addressing
An addressing mode that involves adding a small constant to the value of the program counter in order to form an address. On the ARM, PC-relative addressing is used to access tables of large constants that are located at the end of the code for each procedure, giving access to the values of these constants without having to embed large constants in instructions.
Peephole optimiser
A compiler phase that improves a program that is expressed as a sequence of instructions. The optimiser contains a collection of rules, each matching a short sequence of instructions and replacing them with a better sequence.
A compiler module that transforms the program being compiled from one representation into a possibly different representation. In the compilers we design, phases mostly run sequentially, each making a separate pass over the program; but it is possible to combine multiple phases into a single pass, for example generating intermediate code at the same time as parsing an expression, rather than constructing an explicit abstract syntax tree and then walking over it.
A rule A → α in a context-free grammar that allows a syntactic variable A to be replaced by a string α, a mixture of variables and tokens.
Program counter
A register that holds the address of the next instruction to be executed.
Regular expression
An expression that denotes a language, built up from literal characters using the operations of concatenation, alternation and Kleene closure.
Return address
The address of the next instruction after a call of a procedure. When the procedure returns, execution continues from this point.
RISC machine
A computer designed with a simplified instruction set. Typically, these machines have a large set of uniform registers, a small set of addressing modes, and load/store instructions separate from the instructions that carry out arithmetic operations.
Scratch register
A register that is set aside for temporary use, typically in a short sequence of instructions that translate a single operation. For example, on the ARM we set aside IP as a scratch register, and we can branch to an arbitrary address using the sequence set ip, #addr; br ip.
Semantic action
An expression attached to a production in a context-free grammar that shows how to build an abstract syntax tree for the construct that is generated by the production. More generally, an expression that associates a value with the construct that is computed from the values of its parts.
Small constant
A constant that fits into the 'immediate field' of an instruction. RISC machines are designed with limited sizes for the constants that appear in instructions, big enough to contain the majority of constants that occur in practical programs. A compiler for such a machine should generate instructions with embedded constants where possible, but must also have provision for dealing with constants that do not fit, perhaps by developing them into registers. On CISC machines, it is often the case that all constants are regarded as small, whatever their value.
Stack frame
A region of storage, allocated from a stack, that contains the parameters and local variables of a procedure activation, together with administrative information (in the frame head) used when the procedure returns, or for access to variables in enclosing procedures.
Stack pointer
A register that points to the last address on the stack that is in use (or alternatively, the first address that is not in use).
Static allocation
Storage reserved at a fixed address, supporting a single copy of a variable, with a lifetime covering the entire execution of the program.
Static link
A pointer leading from the stack frame of a procedure to the stack frame of the textually enclosing procedure. Static links are one mechanism for allowing the body of a procedure to access variables in enclosing procedures.
Syntax-directed translation
A scheme for generating code that is based on assembling the translation of each construct from translations of its parts. Such a scheme is naturally expressed as a recursive function on abstract syntax trees.
Virtual machine
A notional computer that is used as the target of a compiling process but is not implemented as hardware. Code for a virtual machine may be executed either by an interpreter, or by translating it into code for a real machine, perhaps just before it is executed (JIT translation).