Lecture 22 – Designing a datapath (Digital Systems)

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

The plan now is to put together the elements we have studied into a simple datapath that can execute Thumb instructions. We'll do it in stages, adding at each stage just sufficient logic to implement a few more instructions. The design we make will be seriously unrealistic, in that all the work of executing an instruction will be performed inside a single clock cycle: this will lead to longer combinational paths than we would like, and so require a lower clock speed. A more practical design would use pipelining to overlap the execution of each instruction with preparation for the next one. You can study pipelining and the design questions it raises in next year's course on Computer Architecture. For now, we must be content with the observation that if we wanted a pipelined implementation of the Thumb architecture, then effort spent on this single-cycle implementation would not be wasted, because a pipelined design starts with a single-cycle design, drawing lines across the circuit to separate what happens for a particular instruction in this clock cycle from what happens in the next.

The design is shown in these notes by means of a sequence of circuit diagrams, each accompanied by a selection of settings for the control signals that correspond to instructions that the circuit is capable of executing. The final design is also represented by a register-level simulator written in C. The simulator contains tables that decode all the instructions it implements, and is capable of loading and executing binary images prepared using the standard Thumb tools.

  • The handout for this lecture and the next one is provided as a separate, printable document. The content is the same, but the presentation has been edited for printing.
  • Source code for the simulator is available as part of the (utterly optional) Lab five.
  • An annotated listing of the simulator is also provided as a PDF document.

Stage 1: Instruction fetch

The first stage is to arrange to fetch a stream of instructions from memory, and decode them into control signals that will drive the rest of the datapath. For this, let's install a program counter PC, a register that will, on each clock cycle, feed the address of the current instruction to a memory unit IMem so that it fetches a 16-bit instruction. There's also a simplified adder that increments the PC value by 2 and feeds it back into the PC as the address of the next instruction. Some of the 16 bits of the instruction are fed into a combinational circuit that decodes it, producing a bundle of control signals that are fed to the functional units in the datapath. Since those functional units and their connections are yet to be added, we can't say precisely what the control signals are at this stage; but let's add wiring that makes both the control signals and the remaining bits of the instruction (those that have not been accounted for in the encoding) available to each part of the datapath. We can imagine building the decoder from a ROM or (as we'll see later) several ROMs that decode different parts of the instruction.

Stage 1: Instruction fetch

This design is capable of implementing only straight-line programs with no branching, because there is no way to avoid a sequence of PC values that goes 0, 2, 4, 6, ... Also, this design doesn't reflect the fact that instructions can access the PC alongside the other 15 registers. We'll adjust the design later to correct both these infelicities.

Stage 2: ALU operations

Now let's add some datapath components: a simple register file and an ALU. The twin-port register file is capable of reading the values of two registers and writing a result to a third register, all in the same clock cycle. We can imagine for now that the three registers are selected by fixed fields in the instruction, as they are in the add reg instruction:

Adds-rrr format.png

In executing this instruction, the two registers that are read are selected by fields instr<5:3> and instr<8:6> of the instruction. The control unit must ask the ALU to add its two inputs, producing a result that is fed back to the register file. The control unit also tells the register file to write the result back into the register selected by instr<2:0>.

Stage 2: ALU operations

The same datapath could be used to implement other instructions that perform arithmetic on registers: the three-register form of sub, certainly:

Subs-rrr format.png

For this instruction, we need to tell the ALU to subtract rather than add. But we can also implement instructions like ands that specify two registers and overwrite one of their operands:

Ands-rr format.png

This instruction is a bit different, because the ALU must do a bitwise and operation, but also the three registers are selected by different fields of the instruction, with instr<2:0> used to select both one of the inputs and the output of the instruction. Let's leave aside for a while these issues of detail in decoding, and concentrate instead on what features are needed in the datapath itself.

Stage 3: Immediate operands

In addition to instructions that perform ALU operations with the operands taken from registers, there are also instructions that take the second operand from a field in the instruction. Examples of this are two forms of the add instruction:

Adds-rri format.png
Adds-ri format.png

We can also cover the immediate form of the mov instruction, if we allow an ALU operation that simply copies the second operand and ignores the first.

Movs-ri format.png

To accommodate these, we can introduce a multiplexer on the second input of the ALU, fed both from the second register value rb and from appropriate fields of the instruction. The examples above show that it must provide the option of selecting both instr<8:6> and instr<7:0>, and there are other possibilities we will discover as we proceed.

Stage 3: Immediate operands

Now we have a few control signals, we can start to make a table showing how they should be set to carry out various instructions.

Instruction cRand2 cAluOp cRegWrite
adds r RegB Add T
adds i3 Imm3 Add T
adds i8 Imm8 Add T
subs r RegB Sub T
subs i3 Imm3 Sub T
subs i8 Imm8 Sub T
ands r RegB And T
movs r RegB Mov T
movs i8 Imm8 Mov T

This table will expand as we get further, adding more rows to provide more instructions, but also more columns to control the extra hardware we will need to implement them. There will always be settings for any added control signals that make the new hardware do nothing, so that we can still get the effect of these early instructions unchanged. We've still to deal with the issue of what registers are selected to take part in the instructions, and we have also yet to provide for the fact these instructions all set the condition codes. And so far, all instructions write their result into a register: that will change too.

Stage 4: Data memory

Now let's add a second interface to memory, so that we can implement load and store instructions. We'll use what's called a modified Harvard architecture, meaning that the data memory will be treated separately from the instruction memory. They could be separate memories, as on some microcontrollers like the AVR used in the Arduino, or we could imagine having two independent caches in front of the same memory, and modelling just the things that happen when there is a cache hit all the time, not the periods of suspended animation when the processor core is waiting for a memory transaction to complete. Either way, this is different from the von Neumann architecture of ARM's Cortex-M0 implementation, where there is one memory interface, and loads and stores are executed in an extra cycle between instruction fetches.

The Thumb instruction set provides instructions like ldr r0, [r1, r2] and str r0, [r1, r2] that form an address by adding the contents of two registers r1 and r2, and either load a memory word and save it in a third register r0, or take a value from r0 and store it.

Ldr-rrr format.png
Str-rrr format.png

We should notice two requirements for the datapath: first, that we need to form addresses by adding, and second, that the str instruction here reads not two but all three of the registers named in the instruction. We can use the existing ALU to do the address calculation, and we can easily enhance the register file with an extra mux so that it outputs the current value of all three registers named in the instruction. If the third register value isn't needed (as in all instructions except a store) then it costs nothing to compute it anyway.

Stage 4: Data memory

In addition to the third register value rc, the diagram shows two further architectural elements. There's the data memory DMem, with two data inputs, one data output and two control inputs, both single bits. The two data inputs are an address, taken from the ALU output, and a value to be stored, taken from rc. The data output is a value memout that has been loaded from memory, and this together with aluout feeds a new mux that determines the result of the instruction. The two control inputs for the memory are cMemRd and cMemWr, telling it whether to conduct a read cycle, a write cycle, or (if the instruction is not a load or store) neither. Writing when we don't want to write is obviously harmful, and reading unnecessarily might also be harmful if it causes a cache miss, or an exception for an invalid address. The result mux can be controlled by the same cMemRd signal, so that the result of the instruction is memout for load instructions and aluout for everything else.

Let's enhance the decoding table to cover these two new instructions. I'll keep just a few of the existing instructions, extending the lines for them to include values for cMemRd = cMemWr = F that maintain the same function as before; the other instructions can be extended in the same way. I've added entries for the ldr and str with the reg+reg addressing mode. Note that str is the first instruction that doesn't write its result back to a register. There will be others, so while it was tempting before to suppose cRegWrite = T always, and it's tempting now to suppose cRegWrite = !cMemWr, we will see later that neither of these are true.

Instruction cRand2 cAluOp cMemRd cMemWr cRegWrite
adds r RegB Add F F T
movs i8 Imm8 Mov F F T
ldr r RegB Add T F T
str r RegB Add F T F

Stage 5: Barrel shifter

As the next step, let's add a barrel shifter to the datapath, so that we can implement shift instructions like the following.

Lsls-rri format.png
Rors-rr format.png

We could make a barrel shifter part of the ALU, so that left and right shifts were added to the list of ALU operations; or failing that, we could put the shifter 'in parallel' with the ALU, feeding its output together with those of the ALU and the data memory into the result mux. That would allow us to implement the shift instructions OK, but it would be less versatile. We can take a glance at big-ARM instructions like

ldr r0, [r1, r2, LSL #2].

This shifts left by 2 bits the value of r2, adds that to the value of r1 to form an address, loads from that address, and puts the result in r0, all in one instruction. This is really useful if r1 contains the base address of an array and r2 contains an index into the array. Sadly, Thumb code doesn't have a way to encode all that in one instruction. We can provide for such operations by adding a barrel shifter in front of the ALU, operating on the value that will become the ALU's second input, as shown in the figure. There is a control signal to set the operation – Lsl, Lsr, Asr or Ror – to be performed by the shifter. There's also a mux that lets us choose the shift amount, either a constant like 0 or 2 (Sh0 or Sh2), or an immediate field of the instruction (ShImm), or a value taken from the first register ra read by the instruction (ShReg).

Stage 5: Barrel shifter

There are two new control signals, cShiftOp and cShiftAmt. Existing instructions will continue to work if we set cShiftOp = Lsl and cShiftAmt = Sh0, representing the constant 0. We can make good use of the shifter in implementing load and store instructions with the reg+imm addressing mode, because it is specified that the offset should be multiplied by 4 in such instructions, and we can get the shifter to do the multiplication.

Ldr-rri format.png
Str-rri format.png

Here are control signals for some existing instructions, plus the two shift instructions and the reg+imm forms of ldr and str.

Instruction cRand2 cShiftOp cShiftAmt cAluOp cMemRd cMemWr cRegWrite
adds r RegB Lsl Sh0 Add F F T
movs i8 Imm8 Lsl Sh0 Mov F F T
ldr r RegB Lsl Sh0 Add T F T
str r RegB Lsl Sh0 Add F T F
lsls i5 RegB Lsl ShImm Mov F F T
rors r RegB Ror ShReg Mov F F T
ldr i5 Imm5 Lsl Sh2 Add T F T
str i5 Imm5 Lsl Sh2 Add F T F

The disadvantage of putting the barrel shifter 'in series' with the ALU is that it lengthens the combinational paths, one of which now stretches from the register file, through shifter, ALU and data memory, back to the register file. The long path will slow the maximum clock rate that can be supported by the machine.


What's the difference between decoded, derived and dynamic signals?

Decoded signals are those that appear in the decoding tables and are determined by the instruction's opcode; derived signals also depend on other parts of the instruction halfword, and dynamic signals depend on other parts of the machine state.

  • The decoded signals will be the same in every instance of an instruction – that it to say, two instructions that share the same opcode will have the same decoded signals. Note, however, that two instructions like mov r and mov i8 may share the same mnemonic in assembly language, but have different opcodes, and so be treated as different instructions by the hardware, with different decoded signals. In the simulator, these decoded signals form the members of the Control structure ctrl contained in one of the decoding tables, and have names like ctrl->cShiftAmt.
  • The derived signals will be the same whenever a particular instruction is executed, because they are determined by the bits of the instruction taken all together; for example, the instruction add r3, r1, r2 always writes its result to register r3, and so has cRegC = 3. In the simulator, these derived signals are outside the Control structure, but also have names like cRegC that start with a lower-case c followed by an upper-case letter.
  • The dynamic signals differ from one execution of the instruction to another, so that at one time that add instruction could write 7 to r3, and another time it could write 8, and the value of the result signal would be different in the two cases.

In the lectures, you talk about processor designs containing lots of multiplexers, but some books talk about designs containing various busses. What's the difference?

There isn't really a difference: a bus in a conventional design is a bundle of wires that can be driven by several sources at different times, usually by setting one source to produce an output and the others to enter a high-impedance state. If there is a difference, it is that we are neutral about the technology used to implement a multiplexer – logic gates, or the kind of three-state logic just described – whereas a bus, especially if it stretches between chips, has a specific implementation that must be followed.

Lecture 23