Lecture 4 – Number representations (Digital Systems)

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

Signed and unsigned numbers

[math]\displaystyle{ }[/math][4.1] So far we have used only positive numbers, corresponding to the C type unsigned. We can pin down the properties of this data type by defining a function \(\mathit{bin}(a)\) that maps an \(n\)-bit vector \(a\) to an integer:

\[\mathit{bin}(a) = a_0 + 2a_1 + 4a_2 + \ldots + 2^{n-1}a_{n-1} = \sum_{0\le i\lt n} a_i.2^i\]

(where \(n = 32\)). What is the binary operation \(\oplus\) that is implemented by the add instruction? It would be nice if always

\[\mathit{bin}(a \oplus b) = \mathit{bin}(a) + \mathit{bin}(b)\]

but unfortunately that's not possible owing to the limited range \(0 \le \mathit{bin}(a) < 2^n\). All we can promise is that

\[\mathit{bin}(a \oplus b) \equiv \mathit{bin}(a) + \mathit{bin}(b) \pmod{2^n}\]

and since each bit vector maps to a different integer modulo \(2^n\), this equation exactly specifies the result of \(\oplus\) in each case.

[4.2] More commonly used than the unsigned type is the type int of signed integers. Here the interpretation of bit vectors is different: we define \(\mathit{twoc}(a)\) by

\[\mathit{twoc}(a) = \sum_{0\le i<n-1} a_i.2^i - a_{n-1}.2^{n-1}.\]

Taking \(n = 8\) for a moment:

    a        bin(a)   twoc(a)
0000 0000       0        0
0000 0001       1        1
0000 0010       2        2
0111 1111     127      127 = 27-1
1000 0000     128     -128 = -27
1000 0001     129     -127
1111 1110     254       -2
1111 1111     255       -1

Note that the leading bit is 1 for negative numbers. So we see \(-2^{n-1}\le \mathit{twoc}(a) < 2^{n-1}\), and

\[\mathit{twoc}(a) = \sum_{0\le i<n} a_i.2^i - a_{n-1}.2^n = \mathit{bin}(a) - a_{n-1}.2^n.\]

So \(\mathit{twoc}(a) \equiv \mathit{bin}(a) \pmod{2^n}\). Therefore if \(\mathit{bin}(a \oplus b) \equiv \mathit{bin}(a) + \mathit{bin}(b) \pmod{2^n}\), then also \(\mathit{twoc}(a \oplus b) \equiv \mathit{twoc}(a) + \mathit{twoc}(b)\), and the same addition circuit can be used for both signed and unsigned arithmetic – a telling advantage for this two's complement representation.

[4.3] How can we negate a signed integer? If we compute \(\bar a\) such that \(\bar a_i = 1 - a_i\), then we find

\[\begin{align}\mathit{twoc}(\bar a) &= \sum_{0\le i<n-1} (1-a_i).2^i - (1-a_{n-1}).2^{n-1}\\&= -\Bigl(\sum_{0\le i<n-1} a_i.2^i - a_{n-1}.2^{n-1}\Bigr) + \Bigl(\sum_{0\le i<n-1} 2^i - 2^{n-1}\Bigr) = -\mathit{twoc}(a)-1\end{align}\]

So to compute \(-a\), negate each bit, then add 1. This works for every number except \(-2^{n-1}\), which like 0 gets mapped to itself.

This gives us a way of implementing binary subtraction \(\ominus\) in such a way that

\[\mathit{twoc}(a\ominus b) \equiv \mathit{twoc}(a) - \mathit{twoc}(b) \pmod{2^n}.\]

We simply define \(a\ominus b = a\oplus \bar b\oplus 1\). If we have a method of addition that adds two digits and a carry in each column, producing a sum digit and a carry into the next column, then the natural thing to do is treat the \(+1\) as a carry input to the rightmost column. We will implement this electronically next term.


This two's-complement binary representation of numbers is typical of every modern computer: the vital factor is that the same addition circuit can be used for both signed and unsigned numbers. Historical machines used different number representations: for example, in business computing most programs used to do a lot of input and output and only a bit of arithmetic, and there it was better to store numbers in decimal (or binary-coded decimal) and use special circuitry for decimal addition and subtraction, rather than go to the trouble of converting all the numbers to binary on input and back again to decimal on output, something that would require many slow multiplications and divisions. Floating point arithmetic has standardised on a sign-magnitude representation because two's-complement does not simplify things in a signficant way. That makes negating a number as simple as flipping the sign bit, but it does means that there are two representations of zero – +0 and −0 – and that can be a bit confusing.

Comparisons and condition codes

[math]\displaystyle{ }[/math][4.4] To compare two signed numbers \(a\) and \(b\), we can compute \(a \ominus b\) and look at the result:

  • If \(a \ominus b = 0\), then \(a=b\).
  • If \(a \ominus b\) is negative (sign bit = 1), then it could be that \(a<b\), or maybe \(b<0<a\) and the subtraction overflowed. For example, if \(a = 100\) and \(b = -100\), then \(a-b = 200 \equiv -56 \pmod{256}\) in 8 bits, so \(a\ominus b\) appears negative when the true result is positive.
  • Similar examples show that, with \(a < 0 < b\), the result of \(a \ominus b\) can appear positive when the true result is negative. In each case, we can detect overflow by examining the signs of \(a\) and \(b\) and seeing if they are consistent with the sign of the result \(a \ominus b\).

[4.5] The cmp instruction computes \(a \ominus b\), throws away the result, and sets four condition flags:

  • N – the sign bit of the result
  • Z – whether the result is zero
  • C – the carry output of the subtraction
  • V – overflow as explained above

[4.6] A subsequent conditional branch can test these flags and branch if an appropriate condition is satisfied. A total of 14 branch tests are implemented.

equality     signed              unsigned                  miscellaneous
--------     ------              --------                  -------------
beq*  Z      blt  N!=V           blo = bcc*  !C            bmi*  N
bne*  !Z     ble  Z or N!=V      bls         Z or !C       bpl*  !N
             bgt  !Z and N=V     bhi         !Z and C      bvs*  V
             bge  N=V            bhs = bcs*  C             bvc*  !V

The conditions marked * test individual condition code bits, and the others test meaningful combinations of bits. For example, the instruction blt tests whether in the comparison \(a<b\) was true, and that is true if either the subtraction did not overflow and the N bit is 1, or the subtraction did overflow and the N bit is 0 – in other words, if N ≠ V. All the other signed comparisons use combinations of the same ideas: it's pointless to memorise all the combinations.

For comparisons of unsigned numbers, it's useful to work out what happens to the carries when we do a subtraction. For example, in 8 bits we can perform the subtraction 32 − 9 like this, adding together 32, the bitwise complement of 9, and an extra 1:

32      0010 0000
~9      1111 0110
      1 0001 0111

The result should be 2310 = 101112, but as you can see, if performed in 9 bits there is a leading 1 bit that becomes the C flag. If we subtract unsigned numbers \(a-b\) in this way, the C flag is 1 exactly if \(a\ge b\), and this give the basis for the unsigned conditional branches bhs (= Branch if Higher or Same), etc.

The cmp instruction has the sole purpose of setting the condition codes, and throws away the result of the subtraction. But the subs instruction, which saves the result of the subtraction in a register, also sets the condition codes: that's the meaning of the s suffix on the mnemonic. (The big ARMs have both subs that does set the condition codes, and sub that doesn't.) Other instructions also set the condition codes in a well-defined way. If the codes are set at all, Z always indicates if the result is zero, and N is always equal to the sign bit of the result.

  • In an adds instruction, the C flag is set to the carry-out from the addition, and the V flag indicates if there was an overflow, with a result whose sign is inconsistent with the signs of the operands.
  • In a shift instruction like lsrs, the C flag is set to the last bit shifted out. That means we can divide the number in r0 by 2 with the instruction lsrs r0, r0, #1 and test whether the original number was even with a subsequent bcc instruction. Shift instructions don't change the V flag.


The four status bits NZCV are almost universal in modern processor designs. The exception is machines like the MIPS and DEC Alpha that have no status bits, but allow the result of a comparison to be computed into a register: on the MIPS, the instruction slt r2, r3, r4 sets r2 to 1 if r3 < r4 and to zero otherwise; this can be followed by a conditional branch that is taken is r0 is non-zero. On x86, the four flags are called SZOC, with S for sign and O for overflow, and for historical reasons there are additional flags P (giving the parity of the least-significant byte of the result) and A (used to implement BCD arithmetic). Also, the sense of the carry flag after a subtract instruction is reversed, so that it is set if there is a borrow, rather than the opposite. Consequently, the conditions for unsigned comparisons are all reversed, so that the x86 equivalent of bhs is the jae (jump if above or equal) instruction, which branches if the C flag is zero. The principles are the same, but the detailed implementation is different.


Is overflow possible in unsigned subtraction, and how can we test for it?

[math]\displaystyle{ }[/math]Overflow happens when the mathematically correct result \(\mathit{bin}(a) - \mathit{bin}(b)\) is not representable as \(bin(c)\) for any bit-vector \(c\). If \(\mathit{bin}(a) \ge \mathit{bin}(b)\) then the difference \(\mathit{bin}(a) - \mathit{bin}(b)\) is non-negative and no bigger than \(\mathit{bin}(a)\), so it is representable. It's only when \(\mathit{bin}(a) < \mathit{bin}(b)\), so the difference is negative, that the mathematically correct result is not representable. We can detect this case by looking at the carry bit: C = 0 if and only if the correct result is negative.

How can the overflow flag be computed?

Perhaps this is a question better answered next term, when we will focus on computing Boolean functions using logic gates. But let me answer it here anyway, focussing on an addition, because subtraction is the same as addition with a complemented input. The \(V\) flag should be set if the sign of the result is inconsistent with the signs of the two inputs to the addition. We can make a little truth table, denoting by \(a_{31}\) and \(b_{31}\) the sign bits of the two inputs, and by \(z_{31}\) the sign bit of the result.

\(a_{31}\) \(b_{31}\) \(z_{31}\) \(V\)
0 0 0 0 All positive
0 0 1 1 Positive overflows to negative
0 1 ? 0 Overflow impossible
1 0 ? 0 Overflow impossible
1 1 0 1 Negative overflows to positive
1 1 1 0 All negative

Various formulas can be proposed for \(V\): one possibility is \(V = (a_{31} \oplus z_{31}) \wedge (b_{31} \oplus z_{31})\), where \(\oplus\) denotes the exclusive-or operation.

There is a neater way to compute \(V\) if we have access to the carry input \(c_{31}\) and output \(c_{32}\) of the last adder stage. We can extend the truth table above, filling in \(c_{32} = {\it maj}(a_{31}, b_{31}, c_{31})\) and \(z_{31} = a_{31}\oplus b_{31}\oplus c_{31}\) as functions of the inputs to the last stage.

\(a_{31}\) \(b_{31}\) \(c_{31}\) \(c_{32}\) \(z_{31}\) \(V\)
0 0 0 0 0 0
0 0 1 0 1 1
0 1 0 0 1 0
0 1 1 1 0 0
1 0 0 0 1 0
1 0 1 1 0 0
1 1 0 1 0 1
1 1 1 1 1 0

By inspecting each row, we can see that \(V = c_{31} \oplus c_{32}\). We can derive this using the remarkable equation

\({\it maj}(a,b,c)\oplus x = {\it maj}(a\oplus x, b\oplus x, c\oplus x)\),

and the property \({\it maj}(a, b, 0) = a \wedge b\). For, subsituting for \(z_{31}\) in the previous formula for \(V\) and simplifying, we obtain

\(V = (b_{31}\oplus c_{31}) \wedge (a_{31}\oplus c_{31}) = {\it maj}(a_{31} \oplus c_{31}, b_{31}\oplus c_{31}, c_{31}\oplus c_{31}) = {\it maj}(a_{31}, b_{31}, c_{31})\oplus c_{31} = c_{32} \oplus c_{31}\).

Lecture 5