Lab one (Digital Systems)

This lab is built around a program, mostly written in C, that calls an assembly language subroutine to perform an arithmetic operation. As supplied, there are two versions of the subroutine – one that uses the machine's adds instruction to add two numbers, and another that contains a simple but slow loop that performs multiplication. Your task is to add other variations, such as subtraction (easy), a faster multiplication algorithm (moderate), or a fast division algorithm (tough).

Ideally, this lab requires a micro:bit. If you can't get hold of one, but do have a Linux machine of some kind, then it's possible to experiment with assembly language programming using the emulation software QEMU. See a later section of these instructions for details.

The lab1-asm directory of the lab materials contains the following files, some of them the same as the corresponding files seen before:

`Makefile`	Build script
`fmain.c`	Main program
`func.s`	Subroutine – single add instrution
`mul1.s`	Subroutine – simple multiplication loop
`fac.s`	Factorials with `mult` as a subroutine
`bank.s`	"Bank accounts" with a static array
`hardware.h`	Header file with layout of I/O registers
`lib.c, lib.h`	Library with implementation of `printf`
`startup.c`	Startup code
`nRF51822.ld`	Linker script
`debug`	Shell script for starting debugger
`qmain.c`	Alternative main program for use with QEMU

In particular, main.c is the main program, written in C, with a loop that prompts for two unsigned integers x and y, then calls a function func(x, y) and prints the result. There's no reason why this function couldn't be written in C, but instead two versions written in assembly language are provided, as a starting point for experiments in programming at the machine level.

To build an initial program, just use the command

$ make

and usual (or use Build>Make within Geany. As a default, this includes the subroutine from func.s that uses one instruction to add its two arguments, and produces a binary file func.hex. You can load the program into the micro:bit by dragging and dropping, or by using the command

$ cp func.hex /media/mike/MICROBIT

Now start minicom, reset the micro:bit, and expect an interaction like this one:

Hello micro:world!
Gimme a number: 3
Gimme a number: 4
func(3, 4) = 7
func(0x3, 0x4) = 0x7

Note that positive and negative decimal numbers are allowed as input, and also numbers written in hexadecimal with a leading "0x". The result returned by func(x, y) is interpreted both as a signed number printed in decimal, and as an unsigned number printed in hexadecimal. You are welcome to modify the main program if you wish to change this behaviour.

A second implementation of the function func(x, y) is provided in the file mul1.s: it computes the product of x and y using a simple loop. You can build a program containing this definition of func with the command,

$ make mul1.hex

and load it into the micro:bit with

$ cp mul1.hex /media/mike/MICROBIT

or a similar command. (Within Geany, with the mul1.s source file open, choose Build>Make me to build the program, and Build>Upload me to upload it.) The subroutine works well for simple examples like 2 * 3 = 6, but you will find that one of 10000000 * 2 and 2 * 10000000 is much slower than the other. You may also notice that the main program lights one of the LEDs on the micro:bit before calling func, and switches it off again when func returns, so that you can see how much time the subroutine is taking. We can use the LED signal together with an oscilloscope to get a precise timing for the subroutine, and you can try this in the lab if you like.

Two additional implementations of a subroutine called func are provided, drawing from examples in the lectures.

In fac.s, the subroutine func(x, y) returns the factorial of x, ignoring y. The subroutine calls another subroutine mult to do the multiplications needed to calculate the factorial.
In bank.s, a static array of 10 integers is allocated, much as if it were declared in C with

static int account[10];

The subroutine func(x, y) increments account[x] by y and returns its new value; the values in the array are remembered from one invocation of func to the next, as is normal for a static array.

You can make your own implementation of the function func as follows:

Copy an existing implementation to get the structure straight:

$ cp func.s sub.s

Now edit the copy sub.s to replace the adds instruction with something appropriate to your wishes – perhaps a subs instruction.
Use the following command to assemble your code and link it into a binary file.

$ make sub.hex

Load the resulting file into the micro:bit.

$ cp sub.hex /media/mike/MICROBIT

With Geany, as before, you can choose Build>Make me and {{

In writing your own versions of func, you should note the calling conventions of the ARM chip: failing to obey them may make the program go haywire. In particular:

The arguments x and y arrive in registers r0 and r1.
When the subroutine returns, whatever is left in register r0 is taken as the result.
The subroutine may modify the contents of registers r0 and r1, and also the contents of r2 and r3.
Unless it takes special steps to preserve their values, the subroutine should not modify the contents of other registers such as r4 to r7. It may be that the main program is keeping nothing special in those registers, in which case trashing them will do no harm, but that's not something we should rely on.
The subroutine body should not mess with the stack pointer sp, or the link register lr. The link register contains the code address to which the subroutine will return via the final instruction bx lr, and overwriting this will result in chaos.

Tasks

Different participants in the course will have different amounts of experience with low-level programming, so you should choose whichever of the following tasks you find both possible and illuminating. Each requires you to produce an assembly language file containing a definition of the function func. For the first few, you can work most easily by tinkering with the supplied file func.s, but for later tasks you may like to make a new file to preserve your work.

Try replacing the adds instruction with a different instruction – such and the subs instruction that subtracts instead of adding. Explain to yourself why a subtraction such as 2 - 5 seems to give a large positive result when it is interpreted as an unsigned number.
Explore other ALU operations such as the bitwise logical instructions ands, orrs and eors, or the shifts and rotates lsls, lsrs, asrs and rors.
Use shifts and adds to write a function that multiplies one of its arguments by a small constant, such as 10.
Write a function that multiplies two numbers using the built-in multiply instruction muls, and compare it for speed with the supplied code. Note that the Nordic chip includes the optional single-cycle multiplier, but other instances of Cortex-M0 may have a slower multiplier or (I believe) none at all.
Write a faster software implementation of multiplication, using a log-time algorithm.
Write a simple implementation of unsigned integer division.
Write an implementation of unsigned division that runs in a reasonable time.

Some of these tasks we will look at in the lectures; others may be mentioned on a problem sheet. I don't think the overlap matters much, because different settings are good for focussing on different things. You can fruitfully discuss with your tutor the range of algorithms you might use for one of the tasks, but it's pointless to spend tutorial time on the details of assembly language syntax. Such details are not important for the exam, but it helps understanding if you gain some sense of mastery of the technicalities.

Factorials are often used as an example of recursion, though a bad one because they can be computed with a simple loop. Rewrite the factorial program to use recursion and see how much worse it gets.
Factorials provide one way of computing binomial coefficients since choose(n, r) = n! / r! (n-r)!; others include filling in Pascal's triangle row by row, or using a recurrence such as

choose(n, r) = n/r * choose(n-1, r-1)

or

choose(n, r) = (n-r+1)/r * choose(n, r-1)

with suitable boundary conditions. Implement one or more of these methods, perhaps using the unsigned division routine you implemented earlier. Which method is the best, in terms of both speed and freedom from overflow?

Most of the new instructions (and derestricted versions of old ones) are encoded in 32 bits instead of 16 bits, with the first 16-bit half of the instruction coming from a region of the decoding chart that makes it illegal when read as a 16-bit Thumb instruction. As well as new instructions like the signed and unsigned divide instructions sdivs and udivs, you will find that arithmetic instructions can use all 16 registers, not just registers r0 to r7, they exist in variants without the s suffix that don't set the condition codes, and restrictions on the values of immediate constants are relaxed, so that you can write the instruction adds r0, r1, #100. Note that the "Thumb-2" code with mixed 16-bit and 32-bit instructions is different from the "Native" code where all instructions are 32 bits long, and the encoding is quite a bit simpler to decipher.

From one point of view, it's not worth becoming obsessive about trying to stick with the 16-bit instructions, because the 32-bit instructions are almost as fast, and a 32-bit instruction is certainly no slower than the pair of 16-bit instructions that might replace it. If you (or a compiler) write code that favours the low registers for simple tasks, then many of the instructions you want will happen to have a 16-bit encoding, and the assembler will automatically produce good binary code for your program. From another point of view, when learning about the machine we can notice that the 16-bit subset gives a good guide to the operations that are important: for example, the existence of 16-bit instructions for loads and stores with addressing relative to the stack pointer gives a clue that these operations are important for access to local variables in a subroutine.

As far as the exam is concerned, don't waste time trying to memorise the details of what instructions will fit in 16 bits and what fit in 32 bits, or trying to learn the full repertoire of 32-bit instructions. The exam paper will have the usual summary of Thumb code, and anything additional you need to know will be stated in the question. You will not be penalised for writing sensible instructions just because they happen not to have an encoding on one processor or another.}}

An alternative way of testing

The bigger ARM processors (like the one used in the Raspberry Pi) can execute code in Thumb mode – and there the fuss about the bottom bit of a code address being 1 for Thumb mode has some point to it. Nobody who programs the Pi seems to bother with Thumb mode, however, because with 1GB of memory or more, why worry about code size?

Nevertheless, this compatibility with bigger processors gives a convenient way to test fragments of Thumb assembly language, and the Makefile for this lab exercise is set up to support it. The trick is to compile a main program into native ARM code, assemble the test function into Thumb code, then link them together so that the Thumb code can be called from the ARM code. The resulting binary will then run on the emulator qemu-arm. Try this:

$ make q-func.elf
$ qemu-arm q-func.elf 23 34
func(23, 34) = 57
func(0x17, 0x22) = 0x39

The main program test.c is compiled into native ARM code, and expects to run under Linux. It looks for two numbers as command-line arguments, then calls the subroutine func that is separately assembled into Thumb code. It's important to include the .thumb_func directive in front of every function, or the assembler will fail to assemble it into Thumb code, or it will be invoked without switching the machine to Thumb state, and things will not go well. The main advantage of this way of testing is that the command to invoke qemu-arm can be part of a shell script, and then we can set up automated testing. That's something we'd also want to set up for code running on the micro:bit in a larger project.

In order to use this method on your own machine, you will need to install packages to support compiling code for bigger ARM processors that runs under Linux, and also to install the QEMU emulator to run the result (if you don't have a Raspberry Pi). On Debian and derivatives, something like

$ sudo apt-get install qemu-user gcc-arm-linux-gnueabihf

ought to be sufficient.

Connecting an oscilloscope

The main program from fmain.c measures the time taken to call the func() subroutine using a hardware clock integrated into the micro:bit's processor chip, but you may like to use an oscilloscope or logic analyser pod to measure the same time independently. The main program lights an LED before calling the subroutine and switches the LED off when it returns. We can get external access to an I/O line that is used to control the LED, and by that means time how long it is lit – even if that time is too short for the flash to be visible. A suitable signal is accessible on pin P3 of the board.

To connect the scope or logic analyser, it's convenient to plug the microbit into an edge connector breakout board that has header pins for each contact on the edge connector.

Connect the crocodile clip of the scope probe to ground, marked "0V" on the breakout board. Several pins at that end of the board are grounded, so there's little risk of shorts. Connect the probe tip to pin 3.
To capture the single event of the LED lighting, we need to inhibit the scope from always updating the trace, which it does by default so as not to appear frozen. Use the trigger menu to change the trigger mode from Auto to Normal. This should be shown in the top line of the display. (You can write to the manufacturer to ask why Normal is not the default, but don't expect a reply.)
Now set the horizontal scale to 1 microsec/division, and the vertical scale to 1 V/div on channel 1. Set the trigger level to about 1.5 V (it's not critical).
At this point, if you run your program, the scope should succeed in capturing the timing pulse.
Enable a measurement of the pulse length by activating the Measure menu, choosing Time, scrolling down to +Width by turning the GP knob, and pressing the knob to confirm.
As you adjust the trace, you'll see a blank where the previous acquisition contains no data. Run the program again to fill in the gaps.

Alternatively, you can connect one of the cheap logic analyser pods and use the PulseView software on the host PC to capture traces.

Connect the ground wire (white) of the logic analyser to "0V" on the micro:bit, and channel 0 of the logic analyser (black wire) to pin 3.
In PulseView, enable only channel 0, and set it as to trigger on a rising edge. Set a pre-trigger interval of (say) 5% so as to capture some data before the trigger event.
Set PulseView to acquire samples at a rate of 16MHz. Since this is the same as the clock frequency of the micro:bit, we cannot hope to measure the length of each pulse with perfect accuracy, and there is always the danger that the signal will be captured as high for one cycle too few or too many. PulseView starts to show individual samples once you zoom in far enough to see them.
To capture data for a short run, it's necessary to find by experiment a procedure for resetting the micro:bit and starting the logic analyser, so that it is not triggered by noise surrounding the reset. Try pressing reset and holding it, then starting the capture, then releasing the reset button.

The poor resolution of the logic analyser is slightly disappointing, but inevitable given that the pod contains a microcontoller (what else?) built with the same technology as the micro:bit. The oscilloscope, by way of contrast, has special acquisition circuitry using multiple A-to-D converters each with a precisely defined capture interval. This is expensive but capable of acquiring up to 2 gigasamples per second. We can make better use of the logic analyser by getting it to time longer events, such as a program that does multiple calculations or a loop that runs tens or hundreds of times. A single cycle difference in each iteration will then add up to a measurable change in the runtime.

Whether you use the scope or the logic analyser, one measurement of your program gives little information, but you can repeat the same calculation multiple times to see whether the pulse length is consistent, and then try changing parameters to see how they affect the result. On the V1 board, the observed change will be a whole number of clock cycles, each lasting 62.5 nanosec.

Alternatively, you could rewrite the driver fmain.c so that it prompts for inputs once, then repeats the calculation forever, producing a repetitive sequence of timing pulses that can be shown on the scope. Reset the program to use different inputs.

Lab one (Digital Systems)

Tasks

An alternative way of testing

Connecting an oscilloscope

Navigation menu

Search