Adventures with Arduino: From 41 Instructions to 1

A couple of weeks ago, I was looking for a fun project to do with my 8 year old son. By chance, I saw a recent issue of Make Magazine which had a feature about an electronics prototyping platform called Arduino. I thought it sounded pretty cool, and could be a good introduction for my son to some basic electronics as well as some simple programming.

The Arduino is designed to be very easy to use. It comes with a set of tools based on the Wiring programming environment and libraries. The board is controlled by an Atmel 8-bit microprocessor. In comparison to what I spend my days working with, this is a very simple processor. No vector registers, no floating point registers. Even something as simple as adding two 32-bit numbers requires three assembly instructions (add word, add byte with carry, add byte with carry).

While I was waiting for my Arduino to be delivered, I downloaded the IDE and compiled the simple example code to blink an LED once every second. The status window of the IDE dutifully printed out the compilation progress, and at the end, also printed out the executable size – 1018 bytes.

Since I’ve never worked on this platform before I wasn’t sure what the executable size should be, but that seemed a little large to me. So I started to dig around a little bit…

Warning: There’s a fair amount of code and assembly in this post, but I’ll do my best to explain it all!

Start Simple

The Blink program is really simple. It just sets a specific pin on the board to output mode, then continually turns that pin on and off every second.

void setup()
{
	pinMode(13, OUTPUT);
}

void loop()
{
	digitalWrite(13, HIGH);   // set the LED on
	delay(1000);              // wait for a second
	digitalWrite(13, LOW);    // set the LED off
	delay(1000);              // wait for a second
}

I thought I may as well start at the top, so I took a look at the pinMode function. Fortunately, the Arduino codebase is open source and comes as part of the install package.

The pinMode function just sets the data direction (i.e. input or output) of a particular pin on the microcontroller. Most of the pins on the ATmega328p chip are wired into the numbered pins on the Arduino board, but the mapping isn’t one-to-one.

void pinMode(uint8_t pin, uint8_t mode)
{
	uint8_t bit = digitalPinToBitMask(pin); // * Fetch
	uint8_t port = digitalPinToPort(pin); // * Fetch

	volatile uint8_t *reg;

	if (port == NOT_A_PIN) return;

	// JWS: can I let the optimizer do this?
	reg = portModeRegister(port); // * Fetch

	if (mode == INPUT) {
		uint8_t oldSREG = SREG;
		cli();
		*reg &= ~bit;
		SREG = oldSREG;
	} else {
		uint8_t oldSREG = SREG;
		cli();
		*reg |= bit;
		SREG = oldSREG;
	}
}

What Does It Do?

Here’s a quick overview of what this code does by line number:

3, 4 & 11: Look up which bit of which particular data direction register (DDR) needs to be set or unset.
14 & 15/19 & 20: Perform the first part of the disable interrupts dance – storing off the existing status register and clearing the global interrupt bit (cli).
16: Clear the DDR bit if the mode is input
21: Set the DDR bit if the mode is output
17 & 22: The rest of the interrupt dance – restore the status register to its previous state.

Some Thoughts

This code performs three fetches (marked) from program memory to convert the Arduino pin number into the correct data direction register and bit. Each fetch from program memory costs 3 cycles.
It needs to disable interrupts because setting or clearing the DDR bit requires a read/modify/write pattern. If an interrupt happened after the read but before the write, the handler might change that DDR but the portMode function would change it back.
It stores off the existing status register before disabling interrupts. I presume this is so that it can restore global interrupts to the previous state at the end of the function rather than arbitrarily turning them back on.
It’s interesting that the author (JWS) hints that there may be a faster way to do this in his comment…

What are we trying to do again?

After looking at all of this code, it’s easy to forget what the point of this function actually is. All it really does is set or clear one bit in a specific data direction register – that’s it! All the other code is just there for support.

Digging Deeper

It’s all very well looking at the source code, but this isnâ€™t what the microcontroller sees. In order to see that, we need to look at the result of the compilation. We can do this really easily by taking the elf that is produced and running it through the disassembler (avr-objdump) included with the toolchain.

avr-objdump -D Blink.elf > Blink.asm

You can also get objdump to interleave the original source code with the disassembly. I sometimes find this more confusing than the plain assembly since it can be misleading with where it places some of the source code.

avr-objdump -S Blink.elf > Blink.asm

Disassemble!

This might look intimidating at first, but please bear with it. The assembly proceeds in the same order as the original source code, and Iâ€™ve added comments to explain whatâ€™s going on. You can really dig much further into it by looking at the assembly instruction reference in the Atmel Microcontroller Guide, but it’s not necessary to understand the rest of this post.

000002d8 :
// Fetch the port bit mask
2d8:	48 2f       	mov	r20, r24
2da:	50 e0       	ldi	r21, 0x00	; 0
2dc:	ca 01       	movw	r24, r20
2de:	86 56       	subi	r24, 0x66	; 102
2e0:	9f 4f       	sbci	r25, 0xFF	; 255
2e2:	fc 01       	movw	r30, r24
2e4:	24 91       	lpm	r18, Z+
// Fetch the processor port
2e6:	4a 57       	subi	r20, 0x7A	; 122
2e8:	5f 4f       	sbci	r21, 0xFF	; 255
2ea:	fa 01       	movw	r30, r20
2ec:	84 91       	lpm	r24, Z+
// Return if the port is invalid
2ee:	88 23       	and	r24, r24
2f0:	c1 f0       	breq	.+48     	; 0x322 
// Fetch the DDR address
2f2:	e8 2f       	mov	r30, r24
2f4:	f0 e0       	ldi	r31, 0x00	; 0
2f6:	ee 0f       	add	r30, r30
2f8:	ff 1f       	adc	r31, r31
2fa:	e8 59       	subi	r30, 0x98	; 152
2fc:	ff 4f       	sbci	r31, 0xFF	; 255
2fe:	a5 91       	lpm	r26, Z+
300:	b4 91       	lpm	r27, Z+
// Check the mode and branch if necessary
302:	66 23       	and	r22, r22
304:	41 f4       	brne	.+16     	; 0x316 
// Clear the DDR bit and return
306:	9f b7       	in	r25, 0x3f	; 63
308:	f8 94       	cli
30a:	8c 91       	ld	r24, X
30c:	20 95       	com	r18
30e:	82 23       	and	r24, r18
310:	8c 93       	st	X, r24
312:	9f bf       	out	0x3f, r25	; 63
314:	08 95       	ret
// Set the DDR bit and return
316:	9f b7       	in	r25, 0x3f	; 63
318:	f8 94       	cli
31a:	8c 91       	ld	r24, X
31c:	82 2b       	or	r24, r18
31e:	8c 93       	st	X, r24
320:	9f bf       	out	0x3f, r25	; 63
322:	08 95       	ret
000002ce :
// Copy the parameters into registers
2ce:	8d e0       	ldi	r24, 0x0D	; 13
2d0:	61 e0       	ldi	r22, 0x01	; 1
// Call pinMode
2d2:	0e 94 6c 01 	call	0x2d8	; 0x2d8 
2d6:	08 95       	ret

The pin mode function is comprised of 38 assembly instructions. The call to pinMode from Setup call requires an additional 3 instructions – 2 to put the parameters into registers, and 1 to call the function. That’s a grand total of 41 instructions. That sounds like an awful lot of work just to set one bit.

Isnâ€™t there an easier way to set or clear a bit in a register? Of course there is. If you look at the assembly instruction reference, there’s actually a single instruction (sbi) that sets a specific bit in a register. There is a corresponding instruction (cbi) to clear that bit.

Improving pinMode

Ok, so we know in theory it could be more efficient, but how do we get the compiler to do this so that we don’t have to look at assembly all of the time? Aren’t there other things we need to worry about? What about interrupts? What about the memory fetches?

I made a few different changes to the pinMode function to try and reduce the instruction count. I’m going to go over them in the order I attempted them.

Removing One Fetch

As the old adage goes, two memory fetches are better than three. Well, perhaps it’s not an adage, but it’s true nonetheless.

The original code looks up the port number using the pin index, then uses the port number to look up the DDR address. We can skip that first lookup by creating a table that goes directly from pin index to DDR address.

You’re probably thinking, “but that’ll cost more memory”, and you’d be right. We’ll address that shortly though. First, let’s see what that does to the assembly.

000002e0 :
// Fetch the port bit mask
2e0:	a8 2f       	mov	r26, r24
2e2:	b0 e0       	ldi	r27, 0x00	; 0
2e4:	cd 01       	movw	r24, r26
2e6:	8e 53       	subi	r24, 0x3E	; 62
2e8:	9f 4f       	sbci	r25, 0xFF	; 255
2ea:	fc 01       	movw	r30, r24
2ec:	24 91       	lpm	r18, Z+
// Fetch the DDR address
2ee:	a4 58       	subi	r26, 0x84	; 132
2f0:	bf 4f       	sbci	r27, 0xFF	; 255
2f2:	8c 91       	ld	r24, X
2f4:	e8 2f       	mov	r30, r24
2f6:	f0 e0       	ldi	r31, 0x00	; 0
// Check the mode and branch if necessary
2f8:	66 23       	and	r22, r22
2fa:	31 f4       	brne	.+12     	; 0x308 
// Clear the DDR bit
2fc:	9f b7       	in	r25, 0x3f	; 63
2fe:	f8 94       	cli
300:	80 81       	ld	r24, Z
302:	20 95       	com	r18
304:	28 23       	and	r18, r24
306:	04 c0       	rjmp	.+8      	; 0x310 
// Set the DDR bit
308:	9f b7       	in	r25, 0x3f	; 63
30a:	f8 94       	cli
30c:	80 81       	ld	r24, Z
30e:	28 2b       	or	r18, r24
310:	20 83       	st	Z, r18
312:	9f bf       	out	0x3f, r25	; 63
// Return
314:	08 95       	ret

It’s definitely better – we’ve reduced the function size by 11 instructions, but we’ve also added another constant table of 20 bytes (not shown). Each assembly instruction is 2 bytes, so we have a net saving of only 2 bytes. Let’s do something about that.

Inlining

Inlining a function just means that the compiler puts the code directly into the caller site rather than jumping to a separate function. No function call means that we don’t pay the cost (three instructions in this case) of calling the function. It also allows the compiler to be more aggressive about optimizations.

Often, the compiler will inline functions in the same file automatically when it can. If you want to share that function, then you can put it in a header file and explicitly mark it as inline by including ‘inline’ in front of the function declaration.

The main drawback of inlining function calls is that it can make your code larger since duplicates of the same function can be scattered around the compiled code.

In this case, the code size for calling the non-inlined function just once is 3 instructions per call, plus 27 instructions for the function itself. Assuming we call the function N times, and the inlined function size is Y, then a rough guide for when it’s better (spacewise) to inline is when N * Y < N * 3 + 27.

Wouldnâ€™t it be nice if Y were 3 or less? Then itâ€™s always better to inline.

Let’s try it out for portMode and see how it affects the assembly size. Note that we’re now looking at the Setup function, since that’s the place where pinMode was being called from, hence the place where the inlined function now resides.

00000368 :
// Fetch the port bit mask
368:	e7 ea       	ldi	r30, 0xA7	; 167
36a:	f0 e0       	ldi	r31, 0x00	; 0
36c:	e4 91       	lpm	r30, Z+
// Set the DDR bit
36e:	9f b7       	in	r25, 0x3f	; 63
370:	f8 94       	cli
372:	84 b1       	in	r24, 0x04	; 4
374:	e8 2b       	or	r30, r24
376:	e4 b9       	out	0x04, r30	; 4
378:	9f bf       	out	0x3f, r25	; 63
// Return
37a:	08 95       	ret

Wow, thatâ€™s pretty small – just 9 instructions (not including the return for Setup) now… Where did all of the code go?

The compiler now knows exactly what the function parameters are, so it can go to town on optimizing the code for those particular parameters rather than having to support the general case.

It got rid of the branch when the pin index is out of bounds since it knows what the pin index is now.
It removed the entire part of the function to deal with the input mode – not needed.
It worked out which data direction register is needed by looking at the array and indexing into it at compile time.
Itâ€™s still loading the bit mask from program memory though…

Why could it resolve the DDR address at compile time, but not the bit mask? They’re both just arrays of bytes indexed by the pin.

The answer to this one is to do with where the arrays are defined. Because I made the DDR address array myself, I put it in the same file as pinMode. The bit mask array is in an entirely different file, so the compiler couldn’t index into it during compilation.

That’s pretty easy to fix – just move the array into the same file as pinMode.

00000368 :
// Set the DDR bit
368:	8f b7       	in	r24, 0x3f	; 63
36a:	f8 94       	cli
36c:	25 9a       	sbi	0x04, 5	; 4
36e:	8f bf       	out	0x3f, r24	; 63
// Return
370:	08 95       	ret

Great! No more memory fetches! But hang on… What happened to where it sets the DDR bit? It used to load the DDR register, ‘or’ in the mask, the write the DDR register back out.

Again, the compiler is being clever. It knows that the mask only ever contains one bit, so it can use the more optimal ‘sbi’ instruction.

Interrupts

If you remember, we needed the interrupt because of the fact that we were reading, modifying then writing to a register. But now we only have on instruction to set the bit.

The Atmel guide says that an instruction will finish before an interrupt is processed – even if that instruction takes multiple cycles (sbi takes 2 cycles). This means that we donâ€™t need to disable interrupts since you canâ€™t get an interrupt at a bad time.

Et Voila!

Removing interrupts removes three instructions, so thatâ€™s it. Weâ€™re down to a single instruction to set a bit, as it should be.

00000368 :
// Set the DDR bit
 368:	25 9a       	sbi	0x04, 5	; 4
// Return
 36a:	08 95       	ret

Summary

The code went from 41 instructions for a single call to just 1. What’s more, the savings increase the more frequently we call it. Not only is it smaller, it’s also faster.

I compiled and ran the Blink application using the modified pinMode function, and the program size went down to 936 bytes. Not bad for a quick bit of experimentation.

void pinMode(uint8_t pin, uint8_t mode)
{
	const uint8_t digital_pin_to_bit_mask[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20 };
	const uint8_t digital_pin_to_ddr[] = { 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x27, 0x27, 0x27, 0x27, 0x27, 0x27 };

	uint8_t bit = digital_pin_to_bit_mask[pin];
	volatile uint8_t* reg = (volatile uint8_t*)digital_pin_to_ddr[pin];

	if (mode == INPUT)
		*reg &= ~bit;
	else
		*reg |= bit;
}

I think there’s the potential for performance optimizations such as these the elsewhere in the Arduino codebase. For example, digitalWrite has many of the same issues as pinMode, so the same solutions would apply.

I suspect that even more optimizations could be applied by putting more of an onus on the programmer to do things like remember to turn off pulse width modulation on a pin when it’s not needed.

On the other hand, I understand why some of the decisions have been made as they stand – to keep things simple. The nice thing is that the source code is provided, so it’s easy for those who want to look at the original code to change it to suit their needs.

I would definitely recommend having a look at the assembly for your Arduino programs every now and then. It gives you a good idea of what’s really happening at the hardware level. It can also give you insight into how you might be able to speed things up, or reduce program size.

The biggest lesson I learned throughout this is this: 8 year olds don’t care about optimizations and disassembly. They like flashing lights, buttons, sensors, motors and other cool stuff like that.

I should have known…

3 Thoughts to “Adventures with Arduino: From 41 Instructions to 1”

Sean

February 1, 2011 at 5:55 pm

Jacob must have gone bananas when you finally got down to one instruction. Best toy ever.
Anders

February 1, 2011 at 6:15 pm

Arduino libraries tend to be quite bloated and not really that concerned with performance, which might be fine considering its intended use. Just remember that AVR is a Harvard architecture, so data memory is more precious than program memory.
ftorama

August 27, 2011 at 2:43 pm

great job but sbi and cbi don’t work on all DDR registers. See your AVR datasheet to check this. For upper DDR registers, you need to do the read-modify-write procedure.

Comments are closed.