A couple of weeks ago, I was looking for a fun project to do with my 8 year old son. By chance, I saw a recent issue of Make Magazine which had a feature about an electronics prototyping platform called Arduino. I thought it sounded pretty cool, and could be a good introduction for my son to some basic electronics as well as some simple programming.
The Arduino is designed to be very easy to use. It comes with a set of tools based on the Wiring programming environment and libraries. The board is controlled by an Atmel 8-bit microprocessor. In comparison to what I spend my days working with, this is a very simple processor. No vector registers, no floating point registers. Even something as simple as adding two 32-bit numbers requires three assembly instructions (add word, add byte with carry, add byte with carry).
While I was waiting for my Arduino to be delivered, I downloaded the IDE and compiled the simple example code to blink an LED once every second. The status window of the IDE dutifully printed out the compilation progress, and at the end, also printed out the executable size – 1018 bytes.
Since I’ve never worked on this platform before I wasn’t sure what the executable size should be, but that seemed a little large to me. So I started to dig around a little bit…
Warning: There’s a fair amount of code and assembly in this post, but I’ll do my best to explain it all!
Start Simple
The Blink program is really simple. It just sets a specific pin on the board to output mode, then continually turns that pin on and off every second.
void setup() { pinMode(13, OUTPUT); } void loop() { digitalWrite(13, HIGH); // set the LED on delay(1000); // wait for a second digitalWrite(13, LOW); // set the LED off delay(1000); // wait for a second }
I thought I may as well start at the top, so I took a look at the pinMode function. Fortunately, the Arduino codebase is open source and comes as part of the install package.
The pinMode function just sets the data direction (i.e. input or output) of a particular pin on the microcontroller. Most of the pins on the ATmega328p chip are wired into the numbered pins on the Arduino board, but the mapping isn’t one-to-one.
void pinMode(uint8_t pin, uint8_t mode) { uint8_t bit = digitalPinToBitMask(pin); // * Fetch uint8_t port = digitalPinToPort(pin); // * Fetch volatile uint8_t *reg; if (port == NOT_A_PIN) return; // JWS: can I let the optimizer do this? reg = portModeRegister(port); // * Fetch if (mode == INPUT) { uint8_t oldSREG = SREG; cli(); *reg &= ~bit; SREG = oldSREG; } else { uint8_t oldSREG = SREG; cli(); *reg |= bit; SREG = oldSREG; } }
What Does It Do?
Here’s a quick overview of what this code does by line number:
- 3, 4 & 11: Look up which bit of which particular data direction register (DDR) needs to be set or unset.
- 14 & 15/19 & 20: Perform the first part of the disable interrupts dance – storing off the existing status register and clearing the global interrupt bit (cli).
- 16: Clear the DDR bit if the mode is input
- 21: Set the DDR bit if the mode is output
- 17 & 22: The rest of the interrupt dance – restore the status register to its previous state.
Some Thoughts
- This code performs three fetches (marked) from program memory to convert the Arduino pin number into the correct data direction register and bit. Each fetch from program memory costs 3 cycles.
- It needs to disable interrupts because setting or clearing the DDR bit requires a read/modify/write pattern. If an interrupt happened after the read but before the write, the handler might change that DDR but the portMode function would change it back.
- It stores off the existing status register before disabling interrupts. I presume this is so that it can restore global interrupts to the previous state at the end of the function rather than arbitrarily turning them back on.
- It’s interesting that the author (JWS) hints that there may be a faster way to do this in his comment…
What are we trying to do again?
After looking at all of this code, it’s easy to forget what the point of this function actually is. All it really does is set or clear one bit in a specific data direction register – that’s it! All the other code is just there for support.
Digging Deeper
It’s all very well looking at the source code, but this isn’t what the microcontroller sees. In order to see that, we need to look at the result of the compilation. We can do this really easily by taking the elf that is produced and running it through the disassembler (avr-objdump) included with the toolchain.
avr-objdump -D Blink.elf > Blink.asm
You can also get objdump to interleave the original source code with the disassembly. I sometimes find this more confusing than the plain assembly since it can be misleading with where it places some of the source code.
avr-objdump -S Blink.elf > Blink.asm
Disassemble!
This might look intimidating at first, but please bear with it. The assembly proceeds in the same order as the original source code, and I’ve added comments to explain what’s going on. You can really dig much further into it by looking at the assembly instruction reference in the Atmel Microcontroller Guide, but it’s not necessary to understand the rest of this post.
000002d8 : // Fetch the port bit mask 2d8: 48 2f mov r20, r24 2da: 50 e0 ldi r21, 0x00 ; 0 2dc: ca 01 movw r24, r20 2de: 86 56 subi r24, 0x66 ; 102 2e0: 9f 4f sbci r25, 0xFF ; 255 2e2: fc 01 movw r30, r24 2e4: 24 91 lpm r18, Z+ // Fetch the processor port 2e6: 4a 57 subi r20, 0x7A ; 122 2e8: 5f 4f sbci r21, 0xFF ; 255 2ea: fa 01 movw r30, r20 2ec: 84 91 lpm r24, Z+ // Return if the port is invalid 2ee: 88 23 and r24, r24 2f0: c1 f0 breq .+48 ; 0x322 // Fetch the DDR address 2f2: e8 2f mov r30, r24 2f4: f0 e0 ldi r31, 0x00 ; 0 2f6: ee 0f add r30, r30 2f8: ff 1f adc r31, r31 2fa: e8 59 subi r30, 0x98 ; 152 2fc: ff 4f sbci r31, 0xFF ; 255 2fe: a5 91 lpm r26, Z+ 300: b4 91 lpm r27, Z+ // Check the mode and branch if necessary 302: 66 23 and r22, r22 304: 41 f4 brne .+16 ; 0x316 // Clear the DDR bit and return 306: 9f b7 in r25, 0x3f ; 63 308: f8 94 cli 30a: 8c 91 ld r24, X 30c: 20 95 com r18 30e: 82 23 and r24, r18 310: 8c 93 st X, r24 312: 9f bf out 0x3f, r25 ; 63 314: 08 95 ret // Set the DDR bit and return 316: 9f b7 in r25, 0x3f ; 63 318: f8 94 cli 31a: 8c 91 ld r24, X 31c: 82 2b or r24, r18 31e: 8c 93 st X, r24 320: 9f bf out 0x3f, r25 ; 63 322: 08 95 ret 000002ce : // Copy the parameters into registers 2ce: 8d e0 ldi r24, 0x0D ; 13 2d0: 61 e0 ldi r22, 0x01 ; 1 // Call pinMode 2d2: 0e 94 6c 01 call 0x2d8 ; 0x2d8 2d6: 08 95 ret
The pin mode function is comprised of 38 assembly instructions. The call to pinMode from Setup call requires an additional 3 instructions – 2 to put the parameters into registers, and 1 to call the function. That’s a grand total of 41 instructions. That sounds like an awful lot of work just to set one bit.
Isn’t there an easier way to set or clear a bit in a register? Of course there is. If you look at the assembly instruction reference, there’s actually a single instruction (sbi) that sets a specific bit in a register. There is a corresponding instruction (cbi) to clear that bit.
Improving pinMode
Ok, so we know in theory it could be more efficient, but how do we get the compiler to do this so that we don’t have to look at assembly all of the time? Aren’t there other things we need to worry about? What about interrupts? What about the memory fetches?
I made a few different changes to the pinMode function to try and reduce the instruction count. I’m going to go over them in the order I attempted them.
Removing One Fetch
As the old adage goes, two memory fetches are better than three. Well, perhaps it’s not an adage, but it’s true nonetheless.
The original code looks up the port number using the pin index, then uses the port number to look up the DDR address. We can skip that first lookup by creating a table that goes directly from pin index to DDR address.
You’re probably thinking, “but that’ll cost more memory”, and you’d be right. We’ll address that shortly though. First, let’s see what that does to the assembly.
000002e0 : // Fetch the port bit mask 2e0: a8 2f mov r26, r24 2e2: b0 e0 ldi r27, 0x00 ; 0 2e4: cd 01 movw r24, r26 2e6: 8e 53 subi r24, 0x3E ; 62 2e8: 9f 4f sbci r25, 0xFF ; 255 2ea: fc 01 movw r30, r24 2ec: 24 91 lpm r18, Z+ // Fetch the DDR address 2ee: a4 58 subi r26, 0x84 ; 132 2f0: bf 4f sbci r27, 0xFF ; 255 2f2: 8c 91 ld r24, X 2f4: e8 2f mov r30, r24 2f6: f0 e0 ldi r31, 0x00 ; 0 // Check the mode and branch if necessary 2f8: 66 23 and r22, r22 2fa: 31 f4 brne .+12 ; 0x308 // Clear the DDR bit 2fc: 9f b7 in r25, 0x3f ; 63 2fe: f8 94 cli 300: 80 81 ld r24, Z 302: 20 95 com r18 304: 28 23 and r18, r24 306: 04 c0 rjmp .+8 ; 0x310 // Set the DDR bit 308: 9f b7 in r25, 0x3f ; 63 30a: f8 94 cli 30c: 80 81 ld r24, Z 30e: 28 2b or r18, r24 310: 20 83 st Z, r18 312: 9f bf out 0x3f, r25 ; 63 // Return 314: 08 95 ret
It’s definitely better – we’ve reduced the function size by 11 instructions, but we’ve also added another constant table of 20 bytes (not shown). Each assembly instruction is 2 bytes, so we have a net saving of only 2 bytes. Let’s do something about that.
Inlining
Inlining a function just means that the compiler puts the code directly into the caller site rather than jumping to a separate function. No function call means that we don’t pay the cost (three instructions in this case) of calling the function. It also allows the compiler to be more aggressive about optimizations.
Often, the compiler will inline functions in the same file automatically when it can. If you want to share that function, then you can put it in a header file and explicitly mark it as inline by including ‘inline’ in front of the function declaration.
The main drawback of inlining function calls is that it can make your code larger since duplicates of the same function can be scattered around the compiled code.
In this case, the code size for calling the non-inlined function just once is 3 instructions per call, plus 27 instructions for the function itself. Assuming we call the function N times, and the inlined function size is Y, then a rough guide for when it’s better (spacewise) to inline is when N * Y < N * 3 + 27.
Wouldn’t it be nice if Y were 3 or less? Then it’s always better to inline.
Let’s try it out for portMode and see how it affects the assembly size. Note that we’re now looking at the Setup function, since that’s the place where pinMode was being called from, hence the place where the inlined function now resides.
00000368 : // Fetch the port bit mask 368: e7 ea ldi r30, 0xA7 ; 167 36a: f0 e0 ldi r31, 0x00 ; 0 36c: e4 91 lpm r30, Z+ // Set the DDR bit 36e: 9f b7 in r25, 0x3f ; 63 370: f8 94 cli 372: 84 b1 in r24, 0x04 ; 4 374: e8 2b or r30, r24 376: e4 b9 out 0x04, r30 ; 4 378: 9f bf out 0x3f, r25 ; 63 // Return 37a: 08 95 ret
Wow, that’s pretty small – just 9 instructions (not including the return for Setup) now… Where did all of the code go?
The compiler now knows exactly what the function parameters are, so it can go to town on optimizing the code for those particular parameters rather than having to support the general case.
- It got rid of the branch when the pin index is out of bounds since it knows what the pin index is now.
- It removed the entire part of the function to deal with the input mode – not needed.
- It worked out which data direction register is needed by looking at the array and indexing into it at compile time.
- It’s still loading the bit mask from program memory though…
Why could it resolve the DDR address at compile time, but not the bit mask? They’re both just arrays of bytes indexed by the pin.
The answer to this one is to do with where the arrays are defined. Because I made the DDR address array myself, I put it in the same file as pinMode. The bit mask array is in an entirely different file, so the compiler couldn’t index into it during compilation.
That’s pretty easy to fix – just move the array into the same file as pinMode.
00000368 : // Set the DDR bit 368: 8f b7 in r24, 0x3f ; 63 36a: f8 94 cli 36c: 25 9a sbi 0x04, 5 ; 4 36e: 8f bf out 0x3f, r24 ; 63 // Return 370: 08 95 ret
Great! No more memory fetches! But hang on… What happened to where it sets the DDR bit? It used to load the DDR register, ‘or’ in the mask, the write the DDR register back out.
Again, the compiler is being clever. It knows that the mask only ever contains one bit, so it can use the more optimal ‘sbi’ instruction.
Interrupts
If you remember, we needed the interrupt because of the fact that we were reading, modifying then writing to a register. But now we only have on instruction to set the bit.
The Atmel guide says that an instruction will finish before an interrupt is processed – even if that instruction takes multiple cycles (sbi takes 2 cycles). This means that we don’t need to disable interrupts since you can’t get an interrupt at a bad time.
Et Voila!
Removing interrupts removes three instructions, so that’s it. We’re down to a single instruction to set a bit, as it should be.
00000368 : // Set the DDR bit 368: 25 9a sbi 0x04, 5 ; 4 // Return 36a: 08 95 ret
Summary
The code went from 41 instructions for a single call to just 1. What’s more, the savings increase the more frequently we call it. Not only is it smaller, it’s also faster.
I compiled and ran the Blink application using the modified pinMode function, and the program size went down to 936 bytes. Not bad for a quick bit of experimentation.
void pinMode(uint8_t pin, uint8_t mode) { const uint8_t digital_pin_to_bit_mask[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20 }; const uint8_t digital_pin_to_ddr[] = { 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x2A, 0x24, 0x24, 0x24, 0x24, 0x24, 0x24, 0x27, 0x27, 0x27, 0x27, 0x27, 0x27 }; uint8_t bit = digital_pin_to_bit_mask[pin]; volatile uint8_t* reg = (volatile uint8_t*)digital_pin_to_ddr[pin]; if (mode == INPUT) *reg &= ~bit; else *reg |= bit; }
I think there’s the potential for performance optimizations such as these the elsewhere in the Arduino codebase. For example, digitalWrite has many of the same issues as pinMode, so the same solutions would apply.
I suspect that even more optimizations could be applied by putting more of an onus on the programmer to do things like remember to turn off pulse width modulation on a pin when it’s not needed.
On the other hand, I understand why some of the decisions have been made as they stand – to keep things simple. The nice thing is that the source code is provided, so it’s easy for those who want to look at the original code to change it to suit their needs.
I would definitely recommend having a look at the assembly for your Arduino programs every now and then. It gives you a good idea of what’s really happening at the hardware level. It can also give you insight into how you might be able to speed things up, or reduce program size.
The biggest lesson I learned throughout this is this: 8 year olds don’t care about optimizations and disassembly. They like flashing lights, buttons, sensors, motors and other cool stuff like that.
I should have known…
Jacob must have gone bananas when you finally got down to one instruction. Best toy ever.
Arduino libraries tend to be quite bloated and not really that concerned with performance, which might be fine considering its intended use. Just remember that AVR is a Harvard architecture, so data memory is more precious than program memory.
great job but sbi and cbi don’t work on all DDR registers. See your AVR datasheet to check this. For upper DDR registers, you need to do the read-modify-write procedure.