Edwin Olson (eolson@mit.edu), April 2007
I develop on several embedded platforms. My current platform of choice is an ARM7TDMI device in the form of an LPC2378. I develop primarily in C, but also write assembly as necessary. These hosts run small real-time operating systems with limited capabilities.
There are two things that I wish I could do on my embedded platform that I can do on a proper Linux host.
I've got no solution for #1 yet, and I'd love to hear from you if you've got an approach. (Some possibilities: gcc -ffortify, port libmudflap, etc.) A good simulator would also help out.
I do have an approach to profiling now, however. Really, this is a question of "what's the least amount of work I can do to implement a simple profiler?" Because this is an embedded system in which peripheral systems trigger all processing, simulation is hard to do: many peripherals would have to be faithfully simulated, as would the real-world events! In-situ profiling presents fewer hurdles.
Because the CPU is relatively limited (48 MIPS, give or take), any significant profiling overhead (or code instrumentation) is a show-stopper. These considerations led to my strategy:
The implementation of #1 and #2 are strongly platform and embedded-O/S dependent, but they're also fairly easy to do! Below is my FIQ, for example, which uses the LPC timer 0 to generate an interrupt every 100k cycles (or there-abouts). Not shown, of course, is the modification to the reset vectors to point to on_profile_irq, and the normal O/S task which polls profile_out and sends the data to the host.
#define PROFILE_SAMPLES 256 uint32_t profile_buffer[PROFILE_SAMPLES]; volatile uint32_t profile_out; void profile_init() { PCONP |= (1<<1); // PCTIM0 T0PR = 1; // prescale divider T0TCR = (1<<0); // enable the counter T0MR0 = 9000; // match register T0MCR = (1<<0) | (1<<1); // when we hit MR0, throw an IRQ and reset. T0CTCR = 0; // counter mode, por favor VICIntSelect |= (1<<4); // make MR0 match a FIQ VICIntEnable |= (1<<4); } void on_profile_irq() __attribute__ ((interrupt ("FIQ"))); void on_profile_irq() { uint32_t addr; asm volatile("mov %r0, lr \r\n" : "=r"(addr) ); if (profile_out < PROFILE_SAMPLES) profile_buffer[profile_out++] = addr; T0IR |= (1<<0); // ack the interrupt }
Running this (slightly) modified firmware and logging the resulting PC samples to a file results in a file containing lines of addresses. It looks like this:
000029bc 000038ec 00005444 00003738 00003528 000038d4 00005424 0000371c 00005458 00003764 00002a0c 00006c24 0000546c 00003918 00006bf0 00003680 000038d0
Each line represents a simple sample of the PC. Because I used a FIQ rather than an IRQ, and because FIQs are enabled during IRQs, I am able to profile time spent in ISRs as well as user code. This is great! I can easily get a thousand samples per second (or so) with no significant loading on the CPU.
My solution for processing the data is somewhat inelegant, but it was easy and produces useful and easy-to-interpret results.
I use the standard binutils program "addr2line" to translate each PC sample into a tuple of (source code file, line number, function name). Addr2line is included in all standard GNU toolchains, and will read our file of addresses and produce a new file of tuples. Suppose the file containing PC samples is called "samples.log". I run the following command:
% cat /tmp/samples.log | arm-elf-addr2line -e main.out -fi > /tmp/profile.dat
Note: it is important for you to compile your code with -g so that the debugging information is included in the file. But it is not necessary for you to disable compile-time optimizations, which is good because otherwise the profiling results would be less meaningful!
Addr2line, with the options -fi, produces two lines for every sample: the function name, followed by filename:line number. The output of addr2line looks like:
scheduler_add ../nkern/core/scheduler.c:25 nkern_utime ../nkern/platforms/lpc2378/platform.c:61 _nkern_schedule ../nkern/core/nkern.c:393 scheduler_add ../nkern/core/scheduler.c:40 _nkern_schedule ../nkern/core/nkern.c:314 nkern_in_irq ../nkern/include/nkern.h:131 nkern_yield ../nkern/core/nkern.c:188 sleep_queue_remove_head_if_sleep_done ../nkern/core/sleep_queue.c:43 nkern_utime ../nkern/platforms/lpc2378/platform.c:63 _userspace_reschedule ../nkern/archs/arm7tdmi/arch.c:35 scheduler_pick ../nkern/core/scheduler.c:54
The next step in processing the data was to write a short application that reads the output of addr2line (written to a file named "profile.dat") and produces summary data. The program simply tallies the number of hits on each function and on each line number, producing an annotated listing giving the amount of CPU time of each line of source code. You can easily write an application yourself, or use mine, Prof.java. My application takes the samples.log file as the first argument; additional arguments are added to the search path used to locate source code.
The most important task is to produce useful output that makes it easy to see where CPU is being spent. My program sorts functions by CPU usage and outputs the source code of the functions (most CPU time to lowest CPU time) with each line labelled with the CPU usage.
Here is some Sample output from my program. I'd be very interested in any improvements, so please send 'em to me!
There are some serious limitations to my current implementation. Some could be easily fixed; others, not so much.
The output would be far more useful with backtrace information: i.e., the sequence of calls led to the current output. This data ends up being very difficult to harvest from an ARM core: the calling convention is different between Thumb mode and ARM32 mode, and it is not obvious whether the FIQ interrupted user code or an IRQ. All of these would lead to different code paths and would make the FIQ much more costly. Because of these complications, I have not pursued this. It would also be necessary to limit the maximum depth to, say, 10.
But if I had backtrace information, I would modify the "samples.log" file format to contain multiple addresses per line. Each line would correspond to a single sample, and successive samples on each line would represent the caller of the previous address. This file couldn't be piped directly through addr2line, but only a short wrapper would be needed. Then, for each function, I would tally each trace that led to the current sample. When outputting the data for each function, I would print the top N traces that led to the execution of the function.
I'm keen on feedback. Send me a note at eolson@mit.edu.