This series of posts talks about tracking down a high system time problem when compiling the Linux kernel.
I’m using ‘perf’ to try to home in on where the time is going and why. Unfortunately, it seems that this toolchain is not that mature on Ubuntu.
The primary developers are from RedHat, so they work mainly on RedHat’s Linux distributions, “Fedora” and “Red Hat Enterprise Linux (RHEL)”
Development environment:
If you’ve just joined this blog, here is my Linux development environment:
- Mid 2011 27″ iMac, running OS X Yosemite
- Quad core Intel i5 processors, 64bit, with virtualization extensions
- VirtualBox hypervisor, version 4.3.20
- Guest operating system
- Ubuntu 14.04 LTS, 64bit
- Guest kernel version : 3.13.0-32-generic
- Guest OS assigned 2 virtual CPUs
- Guest OS assigned 4GBytes of RAM
Differences between 32bit and 64bit behavior:
I recently figured out how to run 64bit guest operating systems under VirtualBox.
Since 32bit and 64bit systems are quite different (even though the user experience is essentially identical), I thought I’d check to see if the high system time problem is still in evidence.
When I launch a kernel compile with ‘make -j3’ on my dual-core VM, this is what I see in the traditional ‘top’ command:

As you can see, I’m still getting pretty high system times. In this case, almost one entire CPU is spent dealing with kernel-space instructions, which is an awful lot.
It would be extremely nice to get to a full understanding of this, since my compiles might go quite a bit faster – assuming I don’t become I/O bound instead.
Output of running ‘perf’ under Ubuntu:
The first tool to run is ‘perf top’, since that’s somewhat equivalent to the traditional ‘top’ command. Here’s the output:

It seems to be saying that ‘__do_page_fault’ is consuming most of the time in kernel-space.
This does actually make some sense, because compilation is very I/O intensive. It’s typically lots of little jobs that access a lot of small files. Under this workload, it’s plausible that the page faulting is due to the memory mapped IO paging in a lot of files.
Now, one caveat is that the symbol that’s taking most of the time does appear to have changed between the 32bit and 64bit versions of the same OS. For the 32bit version, the top symbol was ‘__raw_spin_unlock_irq_restore’.
Getting call stacks:
I have a hypothesis that the excessive system time is caused by memory mapped file I/O, at least on the 64bit OS.
In order to gain confidence in this hypothesis, it would be nice to see where this activation is coming from. The ‘perf’ toolchain supports a very nice feature which allows me to get the call stacks for each symbol, which should allow me to trace the sequence of calls back towards the source.
There are two steps to doing this:
- perf record : captures data as a workload runs
- perf report : analyzes and displays the captured data
So, I’ll kick off the compilation process as normal, with ‘make -j3’, then start the record process:
perf record -ga -- sleep 10
This will record for 10 seconds, capturing call stacks from all the CPUs in the system.
Once that’s done, I’ll suspend the compilation process, then run the report:
perf report -g
Note: On modern distributions, ‘perf report’ will launch with an interactive ‘curses’ style browser. If you’d prefer to get a traditional text-only output, add the ‘–stdio’ switch
Here’s what I get (Note: I use red terminals for commands that must run as ‘root’):

I can attempt to expand the ‘__do_page_fault’ symbol, but under Ubuntu 14.04 LTS, the expanded section is empty. That’s a bit suspicious: The highest activity symbol allegedly having no callers seems unlikely.
It turns out that the only symbol which has anything interesting in it is ‘page_fault’, below:

This suggests that ‘memset’ is taking a disproportionate amount of resources. That’s conceivable if large amounts of memory are being allocated and then initialized with ‘memset’ (which is a typical in C code).
OK, bearing in mind that I’m now a bit suspicious, I’ll expand the ‘memset’ symbol:

Hmmm. This doesn’t look good.
Firstly, what’s that zero doing there – it’s not a hex address, and even if it were, why are we vectoring through that address? We have a whole lot of 64bit hex addresses to deal with too, which suggests missing debug packages.
Finally, the math doesn’t add up – according to this, only about 7% of the time spent in ‘memset’ is actually accounted for.
Using the ‘dwarf’ unwinder:
There is another approach which I can use : The ‘dwarf’ unwinder.
In order to unwind cleanly, all the code I want to trace needs to be run with frame pointers enabled. This is achieved with the gcc compile-time switch ‘-fno-omit-frame-pointer’.
Now, having this switch enabled on all the code is a bit of a pain – in something like a kernel compile, there are several major toolchains at work, and it turns out to be quite hard to get debug symbols installed for some of them.
The ‘dwarf’ unwinder mode helps side step this.
Perf is invoked as follows:
perf record -g --call-graph dwarf -a -F 97 -o perf.data.dwarf -- sleep 10
Note: The ‘-F 97’ forces ‘perf’ to sample at 97 Hz, which prevents the CPU and I/O from getting overwhelmed with data.
Unfortunately, the results are much the same – memset still expands into zeroes and 64bit hex addresses:

Kernel versions required to support ‘dwarf’ unwinding:
I think I’m right in saying that a kernel version of at least 3.9 and up to support ‘dwarf’ unwinding.
People using lower kernel versions will need to continue to use ‘-fno-omit-frame-pointer’, or at least debug symbol packages that are built with that option.
Next steps:
Well, I seem to have drawn a blank with using Perf under Ubuntu. It’s too early to say that the tools are broken – there are differences between the distributions, and it’s possible that I’m getting tripped up by something silly.
I think the next step is to try the same thing, but use a full RedHat distribution, such as Fedora. I’m not going to use CentOS because I don’t know that everything is built 100% the same, and also the kernel versions tend to be older.