Comparing Ubuntu 12.04 LTS performance between physical and virtual instances

In my last post I presented data that showed my virtualized Fedora 21.5 instance being faster than physical hardware running Ubuntu 12.04.

This was an interesting enough result to want to look further, but it’s not really a fair comparison. To be more equal, I need to at least run the workload on the same operating system version.

In this case, that means using two copies of Ubuntu 12.04 LTS, 64 bit.

Methodology:

Now, bear in mind that I’m not working in a corporate environment where I have easy access to multiple different, but equivalently specified platforms. I’m going to have to work with the systems I have, and that means VMs with occasional cross-checks with physical hardware.

In my experience, having precisely equivalent systems hasn’t been that important for my type of work: I’m looking for anomalous behavior, which generally results in large differences in performance. The signal-to-noise ratio is usually pretty high: We’re not normally looking at hard-to-detect ~3% performance variations with this type of problem.

Ultimately, I may end up doing 100% clean installs on both systems and benchmarking them side-by-side. I’ll probably know when I reach this point because I won’t be able to explain the differences any other way.

New data:

Here’s the results of doing 20 iterations of the Linux kernel compilation on this new operating system version:

Screen Shot 2015-03-01 at 10.55.37 AM

The green histogram shows the new data. This new VM is substantially slower than any of the combinations I’ve tested so far.

In fact, if we consider median times, my Fedora 21.5 instance  completes the kernel compilation job 750 seconds (12.5 minutes) faster than the Ubuntu 12.04 VM – that’s over 40% faster.

Equally, the VM of Ubuntu 12.04 is ~30% slower than Ubuntu 12.04 on physical hardware.

Note: I’m using the median time, not the average, because there may be an outlier in the new data off to the far right of the graph. Medians are not as skewed by outliers as averages are.

Note 2: I have not removed the suspected outlier, because it may equally be valid data. It was recorded under the same operating conditions as the other data, so it may possibly be something interesting – I’ll have to repeat the experiment and see if I get similar results.

Interpretation:

This sort of data is what makes performance analysis so interesting: I was initially surprised by my VM being faster than physical hardware. I’m now surprised by how slow my VM is with this version of Ubuntu.

It’s tempting to say “Well, VMs are just slower than real hardware, what did you expect?”, but that’s not quite accurate.

I could imagine that there is some overhead with using a hypervisor – especially a Type 2 hypervisor. However, I’ve never come across a clear analysis of how much performance really does degrade for different VMs on different platforms, so this counts more as common sense intuition than proven fact.

More importantly, the earlier Fedora 21.5 results show clearly that virtualized performance can be surprisingly good.

Next steps:

At this point, it seems that there’s something intrinsic about Ubuntu 12.04 that makes it run slowly on my VM when compared with physical hardware. I don’t know what it might be, so rather than theorizing, I need to find out.

It’s also possible that I got something slightly wrong in my 2nd hypervisor setup and that’s causing a performance problem – I’ll have to double check and run some more tests.

Assuming this result is reproducible, I’ll start to dive into what might be going on in my next set of posts. If not, it will be just as interesting to figure out what went wrong and why it has the impact it does!

More detailed performance comparison data

In the last post I said that my VM seems to be oddly faster than a physical machine for a kernel compilation workload.

For that post I used a fairly small benchmarking sample. This is fine for getting a feel for a problem, but there’s always a chance that the results may not be stable, and that the data may have turned out that way by chance.

I decided to burn some CPU time and do a larger benchmarking run. I did 20 iterations of the compilation process (from clean each time).

The Page Cache was cleared on the 1st iteration, but not for subsequent iterations.

I have discarded the 1st measurement, so we are only looking at cases where the page cache is hot, and almost certainly full of data related to the compilation job. This should reduce the impact of the I/O subsystem, at least for the read side of the workload.

Obviously data will have to be written back to disk. That’s normally done asynchronously via the page cache, which acts as a write-back cache. The pages will be flushed to disk as necessary, reducing the impact of data writes. Data required from a earlier step in the compilation process will (probably) be satisfied from the page cache, rather than from disk.

Here’s the result:

Screen Shot 2015-02-26 at 9.42.53 AM

As you can see, the data is clearly bimodal. The fastest physical time is over 2 minutes slower than the slowest virtual time. On average, the physical system seems to be about 9% slower than the VM.

Not only is the physical system slower, but it’s less predictable. The VM data has a range of 27 seconds, while the physical data has a range of 71 seconds.

This is quite interesting because the VM is running under a type 2 hypervisor. Type 2 hypervisors appear to the host OS as just another process, and therefore has to share machine resources with other processes.

I would expect this setup to be less predictable, because I don’t know what other processes could be consuming machine resources while my benchmarking run progresses.

I’m going to start trying to figure out what’s going on next.

Performance differences between physical and virtual hardware

For the first set of posts on debugging, I analyzed one cause of high system time during compilation of the Linux kernel.

For this next set of posts, I’m going to dig into a funny observation I made while doing the system time work: My physical machine appears to be slower than my VM for doing the compilation workload.

My physical machine is a brand-new machine based off an quad-core Intel i5 processor. My VM is running on a 3 1/2 year old iMac, also based off a quad core i5 processor. The VM is allocated only 2 processors, so on the face of it, the physical machine should be faster.

Let’s dive in and see if we can understand what might be going on.

Machine setup

Physical:

  • Late 2014 Dell Optiplex 7010 mini-tower
  • Quad core Intel i5 processor, with nominal speed of 3.2GHz
    • SKU is Core I5-3470
    • Model details from Inter here.
    • L1 cache size : 256KB
    • L2 cache size : 1MB
    • L3 cache size : 6MB
  • 4GB system RAM
    • 3.8GB available after integrated graphics RAM
  • 7,200 RPM Western Digital 250GB drive, communicating over a 3Gbps SATA link
  • Ubuntu Linux 12.04 LTS

Virtual:

  • Mid 2011 27″ iMac
  • Quad-core Intel i5 processor, 2.7GHz nominal speed, single socket
    • According to this site, the precise SKU is a Sandy Bridge i5-2500S
    • You can get the definitive part information from Intel here.
    • L2 cache size (per core): 256KB
    • L3 cache size (per socket) : 6MB
  • 12 GB system RAM (4 banks of DDR3 at 1333MHz)
  • 7,200 RPM Western Digital 1TB drive, communicating over a 3Gbps SATA link
    • This is the only drive in the system
  • Mac OSX Yosemite (v10.10.2)
  • Hypervisor : VirtualBox 4.3.20

The virtual operating system is:

  • RedHat Fedora 21.5, 64bit
  • 4GB of RAM
  • 2 virtual processors
  • 80GB disk image maximum size (set to expand as necessary)
  • 64MB display RAM, 3D acceleration turned on
  • Virtualization extensions enabled

Doing a clean comparison between a virtualized guest instance and physical hardware is not that easy – there are a lot of major differences – but on the surface, the physical linux system is not too different from the virtualized linux system.

Given that VirtualBox is a type 2 hypervisor (and it doesn’t have a reputation for tremendous performance), I certainly wouldn’t expect it to outperform physical hardware. Still, as the data below shows, for this particular workload, it does.

Benchmarking methodology:

One of the cardinal rules of benchmarking is to always do the same measurement multiple times. Amongst other things, we need to take into account:

  • Cache warming effects at many levels in the system
  • Other housekeeping workloads running during the lifespan of the task
  • Unknown behavior caused by data arriving over the network interface
  • Random variations of unknown origin

Of course, the more isolated and predictable the benchmarking environment, the better. That’s not terribly easy with a type 2 hypervisor, because the host operating system could be doing all sorts of things that the guest has no visibility into.

To account for these issues, I run the compile job back-to-back three times. To address some of the caching issues, my controller script has the option to clear the Linux page cache between each iteration, or to only clear it once at the beginning.

Three times isn’t really enough if I’m going to be formal about it. I’d need more samples in order to ensure that the results aren’t just coincidence. However, I’m not a statistician, so I’m not going to talk about something I don’t really know about.

Unscientific though this methodology may be, I’m seeing results that appear to be stable, so I think it’s good enough to be getting on with.

Restricting number of CPU cores:

In order to make the comparison more fair, I did turn off two of the CPU cores on the physical machine.

This can be done in two ways: Through the BIOS, or through super-user commands within the operating system. I used the latter, although I suspect that it may not have precisely the same effects as disabling CPUs in the BIOS.

Here’s how to quickly disable a CPU in Linux:

echo 0 > /sys/devices/system/cpu/cpu<n>/online

So, to turn off CPU cores 3 and 4, I issued the following commands (remember, the core numbering starts at zero):

echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

This takes effect immediately and doesn’t require a reboot, which is handy.

Compilation code base:

For this benchmarking, I was compiling v3.9 of the Linux kernel on both physical and virtual machines. The code is a vanilla git checkout of the linux-stable code base, not the Fedora or Ubuntu code base.

I’m using v3.9 rather than one of the higher versions because one of the internal APIs has changed a little in more modern releases, which causes a compilation error with the VirtualBox kernel modules.

Benchmarking results:

Physical: 1905 secs, 1863 secs, 1884 secs

Virtual: 1784 secs, 1727 secs, 1742 secs

So, the absolute differences as as follows: 121 secs, 136 secs, 142 secs

The percentage values are improvements of: ~7%, ~9%, ~8%

So, my virtualized system, running on 3 1/2 year old hardware is about 8% faster than a brand new machine.

If this was a bake-off between two similar physical systems, this would be enough to make a decision. As I mentioned earlier, a 10% difference is worth having in a professional environment where you run the same jobs again and again and again.

That the faster system is a virtualized instance, running on 3 1/2 year old hardware, inside a (apparently not terribly optimal) type 2 hypervisor is really rather surprising.

The next series of posts will explore this issue and try to come to a root cause – we’ll follow where the data takes us.

System time during kernel compile under IceWM

In the last post, I explored the system calls issued by the ‘gnome-shell’ system. It seems that the interface to the VirtualBox hypervisor may be consuming rather a lot of system time: I’ve regularly seen values of over 40% of CPU time being spent in the kernel.

This is a lot: Out of a dual core system, almost one CPU was being spent doing something other than my work. To put it in perspective, in professional computing, getting a 5% performance improvement is a big deal.

Running under IceWM:

Since I haven’t been able to get a lot of information about what ‘/dev/vboxuser’ actually does, I decided to investigate the performance differences experimentally.

I installed the IceWM window manager onto my VM and re-ran the kernel compile. You can see how to do this here:

http://forums.fedoraforum.org/showthread.php?t=282433

By the way: This article isn’t that clear about how you actually start the Window Manager. Under Fedora 21.5 all you need to do is log out, then click the little gear at the bottom of the logout panel to select the Window Manager you want to use.

It turns out that running under IceWM results in substantially lower system time consumption:

icewm_kernel_compile

The system time values can be as high as ~25%, but are often around 15%, with lows at around 11%. That’s a big improvement over what I was seeing under Gnome.

Double checking with ‘perf’:

The ‘top’ results are promising, but I’ll double check with the ‘perf’ toolchain:

First of all, with ‘perf top -a’, to get an instantaneous reading of the topmost symbols:

icewm_kernel_compile_perf_top

This is much better – the top symbols are still in the kernel, but the amount of time  they are consuming is pretty modest – about 14%, compared with about 38% in my earlier post.

Now we can get a longer snapshot with the record / report mechanism:

perf record –call-graph dwarf -a -F 97 — sleep 5

perf report

Here’s the output:

icewm_kernel_compile_perf_results

This is substantially different from what I saw earlier. There’s no mention of any VirtualBox symbols or ‘ioctl’ system calls, and the majority of symbols are user-space rather than kernel space.

Root cause?

As always, we need to figure out if we really have a root cause.

The switch between window managers is a fairly large change, but it’s not that big in the context of a full-sized computing environment. The kernel, the user-space code, X11 and the hypervisor are all the same.

Obviously, there is a large range of things that could have changed, but in this case I feel fairly comfortable about having identified the root cause.

Although I’m not an expert in either system, I suspect there is an integration between ‘gnome-shell’ and VirtualBox which is responsible for poor performance under this workload.

The integration is probably quite useful for desktop users: It could possibly be something like clean support for resizing the VM’s window, which is certainly nice to have.

In the past I’ve mentioned that I like cut-down window managers because they are so simple. It seems that this may be another data point in support of that choice.

If you’re old school enough to be happy with a bare-bones computing environment and just want to work efficiently, it may be worth getting rid of the bells and whistles.

Examining ‘gnome-shell’ behavior more closely

In the last post I noticed a correlation between activity in the 'gnome-shell' process and excessive consumption of system time during a kernel compile.

In this post I’m going to dig down a bit, to see what this process is doing and why it’s polling like it is.

What does ‘gnome-shell’ do?

I’m not an expert in windowing environments, but it appears that ‘gnome-shell’ is the core of the windowing system provided by Fedora 21. There’s a high level introduction here.

Since windowing systems are heavily event based (the window manager needs to react to mouse location, clicks and other windowing events), it’s to be expected that some part of it needs to listen for incoming events.

The question here is whether it’s contributing to the high system time observed during a kernel compile.

Experiment 1: Counting system calls under load

There’s one simple experiment I can do to see if activation in ‘gnome-shell’ increases under load: Attach ‘strace’ and count the number of system calls.

System calls will consume system time by definition: If I want the operating system to do something, I ask it via a system call. I’ll cause a context switch as I enter the system call and I’ll force the CPU to spend some time inside the kernel. This will show up as system time in tools like ‘top’.

Here’s the command I’ll use:

timeout 5 strace -c -p `pgrep -d, gnome-shell`

Here’s the output of a more-or-less idle machine:

strace_counts_idle

When I run the kernel compile using ‘make -j3’, I get the following results:

strace_kernel_compile_3_cores

So, I’m seeing substantially higher numbers of system calls, but the mix of system calls has changed too. I’m seeing these changes:

  • The highest proportion of time is being consumed by ‘ioctl’
  • There’s ~10x the number of ‘poll’ calls
  • The ‘munmap’ call starts showing up, curiously not matched by ‘mmap’ calls
  • The duration of each call has dropped, even though the system is busier

Time spent routing messages?

My first hypothesis is that the increased output was causing the windowing system to work harder.

The compilation processes will be generating output on STDOUT / STDERR, which presumably has to be routed to the appropriate window. You could imagine that would take more effort to do, and perhaps that would result in increased system time?

I think the data refutes this idea. In particular, the proportion of time we’re spending polling has apparently dropped. If the windowing system was spending its time routing messages, I would increase. Also, I’m seeing system time, and I’d expect this work to happen in user space.

Debugging ‘ioctl’

The big thing here is that ‘ioctl’ has seen a 6x increase in the time it consumes. We need to figure out what this call is doing.

This is actually quite easy to do. Vanilla Linux ‘strace’ actually has quite a large range of powerful options, including the ability to filter out specific system calls. Combine that with the ability to resolve file descriptors into file paths, and we should get a good idea of what ‘ioctl’ is doing.

strace -y -e trace=ioctl -p `pgrep -d, gnome-shell`

Here’s the output (abbreviated for clarity):

strace_ioctl

So, these time consuming ‘ioctl’ system calls are interacting with something to do with the VirtualBox hypervisor.

This is the first time that we’ve seen concrete evidence of a hypervisor involvement. This is good news, because a difference in system time consumption between physical and virtual systems is where this whole thread started.

So, what is this ‘/dev/vboxuser’ thing, and why is the windowing system driving it harder when a compile starts?

Unfortunately, there doesn’t seem to be much (any really!) information about what this device is used for. I’ve seen one reference to it handling host-to-guest clipboards and screen resizing, but that’s about it.

I have an IM in with the VirtualBox developers to see if I can get more information. I’ll update here if I learn more.

Alternative experiments:

In the absence of more concrete information, I can do a few more experiments:

  • Change the window manager to something not based on Gnome
  • Change the hypervisor to something not based on VirtualBox

These may tell me something, albeit obliquely, about what’s consuming the system time. I’ll cover these experiments in my next post.

Gnome-shell stuck in a tight polling loop

I now have a way to generate more useful stack trace information with the ‘perf’ toolchain.

It’s still not perfect though – I’m now a bit overwhelmed with the amount of data the tool generates.

How can I boil this stuff down to start to answer the question of what’s causing the suspected high system time when I compile the Linux kernel?

Sorting perf output:

It turns out there’s a very useful ‘sort’ mode for ‘perf report’. It’s seems to be somewhat similar to a ‘GROUP BY / ORDER BY’ expression in SQL – it appears to both sort and consolidate the data, which is very handy.

Here’s the new command line:

perf report -i perf.data.compile -s dso

And this is the output:

fedora_perf_report_dso_sort

The output’s a bit hard to read with my color choice, but the key thing here is that the numbers are starting to make some sense.

When I was running the compile, the user CPU time was hovering at ~75%, and ~25% was spent in system time. Those values are reflected here in the ‘cc1’ and ‘kernel.kallsyms’ lines.

I suspect, but don’t know for sure, that the reason we’re seeing so much time in ‘libc’ is because just about everything in Unix (or Linux) uses the C standard library, so it’s naturally going to show up as very high activation.

Drilling down:

Having gotten some values that look more believable, let’s drill down a bit and see what’s causing our system time problem.

Since I’m interested in system time, I’ll expand out the kernel symbols, as shown below:

fedora_perf_report_dso_sort_kernel_expanded

Now, this is a bit confusing. I’m seeing lots of good detail, but I’m not sure if the symbols are ranked according to how often they were sampled.

Maybe, now I’m more confident in what the tool is telling me, I should revert back to ‘perf top’ to get a reading of which symbol’s hot as the compile progresses?

The plot thickens:

Now, this is where things get a bit strange.

This afternoon, I kicked off many compilation runs and got ~25% system time. This is high, but not awful, and I was starting to wonder if I was chasing something inconsequential.

This evening, I did exactly the same set of commands, and got easily 45% system time, and saw a very different set of symbols in the ‘perf top’ output:

fedora_perf_top3

What could have changed? If I run a ‘perf record’ session now, this is what I get:

fedora_perf_report_gnome_shell_symbols

All of a sudden, I see a bunch of IO related symbols showing up, and they are within kernel space. I have no idea why they might originate in the gnome-shell – that’s either a glitch, or it’s actually pretty interesting.

gnome-shell:

It turned out that there was a process called ‘gnome-shell’ which was consuming CPU resources all the time and was visible with vanilla ‘top’.

Using traditional ‘strace’ showed that the shell session was sitting in a fairly tight loop running ‘poll’ on a file descriptor attached to a socket with an inode number.

This behavior was not present in earlier sessions. If it had been, I’d have seen the ‘gnome-shell’ related symbols show up in the perf output.

Root cause or not?

Now, we need to be careful not to jump to conclusions, since modern operating systems are fiendishly complex.

The presence of a polling loop seems to correlate with higher-than-expected system time during a kernel compile. This may be a coincidence, but it bears more investigation.

I didn’t notice it during other periods of high system time, so it’s possible it’s a one-off glitch, or perhaps endemic to Fedora’s window management environment. It’s possible that there are multiple causes of high system time, so we need to be prepared to “peel the onion”.

The next post will examine the ‘gnome-shell’ process a bit more. We’ll conduct some experiments to see if we can be more confident in assigning a root cause.

Running the ‘perf’ toolchain under Fedora

In the last post, I finally decided that I couldn’t get the ‘perf’ toolchain to work properly under Ubuntu 14.04 LTS.

In this post, I’m going to look at doing the same debug under Fedora 21 (64bit).

Those of you who’ve followed this series so far should be very familiar with the basic steps to install the tools, so I’m not going to describe the setup in detail.

Getting the 1st round of data:

First of all, I’ll launch the kernel compile as usual:

make -j3

Then, when the system has settled a little, l’ll launch the recording tool:

perf record --call-graph dwarf -F 97 -a -o perf.data.compile -- sleep 5

Note : The ‘-F 97’ tells ‘perf’ to sample at 97Hz. This reduces the amount of CPU load and I/O bandwidth required. The odd choice of 97Hz (not 100Hz, for example) prevents system events from synchronizing too much.

Then, let’s see what we get from ‘perf report’:

fedora_perf_report_no_sortSo, this looks promising. But, what’s ‘cc1’, and why do we have so many of them?

‘cc1’ is the actual C compiler that’s invoked by the ‘gcc’ toolchain. I don’t know for sure why we have so many, but I suspect that it’s because of two things:

  • There are lots of short-lived ‘cc1’ processes, each with their own PID
  • The sampling catches them all at different stages in their lifecycle, leading to different stack traces for each sample.

If we expand a few of these ‘cc1’ instances, this is what we get:

fedora_perf_report_no_sort_expandedThis supports my conjecture above – I’m seeing many, many possible stack traces for ‘cc1’.

This is interesting, but doesn’t really help me. I want to see why the compilation process seems to cause high system time, not snapshots of the control flow of ‘cc1’.

The next post will show how to make this data more useful.

Perf call stack unwinding under Ubuntu

This series of posts talks about tracking down a high system time problem when compiling the Linux kernel.

I’m using ‘perf’ to try to home in on where the time is going and why. Unfortunately, it seems that this toolchain is not that mature on Ubuntu.

The primary developers are from RedHat, so they work mainly on RedHat’s Linux distributions, “Fedora” and “Red Hat Enterprise Linux (RHEL)”

Development environment:

If you’ve just joined this blog, here is my Linux development environment:

  • Mid 2011 27″ iMac, running OS X Yosemite
    • Quad core Intel i5 processors, 64bit, with virtualization extensions
  • VirtualBox hypervisor, version 4.3.20
  • Guest operating system
    • Ubuntu 14.04 LTS, 64bit
    • Guest kernel version : 3.13.0-32-generic
    • Guest OS assigned 2 virtual CPUs
    • Guest OS assigned 4GBytes of RAM

Differences between 32bit and 64bit behavior:

I recently figured out how to run 64bit guest operating systems under VirtualBox.

Since 32bit and 64bit systems are quite different (even though the user experience is essentially identical), I thought I’d check to see if the high system time problem is still in evidence.

When I launch a kernel compile with ‘make -j3’ on my dual-core VM, this is what I see in the traditional ‘top’ command:

high_system_time_top

As you can see, I’m still getting pretty high system times. In this case, almost one entire CPU is spent dealing with kernel-space instructions, which is an awful lot.

It would be extremely nice to get to a full understanding of this, since my compiles might go quite a bit faster – assuming I don’t become I/O bound instead.

Output of running ‘perf’ under Ubuntu:

The first tool to run is ‘perf top’, since that’s somewhat equivalent to the traditional ‘top’ command. Here’s the output:

high_system_time_perf_top

It seems to be saying that ‘__do_page_fault’ is consuming most of the time in kernel-space.

This does actually make some sense, because compilation is very I/O intensive. It’s typically lots of little jobs that access a lot of small files. Under this workload, it’s plausible that the page faulting is due to the memory mapped IO paging in a lot of files.

Now, one caveat is that the symbol that’s taking most of the time does appear to have changed between the 32bit and 64bit versions of the same OS. For the 32bit version, the top symbol was ‘__raw_spin_unlock_irq_restore’.

Getting call stacks:

I have a hypothesis that the excessive system time is caused by memory mapped file I/O, at least on the 64bit OS.

In order to gain confidence in this hypothesis, it would be nice to see where this activation is coming from. The ‘perf’ toolchain supports a very nice feature which allows me to get the call stacks for each symbol, which should allow me to trace the sequence of calls back towards the source.

There are two steps to doing this:

  • perf record : captures data as a workload runs
  • perf report : analyzes and displays the captured data

So, I’ll kick off the compilation process as normal, with ‘make -j3’, then start the record process:

perf record -ga -- sleep 10

This will record for 10 seconds, capturing call stacks from all the CPUs in the system.

Once that’s done, I’ll suspend the compilation process, then run the report:

perf report -g

Note: On modern distributions, ‘perf report’ will launch with an interactive ‘curses’ style browser. If you’d prefer to get a traditional text-only output, add the ‘–stdio’ switch

Here’s what I get (Note: I use red terminals for commands that must run as ‘root’):

perf_report_top

I can attempt to expand the ‘__do_page_fault’ symbol, but under Ubuntu 14.04 LTS, the expanded section is empty. That’s a bit suspicious: The highest activity symbol allegedly having no callers seems unlikely.

It turns out that the only symbol which has anything interesting in it is ‘page_fault’, below:

perf_report_expanded

This suggests that ‘memset’ is taking a disproportionate amount of resources. That’s conceivable if large amounts of memory are being allocated and then initialized with ‘memset’ (which is a typical in C code).

OK, bearing in mind that I’m now a bit suspicious, I’ll expand the ‘memset’ symbol:

perf_report_expanded_memset

Hmmm. This doesn’t look good.

Firstly, what’s that zero doing there – it’s not a hex address, and even if it were, why are we vectoring through that address? We have a whole lot of 64bit hex addresses to deal with too, which suggests missing debug packages.

Finally, the math doesn’t add up – according to this, only about 7% of the time spent in ‘memset’ is actually accounted for.

Using the ‘dwarf’ unwinder:

There is another approach which I can use : The ‘dwarf’ unwinder.

In order to unwind cleanly, all the code I want to trace needs to be run with frame pointers enabled. This is achieved with the gcc compile-time switch ‘-fno-omit-frame-pointer’.

Now, having this switch enabled on all the code is a bit of a pain – in something like a kernel compile, there are several major toolchains at work, and it turns out to be quite hard to get debug symbols installed for some of them.

The ‘dwarf’ unwinder mode helps side step this.

Perf is invoked as follows:

perf record -g --call-graph dwarf -a -F 97 -o perf.data.dwarf -- sleep 10

Note: The ‘-F 97’ forces ‘perf’ to sample at 97 Hz, which prevents the CPU and I/O from getting overwhelmed with data.

Unfortunately, the results are much the same – memset still expands into zeroes and 64bit hex addresses:

perf_report_dwarf

Kernel versions required to support ‘dwarf’ unwinding:

I think I’m right in saying that a kernel version of at least 3.9 and up to support ‘dwarf’ unwinding.

People using lower kernel versions will need to continue to use ‘-fno-omit-frame-pointer’, or at least debug symbol packages that are built with that option.

Next steps:

Well, I seem to have drawn a blank with using Perf under Ubuntu. It’s too early to say that the tools are broken – there are differences between the distributions, and it’s possible that I’m getting tripped up by something silly.

I think the next step is to try the same thing, but use a full RedHat distribution, such as Fedora. I’m not going to use CentOS because I don’t know that everything is built 100% the same, and also the kernel versions tend to be older.

Installing debug symbols for the GNU toolchain

The last post showed how to find the most used symbol in a running symbol with ‘perf top’.

I could tell that the ‘cc1’ tool contained the most used symbol, but since I don’t have the debug symbols installed for that tool, I couldn’t find the symbol name.

This post describes how to get the debug packages installed.

Installing ‘gcc’ debug symbols:

Since I know that my compilation system uses the gnu C toolchain, it seems that I should just be able to install the debug symbols.

I can search for an appropriate debug symbol package as follows (output has been edited for clarity):

nick>> sudo apt-cache search 'gcc-' | grep dbgsym
gcc-4.7-multilib-dbgsym - debug symbols for package gcc-4.7-multilib
gcc-4.7-dbgsym - debug symbols for package gcc-4.7
libgcc-4.8-dev-dbgsym - debug symbols for package libgcc-4.8-dev
gcc-4.8-dbgsym - debug symbols for package gcc-4.8
libx32gcc-4.8-dev-dbgsym - debug symbols

So, there are several gcc versions to choose from. It’s easy to find the current ‘gcc’ version:

gcc -v
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

OK, I’ll install the ‘gcc-4.8-dbgsym’ package:

sudo apt-get install gcc-4.8-dbgsym

Easy enough, right?

Unfortunately, I get this message:

apt-get install gcc-4.8-dbgsym
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
gcc-4.8-dbgsym : Depends: lib32gomp1-dbg (= 4.8.2-19ubuntu1) but it is not installable
Depends: libn32gomp1-dbg (= 4.8.2-19ubuntu1) but it is not installable
Depends: libhfgomp1-dbg (= 4.8.2-19ubuntu1) but it is not installable

OK, maybe I can fix this by forcing the package management system to reload its index:

sudo apt-get update
sudo apt-get install gcc-4.8-dbgsym

No luck. I’ll need to look a bit further.

Checking package availability:

There appears to be a problem with package dependencies, so the first thing we want to do is check if the package really is available.

The simplest way to get information on an Ubuntu package seems to be to do a google search. I know that I’m using Ubuntu ‘trusty’, so if I search on ‘ubuntu trusty libn32gomp1-dbg’, I’ll get to this page:

http://packages.ubuntu.com/trusty/debug/lib32gomp1-dbg

The problem here seems to be that the package is only available for 64 bit installs. I can tell this because in the ‘download’ section, the only architecture available is ‘amd64’.

Note : Even though I’m running on Intel CPUs (albeit Virtualized), I still want to use the ‘amd64’ packages. This is just labelling – AMD defined the 64 bit extensions to the x86 architecture first, so that’s how 64bit x86 code is referred to.

Virtualization and 64bit operating systems:

I’ve been deliberately using a 32bit guest operating system inside my VirtualBox hypervisor. I’ve had problems bringing up a 64bit guest OS, even though my machine is 64bit and the processor has the Hyper-V extensions.

Now that I’ve figured out how to run 64bit guest operating systems, I should be able to side-step this problem.

Debugging high system time with ‘perf top’

In the last couple of posts I described a problem I encountered when compiling the Linux kernel on my VM.

In this post, I’m going to start to dig in and see what might be going on. I’m going to use the ‘perf’ toolkit, which is part of the Linux kernel.

Measure, measure, measure:

Effective performance optimization involves discovering the ‘hot’ part of the code, rather than guessing. Accurate measurement is key, because even informed opinion often turns out to be incorrect.

Graphing the data is important too. It’s possible to see patterns in scrolling data that aren’t there when the data is visualized.

It’s possible to profile a user-space application with tools like ‘gprof’, but what happens when the time seems to be in the operating system?

System performance and ‘perf’:

The first tool most people turn to when there’s a performance problem is ‘top’. It’s a great tool and can tell me a lot – especially if I know how to interpret its output accurately.

The problem is that it’ll tell me which processes are consuming the most CPU, but not where inside those processes the time is going. That’s something that’s really hard to get without advanced debug tools.

Installing perf tools:

Even though ‘perf’ is part of the linux kernel, user-facing tools are needed to drive it. They can be installed easily as follows:

sudo apt-get install linux-tools-common

Perf top:

Once the tools are installed, simply start the workload. In this case, it’s easy:


cd <top of kernel source code>
make clean
make -j3

Then, run ‘perf top’. This will bring up an interactive interface that looks somewhat like traditional top. It’s a bit hard to capture text off it, so here’s the first few lines:

16.42% cc1 [.] 0x00116d1a
8.38% [kernel] [k] _raw_spin_unlock_irqrestore
8.28% [kernel] [k] read_tsc
6.45% [kernel] [k] finish_task_switch
4.37% libc-2.19.so [.] __memset_sse2

This shows that a symbol inside the ‘cc1’ process is taking the most amount of the CPU. That’s no surprise, because we know that the system is compiling C code, and ‘cc1’ is part of the gnu C compiler toolchain.

Now, there’s a problem here, because I’m seeing a hex address rather than a human-readable symbol name. It is possible to resolve the address into a symbol manually, but it’s easier to install the debug symbol packages.

What’s more interesting is the kernel symbols on lines 2 through 5. These three symbols are showing up in the sample more often than the user-space workload, which strongly suggests that this may be where the system time is going.

Next I’ll install the debug symbols for the C toolchain (it’s not as straightforwards as it may seem, unfortunately), and then we’ll take a look at what these kernel symbols actually mean.