More detailed performance comparison data

February 26, 2015 ~ nickjpavey ~ Leave a comment

In the last post I said that my VM seems to be oddly faster than a physical machine for a kernel compilation workload.

For that post I used a fairly small benchmarking sample. This is fine for getting a feel for a problem, but there’s always a chance that the results may not be stable, and that the data may have turned out that way by chance.

I decided to burn some CPU time and do a larger benchmarking run. I did 20 iterations of the compilation process (from clean each time).

The Page Cache was cleared on the 1st iteration, but not for subsequent iterations.

I have discarded the 1st measurement, so we are only looking at cases where the page cache is hot, and almost certainly full of data related to the compilation job. This should reduce the impact of the I/O subsystem, at least for the read side of the workload.

Obviously data will have to be written back to disk. That’s normally done asynchronously via the page cache, which acts as a write-back cache. The pages will be flushed to disk as necessary, reducing the impact of data writes. Data required from a earlier step in the compilation process will (probably) be satisfied from the page cache, rather than from disk.

Here’s the result:

As you can see, the data is clearly bimodal. The fastest physical time is over 2 minutes slower than the slowest virtual time. On average, the physical system seems to be about 9% slower than the VM.

Not only is the physical system slower, but it’s less predictable. The VM data has a range of 27 seconds, while the physical data has a range of 71 seconds.

This is quite interesting because the VM is running under a type 2 hypervisor. Type 2 hypervisors appear to the host OS as just another process, and therefore has to share machine resources with other processes.

I would expect this setup to be less predictable, because I don’t know what other processes could be consuming machine resources while my benchmarking run progresses.

I’m going to start trying to figure out what’s going on next.

Performance differences between physical and virtual hardware

February 22, 2015February 26, 2015 ~ nickjpavey ~ Leave a comment

For the first set of posts on debugging, I analyzed one cause of high system time during compilation of the Linux kernel.

For this next set of posts, I’m going to dig into a funny observation I made while doing the system time work: My physical machine appears to be slower than my VM for doing the compilation workload.

My physical machine is a brand-new machine based off an quad-core Intel i5 processor. My VM is running on a 3 1/2 year old iMac, also based off a quad core i5 processor. The VM is allocated only 2 processors, so on the face of it, the physical machine should be faster.

Let’s dive in and see if we can understand what might be going on.

Machine setup

Physical:

Late 2014 Dell Optiplex 7010 mini-tower
Quad core Intel i5 processor, with nominal speed of 3.2GHz
- SKU is Core I5-3470
- Model details from Inter here.
- L1 cache size : 256KB
- L2 cache size : 1MB
- L3 cache size : 6MB
4GB system RAM
- 3.8GB available after integrated graphics RAM
7,200 RPM Western Digital 250GB drive, communicating over a 3Gbps SATA link
Ubuntu Linux 12.04 LTS

Virtual:

Mid 2011 27″ iMac
Quad-core Intel i5 processor, 2.7GHz nominal speed, single socket
- According to this site, the precise SKU is a Sandy Bridge i5-2500S
- You can get the definitive part information from Intel here.
- L2 cache size (per core): 256KB
- L3 cache size (per socket) : 6MB
12 GB system RAM (4 banks of DDR3 at 1333MHz)
7,200 RPM Western Digital 1TB drive, communicating over a 3Gbps SATA link
- This is the only drive in the system
Mac OSX Yosemite (v10.10.2)
Hypervisor : VirtualBox 4.3.20

The virtual operating system is:

RedHat Fedora 21.5, 64bit
4GB of RAM
2 virtual processors
80GB disk image maximum size (set to expand as necessary)
64MB display RAM, 3D acceleration turned on
Virtualization extensions enabled

Doing a clean comparison between a virtualized guest instance and physical hardware is not that easy – there are a lot of major differences – but on the surface, the physical linux system is not too different from the virtualized linux system.

Given that VirtualBox is a type 2 hypervisor (and it doesn’t have a reputation for tremendous performance), I certainly wouldn’t expect it to outperform physical hardware. Still, as the data below shows, for this particular workload, it does.

Benchmarking methodology:

One of the cardinal rules of benchmarking is to always do the same measurement multiple times. Amongst other things, we need to take into account:

Cache warming effects at many levels in the system
Other housekeeping workloads running during the lifespan of the task
Unknown behavior caused by data arriving over the network interface
Random variations of unknown origin

Of course, the more isolated and predictable the benchmarking environment, the better. That’s not terribly easy with a type 2 hypervisor, because the host operating system could be doing all sorts of things that the guest has no visibility into.

To account for these issues, I run the compile job back-to-back three times. To address some of the caching issues, my controller script has the option to clear the Linux page cache between each iteration, or to only clear it once at the beginning.

Three times isn’t really enough if I’m going to be formal about it. I’d need more samples in order to ensure that the results aren’t just coincidence. However, I’m not a statistician, so I’m not going to talk about something I don’t really know about.

Unscientific though this methodology may be, I’m seeing results that appear to be stable, so I think it’s good enough to be getting on with.

Restricting number of CPU cores:

In order to make the comparison more fair, I did turn off two of the CPU cores on the physical machine.

This can be done in two ways: Through the BIOS, or through super-user commands within the operating system. I used the latter, although I suspect that it may not have precisely the same effects as disabling CPUs in the BIOS.

Here’s how to quickly disable a CPU in Linux:

echo 0 > /sys/devices/system/cpu/cpu<n>/online

So, to turn off CPU cores 3 and 4, I issued the following commands (remember, the core numbering starts at zero):

echo 0 > /sys/devices/system/cpu/cpu2/online echo 0 > /sys/devices/system/cpu/cpu3/online

This takes effect immediately and doesn’t require a reboot, which is handy.

Compilation code base:

For this benchmarking, I was compiling v3.9 of the Linux kernel on both physical and virtual machines. The code is a vanilla git checkout of the linux-stable code base, not the Fedora or Ubuntu code base.

I’m using v3.9 rather than one of the higher versions because one of the internal APIs has changed a little in more modern releases, which causes a compilation error with the VirtualBox kernel modules.

Benchmarking results:

Physical: 1905 secs, 1863 secs, 1884 secs

Virtual: 1784 secs, 1727 secs, 1742 secs

So, the absolute differences as as follows: 121 secs, 136 secs, 142 secs

The percentage values are improvements of: ~7%, ~9%, ~8%

So, my virtualized system, running on 3 1/2 year old hardware is about 8% faster than a brand new machine.

If this was a bake-off between two similar physical systems, this would be enough to make a decision. As I mentioned earlier, a 10% difference is worth having in a professional environment where you run the same jobs again and again and again.

That the faster system is a virtualized instance, running on 3 1/2 year old hardware, inside a (apparently not terribly optimal) type 2 hypervisor is really rather surprising.

The next series of posts will explore this issue and try to come to a root cause – we’ll follow where the data takes us.

System time during kernel compile under IceWM

February 10, 2015February 19, 2015 ~ nickjpavey ~ 1 Comment

In the last post, I explored the system calls issued by the ‘gnome-shell’ system. It seems that the interface to the VirtualBox hypervisor may be consuming rather a lot of system time: I’ve regularly seen values of over 40% of CPU time being spent in the kernel.

This is a lot: Out of a dual core system, almost one CPU was being spent doing something other than my work. To put it in perspective, in professional computing, getting a 5% performance improvement is a big deal.

Running under IceWM:

Since I haven’t been able to get a lot of information about what ‘/dev/vboxuser’ actually does, I decided to investigate the performance differences experimentally.

I installed the IceWM window manager onto my VM and re-ran the kernel compile. You can see how to do this here:

http://forums.fedoraforum.org/showthread.php?t=282433

By the way: This article isn’t that clear about how you actually start the Window Manager. Under Fedora 21.5 all you need to do is log out, then click the little gear at the bottom of the logout panel to select the Window Manager you want to use.

It turns out that running under IceWM results in substantially lower system time consumption:

The system time values can be as high as ~25%, but are often around 15%, with lows at around 11%. That’s a big improvement over what I was seeing under Gnome.

Double checking with ‘perf’:

The ‘top’ results are promising, but I’ll double check with the ‘perf’ toolchain:

First of all, with ‘perf top -a’, to get an instantaneous reading of the topmost symbols:

This is much better – the top symbols are still in the kernel, but the amount of time they are consuming is pretty modest – about 14%, compared with about 38% in my earlier post.

Now we can get a longer snapshot with the record / report mechanism:

perf record –call-graph dwarf -a -F 97 — sleep 5

perf report

Here’s the output:

This is substantially different from what I saw earlier. There’s no mention of any VirtualBox symbols or ‘ioctl’ system calls, and the majority of symbols are user-space rather than kernel space.

Root cause?

As always, we need to figure out if we really have a root cause.

The switch between window managers is a fairly large change, but it’s not that big in the context of a full-sized computing environment. The kernel, the user-space code, X11 and the hypervisor are all the same.

Obviously, there is a large range of things that could have changed, but in this case I feel fairly comfortable about having identified the root cause.

Although I’m not an expert in either system, I suspect there is an integration between ‘gnome-shell’ and VirtualBox which is responsible for poor performance under this workload.

The integration is probably quite useful for desktop users: It could possibly be something like clean support for resizing the VM’s window, which is certainly nice to have.

In the past I’ve mentioned that I like cut-down window managers because they are so simple. It seems that this may be another data point in support of that choice.

If you’re old school enough to be happy with a bare-bones computing environment and just want to work efficiently, it may be worth getting rid of the bells and whistles.

Examining ‘gnome-shell’ behavior more closely

February 9, 2015February 9, 2015 ~ nickjpavey ~ Leave a comment

In the last post I noticed a correlation between activity in the 'gnome-shell' process and excessive consumption of system time during a kernel compile.

In this post I’m going to dig down a bit, to see what this process is doing and why it’s polling like it is.

What does ‘gnome-shell’ do?

I’m not an expert in windowing environments, but it appears that ‘gnome-shell’ is the core of the windowing system provided by Fedora 21. There’s a high level introduction here.

Since windowing systems are heavily event based (the window manager needs to react to mouse location, clicks and other windowing events), it’s to be expected that some part of it needs to listen for incoming events.

The question here is whether it’s contributing to the high system time observed during a kernel compile.

Experiment 1: Counting system calls under load

There’s one simple experiment I can do to see if activation in ‘gnome-shell’ increases under load: Attach ‘strace’ and count the number of system calls.

System calls will consume system time by definition: If I want the operating system to do something, I ask it via a system call. I’ll cause a context switch as I enter the system call and I’ll force the CPU to spend some time inside the kernel. This will show up as system time in tools like ‘top’.

Here’s the command I’ll use:

timeout 5 strace -c -p `pgrep -d, gnome-shell`

Here’s the output of a more-or-less idle machine:

When I run the kernel compile using ‘make -j3’, I get the following results:

So, I’m seeing substantially higher numbers of system calls, but the mix of system calls has changed too. I’m seeing these changes:

The highest proportion of time is being consumed by ‘ioctl’
There’s ~10x the number of ‘poll’ calls
The ‘munmap’ call starts showing up, curiously not matched by ‘mmap’ calls
The duration of each call has dropped, even though the system is busier

Time spent routing messages?

My first hypothesis is that the increased output was causing the windowing system to work harder.

The compilation processes will be generating output on STDOUT / STDERR, which presumably has to be routed to the appropriate window. You could imagine that would take more effort to do, and perhaps that would result in increased system time?

I think the data refutes this idea. In particular, the proportion of time we’re spending polling has apparently dropped. If the windowing system was spending its time routing messages, I would increase. Also, I’m seeing system time, and I’d expect this work to happen in user space.

Debugging ‘ioctl’

The big thing here is that ‘ioctl’ has seen a 6x increase in the time it consumes. We need to figure out what this call is doing.

This is actually quite easy to do. Vanilla Linux ‘strace’ actually has quite a large range of powerful options, including the ability to filter out specific system calls. Combine that with the ability to resolve file descriptors into file paths, and we should get a good idea of what ‘ioctl’ is doing.

strace -y -e trace=ioctl -p `pgrep -d, gnome-shell`

Here’s the output (abbreviated for clarity):

So, these time consuming ‘ioctl’ system calls are interacting with something to do with the VirtualBox hypervisor.

This is the first time that we’ve seen concrete evidence of a hypervisor involvement. This is good news, because a difference in system time consumption between physical and virtual systems is where this whole thread started.

So, what is this ‘/dev/vboxuser’ thing, and why is the windowing system driving it harder when a compile starts?

Unfortunately, there doesn’t seem to be much (any really!) information about what this device is used for. I’ve seen one reference to it handling host-to-guest clipboards and screen resizing, but that’s about it.

I have an IM in with the VirtualBox developers to see if I can get more information. I’ll update here if I learn more.

Alternative experiments:

In the absence of more concrete information, I can do a few more experiments:

Change the window manager to something not based on Gnome
Change the hypervisor to something not based on VirtualBox

These may tell me something, albeit obliquely, about what’s consuming the system time. I’ll cover these experiments in my next post.

Gnome-shell stuck in a tight polling loop

February 2, 2015February 9, 2015 ~ nickjpavey ~ Leave a comment

I now have a way to generate more useful stack trace information with the ‘perf’ toolchain.

It’s still not perfect though – I’m now a bit overwhelmed with the amount of data the tool generates.

How can I boil this stuff down to start to answer the question of what’s causing the suspected high system time when I compile the Linux kernel?

Sorting perf output:

It turns out there’s a very useful ‘sort’ mode for ‘perf report’. It’s seems to be somewhat similar to a ‘GROUP BY / ORDER BY’ expression in SQL – it appears to both sort and consolidate the data, which is very handy.

Here’s the new command line:

perf report -i perf.data.compile -s dso

And this is the output:

The output’s a bit hard to read with my color choice, but the key thing here is that the numbers are starting to make some sense.

When I was running the compile, the user CPU time was hovering at ~75%, and ~25% was spent in system time. Those values are reflected here in the ‘cc1’ and ‘kernel.kallsyms’ lines.

I suspect, but don’t know for sure, that the reason we’re seeing so much time in ‘libc’ is because just about everything in Unix (or Linux) uses the C standard library, so it’s naturally going to show up as very high activation.

Drilling down:

Having gotten some values that look more believable, let’s drill down a bit and see what’s causing our system time problem.

Since I’m interested in system time, I’ll expand out the kernel symbols, as shown below:

Now, this is a bit confusing. I’m seeing lots of good detail, but I’m not sure if the symbols are ranked according to how often they were sampled.

Maybe, now I’m more confident in what the tool is telling me, I should revert back to ‘perf top’ to get a reading of which symbol’s hot as the compile progresses?

The plot thickens:

Now, this is where things get a bit strange.

This afternoon, I kicked off many compilation runs and got ~25% system time. This is high, but not awful, and I was starting to wonder if I was chasing something inconsequential.

This evening, I did exactly the same set of commands, and got easily 45% system time, and saw a very different set of symbols in the ‘perf top’ output:

What could have changed? If I run a ‘perf record’ session now, this is what I get:

All of a sudden, I see a bunch of IO related symbols showing up, and they are within kernel space. I have no idea why they might originate in the gnome-shell – that’s either a glitch, or it’s actually pretty interesting.

gnome-shell:

It turned out that there was a process called ‘gnome-shell’ which was consuming CPU resources all the time and was visible with vanilla ‘top’.

Using traditional ‘strace’ showed that the shell session was sitting in a fairly tight loop running ‘poll’ on a file descriptor attached to a socket with an inode number.

This behavior was not present in earlier sessions. If it had been, I’d have seen the ‘gnome-shell’ related symbols show up in the perf output.

Root cause or not?

Now, we need to be careful not to jump to conclusions, since modern operating systems are fiendishly complex.

The presence of a polling loop seems to correlate with higher-than-expected system time during a kernel compile. This may be a coincidence, but it bears more investigation.

I didn’t notice it during other periods of high system time, so it’s possible it’s a one-off glitch, or perhaps endemic to Fedora’s window management environment. It’s possible that there are multiple causes of high system time, so we need to be prepared to “peel the onion”.

The next post will examine the ‘gnome-shell’ process a bit more. We’ll conduct some experiments to see if we can be more confident in assigning a root cause.

nickjpavey

Advanced Linux Performance Analysis and Debugging

Month: February 2015