System time during kernel compile under IceWM

February 10, 2015February 19, 2015 ~ nickjpavey ~ 1 Comment

In the last post, I explored the system calls issued by the ‘gnome-shell’ system. It seems that the interface to the VirtualBox hypervisor may be consuming rather a lot of system time: I’ve regularly seen values of over 40% of CPU time being spent in the kernel.

This is a lot: Out of a dual core system, almost one CPU was being spent doing something other than my work. To put it in perspective, in professional computing, getting a 5% performance improvement is a big deal.

Running under IceWM:

Since I haven’t been able to get a lot of information about what ‘/dev/vboxuser’ actually does, I decided to investigate the performance differences experimentally.

I installed the IceWM window manager onto my VM and re-ran the kernel compile. You can see how to do this here:

http://forums.fedoraforum.org/showthread.php?t=282433

By the way: This article isn’t that clear about how you actually start the Window Manager. Under Fedora 21.5 all you need to do is log out, then click the little gear at the bottom of the logout panel to select the Window Manager you want to use.

It turns out that running under IceWM results in substantially lower system time consumption:

The system time values can be as high as ~25%, but are often around 15%, with lows at around 11%. That’s a big improvement over what I was seeing under Gnome.

Double checking with ‘perf’:

The ‘top’ results are promising, but I’ll double check with the ‘perf’ toolchain:

First of all, with ‘perf top -a’, to get an instantaneous reading of the topmost symbols:

This is much better – the top symbols are still in the kernel, but the amount of time they are consuming is pretty modest – about 14%, compared with about 38% in my earlier post.

Now we can get a longer snapshot with the record / report mechanism:

perf record –call-graph dwarf -a -F 97 — sleep 5

perf report

Here’s the output:

This is substantially different from what I saw earlier. There’s no mention of any VirtualBox symbols or ‘ioctl’ system calls, and the majority of symbols are user-space rather than kernel space.

Root cause?

As always, we need to figure out if we really have a root cause.

The switch between window managers is a fairly large change, but it’s not that big in the context of a full-sized computing environment. The kernel, the user-space code, X11 and the hypervisor are all the same.

Obviously, there is a large range of things that could have changed, but in this case I feel fairly comfortable about having identified the root cause.

Although I’m not an expert in either system, I suspect there is an integration between ‘gnome-shell’ and VirtualBox which is responsible for poor performance under this workload.

The integration is probably quite useful for desktop users: It could possibly be something like clean support for resizing the VM’s window, which is certainly nice to have.

In the past I’ve mentioned that I like cut-down window managers because they are so simple. It seems that this may be another data point in support of that choice.

If you’re old school enough to be happy with a bare-bones computing environment and just want to work efficiently, it may be worth getting rid of the bells and whistles.

Running the ‘perf’ toolchain under Fedora

January 29, 2015January 29, 2015 ~ nickjpavey ~ Leave a comment

In the last post, I finally decided that I couldn’t get the ‘perf’ toolchain to work properly under Ubuntu 14.04 LTS.

In this post, I’m going to look at doing the same debug under Fedora 21 (64bit).

Those of you who’ve followed this series so far should be very familiar with the basic steps to install the tools, so I’m not going to describe the setup in detail.

Getting the 1st round of data:

First of all, I’ll launch the kernel compile as usual:

make -j3

Then, when the system has settled a little, l’ll launch the recording tool:

perf record --call-graph dwarf -F 97 -a -o perf.data.compile -- sleep 5

Note : The ‘-F 97’ tells ‘perf’ to sample at 97Hz. This reduces the amount of CPU load and I/O bandwidth required. The odd choice of 97Hz (not 100Hz, for example) prevents system events from synchronizing too much.

Then, let’s see what we get from ‘perf report’:

So, this looks promising. But, what’s ‘cc1’, and why do we have so many of them?

‘cc1’ is the actual C compiler that’s invoked by the ‘gcc’ toolchain. I don’t know for sure why we have so many, but I suspect that it’s because of two things:

There are lots of short-lived ‘cc1’ processes, each with their own PID
The sampling catches them all at different stages in their lifecycle, leading to different stack traces for each sample.

If we expand a few of these ‘cc1’ instances, this is what we get:

This supports my conjecture above – I’m seeing many, many possible stack traces for ‘cc1’.

This is interesting, but doesn’t really help me. I want to see why the compilation process seems to cause high system time, not snapshots of the control flow of ‘cc1’.

The next post will show how to make this data more useful.

Processor Performance Counters

January 13, 2015January 13, 2015 ~ nickjpavey ~ Leave a comment

In the last few posts I looked at a simple use case for the SystemTap tool from RedHat.

Now I’m going to look at the ‘perf’ tool that comes as part of the Linux kernel.

Perf is derived from the “Performance Counter for Linux” (PCL) project, so it’s worth taking a bit of a detour into the hardware features of modern processors.

Bringing up a CPU for the first time:

The traditional way to debug software is to debug it using a debugger such as GDB. The problem is that this depends on two things:

Having a development environment that is mature enough to support a debugger.
Having a process which can accept being stopped and single-stepped.

Bringing up a new processor design to the point where it can run even a small operating system is a major chunk of work. In the meantime, the chip is sitting in the lab and a way is needed to figure out what it’s capable of doing.

Assuming the CPU is working (which it may not be), even a fairly large bring-up program will be completed in microseconds. That’s obviously way too fast to observe manually, even if there was a good way to see what the machine is doing.

Logic analyzers can be attached to the busses and pins, but that only shows what’s going on at the physical interfaces to the chip. As a side note, monitoring high speed signals on a 1,000 pin socket is not actually that easy, either!

In the CPU design world, tools are often rather low-level. Frankly, a ‘printf’ statement is pretty heady stuff, because it implies the system is running an OS with a C standard library, and I actually have something to view the output on!

So, given that I have a pretty bare-bones set of tools, how do I know what’s going on inside the core of the device?

On-chip monitoring hardware:

Part of the answer is on-chip debug systems and performance counters. For completeness, there are also Scan and JTAG interfaces, but they are well outside the scope of this article.

On-chip debug and performance monitoring hardware has visibility into the low level implementation of the processor design. It can be used to monitor the system as it runs at full speed.

Modern hardware debug modules can be pretty powerful. They have a range of programmable features and can monitor a wide range of system resources (although not all at the same time) without degrading performance. Of course, the feature set is nothing like as rich as a full interactive debugger, but it’s an awful lot better than nothing!

Performance counters are related to the on-chip debug system. As the name implies, they are designed less for identifying software errors than for quantifying software behavior as it runs.

They can still be pretty flexible though, especially with tools such as ‘perf’ and ‘Oprofile’ to help configure them.

Finally, traditional debuggers emphasize functional debugging and won’t reveal anything about important architectural issues such as cache utilization. Given how important the memory hierarchy is to performance, understanding what the workload is doing to the caches is critically important.

Documentation:

Understanding how performance counters can be programmed does require a background in reading processor architecture manuals. It’s pretty technical stuff but many of you will be interested, so here are some links:

AMD : Section 13.2 through 13.4 of the System Programming Manual
Intel : Section 18.4 of Software Developer’s manual

(Warning : These docs are VERY large. You may want to download them to your machine then view them outside of your web browser.)

Comparing Linux WM load to Mac OS X (Yosemite)

January 9, 2015January 13, 2015 ~ nickjpavey ~ Leave a comment

In the last post I used SystemTap to look at the I/O operations generated by the desktop environment.

I thought it would be interesting to do a similar analysis on my Mac, just to have a comparison.

It turns out that it’s more complicated to analyze the Mac’s behavior than it is for Linux. Based on this specific experiment, the Mac seems to live in a middle ground between IceWM and Ubuntu’s Unity environment.

However general usage impressions and some other filesystem data suggest that disk utilization may be substantially higher.

Mac OS X tools:

The Mac uses quite a different tool set to do monitoring from Linux.

Thanks to its BSD heritage, Mac OS X has an implementation of Sun’s ‘dtrace’, which arguably is the gold standard of tracing systems.

Dtrace is not fully ported to Linux currently, although work is underway. I tried using it about six months ago and found it wasn’t fully mature yet.

The tool to use of the Mac is ‘fs_usage’. I believe (although I could be wrong) that this is based off ‘dtrace’.

It’s a very handy tool and gives quite a bit of actionable info right out of the box.

Results:

Over a 60 second period my Mac issued ~315k I/O operations. Many of these were filesystem metadata operations like ‘getattrlist’ and ‘stat64’ operations that possibly might not result in many actual disk operations.

If I filter the ‘fs_usage’ output pretty strictly, allowing a fairer comparison with the ‘systemtap’ under Linux, then I get a count of about 875 reads and writes over 60s. This is actually better than under Linux, but only if we compare these specific operations.

The majority of the load seemed to be from the Spotlight indexing system, which I’ve noticed being pretty intensive before.

Commentary:

I’m actually not that keen on Spotlight – I feel it uses a lot of resources for rather modest returns. I use its features very rarely, but I suspect that I pay a performance penalty for its indexing operations.

Even though the actual number of reads and writes doesn’t appear to be that high, you can hear the disks working. The disks are clearly doing a lot of work and application startup is rather sluggish.

My guess is that the metadata operations (getattr, stat, open etc etc) actually do contribute to quite a lot of disk activity, especially shortly after reboot when the filesystem caches may not be that warm.

Moving the head of a rotating disk is pretty time consuming. On my iMac (7,200 RPM conventional disk), the disk can only perform about 100 I/O operations per second. Even a fairly modest seek load can really eat into system performance.

We could use another tool to probe into a lower part of the file system to see how many ‘seek’ operations are actually reaching the disk. ‘iostat -o’ is an OK tool for this, but it’s not as good a tool as under Linux and there isn’t a specific column for the number of seeks / sec. You can infer what’s going on, but it’s a bit sparse.

It is possible to disable spotlight, but I’ve found that other applications use it as a central searching facility. For example, I could not search email in Microsoft Outlook when Spotlight was turned off.

Installing kernel debug symbols in Ubuntu 14.04

January 5, 2015February 23, 2015 ~ nickjpavey ~ Leave a comment

In the last post, I found that SystemTap threw errors when I tried to add probe points into the operating system.

It appears that the solution is to install the debug version of the kernel, which contains a mapping from human-readable symbol names to machine-executable addresses.

There is an excellent post on the Ubuntu wiki which describes how to do this, including doing a dedicated kernel build.

This post contains simplified instructions specific to Ubuntu 14.04 LTS. Please note that I’m running the 32bit version in a VM, since my hypervisor doesn’t support 64bit guest operating systems without tweaking (Note: I subsequently figured out how to make this work).

Finding the Ubuntu release name:

All Ubuntu releases have a marketing code name along with the more traditional version number.

It turns out that the code name is important for this task, because it is used to index into Ubuntu’s online package repositories.

I can find the appropriate code name as follows – this applies to all Ubuntu releases, not just 14.04 LTS:

cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=14.04 DISTRIB_CODENAME=trusty DISTRIB_DESCRIPTION="Ubuntu 14.04.1 LTS"

So, my code name is ‘trusty’.

Adding repository references:

Now that I have the code name, I need to modify my repository setup to access the repository that holds the debug symbols.

The repositories are configured by the following file:

/etc/apt/sources.list

In this case, here’s what I need to add to ‘sources.list’:

deb http://ddebs.ubuntu.com/ trusty main restricted universe multiverse deb http://ddebs.ubuntu.com/ trusty-security main restricted universe multiverse deb http://ddebs.ubuntu.com/ trusty-updates main restricted universe multiverse deb http://ddebs.ubuntu.com/ trusty-proposed main restricted universe multiverse

Note: A descritpion of what all these ‘universe’ and ‘multiverse’ tags mean can be found
here.

The key thing here is that the lines contain ‘ddebs’, which means ‘debug-debian, rather than the normal repositories which contain stripped packages.

Once I’ve added these lines, I need to force the package management system to re-read the file:

sudo apt-get update

When this completed, I noticed that the package system wasn’t able to validate the package signatures:

W: GPG error: http://ddebs.ubuntu.com trusty-proposed Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY ECDCAD72428D7C01

This is easily fixed, as follows:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01

Note that the value passed to ‘apt-key’ is the same value as reported in the warning message. This process is the same for any repository where you get this warning – assuming you trust the repository, of course!

Once I’ve gotten the keys straightened out, I need to get run the update process again:

sudo apt-get update

Installing the debug symbols:

The final stage should be easy. I need to download the kernel package with debug symbols that matches our current kernel version:

sudo apt-get install linux-image-$(uname -r)-dbgsym

It’s worth noting that the package is pretty big, at about 400Mbytes. Go get a coffee while it downloads!

nickjpavey

Advanced Linux Performance Analysis and Debugging

Debugging