Debugging high system time with ‘perf top’

In the last couple of posts I described a problem I encountered when compiling the Linux kernel on my VM.

In this post, I’m going to start to dig in and see what might be going on. I’m going to use the ‘perf’ toolkit, which is part of the Linux kernel.

Measure, measure, measure:

Effective performance optimization involves discovering the ‘hot’ part of the code, rather than guessing. Accurate measurement is key, because even informed opinion often turns out to be incorrect.

Graphing the data is important too. It’s possible to see patterns in scrolling data that aren’t there when the data is visualized.

It’s possible to profile a user-space application with tools like ‘gprof’, but what happens when the time seems to be in the operating system?

System performance and ‘perf’:

The first tool most people turn to when there’s a performance problem is ‘top’. It’s a great tool and can tell me a lot – especially if I know how to interpret its output accurately.

The problem is that it’ll tell me which processes are consuming the most CPU, but not where inside those processes the time is going. That’s something that’s really hard to get without advanced debug tools.

Installing perf tools:

Even though ‘perf’ is part of the linux kernel, user-facing tools are needed to drive it. They can be installed easily as follows:

sudo apt-get install linux-tools-common

Perf top:

Once the tools are installed, simply start the workload. In this case, it’s easy:


cd <top of kernel source code>
make clean
make -j3

Then, run ‘perf top’. This will bring up an interactive interface that looks somewhat like traditional top. It’s a bit hard to capture text off it, so here’s the first few lines:

16.42% cc1 [.] 0x00116d1a
8.38% [kernel] [k] _raw_spin_unlock_irqrestore
8.28% [kernel] [k] read_tsc
6.45% [kernel] [k] finish_task_switch
4.37% libc-2.19.so [.] __memset_sse2

This shows that a symbol inside the ‘cc1’ process is taking the most amount of the CPU. That’s no surprise, because we know that the system is compiling C code, and ‘cc1’ is part of the gnu C compiler toolchain.

Now, there’s a problem here, because I’m seeing a hex address rather than a human-readable symbol name. It is possible to resolve the address into a symbol manually, but it’s easier to install the debug symbol packages.

What’s more interesting is the kernel symbols on lines 2 through 5. These three symbols are showing up in the sample more often than the user-space workload, which strongly suggests that this may be where the system time is going.

Next I’ll install the debug symbols for the C toolchain (it’s not as straightforwards as it may seem, unfortunately), and then we’ll take a look at what these kernel symbols actually mean.

Debugging a high system time problem

I covered some pretty basic debug in previous posts, so now it’s time to address something more complex.

Debug symbols:

As I mentioned earlier, in order to use advanced debug tools like ‘perf’ and ‘systemtap’, it’s useful to install the kernel debug symbols.

It’s not mandatory, but the tools will emit hex addresses rather than meaningful symbols unless they have a way to convert from instruction pointer addresses to symbols within the code.

Now, on another system I use, I noticed that the standard way of installing current debug symbols didn’t seem to work. I compiled a new version of the Ubuntu kernel with debugging mode turned on, and the symbols started showing up. I’ll write about how to do this in future posts.

I thought I’d do the same thing on my Virtual Machine, more out of interest than anything else, since adding debug symbols seemed to work on this instance.

High system time:

When I kicked off the compile, I noticed that the system seemed to be spending a lot of time in kernel space.

Here’s a snapshot of the ‘top’ output:

top - 14:20:11 up 39 min, 9 users, load average: 1.18, 0.62, 0.42

Tasks: 205 total, 2 running, 203 sleeping, 0 stopped, 0 zombie

%Cpu0 : 46.0 us, 44.0 sy, 0.0 ni, 0.0 id, 4.0 wa, 0.0 hi, 6.0 si, 0.0 st

%Cpu1 : 71.0 us, 29.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 3029552 total, 2220740 used, 808812 free, 187868 buffers KiB Swap: 3069948 total, 0 used, 3069948 free. 1011568 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

21965 nick 20 0 43464 31968 8176 R 55.7 1.1 0:00.63 cc1

3861 nick 20 0 708676 275508 44732 S 12.9 9.1 0:20.53 firefox

2623 nick 20 0 452592 125972 38452 S 2.0 4.2 1:17.22 compiz

7 root 20 0 0 0 0 S 1.0 0.0 0:03.34 rcu_sched

13 root 20 0 0 0 0 S 1.0 0.0 0:02.11 ksoftirqd/1

977 root 20 0 27196 1412 1032 S 1.0 0.0 0:00.95 VBoxService

I normally don’t expect to see such high values. Some system time is to be expected, but this is much higher than I’d expect for a user-space job like compilation. Since this blog is about debugging hard problems like this, I thought it would be fun to find out what’s going on.

The next post in this series describes finding a high system time problem when compiling the Linux kernel.

Processor Performance Counters

In the last few posts I looked at a simple use case for the SystemTap tool from RedHat.

Now I’m going to look at the ‘perf’ tool that comes as part of the Linux kernel.

Perf is derived from the “Performance Counter for Linux” (PCL) project, so it’s worth taking a bit of a detour into the hardware features of modern processors.

Bringing up a CPU for the first time:

The traditional way to debug software is to debug it using a debugger such as GDB. The problem is that this depends on two things:

  • Having a development environment that is mature enough to support a debugger.
  • Having a process which can accept being stopped and single-stepped.

Bringing up a new processor design to the point where it can run even a small operating system is a major chunk of work. In the meantime, the chip is sitting in the lab and a way is needed to figure out what it’s capable of doing.

Assuming the CPU is working (which it may not be), even a fairly large bring-up program will be completed in microseconds. That’s obviously way too fast to observe manually, even if there was a good way to see what the machine is doing.

Logic analyzers can be attached to the busses and pins, but that only shows what’s going on at the physical interfaces to the chip. As a side note, monitoring high speed signals on a 1,000 pin socket is not actually that easy, either!

In the CPU design world, tools are often rather low-level. Frankly, a ‘printf’ statement is pretty heady stuff, because it implies the system is running an OS with a C standard library, and I actually have something to view the output on!

So, given that I have a pretty bare-bones set of tools, how do I know what’s going on inside the core of the device?

On-chip monitoring hardware:

Part of the answer is on-chip debug systems and performance counters. For completeness, there are also Scan and JTAG interfaces, but they are well outside the scope of this article.

On-chip debug and performance monitoring hardware has visibility into the low level implementation of the processor design. It can be used to monitor the system as it runs at full speed.

Modern hardware debug modules can be pretty powerful. They have a range of programmable features and can monitor a wide range of system resources (although not all at the same time) without degrading performance. Of course, the feature set is nothing like as rich as a full interactive debugger, but it’s an awful lot better than nothing!

Performance counters are related to the on-chip debug system. As the name implies, they are designed less for identifying software errors than for quantifying software behavior as it runs.

They can still be pretty flexible though, especially with tools such as ‘perf’ and ‘Oprofile’ to help configure them.

Finally, traditional debuggers emphasize functional debugging and won’t reveal anything about important architectural issues such as cache utilization. Given how important the memory hierarchy is to performance, understanding what the workload is doing to the caches is critically important.

Documentation:

Understanding how performance counters can be programmed does require a background in reading processor architecture manuals. It’s pretty technical stuff but many of you will be interested, so here are some links:

(Warning : These docs are VERY large. You may want to download them to your machine then view them outside of your web browser.)

Comparing Linux WM load to Mac OS X (Yosemite)

In the last post I used SystemTap to look at the I/O operations generated by the desktop environment.

I thought it would be interesting to do a similar analysis on my Mac, just to have a comparison.

It turns out that it’s more complicated to analyze the Mac’s behavior than it is for Linux. Based on this specific experiment, the Mac seems to live in a middle ground between IceWM and Ubuntu’s Unity environment.

However general usage impressions and some other filesystem data suggest that disk utilization may be substantially higher.

Mac OS X tools:

The Mac uses quite a different tool set to do monitoring from Linux.

Thanks to its BSD heritage, Mac OS X has an implementation of Sun’s ‘dtrace’, which arguably is the gold standard of tracing systems.

Dtrace is not fully ported to Linux currently, although work is underway. I tried using it about six months ago and found it wasn’t fully mature yet.

The tool to use of the Mac is ‘fs_usage’. I believe (although I could be wrong) that this is based off ‘dtrace’.

It’s a very handy tool and gives quite a bit of actionable info right out of the box.

Results:

Over a 60 second period my Mac issued ~315k I/O operations. Many of these were filesystem metadata operations like ‘getattrlist’ and ‘stat64’ operations that possibly might not result in many actual disk operations.

If I filter the ‘fs_usage’ output pretty strictly, allowing a fairer comparison with the ‘systemtap’ under Linux, then I get a count of about 875 reads and writes over 60s. This is actually better than under Linux, but only if we compare these specific operations.

The majority of the load seemed to be from the Spotlight indexing system, which I’ve noticed being pretty intensive before.

Commentary:

I’m actually not that keen on Spotlight – I feel it uses a lot of resources for rather modest returns. I use its features very rarely, but I suspect that I pay a performance penalty for its indexing operations.

Even though the actual number of reads and writes doesn’t appear to be that high, you can hear the disks working. The disks are clearly doing a lot of work and application startup is rather sluggish.

My guess is that the metadata operations (getattr, stat, open etc etc) actually do contribute to quite a lot of disk activity, especially shortly after reboot when the filesystem caches may not be that warm.

Moving the head of a rotating disk is pretty time consuming. On my iMac (7,200 RPM conventional disk), the disk can only perform about 100 I/O operations per second. Even a fairly modest seek load can really eat into system performance.

We could use another tool to probe into a lower part of the file system to see how many ‘seek’ operations are actually reaching the disk. ‘iostat -o’ is an OK tool for this, but it’s not as good a tool as under Linux and there isn’t a specific column for the number of seeks / sec. You can infer what’s going on, but it’s a bit sparse.

It is possible to disable spotlight, but I’ve found that other applications use it as a central searching facility. For example, I could not search email in Microsoft Outlook when Spotlight was turned off.

Running SystemTap to monitor I/O behaviour

The objective of this mini-project was to use advanced Linux debug tools to investigate the runtime impact of Window Managers (WMs) on the machine’s storage subsystem.

It’s not a very complex, or perhaps even a very useful project, but it does provide a vehicle for bringing up the tools and answering a question in an objective manner.

Experimental environment:

I’ve finally got to the point where I can run SystemTap and (hopefully) come to some sort of understanding of how the WM drives the IO subsystem.

Here’s my operating environment:

  • Host system : 27″ iMac, OS X Yosemite, Quad core Intel I5 processor, 12GB RAM
  • Hypervisor : Oracle VirtualBox 4.3.20
  • Guest OS : Ubuntu 14.04 LTS (32 bit
  • Window managers : Ubuntu Unity, IceWM

The system is idle, with one terminal running. The OS is in full multi-user mode, so there’s possibly a range of background tasks going on.

This obviously isn’t terrible realistic, since most interesting problems occur when there’s actually something going on. Still, it’s sufficient to answer our current training question.

Running SystemTap under IceWM:

My normal development environment uses IceWM as its window manager, so I’ll use that to start off with.

I want to monitor the system for a fixed amount of time, so that errors don’t creep in by guessing the monitoring interval.

I’m going to run SystemTap as follows:

sudo timeout 60 ./watch_all_reads_writes.stap 8 1 2>&1 | tee trace.60s

Note : The ‘timeout’ utility is extremely handy. It allows me to run a command for a specified interval then deliver a signal (which I can specify if I want to). It’s not available by default on all Linux distributions and versions, but it’s available in Ubuntu 14.04.

Here’s a snippet of the output:

rs:main Q:Reg(545) vfs_write 0x800001
rs:main Q:Reg(545) vfs_write 0x800001
tee(2417) vfs_write 0x800001
tee(2417) vfs_write 0x800001
tee(2417) vfs_write 0x800001
tee(2417) vfs_write 0x800001
tee(2417) vfs_write 0x800001

So, the first interesting thing is that the presence of ‘tee’ actually causes a lot of I/O write operations, even when there’s no output from the application it’s monitoring.

I need to strip out the impact of ‘tee’ the count the read / write operations, as follows:

grep -v tee trace.60s | wc -l
>> 5

The total number of lines is 293 over 60 seconds, so ‘tee’ clearly dominates the I/O behavior.

Results with Ubuntu Unity:

Next I need to do a similar test when I’m running Ubuntu’s ‘Unity’ desktop environment.

The results are pretty different: There were 4424 read or write operations over 60s.

Stripping out the contribution from ‘tee’, gave 4112 operations, compared to 5 with IceWM.

I can extract the list of processes with this command line:

cat trace.60s.unity | gawk -F'(‘ ‘{print $1;}’ | sort -u | wc -l

There were 29 separate processes causing the IO operations (I won’t list them all here).

If I had to guess, I’d say that it’s mainly to do with Unity doing background operations to keep packages updates, but there were some interesting ones like an ‘http’ process and something called ‘Zeitgeist’.

There is more information about Zeigeist on its Ubuntu man page.

Summary:

Based on the output of this script, we’ve discovered that:

  • IceWM causes minimal I/O subsystem load when the system is idle
  • Ubuntu Unity has fairly significant background activity which affects the I/O subsystem
  • The ‘tee’ application does about 3 write operations / sec, even when there’s no input that needs to be written to STDOUT / STDERR and a file. Are these zero-length writes? I don’t know – but we could do some more work to find out.

Given this, can I say that Ubuntu Unity would load my machine’s I/O subsystem too much? No, not at least under these operating conditions.

These I/O rates are pretty modest, even with the slow 7,200 rpm disks in iMacs and the overhead from the virtualization system.

Also, when I see calls to ‘vfs.read’ or ‘vs.write’, that doesn’t necessarily translate to an operation that actually causes disk heads to move.

To understand why that is, I’ll have to take a look at how the Virtual Memory subsystem of modern operating systems works. This is well outside the scope of this post, but if you’re interested I recommend reading Mel Gorman’s document at this link.

https://www.kernel.org/doc/gorman/pdf/understand.pdf

There’s a LOT in this document. Even if it refers to older kernels, it’s still an excellent resource.

Caveats:

There are caveats to my mini-analysis:

  • The monitoring tool only tracks read and write operations. There are other I/O operations such as ‘stat’, ‘open’ and ‘close’ which are not logged.
  • This tool only monitors I/O operations that go through the Virtual File System (VFS). Operation that go over network sockets won’t be tracked.

Also, I’m not a real hard-core expert in these tools, so I may have made some mistakes or omissions. If you spot anything, let me know.

Now, for one last comment: These monitoring tools are very powerful, because they operate across the whole system and can reach right into the kernel – we’ve only just scratched the surface.

The side effect is that using them can raise more questions than they answer!  I think that’s a good thing, because I get to really understand what my machine’s doing.

Installing kernel debug symbols in Ubuntu 14.04

In the last post, I found that SystemTap threw errors when I tried to add probe points into the operating system.

It appears that the solution is to install the debug version of the kernel, which contains a mapping from human-readable symbol names to machine-executable addresses.

There is an excellent post on the Ubuntu wiki which describes how to do this, including doing a dedicated kernel build.

This post contains simplified instructions specific to Ubuntu 14.04 LTS. Please note that I’m running the 32bit version in a VM, since my hypervisor doesn’t support 64bit guest operating systems without tweaking (Note: I subsequently figured out how to make this work).

Finding the Ubuntu release name:

All Ubuntu releases have a marketing code name along with the more traditional version number.

It turns out that the code name is important for this task, because it is used to index into Ubuntu’s online package repositories.

I can find the appropriate code name as follows – this applies to all Ubuntu releases, not just 14.04 LTS:

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.1 LTS"

So, my code name is ‘trusty’.

Adding repository references:

Now that I have the code name, I need to modify my repository setup to access the repository that holds the debug symbols.

The repositories are configured by the following file:

/etc/apt/sources.list

In this case, here’s what I need to add to ‘sources.list’:

deb http://ddebs.ubuntu.com/ trusty main restricted universe multiverse
deb http://ddebs.ubuntu.com/ trusty-security main restricted universe multiverse
deb http://ddebs.ubuntu.com/ trusty-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com/ trusty-proposed main restricted universe multiverse

Note: A descritpion of what all these ‘universe’ and ‘multiverse’ tags mean can be found
here.

The key thing here is that the lines contain ‘ddebs’, which means ‘debug-debian, rather than the normal repositories which contain stripped packages.

Once I’ve added these lines, I need to force the package management system to re-read the file:

sudo apt-get update

When this completed, I noticed that the package system wasn’t able to validate the package signatures:

W: GPG error: http://ddebs.ubuntu.com trusty-proposed Release:
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY ECDCAD72428D7C01

This is easily fixed, as follows:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01

Note that the value passed to ‘apt-key’ is the same value as reported in the warning message. This process is the same for any repository where you get this warning – assuming you trust the repository, of course!

Once I’ve gotten the keys straightened out, I need to get run the update process again:

sudo apt-get update

Installing the debug symbols:

The final stage should be easy. I need to download the kernel package with debug symbols that matches our current kernel version:

sudo apt-get install linux-image-$(uname -r)-dbgsym

It’s worth noting that the package is pretty big, at about 400Mbytes. Go get a coffee while it downloads!

Installing SystemTap on Ubuntu 14.04 LTS

Having discussed what SystemTap is and how it can help, I’m going to get started with my mini-project of understanding the load my window management system puts on my VM’s I/O subsystem.

Installation and sanity check:

The first thing is to get SystemTap installed. This is pretty easy with the more modern Ubuntu installations:


sudo apt-get install systemtap

Once the tool is installed, let’s try a standard “Hello World” functionality test. Here’s the SystemTap script:


#! /usr/bin/env stap
probe begin { println("hello world") exit () }

Write this into a file called ‘hello_world.stp’, then make sure the execute permissions are set.

Now we can execute this. I’ll need to be ‘root’, since we’re trying to load a piece of code into the kernel. That’s definitely not something that a regular user should be able to do!


sudo bash
./hello_world.stp
>> hello world

Great – it’s working well enough to get started.

SystemTap examples:

The best place to start with SystemTap is with the examples. There’s a library of useful scripts available here, organized by their purpose.

The one I want to start with is in the I/O section. The basic script allows me to watch when a given ‘inode’ on a specific device is read or written to. It’s available at this location.

Here’s the unmodified code:


#! /usr/bin/env stap
probe vfs.write, vfs.read
{
# dev and ino are defined by vfs.write and vfs.read
if (dev == MKDEV($1,$2) # major/minor device
&& ino == $3)
printf ("%s(%d) %s 0x%x/%u\n",
execname(), pid(), ppfunc(), dev, ino)
}

As you can see, the language is pretty similar to C. That’s nice, because most people reading this will be very familiar with C, and it’s a bit of a pain to have to learn a new language for every task.

Now, I’m not actually interested in monitoring an inode. Instead, I just want the outline of how to attach probes to the file system of a device.  I need to modify the script a bit to remove the check on the inode value:


#! /usr/bin/env stap
probe vfs.write, vfs.read
{
# dev and ino are defined by vfs.write and vfs.read
if (dev == MKDEV($1,$2) ) # major/minor device
printf ("%s(%d) %s 0x%x\n",
execname(), pid(), ppfunc(), dev)
}

I’ll save the new code as ‘watch_all_reads_writes.stp’.

I’ve just removed the references to the third command line parameter (‘$3’) and simplified the ‘if’ statement a bit.

Finding the device number

Before we run the script, we need to figure out which device to monitor.

I’m working on a pretty simple Ubuntu install. I don’t have multiple file systems mounted and I don’t have more than one storage device in my system, which makes life easy.

To find the device number just run:


df <some file on your file system>

This will give you the device which hosts that particular file, as follows:


df watch_all_reads_writes.stap
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 79413120 6299988 69056096 9% /

We’re after the 2nd line of output, which starts with ‘/dev/sda1’.

Now we can get the device number by running ‘ls -l’ on the device file:


ls -l /dev/sda1
brw-rw---- 1 root disk 8, 1 Jan 2 14:46 /dev/sda

We’re after the ‘8, 1’. This means ‘major device’ is 8 and ‘minor device’ is ‘1’.

These values will be the parameters to our SystemTap script, which we’re now ready to run.

Running the script

Let’s try to run it (as root, of course):


sudo ./watch_all_reads_writes.stap 8 1
semantic error: while resolving probe point: identifier 'kernel' at /usr/share/systemtap/tapset/linux/vfs.stp:836:19
source: probe vfs.write = kernel.function("vfs_write")
semantic error: missing i386 kernel/module debuginfo [man warning::debuginfo] under '/lib/modules/3.13.0-32-generic/build'
semantic error: while resolving probe point: identifier 'vfs' at ./watch_all_reads_writes.stap:3:7
source: probe vfs.write, vfs.read

Yikes! It blew up when we tried to do something that actually inserted probes into the operating system.

A bit of googling told me that this was likely the result of not having the debug symbols for the kernel installed. This means that SystemTap was not able to resolve the string ‘vfs.read’ into a hex address at which to place its probe.

Fortunately, this is usually an easy fix. I’ll show you how to get the debug symbols in the next post.