Running the ‘perf’ toolchain under Fedora

In the last post, I finally decided that I couldn’t get the ‘perf’ toolchain to work properly under Ubuntu 14.04 LTS.

In this post, I’m going to look at doing the same debug under Fedora 21 (64bit).

Those of you who’ve followed this series so far should be very familiar with the basic steps to install the tools, so I’m not going to describe the setup in detail.

Getting the 1st round of data:

First of all, I’ll launch the kernel compile as usual:

make -j3

Then, when the system has settled a little, l’ll launch the recording tool:

perf record --call-graph dwarf -F 97 -a -o perf.data.compile -- sleep 5

Note : The ‘-F 97’ tells ‘perf’ to sample at 97Hz. This reduces the amount of CPU load and I/O bandwidth required. The odd choice of 97Hz (not 100Hz, for example) prevents system events from synchronizing too much.

Then, let’s see what we get from ‘perf report’:

fedora_perf_report_no_sortSo, this looks promising. But, what’s ‘cc1’, and why do we have so many of them?

‘cc1’ is the actual C compiler that’s invoked by the ‘gcc’ toolchain. I don’t know for sure why we have so many, but I suspect that it’s because of two things:

  • There are lots of short-lived ‘cc1’ processes, each with their own PID
  • The sampling catches them all at different stages in their lifecycle, leading to different stack traces for each sample.

If we expand a few of these ‘cc1’ instances, this is what we get:

fedora_perf_report_no_sort_expandedThis supports my conjecture above – I’m seeing many, many possible stack traces for ‘cc1’.

This is interesting, but doesn’t really help me. I want to see why the compilation process seems to cause high system time, not snapshots of the control flow of ‘cc1’.

The next post will show how to make this data more useful.

Perf call stack unwinding under Ubuntu

This series of posts talks about tracking down a high system time problem when compiling the Linux kernel.

I’m using ‘perf’ to try to home in on where the time is going and why. Unfortunately, it seems that this toolchain is not that mature on Ubuntu.

The primary developers are from RedHat, so they work mainly on RedHat’s Linux distributions, “Fedora” and “Red Hat Enterprise Linux (RHEL)”

Development environment:

If you’ve just joined this blog, here is my Linux development environment:

  • Mid 2011 27″ iMac, running OS X Yosemite
    • Quad core Intel i5 processors, 64bit, with virtualization extensions
  • VirtualBox hypervisor, version 4.3.20
  • Guest operating system
    • Ubuntu 14.04 LTS, 64bit
    • Guest kernel version : 3.13.0-32-generic
    • Guest OS assigned 2 virtual CPUs
    • Guest OS assigned 4GBytes of RAM

Differences between 32bit and 64bit behavior:

I recently figured out how to run 64bit guest operating systems under VirtualBox.

Since 32bit and 64bit systems are quite different (even though the user experience is essentially identical), I thought I’d check to see if the high system time problem is still in evidence.

When I launch a kernel compile with ‘make -j3’ on my dual-core VM, this is what I see in the traditional ‘top’ command:

high_system_time_top

As you can see, I’m still getting pretty high system times. In this case, almost one entire CPU is spent dealing with kernel-space instructions, which is an awful lot.

It would be extremely nice to get to a full understanding of this, since my compiles might go quite a bit faster – assuming I don’t become I/O bound instead.

Output of running ‘perf’ under Ubuntu:

The first tool to run is ‘perf top’, since that’s somewhat equivalent to the traditional ‘top’ command. Here’s the output:

high_system_time_perf_top

It seems to be saying that ‘__do_page_fault’ is consuming most of the time in kernel-space.

This does actually make some sense, because compilation is very I/O intensive. It’s typically lots of little jobs that access a lot of small files. Under this workload, it’s plausible that the page faulting is due to the memory mapped IO paging in a lot of files.

Now, one caveat is that the symbol that’s taking most of the time does appear to have changed between the 32bit and 64bit versions of the same OS. For the 32bit version, the top symbol was ‘__raw_spin_unlock_irq_restore’.

Getting call stacks:

I have a hypothesis that the excessive system time is caused by memory mapped file I/O, at least on the 64bit OS.

In order to gain confidence in this hypothesis, it would be nice to see where this activation is coming from. The ‘perf’ toolchain supports a very nice feature which allows me to get the call stacks for each symbol, which should allow me to trace the sequence of calls back towards the source.

There are two steps to doing this:

  • perf record : captures data as a workload runs
  • perf report : analyzes and displays the captured data

So, I’ll kick off the compilation process as normal, with ‘make -j3’, then start the record process:

perf record -ga -- sleep 10

This will record for 10 seconds, capturing call stacks from all the CPUs in the system.

Once that’s done, I’ll suspend the compilation process, then run the report:

perf report -g

Note: On modern distributions, ‘perf report’ will launch with an interactive ‘curses’ style browser. If you’d prefer to get a traditional text-only output, add the ‘–stdio’ switch

Here’s what I get (Note: I use red terminals for commands that must run as ‘root’):

perf_report_top

I can attempt to expand the ‘__do_page_fault’ symbol, but under Ubuntu 14.04 LTS, the expanded section is empty. That’s a bit suspicious: The highest activity symbol allegedly having no callers seems unlikely.

It turns out that the only symbol which has anything interesting in it is ‘page_fault’, below:

perf_report_expanded

This suggests that ‘memset’ is taking a disproportionate amount of resources. That’s conceivable if large amounts of memory are being allocated and then initialized with ‘memset’ (which is a typical in C code).

OK, bearing in mind that I’m now a bit suspicious, I’ll expand the ‘memset’ symbol:

perf_report_expanded_memset

Hmmm. This doesn’t look good.

Firstly, what’s that zero doing there – it’s not a hex address, and even if it were, why are we vectoring through that address? We have a whole lot of 64bit hex addresses to deal with too, which suggests missing debug packages.

Finally, the math doesn’t add up – according to this, only about 7% of the time spent in ‘memset’ is actually accounted for.

Using the ‘dwarf’ unwinder:

There is another approach which I can use : The ‘dwarf’ unwinder.

In order to unwind cleanly, all the code I want to trace needs to be run with frame pointers enabled. This is achieved with the gcc compile-time switch ‘-fno-omit-frame-pointer’.

Now, having this switch enabled on all the code is a bit of a pain – in something like a kernel compile, there are several major toolchains at work, and it turns out to be quite hard to get debug symbols installed for some of them.

The ‘dwarf’ unwinder mode helps side step this.

Perf is invoked as follows:

perf record -g --call-graph dwarf -a -F 97 -o perf.data.dwarf -- sleep 10

Note: The ‘-F 97’ forces ‘perf’ to sample at 97 Hz, which prevents the CPU and I/O from getting overwhelmed with data.

Unfortunately, the results are much the same – memset still expands into zeroes and 64bit hex addresses:

perf_report_dwarf

Kernel versions required to support ‘dwarf’ unwinding:

I think I’m right in saying that a kernel version of at least 3.9 and up to support ‘dwarf’ unwinding.

People using lower kernel versions will need to continue to use ‘-fno-omit-frame-pointer’, or at least debug symbol packages that are built with that option.

Next steps:

Well, I seem to have drawn a blank with using Perf under Ubuntu. It’s too early to say that the tools are broken – there are differences between the distributions, and it’s possible that I’m getting tripped up by something silly.

I think the next step is to try the same thing, but use a full RedHat distribution, such as Fedora. I’m not going to use CentOS because I don’t know that everything is built 100% the same, and also the kernel versions tend to be older.

Installing debug symbols for the GNU toolchain

The last post showed how to find the most used symbol in a running symbol with ‘perf top’.

I could tell that the ‘cc1’ tool contained the most used symbol, but since I don’t have the debug symbols installed for that tool, I couldn’t find the symbol name.

This post describes how to get the debug packages installed.

Installing ‘gcc’ debug symbols:

Since I know that my compilation system uses the gnu C toolchain, it seems that I should just be able to install the debug symbols.

I can search for an appropriate debug symbol package as follows (output has been edited for clarity):

nick>> sudo apt-cache search 'gcc-' | grep dbgsym
gcc-4.7-multilib-dbgsym - debug symbols for package gcc-4.7-multilib
gcc-4.7-dbgsym - debug symbols for package gcc-4.7
libgcc-4.8-dev-dbgsym - debug symbols for package libgcc-4.8-dev
gcc-4.8-dbgsym - debug symbols for package gcc-4.8
libx32gcc-4.8-dev-dbgsym - debug symbols

So, there are several gcc versions to choose from. It’s easy to find the current ‘gcc’ version:

gcc -v
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

OK, I’ll install the ‘gcc-4.8-dbgsym’ package:

sudo apt-get install gcc-4.8-dbgsym

Easy enough, right?

Unfortunately, I get this message:

apt-get install gcc-4.8-dbgsym
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
gcc-4.8-dbgsym : Depends: lib32gomp1-dbg (= 4.8.2-19ubuntu1) but it is not installable
Depends: libn32gomp1-dbg (= 4.8.2-19ubuntu1) but it is not installable
Depends: libhfgomp1-dbg (= 4.8.2-19ubuntu1) but it is not installable

OK, maybe I can fix this by forcing the package management system to reload its index:

sudo apt-get update
sudo apt-get install gcc-4.8-dbgsym

No luck. I’ll need to look a bit further.

Checking package availability:

There appears to be a problem with package dependencies, so the first thing we want to do is check if the package really is available.

The simplest way to get information on an Ubuntu package seems to be to do a google search. I know that I’m using Ubuntu ‘trusty’, so if I search on ‘ubuntu trusty libn32gomp1-dbg’, I’ll get to this page:

http://packages.ubuntu.com/trusty/debug/lib32gomp1-dbg

The problem here seems to be that the package is only available for 64 bit installs. I can tell this because in the ‘download’ section, the only architecture available is ‘amd64’.

Note : Even though I’m running on Intel CPUs (albeit Virtualized), I still want to use the ‘amd64’ packages. This is just labelling – AMD defined the 64 bit extensions to the x86 architecture first, so that’s how 64bit x86 code is referred to.

Virtualization and 64bit operating systems:

I’ve been deliberately using a 32bit guest operating system inside my VirtualBox hypervisor. I’ve had problems bringing up a 64bit guest OS, even though my machine is 64bit and the processor has the Hyper-V extensions.

Now that I’ve figured out how to run 64bit guest operating systems, I should be able to side-step this problem.

Debugging high system time with ‘perf top’

In the last couple of posts I described a problem I encountered when compiling the Linux kernel on my VM.

In this post, I’m going to start to dig in and see what might be going on. I’m going to use the ‘perf’ toolkit, which is part of the Linux kernel.

Measure, measure, measure:

Effective performance optimization involves discovering the ‘hot’ part of the code, rather than guessing. Accurate measurement is key, because even informed opinion often turns out to be incorrect.

Graphing the data is important too. It’s possible to see patterns in scrolling data that aren’t there when the data is visualized.

It’s possible to profile a user-space application with tools like ‘gprof’, but what happens when the time seems to be in the operating system?

System performance and ‘perf’:

The first tool most people turn to when there’s a performance problem is ‘top’. It’s a great tool and can tell me a lot – especially if I know how to interpret its output accurately.

The problem is that it’ll tell me which processes are consuming the most CPU, but not where inside those processes the time is going. That’s something that’s really hard to get without advanced debug tools.

Installing perf tools:

Even though ‘perf’ is part of the linux kernel, user-facing tools are needed to drive it. They can be installed easily as follows:

sudo apt-get install linux-tools-common

Perf top:

Once the tools are installed, simply start the workload. In this case, it’s easy:


cd <top of kernel source code>
make clean
make -j3

Then, run ‘perf top’. This will bring up an interactive interface that looks somewhat like traditional top. It’s a bit hard to capture text off it, so here’s the first few lines:

16.42% cc1 [.] 0x00116d1a
8.38% [kernel] [k] _raw_spin_unlock_irqrestore
8.28% [kernel] [k] read_tsc
6.45% [kernel] [k] finish_task_switch
4.37% libc-2.19.so [.] __memset_sse2

This shows that a symbol inside the ‘cc1’ process is taking the most amount of the CPU. That’s no surprise, because we know that the system is compiling C code, and ‘cc1’ is part of the gnu C compiler toolchain.

Now, there’s a problem here, because I’m seeing a hex address rather than a human-readable symbol name. It is possible to resolve the address into a symbol manually, but it’s easier to install the debug symbol packages.

What’s more interesting is the kernel symbols on lines 2 through 5. These three symbols are showing up in the sample more often than the user-space workload, which strongly suggests that this may be where the system time is going.

Next I’ll install the debug symbols for the C toolchain (it’s not as straightforwards as it may seem, unfortunately), and then we’ll take a look at what these kernel symbols actually mean.

Debugging high system time in kernel compile

In the last post, I discovered that compiling the Linux kernel in my VM seemed to cause the system to spend a lot of its cycles in the operating system, rather than processing my compilation workload. This is known as ‘system time’, and is easily seen with the ‘top’ tool.

Virtual vs Physical:

I was pretty sure that seeing about 35% system time was unusual. I’d normally expect to see values like this in my torture-testing environments, where I deliberately do system calls as fast as possible.

The finger of suspicion immediately fell on my VM environment – mainly because it’s the biggest factor that was different between my environments.

VMs are amazing at making the guest operating system appear to be running on its own hardware. Of course, it’s really not the case : I’m actually running two operating systems on a single piece of hardware.

Operating systems really like to have full control of the hardware, so running two parallel OS instances is not something that comes naturally to computer systems. There’s some clever slight of hand going on to give the guest OS the illusion that it’s in control, and the instances actually aren’t actually as separate as they appear.

For most purposes, this almost-separation works extremely well – VMs are great technology for development and investigative work. However, sometimes the physical reality of the computing environment leaks through, resulting in some hard-to-debug problems.

Workload characteristics:

Compilation workloads tend to start lots of fairly short-lived user-space processes which access large numbers of relatively small files. There’s the code that’s being compiled of course, but there’s also all the headers and (if the system is linking) all the object files.

One of the things that VMs do is virtualize the file system. Instead of having an actual disk that accepts commands from the OS, the VM uses a large file in the host operating system and presents that to the guest as if it were a disk.

There’s a big difference there, and VMs (especially type 2 VMs like VirtualBox, Parallels and VMware Workstation) have a reputation for poor I/O performance.

Note : I haven’t benchmarked I/O performance for myself, so take this statement with a grain of salt – I could be wrong.

A workload with lots of small random accesses would hit any file system hard, and perhaps the overhead of virtualization is what’s causing the high system time?

Cross-checking:

There’s one easy, if not terribly scientific, test that I can do to get a quick read on the problem: Repeat the same workload on a native Linux machine, to see if the same thing shows up.

Now, this really is rough and ready. Just about everything about the two environments is different.

My native Linux machine uses an Intel i5-3470 quad core processor, while my VM uses two virtual  Core(TM) i5-2500S CPUs out of the four in the host machine.

I have different amounts of memory available, and the guest OS kernel versions are different. The host machine is a Mac, too, so that’s another factor.

Still, running the same compilation process on the same Ubuntu kernel release showed that I was only using ~8% system time on my physical hardware. That’s much more like what I’d expect to see for a workload that doesn’t go out of its way to issue system calls.

To add another data point (and, scientifically speaking, lots more confounding factors!), I decided to do the same test with a guest OS running under Linux’s KVM system.

Again, the guest OS showed that it was spending reasonable amounts of time in system space.

This may possibly suggest that there’s something about the VirtualBox implementation which is giving rise to high system time in the guest OS.

The next post will show how I went about investigating this further.

Debugging a high system time problem

I covered some pretty basic debug in previous posts, so now it’s time to address something more complex.

Debug symbols:

As I mentioned earlier, in order to use advanced debug tools like ‘perf’ and ‘systemtap’, it’s useful to install the kernel debug symbols.

It’s not mandatory, but the tools will emit hex addresses rather than meaningful symbols unless they have a way to convert from instruction pointer addresses to symbols within the code.

Now, on another system I use, I noticed that the standard way of installing current debug symbols didn’t seem to work. I compiled a new version of the Ubuntu kernel with debugging mode turned on, and the symbols started showing up. I’ll write about how to do this in future posts.

I thought I’d do the same thing on my Virtual Machine, more out of interest than anything else, since adding debug symbols seemed to work on this instance.

High system time:

When I kicked off the compile, I noticed that the system seemed to be spending a lot of time in kernel space.

Here’s a snapshot of the ‘top’ output:

top - 14:20:11 up 39 min, 9 users, load average: 1.18, 0.62, 0.42

Tasks: 205 total, 2 running, 203 sleeping, 0 stopped, 0 zombie

%Cpu0 : 46.0 us, 44.0 sy, 0.0 ni, 0.0 id, 4.0 wa, 0.0 hi, 6.0 si, 0.0 st

%Cpu1 : 71.0 us, 29.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 3029552 total, 2220740 used, 808812 free, 187868 buffers KiB Swap: 3069948 total, 0 used, 3069948 free. 1011568 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

21965 nick 20 0 43464 31968 8176 R 55.7 1.1 0:00.63 cc1

3861 nick 20 0 708676 275508 44732 S 12.9 9.1 0:20.53 firefox

2623 nick 20 0 452592 125972 38452 S 2.0 4.2 1:17.22 compiz

7 root 20 0 0 0 0 S 1.0 0.0 0:03.34 rcu_sched

13 root 20 0 0 0 0 S 1.0 0.0 0:02.11 ksoftirqd/1

977 root 20 0 27196 1412 1032 S 1.0 0.0 0:00.95 VBoxService

I normally don’t expect to see such high values. Some system time is to be expected, but this is much higher than I’d expect for a user-space job like compilation. Since this blog is about debugging hard problems like this, I thought it would be fun to find out what’s going on.

The next post in this series describes finding a high system time problem when compiling the Linux kernel.

WordPress vs Blogspot experiment

I’ve been adding content to this blog for a few weeks now, and I’m actually quite enjoying the experience.

I’ve spent enough time with WordPress to get an idea of its strengths and weaknesses. It’s pretty good – it seems to be reliable (there’s nothing worse than having a web page eat your post!), the scheduling features work cleanly and I’ve been getting page views.

Since this blog is about experiments and gathering data, I thought I’d have a bit of fun with competing blogging platforms. I’ll be posting to both WordPress and Blogger and monitoring page views and search results.

I’ve set up a copy of the content on Google’s Blogspot service here:

http://nickjpavey.blogspot.com

I’ll be gathering my impressions of what it’s like to use both sites for hosting technical content. I’ll report back in a couple of weeks.

Running 64bit guest operating systems under VirtualBox on Mac OS X

I’ve been using VirtualBox for ages. It’s a pretty handy virtualization system for multiple platforms, especially given that it’s free.

Over the years, I’ve run into a problem that I haven’t been able to run 64bit guest operating systems. I’d always get an error about my processors not supporting hardware virtualization.

This surprised me because I knew that I was using 64bit capable processors which had the virtualization extensions. I also knew I was running a 64bit host operating system.

I’d always just accepted it and gotten on with my work with a 32bit guest OS, which actually works fine for ordinary work.

My environment:

My computing environment is Mac OS X (Yosemite), running on a mid 2011 27″ iMac. The processors in this machine are plenty late enough to be 64bit capable, and have the virtualization extensions.

I’m usually installing Ubuntu or CentOS as my guest operating systems.

I don’t have the equipment to test whether this trick works on Windows, or under native linux. Your mileage may vary.

The ‘Version’ drop down matters:

It turns out that when you set up a new VM, the version of OS actually does something. I’d always assumed that it was just a label, and that all the settings that mattered were actually under the ‘system’ tab of the setup tool.

That’s not the case at all: Specifying the ‘version’ to be 64bit actually enables the guest OS to be fully 64bit.

Here’s the all-important drop-down:

Screen Shot 2015-01-16 at 6.19.00 PM

Make sure that the Version drop down says 64bit, and you should be all set.

Who knew? Well, maybe everyone apart from me 🙂

Processor Performance Counters

In the last few posts I looked at a simple use case for the SystemTap tool from RedHat.

Now I’m going to look at the ‘perf’ tool that comes as part of the Linux kernel.

Perf is derived from the “Performance Counter for Linux” (PCL) project, so it’s worth taking a bit of a detour into the hardware features of modern processors.

Bringing up a CPU for the first time:

The traditional way to debug software is to debug it using a debugger such as GDB. The problem is that this depends on two things:

  • Having a development environment that is mature enough to support a debugger.
  • Having a process which can accept being stopped and single-stepped.

Bringing up a new processor design to the point where it can run even a small operating system is a major chunk of work. In the meantime, the chip is sitting in the lab and a way is needed to figure out what it’s capable of doing.

Assuming the CPU is working (which it may not be), even a fairly large bring-up program will be completed in microseconds. That’s obviously way too fast to observe manually, even if there was a good way to see what the machine is doing.

Logic analyzers can be attached to the busses and pins, but that only shows what’s going on at the physical interfaces to the chip. As a side note, monitoring high speed signals on a 1,000 pin socket is not actually that easy, either!

In the CPU design world, tools are often rather low-level. Frankly, a ‘printf’ statement is pretty heady stuff, because it implies the system is running an OS with a C standard library, and I actually have something to view the output on!

So, given that I have a pretty bare-bones set of tools, how do I know what’s going on inside the core of the device?

On-chip monitoring hardware:

Part of the answer is on-chip debug systems and performance counters. For completeness, there are also Scan and JTAG interfaces, but they are well outside the scope of this article.

On-chip debug and performance monitoring hardware has visibility into the low level implementation of the processor design. It can be used to monitor the system as it runs at full speed.

Modern hardware debug modules can be pretty powerful. They have a range of programmable features and can monitor a wide range of system resources (although not all at the same time) without degrading performance. Of course, the feature set is nothing like as rich as a full interactive debugger, but it’s an awful lot better than nothing!

Performance counters are related to the on-chip debug system. As the name implies, they are designed less for identifying software errors than for quantifying software behavior as it runs.

They can still be pretty flexible though, especially with tools such as ‘perf’ and ‘Oprofile’ to help configure them.

Finally, traditional debuggers emphasize functional debugging and won’t reveal anything about important architectural issues such as cache utilization. Given how important the memory hierarchy is to performance, understanding what the workload is doing to the caches is critically important.

Documentation:

Understanding how performance counters can be programmed does require a background in reading processor architecture manuals. It’s pretty technical stuff but many of you will be interested, so here are some links:

(Warning : These docs are VERY large. You may want to download them to your machine then view them outside of your web browser.)

Comparing Linux WM load to Mac OS X (Yosemite)

In the last post I used SystemTap to look at the I/O operations generated by the desktop environment.

I thought it would be interesting to do a similar analysis on my Mac, just to have a comparison.

It turns out that it’s more complicated to analyze the Mac’s behavior than it is for Linux. Based on this specific experiment, the Mac seems to live in a middle ground between IceWM and Ubuntu’s Unity environment.

However general usage impressions and some other filesystem data suggest that disk utilization may be substantially higher.

Mac OS X tools:

The Mac uses quite a different tool set to do monitoring from Linux.

Thanks to its BSD heritage, Mac OS X has an implementation of Sun’s ‘dtrace’, which arguably is the gold standard of tracing systems.

Dtrace is not fully ported to Linux currently, although work is underway. I tried using it about six months ago and found it wasn’t fully mature yet.

The tool to use of the Mac is ‘fs_usage’. I believe (although I could be wrong) that this is based off ‘dtrace’.

It’s a very handy tool and gives quite a bit of actionable info right out of the box.

Results:

Over a 60 second period my Mac issued ~315k I/O operations. Many of these were filesystem metadata operations like ‘getattrlist’ and ‘stat64’ operations that possibly might not result in many actual disk operations.

If I filter the ‘fs_usage’ output pretty strictly, allowing a fairer comparison with the ‘systemtap’ under Linux, then I get a count of about 875 reads and writes over 60s. This is actually better than under Linux, but only if we compare these specific operations.

The majority of the load seemed to be from the Spotlight indexing system, which I’ve noticed being pretty intensive before.

Commentary:

I’m actually not that keen on Spotlight – I feel it uses a lot of resources for rather modest returns. I use its features very rarely, but I suspect that I pay a performance penalty for its indexing operations.

Even though the actual number of reads and writes doesn’t appear to be that high, you can hear the disks working. The disks are clearly doing a lot of work and application startup is rather sluggish.

My guess is that the metadata operations (getattr, stat, open etc etc) actually do contribute to quite a lot of disk activity, especially shortly after reboot when the filesystem caches may not be that warm.

Moving the head of a rotating disk is pretty time consuming. On my iMac (7,200 RPM conventional disk), the disk can only perform about 100 I/O operations per second. Even a fairly modest seek load can really eat into system performance.

We could use another tool to probe into a lower part of the file system to see how many ‘seek’ operations are actually reaching the disk. ‘iostat -o’ is an OK tool for this, but it’s not as good a tool as under Linux and there isn’t a specific column for the number of seeks / sec. You can infer what’s going on, but it’s a bit sparse.

It is possible to disable spotlight, but I’ve found that other applications use it as a central searching facility. For example, I could not search email in Microsoft Outlook when Spotlight was turned off.