Sunday, 10 January 2016

x86 single step mode - how slow is it?

The x86 CPU architecture has a feature that generates a "trap" (a software interrupt) after executing each instruction. This feature dates all the way back to 8086, and it's activated by setting a bit within the FLAGS register. Any program can activate this at any time.

The feature is powerful. If you want to single-step through some code, watching what it is doing, you don't need an instruction decoder in order to figure out where to put the next breakpoint, or an interpreter to figure out what each instruction means. You can rely on the CPU to do the execution.

I thought about using this feature for various things. What I had in mind were some (or all) of:
  • execution tracing (i.e. produce a list of executed instructions),
  • counting the number of executed instructions,
  • monitoring memory accesses,
  • changing the behaviour of certain instructions while keeping everything else the same.
A program must register an exception handler (Windows) or a signal handler (Linux) in order to use this feature: otherwise the program just crashes when the trap signal is received and not handled ("Trace/breakpoint trap"). Having done so, a specific bit in the FLAGS register may be set in order to activate single stepping:
pushf /* save EFLAGS register on stack */
orl $0x100, (%esp) /* set TF bit */
popf /* reload EFLAGS register */
However, there are some problems with the single step feature:
  1. It's slow!
  2. On Windows, you can't easily single-step across system calls if you are running a 32-bit program on a 64-bit host.
I'll concentrate on the first problem. How slow is it?

A single-step trap event involves the following actions:
  1. One instruction is fetched, decoded and executed,
  2. CPU generates a software interrupt;
  3. OS handles the interrupt;
  4. OS notifies the program's signal handler;
  5. Program handles the signal and returns to OS;
  6. OS finishes handling the interrupt;
  7. CPU resumes normal execution (fetches the next instruction).
As this is not a normal usage scenario, there is no particular reason for either the CPU designer or the OS developer to make it work quickly. I wrote programs to measure the length of time taken to go through this sequence. (For those interested in such things, source code and binaries are in this bootable ISO image.)

The first of these programs ran on "bare metal", i.e. without an OS. I used the memtest86 program as a basis for this, as it is written in C and hence easily modifiable, but it is started from the boot loader (or CD/USB stick) and runs with its own (very minimal) device drivers. I stripped out the memory-testing functionality and replaced it with code to enter single-step mode, execute an instruction, and handle the resulting trap event. I measured the current time (using the RDTSC instruction) at steps 1, 5 and 7 in the list of actions written above.

On my PC, which has a Core2 Duo E8600 CPU, it takes 440 clock cycles to go from step 1 to step 5, assuming all code is in cache. It then takes a further 240 clock cycles to go from step 5 to step 7. These timings include a very minimal interrupt handler which does little beyond saving the register state: in this environment, that is the extent of the "OS". Though software overheads are as low as they reasonably could be, we still see a 680 clock cycle overhead! If the original instruction would take one clock cycle, it now takes 680.

On other CPUs the results will be different, but you can expect a similar large difference between single step mode and regular execution.

With an OS, the situation is worse, because there is a much greater overhead for steps 3, 4 and 6. There are four transitions between kernel mode (where the OS runs) and user mode (where the application runs).

The cost is measured by my second program, which runs within Windows 7. On the same CPU, it takes 19080 clock cycles to go from step 1 to step 5, and then 14050 clock cycles to go from step 5 to step 7. The total overhead is 33130 clock cycles.

My third program runs within Linux, where the latency is somewhat lower. Again on the same CPU, the Linux 2.6 kernel takes 3470 clock cycles for step 1 to step 5, and 2040 clock cycles for step 5 to step 7. The total overhead is 5510 clock cycles. Putting the results together:


Step 1 to 5 Step 5 to 7 Total
Bare metal 440 240 680
Windows 7 19080 14050 33130
Linux 2.6 3470 2040 5510

These results mean that it's not practical to use the single step feature to run any large program, because that program will run at a much lower speed: thousands of times slower than it normally would. Most of this cost is due to the OS, but even if the OS handled single-stepping really efficiently, there would still be a huge cost from the CPU itself.

Some of this cost can be reduced. Some Intel CPUs allow you to specify that traps should only be generated for branch instructions. This is sufficient to generate an instruction trace. But the overhead of each trap is so large that any large program will still be impractically slow. Some Intel CPUs also have a feature to log branch events in memory (the "branch trace store"). This is much faster, because the work is done without interrupting software execution. The branch trace store could also be used to reconstruct an instruction trace. However, these possibilities are limited, because they are not provided by all x86 CPUs. Anything in a "model specific register" is subject to change between CPU revisions, and though AMD CPUs have equivalents, I gather that these are not quite the same.

Though single-step is powerful, it is impractical. The Valgrind tools are able to produce instruction traces and analyse memory accesses performed by a program, but this is done by recompiling the program using a framework named libVEX, which is effectively a simulator for x86 code. Though there is some overhead here, it is much lower than single-stepping. Valgrind does not require any Intel/AMD model-specific features. If you require something that can be achieved by single-stepping, but performance is important, then this approach is probably the best way to go.