Sunday, 5 April 2020

x86determiniser version 2.0

Yesterday I finished work on version 2.0 of "x86determiniser" program. This is a "simulator" for x86 programs which executes native Windows .exe programs and Linux (ELF) programs. In x86determiniser, system calls are passed through to the host operating system, and all execution timings are repeatable (you get the same timing measurements for every run of a program).

This is free, open-source software (MIT license) and I've been working on it occasionally for the last five years. I have source code on my x86determiniser github page and binaries on my own website.

When a program runs within x86determiniser, timings measured using the RDTSC instruction are repeatable: you get the same result every time you run the program.

For example, here's a short C program which uses RDTSC to count the CPU clock cycles required to fill an array:

    #define size 1000000
    int array[size];
    uint64_t FillArray (void)
    {
        uint64_t start_time, finish_time;

        start_time = __rdtsc();
        for (int i = 0; i < size; i++) {
            array[i] = i;
        }
        finish_time = __rdtsc();

        return finish_time - start_time;
    }

Here are some timings returned by that function (small numbers = faster):

    8139400 1728760 1375740
    1319540 1580800 1334210
    1459210 1333560 1323110
    1329070 ...

Notice that there is significant variation. Though the function always does exactly the same thing, it doesn't take the same length of time to do it. The time taken is affected by many factors: the OS scheduler, I/O activity, memory paging, the cache contents and the CPU pipeline state, in (very) approximate order of importance. It's almost random. It is very unpredictable.

You can exclude the effects of scheduling by measuring only the execution time of a single process, but this is not a detailed measurement with most operating systems, and it still does not exclude the effects on the cache.

x86determiniser removes this problem completely. Here's the results of running the same function inside x86determiniser:

    4000006 4000006 4000006
    4000006 4000006 4000006
    4000006 4000006 4000006
    4000006 ...

No matter how many times you run the function, it does exactly the same thing, and in exactly the same time: at least, from the perspective of the RDTSC instruction! Of course, the actual time taken to run x86determiniser may still vary, but the RDTSC clock inside the simulator ticks at exactly 1 clock cycle per instruction. We see that storing each value within the array takes 4 clock cycles (4 instructions), and there is an overhead of 6 clock cycles before the loop (6 instructions). I've highlighted these within the following disassembly, where the loop appears in red and the instructions before it appear in green. As the loop runs 1000000 times, the execution time from the simulated RDTSC instruction is exactly 4*1000000+6.

    0000000000401540 : 
      401540:       rdtsc
      401542:       mov    %rax,%rcx
      401545:       shl    $0x20,%rdx
      401549:       xor    %eax,%eax
      40154b:       or     %rdx,%rcx
      40154e:       lea    0x642b(%rip),%rdx   # 407980 <array>  
      401555:       mov    %eax,(%rdx,%rax,4)
      401558:       add    $0x1,%rax
      40155c:       cmp    $0xf4240,%rax
      401562:       jne    401555 <FillArray+0x15> 
      401564:       rdtsc
      401566:       shl    $0x20,%rdx
      40156a:       or     %rdx,%rax
      40156d:       sub    %rcx,%rax
      401570:       retq

x86determiniser was created to execute test cases in a manner that was both predictable and repeatable. At the time, I was working on a timing analysis tool named RapiTime, made by Rapita Systems. This tool captures very detailed timing information about a program, almost at the statement level, by creating a timestamped execution trace. We wanted to ship an example program for RapiTime which was fairly complex, but it needed to run on the user's PC (i.e. Windows or Linux) and it needed to produce completely repeatable timing results. We needed a counter or clock that was very accurate but also very predictable, like the timing measurements that would actually be used when RapiTime was deployed on a real embedded system.

One answer was to run the example in a simulation which would remove all of these effects. But which simulation software was suitable? I considered various possibilities, such as QEMU, but none of the well-known simulators met all of the requirements:

  • Simulator runs on Windows and Linux.
  • Simulates native programs (.exe, ELF) allowing file I/O, "printf" etc. to the host system.
  • Deterministic timing measurements.

Some simulators are able to operate by "syscall emulation" which means that all system calls performed by the simulated program are sent to the host system. For example, if you have Linux on a Raspberry Pi, you can install the "qemu-x86_64" program, which allows you to run a Linux program that was compiled on a PC, even though the Pi does not have an x86 CPU. The program is dynamically translated to ARM by QEMU, and system calls are also translated as required, so that they are handled by the Linux kernel on the Pi. Similarly, you can run Raspberry Pi Linux programs on your Linux PC using "qemu-arm".

QEMU is not the only simulator to provide the syscall emulation feature: SimpleScalar did it too, so does GemM5, and though you may not think of valgrind as a simulator, it is built on a binary translation backend called libvex, and system calls are passed through to the host!

Many of these simulators are open source, and could be modified to provide deterministic timing measurements. They all work on Linux. But there's a sticking point, which is that they do not work on Windows, or at least, the simulator works on Windows but syscall emulation does not. For instance, if you are running QEMU on Windows, you are always simulating a complete system, including all of its hardware and its operating system. This is not the same thing, and does not meet the requirements.

I set out to write x86determiniser as a solution to this problem. I didn't want to write a complete x86 simulator, because the instruction set is monstrously complex, so I decided instead to try to take shortcuts.

My first idea was to use the single step feature of the x86 CPU. This executes one instruction and then triggers a breakpoint, handing control to a debugger or an exception handler. x86determiniser would execute the program by single-stepping, taking a special action for the RDTSC instruction, where it would substitute an instruction count instead of a clock cycle count.

This worked, but I found that single stepping was so slow that it was completely impractical. My programs ran at about 10,000 instructions per second, which makes it effectively impossible to execute any non-trivial program. The typical execution speed of software is almost one million times faster! It is slow because the single step generates an interrupt for every instruction, and this must be handled by the operating system, then passed to the program. Many thousands of additional instructions must be executed for each single step.

Thinking again, I decided to look at a simple form of binary rewriting. For any point in the program, x86determiniser identifies the next control flow instruction, such as "jmp", "call", "ret" or a conditional jump. It is replaced with new code which saves part of the context and then switches into x86determiniser, which is loaded into the same memory space as the program. x86determiniser interprets the control flow instruction, decides which part of the program should be executed next, and then continues execution until the next jump. This approach has some nice properties:

  • It's not necessary to decode or interpret most x86 instructions. In fact, x86determiniser does not know (or care) what most x86 instructions do: it simply causes the CPU to execute them as usual.
  • There is a context switch for each jump, but it is only within the program, so there is no interrupt handler or OS overhead.
  • System calls are treated like any other instruction, so we get "syscall emulation" for free.
  • The technique works the same way on any OS, so platform-specific code is mostly avoided.
  • The technique can be applied selectively to some parts of the program, with other parts executing normally at full speed. In x86determiniser it is applied only to code from the executable. Shared libraries execute normally.

However, there are some serious disadvantages, such as:

  • Even without interrupt handler and OS overhead, there's still a significant cost for every control flow instruction, because each requires something like a context switch in order to execute part of x86determiniser.
  • The program should be single-threaded, because only a single thread can be executed in this way. If the program is multi-threaded, the additional threads cannot share any code with the main thread. This is because the code is modified in place.

Despite these disadvantages, the x86determiniser method proved fast enough to be practical. In version 1.x, programs were linked against "libx86determiniser.dll" (or the Linux equivalent) and they called a function named "startup_x86_determiniser()" within their "main()" function. Here's an example. When the startup function was called, it activated x86determiniser, and from that point onwards, the program continued to execute in "simulation". These early versions used GNU objdump to disassemble the program and determine the addresses of control flow instructions. 64-bit programs were not supported.

These limitations meant that a program had to be specially modified for use with x86determiniser, that the x86determiniser software license applied to the programs linked with it, and that users of 64-bit Linux were forced to install 32-bit versions of GCC in order to use the software.

I was not happy with these limitations, so I set out to create a new version which would avoid them. After more than two years of on/off work, I have finally finished version 2.0 with the following new features:

  • x86determiniser is now a program, not a library. It loads a program like a debugger, and then injects the "determinising" software into the program's memory space. This means that programs don't have to be modified for use with x86determiniser.
  • 64-bit support.
  • Built-in disassembler (Zydis) so objdump is not required.
  • MIT license, so there are no issues with distributing binaries.

The new version was achieved in steps, which were (roughly):

  • Introduce the x86determiniser "loader" frontend on Windows, 32-bit.
  • Add new test cases: simple ones to test basic functionality such as parameter parsing, and a complex one to check that a correct instruction trace would be produced for a particular test program.
  • Add 64-bit Windows support.
  • Port to Linux (32-bit).
  • Add 64-bit Linux support.
  • Add further test cases to search for further problems.

I found many of these steps very challenging. Because x86determiniser now acts as a loader, using ptrace (Linux) or WaitForDebugEvent (Windows), it is not possible to attach a debugger to the process being "simulated". Without debugging tools it is hard to figure out why something went wrong. I dislike having to fall back to the "printf" approach to debugging but sometimes the only alternative is guesswork.

64-bit support was also challenging. I found that my version 1.x code relied on the ability to put absolute addresses into the machine code, for example in order to do a jump to an arbitrary address without using a register. The following will load an address from a global variable named next_code, and then jump to that address:

    jmp     *(next_code)

With 64-bit x86 code, the address of the next_code variable cannot be specified as either a 64-bit absolute address, or as a 64-bit offset. It can only be a 32-bit offset. This forced changes in various places, since I had relied on the ability to refer to any location in the memory space within "jmp" and "call". The weirdest implication was that the the symbol for "next_code" could not be exported globally with ".global", because if it is, it will be possible for shared libraries to override it, and that would prevent the assembler using a 32-bit offset.

A further 64-bit issue was caused by the red zone which allows a function to use some of the memory below the current stack pointer. This prevented x86determiniser capturing the state of the CPU flags. I was forced to subtract 128 bytes from the stack pointer value when switching context, in order to avoid destroying information in the red zone. The need to deal with the red zone has come up before, but this time I could not work around it with "-mno-red-zone", since I wanted x86determiniser to support programs without modification.

Because of these issues, the code modification needed to jump from "simulated" code back into x86determiniser went from a five byte absolute call to a nineteen byte instruction sequence which modifies the stack pointer to avoid the red zone, and then does an absolute call, deferencing an address stored after the instruction. I had to pick instructions which would not affect the FLAGS register, i.e. use "lea" rather than "add".

x86determiniser has involved quite a large effort, but I think it has been worthwhile. I have learned details about how debuggers actually work, how to do DLL injection, and I've been able to do some fun low-level programming.

However, I am not entirely happy with x86determiniser version 2.0. I think performance is poor, and the lack of multi-threading support is an issue. The "loader" part of the program is both complicated and platform-specific. For the next major version (if any) I would probably look at integrating a "real" x86 CPU simulator to interpret the instructions. Only the system call instructions would actually be executed in place, and the program itself would not be modified. I think that this would greatly improve performance in most cases.