retrowin32: redoing syscalls

September 18, 2024

This post is part of a series on retrowin32.

The "Windows emulator" half of retrowin32 is an implementation of the Windows API, which means it provides implementations of functions exposed by Windows. I've recently changed how calls from emulated code into these native implementations works and this post gives background on how and why.

Consider a function found in kernel32.dll as taken from MSDN (with some arguments omitted for brevity):

BOOL WriteFile(
  [in]  HANDLE   hFile,
  [in]  LPCVOID  lpBuffer,
  [in]  DWORD    nNumberOfBytesToWrite,
);

In retrowin32 we write a Rust implementation of it like this:

#[win32_derive::dllexport]
pub fn WriteFile(
    machine: &mut Machine,
    hFile: HFILE,
    lpBuffer: &[u8],
) -> bool {
    ...
}

Suppose we have some 32-bit x86 executable that thinks it's calling the former. retrowin32's machinery connects it through to the latter — even though the implementation might be running on 64-bit x86, a different CPU architecture, or even the web platform.

Language gap

Before we get into the lower-level x86 details, you might first notice that those function prototypes don't even take the same arguments — nNumberOfBytesToWrite disappeared.

On the caller's side, a function call like WriteFile(hFile, "hello", 5); creates a stack that looks like:

return address   ← stack pointer points here
hFile
pointer to "hello" buffer
5

The dllexport annotation in the Rust code above causes a code generator to generate an interop function that knows how to read various Rust types from the stack. For example for WriteFile the code it generates looks something like the following, which maps the pointer+length pair into a Rust slice:

mod impls {
    pub fn WriteFile(machine: &mut Machine, stack_pointer: u32) -> u32 {
        let hFile = machine.mem.read(stack_pointer + 4);     // hFile
        let lpBuffer = machine.mem.read(stack_pointer + 8);  // pointer to "hello"
        let len = machine.mem.read(stack_pointer + 12);      // 5
        let slice = machine.mem.slice(lpBuffer, len);
        super::WriteFile(machine, hFile, slice)  // the human-authored function
    }
}

There's a bit more detail around how the return value makes it into the eax register and how the stack gets cleaned up, but that is most of it. With these in place, the remaining question is how to intercept when the executable is making a DLL call and to pass it through to this implementation.

DLL calls

Emulation aside, when an executable calls an internal function it knows where that function is in memory, so it can generate a call instruction targeting the direct address. When an executable wants to call a function in a DLL, it doesn't know where in memory the DLL is, so how does it know where to call?

In Windows this is handled by the "import address table" (IAT), which is an array of pointers at a known location within the exectable. Each call site generates code that calls by indirecting through the IAT: instead of calling WriteFile at specific address, it calls the address found in the WriteFile entry in the IAT. At load time, each needed DLL address is written into the IAT.

(In this picture, XXXX is a constant address pointing at the IAT, while CCCC is the address written into the IAT at load time that points to the implementation of WriteFile.)

Advanced readers might ask: given that we're writing address into the IAT at load time, why is there this indirection instead of just writing the correct addresses in to the call sites? The answer is there's a complex balance between making things fast in the common case as well as in sharing memory. Windows even has an optimization that attempts to make the (mutated after loading, so typically causing a copy on write) IATs amenable to being shared across process instances.

retrowin32, before

In retrowin32's case we just needed some easy way to identify these calls. So retrowin32 poked easy to spot magic numbers into the IAT.

The emulator's code then looks like:

const builtins = [..., WriteFile, ...]
fn emulate() {
    if instruction_pointer looks like 0xFIAT_xxxx {
        builtins[xxx].call()
    } else {
        ...emulate code found at instruction_pointer...
    }
}

This worked fine for a long time! But it had a few important problems that became apparent over time.

  1. In "native" x86-64 mode we don't get to intercept all calls to our magic addresses, and rather need IAT calls to point to real x86 code. This required a significantly different code path.
  2. The Windows LoadLibrary() call gives you a pointer to the real base address where a DLL was loaded. An executable I was trying to get to run was calling LoadLibrary() and was attempting to traverse the various DLL file headers found at that pointer. Because retrowin32 didn't actually load real libraries, these didn't exist.
  3. DLLs can expose data other than functions. For example, DirectDraw exports vtables, and msvcrt.dll exports plain variables like ints.
  4. Machinery like the debugger that can single-step or show memory would get confused by these magic invalid addresses. In the above arrangement, there isn't an obvious place to put a breakpoint to break on calls to WriteFile().
  5. The core emulation loop, the most performance-sensitive code, needs to repeatedly check if we've hit one of these magic addresses. (Things are arranged such that it runs once per basic block and not once per instruction, at least.)

retrowin32, now

The shared root cause of some of the above issues is that executables just expect real DLLs to be found in memory. (Pretty much every problem bottoms out as Hyrum's Law.) I could make retrowin32 attempt to construct all the appropriate layout at runtime, but ultimately we can match DLL semantics best by creating real DLLs for system libraries and loading them like any other DLL.

However, all these DLLs need to do is hand control off to handlers within retrowin32. For this I changed retrowin32's code generator to generate trivial snippets of x86 assembly which compile into real win32 DLLs as part of the retrowin32 build. (My various detours through cross-compiling finally paying off!) retrowin32 needs to be able to load DLLs anyway to load DLLs found out in the wild, and by using a real DLL toolchain to construct these I guarantee the system DLLs have all of the expected layout.

I can then bundle the literal bytes of the DLLs into the retrowin32 executable such that they can be loaded when needed without needing to shuffle files around; they are a few kilobytes each.

(Coincidentally, the DREAMM emulator, a much more competent Windows emulator, recently made a similar change for pretty similar reasons, which makes me more confident this is the right track.)

Each stub function needs to trigger the retrowin32 code that handles builtin functions. On x86-64 that code is especially tricky. To make this swappable, these stub functions all call out to known shared function, yet another symbol that retrowin32 can swap out for the different use cases. They can find a reference to that symbol using exactly the DLL import machinery again with a magic ambient retrowin32.dll.

Under emulation retrowin32_syscall can point at some code that invokes the x86 sysenter instruction, which is arbitrary but easy for an emulator to hook. On native x86-64 we can instead implement it with the stack-switching far-calling goop.

Finally, in this picture, how does retrowin32_syscall know which builtin was invoked? At the time the sysenter is hit, the return address is at the top of the stack, so with some bookkeeping we can derive that it was Writefile from that address.

Vtables

COM libraries like DirectDraw expose vtables to executables: when you call e.g. DirectDrawCreate you get back a pointer to (a pointer to) a table of functions within DirectDraw. For the executable to be able to read that pointer, it has to exist within the emulated memory, which means previously I had some macro soup that knew to allocate a bit of emulated memory to shove some function pointers into it when the DLLs were used.

After this change, I can just expose the vtables as static data from the DLLs, like this:

.globl _IDirectDrawClipper
_IDirectDrawClipper:
  .long _IDirectDrawClipper_QueryInterface
  .long _IDirectDrawClipper_AddRef
  .long _IDirectDrawClipper_Release
  .long _IDirectDrawClipper_GetClipList
  .long _IDirectDrawClipper_GetHWnd
  ...

Building DLLs

To generate these stub DLLs, I currently generate assembly and run it through the clang-cl toolchain, which seamlessly cross-compiles C/asm to win32. It feels especially handy to be able to personally bug one of the authors of it for my tech support, thanks again Nico!

It feels a bit disappointing to need to pull in a whole separate compiler toolchain for this, especially given that Rust contains the same LLVM bits. I spent a disproportionate amount of time tinkering with alternatives, like perhaps using one of the other assemblers coupled with the lld-link embedded in Rust, or even using Rust itself, but I ran into a variety of small problems with that. Probably something to revisit later.

Dynamic linking

Any time I tinker in this area (and also looking at Raymond Chen's post, also linked above) I always find myself thinking about how much dang machinery there is to make dynamic linking work and to try to fix up the performance of it.

It makes me think of a mini-rant I heard Rob Pike give about dynamic linking (it was clearly a canned speech for him, surely a pet peeve) and how much Go just attempts to not do it. I know the static/dynamic linking thing is a tired argument, I understand the two sides of it, and I understand that engineering is about choosing context specific tradeoffs, but much like with minimal version selection it is funny how clear it is to me that dynamic linking is just not the technology to choose unless forced. Introspecting, I guess this just reveals something to me about my own engineering values.

Appendix: dllimport calls

The above omitted a lot of details, sorry pedants! As penance here is one small detail I ratholed on that might be interesting for you; non-pedants, feel free to skip this section.

In the stub DLLs I first generated code like:

WriteFile:
  call _retrowin32_syscall
  ret 12

To make it all link, at link time I provided a retrowin32.lib file that lets the linker know that _retrowin32_syscall comes from retrowin32.dll.

I expected that to generate a call via the IAT, but the code it generated went through one additional indirection. That is, the resulting machine code in the DLL looked like:

WriteFile:
  call _retrowin32_syscall   ; e8 rel32
...
_retrowin32_syscall:
  jmp [IAT entry for retrowin32_syscall]  ; ff25 addr

Why did it do this? At least in this blog's explanation, when compiling the initial assembly into an object file, the assembler doesn't know whether retrowin32_syscall is a local symbol that will be resolved at link time or if it's a symbol coming from a DLL. So it assumes the former and generates a rel32 call to a local function.

Then, at link time, the linker sees that it actually needs this call to resolve to an IAT reference, and so it generates this extra bit of code containing the jmp (which ghidra labels a "thunk").

To eliminate this indirection I instead generate assembly that looks like:

WriteFile:
  call [__imp__retrowin32_syscall]

where that __imp_-prefixed symbol is somehow understood to refer directly to the IAT entry. At the level of C code this is related to the dllimport attribute, but I am not sure whether that attribute exists at the assembly level. Peeking at the LLVM source I see it actually has some special-casing around symbols beginning with the literal string "__imp_", yikes.