Theseus, a static Windows emulator
This post is likely the end of my series on retrowin32.
I bring you: Theseus, a new Windows/x86 emulator that translates programs statically, solving a bunch of emulation problems while surely introducing new ones.
What happened to retrowin32?
I haven't been working on retrowin32, my win32 emulator, in part due to life stuff and in part because I haven't been sure where I wanted to go with it. And then someone who had contributed to it in the past posted retrotick, their own web-based Windows emulator that looks better than my years of work, and commented on HN that it took them an hour with Claude.
This is not a post about AI, both because there are too many of those already and because I'm not yet sure of my own feelings on it. But one small thing I have been thinking about is that (1) AI has been slowly but surely climbing the junior to senior engineer ladder; and (2) one of the main pieces of being a senior engineer is better understanding what you ought to be building, as distinct from how to build it.
(Is that just the Innovator's Dilemma's concept of "retreating upmarket", applied to my own utility as a human? Not even sure. I am grateful I do this work for the journey, to satisfy my own curiosity, because that means I am not existentially threatened like a business would be in this situation. As Benny Feldman says: "I cheat at the casino by secretly not having an attachment to material wealth!")
So, Mr. Senior Engineer, what ought we build? What problem are we even solving with emulators, and how do our approaches meet that? I came to a kind of unorthodox solution that I'd like to tell you about!
Emulators and JITs
The simplest CPU emulator is very similar to an interpreter. An input program, after parsing, becomes x86 instructions like:
mov eax, 3
add eax, 4
call ... ; some Windows system API
An interpreting emulator is a big loop that steps through the instructions. It looks like:
loop {
let instr = next_instruction();
match instr {
// e.g. `mov eax, 3`
Mov => { set(argument_1(), argument_2()); }
// e.g. `add eax, 4`
Add => { set(argument_1(), argument_1() + argument_2()); }
...
}
}
Like an interpreter, this approach is slow.
At a high level interpreters are slow because they are doing a bunch of dynamic
work for each instruction. Imagine emulating a program that runs the same add
instruction in a loop; the above emulator loop has all these function calls to
repeatedly ask "what instruction am I running now?" and inspect the arguments,
only to eventually do the same add on each iteration. x86 memory references
are extra painful because they are
very flexible.
Further, on x86 the add instruction not only adds the numbers but also
computes six derived values, including things like the parity flag: whether the
result contains an even number of 1 bits(!). A correct emulator needs to either
compute all of these as well, or perform some sort of side analysis of the code
to decide how to run it efficiently.
There are various fun techniques to improve emulators. But if you want to go fast what you really need is some combination of analyzing the code and generating native machine code from it — a JIT. JITs are famously hard to write! They are effectively optimizing compilers, which means all the complexity of optimization and generating machine code, but also where the runtime of the compilation itself is in the critical performance path. I liked this post's discussion of why JITs are hard which mentions there have been more than 15 attempts at a Python JIT.
Static binary translation
So suppose you want to generate efficient machine code, but you don't want to write a JIT. You know what's really good at analyzing code and generating efficient machine code from it? A compiler!
So here's the main idea. Given code like the above input x86 snippet, we can process it into source code that looks like:
regs.eax = 3;
regs.eax = add(regs.eax, 4);
windows_api(); // some native implementation of the API that was called
We then feed this code back in to an optimizing compiler to get a program native to your current architecture, x86 no longer needed.
In other words, instead of handing an .exe file directly to an emulator that
might JIT code out, we instead have a sort of compiler that statically
translates the .exe (via a second compiler in the middle) directly into a
"native" executable.
(I write native in scare quotes because while the resulting executable is a
native binary, it is a binary that is carrying around a sort of inner virtual
machine representing the x86 state, like the regs struct in the above code.
More on this in a bit.)
I think I came up with this basic idea on my own just by thinking hard about what I was trying to achieve, but it turns out this approach is known as static binary translation and is well studied. It has some nice properties, and also some big problems.
Decompilation
I'll go into those, but first, a minor detour about how I ended up here.
Have you heard of decompilation? These madmen (madpeople?) are manually recreating the source code to old video games, one function at a time. They take the game binary, extract the machine code of one function, then use a fancy UI (click one of the entries under "Recent activity") to iteratively tinker on reproducing the higher-level code that generates the exact same machine code. It's kind of amazing.
(To do this, they need to even run the same original compiler that was used to compile the target game. Those compilers are often Windows programs, which means implementing the above fancy UI involves running old Windows binaries on their Linux servers. This is how I first learned about them — they need a Windows emulator!)
Decompilation is not only just a weird and fascinating (and likely tedious?) human endeavor. It also highlighted something important for me: I don't so much care about having an emulator that can run any random program, I care about running a few very specific programs and I'm willing to go to even some manual lengths to help out.
In practice, if you look at a person building a Windows emulator, they end up as surgeons needing to kind of manually reach in and pump the heart of the target program themselves anyway, including debugging the target program and working around its individual bugs. It's common for emulators to even manually curate a list of programs that are known to work or fail.
An old idea
Statically translating machine code is not a new idea. Why isn't it more popular? My impression in trying to read about it is that it is often dismissed because it can't work, but at least so far it's worked well. Maybe I haven't yet encountered some impossible problem that I've so far overlooked?
(When trying to look up related work for this blog post, I saw this attempt at statically translating NES that concluded it can't be done, but then also these people seem to be succeeding at it so it's hard to say.)
I think there are two main problems, a technical one and a more cultural one.
The technical part is that the simple idea has complex details. To start with, any program that generates code at runtime (e.g. itself containing a JIT) won't work, but it's easy for me to just dismiss those programs as out of scope. There are also challenges around things like how control flow works, but those are small and interesting and I might go into them in future posts.
A common topic of research is that it's in the limit impossible to statically find all of the code that might be executed even in a program that doesn't generate code at runtime, because of dynamic control flow from vtables or jump tables. In particular, while there are techniques to find most of the code, no approach is guaranteed to work perfectly. This is where decompilation changed my view: if I'm willing to manually help out a bit on a specific program, then this problem might be fine?
The main cultural reason I think binary translation isn't more common is that it's not as convenient as a generic emulator that handles most programs already. Users aren't likely to want to run a compiler toolchain, though I have seen projects embed the compiler (e.g. LLVM) directly to avoid this.
The other cultural problem is there are legal ramifications if you intend to distribute translated programs. Every video game emulator relies on the legal fiction of "first, copy the game data from the physical copy you already own and pass that in as an input", so they get to plausibly remain non-derivative works.
But I'm not solving for users, I'm solving for my own interest. These cultural problems don't matter to me.
Benefits
Again consider the snippet above, which is adding 3 and 4. In a static translator world we parse the instruction stream ahead of time, so the compiler gets to see that we want to put a 3 in eax and not (as an interpreter would) spend runtime considering what values we are reading and writing where.
A compiler will not only generate the correct machine code for the target architecture, it even will optimize code like the above to just store the resulting value 7. And a compiler is capable of eliminating unneeded code like parity computations if you frame things right. Because the Theseus code generation happens "offline", separately from the execution of the program, I can worry less than a JIT might to about spending runtime analyzing the code to try to help.
When I started this I had thought that performance would be the whole benefit of this approach, but it turns out to be easier to develop as well because it brings in all of the other developer tools:
- The translated instructions appear as regular code in the output program, which means the native debugger can step translated instructions, which appear as regular source code.
- If the program crashes, the native stack trace traces back in to the (translated assembly of the) original program.
- I haven't tried it yet, but CPU profiling ought to have the same benefit.
In retrowin32 I ended up building a whole debugger UI to help track down problems, but in Theseus I've just used my system debugger so far and it's been fine.
In retrowin32 I also spent a lot of time fiddling with the bridge between the emulator and native code. This boundary still exists in Theseus but it is so much smaller, because the translated code can directly call my native win32 system API implementation (with a bit of glue code to move data in and out of the inner machine's representation).
On MacOS retrowin32 could run under Rosetta but it meant the entire executable needed to be an x86-64 binary, which meant it required a cross-compiled SDL. A Theseus binary is native code that just calls the native SDL.
All told it is just much simpler. From the start of this idea to getting the test program I've tinkered with all this while running its first scene, including DirectX, FPU, and MMX, only took me a couple weeks.
Partial evaluation
You can think of the different approaches of interpreter to JIT to static binary as a spectrum of how much work you do ahead of time versus at runtime. Theseus take the dynamic question of "what kind of mov is this" and move it to the ahead of time compilation step, partially evaluating the generic instruction handler into a specific instruction with nailed-down arguments. (I'll link again to the excellent blog about meta-tracing C code. Read about Futamura projections for this idea taken to its extreme conclusion!)
For another example, a typical Windows emulator must parse and load the PE executable on startup, but Theseus does that at compile time and writes out just the data structures needed to execute it. The PE-parsing code isn't needed in the output.
Similarly, executable startup involves linking and loading any referenced DLLs
including those from the system, but Theseus must see all the code it will run,
so it does this linking ahead of time. Here's some output near a call to a
Windows API, where at compile time it resolved an IAT reference (the ds:[...]
address) directly to the Rust implementation I wrote:
// 004012a0 push 4070A4h
push(ctx, 0x4070a4u32);
// 004012a5 push 8
push(ctx, 0x8u32);
// 004012a7 call dword ptr ds:[4060E8h]
call(ctx, 0x4012ad, Cont(user32::CreateWindowExA_stdcall))
In some sense it's as if Theseus at compile time is partially running the system binary loader and the output source code is a snapshot of the ready state. It reminds me a bit of the problem of unpacking executables.
WebAssembly
Theseus should easily extend to running on the web under WebAssembly; most of it is just compiling the generated program with wasm as the target architecture. (I initially had this working then decided I don't need the additional complexity for now, so it isn't implemented.)
Separately, the output program from Theseus is inspired by how WebAssembly is executed. In both there is an outer host program that carries within it a "machine" with its own idea of code and memory. The code within that machine can only read/write to its own memory and must call provided hooks to bridge out to the host. Like WebAssembly, the Theseus output executable code is isolated from the data, with the nice property that no amount of unintentional/malicious memory writes can create new code.
A wasm Theseus would be a turducken of machines:
- the native host machine's WebAssembly implementation (e.g. the Chrome runtime), with its notion of memory, runs a
- WebAssembly virtual machine with the Theseus wasm blob, with its own idea about memory (e.g. where my Rust implementation of the Windows API puts allocations), and within that there is
- the x86 virtual machine and Windows program's notion of memory (which e.g. might say "read from the static data table at memory offset $x").
In thinking about it, it's tempting to try to blend some layers of machines here, and make the WebAssembly program's memory 1:1 with the input Windows program's idea of memory. That is, if the input program writes to some address $x, you could translate that to exactly writing to WebAssembly memory address $x. (You'd need to adjust the middle layer to hide its data structures in places the x86 program doesn't use.) I had to do something like this to make retrowin32 work under an x86 emulator. WebAssembly even would let me lay out the memory directly from the binary. I don't think this really buys you much, it would just be kind of cute.
On the topic of WebAssembly and static binary translation, check out wastrel which is static binary translation applied to the problem of executing WebAssembly. Reading about it surely gave me the seeds of this idea.
Theseus
I named this project Theseus, as in the ship.
Consider again the x86 assembly at the top of the post. What does it do? Depending on how you look at it, one correct answer is "adds three and four" or even just "computes 7". Or you could say it puts 3 in the eax register, adds 4 to the eax register, consumes some CPU clocks, and sets various CPU flags.
If I or my compiler replaces one of these interpretations with another, is it still the same program? Depending on which context you care about — my impression is that emulating systems like the NES requires getting the clocks exactly right — these details either matter or don't. In the case of Theseus I am explicitly throwing away the input program because I have replaced all its parts, one by one.
I have one farther off idea, again along the lines of the ship of Theseus.
Implementing the Windows API is an endless stream of working around four decades
of Hyrum's Law. Consider
that random bug workaround again:
if you were documenting the API of DirectPlayEnumerateA would you write that
it calls the callback, or would it be more correct to say that it calls the
callback and also restores a preserved stack pointer? If you look at the code of
a Windows emulator like Wine today it is full of things like this.
One idea I've been thinking about is that for problems like these, rather than making the emulator more complicated, you could take a page from the decompilation playbook and provide an easy way to manage replacing parts of the program itself.
Once you're willing to replace pieces of a program there are more interesting possibilities. If a program has some bit of code that doesn't perform well, instead of making a JIT fancier, you could just manually replace the code with your own implementation. (It's plausible you wouldn't even need to change algorithms, it might be enough to just write the same algorithm in native code and let your modern compiler apply its autovectorization logic to it.) With enough machinery, you could even replace parts to add features, as one contributor to retrowin32 investigated here and even implemented for some GameBoy games.