retrowin32 progress report

February 20, 2023

It's been four months since I first wrote about retrowin32, my win32 emulator. Here's a progress report.

Before I get into the tech bits I have some softer life observations. Feel free to skip down to the next section if you're just here to talk computers.

The first is that I left my job in large part due to a lack of free time, but unexpectedly the rest of my life kind of expanded to consume much of that time. In some sense I better understand why I was so stressed for time before, given how little slack I had in my schedule. The root cause of this is a combination of parenthood, my wife restarting her career, and my own lack of organization, but I am trying to not be too hard on myself about not making as much progress as I might. I took this break from work with an explicit goal of trying to take it easy.

With that in mind, I have also been thinking about what my actual objective with this project is. I am unlikely to make anything better than what already exists, but as I mentioned in my Ninja retrospective, "success" can be just what you define it as. Currently I think success is about just chasing wherever my curiosity takes me — I found this thread pretty inspiring — and not any particular product goals.

(And though making anything explicitly useful isn't my goal, I found it funny while I was working at Figma how often some random technical challenge would come up, and I would say "oh I did a project in this area once, I know a bit about it"...)

Particularly in this space one last thing I appreciate about this project is how little I know, which means it's easy for me to post something wrong about it. Whenever I write something on here and get corrected by the internet I view it as free education, so if you have any feedback (even the well-actually sort) I'd love to hear from you.

Targeting both Wasm and native

After some refactoring surgery, the project is now a little more layered. There's an x86 emulator that doesn't know about Windows, and then a Windows layer atop that. The Windows layer then exposes a collection of Host interfaces that provide things like "write to stdout" and "create a window".

That layering allows me to implement the Host interface twice. The first makes direct calls to the OS, which lets you build a native retrowin32 executable and run retrowin32 foo.exe from the command line to use SDL to show a window. The second implementation targets WebAssembly and forwards the Host API across the WebAssembly boundary to a TypeScript implementation that uses the HTML DOM to show a window.

I hadn't initially set out to target non-Wasm. It has turned out to be useful because it's faster to execute (which matters when you've written an emulator as slow as mine, ha!) and the native profiling tools are better. I feel like I've learned this particular lesson multiple times now so it's something to keep in mind for future Wasm projects.

But also because I was originally only thinking about Wasm I haven't thought much about safety, because in a Wasm environment a program that goes wildly wrong can still only corrupt its own memory. It might be interesting to think more about whether I can set up the emulator such that it could "safely" run an untrusted executable, especially given that the intent is there's this fairly narrow Host interface from the executable out to the outer machine.

Translating x86

One thing I had imagined I would do when I started this project was build some sort of translator directly from x86 to Wasm, and so far I haven't. Naively you might expect (I might have expected!) that you could even build some sort of ahead of time x86 to Wasm (or native) translator. That ends up not working out for an interesting reason.

For starters, it's not uncommon for x86 programs to generate more x86 code. The obvious example is a program that JITs. Another example is the demoscene types of executables that I started this project with are typically wrapped in some sort of "packer", which is to say the first step of the program you run is for it to uncompress the code of the actual program into memory and go from there. (See kkrunchy for one example.)

But even if you set aside those kinds of programs and consider just the subset of programs that don't dynamically generate code, it is still a potentially surprising challenge to statically translate from x86, because it can be difficult to identify what the code actually is! In a Windows executable there is typically a blob of bytes that is marked as executable code, but those bytes are relatively unstructured. The x86 instruction encoding is complex, dense, and full of surprises.

For a concrete example, here's a snippet of a random exe I was looking at. The leftmost column is the address of the code. The second column is the raw bytes of code — of different lengths because x86 is a variable-length encoding — followed by the remainder of the line that attempts to disassemble those bytes.

004188e2 83e203           AND      EDX,0x3
004188e5 ff2495f4884100   JMP      dword ptr [EDX*0x4 + 4188f4]
004188ec ff248d0489410090 ??
004188f4 04894100     // this is 00418904 written in little-endian
004188f8 0c894100     // this is 0041890c written in little-endian
004188fc 18894100     // this is 00418918 written in little-endian
00418900 2c894100     // this is 0041892c written in little-endian
00418904 8b44240c      MOV      EAX,dword ptr [...]

What you see here is a jump table. In the first line EDX is masked to be <= 4. The math in the second second line computes an address and dereferences that address, which is to say it reads one of the constants seen in lines 4 through 7 and then jumps to that address.

In particular notice that this instruction stream also contains raw data. And my disassembler got confused about what the bytes shown on line 3 even are. It's possible some other point in the code computes some arbitrary offset within those bytes and jumps there.

This is a lengthy way of saying that given an arbitrary x86 binary, even if it doesn't itself generate code, you cannot just start at the top and disassemble your way to the bottom, because you cannot be sure where the boundaries of instructions lie; there might be data that parses as instructions and jumps that jump into what looks like data. Identifying where the instructions actually lie is in the limit something you can only determine at runtime.

In practice, all of this doesn't mean generating Wasm from x86 is impossible. Rather, it means that you also must be able to transpile x86 code at runtime, which is what existing x86 to Wasm tools like v86 and CheerpX do. (See v86's "how it works" doc.)

Generating executables

To start this project I had been working with some random .exe files I found, but for testing it is useful to create my own executables.

For example, one trick I borrowed from some other x86 emulators is to make a program that executes various x86 instructions in a controlled environment and then prints out the resulting CPU state. I can then execute this program on a real x86, capture its output, and then verify my emulator produces the same output.

So how do you create a Windows exe from a Mac? "This is just cross-compiling," I thought, but it turns out to be kind of a rabbit hole of fiddly details. In particular cross compilers tend to generate code that pulls in a lot of dependencies on either Windows libraries or a Windows toolchain; see for example these details about the current state of the Rust toolchain, which can be summarized as "it doesn't work yet".

The Zig language is very promising in this respect. My main hesitation with it is that the language is still early. I managed to crash the compiler while attempting to learn the language, and even within retrowin32 I found my older Zig hello world failed to compile with a newer compiler (some trivial flag change). But otherwise it seems pretty ideal — super fast compiler, painless cross compiling, minimal executables, and even an asm language construct.

Where I've currently ended up is just C code, built on a native Windows machine. One nice property of this approach is that it uses the standard Windows toolchain so the resulting executables are just like other Windows executables. There is one piece that is a little clever: I set up GitHub's CI so that when I modify the C code, it rebuilds the associated .exe files on a Windows VM and attaches the resulting exectable to the current Git branch, so I can iterate (a bit slowly) from my Mac. (PS: there's a secret flag to make MSVC's outputs stable...)

Implementing the Windows API

retrowin32 includes its own implementation of (some of) the Windows API. The way this works is when the x86 emulator tries to jump to a Windows function it plumbs its way out to code I've written. Each of those calls need to pop the arguments off the x86 stack and then interpret them, poking data back into the x86 memory as appropriate.

Consider the Windows function WriteFile:

BOOL WriteFile(
  [in]                HANDLE       hFile,
  [in]                LPCVOID      lpBuffer,
  [in]                DWORD        nNumberOfBytesToWrite,
  [out, optional]     LPDWORD      lpNumberOfBytesWritten,
  [in, out, optional] LPOVERLAPPED lpOverlapped

Concretely the signature says that takes in some numbers and some pointers. Through a moderate amount of Rust hackery my implementation of this function instead has this signature:

pub fn WriteFile(
    machine: &mut Machine,
    hFile: HFILE,
    lpBuffer: Option<&[u8]>,
    lpNumberOfBytesWritten: Option<&mut u32>,
    lpOverlapped: u32,
) -> bool {

The dllexport on there is picked up by a code generator that generates the plumbing from the x86 emulation machinery (which is also the purpose of the additional Machine parameter) and also maps the parameters from fancier Rust types. For example, note that the buffer/length pair becomes a slice that has proper bounds, and the out-param becomes an &mut. Further, the pointers become Option<...> to model whether they're null.

(In a Rust sense this is still unsafe, because the caller of this function can point lpBuffer and lpNumberOfBytesWritten at the same memory...)

I highlight this function also to point out one more funny instance of Hyrum's law. Per the MSDN docs the lpNumberOfBytesWritten cannot be null, and initially in my signature for the above function I hadn't wrapped an Option<> around it.

Meanwhile, I had a test "minimal Windows" program I had been using that used this function, with a call like:

WriteFile(hStdout, buf, sizeof(buf) - 1, nullptr, nullptr);

...which is to say it passed a null for the parameter that officially cannot be null. This program of course worked fine on native Windows, and crashed under my emulator until I realized what was going on.

(Since then I've changed things such that every pointer must always become an Option<>, because it's easy enough to .unwrap() on the Rust side anyway.)

DOS nostalgia

My childhood was during the DOS days. I have memories of tinkering with stuff like CONFIG.SYS and trying to figure out how to get more memory to run a video game, but I didn't really understand what was going on. It has been a real nostalgic pleasure to read through Wikipedia articles about the technology of this era now that I have the computer science education necessary to appreciate it.

For example, do you rememeber the term "high memory"? It is actually pretty amusing! In those days memory was addressed by segment << 4 + offset, where both segment and offset were 16 bits. 16 << 4 means it covers 20 bits, which is to say you could address one megabyte. But if you look at the math, the absolute largest address you can represent with that expression is 0xFFFF << 4 + 0xFFFF, which covers just a little bit (65520 bytes!) beyond one megabyte. And that is what the "high memory area" was, that little extra region!

...ok, but what does the emulator actually do?

It turns out in retrospect that I got very lucky with the first DirectDraw binary that I successfully emulated, in that that binary did not use much Windows API or machine features.

Since then I have found that even a trivial console C program pulls in even more soup of Windows API calls. For example a C program expects main to be called with an argv array, but that is not at all how Windows views things, so as part of C startup it ends up calling a bunch of Windows functions to set things up.

How many? A small C++ console program I have been testing with pulls in 83 Windows functions. But the Windows C library also seems to dynamically probe for further code — I think something about initializing the C allocator pokes at TlsAlloc (thread locals) which then tries to see if the current system has the DLLs to support FlsAlloc (fiber locals).

...and this is a long way of saying that, despite hundreds of commits since my last post, and progress that is visible to me internally, there aren't any major new milestones to report. Since last time I do have the native execution bits hooked up, like this:

$ cargo run -p retrowin32 — exe/zig_hello/hello.exe 2>/dev/null 
Hello, world!

and that can similarly spawn a native window for the DirectDraw demo. But I have no new cool web-based demos to show you. Sorry, maybe next time!