Tech Notes

Cross compiling C/Rust to win32, again

2024-04-16T00:00:00Z

This post is part of a series on retrowin32.

Earlier I wrote about cross compiling Rust to win32. I ended up not following through on that approach due to missing compiler intrinsics. Instead here are two closely related dives.

Cross compiling C to win32

Clang supports cross compilation, including targeting Windows, but I kept getting it wrong. After a few false starts, I reached out to Nico, who actually knows things, and he set me straight, so thank him for anything you learn from this post!

Clang has a clang-cl binary that is intended to be a drop-in substitute for Visual Studio's cl.exe, which means not only matching the command-line interface in terms of how flags are spelled (which I don't care much about) but importantly also in terms of preconfiguring the compiler settings to properly produce a Windows output (which ends up critical). For example, if your program #includes a Windows header file, you need the compiler to be configured to understand all the minor language variations found in Windows-style source code.

This means cross compiling C++ code to Windows ends up being as simple as something like this. (I'm still not exactly clear on when flags are hyphenated or slashed, but I do know to pass linker args to cl.exe they must follow the /link switch...)

$ clang-cl -fuse-ld=lld -target i686-pc-windows-msvc \
    -vctoolsdir $xwin_path/crt -winsdkdir $xwin_path/sdk \
    foo.cc \
    /link /subsystem:console

Fully worked example here.

xwin

The above invocation needs Windows headers and libraries in a particular file system layout. They are available from Microsoft but only in the form of installer .exe files.

winetricks is a script that downloads and unpacks those (and many other redistributable packages) by invoking the executables through Wine. (I briefly thought about running this through retrowin32 but the exes target modern Windows, not the old Windows that retrowin32 targets.)

But as I mentioned in the previous post, there is a tool called xwin that downloads and unpacks the .exes directly. This is a subtle process — the files can be VSIX or CAB files which contain XML blob indexing other files and there's some manual shuffling of files around when unpacking happens — which doesn't give me a lot of confidence it will keep working into the future, but for now, an invocation like this:

$ xwin --accept-license --arch x86 splat --output redist --disable-symlinks

(where --disable-symlinks is only needed on a case insensitive file system) produces a directory layout that the above clang-cl invocation accepts.

Back to Rust, and calling conventions

Looking back, many of the problems I encountered in my previous post were due to using Rust's no_std. I was avoiding the Rust standard library for a few reasons but primarily because it used SSE instructions, which retrowin32 doesn't (yet?) support. But it turns out that switching the Rust target from i686-pc-windows-msvc to i585-pc-windows-msvc was sufficient to avoid these.

I could then run a Rust "hello world" with-std app in retrowin32 with a few more function stubs. The only tricky one was that it needs memcpy from vcruntime140.dll. Implementing memcpy is pretty easy; the tricky part was realizing it uses the cdecl calling convention. To understand why this matters you need a bit of background.

First, on Windows there are two main calling conventions, called stdcall and cdecl. In both, the caller pushes its arguments onto the stack right to left. In stdcall, the callee is then responsible for popping those arguments, while in cdecl it's the caller. (There's a lot more to it than just this; I found this reference especially helpful.)

I'm not clear on why both exist. A benefit of stdcall is you only need the stack-popping code in one location (the callee) and not once per caller, and in fact the ret instruction takes an integer argument of how much to pop so it only costs two extra bytes in the binary. A benefit of cdecl is that it's arguably necessary for varargs functions — or at least that is what sources online say, but it seems to me you could make it work either way so I'm not really sure. (I guess it would be more fragile in the case where the callee doesn't consume all its varargs inputs? Maybe it'd more complex with all the modern stack corruption mitigations?)

In any case, up to this point retrowin32 implements hundreds of Windows functions but they were all stdcall. It implements these functions in two layers.

First, I implement a given Windows API function via an annotated Rust function using Rust types, like the following:

#[win32_derive::dllexport]
pub fn SetThreadDescription(
    machine: &mut Machine,
    hThread: HTHREAD,
    lpThreadDescription: Option<&Str16>,
) -> bool { ... }

Second, a code generator collects up all the functions that were annotated as dllexport and generates code for each that translates emulator state into a call to these functions. For example, for the above it knows to pull hThread and lpThreadDescription off the emulated stack, and the latter is an optional pointer to a NUL-terminated WTF-16 string that needs to be read from emulated memory. The bool return value becomes an integer that goes in the emulator's eax register. And finally that the stack is popped by 8 bytes because that's what the arguments used.

To support cdecl, I only needed to adjust the dllexport code generation to control how much stack was popped. With that in place retrowin32 can now at least run a simple Rust program with println!("hello, world");.

Inline assembly

The reason I have been poking at all of this is because I want to write a test suite over my implementations of x86 opcodes, and I want to run that suite on a native x86 to verify my emulator behavior matches the real thing.

Because I want to test specifically the invocations of opcodes I write inline assembly to do it. This has allowed me to contrast the way inline assembly works in C and in Rust.

Clang follows gcc syntax for inline assembly, as documented in the gcc docs (and I believe nowhere in the Clang docs?). It is syntactically surprisingly clunky. For example:

to write multiple instructions of assembly you must embed literal '\n's into the string;
it treats % as an escape metacharacter, but meanwhile (in AT&T syntax) registers are prefixed with %, which means to refer to a register you must write it doubled as %%;
to specify inputs/outputs/clobbers, you write them after : in the blocks, which means you end up with awkward empty blocks (see e.g. the /* No outputs */ comment in the example here);
the syntax within these blocks is itself a kind printf-like format string with docs describing things like "g" means "Any register, memory or immediate integer operand is allowed, except for registers that are not general registers."

I have no context here so I imagine a lot of the above probably organically grew from how assemblers worked over gcc's history. It doesn't feel like the thing you would design if you were inventing it today.

Meanwhile, clang-cl configures clang to work like a Windows compiler, which means it also supports Microsoft's inline assembly syntax. I don't have much experience with this except that it looks a lot simpler than the gcc syntax. It seems the compiler must infer a lot more of the context that is explicitly stated in the gcc format to get things like clobbers right; that is at least what LLVM does, where you can see the ~{flags} bits on the right.

Finally, Rust has its own syntax for inline assembly that feels very sensible. (Rust supports many fewer architectures than gcc, which possibly makes the problem space a lot easier?) Since it backs onto LLVM anyway it feels semantically close to Clang, but the syntactical problems like how options are specified is in a simple syntax with keywords.

One bit of cute syntax I enjoyed: in assembly in various places you end up using a letter suffix to specify the size of a given operation; for example in AT&T assembly you write addb to specify a byte-sized add but addw for a 16-bit ("word") add. Unfortunately I keep getting these letter suffixes confused, where AT&T uses e.g. l for "long" when Intel uses d for "dword", both to specify a 32-bit operation. (The worst is q is for "quad word", aka (2-byte word) * 4 = 8 bytes.) LLVM has a zoo of single-letter codes which is surely necessary for modeling all the necessary complexity.

Meanwhile, it's not the same context so it's a little unfair, but in Rust asm templates the formatting codes just correspond to the register names — e.g. you use e (as in eax, ebx) to refer to 32 bits, or l (as in al, bl) for the low 8 bits.

Between these options and given how the rest of retrowin32 is in Rust already I am leaning towards using Rust (if anything, my main complaint is that the Rust autoformatter skips my asm blocks, possibly because it's a macro?). The binaries are pretty large, so maybe that is what I will look into next...

Moving data from Rust to JS

2024-04-02T00:00:00Z

Suppose you have some structured data in your Rust WebAssembly program that you want to expose to JavaScript. There are different ways to do it with different tradeoffs, and in this post I dive a bit into them and sketch out a new better way.

For a concrete example, in retrowin32's debugger there's a disassembly view (click 'step' once in that UI to see) that renders some assembly using React. That is not just a glob of plain text, but rather structured data where e.g. the addresses within the instruction stream are marked and hyperlinked (try clicking an address in the disassembly), to either step the program to an instruction or view a memory dump at an address.

Recall (perhaps from my notes) that memory within the WebAssembly program is a big Uint8Array from JavaScript's perspective. At the core all approaches involve making the JS read that memory, just with different patterns.

The first option is to keep the structure in Rust and read out each field via individual JS calls. This section of the wasm-bindgen docs has an example of what it looks like. This is fairly verbose to set up (you need to write out a getter for each field), fairly chatty across the JS-Wasm boundary, and not great from a React perspective (because React prefers plain data objects — consider how a complex nested structure in particular works with this approach) but can be appropriate in contexts where it's a complex object on the Rust side.

If not that, then the other approaches are variations on copying the data to JS. Copying feels bad but makes sense in particular when the Rust code is generating data only for the JS side to use anyway; e.g. in the above dissassembly, there's no need for the Rust side to hang on to any of it. And no matter what, any string that traverses the boundary from WebAssembly to JS must be copied, as JS strings own their data and cannot be views into Wasm memory.

The simplest approach for copying is to just serialize the whole thing as JSON. Concretely this means the Rust side generates a big blob of JSON, then JS copies that blob to a JS string, then JSON.decode()es it. This is less bad than it sounds in that it at least processes all the data in bulk; e.g. you only have to run one block of bytes through the Wasm→JS string decoder. And browsers have poured effort into fast JSON parsing. But it's also worse than it sounds in that it means you must generate JSON on the Rust side. If you're not doing that already means pulling in a bunch of serialization code. At some level it just feels wrong, to serialize structured data into a string only to immediately parse that string.

Alternatively you could have the Rust side construct the JS-side object by making calls across the Wasm boundary. In some sense this is the dual of the first approach: lots of Rust→JS calls instead of lots of JS→Rust calls. Imagine a Person struct:

struct Person {
    name: String,
    age: u32,
}

Serialization for this makes a series of calls like the following, where my made-up js namespace is just to show all the places where the Rust code is calling up to JS:

let obj = js::create_object();
let name = js::create_string("name");
let name_val = js::create_string(person.name);
js::set_property(obj, name, name_val);
let age = js::create_string("age");
let age_val = js::create_number(person.age);
js::set_property(obj, age, age_val);

This is roughly the approach taken in serde-wasm-bindgen, which is the approach recommended in the Rust Wasm docs. It's conveniently all code generated via serde and in terms of the source-level modifications it only requires a few annotations on some structs.

But looking at the above you might have a few questions. One is why do you need to make a call to JS to create a number object, given that Wasm supports numbers natively? I might be misreading the relevant code here, but I think it's because of the way serialization code is structured, it effectively needs to recursively serialize each struct field to the same type, something like "a handle to a JS object".

The second is that it feels redundant to allocate strings for the names of each field (the quoted "name") above, especially if you're serializing a lot of these objects. serde-wasm-bindgen addresses this by instead interning the names of struct fields into a HashMap, which is nice but also which is one of my least-favorite code patterns, the global cache of data that never shrinks. (It's at least bounded by the total number of distinct field names you ever serialize.)

Finally, all of these calls from Rust to JS are not free. I don't have a good intuition for how fast WebAssembly→JS calls are — for all I know VM engines are able to make them equivalent to within-language calls — but there is still set up goop on the Rust side and additional JS function on the JS side just to glue these two sides together. Just getting a handle to a JS object requires bookkeeping on the Rust side.

With all this in mind, I sketched out a slightly different approach. To serialize the above Person struct, I codegen a JS function like:

function __waser_Person(name, age) { return {name, age}; }

The Rust-side generated code then looks like:

let name_val = js::create_string(person.name);
let age = person.age;
let obj = js::__waser_Person(name_val, age);  // age passed as int

The idea here is that the JS engine is probably best equipped to do all of the relevant caching of the names of the fields, and perhaps we can better hit some optimizations around object construction. It does mean we generate a JS function per type of structure serialized but it's relatively small.

In my hacky prototype running this over some sample data was ~16% faster and also generated smaller code (which I won't quantify because the smaller code was also probably in part due to not using serde) than serde-wasm-bindgen. The benefit seems to be primarily from the function call, not the "passed as int" part, but that surely depends on the specific type of data being serialized, as mine was mostly strings. But also, the performance of this area is not really in any critical path for me. I just found it an interesting exploration, so I am unlikely to land it anywhere!

retrowin32: Minesweeper and the four month bug

2024-03-16T00:00:00Z

This post is part of a series on retrowin32.

retrowin32 now runs enough of Minesweeper to let you sort of play it in your browser:

Minesweeper; try it yourself

It is likely to crash if you explore it too far — for example, if you win it attempts to bring up a "you win" dialog that triggers an unimplemented codepath — but still! It kind of works!

Getting this working involved fleshing out a lot more of the Windows API. The demoscene executables I had been focusing on up to this point mostly brought up a window and sent pixels to it, while Minesweeper's startup pokes at the registry, ini files, and in particular has a bunch of drawing code. If you click "view in debugger" in the above UI and then "imports" you can see a list of all of the various Windows API that this pulls in and which I have (partially) implemented.

For example, the red numbers in the UI come from bottom-up bitmaps that are stored as 4 bits per pixel and which are 13 pixels wide, so each row uses 6.5 bytes. I briefly went down a rabbithole of reasoning about generically decoding these before I realized the BMP format uses padding.

Getting Minesweeper to render definitely makes this project feel more "real" and I think is a cool demo, but I also am not really sure I ultimately want to reimplement a bunch of old Windows APIs. For example, I looked a bit into SkiFree. It has 1bpp bitmaps and various raster ops and I am just not too sure it is interesting.

Introspecting, I think I looked at Minesweeper because I was curious to see how hard it would be, but also because I was just avoiding The Big Scary Bug.

The Big Scary Bug

Last November I posted about an emulator CPU bug: to resummarize, a demo worked when using Apple's CPU emulator but not mine, but it manifested as just the demo doing the wrong thing and not as any smoking gun crash.

Here are some of the approaches I have tried to isolate this bug over the last few months:

Tracing the executable on native Windows and my emulator and comparing execution traces; failed because native execution is too different from my emulator
Integrating a 3rd CPU emulator (Unicorn) and comparing execution traces; failed because I couldn't get Unicorn to reliably report CPU state, possibly due to either bugs in it or how I interacted with it
Figuring out enough of the LLDB API to attempt to get it to dump an execution trace of running under Rosetta; failed because I got the traces closer but they still just diverged at some point, and also the LLDB API is very frustrating — how can I print an 80-bit x86 float, still not even sure!

In any case I put it all on the back burner while I fiddled with Minesweeper for a bit.

Then today I was looking over my notes on different demos I had tried in the emulator and for one my note on why it didn't work was "uses lots of windows apis, CreateDIBSection bitmap flags, SetTimer, etc". And I thought, huh, I recently have implemented that kind of thing, I should try it again...

...and it gets much farther along, before of course encountering some other problems, including that it somehow is underflowing the FPU stack. As I glanced through some of the FPU code around its stack handling, I randomly noticed that I had typo-misimplemented the fild instruction. It is supposed to take a 64-bit integer from memory and put it on the FPU stack (converting it to a float), but I had made it take a 64-bit float from memory and put it on the stack.

This did not fix this new demo, but apparently it did fix the four month long bug. Argh.

Test suite

I think this is the third dumb typo bug I've discovered in my FPU implementation, where all three would have been caught if I had had even the most trivial of tests (e.g. to test multiplication, "does 2 * 3 produce 6?"). I noticed if I pasted some of the relevant code into an LLM it was able to spot one, but also when I pasted the whole file into the same LLM it couldn't find those bugs nor the new one. Someday soon though, maybe?

I have been circling around writing an x86 CPU test suite because I haven't been sure exactly which way in which I want to test my implementation. There's of course the "does it work at all" tests which would have caught these bugs, but there are also lots of cases that would only be exercised by particular inputs, like whether the overflow flag gets set based on particular combinations of inputs. I have a basic approach at this that I have written in C but updating it is pretty annoying; I even as far as building out GitHub CI goop to compile this C via the MSVC toolchain. My most recent retrowin32 update, about cross-compiling Rust, was prompted exactly by the thought of writing this test suite in Rust.

One recent idea I had for those is that I could just exhaustively run all combinations of inputs for the implementations of operations that involve 8-bit integers — there are only 2**16 of them — and compare these against a native CPU implementation. It has the downside that wouldn't flush out bugs in the 16 or 32-bit implementations, but for many of these I have written them as generics over the operand size (example) so maybe it would be enough.

Wine on Mac on Rosetta

By the way, if you actually wanted to run Minesweeper on your Mac, recent releases on Wine do the 32-bit x86 on 64-bit Rosetta thing — after all, I learned about it from Wine — and Wine is a much more competent implementation than mine. At this point you can just brew install wine-stable and then wine some.exe and it will work, even on Apple silicon.

With that in hand, what is even the point of my own project? Of course, the ultimate goal is just my own interest, but does Wine subsume its functionality? It turns out the one demo that got me started on this whole thing still crashes under Rosetta due to an illegal instruction. Just guessing, but it appears Rosetta doesn't implement the "nested pointers" variants of the enter instruction that chillin uses. So at least until they fix that, I still have a (totally made-up) purpose.