Cross compiling C/Rust to win32, again

April 16, 2024

This post is part of a series on retrowin32.

Earlier I wrote about cross compiling Rust to win32. I ended up not following through on that approach due to missing compiler intrinsics. Instead here are two closely related dives.

Cross compiling C to win32

Clang supports cross compilation, including targeting Windows, but I kept getting it wrong. After a few false starts, I reached out to Nico, who actually knows things, and he set me straight, so thank him for anything you learn from this post!

Clang has a clang-cl binary that is intended to be a drop-in substitute for Visual Studio's cl.exe, which means not only matching the command-line interface in terms of how flags are spelled (which I don't care much about) but importantly also in terms of preconfiguring the compiler settings to properly produce a Windows output (which ends up critical). For example, if your program #includes a Windows header file, you need the compiler to be configured to understand all the minor language variations found in Windows-style source code.

This means cross compiling C++ code to Windows ends up being as simple as something like this. (I'm still not exactly clear on when flags are hyphenated or slashed, but I do know to pass linker args to cl.exe they must follow the /link switch...)

$ clang-cl -fuse-ld=lld -target i686-pc-windows-msvc \
    -vctoolsdir $xwin_path/crt -winsdkdir $xwin_path/sdk \ \
    /link /subsystem:console

Fully worked example here.


The above invocation needs Windows headers and libraries in a particular file system layout. They are available from Microsoft but only in the form of installer .exe files.

winetricks is a script that downloads and unpacks those (and many other redistributable packages) by invoking the executables through Wine. (I briefly thought about running this through retrowin32 but the exes target modern Windows, not the old Windows that retrowin32 targets.)

But as I mentioned in the previous post, there is a tool called xwin that downloads and unpacks the .exes directly. This is a subtle process — the files can be VSIX or CAB files which contain XML blob indexing other files and there's some manual shuffling of files around when unpacking happens — which doesn't give me a lot of confidence it will keep working into the future, but for now, an invocation like this:

$ xwin --accept-license --arch x86 splat --output redist --disable-symlinks

(where --disable-symlinks is only needed on a case insensitive file system) produces a directory layout that the above clang-cl invocation accepts.

Back to Rust, and calling conventions

Looking back, many of the problems I encountered in my previous post were due to using Rust's no_std. I was avoiding the Rust standard library for a few reasons but primarily because it used SSE instructions, which retrowin32 doesn't (yet?) support. But it turns out that switching the Rust target from i686-pc-windows-msvc to i585-pc-windows-msvc was sufficient to avoid these.

I could then run a Rust "hello world" with-std app in retrowin32 with a few more function stubs. The only tricky one was that it needs memcpy from vcruntime140.dll. Implementing memcpy is pretty easy; the tricky part was realizing it uses the cdecl calling convention. To understand why this matters you need a bit of background.

First, on Windows there are two main calling conventions, called stdcall and cdecl. In both, the caller pushes its arguments onto the stack right to left. In stdcall, the callee is then responsible for popping those arguments, while in cdecl it's the caller. (There's a lot more to it than just this; I found this reference especially helpful.)

I'm not clear on why both exist. A benefit of stdcall is you only need the stack-popping code in one location (the callee) and not once per caller, and in fact the ret instruction takes an integer argument of how much to pop so it only costs two extra bytes in the binary. A benefit of cdecl is that it's arguably necessary for varargs functions — or at least that is what sources online say, but it seems to me you could make it work either way so I'm not really sure. (I guess it would be more fragile in the case where the callee doesn't consume all its varargs inputs? Maybe it'd more complex with all the modern stack corruption mitigations?)

In any case, up to this point retrowin32 implements hundreds of Windows functions but they were all stdcall. It implements these functions in two layers.

First, I implement a given Windows API function via an annotated Rust function using Rust types, like the following:

pub fn SetThreadDescription(
    machine: &mut Machine,
    hThread: HTHREAD,
    lpThreadDescription: Option<&Str16>,
) -> bool { ... }

Second, a code generator collects up all the functions that were annotated as dllexport and generates code for each that translates emulator state into a call to these functions. For example, for the above it knows to pull hThread and lpThreadDescription off the emulated stack, and the latter is an optional pointer to a NUL-terminated WTF-16 string that needs to be read from emulated memory. The bool return value becomes an integer that goes in the emulator's eax register. And finally that the stack is popped by 8 bytes because that's what the arguments used.

To support cdecl, I only needed to adjust the dllexport code generation to control how much stack was popped. With that in place retrowin32 can now at least run a simple Rust program with println!("hello, world");.

Inline assembly

The reason I have been poking at all of this is because I want to write a test suite over my implementations of x86 opcodes, and I want to run that suite on a native x86 to verify my emulator behavior matches the real thing.

Because I want to test specifically the invocations of opcodes I write inline assembly to do it. This has allowed me to contrast the way inline assembly works in C and in Rust.

Clang follows gcc syntax for inline assembly, as documented in the gcc docs (and I believe nowhere in the Clang docs?). It is syntactically surprisingly clunky. For example:

I have no context here so I imagine a lot of the above probably organically grew from how assemblers worked over gcc's history. It doesn't feel like the thing you would design if you were inventing it today.

Meanwhile, clang-cl configures clang to work like a Windows compiler, which means it also supports Microsoft's inline assembly syntax. I don't have much experience with this except that it looks a lot simpler than the gcc syntax. It seems the compiler must infer a lot more of the context that is explicitly stated in the gcc format to get things like clobbers right; that is at least what LLVM does, where you can see the ~{flags} bits on the right.

Finally, Rust has its own syntax for inline assembly that feels very sensible. (Rust supports many fewer architectures than gcc, which possibly makes the problem space a lot easier?) Since it backs onto LLVM anyway it feels semantically close to Clang, but the syntactical problems like how options are specified is in a simple syntax with keywords.

One bit of cute syntax I enjoyed: in assembly in various places you end up using a letter suffix to specify the size of a given operation; for example in AT&T assembly you write addb to specify a byte-sized add but addw for a 16-bit ("word") add. Unfortunately I keep getting these letter suffixes confused, where AT&T uses e.g. l for "long" when Intel uses d for "dword", both to specify a 32-bit operation. (The worst is q is for "quad word", aka (2-byte word) * 4 = 8 bytes.) LLVM has a zoo of single-letter codes which is surely necessary for modeling all the necessary complexity.

Meanwhile, it's not the same context so it's a little unfair, but in Rust asm templates the formatting codes just correspond to the register names — e.g. you use e (as in eax, ebx) to refer to 32 bits, or l (as in al, bl) for the low 8 bits.

Between these options and given how the rest of retrowin32 is in Rust already I am leaning towards using Rust (if anything, my main complaint is that the Rust autoformatter skips my asm blocks, possibly because it's a macro?). The binaries are pretty large, so maybe that is what I will look into next...