Tech Notes: retrowin32: Threading in two ways

retrowin32: Threading in two ways

May 16, 2024

This post is part of a series on retrowin32.

Until recently, retrowin32 was wholly single-threaded: it couldn't emulate programs that used threads, and it didn't use threads in its implementation. Recently I've tinkered both of those, but maybe not in the way you'd expect: retrowin32 can now emulate threaded programs and I made an implementation that uses threads, but it haven't used threads to emulate threads.

Emulating threads

Previously retrowin32 could only emulate single-threaded programs, which was a nice simplifying assumption for my implementation. Unfortunately the demoscene executables I'm aiming to support often end up using threads.

Side note: the winmm waveOutOpen API used for playing sound has a CALLBACK_FUNCTION mode where the docs are amusingly silent on whether that callback is called from a different thread. Like, given the other provided modes I would guess the callback is async, and the waveOutProc docs talk about avoiding deadlock, but you'd think there would be some explicit mention of this!

Since I am on the topic I would like to take this opportunity to link you to a great blog post on asynchronous callbacks.

I had imagined supporting threads would be hard, but it ended up being surprisingly simple. The core of retrowin32's emulator is a CPU type that holds the register state (including eip, the current instruction pointer) and a Mem type that holds memory. For Rust borrow checker reasons it was already the case that CPU didn't own Mem. To make threads work I just added an array of CPU and cycle through them as I run. This is not true to how the processors actually work but it's surprisingly effective.

Another side note: often when writing code there is some sort of state that I think would be a lot more convenient to manage if it were global, and often I later realize that it was beneficial to instead pass it around as a parameter, for all the usual "makes testing better" etc. reasons. Here, I had been tempted in the past to make the CPU register state global, but because I hadn't done this it was easy to support multiple CPUs. Somehow no matter how many times I repeat this lesson I am still regularly tempted by globals.

The tricky part was handling blocking. For example, a program might spawn a thread that is generating some graphics, while its main thread may block in the Windows GetMessage() API, waiting for some UI event. Meanwhile, when running on the web events arrive via event listener callbacks. To plumb these two together, emulated threads can get into a "blocked" state where they can't be scheduled until an event prods the message queue.

I had already implemented some of machinery for this via Rust's async support but with potentially multiple threads in different kinds of blocked states (including waiting for a timer, for example) it's feeling even closer to Rust's model of how async works with wakers etc. I suspect/fear the future might involve me getting even deeper into making my emulated threads into Rust async tasks.

Using threads

Separately you might wonder: if the emulated program uses threads, why not just emulate it using real threads? There are two main reasons I have been avoiding this.

One is that I don't really need it for the kinds of programs I'm running — they are using threads for concurrency reasons, not parallelism or performance — and threaded code is generally more difficult to reason about with subtle consequences. The fact that Apple uses a non-standard ARM extension to implement x86 memory behaviors in their x86 emulator suggests these consequences matter, though it's likely these details aren't important to the programs I'm playing with in particular.

The second reason is that retrowin32 runs on the web, where threads are not such a thing. Web workers do exist, but to have multiple workers work over the same shared memory (which is how multiple emulated threads would conceptually work), you need to use the SharedArrayBuffer browser API which was locked down due to the Spectre attacks. To use it you need to set some special HTTP headers, which makes local development a bit more annoying, and which means I wouldn't be able to host the resulting thing on GitHub pages. These are surmountable challenges but it also tickles my "you are getting too far off the beaten path" sense. And if we're worrying about memory ordering behavior on ARM, who can even imagine the behaviors in browsers.

Using exactly one thread

But maybe there is a middle path: using just a single worker.

Everything-in-one-thread retrowin32 must put effort into playing nice with the browser's event loop. The emulator wants to grind through emulated instructions as fast as possible, but it needs to yield control back to the browser to let it process events, then pass control back to the emulator as soon as possible to continue emulating.

The implementation has logic around trying to estimate how many emulated instructions it can execute before yielding control back to the browser but it wasn't great. There's a nice doc about newer browser APIs for scheduling (thanks Mihai for the link, I had no idea!) but it's fundamentally kind of clunky (and often Chrome-only).

I thought I could improve on this by just using a single worker thread dedicated to running the emulator, while still emulating threads myself in the manner described in the previous section. The worker gets to grind just on emulation, leaving the browser UI thread free to handle events. Because the emulator is still wholly within a single thread I can sidestep my worries above about shared memory (but more on this below).

This is "just" as easy as taking code that was aggresively single-threaded, where the browser TypeScript code and the Rust/Wasm emulator code could call directly into each other, and splitting it into two processes that communicate over a channel. Which is to say, fairly invasive.

Worker messaging

Here's a cute(/awful?) bit of TypeScript to make it more palatable. Imagine you have two classes, Emulator (with methods like run()) and Host (with methods like onStdout(), called when the emulator wants to print something). Previously they called one another directly. Now suppose you want to put the Emulator in a worker and pass messages between them. You could imagine defining a messaging protocol and writing out interfaces for the kinds of messages and so on...

Or you could do this:

/** Creates an object that makes any obj.foo() call into a postMessage onto the target. */
export function messageProxy(target: Worker | Window): object {
  return new Proxy(target, {
    get(target, prop, _receiver) {
      return (...args: {}[]) => {
        target.postMessage([prop, args]);
      }
    },
  });
}

/** Sets the onmessage handler to receive postMessage calls from a proxy and forward them to the handler. */
export function setOnMessage(target: Worker | Window, handler: object) {
  target.onmessage = function (e: MessageEvent<[string, {}[]]>) {
    const [fn, params] = e.data;
    (handler as any)[fn](...params);
  };
}

The first function takes a message-receiving target and returns an object such that calling object.foo(bar) invokes target.postMessage(['foo', [bar]]). The second function takes a message-sending source and a handler and sets its message handler such that a message like ['foo', [bar]] invokes handler.foo(bar).

Combined, you invoke them like this (with a similar mirrored invocation on the other side):

const worker: Worker = /* ...spawn worker */;
const emulator = messageProxy(worker) as Emulator;
setOnMessage(worker, this /* which implements Host */);

emulator.foo(bar);  // type-checked!

By coercing the type returned from messageProxy to the type of the actual object on the other side of the messaging channel (here Emulator), TypeScript now can still type-check all the method calls.

That as Emulator cast is of course a big lie, in that you don't have an actual Emulator object. Importantly none of the "methods" (really, message sends) can have any return values, because the calls don't wait for a response.

In practice this almost didn't end up mattering just because of the kinds of calls I needed don't need a response; the calls are like "start emulation", "here's a new UI event", or "print this to stdout" and none where the caller needed to wait. Unfortunately there are two major exceptions.

Pixel buffers

The first problem is about how pixels get transferred. In this proposed worker world we now have exactly two threads: the main browser UI thread and the emulator's worker thread. Imagine the emulated code draws some pixels into some memory buffer. How does that make it to the UI thread to display?

One option is something around sharing memory with the UI thread so it can read the pixel buffers. But this resurrects the problems I was trying to avoid with shared memory.

Another option is to post these pixels in a message just like the other calls. This works, but I think it means allocating a pixel buffer in the worker and transferring it to the UI once per frame which doesn't feel great from a performance perspective.

It turns out there is a browser API that looks designed for this problem: HTMLCanvasElement.transferControlToOffscreen(), which lets you take a <canvas> and hand off responsibility for drawing its contents to a worker. (These latter two options for pixel transfer are described in this nice MDN article.)

Unfortunately to use transferControlToOffscreen() you must create and size the canvas from the UI thread, which is incompatible with how the emulator wants things to work. Windows programs can resize a window and then begin immediately drawing to it, but to use this API the emulator would need to wait after resizing for the UI thread to return the resized canvas before continuing.

There are browser synchronization APIs to allow workers to block, but as best I can tell there is no way to block while waiting for a transferred browser object like these canvases — they can only arrive in an onmessage callback. So to have emulation wait for one of these buffers to arrive, I would need to make the emulation code capable of suspending control and resuming after the message. This is something I have machinery for but it is kind of a hassle.

Is it worth it?

The second, larger problem is that retrowin32 also incorporates a debugger, and the debugger wants to poke at all sorts of emulator state to surface it in the UI. If the emulator state is now in a worker, all of this data retrieval must be made async. This is, as they say, a SMOP, but it made me reconsider whether this refactor is worth it.

I had imagined that running the emulator in a worker might let me avoid some of the goop around trying to interact nicely with the browser's event loop. But in practice the worker needs to receive messages to be controlled, and messages arrive via the browser's event loop. In other words, the worker must still attempt to interact nicely with the event loop. I executed the cardinal sin of engineering: I implemented without thinking through what the implementation would actually gain for me.

So with my lessons learned, I think for now I am going to stick to the single-threaded workerless browser implementation. If you enjoyed the TypeScript bits above you can see a more refined version that doesn't rely on Proxy cuteness and instead makes channel.post('somemethod', [args]) typecheck, as well as supporting blocking calls. And regardless the emulator itself can emulate threads, a bit — or at least I can now emulate simple multithreaded programs like this.