Chromium Notes: Thread restrictions: preventing unintentional IO

Thread restrictions: preventing unintentional IO

November 19, 2010

I've written before about UI jank, the problem of blocking your user interface on slow operations. In Chrome we have two critical threads which should never block; in both cases, if they hang even for even milliseconds the UI can feel "skippy". To remain responsive, these threads must never perform disk accesses. Whenever they need to interact with the disk, they pass the job off to a helper thread and asynchronously complete the task once the disk access completes.

Historically, from a software engineering perspective, we've kept these threads from blocking by just being careful — by trying to write the code correctly, by having reviewers verify new changes don't introduce blocking, and finally by QA (and regular usage by developers) noticing when things get janky. This approach hasn't scaled well with the size of the project, as demonstrated recently by the recent regression where we accidentally started querying plugin metadata on the UI thread.

I don't blame the author of that bug; the consequences of a function call really can be hard to keep track of. Some simple operation you might want to perform that doesn't involve the disk might end up touching it anyway; for example, computing a checksum may cause us to lazy-init our crypto library which then attempts to initialize a SQLite database.

To help improve this state of affairs, recently I introduced a simple bit of per-thread state — a boolean — that declares whether it's OK to block the current thread. Lower-level functions that do block (by making calls like fopen or dlopen) can assert that the current thread is allowed to make the call.

So far this assertion has dug up over 30 other places where we're unintentionally blocking on the disk. One good example is that when loading a file:/// URL for some reason we stat() the file on the wrong thread (despite doing opening and reading of files on a background thread). There are also plenty of not so useful examples, like in unit tests where we don't much care about jank; in those, I suppress the assertion.

For the remainder of this post I'm going to meander around discussing various aspects of this problem.

API surface. Above I glossed over how we can tell whether a lower-level function touches the disk. The short answer is that we can't. Especially on platforms where we don't have source code, you can never know if a hypothetical GetDisplayDimensions() call counts as a blocking one or not. In practice there is a lot of lower-level API that deals with files where we already had wrapper classes in place due to making the code work on multiple operating systems, so we can insert checks there. Hopefully the "you must be careful" problem of adding checks is smaller than the "you must be careful" problem it's trying to solve! But I also know this approach misses a lot; I recall one bug we've had is that when you drag a downloaded file off the "shelf" to your desktop on Windows, the resulting file copy will actually block the thread involved in the drag and drop.

Dynamic state. This approach relies on a run-time check of a thread-local variable. This means that we only discover instances of the problem when a test exercises the problematic codepath or a developer hits an assertion while looking at something else (all of this code is compiled out of the builds we release to users). On the positive side, it means that no matter how you end up blocking — virtual method call, function pointer — we still catch you. And it gives us an escape hatch so we can turn the checks on while incrementally annotating code that we know is wrong — you can just temporarily flip the "this thread is OK to make disk accesses" bit and file a bug on a problematic call.

Programming language support. Conceptually though, you'd think the compiler could check this correctness property at compile time. In Haskell, you annotate a function as to whether it has effects at all (including disk accesses), so you could imagine jiggering up some sort of disk-access monad. With that perspective I observe the underlying problem that from a program's perspective, slow operations behave almost like an effect and a lot of the craft of programming is about encapsulating an effect locally.

Mock interfaces. In another (unrelated) project I've done a consistent job of creating mock interfaces for disk accesses to make testing easier and I've found that needing to pass in an extra DiskInterface* to a function is a great way of the compiler helping you know whether a given call may touch the disk or not. (It also means the tests are blazing fast — they only touch memory.) But see above about API surface; my other project is primarily computational so it's easy to enumerate the interesting calls.

More than just the disk. When worrying about blocking, you're normally concerned with disk access, which is glacially slow when compared to most other operations a computer can do. But there are many ways an app can block. One that affected many people I know is that Firefox used to do blocking DNS resolution from the UI thread under some circumstances (which happened to be the standard setup within the Google corporate network). DNS is generally significantly slower than the disk. At the other extreme, even exclusively computational operations can consume enough time to be perceivable, though it's relatively rare.

You can't really win. In a world with virtual memory, you never know whether a given page you're about to access has been swapped out. (Well, I mean, you can query for it, but there's not a lot you can do when the upcoming page of code is swapped out.) That means sometimes you will pause.

But that doesn't mean you should give up but rather that you should improve what you can. The best tool we currently have for end-to-end analysis is our jankometer (I keep intending to write it up, but briefly it checks whether a given iteration through the event loop takes more than an absolute amount of time — if you're running Chrome you can see its logged values by visiting about:histograms/Chrome and looking at the entries with Msg in their names). In Chrome's case, many bugs tagged "jank" remain.