n2: revisiting Ninja

March 24, 2022

In my Ninja retrospective I observed that the joy comes from tinkering and learning, and that I could have still gone that route without the "success" of pleasing users. With that in mind lately, and with nearly a decade of distance, I spent some time rethinking some of the design choices of Ninja.

For me, it's one thing to read and think about an idea, and it's another to implement it; there's a lot of subtle detail that you only discover when you actually try to make the thing work. So here's a thing that seems to work: n2, pronounced "into". It reimplements enough of Ninja that I have been using it as my day to day build system for work — a CMake-based project with ~1k build steps — and it takes a different approach to some of the core problems in Ninja that I'll go into more detail in the rest of this post.

Manifests instead of mtime order

Ninja, following make, relies mostly on comparing mtimes. At a high level the problem of a build system is to determine which of a series of steps it needs to re-execute, and comparing whether inputs are newer than outputs is a decent first approach for that. However, as mtime comparison considered harmful describes in depth, there are a lot of details that break them. If I were to summarize the two main ideas in that post, it's that (1) mtimes themselves don't model well all changes to the contents of files, and (2) there are more factors to determining out-of-date-ness than just file contents.

Set aside problems like clock resolution and mmap and imagine for a moment that mtimes reliably changed whenever file contents did. (I think this is still a reasonable assumption for most common build setups.) Here are some example scenarios that mtime comparisons still handle poorly:

Here's a different model that cleanly solves these problems and others. For each build step, after the build completes, construct a manifest-like document that includes the command line, the names of the inputs, their mtimes, the names of the outputs, and their mtimes. Now hash this document and record that hash as a representative for the build state. (This is similar to how git models commits.) The next time you reconsider this build step in a subsequent build, it's out of date exactly if the hash doesn't match the recorded one.

signature = hash("""
  cc -Wsome-flag foo.cc -o foo.o
  inputs: foo.cc 121312 foo.h 128755 generated.h 138913
  outputs: foo.o 127841
""")

This representation is intended to be fragile, in that the manifest changes in all the cases where it matters: if an input file changes, if the build graph changes, if the command line changes, if an output file is missing, a file's mtime goes backwards, and so on. And the hash value itself is a tiny thing to store.

Note that this is explicitly not hashing the content of files — the manifest here is just covering "here's a bag of stuff that identifies the build state that was executed", and is an in-memory thing that is fast to compute. Nor does this involve any content addressing. But the "hash a bag of stuff" model easily extends to other techniques, including avoiding mtimes entirely by using content hashing: just replace the mtimes in the above manifest with file content hashes. And further, a manifest like this (less the outputs) is exactly the key you'd use to implement a caching system like ccache or bazel.

Or if you wanted, you could add in file sizes/inodes/etc. for the reasons mentioned in the above mtimes article, or even any relevant environment variables. Because all the matters is whether the manifest hash changes across builds, you could even make different decisions about these issues on a per-build-step basis, like if you wanted to mix content-based hashing on some build steps with mtimes or something else on others.

Single pass

In Build Systems à la Carte, they define "early cutoff" as a feature where, if a given build step doesn't modify its outputs — commonly because it determined for some other reason that the outputs are still up to date — build steps downstream of that don't need to execute again. Early cutoff turns out to be important for real build system problems (such as this one), and so Ninja 1.10 added support for it (see restat in the manual).

But Ninja's architecture made this support pretty clunky. Ninja works in two passes: it first computes a "plan" of all the build tasks that are considered out of date by examining on-disk mtimes, then it executes the plan in topological order. This design meant it can accurately show how many total tasks it will execute as it works, but to support early cutoff Ninja needs to do extra graph traversal work to remove entries from the plan when it discovers they won't end up being needed.

If you instead start with early cutoff in mind, there's a different approach that makes it much simpler. Rather than two passes, do the work in a single pass: visit every build step that is an input to the desired output in topological order, stat()ing files as you go, and executing out of date commands as they're visited. Early cutoff naturally falls out as the result. If by the time you examine a build step the inputs are up to date, you don't need to execute it, and that's independent of whether any other build steps ran or not.

Further, supporting early cutoff meant Ninja must store additional state on disk. After an early cutoff build, there is some output file with an mtime older than one of its inputs, but for the purposes of the next build it should be considered up to date. Ninja clunkily handles this case by recording a second mtime on the side. But the above manifest-based approach handles this already for free.

In Ninja, when first constructing the plan, which includes stat()ing all the inputs, it is silent; it only shows progress in the second pass when actually executing steps. This is fine in the warm disk case where stats are fast, but in the cold disk case it can feel like it's hung. The single-pass approach lets us instead interleave those stats with other work over the course of a build, which means n2 can start executing build steps that are found to be out of date even before it's stat()ed all the inputs relevant to the overall build. It also means that the build progress bar can cover this stat work.

Consequences

In all, representing build state with these hashes and a single pass means there's a tiny on-disk representation that can efficiently get started on executing tasks quicker with better progress feedback, and it also handles some subtle mtime cases including early cutoff for free. Pretty neat.

I don't think these ideas are new. The main idea here is more or less "verifying traces" from the above build paper; Go does something similar too.

In the spirit of engineering being about tradeoffs, here are some Ninja features the above loses:

In my experience software always accretes complexity to address unforeseen use cases, and it's a rare pleasure when a different conceptual model allows you to consolidate to something simpler again. In other words, I wrote this code mostly for the fun of it, so if you find it useful that's cool and if not that's cool too.