Tech Notes: diff --stat for binary files

diff --stat for binary files

August 09, 2025

I contributed a minor feature to the Jujutsu version control system, which I wrote about previously.

When you run diff --stat in Git, it shows you a summary of your change as a list of modified files and counts of added and removed lines for each modified file. For binary files, Git displays the difference in byte size. Here's an example commit where I grew a .dll file:

commit 9649ab9bf70c92a1ebe2ac39b4d2ef86b1de37b9
Author: Evan Martin <evan.martin@gmail.com>
Date:   Thu Oct 17 11:56:28 2024 -0700

    dinput: more stubs

 win32/dll/dinput.dll               | Bin 2560 -> 3584 bytes
 win32/src/winapi/dinput/builtin.rs |  48 ++++++++++++++++++++++++++++++++++++++++++++----
 win32/src/winapi/dinput/dinput.rs  |  53 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 6 deletions(-)

Jujutsu has the same feature except it didn't handle binary files: it would just count the number of 0x0a bytes in the file, which is not very useful. So I fixed that.

This is a very minor feature but it turned out to be more subtle than I expected for one main reason: the above output is sized to make each line fit the terminal width, which means it truncates the file names if they are too long and also scales the graph on the right to fit. You need to end up being careful to measure all the relevant text and be careful with rounding as well as underflowing zero (e.g. if the terminal is too narrow to fit the filename at all).

Here are some minor notes.

Slightly different output: Git shows 2560 -> 3584 bytes, but after discussion in the PR about whether to show plain byte counts or pretty-print the numbers, I convinced myself that the other lines in diff --stat output are only showing the magnitude of the change and not the before/after. So my output looks like (binary) +1024 bytes. This means that you can't tell a grown file from a fully added or removed file, but that was already true for text files, and that's never bothered me in my years of using Git.

expect tests: I had learned about "expect tests" from this Jane Street blog post. From the post it sounded like a great feature but it was for OCaml only, so I never tried it. I was delighted to discover that Jujutsu uses them via Insta, a Rust library that provides a similar thing.

In the test for my change it runs jj diff --stat and asserts what the output looks like, as follows. The cool thing about Insta is that I didn't need to hand-update this text; instead it can run the test and interactively step through which outputs differ, and for the changes I accept it automatically inserts them back into the code.

let output = work_dir.run_jj(["diff", "--stat"]);
// Rightmost display column          ->|
insta::assert_snapshot!(output, @r"
binary_added.png    | (binary) +12 bytes
binary_modified.png | (binary)
...fied_to_text.png | (binary) -8 bytes
binary_removed.png  | (binary) -16 bytes
...y_valid_utf8.png | (binary) +3 bytes
5 files changed, 0 insertions(+), 0 deletions(-)
[EOF]
");

(The idea of expect tests is deeper than just textual command output! Read the original blog post for more.)

Colored output: When generating textual output, Jujutsu tags substrings with keywords like added or binary which then feeds into an outer system that assigns colors to these semantic categories. This is a neat mechanism to keep colors consistent across different commands while allowing for customization. In particular if you customize the output of other commands like log, you'll interact with these.

Rust build output is massive: This is my first tinkering with Jujutsu, but over the ~two months that I worked on this, my target/ dir (containing Rust build output) grew to over 25gb. Jeepers. I think it was maybe intermediate outputs of various libraries whose versions themselves varied over that time period?

PS: I didn't actually work on it for two months! I worked on it for couple hours, forgot about it, picked up some weeks later, and then repeated that a few times.

Double width characters: File names can be Unicode, and even in a terminal some Unicode characters (particularly Chinese) are supposed to occupy two columns. This means to measure the width filename and properly elide it with ... you need not only Unicode character handling, but also data tables about which codepoints are double-width.

This code was already all implemented and I did not touch it, but I mostly note that even a pretty basic thing like "shorten a filename to make the text align on the terminal" quickly becomes a whole project if you try to do it thoroughly.

Commit access: After writing a few PRs, they granted me access to merge my own changes. Pretty cool thing to do for a first-time contributor! I expect the repo is set up to refuse force-pushes so I suppose if I mess things up they can always fix it.

Future work: When writing this blog post I looked at the output a bit more carefully and noticed yet more aligning is to be done.