Courgette -- better binary diffing by understanding x86

May 30, 2009

Courgette landed in the Chromium tree a while ago but there's been no announcement, so here's my basic understanding.

Chrome likes to rapidly push out small updates. The obvious way — in fact the Debian way* — is to force a redownload of the whole multi-megabyte package. Better is to send down diffs. For random binary blobs, the state of the art is xdelta. (Random fun fact: Josh, the xdelta author, is coincidentally one the original authors of GTK+ and a longtime Googler.)

Even better is to send down diffs that are smart about what you're diffing: bsdiff "tak[es] advantage of how executable files change". This is what Chrome currently uses for updates, but they're still a bit on the chubby side.

Why? It turns out that when you make a minor patch (say an extra NULL-check somewhere) that ends up shifting all following code. Any addresses (jumps) contained in the resulting binary will have their values slightly changed, which means you have lots of integers to adjust throughout your diff.

Imagine software (not us yet, surely) that has hundreds of millions of users; pushing out a 500k update to 100m users is 50 terabytes of transfer. mbsdiff.cc, the core of the bsdiff implementation in the Chrome tree, is a succinct 382 lines of code. Can you do better if you're willing to work harder?

Enter courgette. From a comment:

// The main idea in Courgette is to do patching *under a
// tranformation*.  The input is transformed into a new representation,
// patching occurs in the new representation, and then the transform is
// reversed to get the patched data.

Conceptually, given an executable, you disassemble the executable into a simplified stream of instruction-like bits that use labels instead of addresses. Then you can compute a patch between these abstracted programs, which doesn't need to mention most of the labels that don't change. With a source binary and a patch, you construct the target binary by disassembling the source, patching the abstract representation, then reassembling into the destination.

I believe the disassembling/assembling process isn't intended to be perfect. x86 and compilers are complicated so the patch-generated output isn't quite identical to the target executable. So you also include a bsdiff'd patch alongside to fix up the final details. Courgette incorporates this idea of multi-pass patching into its design.

It only supports Windows x86 for now, but other formats are surely not hard to implement. It seems to me that the basic idea ought to work for other file formats as well — I imagine, for example, making a delta format that knew about .po files or knew to unzip embedded .jar files and recursively delta their contents. But now you know about as much about this I do; go read the source for more.

* More discussion of the various package deltaing available on Debian and Redhat for another post sometime. I am mostly sad about the lack of it, so please mail me good news before I embarrass myself about something I don't know about.