Binary size and other tree maps

November 14, 2010

(This is kinda old at this point, but I may as well post it since I haven't touched it in a while.)

We build Chrome as one huge binary. This has a number of benefits outside the scope of this post but one negative is that we end up with a single enormous file without much insight into where the space is going. You can look at correlated factors like .o file size but you never know what the linker is going to throw away or optimize. objdump and friends can tell you the relative sizes of sections, but that still is a pretty blunt instrument.

nm can give you per-symbol sizes. This let me first discover e.g. for our translate feature, we ship a 1mb language model data table. But what I'm more interested in than single large symbols is aggregate costs of modules.

It turns out that nm can also emit the line number each symbol came from, though it takes a long time to compute (at one point I dug into nm's source and discovered there's a comment in there about this; my memory is months old at this point but it was something about "this is slow, but we don't do it often"). With paths to files we can map back from bytes in the binary to a directory structure of the source.

With a directory structure in hand, I next turned to visualization. I've used treemaps before for looking at disk space (in particular on Linux Baobab is built into Gnome and its ring chart is quite nice), but how can I share the results with my coworkers? I turned to the web but found lots of one-offs and the larger JavaScript Infovis Toolkit but I found its UI frustrating and clunky.

I said to my office: "I bet I could hack something decent up in a few minutes." Ojan responded: "I bet it'll take you a few hours to get 80% there, and then a week to have it be useful." And he was pretty much spot on.

But the end result is that I have published a web-navigable treemap of our binary size. (You can see some other discussion of it on hacker news.) This breaks it down by directories; it's not hard to do other breakdowns, like by namespace.

I also published the treemapping widget separately. It was fun to write, a combination of intuition as well as reading a paper and implementing the algorithm from it. It's pretty straightforward and works on both WebKit and Gecko (though I may have accidentally broken Gecko more recently, I haven't checked, and I also rely on WebKit transitions for gratuitious but brief visual effects). I spent an embarrassing amount of time fiddling with getting the spacing right; it turns out adjusting divs when borders are present is still pretty fiddly, even with the border-box CSS attribute.

Since then I've used the same widget for looking into our test timings; you can see a snapshot of the map. With this in hand I knew which tests were most problematic and cut down test runtime by a lot. (Coincidentally, while grinding away at tests I also discovered that much of our test flakiness was caused by a single bug, so a lot of the red you see on those old charts is now fixed. But that's a story for another time.)