Adventures in heap profiling

March 06, 2012

There's an obvious metaphor for heap profiling that I've yet to find a good opportunity to use, so here goes. Imagine you're investigating a company that is spending too much money. You have all of the credit card bills, but they only implicate the person who signed for a given purchase. Ideally you'd track down who authorized too many purchases. But then if you trace the management hierarchy all the way to its root, the CEO is ultimately responsible for all spending, and that's not a useful fact either.

In code, your credit card bills are the calls to malloc() and your CEO is the main() function that invokes all the calls. Heap profiling, then, can be broken into two pieces: one, collecting the chains of authority, and second, analyzing those chains to allocate (pun intended!) blame.

Let's start with the first one. Collecting a heap profile is as simple as recording the call stack at each memory allocation point, which is to say it's not exactly simple.

First you must hook allocation. Glibc provides hooks for malloc, and if you're using a custom memory allocator like tcmalloc you're already doing this as well, but there's also various pieces of your program like C++ or glib which may use its own allocation pools (for glib, for example, you can adjust G_SLICE to aid debugging).

Once your code is getting called at the right places, you must find a way to get a stack trace at runtime. One approach is to look at the stack for return addresses. It appears gcc for x86-64 Linux provides API for such at thing, but check out the gnarly code Google uses for x86; libunwind is a more recent implementation that likely works better. Another approach is to record extra information while the code is running; tcmalloc can use gcc's -finstrument-functions flag to hook every function call with a bit of code that simply maintains the call stack as an array on the side (a "shadow stack").

Now you've collected a bunch of allocation stack traces, and it's time to shuffle them together. Typically this information is presented as a "top N" list or a tree widget, but tcmalloc provides a cool tool called pprof that generates those as well as input to graphviz to generate pictorial graphs. You can see a snippet of such a graph for GFS, the Google File System, on on the tcmalloc heap profiler page.

However, to map the addresses seen in the recorded call stacks into human-readable function names, pprof relies upon helper tools from GNU binutils such as addr2line, which turns out to be agonizingly slow — for Chrome I clocked it as taking over 4.5 minutes to load a profile! For fun I recently dug into how this all works, and so I have written two programs to share with you.

The first I wrote is maddr, a halfway reimplementation of addr2line. To map an address to a line number in code you rely on debug information, which is in an awkward format described in the DWARF spec. The line number info is a bytecode format that generates a large table mapping every offset within the binary to a source file and line number; maddr decodes that table into memory and does binary searches over it to quickly answer queries. In principle this could be extended slightly to support the same formats addr2line uses, and then plugged directly into pprof.

But I was curious about how the other pieces of all of this work, like how pprof shuffles the stacks together, and so my second program is the boringly-named hp, which loads the heap profiling output as emitted by tcmalloc and generates graphs. hp is written in Go which makes concurrency so easy that it trivially uses multiple threads to load a heap profile and the symbols from a binary in parallel. It then has a "server" mode which brings up a web server, letting you adjust parameters (well, one parameter, but you get the idea) dynamically by clicking around.

You can play with a demo of hp's output online. (Caveats: it's a static snapshot so "rerender" doesn't work in the demo; it uses WebKit flexbox so it only works in WebKit. Use the middle mouse button to pan.) You could easily extend this so that you could click on the graph nodes to focus in on a single call stack, but I think I've exhausted my curiosity at this point (and I no longer work on Chrome) so that's all you get!