Adventures in heap profiling
There's an obvious metaphor for heap profiling that I've yet to find a good opportunity to use, so here goes. Imagine you're investigating a company that is spending too much money. You have all of the credit card bills, but they only implicate the person who signed for a given purchase. Ideally you'd track down who authorized too many purchases. But then if you trace the management hierarchy all the way to its root, the CEO is ultimately responsible for all spending, and that's not a useful fact either.
In code, your credit card bills are the calls to malloc()
and your
CEO is the main()
function that invokes all the calls. Heap
profiling, then, can be broken into two pieces: one, collecting the
chains of authority, and second, analyzing those chains to allocate
(pun intended!) blame.
Let's start with the first one. Collecting a heap profile is as simple as recording the call stack at each memory allocation point, which is to say it's not exactly simple.
First you must hook allocation. Glibc provides hooks for
malloc
, and if you're using a custom memory allocator like
tcmalloc you're already doing this as well, but there's also various
pieces of your program like C++ or glib which may use its own
allocation pools (for glib, for example, you can adjust
G_SLICE
to aid debugging).
Once your code is getting called at the right places, you must find a
way to get a stack trace at runtime. One approach is to look at the
stack for return addresses. It appears gcc for x86-64 Linux provides
API for such at thing, but check out the gnarly code Google uses for
x86; libunwind is a more recent implementation that likely
works better. Another approach is to record extra information while
the code is running; tcmalloc can use gcc's -finstrument-functions
flag to hook every function call with a bit of code that simply
maintains the call stack as an array on the side (a "shadow stack").
Now you've collected a bunch of allocation stack traces, and it's time
to shuffle them together. Typically this information is presented as
a "top N" list or a tree widget, but tcmalloc provides a cool tool
called pprof
that generates those as well as input to graphviz
to
generate pictorial graphs. You can see a snippet of such a graph for
GFS, the Google File System, on on the tcmalloc heap profiler
page.
However, to map the addresses seen in the recorded call stacks into
human-readable function names, pprof
relies upon helper tools from
GNU binutils such as addr2line
, which turns out to be agonizingly
slow — for Chrome I clocked it as taking over 4.5 minutes to load a
profile! For fun I recently dug into how this all works, and so I
have written two programs to share with you.
The first I wrote is maddr, a halfway reimplementation of
addr2line
. To map an address to a line number in code you rely on
debug information, which is in an awkward format described in the
DWARF spec. The line number info is a bytecode format that generates
a large table mapping every offset within the binary to a source file
and line number; maddr
decodes that table into memory and does
binary searches over it to quickly answer queries. In principle this
could be extended slightly to support the same formats addr2line
uses, and then plugged directly into pprof
.
But I was curious about how the other pieces of all of this work, like
how pprof
shuffles the stacks together, and so my second program is
the boringly-named hp, which loads the heap profiling output as
emitted by tcmalloc and generates graphs. hp
is written in Go which
makes concurrency so easy that it trivially uses multiple threads to
load a heap profile and the symbols from a binary in parallel. It
then has a "server" mode which brings up a web server, letting you
adjust parameters (well, one parameter, but you get the idea)
dynamically by clicking around.
You can play with a demo of hp's output online. (Caveats: it's a static snapshot so "rerender" doesn't work in the demo; it uses WebKit flexbox so it only works in WebKit. Use the middle mouse button to pan.) You could easily extend this so that you could click on the graph nodes to focus in on a single call stack, but I think I've exhausted my curiosity at this point (and I no longer work on Chrome) so that's all you get!