Chromium Notes: Perf work

Perf work

April 01, 2009

Two unrelated changes, right in a row:

r12782, a one-liner that rebases the DLL. Tests show 35% memory and 23% startup improvement.
r12817, a tiny patch that enables tcmalloc. Tests show a 13% perf improvement at the cost of 12% more memory.

I'm no expert on this but here are some basic attempts at explanation.

Note: This post has been significantly changed since posting after Dean told me how wrong it was before. :~(

A binary can locate data at any address because it controls its own virtual memory space. For shared libraries it's trickier: if your generated shared library code includes a reference to address X, then it's requiring that the shared library code is loaded covering address X in each process's virtual address space. If you have two shared libraries that both get compiled to use the same address, a process in theory wouldn't be able to use both.

You have two options to making the library code relocatable. One is -fPIC, which makes all addresses relative to an extra register, and then whenever you call into a given shared library you set that register to the base address the library ended up loaded. This is what people do on Linux. You burn a register, which is especially painful on x86; I heard WebKit's JS engine lost 12% in benchmarks after compiling with -fPIC. (This is the reason I hope distro packagers will not package V8 as a shared library that Chromium links against. We ought to write up notes on this stuff somewhere.)

The other option is to pick a number for the address your library is compiled against as its base, but also include a table of all the offsets to addresses used within your library code. At load time, if the base address you picked is already in use, the loader picks a free address and then rewrites all of your code to use the proper address. This is what is done on Windows. This approach very fast in the case where there's no conflict but if you need to rewrite it's not only slow, you also end up with non-sharable pages (since mmaping the library is copy-on-write, as soon as you fixup an address you've got a private copy).

So how do we pick the DLL base address? Google's engineering lifeblood is to be data-oriented: we picked an address (at random, as far as I can tell) and instrumented the browser to report back what address we ended up at (as always, only from users who have opted in to reporting aggregate statistics). Dean's one-liner change was to use the best address we saw. Apparently the address we had previously chosen was already in use on our build bots, because all the graphs spiked in a good way. The startup perf comes from not needing to rewrite the DLL, and the memory comes from having the code shared across all instances of our binary (browser and renderer processes, etc.).

(I imagine in a x86_64 world this struggle kinda goes away; you could imagine doing the Windows approach with a random base address chosen at compile-time and not getting any conflicts since the memory space is so huge. At the same time, you have extra registers so maybe -fPIC isn't so bad either.)

The tcmalloc change is simpler, somewhat: different memory allocators have different performance characteristics, so you'd expect changing them to have an effect. I think there's probably a lot of room for experimenting in this area, as Chromium tends to not have a lot of threads racing at the same time within a given process and it also tends to shut down renderer processes frequently, which means maybe an allocator that trades off worse threads and more fragmentation for better memory/perf characteristics would benefit us. But again, I don't know much about this stuff.