Start with a simple question: "how much memory does my app use?" It turns out the answer is endlessly complex. Until recently I knew relatively little about what follows in this post but I learned a lot through patient explanations from colleagues; hopefully this post will save someone else who was in my position some time.
Like much of computer science, memory works as layers of abstractions.
An application interfaces with
new, those functions
manage pages of virtual memory using syscalls, and those pages are
(sometimes) backed by real memory by the operating system.
Within a process all you see is virtual memory, but outside of a
process all that matters to your system is physical memory. Because
these are different views on the same thing, you'll often see overlap
between the terms and tools used to evaluate them. For example, the
first memory column of
top shows the total virtual space used for a
But what you normally care about is physical memory: space on the
chip, the thing that when you run out of stuff starts swapping. That
top is nearly useless for the sorts of questions you run
top to answer ("where are my resources going?").
To keep these ideas distinct let's talk about two different resources. Your operating system manages memory: the space in the chips. A process has address space: addresses that a pointer can read from and store data at. Those terms can refer to the same thing, but frequently they don't. Virtual address space can be backed by memory but also by something else,
From a process's perspective, it runs alone in a sea of available
address space. Start with
cat /proc/self/maps, a dump of a map of
the virtual memory of the
cat process (fields elided below for
clarity; more docs are in
man 5 proc):
00400000-0040d000 r-xp /bin/cat 0060d000-0060e000 r--p /bin/cat 0060e000-0060f000 rw-p /bin/cat 0201a000-0203b000 rw-p [heap] 7f9b80fe9000-7f9b81163000 r-xp /lib/libc-2.11.1.so
The leftmost pairs of numbers indicate spans of address space, and the
fields to the right are attributes of that span. You'll see that the
cat binary appears at a low address and libraries it uses are mapped
in at high addresses. The rightmost column shows that some of these
ranges of the address space are backed by files: when accessing the
data at 0x400000, the kernel can just just get that data from
/bin/cat and not use real memory (the stuff on the chips, remember)
except as a cache.
The permission flags on the ranges reflect the different parts of the
r-x is code,
r-- is read-only data,
rw- is read-write
data. You can dig into this further by looking at
/bin/cat and noting the "VMA" addresses listed for various sections.
An aside: sometimes you'll see spans marked
---, which means completely inaccessible. I asked one of the aforementioned colleagues about it and he suggested it could be guard pages, but that some regions were too large. Curious, he dug into the source and reported back: "the [shared object] is getting loaded but it specifies non-contiguous sections in memory. Since dl doesn't want to try and get the kernel to allocate exactly (because the kernel can always ignore the hint), it allocates a single, big block anywhere and PROT_NONEs out the bits that it doesn't need."
PS: If you spend a lot of time looking at these
maps files, you
should try out the
sudo apt-get install procps.
The region marked
[heap] is where this process happens to be
allocating pointers (in the
malloc() sense) from, but it could just
as well be getting pages at other virtual addresses. On a default
kernel setup the kernel will overcommit: repeatedly getting more
pages (repeatedly calling
malloc) just causes you to map in more
pages within your address space, and you can keep requesting memory,
more than the physical memory of your system, and
malloc will more
or less keep returning pointers until you run out of address space.
From an process's perspective allocatable address space is unrelated
to available system resources.
The only time you really care about how much virtual memory you're
using is when a process is near its address limit. For example, the
Chromium 32-bit buildbots have been failing recently because the
linker couldn't find 900mb of consecutive address space to
output binary. (In this case, the buildbots have plenty of physical
memory; our 32-bit buildbots have the modern problem of more physical
memory than virtual address space, which means processes can "run out
of memory" when there is plenty left.)
Your virtual addresses (that aren't backed by files) become backed by
physical memory when your process touches the addresses. Take a look
cat /proc/self/smaps, the same information as
maps with some
02401000-02422000 rw-p 00000000 00:00 0 [heap] Size: 132 kB Rss: 52 kB Pss: 52 kB
Size is the size of the virtual region.
Rss, resident set size,
is the subset of this that is backed by real memory. (Let's ignore
swap, yet another factor complicating these numbers.) So this is
saying that while the heap is 132kb of virtual address space, it's
only been backed by 52kb of real memory. And here our question about
application memory use could more or less be answered -- sum the RSS
for all mappings -- except for another twist: pages can be shared.
If ten processes are all running with the same 3mb
libc library, the
kernel only needs to load that library in to memory once and just map
it into each process's address space. (That is, after all, one of the
primary purported benefits of shared libraries.) Writable pages (for
example, global variables within
libc) can be marked copy-on-write:
as each process writes to those virtual addresses, the kernel will
implicitly re-back it with its own memory private to the process. In
fact, whenever a process forks all of its private memory in the
subprocess is marked copy-on-write as well, a trick we make use of
Shared memory makes our application accounting question a tricky one.
Clearly it doesn't make sense to charge a process for its shared
memory in isolation -- in the above example libc is using at most 3mb
of memory, not 30mb. A more useful metric, "proportional set size"
Pss seen above in the
smaps dump), divides the RSS by the
number of processes sharing that memory. Each process is charged an
equal fraction of the total amount used.
PSS is the closest we've come to a fair memory metric, but it has some counterintuitive properties: for example, if you kill one process that is sharing memory with others, its shared memory doesn't become available to the system, so other processes that were sharing that memory have their PSS go up a little. (That hypothetical scenario points at perhaps another potential useful memory metric: "if I kill this process, how much memory is freed up?")
Chrome is built as a suite of cooperating processes. If you boot up your browser and display one tab containing Flash, you'll see somewhere around five processes communicating with one another. To keep memory usage low it is critical these five processes share as much memory as possible.
Here are some ways processes can end up sharing pages:
- Explicitly, by using shared memory APIs;
- Via the loader, by using the same binary or shared libraries;
mmaping the same file;
forking (all private pages are marked copy on write in the child).
In Chrome's case, we make use of all of those.
- (We use shared memory in our IPC system, but that is for performance, not memory savings.)
- All Chrome processes are from the same binary, so all code pages are shared between them.
- Chrome stores all of its resources (e.g. translatable messages, the
"no such picture" icon shown when an
<img>tag references a bad URL) in a single file which it
mmaps. This way, despite running many processes, only one copy of those resources are ever occupying memory. (Cat
/proc/<pid>/mapsfor a Chrome process and look for
.pak. Note that you might need to be root to do this for sandboxed Chrome processes.)
- For reasons that are too complicated to describe here, shortly after startup Chrome forks a "zygote" process, which waits for commands from the main Chrome process and itself forks again when it's time to spawn a new subprocess. Though the zygote exists for other reasons, in theory one consequence of this design is that any startup-time initialization that happens before the zygote forks will end up in copy-on-write pages in subprocesses. As long as that memory (for example, the parsed representation of the command line) isn't modified later it will be shared.
In all of the above discussion I haven't even touched on what
malloc() does. In some sense for memory usage purposes it's not
important; its consequences can be seen in the above numbers. But
it's at least interesting to consider how it fits into the above
Without making this already overlong post longer, know that
gets pages of virtual address space from the system and then manages
pieces within those regions. In fact, in Chrome you can visit
about:tcmalloc (and then reload the page once) to get a dump of some
of the sorts of information the tcmalloc authors thought were of
Memory allocators (like libc's malloc or tcmalloc) are complicated and interesting but the important fact for this post is that they have some bookkeeping overhead which itself costs memory. A memory profiler, a critical tool for application memory analysis, will tell you how much memory your application is requesting, but usually they don't include allocator or how much of the memory your application requests ends up being actually used (in the RSS sense).
Ultimately memory allocators are just managing the above primitives.
A profiler can help provide more insight into which application-level
logic is causing memory allocations, but it's not the right layer to
describe where memory is going. For the best picture I recommend
smem, a tool that was written by the same person who added the
hooks to the kernel to make PSS computation possible:
sudo apt-get install smem .