<feed xmlns="http://www.w3.org/2005/Atom"><title>Tech Notes</title><id>tag:neugierig.org,2010:tech-notes</id><link href="http://neugierig.org/software/blog/" /><link href="http://neugierig.org/software/blog/atom.xml" rel="self" /><updated>2012-03-06T17:06:00Z</updated><author><name>Evan Martin</name><email>martine@danga.com</email></author><entry><id>tag:neugierig.org,2010:tech-notes/2012-03-06/heap-profiling</id><updated>2012-03-06T17:06:00Z</updated><title>Adventures in heap profiling</title><link href="http://neugierig.org/software/blog/2012/03/heap-profiling.html" /><content type="html">&lt;p&gt;There's an obvious metaphor for heap profiling that I've yet to find a
good opportunity to use, so here goes.  Imagine you're investigating a
company that is spending too much money.  You have all of the credit
card bills, but they only implicate the person who signed for a given
purchase.  Ideally you'd track down who authorized too many purchases.
But then if you trace the management hierarchy all the way to its
root, the CEO is ultimately responsible for all spending, and that's
not a useful fact either.&lt;/p&gt;
&lt;p&gt;In code, your credit card bills are the calls to &lt;code&gt;malloc()&lt;/code&gt; and your
CEO is the &lt;code&gt;main()&lt;/code&gt; function that invokes all the calls.  Heap
profiling, then, can be broken into two pieces: one, collecting the
chains of authority, and second, analyzing those chains to allocate
(pun intended!) blame.&lt;/p&gt;
&lt;p&gt;Let's start with the first one.  Collecting a heap profile is as
simple as recording the call stack at each memory allocation point,
which is to say it's not exactly simple.&lt;/p&gt;
&lt;p&gt;First you must hook allocation.  &lt;a href="http://www.gnu.org/software/libc/manual/html_node/Hooks-for-Malloc.html"&gt;Glibc provides hooks for
&lt;code&gt;malloc&lt;/code&gt;&lt;/a&gt;, and if you're using a custom memory allocator like
tcmalloc you're already doing this as well, but there's also various
pieces of your program like C++ or glib which may use its own
allocation pools (for glib, for example, &lt;a href="http://developer.gnome.org/glib/2.30/glib-running.html"&gt;you can adjust
&lt;code&gt;G_SLICE&lt;/code&gt;&lt;/a&gt; to aid debugging).&lt;/p&gt;
&lt;p&gt;Once your code is getting called at the right places, you must find a
way to get a stack trace at runtime.  One approach is to look at the
stack for return addresses.  It appears gcc for x86-64 Linux provides
API for such at thing, but check out &lt;a href="http://code.google.com/p/google-glog/source/browse/trunk/src/stacktrace_x86-inl.h"&gt;the gnarly code Google uses for
x86&lt;/a&gt;; &lt;a href="http://www.nongnu.org/libunwind/"&gt;libunwind&lt;/a&gt; is a more recent implementation that likely
works better.  Another approach is to record extra information while
the code is running; tcmalloc can use gcc's &lt;code&gt;-finstrument-functions&lt;/code&gt;
flag to hook &lt;em&gt;every&lt;/em&gt; function call with a bit of code that simply
maintains the call stack as an array on the side (a "shadow stack").&lt;/p&gt;
&lt;p&gt;Now you've collected a bunch of allocation stack traces, and it's time
to shuffle them together.  Typically this information is presented as
a "top N" list or a tree widget, but tcmalloc provides a cool tool
called &lt;code&gt;pprof&lt;/code&gt; that generates those as well as input to &lt;code&gt;graphviz&lt;/code&gt; to
generate pictorial graphs.  You can see a snippet of such a graph for
GFS, the Google File System, on &lt;a href="http://goog-perftools.sourceforge.net/doc/heap_profiler.html"&gt;on the tcmalloc heap profiler
page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, to map the addresses seen in the recorded call stacks into
human-readable function names, &lt;code&gt;pprof&lt;/code&gt; relies upon helper tools from
GNU binutils such as &lt;code&gt;addr2line&lt;/code&gt;, which turns out to be agonizingly
slow -- for Chrome I clocked it as taking over 4.5 minutes to load a
profile!  For fun I recently dug into how this all works, and so I
have written two programs to share with you.&lt;/p&gt;
&lt;p&gt;The first I wrote is &lt;a href="https://github.com/martine/maddr"&gt;maddr&lt;/a&gt;, a halfway reimplementation of
&lt;code&gt;addr2line&lt;/code&gt;.  To map an address to a line number in code you rely on
debug information, which is in an awkward format described in the
DWARF spec.  The line number info is a bytecode format that generates
a large table mapping every offset within the binary to a source file
and line number; &lt;code&gt;maddr&lt;/code&gt; decodes that table into memory and does
binary searches over it to quickly answer queries.  In principle this
could be extended slightly to support the same formats &lt;code&gt;addr2line&lt;/code&gt;
uses, and then plugged directly into &lt;code&gt;pprof&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;But I was curious about how the other pieces of all of this work, like
how &lt;code&gt;pprof&lt;/code&gt; shuffles the stacks together, and so my second program is
the boringly-named &lt;a href="https://github.com/martine/hp"&gt;hp&lt;/a&gt;, which loads the heap profiling output as
emitted by tcmalloc and generates graphs.  &lt;code&gt;hp&lt;/code&gt; is written in Go which
makes concurrency so easy that it trivially uses multiple threads to
load a heap profile and the symbols from a binary in parallel.  It
then has a "server" mode which brings up a web server, letting you
adjust parameters (well, one parameter, but you get the idea)
dynamically by clicking around.&lt;/p&gt;
&lt;p&gt;You can play with &lt;a href="http://martine.github.com/hp/demo.html"&gt;a demo of hp's output&lt;/a&gt; online.  (Caveats:
it's a static snapshot so "rerender" doesn't work in the demo; it uses
WebKit flexbox so it only works in WebKit.  Use the middle mouse
button to pan.)  You could easily extend this so that you could click
on the graph nodes to focus in on a single call stack, but I think
I've exhausted my curiosity at this point (and I &lt;a href="http://neugierig.org/software/chromium/notes/2012/02/the-end.html"&gt;no longer work on
Chrome&lt;/a&gt;) so that's all you get!&lt;/p&gt;</content></entry><entry><id>tag:neugierig.org,2010:tech-notes/2012-01-16/sabbatical</id><updated>2012-01-16T19:44:00Z</updated><title>Sabbatical</title><link href="http://neugierig.org/software/blog/2012/01/sabbatical.html" /><content type="html">&lt;p&gt;In late September I took a three-month sabbatical from Google.  (This
post was written both during and after the sabbatical; sorry for the
mixed tense in some of the wording.)&lt;/p&gt;
&lt;p&gt;I told my managers: in the old days my officemates would make fun of
me for being excessively optimistic and happy but more recently I had
grown a reputation at work for being grumpy.  I wanted to figure out
whether it was my work or something else in my life that had caused
this change.  And thinking back, even through college I always took
classes through the summer; I think my last real break from work of
some sort was maybe my first year of high school, before I started
working in the summers.&lt;/p&gt;
&lt;p&gt;I've read about experiments where people stay in a place without the
sun or a clock to guide their sleeping behavior, and how they fall
into a 25-hour schedule of sleeping and waking.  I was curious how my
life would similarly arrange itself without constraints.&lt;/p&gt;
&lt;p&gt;So I made nearly no plans for my new free time.  My wife works and I
didn't want to leave her behind on some trip around the world.
Instead, I had the vague idea to relax, work on projects, and catch up
on video games.  (I went cold turkey on games early on in college in
an attempt to focus; in retrospect, putting Linux on my primary
computer to help enforce that was likely a valuable career decision.)&lt;/p&gt;
&lt;p&gt;Here's what I found.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Work vs play.&lt;/strong&gt; Even when I was working, I spent a lot of my free
time hacking on software.  It's something I enjoy; it's something I'd
do even if it weren't something I could be employed for, so it's great
people are willing to pay me to do it too.  Without a job structuring
my hacking time, I found that I fell into a pattern of hacking from
morning to mid-day, then I goofed off for the brain downtime that
frequently accompanies lunchtime through 3pm or so, and often returned
to hacking in the evening.&lt;/p&gt;
&lt;p&gt;When I was working I think I got my most productive time in after
lunch.  I think the reason mornings worked for me now is that there is
nothing extraneous between me and my goals.  At work there was always
side tasks like email and keeping up with daily churn to occupy my
mornings.  With my new perspective, I wonder how much of that was
self-inflicted; I'm considering strategies like "don't open mail until
after noon" when I return to work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Video games.&lt;/strong&gt; The parallels to addiction are obvious to me,
especially the modern horrible stuff like Zynga products.  I think the
way they hook is interesting to reflect on for a moment.&lt;/p&gt;
&lt;p&gt;The joy of a good video game is the feeling of achievement you get.
That is, I think, why they often hook the same personality type as a
programmer's: making a WoW character and writing code have the same
reward feedback loop.&lt;/p&gt;
&lt;p&gt;But since a game is synthetic, the achievement is also ultimately
synthetic ("I pushed the buttons at the right times so that the reward
light turned on").  The cynical craft of modern video games is
tricking the gamer into &lt;em&gt;feeling&lt;/em&gt; accomplishment without actually
making them work for it.  I read a review of one recent game (perhaps
God of War?) where the reviewer discovered you could beat the entire
game just by pressing the single attack button repeatedly -- your
character doing amazing leaps and smashes throughout -- and it's
interesting to contrast that with the reviewers trying to find the
words to describe the sense of "real" accomplishment found in a truly
hard game like Dark Souls.  (I haven't played either.)&lt;/p&gt;
&lt;p&gt;With all that in mind, I think there is room in my life for video
games just as there is room for alcohol.  I'm not sure which, though;
Skyrim is really pretty but I was let down by how every problem was
solved by entering a dungeon and killing everything.  Starcraft II is
pretty amazing, though.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Getting things done.&lt;/strong&gt; A few days into my leave I got out a piece of
paper and made a list of all the sorts of menial life tasks I've been
intending to do but putting off, stuff like "figure out to do with
that cardboard box full of CDs".  I worked through some of that.&lt;/p&gt;
&lt;p&gt;Most stuff I own that lives in a closest is trash, almost by
definition: I haven't used it.  Much of it is in fact
difficult-to-dispose-of trash, like old hard drives where I was afraid
of leaking data.  Or those CDs: given that I've re-ripped them, what
is the ethical thing to do with them?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Living your dreams.&lt;/strong&gt; When I was planning my leave I had wild dreams
about getting fit or learning Arabic or whatever.  Once I no longer
had my job to blame for it I was confronted by what I already
subconsciously knew: my own motivation is at fault.  I was using "no
free time" as an excuse.&lt;/p&gt;
&lt;p&gt;Daisy asked me recently: "Should I take a sabbatical too?  Will it
give me a chance to finally do all those things I've wanted to?"  And
my response was, "No, if you really wanted to do those things you
would've found time for them already."&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Money is a means to an end, not a goal.&lt;/strong&gt; I have been very lucky in
life, to have been born white in this country to a middle class
family, so that I can be in a position to afford to not work.  But I'm
not some tech startup lottery millionaire; I'm paid like an engineer
and I live well within that.&lt;/p&gt;
&lt;p&gt;Do I enjoy the nicer things in life?  Yeah, sorta, but not enough to
strive for them.  (Or, to use the "Your Money or Your Life"
computation, it rarely seems worth it to spend another hour in a
cubicle to have a more tender piece of meat for dinner.)  It always
felt so &lt;em&gt;natural&lt;/em&gt; to work all day because that's just what people do,
but seeing my wife leave in the early morning and return as the sun is
setting really drives home for me how unnatural it is.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What now?&lt;/strong&gt; I'm back at Google, still a little unsure of myself.  I
get no shortage of job offers but it's not clear to me that anyone
provides a opportunity with as much freedom and the opportunity to
work on free software as Google does.  If you are also a Googler and
have the opportunity to take a sabbatical, I highly recommend it.&lt;/p&gt;</content></entry><entry><id>tag:neugierig.org,2010:tech-notes/2011-12-21/linux-g4-mac-mini</id><updated>2011-12-21T18:05:00Z</updated><title>Linux on a G4 Mac Mini</title><link href="http://neugierig.org/software/blog/2011/12/linux-g4-mac-mini.html" /><content type="html">&lt;p&gt;I wanted to put Linux on this old Mac Mini so I can use it a file
server.  Some install instructions suggested burning a CD, but I don't
have a CD burner handy.  Others suggested booting off a USB stick, but
the newer Debian/Ubuntu installer docs removed that section, and it's
not clear how to partition the USB device.  What to do?&lt;/p&gt;
&lt;p&gt;It turns out it's relatively easy to completely install via the
network due to Open Firmware's tftp support.  You'll need a pretty
standard networking setup (DHCP) along with a Linux box and a network
wire for the Mac.&lt;/p&gt;
&lt;p&gt;Start by searching for [&lt;a href="http://www.debian.org/distrib/netinst"&gt;debian netboot&lt;/a&gt;], not to be confused
with &lt;em&gt;netinst&lt;/em&gt;: the former is how to net boot the installer, the
latter is minimal CD images that download the remainder of the install
from the internet.  What you want is the netboot version of the
netinst installer, and you'll find it in the "network boot" section of
the above page.&lt;/p&gt;
&lt;p&gt;For a G4 Mac Mini, follow the link to the powerpc images, then
navigate to &lt;code&gt;powerpc/netboot&lt;/code&gt; and download the following files into
a directory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;yaboot.conf&lt;/code&gt;, configuration for yaboot&lt;/li&gt;
&lt;li&gt;&lt;code&gt;yaboot&lt;/code&gt;, the boot loader&lt;/li&gt;
&lt;li&gt;&lt;code&gt;boot.msg&lt;/code&gt;, the bootup message&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vmlinux&lt;/code&gt;, the kernel&lt;/li&gt;
&lt;li&gt;&lt;code&gt;initrd.gz&lt;/code&gt;, the rest&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;(How to remember this file list?  You need &lt;code&gt;yaboot.conf&lt;/code&gt; and all files
it references; it's a short text file.)&lt;/p&gt;
&lt;p&gt;Next, install tftp on your hosting machine.  Run it like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sudo in.tftpd -L -s path/to/files
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It runs as root due to the port it uses, &lt;code&gt;-L&lt;/code&gt; puts it in foreground
mode, and &lt;code&gt;-s&lt;/code&gt; makes paths relative to the directory you give it.&lt;/p&gt;
&lt;p&gt;To verify that tftp works, try a command like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;tftp localhost -c get yaboot.conf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fix your setup if that command results in an error (for example, I
learned the hard way that the Mac doesn't like pulling files from a
subdirectory).&lt;/p&gt;
&lt;p&gt;Then boot the Mac into Open Firmware by holding Windows-Alt-O-F
(that's Command-Option-O-F on a Mac keyboard) as it boots.  At the
prompt, tell it to fetch yaboot:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;boot enet:192.168.1.114,yaboot
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(adjust the IP as necessary).&lt;/p&gt;
&lt;p&gt;From that point, just follow the menus.  Simple.&lt;/p&gt;
&lt;p&gt;(But actually it was not so simple to figure out the above.  If you
run into yaboot errors like &lt;code&gt;can't read elf e_ident/e_type/e_machine
info&lt;/code&gt; or the installer being unable to detect your disk, I found both
of those were solved by using the latest version of Debian, and not
Debian stable or the most recent PPC Ubuntu.)&lt;/p&gt;
&lt;p&gt;PS: wow was OS X slow on this thing.  I can't believe it was
considered tolerable at the time of release.&lt;/p&gt;</content></entry><entry><id>tag:neugierig.org,2010:tech-notes/2011-12-20/nonblocking-disk-io</id><updated>2011-12-20T20:29:00Z</updated><title>Nonblocking disk IO</title><link href="http://neugierig.org/software/blog/2011/12/nonblocking-disk-io.html" /><content type="html">&lt;p&gt;It's natural, when writing an event-driven application, to want to
perform disk operations like file reads and writes in the same
non-blocking manner you use for sockets.&lt;/p&gt;
&lt;p&gt;This turns out to be hard.  In theory there's POSIX AIO, but it's
reported to not really work on Linux (my info might be out of date).
Async libraries like node.js use an internal thread pool to simulate
its desired event-behavior for files.&lt;/p&gt;
&lt;p&gt;Sometimes in discussions about this people point longingly at Windows,
which does have an API for performing overlapped disk operations.  See
&lt;a href="http://tinyclouds.org/iocp-links.html"&gt;Ryan Dahl's nice overview&lt;/a&gt; for a ton of links.  In fancy apps
like Chrome, subsystems like the disk cache use overlapped IO on
Windows and threads on non-Windows, with the hope that Windows async
IO involves less overhead (less bookkeeping, fewer copies, etc.).&lt;/p&gt;
&lt;p&gt;But it turns out that Windows async IO is just broken.  In any of a
number of situations, including if your disk is encrypted, Windows
will silently make your async file operations synchronous.  See &lt;a href="http://support.microsoft.com/kb/156932"&gt;this
MSDN doc&lt;/a&gt; for a list of other potential reasons.  Special
highlights include how extending a file's length is synchronous unless
you use a special API, while that API on NTFS file systems requires a
special privilege that is only available to administrators by default.&lt;/p&gt;
&lt;p&gt;I don't write this to just say "man, Windows sure sucks" -- the Linux
situation is worse, and this may well all have been fixed in Windows
7.  But rather I observe that there is just a ton of API surface in
an operating system, and any of it may block (see e.g. &lt;a href="http://brad.livejournal.com/2228488.html"&gt;Brad
discussing sendfile&lt;/a&gt;).  (In Chrome's case, we found via
instrumentation that real users were encountering browser hangs
because its supposedly async disk interface wasn't async for seeking
within a file.)  In practice for anything you want to be truly
asynchronous you probably need to use a thread.  You can still use the
AIO APIs, if you have some that work, from that thread.&lt;/p&gt;
&lt;p&gt;There are two ways to interpret that conclusion.  One is that trying
to make a synchronous world async in a piecemeal fashion doesn't work,
and instead it's better to make it easy to coordinate synchronous
tasks -- the Go model (see &lt;a href="http://swtch.com/~rsc/talks/threads07/"&gt;rsc's convincing slides&lt;/a&gt;).  The other
view is that the problem is the synchronous system, and any work we
can do to move away from that the better -- the node model, where you
replace the world.&lt;/p&gt;</content></entry><entry><id>tag:neugierig.org,2010:tech-notes/2011-12-15/return-by-value</id><updated>2011-12-15T17:58:00Z</updated><title>Return by value</title><link href="http://neugierig.org/software/blog/2011/12/return-by-value.html" /><content type="html">&lt;p&gt;So you've been writing some C after using Python or Go or Haskell or
pretty much anything other than C and you're jealous of being able to
return more than one thing from a function.  How do you do it in C?&lt;/p&gt;
&lt;p&gt;The standard way is to return one thing (perhaps the "primary" thing
you're returning) as the return value, and then have the caller
provide pointers for the rest as "output parameters".  You &lt;em&gt;could&lt;/em&gt;
return everything by value in a &lt;code&gt;struct&lt;/code&gt;, but it feels like all those
copies might be bad, maybe?&lt;/p&gt;
&lt;p&gt;Let's check.  To reduce inlining confusion, let's split it across
multiple object files.  So here's the interface:&lt;/p&gt;
&lt;pre&gt;
&lt;span class="keyword"&gt;typedef&lt;/span&gt; &lt;span class="keyword"&gt;struct&lt;/span&gt; {
  &lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="variable-name"&gt;a&lt;/span&gt;;
  &lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="variable-name"&gt;b&lt;/span&gt;;
} &lt;span class="variable-name"&gt;Pair&lt;/span&gt;;

&lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="function-name"&gt;return_pair&lt;/span&gt;();
&lt;span class="type"&gt;void&lt;/span&gt; &lt;span class="function-name"&gt;fill_pair&lt;/span&gt;(&lt;span class="type"&gt;Pair&lt;/span&gt;* &lt;span class="variable-name"&gt;p&lt;/span&gt;);
&lt;span class="type"&gt;void&lt;/span&gt; &lt;span class="function-name"&gt;fill_ints&lt;/span&gt;(&lt;span class="type"&gt;int&lt;/span&gt;* &lt;span class="variable-name"&gt;a&lt;/span&gt;, &lt;span class="type"&gt;int&lt;/span&gt;* &lt;span class="variable-name"&gt;b&lt;/span&gt;);
&lt;/pre&gt;

&lt;p&gt;And the trivial implementation:&lt;/p&gt;
&lt;pre&gt;
&lt;span class="preprocessor"&gt;#include&lt;/span&gt; &lt;span class="string"&gt;"lib.h"&lt;/span&gt;

&lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="function-name"&gt;return_pair&lt;/span&gt;() {
  &lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="variable-name"&gt;p&lt;/span&gt; = { 3, 5 };
  &lt;span class="keyword"&gt;return&lt;/span&gt; p;
}
&lt;span class="type"&gt;void&lt;/span&gt; &lt;span class="function-name"&gt;fill_pair&lt;/span&gt;(&lt;span class="type"&gt;Pair&lt;/span&gt;* &lt;span class="variable-name"&gt;p&lt;/span&gt;) {
  p-&amp;gt;a = 3;
  p-&amp;gt;b = 5;
}
&lt;span class="type"&gt;void&lt;/span&gt; &lt;span class="function-name"&gt;fill_ints&lt;/span&gt;(&lt;span class="type"&gt;int&lt;/span&gt;* &lt;span class="variable-name"&gt;a&lt;/span&gt;, &lt;span class="type"&gt;int&lt;/span&gt;* &lt;span class="variable-name"&gt;b&lt;/span&gt;) {
  *a = 3;
  *b = 5;
}
&lt;/pre&gt;

&lt;p&gt;And finally here's &lt;code&gt;main&lt;/code&gt; to run it, including calls to the functions
so we can see what work the caller must do.&lt;/p&gt;
&lt;pre&gt;
&lt;span class="preprocessor"&gt;#include&lt;/span&gt; &lt;span class="string"&gt;"lib.h"&lt;/span&gt;

&lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="function-name"&gt;main&lt;/span&gt;(&lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="variable-name"&gt;argc&lt;/span&gt;, &lt;span class="type"&gt;char&lt;/span&gt;** &lt;span class="variable-name"&gt;argv&lt;/span&gt;) {
  &lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="variable-name"&gt;p1&lt;/span&gt; = return_pair();
  &lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="variable-name"&gt;p2&lt;/span&gt;;
  fill_pair(&amp;amp;p2);
  &lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="variable-name"&gt;a&lt;/span&gt;, &lt;span class="variable-name"&gt;b&lt;/span&gt;;
  fill_ints(&amp;amp;a, &amp;amp;b);
  &lt;span class="keyword"&gt;return&lt;/span&gt; p1.a + p2.a + a;
}
&lt;/pre&gt;

&lt;p&gt;And now to a disassembler.&lt;/p&gt;
&lt;p&gt;Starting with the last function, &lt;code&gt;fill_ints()&lt;/code&gt;.  Passing in two
pointers means that two registers get addresses put into them:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;   0x00000000004004e7 &amp;lt;+23&amp;gt;:    lea    0x8(%rsp),%rsi
   0x00000000004004ec &amp;lt;+28&amp;gt;:    lea    0xc(%rsp),%rdi
   0x00000000004004f1 &amp;lt;+33&amp;gt;:    callq  0x400530 &amp;lt;fill_ints&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and the implementation of &lt;code&gt;fill_ints()&lt;/code&gt; fills in the pointees.  Pretty
much what you'd expect.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Dump of assembler code for function fill_ints:
   0x0000000000400530 &amp;lt;+0&amp;gt;: movl   $0x3,(%rdi)
   0x0000000000400536 &amp;lt;+6&amp;gt;: movl   $0x5,(%rsi)
   0x000000000040053c &amp;lt;+12&amp;gt;:    retq
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;fill_pair&lt;/code&gt; implementation is similar, but with just one pointer
and two offsets.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;return_pair&lt;/code&gt; is quite different:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Dump of assembler code for function return_pair:
   0x0000000000400510 &amp;lt;+0&amp;gt;: movabs $0x500000003,%rax
   0x000000000040051a &amp;lt;+10&amp;gt;:    retq
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because two ints fit in a 64-bit register, the whole function can
be implemented with one immediate load and no memory accesses!&lt;/p&gt;
&lt;p&gt;But surely, you say, that's just because your &lt;code&gt;Pair&lt;/code&gt; type is simple.
How about pointers?  If the second field were a pointer, it wouldn't
fit into a single register.&lt;/p&gt;
&lt;p&gt;Here's what a pair of an int and a pointer compiles down to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Dump of assembler code for function return_pair2:
   0x0000000000400510 &amp;lt;+0&amp;gt;: mov    $0x40062c,%edx
   0x0000000000400515 &amp;lt;+5&amp;gt;: mov    $0x3,%eax
   0x000000000040051a &amp;lt;+10&amp;gt;:    retq
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again no memory references, just registers.&lt;/p&gt;
&lt;p&gt;Ok, how about something that can't fit in multiple registers?  Like
say a buffer.&lt;/p&gt;
&lt;pre&gt;
&lt;span class="keyword"&gt;typedef&lt;/span&gt; &lt;span class="keyword"&gt;struct&lt;/span&gt; {
  &lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="variable-name"&gt;a&lt;/span&gt;;
  &lt;span class="type"&gt;int&lt;/span&gt; &lt;span class="variable-name"&gt;big&lt;/span&gt;[1024];
} &lt;span class="variable-name"&gt;Pair&lt;/span&gt;;

&lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="function-name"&gt;return_pair3&lt;/span&gt;();
&lt;/pre&gt;

&lt;p&gt;and the associated code:&lt;/p&gt;
&lt;pre&gt;
&lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="function-name"&gt;return_pair3&lt;/span&gt;() {
  &lt;span class="type"&gt;Pair&lt;/span&gt; &lt;span class="variable-name"&gt;p&lt;/span&gt;;
  p.a = 3;
  p.big[0] = 5;
  &lt;span class="keyword"&gt;return&lt;/span&gt; p;
}
&lt;/pre&gt;

&lt;p&gt;Here's the dump:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Dump of assembler code for function return_pair3:
   0x0000000000400510 &amp;lt;+0&amp;gt;: sub    $0xfa0,%rsp
   0x0000000000400517 &amp;lt;+7&amp;gt;: mov    %rdi,%rax
   0x000000000040051a &amp;lt;+10&amp;gt;:    movl   $0x3,(%rdi)
   0x0000000000400520 &amp;lt;+16&amp;gt;:    movl   $0x5,0x4(%rdi)
   0x0000000000400527 &amp;lt;+23&amp;gt;:    add    $0xfa0,%rsp
   0x000000000040052e &amp;lt;+30&amp;gt;:    retq
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To "return" a large structure, the caller provides stack space for it
and the function fills in the caller's copy -- sorta like &lt;a href="http://en.wikipedia.org/wiki/Return_value_optimization"&gt;the return
value optimization&lt;/a&gt;.  This code is the same as the code that
explicitly passes a pointer.  (I don't get why this function adjusts
&lt;code&gt;%rsp&lt;/code&gt;, it seems like it doesn't even use it...)&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;In each of these cases, returning by value seems to equal to or better
in terms of generated code to the approaches using pointers.  So why
not do it?&lt;/p&gt;
&lt;p&gt;Here are some reasons.  (Note that I'm avoiding C++ here, which has
its own additional complicated rules as described in the above
wikipedia article.)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Most importantly, you need to create a new tuple type whenever you
  want to pass more than one value around.  It is &lt;em&gt;inconvenient&lt;/em&gt;,
  especially when the caller already has a variable handy for the
  value it wants to get back from the function and could just pass its
  address.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Passing structures via registers appears to be a newer ABI; gcc has
  &lt;code&gt;-fpcc-struct-return&lt;/code&gt; and &lt;code&gt;-freg-struct-return&lt;/code&gt; to select between
  them.  But my system appears to be built with return-via-registers
  on (it appears that it as introduced into gcc around the year 2000)
  and even when I manually select returning via memory it just means
  &lt;code&gt;return_pair&lt;/code&gt; and &lt;code&gt;return_pair2&lt;/code&gt; decompose into the behavior of
  &lt;code&gt;return_pair3&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If your structure contains any character buffer the function gains
  a bunch of checking code due to &lt;a href="http://en.wikipedia.org/wiki/Buffer_overflow_protection#GCC_Stack-Smashing_Protector_.28ProPolice.29"&gt;&lt;code&gt;-fstack-protector&lt;/code&gt;&lt;/a&gt;,
  removing the benefit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For larger structures, you may have to worry about stack space.
  But such things don't belong on the stack in the first place;
  you are working with pointers to them to start with so functions
  that fill in those pointers are more convenient anyway.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;(This point and the following were contributed by Jeffrey Yasskin
  after the post was first published.)  If you return several
  different variables, depending on conditions inside the function,
  NRVO doesn't kick in. This is often a missed optimization in the
  compiler, but we still have to deal with it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the return value owns some allocated space, you can often save
  allocation time by passing in a variable that already has the space
  allocated.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Insert your reason here.  What else am I missing?&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;</content></entry></feed>
