Chromium Notestag:neugierig.org,2009:chromium-notes2012-02-22T15:46:00ZEvan Martinevan@chromium.orgtag:neugierig.org,2009:chromium-notes/2012-02-22/the-end2012-02-22T15:46:00ZThe end<p>No posts for six months! I have some reasons.</p>
<p>I started at Google in 2004. When I was hired they asked me to
describe in broad terms what I wanted to work on. I remember writing
back: "I've read rumors of Google working on a Linux-based operating
system, which would be a perfect match for my skills and interests.
Failing that, I studied linguistics so something involving language
would be nice." My friend Tessa (also newly hired) and I drove down
to Silicon Valley together.</p>
<p>I ended up in search ranking (with Tessa, even in the same office at
first) for a few years. It was the experience I came to Google for:
helping people smarter than me write mapreduces over terabytes to
compute tricky math over language (<a href="http://googleblog.blogspot.com/2010/01/helping-computers-understand-language.html">much of my time on this
project</a>). But when I learned Google was finally working on the
long-rumored browser I knew it was the project for me, to make a
browser for Linux that was awesome.</p>
<p>We did just that. Some years passed and now somewhere near one in
five people on the internet use Chrome. (With a rounding error's
fraction of them on Linux, but it is <em>my</em> rounding error.) I learned
so much about so many things, from software engineering to graphics
performance to binary internals.</p>
<p>But I also learned that most complaints you ever read about making
software for desktop Linux are correct. Before we'd jokingly say
"year of Linux on the desktop!" and laugh about how it would never
happen, but my smiles had become bitter. A short way to put it is
that writing high-quality software is not really a goal of the
platform; <a href="https://plus.google.com/115094562986465477143/posts/Di6RwCNKCrf">stuff that doesn't matter</a> like continuously rewriting
atop ever-changing <a href="http://blogs.adobe.com/penguinswf/2007/05/welcome_to_the_jungle.html">platforms</a> is. The scrappiness and free
software spirit is what makes me love Linux as a hacker but I
recognize now a deeper doom, that it will only ever broadly succeed by
removing that spirit (e.g. Android). Maybe another way of saying all
this is that I grew up.</p>
<p>The Chrome team grew up, too. My fondest memory of Chrome is when we
started our San Francisco office, five of us crammed into a three
person room and hacking at full speed. (At one point the first page
of <a href="http://www.ohloh.net/p/chrome/contributors">our now 95-page ohloh leaderboard</a> was our office plus the
Linux guys to the south.) But visit <a href="http://build.chromium.org">build.chromium.org</a> now and
observe the many configurations; visit <a href="https://groups.google.com/a/chromium.org/group/chromium-dev/topics">chromium-dev</a> now and
witness tens of opinions on trivial matters; try to build Chrome now
and discover you need to download literally gigabytes of source.</p>
<p>Those are challenges, sure, but I found I was taking them personally.
I was too personally invested in the project. I gradually became a
grumpy, complainy person, the sort I wouldn't like to work with. I
took a <a href="http://neugierig.org/software/blog/2012/01/sabbatical.html">three month sabbatical</a>, came back, and found I
still felt the same way. With time to reflect I wonder if the way I
want to work is simply incompatible with a large team.</p>
<p>And so I'm moving on, to a smaller team. Coincidentally, Tessa's
team. If you want to follow my future endeavors you can read
<a href="http://neugierig.org/software/blog/">my other blog</a>.</p>tag:neugierig.org,2009:chromium-notes/2011-08-31/windows-hookers2011-08-31T21:34:00ZTracking down a mysterious Windows crash<p><em>Today's post is a guest post from Eric Roman. He wrote a slightly
more expletive-laden version of this post inside Google and I asked
him if I could post it here. It serves as a good illustration of how
deploying software on more than a hundred million different users'
computers is a nearly-biological fight.</em></p>
<p>Understanding stability of Windows applications is really hard.</p>
<p>Recently, @apatrick has concluded a heroic investigation into one of
Chrome's most mysterious top crashes for Windows. He just committed a
"fix" for it on the canary channel, which seems to be working!</p>
<p>This debugging journey is pretty epic, so I'm giving a blow-by-blow
summary of it below. If you want to jump straight to the conclusion,
see <a href="http://crrev.com/96807">r96807</a> for the spoiler (can also read
comments in <a href="http://crbug.com/81449">bug 81449</a> for the full novella).</p>
<p><strong>Prelude</strong></p>
<p>Our story begins a year ago, when we first began tracking crashes in a
generic <code>RunnableFunction<>::Run()</code> location. These crashes had
established themselves as a top browser crash for Windows Chrome. At
the time, Huan and I unsuccessfully looked into the issue, but
couldn't make heads or tails of it (<a href="http://crbug.com/54307">bug
54307</a>). The basic format of the crash looked
something like this:</p>
<pre><code>0377fd78 0377fdb0 0x0
0377fd7c 0254c3c3 0x377fdb0
0377fd84 021ba1ae chrome_1c30000!RunnableFunction<void (__cdecl*)(void *),Tuple1<void *> >::Run+0xc
0377fd8c 021baa21 chrome_1c30000!`anonymous namespace'::TaskClosureAdapter::Run+0xb
0377fdb0 021baaa6 chrome_1c30000!MessageLoop::RunTask+0x81
0377fdc0 021bae47 chrome_1c30000!MessageLoop::DeferOrRunPendingTask+0x28
0377fdf8 021d1b24 chrome_1c30000!MessageLoop::DoWork+0x80
0377fe24 021ba960 chrome_1c30000!base::MessagePumpDefault::Run+0xc2
0377fe30 021ba8e5 chrome_1c30000!MessageLoop::RunInternal+0x31
0377fe38 021ba7d9 chrome_1c30000!MessageLoop::RunHandler+0x17
0377fe58 021c6530 chrome_1c30000!MessageLoop::Run+0x15
0377fe5c 021c6643 chrome_1c30000!base::Thread::Run+0x9
0377ffa8 021c1a2f chrome_1c30000!base::Thread::ThreadMain+0xa1
0377ffb4 7c80b713 chrome_1c30000!base::`anonymous namespace'::ThreadFunc+0x16
0377ffec 00000000 kernel32!BaseThreadStart+0x37
</code></pre>
<p>Above we can see that the crash is due to jumping to instruction
pointer of 0. And judging by the top frames, it looks like there was
some stack corruption at work (notice how the first frame's alleged
return address is actually a stack location).</p>
<p>The callstack itself isn't terribly helpful though, since we can't
tell what code was actually running prior to the crash (it was gobbled
up by the stack corruption).</p>
<p>Moreover, the source location of <code>RunnableFunction<>::Run()</code> doesn't
help narrow things in the slightest, since pretty much all of Chrome's
code runs through this path (Chrome relies heavily on message passing
to post asynchronous tasks to another thread's message loop).</p>
<p>Really, all we know at this point is that "some task" got posted to
"some thread", and then crashed at "some point" after running this
task.</p>
<p>There is one interesting piece of information that can be inferred
from the minidumps: based on the crashed thread's index, it is likely
the crashing task was running on Chrome's "Child process launcher"
thread. In fact, a subsequent instrumentation
(<a href="http://crrev.com/58786">r58786</a>) confirmed that all of these crashes were
happening on the child process launcher thread. Since there is very
little code that legitimately runs on this thread, we did a full
code-flow analysis of all the paths that could post tasks. But we did
not discover any problematic codepaths (I was hoping to stumble across
something bad like a use-after-free).</p>
<p>Without any extra leads, (as well as a temporary dip in the crash's
frequency lowering its priority) the mystery bug got pushed onto the
back-burner.</p>
<p>It lay dormant for the next 9 months, waiting for a new champion to
take up arms.</p>
<p><strong>Part II</strong></p>
<p>In May, Al Patrick (an innocent bystander), is assigned the bug on
suspicion that it is a regression. (At this point the callstack has
morphed a bit due to various optimization ambiguities, but it is
essentially the same bug I had failed to solve earlier).</p>
<p>Our new hero starts off by adding some instrumentation trying to see
if a bad function pointer is ever being directly passed to a runnable
function (<a href="http://crrev.com/85359">r85359</a>). This doesn't turn up anything
salient.</p>
<p>Next he instruments posted tasks to retain the location where they got
posted from (<a href="http://crrev.com/85991">r85991</a>). This change is absolutely
brilliant! I have benefited from it many times since it was
introduced, to help debug crash dumps.</p>
<p>The instrumentation is cheap yet effective: whenever posting a task,
the <code>FROM_HERE</code> macro (that was being used in debug builds to save
filename/line numbers) now saves the instruction pointer into the
<code>PendingTask</code>. Later when the task is de-queued by the target thread's
message loop, this same instruction pointer (i.e. the birthplace of
the task) is pushed onto the stack prior to calling the task's virtual
function. That way should a crash happen later, you can poach the
address of the birthplace off the stack during postmortem dump
analysis!</p>
<p>This instrumentation reveals that the problem tasks were posted by
<code>ChildProcessLauncher::Context::Terminate()</code>, suggesting that the task
itself was a runnable function on
<code>ChildProcessLauncher::Context::TerminateInternal()</code>.</p>
<p>So far so good, but we still have no idea why it is crashing.</p>
<p>Next, our hero instruments the base runnable function/method tasks in
Chrome to detect use-after-frees (<a href="http://crrev.com/86447">r86447</a>), as well
as other memory molestation (by preserving the value of the function
pointer into the minidump). This instrumentation reveals that not only
was the task alive and well at the time it was run, but the function
pointer was also untouched. This is a major breakthrough in the
investigation, since it tells us conclusively that whatever craziness
has happened, it occurred while executing
<code>ChildProcessLauncher::Context::TerminateInternal()</code>.</p>
<p>This is where things start to get weird. Looking at <code>TerminateInternal</code>
(the function that is blowing up) there is hardly any code:</p>
<pre><code>chrome_1c30000!ChildProcessLauncher::Context::TerminateInternal:
025e0cea push ebp
025e0ceb mov ebp,esp
025e0ced mov eax,dword ptr [ebp+8]
025e0cf0 mov dword ptr [ebp+8],eax
025e0cf3 test eax,eax
025e0cf5 je chrome_1c30000!ChildProcessLauncher::Context::TerminateInternal+0x16 (025e0d00)
025e0cf7 push 0
025e0cf9 push eax
025e0cfa call dword ptr [chrome_1c30000!_imp__TerminateProcess (02ba56f4)]
025e0d00 push esi
025e0d01 lea esi,[ebp+8]
025e0d04 call chrome_1c30000!base::Process::Close (0289e055)
025e0d09 pop esi
025e0d0a pop ebp
025e0d0b ret
</code></pre>
<p>And the code for <code>base::Process::Close()</code> which it references is also
pretty simple:</p>
<pre><code>chrome_1c30000!base::Process::Close:
0289e055 cmp dword ptr [esi],0
0289e058 je chrome_1c30000!base::Process::Close+0x1d (0289e072)
0289e05a push edi
0289e05b mov edi,dword ptr [esi]
0289e05d call dword ptr [chrome_1c30000!_imp__GetCurrentProcess (02ba56f0)]
0289e063 cmp edi,eax
0289e065 je chrome_1c30000!base::Process::Close+0x19 (0289e06e)
0289e067 push edi
0289e068 call dword ptr [chrome_1c30000!_imp__CloseHandle (02ba561c)]
0289e06e and dword ptr [esi],0
0289e071 pop edi
0289e072 ret
</code></pre>
<p>So basically all that we are doing is killing the process, by calling
a handful of win32 API functions. Hmmm.</p>
<p>Al theorizes that someone may be hooking one of the winapi calls
(perhaps <code>kernel32!TerminateProcess</code>), and that whatever code it is
running in response to that call is responsible for the stack
corruption. For instance if the hooker used the wrong calling
convention (not <code>stdcall</code>), that could be corrupting our stack upon
return!</p>
<p>So how is this guy "hooking" the API call?</p>
<p>There are a lot of different ways you could intercept Windows API
calls, and I am definitely no expert to explain them all. You could
for instance do things like directly patch the code in the system DLL
(in user land, or even on disk). Or re-write the binary to substitute
all the target function calls with your new one. But definitely the
simplest and most intuitive way to hook is to just patch the import
address table. (You can see how that works in the code above — our
compiled code doesn't call directly into kernel32, but rather it jumps
to the address listed from the import table table... that is the
address you would be patching if you wanted to intercept the call).</p>
<p>To this end, Al adds yet more instrumentation, this time to try and
detect if <code>TerminateProcess</code> is being hooked via the import address
table (<a href="http://crrev.com/96266">r96266</a>). Unfortunately this instrumentation
doesn't reveal any smoking gun yet.</p>
<p>Ricardo makes a good observation — a hook on <code>CloseHandle()</code> is
perhaps more likely than a hook on <code>TerminateProcess</code>, based on the
position of the 0 that appears on the stack (<a href="http://crbug.com/81449">bug
81449</a>). Plus, there is probably more value
in hooking <code>CloseHandle</code> over <code>TerminateProcess</code>.</p>
<p>Finally, Al commits a changelist that bypasses the address table
altogether for both <code>CloseHandle</code> and <code>TerminateProcess</code>, and instead
calls the underlying implementation in <code>ntdll.dll</code> directly:
<a href="http://crrev.com/96807">r96807</a>. This is not quite as efficient, but
not a huge deal either due to the low frequency of these calls.</p>
<p>This workaround appears to have thwarted the bad hooker, and so far
there hasn't been a single crash of this sort in the Windows Chrome
canary!</p>
<p>It remains unclear which of the two was being hooked or why. The
workaround is a pretty ridiculous thing to have to do, but the bypass
could shave off as much as 10% of our Windows browser crashes (yes,
this crash really was that high in some releases)! It is likely the
hooker is malware — we could copy a fragment of the hooked code into
our minidump to try and learn more.</p>
<p>Ideally we want to alert the user about these sorts of problems,
rather than papering-over them with workarounds (since they indicate a
real problem with their underlying system). But we don't have a good
mechanism to do that yet (see <a href="http://crbug.com/72239">bug 72239</a> for
some proposals).</p>
<p>In summary, kudos to @apatrick for his excellent and persistent
debugging investigation over the past three months.</p>
<p>Also this shows how powerful Chrome's Canary channel is, since it
allows you do this style of experimental debugging with very quick
turn-arounds (mere days).</p>tag:neugierig.org,2009:chromium-notes/2011-08-29/static-initializers2011-08-29T19:42:00ZStatic initializers<p>Globals and singletons are already well-known as a design antipattern,
but they have an interesting additional cost. Consider a global (I
include file-level static in this category) value that has
initialization code. That code must be run at startup (which leads to
the <a href="http://www.parashift.com/c++-faq-lite/ctors.html#faq-10.14">static initialization order fiasco</a>, though that is not
the point of this post).</p>
<p>Because this initialization code is run at startup, before even
<code>main()</code> is entered, it is in the critical path for startup. It turns
out that even simple code must be paged in off disk, which can lead to
disk seeks, and disk seeks murder your startup performance.</p>
<p>This is not hypothetical: with ChromeOS we found that
innocuous-seeming static initializers in Chrome were actually
affecting the bottom line of startup performance. (Note: that
observation comes from a coworker; I'm not sure whether he was using a
non-SSD machine at the time or if it also happens on SSDs. Just
guessing, but paging in more code, especially code that is
non-contiguous, must have some non-zero cost even on the SSDs that
ChromeOS relies upon.)</p>
<p>Because of this cost we attempt to track static initialization on our
performance bots and prevent new checkins from adding more. (Ideally
we'd remove them all but progress is slow.) I recently looked into
how this works and I thought it'd be useful to write it down before I
forget.</p>
<p><strong>How constructors are implemented</strong></p>
<p>The compiler creates, for each object file, a function that contains
the constructors for the file. Pointers to these functions are
collected in a table at link time. At startup,
<code>__do_global_ctors_aux</code> iterates through the table and calls each
function. (<a href="http://vxheavens.com/lib/viz00.html">Here's a nice page that walks through the
disassembly</a>.) Conceptually, to judge the cost of all static
constructors you might want to do something like sum the size of
all of these functions, but for our purposes we care about disk seeks;
even doing more work in a single static constructor is fine if we
reduce the total number of functions paged in, which means the size
of the constructor table is the statistic of interest.</p>
<p>The table of functions shows up as the <code>.ctors</code> section of the
executable. You can dump table via commands like (note that the first
entry is -1, the rest are addresses):</p>
<p><code>$ objdump --full-content --section=.ctors path/to/binary</code></p>
<p>or in gdb,</p>
<p><code>(gdb) x/1000xg &__CTOR_LIST__</code></p>
<p>The gdb output is perhaps useful since it will decode little-endian
for you. (N.b. that "g" trailing the "x" command prints 64-bit
pointers; adjust as necessary locally.)</p>
<p>For a Chrome binary I glanced at the ctor list appears to be in
pointer order, which means you can see how much of the resulting
binary they span by subtracting the last entry from the first. From
my random debugging build: 30mb, not good.</p>
<p><strong>Constructors versus static initialization</strong></p>
<p>Note that data that is initialized to a constant is implemented in a
different way: the constant value can just be placed in the right
place at compile time, so there is no cost. In contrast, C++ objects
that have constructors involve code and must be computed at runtime.
You'll also sometimes encounter code that initializes variables with
function calls (like <a href="http://neugierig.org/software/chromium/notes/2011/01/plugin-conflict.html">we did with the mysterious IcedTea
crash</a>).</p>
<p>You might also notice that static data can be shared between multiple
instances of the same executable, while initialized memory is private;
see <a href="http://neugierig.org/software/blog/2011/05/memory.html">my post about how memory works</a> for more on that.</p>
<p>I noticed with some interest that the Go programming language,
designed in part by compiler hackers, neatly sidesteps some of the
above problems: by defining initialization order carefully ("The
importing of packages, by construction, guarantees that there can be
no cyclic dependencies in initialization.") and by only allowing
simple values as constant initializers. See <a href="http://golang.org/doc/go_spec.html#Program_initialization_and_execution">their manual</a>
for more.</p>
<p><strong>What to do about it</strong></p>
<p>Mozilla hackers <a href="http://blog.mozilla.com/tglek/2010/05/27/startup-backward-constructors/">have found that Linux is pathologically bad in how it
runs the resulting ctor list</a>, and it looks like they have
at least <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=606137">considered fixing that manually</a>. We have chatted
about doing the same, but fundamentally I believe the way to keep
startup fast is to <em>do less</em>. <a href="http://neugierig.org/software/chromium/notes/2010/05/fast.html">See also my earlier post about
performance</a>.</p>
<p>It appears that the generated functions that run these constructors
get names starting with <code>_GLOBAL__I_</code>. This means a call like</p>
<p><code>$ nm out/Debug/chrome | grep _GLOBAL__I</code></p>
<p>will dump a list of all files that have a global constructor. Go delete
some code!</p>tag:neugierig.org,2009:chromium-notes/2011-08-03/zygote2011-08-03T17:59:00ZThe zygote process and software updates<p>When you make a new tab Chrome (usually) starts a new process for that
tab. How is this done? It would seem natural to just <code>fork()</code>, but
<code>fork</code> can't be used safely in the presence of threads. <code>fork</code> only
forks the current thread but other threads may be holding locks
(including e.g. inside glibc or in the allocator) which would never
be released after the fork.</p>
<p>If you are careful to not touch anything after a fork, it can be safe
to immediately <code>exec</code>. This matches the process launching model on
Windows (no fork, only fork+exec), with the negative that it forces
the overhead of startup again on each new process. (Code reference:
<a href="http://codesearch.google.com/codesearch#search/&exact_package=chromium&q=launchprocess&type=cs"><code>LaunchProcess()</code></a>, which also knows to e.g. use <code>_exit</code> instead
of <code>exit</code>.)</p>
<p>Forking and execing ourselves is how we spawn subprocesses on Mac (I
believe; there may be some trickery related to how app bundles work
that complicate this). On Linux it is unfortunately more complicated.
Updates on Linux are managed by a systemwide package manager that runs
independently of other software, which effectively means at any point
any file you rely upon can be silently clobbered. (This problem even
affects single-process apps like Firefox; an update will clobber some
JavaScript used in the UI and suddenly things will either crash or get
weird.) In Chrome's case, if Chrome binary itself is updated while
the browser is running, processes spawned by the running Chrome would
be the newer Chrome, which may have made an incompatible change to the
interface between Chrome processes.</p>
<p>Instead, at startup, before we spawn any threads, we fork off a helper
process. This process opens every file we might use and then waits
for commands from the main process. When it's time to make a new
subprocess we ask the helper, which forks itself again as the new
child. By virtue of always forking from the same initial process, we
guarantee that we are always running the same code; even if the files
we opened are replaced by a system update our handle on them is the
handle for the previous file. (That works as long as nobody
overwrites the contents of the file we have open; thankfully, package
updates write a new file and rename it over the old name, leaving our
open copy the only remaining reference to the old file.)</p>
<p>(Code reference: <a href="http://codesearch.google.com/codesearch#OAMlx_jo-ck/src/content/browser/child_process_launcher.cc&exact_package=chromium&q=launchinternal&type=cs&l=104"><code>ChildProcessLauncher</code>'s <code>LaunchInternal()</code></a>,
the gory <code>ifdef</code> soup used when launching a subprocess. Truly some
ugly code.)</p>
<p>This solution is both clever and an ugly hack. Any time someone adds
code to Chrome that interacts with a file on disk they either need to
be aware that they need to preemptively open it or they will produce
mysterious failures across updates (in practice, usually the latter;
e.g. <a href="http://code.google.com/p/chromium/issues/detail?id=35793">bug 35793: Devtools stop working when chrome gets
updated</a>). An interesting question to ask is: why is this not a
problem on Windows and Mac?</p>
<p>On Windows, files are locked if any process is using them, which
forces a design where updates install into a separate directory. But
— annoyingness of locking aside — in fact I think that design is
preferable. To start with, a given version of Chrome will know its
files will remain unmolested by updates. Furthermore, when an update
happens, the updater can write out a separate "update succeeded"
sentinel after writing all the files out, making impossible for an
aborted update to leave both the previous and next version in a
half-working state. (On Mac, we take a similar approach; I don't know
enough about Macs to know whether the versioned directories within
bundles make this magically work.)</p>
<p>With all this in mind you might reasonably ask why Linux needs to be
special: why we waste memory on this zygote process launcher and have
extra buggy codepaths just to support an inferior update model. (Note
that by using <code>.deb</code> files we also lose <a href="http://neugierig.org/software/chromium/notes/2009/05/courgette.html">our tiny incremental
updates</a>.)</p>
<p>And to that I can only answer the thinking we had at the time: one, we
wanted to be good citizens on Linux; one distinction between "lame
port of a Windows app" and "real Linux software" is exactly whether
you distribute as a tarball or as a package. Secondly, and more
importantly, we knew that regardless of what we did for Google Chrome
the Linux distros would attempt to stuff Chromium into their package
manager even when they know it breaks the app, much like they've done
to Firefox. Now that I've summarized it in these terms it sounds a
little depressing, but there it is; with ChromeOS where we control the
stack we have more intelligent updates.</p>tag:neugierig.org,2009:chromium-notes/2011-07-29/datavis2011-07-29T23:32:00ZData visualization and d3<p>Lately I've been learning the <a href="http://mbostock.github.com/d3/">awesome d3 library</a> for data
visualization. I usually only publish my toys internally, but
only because it's convenient; I'd rather put them online so others
can play with them.</p>
<p>First up: <a href="http://neugierig.org/software/datavis/lines-spent/">lines spent</a>.</p>