Tracking down a mysterious Windows crash
Today's post is a guest post from Eric Roman. He wrote a slightly more expletive-laden version of this post inside Google and I asked him if I could post it here. It serves as a good illustration of how deploying software on more than a hundred million different users' computers is a nearly-biological fight.
Understanding stability of Windows applications is really hard.
Recently, @apatrick has concluded a heroic investigation into one of Chrome's most mysterious top crashes for Windows. He just committed a "fix" for it on the canary channel, which seems to be working!
This debugging journey is pretty epic, so I'm giving a blow-by-blow summary of it below. If you want to jump straight to the conclusion, see r96807 for the spoiler (can also read comments in bug 81449 for the full novella).
Prelude
Our story begins a year ago, when we first began tracking crashes in a
generic RunnableFunction<>::Run()
location. These crashes had
established themselves as a top browser crash for Windows Chrome. At
the time, Huan and I unsuccessfully looked into the issue, but
couldn't make heads or tails of it (bug
54307). The basic format of the crash looked
something like this:
0377fd78 0377fdb0 0x0
0377fd7c 0254c3c3 0x377fdb0
0377fd84 021ba1ae chrome_1c30000!RunnableFunction<void (__cdecl*)(void *),Tuple1<void *> >::Run+0xc
0377fd8c 021baa21 chrome_1c30000!`anonymous namespace'::TaskClosureAdapter::Run+0xb
0377fdb0 021baaa6 chrome_1c30000!MessageLoop::RunTask+0x81
0377fdc0 021bae47 chrome_1c30000!MessageLoop::DeferOrRunPendingTask+0x28
0377fdf8 021d1b24 chrome_1c30000!MessageLoop::DoWork+0x80
0377fe24 021ba960 chrome_1c30000!base::MessagePumpDefault::Run+0xc2
0377fe30 021ba8e5 chrome_1c30000!MessageLoop::RunInternal+0x31
0377fe38 021ba7d9 chrome_1c30000!MessageLoop::RunHandler+0x17
0377fe58 021c6530 chrome_1c30000!MessageLoop::Run+0x15
0377fe5c 021c6643 chrome_1c30000!base::Thread::Run+0x9
0377ffa8 021c1a2f chrome_1c30000!base::Thread::ThreadMain+0xa1
0377ffb4 7c80b713 chrome_1c30000!base::`anonymous namespace'::ThreadFunc+0x16
0377ffec 00000000 kernel32!BaseThreadStart+0x37
Above we can see that the crash is due to jumping to instruction pointer of 0. And judging by the top frames, it looks like there was some stack corruption at work (notice how the first frame's alleged return address is actually a stack location).
The callstack itself isn't terribly helpful though, since we can't tell what code was actually running prior to the crash (it was gobbled up by the stack corruption).
Moreover, the source location of RunnableFunction<>::Run()
doesn't
help narrow things in the slightest, since pretty much all of Chrome's
code runs through this path (Chrome relies heavily on message passing
to post asynchronous tasks to another thread's message loop).
Really, all we know at this point is that "some task" got posted to "some thread", and then crashed at "some point" after running this task.
There is one interesting piece of information that can be inferred from the minidumps: based on the crashed thread's index, it is likely the crashing task was running on Chrome's "Child process launcher" thread. In fact, a subsequent instrumentation (r58786) confirmed that all of these crashes were happening on the child process launcher thread. Since there is very little code that legitimately runs on this thread, we did a full code-flow analysis of all the paths that could post tasks. But we did not discover any problematic codepaths (I was hoping to stumble across something bad like a use-after-free).
Without any extra leads, (as well as a temporary dip in the crash's frequency lowering its priority) the mystery bug got pushed onto the back-burner.
It lay dormant for the next 9 months, waiting for a new champion to take up arms.
Part II
In May, Al Patrick (an innocent bystander), is assigned the bug on suspicion that it is a regression. (At this point the callstack has morphed a bit due to various optimization ambiguities, but it is essentially the same bug I had failed to solve earlier).
Our new hero starts off by adding some instrumentation trying to see if a bad function pointer is ever being directly passed to a runnable function (r85359). This doesn't turn up anything salient.
Next he instruments posted tasks to retain the location where they got posted from (r85991). This change is absolutely brilliant! I have benefited from it many times since it was introduced, to help debug crash dumps.
The instrumentation is cheap yet effective: whenever posting a task,
the FROM_HERE
macro (that was being used in debug builds to save
filename/line numbers) now saves the instruction pointer into the
PendingTask
. Later when the task is de-queued by the target thread's
message loop, this same instruction pointer (i.e. the birthplace of
the task) is pushed onto the stack prior to calling the task's virtual
function. That way should a crash happen later, you can poach the
address of the birthplace off the stack during postmortem dump
analysis!
This instrumentation reveals that the problem tasks were posted by
ChildProcessLauncher::Context::Terminate()
, suggesting that the task
itself was a runnable function on
ChildProcessLauncher::Context::TerminateInternal()
.
So far so good, but we still have no idea why it is crashing.
Next, our hero instruments the base runnable function/method tasks in
Chrome to detect use-after-frees (r86447), as well
as other memory molestation (by preserving the value of the function
pointer into the minidump). This instrumentation reveals that not only
was the task alive and well at the time it was run, but the function
pointer was also untouched. This is a major breakthrough in the
investigation, since it tells us conclusively that whatever craziness
has happened, it occurred while executing
ChildProcessLauncher::Context::TerminateInternal()
.
This is where things start to get weird. Looking at TerminateInternal
(the function that is blowing up) there is hardly any code:
chrome_1c30000!ChildProcessLauncher::Context::TerminateInternal:
025e0cea push ebp
025e0ceb mov ebp,esp
025e0ced mov eax,dword ptr [ebp+8]
025e0cf0 mov dword ptr [ebp+8],eax
025e0cf3 test eax,eax
025e0cf5 je chrome_1c30000!ChildProcessLauncher::Context::TerminateInternal+0x16 (025e0d00)
025e0cf7 push 0
025e0cf9 push eax
025e0cfa call dword ptr [chrome_1c30000!_imp__TerminateProcess (02ba56f4)]
025e0d00 push esi
025e0d01 lea esi,[ebp+8]
025e0d04 call chrome_1c30000!base::Process::Close (0289e055)
025e0d09 pop esi
025e0d0a pop ebp
025e0d0b ret
And the code for base::Process::Close()
which it references is also
pretty simple:
chrome_1c30000!base::Process::Close:
0289e055 cmp dword ptr [esi],0
0289e058 je chrome_1c30000!base::Process::Close+0x1d (0289e072)
0289e05a push edi
0289e05b mov edi,dword ptr [esi]
0289e05d call dword ptr [chrome_1c30000!_imp__GetCurrentProcess (02ba56f0)]
0289e063 cmp edi,eax
0289e065 je chrome_1c30000!base::Process::Close+0x19 (0289e06e)
0289e067 push edi
0289e068 call dword ptr [chrome_1c30000!_imp__CloseHandle (02ba561c)]
0289e06e and dword ptr [esi],0
0289e071 pop edi
0289e072 ret
So basically all that we are doing is killing the process, by calling a handful of win32 API functions. Hmmm.
Al theorizes that someone may be hooking one of the winapi calls
(perhaps kernel32!TerminateProcess
), and that whatever code it is
running in response to that call is responsible for the stack
corruption. For instance if the hooker used the wrong calling
convention (not stdcall
), that could be corrupting our stack upon
return!
So how is this guy "hooking" the API call?
There are a lot of different ways you could intercept Windows API calls, and I am definitely no expert to explain them all. You could for instance do things like directly patch the code in the system DLL (in user land, or even on disk). Or re-write the binary to substitute all the target function calls with your new one. But definitely the simplest and most intuitive way to hook is to just patch the import address table. (You can see how that works in the code above — our compiled code doesn't call directly into kernel32, but rather it jumps to the address listed from the import table table... that is the address you would be patching if you wanted to intercept the call).
To this end, Al adds yet more instrumentation, this time to try and
detect if TerminateProcess
is being hooked via the import address
table (r96266). Unfortunately this instrumentation
doesn't reveal any smoking gun yet.
Ricardo makes a good observation — a hook on CloseHandle()
is
perhaps more likely than a hook on TerminateProcess
, based on the
position of the 0 that appears on the stack (bug
81449). Plus, there is probably more value
in hooking CloseHandle
over TerminateProcess
.
Finally, Al commits a changelist that bypasses the address table
altogether for both CloseHandle
and TerminateProcess
, and instead
calls the underlying implementation in ntdll.dll
directly:
r96807. This is not quite as efficient, but
not a huge deal either due to the low frequency of these calls.
This workaround appears to have thwarted the bad hooker, and so far there hasn't been a single crash of this sort in the Windows Chrome canary!
It remains unclear which of the two was being hooked or why. The workaround is a pretty ridiculous thing to have to do, but the bypass could shave off as much as 10% of our Windows browser crashes (yes, this crash really was that high in some releases)! It is likely the hooker is malware — we could copy a fragment of the hooked code into our minidump to try and learn more.
Ideally we want to alert the user about these sorts of problems, rather than papering-over them with workarounds (since they indicate a real problem with their underlying system). But we don't have a good mechanism to do that yet (see bug 72239 for some proposals).
In summary, kudos to @apatrick for his excellent and persistent debugging investigation over the past three months.
Also this shows how powerful Chrome's Canary channel is, since it allows you do this style of experimental debugging with very quick turn-arounds (mere days).