Tracking down a mysterious Windows crash

August 31, 2011

Today's post is a guest post from Eric Roman. He wrote a slightly more expletive-laden version of this post inside Google and I asked him if I could post it here. It serves as a good illustration of how deploying software on more than a hundred million different users' computers is a nearly-biological fight.

Understanding stability of Windows applications is really hard.

Recently, @apatrick has concluded a heroic investigation into one of Chrome's most mysterious top crashes for Windows. He just committed a "fix" for it on the canary channel, which seems to be working!

This debugging journey is pretty epic, so I'm giving a blow-by-blow summary of it below. If you want to jump straight to the conclusion, see r96807 for the spoiler (can also read comments in bug 81449 for the full novella).


Our story begins a year ago, when we first began tracking crashes in a generic RunnableFunction<>::Run() location. These crashes had established themselves as a top browser crash for Windows Chrome. At the time, Huan and I unsuccessfully looked into the issue, but couldn't make heads or tails of it (bug 54307). The basic format of the crash looked something like this:

0377fd78 0377fdb0 0x0
0377fd7c 0254c3c3 0x377fdb0
0377fd84 021ba1ae chrome_1c30000!RunnableFunction<void (__cdecl*)(void *),Tuple1<void *> >::Run+0xc
0377fd8c 021baa21 chrome_1c30000!`anonymous namespace'::TaskClosureAdapter::Run+0xb
0377fdb0 021baaa6 chrome_1c30000!MessageLoop::RunTask+0x81
0377fdc0 021bae47 chrome_1c30000!MessageLoop::DeferOrRunPendingTask+0x28
0377fdf8 021d1b24 chrome_1c30000!MessageLoop::DoWork+0x80
0377fe24 021ba960 chrome_1c30000!base::MessagePumpDefault::Run+0xc2
0377fe30 021ba8e5 chrome_1c30000!MessageLoop::RunInternal+0x31
0377fe38 021ba7d9 chrome_1c30000!MessageLoop::RunHandler+0x17
0377fe58 021c6530 chrome_1c30000!MessageLoop::Run+0x15
0377fe5c 021c6643 chrome_1c30000!base::Thread::Run+0x9
0377ffa8 021c1a2f chrome_1c30000!base::Thread::ThreadMain+0xa1
0377ffb4 7c80b713 chrome_1c30000!base::`anonymous namespace'::ThreadFunc+0x16
0377ffec 00000000 kernel32!BaseThreadStart+0x37

Above we can see that the crash is due to jumping to instruction pointer of 0. And judging by the top frames, it looks like there was some stack corruption at work (notice how the first frame's alleged return address is actually a stack location).

The callstack itself isn't terribly helpful though, since we can't tell what code was actually running prior to the crash (it was gobbled up by the stack corruption).

Moreover, the source location of RunnableFunction<>::Run() doesn't help narrow things in the slightest, since pretty much all of Chrome's code runs through this path (Chrome relies heavily on message passing to post asynchronous tasks to another thread's message loop).

Really, all we know at this point is that "some task" got posted to "some thread", and then crashed at "some point" after running this task.

There is one interesting piece of information that can be inferred from the minidumps: based on the crashed thread's index, it is likely the crashing task was running on Chrome's "Child process launcher" thread. In fact, a subsequent instrumentation (r58786) confirmed that all of these crashes were happening on the child process launcher thread. Since there is very little code that legitimately runs on this thread, we did a full code-flow analysis of all the paths that could post tasks. But we did not discover any problematic codepaths (I was hoping to stumble across something bad like a use-after-free).

Without any extra leads, (as well as a temporary dip in the crash's frequency lowering its priority) the mystery bug got pushed onto the back-burner.

It lay dormant for the next 9 months, waiting for a new champion to take up arms.

Part II

In May, Al Patrick (an innocent bystander), is assigned the bug on suspicion that it is a regression. (At this point the callstack has morphed a bit due to various optimization ambiguities, but it is essentially the same bug I had failed to solve earlier).

Our new hero starts off by adding some instrumentation trying to see if a bad function pointer is ever being directly passed to a runnable function (r85359). This doesn't turn up anything salient.

Next he instruments posted tasks to retain the location where they got posted from (r85991). This change is absolutely brilliant! I have benefited from it many times since it was introduced, to help debug crash dumps.

The instrumentation is cheap yet effective: whenever posting a task, the FROM_HERE macro (that was being used in debug builds to save filename/line numbers) now saves the instruction pointer into the PendingTask. Later when the task is de-queued by the target thread's message loop, this same instruction pointer (i.e. the birthplace of the task) is pushed onto the stack prior to calling the task's virtual function. That way should a crash happen later, you can poach the address of the birthplace off the stack during postmortem dump analysis!

This instrumentation reveals that the problem tasks were posted by ChildProcessLauncher::Context::Terminate(), suggesting that the task itself was a runnable function on ChildProcessLauncher::Context::TerminateInternal().

So far so good, but we still have no idea why it is crashing.

Next, our hero instruments the base runnable function/method tasks in Chrome to detect use-after-frees (r86447), as well as other memory molestation (by preserving the value of the function pointer into the minidump). This instrumentation reveals that not only was the task alive and well at the time it was run, but the function pointer was also untouched. This is a major breakthrough in the investigation, since it tells us conclusively that whatever craziness has happened, it occurred while executing ChildProcessLauncher::Context::TerminateInternal().

This is where things start to get weird. Looking at TerminateInternal (the function that is blowing up) there is hardly any code:

025e0cea push ebp
025e0ceb mov ebp,esp
025e0ced mov eax,dword ptr [ebp+8]
025e0cf0 mov dword ptr [ebp+8],eax
025e0cf3 test eax,eax
025e0cf5 je chrome_1c30000!ChildProcessLauncher::Context::TerminateInternal+0x16 (025e0d00)
025e0cf7 push 0
025e0cf9 push eax
025e0cfa call dword ptr [chrome_1c30000!_imp__TerminateProcess (02ba56f4)]
025e0d00 push esi
025e0d01 lea esi,[ebp+8]
025e0d04 call chrome_1c30000!base::Process::Close (0289e055)
025e0d09 pop esi
025e0d0a pop ebp
025e0d0b ret

And the code for base::Process::Close() which it references is also pretty simple:

0289e055 cmp dword ptr [esi],0
0289e058 je chrome_1c30000!base::Process::Close+0x1d (0289e072)
0289e05a push edi
0289e05b mov edi,dword ptr [esi]
0289e05d call dword ptr [chrome_1c30000!_imp__GetCurrentProcess (02ba56f0)]
0289e063 cmp edi,eax
0289e065 je chrome_1c30000!base::Process::Close+0x19 (0289e06e)
0289e067 push edi
0289e068 call dword ptr [chrome_1c30000!_imp__CloseHandle (02ba561c)]
0289e06e and dword ptr [esi],0
0289e071 pop edi
0289e072 ret

So basically all that we are doing is killing the process, by calling a handful of win32 API functions. Hmmm.

Al theorizes that someone may be hooking one of the winapi calls (perhaps kernel32!TerminateProcess), and that whatever code it is running in response to that call is responsible for the stack corruption. For instance if the hooker used the wrong calling convention (not stdcall), that could be corrupting our stack upon return!

So how is this guy "hooking" the API call?

There are a lot of different ways you could intercept Windows API calls, and I am definitely no expert to explain them all. You could for instance do things like directly patch the code in the system DLL (in user land, or even on disk). Or re-write the binary to substitute all the target function calls with your new one. But definitely the simplest and most intuitive way to hook is to just patch the import address table. (You can see how that works in the code above — our compiled code doesn't call directly into kernel32, but rather it jumps to the address listed from the import table table... that is the address you would be patching if you wanted to intercept the call).

To this end, Al adds yet more instrumentation, this time to try and detect if TerminateProcess is being hooked via the import address table (r96266). Unfortunately this instrumentation doesn't reveal any smoking gun yet.

Ricardo makes a good observation — a hook on CloseHandle() is perhaps more likely than a hook on TerminateProcess, based on the position of the 0 that appears on the stack (bug 81449). Plus, there is probably more value in hooking CloseHandle over TerminateProcess.

Finally, Al commits a changelist that bypasses the address table altogether for both CloseHandle and TerminateProcess, and instead calls the underlying implementation in ntdll.dll directly: r96807. This is not quite as efficient, but not a huge deal either due to the low frequency of these calls.

This workaround appears to have thwarted the bad hooker, and so far there hasn't been a single crash of this sort in the Windows Chrome canary!

It remains unclear which of the two was being hooked or why. The workaround is a pretty ridiculous thing to have to do, but the bypass could shave off as much as 10% of our Windows browser crashes (yes, this crash really was that high in some releases)! It is likely the hooker is malware — we could copy a fragment of the hooked code into our minidump to try and learn more.

Ideally we want to alert the user about these sorts of problems, rather than papering-over them with workarounds (since they indicate a real problem with their underlying system). But we don't have a good mechanism to do that yet (see bug 72239 for some proposals).

In summary, kudos to @apatrick for his excellent and persistent debugging investigation over the past three months.

Also this shows how powerful Chrome's Canary channel is, since it allows you do this style of experimental debugging with very quick turn-arounds (mere days).