Forking upstream software
Tom wrote a rant about the way we've incorporated other free software, and while I know it's easy to get carried away when you're mid-rant and talk in generalities, I thought it'd be interesting to look at the details.
First, some background. On Windows, your software is basically on its own: if the code you want isn't provided by Microsoft, it's your responsibility to package, ship, and update it. So for any third-party libraries we use (take sqlite as an example) we already must dedicate resources to integrating it into our build system, tracking security vulnerabilities upstream, and updating the code we use. This cost is fundamental to the endeavor, and any approach to software management beyond that (notably the Linux way) is by definition extra work to support. (That doesn't mean I don't want to change things, but just that doing so doesn't come for free.)
An interesting side effect of having all this code locally is that there is no longer a difference between your bugs and someone else's bugs: in tracking a bug down, you trace the code to the source of the problem and then you fix the bug, regardless of whose it is. Contrast this with development of software on any other platform, including Linux, where if there's a bug in Karmic's version of GTK the best we can say is "darn, let's file a bug upstream then attempt to find a workaround". (Which is what we've done; for a particular example I'll note my coworker Dan has written patches for two or three window managers, not that that helps any of our existing users.)
I believe this is generally the approach we have taken: do whatever it
takes to make the best product we can, contribute bits upstream where
we can. So let's take a stroll through the README
s in the
third_party
directories and see what we can find. As I type this
post I don't actually know the conclusion I'll reach, so this is for
my education as much as yours. (Edit: Now that I've gone through more
of this, I'll use the short phrase "build changes" to mean "changes
made to let this to build as part of our cross-platform build system,
ending up as a static lib that gets pulled into our final binary.")
-
apple
: two random files, no README. I have no idea what this is. The reviewer should have caught this before checking it in. Filed bug 29334. (Edit: now fixed.) -
bsdiff
: "This is Chrome's locally patched copy of Colin Percival's bsdiff tool [...] obtained from Mozilla's local copy/fork of bsdiff." The entirety of the code is a 382 line source file, so I can't be too bothered either way. Fork severity: harmless. -
bspatch
: just like bsdiff. 357 lines. -
bzip2
: "When updating, no modifications should be necessary aside from copying the subset of files included here." -
cld
: I believe this is an internal Google project. I wonder if was ever officially open sourced. Filed bug 29342 about unclear license. -
codesighs
: "There are no local changes to the code itself." -
expat
: Removed unused files. Two patches: one to fix a compiler warning on Windows, one to fix a crash. The README says it's version 2.7.0, but as far as I can tell from the home page and my Karmic install the latest version is 2.0.1 released in 2007? Perhaps it's a typo in the README. I tried to check if we'd reported the bug into their bug tracker but I get a server error when I click their bug tracker link. Fork severity: suspicious. -
ffmpeg
: Patched to hell. I believe we actually build this with gcc on Windows, it is so gnarly to build. The README is terrifying to me; it says we patch out some of its hand-written assembly so our crash catcher can function. Fork severity: fail.On the other hand, I don't think ffmpeg really cares too much; it seems from their download page they don't actually make releases, and instead recommend just pulling their trunk.
I asked a coworker about it and he said:
The good news is that of the 50 patches about 1/2 actually have made it upstream in one way or another (mostly by the FFMpeg devs themselves). I've been tracking it on a spreadsheet [...]
Why so many patches? I think a lot are from our security guys who found a lot of bugs after spending some time fuzzing it.
-
fuzzymatch
: A one-off C program agl hacked up. -
glew
: OpenGL stuff I don't quite understand. The local modification was for sandboxing reasons, but it is pretty invasive (much of the API now goes through an extra indirection) so it seems unlikely upstream would want it. Fork score: not good, but I think it's legit. -
harfbuzz
: I know this one because I work with it personally. We track upstream's git repo directly and have contributed both bug fixes and a bunch of code in upstream's "contrib/" directory. The last time I sent patches Behdad said something to the effect of: "you're basically the only people using this code so don't worry about it too much". I hope to retarget Chrome to the harfbuzz-ng rewrite this month. -
hunspell
: A variety of small patches for integrating it into the build (of the "remove reference to"config.h"
sort), "Change the input params of the constructors to receive a FILE* instead of a file path. This is required to use hunspell in the sandbox." Diffstat: 6 files changed, 28 insertions(+), 25 deletions(-).Additional major change: a different on-disk representation for a large memory and CPU performance impact. I asked the author of the Chrome patch and he said:
There's not really a mailing list or active project site as far as I could find. I emailed some random guy whose email address I found with some of my perf numbers and he responded with "oh thanks" but that's the extent of the communication.
-
icu
: See bug 28294, "use system ICU". The version of ICU we need is newer than the one provided on OS X or on Karmic (let alone Hardy, which we also want to support); we also continue to make improvements. We send these patches upstream. -
jemalloc
: Diffstat: 5 files changed, 38 insertions(+), 17 deletions(-). The local changes there aren't even up to snuff for acceptable Chrome changes, with commented-out bits. I don't believe we actually use this yet, but Mike has been experimenting with it. Fork severity: eh. -
lcov
: Upstream + a perl script. I think this is only used by the build bots. -
libevent
: Upstream + build system change. We carried a forked version of this for quite a while, due to a bug that caused libevent to use a large amount of extra memory per instance (ChangeLog says: "Saves up to 512K per epoll-based event_base."). We sent that patch upstream but it took a while for Niels to make a release carrying it. Upstream ChangeLog includes patches from multiple Chrome people (Dean and Adam). -
libjingle
: Lots of local patches. I ran into Sean (the author) the other day and he indicated libjingle is more or less unmaintained, but don't quote me on that. I see no attempt to upstream these patches; but I do see at least two unanswered bugs on the bug tracker with patches from Craig (one of the Chromium contributors). -
libjpeg
: Upstream version + fixes to integrate into our build. -
libpng
: Same as libjpeg. -
libxml
: Lots of changes! The README indicates we cherry-picked unreleased security fixes on top of the latest released version, but the version we carry is now older than trunk. Diffstat: 9 files changed, 310 insertions(+), 9 deletions(-). Majority of the diffs look like they're adding ICU support; I tried to see whether upstream supported this but in my ten seconds of looking I found this dispiriting old thread. (Edit: originally I wrote that upstream was opposed to supporting ICU; I see now I failed to read the context correctly, and that they welcomed an ICU patch and that the following quote that I used in the original post was about why upstream would prefer people to not ICU, not that they were completely opposed to it.)Mark> I'm not advocating that ICU be the default, I'm just curious why you feel vendors should not use it if it is present.
Daniel (upstream)> Very simple, I don't want to be perceived as depending on ICU.
Ironically, the person involved in that thread is none other than Mark Rowe, who I now know as one of the WebKit hackers. Maybe things have changed since 2007? Fork score: not sure. Even if upstream does allow a compile-time dependency on ICU, I doubt any distros would ship it that way. And we lose points for having other patches.
-
libxslt
: Build changes, one functional change for Windows. Unclear if that's upstream or not, nor why it's necessary in the first place ("Modified libxslt/security.c to use GetFileAttributesA instead of GetFileAttributes.") — perhaps for the sandbox? -
lighttpd
: Used for running HTTP-based tests. Just Mac/Win binaries. We historically only pull binaries if you're on the relevant platform; the fact I have this on my Linux box was probably an oversight. -
lzma_sdk
: README indicates only build changes. -
modp_64
: 913 lines of code. Build changes, including some that make it reach deep into shared Chrome headers. I am suspicious but it's also tiny. -
mozilla
: Mozilla headers needed to work with Java. Build changes. -
npapi
: Headers needed to work with Netscape plugins. Build changes. Rather forked. I don't think this is available as a standalone library, though. -
ocmock
: No modifications. -
openmax
: I don't know what this is. khronos.org reference makes me think WebGL or something. The README says it's headers. -
ots
: Written by the Chrome team, trying to be a library other browsers can use. -
protobuf2
: Written by Google. -
pyftpdlib
: A server used for testing FTP. Local patch is to remove a timing-based DoS protection, because the tests want to run as quickly as possible and so they hammer on it. Maybe this should be made into API and upstreamed? -
pywebsocket
: Written by the Chrome team, trying to be a library other browsers can use. WebKit uses it now I think. -
scons
: Build changes. But don't use scons anymore; we should probably drop this. -
simplejson
: Build changes. -
skia
: Written by Google. We push patches into it. -
sqlite
: Forked heavily. The full-text indexing support for SQLite (which is now upstream) was written by someone on the Chrome team, and so he feels comfortable making whatever changes needed. Fork severity: bad. Someone needs to spend some time cleaning this up.Concidentally, someone left an anonymous comment on Tom's post ranting about sqlite upstream, saying why they forked it as well; but I don't think this has been our experience at all; I met Richard once at a conference and he seemed like an ideal upstream — friendly and fixated on details.
-
tcmalloc
: Written by Google. I think there is a lot of experimentation going on here, so it's probably forked a bit. -
tlslite
: Used by tests. Includes a patch with bug fix, which references a thread on the user mailing list where someone else ran into the same problem. It's unclear if the patch has been upstreamed. Fork severity: not good. -
WebKit
: WebKit is a full half of the project's code weight, and we have made extensive changes to it. We now have many people on the project contributing to upstream WebKit and we now can build directly out of a copy of upstream's trunk with no local patches. More on upstreaming to WebKit in another post. -
xdg-utils
: We contributedxdg-settings
upstream, then pulled it back down into our tree. For the other xdg utilities we already use the one already on the system. It is not clear what the current relationship is between our copy and the upstream release; on the other hand, since we wrote the only file we use, it is unlikely anyone else needs it. Fork score: not so good. Filed bug 29769. -
yasm
: gnarly. Needed for ffmpeg. See ffmpeg above. -
zlib
: Fix a compile warning. Windows fixes. Not clear whether the changes are upstream.
There are a few other directories other than the top level for third-party code. Let's now go into them alphabetically:
-
base/third_party/dmg_fp
: As far as I can tell every project that has ever used this has forked it. -
base/third_party/icu
: Some basic UTF handling functions pulled from ICU (used for bits of code that don't want to depend on all of ICU). -
base/third_party/nspr
: This is just a couple of files, not all of nspr. The README is unclear. Filed bug 29764. -
base/third_party/purify
: The Windows equivalent of Valgrind. But there is no README. Looks like headers. Filed bug 29765. -
base/third_party/valgrind
: Header for working with Valgrind. It seems a little excessive to depend on a 24mb Valgrind package to build with this one file. -
base/third_party/xdg_mime
: Claims it contains one minor compilation fix, looks like working around a warning. -
base/third_party/xdg_user_dirs
: Build changes. Upstream designed this with the intent it is always copied into other projects, never built as a separate library. -
chrome/third_party/jstemplate
: Minor changes. There's not really a good build/packaging infrastructure for JavaScript libraries anyway, so this is hand-"optimized" (removed dead code, run through the Closure compiler). -
chrome/third_party/wtl
: Windows code. "These headers are almost the equivalent of WTL 8.0 fromhttp://sourceforge.net/projects/wtl/
with patch #1871358 applied." -
net/third_party/nss
: libssl from upstream. Used for staging various SSL changes for e.g. SPDY experiments; those changes eventually show up upstream. The NSS maintainer works on the Chrome team, and he ok'd this. -
net/third_party/parseftp
: Copied from Mozilla. Just build changes (remove NSPR, add namespace). I believe Pawel is working on this with Mozilla, but I don't follow this stuff closely.
...*whew*, that was exhausting! And I am actually rather surprised — I expected to find more projects hacked up like sqlite, when in fact that's one of the few projects we've actually forked in a meaningful way.