wstring removal

July 22, 2010

Chrome, in one of the few truly Windows-specific places in its design, was originally written using C++ wstrings throughout. wstring is a string of wchar_t and generally represents a string of Unicode, but in Chrome originally they were used everywhere, even for plain ASCII values like the names of command line switches. wstrings are UCS-2* on Windows and UCS-4 on other platforms, which makes them very convenient on Windows where you can pass them directly to native APIs — and much less sensible everywhere else.

I and others have slowly been removing and untangling them from the code but it is slow going. See, for example, how we're up to comment 47 on this bug. Every time you call into a new module that module will frequently expect a wstring so you need to do string conversions. Even worse, people continue to add more code that uses wstrings along with some TODO like "fix this for non-Windows platforms". It's a standard technical debt sort of thing: you're trying to finish your feature and worrying about the proper string type is as the bottom of your stack.

However, we still need to deal with Unicode text; what if not wstrings? One common approach is to use UTF-8 everywhere, which is what I had argued for, but there are two good arguments against this. One is that UTF-16 is the native string type of JavaScript, WebKit, and Windows, and the fewer encoding conversions the better. The more interesting argument is that programmers, even the rare enough sort of programmers who understand encoding issues, will always make mistakes and will mix up strings of ASCII, bytes, and UTF-8 without thinking about the consequences.** (This can perhaps be mitigated by a separate u8string type, but hey, I lost this argument.) The conclusion for Chrome was to migrate to using UTF-16 strings when necessary to store Unicode. Since e.g. myutf16string.data() gives you back a pointer of the wrong type to pass to fopen(), it's really hard to screw up.

My cleanup approach of late has been one of trying to limit ongoing damage: I make it so if you add new code that does the wrong thing, your code is more painful to write. This is accomplished by removing functions that accept wstrings from the lowest-level libraries, which means when writing higher-level code you have to keep thunking in and out of wstrings. For example our path abstraction (UTF-16 on Windows, UTF-8 on Mac, bytes on Linux) has methods ::ToWStringHack() and ::FromWStringHack() which hopefully make my colleagues feel bad every time they have to use them.

I worry that my anti-wstring crusade is a losing battle — wasting more time than it's saving — and one that will never end since it's not urgent to fix (I mostly work on it while waiting for other larger projects to progress; one of our open source contributors has also been helping***). Nor is it technically that important — sure, we're wasting some memory, but the proper way to approach memory consumption is through measurement and tackling the big parts first (converting all the switch constants from four-byte wchars to single-byte ASCII shaved off a couple kb from our binary); sure, some users have file names that are bytes instead of Unicode but they can't be that common. But for me at least, it's more about the principle of the thing: being able to tell that two different collections of bytes are different things that shouldn't be mixed are what separates us from the animals.

* I know in theory they are UTF-16 but in reality few programmers ever get that right. For all practical purposes you're screwed if you're not in the BMP.

** "You should just write the code correctly" is never a good response to problems like these. Everyone always writes the code as bug-free as they can; our objective should be to make the compiler help with catching mistakes.

*** It turns out that even if you're not a very experienced programmer you can productively hack on large projects like the kernel or Chrome; you just need to tackle small tasks.