wstring removal
Chrome, in one of the few truly Windows-specific places in its design,
was originally written using C++ wstrings throughout. wstring is a
string of wchar_t
and generally represents a string of Unicode, but
in Chrome originally they were used everywhere, even for plain ASCII
values like the names of command line switches. wstrings are UCS-2*
on Windows and UCS-4 on other platforms, which makes them very
convenient on Windows where you can pass them directly to native APIs
— and much less sensible everywhere else.
I and others have slowly been removing and untangling them from the code but it is slow going. See, for example, how we're up to comment 47 on this bug. Every time you call into a new module that module will frequently expect a wstring so you need to do string conversions. Even worse, people continue to add more code that uses wstrings along with some TODO like "fix this for non-Windows platforms". It's a standard technical debt sort of thing: you're trying to finish your feature and worrying about the proper string type is as the bottom of your stack.
However, we still need to deal with Unicode text; what if not
wstrings? One common approach is to use UTF-8 everywhere, which is
what I had argued for, but there are two good arguments against this.
One is that UTF-16 is the native string type of JavaScript, WebKit,
and Windows, and the fewer encoding conversions the better. The more
interesting argument is that programmers, even the rare enough sort of
programmers who understand encoding issues, will always make mistakes
and will mix up strings of ASCII, bytes, and UTF-8 without thinking
about the consequences.** (This can perhaps be mitigated by a
separate u8string
type, but hey, I lost this argument.) The
conclusion for Chrome was to migrate to using UTF-16 strings when
necessary to store Unicode. Since e.g. myutf16string.data()
gives
you back a pointer of the wrong type to pass to fopen()
, it's really
hard to screw up.
My cleanup approach of late has been one of trying to limit ongoing
damage: I make it so if you add new code that does the wrong thing,
your code is more painful to write. This is accomplished by removing
functions that accept wstrings from the lowest-level libraries, which
means when writing higher-level code you have to keep thunking in and
out of wstrings. For example our path abstraction (UTF-16 on Windows,
UTF-8 on Mac, bytes on Linux) has methods ::ToWStringHack()
and
::FromWStringHack()
which hopefully make my colleagues feel bad
every time they have to use them.
I worry that my anti-wstring crusade is a losing battle — wasting more time than it's saving — and one that will never end since it's not urgent to fix (I mostly work on it while waiting for other larger projects to progress; one of our open source contributors has also been helping***). Nor is it technically that important — sure, we're wasting some memory, but the proper way to approach memory consumption is through measurement and tackling the big parts first (converting all the switch constants from four-byte wchars to single-byte ASCII shaved off a couple kb from our binary); sure, some users have file names that are bytes instead of Unicode but they can't be that common. But for me at least, it's more about the principle of the thing: being able to tell that two different collections of bytes are different things that shouldn't be mixed are what separates us from the animals.
* I know in theory they are UTF-16 but in reality few programmers ever get that right. For all practical purposes you're screwed if you're not in the BMP.
** "You should just write the code correctly" is never a good response to problems like these. Everyone always writes the code as bug-free as they can; our objective should be to make the compiler help with catching mistakes.
*** It turns out that even if you're not a very experienced programmer you can productively hack on large projects like the kernel or Chrome; you just need to tackle small tasks.