RTL titles

April 19, 2011

(Here's a post from some months ago. I think I'm not writing new posts because I've been sitting on this one for so long, so perhaps it's for the best that I just publish it.)

I've been away for a while. Part of my travels involved a hackfest in Tel Aviv for right-to-left text in WebKit. My RTL knowledge of WebKit is minor -- I have done a decent bit of hacking on the Linux Chrome rendering code for complex text in Webkit -- but as a language enthusiast (I like to tell people: by college degrees, I am as qualified a linguist as I am a computer scientist!), I am already familiar with the bidi algorithm and I've studied some Arabic.

(I know some properties of Hebrew but not the alphabet, but in Tel Aviv all the street signs are Rosetta stones of Hebrew/Arabic/Roman scripts; by the end of my week I proudly identified and boarded the shared taxi to Jerusalem by reading the Hebrew sign in the window.)

Upon arriving in Tel Aviv, we assigned the important bugs to the more talented WebKit hackers, while I picked a relatively minor bug in the hope that I could churn through a few of them. By the end of the week I didn't even fix that one bug. I did, however, make two refactoring pre-changes and got another one in just at the finish line that touched 53 files. Somehow with me it is always a yak shave.

The bug seemed pretty simple: WebKit doesn't understand <title dir="rtl">RTL titles</title>.

To start with, what does this actually mean? Here's an attempt to explain as briefly as possible, glossing over details; bidirectional text is actually pretty gnarly.

First, some background for bidi beginners. You should know that text is always stored in its logical order: the first letter of an in-memory string of Hebrew is the first letter of where you'd start reading, on the right. The same is true for the order of the characters as written in an HTML document. When rendering a string that contains both right-to-left (RTL) and left-to-right (LTR) text, you end up "reversing" bits of it. The algorithm for this reversal is called the bidi (bidirectional) algorithm, and it is complex and interesting but out of scope for this post.

In discussions of bidi, the convention is to write the characters that should be RTL (representing a language like Hebrew) in uppercase, and the LTR characters in lowercase. So, for example, the in-memory string foo BAR XYZ should appear as foo ZYX RAB. Critically, note that the last word of the source string ends up in the middle of the rendered string -- ommitting many details, at a conceptual level you're laying out a string of chunks left to right and the BAR XYZ chunk should be rendered right to left as part of that.

I've already made an assumption there, though: I wrote that you're laying out the chunks left to right, but in a right to left document the overall layout order goes the other way. (The document starts from the right, after all.) foo BAR XYZ in a Hebrew document should render as ZYX RAB foo. This extra bit of of metadata about the text -- inventing a term, the direction context of the string -- is just what the dir attribute of the title is for.

WebKit can display plenty of RTL sites just fine, so it necessarily gets the all of the above details correct in web content. The problem that my bug was about is that WebKit generally isn't responsible for rendering title tags -- they're handled by the browser.

Aside: amusingly, a stylesheet containing

head { display: block } title { display: block }

will cause the title to display in the page. It sorta makes sense as soon as you see it.

For the browser to render the title correctly, it needs the same information that WebKit has -- the text and its direction. (We had many discussions about the proper way to display an RTL title in an LTR browser -- e.g., do you move the favicon too?) To fix this bug I "just" needed to (1) get the direction as specified on the tag (or any parent of the tag, or CSS, or ...); (2) plumb that extra metadata out to the browser; (3) make use of that extra metadata browser-side (e.g. flip the text layout direction when appropriate).

Unfortunately, titles are used in a variety of places within WebCore -- as an attribute of documents, of course, but also in data structures related to loading pages, history, and in APIs used to communicate state information up to the hosting environment. Touching all of these resulted in a larger patch than I'd anticipated.

I first landed some managable chunks of it: a small refactoring and then a larger one; it's good I landed them separately because I got the logic in the latter wrong and needed to quick-fix it (in my defense, part of the problem is that the code has a related bug).

Then comes the monster change-the-world patch where I swap out the type of a core object; despite my best efforts at modifying nine WebKit ports simultaneously I managed to break the GTK build and Qt build and the Qt build a second time.

That got me to the point where the data was exposed to the WebKit-internal platform layer, but not through any public APIs; so next was exposing that through the Chromium WebKit API, which is also the layer the testing interface uses so it allowed me to write a test. And finally, I screwed up that patch too, necessitating another quick fix.

And with all that in place, I then turned to the Chrome-side implementation...

...and discovered the chicken-and-egg problem of new HTML specs: because nobody implements <title dir>, few sites made use of it. In fact, I did find a site that frequently had titles that mixed LTR and RTL text, exactly the sort that would benefit from my change, and the site did use the dir attribute on a <title> tag, despite it not having any effect in browsrs -- but the site used it in such a way that was exactly backwards; my locally patched browser that obeyed the attribute made the site strictly worse. The site? Google Israel.

And with no better conclusion than that, this post has sadly languished unpublished on my laptop and the work is in a similar state. Now that I look, it seems my bug report about fixing Google Israel was fixed, perhaps it's worth again trying to land my patch. But at this point I am suspicious of the sunk cost fallacy: I had picked this bug because it was so minor that we had guessed it would be easy, and it's perhaps not worth much more time when I have a limitless stream of more important bugs.

Your reward for reading this long and anticlimactic post is an example of one of the many ways software can get RTL wrong. In this image, I constructed a page with a specially-crafted title, which Chrome naively formats as "$PAGE_TITLE - Google Chrome" and then hands it on to the OS. The image shows what happens when I alt-tab.