Language Identification

(last updated: 2004-01-29.)

About

I wanted to conduct a census of the languages people are using on LiveJournal, using the publicly available journal data.

To make it interesting, I wrote my own software to do it.

Results

My latest results (collected over the week ending 29-Jan-2003) are as follows:

It analyzed a little over 360,000 posts.

LanguagePercentage
English77%
Skipped16%
Russian5%
Unsure1%

Table 1: Primary Results.

"Skipped" are entries that were skipped for being too short (not containing enough text to make a useful analysis). "Unsure" are entries where the language identification was not within a certain threshold.

All other languages included in the survey were each less than half a percentage point of the entries surveyed. Of those languages (about 3000 posts were identified as one of these):

LanguagePercentage
German45%
Spanish13%
French12%
Portugese8%
Italian6%
Belarusian5%

Table 2: Uncommon Languages.

However, many of the less-common languages are probably misrepresented, due to a lack of training data and posts.

Software

I'm publishing the software I wrote under the BSD license.

You can browse the code in the arch repository martine@danga.com—2004/langid—dev—0.1.

Check it out with these commands:

tla register-archive http://neugierig.org/software/arch
tla get martine@danga.com--2004/langid--dev--0.1

The generated documentation for the internal library is in html/.

Evan Martin, martine@danga.com