About
I wanted to conduct a census of the languages people are using on LiveJournal, using the publicly available journal data.
To make it interesting, I wrote my own software to do it.
Results
My latest results (collected over the week ending 29-Jan-2003) are as follows:
It analyzed a little over 360,000 posts.
| Language | Percentage |
|---|---|
| English | 77% |
| Skipped | 16% |
| Russian | 5% |
| Unsure | 1% |
Table 1: Primary Results.
"Skipped" are entries that were skipped for being too short (not containing enough text to make a useful analysis). "Unsure" are entries where the language identification was not within a certain threshold.
All other languages included in the survey were each less than half a percentage point of the entries surveyed. Of those languages (about 3000 posts were identified as one of these):
| Language | Percentage |
|---|---|
| German | 45% |
| Spanish | 13% |
| French | 12% |
| Portugese | 8% |
| Italian | 6% |
| Belarusian | 5% |
Table 2: Uncommon Languages.
However, many of the less-common languages are probably misrepresented, due to a lack of training data and posts.
Software
I'm publishing the software I wrote under the BSD license.
You can browse the code in the arch repository martine@danga.com—2004/langid—dev—0.1.
Check it out with these commands:
tla register-archive http://neugierig.org/software/arch tla get martine@danga.com--2004/langid--dev--0.1
The generated documentation for the internal library is in html/.
Evan Martin, martine@danga.com