I wanted to conduct a census of the languages people are using on LiveJournal, using the publicly available journal data.
To make it interesting, I wrote my own software to do it.
My latest results (collected over the week ending 29-Jan-2003) are as follows:
It analyzed a little over 360,000 posts.
Table 1: Primary Results.
"Skipped" are entries that were skipped for being too short (not containing enough text to make a useful analysis). "Unsure" are entries where the language identification was not within a certain threshold.
All other languages included in the survey were each less than half a percentage point of the entries surveyed. Of those languages (about 3000 posts were identified as one of these):
Table 2: Uncommon Languages.
However, many of the less-common languages are probably misrepresented, due to a lack of training data and posts.
I'm publishing the software I wrote under the BSD license.
Check it out with these commands:
tla register-archive http://neugierig.org/software/arch tla get email@example.com/langid--dev--0.1
The generated documentation for the internal library is in html/.Evan Martin, firstname.lastname@example.org