Move Over English
Though I was at a conference, then deathly sick (note the use of hyperbole), when David Sifry came out with his State of the Blogosphere Part 2 — On Language and Tagging, think there is still important data here to get reported for the record. David’s ability to cut through information on the index of 37.3 million blogs to bring coherent thought to the table is a gift he shares several times a year and we should take advantage of it to get the big picture of how our lives are changing.
For this post, I choose to focus on the analysis of the language data.
He begins by offering a few disclaimers about the data set he’s about to offer. Three important caveats he reminds us to keep in the foreground when studying his data.
- First that the automated language software they use may not be perfect and my over- or undercount a particular language or group of languages, due to bugs wthin the software. He follows that comment with a statement that Technorati, however, still feels fairly confident in its reliability across the millions of blogs and posts they index each day.
- One part of the blogosphere, Mr. Sifry is certain that is being under-reported is posts and blogs written in Korean. This is due to the fact that the main services are not indexed by Technorati at this time. A second that is being undercounted to a lesser degree is French language blogs and posts, because Technorati has not yet got a good system for indexing skyblog.
- This third caveat is that Japanese bloggers write shorter posts. This could be due to their predilection to posting from mobile telephone. This fact could be skewing the results of the data that follows making the numbers higher, as the data tracks quantity of posts not length.
Within these caveats, Dave Sifry aso offers this invitation,
if anyone at these (or other) blogging services is interested in being indexed, please drop me a line.