A 2015 count of Japanese word frequency

*Slowly emerges from underground cave* I’ve been doing computer things to large samples of Japanese text. To be more specific, I’ve been feeding the full contents of the Japanese Wikipedia to Mecab, R, python, and several small shell scripts. It occurred to me that, while these things are at hand, it would be simple to make a new count of frequent Japanese words. So I did. You can see what is it like at this Wiktionary page. Full TSV tables are available for download: the count of lemmas (uninflected words), and of inflected word forms. New stuff about kanji is forthcoming. *Slowly submerges to cave*

5 thoughts on “A 2015 count of Japanese word frequency

  1. Great stuff! It’s interesting how you can spot the biases in the source — like Wikipedia’s semi-automatic creation of pages for places puts 市 and 県 much higher than they would be in “normal” discourse (maybe not higher than they would be in a newspaper corpus, though, who knows).

    Good to see Okinawan flexing its muscles at #74 there too.

    • I thought the very high count of 年/月/日 would be due to Wikipedia timestamps, but this is after stripping them away. Perhaps Wikipedia just talks about dates a lot?

      I got a much higher count of 市 and 県 than the last count; I don’t know if a lot of automatic pages were created in the meantime, or if the previous count analysed the words differently (it was a different mophological analysier).

      More worringly, there are some oddities that are consequences of the automatic analysis; e.g. Mecab is lacking such a common word as 二人, for some reason; so, even though it can analyse ‘sumomomomomomomomonouchi’ correctly, it reads 二人 as *ni nin, counting one hit for each morpheme instead of tallying futari. The previous wiktionary count (based on Juman, a dictionary used by Mecab) seems to have had the same problem. Similarly, it seems that single kana like ぬ get over-counted; browsing some sample analysis, a lot of them seems like legit jodōshi -nu, but not a few are unknown kana words misanalyzed as agglutinated suffixes. Can’t wait to get my hands on a properly annotated corpus…

  2. I wonder if “sumomomomomomomomonouchi” was put in there as a special case? It is a pretty famous piece of wordplay, after all; it looks good that they can handle it.

    Anyway, I was just commenting to say that every time I come to this page the first thing I see is “*Slowly emerges from underground cave*” and it is exactly like being in a King’s Quest-type game and passing through a screen where every time you enter the same sprite slides laboriously out of the same cave to talk to you.

  3. Haha, and the NPC always say the same thing! It fits. I’m sorry for making you refresh the page, when I’m writing so sporadically… I think it’s probably best to follow this blog on an RSS reader or via email notifications.

    I took their “sumomo…” example as a claim that the regular algorithm can handle it, without any special treatment. Doing otherwise would be tantamount to cheating, to my mind.

Leave a Reply

Your email address will not be published. Required fields are marked *