Testing the predictive power of phonetic components in Japanese kanji

Most Chinese characters have a phonetic component—a hint that suggests their pronounciations. For example, 半 bàn “half” appears inside 伴 “companion” because the latter is also pronounced bàn; and further, it also suggests the approximate pronounciation of 判 pàn “judge”, 叛 pàn “rebel”, and 胖 pàng “fat”. Beginning readers often don’t notice this feature, but awareness of phonetic hints grows with proficiency. The rate of characters with a phonetic component reaches up to 90% – though that’s 90% of all characters, not 90% of what you’d stumble in actual use; nonphonetic characters (such as 木 or 人) are disproportionately more frequent.

However, this system was always imprecise, and grew ever looser as the spoken language changed and the characters were simplified in various ways. In many cases, it’s now unclear to what degree a component was originally added as a phonetic hint, a semantic mnemonic, both, or neither. For example, it’s conceivable that whoever first built the character for 判 “judge” decided to use 半 not just for its sound, but also because they thought “cutting 刂 in equal halves 半” makes sense as a mnemonic for “judging”.

When the characters are used to represent Japanese, the phonetic hints only work with Sino-Japanese readings, i.e. the on-yomi (…that’s kind of why they’re called on-yomi, “sound readings”). Moreover, even for on-yomi, the Japanese readings grew to be more imprecise than even Chinese. I wanted to try to measure their predictive power; this post report the results of a simple but quantitative experiment (for a quick summary, skip to results!).

Phonetic components and the Japanese language

There are various reasons why the hints are less useful for Japanese:

  • In their original language, most characters have only one reading. In Japanese they typically have at least two, the original Sinitic plus a native gloss. This point alone means the hints don’t work for about half (?) of the readings.
  • What’s more, even though there are (were) many Sinitic languages through space (time), a given Chinese community typically read the characters in only one way. In the case of Japanese, it’s common for characters to have several coexisting Sinitic readings (multiple on-yomi), because the language preserves several loan strata. And all of these multiple on-yomi were subject to local language change, independently of the mainland.
  • Of the current standard (Jōyō) kanji set, 364 (about 17%) were simplified after 1946, breaking graphical relationships.

That doesn’t mean they’re useless, though, or that natives don’t process them. Mary Noguchi of Kanji Clinic describes this hypothetical example about the character for “hemorrhoid”:

[…] locate a willing subject to write, from memory, the kanji for ぢ. The point of this exercise is to witness the trial and error strategy a native speaker may employ when writing or pronouncing a relatively unfamiliar kanji such as 痔. Your friend will probably begin by writing the kanji component 疒 (“sickness”). This will put her into the ballpark, but writing the rest of the character may prove more difficult. She might start wracking her brain for a component which signals the pronunciation ji:

“Is it 寿 (“longevity”)? Oh, no, 寿 is pronounced ju, so that can’t be right. Hmmm, how about 血 (“blood,” usually pronounced chi, but sometimes ji in compounds like hanaji “bloody nose”)? Or maybe 地 (“ground,” ji)? No,” she moans, “that doesn’t look right, either. OK, I think it’s 寺 (“temple,” ji) plus 疒. Yes, 痔. That’s it!”

Sometimes a native taking quick notes will even substitute a complicated phonetic component for a kana character—for example, noting 機 ki “machine” by using only the left 木 “wood” component plus a squeezed phonetic ki キ to the right.

Goals and definitions: what to measure, and how

For this experiment, I extracted readings from Jim Breen’s venerable Kanjidic, and structural analysis from KanjiVG, a publicly-available database of graphical– and component decompositions. (If there are any errors in the tables, please report, so that I can either debug my code or forward corrections to the original sources.) Regarding my goals:

  • I chose KanjiVG because I was interested in a synchronic analysis—that is, of the structure of the kanji as they are now, not of their historical (traditional, Seal, or pre-Qin) forms. I ignored history, and looked for correlations between modern visual components and modern readings, in the spirit of testing how much information is still present in the system.

  • I also chose to ignore approximate readings, out of didactic interest (in my experience as a Japanese student, I found approximations to be more trouble than worth). I looked for exact matches.

  • And I was especially interested in components that could be used reliably as a guide to pronounciation.

There’s quite a bit of data to massage, and it can be tricky to measure what exactly is a “good” phonetic component. In the next section I make some important definitions about metrics.

Kanji sets

First of all, the results will differ significantly depending on which kanji set (our universe) is analyzed. We’ll investigate two such sets:

  1. The set of kanji taught in Japanese education, today called Jōyō kanji. Since 1945, Japanese texts generally omit furigana readings for Jōyō, meaning that adult readers are expected to know by heart the readings of this set. Our Jōyō analyses are based on the full set of 2141 characters (including the 5 that were dropped in 2010, namely 勺 銑 脹 錘 匁 ).
  2. The largest set I have available, that is, all kanji in the current version of KanjiVG—a bit more than 6300. I don’t know if there’s any selection criteria for this set, but I expect they’re some of the most common, and six thousand is a good estimate of the knowledge of a proficient, literary-minded native reader.

Quantifying phonetic series

Within each kanji set, the basic variables to relate are:

  • A few thousand kanji, where
  • each kanji is made of one or more components, and
  • each kanji has zero or more (on-yomi) readings.

A component series is set of kanji that include a certain component. Here are some examples from the Jōyō set:

Component Size of series Kanji in series

For our purposes, a phonetic series is a set of kanji that shares a component and a reading. If we add each kanji’s readings to the table above, interesting patterns appear:

Component Size of series Kanji in series
ki shu sou chou to fu
sei sei,
hou hou hou hou hou hou
saku saku saku sa
kyuu kyuu kyuu,

First, consider the 走-series. Not a single kanji in it has a shared reading! 走 is not a phonetic component at all, i.e. the 走-series is not a phonetic series.

Compare to the 包-series. All the six kanji that include 包 are pronounced hou. In other words, 包-hou is a phonetic series of size 6.

Now consider the 乍-series. Almost there! Four out of five kanji include the reading saku, but 詐 breaks the pattern; it’s only read as sa. In this case the 乍-saku phonetic series (size 4, 作搾昨酢) is smaller than the full 乍 component series (5, 作搾昨詐酢). We say this phonetic series covers 4 out of 5 kanji, or that it has a kanji coverage of 4/5 = 80%.

What about the 青-series? It does include a phonetic series, with 100% coverage: all characters do have a shared reading, sei. However, most characters also have extra, unpredictable readings, so that this component is less predictive than 包, and we should measure this. Call readings coverage the ratio of predicted readings to all the readings of all kanji in which the component appears (that is, all kanji in the larger component series). 青-sei is a phonetic series of size 7, with 100% kanji coverage but only 43.75% (7/16) readings coverage.

I hope it’s clear by now that 及-kyuu is a size-4 phonetic series with 100% kanji coverage and 80% (4/5) readings coverage.

Phonetic series with 100% kanji and 100% readings coverage (like 包-hou) are especially useful; these ratings mean that, whenever the component appears, one can be sure of all readings of the kanji. We call these perfect series. Second in importante are those with 100% kanji but less than 100% readings (like 及-kyū and 青-sei); let’s name them semiperfect series. If you see a semiperfect phonetic component, you can be sure of at least one of the kanji’s readings. Series with less than 100% kanji coverage are not as useful, since you have to memorize the exceptions anyway; these are imperfect series.

Attentive readers might have noticed that a single component can be on many phonetic series; 乍, for example, could also be described as a very imperfect predictor for sa, working for 作詐 but not 搾昨酢. It of course performs better as a predictor for saku, since in that role it get 80/66% for 4 kanji, rather than 40/33% for just 2. We’re now in position to choose the best series for a component or a kanji: the rating criteria will be:

  • First, higher kanji coverage;
  • Second, larger size (more kanji);
  • And third, higher readings coverage.


Perfect and semi-perfect series

Here’s a selection of the best phonetic series for Jōyō kanji, including all those with 100% kanji coverage and size 3 or larger:

SizeKanji in seriesPredicted readingsReadings coverageKanji with extra readings
7情 晴 清 精 請 青 静sei43.75情 清 精 請 青 静
6包 抱 泡 砲 胞 飽hou100
6交 効 校 絞 較 郊kou75校 較
5付 府 符 腐 附fu100
5伺 司 嗣 詞 飼shi100
5倹 剣 検 険 験ken83.333
5喩 愉 癒 諭 輸yu83.333
4構 溝 講 購kou100
4渦 禍 過 鍋ka100
4復 腹 複 覆fuku100
4官 棺 管 館kan100
4朱 株 殊 珠shu100
4楷 皆 諧 階kai100
4壌 嬢 譲 醸jou100
4彩 採 菜 采sai100
4及 吸 扱 級kyuu80
4堀 屈 掘 窟kutsu80
4儀 犠 義 議gi80
4摩 磨 魔 麻ma80
4化 花 貨 靴ka66.667化 花
4伴 判 半 畔han66.667伴 判
4個 固 箇 錮ko66.667個 箇
4想 相 箱 霜sou66.667想 相
4峡 挟 狭 頰kyou57.143峡 挟 狭
3峰 縫 邦hou100
3五 悟 語go100
3坑 抗 航kou100
3先 洗 銑sen100
3慈 滋 磁ji100
3利 梨 痢ri100
3側 則 測soku100
3庭 廷 艇tei100
3偏 編 遍hen100
3底 抵 邸tei100
3永 泳 詠ei100
3泉 線 腺sen100
3標 漂 票hyou100
3彰 章 障shou100
3噴 墳 憤fun100
3丙 柄 病hei75
3刃 忍 認nin75
3善 繕 膳zen75
3努 奴 怒do75
3嫁 家 稼ka75
3巨 拒 距kyo75
3帝 締 諦tei75
3広 拡 鉱kou75
3健 建 鍵ken75
3曹 槽 遭sou75
3嘲 朝 潮chou75
3救 求 球kyuu75
3描 猫 苗byou75
3園 猿 遠en75
𠔉3券 圏 巻ken75
3代 袋 貸tai60代 袋
3居 据 裾kyo60居 裾
3幣 弊 蔽hei60
3壮 荘 装sou50荘 装
3太 汰 駄ta50太 汰 駄

Here’s a similar selection for KanjiVG kanji (including non-Jōyō), limited to 100% kanji coverage and size 5 or greater:

SizeKanji in seriesPredicted readingsReadings coverageKanji with extra readings
14五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬go82.353圄 衙
13廬 櫨 瀘 爐 盧 臚 艫 蘆 轤 鑪 顱 驢 鱸ro81.25廬 臚 驢
12剽 嫖 慓 標 漂 瓢 票 縹 飃 飄 驃 鰾hyou100
12吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬go80圄 衙
12凰 徨 惶 湟 煌 皇 篁 蝗 遑 鍠 隍 鰉kou80凰 皇 遑
11咢 愕 萼 蕚 諤 鄂 鍔 顎 鰐 鶚 齶gaku100
11啾 愀 愁 楸 湫 甃 秋 萩 鍬 鞦 鰍shuu73.333愀 湫 鍬 鰍
11倉 創 愴 搶 槍 滄 瘡 艙 蒼 蹌 鎗sou64.706創 搶 槍 瘡 蹌 鎗
10冓 媾 搆 構 溝 篝 覯 講 購 遘kou100
10偏 扁 篇 編 翩 蝙 褊 諞 遍 騙hen100
10嗟 嵯 嵳 差 搓 槎 瑳 磋 縒 蹉sa71.429嗟 嵯 搓 縒
9堋 崩 弸 朋 棚 硼 繃 萠 鵬hou90
9伴 判 半 叛 拌 畔 絆 胖 袢han75伴 判 叛
9楜 湖 瑚 糊 胡 葫 蝴 醐 餬ko60瑚 糊 胡 醐
8亢 伉 吭 坑 抗 杭 航 頏kou100
8僑 喬 嬌 橋 矯 蕎 轎 驕kyou100
8嶂 彰 樟 璋 瘴 章 障 鱆shou100
8伺 司 嗣 祠 笥 覗 詞 飼shi88.889
8儀 嶬 犠 礒 義 艤 蟻 議gi80犠 礒
8嬪 擯 檳 殯 濱 繽 賓 鬢hin80檳 鬢
8傍 旁 榜 滂 磅 膀 蒡 謗bou53.333旁 榜 滂 磅 膀 蒡 謗
7奚 渓 溪 谿 蹊 鶏 鷄kei100
7容 榕 溶 熔 穃 蓉 鎔you100
7介 价 堺 界 畍 疥 芥kai87.5
7及 吸 岌 扱 汲 笈 級kyuu87.5
7于 吁 宇 盂 紆 芋 迂u77.778于 吁
7代 垈 岱 玳 袋 貸 黛tai70代 垈 袋
7個 凅 固 涸 痼 箇 錮ko63.636個 凅 涸 箇
7會 檜 獪 繪 膾 薈 鱠kai58.333會 繪 膾 薈 鱠
7曚 朦 檬 濛 矇 艨 蒙mou58.333曚 朦 矇 艨 蒙
7叟 嫂 捜 搜 溲 痩 艘sou36.842叟 捜 搜 溲 痩 艘
6安 按 晏 案 鞍 鮟an100
6悍 捍 旱 桿 稈 駻kan100
6倔 堀 屈 崛 掘 窟kutsu85.714
6広 拡 昿 砿 絋 鉱kou85.714
6壙 廣 擴 曠 礦 鑛kou85.714
6櫪 歴 瀝 癧 轣 靂reki85.714
6偕 揩 楷 皆 諧 階kai85.714
6嗹 漣 縺 蓮 連 鏈ren85.714
6庭 廷 挺 梃 艇 霆tei75挺 梃
6曹 槽 漕 糟 艚 遭sou75曹 糟
6救 毬 求 球 裘 逑kyuu75求 裘
6憔 樵 焦 礁 蕉 鷦shou75
6欄 瀾 爛 蘭 襴 闌ran75蘭 襴
6嚀 寧 檸 濘 獰 聹nei66.667嚀 檸 獰
6園 猿 薗 袁 轅 遠en66.667薗 袁 遠
6廨 懈 蟹 蠏 解 邂kai66.667廨 懈 解
6倨 居 据 裾 踞 鋸kyo54.545倨 居 裾 踞 鋸
6嘸 廡 憮 撫 無 蕪bu54.545嘸 憮 撫 無 蕪
6溽 縟 耨 蓐 褥 辱joku54.545溽 耨 蓐
6抹 末 沫 秣 茉 靺matsu50末 沫 秣 茉 靺
6皺 芻 蒭 趨 鄒 雛suu37.5皺 芻 蒭 趨 鄒 雛
5偬 匆 怱 愡 葱sou100
5喚 奐 換 渙 煥kan100
5幾 機 磯 譏 饑ki100
5娑 沙 莎 裟 鯊sa sha100
5彩 採 綵 菜 采sai100
5卷 圈 惓 綣 蜷ken83.333
5攅 纉 讃 賛 鑚san83.333
5廊 榔 瑯 螂 郎rou83.333
5唳 戻 捩 涙 綟rei71.429捩 涙
5揉 柔 糅 蹂 鞣juu71.429柔 鞣
5當 礑 蟷 襠 鐺tou71.429
5剳 塔 搭 荅 鞳tou71.429
5晰 析 淅 皙 蜥seki62.5晰 皙 蜥
5噴 墳 憤 濆 賁fun62.5濆 賁
5撻 燵 達 闥 韃tatsu62.5達 闥 韃
5保 堡 葆 褒 褓hou55.556保 堡 葆 褓
5奬 將 漿 獎 鏘shou55.556奬 將 獎 鏘
5惺 星 猩 腥 醒sei55.556星 猩 腥
5壯 奘 弉 莊 裝sou50奘 弉 莊 裝
5椴 段 緞 葮 鍛tan50椴 段 緞 葮
5嗾 族 簇 蔟 鏃zoku45.455嗾 簇 蔟 鏃

It’s clear that phonetic series grow in usefulness as we venture outside Jōyō. This is probably because, as mentioned earlier, rarer characters are more likely to have been built with a phonetic structure.

Notice 吾-go is kind of a spurious series, because the component 吾 itself includes a better phonetic component, 五-go:

14五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬go82.353圄 衙
12吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬go80圄 衙

Imperfect series

Here are some of the top few phonetic series with imperfect kanji coverage, for Jōyō kanji:

SizeKanji in component seriesSize of phonetic seriesKanji not in phonetic seriesKanji coveragePredicted readingsReadings coverage
8召 招 昭 沼 照 紹 詔 787.5shou87.5
7姿 恣 次 茨 諮 資685.714shi66.667
6埼 奇 寄 崎 583.333ki83.333
6令 冷 鈴 零 583.333rei71.429
9 反 坂 板 版 販 阪 飯7仮 返77.778han50
8交 効 校 絞 較 郊 6父 釜75kou60
8姓 性 星 牲 生 6産 隆75sei50
7壮 奨 荘 装5寝 状71.429sou41.667
9亡 妄 望 盲 網 6忘 慌 荒66.667mou46.154

For the KanjiVG set, there are more than 600 series between 75% and 99% kanji coverage, many of them quite large. Here’s a sampling of some interesting ones:

SizeKanji in component seriesSize of phonetic seriesKanji not in phonetic seriesKanji coveragePredicted readingsReadings coverage
18 令 伶 冷 囹 嶺 怜 澪 玲 羚 聆 苓 蛉 鈴 零 領 齡 齢17 94.444rei65.385
18 交 佼 傚 効 咬 效 校 狡 皎 絞 纐 蛟 較 郊 餃 鮫 鵁17 94.444kou60.714
17 倩 情 晴 清 瀞 睛 精 菁 蜻 請 錆 青 靖 静 靜 鯖16 94.118sei45.714
14 付 俯 咐 坿 府 拊 柎 符 腐 腑 苻 附 鮒13 92.857fu81.25
14 戔 棧 淺 濺 牋 盞 箋 綫 賎 賤 踐 錢 餞13 92.857sen68.421
14 伽 加 嘉 架 枷 珈 痂 笳 茄 袈 跏 迦 駕13 92.857ka61.905
25 包 匏 咆 垉 庖 抱 枹 泡 炮 疱 皰 砲 胞 苞 蚫 袍 鉋
靤 鞄 飽 髱 鮑 麭
23 萢 雹92hou67.647
12 幔 慢 曼 漫 縵 蔓 謾 鏝 饅 鬘 鰻11 91.667man61.111
11 壙 廣 擴 曠 礦 簧 鑛 黃 黄 黌10 90.909kou66.667
10 俘 孚 孵 桴 殍 浮 艀 蜉 郛9 90fu81.818
10 侏 朱 株 殊 洙 珠 茱 蛛 9 90shu75
10 决 刔 夬 抉 決 缺 袂 訣 鴃9 90ketsu52.941
18 兪 喩 愈 愉 揄 楡 渝 瑜 瘉 癒 蝓 覦 諭 踰 輸 逾 16 偸 鍮88.889yu59.259
15 廊 朖 朗 榔 浪 狼 琅 瑯 粮 莨 螂 踉 郎13 娘 良86.667rou72.222
15 倚 剞 埼 奇 寄 崎 掎 畸 碕 綺 羇 騎13 椅 猗86.667ki68.421
14 呰 嘴 柴 此 疵 眥 眦 紫 觜 貲 雌 髭12 些 砦85.714shi46.154
19 哨 宵 峭 悄 梢 消 硝 稍 肖 蛸 誚 逍 銷 霄 鞘 鮹16 削 屑 趙84.211shou61.538
18 幡 播 旙 旛 潘 繙 翻 膰 蕃 藩 蟠 鐇 飜 鷭15 審 瀋 番83.333han50
12 佻 兆 姚 挑 晁 眺 窕 誂 跳 10 桃 逃83.333chou58.824

When we open to a larger universe, some series can be demoted—like our friend 包-hou, the best perfect predictor in Jōyō, which here becomes imperfect thanks to 雹 (haku, hyou—one semivowel from hou!) and 萢 (no on-yomi). Even then, 包-hou still manages an impressive 23 out of 25—arguably even more useful than the 6 out of 6 in Jōyō.

The imperfect series with kanji coverage smaller than 50% (not pictured above) are quite bad; in fact, most of these “series” are just statistical noise, with sets of hundreds of characters where two or three share a reading by chance. For example, 含 and 岩, both read as gan, share the component 口; but so do 449 other Jōyō kanji, so that 口-gan has an abysmal kanji coverage of 0.44%.

Coverage of kanji sets: how many are in phonetic series?

It’s interesting to measure the coverage of Jōyō kanji—that is, how many of its kanji are included in a phonetic series. The charts below graph this for the Jōyō and KanjiVG sets, choosing the very best phonetic series for each kanji (click for larger version):

Chart of phonetic series coverage of Jōyō kanji
Chart of phonetic series coverage of KanjiVG kanji

You can read these charts horizontally, as an one-dimensional percent bar. The vertical dimension adds information about kanji coverage and (in darker shading) readings coverage. The very bad section (kanji coverage less than 50%) is shaded in red, and is in practice the same as having no phonetic hint. The hints are actually useful basically for the 100% section (colored blue), and perhaps a few of the better imperfect series (orange).

However, the charts above are a bit misleading because they’re not graphing the size (number of kanji) of each series, only their relative coverages. Many of the perfect series have a modest size of 2—just enough to know, for example, that 愁 is pronounced the same as 秋. However, if the entire graphs were adjusted for size, we’d have to use a log scale, because of how bad the red section is—kanji outside the phonetic series would dominate the image in a sea of gray. To avoid that, this magnification shows only the left part (up to 50% kanji coverage), allowing us to represent the series size in a natural way:

Chart of better phonetic series coverage of Jōyō kanji, size-adjusted
Chart of better phonetic series coverage of KanjiVG kanji, size-adjusted

(Notice that the scales are not the same between different graphics!)

456 Jōyō means 21.29% are perfectly phonetic, and 1411/6394 = 22.06% of our larger set (the yellow, imperfect part goes up to 34.46% and 49.69%). How to reconcile this with the often-cited figure that about 90% of kanji have a phonetic component? It’s simple: most of those components can only be detected through historical analysis, because either the sounds and graphs have changed too much, or else they now only appear in a single kanji. For example:

  • In 盆 bon, the top element 分 is a phonetic for the sound bon, coming ultimately from Old Chinese *pən. However, due to diverging phonetic changes, its role as a phonetic in 分粉紛雰 became fun (recall that Modern Japanese f- or h- was Old Japanese p-); in 貧, hin; and, in 頒, han. So the phonetic structure of 盆 (“dish” 皿 + sound *pən = *bˤən “bowl”) became obscured and approximate.
  • In 浮 fu “to float”, the right component 孚 is a phonetic element giving the sound fu (Old Chinese *mbu). It also appeared by itself as 孚 fu “confidence”, and in 俘 fu “prisoner”; however, these two characters are not used in modern writing anymore, so that 浮 “float” remained alone as the single example of the phonetic 孚-fu.


Fuller data sets are available in tab-separated text files (TSV), compatible with any spreadsheet software and easy to manipulate with Unix tools.

components_phonetic.jouyou.tsv and components_phonetic.kanjivg.tsv:
The main tables, with one phonetic series by line, ordered by goodness as described above.
kanji_phonetic.jouyou.tsv and kanji_phonetic.kanjivg.tsv:
The best phonetic series for each kanji.
Extracted from KanjiVG, a decomposition of each kanji into components.
Components that are considered to be “the same” in this experiment (for example, 人 and 亻, or 水 氵 氺). These are generally calligraphic abbreviations, marked as Z-variants or semantic variants in Unicode. This list is partial.
base_components.jouyou.tsv and base_components.kanjivg.tsv:
A list of all components needed to build the kanji set which are not themselves part of the set. Later I’ll add a new post on this.

Thanks to everyone who sent suggestions after the first version of this article, and to the compilers of KanjiVG and Kanjidic. Readers interested in the history of characters are directed to my Kanjigen tool. Comments welcome!

26 thoughts on “Testing the predictive power of phonetic components in Japanese kanji

  1. (Note for habitual readers: I changed the site stylesheet in order to highlight a few lines in the tables. If you don’t see any highlights, it’s probably caching the older version; try refreshing the browser a couple times with Ctrl+F5 or Ctrl+Shift+R).

  2. Interesting idea, especially the synchronic part. How about some graphs?
    – Number of kanji that are 100%, 90%, 80%… perfect (this is probably the most interesting one)
    – As above, but divided by reading set (number of 1-set kanji that are 100%, 90%… perfect; Number of 2-set kanji that are 100%, 90%… perfect)
    – Come to think of it, number of kanji with 1, 2, 3… reading sets?
    (Since you supply the data, I should probably do this myself, but I *still* don’t know how to use R).

    • I’m not sure if I get what stats are you thinking about (probably because, looking now, the names I chose are too confusing):

      – I used “perfect” to describe phonetic components, not kanji; i.e. those components that reliably predict all the readings of a kanji. By “90% perfect kanji”, do you mean a kanji with 90% of its readings predicted by a phonetic component?
      – I (arbitrarily) called the set of all readings of a given kanji its “reading set”; so they all have, by definition, one set. Do you mean the number of kanji in each component/reading-set group?

  3. It’s probably a bad sign that even I’m not sure what I was on about there. But thinking carefully, I seem to have written “kanji” where I meant “component”. So, out of n = 50 (or whatever) components, 10 are “perfect”, 20 are “very good”, etc…. The theory being that if some components are “perfect” and others are “not”, with varying numbers involved, there must be a way to rank them and show what the territory between “perfect” to “completely useless” looks like.

    I suppose that figuring out a ranking method is the hard part. Is “皮” worse than “召”, and if so, by how much? And so on.

  4. Ok, I abandoned the focus on perfect sets in favor of three quantifiable criteria (kanji coverage %, size, and readings coverage %), and they’re now sorted in this order.

    The main weakness of this method is that, in many cases, a much larger phonetic series with a slightly worse kanji coverage feels more interesting, so the absolute priority for kanji coverage is unjustified. I tried to use a weighted function based on both variables, but at the end of the weekend decided that the simpler, more understandable sorting key made it easier to handle the data tables.

  5. Great reading! The new approach works for me. Looking down the coverage lists, it’s kind of shocking how closely my feelings about the characters in each row match the percentages– the safe and cozy 100%/100% rows, the approachable but somehow ominous 90%/50%(ish) rows, the forbidding tangle of the 70%/40% rows…

  6. Interesting article! It’s something I’ve also investigated on my quest for Japanese reading ability. Are you aware of an official list anywhere online of the phonetic components in Japanese? They are often mentioned but there does not seem to be a definitive list, although perhaps I am just in the dark.


  7. @Natalie: Sorry, I had missed your comment! In case you have email notification on and is reading this: I don’t think there’s a definitive list, because there are wide disagreements over which components are (historically) phonetic, or even what counts as a component. The best we can do is to look at what’s classified as a “phono-semantic character” 形声文字 in traditional dicionaries – the Shuowen, the KangXi and, for Japanese, Morohashi. I’m unaware of online lists of such classifications, though a few web searches in Japanese will probably return something.

    Of course, if all you need are lists of phonetic hints that still work (as opposed to historical “phonetics”), then there’s always the data files in this post ;)

    But if you’re looking for such a list for learning purposes, I think that it’s better to just acquire the phonetics by osmosis as you learn new words (it’s how the natives learn them, after all). Vocabulary acquisition people (like Nation) tell us that learning similar words together “to understand their difference” is a bad idea, because interference effects are bad for memorization – it’s better to deal with similar words one at a time, learning other stuff in-between. My gut feeling is that a similar thing happens with kanji: if you try to learn a phonetic series serially one after another, then all those similar-looking, similar-sounding kanji blend together in a blur.

    If you just read stuff you’re interested in, you’ll find that one starts to naturally guess the pronounciation of new kanji from phonetic hints one has acquired unconsciously (though unfortunately this inference sometimes fails – I’m looking at you, 拡/kaku).

  8. I think there is an error with the kanji 蜂(bee) in file kanji_phonetic.jouyou.tsv
    It has 三 as component instead of 丰 but it’s onyomi is ホウ, the same as the kanji with phonetic compound 丰 (邦 峰 縫).

    Btw, this is an amazing work and helps me a lot with my study. Thanks :)

    • hiro, thank you very much for the correction. I’m presently unable to redo this analysis, but I’m making a new one (using an actual corpus this time!), and I’ll take this datum in consideration. thanks!

  9. I have to say, I really appreciate this post, and all the work that went into it. I actively use your data to determine which kanji to study together as sets, even with the limitations around the “imperfect” sets. For me, having kanji that share both a graphical component as well as a phonetic component actually allows me to learn those kanji much more quickly, as related sets of information. I think of it as being able to quickly create multiple indices in my mind to a specific kanji. I discovered this on my own, because it just didn’t make sense to me to learn random kanji in some arbitrary order when there were organizing principles “staring me in the face.” In searching for more data about that, I ran across your post.

    I’m just working on the jouyou kanji now, and I look forward to exploiting this method on the larger universe of kanji once I’ve mastered the jouyou. I’ve actually tried it already with a few of the larger non-jouyou sets, and it really worked very well for me. So I think I have to disagree with some of the received knowledge about this type of learning, based on my own experience. (I’ll be interested to read Reading Chinese Script: A Cognitive Analysis.) I am an adult learner of Japanese, and I think that makes a difference. Native speakers who are learning kanji already have the additional “index” or context of the spoken language which surrounds them every day. I don’t have that advantage. But I *can* create a much richer environment within which to establish these types of learning contexts by using techniques such as this. Yes it takes some time, especially if you are going it alone in trying to organize and extract the study information but…

    I’m going to share my dream with you and the rest of the Internet, in the hopes that someone is doing this, or has done this, or is interested in doing this – I would love to find an electronic listing of all the phonetic components that someone could load into a rendering program, such as the kanjVG database and viewer or the like. Then the phonetic components could be programmed to show up in one specific color, and the semantic components in another. With the right indexing and interface, a student could choose to study groups of kanji related by phonetic and/or semantic components.

    Think about how effective a study aid that would be! You would actually be capturing some of our internal decoding processes on screen. How would it be if someone could build that?

    It would be great for me anyway! And I bet for others. What do you think? (And thanks again!)

  10. In components_phonetic.kanjivg.tsv there are several components having more than one predicted reading. They are just separated by a blank.

    Example: 竟 has キョウ and ケイ

    I would have expected to see two lines in that case.

    And worse it is impossible to see which of the entries is more important. Heisig (only jouyou) gives キョウ as main reading and ケイ as exception for 境.

    鏡 => キョウ 望遠鏡/顕微鏡/鏡台
    境 => キョウ 境界/国境/環境/境遇
    境 => ケイ 境内

    Btw: I confused 竸 with 競 which has interestingly enough also キョウ (main) and ケイ as readings.

    • Thanks for the bug report, Matthias. You’re completely right; I don’t know what I was thinking.

      Right now I’m busy with my thesis, and can’t bring myself to try to re-read my old code. If it’s any consolation, I’m doing some (hopefully) better statistical analyses for it, weighted by a corpus; and I’m taking care to make it reproducible, and will post the python+R code to github. There were also several problems with KanjiVG component decompositions; some I’ve forwarded to upstream, and will continue to do so as I find them. The deadline for my thesis is August 2016, so you’ll have better data by then.

    • Kanjinetworks says that 竸 and 競 are just graphical variants. ja.wiktionary has mostly the same readings/meanings ascribed to both, but (perhaps by lack of completion) the set of kun-yomi ascribed to them differs a bit. I don’t have a proper kanji dictionary at hand right now to look into…

  11. Good luck with your thesis!

    If you redo the analysis you might consider including “groups” of 1. By this you get a complete picture, including all exceptions. I guess this is also important to get variants as e.g. そつ and そっ.

    • Thanks! My current approach already includes code to unify morphophonemic processes, like sokuon (そつ → そっ), rendaku (そつ→ぞつ) and the like. I count them as “the same reading”, being conservative and assuming the reader knows which variant is used when.

  12. So are the phonetic kanji worth learning deliberately in addition to normal kanji study?

    The report seemed to be indicating as such, then pulled another graph out and everything went obscure and now I’m not sure if it is worth it.

    • I don’t recommend deliberate memorization of phonetic components, specifically. Rather, study kanji as you usually do, but pay attention to their components; now that you know that some of them are phonetic hints, every so often they’ll be useful. You’ll learn which ones naturally, as you get exposed to them.

      (By the way, earlier I said that imperfect hints are more trouble than worth. Now I think this only applies to the reader who’s still in early stages; as one gets to know more words, imperfect hints will actually become helpful in recalling the word to memory. But this only works after one’s already familiar with the word.)

      In my opinion (which is shared by many linguists), you shouldn’t spend a lot of time drilling kanji at first. It feels like industrious learning, but it’s a poor use of your study time. Concentrate on acquiring the Japanese language itself – the words and the grammar patterns – and postpone kanji as much as you can. The reason is that it will be a lot easier to deal with kanji when they’re just fancy notation for words you already know (consider that the Japanese acquire the language fluently before even starting kanji practice at school). After you get yourself decent Japanese proficiency, the main thrust of your kanji studies should be reading interesting texts (furigana-ed texts and electronic tools, e.g. rikaichan, are very OK here). Don’t bother checking every word in a dictionary; try to find texts that you can more-or-less understand as-is, and gloss over the hard parts. Ideally you want something compelling, like genre fiction with a cliffhanger every chapter – something which draws you to keep reading. At this point your main objective is still growing your vocabulary, and improving comprehension. Meanwhile, kanji familiarity will grow by osmosis, in a perfectly-tuned “spaced repetition system” (because, in your readings, you’ll practice more often the kanji that are used more often!)

      Deliberate kanji drilling should be an auxiliary study at best; never spend more time drilling kanji than reading texts. Kanji drills are mainly useful for a) writing Japanese by hand (=recall), and b) passing artificial tests like collage exams or the JLPT. If you need either of these for some reason, you’ll need to drill kanji, but don’t let that detract from reading compelling texts.

    • For most purposes, I believe the question isn’t what I consider a component, but what The Tradition considers a component. The first landmark of The Tradition is the Shuōwén Jiězì, though it has important predecessors like the Hàn Shū which cites the Six Methods of Classification 六書 liùshū. So the short answer is, if a classic like the Shuōwén or the Kāngxī says “character X is composed of W and Y”, then W and Y are components. Note that the large majority of components are themselves characters, and that composition is a recursive process (so a component may be an aggregate of two or more subcomponents). Under this model, a component is a phonetic or semantic component (音符/意符) if the tradition claims it as such. This is the definition I use most of the time and what enables me to interact with dictionaries, reference materials, other readers etc.

      A synchronic, instrumental definition would be analogue to the process of identifying morphemes in strings of phonemes (=strokes); you’d look for graphical subelements which reocur in at least two characters, while co-ocurring with phonetic or semantic values (or both, cf. Tōdō Akiyasu’s work). A more sophisticated historical definition would go beyond the classics and try to reconstruct the components from oracle-bone scholarship (cf. Ochiai Atsushi, with grains of salt).

      For the purpose of this post (and of my thesis, which was a corpus-based approach), I deliberately set out to be ahistorical (because I meant to test how much information is left in the system). In this case, components were defined graphically as recurring visual sub-elements (taken from the KanjiVG vectorial graphics database), almost blinded from tradition; the single exception is that a few allographs (graphical variations) taught to be “the same component” in Japanese school education (e.g. 心 忄 㣺) are treated as instances of the same component, on account of the fact that most Japanese-educated readers would treat them this way. In the thesis I’ve tested how many Jōyō characters occur, on a typical text (Wikipedia), with readings that can be predicted by any of its components, thus defined (that is, for any given kanji/reading instance, whether any of the graphical subelements, taken ahistorically, occur in other characters with the same reading, and mostly restricted to those, so that the presence of that visual subelement is enough to predict the reading). I reached a number of around 23%, being as optimistic/conservative as I could. That is, a reader going through Wikipedia with full knowledge of Jōyō Kanji phonetically predictive components but (somehow) unable to recall kanji readings would be able to predict about ~23% of them, optimistically.

Leave a Reply

Your email address will not be published. Required fields are marked *