Testing the predictive power of phonetic components in Japanese kanji

Most Chinese characters have a phonetic component—a hint that suggests their pronounciations. For example, 半 bàn “half” appears inside 伴 “companion” because the latter is also pronounced bàn; and further, it also suggests the approximate pronounciation of 判 pàn “judge”, 叛 pàn “rebel”, and 胖 pàng “fat”. Beginning readers often don’t notice this feature, but awareness of phonetic hints grows with proficiency. The rate of characters with a phonetic component reaches up to 90% – though that’s 90% of all characters, not 90% of what you’d stumble in actual use; nonphonetic characters (such as 木 or 人) are disproportionately more frequent.

However, this system was always imprecise, and grew ever looser as the spoken language changed and the characters were simplified in various ways. In many cases, it’s now unclear to what degree a component was originally added as a phonetic hint, a semantic mnemonic, both, or neither. For example, it’s conceivable that whoever first built the character for 判 “judge” decided to use 半 not just for its sound, but also because they thought “cutting 刂 in equal halves 半” makes sense as a mnemonic for “judging”.

When the characters are used to represent Japanese, the phonetic hints only work with Sino-Japanese readings, i.e. the on-yomi (…that’s kind of why they’re called on-yomi, “sound readings”). Moreover, even for on-yomi, the Japanese readings grew to be more imprecise than even Chinese. I wanted to try to measure their predictive power; this post report the results of a simple but quantitative experiment (for a quick summary, skip to results!).

Phonetic components and the Japanese language

There are various reasons why the hints are less useful for Japanese:

  • In their original language, most characters have only one reading. In Japanese they typically have at least two, the original Sinitic plus a native gloss. This point alone means the hints don’t work for about half (?) of the readings.
  • What’s more, even though there are (were) many Sinitic languages through space (time), a given Chinese community typically read the characters in only one way. In the case of Japanese, it’s common for characters to have several coexisting Sinitic readings (multiple on-yomi), because the language preserves several loan strata. And all of these multiple on-yomi were subject to local language change, independently of the mainland.
  • Of the current standard (Jōyō) kanji set, 364 (about 17%) were simplified after 1946, breaking graphical relationships.

That doesn’t mean they’re useless, though, or that natives don’t process them. Mary Noguchi of Kanji Clinic describes this hypothetical example about the character for “hemorrhoid”:

[…] locate a willing subject to write, from memory, the kanji for ぢ. The point of this exercise is to witness the trial and error strategy a native speaker may employ when writing or pronouncing a relatively unfamiliar kanji such as 痔. Your friend will probably begin by writing the kanji component 疒 (“sickness”). This will put her into the ballpark, but writing the rest of the character may prove more difficult. She might start wracking her brain for a component which signals the pronunciation ji:

“Is it 寿 (“longevity”)? Oh, no, 寿 is pronounced ju, so that can’t be right. Hmmm, how about 血 (“blood,” usually pronounced chi, but sometimes ji in compounds like hanaji “bloody nose”)? Or maybe 地 (“ground,” ji)? No,” she moans, “that doesn’t look right, either. OK, I think it’s 寺 (“temple,” ji) plus 疒. Yes, 痔. That’s it!”

Sometimes a native taking quick notes will even substitute a complicated phonetic component for a kana character—for example, noting 機 ki “machine” by using only the left 木 “wood” component plus a squeezed phonetic ki キ to the right.

Goals and definitions: what to measure, and how

For this experiment, I extracted readings from Jim Breen’s venerable Kanjidic, and structural analysis from KanjiVG, a publicly-available database of graphical– and component decompositions. (If there are any errors in the tables, please report, so that I can either debug my code or forward corrections to the original sources.) Regarding my goals:

  • I chose KanjiVG because I was interested in a synchronic analysis—that is, of the structure of the kanji as they are now, not of their historical (traditional, Seal, or pre-Qin) forms. I ignored history, and looked for correlations between modern visual components and modern readings, in the spirit of testing how much information is still present in the system.

  • I also chose to ignore approximate readings, out of didactic interest (in my experience as a Japanese student, I found approximations to be more trouble than worth). I looked for exact matches.

  • And I was especially interested in components that could be used reliably as a guide to pronounciation.

There’s quite a bit of data to massage, and it can be tricky to measure what exactly is a “good” phonetic component. In the next section I make some important definitions about metrics.

Kanji sets

First of all, the results will differ significantly depending on
which kanji set (our universe) is analyzed. We’ll investigate two
such sets:

  1. The set of kanji taught in Japanese education, today called Jōyō kanji. Since 1945, Japanese texts generally omit furigana readings for Jōyō, meaning that adult readers are expected to know by heart the readings of this set. Our Jōyō analyses are based on the full set of 2141 characters (including the 5 that were dropped in 2010, namely 勺 銑 脹 錘 匁 ).
  2. The largest set I have available, that is, all kanji in the current version of KanjiVG—a bit more than 6300. I don’t know if there’s any selection criteria for this set, but I expect they’re some of the most common, and six thousand is a good estimate of the knowledge of a proficient, literary-minded native reader.

Quantifying phonetic series

Within each kanji set, the basic variables to relate are:

  • A few thousand kanji, where
  • each kanji is made of one or more components, and
  • each kanji has zero or more (on-yomi) readings.

A component series is set of kanji that include a certain component. Here are some examples from the Jōyō set:

Component Size of series Kanji in series

For our purposes, a phonetic series is a set of kanji that
shares a component and a reading. If we add each kanji’s
readings to the table above, interesting patterns appear:

Component Size of series Kanji in series
ki shu sou chou to fu
sei sei,
hou hou hou hou hou hou
saku saku saku sa
kyuu kyuu kyuu,

First, consider the 走-series. Not a single kanji in it has a
shared reading! 走 is not a phonetic component at all, i.e. the
走-series is not a phonetic series.

Compare to the 包-series. All the six kanji that include 包 are
pronounced hou. In other words, 包-hou is a phonetic
series of size 6.

Now consider the 乍-series. Almost there! Four out of five kanji include the reading saku, but 詐 breaks the pattern; it’s only read as sa. In this case the 乍-saku phonetic series (size 4, 作搾昨酢) is smaller than the full 乍 component series (5, 作搾昨詐酢). We say this phonetic series covers 4 out of 5 kanji, or that it has a kanji coverage of 4/5 = 80%.

What about the 青-series? It does include a phonetic series, with 100% coverage: all characters do have a shared reading, sei. However, most characters also have extra, unpredictable readings, so that this component is less predictive than 包, and we should measure this. Call readings coverage the ratio of predicted readings to all the readings of all kanji in which the component appears (that is, all kanji in the larger component series). 青-sei is a phonetic series of size 7, with 100% kanji coverage but only 43.75% (7/16) readings coverage.

I hope it’s clear by now that 及-kyuu is a size-4 phonetic
series with 100% kanji coverage and 80% (4/5) readings coverage.

Phonetic series with 100% kanji and 100% readings coverage (like 包-hou) are especially useful; these ratings mean that, whenever the component appears, one can be sure of all readings of the kanji. We call these perfect series. Second in importante are those with 100% kanji but less than 100% readings (like 及-kyū and 青-sei); let’s name them semiperfect series. If you see a semiperfect phonetic component, you can be sure of at least one of the kanji’s readings. Series with less than 100% kanji coverage are not as useful, since you have to memorize the exceptions anyway; these are imperfect series.

Attentive readers might have noticed that a single component can be on many phonetic series; 乍, for example, could also be described as a very imperfect predictor for sa, working for 作詐 but not 搾昨酢. It of course performs better as a predictor for saku, since in that role it get 80/66% for 4 kanji, rather than 40/33% for just 2. We’re now in position to choose the best series for a component or a kanji: the rating criteria will be:

  • First, higher kanji coverage;
  • Second, larger size (more kanji);
  • And third, higher readings coverage.


Perfect and semi-perfect series

Here’s a selection of the best phonetic series for Jōyō kanji, including all those with 100% kanji coverage and size 3 or larger:

Size Kanji in series Predicted readings Readings coverage Kanji with extra readings
7 情 晴 清 精 請 青 静 sei 43.75 情 清 精 請 青 静
6 包 抱 泡 砲 胞 飽 hou 100
6 交 効 校 絞 較 郊 kou 75 校 較
5 付 府 符 腐 附 fu 100
5 伺 司 嗣 詞 飼 shi 100
5 倹 剣 検 険 験 ken 83.333
5 喩 愉 癒 諭 輸 yu 83.333
4 構 溝 講 購 kou 100
4 渦 禍 過 鍋 ka 100
4 復 腹 複 覆 fuku 100
4 官 棺 管 館 kan 100
4 朱 株 殊 珠 shu 100
4 楷 皆 諧 階 kai 100
4 壌 嬢 譲 醸 jou 100
4 彩 採 菜 采 sai 100
4 及 吸 扱 級 kyuu 80
4 堀 屈 掘 窟 kutsu 80
4 儀 犠 義 議 gi 80
4 摩 磨 魔 麻 ma 80
4 化 花 貨 靴 ka 66.667 化 花
4 伴 判 半 畔 han 66.667 伴 判
4 個 固 箇 錮 ko 66.667 個 箇
4 想 相 箱 霜 sou 66.667 想 相
4 峡 挟 狭 頰 kyou 57.143 峡 挟 狭
3 峰 縫 邦 hou 100
3 五 悟 語 go 100
3 坑 抗 航 kou 100
3 先 洗 銑 sen 100
3 慈 滋 磁 ji 100
3 利 梨 痢 ri 100
3 側 則 測 soku 100
3 庭 廷 艇 tei 100
3 偏 編 遍 hen 100
3 底 抵 邸 tei 100
3 永 泳 詠 ei 100
3 泉 線 腺 sen 100
3 標 漂 票 hyou 100
3 彰 章 障 shou 100
3 噴 墳 憤 fun 100
3 丙 柄 病 hei 75
3 刃 忍 認 nin 75
3 善 繕 膳 zen 75
3 努 奴 怒 do 75
3 嫁 家 稼 ka 75
3 巨 拒 距 kyo 75
3 帝 締 諦 tei 75
3 広 拡 鉱 kou 75
3 健 建 鍵 ken 75
3 曹 槽 遭 sou 75
3 嘲 朝 潮 chou 75
3 救 求 球 kyuu 75
3 描 猫 苗 byou 75
3 園 猿 遠 en 75
𠔉 3 券 圏 巻 ken 75
3 代 袋 貸 tai 60 代 袋
3 居 据 裾 kyo 60 居 裾
3 幣 弊 蔽 hei 60
3 壮 荘 装 sou 50 荘 装
3 太 汰 駄 ta 50 太 汰 駄

Here’s a similar selection for KanjiVG kanji (including non-Jōyō), limited to 100% kanji coverage and size 5 or greater:

Size Kanji in series Predicted readings Readings coverage Kanji with extra readings
14 五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 82.353 圄 衙
13 廬 櫨 瀘 爐 盧 臚 艫 蘆 轤 鑪 顱 驢 鱸 ro 81.25 廬 臚 驢
12 剽 嫖 慓 標 漂 瓢 票 縹 飃 飄 驃 鰾 hyou 100
12 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 80 圄 衙
12 凰 徨 惶 湟 煌 皇 篁 蝗 遑 鍠 隍 鰉 kou 80 凰 皇 遑
11 咢 愕 萼 蕚 諤 鄂 鍔 顎 鰐 鶚 齶 gaku 100
11 啾 愀 愁 楸 湫 甃 秋 萩 鍬 鞦 鰍 shuu 73.333 愀 湫 鍬 鰍
11 倉 創 愴 搶 槍 滄 瘡 艙 蒼 蹌 鎗 sou 64.706 創 搶 槍 瘡 蹌 鎗
10 冓 媾 搆 構 溝 篝 覯 講 購 遘 kou 100
10 偏 扁 篇 編 翩 蝙 褊 諞 遍 騙 hen 100
10 嗟 嵯 嵳 差 搓 槎 瑳 磋 縒 蹉 sa 71.429 嗟 嵯 搓 縒
9 堋 崩 弸 朋 棚 硼 繃 萠 鵬 hou 90
9 伴 判 半 叛 拌 畔 絆 胖 袢 han 75 伴 判 叛
9 楜 湖 瑚 糊 胡 葫 蝴 醐 餬 ko 60 瑚 糊 胡 醐
8 亢 伉 吭 坑 抗 杭 航 頏 kou 100
8 僑 喬 嬌 橋 矯 蕎 轎 驕 kyou 100
8 嶂 彰 樟 璋 瘴 章 障 鱆 shou 100
8 伺 司 嗣 祠 笥 覗 詞 飼 shi 88.889
8 儀 嶬 犠 礒 義 艤 蟻 議 gi 80 犠 礒
8 嬪 擯 檳 殯 濱 繽 賓 鬢 hin 80 檳 鬢
8 傍 旁 榜 滂 磅 膀 蒡 謗 bou 53.333 旁 榜 滂 磅 膀 蒡 謗
7 奚 渓 溪 谿 蹊 鶏 鷄 kei 100
7 容 榕 溶 熔 穃 蓉 鎔 you 100
7 介 价 堺 界 畍 疥 芥 kai 87.5
7 及 吸 岌 扱 汲 笈 級 kyuu 87.5
7 于 吁 宇 盂 紆 芋 迂 u 77.778 于 吁
7 代 垈 岱 玳 袋 貸 黛 tai 70 代 垈 袋
7 個 凅 固 涸 痼 箇 錮 ko 63.636 個 凅 涸 箇
7 會 檜 獪 繪 膾 薈 鱠 kai 58.333 會 繪 膾 薈 鱠
7 曚 朦 檬 濛 矇 艨 蒙 mou 58.333 曚 朦 矇 艨 蒙
7 叟 嫂 捜 搜 溲 痩 艘 sou 36.842 叟 捜 搜 溲 痩 艘
6 安 按 晏 案 鞍 鮟 an 100
6 悍 捍 旱 桿 稈 駻 kan 100
6 倔 堀 屈 崛 掘 窟 kutsu 85.714
6 広 拡 昿 砿 絋 鉱 kou 85.714
6 壙 廣 擴 曠 礦 鑛 kou 85.714
6 櫪 歴 瀝 癧 轣 靂 reki 85.714
6 偕 揩 楷 皆 諧 階 kai 85.714
6 嗹 漣 縺 蓮 連 鏈 ren 85.714
6 庭 廷 挺 梃 艇 霆 tei 75 挺 梃
6 曹 槽 漕 糟 艚 遭 sou 75 曹 糟
6 救 毬 求 球 裘 逑 kyuu 75 求 裘
6 憔 樵 焦 礁 蕉 鷦 shou 75
6 欄 瀾 爛 蘭 襴 闌 ran 75 蘭 襴
6 嚀 寧 檸 濘 獰 聹 nei 66.667 嚀 檸 獰
6 園 猿 薗 袁 轅 遠 en 66.667 薗 袁 遠
6 廨 懈 蟹 蠏 解 邂 kai 66.667 廨 懈 解
6 倨 居 据 裾 踞 鋸 kyo 54.545 倨 居 裾 踞 鋸
6 嘸 廡 憮 撫 無 蕪 bu 54.545 嘸 憮 撫 無 蕪
6 溽 縟 耨 蓐 褥 辱 joku 54.545 溽 耨 蓐
6 抹 末 沫 秣 茉 靺 matsu 50 末 沫 秣 茉 靺
6 皺 芻 蒭 趨 鄒 雛 suu 37.5 皺 芻 蒭 趨 鄒 雛
5 偬 匆 怱 愡 葱 sou 100
5 喚 奐 換 渙 煥 kan 100
5 幾 機 磯 譏 饑 ki 100
5 娑 沙 莎 裟 鯊 sa sha 100
5 彩 採 綵 菜 采 sai 100
5 卷 圈 惓 綣 蜷 ken 83.333
5 攅 纉 讃 賛 鑚 san 83.333
5 廊 榔 瑯 螂 郎 rou 83.333
5 唳 戻 捩 涙 綟 rei 71.429 捩 涙
5 揉 柔 糅 蹂 鞣 juu 71.429 柔 鞣
5 當 礑 蟷 襠 鐺 tou 71.429
5 剳 塔 搭 荅 鞳 tou 71.429
5 晰 析 淅 皙 蜥 seki 62.5 晰 皙 蜥
5 噴 墳 憤 濆 賁 fun 62.5 濆 賁
5 撻 燵 達 闥 韃 tatsu 62.5 達 闥 韃
5 保 堡 葆 褒 褓 hou 55.556 保 堡 葆 褓
5 奬 將 漿 獎 鏘 shou 55.556 奬 將 獎 鏘
5 惺 星 猩 腥 醒 sei 55.556 星 猩 腥
5 壯 奘 弉 莊 裝 sou 50 奘 弉 莊 裝
5 椴 段 緞 葮 鍛 tan 50 椴 段 緞 葮
5 嗾 族 簇 蔟 鏃 zoku 45.455 嗾 簇 蔟 鏃

It’s clear that phonetic series grow in usefulness as we venture outside Jōyō. This is probably because, as mentioned earlier, rarer characters are more likely to have been built with a phonetic structure.

Notice 吾-go is kind of a spurious series, because the component 吾 itself includes a better phonetic component, 五-go:

14 五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 82.353 圄 衙
12 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 80 圄 衙

Imperfect series

Here are some of the top few phonetic series with imperfect kanji coverage, for Jōyō kanji:

Size Kanji in component series Size of phonetic series Kanji not in phonetic series Kanji coverage Predicted readings Readings coverage
8 召 招 昭 沼 照 紹 詔 7 87.5 shou 87.5
7 姿 恣 次 茨 諮 資 6 85.714 shi 66.667
6 埼 奇 寄 崎 5 83.333 ki 83.333
6 令 冷 鈴 零 5 83.333 rei 71.429
9 反 坂 板 版 販 阪 飯 7 仮 返 77.778 han 50
8 交 効 校 絞 較 郊 6 父 釜 75 kou 60
8 姓 性 星 牲 生 6 産 隆 75 sei 50
7 壮 奨 荘 装 5 寝 状 71.429 sou 41.667
9 亡 妄 望 盲 網 6 忘 慌 荒 66.667 mou 46.154

For the KanjiVG set, there are more than 600 series between 75% and 99% kanji coverage, many of them quite large. Here’s a sampling of some interesting ones:

Size Kanji in component series Size of phonetic series Kanji not in phonetic series Kanji coverage Predicted readings Readings coverage
18 令 伶 冷 囹 嶺 怜 澪 玲 羚 聆 苓 蛉 鈴 零 領 齡 齢 17 94.444 rei 65.385
18 交 佼 傚 効 咬 效 校 狡 皎 絞 纐 蛟 較 郊 餃 鮫 鵁 17 94.444 kou 60.714
17 倩 情 晴 清 瀞 睛 精 菁 蜻 請 錆 青 靖 静 靜 鯖 16 94.118 sei 45.714
14 付 俯 咐 坿 府 拊 柎 符 腐 腑 苻 附 鮒 13 92.857 fu 81.25
14 戔 棧 淺 濺 牋 盞 箋 綫 賎 賤 踐 錢 餞 13 92.857 sen 68.421
14 伽 加 嘉 架 枷 珈 痂 笳 茄 袈 跏 迦 駕 13 92.857 ka 61.905
25 包 匏 咆 垉 庖 抱 枹 泡 炮 疱 皰 砲 胞 苞 蚫 袍 鉋
靤 鞄 飽 髱 鮑 麭
23 萢 雹 92 hou 67.647
12 幔 慢 曼 漫 縵 蔓 謾 鏝 饅 鬘 鰻 11 91.667 man 61.111
11 壙 廣 擴 曠 礦 簧 鑛 黃 黄 黌 10 90.909 kou 66.667
10 俘 孚 孵 桴 殍 浮 艀 蜉 郛 9 90 fu 81.818
10 侏 朱 株 殊 洙 珠 茱 蛛 9 90 shu 75
10 决 刔 夬 抉 決 缺 袂 訣 鴃 9 90 ketsu 52.941
18 兪 喩 愈 愉 揄 楡 渝 瑜 瘉 癒 蝓 覦 諭 踰 輸 逾 16 偸 鍮 88.889 yu 59.259
15 廊 朖 朗 榔 浪 狼 琅 瑯 粮 莨 螂 踉 郎 13 娘 良 86.667 rou 72.222
15 倚 剞 埼 奇 寄 崎 掎 畸 碕 綺 羇 騎 13 椅 猗 86.667 ki 68.421
14 呰 嘴 柴 此 疵 眥 眦 紫 觜 貲 雌 髭 12 些 砦 85.714 shi 46.154
19 哨 宵 峭 悄 梢 消 硝 稍 肖 蛸 誚 逍 銷 霄 鞘 鮹 16 削 屑 趙 84.211 shou 61.538
18 幡 播 旙 旛 潘 繙 翻 膰 蕃 藩 蟠 鐇 飜 鷭 15 審 瀋 番 83.333 han 50
12 佻 兆 姚 挑 晁 眺 窕 誂 跳 10 桃 逃 83.333 chou 58.824

When we open to a larger universe, some series can be demoted—like our friend 包-hou, the best perfect predictor in Jōyō, which here becomes imperfect thanks to 雹 (haku, hyou—one semivowel from hou!) and 萢 (no on-yomi). Even then, 包-hou still manages an impressive 23 out of 25—arguably even more useful than the 6 out of 6 in Jōyō.

The imperfect series with kanji coverage smaller than 50% (not pictured above) are quite bad; in fact, most of these “series” are just statistical noise, with sets of hundreds of characters where two or three share a reading by chance. For example, 含 and 岩, both read as gan, share the component 口; but so do 449 other Jōyō kanji, so that 口-gan has an abysmal kanji coverage of 0.44%.

Coverage of kanji sets: how many are in phonetic series?

It’s interesting to measure the coverage of Jōyō kanji—that is, how many of its kanji are included in a phonetic series. The charts below graph this for the Jōyō and KanjiVG sets, choosing the very best phonetic series for each kanji (click for larger version):

Chart of phonetic series coverage of Jōyō kanji

Chart of phonetic series coverage of KanjiVG kanji

You can read these charts horizontally, as an one-dimensional percent bar. The vertical dimension adds information about kanji coverage and (in darker shading) readings coverage. The very bad section (kanji coverage less than 50%) is shaded in red, and is in practice the same as having no phonetic hint. The hints are actually useful basically for the 100% section (colored blue), and perhaps a few of the better imperfect series (orange).

However, the charts above are a bit misleading because they’re not graphing the size (number of kanji) of each series, only their relative coverages. Many of the perfect series have a modest size of 2—just enough to know, for example, that 愁 is pronounced the same as 秋. However, if the entire graphs were adjusted for size, we’d have to use a log scale, because of how bad the red section is—kanji outside the phonetic series would dominate the image in a sea of gray. To avoid that, this magnification shows only the left part (up to 50% kanji coverage), allowing us to represent the series size in a natural way:

Chart of better phonetic series coverage of Jōyō kanji, size-adjusted

Chart of better phonetic series coverage of KanjiVG kanji, size-adjusted

(Notice that the scales are not the same between different graphics!)

456 Jōyō means 21.29% are perfectly phonetic, and 1411/6394 = 22.06% of our larger set (the yellow, imperfect part goes up to 34.46% and 49.69%). How to reconcile this with the often-cited figure that about 90% of kanji have a phonetic component? It’s simple: most of those components can only be detected through historical analysis, because either the sounds and graphs have changed too much, or else they now only appear in a single kanji. For example:

  • In 盆 bon, the top element 分 is a phonetic for the sound bon, coming ultimately from Old Chinese *pən. However, due to diverging phonetic changes, its role as a phonetic in 分粉紛雰 became fun (recall that Modern Japanese f- or h- was Old Japanese p-); in 貧, hin; and, in 頒, han. So the phonetic structure of 盆 (“dish” 皿 + sound *pən = *bˤən “bowl”) became obscured and approximate.
  • In 浮 fu “to float”, the right component 孚 is a phonetic element giving the sound fu (Old Chinese *mbu). It also appeared by itself as 孚 fu “confidence”, and in 俘 fu “prisoner”; however, these two characters are not used in modern writing anymore, so that 浮 “float” remained alone as the single example of the phonetic 孚-fu.


Fuller data sets are available in tab-separated text files (TSV), compatible with any spreadsheet software and easy to manipulate with Unix tools.

components_phonetic.jouyou.tsv and components_phonetic.kanjivg.tsv:
The main tables, with one phonetic series by line, ordered by goodness as described above.
kanji_phonetic.jouyou.tsv and kanji_phonetic.kanjivg.tsv:
The best phonetic series for each kanji.
Extracted from KanjiVG, a decomposition of each kanji into components.
that are considered to be “the same” in this experiment (for
example, 人 and 亻, or 水 氵 氺). These are generally
calligraphic abbreviations, marked as Z-variants or semantic
variants in Unicode. This list is partial.
base_components.jouyou.tsv and base_components.kanjivg.tsv:
A list of all components needed to build the kanji set which are not themselves part of the set. Later I’ll add a new post on this.

Thanks to everyone who sent suggestions after the first version of this article, and to the compilers of KanjiVG and Kanjidic. Readers interested in the history of characters are directed to my Kanjigen tool. Comments welcome!

31 thoughts on “Testing the predictive power of phonetic components in Japanese kanji

  1. (Note for habitual readers: I changed the site stylesheet in order to highlight a few lines in the tables. If you don’t see any highlights, it’s probably caching the older version; try refreshing the browser a couple times with Ctrl+F5 or Ctrl+Shift+R).

  2. Interesting idea, especially the synchronic part. How about some graphs?
    – Number of kanji that are 100%, 90%, 80%… perfect (this is probably the most interesting one)
    – As above, but divided by reading set (number of 1-set kanji that are 100%, 90%… perfect; Number of 2-set kanji that are 100%, 90%… perfect)
    – Come to think of it, number of kanji with 1, 2, 3… reading sets?
    (Since you supply the data, I should probably do this myself, but I *still* don’t know how to use R).

    • I’m not sure if I get what stats are you thinking about (probably because, looking now, the names I chose are too confusing):

      – I used “perfect” to describe phonetic components, not kanji; i.e. those components that reliably predict all the readings of a kanji. By “90% perfect kanji”, do you mean a kanji with 90% of its readings predicted by a phonetic component?
      – I (arbitrarily) called the set of all readings of a given kanji its “reading set”; so they all have, by definition, one set. Do you mean the number of kanji in each component/reading-set group?

  3. It’s probably a bad sign that even I’m not sure what I was on about there. But thinking carefully, I seem to have written “kanji” where I meant “component”. So, out of n = 50 (or whatever) components, 10 are “perfect”, 20 are “very good”, etc…. The theory being that if some components are “perfect” and others are “not”, with varying numbers involved, there must be a way to rank them and show what the territory between “perfect” to “completely useless” looks like.

    I suppose that figuring out a ranking method is the hard part. Is “皮” worse than “召”, and if so, by how much? And so on.

  4. Ok, I abandoned the focus on perfect sets in favor of three quantifiable criteria (kanji coverage %, size, and readings coverage %), and they’re now sorted in this order.

    The main weakness of this method is that, in many cases, a much larger phonetic series with a slightly worse kanji coverage feels more interesting, so the absolute priority for kanji coverage is unjustified. I tried to use a weighted function based on both variables, but at the end of the weekend decided that the simpler, more understandable sorting key made it easier to handle the data tables.

  5. Great reading! The new approach works for me. Looking down the coverage lists, it’s kind of shocking how closely my feelings about the characters in each row match the percentages– the safe and cozy 100%/100% rows, the approachable but somehow ominous 90%/50%(ish) rows, the forbidding tangle of the 70%/40% rows…

  6. Interesting article! It’s something I’ve also investigated on my quest for Japanese reading ability. Are you aware of an official list anywhere online of the phonetic components in Japanese? They are often mentioned but there does not seem to be a definitive list, although perhaps I am just in the dark.


    • I recently contacted you regarding your book and the master’s thesis of Hiroko Thompson. I downloaded the jouyou phonetic list from this site and converted it to a PDF. I can copy and paste an individual kanji and lookup whether it belongs to a phonetic group. My Rikaichan pop-up dictionary shows the meaning and readings of each kanji in the group. I have used phonetics for years to help me learn the kanji, based on Michael Pye’s book. This site does not necessarily supersede your book. Your book serves as a good introduction to the phonetic method. This site can be used as an electronic reference to support that method.

  7. @Natalie: Sorry, I had missed your comment! In case you have email notification on and is reading this: I don’t think there’s a definitive list, because there are wide disagreements over which components are (historically) phonetic, or even what counts as a component. The best we can do is to look at what’s classified as a “phono-semantic character” 形声文字 in traditional dicionaries – the Shuowen, the KangXi and, for Japanese, Morohashi. I’m unaware of online lists of such classifications, though a few web searches in Japanese will probably return something.

    Of course, if all you need are lists of phonetic hints that still work (as opposed to historical “phonetics”), then there’s always the data files in this post ;)

    But if you’re looking for such a list for learning purposes, I think that it’s better to just acquire the phonetics by osmosis as you learn new words (it’s how the natives learn them, after all). Vocabulary acquisition people (like Nation) tell us that learning similar words together “to understand their difference” is a bad idea, because interference effects are bad for memorization – it’s better to deal with similar words one at a time, learning other stuff in-between. My gut feeling is that a similar thing happens with kanji: if you try to learn a phonetic series serially one after another, then all those similar-looking, similar-sounding kanji blend together in a blur.

    If you just read stuff you’re interested in, you’ll find that one starts to naturally guess the pronounciation of new kanji from phonetic hints one has acquired unconsciously (though unfortunately this inference sometimes fails – I’m looking at you, 拡/kaku).

    • Hi there

      Thanks for your reply and thoughts. I think all the analysis you’ve done is pretty amazing.

      As for me, I was unsatisfied with the lack of an official list of phonetic components, so I went ahead and made one of my own.

      They are listed in my new book The Kanji Code.

    • Thanks for sharing your thoughts. In the end, I was still frustrated with a lack of an official list of phonetic components. So I have created one myself.
      It is listed in my new book The Kanji Code. For each phonetic I list kanji that contain it, the reading they share, and example words using that reading.
      I hope it will be of use to others like me who want to see a list.

  8. I think there is an error with the kanji 蜂(bee) in file kanji_phonetic.jouyou.tsv
    It has 三 as component instead of 丰 but it’s onyomi is ホウ, the same as the kanji with phonetic compound 丰 (邦 峰 縫).

    Btw, this is an amazing work and helps me a lot with my study. Thanks :)

    • hiro, thank you very much for the correction. I’m presently unable to redo this analysis, but I’m making a new one (using an actual corpus this time!), and I’ll take this datum in consideration. thanks!

  9. I have to say, I really appreciate this post, and all the work that went into it. I actively use your data to determine which kanji to study together as sets, even with the limitations around the “imperfect” sets. For me, having kanji that share both a graphical component as well as a phonetic component actually allows me to learn those kanji much more quickly, as related sets of information. I think of it as being able to quickly create multiple indices in my mind to a specific kanji. I discovered this on my own, because it just didn’t make sense to me to learn random kanji in some arbitrary order when there were organizing principles “staring me in the face.” In searching for more data about that, I ran across your post.

    I’m just working on the jouyou kanji now, and I look forward to exploiting this method on the larger universe of kanji once I’ve mastered the jouyou. I’ve actually tried it already with a few of the larger non-jouyou sets, and it really worked very well for me. So I think I have to disagree with some of the received knowledge about this type of learning, based on my own experience. (I’ll be interested to read Reading Chinese Script: A Cognitive Analysis.) I am an adult learner of Japanese, and I think that makes a difference. Native speakers who are learning kanji already have the additional “index” or context of the spoken language which surrounds them every day. I don’t have that advantage. But I *can* create a much richer environment within which to establish these types of learning contexts by using techniques such as this. Yes it takes some time, especially if you are going it alone in trying to organize and extract the study information but…

    I’m going to share my dream with you and the rest of the Internet, in the hopes that someone is doing this, or has done this, or is interested in doing this – I would love to find an electronic listing of all the phonetic components that someone could load into a rendering program, such as the kanjVG database and viewer or the like. Then the phonetic components could be programmed to show up in one specific color, and the semantic components in another. With the right indexing and interface, a student could choose to study groups of kanji related by phonetic and/or semantic components.

    Think about how effective a study aid that would be! You would actually be capturing some of our internal decoding processes on screen. How would it be if someone could build that?

    It would be great for me anyway! And I bet for others. What do you think? (And thanks again!)

  10. In components_phonetic.kanjivg.tsv there are several components having more than one predicted reading. They are just separated by a blank.

    Example: 竟 has キョウ and ケイ

    I would have expected to see two lines in that case.

    And worse it is impossible to see which of the entries is more important. Heisig (only jouyou) gives キョウ as main reading and ケイ as exception for 境.

    鏡 => キョウ 望遠鏡/顕微鏡/鏡台
    境 => キョウ 境界/国境/環境/境遇
    境 => ケイ 境内

    Btw: I confused 竸 with 競 which has interestingly enough also キョウ (main) and ケイ as readings.

    • Thanks for the bug report, Matthias. You’re completely right; I don’t know what I was thinking.

      Right now I’m busy with my thesis, and can’t bring myself to try to re-read my old code. If it’s any consolation, I’m doing some (hopefully) better statistical analyses for it, weighted by a corpus; and I’m taking care to make it reproducible, and will post the python+R code to github. There were also several problems with KanjiVG component decompositions; some I’ve forwarded to upstream, and will continue to do so as I find them. The deadline for my thesis is August 2016, so you’ll have better data by then.

    • Kanjinetworks says that 竸 and 競 are just graphical variants. ja.wiktionary has mostly the same readings/meanings ascribed to both, but (perhaps by lack of completion) the set of kun-yomi ascribed to them differs a bit. I don’t have a proper kanji dictionary at hand right now to look into…

  11. Good luck with your thesis!

    If you redo the analysis you might consider including “groups” of 1. By this you get a complete picture, including all exceptions. I guess this is also important to get variants as e.g. そつ and そっ.

    • Thanks! My current approach already includes code to unify morphophonemic processes, like sokuon (そつ → そっ), rendaku (そつ→ぞつ) and the like. I count them as “the same reading”, being conservative and assuming the reader knows which variant is used when.

  12. So are the phonetic kanji worth learning deliberately in addition to normal kanji study?

    The report seemed to be indicating as such, then pulled another graph out and everything went obscure and now I’m not sure if it is worth it.

    • I don’t recommend deliberate memorization of phonetic components, specifically. Rather, study kanji as you usually do, but pay attention to their components; now that you know that some of them are phonetic hints, every so often they’ll be useful. You’ll learn which ones naturally, as you get exposed to them.

      (By the way, earlier I said that imperfect hints are more trouble than worth. Now I think this only applies to the reader who’s still in early stages; as one gets to know more words, imperfect hints will actually become helpful in recalling the word to memory. But this only works after one’s already familiar with the word.)

      In my opinion (which is shared by many linguists), you shouldn’t spend a lot of time drilling kanji at first. It feels like industrious learning, but it’s a poor use of your study time. Concentrate on acquiring the Japanese language itself – the words and the grammar patterns – and postpone kanji as much as you can. The reason is that it will be a lot easier to deal with kanji when they’re just fancy notation for words you already know (consider that the Japanese acquire the language fluently before even starting kanji practice at school). After you get yourself decent Japanese proficiency, the main thrust of your kanji studies should be reading interesting texts (furigana-ed texts and electronic tools, e.g. rikaichan, are very OK here). Don’t bother checking every word in a dictionary; try to find texts that you can more-or-less understand as-is, and gloss over the hard parts. Ideally you want something compelling, like genre fiction with a cliffhanger every chapter – something which draws you to keep reading. At this point your main objective is still growing your vocabulary, and improving comprehension. Meanwhile, kanji familiarity will grow by osmosis, in a perfectly-tuned “spaced repetition system” (because, in your readings, you’ll practice more often the kanji that are used more often!)

      Deliberate kanji drilling should be an auxiliary study at best; never spend more time drilling kanji than reading texts. Kanji drills are mainly useful for a) writing Japanese by hand (=recall), and b) passing artificial tests like collage exams or the JLPT. If you need either of these for some reason, you’ll need to drill kanji, but don’t let that detract from reading compelling texts.

    • For most purposes, I believe the question isn’t what I consider a component, but what The Tradition considers a component. The first landmark of The Tradition is the Shuōwén Jiězì, though it has important predecessors like the Hàn Shū which cites the Six Methods of Classification 六書 liùshū. So the short answer is, if a classic like the Shuōwén or the Kāngxī says “character X is composed of W and Y”, then W and Y are components. Note that the large majority of components are themselves characters, and that composition is a recursive process (so a component may be an aggregate of two or more subcomponents). Under this model, a component is a phonetic or semantic component (音符/意符) if the tradition claims it as such. This is the definition I use most of the time and what enables me to interact with dictionaries, reference materials, other readers etc.

      A synchronic, instrumental definition would be analogue to the process of identifying morphemes in strings of phonemes (=strokes); you’d look for graphical subelements which reocur in at least two characters, while co-ocurring with phonetic or semantic values (or both, cf. Tōdō Akiyasu’s work). A more sophisticated historical definition would go beyond the classics and try to reconstruct the components from oracle-bone scholarship (cf. Ochiai Atsushi, with grains of salt).

      For the purpose of this post (and of my thesis, which was a corpus-based approach), I deliberately set out to be ahistorical (because I meant to test how much information is left in the system). In this case, components were defined graphically as recurring visual sub-elements (taken from the KanjiVG vectorial graphics database), almost blinded from tradition; the single exception is that a few allographs (graphical variations) taught to be “the same component” in Japanese school education (e.g. 心 忄 㣺) are treated as instances of the same component, on account of the fact that most Japanese-educated readers would treat them this way. In the thesis I’ve tested how many Jōyō characters occur, on a typical text (Wikipedia), with readings that can be predicted by any of its components, thus defined (that is, for any given kanji/reading instance, whether any of the graphical subelements, taken ahistorically, occur in other characters with the same reading, and mostly restricted to those, so that the presence of that visual subelement is enough to predict the reading). I reached a number of around 23%, being as optimistic/conservative as I could. That is, a reader going through Wikipedia with full knowledge of Jōyō Kanji phonetically predictive components but (somehow) unable to recall kanji readings would be able to predict about ~23% of them, optimistically.

  13. It would worthwhile to mention that a few of the kanji phonetic component reading for particular kanji are so obscure as to have no common words or just be used in names. A little asterix or additional family measure would demonstrate the actual usefulness of the reading

Leave a Reply

Your email address will not be published. Required fields are marked *