Testing the power of phonetic components in Japanese kanji
A large majority (up to 90%) of Chinese characters have a phonetic component—a hint that suggests its pronounciation (though that’s 90% of the total, not 90% of what you’d stumble in actual use; nonphonetic characters are disproportionately more frequent). For example, 半 bàn “half” appears inside 伴 “companion” because the latter is also pronounced bàn; and further, it also suggests the approximate pronounciation of 判 pàn “judge”, 叛 pàn “rebel”, and 胖 pàng “fat”. Beginning readers often don’t notice this feature, but awareness of phonetic hints grows with proficiency.
However, this system was always imprecise, and grew ever looser as the spoken language changed and the characters were simplified in various ways. In many cases, it’s now unclear to what degree a component was originally added as a phonetic hint, a semantic mnemonic, both, or neither. For example, it’s conceivable that whoever first built the character for 判 “judge” decided to use 半 not just for its sound, but also because they thought “cutting 刂 in equal halves 半” makes sense as a mnemonic for “judging”.
When the characters are used to represent Japanese, the phonetic hints only work with Sino-Japanese readings, i.e. the on-yomi (…that’s kind of why they’re called on-yomi, “sound readings”). Moreover, even for on-yomi, the Japanese readings grew to be more imprecise than even Chinese. I wanted to try to measure their predictive power; this post report the results of a simple but quantitative experiment (for a quick summary, skip to results!).
Phonetic components and the Japanese language
There are various reasons why the hints are less useful for Japanese:
- In their original language, most characters have only one reading. In Japanese they typically have at least two, the original Sinitic plus a native gloss. This point alone means the hints don’t work for about half (?) of the readings.
- What’s more, even though there are (were) many Sinitic languages through space (time), a given Chinese community typically read the characters in only one way. In the case of Japanese, it’s common for characters to have several coexisting Sinitic readings (multiple on-yomi), because the language preserves several loan strata. And all of these multiple on-yomi were subject to local language change, independently of the mainland.
- Of the current standard (Jōyō) kanji set, 364 (about 17%) were simplified after 1946, breaking graphical relationships.
That doesn’t mean they’re useless, though, or that natives don’t process them. Mary Noguchi of Kanji Clinic describes this hypothetical example about the character for “hemorrhoid”:
[…] locate a willing subject to write, from memory, the kanji for ぢ. The point of this exercise is to witness the trial and error strategy a native speaker may employ when writing or pronouncing a relatively unfamiliar kanji such as 痔. Your friend will probably begin by writing the kanji component 疒 (“sickness”). This will put her into the ballpark, but writing the rest of the character may prove more difficult. She might start wracking her brain for a component which signals the pronunciation ji:
“Is it 寿 (“longevity”)? Oh, no, 寿 is pronounced ju, so that can’t be right. Hmmm, how about 血 (“blood,” usually pronounced chi, but sometimes ji in compounds like hanaji “bloody nose”)? Or maybe 地 (“ground,” ji)? No,” she moans, “that doesn’t look right, either. OK, I think it’s 寺 (“temple,” ji) plus 疒. Yes, 痔. That’s it!”
Sometimes a native taking quick notes will even substitute a complicated phonetic component for a kana character—for example, noting 機 ki “machine” by using only the left 木 “wood” component plus a squeezed phonetic ki キ to the right.
Goals and definitions: what to measure, and how
For this experiment, I extracted readings from Jim Breen’s venerable Kanjidic, and structural analysis from KanjiVG, a publicly-available database of graphical– and component decompositions. (If there are any errors in the tables, please report, so that I can either debug my code or forward corrections to the original sources.) Regarding my goals:
-
I chose KanjiVG because I was interested in a synchronic analysis—that is, of the structure of the kanji as they are now, not of their historical (traditional, Seal, or pre-Qin) forms. I ignored history, and looked for correlations between modern visual components and modern readings, in the spirit of testing how much information is still present in the system.
-
I also chose to ignore approximate readings, out of didactic interest (it feels to me that, in the case of Japanese, approximations are more trouble than worth). I looked for exact matches.
-
And I was especially interested in components that could be used reliably as a guide to pronounciation.
There’s quite a bit of data to massage, and it can be tricky to measure what exactly is a “good” phonetic component. In the next section I make some important definitions about metrics.
Kanji sets
First of all, the results will differ significantly depending on which kanji set (our universe) is analyzed. We’ll investigate two such sets:
- The set of kanji taught in Japanese education, today called Jōyō kanji. Since 1945, Japanese texts generally omit furigana readings for Jōyō, meaning that adult readers are expected to know by heart the readings of this set. Our Jōyō analyses are based on the full set of 2141 characters (including the 5 that were dropped in 2010, namely 勺 銑 脹 錘 匁 ).
- The largest set I have available, that is, all kanji in the current version of KanjiVG—a bit more than 6300. I don’t know if there’s any selection criteria for this set, but I expect they’re some of the most common, and six thousand is a good estimate of the knowledge of a proficient, literary-minded native reader.
Quantifying phonetic series
Within each kanji set, the basic variables to relate are:
- A few thousand kanji, where
- each kanji is made of one or more components, and
- each kanji has zero or more (on-yomi) readings.
A component series is set of kanji that include a certain component. Here are some examples from the Jōyō set:
| Component | Size of series | Kanji in series | ||||||
|---|---|---|---|---|---|---|---|---|
| 走 | 7 | 越 | 起 | 趣 | 走 | 超 | 徒 | 赴 |
| 青 | 7 | 情 | 晴 | 清 | 精 | 請 | 青 | 静 |
| 包 | 6 | 包 | 抱 | 泡 | 砲 | 胞 | 飽 | |
| 乍 | 5 | 作 | 搾 | 昨 | 詐 | 酢 | ||
| 及 | 4 | 及 | 吸 | 扱 | 級 | |||
For our purposes, a phonetic series is a set of kanji that shares a component and a reading. If we add each kanji’s readings to the table above, interesting patterns appear:
| Component | Size of series | Kanji in series | ||||||
|---|---|---|---|---|---|---|---|---|
| 走 | 7 | 越 | 起 | 趣 | 走 | 超 | 徒 | 赴 |
| etsu, otsu |
ki | shu | sou | chou | to | fu | ||
| 青 | 7 | 情 | 晴 | 清 | 精 | 請 | 青 | 静 |
| sei, jou |
sei | sei, shou, shin |
sei, shou, shiyau |
sei, shou, shin |
sei, shou |
sei, jou |
||
| 包 | 6 | 包 | 抱 | 泡 | 砲 | 胞 | 飽 | |
| hou | hou | hou | hou | hou | hou | |||
| 乍 | 5 | 作 | 搾 | 昨 | 酢 | 詐 | ||
| saku, sa |
saku | saku | saku | sa | ||||
| 及 | 4 | 及 | 吸 | 扱 | 級 | |||
| kyuu | kyuu | kyuu, sou |
kyuu | |||||
First, consider the 走-series. Not a single kanji in it has a shared reading! 走 is not a phonetic component at all, i.e. the 走-series is not a phonetic series.
Compare to the 包-series. All the six kanji that include 包 are pronounced hou. In other words, 包-hou is a phonetic series of size 6.
Now consider the 乍-series. Almost there! Four out of five kanji include the reading saku, but 詐 breaks the pattern; it’s only read as sa. In this case the 乍-saku phonetic series (size 4, 作搾昨酢) is smaller than the full 乍 component series (5, 作搾昨詐酢). We say this phonetic series covers 4 out of 5 kanji, or that it has a kanji coverage of 4/5 = 80%.
What about the 青-series? It does include a phonetic series, with 100% coverage: all characters do have a shared reading, sei. However, most characters also have extra, unpredictable readings, so that this component is less predictive than 包, and we should measure this. Call readings coverage the ratio of predicted readings to all the readings of all kanji in which the component appears (that is, all kanji in the larger component series). 青-sei is a phonetic series of size 7, with 100% kanji coverage but only 43.75% (7/16) readings coverage.
I hope it’s clear by now that 及-kyuu is a size-4 phonetic series with 100% kanji coverage and 80% (4/5) readings coverage.
Phonetic series with 100% kanji and 100% readings coverage (like 包-hou) are especially useful; these ratings mean that, whenever the component appears, one can be sure of all readings of the kanji. We call these perfect series. Second in importante are those with 100% kanji but less than 100% readings (like 及-kyū and 青-sei); let’s name them semiperfect series. If you see a semiperfect phonetic component, you can be sure of at least one of the kanji’s readings. Series with less than 100% kanji coverage are not as useful, since you have to memorize the exceptions anyway; these are imperfect series.
Attentive readers might have noticed that a single component can be on many phonetic series; 乍, for example, could also be described as a very imperfect predictor for sa, working for 作詐 but not 搾昨酢. It of course performs better as a predictor for saku, since in that role it get 80/66% for 4 kanji, rather than 40/33% for just 2. We’re now in position to choose the best series for a component or a kanji: the rating criteria will be:
- First, higher kanji coverage;
- Second, larger size (more kanji);
- And third, higher readings coverage.
Results
Perfect and semi-perfect series
Here’s a selection of the better phonetic series for Jōyō kanji, including all those with 100% kanji coverage and size 3 or larger:
| Size | Kanji in series | Predicted readings | Readings coverage | Kanji with extra readings | |
|---|---|---|---|---|---|
| 青 | 7 | 情 晴 清 精 請 青 静 | sei | 43.75 | 情 清 精 請 青 静 |
| 包 | 6 | 包 抱 泡 砲 胞 飽 | hou | 100 | |
| 交 | 6 | 交 効 校 絞 較 郊 | kou | 75 | 校 較 |
| 付 | 5 | 付 府 符 腐 附 | fu | 100 | |
| 司 | 5 | 伺 司 嗣 詞 飼 | shi | 100 | |
| 僉 | 5 | 倹 剣 検 険 験 | ken | 83.333 | 験 |
| 兪 | 5 | 喩 愉 癒 諭 輸 | yu | 83.333 | 輸 |
| 冓 | 4 | 構 溝 講 購 | kou | 100 | |
| 咼 | 4 | 渦 禍 過 鍋 | ka | 100 | |
| 复 | 4 | 復 腹 複 覆 | fuku | 100 | |
| 官 | 4 | 官 棺 管 館 | kan | 100 | |
| 朱 | 4 | 朱 株 殊 珠 | shu | 100 | |
| 皆 | 4 | 楷 皆 諧 階 | kai | 100 | |
| 襄 | 4 | 壌 嬢 譲 醸 | jou | 100 | |
| 采 | 4 | 彩 採 菜 采 | sai | 100 | |
| 及 | 4 | 及 吸 扱 級 | kyuu | 80 | 扱 |
| 屈 | 4 | 堀 屈 掘 窟 | kutsu | 80 | 窟 |
| 義 | 4 | 儀 犠 義 議 | gi | 80 | 犠 |
| 麻 | 4 | 摩 磨 魔 麻 | ma | 80 | 麻 |
| 化 | 4 | 化 花 貨 靴 | ka | 66.667 | 化 花 |
| 半 | 4 | 伴 判 半 畔 | han | 66.667 | 伴 判 |
| 固 | 4 | 個 固 箇 錮 | ko | 66.667 | 個 箇 |
| 相 | 4 | 想 相 箱 霜 | sou | 66.667 | 想 相 |
| 夾 | 4 | 峡 挟 狭 頰 | kyou | 57.143 | 峡 挟 狭 |
| 丰 | 3 | 峰 縫 邦 | hou | 100 | |
| 五 | 3 | 五 悟 語 | go | 100 | |
| 亢 | 3 | 坑 抗 航 | kou | 100 | |
| 先 | 3 | 先 洗 銑 | sen | 100 | |
| 兹 | 3 | 慈 滋 磁 | ji | 100 | |
| 利 | 3 | 利 梨 痢 | ri | 100 | |
| 則 | 3 | 側 則 測 | soku | 100 | |
| 廷 | 3 | 庭 廷 艇 | tei | 100 | |
| 扁 | 3 | 偏 編 遍 | hen | 100 | |
| 氐 | 3 | 底 抵 邸 | tei | 100 | |
| 永 | 3 | 永 泳 詠 | ei | 100 | |
| 泉 | 3 | 泉 線 腺 | sen | 100 | |
| 票 | 3 | 標 漂 票 | hyou | 100 | |
| 章 | 3 | 彰 章 障 | shou | 100 | |
| 賁 | 3 | 噴 墳 憤 | fun | 100 | |
| 丙 | 3 | 丙 柄 病 | hei | 75 | 病 |
| 刃 | 3 | 刃 忍 認 | nin | 75 | 刃 |
| 善 | 3 | 善 繕 膳 | zen | 75 | 膳 |
| 奴 | 3 | 努 奴 怒 | do | 75 | 怒 |
| 家 | 3 | 嫁 家 稼 | ka | 75 | 家 |
| 巨 | 3 | 巨 拒 距 | kyo | 75 | 拒 |
| 帝 | 3 | 帝 締 諦 | tei | 75 | 諦 |
| 広 | 3 | 広 拡 鉱 | kou | 75 | 拡 |
| 建 | 3 | 健 建 鍵 | ken | 75 | 建 |
| 曹 | 3 | 曹 槽 遭 | sou | 75 | 曹 |
| 朝 | 3 | 嘲 朝 潮 | chou | 75 | 嘲 |
| 求 | 3 | 救 求 球 | kyuu | 75 | 求 |
| 苗 | 3 | 描 猫 苗 | byou | 75 | 苗 |
| 袁 | 3 | 園 猿 遠 | en | 75 | 遠 |
| 𠔉 | 3 | 券 圏 巻 | ken | 75 | 巻 |
| 代 | 3 | 代 袋 貸 | tai | 60 | 代 袋 |
| 居 | 3 | 居 据 裾 | kyo | 60 | 居 裾 |
| 敝 | 3 | 幣 弊 蔽 | hei | 60 | 蔽 |
| 壮 | 3 | 壮 荘 装 | sou | 50 | 荘 装 |
| 太 | 3 | 太 汰 駄 | ta | 50 | 太 汰 駄 |
Here’s a similar selection for KanjiVG kanji (including non-Jōyō), limited to 100% kanji coverage and size 5 or greater:
| Size | Kanji in series | Predicted readings | Readings coverage | Kanji with extra readings | |
|---|---|---|---|---|---|
| 五 | 14 | 五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 | go | 82.353 | 圄 衙 |
| 盧 | 13 | 廬 櫨 瀘 爐 盧 臚 艫 蘆 轤 鑪 顱 驢 鱸 | ro | 81.25 | 廬 臚 驢 |
| 票 | 12 | 剽 嫖 慓 標 漂 瓢 票 縹 飃 飄 驃 鰾 | hyou | 100 | |
| 吾 | 12 | 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 | go | 80 | 圄 衙 |
| 皇 | 12 | 凰 徨 惶 湟 煌 皇 篁 蝗 遑 鍠 隍 鰉 | kou | 80 | 凰 皇 遑 |
| 咢 | 11 | 咢 愕 萼 蕚 諤 鄂 鍔 顎 鰐 鶚 齶 | gaku | 100 | |
| 秋 | 11 | 啾 愀 愁 楸 湫 甃 秋 萩 鍬 鞦 鰍 | shuu | 73.333 | 愀 湫 鍬 鰍 |
| 倉 | 11 | 倉 創 愴 搶 槍 滄 瘡 艙 蒼 蹌 鎗 | sou | 64.706 | 創 搶 槍 瘡 蹌 鎗 |
| 冓 | 10 | 冓 媾 搆 構 溝 篝 覯 講 購 遘 | kou | 100 | |
| 扁 | 10 | 偏 扁 篇 編 翩 蝙 褊 諞 遍 騙 | hen | 100 | |
| 差 | 10 | 嗟 嵯 嵳 差 搓 槎 瑳 磋 縒 蹉 | sa | 71.429 | 嗟 嵯 搓 縒 |
| 朋 | 9 | 堋 崩 弸 朋 棚 硼 繃 萠 鵬 | hou | 90 | 繃 |
| 半 | 9 | 伴 判 半 叛 拌 畔 絆 胖 袢 | han | 75 | 伴 判 叛 |
| 胡 | 9 | 楜 湖 瑚 糊 胡 葫 蝴 醐 餬 | ko | 60 | 瑚 糊 胡 醐 |
| 亢 | 8 | 亢 伉 吭 坑 抗 杭 航 頏 | kou | 100 | |
| 喬 | 8 | 僑 喬 嬌 橋 矯 蕎 轎 驕 | kyou | 100 | |
| 章 | 8 | 嶂 彰 樟 璋 瘴 章 障 鱆 | shou | 100 | |
| 司 | 8 | 伺 司 嗣 祠 笥 覗 詞 飼 | shi | 88.889 | 笥 |
| 義 | 8 | 儀 嶬 犠 礒 義 艤 蟻 議 | gi | 80 | 犠 礒 |
| 賓 | 8 | 嬪 擯 檳 殯 濱 繽 賓 鬢 | hin | 80 | 檳 鬢 |
| 旁 | 8 | 傍 旁 榜 滂 磅 膀 蒡 謗 | bou | 53.333 | 旁 榜 滂 磅 膀 蒡 謗 |
| 奚 | 7 | 奚 渓 溪 谿 蹊 鶏 鷄 | kei | 100 | |
| 容 | 7 | 容 榕 溶 熔 穃 蓉 鎔 | you | 100 | |
| 介 | 7 | 介 价 堺 界 畍 疥 芥 | kai | 87.5 | 芥 |
| 及 | 7 | 及 吸 岌 扱 汲 笈 級 | kyuu | 87.5 | 扱 |
| 于 | 7 | 于 吁 宇 盂 紆 芋 迂 | u | 77.778 | 于 吁 |
| 代 | 7 | 代 垈 岱 玳 袋 貸 黛 | tai | 70 | 代 垈 袋 |
| 固 | 7 | 個 凅 固 涸 痼 箇 錮 | ko | 63.636 | 個 凅 涸 箇 |
| 會 | 7 | 會 檜 獪 繪 膾 薈 鱠 | kai | 58.333 | 會 繪 膾 薈 鱠 |
| 蒙 | 7 | 曚 朦 檬 濛 矇 艨 蒙 | mou | 58.333 | 曚 朦 矇 艨 蒙 |
| 叟 | 7 | 叟 嫂 捜 搜 溲 痩 艘 | sou | 36.842 | 叟 捜 搜 溲 痩 艘 |
| 安 | 6 | 安 按 晏 案 鞍 鮟 | an | 100 | |
| 旱 | 6 | 悍 捍 旱 桿 稈 駻 | kan | 100 | |
| 屈 | 6 | 倔 堀 屈 崛 掘 窟 | kutsu | 85.714 | 窟 |
| 広 | 6 | 広 拡 昿 砿 絋 鉱 | kou | 85.714 | 拡 |
| 廣 | 6 | 壙 廣 擴 曠 礦 鑛 | kou | 85.714 | 擴 |
| 歴 | 6 | 櫪 歴 瀝 癧 轣 靂 | reki | 85.714 | 歴 |
| 皆 | 6 | 偕 揩 楷 皆 諧 階 | kai | 85.714 | 揩 |
| 連 | 6 | 嗹 漣 縺 蓮 連 鏈 | ren | 85.714 | 漣 |
| 廷 | 6 | 庭 廷 挺 梃 艇 霆 | tei | 75 | 挺 梃 |
| 曹 | 6 | 曹 槽 漕 糟 艚 遭 | sou | 75 | 曹 糟 |
| 求 | 6 | 救 毬 求 球 裘 逑 | kyuu | 75 | 求 裘 |
| 焦 | 6 | 憔 樵 焦 礁 蕉 鷦 | shou | 75 | 憔 |
| 闌 | 6 | 欄 瀾 爛 蘭 襴 闌 | ran | 75 | 蘭 襴 |
| 寧 | 6 | 嚀 寧 檸 濘 獰 聹 | nei | 66.667 | 嚀 檸 獰 |
| 袁 | 6 | 園 猿 薗 袁 轅 遠 | en | 66.667 | 薗 袁 遠 |
| 解 | 6 | 廨 懈 蟹 蠏 解 邂 | kai | 66.667 | 廨 懈 解 |
| 居 | 6 | 倨 居 据 裾 踞 鋸 | kyo | 54.545 | 倨 居 裾 踞 鋸 |
| 無 | 6 | 嘸 廡 憮 撫 無 蕪 | bu | 54.545 | 嘸 憮 撫 無 蕪 |
| 辱 | 6 | 溽 縟 耨 蓐 褥 辱 | joku | 54.545 | 溽 耨 蓐 |
| 末 | 6 | 抹 末 沫 秣 茉 靺 | matsu | 50 | 末 沫 秣 茉 靺 |
| 芻 | 6 | 皺 芻 蒭 趨 鄒 雛 | suu | 37.5 | 皺 芻 蒭 趨 鄒 雛 |
| 匆 | 5 | 偬 匆 怱 愡 葱 | sou | 100 | |
| 奐 | 5 | 喚 奐 換 渙 煥 | kan | 100 | |
| 幾 | 5 | 幾 機 磯 譏 饑 | ki | 100 | |
| 沙 | 5 | 娑 沙 莎 裟 鯊 | sa sha | 100 | |
| 采 | 5 | 彩 採 綵 菜 采 | sai | 100 | |
| 卷 | 5 | 卷 圈 惓 綣 蜷 | ken | 83.333 | 卷 |
| 賛 | 5 | 攅 纉 讃 賛 鑚 | san | 83.333 | 纉 |
| 郎 | 5 | 廊 榔 瑯 螂 郎 | rou | 83.333 | 郎 |
| 戻 | 5 | 唳 戻 捩 涙 綟 | rei | 71.429 | 捩 涙 |
| 柔 | 5 | 揉 柔 糅 蹂 鞣 | juu | 71.429 | 柔 鞣 |
| 當 | 5 | 當 礑 蟷 襠 鐺 | tou | 71.429 | 鐺 |
| 荅 | 5 | 剳 塔 搭 荅 鞳 | tou | 71.429 | 剳 |
| 析 | 5 | 晰 析 淅 皙 蜥 | seki | 62.5 | 晰 皙 蜥 |
| 賁 | 5 | 噴 墳 憤 濆 賁 | fun | 62.5 | 濆 賁 |
| 達 | 5 | 撻 燵 達 闥 韃 | tatsu | 62.5 | 達 闥 韃 |
| 保 | 5 | 保 堡 葆 褒 褓 | hou | 55.556 | 保 堡 葆 褓 |
| 將 | 5 | 奬 將 漿 獎 鏘 | shou | 55.556 | 奬 將 獎 鏘 |
| 星 | 5 | 惺 星 猩 腥 醒 | sei | 55.556 | 星 猩 腥 |
| 壯 | 5 | 壯 奘 弉 莊 裝 | sou | 50 | 奘 弉 莊 裝 |
| 段 | 5 | 椴 段 緞 葮 鍛 | tan | 50 | 椴 段 緞 葮 |
| 族 | 5 | 嗾 族 簇 蔟 鏃 | zoku | 45.455 | 嗾 簇 蔟 鏃 |
It’s clear that phonetic series grow in usefulness as we venture outside Jōyō. This is probably because, as mentioned earlier, rarer characters are more likely to have been built with a phonetic structure.
Notice 吾-go is kind of a spurious series, because the component 吾 itself includes a better phonetic component, 五-go:
| 五 | 14 | 五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 | go | 82.353 | 圄 衙 |
| 吾 | 12 | 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 | go | 80 | 圄 衙 |
Imperfect series
Here are some of the top few phonetic series with imperfect kanji coverage, for Jōyō kanji:
| Size | Kanji in component series | Size of phonetic series | Kanji not in phonetic series | Kanji coverage | Predicted readings | Readings coverage | |
|---|---|---|---|---|---|---|---|
| 召 | 8 | 召 招 昭 沼 照 紹 詔 超 | 7 | 超 | 87.5 | shou | 87.5 |
| 次 | 7 | 姿 恣 次 盗 茨 諮 資 | 6 | 盗 | 85.714 | shi | 66.667 |
| 奇 | 6 | 埼 奇 寄 崎 椅 騎 | 5 | 椅 | 83.333 | ki | 83.333 |
| 令 | 6 | 令 冷 鈴 零 領 齢 | 5 | 領 | 83.333 | rei | 71.429 |
| 反 | 9 | 仮 反 坂 板 版 販 返 阪 飯 | 7 | 仮 返 | 77.778 | han | 50 |
| 父 | 8 | 交 効 校 父 絞 較 郊 釜 | 6 | 父 釜 | 75 | kou | 60 |
| 生 | 8 | 姓 性 星 牲 生 産 醒 隆 | 6 | 産 隆 | 75 | sei | 50 |
| ⺦ | 7 | 壮 奨 寝 将 状 荘 装 | 5 | 寝 状 | 71.429 | sou | 41.667 |
| 亡 | 9 | 亡 妄 忘 忙 慌 望 盲 網 荒 | 6 | 忘 慌 荒 | 66.667 | mou | 46.154 |
For the KanjiVG set, there are more than 600 series between 75% and 99% kanji coverage, many of them quite large. Here’s a sampling of some interesting ones:
| Size | Kanji in component series | Size of phonetic series | Kanji not in phonetic series | Kanji coverage | Predicted readings | Readings coverage | |
|---|---|---|---|---|---|---|---|
| 令 | 18 | 令 伶 冷 囹 嶺 怜 澪 玲 羚 聆 苓 蛉 鈴 零 領 鴒 齡 齢 | 17 | 鴒 | 94.444 | rei | 65.385 |
| 交 | 18 | 交 佼 傚 効 咬 效 校 狡 皎 絞 纐 蛟 較 郊 餃 駮 鮫 鵁 | 17 | 駮 | 94.444 | kou | 60.714 |
| 青 | 17 | 倩 情 晴 清 瀞 猜 睛 精 菁 蜻 請 錆 青 靖 静 靜 鯖 | 16 | 猜 | 94.118 | sei | 45.714 |
| 付 | 14 | 付 俯 咐 坿 府 拊 柎 椨 符 腐 腑 苻 附 鮒 | 13 | 椨 | 92.857 | fu | 81.25 |
| 戔 | 14 | 戔 棧 殘 淺 濺 牋 盞 箋 綫 賎 賤 踐 錢 餞 | 13 | 殘 | 92.857 | sen | 68.421 |
| 加 | 14 | 伽 加 嘉 架 枷 珈 痂 笳 茄 袈 賀 跏 迦 駕 | 13 | 賀 | 92.857 | ka | 61.905 |
| 包 | 25 | 包 匏 咆 垉 庖 抱 枹 泡 炮 疱 皰 砲 胞 苞 萢 蚫 袍 鉋 雹 靤 鞄 飽 髱 鮑 麭 |
23 | 萢 雹 | 92 | hou | 67.647 |
| 曼 | 12 | 幔 慢 曼 漫 縵 蔓 蘰 謾 鏝 饅 鬘 鰻 | 11 | 蘰 | 91.667 | man | 61.111 |
| 黄 | 11 | 壙 廣 擴 曠 横 礦 簧 鑛 黃 黄 黌 | 10 | 横 | 90.909 | kou | 66.667 |
| 孚 | 10 | 乳 俘 孚 孵 桴 殍 浮 艀 蜉 郛 | 9 | 乳 | 90 | fu | 81.818 |
| 朱 | 10 | 侏 朱 株 殊 洙 珠 茱 蛛 誅 銖 | 9 | 誅 | 90 | shu | 75 |
| 夬 | 10 | 决 刔 夬 快 抉 決 缺 袂 訣 鴃 | 9 | 快 | 90 | ketsu | 52.941 |
| 兪 | 18 | 偸 兪 喩 愈 愉 揄 楡 渝 瑜 瘉 癒 蝓 覦 諭 踰 輸 逾 鍮 | 16 | 偸 鍮 | 88.889 | yu | 59.259 |
| 良 | 15 | 娘 廊 朖 朗 榔 浪 狼 琅 瑯 粮 良 莨 螂 踉 郎 | 13 | 娘 良 | 86.667 | rou | 72.222 |
| 奇 | 15 | 倚 剞 埼 奇 寄 崎 掎 椅 欹 猗 畸 碕 綺 羇 騎 | 13 | 椅 猗 | 86.667 | ki | 68.421 |
| 此 | 14 | 些 呰 嘴 柴 此 疵 眥 眦 砦 紫 觜 貲 雌 髭 | 12 | 些 砦 | 85.714 | shi | 46.154 |
| 肖 | 19 | 削 哨 宵 屑 峭 悄 梢 消 硝 稍 肖 蛸 誚 趙 逍 銷 霄 鞘 鮹 | 16 | 削 屑 趙 | 84.211 | shou | 61.538 |
| 番 | 18 | 審 幡 播 旙 旛 潘 瀋 燔 番 繙 翻 膰 蕃 藩 蟠 鐇 飜 鷭 | 15 | 審 瀋 番 | 83.333 | han | 50 |
| 兆 | 12 | 佻 兆 姚 挑 晁 桃 眺 窕 誂 跳 逃 銚 | 10 | 桃 逃 | 83.333 | chou | 58.824 |
When we open to a larger universe, some series can be demoted—like our friend 包-hou, the best perfect predictor in Jōyō, which here becomes imperfect thanks to 雹 (haku, hyou—one semivowel from hou!) and 萢 (no on-yomi). Even then, 包-hou still manages an impressive 23 out of 25—arguably even better than the 6 out of 6 in Jōyō.
The imperfect series with kanji coverage smaller than 50% (not pictured above) are quite bad; in fact, most of these “series” are just statistical noise, with sets of hundreds of characters where two or three share a reading by chance. For example, 含 and 岩, both read as gan, share the component 口; but so do 449 other Jōyō kanji, so that 口-gan has an abysmal kanji coverage of 0.44%.
Coverage of kanji sets: how many are in phonetic series?
It’s interesting to measure the coverage of Jōyō kanji—that is, how many of its kanji are included in a phonetic series. The charts below graph this for the Jōyō and KanjiVG sets, choosing the very best phonetic series for each kanji (click for larger version):
You can read these charts horizontally, as an one-dimensional percent bar. The vertical dimension adds information about kanji coverage and (in darker shading) readings coverage. The very bad section (kanji coverage less than 50%) is shaded in red, and is in practice the same as having no phonetic hint. The hints are actually useful basically for the 100% section (colored blue), and perhaps a few of the better imperfect series (orange).
However, the charts above are a bit misleading because they’re not graphing the size (number of kanji) of each series, only their relative coverages. Many of the perfect series have a modest size of 2—just enough to know, for example, that 愁 is pronounced the same as 秋. However, if the entire graphs were adjusted for size, we’d have to use a log scale, because of how bad the red section is—kanji outside the phonetic series would dominate the image in a sea of gray. To avoid that, this magnification shows only the left part (up to 50% kanji coverage), allowing us to represent the series size in a natural way:
(Notice that the scales are not the same between different graphics!)
456 Jōyō means 21.29% are perfectly phonetic, and 1411/6394 = 22.06% of our larger set (the yellow, imperfect part goes up to 34.46% and 49.69%). How to reconcile this with the often-cited figure that about 90% of kanji have a phonetic component? It’s simple: most of those components can only be detected through historical analysis, because either the sounds and graphs have changed too much, or else they now only appear in a single kanji. For example:
- In 盆 bon, the top element 分 is a phonetic for the sound bon, coming ultimately from Old Chinese *pən. However, due to diverging phonetic changes, its role as a phonetic in 分粉紛雰 became fun (recall that Modern Japanese f- or h- was Old Japanese p-); in 貧, hin; and, in 頒, han. So the phonetic structure of 盆 (“dish” 皿 + sound *pən = *bˤən “bowl”) became obscured and approximate.
- In 浮 fu “to float”, the right component 孚 is a phonetic element giving the sound fu (Old Chinese *mbu). It also appeared by itself as 孚 fu “confidence”, and in 俘 fu “prisoner”; however, these two characters are not used in modern writing anymore, so that 浮 “float” remained alone as the single example of the phonetic 孚-fu.
Downloads
Fuller data sets are available in tab-separated text files (TSV), compatible with any spreadsheet software and easy to manipulate with Unix tools.
- components_phonetic.jouyou.tsv and components_phonetic.kanjivg.tsv:
- The main tables, with one phonetic series by line, ordered by goodness as described above.
- kanji_phonetic.jouyou.tsv and kanji_phonetic.kanjivg.tsv:
- The best phonetic series for each kanji.
- components.tsv:
- Extracted from KanjiVG, a decomposition of each kanji into components.
- component_variants.tsv:
- Components that are considered to be “the same” in this experiment (for example, 人 and 亻, or 水 氵 氺). These are generally calligraphic abbreviations, marked as Z-variants or semantic variants in Unicode. This list is partial.
- base_components.jouyou.tsv and base_components.kanjivg.tsv:
- A list of all components needed to build the kanji set which are not themselves part of the set. Later I’ll add a new post on this.
Thanks to everyone who sent suggestions after the first version of this article, and to the compilers of KanjiVG and Kanjidic. Readers interested in the history of characters are directed to my Kanjigen tool. Comments welcome!