The Nanbanjin Nikki

ザ南蛮人日記

Testing the predictive power of phonetic components in Japanese kanji

Most Chinese characters have a phonetic component—a hint that suggests their pronounciations. For example, 半 bàn “half” appears inside 伴 “companion” because the latter is also pronounced bàn; and further, it also suggests the approximate pronounciation of 判 pàn “judge”, 叛 pàn “rebel”, and 胖 pàng “fat”. Beginning readers often don’t notice this feature, but awareness of phonetic hints grows with proficiency. The rate of characters with a phonetic component reaches up to 90% – though that’s 90% of all characters, not 90% of what you’d stumble in actual use; nonphonetic characters (such as 木 or 人) are disproportionately more frequent.

However, this system was always imprecise, and grew ever looser as the spoken language changed and the characters were simplified in various ways. In many cases, it’s now unclear to what degree a component was originally added as a phonetic hint, a semantic mnemonic, both, or neither. For example, it’s conceivable that whoever first built the character for 判 “judge” decided to use 半 not just for its sound, but also because they thought “cutting 刂 in equal halves 半” makes sense as a mnemonic for “judging”.

When the characters are used to represent Japanese, the phonetic hints only work with Sino-Japanese readings, i.e. the on-yomi (…that’s kind of why they’re called on-yomi, “sound readings”). Moreover, even for on-yomi, the Japanese readings grew to be more imprecise than even Chinese. I wanted to try to measure their predictive power; this post report the results of a simple but quantitative experiment (for a quick summary, skip to results!).

Phonetic components and the Japanese language

There are various reasons why the hints are less useful for Japanese:

That doesn’t mean they’re useless, though, or that natives don’t process them. Mary Noguchi of Kanji Clinic describes this hypothetical example about the character for “hemorrhoid”:

[…] locate a willing subject to write, from memory, the kanji for ぢ. The point of this exercise is to witness the trial and error strategy a native speaker may employ when writing or pronouncing a relatively unfamiliar kanji such as 痔. Your friend will probably begin by writing the kanji component 疒 (“sickness”). This will put her into the ballpark, but writing the rest of the character may prove more difficult. She might start wracking her brain for a component which signals the pronunciation ji:

“Is it 寿 (“longevity”)? Oh, no, 寿 is pronounced ju, so that can’t be right. Hmmm, how about 血 (“blood,” usually pronounced chi, but sometimes ji in compounds like hanaji “bloody nose”)? Or maybe 地 (“ground,” ji)? No,” she moans, “that doesn’t look right, either. OK, I think it’s 寺 (“temple,” ji) plus 疒. Yes, 痔. That’s it!”

Sometimes a native taking quick notes will even substitute a complicated phonetic component for a kana character—for example, noting 機 ki “machine” by using only the left 木 “wood” component plus a squeezed phonetic ki キ to the right.

Goals and definitions: what to measure, and how

For this experiment, I extracted readings from Jim Breen’s venerable Kanjidic, and structural analysis from KanjiVG, a publicly-available database of graphical– and component decompositions. (If there are any errors in the tables, please report, so that I can either debug my code or forward corrections to the original sources.) Regarding my goals:

There’s quite a bit of data to massage, and it can be tricky to measure what exactly is a “good” phonetic component. In the next section I make some important definitions about metrics.

Kanji sets

First of all, the results will differ significantly depending on which kanji set (our universe) is analyzed. We’ll investigate two such sets:

  1. The set of kanji taught in Japanese education, today called Jōyō kanji. Since 1945, Japanese texts generally omit furigana readings for Jōyō, meaning that adult readers are expected to know by heart the readings of this set. Our Jōyō analyses are based on the full set of 2141 characters (including the 5 that were dropped in 2010, namely 勺 銑 脹 錘 匁 ).
  2. The largest set I have available, that is, all kanji in the current version of KanjiVG—a bit more than 6300. I don’t know if there’s any selection criteria for this set, but I expect they’re some of the most common, and six thousand is a good estimate of the knowledge of a proficient, literary-minded native reader.

Quantifying phonetic series

Within each kanji set, the basic variables to relate are:

  • A few thousand kanji, where
  • each kanji is made of one or more components, and
  • each kanji has zero or more (on-yomi) readings.

A component series is set of kanji that include a certain component. Here are some examples from the Jōyō set:

Component Size of series Kanji in series
7
7
6
5
4

For our purposes, a phonetic series is a set of kanji that shares a component and a reading. If we add each kanji’s readings to the table above, interesting patterns appear:

Component Size of series Kanji in series
7
etsu,
otsu
ki shu sou chou to fu
7
sei,
jou
sei sei,
shou,
shin
sei,
shou,
shiyau
sei,
shou,
shin
sei,
shou
sei,
jou
6
hou hou hou hou hou hou
5
saku,
sa
saku saku saku sa
4
kyuu kyuu kyuu,
sou
kyuu

First, consider the 走-series. Not a single kanji in it has a shared reading! 走 is not a phonetic component at all, i.e. the 走-series is not a phonetic series.

Compare to the 包-series. All the six kanji that include 包 are pronounced hou. In other words, 包-hou is a phonetic series of size 6.

Now consider the 乍-series. Almost there! Four out of five kanji include the reading saku, but 詐 breaks the pattern; it’s only read as sa. In this case the 乍-saku phonetic series (size 4, 作搾昨酢) is smaller than the full 乍 component series (5, 作搾昨詐酢). We say this phonetic series covers 4 out of 5 kanji, or that it has a kanji coverage of 4/5 = 80%.

What about the 青-series? It does include a phonetic series, with 100% coverage: all characters do have a shared reading, sei. However, most characters also have extra, unpredictable readings, so that this component is less predictive than 包, and we should measure this. Call readings coverage the ratio of predicted readings to all the readings of all kanji in which the component appears (that is, all kanji in the larger component series). 青-sei is a phonetic series of size 7, with 100% kanji coverage but only 43.75% (7/16) readings coverage.

I hope it’s clear by now that 及-kyuu is a size-4 phonetic series with 100% kanji coverage and 80% (4/5) readings coverage.

Phonetic series with 100% kanji and 100% readings coverage (like 包-hou) are especially useful; these ratings mean that, whenever the component appears, one can be sure of all readings of the kanji. We call these perfect series. Second in importante are those with 100% kanji but less than 100% readings (like 及-kyū and 青-sei); let’s name them semiperfect series. If you see a semiperfect phonetic component, you can be sure of at least one of the kanji’s readings. Series with less than 100% kanji coverage are not as useful, since you have to memorize the exceptions anyway; these are imperfect series.

Attentive readers might have noticed that a single component can be on many phonetic series; 乍, for example, could also be described as a very imperfect predictor for sa, working for 作詐 but not 搾昨酢. It of course performs better as a predictor for saku, since in that role it get 80/66% for 4 kanji, rather than 40/33% for just 2. We’re now in position to choose the best series for a component or a kanji: the rating criteria will be:

  • First, higher kanji coverage;
  • Second, larger size (more kanji);
  • And third, higher readings coverage.

Results

Perfect and semi-perfect series

Here’s a selection of the best phonetic series for Jōyō kanji, including all those with 100% kanji coverage and size 3 or larger:

Size Kanji in series Predicted readings Readings coverage Kanji with extra readings
7 情 晴 清 精 請 青 静 sei 43.75 情 清 精 請 青 静
6 包 抱 泡 砲 胞 飽 hou 100
6 交 効 校 絞 較 郊 kou 75 校 較
5 付 府 符 腐 附 fu 100
5 伺 司 嗣 詞 飼 shi 100
5 倹 剣 検 険 験 ken 83.333
5 喩 愉 癒 諭 輸 yu 83.333
4 構 溝 講 購 kou 100
4 渦 禍 過 鍋 ka 100
4 復 腹 複 覆 fuku 100
4 官 棺 管 館 kan 100
4 朱 株 殊 珠 shu 100
4 楷 皆 諧 階 kai 100
4 壌 嬢 譲 醸 jou 100
4 彩 採 菜 采 sai 100
4 及 吸 扱 級 kyuu 80
4 堀 屈 掘 窟 kutsu 80
4 儀 犠 義 議 gi 80
4 摩 磨 魔 麻 ma 80
4 化 花 貨 靴 ka 66.667 化 花
4 伴 判 半 畔 han 66.667 伴 判
4 個 固 箇 錮 ko 66.667 個 箇
4 想 相 箱 霜 sou 66.667 想 相
4 峡 挟 狭 頰 kyou 57.143 峡 挟 狭
3 峰 縫 邦 hou 100
3 五 悟 語 go 100
3 坑 抗 航 kou 100
3 先 洗 銑 sen 100
3 慈 滋 磁 ji 100
3 利 梨 痢 ri 100
3 側 則 測 soku 100
3 庭 廷 艇 tei 100
3 偏 編 遍 hen 100
3 底 抵 邸 tei 100
3 永 泳 詠 ei 100
3 泉 線 腺 sen 100
3 標 漂 票 hyou 100
3 彰 章 障 shou 100
3 噴 墳 憤 fun 100
3 丙 柄 病 hei 75
3 刃 忍 認 nin 75
3 善 繕 膳 zen 75
3 努 奴 怒 do 75
3 嫁 家 稼 ka 75
3 巨 拒 距 kyo 75
3 帝 締 諦 tei 75
3 広 拡 鉱 kou 75
3 健 建 鍵 ken 75
3 曹 槽 遭 sou 75
3 嘲 朝 潮 chou 75
3 救 求 球 kyuu 75
3 描 猫 苗 byou 75
3 園 猿 遠 en 75
𠔉 3 券 圏 巻 ken 75
3 代 袋 貸 tai 60 代 袋
3 居 据 裾 kyo 60 居 裾
3 幣 弊 蔽 hei 60
3 壮 荘 装 sou 50 荘 装
3 太 汰 駄 ta 50 太 汰 駄

Here’s a similar selection for KanjiVG kanji (including non-Jōyō), limited to 100% kanji coverage and size 5 or greater:

Size Kanji in series Predicted readings Readings coverage Kanji with extra readings
14 五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 82.353 圄 衙
13 廬 櫨 瀘 爐 盧 臚 艫 蘆 轤 鑪 顱 驢 鱸 ro 81.25 廬 臚 驢
12 剽 嫖 慓 標 漂 瓢 票 縹 飃 飄 驃 鰾 hyou 100
12 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 80 圄 衙
12 凰 徨 惶 湟 煌 皇 篁 蝗 遑 鍠 隍 鰉 kou 80 凰 皇 遑
11 咢 愕 萼 蕚 諤 鄂 鍔 顎 鰐 鶚 齶 gaku 100
11 啾 愀 愁 楸 湫 甃 秋 萩 鍬 鞦 鰍 shuu 73.333 愀 湫 鍬 鰍
11 倉 創 愴 搶 槍 滄 瘡 艙 蒼 蹌 鎗 sou 64.706 創 搶 槍 瘡 蹌 鎗
10 冓 媾 搆 構 溝 篝 覯 講 購 遘 kou 100
10 偏 扁 篇 編 翩 蝙 褊 諞 遍 騙 hen 100
10 嗟 嵯 嵳 差 搓 槎 瑳 磋 縒 蹉 sa 71.429 嗟 嵯 搓 縒
9 堋 崩 弸 朋 棚 硼 繃 萠 鵬 hou 90
9 伴 判 半 叛 拌 畔 絆 胖 袢 han 75 伴 判 叛
9 楜 湖 瑚 糊 胡 葫 蝴 醐 餬 ko 60 瑚 糊 胡 醐
8 亢 伉 吭 坑 抗 杭 航 頏 kou 100
8 僑 喬 嬌 橋 矯 蕎 轎 驕 kyou 100
8 嶂 彰 樟 璋 瘴 章 障 鱆 shou 100
8 伺 司 嗣 祠 笥 覗 詞 飼 shi 88.889
8 儀 嶬 犠 礒 義 艤 蟻 議 gi 80 犠 礒
8 嬪 擯 檳 殯 濱 繽 賓 鬢 hin 80 檳 鬢
8 傍 旁 榜 滂 磅 膀 蒡 謗 bou 53.333 旁 榜 滂 磅 膀 蒡 謗
7 奚 渓 溪 谿 蹊 鶏 鷄 kei 100
7 容 榕 溶 熔 穃 蓉 鎔 you 100
7 介 价 堺 界 畍 疥 芥 kai 87.5
7 及 吸 岌 扱 汲 笈 級 kyuu 87.5
7 于 吁 宇 盂 紆 芋 迂 u 77.778 于 吁
7 代 垈 岱 玳 袋 貸 黛 tai 70 代 垈 袋
7 個 凅 固 涸 痼 箇 錮 ko 63.636 個 凅 涸 箇
7 會 檜 獪 繪 膾 薈 鱠 kai 58.333 會 繪 膾 薈 鱠
7 曚 朦 檬 濛 矇 艨 蒙 mou 58.333 曚 朦 矇 艨 蒙
7 叟 嫂 捜 搜 溲 痩 艘 sou 36.842 叟 捜 搜 溲 痩 艘
6 安 按 晏 案 鞍 鮟 an 100
6 悍 捍 旱 桿 稈 駻 kan 100
6 倔 堀 屈 崛 掘 窟 kutsu 85.714
6 広 拡 昿 砿 絋 鉱 kou 85.714
6 壙 廣 擴 曠 礦 鑛 kou 85.714
6 櫪 歴 瀝 癧 轣 靂 reki 85.714
6 偕 揩 楷 皆 諧 階 kai 85.714
6 嗹 漣 縺 蓮 連 鏈 ren 85.714
6 庭 廷 挺 梃 艇 霆 tei 75 挺 梃
6 曹 槽 漕 糟 艚 遭 sou 75 曹 糟
6 救 毬 求 球 裘 逑 kyuu 75 求 裘
6 憔 樵 焦 礁 蕉 鷦 shou 75
6 欄 瀾 爛 蘭 襴 闌 ran 75 蘭 襴
6 嚀 寧 檸 濘 獰 聹 nei 66.667 嚀 檸 獰
6 園 猿 薗 袁 轅 遠 en 66.667 薗 袁 遠
6 廨 懈 蟹 蠏 解 邂 kai 66.667 廨 懈 解
6 倨 居 据 裾 踞 鋸 kyo 54.545 倨 居 裾 踞 鋸
6 嘸 廡 憮 撫 無 蕪 bu 54.545 嘸 憮 撫 無 蕪
6 溽 縟 耨 蓐 褥 辱 joku 54.545 溽 耨 蓐
6 抹 末 沫 秣 茉 靺 matsu 50 末 沫 秣 茉 靺
6 皺 芻 蒭 趨 鄒 雛 suu 37.5 皺 芻 蒭 趨 鄒 雛
5 偬 匆 怱 愡 葱 sou 100
5 喚 奐 換 渙 煥 kan 100
5 幾 機 磯 譏 饑 ki 100
5 娑 沙 莎 裟 鯊 sa sha 100
5 彩 採 綵 菜 采 sai 100
5 卷 圈 惓 綣 蜷 ken 83.333
5 攅 纉 讃 賛 鑚 san 83.333
5 廊 榔 瑯 螂 郎 rou 83.333
5 唳 戻 捩 涙 綟 rei 71.429 捩 涙
5 揉 柔 糅 蹂 鞣 juu 71.429 柔 鞣
5 當 礑 蟷 襠 鐺 tou 71.429
5 剳 塔 搭 荅 鞳 tou 71.429
5 晰 析 淅 皙 蜥 seki 62.5 晰 皙 蜥
5 噴 墳 憤 濆 賁 fun 62.5 濆 賁
5 撻 燵 達 闥 韃 tatsu 62.5 達 闥 韃
5 保 堡 葆 褒 褓 hou 55.556 保 堡 葆 褓
5 奬 將 漿 獎 鏘 shou 55.556 奬 將 獎 鏘
5 惺 星 猩 腥 醒 sei 55.556 星 猩 腥
5 壯 奘 弉 莊 裝 sou 50 奘 弉 莊 裝
5 椴 段 緞 葮 鍛 tan 50 椴 段 緞 葮
5 嗾 族 簇 蔟 鏃 zoku 45.455 嗾 簇 蔟 鏃

It’s clear that phonetic series grow in usefulness as we venture outside Jōyō. This is probably because, as mentioned earlier, rarer characters are more likely to have been built with a phonetic structure.

Notice 吾-go is kind of a spurious series, because the component 吾 itself includes a better phonetic component, 五-go:

14 五 伍 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 82.353 圄 衙
12 吾 唔 圄 寤 悟 晤 梧 牾 珸 衙 語 齬 go 80 圄 衙

Imperfect series

Here are some of the top few phonetic series with imperfect kanji coverage, for Jōyō kanji:

Size Kanji in component series Size of phonetic series Kanji not in phonetic series Kanji coverage Predicted readings Readings coverage
8 召 招 昭 沼 照 紹 詔 7 87.5 shou 87.5
7 姿 恣 次 茨 諮 資 6 85.714 shi 66.667
6 埼 奇 寄 崎 5 83.333 ki 83.333
6 令 冷 鈴 零 5 83.333 rei 71.429
9 反 坂 板 版 販 阪 飯 7 仮 返 77.778 han 50
8 交 効 校 絞 較 郊 6 父 釜 75 kou 60
8 姓 性 星 牲 生 6 産 隆 75 sei 50
7 壮 奨 荘 装 5 寝 状 71.429 sou 41.667
9 亡 妄 望 盲 網 6 忘 慌 荒 66.667 mou 46.154

For the KanjiVG set, there are more than 600 series between 75% and 99% kanji coverage, many of them quite large. Here’s a sampling of some interesting ones:

Size Kanji in component series Size of phonetic series Kanji not in phonetic series Kanji coverage Predicted readings Readings coverage
18 令 伶 冷 囹 嶺 怜 澪 玲 羚 聆 苓 蛉 鈴 零 領 齡 齢 17 94.444 rei 65.385
18 交 佼 傚 効 咬 效 校 狡 皎 絞 纐 蛟 較 郊 餃 鮫 鵁 17 94.444 kou 60.714
17 倩 情 晴 清 瀞 睛 精 菁 蜻 請 錆 青 靖 静 靜 鯖 16 94.118 sei 45.714
14 付 俯 咐 坿 府 拊 柎 符 腐 腑 苻 附 鮒 13 92.857 fu 81.25
14 戔 棧 淺 濺 牋 盞 箋 綫 賎 賤 踐 錢 餞 13 92.857 sen 68.421
14 伽 加 嘉 架 枷 珈 痂 笳 茄 袈 跏 迦 駕 13 92.857 ka 61.905
25 包 匏 咆 垉 庖 抱 枹 泡 炮 疱 皰 砲 胞 苞 蚫 袍 鉋
靤 鞄 飽 髱 鮑 麭
23 萢 雹 92 hou 67.647
12 幔 慢 曼 漫 縵 蔓 謾 鏝 饅 鬘 鰻 11 91.667 man 61.111
11 壙 廣 擴 曠 礦 簧 鑛 黃 黄 黌 10 90.909 kou 66.667
10 俘 孚 孵 桴 殍 浮 艀 蜉 郛 9 90 fu 81.818
10 侏 朱 株 殊 洙 珠 茱 蛛 9 90 shu 75
10 决 刔 夬 抉 決 缺 袂 訣 鴃 9 90 ketsu 52.941
18 兪 喩 愈 愉 揄 楡 渝 瑜 瘉 癒 蝓 覦 諭 踰 輸 逾 16 偸 鍮 88.889 yu 59.259
15 廊 朖 朗 榔 浪 狼 琅 瑯 粮 莨 螂 踉 郎 13 娘 良 86.667 rou 72.222
15 倚 剞 埼 奇 寄 崎 掎 畸 碕 綺 羇 騎 13 椅 猗 86.667 ki 68.421
14 呰 嘴 柴 此 疵 眥 眦 紫 觜 貲 雌 髭 12 些 砦 85.714 shi 46.154
19 哨 宵 峭 悄 梢 消 硝 稍 肖 蛸 誚 逍 銷 霄 鞘 鮹 16 削 屑 趙 84.211 shou 61.538
18 幡 播 旙 旛 潘 繙 翻 膰 蕃 藩 蟠 鐇 飜 鷭 15 審 瀋 番 83.333 han 50
12 佻 兆 姚 挑 晁 眺 窕 誂 跳 10 桃 逃 83.333 chou 58.824

When we open to a larger universe, some series can be demoted—like our friend 包-hou, the best perfect predictor in Jōyō, which here becomes imperfect thanks to 雹 (haku, hyou—one semivowel from hou!) and 萢 (no on-yomi). Even then, 包-hou still manages an impressive 23 out of 25—arguably even more useful than the 6 out of 6 in Jōyō.

The imperfect series with kanji coverage smaller than 50% (not pictured above) are quite bad; in fact, most of these “series” are just statistical noise, with sets of hundreds of characters where two or three share a reading by chance. For example, 含 and 岩, both read as gan, share the component 口; but so do 449 other Jōyō kanji, so that 口-gan has an abysmal kanji coverage of 0.44%.

Coverage of kanji sets: how many are in phonetic series?

It’s interesting to measure the coverage of Jōyō kanji—that is, how many of its kanji are included in a phonetic series. The charts below graph this for the Jōyō and KanjiVG sets, choosing the very best phonetic series for each kanji (click for larger version):

Chart of phonetic series coverage of Jōyō kanji
Chart of phonetic series coverage of KanjiVG kanji

You can read these charts horizontally, as an one-dimensional percent bar. The vertical dimension adds information about kanji coverage and (in darker shading) readings coverage. The very bad section (kanji coverage less than 50%) is shaded in red, and is in practice the same as having no phonetic hint. The hints are actually useful basically for the 100% section (colored blue), and perhaps a few of the better imperfect series (orange).

However, the charts above are a bit misleading because they’re not graphing the size (number of kanji) of each series, only their relative coverages. Many of the perfect series have a modest size of 2—just enough to know, for example, that 愁 is pronounced the same as 秋. However, if the entire graphs were adjusted for size, we’d have to use a log scale, because of how bad the red section is—kanji outside the phonetic series would dominate the image in a sea of gray. To avoid that, this magnification shows only the left part (up to 50% kanji coverage), allowing us to represent the series size in a natural way:

Chart of better phonetic series coverage of Jōyō kanji, size-adjusted
Chart of better phonetic series coverage of KanjiVG kanji, size-adjusted

(Notice that the scales are not the same between different graphics!)

456 Jōyō means 21.29% are perfectly phonetic, and 1411/6394 = 22.06% of our larger set (the yellow, imperfect part goes up to 34.46% and 49.69%). How to reconcile this with the often-cited figure that about 90% of kanji have a phonetic component? It’s simple: most of those components can only be detected through historical analysis, because either the sounds and graphs have changed too much, or else they now only appear in a single kanji. For example:

  • In 盆 bon, the top element 分 is a phonetic for the sound bon, coming ultimately from Old Chinese *pən. However, due to diverging phonetic changes, its role as a phonetic in 分粉紛雰 became fun (recall that Modern Japanese f- or h- was Old Japanese p-); in 貧, hin; and, in 頒, han. So the phonetic structure of 盆 (“dish” 皿 + sound *pən = *bˤən “bowl”) became obscured and approximate.
  • In 浮 fu “to float”, the right component 孚 is a phonetic element giving the sound fu (Old Chinese *mbu). It also appeared by itself as 孚 fu “confidence”, and in 俘 fu “prisoner”; however, these two characters are not used in modern writing anymore, so that 浮 “float” remained alone as the single example of the phonetic 孚-fu.

Downloads

Fuller data sets are available in tab-separated text files (TSV), compatible with any spreadsheet software and easy to manipulate with Unix tools.

components_phonetic.jouyou.tsv and components_phonetic.kanjivg.tsv:
The main tables, with one phonetic series by line, ordered by goodness as described above.
kanji_phonetic.jouyou.tsv and kanji_phonetic.kanjivg.tsv:
The best phonetic series for each kanji.
components.tsv:
Extracted from KanjiVG, a decomposition of each kanji into components.
components_variants.tsv:
Components that are considered to be “the same” in this experiment (for example, 人 and 亻, or 水 氵 氺). These are generally calligraphic abbreviations, marked as Z-variants or semantic variants in Unicode. This list is partial.
base_components.jouyou.tsv and base_components.kanjivg.tsv:
A list of all components needed to build the kanji set which are not themselves part of the set. Later I’ll add a new post on this.

Thanks to everyone who sent suggestions after the first version of this article, and to the compilers of KanjiVG and Kanjidic. Readers interested in the history of characters are directed to my Kanjigen tool. Comments welcome!

Comments

(Note for habitual readers: I changed the site stylesheet in order to highlight a few lines in the tables. If you don’t see any highlights, it’s probably caching the older version; try refreshing the browser a couple times with Ctrl+F5 or Ctrl+Shift+R).

By leoboiko on .

Interesting idea, especially the synchronic part. How about some graphs?
– Number of kanji that are 100%, 90%, 80%… perfect (this is probably the most interesting one)
– As above, but divided by reading set (number of 1-set kanji that are 100%, 90%… perfect; Number of 2-set kanji that are 100%, 90%… perfect)
– Come to think of it, number of kanji with 1, 2, 3… reading sets?
(Since you supply the data, I should probably do this myself, but I *still* don’t know how to use R).

By Matt on .

on it… :)

By leoboiko on .

It’s probably a bad sign that even I’m not sure what I was on about there. But thinking carefully, I seem to have written “kanji” where I meant “component”. So, out of n = 50 (or whatever) components, 10 are “perfect”, 20 are “very good”, etc…. The theory being that if some components are “perfect” and others are “not”, with varying numbers involved, there must be a way to rank them and show what the territory between “perfect” to “completely useless” looks like.

I suppose that figuring out a ranking method is the hard part. Is “皮” worse than “召”, and if so, by how much? And so on.

By Matt on .

I agree very much. Stay tuned…

By leoboiko on .

Ok, I abandoned the focus on perfect sets in favor of three quantifiable criteria (kanji coverage %, size, and readings coverage %), and they’re now sorted in this order.

The main weakness of this method is that, in many cases, a much larger phonetic series with a slightly worse kanji coverage feels more interesting, so the absolute priority for kanji coverage is unjustified. I tried to use a weighted function based on both variables, but at the end of the weekend decided that the simpler, more understandable sorting key made it easier to handle the data tables.

By leoboiko on .

Great reading! The new approach works for me. Looking down the coverage lists, it’s kind of shocking how closely my feelings about the characters in each row match the percentages– the safe and cozy 100%/100% rows, the approachable but somehow ominous 90%/50%(ish) rows, the forbidding tangle of the 70%/40% rows…

By Matt on .

Interesting article! It’s something I’ve also investigated on my quest for Japanese reading ability. Are you aware of an official list anywhere online of the phonetic components in Japanese? They are often mentioned but there does not seem to be a definitive list, although perhaps I am just in the dark.

Doumo

By Natalie on .

Good stuff, very interesting. Some of the links to your data files are 404s though, components_variants.tsv and components.tsv

By Dandy Mountfarto on .

Thanks for the warning, Dandy! I’ve fixed the links.

By leoboiko on .

@Natalie: Sorry, I had missed your comment! In case you have email notification on and is reading this: I don’t think there’s a definitive list, because there are wide disagreements over which components are (historically) phonetic, or even what counts as a component. The best we can do is to look at what’s classified as a “phono-semantic character” 形声文字 in traditional dicionaries – the Shuowen, the KangXi and, for Japanese, Morohashi. I’m unaware of online lists of such classifications, though a few web searches in Japanese will probably return something.

Of course, if all you need are lists of phonetic hints that still work (as opposed to historical “phonetics”), then there’s always the data files in this post ;)

But if you’re looking for such a list for learning purposes, I think that it’s better to just acquire the phonetics by osmosis as you learn new words (it’s how the natives learn them, after all). Vocabulary acquisition people (like Nation) tell us that learning similar words together “to understand their difference” is a bad idea, because interference effects are bad for memorization – it’s better to deal with similar words one at a time, learning other stuff in-between. My gut feeling is that a similar thing happens with kanji: if you try to learn a phonetic series serially one after another, then all those similar-looking, similar-sounding kanji blend together in a blur.

If you just read stuff you’re interested in, you’ll find that one starts to naturally guess the pronounciation of new kanji from phonetic hints one has acquired unconsciously (though unfortunately this inference sometimes fails – I’m looking at you, 拡/kaku).