List of «official» Japanese simplified kanji

My kanjigen tool has a button to convert shinjitai (“new character forms”) to their original (kyūjitai) forms. I added it because some Chinese dictionaries used by kanjigen don’t index Japanese shinjitai, & it would be too much of a bother to convert elsewhere.

When going about writing it I decided to restrict myself to “official” or “standard” shinjitai, for… no good reason, really. It’s just something I wanted to do. It was surprisingly difficult to find a reliable source for that information! Matt suggested that I should just use the Jōyō Kanji Table PDF published by the Japanese government, so I did. PDFs are hard to parse and it was really mendokusai, but I extracted the relevant information and made this machine-readable list of official shinjitai-to-kyūjitai mappings.

I thought such information could perhaps be relevant for Unicode, so I sent it to Unihan people and they said it will be included in a later version. Yay!

Of course, there’s the possibility I made some mistake, and you shouldn’t trust my list as much as you’d trust the official document. I double-checked it as well as I could, but triple-checking is always welcome. For reference, here’s how the list was generated:

  1. I converted the Heisei 22 PDF to text using Apache’s pdfbox, with the command:

    java -jar pdfbox-app-1.6.0.jar ExtractText jouyoukanjihyou_h22.pdf

    (Earlier I had tried the pdftotext(1) tool, which resulted in a harder-to-parse text file. In any case I massaged it manually into a shinjitai list and later checked against the pdfbox results, just to be sure.)

  2. The old forms are listed in the file as single characters between CJK parenthesis following the new form; I extracted them on Linux with the command:

    sed -n -e "s/^[[:space:]]*\([^[[:space:]]\)[[:space:]]*(\(.\)).*/\1 \2/p"
  3. Two rows in the original table (those for and ) are multiline and couldn’t be extracted with the above command, so I added them manually.

  4. The traditional form of appears not to be encoded as an Unicode character in the PDF, but the reference glyph appears to be the identical as U+9F9C , so I added it manually.

The resulting 364 pairs were checked against the latest Unihan, and I found that most were listed as one or more of kTraditionalVariant, kZVariant, kSemanticVariant, kSpecializedSemanticVariant, or kCompatibilityVariant in either direction, with no pattern I could discern. 22 pairs were not in Unihan at all; I verified those manually, and all of them were indeed in the Japanese Jōyō table.

If you’re a developer planning to use this, note that there are also unofficial “extended” (non-Jōyō) shinjitai (aka 拡張新字体), Asahi simplified characters, variant characters (異体字), handwritten abbreviations (略字), Chinese simplifications &c. This list is strictly about Japanese standard Jōyō simplifications. Search the terms on Wikipedia or elsewhere for more info. Notice also that maps to three different characters (辨 瓣 辯)。

Thanks to Ben Bullock of the Lingua::JA::Moji Perl module, whose data I borrowed for the first versions of kanjigen.

9 thoughts on “List of «official» Japanese simplified kanji

  1. A nice tool!

    One test I put it through was to input the character 鴬, which should yield 鶯. Instead the output was 鴬 (no change).

    I’m not totally sure of the historical situation behind this (I’m assuming 鴬 was already current in the kyūjitai), but the standard traditional character is actually 鶯. However, the situation is a bit confusing, as I note at my linked page “Of Orioles and Warblers” (qv).

    To summarise, 鴬 is the shinjitai of 鶯 in Japanese, whereas 鸴 is the simplified form of 鷽 in Chinese. Chinese simplifies 鴬 as 莺.

  2. The thing is, 鴬 is not in the Jōyō list, and at the moment the conversion button only works for jōyō characters, because I wanted to find out what were the “standard” shinjitai. I agree that ideally I should try to be inclusive of all shinjitai, not of the standard ones only (at least in the cases where there’s a one-to-one mapping—cases like 弁 are complicated).

    For the moment, other conversion tools like this can help. Some of these days I’ll try to mine non-jōyō simplifications from various sources to make the convert button more useful.

  3. Needless to say, I wasn’t aware that 鴬 is not in the jōyō kanji list. I seem to have reached the point where it’s not very relevant whether a kanji is in the tōyō/jōyō kanji list or not.

  4. I guess I should clarify the above comment to avoid the impression that I know so many kanji that it doesn’t matter any more. It’s more that, having moved over to Chinese, attempts to keep the available kanji tightly controlled in an arbitrary box just seem irrelevant.

  5. No, I dig what you’re saying. I don’t even know all jōyō yet, and my interests have already been making me feel the futility of arbitrary boxes.

  6. Hi–I wonder if you can help, please?

    I’m writing a novel set in Japan and have a situation where the foreign protagonist Googles a Japanese guy’s full name but nothing comes up namewise (that he can read). I propose to explain this by saying the name had an unusual reading, so the foreigner skipped over it because he misread it, or because Google didn’t list it, as it is so uncommon.

    I had wanted to use Kaifu, but although there are several kanji read as “kai,” only the kai for “umi” seems to be used for a person’s name, so that seems to be ruled out, as the foreigner would see it listed and so check it out anf find his guy. Or have you come across a kyuujitai reading of kai used in a name?

    Any suggestions for names that could be easily misread and so overlooked by a foreigner doing a search? In particular, I’m looking for something with a strong beginning sound using, say, K or G, in order to match the Japanese character’s hard personality.

    Thank you,
    John

Leave a Reply

Your email address will not be published. Required fields are marked *