My kanjigen tool has a button to convert shinjitai (“new character forms”) to their original (kyūjitai) forms. I added it because some Chinese dictionaries used by kanjigen don’t index Japanese shinjitai, & it would be too much of a bother to convert elsewhere.
When going about writing it I decided to restrict myself to “official” or “standard” shinjitai, for… no good reason, really. It’s just something I wanted to do. It was surprisingly difficult to find a reliable source for that information! Matt suggested that I should just use the Jōyō Kanji Table PDF published by the Japanese government, so I did. PDFs are hard to parse and it was really mendokusai, but I extracted the relevant information and made this machine-readable list of official shinjitai-to-kyūjitai mappings.
I thought such information could perhaps be relevant for Unicode, so I sent it to Unihan people and they said it will be included in a later version. Yay!
Of course, there’s the possibility I made some mistake, and you shouldn’t trust my list as much as you’d trust the official document. I double-checked it as well as I could, but triple-checking is always welcome. For reference, here’s how the list was generated:
I converted the Heisei 22 PDF to text using Apache’s pdfbox, with the command:
java -jar pdfbox-app-1.6.0.jar ExtractText jouyoukanjihyou_h22.pdf
(Earlier I had tried the
pdftotext(1)tool, which resulted in a harder-to-parse text file. In any case I massaged it manually into a shinjitai list and later checked against the pdfbox results, just to be sure.)
The old forms are listed in the file as single characters between CJK parenthesis following the new form; I extracted them on Linux with the command:
sed -n -e "s/^[[:space:]]*\([^[[:space:]]\)[[:space:]]*（\(.\)）.*/\1 \2/p"
Two rows in the original table (those for 弁 and 餅) are multiline and couldn’t be extracted with the above command, so I added them manually.
The traditional form of 亀 appears not to be encoded as an Unicode character in the PDF, but the reference glyph appears to be the identical as U+9F9C 龜, so I added it manually.
The resulting 364 pairs were checked against the latest Unihan, and I found that most were listed as one or more of kTraditionalVariant, kZVariant, kSemanticVariant, kSpecializedSemanticVariant, or kCompatibilityVariant in either direction, with no pattern I could discern. 22 pairs were not in Unihan at all; I verified those manually, and all of them were indeed in the Japanese Jōyō table.
If you’re a developer planning to use this, note that there are also unofficial “extended” (non-Jōyō) shinjitai (aka 拡張新字体), Asahi simplified characters, variant characters （異体字）, handwritten abbreviations （略字）, Chinese simplifications &c. This list is strictly about Japanese standard Jōyō simplifications. Search the terms on Wikipedia or elsewhere for more info. Notice also that 弁 maps to three different characters （辨 瓣 辯）。
Thanks to Ben Bullock of the Lingua::JA::Moji Perl module, whose data I borrowed for the first versions of kanjigen.