Kanjigen updates

In lieu of an essay, today I’ve made several long-overdue updates to my comparative “character etymology” tool, Kanjigen. Changes include:

  • Can now (hopefully) convert to and from simplified Japanese characters (shinjitai 新字体), simplified Chinese (jiǎntǐzì 简体字), and traditional characters (kyûjitai 旧字体 aka fántǐzì 繁體字). I’m using the latest Unihan data for conversions. I still abstain from difficult cases—I do nothing if the character has multiple alternative conversions, and I’m not converting “extended shinjitai”, itaiji, ryakuji &c. &c.

  • Playing with the conversion buttons can make irrecoverable changes; therefore, an undo button was added.

  • There’s now an option to auto-convert the search query to the form most likely to be present in each dictionary that Kanjigen supports.

  • When linking to ja.wiktionary, it now points directly to the character origin section.

  • Added indexes to Karlgren’s analytical dictionary (though the Unihan data seems quite sparse, but it’s better than nothing).

  • Convenience links to Unihan and jisho.org. (Should I add anything else?)

  • Better support for URIs. The search text is now saved to the URI’s hash-parameter. You can save or build these links and share a link directly to a specific query, like this (the user still must click the search button to get the popup windows to external dictionaries). I’ve tried to make a regular form with a submit button, but this angered popup blockers.

  • Hide indexes and links when nothing’s loaded yet (needs HTML5).

  • Fixed bug when there’s no Morohashi index available. Better choice of kanji forms for Morohashi search (first try as-is, then try traditional).

  • A help page was added.

  • The old post on kanji resources had grown a lot of aggregated discussion, so I decided to split it. The post at the old URL was trimmed to focus on a list of resources, as it did originally. The parts about my current understanding of “character etymology” have been moved to a new post, currently in draft (sorry, ran out of MP). I hope that these three pieces (Kanjigen with the help page, the list of resources, and the post on character structure) complement each other.

  • Many internal changes that will make it easier for me to add more functions in the future.

  • Scare quotes in “etymology” because Victor Mair said he’d point students here.

  • May 19: Datafiles are now JSON, and sent compressed if the browser supports it; added KangXi support; other minor, cosmetic enhancements. TODO file.

If you like Kanjigen, link to it and make me famous! (Kidding—this is a simple hack, all credit should go to the people who worked hard to build the databases and dictionaries in the first place.) If you know of any more publicly queryable databases of factual character analysis, tell me and I’ll add them to the list.

9 thoughts on “Kanjigen updates

  1. Haha, fixed it. Thanks mkrause.

    Carl: Technically, it’s a frontend—it’s all Javascript so your browser does all the work, not my server :) I did it this way because I don’t know if the hosting service I use would like the processing load. At first I was afraid Javascript would choke on all the kanji lists, but even without making any effort to optimize, my old EeePC 700 can run it just fine. It seems modern browsers have quite efficient JS implementations. Kanjigen is my first attempt at Javascript so it’s not exactly the prettiest code out there, but it seems to work.

    Oh and I made a quick and dirty Python script to massage data from Unihan into a JS format. (I know, I should have used JSON and a parser, but hey.)

  2. (Should I add anything else?)

    The English-language wiktionary pages for individual characters often include a direct link to the appropriate page of the Kangxi dictionary (an image hosted at http://www.kangxizidian.com), and I like to go and try to read them sometimes. It’s not more than a click or two away from the other courtesy links, but, hey, why not?

  3. Matt: Sure, why not? The kangxizidian.com search is klunky so we could just take the index data from Unihan and link to the page scan directly. Must be very simple…

    Or so I thought. But first, Unihan has two fields with KangXi info (kKangXi and kIRGKangXi) (IRG = the Ideographic Rapporteur Group). Sometimes they conflict. Someone asked which was correct. The official answer was, “ whichever one has the correct data is the correct one. :-) ”, and, “corrections welcomed”. So I added logic to kanjigen to display both fields when a conflict occur.

    Because KangXi indexes are algorithmic, it’s possible to infer a theoretical index position for characters not in the dictionary. Unihan includes both real and imaginary indexes. Of the 252 characters whose KangXi fields conflict, only 99 are actually present in the dictionary. I ended up hacking together a simple comparison tool to check them at kangxizidian.com, and later I’ll compare them one by one and send the results to the Unicode guys.

    Another problem is that the KangXi data is one order of magnitude larger than everything Kanjigen had to chew up to this point (we now download some 1.9MiB of data, versus 340KiB without KangXi info). So I made some performance improvements I planned on doing anyway; data files are now JSON, and minified, and I set up Apache to send them compressed, if the browser supports that. I figure the tool must be loading slower now, but I hope that, after a couple reloads, the caches will speed it up again. I might try to do something smart with fetching index data asynchronously, but some other day.

    Anyway, there you go, KangXi support :)

  4. Carl: Oh, from a structural, computer science point of view, Javascript was never that bad. It just suffered from horrible libraries/implementations/cruft, and bad marketing (this image is accurate). The very name is a good example—the language has nothing to do whatsoever with Java (*spit*), and this publicity gimmick of naming it after Java (*spit*) people think Javascript is supposed to be programmed like it, or at least in an object-oriented style; but in fact Javascript is not OO, it’s a prototype-based language with closures and other functional leanings, much closer to Scheme than that horrible J. The default reference for smart Javascript is Crockford. (Please don’t use my code as an example of smart Javascript. I shudder at the thought of feeding kanjigen to JSLint).

  5. Heh, yesterday I couldn’t figure how to make kanjigen load the indexes asynchronously, but today I woke up with a scheme in my mind. Thanks, brain! (Please bear with the slow loading a little longer, it will take some time.)

  6. A few points:
    • kanjinetworks.com is offline permanently
    • the new & improved chineseetymology.org is at hanziyuan.net
    • the name kanjigen was already taken by 漢字源

Leave a Reply

Your email address will not be published. Required fields are marked *