The Nanbanjin Nikki

ザ南蛮人日記

Kanjigen updates

In lieu of an essay, today I’ve made several long-overdue updates to my comparative “character etymology” tool, Kanjigen. Changes include:

If you like Kanjigen, link to it and make me famous! (Kidding—this is a simple hack, all credit should go to the people who worked hard to build the databases and dictionaries in the first place.) If you know of any more publicly queryable databases of factual character analysis, tell me and I’ll add them to the list.

Comments

The site looks useful! Thanks for the hard work. What are you using on the backend?

By Carl on .

You may want to check the link on “The old post on kanji resources”. :)

By mkrause on .

Haha, fixed it. Thanks mkrause.

Carl: Technically, it’s a frontend—it’s all Javascript so your browser does all the work, not my server :) I did it this way because I don’t know if the hosting service I use would like the processing load. At first I was afraid Javascript would choke on all the kanji lists, but even without making any effort to optimize, my old EeePC 700 can run it just fine. It seems modern browsers have quite efficient JS implementations. Kanjigen is my first attempt at Javascript so it’s not exactly the prettiest code out there, but it seems to work.

Oh and I made a quick and dirty Python script to massage data from Unihan into a JS format. (I know, I should have used JSON and a parser, but hey.)

By leoboiko on .

(Should I add anything else?)

The English-language wiktionary pages for individual characters often include a direct link to the appropriate page of the Kangxi dictionary (an image hosted at http://www.kangxizidian.com), and I like to go and try to read them sometimes. It’s not more than a click or two away from the other courtesy links, but, hey, why not?

By Matt on .

Neat solution. I should get into JS more at some point. I know it’s not as bad as it was in the 90s anymore, but I just never feel like dealing with it.

By Carl on .

Matt: Sure, why not? The kangxizidian.com search is klunky so we could just take the index data from Unihan and link to the page scan directly. Must be very simple…

Or so I thought. But first, Unihan has two fields with KangXi info (kKangXi and kIRGKangXi) (IRG = the Ideographic Rapporteur Group). Sometimes they conflict. Someone asked which was correct. The official answer was, “ whichever one has the correct data is the correct one. :-) ”, and, “corrections welcomed”. So I added logic to kanjigen to display both fields when a conflict occur.

Because KangXi indexes are algorithmic, it’s possible to infer a theoretical index position for characters not in the dictionary. Unihan includes both real and imaginary indexes. Of the 252 characters whose KangXi fields conflict, only 99 are actually present in the dictionary. I ended up hacking together a simple comparison tool to check them at kangxizidian.com, and later I’ll compare them one by one and send the results to the Unicode guys.

Another problem is that the KangXi data is one order of magnitude larger than everything Kanjigen had to chew up to this point (we now download some 1.9MiB of data, versus 340KiB without KangXi info). So I made some performance improvements I planned on doing anyway; data files are now JSON, and minified, and I set up Apache to send them compressed, if the browser supports that. I figure the tool must be loading slower now, but I hope that, after a couple reloads, the caches will speed it up again. I might try to do something smart with fetching index data asynchronously, but some other day.

Anyway, there you go, KangXi support :)

By leoboiko on .

Carl: Oh, from a structural, computer science point of view, Javascript was never that bad. It just suffered from horrible libraries/implementations/cruft, and bad marketing (this image is accurate). The very name is a good example—the language has nothing to do whatsoever with Java (*spit*), and this publicity gimmick of naming it after Java (*spit*) people think Javascript is supposed to be programmed like it, or at least in an object-oriented style; but in fact Javascript is not OO, it’s a prototype-based language with closures and other functional leanings, much closer to Scheme than that horrible J. The default reference for smart Javascript is Crockford. (Please don’t use my code as an example of smart Javascript. I shudder at the thought of feeding kanjigen to JSLint).

By leoboiko on .

Heh, yesterday I couldn’t figure how to make kanjigen load the indexes asynchronously, but today I woke up with a scheme in my mind. Thanks, brain! (Please bear with the slow loading a little longer, it will take some time.)