The Nanbanjin Nikki

ザ南蛮人日記

Announcing myougiden, a command-line Japanese/English dictionary

Where have I been, you ask? I’ve disappeared for the last two weeks! I didn’t write anything, talked to no one, was nowhere to be seen!

As it happens, the Muse of Programming possessed me forcefully, and after some intense days taken by a mood, I ended up with this:

myougiden screenshot

myougiden is a new JMdict-based dictionary for the command-line. If you’re in a POSIX-style system (I think OSX should work, probably, perhaps), and you’re interested in trying it out, refer to the README. Here’s a copy of the current features list for hype:

It’s still rough and untested, so it might not work on your system (I tested it on Debian GNU/Linux, wheezy, and Python 3.2). Please tell me of any bugs!

Now I should probably go back to my thesis…

Comments

I don’t know what to say! The software looks great, but… you used word processor romanization in its name!

By Matt on .

What, you expected Myōgiden? But you can’t use non-ASCII characters in Linux command names! I mean, you can, but… you can’t. It’s just Not Done. It would be unnatural.

Besides, according to the command-not-found database, currently there are no commands, in all of Debian’s hoards, starting with myo– — which means myougiden can be invoked with myo[TAB].

By leoboiko on .

What, you expected Myōgiden?

Well, ideally…

I just think that degrading to “no indication of long vowel” is superior to degrading to word processor style. I don’t know what the official Kunrei-shiki standard says, but at least in the Hepburn world that’s the done thing (e.g. passports, train station names).

By Matt on .

Losing phonemic information hurts my computolinguistic sensibilities (even non-phonemic information—I’m very bothered when I have to write e.g. yokuzuna or kanazukai and can’t distinguish underlying /du/ from /zu/. And speaking of that, it should be いなづま not いなずま。 It’s the “wife of the rice”!)

By leoboiko on .

little fix: url is “….japaneseenglish…”

By gobr on .

This makes me remember that I should some day write script to adapt Edict to work with the OS X Dictionary.app format. It would be a simple XML→XML conversion, but it takes time and gumption, so…

By Carl on .

After I was 80% done, the thought popped that I should have looked into the DICT protocol… It’s true that I do a lot of firulas* like color and “intelligent” guessing, but perhaps it would be possible to write it as a custom server/client pair with protocol extensions, while remaining compatible with existing software. Oh well.

*firula: Unreasonably indulgent design; like, say, a backpack with almost too many kinds of inner divisions (“almost” because it’s never too many).

By leoboiko on .

This is pretty slick! I’ll be using its regex support for searches instead of Nihongo Resources from now on.

Thanks! Honesty binds me to confess that myougiden is quite slower than I hoped; in part because it attempts to “do what you mean”, by running many types of queries until one matches. And regexes unfortunately give it a significant performance hit. If the latency gets too uncomfortable, try passing lots of parameters to reduce query guessing. Also, depending on what you need, consider simply grepping edict.utf8 or edict2.utf8 (this method has been my primary “dictionary” for many years, and myougiden grew out of this workflow).

(and if anyone has suggestions of how to make this thing faster, I’m all ears! profiling shows that most of the time is spent on the SQL queries, not on the fluff.)

By leoboiko on .

I would love it if you could add support for reading the EPWING dictionary format as well (http://ja.wikipedia.org/wiki/EPWING). I’ve got some dictionaries in this format, and I’m currently stuck using some Windows-based readers in a VM.

Since myougiden is in Python, I might try to add EPWING support myself, if I get the time.

Yeah multi-dictionary support has been asked–there’s a ton of little neat stuffs to add but I kinda grew tired of coding for now, & am concentrating on nethack my thesis, but I’ll try my hand at it when I’m coding again, & of course patches are welcome.

By leoboiko on .

I’m sad this is project is called ‘defunct’ on Github because I certainly use it every day! (Usually because I feel bad about hitting beta.jisho.org every few minutes.) It’s plenty fast for me. Good work! The code looks high-quality so hopefully you (or someone else) will come back to it.

By Ahmed Fasih on .

I’m so very sorry; I’m doing a thousand little things and find myself with little energy to code. But I myself use myōgiden everyday! So it’s not really defunct, it’s in… suspended animation? (笑)

If you’re tech-savvy, check out the latest branches/commits. I’ve added support to “full text search” (search-engine–like queries), which has sped up most queries to an order of magnitude. (There’s also new support for EDICT/JMdict languages other than English, if you’re interested in that.) Any of these days I should package a new release, and eventually add a few important features we’re missing (like de-inflection/lemmatization).

By leoboiko on .

Please don’t apologize :! I saw the issue about non-English JMdict versions (which is where I saw the word ‘defunct’ :P), but I didn’t catch the FTS branch, interesting!

But what I’m most intrigued by is de-inflection/lemmatization. I’m currently using Ve (by Kimtaro, who also runs jisho.org, and which itself reprocesses MeCab: https://github.com/Kimtaro/ve) to separate Japanese sentences into “words” (Ve combines the morphemes found by MeCab into something higher-level than morphemes, nominally “words”), and then I use myougiden to look up the resultant lemmas to make glosses. In this way, I have my own linguistics-superpowered version of lingq.com :)

E.g., Ve converts “今朝、我が家で初氷を観測しました” (from a lesson on lingq.com) into approximately the following JSON:

[
{
“word”: “今朝”,
“lemma”: “今朝”,
“part_of_speech”: “noun”,
“extra”: { “reading”: “ケサ”, “transcription”: “ケサ” }
},
{
“word”: “、”,
“lemma”: “、”,
“part_of_speech”: “symbol”,
“extra”: { “reading”: “、”, “transcription”: “、” }
},
{
“word”: “我が家”,
“lemma”: “我が家”,
“part_of_speech”: “noun”,
“extra”: { “reading”: “ワガヤ”, “transcription”: “ワガヤ” }
},
{
“word”: “で”,
“lemma”: “で”,
“part_of_speech”: “postposition”,
“extra”: { “reading”: “デ”, “transcription”: “デ” }
},
{
“word”: “初氷”,
“lemma”: “初氷”,
“part_of_speech”: “noun”,
“extra”: { “reading”: “ハツゴオリ”, “transcription”: “ハツゴーリ” }
},
{
“word”: “を”,
“lemma”: “を”,
“part_of_speech”: “postposition”,
“extra”: { “reading”: “ヲ”, “transcription”: “ヲ” }
},
{
“word”: “観測しました”,
“lemma”: “観測する”,
“part_of_speech”: “verb”,
“extra”: { “reading”: “カンソクシマシタ”, “transcription”: “カンソクシマシタ” }
}
]

That last word shows the de-inflection of verbs. I’m very curious about how myougiden would handle de-inflection & lemmatization!