Announcing myougiden, a command-line Japanese/English dictionary

Where have I been, you ask? I’ve disappeared for the last two weeks! I didn’t write anything, talked to no one, was nowhere to be seen!

As it happens, the Muse of Programming possessed me forcefully, and after some intense days taken by a mood, I ended up with this:

myougiden screenshot

myougiden is a new JMdict-based dictionary for the command-line. If you’re in a POSIX-style system (I think OSX should work, probably, perhaps), and you’re interested in trying it out, refer to the README. Here’s a copy of the current features list for hype:

  • Fully Unicode-aware.
  • Regular expression support.
  • Partial, full, and whole-word queries.
  • Optional rōmaji input and output.
  • Intelligently find out what kind of query is intended.
  • Option for tab-separated output, easily manipulable with Unix tools. (beta)
  • Full color output, including partial match highlighting. No seriously, this thing has a lot of color. I mean we’re talking Takashi Murakami material here.
  • Handles JMdict restricted readings and senses intelligently.
  • Auto-pager, auto-color, auto-nice, and other small niceties.

It’s still rough and untested, so it might not work on your system (I tested it on Debian GNU/Linux, wheezy, and Python 3.2). Please tell me of any bugs!

Now I should probably go back to my thesis…

15 thoughts on “Announcing myougiden, a command-line Japanese/English dictionary

  1. What, you expected Myōgiden? But you can’t use non-ASCII characters in Linux command names! I mean, you can, but… you can’t. It’s just Not Done. It would be unnatural.

    Besides, according to the command-not-found database, currently there are no commands, in all of Debian’s hoards, starting with myo– — which means myougiden can be invoked with myo[TAB].

  2. What, you expected Myōgiden?

    Well, ideally…

    I just think that degrading to “no indication of long vowel” is superior to degrading to word processor style. I don’t know what the official Kunrei-shiki standard says, but at least in the Hepburn world that’s the done thing (e.g. passports, train station names).

  3. Losing phonemic information hurts my computolinguistic sensibilities (even non-phonemic information—I’m very bothered when I have to write e.g. yokuzuna or kanazukai and can’t distinguish underlying /du/ from /zu/. And speaking of that, it should be いなづま not いなずま。 It’s the “wife of the rice”!)

  4. This makes me remember that I should some day write script to adapt Edict to work with the OS X format. It would be a simple XML→XML conversion, but it takes time and gumption, so…

  5. After I was 80% done, the thought popped that I should have looked into the DICT protocol… It’s true that I do a lot of firulas* like color and “intelligent” guessing, but perhaps it would be possible to write it as a custom server/client pair with protocol extensions, while remaining compatible with existing software. Oh well.

    *firula: Unreasonably indulgent design; like, say, a backpack with almost too many kinds of inner divisions (“almost” because it’s never too many).

  6. Thanks! Honesty binds me to confess that myougiden is quite slower than I hoped; in part because it attempts to “do what you mean”, by running many types of queries until one matches. And regexes unfortunately give it a significant performance hit. If the latency gets too uncomfortable, try passing lots of parameters to reduce query guessing. Also, depending on what you need, consider simply grepping edict.utf8 or edict2.utf8 (this method has been my primary “dictionary” for many years, and myougiden grew out of this workflow).

    (and if anyone has suggestions of how to make this thing faster, I’m all ears! profiling shows that most of the time is spent on the SQL queries, not on the fluff.)

  7. Yeah multi-dictionary support has been asked–there’s a ton of little neat stuffs to add but I kinda grew tired of coding for now, & am concentrating on nethack my thesis, but I’ll try my hand at it when I’m coding again, & of course patches are welcome.

  8. I’m sad this is project is called ‘defunct’ on Github because I certainly use it every day! (Usually because I feel bad about hitting every few minutes.) It’s plenty fast for me. Good work! The code looks high-quality so hopefully you (or someone else) will come back to it.

  9. I’m so very sorry; I’m doing a thousand little things and find myself with little energy to code. But I myself use myōgiden everyday! So it’s not really defunct, it’s in… suspended animation? (笑)

    If you’re tech-savvy, check out the latest branches/commits. I’ve added support to “full text search” (search-engine–like queries), which has sped up most queries to an order of magnitude. (There’s also new support for EDICT/JMdict languages other than English, if you’re interested in that.) Any of these days I should package a new release, and eventually add a few important features we’re missing (like de-inflection/lemmatization).

  10. Please don’t apologize :! I saw the issue about non-English JMdict versions (which is where I saw the word ‘defunct’ :P), but I didn’t catch the FTS branch, interesting!

    But what I’m most intrigued by is de-inflection/lemmatization. I’m currently using Ve (by Kimtaro, who also runs, and which itself reprocesses MeCab: to separate Japanese sentences into “words” (Ve combines the morphemes found by MeCab into something higher-level than morphemes, nominally “words”), and then I use myougiden to look up the resultant lemmas to make glosses. In this way, I have my own linguistics-superpowered version of :)

    E.g., Ve converts “今朝、我が家で初氷を観測しました” (from a lesson on into approximately the following JSON:

    “word”: “今朝”,
    “lemma”: “今朝”,
    “part_of_speech”: “noun”,
    “extra”: { “reading”: “ケサ”, “transcription”: “ケサ” }
    “word”: “、”,
    “lemma”: “、”,
    “part_of_speech”: “symbol”,
    “extra”: { “reading”: “、”, “transcription”: “、” }
    “word”: “我が家”,
    “lemma”: “我が家”,
    “part_of_speech”: “noun”,
    “extra”: { “reading”: “ワガヤ”, “transcription”: “ワガヤ” }
    “word”: “で”,
    “lemma”: “で”,
    “part_of_speech”: “postposition”,
    “extra”: { “reading”: “デ”, “transcription”: “デ” }
    “word”: “初氷”,
    “lemma”: “初氷”,
    “part_of_speech”: “noun”,
    “extra”: { “reading”: “ハツゴオリ”, “transcription”: “ハツゴーリ” }
    “word”: “を”,
    “lemma”: “を”,
    “part_of_speech”: “postposition”,
    “extra”: { “reading”: “ヲ”, “transcription”: “ヲ” }
    “word”: “観測しました”,
    “lemma”: “観測する”,
    “part_of_speech”: “verb”,
    “extra”: { “reading”: “カンソクシマシタ”, “transcription”: “カンソクシマシタ” }

    That last word shows the de-inflection of verbs. I’m very curious about how myougiden would handle de-inflection & lemmatization!

  11. It’s a fine piece of software that you should be very proud of. You’ve put a lot of work into solving problems that aren’t specific to JMDict. I might read through the source and see about getting it to work with other languages. A CEDICT version would be great.

Leave a Reply

Your email address will not be published. Required fields are marked *