Japanese word boundaries: inflected nouns?
The other day there was a discussion on No-Sword that turned to the topic of word separation, and I’d like to salvage my comments and throw together a post, even if it’s redundant—there are a couple of interesting papers that I believe are deserving of more exposure.
Japanese, like the scriptio continua of Classical Greek and Latin, is written without spaces; but romanized Japanese needs spacing, and, lacking a well-established tradition, we are often at a loss on whether to space a compound or not (akishigure, aki-shigure, aki shigure?). Investigating this opens a whole can of worms. Valéry once said that words are like fragile planks over the abyss; we cross words to reach meaning everyday without any difficulty, but if you stop on top of a word to examine it—say, “time” or “being”—the word promptly breaks and you find yourself falling. And this applies to the word “word” itself. The way I see it, there are at least three major usages of “word”:
- A typographical (or orthographic) word is the kind that Japanese is (generally) lacking. Even when reading a text in a phonographic script (like this one), we don’t read it pronouncing the sounds one letter at a time; instead, we learn to recognize general word-shapes at a glance (linguists dealing with writing systems, such as Sampson or Rogers, have compared this to a kind of “soft” logography). Such word-shapes give typographical words a strong psychological presence or “suchness” in the mind of the literate reader. For purely random, historical reasons, Portuguese writes em cima (“in-top”, on top, above) with a space but embaixo (“in-below”, below) as one contiguous word; if you ask an educated Brazilian, he’ll likely say that em cima is “two words”, even though it’s completely parallel in structure and pronunciation to embaixo. It’s in this sense that Gerrit Noordzij can claim that the Irish manuscript tradition “invented the word” when they added spacing and perfected the rhythmic script that we now call lowercase (or minuscule). Typographical words are defined by the visual patterns that our brains acquire from exposure to reading and learn to expect and match.
- We could call a linguistic word one that is defined with objective criteria, be it by old-style grammarians or modern linguists. There’s a bunch of definitions around: an intermediate level of phonological contours, an element that can stand isolated as an answer to a question, refereed by certain anaphors, and so on. Haspelmath has a paper discussing many of those definitions, and their problems when they’re proposed as universal features of all languages.
- If you try to avoid typographic interference and ask an illiterate native what’s a “word”, you’ll get all sorts of surprising examples. (For now I’m assuming common languages all have a word with the same role as “word”, though this assumption can itself be problematic.) My five-year-old daughter seems to have acquired a fairly complete basic vocabulary; but the other day she told me she was “tired of hearing that word”—referring to her little brother’s insistent “I want more bread” utterance. John Miles Foley has noted that, in many oral traditions, poet-performers will call “word” any kind of self-standing linguistic unit, from a couple of morphemes to a verse or couplet or even an entire passage (accordingly, he calls the chapters of his book “words”).
Now the problem is that those definitions of wordness seldom match one another. They don’t even match themselves; one linguistic school will disagree with the others, each ortographic standard challenges the others, et cetera. So how do we space Japanese? Until a common usage emerges, I don’t think the question can be answered absolutely.
I started to think about this topic when I found this mostly obscure article by J.J. Chew Jr., On word boundaries in Japanese. Chew noticed that Japanese natives often have trouble in writing romanized Japanese in the same way as Europeans do—among other things, they write sequences like desukara or masuka where we’d expect desu kara and masu ka. Insightfully, Chew then came up with one kind of Japanese writing that’s actually spaced by natives: namely, hiragana books for small children. Analyzing some 300 pages of that material, he found, among other unexpected points, that what we call “particles” and “copulæ” are almost always printed without spacing or separation. He then made some informal psycholinguistic research:
To assess the reaction of native Japanese to these texts, I feigned naïveté and asked what the spaces were for. Repeatedly I was informed that it would be difficult for a child to understand a text without spaces. Usually I was told that spaces provided points where the child could stop without making the text unintelligible. When I inquired further if it wouldn’t be an improvement to provide spaces before the particles and copula, the reply was invariably negative: “Breaking up the ‘words’ would confuse the children.”
Which is to say, according to Chew, that the Japanese perceive things like sorega or anatawo as “words”. Which means nominals (taigen) could be described as inflected nouns, just like verbs or adjectives are inflected: oishi-i, kirei-na, hadashi-no; oishi-i, kirei-da, hadashi-da.
I don’t think Chew’s proposal had much of an impact; I think the grammatical consensus is that ringo in ringo+ga is a more standalone unit than tabe in tabe+masu. (On the other hand, there is disagreement on whether kirei-na is an inflected word or two uninflected ones.) But even with caveats, Chew’s insight isn’t groundless; there are certainly some relevant facts that shouldn’t be dismissed:
- Spoken Japanese doesn’t allow a pause (“juncture”) between a nominal and a particle;
- The pitch accent in Modern Standard Japanese works with the nominal+particle bundle as a unit; e.g. hashí↓gà “bridge” vs. hashígá “edge”, which are only distinguished by the accent in the ga;
- And Japanese natives unfamiliar with European-style romanization tend to not write spaces before particles.
Whether we call a sequence like gakkou+e an “inflected word” or a “word with particles” is, at the end of the day, a matter of terminology; but anyhow it’s a fact that it has a stronger unity than the English sequence to+school.