The other day there was a discussion on No-Sword that turned to the topic of word separation, and I’d like to salvage my comments and throw together a post, even if it’s redundant—there are a couple of interesting papers that I believe are deserving of more exposure.

Japanese, like the scriptio continua of Classical Greek and Latin, is written without spaces; but romanized Japanese needs spacing, and, lacking a well-established tradition, we are often at a loss on whether to space a compound or not (akishigure, aki-shigure, aki shigure?). Investigating this opens a whole can of worms. Valéry once said that words are like fragile planks over the abyss; we cross words to reach meaning everyday without any difficulty, but if you stop on top of a word to examine it—say, “time” or “being”—the word promptly breaks and you find yourself falling. And this applies to the word “word” itself. The way I see it, there are at least three major usages of “word”:

  1. A typographical (or orthographic) word is the kind that Japanese is (generally) lacking. Even when reading a text in a phonographic script (like this one), we don’t read it pronouncing the sounds one letter at a time; instead, we learn to recognize general word-shapes at a glance (linguists dealing with writing systems, such as Sampson or Rogers, have compared this to a kind of “soft” logography). Such word-shapes give typographical words a strong psychological presence or “suchness” in the mind of the literate reader. For purely random, historical reasons, Portuguese writes em cima (“in-top”, on top, above) with a space but embaixo (“in-below”, below) as one contiguous word; if you ask an educated Brazilian, he’ll likely say that em cima is “two words”, even though it’s completely parallel in structure and pronunciation to embaixo. It’s in this sense that Gerrit Noordzij can claim that the Irish manuscript tradition “invented the word” when they added spacing and perfected the rhythmic script that we now call lowercase (or minuscule). Typographical words are defined by the visual patterns that our brains acquire from exposure to reading and learn to expect and match.
  2. We could call a linguistic word one that is defined with objective criteria, be it by old-style grammarians or modern linguists. There’s a bunch of definitions around: an intermediate level of phonological contours, an element that can stand isolated as an answer to a question, refereed by certain anaphors, and so on. Haspelmath has a paper discussing many of those definitions, and their problems when they’re proposed as universal features of all languages.
  3. If you try to avoid typographic interference and ask an illiterate native what’s a “word”, you’ll get all sorts of surprising examples. (For now I’m assuming common languages all have a word with the same role as “word”, though this assumption can itself be problematic.) My five-year-old daughter seems to have acquired a fairly complete basic vocabulary; but the other day she told me she was “tired of hearing that word”—referring to her little brother’s insistent “I want more bread” utterance. John Miles Foley has noted that, in many oral traditions, poet-performers will call “word” any kind of self-standing linguistic unit, from a couple of morphemes to a verse or couplet or even an entire passage (accordingly, he calls the chapters of his book “words”).

Now the problem is that those definitions of wordness seldom match one another. They don’t even match themselves; one linguistic school will disagree with the others, each ortographic standard challenges the others, et cetera. So how do we space Japanese? Until a common usage emerges, I don’t think the question can be answered absolutely.

I started to think about this topic when I found this mostly obscure article by J.J. Chew Jr., On word boundaries in Japanese. Chew noticed that Japanese natives often have trouble in writing romanized Japanese in the same way as Europeans do—among other things, they write sequences like desukara or masuka where we’d expect desu kara and masu ka. Insightfully, Chew then came up with one kind of Japanese writing that’s actually spaced by natives: namely, hiragana books for small children. Analyzing some 300 pages of that material, he found, among other unexpected points, that what we call “particles” and “copulæ” are almost always printed without spacing or separation. He then made some informal psycholinguistic research:

To assess the reaction of native Japanese to these texts, I feigned naïveté and asked what the spaces were for. Repeatedly I was informed that it would be difficult for a child to understand a text without spaces. Usually I was told that spaces provided points where the child could stop without making the text unintelligible. When I inquired further if it wouldn’t be an improvement to provide spaces before the particles and copula, the reply was invariably negative: “Breaking up the ‘words’ would confuse the children.”

Which is to say, according to Chew, that the Japanese perceive things like sorega or anatawo as “words”. Which means nominals (taigen) could be described as inflected nouns, just like verbs or adjectives are inflected: oishi-i, kirei-na, hadashi-no; oishi-i, kirei-da, hadashi-da.

I don’t think Chew’s proposal had much of an impact; I think the grammatical consensus is that ringo in ringo+ga is a more standalone unit than tabe in tabe+masu. (On the other hand, there is disagreement on whether kirei-na is an inflected word or two uninflected ones.) But even with caveats, Chew’s insight isn’t groundless; there are certainly some relevant facts that shouldn’t be dismissed:

  • Spoken Japanese doesn’t allow a pause (“juncture”) between a nominal and a particle;
  • The pitch accent in Modern Standard Japanese works with the nominal+particle bundle as a unit; e.g. hashí↓gà “bridge” vs. hashígá “edge”, which are only distinguished by the accent in the ga;
  • And Japanese natives unfamiliar with European-style romanization tend to not write spaces before particles.

Whether we call a sequence like gakkou+e an “inflected word” or a “word with particles” is, at the end of the day, a matter of terminology; but anyhow it’s a fact that it has a stronger unity than the English sequence to+school.

  1. Re: point 3, even in literate Japanese, 一言 doesn’t mean “one word” in the English sense, but “one phrase.” It’s annoying to translate, because you look at your text, and it naturally should be something like, “But what is time? In a word, time is the fluxing of banana peels…” but that doesn’t work in English because the definition is more than one lexical word long!

  2. I’m not sure to what extent we can rely on Chew’s analysis of hiragana texts. I can think of several factors that might have an influence.

    1) Hiragana is regarded as hard to read by the Japanese anyway. Kana don’t form visually compact groups like Roman letters, partly because of their written shape, partly because they are syllabic and therefore more economical, which means that spinning them out will be more confusing than helpful. That means that gakkō e is a more easily scannable form than がっこう へ.

    2) Adults are in a sense reading back to hiragana from the full orthographic form. 学校へ is an easily scannable unit because it consists of kanji + clitic. Nothing would be gained by writing it 学校 へ. It would be quite normal to apply this back to the ‘incomplete’ script that kana orthography is regarded as. What I mean is that hiragana writing is playing by the rules of kanji+clitic, not by its own rules of kana+clitic.

    The bunsetsu is definitely an intuitively obvious unit in Japanese for prosodic and syntactic reasons. But I don’t think it’s possible to say that clitics are glued to nouns in writing because that’s the intuitively obvious thing to do. It could be that people are simply used to it. Since the tradition of using an unspaced orthography has never been broken in Japan, it’s hard to say what people would find ‘intuitive’ had they been brought up on, say, romanisation, instead of the traditional system.

    At the other thread I mentioned Mongolian, which in a sense is the opposite of Japanese. The traditional script writes those clitics as separate words (orthographically distinct from the word they belong to), even though they are morphologically fused with the preceding word. In fact, they are often not even pronounced the way they are written. For example, orthographic yi is pronounced /g/, orthographic bar is pronounced /aar/ (leaving aside issues of vowel harmony). Thus Inner Mongolians, at least, are familiar with the idea that what they pronounce as fused clitics are actually independent particles. It might be interesting to look into this further. One thing I am sure of, though, is that contradictory notions can quite easily coexist in people’s minds (written one way, pronounced another).

    Even ‘full words’ may be heavily influenced by traditional concepts of orthography. In Japanese, I don’t think native speakers would experience any cognitive resistance to writing 学校 as gakkō in romaji, probably because of the phonological form (the two parts are fused together). But in Chinese I think native speakers have a strong perception that xué and xiào are in some way separate words and are tempted to write them separately in romanisation. But there is no need to wonder about Chinese intuitions: Vietnamese is even more clear cut in this. Even though the Vietnamese have been without characters for at least a century, the romanised writing system (Quốc Ngữ) preserves every morpheme as a separate syllable, and Vietnamese swear by the fact that they are separate ‘words’. Học hiệu (学校) is thus regarded as two words, and insisting that it is in some sense ‘one word’ will engender hostility in many Vietnamese speakers. This is despite the fact that many of these morphemes are barely usable as independent words (hiệu is, although its primary meanings are ‘signal’ or ‘shop’). I strongly doubt that this perception would be held to so tenaciously if quốc ngữ had from the start adopted the practice of writing them as one word.

    So I will say that I’m suspicious of native speaker intuitions. I believe that they have been strongly moulded by the written tradition and by no means represent a ‘primitive consciousness’ of what words are.

  3. Something else that occurred to me thinking about that discussion was the fact that particles can do this:

    [neko to inu] ga ita
    [inu to neko] ga ita
    [neko to inu] ni yatta
    [inu to neko] wo mita

    This makes it look kind of like the Saxon genitive, and works different from, say, Latin (where you would expect both nouns to be marked for case)… But of course Latin isn’t really a very meaningful language to compare it to; Turkish would probably be better.

    (Vovin at least has declared that the “case endings” of OJ [at least] are basically equivalent to Turkish “suffixes” — he distinguishes “-ga”, “-no”, etc. from “wa”, calling the latter a “true particle”. Which is another issue to consider: we don’t necessarily need to move *everything* from the “particle” basket to the “inflection” basket, even if we decide that *some* things should be moved.)

  4. Bathrobe: Your expertise is appreciated, as always. Even if the Japanese intuitions are contaminated by the written tradition, I still regard their kana typesetting as an interesting fact; it just changes the domain to that of ortographic words. And yeah, native intuition on what’s a “word” is a very, very elastic notion.

    You say kana is not easily scannable, and I agree, but I regard this as an artifact of this weird custom of writing what is essentially ultra-cursive Wang-style Chinese calligraphy as if it were blocks of type. Kana is designed to be vertical and cursive and joined and with extra glyphs (the now unfortunately named hentaigana), all of which contribute to a kind of visual flow. I wonder if traditional brushwork didn’t somehow influence the spacing of the typeset kana writing described by Chew. For example, a modern workbook I have on semicursive (行書) says one should join up to three kana, but only within bunsetsu; so that the particles are again treated as part of (typographical) words. But I don’t know whether such rules came from older calligraphic practices, or if it’s just some new thing correlated with typesetting norms.

    One interesting point in Chew is that commas in Japanese writing seem to be more prosodic than our commas (which do not correspond to pauses in speech, cf.); if that’s true of punctuation in general, the phonological-words (which uncontroversially include the particles) could also be influencing the spacing in typographical-words.

    Your points about clitics made me think of something I just stumbled while studying Norwegian: the assertion that, unlike English, Norwegian still has a genitive case, as in « Ninas mor » “Nina’s mom”. …But wait: What’s the difference between « Ninas » and “Nina’s”? I think the obvious difference (the apostrophe) is nontrivial; it probably helps to shape (literate) intuition on the status of “’s” as something called a “clitic” (and, therefore, that “Nina’s” isn’t an inflected noun). The basic argument for clitics is the fact that Modern English can do things like “the King of Norway’s daughter”, though the issue is a bit more complicated. Browsing about clitics took me to:

