Mostly-Unsupervised Statistical Segmentation of Japanese: 
Applications to Kanji 
Rie Kubota Ando and Lillian Lee 
Department of Computer Science 
Cornell University 
Ithaca, NY 14853-7501 
{kubotar, llee} @cs.cornell.edu 
Abstract 
Given the lack of word delimiters in written 
Japanese, word segmentation is generally consid- 
ered a crucial first step in processing Japanese texts. 
Typical Japanese segmentation algorithms rely ei- 
ther on a lexicon and grammar or on pre-segmented 
data. In contrast, we introduce a novel statistical 
method utilizing unsegmented training data, with 
performance on kanji sequences comparable to and 
sometimes surpassing that of morphological analyz- 
ers over a variety of error metrics. 
I Introduction 
Because Japanese is written without delimiters be- 
tween words) accurate word segmentation to re- 
cover the lexical items is a key step in Japanese text 
processing. Proposed applications of segmentation 
technology include extracting new technical terms, 
indexing documents for information retrieval, and 
correcting optical character recognition (OCR) er- 
rors (Wu and Tseng, 1993; Nagao and Mori, 1994; 
Nagata, 1996a; Nagata, 1996b; Sproat et al., 1996; 
Fung, 1998). 
Typically, Japanese word segmentation is per- 
formed by morphological analysis based on lexical 
and grammatical knowledge. This analysis is aided 
by the fact that there are three types of Japanese 
characters, kanji, hiragana, and katakana: changes 
in character type often indicate word boundaries, al- 
though using this heuristic alone achieves less than 
60% accuracy (Nagata, 1997). 
Character sequences consisting solely of kanji 
pose a challenge to morphologically-based seg- 
reenters for several reasons. First and most 
importantly, kanji sequences often contain domain 
terms and proper nouns: Fung (1998) notes that 
50-85% of the terms in various technical dictio- 
~The analogous situation in English would be if words were 
written without spaces between them. 
Sequence length # of characters % of corpus 
1 - 3 kanji 20,405,486 25.6 
4 - 6 kanji 12,743,177 16.1 
more than 6 kanji 3,966,408 5.1 
Total 37,115,071 46.8 
Figure 1: Statistics from 1993 Japanese newswire 
(NIKKEI), 79,326,406 characters total. 
naries are composed at least partly of kanji. Such 
words tend to be missing from general-purpose 
lexicons, causing an unknown word problem for 
morphological analyzers; yet, these terms are quite 
important for information retrieval, information 
extraction, and text summarization, making correct 
segmentation of these terms critical. Second, kanji 
sequences often consist of compound nouns, so 
grammatical constraints are not applicable. For 
instance, the sequence sha-chohlkenlgyoh-mulbu- 
choh (presidentlandlbusinesslgeneral manager 
= "a president as well as a general manager of 
business") could be incorrectly segmented as: sha- 
chohlken-gyohlmulbu-choh (presidentl subsidiary 
business\[Tsutomu \[a name\]\[general manager); 
since both alternatives are four-noun sequences, 
they cannot be distinguished by part-of-speech 
information alone. Finally, heuristics based on 
changes in character type obviously do not apply to 
kanji-only sequences. 
Although kanji sequences are difficult to seg- 
ment, they can comprise a significant portion of 
Japanese text, as shown in Figure 1. Since se- 
quences of more than 3 kanji generally consist of 
more than one word, at least 21.2% of 1993 Nikkei 
newswire consists of kanji sequences requiring seg- 
mentation. Thus, accuracy on kanji sequences is an 
important aspect of the total segmentation process. 
As an alternative to lexico-grammatical and su- 
pervised approaches, we propose a simple, effi- 
241 
cient segmentation method which learns mostly 
from very large amounts of unsegmented training 
data, thus avoiding the costs of building a lexicon 
or grammar or hand-segmenting large amounts of 
training data. Some key advantages of this method 
are: 
• No Japanese-specific rules are employed, en- 
hancing portability to other s. 
• A very small number of pre-segmented train- 
ing examples (as few as 5 in our experiments) 
are needed for good performance, as long as 
large amounts of unsegmented data are avail- 
able. 
• For long kanji strings, the method produces re- 
sults rivalling those produced by Juman 3.61 
(Kurohashi and Nagao, 1998) and Chasen 1.0 
(Matsumoto et al., 1997), two morphological 
analyzers in widespread use. For instance, we 
achieve 5% higher word precision and 6% bet- 
ter morpheme recall. 
2 Algorithm 
Our algorithm employs counts of character n-grams 
in an unsegmented corpus to make segmentation de- 
cisions. We illustrate its use with an example (see 
Figure 2). 
Let "A B C D W X Y Z" represent an eight-kanji 
sequence. To decide whether there should be a word 
boundary between D and W, we check whether n- 
grams that are adjacent to the proposed boundary, 
such as the 4-grams sl ="A B C D" and 82 ="W 
X Y Z", tend to be more frequent than n-grams that 
straddle it, such as the 4-gram tl ---- "B C D W". If 
so, we have evidence of a word boundary between 
D and W, since there seems to be relatively little 
cohesion between the characters on opposite sides 
of this gap. 
The n-gram orders used as evidence in the seg- 
mentation decision are specified by the set N. For 
instance, if N = {4} in our example, then we pose 
the six questions of the form, "Is #(s~) > #(tj)?", 
where #(x) denotes the number of occurrences of 
x in the (unsegmented) training corpus. If N = 
{2,4}, then two more questions (Is "#(C D) > 
#(D W)?" and "Is #(W X) > #(O W)?") are 
added. 
More formally, let s~ and 8~ be the non- 
straddling n-grams just to the left and right of lo- 
cation k, respectively, and let t~ be the straddling 
n-gram with j characters to the right of location k. 
s, ? 
I i ABCb{WXYZ 
t, 
% • 
/, 
Figure 2: Collecting evidence for a word boundary 
- are the non-straddling n-grams 81 and 82 more 
frequent than the straddling n-grams tl, t2, and t3? 
Let I> (y, z) be an indicator function that is 1 when 
y > z, and 0 otherwise, 2 In order to compensate for 
the fact that there are more n-gram questions than 
(n - 1)-gram questions, we calculate the fraction of 
affirmative answers separately for each n in N: 
2 n--1 1 
vn(k) = 2(n- 1) x>(#(87), ~=1 j=l 
Then, we average the contributions of each n-gram 
order: 
1 
VN(k) = INI ~ vn(k) 
nEN 
After vN(k) is computed for every location, bound- 
aries are placed at all locations ~ such that either: 
• VN(g) > VN(e -- 1) and VN(g) > VN(e + 1) 
(that is, e is a local maximum), or 
• VN (2) > t, a threshold parameter. 
The second condition is necessary to allow for 
single-character words (see Figure 3). Note that it 
also controls the granularity of the segmentation: 
low thresholds encourage shorter segments. 
Both the count acquisition and the testing phase 
are efficient. Computing n-gram statistics for all 
possible values of n simultaneously can be done in 
O(m log m) time using suffix arrays, where m is 
the training corpus size (Manber and Myers, 1993; 
Nagao and Mori, 1994). However, if the set N of 
n-gram orders is known in advance, conceptually 
simpler algorithms suffice. Memory allocation for 
:Note that we do not take into account the magnitude of 
the difference between the two frequencies; see section 5 for 
discussion. 
242 
v~(k) 
A B \[C DI W X IY\] Z 
Figure 3: Determining word boundaries. The X- Y 
boundary is created by the threshold criterion, the 
other three by the local maximum condition. 
count tables can be significantly reduced by omit- 
ting n-grams occurring only once and assuming the 
count of unseen n-grams to be one. In the applica- 
tion phase, the algorithm is clearly linear in the test 
corpus size if \[NI is treated as a constant. 
Finally, we note that some pre-segmented data is 
necessary in order to set the parameters N and t. 
However, as described below, very little such data 
was required to get good performance; we therefore 
deem our algorithm to be "mostly unsupervised". 
3 Experimental Framework 
Our experimental data was drawn from 150 
megabytes of 1993 Nikkei newswire (see Figure 
1). Five 500-sequence held-out subsets were ob- 
tained from this corpus, the rest of the data serv- 
ing as the unsegmented corpus from which to derive 
character n-gram counts. Each held-out subset was 
hand-segmented and then split into a 50-sequence 
parameter-training set and a 450-sequence test set. 
Finally, any sequences occurring in both a test set 
and its corresponding parameter-training set were 
discarded from the parameter-training set, so that 
these sets were disjoint. (Typically no more than 
five sequences were removed.) 
3.1 Held-out set annotation 
Each held-out set contained 500 randomly-extracted 
kanji sequences at least ten characters long (about 
twelve on average), lengthy sequences being the 
most difficult to segment (Takeda and Fujisaki, 
1987). To obtain the gold-standard annotations, we 
segmented the sequences by hand, using an observa- 
tion of Takeda and Fujisaki (1987) that many kanji 
compound words consist of two-character stem 
words together with one-character prefixes and suf- 
fixes. Using this terminology, our two-level bracket- 
ing annotation may be summarized as follows. 3 At 
3A complete description of the annotation policy, including 
the treatment of numeric expressions, may be found in a tech- 
nical report (Ando and Lee, 1999). 
the word level, a stem and its affixes are bracketed 
together as a single unit. At the morpheme level, 
stems are divided from their affixes. For example, 
although both naga-no (Nagano) and shi (city) can 
appear as individual words, naga-no-shi (Nagano 
city) is bracketed as \[\[naga-no\]\[shi\]\], since here shi 
serves as a suffix. Loosely speaking, word-level 
bracketing demarcates discourse entities, whereas 
morpheme-level brackets enclose strings that cannot 
be further segmented without loss of meaning. 4 For 
instance, if one segments naga-no in naga-no-shi 
into naga (long) and no (field), the intended mean- 
ing disappears. Here is an example sequence from 
our datasets: 
Three native Japanese speakers participated in 
the annotation: one segmented all the held-out data 
based on the above rules, and the other two reviewed 
350 sequences in total. The percentage of agree- 
ment with the first person's bracketing was 98.42%: 
only 62 out of 3927 locations were contested by a 
verifier. Interestingly, all disagreement was at the 
morpheme level. 
3.2 Baseline algorithms 
We evaluated our segmentation method by com- 
paring its performance against Chasen 1.05 (Mat- 
sumoto et al., 1997) and Juman 3.61, 6 (Kurohashi 
and Nagao, 1998), two state-of-the-art, publically- 
available, user-extensible morphological analyzers. 
In both cases, the grammars were used as distributed 
without modification. The sizes of Chasen's and Ju- 
man's default lexicons are approximately 115,000 
and 231,000 words, respectively. 
Comparison issues An important question that 
arose in designing our experiments was how to en- 
able morphological analyzers to make use of the 
parameter-training data, since they do not have pa- 
rameters to tune. The only significant way that they 
can be updated is by changing their grammars or 
lexicons, which is quite tedious (for instance, we 
had to add part-of-speech information ' to new en- 
tries by hand). We took what we felt to be a rea- 
sonable, but not too time-consuming, course of cre- 
ating new lexical entries for all the bracketed words 
in the parameter-training data. Evidence that this 
4This level of segmentation is consistent with Wu's (1998) Monotonicity Principle 
for segmentation. 
5http://cactus.aist-nara.ac.jp/lab/nlt/chasen.html 
6http:/Ipine.kuee.kyoto-u.ac.jplnl-resourceljuman.e.html 
243 
90 
85 
8o 
75 
70 
Word accuracy 
CHASEN JUMAN ~otimize optimize recaU optindze F 
~'ecision 
Figure 4: Word accuracy. The three rightmost 
groups represent our algorithm with parameters 
tuned for different optimization criteria. 
was appropriate comes from the fact that these ad- 
ditions never degraded test set performance, and in- 
deed improved it by one percent in some cases (only 
small improvements are to be expected because the 
parameter-training sets were fairly small). 
It is important to note that in the end, we are com- 
paring algorithms with access to different sources 
of knowledge. Juman and Chasen use lexicons and 
grammars developed by human experts. Our al- 
gorithm, not having access to such pre-compiled 
knowledge bases, must of necessity draw on other 
information sources (in this case, a very large un- 
segmented corpus and a few pre-segmented exam- 
ples) to compensate for this lack. Since we are in- 
terested in whether using simple statistics can match 
the performance of labor-intensive methods, we do 
not view these information sources as conveying 
an unfair advantage, especially since the annotated 
training sets were small, available to the morpho- 
logical analyzers, and disjoint from the test sets. 
4 Results 
We report the average results over the five test sets 
using the optimal parameter settings for the corre- 
sponding training sets (we tried all nonempty sub- 
sets of {2, 3, 4, 5, 6} for the set of n-gram orders N 
and all values in {.05, .1, .15,..., 1} for the thresh- 
old t) 7. In all performance graphs, the "error bars" 
represent one standard deviation. The results for 
Chasen and Juman reflect the lexicon additions de- 
7For simplicity, ties were deterministically broken by pre- 
ferring smaller sizes of N, shorter n-grams in N, and larger 
threshold values, in that order. 
scribed in section 3.2. 
Word and morpheme accuracy The standard 
metrics in word segmentation are word precision 
and recall. Treating a proposed segmentation as a 
non-nested bracketing (e.g., "lAB ICI" corresponds 
to the bracketing "[AB][C]"), word precision (P) is 
defined as the percentage of proposed brackets that 
exactly match word-level brackets in the annotation; 
word recall (R) is the percentage of word-level an- 
notation brackets that are proposed by the algorithm 
in question; and word F combines precision and re- 
call: F = 2PR/(P + R). 
One problem with using word metrics is that 
morphological analyzers are designed to produce 
morpheme-level segments. To compensate, we al- 
tered the segmentations produced by Juman and 
Chasen by concatenating stems and affixes, as iden- 
tified by the part-of-speech information the analyz- 
ers provided. (We also measured morpheme accu- 
racy, as described below.) 
Figures 4 and 8 show word accuracy for Chasen, 
Juman, and our algorithm for parameter settings 
optimizing word precision, recall, and F-measure 
rates. Our algorithm achieves 5.27% higher preci- 
sion and 0.25% better F-measure accuracy than Ju- 
man, and does even better (8.8% and 4.22%, respec- 
tively) with respect to Chasen. The recall perfor- 
mance falls (barely) between that of Juman and that 
of Chasen. 
As noted above, Juman and Chasen were de- 
signed to produce morpheme-level segmentations. 
We therefore also measured morpheme precision, 
recall, and F measure, all defined analogously to 
their word counterparts. 
Figure 5 shows our morpheme accuracy results. 
We see that our algorithm can achieve better recall 
(by 6.51%) and F-measure (by 1.38%) than Juman, 
and does better than Chasen by an even wider mar- 
gin (11.18% and 5.39%, respectively). Precision 
was generally worse than the morphological analyz- 
ers. 
Compatible Brackets Although word-level accu- 
racy is a standard performance metric, it is clearly 
very sensitive to the test annotation. Morpheme ac- 
curacy suffers the same problem. Indeed, the au- 
thors of Juman and Chasen may well have con- 
structed their standard dictionaries using different 
notions of word and morpheme than the definitions 
we used in annotating the data. We therefore devel- 
oped two new, more robust metrics to measure the 
number of proposed brackets that would be incor- 
244 
[ [data] [base] ] [system] (annotation brackets 
Proposedsegmentadon wo~ morpheme 
e~o~ e~o~ 
[data][base] [system] 2 0 
[data][basesystem] 2 I 
[database] [sys][tem] 2 3 
compatible-bracket errors 
crossing morpheme-dividing 
0 0 
1 0 
0 2 
Figure 6: Examples of word, morpheme, and compatible-bracket errors. The sequence "data base" has been 
annotated as "[[data] [base]]" because "data base" and "database" are interchangeable. 
8O 
,¢:: 
~- 75 
Morpheme accuracy 
85 1 ................................................................................................................................................................... 
o 
CHASEN JUMAN optimize optimize recall optir#.ze F 
Ixacir,~ 
Figure 5: Morpheme accuracy. 
rect with respect to any reasonable annotation. 
Our novel metrics account for two types of er- 
rors. The first, a crossing bracket, is a proposed 
bracket that overlaps but is not contained within an 
annotation bracket (Grishman et al., 1992). Cross- 
ing brackets cannot coexist with annotation brack- 
ets, and it is unlikely that another human would 
create such brackets. The second type of er- 
ror, a morpheme-dividing bracket, subdivides a 
morpheme-level annotation bracket; by definition, 
such a bracket results in a loss of meaning. See Fig- 
ure 6 for some examples. 
We define a compatible bracket as a proposed 
bracket that is neither crossing nor morpheme- 
dividing. The compatible brackets rate is simply the 
compatible brackets precision. Note that this met- 
ric accounts for different levels of segmentation si- 
multaneously, which is beneficial because the gran- 
ularity of Chasen and Juman's segmentation varies 
from morpheme level to compound word level (by 
our definition). For instance, well-known university 
names are treated as single segments by virtue of be- 
ing in the default lexicon, whereas other university 
names are divided into the name and the word "uni- 
versity". Using the compatible brackets rate, both 
segmentations can be counted as correct. 
We also use the all-compatible brackets rate, 
which is the fraction of sequences for which all 
the proposed brackets are compatible. Intuitively, 
this function measures the ease with which a human 
could correct the output of the segmentation algo- 
rithm: if the all-compatible brackets rate is high, 
then the errors are concentrated in relatively few 
sequences; if it is low, then a human doing post- 
processing would have to correct many sequences. 
Figure 7 depicts the compatible brackets and all- 
compatible brackets rates. Our algorithm does bet- 
ter on both metrics (for instance, when F-measure 
is optimized, by 2.16% and 1.9%, respectively, in 
comparison to Chasen, and by 3.15% and 4.96%, 
respectively, in comparison to Juman), regardless of 
training optimization function (word precision, re- 
call, or F -- we cannot directly optimize the com- 
patible brackets rate because "perfect" performance 
is possible simply by making the entire sequence a 
single segment). 
Compatible and all-compatible brackets rates 
lOOT 
I 
g5 t 
85 ...... 
80 .... 
75 
CI'~SEN JUMAN 0ptimize precision 00timize recall optimize F 
10 compatib4e brackets rates • all-compatible brackets rates, 
Figure 7: Compatible brackets and all-compatible 
bracket rates when word accuracy is optimized. 
245 
precision 
recall 
F-measure 
Juman5 vs. Juman50 Our50 vs Juman50 Our5 vs. Juman5 
- 1.04 +5.27 +6.18 
-0.63 -4.39 -3.73 
-0.84 +0.26 +1.14 
Our5 vs. Juman50 
+5.14 
-4.36 
+0.30 
Figure 8: Relative word accuracy as a function of training set size. "5" and "50" denote training set size 
before discarding overlaps with the test sets. 
4.1 Discussion 
Minimal human effort is needed. In contrast 
to our mostly-unsupervised method, morphological 
analyzers need a lexicon and grammar rules built 
using human expertise. The workload in creating 
dictionaries on the order of hundreds of thousands 
of words (the size of Chasen's and Juman's de- 
fault lexicons) is clearly much larger than annotat- 
ing the small parameter-training sets for our algo- 
rithm. We also avoid the need to segment a large 
amount of parameter-training data because our al- 
gorithm draws almost all its information from an 
unsegmented corpus. Indeed, the only human effort 
involved in our algorithm is pre-segmenting the five 
50-sequence parameter training sets, which took 
only 42 minutes. In contrast, previously proposed 
supervised approaches have used segmented train- 
ing sets ranging from 1000-5000 sentences (Kash- 
ioka et al., 1998) to 190,000 sentences (Nagata, 
1996a). 
To test how much annotated training data is actu- 
ally necessary, we experimented with using minis- 
cule parameter-training sets: five sets of only five 
strings each (from which any sequences repeated in 
the test data were discarded). It took only 4 minutes 
to perform the hand segmentation in this case. As 
shown in Figure 8, relative word performance was 
not degraded and sometimes even slightly better. In 
fact, from the last column of Figure 8 we see that 
even if our algorithm has access to only five anno- 
tated sequences when Juman has access to ten times 
as many, we still achieve better precision and better 
F measure. 
Both the local maximum and threshold condi- 
tions contribute. In our algorithm, a location k 
is deemed a word boundary if VN(k) is either (1) a 
local maximum or (2) at least as big as the thresh- 
old t. It is natural to ask whether we really need two 
conditions, or whether just one would suffice. 
We therefore studied whether optimal perfor- 
mance could be achieved using only one of the con- 
ditions. Figure 9 shows that in fact both contribute 
to producing good segmentations. Indeed, in some 
cases, both are needed to achieve the best perfor- 
mance; also, each condition when used in isolation 
yields suboptimal performance with respect to some 
performance metrics. 
accuracy optimize optimize optimize 
precision recall F-measure 
word M M & T M 
morpheme M & T T T 
Figure 9: Entries indicate whether best performance 
is achieved using the local maximum condition (M), 
the threshold condition (T), or both. 
5 Related Work 
Japanese Many previously proposed segmenta- 
tion methods for Japanese text make use of either 
a pre-existing lexicon (Yamron et al., 1993; Mat- 
sumoto and Nagao, 1994; Takeuchi and Matsumoto, 
1995; Nagata, 1997; Fuchi and Takagi, 1998) or 
pre-segmented training data (Nagata, 1994; Papa: 
georgiou, 1994; Nagata, 1996a; Kashioka et al., 
1998; Mori and Nagao, 1998). Other approaches 
bootstrap from an initial segmentation provided by 
a baseline algorithm such as Juman (Matsukawa et 
al., 1993; Yamamoto, 1996). 
Unsupervised, non-lexicon-based methods for 
Japanese segmentation do exist, but they often have 
limited applicability. Both Tomokiyo and Ries 
(1997) and Teller and Batchelder (1994) explicitly 
avoid working with kanji charactes. Takeda and 
Fujisaki (1987) propose the short unit model, a 
type of Hidden Markov Model with linguistically- 
determined topology, to segment kanji compound 
words. However, their method does not handle 
three-character stem words or single-character stem 
words with affixes, both of which often occur in 
proper nouns. In our five test datasets, we found 
that 13.56% of the kanji sequences contain words 
that cannot be handled by the short unit model. 
Nagao and Mori (1994) propose using the heuris- 
246 
tic that high-frequency character n-grams may rep- 
resent (portions of) new collocations and terms, 
but the results are not experimentally evaluated, 
nor is a general segmentation algorithm proposed. 
The work of Ito and Kohda (1995) similarly relies 
on high-frequency character n-grams, but again, is 
more concerned with using these frequent n-grams 
as pseudo-lexicon entries; a standard segmentation 
algorithm is then used on the basis of the induced 
lexicon. Our algorithm, on the hand, is fundamen- 
tally different in that it incorporates no explicit no- 
tion of word, but only "sees" locations between 
characters. 
Chinese According to Sproat et al. (1996), most 
prior work in Chinese segmentation has exploited 
lexical knowledge bases; indeed, the authors assert 
that they were aware of only one previously pub- 
lished instance (the mutual-information method of 
Sproat and Shih (1990)) of a purely statistical ap- 
proach. In a later paper, Palmer (1997) presents 
a transformation-based algorithm, which requires 
pre-segmented training data. 
To our knowledge, the Chinese segmenter most 
similar to ours is that of Sun et al. (1998). They 
also avoid using a lexicon, determining whether a 
given location constitutes a word boundary in part 
by deciding whether the two characters on either 
side tend to occur together; also, they use thresholds 
and several types of local minima and maxima to 
make segmentation decisions. However, the statis- 
tics they use (mutual information and t-score) are 
more complex than the simple n-gram counts that 
we employ. 
Our preliminary reimplementation of their 
method shows that it does not perform as well as 
the morphological analyzers on our datasets, al- 
though we do not want to draw definite conclusions 
because some aspects of Sun et al's method seem 
incomparable to ours. We do note, however, that 
their method incorporates numerical differences 
between statistics, whereas we only use indicator 
functions; for example, once we know that one 
trigram is more common than another, we do not 
take into account the difference between the two 
frequencies. We conjecture that using absolute 
differences may have an adverse effect on rare 
sequences. 
6 Conclusion 
In this paper, we have presented a simple, mostly- 
unsupervised algorithm that segments Japanese se- 
quences into words based on statistics drawn from 
a large unsegmented corpus. We evaluated per- 
formance on kanji with respect to several metrics, 
including the novel compatible brackets and all- 
compatible brackets rates, and found that our al- 
gorithm could yield performances rivaling that of 
lexicon-based morphological analyzers. 
In future work, we plan to experiment on 
Japanese sentences with mixtures of character 
types, possibly in combination with morphologi- 
cal analyzers in order to balance the strengths and 
weaknesses of the two types of methods. Since 
our method does not use any Japanese-dependent 
heuristics, we also hope to test it on Chinese or other 
s as well. 
Acknowledgments 
We thank Minoru Shindoh and Takashi Ando for 
reviewing the annotations, and the anonymous re- 
viewers for their comments. This material was sup- 
ported in part by a grant from the GE Foundation. 

References 

Rie Ando and Lillian Lee. 1999. Unsupervised statistical segmentation of Japanese kanji strings. 
Technical Report TR99-1756, Cornell University. 

Takeshi Fuchi and Shinichiro Takagi. 1998. 
Japanese morphological analyzer using word co-occurrence - JTAG. In Proc. of COLING-ACL 
'98, pages 409-413. 

Pascale Fung. 1998. Extracting key terms from 
Chinese and Japanese texts. Computer Processing of Oriental Languages, 12(1 ). 

Ralph Grishman, Catherine Macleod, and John 
Sterling. 1992. Evaluating parsing strategies us- 
ing standardized parse files. In Proc. of the 3rd 
ANLP, pages 156--161. 

Akinori Ito and Kasaki Kohda. 1995. Language 
modeling by string pattern N-gram for Japanese 
speech recognition. In Proc. of ICASSP. 

Hideki Kashioka, Yasuhiro Kawata, Yumiko Kinjo, 
Andrew Finch, and Ezra W. Black. 1998. Use 
of mutual information based character clusters in dictionary-less morphological analysis of 
Japanese. In Proc. of COL1NG-ACL '98, pages 
658-662. 

Sadao Kurohashi and Makoto Nagao. 1998. 
Japanese morphological analysis system JUMAN 
version 3.6 manual. In Japanese. 

Udi Manber and Gene Myers. 1993. Suffix arrays: 
A new method for on-line string searches. SIAM 
Journal on Computing, 22(5):935-948. 

T. Matsukawa, Scott Miller, and Ralph Weischedel. 
1993. Example-based correction of word segmentation and part of speech labelling. In Proc. 
of the HLT Workshop, pages 227-32. 

Yuji Matsumoto and Makoto Nagao. 1994. Improvements of Japanese morphological analyzer 
JUMAN. In Proc. of the International Workshop 
on Sharable Natural Language Resources, pages 
22-28. 

Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, 
Yoshitaka Hirano, Osamu Imaichi, and Tomoaki 
Imamura. 1997. Japanese morphological analysis system ChaSen manual. Technical Report 
NAIST-IS-TR97007, Nara Institute of Science 
and Technology. In Japanese. 

Shinsuke Mori and Makoto Nagao. 1998. Unknown word extraction from corpora using n-gram statistics. Journal of the Information Processing Society of Japan, 39(7):2093-2100. In 
Japanese. 

Makoto Nagao and Shinsuke Mori. 1994. A new 
method of N-gram statistics for large number of 
n and automatic extraction of words and phrases 
from large text data of Japanese. In Proc. of the 
15th COLING, pages 611-615. 

Masaaki Nagata. 1994. A stochastic Japanese 
morphological analyzer using a forward-DP 
backward-A* n-best search algorithm. In Proc. 
of the 15th COLING, pages 201-207. 

Masaaki Nagata. 1996a. Automatic extraction of 
new words from Japanese texts using generalized 
forward-backward search. In Proc. of the Conference on Empirical Methods in Natural Language 
Processing, pages 48-59. 

Masaaki Nagata. 1996b. Context-based spelling 
correction for Japanese OCR. In Proc. of the 16th 
COLING, pages 806-811. 

Masaaki Nagata. 1997. A self-organizing Japanese 
word segmenter using heuristic word identification and re-estimation. In Proc. of the 5th Workshop on Very Large Corpora, pages 203-215. 

David Palmer. 1997. A trainable rule-based algorithm for word segmentation. In Proc. of the 35th 
ACL/8th EACL, pages 321-328. 

Constantine P. Papageorgiou. 1994. Japanese word 
segmentation by hidden Markov model. In Proc. 
of the HLT Workshop, pages 283-288. 

Richard Sproat and Chilin Shih. 1990. A statistical 
method for finding word boundaries in Chinese 
text. Computer Processing of Chinese and Oriental Languages, 4:336-351. 

Richard Sproat, Chilin Shih, William Gale, and 
Nancy Chang. 1996. A stochastic finite-sate 
word-segmentation algorithm for Chinese. Computational Linguistics, 22(3). 

Maosong Sun, Dayang Shen, and Benjamin K. 
Tsou. 1998. Chinese word segmentation without 
using lexicon and hand-crafted training data. In 
Proc. of COLING-ACL '98, pages 1265-1271. 

Koichi Takeda and Tetsunosuke Fujisaki. 1987. 
Automatic decomposition of kanji compound 
words using stochastic estimation. Journal of 
the Information Processing Society of Japan, 
28(9):952-961. In Japanese. 

Kouichi Takeuchi and Yuji Matsumoto. 1995. 
HMM parameter learning for Japanese morphological analyzer. In Proc, of the lOth Pacific Asia 
Conference on Language, Information and Computation (PACLING), pages 163-172. 

Virginia Teller and Eleanor Olds Batchelder. 1994. 
A probabilistic algorithm for segmenting non-kanji Japanese strings. In Proc. of the 12th AAAI, 
pages 742-747. 

Laura Mayfield Tomokiyo and Klaus Ries. 1997. 
What makes a word: learning base units in 
Japanese for speech recognition. In Proc. of the 
ACL Special Interest Group in Natural Language 
Learning (CoNLL97), pages 60-69. 

Zimin Wu and Gwyneth Tseng. 1993. Chinese text 
segmentation for text retrieval: Achievements 
and problems. Journal of the American Society 
for Information Science, 44(9):532-542. 

Dekai Wu. 1998. A position statement on Chinese 
segmentation, Presented at the 
Chinese Language Processing Workshop, 
University of Pennsylvania. 

Mikio Yamamoto. 1996. A re-estimation method 
for stochastic  modeling from ambiguous observations. In Proc. of the 4th Workshop 
on Very Large Corpora, pages 155-167. 

J. Yamron, J. Baker, P. Bamberg, H. Chevalier, 
T. Dietzel, J. Elder, F. Kampmann, M. Mandel, 
L. Manganaro, T. Margolis, and E. Steele. 1993. 
LINGSTAT: An interactive, machine-aided trans- 
lation system. In Proc. of the HLT Workshop, 
pages 191-195. 
