Automatic Extraction of New Words from Japanese Texts using 
Generalized Forward-Backward Search 
Masaaki NAGATA 
NTT Information and Communication Systems Laboratories 
1-2356 Take, Yokosuka-Shi, Kanagawa, 238-03 Japan 
nagat a©nttnly. ± sl. ntt. j p 
Abstract 
We present a novel new word extraction method 
from Japanese texts based on expected word 
frequencies. First, we compute expected word 
frequencies from Japanese texts using a robust 
stochastic N-best word segmenter. We then ex- 
tract new words by filtering out erroneous word 
hypotheses whose expected word frequencies are 
lower than the predefined threshold. The method 
is derived from an approximation of the general- 
ized version of the Forward-Backward algorithm. 
When the Japanese word segmenter is trained on 
a 4.7 million word segmented corpus and tested 
on 1000 sentences whose out-of-vocabulary rate 
is 2.1%, the accuracy of the new word extraction 
method is 43.7% recall and 52.3% precision. 
Introduction 
Segmentation of sentences into words is trivial in 
English because words are delimited by spaces. 
It is a simple task to count word frequencies in 
a given text. It is also a simple task to list all 
new words (unknown words), namely, the words 
in a given text that are not found in the system 
dictionary. However, several languages such as 
Japanese, Chinese and Thai do not put spaces 
between words and so in these languages word 
segmentation, word frequency counting, and new 
word extraction remain unsolved problems in com- 
putational linguistics. 
Most Japanese NLP applications require word 
segmentation as a first stage because there are 
phonological units and semantic units whose pro- 
nunciation and/or meaning is not trivially deriv- 
able from the pronunciation and/or meaning of the 
individual characters. It is well known that the 
accuracy of word segmentation greatly depends 
on the coverage of the dictionary, in other words, 
the Out-Of-Vocabulary (00V) rate of the target 
texts. 
Our goal is to provide a method to automati- 
cally extract new words from Japanese texts. This 
nmthod should adapt the dictionary of the word 
segmenter to new domains and applications. It 
should also maintain the dictionary by collecting 
new words in the target domain. The applica- 
tion of the word segmenter is described elsewhere 
(Nagata, 1996). 
The approach we take is as follows: First, we 
design a statistical language model that can as- 
sign a reasonable word probability to an arbitrary 
substring in the input sentence, whether or not 
it is truly a word. Second, we devised a method 
to obtain the expected word N-gram count in the 
target texts, using an N-best word segmentation 
algorithm (Nagata, 1994). Finally, we extract new 
words by filtering out spurious word hypotheses 
whose expected word frequencies are lower than 
the threshold. 
Japanese Morphological Analysis 
Before we start, we briefly explain the difficul- 
ties of Japanese morphological analysis, especially 
when the input sentence includes unknown words. 
Suppose the input sentence is "-~4)~p/~7 
~}~ ENIAC 69 50 ~ 3o ", which means 
"University of Pennsylvania celebrates the 50th 
anniversary of ENIAC", where the words ~Y5~ 
JP/~7 (transliteration of 'Pennsylvania') and 
ENIAC (the name of the world's first computer) 
are not registered in the system dictionary. Fig- 
ure 1 shows three possible analyses of the input 
sentence, where each box represents a word hy- 
pothesis whose meaning and part of speech are 
shown above and under the box. The tag <UNK> 
represents an unknown word. 
One of the hardest problems in handling unre- 
stricted Japanese text is the identification of un- 
known words. In Figure 1, the string ENIAC is 
successfully tokenized as an unknown word. How- 
ever, there is ambiguity in the segmentation of the 
string ~ 5/zL~J<~-7~. 
In the first analysis, the system considers -~'-.~ 
5//1~/~_~7 ('Pennsylvania') as an unknown word, 
48 
Logprob "'4 ~ ~ \]1t \]< ~ 7" 5k: ~ ~ E N l A C cO 
(rel prob) ~_~y ~j ENIAC of 
Pennsylvania "\[ ENIAC -108.95 \[ -"~5.'J1.-'-'<-~7" \] \] \[\] 
(0.790) <UNK> noun part. <UNK> part. 
pcnqil Vania university ~j. ENIAC of -no.49 i~,v I 1"<=7~'~ I I ENIA¢ I \[\] 
(0.169) noun <UNK> part. <UNK> part. 
-ll 1.90 oencil Vania ~i~y ~j. ENIAC ~\] 
(0.041) I -'<~5"7~ I \[ )<=7 \] I ENIAC \] noun <UNK> noun part. <UNK> part. 
numeral suffix part. verb intl. sym. 
numeral suffix part. verb intl. sym. 
numeral suffix part. verb infl.sym. 
Figure 1: Japanese Morphological Analysis Example 
because ~: ('university') is registered in the dic- 
tionary. This is correct. In the second analysis, 
the system guesses .'<~-7"~: ('Vania university') 
as an unknown word, because -'<~/5.'A~ (transliter- 
ation of 'pencil') is registered in the dictionary and 
some university names are registered in the dictio- 
nary, such as Y,~ ~/7~---b'~ ('Stanford Uni- 
versity') and ~r~'~ ~3~-~ (.'Cambridge Uni- 
versity'). In the third analysis, the system consid- 
ers ,'<-~7" ('Vania') as an unknown word, because 
both ~:/5,',,1~ and ~ are registered in the dic- 
tionary. 
It is often the case that we have overlapping 
word hypotheses if the input sentence contains un- 
known words, such as -'<~'~\]P.'<~7, ~:7~, 
and ,,<:'T in Figure 1. We need a criteria to se- 
lect the most likely word hypothesis from among 
the overlapping candidates. In fact, it is fairly dif- 
ficult to get plausible analyses like the ones shown 
in Figure 1, because failure to identify an unknown 
word affects the segmentation of the neighboring 
words. Obviously, a robust word segmenter is the 
essential first step. 
In the following sections, we first describe a 
statistical language model to cope with unknown 
words. We then describe the word segmentation 
algorithm and the new word extraction method, 
with their derivation as an approximation of a 
generalization of the Forward-Backward algorithm 
(Baum, 1972). Finally, we show experiment re- 
sults and prove its effectiveness. 
Statistical Language Model 
Segmentation Model (Tagging Model) 
Let the input Japanese character sequence be 
C = ca c2... cm, and segment it into word sequence 
W = wl w2 ... wn whose part of speech sequence is 
7' = tlt2...tn. The word segmentation task can 
be defined as finding the set of word segmentation 
and parts of speech assignment (~V, T) that max- 
imize the joint probability of word sequence and 
tag sequence given character sequence P(W, TIC ). 
Since the maximization is carried out with fixed 
character sequence C, the word segmenter only 
has to maximize the joint probability of word se- 
quence and tag sequence P(W, T). 
(w, '/1) = arg ,nax P(W, TIC) 
W,T 
= arg alas P(W, 7') (i) 
We call P(W,T) the segmentation model, al- 
though it is usually called tagging model in En- 
glish tagger research. In this paper, we compare 
three segmentation models: part of speech tri- 
gram, word unigram, and word bigram. 
In the part-of-speech trigram model (POS tri- 
gram model), the joint probability P(W, T) is ap- 
proximated by the product of parts of speech tri- 
gram probabilities P(tilti_2,ti_l) and word out- 
put probabilities for given part of speech P(wiItl) 
n 
P(W,T) = ~I P(tilt,_2,q_OP(wilt,) (2) 
i=1 
In the word unigram and word bigram mod- 
els, the joint probability P(W,T) is approxi- 
mated by the product of word unigram proba- 
bilities P(wi,ti) and word bigram probabilities 
P( wl, ti lwi- a, ti- a), respectively. 
P(W,T) = H p(w,,t,) (3) 
i=a 
P(W,T) = l'I P(wi,tilwi_a,ti_a) (4) 
i=a 
Basically, parameters of these segmentation 
models are estimated by computing the relative 
frequencies of the corresponding events in the seg- 
mented training corpus. However, in order to ham 
dle unknown words, we have introduced a slight 
modification in computing the relative frequencies, 
as is described in the next section. 
49 
Word Model 
We think of an unknown word as a word having a 
special part of speech <U~IK>. We define a statis- 
tical word model to assign a word probability to 
each word hypothesis. It is formally defined as the 
joint probability of the character sequence cl ... ck 
if wi is the unknown word. We decompose it into 
the product of word length probability and word 
spelling probability, 
P(wi\[<U~K>) = P(e,... ck I<UNK>) 
= P(k)P(Cl... ck IZ~) (5) 
where k is the length of the character se- 
quence. We call P(k) the word length model, and 
P(cl ... ck I k) the word spelling model. 
We assume that word length probability P(k) 
obeys a Poisson distribution whose paraineter is 
the average word length ,~ in the training corpus, 
(~ - 1)~-~ P(k)- "~':~)T. e-()~-l) (6) 
This means that we regard word length as the 
interval between hidden word boundary markers, 
which are randomly placed with an average inter- 
val equal to the average word length. Although 
this word length model is very simple, it plays a 
key role in making the word segmentation algo- 
rithm robust. 
We approximate the spelling probability given 
word length P(O ... ck \[k) by the word-based char- 
acter bigram model, regardless of word length. 
Since there are more than 3,000 characters in 
Japanese, the amount of training data would be 
too small if we divided them by word length. 
k 
P(cl... ca) = P(c, I#) 1Y\[ P(c, I~,-,)P(#1~)(7) 
Here, special symbol "#" indicates the word 
boundary marker. 
Note that the word-based character bigram 
model is different from the sentence-based charac- 
ter bigram model. The former is estimated from 
the corpus segmented into words. It assigns a large 
probability to a character sequence that appears 
in the beginning (prefixes), the middle, and the 
end (suffixes) of a word. It also assigns a small 
probability to a character sequence that appears 
across a word boundary. 
By using the word model, we can create 
modified segmentation models that take unknown 
words into consideration. The parameters of the 
modified POS trigram, word unigram, and word 
bigram are estimated by Equations (8), (9), (10), 
and (11), in Figure 2. 
hi Figure 2, C(.) denotes the count of the spec- 
ified event in the training corpus. In the part of 
speech trigram model, P(wi\[ti) for an unknown 
word wi is obtained, by definition, from the word 
model P(wi\]<UNK>). In the word unigram model, 
the unigram count C(wi) for unknown word wi is 
given as the product of the total unigram count 
of unknown words C(<UNK>) and the word model 
probability P(wil<UNK>). The higher order N- 
gram counts involving unknown words are also ob- 
tained in the same manner. 
In order to compute the parameters in Fig- 
ure 2, we need the counts involving unknown 
words, such as C(ti-2, ti-1, <UNK>), C(<UNK>), and 
C((wi-~,tl-a),<UNK>). These counts are impor- 
tant because they represent the contexts in which 
unknown words likely to appear. To estimate 
these counts, we replace all words appearing only 
once in the training corpus with unknown word 
tags <UNK>, before computing relative frequen- 
cies. The underlying idea of the replacement is 
the same as Turing's estimates in back-off smooth- 
ing (Katz, 1987). We redistribute the probabil- 
ity mass of low count sequences to "unseen" se- 
quences. 
Generalized Forward Backward 
Reestimation 
Generalization of the Forward and 
Viterbi Algorithm 
In English part of speech taggers, the maximiza- 
tion of Equation (1) to get the most likely tag se- 
quence, is accomplished by the Viterbi algorithm 
(Church, 1988), and the maximum likelihood es- 
timates of the parameters of Equation (2) are 
obtained from untagged corpus by the Forward- 
Backward algorithm (Cutting et al., 1992). How- 
ever, it is impossible to apply the Viterbi algo- 
rithm and the Forward-Backward algorithm for 
word segmentation of those languages that have 
no delimiter between words, such as Japanese and 
Chinese, because word segmentation hypotheses 
overlap one another. 
Figure 3 shows an example of overlapping 
word hypotheses and possible word segmentations 
for the string ~N~t~ig-f~ ('all prefectures in the 
nation'). We assume ~\[\] ('all nation'), ~ ('all'), 
\[~l ('national capital'), ~ii~;g~t,~ ('prefectures'), 
~i.~ ('metropolitan road'), ~li ('metropolis'), ~Kff 
t.~ ('prefectures'), ~ ('road'), ~ ('prefectures'), 
~.f ('prefecture'), and ~ ('prefecture') are regis- 
tered in the dictionary. There are 15 possible word 
segmentations in this example. In Japanese, a 
lot of words consist of one character. Moreover, 
sequence of characters may constitute a different 
word. 
50 
C(t,_2,t,_ x,<UNK>) P(tilti-2,ti-~) = c(t,_2,t,_t) ifti 
---- <lINK> 
c(~,_~,t,_~) otherwise 
t:'(wiI<UNK>) if tl = <UNK> 
P(wi Its) = _~ otherwise 
c(<U~K>) P(wi,ti) ~_ .c(,o,.t,) × P(w~I<U\]K>) if/~ = <lINK> 
: ~(w,,t,) otherwise ~, c(w,*,) 
c((w,_~,~,_O,<U~iK>) P(wl,tilwi_~,ti-~) : c(,o,_~,t,_~) x P(wiI<UNK>) ifti = <UNK> 
c((w,_~,t,_x),(w,,,,)) otherwise c(w,_x,t,-x) 
Figure 2: Modified Segmentation Models with Consideration to Unknown Words. 
(8) 
(9) 
(10) 
(11) 
I 
l 
Figure 3: Overlapping Word Hypotheses and Pos- 
sible Word Segmentations 
For Japanese word segmentation, we define 
a generalized Forward algorithm and a general- 
ized Viterbi algorithm as follows. Let the input 
Japanese character sequence of length n be C = 
cl c2 . . . c,, and cg denote the substring cp+ l ... %. 
We define a flmction D that maps a character 
sequence c_q to a list of word hypotheses {wi}. 
Function/~ is the generalization of the dictionary. 
Here, wi denotes a combination of orthography 
(formally denoted by wi) and part of speech ti, 
for simplicity. We use word bigram as the seg- 
mentation model in the following example. Other 
segmentation models, such as part of speech tri- 
gram and word unigrarn, can be used in the same 
manner. 
In the generalized forward algorithm, the for- 
ward probability o~(wi) is the joint probability of 
the character sequence c~ and the event that the 
final word in the segmentation of cq0 is wi that 
spans the substring d. Forward probabilities can 
be recurslvely computed as follows. 
O<p<q wiED(c~) 
e o < q <., q < <. 02) 
The generalized forward algorithm starts from 
the beginning of the input sentence, and proceeds 
character by character. At each point q in the 
sentence, it sums over the product of the forward 
probability of the word segmentation hypotheses 
ending at the point ~pq(wl) and the transition 
probability to the word hypotheses starting at that 
point P(wi+l \[wi). 
o~ i ~ 2~ 3~ 4~ s~ 6 
, 
Figure 4: One Step in the Generalized Forward 
Algorithrn. 
Figure 4 shows a snapshot of the generalized 
forward algorithm. Tile input is ~\[\]~i~, and 
the current point q is 2. The word hypotheses 
ending at point 2 (wi 6 n(c~)) are ~I~ (Co 2) and 
\[\] (c~). Those starting at point 2 (wi+x 6 D(c~)) 
are ~J.~ (c~), ~_ (c~), and ~li (c~). The string 
~$~ (c25) is not registered in the dictionary. All 
combinations of these words are examined. 
The generalized Viterbi algorithm can be ob- 
51 
tained by replacing summation with maximization 
in Equation (12). Here, Cpq(wi) is the probabil- 
ity of the most likely word segmentation sequence 
for the character sequence cq0 whose final word wi 
spans the substring c~. 
6;(wi+l) = max max ¢q~(w,)P(w,+~lw,) o_<p<q ~,ev(¢~) 
w,+l e D(c;),O _< q < u,q < r < n (13) 
Note that tile original Forward algorithm and 
tile Viterbi algorithin is the special case in Equa- 
tion (12) and (13) where p and q are fixed as 
p=q-1 andr=q+i. 
In order to handle unknown words, the dictio- 
nary function D returns a word hypothesis tagged 
as unknown word if the substring cpq is not regis- 
tered in the dictionary, such as ~i.~gf (%5) in Fig- 
ure 4. The word model assigns a reasonable prob- 
ability to the unknown word. Therefore, in the 
generalized forward algorithm and the generalized 
Viterbi algorithm, we hypothesize all substrings 
in the input sentence as words, and examine all 
possible combinations of these word hypotheses. 
Since we can define the generalized Back- 
ward algorithm in the same manner, we can de- 
fine the generalized Forward-Backward algorithm 
to estimate the word N-gram counts in Japanese 
texts, and to reestimate the word N-gram prob- 
abilities in the segmentation model. However, 
we give a more intuitive account of the method 
to introduce an approximation of the generalized 
Forward-Backward algorithm. 
Expected Word N-gram Count 
By using the above mentioned word segmentation 
algorithm, we can get all word segmentation hy- 
potheses of the input sentence. Once we get them, 
we can estimate word N-gram count in an unseg- 
mented Japanese corpus. 
Let Oj be the jth word segmentation hypoth- 
esis for the ith sentence in the corpus. P(O~) can • d 
be cornputed by using the segmentahon model 
The Bayes a posleriori estimate of the word un- 
igram count Ci(wi) and the word bigram count 
Ci(wi_l, wi) ill the ith sentence can be computed 
as, 
C'(wo) = ~"~" P(Oj) x n~(w~)) (14) z..,t P(oD 
3 
i r-,, P(O}) xn~(w~, c (wo,w ) = P(O;) -- 
3 
Here,. n}(w~) and. ni'(w~'w3 Z) denote the number 
of tunes the umgram w~ and the bigram w~, w~ 
appeared in tile jth candidate of tile ith sentence 
1 
The estimate of the total unigram count C(w~) 
and the total bigram count C(w~, wE) can 
be obtained by summing the counts over all sen- 
tences in the corpus. 
c(,,o) = (16) 
i 
c(wo, = (17) 
i 
The estimate of the unigram probability and 
the bigram probability can be obtained as the rel- 
ative frequency of the associated events. 
c(wo) (is) f(w~) -- 'w 
C(wo, (19) f(wfllwc')-- C(w~) 
If necessary, we can reestimate the word N-gram 
probabilities by replacing P(w~) and P(w~lw,~ ) 
with f(w~) and f(wolw~). 
Extraction of New Words in Texts 
Expected word unigram counts (expected word 
frequencies) in the corpus (Equation (16)) can be 
used as a measure of likelihood that a particular 
substring in the input texts is actually a word. Let 
0 denote the minimum expected word frequency 
that we use to classify a given word hypothesis w~ 
as a word. 
C(w.) > o (20) 
Those words that are not found in the dictionary 
and whose expected frequencies in the corpus are 
larger than the threshold O are extracted as the 
new words in the input texts. 
In theory, expected word N-gram counts can 
be obtained by the generalized Forward-Backward 
algorithm. In order to save computation time, 
however, we approximated the weighted sum of 
the word N-gram counts over all the word seg- 
mentation hypotheses in a sentence (Equation 
(14)), by that of the N-best word segmentation 
hypotheses 2. 
1Note that the (Generalized) Forward-Backward 
algorithm is devised to compute these expected word 
N-gram count without listing all word segmentation 
hypotheses. 
2If we only use the best word segmentation, it is 
called the Vitcrbi reestimation. Our method might 
be called N-best reestimation. It is designed to be 
more accurate than the Viterbi rcestimation and more 
efficient than the generafized Forward-Backward algo- 
rithm. 
52 
0.2 \[~i -~ 
0.1 ~ ~ 
Figure 5: An example of computing the expected 
word frequencies 
N-best word segmentation hypotheses can be 
obtained by using the Forward-DP Backward-A* 
algorithm (Nagata, 1994). It consists of a for- 
ward dynamic programming search to record tlle 
probabilities of all partial word segmentation hy- 
potheses, and a backward A* algorithm to extract 
the N-best hypotheses. It is a generalization of 
the tree-trellis search (Soong and Huang, 1991), 
in the sense that its forward Viterbi search is 
replaced with the generalized Viterbi search de- 
scribed in this paper. 
In reestimating the word N-gram probabili- 
ties, we introduce two modifications to the normal 
reestimation procedure. The first modification is 
that, instead of using the relative frequency in an 
unsegmented corpus (Equation (18) and (19)), we 
combine the N-gram count in the segmented cor- 
pus with the estimated N-gram count in the un- 
segmented corpus to increase estimate reliability. 
This is because a fairly large amount of segmented 
Japanese corpus were available in our experiments. 
c,~(w~) + c..,o~(w~) (2~) 
f(w¢,) = ~¢~ Cseo(wc~ ) + ~¢, C .... o(wc,) 
f(~,lw~) = c~¢~(w~,w,) + c .... A w.,w,)- c~°~(w~) + ~2-) (22) 
where C,~a(. ) denotes the count in the segmented 
corpus, and Cuns,a(') denotes the estimated count 
in tile unsegmented corpus. 
The second modification is that we prune the 
expected N-gram counts in the unsegmented cor- 
pus if they are lower than a predefined threshold, 
before computing Equation (21) and (22). This 
is because Cunse#(') is unreliable, especially when 
C%,,,a(. ) is low. 
Examples of Estimating Expected 
Word Frequencies 
Finally, we show a simple example of estimat- 
ing the word N-gram counts in an unsegmented 
sentence. Assume that the ith input sentence is 
the character sequence ~-~-~-)kPq, which means 
"introduction to linguistics", and its best three 
word segmentation hypotheses are as shown in 
Figure 5. The leftmost nmnbers in Figure 5 are 
the relative probabilities of the word segmentation 
P(O)) hypotheses, corresponding to ~ p(oD ill Equa- 
tion (14). The expected word unigram count of 
each word hypothesis in the sentence is, 
C~(.z.Pq) = 0.7 + 0.2 + 0.1 = 1.0 
o,n-~-) ---- 0.7 
c'(~-~) = c~(~:) = 0.2 
c~(~ -) = c~(~) = 0.1 
The expected total number of tile words in tile sen- 
tence ~ Ci(w~) is 2.3. If all word hypotheses are 
not registered in tile dictionary and the threshold 
0 is 0.15, we regard )kPq ('introduction'), ,~-~liq: 
('linguistics'), ~ ('language'), and q: ('study') 
as tile new words. ~" ('say') and ~/iq: ('study of 
languages') are discarded. 
Let us give another example that shows the 
effect of summing tlle expected word unigram 
counts over all the sentences in the corpus. Sup- 
pose tile sentence "-"-~ 5/J~,~7~q:~: ENIAC 
© 50 J~l~ 5o ", which means "University of 
Pennsylvania celebrates the 50th anniversary of 
ENIAC.", is in the corpus, and the first three 
word segmentation hypotheses are as shown in 
Figure 1. The expected word unigram counts for 
~/"~A-,~= 7" ('Pennsylvania'), ,<2 7~ ('Va- 
nia University'), and \]<~7" ('Vania') are 0.790, 
0.169, and 0.041, respectively. Suppose also the 
sentence "zh~4" b\]~gc~2:~'.-~/5/A~'<=7~ 9 ~5~ 
~b 7~o ", which means "White House lies at Penn- 
sylvania Avenue.", is in the corpus, and the ex- 
pected word unigram counts for -~-:/~/z~,<: 
7" ('Pennsylvania'), .'<-:7"~ V ('Vania Avenue'), 
and J<~7 ('Vania') are 0.825, 0.127, and 0.048, 
respectively. The expected word unigram counts 
in the corpus are, 
C(-"-~/~/~,<~7) = 0.790 + 0.825 = 1.615 
C(,<=7~) = 0.169 
C(,<~-7~9) = 0.127 
C(,<~7) = 0.041+0.048 = 0.089 
Therefore,-'<>'5/z11~,<=7 is definitely more likely 
to be a new word. Tile more often the unknown 
word appears in the corpus, the more it is likely 
to be extracted, even if there is word segmentation 
ambiguity in each sentence. 
Experiments 
Language Data 
We used the EDR .Japanese Corpus Version 1.0 
(EDR, 1995) to train and test the word segmen- 
53 
tation program. It is a corpus of approximately 
5 million words (200,000 sentences). It was col- 
lected to build a Japanese Electronic Dictionary, 
and contains a variety of Japanese sentences taken 
from newspapers, magazines, dictionaries, ency- 
clopedias, textbooks, etc. It has a variety of an- 
notations on morphology, syntax, and semantics. 
We used word segmentation, pronunciation, and 
part of speech in the morphology information field 
of the annotation. 
In this experiment, we randomly selected 90% 
of the sentences in the EDR Corpus for training 
the word segmentation program. We made two 
test sets from the rest of the corpus, one for a small 
size experiment (100 sentences) and the other for 
a medium size experiment (1000 sentences). Ta- 
ble 1 shows the number of sentences, words, and 
characters for training and test sets. Note that the 
test sets were not part of the training set. That 
is, open data were tested in the experiment. 
Table 1: The amount of training and test data 
training test-1 test-2 
Sentences 192802 100 1000 
Words 4746461 2463 25177 
Characters 7521293 3912 39875 
The training texts contained 133281 word 
types. We discarded word types that appeared 
only once in the training texts. This resulted in 
65152 word types being registered in the dictio- 
nary of the word segmenter. We trained three 
segmentation models, namely, part of speech tri- 
gram, word unigram, and word trigram, after we 
replaced those words appeared only once in the 
training texts with the unknown word tag <UNK>, 
as described in the section of word model. Af- 
ter this replacement, there were 758172 distinct 
word bigrams. Again, we discarded word bigrams 
that appeared only once in the training texts 
for saving main memory, and used the remaining 
294668 word bigrams. The word bigram proba- 
bilities were smoothed using deleted interpolation 
(Jelinek, 1985). 
The training texts contained 3534 character 
types. We discarded characters that appeared 
only once in the training texts; 3167 character 
types remained. We then replaced the discarded 
characters with the unknown character tag to 
train the word spelling model. There were 91198 
distinct character bigrams in the words in the 
training texts 3 
aThere are more than 3000 (some say nlore than 
10000) charters in Japanese, and their frequency dis- 
tribution is skewed. In order to save memory, we used 
a type of character bigram model that considers un- 
We made two spelling models. The first was 
trained using all words in the training texts, while 
the second was trained using those words whose 
frequency is less than or equal to 2. In princi- 
ple, the spelling model of unknown words must be 
trained using the low frequency words. However, it 
nlight suffer from the sparse data problem because 
the total number of word tokens for training is de- 
creased from 4746461 to 103919. We also made 
two length models. The average word lengths of 
all words and that of low frequency words were 
1.58 and 4.49, respectively. Note that the aver- 
age word length is the only parameter of the word 
length model. 
Evaluation Measures 
Word Segmentation accuracy is expressed in ternrs 
of recall and precision. First, we count the number 
of words in corpus segmentation (Std), the num- 
ber of words in system segmentation (Sys), and 
tile number of matching word segmentations (M). 
Recailis defined as M/Std, and precision is defined 
as M/Sys. 
Figure 6 shows an example of computing pre- 
cision and recall for the sentence "ta ~ ~ 7 ~ 2-- 
~c~J~-~..fi~"~"~'% ", which means "Rockefeller 
Laboratory is an academic laboratory founded by 
an American millionaire, Rockefeller". Because of 
the difference in the segmentation of ~ ~ ~ 7 z: 
~--iT~p~, the number of words in corpus seg- 
mentation (Std=15) differs from that of system 
segmentation (Sys=14). Note that the system cor- 
rectly tokenized -~fbJ~E~, although it is not reg- 
istered in the dictionary. 
New word extraction accuracy is described in 
terms of recall, precision, and F-measure. First, 
we count the number of unknown words in the cor- 
pus segmentation (Std), the number of unknown 
words in the system segmentation (Sys), and the 
number of matching words (M). Here, unknown 
words are those that are not registered in the sys- 
tem dictionary. Recall is defined as M/Std, and 
precision is defined as M/Sys. Since recall and 
precision greatly depend on the frequency thresh- 
old, we used the F-measure to indicate the overall 
performance. F-measure is used in Information 
Retrieval, and is calculated by 
F= (/32+l.O) xPxR 
/32 x P+R (23) 
where P is precision, R is recall, and/3 is the rel- 
ative importance given to recall over precision. 
known characters, like the word bigram model used in 
the segmentation model. 
54 
I 
JC00092627 
corpus segmentation 
~,~ I/x i ~j~ 
~ ~ ~ ~ / ~.~ ~J 2 / ~ ~ 
@/ / / ~j~i~l 
'~'~/7~',2 / ~,~i~\] 
7Y: I ~" I ~j~J 
b / ~/~)$ 
vg-I .~.x I IJJ~J~l 
system segmentation 
I ~7~9--/~7~ > ~/~=~-~ /-~ 
11:: //~ / ~l 
¢)/) / ~Jj~a\] 
~/7~'~ / ~ 
p~£1-~'Y~J'~Ii~,~ 
L. I +.." I ~f)=$ 
re_ I J' I ~JJ1D~.~3 
I --~--t$f@£~/~IIL/<UNK> 
o /o /~ 
Rockefeller 
laboratory 
particle (topic) 
America 
of 
big 
rich man 
Rockefeller 
particle (subject) 
found 
inflectional suffix 
auxiliary verb (past) 
academic laboratory 
be 
(period) 
sys=lS, std=14, matched=13 
precision=87.7 (13/18), recall=92.9 (13/14) 
Figure 6: Comparison between the corpus segmentation (left) and the system segmentation (right). 
words are listed in UNIX sdiff style. 
All 
Word Segmentation Accuracy 
In order to decide the best configuration of the un- 
derlying Japanese word segmenter, we compared 
three segmentatio n models: part of speech tri- 
gram, word unigram, and word bigram. We also 
compared three word models: all words, low fre- 
quency words, and the combination of the two. 
The third word model consisted of the spelling 
model trained using all words and the length 
model trained using low frequency words. 
Table 2 shows, for the small test set (100 sen- 
tences), the segmentation accuracy of the various 
combinations of the segmentation models and the 
word models. 
It is obvious that word bigram outperformed 
the part of speech trigram as well as word unigram. 
As for the word model, it seems the combination 
of the spelling model for all words and the length 
model for low frequency words is the best, but the 
difference is small. In the following experiment, we 
decided to use word bigram as the segmentation 
model, and the combination of the spelling model 
of all words and the length model of low frequency 
words as the word model. 
New Word Extraction Accuracy 
We tested the new word extraction method us- 
ing the medium size test set (1000 sentences). It 
contains 538 unknown word types. 8 word types 
appeared twice in the test set. The other 530 word 
types appeared only once. The out-of-vocabulary 
rate of the test set is 2.2%. To count the expected 
word frequencies, we used the top-10 word seg- 
mentation hypotheses. We limited tile maximum 
character length of the a unknown word to 8 in 
order to save computation time. 
We tested three variations of the new word 
extraction method. The first one was "No Reesti- 
marion"; it uses the word segmenter's outputs as 
they are when extracting new words. The second 
and the third ones carry out reestimation before 
extraction, where the pruning thresholds of the ex- 
pected N-gram counts in the reestimation are 0.95 
and 0.50, respectively. Reestimations were carried 
out three times. 
Table 3 shows the new word extraction ac- 
curacies for a variety of expected word frequency 
thresholds 0, with and without reestimation. In 
Table 3, we set fl = 1.0 to compute F-measure. 
As Table 3 shows, the higher the threshold is, 
the higher the precision and the lower the recall 
become. When we put equal importance on recall 
and precision, the best value for the expected word 
frequency threshold is around 0.10 where the recall 
is 43.7% and the precision is 52.3%. 
Figure 7 shows excerpts of correctly extracted 
new words (matched), incorrectly extracted word 
hypotheses (sys-matched), and new words that 
were not extracted (std-matched), when the fre- 
quency threshold was 0.5 and reestimation was not 
carried out. We find that the overall quality of 
the extracted word hypotheses is satisfactory, al- 
55 
Table 2: Language Models and Segmentation Accuracies (100 test sentences) 
POS trigram word unigram word bigram 
word model recall prec. recall prec. recall prec. 
all words 91.6 88.8 88.7 87.3 94.6 : 89.4 
low frequency words 91.5 89.5 88.4 88.0 94.3 90.1 
all words + l.f.w, length 91.5 89.3 88.8 87.6 94.7 89.9 
Table 3: New Word Extraction Accuracy (1000 test sentences) 
freq. 
>0.00 
>0.10 
>0.50 
>0.90 
>0.95 
>0.99 
No Reestimation freq>0.95, 3 iter. freq>0.50, 3 iter. 
recall prec. F recall prec. F recall prec. F 
56.1 34.2 42.5 50.6 37.9 43.4 39.6 56.7 46.6 
43.7 52.3 47.6 43.1 52.1 47.2 37.9 63.6 47.5 
36.4 65.6 46.8 36.1 65.8 46.6 36.6 65.2 46.9 
25.3 76.8 35.8 25.3 77.3 38.1 36.6 65.2 46.9 
23.2 78.1 35.8 23.4 78.3 36.1 36.6 65.2 46.9 
17.3 81.6 28.5 23.4 78.3 36.1 36.6 65.2 46.9 
though the values of recall and precision are not 
so high. We discuss the reason for this in the next 
section. 
Discussion 
The problem of Japanese word segmentation is 
that people often can not agree on a single word 
segmentation. Therefore, the reported perfor- 
mance could be greatly underestimated. Most of 
the new words extracted by the system are ac- 
ceptable as a word (at least for us), and nmy not 
necessarily be a wrong word entry. On the other 
hand, most of the new words not extracted by the 
system can be divided into shorter words that are 
registered in the dictionary. 
For example, in the first sentence of Fig- 
ure 8, W'~/~' • ~ :2 ~-~--5/~ .-/('data coinmu- 
nication') is regarded as one word in corpus seg- 
mentation and counted as an unknown word in 
the test sentence. However, the system seg- 
mented it into -U--~ ('data') and = :2:~0--- 
5/~ Z/('communication'), both of which are found 
in the dictionary. In the second sentence of Fig- 
ure 8, the system extracted .",3-- ~'7/c~ ('Duke 
of Hanover') as a new word, while this word is di- 
vided into ~',/--~'~ ('Hanover') and ~ ('Duke') 
in corpus segmentation. Most of extraction errors 
are of this category. 
There are three types of obvious extraction 
errors. The first type is the truncation of long 
words. Some transliterated Western-origin words 
exceed the predefined maximum length for un- 
known word. The third sentence of Figure 8 is an 
example of this type. In Japanese, 'illustration' is 
transliterated into 9 characters ~ ~ 7, b 1/--5/ 
:/, which exceeds tile maximum unknown word 
length of 8 characters in our system. Since 4" ~ 
1- (the transliteration of 'illust', which also means 
illustration in Japanese) is registered in the dictio- 
nary, t/--5/~ ./(the transliteration of 'ration') is 
incorrectly extracted as a new word. 
The second type is the fragmentation of nu- 
merals. Since we did not use any tokenizers, 
numerals tend to be divided arbitrarily. In the 
second sentence in Figure 8, the system divided 
"1676" into "16" and "76". In fact, it may output 
"1" and "676", "16 .... 7" and "6", or whatever. 
The third type is the concatenation of noun(s) 
and particle. In other words, the system some- 
times erroneously recognizes a noun phrase as a 
word. For example, the Japanese counterparts of 
"A of B", "A and B", and "A, B" are recognized 
as a word. This may be because the probability 
of one long unknown word can be higher than the 
product of the probabilities of two short unknown 
(or infrequent) words and one known word. The 
fourth sentence of Figure 8 is an example of this 
type of error. The system considered ~li~l\]li~lh~-'v 
~l~tlJ ('controllable and observable') as a word, 
while it is divided into ~-I ('able'), fi~Jt~ ('control'), 
~-o ('and'), ~f ('able'), and ~tJ ('observe') in the 
corpus. 
As for reestimation, Table 3 shows no signif- 
icant improvements in the new word extraction 
accuracy. The only effect of reestimation, in our 
experiment, is to increase the expected word fre- 
quencies of the unknown word hypotheses whose 
expected word frequencies are greater than the 
pruning threshold of reestimation. 
This result does not necessarily mean that 
reestimation is useless. This is because most tin- 
56 
I 
mat ched=196 
3~1487 b~t,~2000 ~-P~ ~A'~,~ F,A.~--~ b~l,y~ 7~)--b'~p,~ I/b~J~,~ 
ays-ma~ched=103 
90Zf7 000R STK~ m~Y~$J= mix b~lyS; -~--\]1.,~ 77"F-T~7"{~ 7t~--~4~/¢" ~¢:~ 
std-mat ched=342 
404 BBNTF.,<>'~ b" =:/~=~--~--~f X~2~ ~bo~SP~ ~-~,9y7~ t~-P~¢--~-~- 
threshold=0.5 
std=538, sys=299, matched=196 
recall=36.4 (196/538), precision=65.6 (196/299) 
Figure 7: Excerpts of correctly extracted new words (matched), incorrectly extracted word hypotheses 
(sys-raatched), and not extracted new words (std-matched). 
known words appeared only once in the test sen- 
tences. An ideal example to confirm that reesti- 
marion works well would have an unknown word 
appearing more than twice in the test sentences, 
and it is trivial to extract the word in one appear- 
ance, while it is difficult in the others, because 
of, for example, successive unknown words. If the 
test set were larger, or the out-of-vocabulary rate 
were higher, we believe that the effectiveness of 
reestimation would be more clearly shown. 
Related Work 
Recent years have seen several works on corpus- 
based word segmentation and dictionary construc- 
tion for both Japanese and Chinese. For Chi- 
nese, (Sproat et al., 1994) used the word unigram 
model in their word segmenter based on weighted 
finite-state transducer. Word frequencies were es- 
timated by the Viterbi reestimation (a reesthna- 
tion procedure using the best analysis) from an 
unsegmented corpus of 20 million words. Initial es- 
timates of the word frequencies were derived from 
the frequencies in the corpus of the strings of hanzi 
making up each word in the lexicon whether or not 
each string is actually an instance of the word in 
question. 
(Chang et al., 1995) proposed an automatic 
dictionary construction method for Chinese from a 
large unsegmented corpus (311591 sentences) with 
the help of a small segmented seed corpus (1000 
sentences). They combined Viterbi reestimation 
using the word unigram model with a post filter 
called the "Two-Class Classifier", which is a lin- 
ear discrimination function to decide whether the 
string is actually a word or not based on features 
derived from the character N-gram in a large un- 
segmented corpus. The system's performance is 
compared with a word list derived from two on- 
line Chinese dictionaries (21141 words). Tile re- 
ported recall and precision values were 56.88% and 
77.37% for two character words, and 6.12% and 
85.97% for three character words, respectively. 
For Japanese, (Nagao and Mori, 1994) pro- 
posed a method of computing an arbitrary length 
character N-gram, and showed that the charac- 
ter N-gram statistics obtained from a large cor- 
pus includes information useful for word extrac- 
tion. However, they did not report any evaluation 
of their word extraction method. 
(Teller and Batchelder, 1994) proposed a very 
naive probabilistic word segmentation method for 
Japanese, based on character type information 
and hiragana bigram frequencies. They claimed 
98% word segmentation accuracy, while we clMrn 
94.7%. However, their evaluation method is very 
optimistic, and completely different from ours. 
They count an error only when the system segmen- 
tation violates morpheme boundaries. In other 
words, they count an error only when the system 
segmentation is not acceptable to human judge- 
men% while we count an error whenever tim sys- 
tem segmentation does not exactly match the cor- 
pus segmentation, even if it is inconsistent. 
We used the word bigram model for word 
segmentation, and expected word frequency for 
unknown word extraction. We compared the 
results with a segmented Japanese corpus, and 
reported 43.7% recall and 52.3% precision for 
1000 sentences whose out-of-vocabulary rate is 
2.1%. It is impossible to compare our results with 
(Chang et al., 1995), because the experiment con- 
ditions are completely different in terms of lan- 
guage (Chinese vs. Japanese), the size of seed 
segmented corpus, the size of target unsegmented 
corpus and its out-of-vocabulary rate, the size of 
initial word list, and the type of reference data 
57 
(on-line dictionary vs. segmented corpus). 
Our idea of filtering erroneous word hypoth- 
esis by expected word frequency is simple and 
straightforward. The major contribution of this 
paper is that we present a more accurate method 
for estimating word frequencies in an unsegmented 
corpus, even if it includes unknown words. This 
is achieved by introducing an explicit statistical 
model of unknown words, and by using an N- 
best word segmentation algorithm (Nagata, 1994) 
as an approximation of the generalized Forward- 
Backward algorithm. 
In English taggers, (Weischedel et al., 1993) 
proposed a statistical model to estimate word out- 
put probability p(wi\]tl) for an unknown word from 
spelling information such as inflectional endings, 
derivational endings, hyphenation, and capitaliza- 
tion. Our word model can be thought of a gener- 
alization of their statistical model. One potential 
benefit of our statistical model and segmentation 
algorithm is that they are completely independent 
of the target language and its writing system. We 
intend to test our word segmentation method on 
other languages, such as Chinese and Thai. 
Conclusion 
We present a new word extraction method for 
Japanese based on expected word frequency, which 
is computed by using a statistical language model 
and an N-best word segmentation algorithm. Al- 
though we have encouraging initial results, there 
are a number of questions to be answered, for ex- 
ample, the minimmn seed segmented corpus size 
required, the minimum initial word list required, 
the effect of reestimation for a large unsegmented 
corpus with various out-of-vocabulary rates. Be- 
sides these questions, we are also thinking of as- 
signing the part of speech to the extracted new 
words in order to construct a Japanese dictionary 
automatically. 

References 
\[Baum, 1972\] Leonard E. Baum. 1972. An In- 
equality and Associated Maximization Tech- 
nique in Statistical Estimation for Probabilistic 
Functions of Markov Processes. Inequalilies, 3, 
pages 1-8. 
\[Chang et al., 1995\] Jing-Shin Chang, Yi-Chung 
Lin, and Keh-Yih Su. 1995. Automatic Con- 
struction of a Chinese Electronic Dictionary, In 
Proceedings of VLC-95, pages 107-120. 
\[Church, 1988\] Kenneth W. Church. 1988. A 
Stochastic Parts Program and Noun Phrase 
Parser for Unrestricted Text, \[n Proceedings off 
ANLP-88, pages 136-143. 
\[Cutting et al., 1992\] Doug Cutting, Julian Ku- 
piec, Jan Pedersen, and Penelope Sibun. 1992. 
A Practical Part-of-Speech Tagger, In Proceed- 
ings of ANLP-92, pages 133-140. 
\[EDR, 1995\] Japan 
Electronic Dictionary Research Institute. 1995. 
EDR Electronic Diclionary Version 1 Techni- 
cal Guide, EDR TR2-003. Also available as The 
Slructure of the EDR Eleclronic Dictionary, 
http:///www, iijnet, or. jp/edr/. 
\[Jelinek, 1985\] Frederick 
Jelinek. 1985. Self-organized Language Model- 
ing for Speech Recognition. IBM Report. 
\[Katz, 1987\] Slava M. Katz. 1987. Estimation of 
Probabilities from Sparse Data for the Lan- 
guage Model Component of a Speech Recog- 
nizer, IEEE Trans. ASSP-35, No.3, pp.400-401. 
\[Nagao and Mori, 1994\] Makoto Nagao and Shin- 
suke Mori. 1994. A New Method of N-gram 
Statistics for Large Number of n and Auto- 
matic Extraction of Words and Phrases from 
Large Text Data of Japanese, in Proceedings of 
COLING-94, pages 611-615. 
\[Nagata, 1994\] Masaaki Nagata. 1994. A Stochas- 
tic Japanese Morphological Analyzer Using a 
Forward-DP Backward-A* N-Best Search Al- 
gorithm. In Proceedings off COLING-94, pages 
201-207. 
\[Nagata, 1996\] Masaaki Nagata. 1996. Context- 
Based Spelling Correction for Japanese OCR. 
To appear in Proceedings off COLING-96. 
\[Soong and Huang, 1991\] Frank K. Soong and 
Eng-Fong Huang. 1991. A Tree-Trellis Based 
Fast Search for Finding the N Best Sentence 
Hypotheses in Continuous Speech Recognition. 
In Proceedings off ICASSP-91, pages705-708. 
\[Sproat et al., 1994\] Richard Sproat, Chinlin 
Shih, William Gale, and Nancy Chang. 1994. A 
Stochastic Finite-State Word-Segmentation Al- 
gorithm for Chinese, In Proceedings of A CL-94, 
pages 66-73. 
\[Teller and Batchelder, 1994\] Virginia Teller and 
Eleanor Olds Batchelder. 1994. A Probabilistic 
Algorithm for Segmenting Non-Kanji Japanese 
Strings, In Proceedings off AAAI-94, pages 742- 
747. 
\[Weischedel et al., 1993\] Ralph Weischedel, Marie 
Meteer, Richard Schwartz, Lance Ramshaw, 
and Jeff Palmucci. 1993. Coping with Ambigu- 
ity and Unknown Words through Probabilistic 
Models, in Cornpulalional Linguistics, Vol.19, 
No.2, pages 359-382. 
