A Stochastic Japanese Morphological Analyzer Using a 
Forward-DP Backward-A* N-Best Search Algorithm 
Masa.aki NAGATA 
NTT Network Information Systems l~,~bor~ttorics 
1-2356 Take, Yokosuka-Shi, Kanagaw~t, 238-03 Japan 
(tel) 4-81-468-59-2796 
(fax) +81-468-59-3428 
(e-mail) nagata@nttnly.ntt.jl) 
Abstract 
We present a novel method for segmenting the input 
sentence into words and assigning parts of speech to 
the words. It consists of a statistical language model 
and an efficient two-pa~qs N-best search algorithm. The 
algorithm does not require delimiters between words. 
Thus it is suitable for written Japanese. q'he proposed 
Japanese morphological analyzer achieved 95. l% recall 
and 94.6% precision for open text when it was trained 
and tested on the ATI'¢ Corpus. 
1 Introduction 
In recent years, we have seen a fair number of l)al)ers re- 
porting accuracies of more than 95% for English part of 
speech tagging with statistical language modeling tech- 
niques \[2-4, 10, 11\]. On the other hand, there are few 
works on stochastic Japanese morphological analysis 
\[9, 12, 14\], and they don't seem to have convinced the 
Japanese NLP community that the statistically-based 
teclmiques are superior to conventional rule-based tech- 
niques such as \[16, 17\]. 
We show in this paper that we can buihl a stochastic 
Japanese morphological analyzer that offers approxi- 
mately 95% accuracy on a statistical language model- 
ing technique and an efficient two-pass N-best search 
strategy. 
We used tile simple tri-POS model as the tagging 
model for Japanese. Probability estimates were ob- 
tained after training on the ATI{ l)ialogue Database 
\[5\], whose word segmentation and part of speech tag 
assignment were laboriously performed by hand. 
We propose a novel search strategy for getting the 
N best morphological analysis hypotheses for the in- 
put sentence. It consists of the forward dynamic pro- 
gramming search and the backward A* search. The 
proposed algorithm amalgamates and extends three 
well-known algorithms in different fields: the Minimum 
Connective-Cost Method \[7\] for Japanese morphologi- 
cal analysis, Extended Viterbi Algorithm for charac- 
ter recognition \[6\], and "l~'ee-Trellis N-Best Search for 
speech recognition \[15\]. 
We also propose a novel method for handling un- 
known words uniformly within the statistical approach. 
Using character trigrams ms tim word model, it gener- 
ates the N-best word hypotheses that match the left- 
most substrings starting at a given position in the input 
senten ce. 
Moreover, we propose a novel method for evaluat- 
ing the performance of morphological analyzers. Un- 
like English, Japanese does not place spaces between 
words. It is difficult, even for native Japanese, to place 
word boundaries consistently because of the aggluti- 
native nature of the language. Thus, there were no 
standard performance metrics. We applied bracketing 
accuracy measures \[1\], which is originally used for En- 
glish parsers, to Japanese morphological analyzers. We 
also slightly extended the original definition to describe 
the accuracy of tile N-best candidates. 
In the following sections, we first describe the tech- 
niques used in the proposed morphological analyzer, 
we then explain the cwduation metrics and show the 
system's performance by experimental results. 
2 Tagging Model 
2.1 Tri-POS Model and Relative Fre- 
quency Training 
We used the tri-POS (or triclass, tri-tag, tri-Ggram 
etc.) model ~Ls tile tagging model for Japanese. Con- 
sider a word segmentation of the input sentence W = 
wl w2... w,~ and a sequence of tags T = tits.., t,, of 
the same length. The morphological analysis tmsk cau 
I)e formally defined ,~ finding a set of word segmen- 
tat.ion and parts of speech ~ssignment that maximize 
the joint probability of word sequence arm tag sequence 
P(W, 7'). In the tri-POS model, the joint probability is 
approximated by the product of parts of speech trigram 
probabilities P(tilti_2,ti_l) and word output probabil- 
ities for given part of speech P(wl\]ll): 
r(w,:r) = \]~ r(tdt,_o.,t,_x)r'(w, lt4 (1) 
i=1 
201 
In practice, we consider sentence boundaries ~s special 
symbols as follows. 
P(W,T) = P(ql#)P(wtltt)P(t,.l#, tl)P(w21t~) 
~I P(tilti_2,ti_l)P(willi)P(#\[t,,_l,¢,,) (2) 
i=3 
where "#" indicates the sentence boundary marker. If 
we have some tagged text available, we can estimate the 
probabilities P(tdti_2,ti_l ) and P(wiltl) by comput- 
ing the relative frequencies of the corresponding events 
on this data:. 
N(ti_2, ti-1, tl) P(tifti-2'ti-t) = f(qltl-2'ti-x) 
- iV(ti_..,,ti_,) 
(3) 
P(wilti) = f(wilt,) -- N(w,t) ('1) N(t) 
where f indicates the relative frequency, N(w, t) is t!,e 
number of times a given word w appears with tag l, aid 
N(li_2,ti-l,tl) is the number of times that sequer~ce 
l~i_2ti_ll i appears in the text. It is inevitable to s~ irer 
from sparse-data problem in the part of speech tag tri- 
gram probability I . To handle open text, trigram Frol:- 
ability is smoothed by interpolated estimation, wi~ich 
simply interpolates trlgram, bigram, unigram, and ze- 
rogram relative frequencies\[8\], 
P(tilq_,,q_l ) = qaf(tilt,_2,q_, ) 
A-q2f(tdti_l) + qtf(ti) + qoV (5) 
where f indicates the relative.frequency and V is a 
uniform probability that each tag will occur. The non- 
negative weights qi satisfy q3 + q~ + q1 + q0 = 1, and 
they are adjusted so as to make the observed data most 
probable after the adjustment by using EM algorithm ~-. 
2.2 Order Reduction and Recursive 
Tracing 
In order to understand the search algorithm described 
in the next section, we will introduce the second order 
HMM and extended Viterbi algorithm \[6\]. Considering 
the combined state sequence U = ltl'tt2.., ttn, where 
ul = tl and ui = ti-tli, we have 
P(uilui_l) = P(tilti_=,ti_l) (6) 
Substituting Equation (6) into Equation (l), we have 
lWe used 120 part of speedl tags. In the ATR Corpus, 26 
parts of speech, 13 conjugation types, and 7 conjugation forms 
are defined. Out of 26, 5 parts of speech have conjugation. Since 
we used a list of part of speech, conjugation type, and conjuga- 
tion form as a tag, there are 119 tags in the A'IT¢ Corpus. We 
added the sentence boundary marker to them. 
aTo handle open text, word output probahility P(loilti) must 
also be smoothed. Tiffs problem is discussed in a later section *Ls 
the unknown word problem. 
Equation 
model. 
'ill I ... Wi 
we have 
P(W,r) = 1\]. P(mlui-a)P(wilti) (7) 
i=1 
(7) have the same form as the first order 
Conshler the partial word sequence HI/ = 
and the partial tag sequence Ti = tl...tl, 
F(w~,~) = ¢)(w,_,,~-,)P(~d,.-~)p(wdtO (8) 
Equation (8) suggests that, to find the maxlmmn 
P(I,Vi,7\]) for each ul, we need only to: remember the 
maximum P(W¢_I, 7\]_1), extend each of these prob- 
abilities to every ul by computing Eqnation (8), and 
select the m;uxinmm P(~/Vi,Ti) for each ui. 'thus, by 
increasing i by 1 to n, selecting the u. ttlat maximize 
P(W.,7\]~), and backtracing the sequence leading to 
the nmxinmm probability, we can get the optimal tag 
seqnence. 
3 Search Strategy 
The search algorithm consists of a forward dynamic 
programming search and a backward A* search. First, 
a linear time dynamic programming is used for record- 
ing the scores of all partial paths in a table 3. A back- 
ward A* algorithm based tree search is then used to 
extend the partial paths. Partial paths extended in the 
backward tree search are ranked by their correspond- 
ing fill path scores, which are cmnputed by adding 
the scores of backward partial path scores to the cot 
responding best possihle scores of the remaining paths 
which are prerecorded in the forward search. Since the 
score of the incomplete portion of a path is exactly 
known, the backward search is admissible. That is, the 
top-N candidates are exact. 
3.1 The Forward DP Search 
Table 1 shows the two data structures used in our al- 
gorithm. The st,'t, cture parse stores tile information 
of a word and the best partial path up to the word. 
Parse.start and parse.end are the indices of tile 
start and end positions of the word in the sentence. 
Parse.pos is tile part of speech tag, which is a list of 
part of speech, conjugation type, and conjugation form 
in our system for Japanese. Parse.nth-order-~tate 
is a list of the last two parts of speech tags includ- 
ing that of the current word. This slot corresponds 
to the combined state in the second order IIMM. 
Parse.prob-so-far is the score of the best partial 
path from the beginning of the sentence to the word. 
Parse.prev±ous is the pointer to the (best) previous 
parse structure as in conventional Viterbi decoding, 
which is not necessary if we use the backward N best 
search. 
~ln fact, we use two tables, pa~se-\].ist and path-~ap. The 
reason is described later. 
202 
The structure word represents the word information 
in the dictionary including its lexical form, part of 
speech tag, and word output probability given tt,e part 
of speech. 
Table h Data structures for the N best algorithm 
start 
end 
pea 
nth-order-state 
prob-ao-far 
previous 
parse strttctul'e 
"tim beginning pasition of the word 
the end position of the word 
part of speech tag of the word 
a list of the la-'~t two parts (,f speech 
the b,~t partial path score from the start 
a pointer to previous parse strllettll'e 
word structure 
form \] lexical f,.'-n{ of the word 
l)Oa \[ part of speech tag of the word 
prob _ word outlmt probability 
Before explaining tim forward search, we will de- 
fine some flmctions and tables used in the algo- 
rithm. In the forward search, we use a table called 
parse-list, whose key is the end position of the 
parse structure, and wlm,se value is a list of parse 
structures that have the best partial path scores for 
each combined state at the end position. Function 
register-to-parse-list registers a parse structure 
against the parse-list and maintains the best par- 
tim parses. Function get-parse-list returns a list 
of parse structnres at the specified position. We also 
use the fimetion leltmost-substrings which returns 
a list of word structures in the dictionary whose lexical 
form matches the substrings starting at the. specified 
position in the input sentence. 
function ~orward-paae (string) 
begin 
initial-atepO ; It Pods special symbols at both ends. 
for iffil to length(string) do 
foreach parse in get-parse-list(i) do 
foreach word ill leftmost-subatrings(atring,i) ,I(7 
poa-ngrma :- append(parse.nth-order-stato, 
list (word.poa)) 
if (traneprob(poe-ngrtm) > O) then 
new-parse :'. make-parseO ; 
new-parse.mtart :~ i; 
new-parse.end :- i + length(word.form); 
hey-pares,poe :- word.pea; 
new-parae.nth-order-mtate :~ rest(pos-ngram) ; 
naw-paree.preb-ee-far :- parae.prob-so-far 
* transprob(pos-ngram) * word.prob; 
new-parse.previous := paras; 
regieter-parse-to-parae-ltst (new-parse) ; 
register-paree-to-path-ma p (new-parse) ; 
endif 
elld 
end 
end 
finnl-etQp(); i/ Randlan trtmaition to tho e~d symbol. 
end 
Figure h The forward DP search algorithm 
Figure 1 shows the central part of the forward dy- 
namic programming search algorithm. It starts from 
the beg,string of tim inlmt sentence, and proceeds char- 
attar by character. At each point in tim sentence, it 
looks up the combination of the best partial parses 
ending at the point and word hypotheses starting at 
that point. If tim connection of a partial parse and 
a word llypothesis is allowed by the tagging model, a 
new continuation parse is made and registered in the 
parse-list. The partial path score for the new con- 
titular,on parse is the product of the best partial path 
score up to the poi,g, the trigram probability of the 
last three parts of speech tags and the word output 
probability for LIfe part of speech 4. 
3.2 The Backward A* Search 
The backward search uses a table called path-map, 
whose key is the end position of tile parse structure, 
and whose value is a list of parse structures that have 
the best partial path scores for each distinct combin~ 
ties of the start position and the combined state. The 
dilference 1)etween parse-list and path-map is that 
path-map is classi/ied by tim start position of the last 
word in addition to tim combined state. 
This distinction is crucial for the proposed N best 
algorithm, l"or tim tbrward search to tind a parse that 
maximizes Equation (1), it is the parts of speech se- 
quence that matters. For the backward N-best search, 
how(wet, we want N most likely word segmentation and 
part of speech sequence. Parse-list may shadow less 
probable candidates that have the same part of speech 
sc:qnence for the best scoring candidate, but differ in 
tim segmentaL,on of the last word. As shown in Figure 
1, path-map is made during the forward search by the 
function register-parse-to-path-map, which regis- 
ters a parse structure to path-map and maintains the 
best partial parses in the table's criteria. 
Now we describe the central part of tim backward 
A* search algorithm. But we assume that the readers 
know the A* algorithm, and exphtin only the way we 
applied the algorithm to the problem. 
We consider a parse structure ,~q a state in A* 
search. Two slates are e(plat if their parse structures 
have the same start position, end position, and com- 
bined state. The backward search starts at the end of 
the input, sentence, and backtracks to the beginning of 
the sentence using tim path-map. 
Initial states are obtained by looking up the entries 
of tim sentence end position of the path-map. The suc- 
cessor states are obtained by first, looking u 1) tim en- 
tries of the path-map at the start position of the cur- 
rent parse, then cbecldng whether they satisfy the con- 
straint of the combined state transition in the second 
order IIMM, aim whether the transition is allowed by 
the tagging model. The combined state transition con- 
straint means that tim part of speech sequence in the 
parse.nth-order-state of the current parse, ignor- 
4 In Figure 1, function transprob returns the probability of 
given trlgraln. Functions initial-step and final-step treat 
\[be tl'aliSlt\[ons I%L sltlill~llce \],Olll|dlll'ieg, 
203 
ing the last element, equals that of tile previous parse, 
ignoring the first element. 
The state transition cost of the backward search is 
the product of the part of speech trigram probability 
and the word output probability. Tile score estimate 
of the remaining portion of a path is obtained from the 
parse.prob-so-~ar slot in the parse structure. 
The backward search generates the N best hypothe- 
ses sequentially and there is no need to preset N. The 
complexity of the backward search is significantly less 
than that of the forward search. 
4 Word Model 
To handle open text, we have to cope with unknown 
words. Since Japanese do not put spaces between 
words, we have to identify unknown words at first. To 
do this, we can look at the spelling (character sequence) 
that may constitute a word, or look at the context to 
identify words that are acceptable in this context. 
Once word hypotheses for unknown words are gener- 
ated, the proposed N-best algorithm will find tile most 
likely word segmentation and part of speech assignment 
taking into account the entire sentence. Therefore, we 
can formalize the unknown word problem as (letermin- 
ing the span of an unknown word, assigning its part of 
speech, and estimating its probability given its part of 
speech. 
Let us call a computational model that determines 
the probability of any word hypothesis given its lexi- 
cal form and its part of speech the "word model". The 
word model must account for morphology and word for- 
marion to estimate the part of speech and tile probabil- 
ity of a word hypothesis. For tile first approxinmtion, 
we used the character trigram of each part of sl)eech as 
the word model. 
Let C = cic~.., c,~ denote the sequence of n charac- 
ters that constitute word zv whose part of speech is t. 
We approximate the probability of the word given part 
of speech P(wlt ) by tile trigram probabilities, 
p(,,,Iz) = P,(C) -- f',(~,l#, #)~',(~1#, <) 
u 
IX P,(~,lc+-=, ~,-~)r,(#1c.._l, ..,) 
i=3 
(9) 
where special symbol "#" indicates ttle word boundary 
marker. Character trigram probabilities are estimated 
from the training corpus by computing relative fre- 
quency of character bigram and trigram that appeared 
in words tagged as t. 
Pt(cilci-2, q-i) = f,(c~l~-=, ¢,-~) = Nt(ci_2, Ci_l, ci) N,(c~_.~, ~i) 
(lO) 
where Nt(ci_2,ci_~,ci) is tile total number of times 
character trigram ci_2ci_~el appears in words tagged 
as t in the training corpus. Note that the character 
trigram probabilities reflect the frequency of word to- 
kens in tile training corpus. Since there are more than 
3,000 characters in Japanese, trigram probabilities are 
smoothed by interpolated estimation to cope with the 
sparse-data problem. 
It is ideal to make this character trigram model for 
all open clmss categories, llowever, the amount of train- 
ing data is too small for low frequency categories if we 
divide it by part of speech tags. Therefore, we made 
trigram models only for tile 4 most frequent parts of 
speech that are open categories and have no conju~ 
gation. They are common noun, proper noun, sahen 
no/ln5~ and nunleral. 
> (estimate-paxt-of-spoech ~) ; Hiyako Hotel 
(C~\] 2.7621915641723623E-7) 
(~f~/~ 6.3406095003694205E-9) 
(~l~\] 5,840424519473811E-19) 
(~(~ 5.7364195413101E-29)) 
> (estimate-part-of-speech ~ I 9 9 4 ) 
((~()~ 1,8053860295767367E-6) 
(~M o.s1224s6sls404~zE-17) 
(~-Jdrl:~"~%~ 2.288684007246524E-17) 
(~,~ 7.sos~s3~aso211e-20)) 
; proper noun 
; common noun 
; sa_han noun 
; numeral 
; numeral 
; proper noun 
; common noun 
; sahen noun 
Figure 2: N-best Tags for Unknown Words 
Figure 2 show two examples of part of speech estima- 
tion for unknown words. Each trigram model returns 
a probability if the input string is a word belonging to 
the category. In both examples, the correct category 
has the largest probability. 
> (get -lef tmo st-subst riags-uit h-word-model 
((i~ 4)-~M 2.519457'597358691E-7) 
(~ ~;~f:~"~j~ 2.3449215070189967E-8) 
(~ ~tlfj/~l~i\] 7.02439907471337451{-9) 
(~\]i,~, ~'1 2.375650975098567E-9) 
(',l~J~ "4.)'~;'~$il 5.706S'(4990251415E-IO) 
(t~.~ '~j~cj~,\] 4.735628004876359E-13) 
(~,~ ~$~1 8.9289423481071831{-14) 
(i~b ~)'~,~il 7.266613344265452E-14) 
(~1~,~ ~\[~i~ 6.866d9949613207E-16) 
(,l~b 'RlJlIiG,~I 2.45302390s251351sE-17)) 
Figure 3: N-Best Word lIylmtheses 
Figure 3 shows the N-best word hypotheses gener- 
ated by using tile character trigram models. A word 
hypothesis is a list of word boundary, part of speech 
assignment, and word probability that matches tile left- 
most substrings starting at a given position in tile input 
sentence. In the forward search, to handle unknown 
words, word hypotheses are generated at every posi- 
tion in addition to the ones generated by the function 
leftmost-subs~;rings, which are the words found ill 
tile dictionary, llowever, ill our system, we limited the 
ntunl)er of word hyl)otheses generated at each position 
to 10, for efficiency reasons. 
aA noun tlmt can be used a~s a verb when it is followed by a 
forlna,\] verb "s~tr~t", 
204 
5 Evaluation Measures 
We applied the performance measures for English 
parsers \[1\] to Japanese morphological analyzers. The 
basic idea is that morphological analysis for a sentence 
can be thought of as a set of labeled brackets, where a 
bracket corresponds to word segmentation and its la-. 
bel corresponds to part of speech. We then compare 
the brackets contained in the system's output to the 
brackets contained in the standard analysis. For the 
N-best candidate, we will make the union of t\],e brack- 
ets contained in each candidate, and compare thenr to 
the brackets in the standard. 
For comparison, we court{, the number of I)rackcts 
in the standard data (Std), the number of brackets in 
the system output (Sys), and the nunlber of match- 
ing brackets (M). We then calculate the nleasurcs of 
recall (= M/Std) and precision (= M/Sys). We also 
connt the number of crossings, which is tile mmtber of 
c,'mes where a bracketed sequence from the standard 
data overlaps a bracketed sequence from tile system 
output, but neither sequence is completely coutained 
in the other. 
We defined two equaiity criteria of brackets for 
counting tim number of matching brackets. Two brack- 
ets are unlabeled-bracket-equal if the boundaries of the 
two brackets are tile same. Two brackets are labeled- 
bracket.equal if the labels of the brackets ark the same 
in addition to unlabeled-I)racket-equal. In comparing 
the consistency of the word segmentations of two brack- 
clings, wllich we call structure-consistency, we count 
the measures (recall, precision, crossings) by unlabeled- 
bracket-equal. In comparing the consistency of part 
of speech assignment in addition to word segmenta- 
tion, which we call label-consistency, we couut them by 
labeled-bracket-equal. 
-31.90894138309038 
-38. S9433~3fi658235fi 
~b/tRllDll~iil-ill!lll,Til~l t-J-/lJ/ltlDil, l.l~k o Iifd~) 
-43, I0367483fi46801 
Figure 4: N-Best Morphological Analysis hypotheses 
For example, Figure 4 shows a sample of N-hest anal- 
ysls hypotheses, where the first candidate is the correct 
analysis a. For the second candhlate, since there are !) 
})rackets in tim correct data (Std=9), 11 brackets in the 
second candidate (Sys=ll), and 8 nlatciiing brackets 
(M=8), tile recall and precision with respect to label 
consistency are 8/9 and 8/11, respectively. For the top 
6Probabilities m'e in liiltura\] log b~se e. 
two candidates, since tliere ;ire 12 distinct brackets in 
tile systems otll.litlt and 9 Inatehing brackets, tile re- 
call and precision with respect to hal)el consistency are 
9/9 aud 9/12, respeetiwqy. For the third candidate, 
since the correct data and the third candidate differ 
in just one part of Sl)eech tag, the recall and precision 
wittl respect to structure consistency are 9/9 and 9/9, 
respectiw>ly. 
6 Experiment 
Table 2: The aillount of training and test data ~_~ 
trahling texts closed test open 
10 o 0 Sentences / -1~5 " 10{i0 
13899 Words ' 149059 13176 \[ 
Characters _ 267,122 \[ 9422~ 98997 
We used the NI'I~ Dialogue Databaae\[5\] to train and 
test the proposed morphological analysis method. It is 
a corpus of approxiumtely 800,000 words whose word 
segmentatio,l and part of speech tag assigmnent were 
laboriously performed by hand. In tiffs experilneut, we 
only used one fourth of the A'Ft~. Corl)us , a portion of 
the keyl)oard dialogues in the conference registration 
domain. First, we selected 1,000 test sentences for all 
open test, arid used I.he others for training. Tile corpus 
was divided into 90% R)r training and 10% for test- 
ing. We then selected 1,000 sentences from tile traiu- 
ing set and used them for a closed test. The number of 
sentences, words, and characters for each test set and 
training texts are shown iu 'Pable 2. 
The training texts contained 6580 word types and 
6945 tag trigram types. There were 247 unknown 
word types and 213 unknown tag trigram types in tim 
open test senteuces. Thus, both part of speech tri- 
gralrl l)robabilities alld word output probabilities must 
be snioothed to handle open texts. 
Table 3: Perccld.;ige of words correctly segmented and 
tagged: raw part o\[' speech bigram aud trigrmn 
I 2 I 98'{l% I 89.7'~~ \[90.7% \[ 0.007 \[ 
I:~\[ os.,~:~ \[ 8a.~.s~ \] 84.a% \[ o.m2 \] 
I'~ 9a.'2% I 7s.~/o I 79.6% I o.o15 I 
k 5 I <~l''i''i° I >in'/~% I r(~.o~ I o.o~s_l 
First, as a I)reliminary experiment, we compared tile 
perforn)ances of part of speech bigram and trigram. 
Table 3 shows the percentages of words correctly seg- 
mented and tagged, tested on the closed test sentences. 
The trigram model achiew;d 97.5% recall and 97.8% 
precision flu" the top candidate, while tile bigram model 
achiew.'d 96.2% recall and 96.6% precision. Although 
both tagging models sllow very high l)erformanee, tile 
20,5 
trigram model outperformed tile bigram model in every 
metric. 
We then tested the proposed system, which uses 
smoothed part of speech trigram with word model, on 
the open test sentences. Table 4 shows tile percentages 
of words correctly segmented and tagged. In Table 4, 
label consistency 2 represents the accuracy of segmen- 
tation and tagging ignoring the difference in conjuga- 
tion form. 
For open texts, tile morphological analyzer achieved 
95.1% recall and 94.6% precision for the top candidate, 
and 97.8% recall and 73.2% precision for the 5 best 
candidates. This performance is very encouraging, and 
is comparable to the state-of-the-art stochastic tagger 
for English \[2-4, 10, 11\]. 
Since the segmentation accuracy of the proposed sys- 
tem is relatively high (97.7% recall and 97.2% precision 
for the top candidate) compared to the morphologi- 
cal analysis accuracy, it is likely that we can improve 
the part of speech assignment accuracy by refining the 
statistically-based tagging model. We find a fair num- 
ber of tagging errors happened in conjugation forms. 
We assume that this is caused by the fact that the 
Japanese tag set used in tile ATR. Corpus is not de- 
tailed enough to capture the complicated Japanese verb 
morphology. 
100 
95 
90 
$5 
80q 
75 
90 
65 
60 
Hogpholoqlcal Analylts Accuracy for N-Ilest Sentences 
riw trtqram (closed ce:~t) -a-.-- 
raw bit/tam ,a~t,¢~ 
a~othed trt~ram with Iopen text) -o,.. 
Imoothed trlqram wi word alotlel lopes te~t) -~ .... 
raw m with word moOel lapen text} ~.- 
iraw tr rlra without word nloc~el lopes text) -.12-::-. 
.... -+ ............ . 
,: . ...... • ..................... 
I I , I 
2 3 4 
Rink 
Figure 5: Tile percentage of sentences correctly seg- 
mented and tagged. 
Figure 5 shows tile percentage of sentences (not 
words) correctly segmented and tagged. For open texts, 
the sentence accuracy of the raw part of speech trigram 
without word model is 62.7% for the top candidate and 
70.4% for the top-5, while that of smoothed trigram 
with word model is 66.9% for the top and 80.3% for the 
top-5. We can see that, by smoothing tile part ofsllecch 
trigram and by adding word model to handle unknown 
words, the accuracy and robustness of the morpholog- 
ical analyzer is significantly improved. Ilowever, tile 
sentence accuracy for closed texts is still significantly 
better that that for ol)en texts. It is clear that more 
research has to be done on the smoothing problem. 
7 Discussion 
Morphological analysis is an important practical prob- 
lem with potential apl)lication in many areas including 
kana-to-kanji conversion 7, speech recognition, charac- 
ter recognition, speech synthesis, text revision support, 
information retrieval, and machine translation. 
Most conventional Japanese morphological analyzers 
use rule-based heuristic searches. They usually use a 
connectivity rnatrix (part-of-sl)eech-pair grammar) ,as 
the language model. To rank the morphological anal- 
ysis hypotheses, they usually use heuristics such as 
Longest Match Method or Least Bunsetsu's Number 
Method \[16\]. 
There are some statistically-based approaches to 
Japanese morphological analysis. The tagging models 
previously used are either part of speech I)igram \[9, 14\] 
or Character-based IIMM \[12\]. 
Both hem'istic-based and statistically-based ap- 
proaches use t.he Minimum Connective-Cost Method 
\[7\], which is a linear time dynamic programming algo- 
rithm that finds the morphological hypothesis that has 
tile minimal connective cost (i.e. bigram-ba~sed cost) 
as derived by certain criteria. 
q'o handle unknown words, most Japanese morpho- 
logical analyzers u,~e character type heuristics \[17\], 
which is "a string of the same character type is likely to 
constitute a word". There is one stochastic approach 
that uses bigram of word formation unit \[13\]. tlowever, 
it does not learn probabilities from training texts, but 
learns them fi'om machine readable dictionaries, and 
the model is not incorporated in working morphologi- 
cal analyzers, as fitr as the author knows. 
The unique features of the proposed Japanese mor- 
phological analyzer is that it can find tile exact N most 
likely hyl)otheses using part of speech trigram, and it 
can handle unlmown words using character trigram. 
The algoril.hm can naturally be extended to handle any 
higher order Markov models. Moreover, it can nat- 
nrally be extended to handle lattice-style input that 
is often used as t.he output of speech recognition and 
character recognition systems, by extending the func- 
tion (leftmost-subatringa) so as to return a list of 
words in the dictionary that matches the substrings in 
tile input lattice stm'ting at the specified p(xqition. 
For future wot'k, we have to study the most effective 
way of generating word hypotheses that can handh. • un: 
known words. Currently, we are limiting the number of 
word hypotheses to reduce ambiguity at tile cost of ac- 
curacy. We have also to study tile word model for open 
categories thai, have conjugation, because the training 
7Kana-to-kanji conversion is a pop~alar JILpanese input 
method on computer using ASCII keyboard. Phonetic tranHerip- 
tion by Roams (ASCII) characters are input and converted fir, st 
to the Japanese syllabary hiragana which is then converted to 
orthographic trm~scrlption including Chinese character kanjl. 
206 
Table 4: The percentage of words correctly segmented and tagged: smoothed trigram with word model 
smoothed trigram with word model (open text) 
lal)el consistency 
recall precision crossings 
95.1% 94.6% 0.013 
96.5% 88.0% 0.023 
97.3% 82.1% 0.031 
97.6% 77.4% 0.0'16 
97.8% 73.2% 0.061 
label consistency 2 
recall precision crossings 
95.9% 95.4% (J.013 
97.0% 90.3% 0.023 
97.6% 85.1% 0.031 
97.9% 80.7% 0.046 
98.1% 77.1% 0.060 
structure consistency 
recall precision 
97.7% 97.2% 
98.2% 94.4% 
98.5% 91.7% 
98.7% 89.6% 
98.8% 87.9% 
crossings 
0.013 
0.022 
0.029 
0.044 
0.056 
data gets too small to make trigrams if we divide it by 
tags. We will probably have to tie some parameters to 
solve the insufficient data problem. 
Moreover, we have to study the method to adapt the 
system to a new domain. To develop an m~supervised 
learning method, like the forward-backward algorithm 
for IIMM, is an urgent goal, since we can't always ex- 
pect the availability of manually segmented and tagged 
data. We can think of an EM algorithm by replacing 
maximization with summation in the extended Viterbi 
algorithm, but we don't know how to handle unknown 
words in this algorithm. 
8 Conclusion 
We have developed a stochastic Japanese morphologi- 
cal analyzer. It uses a statistical tagging model and an 
efficient two-pass search algorithm to llnd the N best 
morphological analysis hypotheses for the input sen- 
tence. Its word segmentation and tagging accuracy is 
approxlmatcly 95%, which is comparable to the star.e- 
of-the-art stochastic tagger for English. 
\[7\] 
\[8\] 
\[9\] 
\[10\] 
\[11\] 
\[13\] 
References 
\[14\] \[1\] Black, E. et al.: "A Procedure for Quantit.a- 
tively Comparing the Syntactic Coverage of En- 
glish Grammars", I)AIH'A Speech an(I Nalm'a\] 
Language Workshop, pp.306-311, Morgan Kauf- \[15\] 
mann, 1991. 
\[2\] Charniak, E., Ilendrickson, C., Jacol)son, N., 
and Perkowitz, M.: "Equations for Part-of~Speech 
Tagging", AAAI-93, I)1).784-789, 1993. \[16\] 
\[3\] Church, K.: "A Stochastic Part of Speech Tagger 
and Noun Phrase Parser for English", ANLP-88, 
pp.136-143, 1988. 
\[4\] Cutting, D., Kupiec, J., Pederseu, J., and Sibnn, 
P.: "A Practical Part-of-Speech Tagger", ANLP- \[17\] 
92, pp.133-140, 1992. 
\[5\] Ehara, T., Ogura, K. and Morimoto, T.: "h'rl~ 
Dialogue Database," 1CSLP-90, pp.1093-1096, 
1990. 
\[6\] IIe, Y.: "Extended Viterbi Algorithm for Second 
Order Ilidden Markov Process", ICI'R-88, pp.718- 
720, 1988. 
Ilisamitsu, T. and Nitta, Y.: "Morphological 
Analysis by Minimum Connetivc-Cost Method", 
'\['echnical Report S/GNLC 90-8, IEICE, pp.17-24, 
1990 (in Japanese). 
Jelinek, F.: "Self-organized language modeling for 
speech recognition", IBM Report, 1985 (Reprinted 
in Readings in Speech Recognition, 1)i).450-506). 
Matsunobu, E., lIitaka, T., and Yoshida, S.: "Syn- 
t.actic Analysis by Stochastic P, UNSETSU Gram- 
mar", Technical Rel)ort SIGNL 56-3, IPSJ, 1986 
(in Japanese). 
Meriahlo, 1t.: "Tagging Text with a Probabilistic 
Moder', ICASSP-9I, pp.809-812, 1991. 
Meteer, M. W., Schwartz, R. and Weischedel, R.: 
"I'OST: Using l)robal)ilities in Language Process- 
ing', lJCAI-9 t, pp.960-965, 1991. 
Murakaini, J. and Sagayama, S.: "llidden Markov 
Model applied to Morphological Analysis", 45th 
National Meeting of the IPSJ, Vol.3, pp.161-162, 
1992 (in Japanese). 
Nagai, I1. and llital¢a, T.: "Japanese Word For- 
marion Model and Its Evahmtion", Trans IPSJ, 
Vol.34, No.9, pp.1944-1955, 1993 (in Japanese). 
Sakai, S.: "Morphological Category l~';igram: A 
Single Language Model for botl, Spoken l,anguage 
and Text", ISS1)-93, I)1).87-90, 1993. 
Soong, F. K. aml lluang E.: "A Tree-Trellis 
Based Fast Search for Finding the N Best Sen- 
tence llypotheses in Continuous Speech Recogni- 
tion", ICASS P-9 I, pp.705-708, 1991. 
Yoshinmra, K, llitaka, T., and Yoshida, S.: "Mor- 
phological Analysis of Non-marked-off Japanese 
Sentences hy the Least llUNSETSU's Number 
Method", 't'rans. I1'$3, Vol.24, No.l, pp.40-46, 
19811 (in Japanese). 
Yoshimura, K., Takeuchi, M., Tsuda, K. 
and Shudo, K.: "Morphological Analysis of 
Japanese Sent.ences Containing Unknown Words", 
Trans. IPSJ, Vol.30, No.3, pp.294-301, 1989 (in 
Japanese). 
207 
