Backtracking-Free Dictionary Access Method for 
Japanese Morphological Analysis 
Hiroshi Maruyama 
IBM Research, Tokyo Research Laboratory 
maruyarna@trl.ibm.co.jp 
1 Introduction Input sentence: ~-\[-Ill:~f¢:,~ e) I-3f- ~: j~. --~/:. o 
JMA output: 
Since the Japanese language does not have 
explicit word boundaries, dictionary lookup 
should be done, in principle, for all possible sub- 
strings in an input sentence. Thus, Japanese 
morphological analysis involves a large number 
of dictionary accesses. 
The standard technique for handling this 
problem is to use the TRIE structure to find all 
the words that begin at a given position in a sen- 
tence (Morimoto and Aoe 1993). This process is 
executed for every character position in the sen- 
tence; that is, after looking up all the words be- 
ginning at position n, the program looks up all 
the words beginning at position n + 1, and so 
on. Therefore, some characters may be scanned 
more than once for different starting positions. 
This paper describes an attempt to minimize 
this 'backtracking' by using an idea similar to one 
proposed by Aho and Corasick (Aho 1990) for 
multiple-keyword string matching. When used 
with a 70,491-word dictionary that we developed 
for Japanese morphological analysis, our method 
reduced the number of dictionary accesses by 
25%. 
The next section briefly describes the prob- 
lem and our basic idea for handling it. The de- 
tailed algorithm is given in Section 3 and Sec- 
tion 4, followed by the results of an experiment 
in Section 5. 
22.-I-ft1:*i1:19 I~:19 ©:76 
-I-~I-: :19 ~:. :78 
ix:9 0:29 \]'2:63 o :100 
Fig. 1: Sample input/output of JMA 
2 Japanese 
Analysis 
Morphological 
2.1 Grammar 
A Japanese morphological analyzer (here- 
after called the JMA) takes an input sentence 
and segments it into words and phrases, attach- 
ing a part-of-speech code to each word at the 
same time. Figure 1 shows a sample input and 
the output of our JMA. 
Tile grammaticality of a sequence of 
Japanese words is mainly determined by look- 
ing at two consecutive words at a time (that 
is, hy looking at two-word windows). There- 
fore, Japanese morphological analysis is nor- 
really done by using a Regular Grammar (e.g., 
Maruyama and Ogino 1994). Our JMA gram- 
mar rules have the following general form: 
state1 ~ "word" \[linguistic-features\] 
state2 cost=cost. 
Each grammar rule has a heuristic cost, and 
tile parse with the minimum cost will be selected 
as the most plausible morphological reading of 
the input sentence. A part of our actual gram- 
208 
mar is shown in Figure 2. Currently our gram- 
mar has about 4,300 rules and 400 nonterminal 
symbols. 
2.2 Dictionary Lookup 
While the flmction words (particles, auxiliary 
verbs, and so on, totaling several hundred) are 
encoded in the grammar rules, the content words 
(nouns, verbs, and so on) are stored in st sepa- 
rate dictionary. Since content words may appear 
at any position in tile input sentence, dictionary 
access is tried from all the positions n. 
For example, in the sentence fragment: ill Fig- 
ure 3, 
"7..~ ~{'/. (large)" and "J~ ~I'J. ~i\[ ~i ~ 
(mainframe)" 
are the results of dictionary access at, posit.ion 
1. For simplicity, we assume that the dictionary 
contains only the following words: 
"i~ ~ (large)," 
";k2 ~lJ. ~\['.~ ~: (mainframe)," 
"a\]'~ (computer)", 
"~Ig ~ f~ (eomput.ing facility)," 
and 
",~R~ (facility)." 
2.3 Using TRIE 
The most common method for dictionary 
lookup is to use an index structure called TRIE 
(see Figure 4). The dictionary lookup begins 
with the root node. The hatched nodes represent. 
the terminal no(tes that correspond to dictionary 
entries. At position 1 in tile sentence ab(we, I, wo 
words, "Jq~.eJ. (large)" and " )k:)l'-I.}i\].~:~.~ (main- 
frame)," are found. 
Then, the starting position is advanced 1;o the 
second character in the text; and the dictionary 
lookup is tried again. In this case, no word is 
found, because there are no words thai, begins 
~Actual' dictionaries' '£1so co,train i')'C (big)," " ,b~,! 
(type)," "~'1" (measure)," "i{l'~: (compute)," "~'~: (cwdm.)," 
"~ (m,.:hi.~.)," "~b~ (~.~t,,bU.~h)," ,.,,i "~;;i (p,.,,~ ...... )." 
with "~{'.1." in the dictionary. The start, lug posi- 
tion is l, hen set I,o 3 and t.rled again, and this 
,.i,,~ th,. words "al~:~ (,:,lnlp,,Ce,.y and "i}t~'; 
k)~;{'~{i\[i (comput.ing facilit,y)" are obtained. 'e 
The problem here ix (,hal;, even though we 
know that, 1,here is "TQ){l!}i\[.~\])~ (ma.in\[rarne)" 
al, posit;ion I, we look up "}}\[~{:t~ (computer)" 
again. Since "iil~:~.~: (computer)" is a snhstring 
of "9'4~{~iI'~;)1~ (n-lainframe)," we know that, t;he 
word "~,i\]~,i~ (compul:er)" exists at, posit;ion 3 as 
soon as we lind "X~{~}~\[~3,,i~ (lnainframe)" at i)o-. 
sit;ion I. Therefore, going back l;o 1;he root node 
at position 3 and trying mat;citing all over again 
means duplicatAng our efforts unnecessarily. 
2.4 Eliminating Backtracking 
Our idea is to use t, he b)dex stsuct,m'e de 
w~loped by Abe and Corasick to find muli;iple 
sl,rings in a text. Figure 5 shows l;he TRII!; 
with a point.er called t;he fail pointer associated 
with the node corresponding to l;he word "7)k/~I T/ 
~'\[~2~: (mail fxa.nm) ' (the rightmost, word in Lhe 
first row). When a match st;re'Ling al, position 
n reaches I, his node, it is gnaranl,eetl that tile 
sl.ring "~,ilJ)i~,~" exists starting at position n -t-2. 
Therefore, if the next character in the input sen- 
tence does not mat,oh any of the child nodes, we 
do not go b~ck to the root but go back to the 
node corresponding 1,o this substring by follow- 
ing t, he fail pointer, and resume matching from 
this node. For the input sentence in l,'igure 3, 
l.he dict, ionary access proceeds as indica.ted by 
the dot, t;ed line in I.he Figure 5, 13n(ling the words 
")<~{t! (la.rge)," "g<~{t\[}\]\[#:~.~ (mair,\[','ame)," "\]i\[~'~: 
'~ (COlIIplll;cT)," and so on. Thus, the nmnt)er of 
dictionary node ac(:esses is greatly reduced. 
Ill many Japanese tnorphok)gical analysis 
systems, the dictionary ix held in the secondary 
storage, a.nd t, herefore the number of dictionary 
~Wh,, r,,:~, ch,~t "X~{'~iil~A:~.~ (,,**i,,~'~,-,,,0" w,~.~ re,,.,1 
heft)re does no(. neees;sarily mean f,\[ud, there is no need 
to l~mk up "~{I'~:)~%.(computr.r ...)," because at this point 
twa interpretat.ions, "mainframe facilit.y" and "large com- 
puting facilit.y," are possible. 
209 
J\[~i~/~ -> EAGYOU-SDAN \[pos=l,kow=doushi\] ~f 5 f~; 
YJ~ 5 \[~ -> °'~'" \[pos=26,kow=v_infl\] ~,\[iJ 5 ~k,~ 4 ~x,~ cost=300; 
~\] 5 ~-~,~)- 4~(ff6 -> "J'" \[pos=64,kow=jodoushi,fe={negative}\] 
"~" \[pos=78, kow=set suzoku_j oshi\] 
~ItJj~ ~'f cost=500; 
~l~J~J~,~Y/ -> J\[~)~ cost=999; 
~1 -> "" \[pos=48,kow=jodoushi\] J~J~-~Y~,#~ cost=S00; 
.... ~ ~)~J~ cost=300; 
~)j~y~t~ _> "fZ o" \[pos=45,kow=aux_infl\] '"~J\[J~ cost=a00; 
Fig. 2: Some of the grammar rules 
t ~ 3 fl" g- 6 ' ? 
4large -~ = computer- " facility 
4 mainframe 
* computing laiSility 4~ 
Pig. 3: Dictionary lookup for Japanese morphological analysis 
~F,j ~m ,11" D: ~ ~_m -~ ~-F--1 - t__.l 
Fig. 4: TRIE index 
210 
-I I 
~¢" G% t.-: 
Fig. 5: 'I'I{IE structure with \]hil pointers 
node accesses dominates the performance of the 
overall system. 
3 Constructing TRIE with fail 
pointers 
A TI{IF, index with fail pointers is created in 
the following two steps: 
1. Create a TI{IE iudex, and 
2.5 Other Cosiderations 2. Calculate a fail pointer of each node in the 
TRIE. 
Theoretically there is a I)ossiMlity of 1)rm,ing 
dictionary lookup by using the state set at. posi- 
tion n. For example, if no noun can follow rely 
of the states in the current state set, there is no 
need to look up nouns. One way to do this prun- 
ing is to associate with each node a bit vector 
representing the set of all parts of speech of some 
word beyond this node. \[f the intersection of the 
expected set of parts of speeche an(t the possi. 
bilities beyond this node is empty, the expansion 
of this no(te can be pruned. In general, however, 
almost every character position t)redicts most of 
the parts of speech. Thus, it is common practice 
in Japanese morphok)gical analysis to h)ok up 
every possible prefix at every character position. 
Hidaka et al. (1984) used a modified l{-tree 
instead of a simple TRIE. Altough a B-tree has 
much less nodes than a TRIE and thus the num- 
ber of secondary storage accesses can be signif- 
icantly reduced, it still backtracks to the next 
character position and duplicate matching is in- 
evitable. 
Since Step 1 is well known, we will describe only 
Step 2 here. 
\]"or each node n, Step 9 given the 
value fail(n). In the following algorittlm, 
for'ward('n, c) denotes the chikl node of the node 
'n whose associated character is c. If there is no 
such node, we define forward(n, e) = hi\]. Root 
is the root no(le of the T1HF,. 
"2-1 j'ail(l~oot) ~- leooe 
2-2 for each node ft. of depth 1, fail(n) ~ lSmt 
2-a re,. e~,:l~ depth d-- 1,2, ..., 
2-3-1 for each node. n with depLh d, 
2-3- I-I for each child node rn of n (where 
m = forward(n, c:)), 
fail(m) +-- f(fail(n), c). 
l\[ere, \]'(n, c) is defined as follows: 
fail(',,.) if forward(n, c) 5L nil 
f('n, c) = f(fail(n),c) if forward(n, c:) = nil 
& n ~ Root 
t~oot otherwise 
211 
If tile node corresponds to the end of some 
word, we record the length l of the word in the 
node. For example, at the node that corresponds 
to the end of the word "~:~t'~:~ (mainframe)", 
I = 5 and l = 3 are recorded because it is the end 
of both of the words "~\].~:~ (mainframe, 
l = 5)" and "~l'-~-~ (computer, l = 3)." 3 
Figure 6 shows the complete TRIE with tile 
fail pointers. 
traditional TRIE and was 27% faster in CPU 
time. The CPU time was measured with all the 
nodes in the main memory. 
For the computer manuals, the reduction rate 
was a little larger. This is attributable to the fact 
that computer manuals tend to contain longer, 
more technical terms than newspaper artMes. 
Our method is more effective if there are a large 
number of long words in a text. 
4 Dictionary access 
The algorithm for consulting the dictionary 
is quite simple: 
1 n +-- Root 
2 for each character position i = 11,2, ...k, 
2-1 while n 7~ Root and forward(n, ci) = 
nil do n ~-- fail(n) 
2-2 n = forward(n,cl) 
2-3 ifn is the end of some word(s), output 
them 
where ci is the character at position i. 
5 Experimental results 
We applied the TRIE with fail pointers to 
our 70,491-word dictionary for Japanese mor- 
phological analysis (in which the average word 
length is 2.8 characters) and compared it with 
a conventional TRIE-based system, using two 
sets of data: newspapers articles (44,113 char- 
acters) and computer manuals (235,104 charac- 
ters). The results are shown in Table 1. 
The tables show both the number of node ac- 
cesses and the actual CPU time. For the news- 
paper articles, our method (marked as TRIE w/ 
FP) had 25% fewer node accesses than than the 
SThis information is redundant, because one can look 
up every possible word by following the fail pointer, llow- 
ever, if the nodes are in secondary storage, it is wort, h hav- 
ing the length information within the node to minimize 
the disk access. 
6 Conclusion 
We have proposed a new method of dictio- 
nary lookup for Japanese morphological anal- 
ysis. The idea is quite simple and easy to 
implement with a very small amount of over- 
head (a fail pointer and an array of length l 
to each node). For large l, ermiriology dictionar- 
ies (medical, chemical, and so on), this method 
will greatly reduce tile overhead related to dic- 
tionary access, which dominates the efllciency 
of practical Japanese morphological analyzers. 
Fast Japanese morphological analyzers will be 
crucial to the success of statistically-based lan- 
guage analysis in the near fllture (Maruyama et 
al. 1993). 
References 
1. Atlo, A. V., 1990: "Algorithms for Finding 
Patterns in Strings," in Leeuwen, J.V. ed., 
Handbook of Theoretical Computer Sci- 
ence, Volume A - Algorithms and Com- 
plexity, pp. 273-278, Elsevier. 
2. llidaka, T., Yoshida, S., and Inanaga, II., 
1984: "Extended B-Tree and Its Applica- 
tions to Japanese Word Dictionaries," (In 
Japanese) Trans. of IE\[CE, Vol. 367-D, 
No. 4:. 
3. IIisamitsu, T. and Nitro, Y., \]991: "A 
Uniform 'h'eatment of Ileuristic Meth- 
ods for Morphological Analysis of Written 
272 
l,'ig. 6: TI{II,~ index with fail pointers 
'I~RI\],; TRII!' w/ 1,'1 ) l{educt~ion rat;e 
Node accesses 104,118 78,706 25% 
CPU time (set.) 64.77 40.92 27% 
(a) 44,11t3 chara(%ers in newsi)aper articles 
TRIE 'l~Rllg w/ F1 ) Reduction rate 
Node accesses 542,542 883,176 30% 
CPU time (see). 372.47 228.63 28% 
(b) 235,104 (;haract~ers in computx~r mamlals 
Table 1: lgxi)erimengal results 
Japanese," Prec. of 2nd Japan-Australia 
Joint Workshop on NLP. 
4. Maruyama, II., Ogino, S., and I\[idano, M., 
1993: "The Mega-Word Tagged-Corpus 
Project," Prec. of 5lh International Con- 
ference on Theoretical and Methodological 
Issues in Machine Translation (TM1-93), 
Kyoto, Japan. 
5. Maruyama, H. and Ogino, S., 1994: 
"Japanese Morphological Analysis Based 
on Regular Grammar," (In Japanese), 
Transactions of IPSJ, to appear. 
6. Morimoto, K. and Aoe, J., 1993: "Two 
Trie Structures far Natural Language \[)ic- 
tionaries," Prec. of Natural Language Pro- 
cessing Pacific Rim Symposium (NLPR,9 
'93), Fukuoka, Japan. 
213 
