A Portable & Quick Japanese Parser : QJP 
Masayuki KAMEDA 
Information and Communication R &: D Center, RICOH COMPANY, LTD. 
2-3, Shin-Yokohama 3-chome, Kohoku-ku, Yokohama, 222, Japan 
kameda(~ic.rdc.ricoh.co.jp 
Abstract 
QJP is a portable and quick softwaxe module for 
Japanese processing. QJP analyzes a Japanese sen- 
tence into segmented morphemes/words with tags 
and a syntactic bunsetsu kakari-uke structure based 
on the two strategies, a) Morphological analysis 
based on character-types and functional-words and 
b) Syntactic analysis by simple treatment of struc- 
tural ambiguities and ignoring semantic information. 
QJP is small, fast and robust, because 1) dictio- 
nary size (less than1 100KB) and required memory 
size(260KB) are very small, 2) analysis speed is fast 
(more than 100 words/see on 80486-PC), and 3) even 
a 100-word long sentence containing unknown words 
is easily processed. 
Using QJP and its ana\]ysis results as a base and 
adding other functions for processing Japanese docu- 
ments, a valqety of applications can be developed on 
UNIX workstations or even on PCs. 
1 Introduction 
Natural language parser/analyser is essential for al- 
lowing advanced functions in document processing 
systems, such as keyword extraction to characterize a 
text, key-sentence extraction to abstract a document, 
grammatiea\] style checker, information or knowledge 
retrieval, natura\] lmlguage understanding, naturai 
language interface and so on. But a general pur- 
pose parser requires 1) a laxge dictionary database 
with more than several tens of thousands words, 2) 
advanced techniques for disambiguation and process- 
ing semantics, aald 3) substantial machine resources, 
such as a lot of memory and high speed CPU. 
In addition, users must mMntain additional terms 
in dictionaxies for specialized fields. As a result, most 
parsers cannot be easily used in applications and it 
is difficult to develop a practical parser which can be 
easily integrated into many applications. 
We changed our viewpoint in order to design and 
develop aal applicable and usable Japanese parser. 
First, we focused on the unique sets of character- 
types in written Japanese and constructed a very 
small dictionary using mainly functional words in 
hiragana-chm'acter. Similar approaches\[i\]\[2\] were 
used for segmentation or preliminary morphological 
analysis about 20 years ago, using the transition- 
point between types of ehaxaeter sets to cue word 
segmentation. Second, we noticed that dealing with 
syntactic ambiguities creates a large processing bur- 
den and even using semantic information does little 
to assist syntactic analysis at the current level. So we 
either simplified dealing of structural ambiguities or 
ignored semantics to lighten the syntactic processing. 
We first created a prototype of our parser\[3\] us- 
ing AWK language, and then rewrote it \[4\] in C so 
it could be included in applications. The resulting 
parser, named QJP, is portable, fast and robust. It 
is an effective parser for many general purpose appli- 
cations, despite of a dictionalT size of only 5 thousand 
words. It can analyze a 100-word sentence on a PC 
in less than one second, while using less than half of 
a megabyte of memory. In addition, it requires no 
further dictionaxT maintenance for new terms . 
In this paper we describe the QJP's analysis 
methods and report on its current performances. 
2 Analysis Method 
QJP performs two types of analysis : 1) morphologi- 
cal analysis to segment a sentence into part-of-speech 
tagged morphemes and words, 2) syntactic analysis 
to place words into bunsetsuLdependency structure. 
Analysis strategies are the followings : 
• The morphological analysis is achieved by ex- 
panding an earlier methods\[ill2\] for bunsetsu or 
word segmentation using character-types thus 
Mlowing the use of a very small dictionary. 
• The syntactic ana\]ysis uses no semantic infor- 
mation, only part-of-speech and other syntactic 
information. In addition, rather than creating 
all possible, or some preferable, parses, we con- 
struct the best syntactic structure preserving lo- 
cal ambiguities. 
1Bunsetsu(3~i) is a kind of phrasal unit i n Japanese, con- 
sistiug of one content word(~/~,~.~,/~) \[such as nonn(~ 
~), verb-noun(@~), verb(~315~), adjective(~l:~.~), verb- 
adjective(~J~J) and ad~rerb(~q~\])\] and successive adjunc- 
tive words(l';.~) \[such as auxiliary verbs(II)J~J~) and post- 
positional particles(tlJJ~)\], and carrlng one concept. 
616 
2.1 Morphological Analysis 
Characteristics of written Japanese 
A Japanese sentence has no spaces 
between words\[Figure 1\]. So it is 
difficult to segment a sentence into 
words. However, the fact that at 
least four distinct sets of characters \[for 
example, kanfi(" ~'' ,,~c .... ~=x,, ,,~o~ 
ragana(" ©"," J~"," ~ ",etc), katakana(" 
.\]"," ~:'," ~",etc) and other characters 
(alphabets, nmnbers, symbols etc.)\] are 
used to write Japa~mse can be used for 
segmenting words. Most words written 
in kanji or katakana are content words, 
such as nouns, verb-noun and stems(-~ 
@) of verbs or adjectives. Most words 
in hiragana are functional words(~4~), 
such as postpositional particles, auxiliary 
verbs and inflective suffix(7:~,~-) of 
verbs and others \[Table 1\]. And the vo- 
cabulary of content words is umch larger 
than that of functional words. 
Table 1. Classifications of Japanese Part-of-Speech 
and their word examples 
,A 
*1 gt 
Sharing of Morphemes by Dictionary 
and Rules 
Our strategy is that all functional words, wMch 
are few in nmnber, are stored in the dictionary 
and most content words or their stems in kanji or 
katakana are to be extracted and given their I)ar~-ofo 
speech candidates based on character-types. 
Standard morphological analyser uses a dictio- 
nary to obtain morpheme or word candidates. But in 
our approach, morpheme candidates 2 are extracted 
either from the morpheme dictionary or using al- 
location rules based on character-type. For exam- 
pie, if the dictionary look-up fails, the allocation 
rules extract each sequence of character in which all 
of the characters belong to the same character set. 
Then, using the allocation rules, part-of-speech can- 
didates are assigned based on the sequence's charac- 
ter set and length. The candidates au'e disambiguated 
by checking connection with the the following mor- 
phemes based on the connection table between mor- 
phenm parts-of-speech. The following morphemes, in 
most cases, are functional words or inflective suffixes. 
The dietiomu'y contains funetionM words \[such 
as postpositional particles, auxilim'y verbs, formal 
nouns(N~<~), adverbM nouns(~q~SN), con- 
junctions(~-~N), adverbs and so on\], inflective suf- 
fixes and exceptional content words which cannot be 
or axe not covered by the allocation rules. 
Here are some examples of the allocation rules 
for 1) 1-kanji character sequence, 2) 2-kanfi character 
sequence and 3) katakana character sequence. 
2in this analysis, a inllected word is treated a+s two or more 
morphemes - a stem part and one or more inflection part. 
.% N Part-of-Speech 
~ noun +Y~N# verb-noun 
(sahon) I~# verb 
~# adjective ~7# verb-adjective 
F~ formal noun 
NN~N adverbial noun 
=HN adverb ~I!~ non-oonj, adj. 
~N conjunctive 
~ particle 
~J~$# auxiliary verb 
~li~# aux. functional v. 
~J~ inflective suffix 
T~-~J~ derivative suffix ~ 
prefix ~J~i~ suffix 
E ~1 Examples 
~-J-~, ~m~h-J-6, -~--~-j-6 
~-<, ,~-~, ~-J'6, I$\]+,-~ (~- ~) ~-b. ~ L-t,~, J:-I,<-¢1~6 L-L~ 
~-f~, ~-V. ~-~£, U ,~ -~-f" 
N:U:(~ J: U:), $/':15. ~5~ I,~12 
~-¢, ~5-~, ~-~, 15 < -~ 
~Y. ,5, ~', ~{,5 
• ~ *l:indeloendent word *Z:adjunctive word *3:affix • 4:content word(conceptual word) *5:functional word 
#:inflective -: select point between stem and infleution part 
1) noun / stem of 5-dan verb(~i!~g~NJ~ ) / stem 
of shimo-1-dan verb(~--~-~.~JJ~,7) 
2) noun / (stem of 5-dan verb) / (stem of shimo- 
1-dan) / verb-noun(sahen-meishi; ~)'~ N) / 
verb- adjective (~ "-~!~0J~ ) 
3) noun / verb-noun / verb-adjective 
The 1-kanji character nouns and verb-stems are 
largely of old-JN)anese-origin words, wago(~H~), 
and 2-kanji character nouns, verb-nouns and verb- 
adjectives are mainly Chinese-origin words, kango(~ 
~). In addition, there are several 1-kanji chaxac- 
ter stems of kami-l-dan verbs(\]a--~\[~tJ~), sahen 
verbs(+)-~gOJ-~) and adjcctives()f~-~l) which axe 
stored in the dictionalsr because they ~rc so few in 
nmnber. The word number of words which can be 
treated using rules like those given above is so great 
that the dictionary size is substantially reduced. 
Treating of Wage compound words 
Another characteristic of old-Japanese-origin 
verbs (wage verb) is that they often continue with 
other words or morphemes to become verbs or nouns. 
For examples, two verbs "=-j\]~ <"('to write') and "~ 
~e" ('to become crowded') combine into the compound 
verl) "~@i_},_~"('to write into'), the verb "~2"('to 
read') become the verb "~i"'(' cause to read') with 
the causative suffix "@-", and the verb "~JSu-~2"('to 
step') becoines the the noun ",~fi~Y-\]-"('a step') with 
the derivative suffix "7f'. There axe a great mmly 
compound words such ,as these. 
A word-compounding part determines a word 
fl-om morphemes using word-constituent rules based 
not only on inflections but also on compounds or 
derivations such as those shown above. Such rules 
also greatly reduce the diction,'u'y size. 
61.7 
:\[ 1\]( 8) El*= fi~\] (523)~J~ ' 2\] (14) 0) (37))=~1~ 
.\[ 3\](16) J~9 \[ZH\] (293) El'~g=gl~ 
I\[ 4\] (20) 12 \[zH\] (250) ~\]lRb 
G\] (26) I~1 (355) ~{~I~1 
• \[ 7\] (28) I.= \[zH\] (36) -=t~ 
*\[ 8\](30)~ \[Zd\] (349\]~O~ \[\[ 9\] (32\];h. \[zH\] (351)"F~ 
• \[11\] (36) ~ (41) =~=I~P~ 
*\[12\] (36) ,~ \[zJ\] (369) :~ 2 1113\] (40) h, \[zH\] (372) ~.~a 
• \[14\] (42) t~ \[zH\] (266)'24=9~ 
1115\] (44) ~ \[zH\] (2O5) 9~ 
*\[18\] (46).~,, }i~ (523) ~JR 
• \[17\] (54) ¢) (37) J =~ 
*\[18\] (563~ \[zJ\] (523) ~P\] • \[19\] (583o) \[zH\] (37) \]=~ 
*\[20\] (GO) ~1~ ~1 (523) ,, • \[211 (04) t: (36) -=~-~ 
• \[22\] (66) ~ \[zH\] (1843 ~=~@ 
I \[231 (08) I,~ \[zH\] (375) :~Jl~lb 
• \[24\] (70)-~ \[zH\] (72) ~=~ 
• \[25\] (72), \[zT\] (512)~ 
• \[27\] (04) 15 (47) #'~=~ 
*\[28\](86)~ fz X\] (502)~=! 1129\] (88)- (5013 ~l~l 
• \[30\] (90) a) \[zH\] (37) \] =~tt~ 
*\[31\] (92) ~1 \[za\] (523) ~ • \[32\] (06) "e \[zH\] (243) ~=~c 
11341 (100) z~ (434) ~;O~ • \[35\] (102) o \[zP\] (353) ~J~ 
\[in English\] In processing a sentence of an agglutinative language like Japanese, in which divisiono are not placed between words, 
morphological analysis is the first harrier. 
Fig. 1 Example Japanese Sentence 1 
\[word in English\] -- *\[ 1\] (8) fl~ (O2g~\] (41) 4~ 
Japanese language \[ 2\] (14)~9 (O) (51) 2=~ of 
f 3\] (16\] J: 9-\[- <J: 9 ~J> (60. 22) ~ O0*=~b like \[prep\] 
*\[ 4\] (22)$~ ($I\]f) (41) ~i~ ~erd _\[ 5\] (2{}) ~ (M) (43) ~J 
between 
• \[ 5\] (28)I~ (l:) (51)-:=~R1/ in 
*\[ 7\] (30)t~1-~+~1 (~h%!ei) (41)~1 division space -\[ 8\] (36) '~ ('$) (51) ~=~ <object marker> 
*\[ 9\] (38) ~-h~ <~<> (15, 11) ll~: ~Rea place • \[10\] (42) t~-i,~ <~1,~> (60. 4) -24=0 {$ not 
*\[11\] (46)J~J~ (\]~'~) (41)~I~l agglutinative language 
• \[12\] (54) d~ (~O) (61) ./=~J~ of 
*\[13\] (56) :~ (~) (41) ~i ~J sentence 
• \[14\] (58)~ (~) (51) \]=~ of 
*\[15\] (60) ~ (~,t~) (41) ~1 processing • \[16\] (64) l: (1:) (51) :=~ in 
• \[17\] (66) t~-t,~ <~<> (75.22) ~9=:~ ;~b 
• \[18\] (70) -c ('c) (55) -~=~ • \[19\] (72). (,) (923 ~1 soma 
*\[20\] (74)~'~i~$f (~l~Mf$~) (41)~lal morphulogisal analysis 
• \[21\] (84) 1~: (l~:) (52)#x=f~ <topic marker> 
*\[221 (86) ~--- (~I--) (46\] ~\[~ ~ the first • \[23\] (90) ~ (o)) (51) J =~ of 
*\[24\] (92) IIII~ (~l~¶) (41) ~l~ barrier • \[25\] (96) ~ <t'£> (60. 23) a\]=R\]c is 
• \[26\] (98) ~-~ <~ ~> (75.3) 71b=~ • \[27\] (102). (.) (91)'~ period 
Figure 2. Segmented Morphemes with Tugs Figure 3. Segmented Words with Tugs 
Morphological Output from QJP 
An example of segmented morphemes with 
morpheme-tags are shown in Figure 2, where 8 nouns 
(" 1~2~","~-~",etc.) and 2 stems of word (,,t:)\]', ,, 
~"), maxked by '\[zJ\]', in kanji character are recog- 
nized using allocation rules and connection table. 
The words with part-of-speed, tags 
and morpheme-divisions('-','+') axe shown in Figure 
3, where a compound noun "~avb H" (the 7th word) 
is a compound of the morphemes 8-10 \["~"(stem of 
shimo- l-dan verb " ~)J ~ " ), " ~%" ( renyou-kei inflective 
suffix of shimo-l-dan verb; T~~\[~)~:~-z~ 
~) and " g"(noun)\] using a word-constituent rule. 
In Figure 3, the root forms of inflected words have 
been derived and are shown in the <>-parentheses, 
such us "~\]~ <" which is the root form (shuushi-kei; 
~) of "~". These morphemes and words ~u'e 
not in the dictionaxy. 
2.2 Syntactic Analysis 
Kakari-uke Analysis 
Many J~l)aamse syntactic analyses are ba~ed 
on orthodox bunsets,>depcndency analysis, called 
kakari-uI~e a anMysis(~.~ 0 ~}~{J~:) between bunsetsu 
phrases, where a buckets'a-dependency structure cor- 
responds to a set of kakari-uke bunsetsu pairs. We 
also take this approach because it is intuitive, under- 
standable and easily implemented. 
aThe relation of kakari and uke equals to modifier and 
rood ifiee. 
Simple Treatment of Structural Ambi- 
guities 
StructurM ambiguities are usually dealt with ei- 
ther by generating all possible structures or by select- 
ing the more preferable ones ba,sed on some scoring 
scheme. Such method usually leads to combinatorial 
explosions which causes a lot of memory and process- 
ing time. 
For this problem we have already proposed a 
substitutional light method\[5) in kakari-uke analysis. 
This method extracts all possible kakari-uke pairs, 
and then rather than generate not M1 or some pos- 
sible sets of pairs~ only one best set of pairs is gen- 
erated while still retaining all other possible \]Tairs. 
Thus, instead of generating multiple number of sets, 
it most-likely set is selected ~ld the applie~tion/user 
is presented with alternative kakari-uke pMrs at the 
same time that the selected pairs are presented. If 
the application/user corrects any alternative kakari- 
uke pairs, the most likely set is re-calculated using re- 
taining possible kakari-uke pairs. This means of deal- 
ing with structural ambiguities avoids combinatorial 
explosions and requires flu' less machine resources. 
Not Using of Semantic Inibrmations 
Most methods for analyzing Japanese use c~e 
patterns with semantic features for preference selec- 
tions. However, such analysis techniques using se- 
mantic informations are not yet adequate and seine- 
times i~ctmdly lead to adverse results\[6). 
In addition, semantic information nmst be stored 
618 
in the dictionary. This reduces the merit of the very 
small dictiomtry achie.w.'d in morphological analysis 
section. We limit the information to morphologi- 
c~fl/word ~md syntactic levels \[such as the presence 
of coi\[Ima(-~,%~), adverbial noun, surface or syntactic 
similarity\[7\]\] without using semantic information for 
structurM analysis. 
Flow of QJP's Syntactic Analysis 
Under these approaches, QJP's syntactic aatai- 
yser processes words sequence in three steps\[Figure 4\] 
each following its own set of rules. First it determines 
bunsetsu fcatures\[A\] for each bunsctsu according to 
its word constituents. Second it extracts "all possible 
kakari-uke bunsetsu pairs \[marked by ' O' in B\] based 
on specific combinations of bunsctsu features for each 
bunsetsu pair. 
Last, it selects the best uke-bunsctsu (modifice) 
\[marked by ' ~' in C\] from possible ones for each 
bunsetsu which is a kakari-bunsctsu (modifier), ex- 
cept tim last one, because every bunsetsu modifies 
one of the following bunsctsus, so the last one has no 
uke-bunsetsu. Thc default uke selection is the nearest 
possible uke bunsctsu and, if nccessazsr, Q.}P substi- 
tutes the selcetion 1)ascd on rules comparing the two 
pairs - the currcnt selected ukc-bunsetsu and a more 
distant possible uke-bunsetsu for thc subject kakari- 
bunsctsu. In Figure 7, solnc pairs are not the new,rest 
ones. Tile at)pli(:ation/uscr's kakari-ukc pairs correc- 
tions rest,%rts the selection ; QJP first selects the cor- 
rected kakari.ukc p,fir(s) \[maxked by ' I' in Figure 
7\] and then re-selects remaining kakari-ukc pairs. 
Figure 4-C and Figure 7 ;~re kakari-uke matrices 
showing the possible pairs and selected pairs. Figure 
5 is the output of kakari-ukc pairs tagged with parts- 
of-speech and bunsctsu features. 
\[Se~entation of Words by Hor~ological klalyser\] I \[1. Setting of BunsetsuFeatures\] 
1 \[Bunsetsu Features} \[ 1\] : Elllo~ 91: 
\[ 2\] :~ltllltlllz \[ 3\] :~&B £~ 
\[ 4\] :~:~t~u~ 
\[ 5\] :llttlilio) \[ 6\] :~ 
\[ 7\] :~l!~lzlm,~m, \[ 8\] :~I$ 
\[ 9\] :~-o) 
\[10\] :11=1'~ ~$0 
\[I~ ~II~ ~l 
~~~ o)~} 
A. Bunsetsu Features List 
10987654321 \[ 1\] 0 0 EI~o~51:: 
\[ 2\] 0 $iiltllllZ \[ 3\] O~:L',,B ,~ 
\[ 4\] :O00000fltD'~U~ 
\[ 5\] :0000011tt~-1~ 
\[ 6\] : 0000~o) \[ 7\]:0 ~l:~U~, 
\[ 8\]:o ~ffAI~l~ 
\[ 9\] :0~-o) (0 : Possible Pair) \[10\] 
: B IIP'I~o 
B. Possible Kakari-Uke Bunsetsu gatrix 
\[2. Extraction of 
- 1 
Possible Kakari-Uke Bunsetsu Pairs\] 
\[*. Presentation of Structure\] 
\[&Selection of Best Set of Kakari-Uke Bunsetsu Pairs\] 
\[ 1\]: \[ 2\]: 
\[ 3\]: 
\[ 4\] \[ 5\] 
\[ 6\] \[ 7\] 
\[ 8\] \[ 9\] 
\[10\] 
\[in English\] B~o)J~51c vlike Japanese lang. 
~lilllll< t-between words 
~I, PI~ I-<obj> div. space 
r~h~t~b ~ mot to place rllt~it~O) rof agglutinative lang. 
r~O) rsentence F~ lCt~b~, Fin processing 
till}ltltlt#li I-<topic> morph, analysis --o) Pthe first 
lllP~/187Oo is a barrier. 
D. Kakari-Uke Dependency Structure 
10987 654321 
\[ 2\] : • ~-~rdtlz \[ 
3\]: e~% ~l ~ \[ 4\] :Ox xOOe~l~ 
\[ 5\] :0 x x Oiltlltt~o) \[ 6\] :Ox xe~o) 
\[ 7\]:0 YS\]~lz t~i, vC, \[ 8\] 
: II tf~llt~ltAil#l;t \[ 9\] :llllll--(b (tll : Selected Pair) 
\[10\] :Ml'1-eil~7~o (x : Structural ly 
Prohibited P. ) O. Kakari-Uko Bunsetsu latrix 
Figure 4. Flow of Syntactic Kakari-Uke Analysis 
\[Sel~n~ted Words with Tags\] 
\[ 3\] \[w~kq \[~Pll {~1~=~1\] 
\[ 61 \[~\[~}~l~=~l\] \[ 7\] \[~\[~ii~) ra (==~10) ~u, \[~s < I~r 9=iK~llb\] ~ \[~=1~\], I~l \] 
\[ 9) \[S~-i~,~l~llm\[.,'=~l\] 
{B~msetsu Features) ~-<ilod. Type (K~kar i :Uko) >~ \[lied. No. \] 
\[= ~=gm t%t~ ~t~ ~l~f~} ---------<;~:1~1~>~ \[4\] 
{I~l~ ~I~t~,~I~ i~l ---<~t~,~:~>-----~' \[51 l(%tt~l o)~l!~l ~<~g;l~:~i~J>--~, \[6l 
{~lV~ o)il~l ~-<o)~I~:t~ts\]>----~ \[71 
Figure 5. Kakari-Uke Pail~ with 2.hgs 
619 
3 QJP 
3.1 Implemented Software 
QJP currently is implemented in the C language 
both a QJP library and an interactive/batch console 
application, QJP workbench. They have been imple- 
mented on DOS/PC and UNIX/Sun workstation. 
QJP's dictionary consists of 4 files whose total size 
is less than 50KB and which contain about 5 thou- 
sands nmrphemes. QJP requires a~mther control ta- 
ble file for the compressed 533>(533 morpheme-POS 
connection table, the table for the allocation rules, 
the dictionary file indices and others, which is at most 
35KB. 6 sets of morphological rules a~d 4 sets of syn- 
tactical rules \[Table 2\] are embedded in the form of 
if-then rule in C functions. The size of the work- 
bench execution file on DOS is about 185KB. The 
tota~ size (executables and dictionaries) is much less 
than 300KB\[DOS\] which is quite small and portable 
as a natural language analyser. 
3.2 Analysis Experiment 
QJP performaJ~ces were measured for the QJP work- 
bench using two sentence test-sets : 1 \[24t sentences, 
average length 24.1 words/sentence\] and 2 \[210 sen- 
tences, average length 29.5 words/sentence\]. 
Execution Performance 
About 260KB of memory are required on DOS 
and 500KB on UNIX. With this amount of memory 
QJP can process a very long sentence, such az 100- 
word sentence \[Figure 6\]. 
The analysis speed is 80 to 150 words/see on an 
80486/25~IHz PC and 700 to 800 words/see on a Sun- 
SS20. A 100-word sentence c~n be analyzed in less 
than 1 second on PC. Figure 8 shows the relationship 
of processing time to sentence length. Syntactic pro- 
cessing time is on the order of the square of the sen- 
tence length. But its coefficient is so small that the 
total processing time increases linearly in the range 
of actual long sentences. 
Table 2. Linguistic data and Rules 
MorpholoKiosl ~nalyser 
\[D\] diotionary : -3500 entries. ~5000 morphe~ea 
\[P\] morpheme/word Part-of-Speech : 533/49 POSs 
\[T\] connection table : 533x533 
(~ \[R\] connection source-rules: ~300 rules) 
char. sequence extraotion rules : ~ 20 rules \[R\]R momheme-POS 
allooat on rules : 14 rules 
f~\] mor~heme-POS disa~bi~uation rules : ~ 50 rules 
word-constituent rules ~ 60 rules 
f~\] bunaeLsu head exceptional rules ~ 20 rules 
auxiliary functional verb rules ~ 15 rules 
Syntaotical ~nalyser 
\[F\] bunsetsu features 
\[R\] bunaetsu features saltine rules 
\[R\] kakari-uke par extracting rules 
\[R\] kakari-uke pair exoeptional rules 
\[R\] kakari-uke failure reoovery rules 
68 features 
80 rules 
20 rules 
40 rules 
-~ 4 rules 
• ~\[O\]:diotionsry \[T\]:table \[R\]:rule \[P\]:POS \[F\]:foature 
\[50\] (I1) :tll~lll~l~-~ 
\[49\] (1O) :•~,~ • : Sol•uteri Pair 
\[48\] (9) : xO4~ltt& O : Possible Pair \[47\](9) : • r_&,,~ • :Struoturally Prohibited Pair 
\[46\] ( 8)7:0 x•~7~ x :Poaaible l~t 
\[45\](7) : • x • • •~C~ StruotursllY Prohibited Pair 
\[44\] ( 6)>:x x x • 00~I~ \[43\] ( 7)?: x X X • O00al'~tll~i~6 0 : Selected and later 
\[42\] (5) :x • × , x x•-~o) Corr~ted Pair 
\[41\](7) :.x- "0'' ~, II:~olicatiorVUsar 
\[40\] ( 6)>: • x • • x .... O~;~t~J~ Correatod Pair 
\[39\]( E)>:x XX • XXXX ' II~:~ 
\[~\](4)>:X.X..xxx, O•~Ta \[37\] ( 3\] >: x x x • x x x x • O00I~ ~ 6 
\[36\](2)>:x.x..xxx. O0 tllffil~l'~ 
\[35\](6)7:'x" ,x .... • 00~"J~'. 
\[34\]( 5)>:- x • • x .... O' • x • x•~l= \[33\]( 4)&:x xX • xxxx 
• xx x x x x x•l~:~& 
\[32\](4)?:x.x..xxx. xx.x. 00~=~?: 
\[31\](3)>:. x • .x .... O" ,x.xO 'O~l: \[30\](6)?:.x, "0 .... O.-x.xO '0 ~H', 
\[29\](S)>:'X..0 .... O" "x'xx" .x.O~ 
\[28\](4) :x.x. Xxx. Xx.x..XX.. •'~l't.A,~n,~ 
\[27\](3)>:'x" .0 .... O, "x'xx" .x,O •~l~ll~ 
\[26\]( 2)&:X , x • • xxx • • xX • x • • xx • • • x • O~E~ 
\[251(2\]?:x.x. xxx. xx.×-.xx.. 0 eO~:~ 
\[24\]( 1)>:" x • '0 .... O: • x • xx • . x '0 0 " •~R\]~ ¢ \[23\](7)-: O, • .... O, "x.xx" ,x,O 0 .0 1~l l~ "> 7, -7- ~, l : i~ l, vC. 
\[22,1 ( 6)>:0 × xxx. • xx- x • • xx • • • x • xx • xOl~lffitr~ 
\[Ex~o e S~tonoe 2\] ' 
:_~ 4~tfAt~-l~l~h ~, ~l:\]l~,;};h,t=~.~.t::9~lz 
~ \]~d~, ~ 
Llmlp,'¢" .~ 
~. 
, 
Figure 6. Analyzed Syntactic Structure 
for a 50-bunsetsu senence 
\[20\](5)?:.x..x .... x..x.xx..x.x.x..x. ,llOfl~ I- " / 
\[19\](4) :x.x. ,xxx. ,xx.x..xx...x.xx.xx xOImlR~Y-<~ 6 " ........ X X " ~Jl,~ ~ L synt~ji~" ~lysL~ U 
Ira(4)>:'×" "× .... x..×.××" .x,×.×..x. "O" .xH~m= ,~'~\[- ~ ~,W/,,, ....... \[16\](3)>:x-x. -xxx. -xx,x. -xx...x.xx.xx xx- •{-0\] \[~ I" ~ ."" 
\[15\](537: O. 0 .... x..x.xx.,x.x-x.,x. O..xO ~E~b, ~ I" / .... 
\[14\](4)>:'x" .x ..... x..x.xx..x-x.x..x..x..xx..O/JIt~II~ ~ I" ,/ .,"" 
\[13\](4)?:xxx'xxxx'xxxxxxXXxx'xxxxxxxxxxxxxx'OOm~l~l~G ~ I" ". ~ .... 
\[1O\] ( 1\]>: X X X - X X X X • Ills\]epical lnalysi~ time 
\[ 9\]( 3)&:xxx. xxxx. \[ \[ 
\[ 8\](2)>:x • × • • ×xx, ~ ,r~ \[7\](I)>:. ×..x ..... "" "" 72"~ 
\[5\](4)?: O. 0 .... x..x.xx..x x.x--x. O..xx..e- x.. 0 ~i~. \[5\](3)>:.x..x .... x..x-xx..x x.x..x.-x..xx.-x.-x.- .x,•lWl~l: I- ~ m~:Sun~X. ., , 
\[4\](3)?: O. 0 .... x .x.xx..x x.x.- • u.. x'.u ...... u.x.~: t'~"i - . . , , t 
\[3\](2)>:0 x xxx..xx.x..xx. .x.xx.xO xx..x, xx.xxx.x xO~ ,. , , ~ = I I I I~ 
\[2\](1)>:00x Oxxx.xxxxxxxxxx xxxxxxxOOxxxxx-OxXXXXXXXOXOO~b~ ° ~o \[1\](0)>:0 x XXX..XX.X..XX. .X.XX.XO XX..X, XX.XXX.X XO 0~"~.~ ~umbcr of words in a sentence 
5098 765432 14098765432 13098765432 12098765432 110987654321 Figure 8. Processing Time 
Figure 7. Kakari-Uke Matrix for a 50-bun~etsu senence vs. Sentence Len~h 
620 
Analysis Performance 
We used test-sets I and 2 for tuning and blind- 
test, respectively. For test-sets 1 and 2, the ac- 
curacy of analyzed nmrphenms/words is 99.3/99.3% 
and 95.7/96.1%, the accuracy of almlyzed uke bun- 
sets,~s for each buusetsu excel)t the last one is 95.1% 
and 90.5¢)/0, and the accuracy of set of kakari-uke pairs 
in a sentence is 71.0% and 43.8%, respectively. 
For sentences wlfich have lengths of 3 to 15- 
bunsetsus and are nmrphoh)gically analyzed cor- 
rectly, the accuracy of analyzed uke bunsets~s for 
each bunsetsu is 97.3% and 93.6%, and the accuracy 
for sets of kakari-uke pairs in a sentence is 82.9% and 
70.5%, resl)ectively, 
Comparison 
There are no public data for the performance of 
other Japanese analysers, so comparison is difficult. 
But not oldy the size of files lint also the 1)erfornlallce 
figures for memory and speed of QJP are thought to 
be mot'(,, than ten times better than those of existing 
Japanese analysers\[4\]. As for analysis accuracy, the 
morphologicM accnracy is a little lower than that of 
the existing Jai)anese morphological aualysers usil,g 
large s(Me dictionaries, but the syntactic analysis ac- 
curacy is thought to be no worse than that of the 
existing Jalmnese syntactic analysers. 
4 Conclusions 
We have designed and implemented QJP h~r the pur- 
pose of readily and e,~sily applicable morlflmlogical 
and syntactic amdyser for Japanese. The design 
strategies are based on 1) the morphological anal- 
ysis bLsed on character-types and fimctional words 
to reduce the size of diet&mary, and 2) the syntac- 
tic am'dysis by simple treatment of structural ambi- 
guities and ignoring semantic information to lighten 
processing. 
QJP, ~s implelnented, is portable, quick and ro- 
bust. All tiles needed for execution im:luding dictio- 
nary total less than 300KB on DOS. Even on a slow 
PC a 100-word sentence (:an be analyzed in less than 
1 second using a small amount of memory. This per- 
formances is thought to be quite excellent. The alml- 
ysis accuracy is comparable to that of other existing 
analysers. No dictionary maintenance is necessary 
for new ternls. 
The fnnctions of QJP are inq)lemented ~Ls a QJP 
lil)rary an(l a QJP workbench. We lmve alremly uti- 
lized QJP flw keyword extraction, natnrM language 
query and text reading supt)ort fltnctioas\[9\] and are 
t)lamfing fllrther applications, such ms infi)rlnati(m re- 
trieval system. Others use QJP fimctions for other 
purposes, such as linguistic data extraction. 
QJP currently doesn't segment compound kanji 
words of Chinese-origin and leaves this segmentation 
to the application. In the fltture, we plan to real- 
ize such at segmenting fimction using on statistical 
data\[10\] a,,d aIlixes\[2\]. 
References 
\[1\] Yoshiyuki SAKAMOTO: An Automatic Segmenta- 
tion for Japanese Text, M~tthematical Linguistics, 
Vol.ll, No.6, 1978 (ill Japmlcse). 
\[2\] Makoto NAGAO, Jun-ichi TSUJII, Akira YAM- 
AGAMI, Shuji TATEBE: Data-Structure of a Large 
Japanese Dictionary and Morphological Analysis by 
Using It, Journal of hfformation Processing Soci- 
ety of JaI)au , Vol.19 No.0, pp.514-521, 1978 (ill 
.Japanese). 
\[3\] Masayuki KAMEDA: A Quick aat)anesc Parser, 
Information Pro~iessing Society of Japan SIG Note~, 
94-NL-4, 1993 (in aapmmse). 
\[4\] Masayuki KAMEDA: A Portable & Quick Japanese 
Processing Tool : QJP, I'roc. of the 1st Annual 
Meeting of the Association for Natural Language 
Processing, 1995 (in Jai)anese ). 
\[5\] Masayuki KAMEDA, Shin ISIIII, tIideo ITO: Inter- 
active Disaml)iguation ill a JaI)anese Analysis Sys- 
tem, Information Processing Society of Japan SIG 
Notes, 84-NL-2, 1991 (in Japanese). 
aeSO 7~ J L~:'~ ~ ©~)J~mt\]l'l>~, EDR 7tJ: f-iga%~*lJ)l\] 
~/7\],~9¢d & i,~.9.::JJ~, 1995 (in Jai)ancsc ). 
\[7\] Sadao KUROHASHI, Makoto NAGAO: A Method 
for Analyzing Conjmmtive Structures in Japanese, 
q¥ansactions of Infornuttion Processing Society of 
Japan, Vol.33 No.8, I)p.1022-1031, 1992 (in 
Jal)anese ). 
\[8\] Sadao KUROHASIII, Toshihisa NAKAMURA, Yuji 
MATSUMOTO and Makoto NAGAO : hnprovement 
of Japanese Morl)hologic~d Analyser JUMAN, Proc, 
of the International Workshop on Sharablc Natural 
Language Resources, pp.22-28, 1994. 
\[9\] Masayuki KAMEDA: Supl)ort fimctions for Read- 
ing Japanese text, hfformation Processing Society 
of Japan SIG Notes, ll0-NL-9, 1995 (in Japanese). 
\[10\] Koichi TAKEDA, Tetsunosuke FUJISAKI: Auto- 
marie Decomposition of Kanji Compound Words Us- 
ing Stochastic Estimation, Transactions of Infof 
marion Processing Society of Japan, Vol.28 No.9, 
pp.952-961, 1987 (ill Japanese). 
Example Sentences 
1 Tohru HISAMITSU: Proc.of the 42th Meeting of 
Information Processing Society of Japan, 1991. 
2 Japanese Laid Open Patent No.60-20{)368, 1985. 
621 
