JAPANESE SENT~ICE f~IA\[.YSIS FOR AUT~IATIC IIIDEXIHG 
Hiroshi Kinukawa 
Systems Development Laboratory 
Hitachi, Ltd. 
1099, Ohzenji, Tama-ku, 
Kawasaki 215, Japan 
Hiroshi Matsuoka 
Systems Development Laboratory 
Hitachi, Ltd. 
1099, Ohzenji, T~la-ku, 
Kawasaki 215, Japan 
Hutsuko Kimura 
Institute of Behavioral Sciences 
1-35-7, Yoyogi, Shibuya-ku, 
Tokyo 151, Japan 
A new method for automatic keyword 
extracting and "role" setting is proposed based 
on the Japanese sentence structure analysis. 
The analysis takes into account the following 
features of Japanese sentences, i.e., the 
structure of a sentence is determined by the 
noun-predicate verb dependency, and the case 
indicating words (kaku-joshi) play an important 
role in deep case structure. By utilizing 
the meaning of a noun as it depends on each 
predicate verb, restricted semantic processing 
becomes possible. An automatic indexing system, 
equipped with a man-machine interactive 
error-correcting function, has been developed. 
The evaluation of the system is performed 
by applying it in news information retrieval. 
The results of this evaluation show that the 
system can be put to practical use. 
I. Introduction 
The main problems arising with the 
development of an information retrieval system 
for the Japanese text are the need for, saving 
man-power, standardizing information storage, 
and the realization of efficient retrieval. 
In the case of the English text, the stop-word 
removing method for automatic keyword 
extraction has been put to practical use. 
However, in the case of the Japanese text which 
consists of KanJi and Kana characters, a 
keyword extraction method utilizing statistical 
word frequency data has been reported by a 
Kyoto University group.3 This paper proposes 
a new method of automatic keyword extraction 
and "role" setting for Japanese news 
information retrieval. The "role" characterizes 
semantic identification of each keyword in a 
sentence and is classified into six categories, 
i.e., human subject, human object, time, place, 
action, and miscellaneous important 
information. 
The main features of Japanese sentences can be 
characterized as follows: 
(I) The structure of a sentence is determined 
by the noun-predicate verb dependency. 
(2) The case indicating words(kaku-joshi) play 
an important role in deep case structure. 
Taking these features into account, D.G.Hays's 
dependency grammar I and C.J. Fillmore's case 
grammar 2 arc utilized in the sentence 
structure analysis. The sentence pattern table 
containing a noun-predicate verb dependency 
relationship plays an important function in 
the analysis. By utilizing the meaning of a 
noun as it depends on each predicate verb, 
restricted semantic processing becomes 
possible. An automatic indexing system5, 
equipped with a man-machine interactive 
error-correcting function, has been developed 
based on the method described. Evaluation of 
the system has been done by applying it in news 
information retrieval. 
2. Role Settin~ Criteria 
The employed criteria for the role setting 
of each keyword in a news sentence are as 
fol lows : 
(I) "Action"(A for short) is assigned to verbs 
which express movement and are elements 
of the "predicate" set. 
(2) "Time"(T for short) can be assigned without 
ambiguity. 
(3) "Human subject"(ES for short), "human 
object"(EO for short), "place"(P for short) 
and "miscellaneous important information" 
(Y~I for short) arc assigned to noun words 
according to the following criteria: 
(a) Words which express humans or organizations 
have either role "HS" or "riO". The 
distinction can be made by examining the 
subsequent kaku- joshi. 
(b) Words which express things without 
consciousness have role "If!". 
(c) A country name has role "HS" if it is 
presumed to have consciousness as an 
organization. It has role "P" if it 
means territory. 
(d) An airplane or a ship have role "IIS" 
when they are personified together with 
the driver, role "P" when they express 
the place, and role "MI" when they mean 
things. 
(e) Ambiguities in item (c) and (d) are 
removed by knowing which predicate verb 
the word depends on and this determine 
which human, organization, place or 
miscellaneous matter it expresses. 
514 
To clarify the description, some exa~uples 
are given below: 
ex.1) "State A"ga "State B"wo shihai.suru. 
HS HO A (control) 
<State A controls State B.> 
In this sentence "ga" and "wo" are 
kaku-joshis. 
ex.2) '~tate A"ga "H-Sea"wo shihai-suru. 
HS P A 
<State A controls M-Sea.> 
ex.3) "StateA"ga "petroleum'h~o 
HS MI 
shihai-suru. 
A 
<State A controls petroleum.> 
ex.4) ".~solationism"ga "State A"wo 
MI IlO 
shihai-suru. 
A 
<Islationism controls State A.> 
(4) As mentioned above the "role" of a noun 
word is determined by considering the 
following three elements: i.e., 
(a) the predicate verb whieh the noun word 
depends on 
(b) the meaning of the noun word 
(c) the kaku-joshi which is concatenated to 
the noun word 
3. Japanese Sentence Structure Analysis 
The basic Japanese sentence pattern is 
expressed as "NFINF2--NFnPV" , where 
NFi, which is called "meishi-bunsetsu", is 
composed of a noun word and case indicating 
words, and where PV is a predicate verb. 
The Japanese sentence structure is 
characterized by the following points, i.e., 
(I) The predicate verb is put at the end of 
the sentence• 
(2) The position of a "meishi-bunsetsu" in a 
sentence is not fixed. 
(3) A "meishi-bunsetsu" could be omitted in 
a discourse which consists of several 
sentences. 
Utilizing D.G. Hays's dependency grammar, 
noun-predicate verb dependency relationships 
are formulated. In this formulation the 
relationships between nouns are irrelevant. 
Therefore, the Japanese sentence structure 
becomes independent of noun-word order, and 
a word omission is expressed in terms of the 
presence of a dependency relationship in the 
sentence. Since "role" is semantic 
identification of a word, by applying 
C.J.Fillmore's case grammar 2, it can be 
assigned to each keyword by clarifying the case 
structure of the predicate verb.(Figure I) In 
Japanese sentence structure analysis, the 
predicate verb is identified first and then 
dependent noun words are determined in order 
of nearness to the predicate verb. The sentence 
is parsed by using top-down analysis. The 
bottom-up method is not adopted because it 
causes much ~nbiguity in the parsing of words 
which do not directly depend on the predicate 
verb. The need for classification of noun words 
in terms of their meaning is mentioned in 
chapter 2. Noun words are classified into seven 
semantic classes in order to analyze 
noun-predicate verb dependency relationships 
efficiently and to set "role"s to them, i.e., 
(i) Organization (ii) Person 
(iii) Literature (iv) Place (v) Action 
(vi) Name of matter, Abstract idea, etc. 
(vii) Time 
Predicate verbs are classified by taking into 
account the meaning of the dominated words and 
their cases. (Figure 2) The sentence pattern 
table is constructed based on this predicate 
verb classification. (Figure 3) 
In the news retrieval system, about 5600 
predicate verbs are classified into 586 
classes; this classification is called 
case-information(A4-code). The sentence 
pattern table contains 1686 patterns. A 
Sentence pattern in the table is composed of 
four triplets at most. Elements of the triplet 
are the semantic class identification code of 
the noun word, kaku-joshi, and the "role" which 
is determined in terms of the values of the 
first two elements. 
For example "shihai-suru"(control) and 
"kogeki-suru"(attaek) belong to No.46 category. 
The predicate verb of this category has six 
sentence patterns and each sentence pattern 
has two triplets. The first sentence pattern 
has triplets (ga,A, 1) and (wo, I,2). The first 
code of the triplet is "kaku-joshi", the second 
I) Freedom in the position of "meishi-bunsetsu" 
Japanese Surfase 2) Omissibility of "meishi-bunsetsu" in a discourse 
Sentence Characteristics 3) "Meishi-bunsetsu" is composed of a noun-word and case indicating 
wor ds • 
Surfase Sentence Structure <meishi-bunsetsu> I ..... <meishi-bunsetsu>n<predicate verb> 
Happing(=role) I I  uman Su \ 
~ ~bject ~lace 
Deep Case Structure \] Case Relation (\[Agentive\],\[Objectivo\],\[Locative\] ......... ) 
Figure I Relationship between Surfase Case Structure and Deep Case Structure in Japanese Sentence 
--515 
code is the semantic classification code of the 
noun word, and the third code is the "role". 
Semantic classification code "A" expresses 
organization or person. 
Sentence analysis and "role" setting are 
performed referring to this sentence pattern 
table. 
~., Automatic Indexin~ System 
An automatic indexing system has been 
developed based on the method described. The 
processing procedure of the syst~1 consists 
of the following three steps(Figure 4): 
(I) Word recognition 
<Predicate verb> <Meishi-bunsetsu> <Semantic Class> <Role> 
, ~.(~< noun>ga) 
Agentiv~./ I • 
"u'') 
Obj eeti~~_ ~ 
<Organiz ation>wo 
:Human Subject 
:Miscellaneous 
Important 
Infprmation 
:Human Object 
: PI ace 
: MI 
/($noun>ga~ 
Agentiv~/// t 
~J///~< noun>hi) 
Obj e~~;~oun>~,O) 
I 
<Organiz atio n>ga 
<Organiz ation>ni ~ 
Organiz atio n>wo 
:ItS 
:HO 
:MI 
:MI 
: MI 
Figure 2 Relationship between predicate verb and roles 
A4 code 
I 
? 
First Second Third Fourth 
K B R K B R K B R K B R 
ga I I ~ 
46 ~ ga A I ~.;o I 2 / 
Shihai-suru ga A I wo ~ 4 ./" 
(control) ga A I wo 6 ./ 
\['ogeki-suru ga 6 6 wo I 2 / 
(attack) ga 6 6 ~7o 4 4 / 
etc. ~ ga 6 6 ~Io 6 6 / 
586 
K : Kaku- joshi 
B:Semantie Identification of }~oun Words 
1:Organization 2:Person 3:Literature 
4:Place 5:Action 6:H~e of materials,etc. 
7:Time A:I or 2 
R:Role 
1:IIuman Subject 2:Human Object 
3:Time 4:Place 
5:Action 6:Miscellaneous 
Important Information 
Figure 3 Proposed Sentence Pattern Table 
START ) 
  uzoku. o. __Jwo   
Table / IRecognition I 
Automatic 
"Role" 
Setting 
I Kanji '\]_~ Error- 
I ~ 
iritsu-go~' 
ictionar~ 
I  ene) 
at tern 
able 
Figure 4 Automatic Indexing Procedure 
--516 
(2) An automatic "role" setting resulting from 
the sentence structure analysis 
(3) Man-machine interactive error-correction. 
The hardware configuration is given in Table I. 
Size and performance of the programs are given 
in Table 2. 
4.1 Word Recognition 
Word recognition is executed in the 
following two steps,(Figure 5) i.e., automatic 
segmentation of the Kanji and Kana character 
string, and the matching of each segment with 
entries in the content word dictionary 
("Jiritsu-go" dictionary which contains nouns, 
verbs, etc.) and the function-word table 
("Fuzoku-go" table) to obtain syntactic and 
semantic information concerning the word. The 
first step utilizies statistical features of 
Japanese sentences. The second step is a 
morphological word analysis4. The following 
information codes are given to the words 
contained in the "Jiritsu-go" dictionary: 
(I) At-code:ten word-cl~ss classification 
code 
(2) A2-code:75 morphological class 
classification code 
(3) A3-eode:prefix and suffix identification 
code 
Table I llardware Configuration 
Me. lame Specification and Usage 
I C.P.U. Memory:384~B S.~!.V.:~.5 s. 
2 M.Disk M.A.T:72.5ms. 
Dictionary & Table str. media 
3 Kanji 7001ine/min. 
Printer Printin~ of results 
4 Ifanji Video 40ch./line x 12 line 
Terminal Man-machine interactive 
error-correction 
Nc 
2 
~3 
!I 
5 
6 
C.P.U.:central processing unit 
~l.Disk:magnetic disk memory 
S.M.V.:system mixed value 
M.A.T.:mean access time 
str. :storage 
min. :minute 
ms. :milli-seeond 
Table 2 Size and Performance of the Prosrsms 
Procedure 
Word ~eco~nition 
Automatic "Role" Setting 
Brror-Correctin~ 
Fable Maintenance 
Utility 
~otal 
Steps l lemor,~| Pfm, 
3 KS 60KB ~40ms/m.b. 
12 120 650ms/stc. 
6 132 -- 
6 33 -- 
11 84 -- 
38 I.-32 
These procedures are programmed in Assembly 
language. 
KS :kilo-steps 
KB :kilo-byte 
m.b.:meishi-bunsetsu 
Pfm.:performance 
stc.:sentence 
(4) A4-eode:predicate-verb case identification 
code 
(5) B-code :semantic identification of noun 
words 
The morphological analysis procedure gives the 
following information by referring to the 
"Fuzoku-go" table: 
(6) C1-code:kaku-joshi classification code 
(7) C2-code:the code distinguishes active voice, 
passive voice and causative 
expression 
(8) C3-code:The code given to a meishi-bunsetsu 
distinguishes whether the meishi- 
bunsetsu is a direct dependant of 
the predlcate-verb or a modifier of 
another meishi-bunsetsu. 
The code given to the prdicate-verb 
expresses the type of inflection 
of the verb and the kind of 
subsequent conjunctive function 
word(setsuzoku-joshi). 
(9) D-code :auxiliary code for determining 
A1-eode 
4.2 Automatic "Role" Settin$ 
Automatic "role" setting is executed by 
the following four steps(Figure 6): 
(I) Predicate verbs in a sentence are 
recognized by referring to the At-code at 
first. Then, complex sentence structure is 
analyzed and divided into simple sentences. 
(2) Sentence patterns for each simple sentence 
are obtained by utilizing the A4-code. 
Then, noun-predicate verb dependency is 
analyzed by comparing the B-code and the 
C1-code of noun words with the sentence 
pattern. Prior to this analysis the 
following procedures are executed. 
(a) Seaching the sentence pattern for 
causative expression 
(b) Transforming passive voice expression into 
active voice expression 
(c) Standardizing "kaku-joshi" 
E~ITRY ) 
I 
Automatic 
Segmentation 
I 
"=' J-I ==='=I \=°=°=71 I 
Fillre 5 Word Recognition Process 
--517-- 
EHTRY ) 
I 
Predicate Verb 
Recognition 
1 
Pattern ~---~Predicate Verb I 
Table / \]Dependency I 
I 
\[}ioun Phrase 
Processing I 
I 
I (Era' ) 
Figure 6 Role-Setting Process Utilizing 
Japanese Sentence Analysis 
(3) Words in the noun phrase modify the last 
noun word of the phrase in the analysis. 
(4) The "role" is automatically given to each 
keyword using the results of the above 
three procedures. 
4.3 Man-Machine Interactive Error-Correctinz 
Function 
The man-machine interactive error- 
correction unit consists of a Kanji video 
terminal and a Kanji line printer. 
5. Evaluation of the System 
The system has been evaluated by applying 
it to news information retrieval. The results 
of this application show, that, based on the 
assumption that the content word dictionary 
and the sentence pattern table cover 90% of 
the processed words and processed sentence 
patterns, 85 to 90% of the keywords and 80 to 
85% of the set roles extracted are estimated 
to be correct. Also, the time required for 
indexing is only one third of that required 
for conventional manual inde~ng, and the 
retrieval precision-ratio is improved by 20 
to 30% without affecting the recall-ratio~ With 
this method the turn- arround time for 
information storage is reduced to half of that 
of the conventional manual method. Examples 
of output are given in Figure 7. 
~i 11028270 7 X~: i 
<x~> 
P ° 
9H 
LK ~ R ~-~- F ~ R #- U- F ~ R #- ~- F 
2 ~760725 3 ~;D~)b~ 
35 
3 ~760727 4 ~H~~i~~ 
~ s o~~ 
#-~-V 
Figure 7 Exa~ples of Output 
--518-- 
6, Conclusion 
A new method of automatic keyword 
extracting and "role" setting has been proposed 
and evaluated. An experimental automatic 
indexing system has been developed utilizing 
the above mentioned Japanese sentence structure 
analysis. The analysis is characterized as 
follows: 
(I) It is based on the noun-predicate verb 
dependency. 
(2) Restricted semantic processing becomes 
possible by utilizing the meaning of a noun 
as it depends on each predicate verb. 
An automatic indexing system has been developed 
based on the proposed method. By utilizing the 
system, the foll~ling problems which arose 
with the development of an information 
retrieval system have been solved, i.e., 
man-power savings, information storage 
standardization and the realization of 
efficient retrieval. 
Acknowledgement 
The authors wish to thank Professor Toshio 
Ishiwata of Ibaragi University and Professor 
Hirohiko Hishimura of Tokyo University of 
Agriculture and Technology for their helpful 
discussions. The authors wish to thank Mr. 
Mieji Shimizu and Mr. Masahiro Sakano of Facom 
Hitac Ltd. for their encouragement of this 
study. The authors also wish to thank Dr. 
Takeo Miura, General Manager of Systems 
Development Laboratory of Hitachi Ltd., for 
his farseeing supervision. The authors are 
also grateful to Mr. Hiraku Hashimoto, Mr. 
Akira Hakuta and Mr. Kenji Koichi for their 
computer programming of this study. 

References 

I. D.G. Hays, "Dependency Theory; A Formalism 
and Some Observations", Language Vol.40, 
No.4(1964) 

2. C.J. Fillmore, "The Case for Case", 
Universals in Linguistic Theory, Bach and 
Harms, eds., Holt, Rinehart, and Winston, 
New York(1968) 

3. Makoto Hagao, Mikio ~4izutani and Hiroyuki 
Ikeda, "An Automatic Method of the 
Extraction of Important Words from Japanese 
Scientific Documents", Journal of IPS Japan, 
Voi.17, 2{o.2(1976 Feb.) 

4. Hiroshi Kinukawa, Kenji Tsutsui, Ikuo 
Odagiri and Mutsuko Kimura, "Stenograph to 
Japanese Translation System", Information 
Processing in Japan, Voi.15(1975) 

5. Hiroshi Kinukawa and Mutsuko Kimura, 
"Automatic Indexing System Utilizing 
Japanese Sentence Analysis", Transactions 
of IPS Japan, Voi.21, No.3(1980 May) 
