AN ATTEMPT TO AUTOMATIC THESAURUS CONSTRUCTION FROM AN ORDINARY 
JAPANESE LAN8UAeE DICTIONARY 
Hiroaki Tsurumaru Toru Hitaka Sho Yoshida 
Department of Electronics 
Nagasaki University 
Nagasaki 852, JAPAN 
Department of Electronics 
Kyushu University 36 
~ukuoka 812, JAPAN 
ST~ 
How to obtain hierarchical relations(e.g, superordinate 
-hyponym relation, synonym relation) is one of the most 
important problems for thesaurus construction, A pilot 
system for extracting these relations automatically from 
an ordinary Japanese language dictionary (Shinmeikai Koku- 
gojiten, published hy Sansei-do, in machine readable form) 
is given. The features of the definition sentences in the 
dictionary, the mechanical extraction of the hierarchical 
relations and the estimation of the results are discussed. 
i. INTRODUCTION 
A practical sized semantic dictionary (thesaurus as wide 
sence) is necessary for advanced natural language process- 
ing. We have been studying how to obtain semantic informa- 
tion for such semantic dictionary from a Japanese language 
dictionary(Shinmeikai Kokugojiten,published by Sanseido,in 
machine readable form)(1) containing about 60,000 entries. 
A dictionary contains meanings and usages of practical 
size of general words. Especially,definition sentences{DS: 
a brief notation} are important sources of information for 
meanings of general words. Generally, DS of an entr~word 
{EW:a brief notation} is defined by qualifying its super ~. 
ordinate word or synonyms or hyponyms. We call these words 
definition M~rds{DW:a brief notation}. 
We have been developing a system for extracting automat- 
ically DW related to EW from its DS, and for deciding the 
DW-EW relation (e) . By this system,(hierarchical) relations 
among entry words in the dictionary are to be established. 
We constructed a sub-system for extracting DSs corre- 
sponding to parts of speech, infrected form and meaning 
(definition) number of each entry word(v) . 
In this paper, the features of DSs in the Japanese dic- 
tionary, an outline of the pilot system and the results of 
experiment will be discussed. 
L_ ~EATURES_~RY JAPANE~~DICTIOBANY 
2.~S~TUR~DS 
The typical examples of DSs are as follows: 
(i) \[~(zigzag rule)\] :...b~o~?~(possible to be 
folded)_~h~l~J.,:(rule)o 
(2) \[R~(mountain path)\] : \[...#)q~D(in)~i~(narrow path) 
j O)~(the nleaning of)~D~H~f~(a polite expression of)o 
(3) \[~(blue frog)\] :...~O)(large)#Anlk(frog)59--~ 
(a kind of)o 
(4) \[~{~(respectful daughter-in-low)\] : ~\[(daughter-in- 
iow)l=)bt~(for)~t~j~(a respectful word)° 
(5) \[~(autumn insect)\] : ~/_(bell-ing insect). 
Y~_~i-2_(a kind of cricke1:)~'(etC)o 
Where the brackets( \[...\] ), underline, and parentheses 
((...)) denote EW, DW, and an English translation for the 
preceding Japanese phrase respectively. 
In (1), the final word is DW and superordinate-hyponym 
relation(DW>EW) holds between the DW and the EW. 
In (2), DW is the final word in hook brackets(r...j) and 
DW>EW holds. Tile expression "r...jO)~6)~$1~{~" is 
called a functional e~ rpr~e,'ssion{FE:a brief notation}. The 
(compound) word "~b'J~" in the FE is called a func- 
k~onA1 word{~W:a brief notation}. In this case, the FW 
denotes a usage of the EW. 
In (3), DW is just before the FE "0)--~" and DW>EW 
holds. In this case, the I~E prescribes the DW>EW explic- 
itly. The word"--N" is the FW. 
In (4), DW is just before the FE "t:~b~¢~J~# '' and 
the synonymous relation(DW~EW) holds between the DW and 
the EW. The FW "~" denotes a usage of the EW. 
In (5), two DW<EWs hal(t, that is, the DW "NZ'~/~/" < 
the EW "$~" and the DW "~" < the Ew "$~" . In 
this case, the number of DNs are more than one, DW isn't 
modified and the FE is the word "~*" . The FW is identi- 
cal with the FE. (Notes: "~'" is a sub-postpositive 
signifying exemplification.) 
The features of DSs in the dictionary are as follows: 
(a) Honorary, the final ~mrd in DS is DW. 
(b) In some cases, the final expression in DS is YE 
assigning semantic relation between DW and EW, 
and DW is just before the PE. 
(c) Genraly, DW is modified by another phrase(modifier). 
(d) In some cases, DS contains more than one DW. 
The following general structure is obtained according to 
these features. 
'" (\[MODIFIER\], DW)*. \[F~\]o 
Notes) \[...\] : optional constituent 
(...) : required constituent 
* : sequence of coordinate constituent(e.g..,~) 
• : concatination symbol which is diferent from 
coordinate constituent(.) 
~or convenience of explanation, the general structure is 
divided into the following two types. 
(I) TYPEI: ... (\[MODIFIER\] .DN)*o 
(I1) TYPEII: .,. (\[MODI~\]:ER\] .DW)*. PEa 
445 
In TYPE I, the final word is DW. In TYPEII, the final 
expression is FE, and BW is just before the FE. 
2.2 DW-EW RELATION IN DS 
We will propose the following assumptions according to 
above-mentioned features in order to extruct the DW-EW 
relations from DOs of the general structure. 
(~) When DS is in TYPE I , DS~EW. Because DS is a phrase 
(or a compound word) as wide senoe. 
(~) When DS is in TYPEll, SS pFE EW. 
Where pFE is binary relation assigned by FE, and SS 
is the shortened DS corresponding to (\[MODIFIER\], DW)*. 
® \[MODIFIER\] • W ~ W 
(~) (\[MODIFIERi\] .We) * ~ \[MODIFIERa\] .Wa 
Where i,j : l~n, W is arbitrary word. 
The following general algorithm for deciding the DN-EW 
relations is obtained by means of these assumptions. 
(I) DS is in TYPE I (DS dosn't include FE), 
(A) DW is modified, 
(~) The number of DW is only one, then DW>EW 
(B) The number of DW are more than one, then CD 
(B) DW isn't modified , 
(~) The number of DW is only one, then DW--EW 
(~) The number of DN are more than one, then DN<EW 
(II) DS is in TYPEII(DS includes FE), 
(A) DW is modified , 
(a) The number of DW is only one, 
PFE is '>' or '---' ,then BW>EW otherwise CO. 
(B) The number of DW are more than one, then CD 
(B) DW isn't modified , 
(a) The number of DW is only one, then DW pFE EW 
(B) The number of DW are more than one, 
pFE is '<' , then DW<EW otherwise CD. 
CD denotes that ON-EW relation isn't extracted mechani- 
cally from DS. In this case, the extraction of DW-BW rela- 
tion needs human support at this stage. 
2,3 FEATURES OF FE 
FE prescribes hierarchical relations(e.g. DW>EW, DW<EN, 
DW=EW, or DWmEW) or whole-part relation(DW~ EW). (e.g. On 
" \[~(interbrain)\] :.... ~(brain)OD--~9 (a part Of)o 
", the FE "6D--~" prescribes DW~ EW explicitly.) 
Besides these relations, another relation between DW and 
BW are prescribed by special FEe(e.g. "~T(under)" ), 
which is called associative relation(R). 
There are so many FEs that they are mainly divided into 
four patterns called functional patterns{FP: a brief nota- 
tion}. FP is expressed by means of regular expression. FP 
is necessary for extracting FE and DW-EW relation informa- 
t_j~_n_(i.e, information neccessary for deciding the DW-EW 
relations) assigned by the FE. FP also designates a place 
of DW in DS, 
Main features of FP are as follows: 
(1) Type100 : \[,..DWj • ~ • FW 
(2) Type200 : ...DW • (~9 . FN)* 
(3) Type3OO : ...DW • P FW 
(4) Type400 : .,.DW • e~" 
446 
Motes) ~* is arbitrary character string, 
(...)* is repitation of (,..), 
P is special phrase(e.g. Ic~'~) , 
is concatination symbol. 
We got about one hundred seventy FWs. These are classi- 
fied into two groups. In one group(contained 64 FWs), the 
FNs contain explicitly DW-EN relation information. In the 
other group(contained 105 FWs), some of the FWs contain 
usages of the EWe, which are also important to thesaurus. 
We have constructed a FN dictionary which includes FP 
and DW-EW relation information corresponding to the FP. 
3. SYSTEM FOR EXTRACTING DW-EW (HIERARCHICAL) RELATION 
The system consists of the following four steps. 
RE~tion of EW and DS 
(a) Extraction of EW, its DS, the part of speech of the 
EW, the definition number of the DS from the dictionary. 
(b) Transformation of the extracted DS to the ordinary 
Japanese sentence's form(called the normalized DS). 
Because several contents(meanings) are thrown into one DS 
by means of parentheses or dot ' ' in the dictionary. 
~D Extraction of FE and DW-EW rel~ information 
The FW Dictionary is used. 
(a) When DS dosen't include FW, DS is in TYPEI. 
(b) When DS includes FW and conforms FP, DS is in TYPEII. 
(c) When DS includes FW but doesn't conform FP or when DS 
includes more than one FW, the DS is picked out as check 
data.Because it is difficult to distinguish between DW and 
FW or to extract DW-EW relation information mechanically. 
(3) Extraction of DW and DW-HW relation informatio~ 
A general word dictionary (containing about 75,000 noun 
words)(S)is used, in which the character strings of entry 
words were arranged in inverse order (from right to left). 
DWs are basically extracted by means of longest matching 
method, because there is ordinarily no space between two 
adjacent words in the Japanese sentence. In addition to 
this. there are the following problems. 
(a) The 'hiragana' notation is often used(e.g. ~O)~b 
\[~b\] ). 
(b) The names of animals and plants are described by 
'katakana' (e.g. ~)P \[~\] ). 
(c) The unknown(compound) words are often used. 
(d) In some cases, the DS containes more than one ON. 
The oxtructing procedure has to be constructed with 
regard to these ploblems. 
The relation information are also extracted, that is, 'DW 
isn't modified' and 'The number of DN are more than one' . 
When DN isn't extracted (that is,DR is neither 'katakana' 
string nor 'kanji' string nor any entry word in the word 
dictionary) from DS, the DS is picked out as check data. 
(4) Decision of DW-EW relation 
According to the conditions above-mentioned, DW-EW rela- 
tions are decided. 
When extracted relation information is ambiguous, DS is 
picked out as check data. 
PE T R SU T 
A pilot system has been implemented on FACOM M-360(Naga- 
saki University Computer Center) and FACON N-382(Kyushu 
University Computer Center) mostly by PL/I. 
The experimental input data(2,824 DSs) in the first step, 
arc the normalized DSs. 
Table 1 shows the number of input, output and check data 
in each step and the number of correct and incorrect data 
in output data. 
Table 2 shows the extracted DW-EW relations and the num- 
ber of output data corresponding to the relations. 
The experimental results are as follows: 
(1) The ratio of TYPEI (2,374) to output data(2,?ll) is 
about 87.6%. 
(2) The ratio of TYPEI1(337) to output data(2,711) is 
about 12.4%. 
(3) The ratio of output data(2,434) to input data(2,824) 
is about 85%. 
(a) The ratio(called system precision) of correct output 
data(2,311) to output data(2,434) is about 95 %. 
(b) The ratio(called error ratio) of incorrect output 
data(123) to output data(2,434) is about 5%. 
(4) The ratio of check data(390) to input data(2,824) is 
about 14%. 
Table 1 
(1) Extraction 
of FE 
(2) Extraction 
of DW 
(3) Decision 
of Relation 
Result of 
Experiment 
The Number of Input Data, Output Data 
and Check Data in Each Step 
INPUT OUTPUT DATA 
DATA (correct:incorrect 
2,824 2,374(TYPE I ) 337(TYPE \[I 
) 
2,711 2,502 
(2,386: 116) 
2,386 2,318 
(2,311: 7) 
2,824 2,434 
(2,311: 123) 
CHECK 
DATA 
113 
209 
68 
390 
Table 2 DW-EW Relations and the Number of Output Data 
corresponding to Each Relation 
Relatio\[ 
DW: EW 
Subtota 
Total 
J orrect Dubious Incorrect 
Type Type Type 
i II I i II I i II 
1963 I 71 31 0 
o i io3 o i o 
4i II Oi I 
oi 9 oi i 
oi io oi o 
210~ 
!3. I ? 
oi o 
0 i 0 
O~ 0 
Oi 1 o i 
1 
'i ' o! 2 
2 
i 
Subtotal 
Type 
I i i1 
1966i 71 
12oi 1o~ 
24i 12 
oi 11 
oi 11 
2110 i 208 
2318 
Most of incorrect output data occur in the step of ex- 
traction of DWs which are described by 'hiragana' notation, 
because of limitaions of the longest matching method. 
The improvement of the results necessitates (a) analy- 
sis of the DSs, (b) reinforcement of the general word dic- 
tionary used for extracting the DWs. 
5~CONCLUDINO~RKK~_ 
(1) The similer researches have been carried out about 
several English dictionarys(e.g. LONONAN)(2)(~), however 
there is scaresly any about Japanese dictionary. 
(2) We have extracted automaticlly,DW<EW, DW~EW, DW~ EW 
in addition to DW>EW as the DW-EW relations. 
(3) Input data not suitable for conditions are picked out 
as check data in each step. 
(4) There are a shortage of semantic information (e.g. 
lack of the adequate DW) in the dictionary because of 
assuming the human usage of the dictionary. 
We have been investigating the followings. 
I .Development of a system for utilizing the dictionary (v) . 
II.Development of a system for hierarchically structuring 
among entry words in the dictionary(~=). 
Ill.Development of a man-assisted system for constructing 
a practical sized semantic dictionary (4). 
ACKNOWLEDGEMENT 
We will like to thank the member of Turumaru's laborato- 
ry in Nagasaki University, and in paticular, Mr. A.Uchida 
and Mr. K.Mizuno for their efforts of implementation. 
References
(1) S.Yokoyama: Preparation for the Data Management of a 
Japanese Dictionary, Bul. Electrotech Lab., Voi.41, No.ll, 
PP.19-27 (1977.11) 

(2) N.Nagao,J.Tujii,Y.Oeda,M.Taiyama: AN ATTENPT TO COM- 
PUTERIZED DICTIONARY DATA BASES, Prec. COLING80, PP.534-- 
542 (1980.i0) 

(3) A.Miohiels,J.Noel : APPROACH TO THESAURUS PRODUCTION, 
Prec. COLING82, PP,227-232 (1982.7) 

(4) S.Yoshida,H.Tsurumaru,T.Nitaka: MAN- ASSISTED MACHINE 
CONSTRUCTION OF A SEMANTIC DICTIONARY FOR NATURAL LANGUEOE 
PROCESSING,Prec. COLING82, PP.419-424 (1982.7) 

(5) K. Yoshimura,A.Yamasita,T.Hitaka,S.Yoshida : Automatic 
Extracting System of Technical Terms, NL Technical Report 
of IPS, 42-i (1984.3) 

(6) H.Tsurumaru,K.Mizuno,A.Uchida,T.Hitaka, S.Yoshida: 
Extraction of Hierarchical Relation between Words from 
Definition Sentence of Word, NL Technical Report of IPS, 
45-5 (1984.9) 

(?)H.Turumaru,A.Uchida: The Extraction and Organization of 
Information from the Ordinary Japanese Language Dictionary, 
Reports of the Faculty Engineering Nagasaki University, 
Vol.15, No.24 (1985.1) 
