Linguistic Knowledge Extraction from Real Language Behavior 
R.Shirai and T.Ilamada 
(Department of Electrical Engineering, Waseda University) 
(3-4-10hkubo ShinJuku-ku Tokyo Japan) 
Abstract -- An approach to extract linguis- 
tic knowledge from real language behavior is 
described. This method depends on the 
extraction of word relations, patterns of 
which are obtained by structuring the 
dependency relations in sentences called 
Kakari-Uke relation in Japanese. As the 
first step of this approach, an experiment 
of a word classification utilizing those 
patterns was made on tile 4178 sentences of 
real language data. A system was made to 
analyze dependency structure of sentences 
uti 1 izing the knowledge base obtained 
through this word classification and the 
effectiveness of the knowledge base ~4as 
evaluated. To develop this approach further, 
the relation matrix which captures multiple 
interaction of words is proposed. 
1. Introduction 
In natural language processing, one of 
the major problems to be solved is how to 
describe linguistic and semantic knowledge 
in the sye\]tem. If we nse no particular 
technique and capture the behavior in real 
I anguage as i t i s, the number of FU\]es. 
concepts and relations to be arranged may 
expand so much° But those things contain all 
essential and primitive elements of language 
that we want to find out at \]east. In this 
Paper, it is considered to extract primitive 
elements from real linguistic behavior, and 
apply the elements to analysis sentence. As 
the above-mentioned elements, we use a 
relation between words. (lt is called 
Kakari-Uke relation in Japanese.) 
SUBJ 
John opened the door vith this ke~, 
t NST 
Fig.l 9ependency Relation Structure 
(Kakari-Uke Relations) 
2. Clustering of Words 
2.1. Clustering Method 
The process of the word classification 
based on the pattern of relations is done as 
follows. First, numbers of sentences are 
provided and Kakari-Uke relations are given 
to them. We call those sentences text data. 
Next we get the source side and the sink 
side pattern of relations for each word 
appearing in the text data. Then we 
calculate a distance between words. The 
d i s lance i s defi ned as a correspondence 
between the patterns themselves and the 
frequency of each relation making the 
patterns. Words are classified by a 
clustering algorithm using this distance. 
The distance has two types; one for the 
source side patterns and the other for' the 
sink side patterns. For each word, two 
clustering processes are applied correspond- 
ing to those two types ot distances. In this 
paper, the dependency strt.lcture is called as 
the knowledge base. 
2.2. Re.sul {s 
We made an experi ment of word 
clustering on the 4178 sentences of text 
data quoted from computer manuals. In this 
experiment, a special treatment was taken 
for compound words to ensure information. 
There are many compound words in Japanese 
sentences which are made by combining words 
and act as one word. They are called Fuku9o- 
go in Japanese. If we great them all as 
different from each other, many words appear 
rarely, so that the relating pat.terns of 
each word cannot be captured sufficiently. 
Because of this reason, we adopled a 
mechanism that replaces compound words by a 
normal one including the same meaning 
grammatical roles in grammar as the former. 
This mechanism can work automatically as a 
part of the system. 
As tile result of this experiment, it 
was observed as expected that semantically 
related words tend to be combined, ttowever, 
some words which have different meaning are 
combined with a well classified word group, 
and several well classified groups are 
combined. Not only synonyms, but also the 
words similar in some parts as the extension 
of tile words, and also the words which have 
a common part in the upper concept tend to 
be combined. It is interesting that antonyms 
tend to be combined with each other. It was 
also found that words contained in the same 
group belong to the same part of speech 
almost always. 
253 
3. Sentence Analysis 
3.1. Sentence Analysis System ESSAY 
We made ESSAY (Experimental System of 
Sentence Analysis) which analyzes the de- 
pendency structure using the knowledge base. 
We show the outline of this system in Fig.2. 
Using the knowledge base, ESSAY analyzes the 
dependency structure of sentences, if those 
patterns are used 3ust as they were obtained 
:from the text data, they can only cover the 
relations which have appeared in the text 
data. But the clustering process allows the 
system to cover more relations than appeared 
in the text data. 
,";U, Sentences i 
1 
G .... 
Rel,tion Candidate I 
-_zz  
\i, -' 
I ~:wlua~ion _ j 
G On,',", 
"o'o'" ( 
Fig.2 General Flow of ESSAY 
***~* SENT.NO. = 4 : INPUT IS., ~ I - 3- 1 - I - ~ 
V SAM~D~ 
(The relations ol parameters about privacy security 
of VSAM catalogue are sho~n In Flg.l-3-1-1-4.l 
~ WORD COMBINATION = 1 
*~* SYNONYM COMBINATION = 1 
* EVAL POINT = 90 -~-~ 
1- 3- 1- l -4 I~ 
( In Fig.l-a-I-I-4 ) 
---V S AM~ ~ ~o 
! ( of VSAM catalogue ) 
---~Z 
f (Privacy saecurlty) 
---~ 
! ( about ) 
---J~ ~ ~ ! 
(of parametersI 
--~ 
(the relatlon~) 
(are shown) 
Fig.3 k Sample of Analysis Results 
254 
3.2. An Experiment 
We made an experiment of sentence 
analysis with ESSAY. The knowledge base was 
organized from the 4i78 sentences of text 
data quoted from computer manuals. The input 
sentences we provided for the test were not 
contained in the sentences used for 
knowledge base organization. A sample of the 
analysis result is shown in Fig.3. There is 
a possibility that a Bunsetu (a kind of 
phrase structure element) has several ways 
of possible division into words and Euzoku- 
qo. The system tests some combinations of 
those divisions. In this figure, EVAL POINT 
indicates the value evaluated for each 
structure that is calculated from the 
likelihood of each relations constructing 
the structure, we can express the conclusion 
as follows: 
t_ .- 
(u l_ o 
.t~ ¢u 
~-tn 
~N 
o N 
I O0|r ---,'--- v i thou t Us i n~ ruzoku-Ko ~f 
I ~ --'--Us,rig Fu2oku-~o'1 
01 77.  
40- /Ax, 
' " x,~ 
1 
2O 
0 
iO01 ~.~""-. = .= 
50~ ~ ', ,, . 
0L_  2 3 4 5 6 7 8 9 I0 
Sentence Length (Nulber of Bunsetsu) 
$1: The experiment us done under tt~o conditions ustn8 and 
ulthout using Fuzoku-$o for Inalysis in order to exulne 
the effect of Fuzoku-go. 
$2: Tim rate at uhlch the analysts succeeds. 
$3: The order of correct candidate In the a~lysis results. 
$&: The rate at vhlch the correct candidate is ranked first. 
Fig.4 Analysis Results of every Sentence Length 
a) There is a problem that the long 
sentence with many Bunsetu often makes too 
many combinations of relation candidates. 
b) There are some cases that no result is 
obtained because only a part of words does 
not have a relation candidates although all 
of others have the correct relations, 
c) It is difficult to describe a parallel 
relation using relations between two words. 
Therefore, it is difficult to analyze a 
sentence containing parallel relations. 
d) The rate at which the analysis succeeds 
depends on the length of the sentence. As 
the sentence becomes longer, the rate 
becomes lower. The average of the rate was 
about 40 per cent. 
This result is shown in FIg.4. 
4. More ComplicaLed Data Structure 
ESSAY decides the relations according 
to the connection only between two words. 
The other parts of the sentence take no role 
in this decision at all. But the relations 
complicatedly interact to one another in 
actual sentences. In this section, we 
describe how to deal with the interaction of 
the relations to provide a wider ground for 
judging propriety of relations. 
(he) (to school) (by bus) (goes) 
~t~;t ~-~*, 6 ~ ~? < 
(he) (to school) (at 6 o'clock) (goes) 
(a) 
RI R2 R3 R,l 
1 1 / R1 R2 
R3 
R~ 
(b) 
Fig.5 Relation Natrix 
4.1. Co-occurrence of Relations 
There are word~ relating to more than 
two other words at the same time. As shown 
in Fig.5(a), four kinds of relations appear 
in the text data. If more than two kinds of 
relations appear at the same time, the 
frequency of relations are counted. Then 
frequency table is expressed by a matrix 
called relation matrix shown in Fig.5tb). 
The element Mii means frequency of Ri 
itself, and tim element Mij means frequency 
of appearance of both Ri and RJ at the same 
time. This matrix is obtained for each word 
that have been reIated with more than two 
words at the same time. Utilizing this 
matrix, we can get wider ground for .judging 
propriety of relations. When the relation 
"go -(to1- school" is obvious, seeing 
element M2i and Mi2 (i#2) of the matrix, we 
can gel probability of each relation Ri in 
this situation. 
4.2. Effect of the relation Matrix 
Using this matrix, the ground for 
judging propriety of the relations becomes 
wider' and the number of candidates can be 
effectively redticed. Secondly, because each 
relation becomes more reliable, it is 
expected to get relations according to the 
sentence meaning. 
5. Conclusion 
We haw?. introduced a bottom tip approach 
of organization for a linguistic knowledge 
base. For the organization of knowledge 
bas e, con t i nuous human ef for t has been 
required. The vocabulary of the knowledge 
base depends on the quantity of text data. 
l,inguistic knowledge base organized in 
this manner may not be .so powerful as tho.~e 
constructed analylically. But such method 
may open an automatic w~iy of the knowledge 
acquisition and there may be a possibil ty 
to discover rules and properties which we 
have never noticed. 

REFERENCE 

\[1J Bobrow, D.G., and Winoqrad, T. 1977. An 
overview of KRL, a knowledge representat on 
language. Cognitive Science 1:3-46. 

\[2\] Uoods,U.A. 1973. Progress in natural 
understanding: An application to lunar 
geology, AFIPS Conference Proceeding 42, 
1973 National Computer Conference. Montvale 
N.J.: AFIPS Press, 441-450. 

\[33 Quillian,M.R. 1968. Semantic memory. In 
Minsky, 227-270. 

\[dJ Fillmore,C. 1968. The case for case. In 
E.Bach and R.Harms (Eds.), Universals in 
linguistic theory. New York: Holt, Rinehart, 
and Winston, 1-88. 

\[5\] Katke,W. 1985. Learning language using a 
pattern recognition approach. The AI 
magazine Spring, 1985. 

\[6\] Shirai,K., Hayashi,Y., Hirata,Y., and 
Rubota,J. 1985. Database formulation and 
learning procedure for I(akari-Uke dependency 
analysis.(in Japanesel The transaction of 
information processing society of Japan, 
Vol.26, No.4, 706-714. 
