Analysis of Japanese Compound Nouns 
using Collocational Information 
KOBAYASI Yosiyuki TOKUNAGA Takenobu TANAKA Hozunfi 
Department of Computer Science 
Tokyo Institute of Technology 
{yashi, take, tanaka}O, t ±tech. ac. j p 
Abstract 
Analyzing compound nouns is one of the cru- 
cial issues for natural language processing sys- 
tems, in particular for those systems that aim at a 
wide coverage of domaius. In this paper, we pro- 
pose a mcthod to analyze structures of Japanese 
compound nouns by using both word collocations 
statistics and a thesaurus. An experiment is con- 
ducted with 160,000 word collocations to analyze 
comlmund nouns of with an average length of 4.9 
characters. The accuracy of this method is about 
80%. 
1 Introduction 
Analyzing compound nouns is one of the crucial 
issues for natural language processing systems, in 
particular for those systems that aim at a wide 
coverage of domains. Registering all compound 
nouns in a dictionary is an impractical attproach, 
since we can create a new conll)ound lloun by conl- 
bluing nouns. Therefore, a mechanism to analyze 
the structure of a con,pound noun front the indi- 
vidual nouns is necessary. 
In order to identify structures of a compound 
noun, we must first find a set of words that com- 
pose the compound noun. This task is trivial for 
languages such as English, where words are sepa- 
rated by spaces. The situation is worse, however, 
in Japanese where no spaces are placed betwem, 
words. The process to identify word boundaries 
is usually called segmentation. In processing lan- 
guages such as Japanese, ambiguities in segmenta- 
tion should be resolved at the same time as ann° 
lyzing structure. I"or instance, thc Japanese com- 
pound noun "$)~\]llJ~;I~"(ncw indirect tax), pro- 
duces t6(= 2 4) segcmentations possibilities for this 
case. (By consulting a /lai)anese dictionary, we 
would filter out some.) In this case, we have two 
remaining possibilities: "50\[" (new)/~ (type)/lllJ~'}~ 
(indirect)/t~ (tax)" and ")~#~ (new)/lll\]~)~ (indi- 
rect)/ :~, (tax). ''i Wc nmst choose the correct seg- 
mentation, "~)?~'J. (new)/llll}~ (indirect)/~ (tax)" 
and analyze structure. 
1 Here "/" denotes ~L bound~try of words, 
Segmentation of Jal)anese is difficult only when 
using syntactic knowledge. Therefore, we could 
not always expect a sequence of correctly seg- 
mented words as an input to structure analysis. 
The information of structures is also expected to 
improve segmentation accuracy. 
There are several researches that are attacking 
this problem, l)'uzisaki et al. applied the ItMM 
model to scg,nentatimt and probabilistic CFG to 
analyzing the structure of compound nouns \[3\]. 
The accuracy of their method is 73% in identify- 
ing correct structures of kanzi character sequences 
with average length is 4.2 characters. In their 
approach, word boundaries are identified through 
tmrely statistical information (the IIMM model) 
without regarding such linguistic knowledge, as 
dictionaries. Therefore, the HMM nrodel may sug- 
gest an improper character sequence as a word. 
Purthermore, since nonterminal symbols of CFG 
are derived from a statistical analysis of word col- 
locations, their number tends to be large and so 
the muuber of CFG rules are also large. They as- 
sumed COml)ound nouns consist of only one char- 
acter words and two character words. It is ques- 
tionable whether this method can be extended to 
handle cases that include nmre than two character 
words without lowering accuracy. 
ht this palter , we protmsc a method to analyze 
structures of Japanese compound nouns 1)y using 
word collocational information and a thesaurus. 
The callocational information is acquired from a 
corpus of four kanzi character words. The outline 
of procedures to acquire the collocational informa- 
tion is as follows: 
• extract collocations of nouns from a corpus of 
four kanzi character words 
• replace each noun in the collocations with the- 
saurus categories, to obtain the collocatkms of 
thesaurus categories 
• count occurrence frequencies for each colloca- 
tional pattern of thesaurus catcgorics 
For each possible structure of a compound noun, 
the preference is calculated based on this colloo 
cational information and the structure with the 
highest score wins. 
865 
Hindle and Rooth also used collocational infor- 
mation to solve ambiguities of pp-attachment in 
English \[5\]. Ambiguities arc resolved by compar- 
ing the strength of associativity between a preposi- 
tion and a verb and the preposition and a nominal 
head. The strength of iLssociativity is calculated 
on the basis of occurrence frequencies of word col- 
locations in a corpus. Besides the word colloca- 
tions information, we also use semantic knowledge, 
nanlely~ a thesaurus. 
TILe structure of this paper is as follows: Sec- 
tion 2 cxplains the knowlcdge for structure analy- 
sis of compound nouns and the procedures to ac- 
quire it from a corpus, Section 3 describes the anal- 
ysis algorithm, and Section 4 describes the exper- 
iments that arc conducted to evaluate the perfor- 
mance of our method, and Section 5 summarizes 
the paper and discusscs future rescarch directions. 
2 Collocational 
for Analyzing 
Nouns 
Information 
Compound 
This section describes procedures to acquire col- 
locational information for analyzing compound 
nouns from a corpus of four kanzi character words. 
What we nccd is occurrence frequencies of all word 
collocations. It is not rcalistic, howcvcr, to collect 
all word collocations. We use collocations from 
thesaurus categories that are word abstractions. 
The procedures consist of the following four 
steps: 
1. collect four kanzi character words (section 2.1) 
2. divide the above words in the middle to pro- 
duct pairs of two kanzi charactcr words; if one 
is not in the thesaurus, this four kanzi char- 
aeter word is discarded (section 2.1) 
3. assign thesaurus catcgorics to both two kanzi 
character word (section 2.2) 
4. count occurrence frequencies of category col- 
locations (section 2.3) 
2.1 Collecting Word Collocations 
We use a corpus of four kanzi character words as 
the knowledge source of collocational information. 
The reasons are its follows: 
• In Japanese, kanzi character sequences longer 
than three are usually compound nouns, This 
tendency is confirmed by comparing tile oc- 
currence frequencies of kanzi character words 
in texts and those headwords in dictionaries. 
We investigated the tendency by using sample 
texts from newspaper articles and encyclope- 
dias, and Bunrui Goi IIyou (BGH for short), 
which is a standard Japanese thesaurus. The 
sanlple texts include about 220,000 sentences. 
We found that three character words and 
longer represent 4% in the thesaurus, but 71% 
in the sample texts. Therefore a collection of 
four kanzi character words would bc used as 
a corpus of comi)ound nouns. 
Four kanzi character sequences are useful to 
extract binary relations of nouns, because 
dividing a h)ur kanzi character sequence in 
the middle often gives correct segmentation. 
Our preliminary investigation shows that the 
accuracy of the above heuristics is 96 % 
(961/1000). 
There is a fairly large corpus of four kanzi 
character words created by Prof. Tanaka Ya- 
suhito at Aiti Syukutoku college \[8\]. The cor- 
pus w~Ls manually created from newspaper ar- 
ticles and includes about 160,000 words. 
2.2 Assigning Thesaurus Categories 
After collecting word collocations, wc must assign 
a thesaurus category to each word. This is a diffi- 
cult task because some words are mssigncd multiple 
categories. In such cases, we have several category 
collocations from a single word collocation, some 
of which are incorrect. TILe choices arc as follows; 
(1) use word collocations with all words is as- 
signed a single category. 
(2) equally distribute frequency of word collca- 
tions to all possible category collocations \[4\] 
(3) calculate the probability of each category col- 
location and distribute frequency based on 
these probabilities; the probability of colloca- 
tions are calculated by using method (2) \[4\] 
(4) determine the correct category collocation by 
using statistical methods other than word col- 
locations \[2, 10, 9, 6\] 
Fortunately, there are few words that are its- 
signed multiple categories in BGH. Therefore, we 
use method (1). Word collocations containing 
words with multiple categories represent about 1/3 
of the corpus. If we used other thesauruses, which 
assign multiple categories to more words, we would 
need to use method (2), (3), or (4). 
2.3 Counting Occurrence of Cate- 
gory Collocations 
After assigning the thesaurus categories to words, 
wc count occurrence frequencies of category collo- 
cations as follows: 
1. collect word collocations, at this time we col- 
lect only patterns of word collocations, but 
we do not care about occurrence frequencies 
of thc patterns 
866 
2. replace thesaurus categories with words to 
produce category collocation patterns 
3. count the number of category collocation pat- 
terns 
Note: wc do not care about frequencies of word col- 
locations prior to replacing words with thesaurus 
catcgorics. 
3 Algorithm 
The analysis consists of three steps: 
1. enumerate possible segmentations of an input 
colnpound nmm by consulting headwords of 
the thesaurus (BGH) 
2. assign thesaurus categories to all words 
3. calculate the preferences of every structure of 
thc compound noun according to tlm frcquen-- 
tics of category collocations 
We assume that a structure of a compmmd noun 
cau be expressed by a binary tree. We also as- 
stone that the category of the right branch of a 
(sub)tree represents tile category of tile (sub)tree 
itself. Tiffs assumption exsists because Japanese 
is a head-final language, a modifier is on the h'.ft 
of its modifiee. With these assuml)tions, a prefer- 
ence vahte of a structure is calculated by recurslve 
function p as follows: 
1 iftisleaf 
v(t) = p(l(t)), v(r(t)), cv(cat(l(t)), cat(r(t))) 
otherwise 
where flmction l and r return the left and right 
subtree of the tree respectively, cat returns the- 
saurus categories of the argmnent. If the argu- 
ment of cat is a tree, cat returns the category of 
tl,e rightmost leaf of tile tree. Function cv returns 
an assoeiativity measure of two categories, which 
is calculated from tile frequency of category collo- 
cation described in the previous section. We wmtld 
use two measures for cv: P(catl, cat=) returns the 
relative frequency of collation cat1, which appears 
on the left side al, d cat2, wlfich appears on the 
right. 
Probalfility: 
cvl = l"(catl,cat.~) 
Modified ,nutual information statistics (MIS): 
P(catx, cat2) 
cv2 = I'(catl, *) . 1'(*, cat2) 
where * llleans (ton~t care. 
MIS is similar to nmtual infromation used by 
Church to calculate semantic dependencies be- 
tween words \[1\]. MIS is different from mutual in- 
formation because MIS takes account of the posi- 
tion of the word (left/right). 
Let us consider an example '9~)iJ~III\]~:~;~". 
Segmentation: two possibilities, (1) ,,~zF~ (new)/Iz~t~ (indirect)/~ (tax)" and 
(2) ")~)i" (new)p_f~ (type)/Hi\]~ (indirect)/;\[~ (tax)" 
remain as mentioned in section 1. 
Category assignment: assigning tlmsaurus cate- 
gories provides 
(1)' "W2~ \[ll8\]/llIl~: \[311\]/~g \[137\]" and 
(2)' "*,~ \[316\]/)J~'J \[118:141:111\]/b11~'\[ ~ \[31~1/,~ 
\[1371." 
A three-digit number stands for a thesaurus 
category. A colon ":" separates multiple cat- 
egories assigned to a word. 
Preference calculation: For the case (1) I, the pos- 
sine structures are 
\[\[118, 3111, 1371 and \[118, \[311, 1371\]. 
We represent a tree with a llst notation. For 
tile ease !2~. there is an anfl)iguity with tile 
category ,,,,"' \[118:141:111\], We expand the 
ambiguity to 15 possible structures. Prefer- 
ences are calculated for 17 cases. For example, 
the l)reference of structure \[\[118, 311\], 137\] is 
calculated as follows: 
p(\[\[118,311\], 137\]) 
= p(\[118, anD. p(,a7). ~o(311,137) 
= p(118).p(311), cv(118,311), cv(311,137) 
= cv(118,311), cv(311,137) 
4 Experiments 
4.1 Data and Analysis 
We extract kanzi cl, aracter sequences from news- 
paper editorials and colunms and encyclopedia 
text, which has no overlap with the training cor- 
pus: 954 compound nouns consisting of four kanzi 
characters, 710 compound nouns consisting of five 
kanzi characters, and 786 coml)ound nouns con- 
sistiug of six kanzi characters are mammlly ex- 
tracted from the set of the above kanzi character 
sequences. These three collections of compound 
nouns arc used for test data. 
Wc use a thesaurus BGII, which is a stan- 
dard machine rea(lble Jat)anese thesaurus. BGH 
is structured as a tree with six hlerarclfieal lev- 
els. Table 1 shows the number of categories at all 
levels, in tlfis experiment, we use the categories 
at level 3, If we have more eoml)ound nolln8 \[LS 
knowledge, we could use a liner hierarchy level. 
'\]_'able 1 The mmd)er of categories 
o f 4 
As mentioned in Section 2, we create a set of 
collocations of thesaurus categories from a cor- 
pus of four kanzi character sequences and BGt\[. 
867 
We analyze the test data accorditlg to the proce- 
dures described in Section 3. In segmentation, we 
use a heuristic "minimizing the number of content 
words" in order to prune the search space. This 
heuristics is commonly used in the Japanese mor- 
phological analysis. The correct structures of the 
test data manually created in advance. 
4.2 Results and Discussions 
Table 2 shows the result of the analysis for four, 
five, and six kanzi character sequences. "oc" 
means that the correct answer was not obtained 
because the heuristics is segmentation filtered out 
from the correct segmentation. The first row 
shows the percentage of cases where the correct 
answer is uniquely identified, no tie. The rows, 
denoted "~ n", shows the percentage of correct 
answers in the n-th rank. 4 ,,, shows the percent- 
age of correct answers ranked lower or equal to 4th 
place. 
Table 2 Accuracy of analysis \[%\] 
4 kanzi 5 kanzi 6 kanzi 
rank cvl cv2 cvl cv2 cvl cv2 
1 96 96 63 59 48 53 
1 97 96 71 68 54 68 
2 99 99 91 91 89 93 
,-, 3 99 99 92 92 91 94 
4,-0 0.1 0.i 2 2 4 4 
c~ 1 1: 6 6 5 2 
Regardless, more than 90% of the correct an- 
swers are within the second rank. The proba- 
bilistic measure cvl provides better accuracy than 
the mutual information measure cv2 for five kanzi 
character compound nouns~ but the result is re- 
versed for six kanzi character compound nouns. 
The results for four kanzi character words arc al- 
most equal. In order to judge which measure is 
better, we need further experiments with longer 
words. 
We could not obtain correct segmentation for 
11 out of 954 cases for four kanzi character words, 
39 out of 710 cases for five kanzi character words, 
and 15 out of 787 ca~es for six kanzi character 
words. Therefore, the accuracy of segmentation 
candidates are 99%(943/954), 94.5% (671/710) 
and 98.1% (772/787) respectively. Segmentation 
failure is due to words missing from the dictio- 
nary and the heuristics we adopted. As mentioned 
in Section 1, it is difficult to correct segnlenta- 
tion by using only syntactic knowledge. We used 
the heuristics to reduce ambiguities in segmenta- 
tion, but ambiguities may remain. In these exper- 
iments, there are 75 cases where ambiguities can 
not be solved by the heuristics. There are 11 such 
c~ses for four kanzi character words, 35 such cases 
for five kanzi character words, and 29 cases for six 
kanzi character words. For such cases, the correct 
segmentation can be uniquely identifed by apply- 
ing the structure analysis for 7, 19, and 17 eases, 
and the correct structure can be uniquely iden- 
tified for 7, 10, and 8 cases for all collections of 
test data by using CVl. Oa the other hand, 4, 18, 
and 21 cases correctly segmented and 7, 11, and 
8 cases correctly analyzed their structures for all 
collections by using cv2. 
For a sequence of segmented words, there are 
several possible structures. Table 3 shows possible 
structures for four words sequence and their oc- 
currence in all data collections. Since a compound 
noun of our test data consists of four, five, and 
six characters, there could be cases with a eolw 
pound noun consisting of four, five, or six words. 
hi the current data collections, however, there are 
no such cases. 
In table 3, we find significant deviation over oc- 
currences of structures. This deviation has strong 
correlation with the distance between modifiers 
and modifees. The rightmost column (labeled 
Y\] d) shows sums of distances between modifiers 
and modifiee contained in the structure. The dis- 
tance is measured based on the number of words 
between a modifier and a modifiee. For instance, 
the distance is one, if a modifier and a modifice 
arc inlmcdiately adjacent. 
The correlation between the dist;ancc and the 
occurrence of structures tells us that a modifier 
tends to modify a closer modifiee. This tendency 
has been experimentally proven by Maruyama \[7\]. 
The tendency is expressed by the formula that fol- 
lows: 
q(d) =0.54. d -I"sDG 
where d is the distance between two words and q(d) 
is the probM)ility when two words of said distance 
is d and a modification relation. 
We redifined cv by taking this tendency as the 
fonmda that follows: 
cv' = cv. q( d) 
where cv' is redifincd cv. 'fable 2 shows the re- 
sult by using new cvs. Wc obtaiued significant 
improvement in 5 kauzi aud 6 kanzi collection. 
Table 3 Table of possible structures 
\[structure ' "" 5 kanzi 
WWl~_~~ _ __ .0 
_ . 268 \] 
269 l\[wl, \[-2, ,,s\]f 
96 
\[,,,1, ~,~\], \[~, mH 13 
,,,1, % 3 
6 kanzi 
1 
85 
4o5 
166 
48 
45 
3 
8 
11 
o I 
31 
31 
61 
51 
4 L 
868 
Table 4 Accuracy of analysis \[%\] 
" I 4 kanz~ 5 kanzl-~-~ 
rank 'cv~ cv2 'cvl :v2 evl c~J 
-- i 96 97 "73- 79 662170 I 
,-~ 1 97 97 73 79 62 
2 \[ 99 99 91 92 
N3 1.99 99 93 93 9}2 I }_~ ~41o.1 
0.1 
We assumed that the thesaurus category of a 
tree bc represented by the category of its right 
branch subtree because Japanese is a head-final 
language. However, when a right subtrce is a word 
such as suffixes, this assumption does not always 
hold true. Since our ultimate aimis to analyze sc-- 
mantle structures of compound nouns, then deal- 
ing with only the grammatical head is not enough. 
We should take semantic heads into consideration. 
In order to do so, however, we need knowledge to 
judge which subtrce represents tile semantic fea- 
tures of the tree. This knowledge may be extracted 
from corpora and machine readable dictionaries. 
A certain class of Japanese nouns (called Sahen 
meisi) may behave like verbs. Actually, we can 
make verbs frmn these nouns by adding a special 
verb "-suru." These nouns have case frames just 
like ordinary verbs. With compound nouns includ- 
ing such nouns, wc coukl use case frames and sc- 
lectional restrictions to anMyzc structures. This 
process would be ahnost the same as analyzing or- 
dinary sentences. 
5 Concluding Remarks 
We propose a method to analyze Japanese com- 
pound nouns using collocational information and 
a thesaurus. Wc also describe a method to a(:- 
quire the collocational information from a corpus 
of four kauzi character words. The method to ac- 
quire collocational information is dependent on the 
Japanese character, but the method to calculate 
preferences of structures si applicat)le to any lan- 
guagc with comlmund nouns. 
The experiments show that whcn the method 
anMyzes compound nouns with an average length 
4.9, it produces an accuracy rate of about 83%. 
We are considering those future works that fol- 
low: 
, incorl)orate other syntactic information, such 
as affixes knowledge 
• use another semantic information as well as 
thesauruses, such as sclcctional restriction 
• apply this method to disambiguate other syn- 
tactic structures such as dependency relations 
References 
\[1\] K. W. CMrch, W. Gale P, Hanks, and D. Hin- 
dle. Using statistics in lexical anMysis. In Lex- 
eal Acquisitin, chapter 6. Lawrence Erlbaum 
Associates, 1991. 
\[2\] J. Cowie, J. A. Guthrie, and L. Guthrie. Lcxi- 
cal disambiguation using simulated amlealing. 
In COLING p310, 1992. 
\[3\] T. Fujisaki, F. Jelinek, J. Cocke, and 
E. Black T. Nishino. A probabilistic pars- 
ing method for sentences disambiguation. In 
Current Issues in Parsing Thchnology, chap- 
tcr 10. Kluwer Academic Publishers, 1991. 
\[4\] R. Grishman and J. Sterling. Acquisition of 
sclcctional patterns. In COLING p658, 1992. 
\[5\] D. Hindle and M. Rooth. Structual ambiguity 
and lexocal relations. In ACL p229, 1991. 
\[6\] M. E. Lesk. Automatic sense disambiguation 
using machine readable dictionaries: How to 
tell a pine cone from an ice cream cone. In 
ACM SIGDOC, 1986. 
\[7\] H. Maruyama and S. Ogino. A statistical 
property of Japanese phrase-to-phrase mod- 
ifications. Mathematical Linguistics, Vol. 18, 
No. 7, pp. 348-352, 1992. 
\[8\] Y. Wallaka. Acquisition of knowledge for natu- 
ral language ;the four kanji character scqucn- 
tics (in japanese). In National Conference 
of Information Processing Society of Japan, 
1992. 
\[9\] J. Veronis and N. M. Ide. Word sense dis- 
ambiguation with very large neural networks 
extracted from machine readable dictionaries. 
In COLING p389, 1990. 
\[1O\] D. Yarowsky. Word-sense disamibiguation us- 
ing stastistical models of roget's categories 
trained on large corpora. In COLING p,~5~, 
1992. 
869 
