Analysis of Japanese Compound Nouns by 
Direct Text Scanning 
Toru Hisamitsu and Yoshihiko Nitta 
Advanced Reseamh Laboratory, Hitachi, Ltd. 
Hatoyama, Saitama 350-03, JAPAN 
{ hisamitu, nitta } @ harl.hitachi.co.jp 
Abstract 
This paper aims to analyze word dependency structure in 
compound nouns appearing in Japanese newspaper 
articles. The analysis is a dil't:icult problem because such 
compound nouns can be quite long, have no word 
boundaries between contained nouns, and often contain 
nnregistered words such as abbreviations. The non- 
segmentation property and unregistered words cause initial 
segmentation errors which result in erroneous analysis. 
This paper presents a corpus-based approach which scans a 
corpus with a set of pattern matchers and gathers co- 
occurrence examples to analyze compound nouns. It 
employs boot-strapping search to cope with unregistered 
words: if an unregistered word is lound in the process of 
searching the examples, it is recorded and invokes 
additional searches to gather the examples containing it. 
This makes it possible to correct initial over- 
segmentation errors, and leads to higher accuracy. The 
accuracy of the method is evaluated using the compound 
nouns of length 5, 6, 7, and 8. A baseline is also 
inmxlueed and compared. 
1. Background 
1.1 Compound Nouns in Japanese 
Newspaper Articles 
This paper analyzes the word dependency structure in 
compound nouns appearing in Japanese newspaper 
articles. Assume that you are given a large number of 
articles and a compound noun such as "~.~;£~J~-~". 
This noun actually consists of three nouns "~JE" 
(revision), "~fL=~" and "~~,~-?" (application), where ")~0_~ 
~)~" is the abbreviation of "~d,~/~"(~-: 
large, d,3'~I~-~: retail shop, ~: law). However, it is 
highly unlikely that such a word can be found in an 
ordinary dictionary. Newspaper articles are full of this kind 
of difficult compound nouns which can be infinitely 
generated, and such compound nouns often convey 
substantial information through which the articles can be 
summarized. 
In Japanese newspapers, compound nouns m~c 
especially useful because they convey a lot of information 
in a compact expression (even a single kanji, or Chinese 
character, can represent complex meaning). The number of 
nouns torming a compound noun often exceeds three, ,and 
may reach as much as ten. This means that a compound 
noun can contain up to twenty kanji characters or more. 
Therefore, an analysis of noun compounds has to deal 
with both segmentational and structural ambiguities. 
As for the example above, an initial morphological 
analysis (segmentation + tagging) causes an over- 
segmentation error such as "~IE sn/~ adj/~ n/'lY, n/li~i 
sn" because "~"(large), "~"(shop) and "~,"(law) are all 
meaningful expressions by themselves. 
1.2 Existing Methods and Problems 
Compound noun analysis has been researched lbr 
many years because it is important for understanding 
natural language. A concise review of this research area 
can be found in, for instance, Lauer (1995), which dates 
back to Finin (1980). When applying the existing 
methods to Japanese compound nouns in newspaper 
articles, however, a problem arises: 
(1) All the methods are difficult to apply because they use 
training schemes such as (partial)parsing of the whole 
corpus and counting word occun'ence in word windows. 
As Lauer (1995) pointed out, using (partial) parsing 
of the text is too costly. Thus, the word co-occurrence 
approach seems to be more appropriate. However, 
counting the frequency of a given word is not an easy task 
in a non-segmented Japanese text. Ordinary pattern 
matching algorithms cannot count the number of 
occurrences of a word in non-segmented Japanese text 
because of the ambiguity in how sentences should be 
segmented. Thus, whatever method one chooses, he is 
first confronted with the high cost of Japanese 
morphological analysis and its inaccuracy caused by 
unregistered words. 
Thus, researchers of Japanese compound noun 
analysis have been obliged to employ manually written 
syntactic rules for compound nouns (Miyazaki, 1984) or 
the conceptual dependency model (Kobayashi et al., 1994) 
which employs a thesaurus and a limited co-occunence 
data, li)r example, a collection of four kanji sequences 
(Tanaka, 1992) extracted from a corpus. 
Tim problems in existing methods arc: 
(2) It is costly to manually prepare the rules for the 
analysis of compound nouns. 
(3) Methods employing a conceptual dependency model 
are brittle when unregistered words occur often. One has 
to properly allocate an unregistered word in lhe thesaurus, 
550 
which is another tough problem. 
For these reasons, the existing methods are not 
effective for compound noun analysis in newspaper 
articles. A scheme for collecting coilocational information 
(1) must be practical for large amounts of Japanese mw 
text, and also collect reliable data. 
(2) should cope with unregistered words. 
1.3 Direct Text Scanning Method 
qb satisfy the requirements mentioned above, we used a 
direct text scanning method which collects external 
evidence (McDonald, 1993) of a modifier-modilce 
relationship between two words using a set of simple 
pattern matchers. 
In this method, a Japanese morphological analyzer 
(JMA) first determines the most plausible segmentation 
for a given compound noun by using an ordinary 
dictionary. At this initial stage, the segmentation often 
contains an over-segmentation error. That is, when the 
analyzer encounters an unregistered word, it is likely to 
segment the word into a sequence of registered words of 
short length (we empirically confirmed that word 
boundary crossing type errors make up less than 5% of all 
errors caused by unregistered words). Our method corrects 
many of over-segmentation errors automatically. 
Every word in the initial output of the JMA is used 
as a key in pattern matching. Twenty-three pattern 
matchers gather various types of word co-occurrence, and 
many unregistered words can be detected in the process of 
pattern matching. 
For example, in the searches for L={"~k_tl( .... )<" ")~" 
"~ .... ~'~:"}, a pattern matcher finds evidence that "~)~ 
~" appears as a single word. Then, "J.,:)~" is registered, 
added into L, and invokes a search of word co-occurrence 
around "~" itself. This bootstrapping search makes it 
possible to conrect initial over-segmentation errors m~d to 
obtain the correct solution of morphological analysis. 
A comparison of possible dependency structures is 
conducted by using mutual information and syntactic 
constraints. Lauer (1995) compared a dependency model 
with adjacency models, and found that the dependency 
model is better. We used the dependency model as well. 
We did not use a conceptual dependency model. This 
is because: 
(1) it is difficult to assign a proper position in a thesaurus 
to an unregistered word. 
(2) we aimed to evaluate the perlommnce of the genuine 
direct scanning approach, since no oue has lelX~ted 
whether or not it works, or if it works, how large the 
corpus should be. 
Finally we also intr~×luce a baseline that has yet not 
been introduced in the literature of Japanese compound 
noun analysis. The baseline works fairly well, and the 
text scanning method will turn out to .be much better than 
the baseline. 
Section 2 describes the algorithm of text scanning 
method in detail, section 3 shows the results of our 
experiments and introduces the baseline. Section 4 
discusses problems tbr future research. 
2. Text Scanning Approach 
2.1 Overview 
Figure 1 illustrates the processing I\]ow. An input 
compound noun is first analyzed by JMA and segmented 
into a sequence of registered words. The output is stored 
as an initial value in a list called WORDLIST (WL). 
For every word in WL, a search for its collocational 
pattern is conducted, and the results are stored in tile 
evidence data base (EDB). It is important that there is a 
feedbackloop from EDB to WL through which newly 
tbund words can be a&ted to WL. The search is continued 
until every wold in WL is used as a key. This f~dback 
enables tile bootstrapping acquisition of evidence. 
Figure 1 
Arch itecture of Direct Scanning Mct hod 
Input "4t d~Aa,ML,Yt~'P 
/ ~aqzer \[ ,17 
Result 0f Initial JMA • gCll! sW~ acl/ 
I,; Wi)~ n(~l: sn" Newly Found Word 
i 
ulmtll ov~ D' word 'oo<" ,1  v,oen=I 
I\[ Final WORD_UST Pattern I 
I {(~ll-\]sn),(~ adj), (Jtlln), Matchers 
I (fJ, n),(/j$,~? ~),(kJ,lif). n)} ........ I - :: 
\] Result of FinahJMA \] 1 CorpIJs 
~t "~d~sWkl~ifJ. n I .... 
q \]~i sn' I Augumented 
n i~ead A~/ CFG-Parser ~w~ra 
I Output --Attribute l 
\]Grammar \[ 
np mod-pel:rv-no-rel, ~. 
/ "" ~.. ~rera-rol ' 
8fl ll Sll 
head: i~iE head: )<hli f).~ t\]ead: ~fi 
med-rel:nil mocl~ret: nil modrol: dl 
Alter the searches, the input is re-analyzed using 
newly found words. The final result of JMA is then 
passed to a CFG-parser which calculates the cost of 
possihlc structures and the attribute-values attached to 
each node in a solution. In the case that there is 
ambiguity in the final morphological analysis of a given 
compound noun, the morphological analyzer picks up the 
solution with the least number of segmentations. 
The procedure of the cost calculation era dcpendcncy 
structure is basically the same in Kobayashi et al. (1994). 
The cost of the dependency between two nodes is given by 
nsing mulual information between the lexical heads of ihe 
taxies (fig. 2). 
551 
Here two kind of attributes are used; head, which 
records the head of a node as a value, and nu, d-rel, which 
records the kind of relationship found between two heads 
of children. 
In Japanese, if the two children are both content 
words, the value of the head attribute of the parent node is 
usually identical to the value of the hend attribute of the 
right daughter. 
Figure 2 
Depe nde ncy R epres ta~t ati on U sing 
Attribute-Valu e Pai Is 
NP head: 7 
-r et: 
, ...~ r~ma} 
NP NP 
head: a head: fl 
mod-rel: {r ....... r,m,} m3d-rel:{r ....... r~} 
2.2 Basic CFG Rules 
The category which the morphological analyzer assigns to 
a word is one of the following: sn (stem of a sino-verb), n 
(noun), pn (proper noun), num (number), adj (stem of an 
adjective or an adjectival verb), prfx (nominal prefix), sfix 
(nominal suffix), num-prfx (numerical prefix), and num- 
sfix (numerical suffix). CFG rules for compound noun 
construction use these categories as non-terminals. The 
following two rules are the most basic: \[np -~ np np\] and 
\[np --~ n\]. These rules construct the basic framework of 
the dependency-structure of a compound noun. We assume 
that the structure of a compound noun can be represented 
in the framework of binary-tree grammar by using 
attribute-wdue pairs. 
2.3 Co-occurrence Data Collection by Direct 
Text Scanning 
This subsection describes the most important part of our 
method: the pattern matchers and heuristics on 
unregistered word treatment. 
"Fable 1 shows the main part of the pattern matchers. 
We will describe the procedure for collecting evidence by 
using the example mentioned previously, "~\]E.~)~t~tJ~)~ 
The initial segmentation of the compound noun is 
"~k~ sn/~ adj/Y~ n/i-~ n/~'~T sn". Thus the WL initially 
contains these five words. The words are used as keys lot 
the search. As mentioned in the previous section, this 
solution contains an over-segmentation error, which is the 
most likely error in the situation when unregistered words 
appear. Therefore this example captures the typical 
problem laced in our task. 
In Table I, 'A' stands for a given key, 'B' stands for a 
sequence of kanji characters (we only treat kanji- 
compound nouns in this paper), and 'D' stands for an 
"extended" delimiter: D is identical to a space, a symbol, a 
katakana or a hiragana except "©" (no; o3'). After 
preliminary experiments, we decided to eliminate "©" 
from the delimiters because if it is used, a pattern such ~ts 
"A©B©C"(roughly C orB of A) could be picked up, and 
it may ~ erroneous evidence because of its ambiguity in 
dependency structure. 
Table 
Part of Pattern 
1.1 D.AB.D D 
D'BA'D D 
D 
1.~ D • A~B • D D 
D-Be)A • D 
1.3 \[D'AV~B'D 
D, A~'~cB - D 
D • BV~A • D 
D" B~cA- D 
D AT)~B-~7~ • D 
1., D A~B'9'-~ - D 
\ D AIZB~7~ • D 
D B;O:A-~ "7o • D 
D B~'A~7~ - D 
O BIZA~7~ . D 
1 Matchers 
A-~7~B • D 
AL, Y.:B • D 
A~ ~'~B • D 1.5 
D B'~Z~A • D 
D BL?cA • D 
D B~'LT~A • D 
D B~-/'cA • D 
D • A.~3 ~ i~'B • D ~ 
D • AL g - D 1.6 
D • B,t~ J: LFA • D 
D'B~A'D 
D • A~Zo~'~cCo')B • D' 1.7 
D • A~,~-1~-9~ ~ B - D / 
D • B~:_o~,~Z'09A • D 
D. B~Y-I~-~A • D 
Patterns ill 1.I collect evidence of inner-word 
collocation of A and B. If the length of A is more than or 
equal to 2, The length of B is limited to less than or equ~d 
to 3. If the length of A is !, the length of B is limited to 
less than or equal to 2. Additional explanation will be 
given later in this subsection. 
Patterns in 1.2 collect the evidence of particle- 
combined collocation of A and B. A and B are combined 
by a particle "©" which is similar to "of' in English. 
Note that no part of a phrase such as "A©B©C" is picked 
up so that erroneous evidence can be to avoided. The 
length of B is limited to less than or equal to 3 (in 1.3, 
.... 1.7, the same condition on B is used). 
Patterns in 1.3 collect the evidence of an adjectival 
modifier-modifiee relationship between an adjective (or an 
adjectival noun) and a noun. 
Patterns in 1.4 collect the evidence of a predicate- 
argument relation between a sino-verb and a noun. 
Particles "¢j~" Q~a), "~ "(wo) and "l~-"(ni) roughly indicate 
AGENT, OBJECT and GOAL, respectively. 
Patterns in 1.5 collect the evidence of a modifier- 
modifiee relationship between a sino-verb and a noun, the 
sino-verb which appears at the tail of a noun modifier 
phrase and the noun which is modified by the phrase. 
Patterns in 1.6 collect the evidence of a coordination 
relationship between two words. 
Patterns in 1.7 collect phrases such as "A about B" 
~md "B about A". 
Here we omit the others. One can ,add any pattern as long 
as it supplies reliable evklence. 
In the following part of this subsection, we will 
illustrate the search procedure using the initial value of 
WE {(?~k.d(sn), (~ adj), (/~ n), ('~}~ n), ()j~-~ sn)}. 
From the first item "~kll:Z', evidence shown in 3.1 of 
figure 3 is collected, and the result is stored in the form 
552 
shown in 3.1'. Note that the number of occurrences ~uxt 
the observed relationships are recorded. At this stage, the 
unregistered word "Jql~'~J~" is already captured by using a 
pattern marcher in 1.5. 
As for the second word, however, one has to be 
careful because a word with length 1 is very likely to 
appear through an over-segmentation error. The pattern 
matchers gather evidence such as "AS~ ~:~{U' (~ 
~o¢:big; ~(~: change), "J<~" (university), ")2~!!" (large), 
"J<ldi'{):," (large retail-shop law) etc. as given in 3.2. This 
evidence contains not only correct examples (such as "AS 
L~ ~oc>~.'\[~ '') but also registered words (such as "AS~", "~ 
~") and unregistered words (such as "J<h~"). 
To classify the evidence, we developed the following 
rules: 
R-(a) 
If(l) the length of A is 1, and the length of B is l, ~md 
(2) there is no entry for the concatenated string AB (BA) 
in the dictionary used by JMA, 
then recognize the concatenated string as an unregistered 
word, and apply R-(c). 
R-(b) 
If (1) the length of A is 1, and the length of B is 2, (2) 
there is no entry for the concatenatod string AB (BA) in 
the dictionary, (3) the category of B is not 'sn' (the 
condition for AB), and (4) the concatenated string AB 
(BA) cannot be segmented as a sequence of two registered 
words A'B'(B'A'), where A':#A, 
then recognize the concatenated string as an unregistered 
word and apply R-(c). 
R-(c) 
If (1) the character string consisting of B is identical to 
the concatenated string of the first or the first two words 
following A in the initial solution (the condition for AB), 
or (2) the character string consisting of B is identical to 
the concatenated string of the first previous or the first 
two previous words preceding A in the initial solution 
(the condition for BA), then record AB in WL as an 
unregistered word, which will invoke pattern matching 
using AB as a key. 
R-(d) 
If (1) tile length of A is larger than or equal to 2, and 
(2) the concatenated string AB (BA) cannot be segmented 
as a sequence of two registered words A'B'(B'A'), where A' 
A, then, record an evidence of inner-word co-occurrence 
of A and B. 
We admit that the definition of a word might be 
controversial. However, we do not mention the arguments 
here because of the lack o1' space. We only say that the 
standpoint we chose is simple and umchine-tractable, ~md 
works well lbr our purpose. 
"~-~¢'~'\[~" is recorded as evidence of a 
straighttorward adjectival moditier-nlodifiee relationship 
between ".k" and "~C\[g". 
According to R-(a), "ASq:" and ")~" are neglected. 
According to R-(b) and R-(c)-(l), ~)t~)2 is recorded as 
an unregistered word and stored in WD, which invokes a 
search of the patterns around it. 
Having worked through all the elements in WD, the 
evidence given in 3.1', 3.2', 3.3', 3.4', 3.5' and finally 
3.6' is obtained. 
At this stage, \]MA re-analyzes the input compound 
noun by using newly found words. Thus the con'cct 
segmentation "~.~iE sn / .~l~ n / )~,~l: sn" is obtained, 
and passed to the CFG-parser. 
Figure 3 Exam pie o f Evi deuce C olle ct ion 
, ...lldql q,~, 
3.1 " ..., 0,,lit o~ d( Jl .... ""L: tk ~"~ i~<'t $ "'" 3.4 
32 j~. )<~¢) . l~ff'bi~t.t:.. 3.5 
' ...kthql4 ~... ...~:dq\[ 8 It Z:kI,~L.. 
..Jl, khll//i~).. ..~'Gkt~i~(r~tll'l ~.. 3.6 
( ~/( 11 ,p " i~.wl~l t¢I 2) 
• "{l~(()t~hlitt!~l~Pl\[ : in~,:¢.l tel 15)) | 3r n':::: 
"~' '))1 3.v 
Ne ~y', "', (t~i ,~td: I ...... I z)\] 3,4' 
Wind "" } 
~-'-~ (,t:~;if~, ~Oii ~ 4)/J - -3.6' 
: }fl Add iiio m I Sear dl 
2.4 Selection of Proper Analysis 
2.4.1 Cost Calculation and Mutual 
Information 
The rest of the procedure is straightforward. An augmented 
bottom-up CFG parser chooses the minimum cost tree for 
the given word sequence. Let NP 3 be the parent of NP~ 
and NP~ in a subtree. Each node has three kinds of 
attributes: head, mod-rel and accum-cost, head has the 
lexical head of the subtree under NP i as its value. ,u)d-rel 
keeps tile observed relationships captured by the pattenl 
matchers between the two lexical heads of child nodes 
(this value is not actually used in the fi,llowing 
experiments), accum-cost c i records the accumulated cost 
of the subtree which has NP i as its root. ~ is calculated 
as IMiows: 
c3 = cl+c~-log2( N(headl, head~) ) 
N( headl )N( head2) 
where N(headi) stands for the number of patterns 
containing ha~ i, N(headl, head2) stands for the number of 
the patterns containing both heM~ and head 2. The value of 
accum-cost of each leaf node is set to 0. 
2.4.2 Preference to Analysis Containing 
Observed Evidence 
The corpus based approach inevitably encounters tile 
553 
sparseness problem. Our approach also encounters this 
problem, although it turned out to be not serious, as will 
be explained in section 3.3. This subsection describes the 
heuristic that is employed when the evidence cannot cover 
any of the entire trees. 
Figure 4 shows two possible dependency structures 
in a three-word compound noun. For simplicity, the 
values of the head attribute are indicated instead of the 
non-terminal symbols. For three noun words, the 
following rule is applied: 
If only the dependency between Hj and H 2 was observed, 
then 4-(a) is chosen, else if only the dependency between 
H l and H 3 was observed, then 4-(b) is chosen, else if only 
the dependency between H 2 and H 3 was observed, then 4- 
(b) is chosen. 
In general, priority is given to the solution 
containing more subtrees which directly reflect the 
observed evidence. 
In our experiments, the analysis which has multiple 
minimum cost solutions was considered to have failed. 
Figure 4 
Two Possible Parse 
H, H2 H~ H, ~ Hs 
4-(b) 4-(a) 
3. Results 
3.1 Test Data 
We used the articles contained in "Nikkei Shinbun" for 
January and February in 1992 as the corpus for the 
experiments. The number of the articles is about 27,000, 
which contain about 7 million characters. 
Experiments were carried out using 400 compound 
nouns: 100 for 5-kanji words, 100 for 6-kanji words, 100 
for 7-kanji words and 100 for 8-kanji words. The 
frequency of these word lengths is about the same in the 
corpus. Alter randomly selecting the test samples, we 
confirmed that they were all compound nouns. 
Numerical expressions appeared in 10% of the test 
samples, and such expressions were pre-processed as 
follows: 
"~'\]\-I-~'" --~ "~ pr-num/~\]\-\[- num/~ n" 
(¢~: about; ~: hundred; A.: eight; W: ten; ~-: dealer) 
3.2 Baseline 
Baselines have rarely been introduced in research on 
Japanese noun compounds. This paper introduces a 
baseline to facilitate our evaluation of the effectiveness of 
our method. 
The baseline we used is leftmost derivation. This is 
an extension of left branchprefereture in Lauer (1995). 
The baseline is also a well-known heuristic method to 
analyze Japanese noun phrases combined with "¢)" (such 
as "A©B~C"). As shown below, this heuristic method 
works well especially when the length of a compound 
noun is relatively short. Note that the baseline correctly 
analyzes "i~/E~\]j~,~-" if "~)~" is registered. 
However, the baseline actually fails because it cannot 
capture the unregistered word. 
3.3 Results and Comparison 
Table 2 shows the results of the proposed method. The 
first line indicates the number of samples for which the 
correct dependency structure was given as the single 
minimum cost solution. The second line indicates the 
accumulated number of samples for which the col~rect 
dependency structure was given as one of the minimum 
cost solutions. Table 3 shows the results of the baseline, 
and indicates the number of samples for which the correct 
dependency structure was given. 
Table 2 
word_l_length ~__Q5 a89 --~6 7 
~1 ~ 81 \] 76 \]~ 
The result of Direct Scanning 
Table 3 
word length I 5 6~~~-~ 
1 83 ~63 1 41 \[ ~ 
The result of baseline 
Comparing the two tables reveals that the proposed 
method is more accurate than the baseline. For longer 
word length, the difference is greater. 
Our result cannot be compared accurately with the 
existing result (Kobayashi et a/., 1995) because we used a 
different test corpus, and only the results on 4-, 5- and 6- 
kanji compound nouns were reported. However, the 
accuracy of their results on 6-kanji compound nouns is 
53%, unless they combine their conceptual dependency 
model with a heuristic using the distance of modifier and 
modifee. After combining the model and the heuristic, 
accuracy improves to 70%, which is the same as ours. 
An 8-kanji compound noun usually contains four 
nouns. The performance of our method (accuracy of 58%) 
is encouraging, since most of the errors were caused by 
proper nouns. This problem can be solved using a pre- 
processor (explained below). 
3.4 Causes of Errors 
Forty-two percent of the error was caused by proper 
nouns, 16% by time expressions, and 15% by monetary 
expressions. This means that proper nouns are a major 
cause of the errors, as pointed out in previous research. 
There are several reasons for this: 
(1) an identical proper noun normally does not appear 
554 
many times in the corpus. 
(2) proper nouns sometimes cause cross-boundary errors at 
the initial morphological analysis. 
We can be optimistic about eliminating these three 
types of errors. If we use a preprocessor (for proper nouns, 
see Kitani et al., 1994), most of them can be eliminated. 
4. Future Directions 
This paper discussed performance of the direct text 
scanning method. There remain several interesting 
problems: 
(l) We did not employ the conceptual dependency model. 
A method for combining a conceptual dependency model 
with the proposed approach should be investigated and the 
results analyzed. 
(2) A proper noun pre-processing module should be 
combined with the proposed method. 
(3) The effect of varying the corpus size should be 
investigated. 
(4) The distance between a compound noun and its 
evidence should be reflected in the cost calculation in 
comparing solutions. 
(5) Parallel search should be employed to speed up the 
process. 
(6) How to obtain an expanded expression from a given 
compound noun should be investigated. At the moment, 
the value of the nuM-rel attribute is not used. Some 
compound nouns can be rephrased with an ordinary 
Japanese sentence. Figure 5 shows an example of 
expansion. 
Fi£u re 5 
Analysis of Long Word and 
Expansion to Ordinary Japanese 
"~ ~t~ ~ ~.~-~" 
"no" 
"an entelprise which aims at improving the area where 
many wooden apartments for rent stand close together" 
.~ mokltzo; wooden ~:chhtai; renta 
{1"-~" juutaku; avartment ~:J~:mis,.,huu; crowd 
\]~ 1~: clfiku ; area ~: seibi; improve, maintain ~JI-~:: jigyou; enteq~ise 
5. Conclusion 
A corpus-based approach for analyzing Japanese 
compound nouns was proposed. This method scans a 
corpus with a set of pattern marchers and gathers external 
evidence to analyze compound nouns. It employs a boot- 
strapping procedure to cope with unregistered words: if an 
unregistered word is found in the process of searching the 
co-occurrence examples, the newly tbund word is recorded 
and invokes additional searches, which enahle necessary 
evidence to be gathered for the given compound noun. 
This also makes it possible to correct over-segmentation 
errors in the initial segmentation, and leads to higher 
accuracy. The method is also very portable because it 
depends little on a dictionary of a morphological analyzer 
and treats registered words and unregistered words in the 
same manner. The accuracy of the method was evaluated 
using the compound nouns of length 5, 6, 7, and 8. A 
baseline, which takes leftmost derivation strategy, was 
also investigated for comparison with our method. The 
proposed method is much more accurate than the baseline 
in the experiments for words of four different lengths. 
Acknowledgement 
We would like to express our gratitude to Professor Yorick 
Wilks (Sheffield) and Dr. Shojiro Asai (Hitachi, Ltd.), 
who gave the first author the opportunity to do this 
research at the University of Sheffield as a visiting 
researcher (from January to December, 1995). 
References 
Finin, Tim. 1980. The Semantic Interpretation of 
Compound Nominals, PhD Thesis, Co-ordinated 
Science Laboratory, University of Illinois, Urbana, IL 
Lauer, Mark. 1995. Corpus Statistics Meet the Noun 
Compound: Some Empirical Results, in Proc. of ACL, 
pp.47-54 
McDonald, David B. 1993. Internal and External Evidence 
in the Identification attd Semantic Categorization of 
Proper Names, in Proc. of SIGLEX workshop on 
Acquisition of Lexical Knowledge from Text, pp. 32- 
43, Ohio, USA 
Miyazaki, Masahiro. 1984. Automatic Segmentation 
Method for Compound Words Using Semantic 
Dependent Relationships between Words, in Trans. of 
IPSJ, Vol. 25, No. 6, pp.970-979 
Kitani, T. and Mitamura, T. 1994. An Accurate 
Morphological Analysis and Proper Name Identification 
for Japanese Text Processing, in Trans. of IPSJ, Vol. 
35, No. 3, pp.404-413 
Kobayashi, Y., Tokunaga, T. and Tanaka, H. 1994. 
Analysis of Japanese Compound Noun using 
Collocational Information, in Proc. of COIANG, pp. 
865-869 
Tanaka, Yasuhito. 1992. Acquisition of knowledge for 
natural language; the four kanji character sequence (in 
Japanese), in National Conference of Infommtion 
Processing Society of Japan 
555 
