A Probabilistic Approach to Compound Noun Indexing in 
Korean Texts 
Hyouk R. Park and Young S. Han and Kang H. Lee 
Korea R&D Information Center/KIST 
P.O. Box 122 YuSong Taejon, 305-600, Korea 
{ hrpark, yshan, khlee} @stissbs.kordic.re.kr 
Key-Sun Choi 
Computer Science Department KAIST 
YuSong Taejon, 305-701, Korea 
kschoi(O)world, kaist, ac. kr 
Abstract 
In this paper we address the prob- 
lem of compound noun indexing that 
is about segmenting or decomposing 
compound nouns into promising index 
terms. Compound nouns as index terms 
that usually subscribe to specific no- 
tions tend to increase the precision of 
retrieval performance. The use of the 
component nouns of a compound noun 
as index terms, on the other hand, may 
improve the recall performance, but can 
decrease the precision. 
Our proposed method to handle com- 
pound nouns with a goal to increase 
the recall while preserving the preci- 
sion computes the relevance of the com- 
ponent nouns of a compound noun to 
the document content by comparing the 
document sets that are supported by 
the component nouns and the terms of 
the document. The operational content 
of a term is represented as the proba- 
bilistic distribution of the term over the 
document set. 
Experiments with a set of 1,000 docu- 
ments show that our method gains 33% 
increase of retrieval performance com- 
pared to the indexing method without 
compound noun analysis, and is as good 
as manual decomposition by human ex- 
perts. 
1 Introduction 
Automatic indexing renders a form of document 
representation that visualizes the content of the 
document more explicitly. Indices that are care- 
fully chosen to represent a document will bring 
about the improvement of retrieval performance 
in accuracy and time efficiency. The potential of 
a candidate index is often judged on the basis of 
its discriminating power over a docmnent set as 
well as its linguistic significance in the document. 
Thus, a good index term should distinguish a cer- 
tain class of documents from the rest of the doc- 
uments and be relevant to the subject matters of 
the class of documents to be indexed by the term. 
In general, automatic indexing consists of the 
identification of index terms and the assignment 
of weights to the terms (Salton 1983). 
An index term can be either a simple noun or a 
compound noun composed of more than one sim- 
ple nouns. Compound nouns tend to carry more 
specific contextual information than simple nouns, 
thus they are likely to contribute to the retrieval 
precision. Compound nouns may contain useful 
simple nouns that usually refer general contexts, 
and thus will boost the recall of retrieval. Process- 
ing compound nouns is decomposing them into 
simple nonns and evaluating the simple nouns as 
potential index terms. In both identifying and 
evaluating index terms, compound nouns require 
a different strategy from that for simple nouns. 
The identification of compound nouns involves a 
certain degree of linguistic or statistical analysis 
that varies from simple stemming to morphologi- 
cal analysis (Fagan 1989). 
What makes it even more complicated to han- 
dle compound nouns in Korean documents lies in 
the convention of writing compound nouns. In 
Korean, it is allowed to write compound nouns 
with or without intervening blanks between con- 
stituent nouns. Arbitrarily long compound nouns 
are possible and not rare in real texts. The de- 
composition of a compound noun is particularly 
problematic because of the severe ambiguity of 
segmentations. 
In this paper, we propose a method to iden- 
tify and evaluate the candidate index terms from 
compound nouns. First, each possible decomposi- 
tion of a compound noun is identified. To'see the 
potential of the component nouns of the decom- 
position, we observe how the component nouns 
are distributed over the total document set, and 
514 
also examine how the simple and componnd nouns 
of the current document are distributed over the 
same document set. The similarity of the two dis- 
tributions implies how consistently the two term 
sets will behave given a query at retriewd time. 
The proposed method assumes a dictionary of 
nouns that is automatically constructed from the 
document set. 3'his is the practice that has never 
been tried in Korean document indexing, but has 
some important merits. A laborious work for the 
manual construction of nominal dictionaries is not 
needed. Since the noun dictionary contains only 
those in a document set, the ambiguity in analyz- 
ing words is greatly reduced. 
Previous researches on the problem of com- 
pound noun indexing in Korean have been done in 
two directions. One approach adopts a full-scale 
morphological analysis to decompose a word into a 
sequence of the smallest morpheme units that are 
all treated as index terms. The other approach 
tries to avoid the complexity of the full scale anal- 
ysis by using bigrams as in (Fnjii 199'3; l,ee 11996; 
Ogawa 1993). Since these methods take all the 
components of compound nouns as index terms 
without evaluation, irrelewmt terms can decrease 
retrieval precision. 
Experiments on 1000 documents show that our 
evaluation scheme gave results closet" to the hu- 
man intuition and maintained the highest preci- 
sion ratio of tile existing methods. 
In the following section, a brief review of re- 
lated work on automatic indexing for Korean doc- 
nments is made. Section 3 explains tile proposed 
method in detail. The verification of the method 
through experiments is described in section 4. 
Section 5 concludes the paper. 
2 Related Work 
The previous approaches to compound noun in- 
dexing are based either on full scale morpholog- 
ical analysis (Kang :1995; Kim 1983; Lee 1995; 
Seo 1993) or on the syllabic patterns (Fujii 1993 
; Lee 1996; Ogawa 1993). Morphological anal- 
ysis will return morphologically valid component 
words constituting a given compound word. Since 
this method does not exclude invalid or meaning- 
less words, it can result in the degradatiou of pre- 
cision. Besides the employment of full morpho- 
logical analysis is often too expensive and requires 
costly maintenance. 
Simpler methods segment componnd nouns me- 
chanically into unigram or bigram words that are 
all regarded as index terms (Lee 1996). Bigram in- 
dexes shows better precision than unigrams, but 
can suffer from big index size. In general, the ex- 
isting methods for compound noun analysis have 
been focused mainly on recall performance with 
little attention to the precision. The work pre- 
sented in this paper t'ries to achieve the improve- 
ment of recall without the deterioration of preci- 
documents 
Dictionary a document making 
7°"epiz!ng .................... 
Compound noun ......... 
- i dictionaries : 
single nouns ................................ : 
compound nouns 
Index weighting l 
weighted indices 
Figure 1: Compound noun indexing. 
sion. 
3 Probabilistic Compound Noun 
Indexing 
In this section, we describe the algorithm to rec- 
ognize and evaluate candidate index terms from 
compound nouns. Figure 1 summarizes the algo- 
rithm. The tokenizer produces a list of simple and 
compound nouns by utilizing the noun dictionary 
and the basic stermning rules. The noun dictio- 
nary is used to identify whether a noun is simple or 
compound, and the basic stemming rules are used 
to differentiate non-final words from others such 
as function words and verbs. The noun dictio- 
nary is automatically constructed from the obser- 
w~tion on the document set. The compound noun 
analyzer inw:stigates if the components of com- 
pound nouns are appropriate as indexes. The in- 
dex terrns that include simple nouns prodnced as 
a result of compound noun analysis are weighted, 
which finishes the indexing. 
l,et £' and C denote the sets of simple and com- 
pound nouns, respectively. Simple nouns are, by 
definition, those that do not have any of their 
substrings as a noun according to the dictionary. 
Compound nouns are those one or more sub- 
strings of which are recognized as nouns. Let 
T = {7'l,7~,...,5/~} = EUC be the set of all 
simple and compound nouns of a document set. 
Also, let I) = {D1,Du,...,Dg} be the set of all 
documents. A document is represented as a list of 
term-weight (2~, Wi) pairs. 
For a compound noun Ci of a document, a de- 
525 
composition is a sequence of nouns (!T'~7~ ...Tk.). 
lit inany cases, there are there than one de('olll- 
position, but only a few of thenl are sensible with 
respect to tim conte.xt of the document. Indiscreet 
use of the component nouns lnay bring about the 
improvement of reall, but can lead to the signif- 
icant decrease of i)recision. In the following dis- 
cussions, we dest;ril)e the details of the algorithm 
to select useful coinponent nouns from eOliipound 
Houns. 
3.1 Dictionary |)uildu I) 
It is very difficult to provide an \[t{ systeni with 
the suf\[icient list of \[iOtlllS. tleeause the nomi- 
nals o/itnlllnber {tnd grow faster than other cat- 
egories of words, it is more elticient to halidle 
non-nominal words mamlally. We consider buihl. 
ing noun dictionary by identifying the remaining 
string as a no/ill ai%er eliminating non-llOiliil-lal 
part of a word. The non-nominals are verbs, ad- 
verbs, adjectives, prelixes, aud suf\[ixes. 
The words in non-nominal dictionaries do not 
include those that can also be used as notins, 
whi(;h is not a probleni since unlike, in English, 
the lmllti-eategorial words in Korean telld to be 
invariant of meaning. The non-nonlinal dictionar- 
ies aye made usually hy manual work. 
Those recognized as non-nominal words but not 
as \[unction words are regarded as llOl.lns. '\['here 
can he multiple interpretations in segnienthig a 
word due to the ambiguity of fiinction words as 
illustrated in the following examp\]e. 
Wellcalo .... ). wenealo (reactor), 
wenca-t-lo (with atom) 
atom÷lNST 17~U MI';NTAL 
One way to deal with the probleni is to nse 
tim i)robability of each function word and choose 
the one with the highest vahie. More accurate 
measure woukl be made using a tlidden Marker 
Model that is about a stochastic process of fun(>- 
lion words. 'Fhe function words are (:lassified into 
32 groups according to their roles and position in 
sentences. In particular, each segmentation of a 
word is evahtated as follows. 
P(G IG-, ) P(n)l'(iln) • 
P(CiICi - 1) is the probability of tl~e function cat- 
egory of current word given the category of the 
previous word. P(n) is the probability of candi- 
date noun and l'(fln ) is the prot)ability era fun(> 
tion word given the candidate noun. The best se- 
quence of these segmentations for a sentence can 
be obtained. The candidate nouns n of the best 
sequence are then 'added to the noun dictionary. 
a.2 Tokenizing and eolni)ound noun 
analysis 
Tokenizing aims at recognizing simple and com- 
pound nouns froni a text and reporting them as 
the the final index terms. The method for di(-tio- 
nary making is also us('.d for tokenizing. Since the 
d ictionary making method gives a list; of candidate 
nouns, we only need to ctaeck if a candidate is a 
COml)ound noun and judge if the eompotl(mts el" 
the candidate compound noun e~re consistent, with 
the content of the deemneat,. 
To deal with the notion of consistency, we have 
to deiiue tile nwaning of a term or a set o\[" terms, 
It is a well recognized practice to regard the dis- 
criminating power of a terln as the value of the 
term. The quality of the (liseriniinating power is 
the distribution of the tel'ill over a document set. 
We define the distribution of a terln as the Hleati.. 
ing of the terin. Similarly the meaning of a set of 
terms is the distribution of terms on the dec/anent 
set. 
l,et M be the distribution of a term ~/i~ over a 
document set l) = I)1 ... I).,~ snch that 
J 
On(" deiinition of 54(.) lnay be as follows. 
f ,.,:q('/; , :)j ) :)5) 
-= )_Zk Dk) 
For the case o\[' multii)le terms: 
E 
-: i 
'l'he similarity t)etween two Lel'lTis (or sets o\[" 
terms) can be defined as any of vector similarity 
liieaslires. The Iileasllrl;ln(Hlt of relative infortua- 
lion of the two distril)utions corresponding to the 
two tertns Rives the distauce between the distri- 
t)utions. Given two (listril)utions Mi and k4) for 
~l} and :1) respectively, tim discrimination L() is 
defined as follows (la;lahut, \[988). 
I--t M~. 
i=0 Mj 
Since we want the dissimilarity between two dis~ 
tributions, divergence that is a symmetric version 
of diserimhiation is nlore appropriate for our case. 
It is defined as follows (I}lahut, 1988). 
/;(M~, MS) = i.(M~, a45) + /;(Ms, Md. 
I,'igure 2 ilhistrates the different distributions ()t' 
terms over tile same doclilnellt set suggesting the 
usefulness of the distributions as the representa- 
tion of tim terms. '\['he divergence ~(.) gives about 
tile itfforination (uncertainty) el' the two dist.ribu 
tions as cornpared with each other, and \]las the 
following characteristics. 
* Tim more uniform the distril)ution is, 
the larger L(-) will be. 
o '\['lie lilOrO the two distributions agree, 
the less L(.) will he. 
516 
D1 D2 D3 D4 ... Dn 
l"iDlre 2: llhistra.l,ioTi o~' tern\[ disl,ributions over 
th(,, S&TIIe (\]Oelllil("ltl, Sel,. 
The eh;:u'a, cl,erisi, ics are useful because good hi-. 
(\[ex tcrtns should be less ilnil'orni a,n(\[ sltare sim- 
ihu' eoii|,exts with other terTils in a dOCTTlileUt. 
\]1l rids respect,, i\]l\]'Ol;illa, l;iOli l,heoreti(' niea.sTire is 
it\[ore eollerete and l,\]uis possibly It\[ore &(W.ilrai, e 
t\[ia.ii w.)ellor siTnilarity Tileastircs. 
l"or e~-~(:h de(:olnposition ('/i,'",'l)) o\[' a (:OtTT- 
l)OTUid IIOITTI Ck, whal, we want 1,o see is how dir- 
feren{, l, he deconiposed terTns and t, he doCTilUeUl, 
i,(;T'TTtS a.,'e. Thai, is, /,({rl},.-.,;l)}, Ds<)I.x'o~.cs 
tile score of the imrticulm • deceiT\[position. D?h;tl, 
we select he.re is OIT(; decolllpOSitiOll wit, h th(: low-- 
est diverg(;nce. Let, l;iug ~v a n(\[ r' denol,e a, t\[eeOlll- 
position and l;he I)esl, (leeoluposit, ion resl)eetive/y , 
i 7- ~ art TlTin/)(r, \])k). 
T 
'l'i~e following SliiTiTii~tt'iZeS ttw l)roc(;(hlre o\[' ex- 
tra('ting shnple TiOlll\]S \['roil i COl\[tl)ounct llOllltS. 
i. I{.eiiTOVe iTon-nonliTta.\[ words ushig tile 
tTTel,ilod for dietioTlary Ilial,:itTg. 
2. ldcni,ify cO\[ill)el\[lid iiOliiis llSillg liOitt- 
htM diel, iona.ry. 
3. For (xtelt (t(~conlt)osit, ion mi o1" a. COl\[l- 
pound IIOTIII (\]i, colnpul, e \]\](rf, D). 
4. Select, "~i with the lowest L(ri, l)). 
3.3 hl(lox weighting 
There are three well known liter, hods \[()1' weighl,- 
ing iiTde× l:erius. 'l'liey are based oit the infor-- 
T/l~t{iO/l ofillverse (\[OCtlltteiit fre(tuency, (\]iscriliiht~t- 
l,ion wthi(;, a.nd l)rol)abilisi, ic vMue (Sail,on 1{)88). 
It, turned ()tit i,\]iat i;hese \[no\[hods lead to simi- 
lar per\['orTn~mcc, bTll~ inverse docunient frequency 
is by fax \[lie shnplest of \[\[toni iit l;ei'uis of l;hne 
(x)i))ple×il,y ;-Hid r(.'(iuire(l resollFt:(}s (Sa,lt,()li 1{)887 
I \[arT~i3At n 19<(J7). 
\]llverse (\]O(:lTittelll, t'reqTtelley ltiethod is alSO 
shown to work with little t)erl'orrnanc(; varbtl, ion 
a, cross (|\]\]l'ercnl, (\]onl<'.-tins. For tllis r(~.IISOTT> we 
~MOl)tCd inverse (\]o(-ulnent fre(luency hi Liie cxi)er- 
inmllLS. \[t is defined as follows. 
,,,,q --- ,~J:~.j × log(~/) 
Tnl)le I: The prol)ortion of COiTiltOITitd iiOiillS in 
l, he 1000 sciettce a,I)st, ra('t. At)oTil, ,()% of lie\[illS aA'e 
COITII)O1TIId TtOIITIS. 
ITO. O~ tX.)IIIt)OII(tlIt;S lt()TllLq ltrOl)ortion 
....... T -4~)639 90.55 ~- 
2 4665 8,50 % 
3 469 .85 % 
4 53 .09 % 
5 6 .01 (7o 
where wij is l, he weight, of' i;hc i'l.h LerTtl iu the 
.i'l.h doellllTeill,, ~*.7 is the \]llTiiii)er o\[" oc('llrreTl(x;s 
(if' l, he i'l,h l,erln hi l, he j'th (toeuliTenl,, and dfi is 
the liliTiibcr of dOCTITIielTI, S hi which the i'i,li l,ernT 
OCCTlrD 
4 l,\]xperiments 
'l'hc goal o\[ experilu<its is to vali&~t,e the pro 
posed algoril, hnl for a.na.lyzing compo.ud nou.s 
by co|np;u'ing il, with the mmmal a.nalysis and l, he 
bigranl lnel, hod. 
The l,esl, dal, a set consists of 1000 science a l) 
stra(:~,s writl.en in Kore~ul (Kitu 1.99d). All nomi 
Nals nix> manually \[aleut\[fled and eoinpoulid ltoillis 
were deconq)oscd into ~q)proprinte simple nouns 
by &t\[ expert in(lexcr. In the iirst (;xp(;rit.ent, 
our proposed Mgoril,h|u is asked to do t,\]lc sa.nm 
tiring over the test (lnta., and retri(;wd perl'or 
Imuwes ou 1.he two ttitf(;reut, ouL('om(:s (m~munlly 
imlexed and aul;olual;ically iudexed al)stracl;s) are 
('omparec\[. lIT t, lw S,eCOTT(\[ exl)erinl(?lH,S, the l)er- 
f'orma.nc(~s o\[" the proposed m('tho(l and t)igram 
Tttetb.o(I a.re eoutl);u:e(I to oloserve how Lit(; preci. 
sion is all'eel,cal. 
As is showu at t~d~le l, the portion o\['(:Ollll)Otltid 
itOlll/S iS at)oIIt; 9~/{) O\[" I;OI, M I\[OIlIIS \['()lind ill I;\[TC Lest 
set,, but. (:ml TTtM¢C critical eil>('l,s on tile retriewd 
\[)erforlila.tlee bec&llSe oN;ell COIIT\[)OtlII(I JIOIIIIS e&r 
ryiT\[g n lore sl)eeili(" information become t,. more a,e- 
(:llT'~l,t(': ill(it;,*( too \[.he dOCIllltC'lttS. 
Figure 3 and 'l'~tble 2 summarize the perfor 
umnce of the indexing me.t.hods: mauuM nnMy 
sis, tim propose(I i)rolmbilisl,ie method, and the 
bigr~mi utet\[lod. <\['lit; proposed mel, ho(l showed 
a slightly bct;ter peT'f()riilatlee (around 3%) - 4(~J) 
them nlmnlM indexing or bigr~mi tilde×trig. How- 
ever, otir method lifts wa.s il\]ore e\[lieient than t)i 
gi'a, lit indexing in l, errns o\[' the llUli~ber o\[' inde~x 
LerlliS mid ti~e ~werage iilllTii)er of retrieved doeu- 
ltietltS per ~ query. 
The ~tverage anlbiguity of a colilpoi/lid ilolilt 
is 1.43, and this low anibiguity niust ha.re eon- 
l;ributed l,o tile iiigli &grecnlent ratio of tile \]pro- 
posed indexing; method with lil&lill&l indexing. 
Tim low ~mibiguil, y is pari, ly ~tl, ixibuted 1;o the 
llOTlli dictionary that has 11o iiTilleeessa+ry entries 
57_7 
Recall Man. Prob. Big. No Anal. 
0.00 0.871.9 0.8579 0.8406 0.7957 
0.10 0.7719 0.7587 0.7841 0.6455 
0.20 0.7122 0.6981 0.6812 0.5894 
0.30 0.5895 0.6312 0.5939 0.4931 
0.40 0.5458 0.5854 0.5637 0.4103 
0.50 0.4957 0.5287 0.5240 0.3646 
0.60 0.4272 0.4438 0.4370 0.2844 
0.70 0.3304 0.3665 0.3322 0.2311 
0.80 0.2552 0.2876 0.2569 0.1695 
0.90 0.2102 0.2280!0.2028 0.0900 
1.00 0.1428 0.1724 0.1600 0.0514: 
Table 2: Performance of Manual, Prob., and Bi- 
gram Indexing 
0.9 , i 
! i "Manual" ~-- 
! "Proba~itlstic" -4--. 
0.8 ~ ............. ~ i ................... i ..................................... "Btgram ~ ,o-- 
°'~ I .... 
X. "'5,", ~,. 
0 0.2 0.4 0.6 0,8 i 
Figure 3: Recall-Preeison curve of indexing meth- 
ods 
not found at the documents. 
5 Conclusion 
The compound analysis in automatic indexing 
aims at the improvement of recall performance 
by extracting useful component nouns fi'om com- 
pound nouns. The task for Korean texts requires 
extra efforts due to the complexity of inflections. 
The proposed method gives better potential of 
sustaining the precision while improving the recall 
than other approaches by making use of proba- 
bilistic distributions of terms as the representation 
of meaning of the terms. 
The proposed method to evaluate the coinpo- 
nents of compound nouns is unique in that it de- 
lines and uses term representation, which explains 
the superiority of the method to other methods. 
The method requires little human involvement 
and is very promising for the implementation of 
practical systems by achieving efficiency and ac- 
curacy at the same time. 
References 
Blahut, Richard E. (1987). Principles and Prac- 
tice of Information Theory. Addison-Wesley. 
Fagan, J. L. (1989). The effectiveness of a Nonsyn- 
tactic Approach to Automatic Phrase Indexing 
for Document Retrieval, Journal of American 
Society for Information Science, Vol. 40, No. 2. 
iIarman, D. (1992). "Ranking Algorithms" in In- 
formation Retrieval: Data Slructure and Algo- 
rithms, (Frakes, W. B., and Baeza-Yates, R. 
ed.) Prentice Hall. 
Fujii, lI., and Croft, W. B. (1993). "A compar- 
ison of indexing techniques for Japanese text 
retrieval," In Proceedings of 16'th ACM SIGIR 
Conference. 
Kang, S. S. (1995). "Role of Morphological Anal- 
ysis for Korean Automatic Indexing,", In Pro- 
ceedings of the 22rid Korea Information Science 
Society Conference. 
Kim, Y. It. (1983). Automatic Indexing System of 
Korean .Texts mixed with Chinese and English 
M.S. Thesis, Dept. of Computer Science, Korea 
Advanced Institute of Science and Technology. 
Kim, S. H. (1994). A Development of the 'l~st 
Collection for Estim~ting the Retrieval Perfor- 
mance of an Automatic Indexer, Journal of Ko- 
rea Information Management Society, Vol. 11, 
No. 1. 
Lee, J. lI. (1996). "n-Gram-Based Indexing for 
Effective Retrieval of Korean Texts," In Pro- 
ceedings of 1st Australian Document Comput- 
ing Symposium 1996 
Lee, It. A. (1995). "Implementation of an Indexing 
System Based on Korean Morpheme Structural 
Rules,", In Proceedings of Spring Conference of 
Korea Information Science Society. 
Ogawa Y. (1993). "Simple word strings as corn- 
pound keywords: An indexing and ranking 
method for Japanese texts,", In Proceedings of 
16'th ACM SIGIR Conference. 
Salt;on, G., and McGill M. J. (1983). Introduction 
lo Modern Information Retrieval McGraw-Hill 
Inc. 
Salton, G., and Buckey, C. (1988). Term Weight-- 
ing Approaches in Automatic Text P~etrieval, 
lnformalion Processing 2~ and Management, 
Vol. 24, No. 5. 
Seo, E. K. (1993). An Experiment in Automatic 
Indexing with Korean Texts: A Comparison 
of Syntactico-Statistical and Manual Methods, 
Journal of Korea Information Managemenl So- 
ciety, Vol. 10, No. 1. 
518 
