Retrieving Collocations From Korean Text 
Seonho Kim, Zooil Yang, Mansuk Song 
{pobi, zooil, mssong}@december.yonsei.ac.kr 
Dept. of Computer Science, 
Yonsei University, Seoul, Korea 
Jung-Ho Ahn 
jungho@math.yonsei.ac.kr 
Dept. of Mathematics, 
Yonsei University, Seoul, Korea 
Abstract 
This paper describes a statistical methodology 
ibr automatically retrieving collocations from 
POS tagged Korean text using interrupted bi- 
grams. The free order of Korean makes it hard 
to identify collocations. We devised four statis- 
tics, 'frequency', 'randomness', 'condensation', 
and 'correlation' .to account for the more flexible 
word order properties of Korean collocations. 
We extracted meaningful bigrams using an eval- 
uation ihnction and extended the bigrams to 
n-gram collocations by generating equivalence 
sets, a-covers. We view a modeling problem for 
n-gram collocations as that for clustering of co- 
hesive words. 
1 Introduction 
There have been many theoretical and applied 
works related to collocations. A rapidly grow- 
ing awfilability of copora has attracted interests 
m statistical methods for automatically extract- 
mg ¢:o\]loeations from textual corpora. However, 
it is not easy to )dentify the central tendencies 
of collocation distribution and the borderlines 
of criteria are often fuzzy because the expres- 
sions can be of arbitrary lengths in a large va- 
riety of forms. Getting reliable collocation pat- 
terns is particularly difficult in Korean which 
allows arguments to scamble so freely. This 
paper presents a statistical method using 'in- 
terrupted bigrams' for automatically retrieving 
~:ollocations and idiomatic expressions from Ko- 
rean text. We suggest several statistics to ac- 
count for the more flexible word order. 
If the distribution of a random sample is un- 
known, we often try to make inferences about 
its properties described by suitably defined mea- 
sures. For the properties of arbitrary collocation 
distribution, four measure statistics: 'high fre- 
quency', 'condensation', 'randomness', and 
'correlation' were devised. 
Given a morpheme, our system begins by re- 
trieving the frequency distributions of all bi- 
grams within window and then meaningful bi- 
grams are extracted. We produce a-covers to 
extend them into n-gram collocations 1 
According to the definition of Kjellmer and 
Cowie, a fossilized phrase is a sequence, where 
the occurrence of one word almost predicts the 
rest of the phrase and one word predicts a very 
limited number of words in a semi-fossilized 
phrase (Kjellmer, 1995) (Cowie, 1981). How- 
ever, in both fossilized and semi-fossilized types 
there is a high degree of cohesion among the 
members of the phrases (Kjellmer, 1995). We 
consider the cohesions as a-covers that are ob- 
tained by applying a fuzzy compatibility rela- 
tion, which satisfies symmetry and reflexivity, 
to meaningful bigrams. Namely, n-gram collo- 
cations could be interpreted as equivalent sets 
of the meaningful bigrams through partitioning. 
Here, a-covers mean the clustered sets of the 
meaningful bigrams. 
2 Related Works 
In determining properties of collocations, most 
of corpus-based approaches accepted that the 
words of a collocation have a particular statisti- 
cal distribution(Cruse, 1986). Although previ- 
ous approaches have shown good results in re- 
trieving collocations and many properties have 
been identified, they depend heavily on the fre- 
quency factor. 
(Choueka et al., 1983) proposed an algorithm 
for retrieving only uninterrupted collocations, 2 
IBigrams and n-grams can be either adjacent mor- 
phemes or separated morphems by an arbitrary number 
of other words. 
2In the case of an interrupted collocation, words can 
be separated by an arbitrary number of words, whereas 
71 
sin(:e ,:hey assumed that a collocation is a se- 
,lu¢'n(:e of adjacent words that frequently ap- 
l:,(~ar t~)gether. (Church and Hanks, 1989) de- 
lhw:(I ;t collocation as a pair of correlated words 
:m(i ,,se(t mutual information to evaluate such 
\],~xi(:a,1 (:orrelations of word pairs of length two. 
They retrieved interrupted word pairs, as well as 
,minterrupted word pairs. (Haruno et al., 1996) 
,:onstructed collocations by combining adjacent 
n-grams with high value of mutual information. 
(Brei(lt, 1993)'s study was motivated by the fact 
than mutual information could not give realistic 
figures to low fl'equencies and used t-score for a 
significance test for V-N combinations. 
Martin noted that a span of 5 words on left 
nnd right sides captures 95% of significant collo- 
(:ations in English (Martin, 1983). Based on this 
assumption, (Smadja, 1993) stored all bigrams 
of words along with their relative position, p (-5 
< p _~ 5). He evaluated the lexical strength of a 
word pair using 'Z-score' and the variance of its 
t)osil;ion distribution using '.spread'. He defined 
~,. (:()\]location as an arbitary, domain dependent, 
recurrent, and cohesive lexical cluster. 
(Nagao and Mori, 1994) developed an algo- 
rithm tbr calculating adjacent n-grams to an ar- 
1)itrary large number of n. However, it was hard 
to find an efficient n and a lot of fragments were 
obtained. In Korean, statistics based on adja- 
cent n-grams is not sufficient to capture various 
types of collocations. (Shimohata et al., 1997) 
employed entropy value to filter out fragments 
of the adjacent n-gram model. They evaluated 
disorderness with the distribution of adjacent 
words preceding and following a string. The 
strings with a high value of entropy were ac- 
(:epted as collocations. This disorderness is ef- 
fi(:ient to eliminate fragments but can not han- 
(lle interrupted collocations. In general, previ- 
ous ;studies on collocations have dealt with re- 
stricted types and depend on filtering measures 
in a lexically point of view. 
3 Input Format 
In this section, we discuss an input form rele- 
vant to Korean language structure and linguis- 
tic contents which would work well on an effi- 
mfinterrupted collocation is a sequence of words• To 
~tvoid confusion of terms, we call a sequence of two words 
as ~ 'a(ljacent bigram' and a sequence of n words as a 
• ad?accnt n-gram ~. 
72 
cient statistics. Korean is one of agglutinative 
languages as well as a propositional language. 
An elementary node being called as 'eojeol' is 
generally composed of a content word and func- 
tion words. Namely, a word in English corre- 
sponds to a couple of morphemes in Korean. 
A key feature of Korean is that hmction 
words, such as propositions, endings, copula, 
auxiliary verbs, and particles, are highly devel- 
oped as independent morphemes, while they are 
represented as word order or inflections in En- 
glish. Functional morphemes determine gram- 
matical relations, tense, modal, and aspect. 
In Korean, there are lots of multiple function 
words in a rigid forms. They can be viewed 
as collocations. For this reason, our system is 
designed at the morphological level. A set of 
twelve part of speech tags, { N, J, V, P, D, E, 
T, O, C, A, S, X } 3 was considered. 
Another feature is a free word order. Since 
the words of a collocation appear in text with 
the flexible ways, sufficient samples are required 
to compute accurate probabilities. We allow po- 
sitional information to vary by using an inter- 
rupted bigram model. 
The basic input can be represented in (1). An 
object k means a pair of morphemes (mi,mk) 
and mk corresponds to one of all possible mor- 
phemes, being able to co-occur with mi. A vari- 
able j indicates the j-th position. Xij denotes 
the frequency of mk that occurs at the j-th po- 
sition before mi. 
Xi1 X12 
X21 X2~ 
Xi = . . 
Xnl Xn2 
Given a predicate 
Xll° / X210 
XnlO 
(1) 
morpheme as a base mor- 
pheme, the range of window is from -1 to -10. 
This distance constraint is for the characteris- 
tic of SOV language. If a bigram includes an 
adverb morpheme, a larger window, from -20 to 
10 is used because the components often appear 
widely separated from each other on text. In 
other cases, we considered the range from -5 to 
+5. This distant constraints are for an efficient 
statistics. An input data is transformed to a property 
matrix, T(Xi) as (2) that is a two dimensional 
3'Noun','adJective','Verb','Postposition', 'aDverb', 
'Ending','pre-ending', 'cOpular', 'Conjunction', 'Auxil- 
iary verb', 'Suffix', 'etc.'. 
Cn~,m,) 
(o}~1,o~ol) (drink,much) 
(3}~l,t--t ~ = ) (drink,too) 
(Ot h l,gt) (~nk,~lcan) 
(3t,~l,OH °J) (drink, everyday) 
(fl\[hl,~OI) (drink,boil) 
(OtJ, l,~) (drink,,iot) 
(Ot~,l,@~l) (drink,t.bgether) 
(0~1,~ ==~) (drink, a tittle) 
(OH,~) (drink,take) 
(OtAI,_~) (drink, a little) 
syntactic 
relation 
VD 
VD 
VJ 
VD 
W 
VD 
VD 
VD 
W 
VD 
preferring 
position 
1 
2 
4 
3 
2 
1 
3 
1 
2 
3 
Figure 1: meaningful bigrams of ~\[z\](drink) by Xtract 
~rr~:~y of k object.s, k = 1,2,...,n, on four vari- 
~t,|)les, V Frequency ~ VCondensation , V Randomness , 
~md Vcorrelatio n. 
½F ½c ½R ½cR 
T(Xi)= '. . . . 
~F ~c ~R ~cR 
(2) 
~\]~) continue explanations, we begin by men- 
tioning the 'Xtrgct' tool by Smadja (Smadja, 
1993). Our input form was designed in a simi- 
lar manner with 'Xtract'. Smadja assumed that 
the components of a collocation should appear 
together in a relatively rigid way because of syn- 
tactic constraint. Namely, a bigram pair (mi, 
'rnk), where mk Occurs at one(or several) spe- 
(:ific position around mi, would be a meaningful 
bigrams for collocations. The rigid word order 
is related with the variance of frequency distri- 
bution of (mi, ink). 'Xtract' extracted the pairs 
whose variances are over a threshold and pulled 
out the interesting positions of them by stan- 
dar(lizz~tion of the frequency distributions. Un- 
fbrtunately, the approach for English has sev- 
eral limitations to work 4 on Korean structure 
ibr the following 'reasons: 
1. For free order languages such as Korean, 
words are widely distributed in text, so 
that positional variance affects the over- 
tiltering of Useful bigrams. Figure 1 shows 
that there is no pair which contains ran- 
domly distributed morphems such as func- 
tion words or nouns. This indicates that 
very few pairs were produced when 'Xtract' 
is applied to Korean. 
4~We ported Smadja's Xtract tool into a Korean ver- 
sion. 
73 
2. Suppose that a meaning bigram, (mi, mk) 
prefers a position pj. Then, the number 
of concordances for condition probability P(mi, rnklpj) 
would be small, specially in a 
free order language. As shown in Table 1, 
the model produced a lot of long meaning- 
less n-grams when compiling into n-grams. 
The precision value of Korean version of 
Xtract was estimated to be 40.4%. 
3. The eliminated bigrams by the previous 
stage can appear again in n-gram colloca- 
tions. When compiling, the model only 
keeps the words occupying the position 
with a probability greater than a given 
threshold from the concordances of (mi, 
ink, pj). As one might imagine, the first 
stage could be useless. 
As stated above, in Korean, the effect of po- 
sition on collocations needs to be treated in 
some complex ways. Korean collocations can 
be divided into four types: 'idiom' 5, 'seman- 
tic collocation' 6, 'syntactic collocation' 7 and 
'morphological collocation' s. Idioms and mor- 
phological collocations appear on text in a rigid 
way and word order but others do in the flex- 
ible ways. From a consideration of these more 
flexible collocations, we adopt an interrupted bi- 
gram model and suggest several statistics that 
consist with characteristics of Korean. 
4 Algorithm 
This section describes how properties are repre- 
sented as numerical values and how meaningful 
objects are retrieved. In the first stage, we ex- 
tract meaningful interrupted bigrams based on 
four properties. Next, the meaningful bigrams 
are extended into n-gram collocations using a 
a-compatibility relation. 
It empirically showed that a Weibull distribu- 
tion (3) provides a close approximation of fre- 
quency distribution of bigrams. 
F(x)=l-~ -"*~ o<x<~ ~h~ ~>0,~>0 (3) 
5Idioms have no ambiguous meaning but requires 
rigid patterns to preserve the idiomatic meaning. 
6The replacement of some components by other words 
is more free than idioms. 
~The combination of words is affected by selectionM 
restrictions of predicate, noun, or adverb. 
sit corresponds to multiple function word and ap- 
pears on a adjacent word group. 
I'r('(l 
II 
12 
'2 
n-grams 
..... _7_@(everyone). Noun-~-(objeet case)... ((u,d-))o--I(take)(@l-z,l))(drink) . . 
....... ((,~o\]))(much) (@l-z\]))(drink) . Noun ...... 
-7-(two) ct-N(legs) ~\](at(location case)) ~(a little) ~ ~ ~(strain) x-l(stand) o-lx-I(with ~ing) 
~(f'riend) N'N (with) °1t 71 .~- ~\].~-¢l-(talk over) ~ x-I (~ing) ((~--I))(a little) ~1 ~l (coffee) ~(object case) 
((-I-xl))(drink) ~(modifying ending) ~(dream) . . 
• . . @(two) z\]~l-(hours) ~(object case) ((~Nl))(together) ~(alcohol) .~(object case) ((nl-xl))(drink) 
~-(modi(ying ending) .2.(he) . . . 
..... N-el-(cola) ~(object case) ((~'N"N))(a little) ((-I-zl))(drink) o-I 71-~l(~ing) ..... 
.... I-(I) ~(also) "-tlN(baeklim(location)) ~lx-I(at) ((,lt~))(everyday) ~l~(tears) ~(object case) 
(@Pxl))(dink) ~,I ~(was ~ing) 71 ~l(beeause) .... 
• . . L+(I) ~7\](here) -U-el(specially) zll~(dawn) oJ\](at) ((-N-})(fresh) -~-(modifying ending) -~-(water) Y_(also) 
((u\[-q))(drink) . Oc-~- . ~(that) -~--~\[-(be unsuitable) . 7~-×J-(most) . ~ N- ~z\](says) ~r---~-(exercise) ~-(well) 
• . Verb . . Noun .~-(object case) ((uJ-~-))(too) ~.~\]:o\](much) ((-~-z\]))(drink) -G-(modifying ending) ..... 
• . . Noun Noun... Noun ~-otix\](in) ((-~o\]))(boil) ~ (@\]-zi))(drink) .~ Noun ...... 
..... -~-~1-7-. (even though). ((~))(not) (@},q))(drink) ,.~_(and) Noun Noun . Verb Ending Verb . . 
Table 1: n-grams of =l-zl(drink) by Xtract (freq: freq of sentences) 
dist eval 
-2 O 
-1 O 
-3 X 
-3 O 
-I O 
-3 X 
-4 X 
-2 O 
-2 X 
-1 O 
Thus, there are a lot of pairs with low frequency 
which interrupt to get reliable statistic. We 
clinfinated such pairs using median m that is 
a. value such that P{X > m} > 1/2 to a fre- 
quency distribution F. If median is less than 3, 
we took the value 3 as a median. 
Any quantity that depends on not any un- 
known parameters of population distribution 
but only the sample is called a statistic. We 
regarded four statistics relating to properties of 
(:ollocations as variables. Before the further ex- 
plauation, consider Sm~, a sample space of mi as 
Table 2 whose cardinality \]Sm~l is n. Let one ob- 
.iect be (mi, mk) and its frequency distribution 
be ./}k,,.rik2,''" ,fiklO and,::k+ be ~pl°_l like. 
Suppose that POS(mi) is J and POS(mk) "s 
¢p'. 
4.1 Properties 
The properties which we considered are primar- 
ily concerned with the frequency and positional 
infbrmations of word pairs. As we have em- 
phasized, the correlation between position and 
(:ollocation is very complicated in Korean. 
According to Breidt, MI or T-score thresh- 
olds work satisfactory as a filter for extraction 
of collocations, but filtered out at least half of 
the actual collocations (Breidt, 1993). Gener- 
ally, assumed properties could not fully account 
tbr collocations. Therefore, in order to reduce a 
h)ss of infbrmation, the combination of observed 
vaxiables would be better than filtering. We de- 
lined tbur variables for properties of collocations 
1. Vi: 
2. ~: 
as follows. 
According to Benson's definition, a colloca- 
tion is a recurrent word combination (Ben- 
son et al., 1986). We agree with this view 
that a word pair of high frequency would 
be served as a collocation. Vf statistic of 
an object (mi, ink), is represented as (4). 
Here, standardization demands attention. 
The mean and standard deviation are cal- 
culated in the 'JP' set which the object be- 
longs to. 
Vf = fik+--fijp 
ffijp ' n 
El fil+ A++JJP = %' = n , (4) 
aij P = 1=1 (fd f,jp) 
Intuitively, two words that prefer spe- 
cific positions must be related with each 
other. We seeked to recapture the idea 
with the flexibility of word order. For 
this, the concept of convergence on each 
position was employed. In a free or- 
der language, a meaningful pair can oc- 
cur in text either with two distance or 
three distance. Let's consider two in- 
put vectors x, (0,1,0,0,0,1,0,0,1,0) and y, 
(0,0,0,1,1,1,0,0,0,0). They have the same 
variance but y would be more meaning- 
ful than x, because y can be interpreted 
as (0,0,0,0,3,0,0,0,0,0) within the free or- 
der framework. Therefore, a spatial mask 
74 
:t. 14.: 
word pair 
(miami) 
(mi,m2) 
(mi,mk) 
(m~ ,m,~) 
total 
POS pair 
(J,P) 
(J,P) 
(a,P) 
(J,P) 
total frequency variable(position) distribution 
kl+ All k~2 
k2+ k=l k:2 
fik+ Akl fik2 
fin+ finl fin2 
f~++lJp 
under a JP relation 
• . . fillO 
• • • fi210 
: 
•.. fik~o 
: 
• . . finlo 
fi+ltaP fi+MdP "'" fi+lolJP 
Table 2: all combinations of mi 
(1/2,1,1/2) was devised for convergence on 
each position. The calculation of conden- 
sation value rnikv at p-th position is: 
'lH, ikp 
4fiki+afika+fika p = 1 
f ikv { 1 "q'- ~ f ikp-l- f ikp41 = - 2 p = 2...9 
fia 8 ~3fik 9 +4fik 1 o : 4' p= 10 
The mik,, is c, omputed by neighborhoods 
that are locdted in the border of the 
l)-th position. The may_ALe_ is likely "'3 *~ kk+ 
to represent :the condensation of (mi, 
ink) but it is; still deficient. Intuitively, 
(0,1,1,1,0,3,2,0,0,0) would be less con- 
densed than (0,0,3,0,0,3,2,0,0,0). There- 
fi)re, n' was designed for a penalty factor. 
Irtikp ~. = max (5) 
p=1,2 ..... lo ~n'fia+ 
',,' is the number of m, such that fikm 7 ~ 0 
ti)r 0 <_ m <_ 10, and it is a reverse propor- 
tion to the condensation. Square root was 
used tbr preventing the excessive influence p 
of '/t . 
We were motivated by the idea that if a pair 
is randomly distributed in terms of posi- 
tion. then it Would not be meaningful. Es- 
pecially in tim case of flmction words, they 
are likely to be randomly distributed over a 
given morpheme but distributions of mean- 
in.gful pairs are not random, as shown in 
Figure 3. A typical method for the check of 
randomness is to measure how far the given 
distribution is away from a uniform dis- 
tribution. In (6), fik means the expected 
number of (mi, ink) at each position on the 
a,ssumption that the pair randomly occurs 
4. Vat : 
at the position. \]fikv-Tikl 71k can be viewed 
as an error rate at each position p based 
on the assumption. The big difference be- 
tween the expected number and the actual 
observed frequency means that the distri- 
bution is not random. One might think 
that this concept is the same with one of 
variance. However, note the denominator. 
This calculation is somewhat better than 
variance which depends on frequency. 
= ~ ( fikp -- fik )2 v, . (6) 
To become a meaningful bigram, a pair 
should be syntactically valid. We viewed 
that if the frequency distribution of a pair 
keeps the overall frequency distribution of 
the POS relation set which the pair be- 
longs to, then the pair would be syntac- 
tically valid. To verify this idea, we de- 
pict the overall frequency distributions in 
some POS relations in Figure 2. It shows 
the frequency distributions of pairs which 
are composed of postposition and predicate 
morpheme. It is quite interesting that all 
objects have the similar form of frequency 
distribution. They have sharp peaks at 
the first and third position. Clearly, this 
illustrates that a postposition has a high 
probability of appearing at the first and 
third position before a predicate. We can 
conclude from this that pairs keeping the 
overall frequency structure would be syn- 
tactically valid. We used correlation coetti- 
cient for the structural similarity. In the 
case of a pair rnik, the correlation value 
between (.fikl, fik2,'", .fiklo) and (./i+LI.lP- 
• fi+2lJP,''" ,.fi+loLJP) is evaluated. Let x 
75 
1400 
frequency 800 
1200 
1000 
600 
400 
200 
1600 
• ~ (be deep) 
--~--.- ~ ,~. (b e new) 
o}-~. ~;(be beautiful) 
.... : ......... ~ (rern ain,be left) 
• ~.---~-e-I (be heard,droD in) 
----e--- ¢,t~- ~ ( f o lie w, D o u r into) 
-~...-~ ~ o( x I (drol~/fall/aoart) 
- o)AI (drink) 
.................. X-f (s tand ) 
(wear) 
• O~ 71 (regard) 
N (increase,clim b) 
• - F=t t) (head to) 
1 2 3 4 5 6 7 8 9 10 
• position distance) 
Figure 2: Frequency distributions of pairs with 'JP' or 'VP' POS relation 
and y be two vectors whose components 
are mean corrected, xi - ~ for x, Yi - Y for 
y. The correlation between two variables is 
straightforward, if x and y is standardized 
through dividing each of their elements by 
the standard deviations, ax and ay, respec- 
tively. Let x* be x/ax and y* be y/ay, then 
the correlation between x and y, VeT can be 
represented as follows. 
Xt = (fikl, fik2,''', fiklo) 
Y' = (fi+llJP, fi+2WJP,"', fi+~olJP) (7) 
xJy * gcr 
= 10 
The ranks of bigrams by four measures is sum- 
marized in Figure 3. It tells that each of the 
measures comes up with our expectation. 
4.2 Evaluation Function 
in this section, we analyze the correlations 
of fbur measures we defined and explain how 
to make an evaluation function for extracting 
meaningful bigrams. Table 3 shows the values 
of correlations which exist in the given mea- 
sures: V/, V~, Ve, VeT. This explains that the de- 
fined measures have redundant parts. We can 
say that if a measure has the high values of 
V~ 
½ 
½ 
Vet 
V/ ½ vT ½,. 
1.0 
-0.495 1.0 
-0.203 0.506 1.0 
0.252 -0.278 -0.002 1.0 
Table 3: correlations between factors 
correlations between others, then it has a re- 
dundant part to be eliminated. Since we don't 
know what factors are effective in determining 
useful bigram, the concept of weights is more 
reliable than filtering. We constructed an eval- 
uation function, which reflects the correlations 
between the measures. 
First of all, we standardized four measures. 
Standardization gives an effect on adjustment 
of value range according to its variability. The 
degree of relationship between measurel and 
measure2 can be obtained by Cmeasurel,measure2 
which is {corr(measurel, measure2)} +, where 
x + = x if x > 0, x + = 0 otherwise. The evalu- 
ation function is concerned with the degrees of 
relationships of measures. 
f(Vf, Vr, Vc, Vcr) = Vf ÷¢rVr +CoVe + ¢crVcr (8) 
76 
i 
Cb,. : (1 - Cv.,vf)(1 - aCvA2'v )(1 - aCvsv°r) 
¢ .... (1 - Cv¢,vr)(1 - a-~)(1 - a--~) 
(/'or := (1 -- Cv~.~,vf)(1 - a vS"v~ )(1 -- a'~V~ ~'v~ ) 
where a = 2 2 
(9) 
Here, the cor/stant a(~ 0.845) is for a com- 
pensation coefficient. The minimum value of Cr, 
¢c and ¢c~ is 1/3 respectively, where Cv:,v~ = 
Cvj.,v~, = Cvf,v~ = 0 and all correlations of 
~., i,~:, and Vcr = 1. On the contray, the max- 
lmum value of ¢~, ¢¢, and ¢cr is 1 respectively, 
where Cv:,v, = Cvf,v~ = cv:,v~ = 0 and all cor- 
relations of Vr, Vc, Vcr = 0. In other words, as 
the coefficients ¢~, ¢~, and ¢c~ get closer to 1, 
the correlations between measures reduce. 
As shown in (8) and (9), we agree that Vf is 
a. primaryl factor of collocations. Each coeffi- 
cient ¢ indicates how much the property is re- 
flected in evaluation. For example, in the case 
of Cr, a-~ z~ is a portion which is related with 
the property of condensation within random- 
ness. Therefbre,i 1 - a--~ corresponds to the 
remainder, when subtracting this portion from 
randomness. 
The threshold for evaluation was set by test- 
ing. When the value for threshold was 0.5, good 
results were obtained but in noun morphems, 
a high value over 0.9 was required. The pairs 
are selected as meaningful bigrams whose val- 
ues of the evaluation function are greater than 
the threshold. 
4.3 :Extending to n-grams 
The selected meaningful bigrams from the pre- 
vious step are extended into n-gram colloca- 
tions. At the final step, the longest ones among 
all (~-~:overs are Obtained as n-gram collocations 
by eliminating substrings. Here, n-gram collo- 
~:ations mean interrupted collocations as well as 
n-character strings. 
We regarded Cohesive clusters of the mean- 
ingful bigrams as n-gram collocations on the as- 
sumption that members in a collocation have a 
high degree of cohesion (Kjellmer, 1995). To 
find cohesive chisters, a fuzzy compatibility re- 
la.tion R is appl!ed. R on X x X, where Xis 
the set of all meaningful bigrams which contain 
;,, l)~se morpheme mi, means a cohesive relation 
a.nd partitions of' set X obtained by R corre- 
spond to n-gram collocations. To say shortly, 
our problem hasshifted to clustering of a set X. 
A reason to employ the concept of fuzzy is that 
equivalence sets defined by the relation may be 
more desirable. 
A fuzzy compatability relation R(X,X) is rep- 
resented as a matrix by a membership function. 
The membership function of a fuzzy set A E 
X is denoted by #A : X ~ \[0, 1\] and maps ele- 
ments of a given set X into real numbers in \[0,1\]. 
These two membership functions #A were used 
to define the cohesive relation as follows. 
p(x)= I*nyl .:.,~_ \]*nyl 
I*1 '~"J:- lyl 
D(p(x)\[Ip(y)) = p(x)(log pry) - log p~x)) 
if p(x)_(p(y) 
~A(X' Y) ~ O(p(y)\[Ip(x)) : p(y)(log p~x) - log p~y)) 
if p(y)(_p(x) (10) 
, , 2J*nyl (11) /ZA I,x,Y)= T~ T 
Let Ixl and lYl be the frequency of concor- 
dances which contains the bigram pairs x and y, 
respectively. IxAyl means how often two pairs x 
and y co-occur in the same concordances under 
the distance constraint. (10) is relative entropy 
measure and (11) is dice coefficient. This mea- 
sures are concerned with a lexical relation for 
cohesive degrees. 
To get equivalence sets, it is very important 
to identify properties of the relation R we de- 
fined. A relation which is reflexive, symmet- 
ric and transitive is called as an equivalence 
relation or similarity relation. In our case, 
the fuzzy cohesive relation, R is certainly re- 
flexive and symmetric. If R(x, z) > ma, xyEy 
min\[R(x, y), R(y, z)\] is satisfied for all (x, z) e 
X 2, then R is transitive. Generally, transitive 
closure is used for checking transitivity. The 
transitive closure of a relation is defined as the 
smallest fuzzy relation which is transitive and 
has the fewest possible members with contain- 
ing the relation itself. 
Given a relation S(X,X), its max-min transi- 
tive closure ST(X, X) can be calculated by the 
following algorithm consisted of three steps: 
1. S I = SU (S o S) , o is a max-min compo- 
sition operator. 
2. If S' # S, make S = S ' and go to Step 1. 
3. Stop: S'= ST. 
If above algorithm terminates after the first iter- 
ation when applied to R, R satisfies transitivity. 
To verify its transitivity, above alogrithm were 
employed. As a result, R did not satisfy transi- 
tivity. It means that an element of X could be- 
77 
hmg to multiple (:lasses by R. This proves that 
the relation R is valid to explain collocations. 
A iuzzy binary relation R(X,X) which is re- 
th~xive and symmetric is called as a fuzzy com- 
pa.til)i\[ity relation and is usually referred to as 
~,. (lunsi-e(tuivalence relation. When R is a fuzzy 
compa, tibility relation, compatibility classes are 
,l(,.fined in terms of a specified membership de- 
gre,'. (~. An a-compatibility class is a subset A of 
X. s,mh that it(x, y) > a for all x, y E A and the 
tnmily consisting of the compatibility classes is 
called as an a-cover of X to R in terms of a spe- 
cifi,: membership degree a. An a-cover forms 
partitions of X and an element of X could be- 
long to multiple a-compatibility classes. Here, 
we a.ccepted a-covers at 0.20 a-level in dice and 
(}.3{} in relative entropy. 
One might argue why we did not directly ap- 
ply a\]\] bigrams to this stage with skipping the 
previous stage. We hope to deal with the com- 
t)arision in a later paper. 
5 Evaluation 
We performed experiments for evaluation on 
328,859 sentences(8.5 million-morphemes) from 
Yonsei balanced copora. 250 morphemes were 
selected for a test, such that frequency >_ 150. 
The morphemes have 8,064 pairs and 773 were 
extracted as meaningful bigrams. In the sec- 
ond stage, 3,490 disjoint a-compatibility classes 
corresponding to lexicMly cohesive clusters were 
genera,ted. 698 longest n-gram collocations out 
of the a-compatibility classes were extracted by 
eliminating the fragments that can be subsumed 
in longer classes. 
The precision of extracted meaningful bigram 
was 86.3% and 92% in the case of n-gram collo- 
cations. We could take either o~-covers and the 
hmgest n-grams as n-gram collocations accord- 
ing to applications. 
Since unfortunately, there is no existing 
database of collocations for evaluation, it is not 
easy to compute precision values and recall val- 
ues as well. We computed the precision values 
by hand. As a different approach to Korean 
collocations, (Lee et al., 1996) extracted inter- 
rupted bigrams using several filtering conditions 
and at least the 90% of the results were adja- 
cent bigrams of length 1. By this comparison, 
we may conclude that our approach is more flex- 
ible to deal with Korean word order. 
Figure 3 9 displays the changes of rank ac- 
cording to measures we considered. It shows 
that in contrast to other models, the proper- 
ties have been effective in retrieving colloca- 
tions which contain pairs of morphemes with 
relatively low frequency. Since the ranks of bi- 
grams in four measures came up with our expec- 
tation, if we could make more adequate evalua- 
tion function, the precision would be improved. 
Table 4 shows some obtained meaningful bi- 
grams of 'o}.>\] (not)'. There are a great deal of 
expressions relating negative sentences in Ko- 
rean. The components of them occurs separated 
in various ways. When evaluating meaningflfl 
bigrams, the coetticients for tile evaluation flmc- 
tion are as follows: Cr ~ 0.432, ¢(: v 0.490, 
¢cr ~ 0.371 in the case of 'ol-q(not)'. This 
means that the influence of three other mea- 
sures is 1.284 times more than that of frequency 
measure in 'JP' POS relation. 
We will illustrate all steps with a word, 
'~'(wear). The results of the first stage, mean- 
ingful bigrams of '4_! '(wear) m are shown in Fig- 
ure 4. In the second stage, we calculated mem- 
bership grades of inputs using dice measure and 
relative entropy measure. As Figure 4 shows, 
dice measure looks unsatisfactory in such cases 
as the pair '(~(object case), ~o} (much))'. Al- 
though the common frequency, 3 is a relatively 
high in the aspect of the word with lower fre- 
quency, 'Nol'(much), the value of dice is low. 
Thus, we also tested relative entropy based on 
the probability of low frequency. Two measures 
produce similar results if all values in the level 
set of R is considered instead of a specific value 
of o~, but entropy measure produces more good 
results. 
Figure 4 and 5 show all o~-compatibility 
classes and the longest n-gram collocations of 
'~'(wear). Through our method, various kind 
of collocations were extracted. In Figure 4, the 
order of components of a oe is by concordances. 
6 Conclusion 
In this paper, we implemented measures which 
reflect the four properties of collocation respec- 
9The meanings of pairs are not described in detail 
because the pairs including function words are hard to 
translate into English. 
1°The word meaning corresponds to "put on(wear or 
take on)" in English, but it uses for shoes or socks. 
78 
! 
mk mr poa- relation 
post- not oo~mo. 
:: ~o~: 
: Iof:~.t 
olf ~t oF q Jp 
t .oF i.~ JP 
JOFLf JP 
.~ =oFq Jp 
!o~ L-I Jp 
oll !O~M JP 
o~1 ~ ~ Io~-t.-I Jp 
~Vlllld I~lm 
Frequency distribution 
22 20 17 26 40 48 17 15 6 427 
0 0 0 0 0 0 0 0 0 53 
0 0 0 0 0 0 0 0 1 11 
0 0 1 0 0 0 0 1 0 29 
0 2 1 4 5 2 4 4 0 82 
1 1 0 0 0 1 0 2 0 27 
6 4 3 6 4 9 1 16 2 60 
24 31 34 35 38 46 24 48 10 13 
0 0 0 0 I 7 1 0 0 0 
23 24 25 27 24 55 17 13 1 0 
19 30 23 30 15 19 19 58 1 1 
1 0 1 0 1 3 0 0 0 0 
0 0 2 0 1 3 1 0 0 0 
6 7 11 13 5 25 3 3 0 0 
1 0 0 6 1 2 1 1 0 0 
frec lent 
freq std 
638 4.1 
53 -0 
12 -0 
31 -0 
104 0.2 
32 -0 
111 0.3 
303 1.7 
9 -0 
209 1 
215 1 
6 -1 
7 -1 
73 -0 
1 2 -0 
randl mn 
ran istd 
36.3 0.4 
90.0 2.5 
74.7 1.9 
77.7 2.1 
52.9 1.1 
61.9 1.4 
22.9 -0 
1.6 -1 
53.0 1.1 
4.9 -1 
5.1 -1 
23.3 -0 
20.6 -0 
9,6 -1 
20.6 -0 
.~ Ih~ 
Bml |llmlm 
~ll |llal~llll 
Ollll ~II~111 
I~I~I~ ~llrll~ll~l~ 
IIII~II II~I~I~ I~1 I1~1 
~lllBml~ nenmNI ~11~1~ 
nenmil 111~1~ 
O~1 llitai'aa B~ 
III Bill ~III~IE nml tll~ll~ 
evalul 
eval 
4.8 
3.3 
1.9 
1.8 
1.3 
1.2 
0.6 
0.6 
0.5 
0.1 
0.0 
-0.6 
-0.7 
-0.7 
-0.8 
Figure 3: Top 15 bigrams of 'o~'(not) by our algorithm 
mi POSot 
• J(white) J 
~ ~(shoes) N 
~-'~-(l~:~ots) N 
°=m~socks) N 
~-.~-~'J(rub~u shoes) • N 
o11- (location) P 
oll * (location) P 
0~1- (location) P 
~ * (location) P 
~*(Iocation) P 
L *(modifying} p 
L *(modifying) E 
L *(modif~Rg} E 
L ,(modifying) E 
~ * (modifying) E 
L~ *(modlfy~ng) E 
L *(modifying) E 
L * (mod~fyirlg) E 
L *(modifying) E 
~*(subiec~) E 
> ~*(subject) P 
) |*(subiect) P 
~ F*(su~ect) P 
)~(su~ect) P 
2 Hsubject} P 
2}*(sul~ect) P 
• m* (object) P 
t-(object) P 
~" (object) P 
t-(object) P 
~*(objecl) P 
•*(object) P 
~-" (object) P 
-~- (object) P 
~" (ObJ~:':t) P 
•*(object) P 
~*(obj~ct\] P 
~--(obl~t) P 
• = function word 
m~ POS of m= Relative 
Y 
~-~l-I{sneakets) N 4 5 1 0.22 0.06 
2IS(leather) N 8 4 1 0.t7 0.17 
2 P~(leat her} N 11 4 2 0.27 0.51 
-T~boots) N 8 11 1 0.11 0.04 
~(whlte) J 3 4 2 0.57 0.19 
-.~--~J~ ( mbl0er s PK.~s ) N 19 3 3 0.27 1.85 
.~O~(much) O 19 3 1 0.09 0.62 
-!:1 (Y, hit e) J 19 4 2 0.17 0.78 
-r~(boots) N 19 11 2 0.13 0.10 
°o~shocks) N 19 8 3 0.22. 0.32 
.~-'~,-J~ (Veer shoes) N 19 3 3 0.27 1.85 
~Ol(much) D 50 3 1 0.04 0.94 
-~J,-I(sneaker s) N 50 5 2 0.07 0.92 
~(white) J 50 4 3 0.11 1.89 
7 t-~(leather) N 50 4 1 0.04 0.63 
AJ~shoes) N 50 8 3 0.10 0.69 
~- t-(boots) N 50 11 4 0.13 0.55 
°~shoc~.s) N 50 8 5 0.17 1.15 
.~.-~--'-'~ ( PJbber shoes) N 50 3 2 0.08 1.88 
HI * (location) P 50 19 14 0.41 0.71 
oF~_.~-(stfll) D 31 3 1 0.06 0.78 
¢-Jo~l-I{sneakers) N 0.36 
tl:l (vA'~it e) J 0.51 
~il~shoes) N 0.51 
:~.(~oots) N 0.09 
~-(Iocati~) P 0.10 
L *(modifying) E 0.22 
P~ol(much) D 3.03 
~-~i~sneakers) N 2.01 
~(white) J 2.06 
7 P.~.,(leather) N 2.74 
~U~(shoes\] N 1.54 
::F~(boots) N 1.41 
°ol~(shacks) N 1.79 
~-.~-.~ (rubber shoes) N 3.03 
Hl*(Iocation) P 0.68 
L *(modifying) E 0,12 
~Hsubiect) P 0.22 
All c¢--covers using dice measure 
""; I.(subject ) )-..:ltl (whit e) L (m odiNng) ....~-~ll.t(sneakers)-i.( object ) ...~ (wear) ... 
• ..7 ~subject)... L (modifyirlg)..- ,~.J ~r(sl~S)-..~_J (wear)..- 
.-.2 ~sublect)... L (modifying)"-N u~(shoesl-II(object ) --.~ (wear).-. 
~gsubject).`~(~cati~n)~(mod~fy~ng)~.~`1(~bject)~Bj~(much)``~(we=)~. 
...2~(subject)...oll(Iocation)...ll(object)...~Ol(rnuch)...~(wear)... 
• .-2~-.~ ~-~-(leather boots)..-~(wear)... 
• .-2~-~ -.7-.~,.0eather boots) J(object)...~ (wear)..- 
• .o~-~, ~ ~t~'(Ioather shoes).--~(wear)--. 
-..2}-~ ~J W(k3ather shoes)J.(object)..-~(wear)... 
...~(boots)..-~(wear)... 
-..-.?-~-(boots)t}Cobject) ...~(wear)-.. 
"'" L (modifying),..2 }-~ -7" '~(feather boots)t(object)..-~ (wear)..- 
... ~ (modifying).-o ~ ~ L~,j(leat her shoes)I(obJect)...~(wear)... 
• .. L.. (modifying)"-~ (wear)... 
• .- L (rnodifying)...ol (Iocation)..-J(object)...N (wear)..- 
... L (modifying)-..ol (Iocation)..-~(wear).-. 
... L. (modiNng..-~(object))...~ (wear)... 
"'i(object)"-~(wear)", 
...~OF(much)...~(wear)... 
• ..~ ~(shoes)-..~ (wear)... 
...~ ~(shoes) J(object).-.~ (wear)..- 
-- o I-~.L~_ (sNI)-..~ (wear)-.. 
• ..oj~:(shccks)...~J (wead ... 
• ..°~(socks)...~(Iocation)--. L (modifying).-.-~-~-(boots)t~(object)...~ (wear)... 
• ..ol (location) ...~-~,~.AJ (rul0ber shoes)..-~ (wear)-.. 
...ol (location) ... L (modl lying ) ...-.~-~(boots)i(object ) ...~ (wear) .-. 
• .-oll(Iccation)... L-(modlfytng).-.°~(shocks)J(object),..~(wear).-. 
...ol (Iocatlon)...§(object)..-P~Ol(much)...~ (wear)... 
• ..N (Iocation)-..~ (wear)..- 
• ..N(Iocation)...O~(shocks)...~(wear) .,. 
-..N (Ioc~tion)--.~'gt(shocks)tl(object).--N (wear)... 
• -.N(Iocation)...~(white)..-~-~.,J(rubber shoes)..-~(wear).-. 
• ..c~(Iocation)...~(white)....~.P~-C~(rubber shoes).i(object)...~ (wear)... 
...o~(Iocation)..-~(white)L(modifying)...~-~..~(rubber shoes)tl(object)...~(wear)..- 
• ..-~-.~t4(sneakers).-~ (wear)... 
• .. -~-¢l{sneakem)-B (object) .-.~J (wear) ..- 
- .M (whit e) .-.~-~--~ (rubber shces)...~ (wear)... 
-..~ (vANte)-.-~(wear).-- 
.-.~(v,/nite)...~-~-tzl<sneakers)...~(wear)... 
---~ (whir e) .. -~-.~-~J-( sneaker s)iJ.(object ) ..-~ (wear)..- 
.-.~ (white) L- (rnodiNng)...~ (wear) --. 
Figure 4: Meaningful bigrams and all c~-compatibility classes of '~'(wear) 
tively and the evaluation function which appro- 
priately combines the measures. Our approach 
was primarily focused on the subtle relation- 
shit)s between word positions and collocations 
in a, free order language. 
We extracted meaningful bigrams using an 
ew~luation flmction and extended them into 
n-grams by producing a-compatibility classes. 
The usefulness of our algorithm were illustrated 
by examples and tables. 
This method covered various range of colloca- 
tions, which the extracted collocation patterns 
were case frames, multiple functional words, se- 
lectional restrictions, semantic phrases, corn- 
79 
(Adverb,'o~Q') (Postposition,'o~') (Noun, 'o~q') 
(~\]~, otq)(not only) 
(~-x\], o\]-q)(not only) 
(~t--~-~1, o~q)(simply ~ not) 
(~e-~Q, o\]-q)(but ~ not) 
(~, olM)(never) 
(~V~I, °l'q)(not necessarily) 
(~, o~q)(too) 
(~ol, obq)(not also) 
(~b, obq)(not) 
(2E, o~ )(not) 
(~-, o\]-w\] )(not) 
(~, o\]-q)(not also) 
(~--, o\]M)(not because) 
(:~, o~q)(not ~ that) 
(~x~\], o\]-q)(not important) 
(~, ol-q)(not intend to) 
(°1t71, o\]-q)(not ~ that) 
(oil-r, o~q)(not the reason) 
Table 4: Examples of bigrams for negation expressions 
collocations(Ionsest collocations) 
dice 
""Y b"':F~-'"~"" (S...boots...wear) 
...Yb..o~l ..- ~-...-~... ,~ ..- (S...L...M...O...wear) 
• ..~b..Oll.-.~...~ol ~... (S...L...O...much wear) 
-..2t... ~- ~ ~~-...~..- (S-..M shoes+O.--wear) 
relative 
• ..~t--..oll-..-'7-@"-~ "" 
• ..71-...otl-.. c....,~... ,Nol ...~ ... 
• ..~F..L ~ ~t...~... 
...~t---~.l ~'-l." ~ ... 
K\] Same 
S : a proposition for 
a subject case 
0 : a proposition for 
a object case 
L : a proposition for 
a Ioction case 
M : for modifying 
Figure 5: the longest n-gram collocations of '~'(wear) 
pound nouns, and idioms and it could be ap- 
plicable to other free order languages. 
With the development of recognition of 
phrases, the input format and related distance 
between morphemes, the algorithm can be used 
effectively. Also linguistic contents for statisti- 
cal constraints should be reflected in the system. 
We have plans to check how this algorithm 
will work in English and to align bilingual col- 
locations for machine translation. 

References 

Benson, M., Benson, E., Ilson. R. 1986 The 
BBI Combinatory Dictionary of English: A 
Guide to Word Combinations. John Ben- 
jamins, Amsterdam and Philadelphia. 

Breidt, E. 1993 Extraction of V-N collocations 
fiom text corpora: A feasibility study for 
German. In the 1st ACL-Workshop on Very 
La.~ye Corpora. 

Choueka, Y., Klein, T., and Neuwitz, E. 1983. 
Automatic retrieval of frequent idiomatic and 
collocationM expressions in a large corpus. In 
Journal .for Literary and Linguistic Comput- 
ing, 4:34-38. 

Clmrch, K., and Hanks, P. 1989. Word as- 
sociation norms, mutual information, and 
lexicography. In Computational Linguistics, 
16(1):22-29. 

Cowie, A. P. 1981. The treatment of colloca- 
tions and idioms in learner's dictionaries. In 
Applied Linguistics, 2(3):223-235. 

Cruse, D. P. 1986 Lexical Semantics. Cam- 
bridge University Press. 

Haruno, M., Ikehara, S., and Yamazaki. T. 1996 
Learning bilingual collocations by word-level 
sorting. In Proceedings of the 16th COLING, 
525-530. 

Ikehara, S., Shirai, S., and Uchino, H. 1996 
A Statistical Method for Extracting Uninter- 
rupted and Interrupted Collocations. In Pro- 
ceedings of the 16th COLING, 574-579. 

Kjellmer, G. 1995 A mint of phrases: Corpus 
Linguistics, 111-127. Longman. 

Klir, J. G., and Yuan, B. 1995 Fuzzy Sets 
And Fuzzy Logic: Theory And Applications. 
Prentice-Hall. 

Martin, W., and Sterkenburg, V. P. 1983 Lexi- 
cography: principles and practice 

Nagao, M., and Mori, S. 1994. A new method 
of n-gram statistics tbr large number of n and 
automatic extraction of words and phrases 
from large text data of Japanese. In Proceed- 
ings of the 15th COLING, 611-615. 

Ross, S. M. 1987 Introduction To Probability 
And Statistics For Engineers And Scientists 
John Wiley ~ Sons 

Shimohata, S., Sugio, T., and Nagata, J. 1997. 
Retrieving collocations by co-occurrences and 
word order coristraints. In the 35th Annual 
Meeting qf ACL, 476-481. 

Sm~Ldja, F. 1993. Retrieving collocations from 
text: Xtract. In Computational Linguistics, 
19(1):143-177. 

Sm~tdja, F., MaKeown, K., and Hatzivas- 
silogtou, V. 1996. Translating collocations 
tbr bilingual lexicons: A statistical approach. 
In Computational Linguistics, 22 (1): 1-38. 

Lee, KongJoo, Kim, Jaehoon Kim, Kim, 
Gih:hang. 1995. Extracting Collocations 
fronl Tagged Corpus in Korean. In Proceed- 
ings of Korean Information Sience Society, 
22(2):623-626. 
