Extracting Nested Collocations 
Katerina T. Frantzi and Sophia Ananiadou 
Dept. of ()omputing 
Manch(;sl;(;r Metroi)olil;an Univ(;rsity 
Manchester, MI 5G\]), U.K. 
{K.Frantzi,S.Ananiadou}(~(toc.mmu.a(:.uk 
Abstract 
'l?his paper 1)rovidcs an at)l)roa(:h to 
tim semi-aul;onmtic exl;i'action of (:olloca- 
IJons flom eorl)ora using sl;atisti(:s. The 
growing availability of lm'ge textual cor- 
t)ora, and the in(:reasing number of ap- 
plications of colloeal;ion extra(:tion, has 
given risc~ 1;o wu.ious apt)roaches on the 
I;opi(:. In l;his palter, we address the 
probl(;m of 'ne,stcd collocrd, ions; thai, is, 
those being l)art of longer colloc;ttions. 
Most approa(:hes till now, tl'(!al;ed sub- 
st;rings of collo(:at;ions as eollocal;ions, 
only if they apl)eared ffequenl;ly enough 
1)y l;hemselves in the cor\[)llS. 'Fhese tech- 
niques le\['l; ~r lot; of collocations mmx- 
l;ra(:l;ed, in this 1)ai)er, we i)rol)oSe an al- 
goril;hln for a semi-aul;oma|;ic exl;ra(;l;ion 
of nesl;ed uninl;errupl;ed anti inl;errul)l;ed 
collo(:al;iolls, paying parl;icular al;l;(~lll;ion 
to nested collocat;ion. 
1 Introduction 
Tim increased inl;erest in collocation ext;raetion 
comes from t;hu faeI; l;hal, t;hey can be used for 
many NLP at)plical;ions such as machine transla- 
(;ion, maehilw, aids R)r t;ra.nslal,ion, dictionary con- 
sl;ru(:i;ion, and secon(1 language learning, t.o mmm 
a few. 
Recently, large scale textual corpora give the 
potential of working with the real data, (!ither 
fin' grammar inferring, or for enriching the le.x- 
icon. These corlms-based at)preaches have also 
been used for the extract, ion of collocal,ions. 
In this t)al)er we are concerned wil;h nested 
collocations. Collocations Lhat are subst;rings of 
oLher longer ones. I{egar(ling l;his l;ypu of (:olloea- 
tion, the approaches till ilOW could be divi(led inl;o 
t;wo groups: those thai; do uo(, refer to s'ttbstrings 
of colloco, l, ions as a l)arti(:ular problem, (Church 
and lla.nks, t99(); Kim and Cho, 1993; Nagao 
and Mori, 1994), and those t.hat; do (Kita et al., 
t994; Smadja, 1993; lkchara et al., 1995; Kjelhner, 
11994). \[towew;r, (well the lal;t, er, deal wiLh only 
1)arl; of the probh;m: they l,ry not to extract the 
mlwanl;cd substrings of collocations. In favour of 
this, l;hcy leave a large number of nested colloc.a- 
tions unextracted. 
ht section 2 collocations arc briefly discussed 
and the. l)roblem is determined. In section 3 our 
approach to t;he probl0an, 1;he algorithm and an 
examl)le are given. In section d the experimeld, S 
are discussed and t;he Inethod is (;olnpare(t with 
t, hat proposed by (Kita et a.l., 199d). In sectioll 
5 I;tlel'e are conlmenl;s on relal;ed work and tinally 
Section 6 eonl;ains I;he conc, hlsions and 1;he fill;life 
work. 
2 Collocations - The Problem 
Collocations are perwtsive in language: "letters" 
are "deliw:red", "tea" is "strong" and not "pow- 
elful", we "l'mt progrants", aitd so Oll. Linguists 
have long been interested in collocations and the 
detinitions are nuiaerous and varied. Some re- 
searchers include multi-o.leinent eOlnpOuIlds as (;x- 
amples of collocations; some admit only collo- 
cations (:onsisl;ing of pairs of words, while oth- 
ers admit only eollo(;ations consisting of a max- 
imum of tive or six words; some emphasize syn- 
l, aglnat, ic aspecl;s, others Selnmtl;ic aspects. The 
COlllillOil poini;s regarding collocations appear to 
be, as (Smadja, 1993) suggestsl: they are m'bi- 
l;rary (it is nol; clear why to "Bill through" means 
to "fail"), th('y are domain-dependent ("interest 
rate", "stock market"), t;hey are recurrenl; and co- 
hesive lo~xical clusters: the presence of one of the. 
collocates strongly Sltggesl;S /,tie rest of the cello- 
cat, ion ("Ulfited" could ilnply "States" or "King- 
dom"). 
the classiiics collocations into i)redicative rela- 
tions, rigid noun phrases and phrasal telnplatcs. 
4l 
It is not the goal of this paper to provide yet 
another definition of collocation. We adopt as a 
working definition the one by (Sinclair, 1991) 
Collocation is the occurrence of two or 
more words within a short space of each 
other in a text. 
Let us recall that collocations are domain- 
dependent. Sublanguages have remarlmbly high 
incidences of collocation (Ananiadou and Mc- 
Naught, 1995). (Frawley, 1988) neatly sums up 
the nature of sublanguage, showing the key con- 
tribution of collocation: 
sublanguage is strongly lexically based 
sublanguage texts focus on content 
lexical selection is syntactified in sublanguages 
collocation plays a major role in sublanguage 
sublanguages demonstrate elaborate lexical co- 
hesion. 
The particular structures found in sublanguage 
texts reflect very closely the structuring of a sub- 
language's associated conceptual domain. It is the 
particular syntactified combinations of words that 
reveal this structure. Since we work with sublan- 
guages we can use "small" corpora as opposed as 
if we were working with a general language corpus. 
In the Brown Corpus for example, which consists 
of one million words, there are only 2 occurrences 
of "reading material", 2 of "cups of coffee", 5 of 
"for good" and 7 of "as always", (Kjellmer, 1994). 
We extract uninterrupted and interrupted col- 
locations. The interrupted are phrasal templates 
only and not predicatiw~ relations. We focus on 
the problem of the extraction of those collocations 
we call nested collocations. These collocations are 
at the same time substrings of other longer collo- 
cations. To make this (:lear, consider the follow- 
ing strings: "New York Stock Exchange", "York 
Stock", "New York" and "Stock Exchange". As- 
suine that the first string, being a collocation, is 
extracted by some method able to extract colloca- 
tions of length two or more. Are the other three 
extracted as well? "New York" and "Stock Ex- 
change" should be extracted, while "York Stock" 
should not. Though the examples here are front 
domain-specific lexieal collocations, grammatieM 
ones can be nested as well: "put down as", "put 
down for", "put down to" and "put down". 
(Smadja, 1993; Kits et al., 1994; Ikehara et al., 
1995), mention about substrings of collocations. 
Smadja's Xtract produces only the biggest possi- 
ble n-grams. Ikehara et al., exclude the substrings 
of the retrieved collocations. 
A more precise approach to the problem is pro- 
vided by (Kits et al., 1994). They extract a sub- 
string of a collocation if it, appears a significant 
amount of times by itself. The following exam- 
ple illustrates the problem and their N)proach: 
consider the strings a="in spite" and b="in spite 
of", with n(a) and n(b) their numbers of oceur- 
rencies in the corpus respectively. It will always 
be n(a) > n(b), so whenever b is identified as a 
collocation, a is too. Itowever, a should not be 
extracted as a collocation. So, they modify the 
measure of frequency of occurrence to become 
K(a) = (lal - 1)(n(a) - n(b)) (1) 
where 
a is a word sequence 
la\[ is the length of a 
n(a) is the number of occurrencies of a in the cur- 
pus. 
b is every word sequence that contains a 
n(b) is the number of occurrencies of b 
As a result they do not extract the sub-strings of 
longer collocations unless they appear a signifi- 
cant amount of times by themselves in the corpus. 
The problem is not solved. Table 2 gives the ex- 
tracted by Cost-Criteria n-grams containing "Wall 
Street". The corpus consists of 40,000 words of 
market reports. Only those n-grants of frequency 
3 or more are considered. It (:an be seen that 
"Wall Street" is not extracted as a collocation, 
though it has a frequency of occurrence of 38. 
Table 1: n-grams extracted by Cost-Criteria con- 
taining "Wall Street" 
c-ckKKA  
19 
20 
22 
26 
19 
38 
20 
Candidate Collocations 
Staff" Reporter of The Wall Street Journal 
Wall Street analysts 
Reporter of The Wall Street Journal 
Staff Reporter of The Wall Street 
of The Wall Street Journal 
The Wall Street Journal 
Wall Street Journal 
Reporter of The Wall Street 
Wall Street 
of The Wall Street 
The Wall Street 
42 
3 Our approach - The Algorithm 
We, call the extracted strings candidate colloca- 
tions rather than collocations, since what we ac- 
cet)t as collo(:ations depends oil tile application. 
It is the human judge that will give the tinal de- 
(:ision. This is tile reason we consider tile method 
as semi-automatic. 
Let us consider the string "New York Stock Ex- 
(:hange". Within this string, that has already been 
extra(:ted as a candidate collocation, there are two 
substrings that should/)e extracted, and one that 
shouhl not. The issue is how to distinguish when 
a substring of a (:andidate (:ollo(:ation is a candi- 
date collocation, and when it is not. Kita et al. 
assume that the substring is a candidate (:olloca- 
tion if it appears by itself (with a relatively high 
frequency). ~lb this we add that: 
the sut)string aI)1)ears in more than one, 
(:an(li(lat(~' eollo(:ations, eVell if it, (h)es not 
appear by itself. 
"Wall Street", for exalnple, appears 30 times in 6 
longer candidate colh)cations, and 8 times by it- 
self. If we considered only the number of times 
il; appears by itself, it would get a low value as 
a candidate collocation. We have to consider the 
number of tilnes it apI)ears within hmger candi- 
date collocations. A second fa(:tor is tit(! number 
of these hmger collocations. The greater this num- 
t)er is, the better the string is distribute.d, an(l the 
greater its value as a (:andi(late collocat;ion. We 
make the above (:onditions more spe(:iti(: and give 
the measure for a string being a candidate coll()- 
cation. The measure is called C-value and the fa(> 
tors involved are the string's frequency of o(:eur- 
rence in the corpus, its fre(luen(:y of oe(:urrence in 
longer candidate collocations, the immber of these 
longer ('andidate (:ollocations and its length. Re- 
gar(ling its length, we (:onsider hmger collocations 
to t)e "more important" than shorter appearii~g 
with the same fi'equency. More specifically, if \]a\] 
is the length 2 of the string a, its C-value is analo- 
g()us to la I - 1. The 1 is giv(m sin('e the shortest 
collocations are of length 2, and we want them to 
be "of ilnportan(;e" 2-1= 1. 
More specifically: 
1. If a has the same hequen('y with a longer 
candidate (:ollocation that contains a, it is 
assigne(t C-value(a)=O i.e. is not a colloca- 
tion. it is straightforward that in this case a 
appears in one only hmger candidate colloca- 
tion. 
2We use tit(', same nol;ation with (Kita et al., 1.994). 
2. If n(a) is the number of times a appears, and 
a is not a substring of an already extracted 
candidate collocation, then a is assigned 
3. If a appears as a substring in one or more 
collocations (not with the same frequency), 
then it is assigned 
(I-I t(.)) (3) 
where t(a) is the total frequency of a in longer 
candidate collocations and c(a) the number 
of ttmse candidate collocations. This is the 
most complicate ease. 
Tit(; ilnportance of the. Iluinber of occurrences 
of a string in a longer string is illustrated 
with the de.nominator of the fraction in Equa- 
tion 3. The bigger the nulnber of strings a 
substring appears in, the smaller the fraction 
num&~ o\] occu~ , the bigger the C- 
value of the string. 
The algorithm for the extraction of tile candi- 
elate collo(:ations follows: 
e.xtract the n-grams 
decide on the lowest frequency of collocations 
renlove tlle I>granls below this frequency 
lbr all n-grams a. of lllaxiHlulIl length 
calculate their C-value= ('u - 1)n(a) 
tbr all substrings b 
revise t(b) 
revise c(b) 
h)r all smaller n-grams a in descending order 
if (total frequency of a)=(frequency of a in 
a longer string) 
a is NOT a collocation 
else 
if a appears for the first time 
else C-v,,1,,.;= 
.(~z) J 
for all substrings b 
revise t(b) 
revise c(b) 
The above algorithln coinputes the C-value of 
each string in an incremental way. That is, for 
each string a, we 1:(;(;i, a tuple ('n(a), t(a), c(a)} and 
we revise tt,e t(a) and ,:(a) wflues. For each n- 
gram b, every tin-le it is found ill a longer extracted 
43 
n-gram a, the vahles t(b) and c(b) are revised: 
t(1,) = t(b) + - 
(:(b) = ,-(t,) + 1. 
Ill the, initial stage, n(a) is set to the frequency of 
a appearing on its own, and t(a) and c(a) are set 
to 0. 
Table, 2: n-grmns e, xtraeted by C-wflue containing 
"Wall Street" 
F Candid~te Colloc;~tions __ 
114 19 Staff ReI)orter of 
37.34 26 
36 22 
33 38 
31.34 231 
6 3 
4 20 
0 \]9 
0 19 
0 19 
0 20i 
Tim \V~fll Street Journal 
Wa.ll Street ,hmrnal 
The \¥M1 Street Journal 
\¥all Strce, t 
The Wall St, feet 
\Vail Street ~lm~lysts 
of The \¥M1 Street, Journal 
12.ei)orter of The Wall Street 
R,eporter of The Wall St, reel; Journal 
Staff II,eI)orter of \[File Wall S~reet 
of The YVa,ll Street 
An example: 
Let us calculate the C-value for the string "Wall 
Street". Table 2 shows all the strings that appear 
more that twice, and that contain "Wall Street". 
1. The analysis starts from the longest string, the 
7-gram "Staff l/.et)orter of The Wall Street Jour- 
rod". Its C-value is (:ah:ulated l\[rom Equation 2. 
For each substrings eon|;ained in the 7-gram, tile 
number 11.9 (the l'requen(:y of the 7-gram) is kept, 
as its (till now) fl'equeney of occurrence in longer 
,strings. For each of them, the fact that they have 
been already l'oun(t in a longer string is kept as 
well. Therefbre, t("Wall Street")=19 and c("\gall 
Street")=l. 
2. We continue with the two 6-grams. Both of 
them, "l~,eporter of The Wall Street Journal" and 
:'Staff Reporter of The Wall Street" get; C-value=O 
since they ~q)pear with the same l'requeney as 
the 7-gram that contains the're. Therefore, they 
do not tbrm candidate collocations and they do 
not change the t("Wall Street") and the c("Wall 
Street") values. 
3. F/)r the 5-grams, there is one appearing with 
a l'requency })igger than that of the 7-gram it: 
is (:()nta,incd in, "of The Wall Street Jourlml". 
This gets its C-value \[rom Equation 3. its sub- 
strings increase their frequcmey of occurrence ~ (as 
substrings) by 20 19=1 (20 is the frequency of 
the 5-gram and 19 the fr0,queney it appeared 
in longer candidate collocal;ions), and the num- 
t)er of oeeurrence ~s su/)string by 1. There- 
\[ore, t("Wall Street"')=19+l=20 and c("Wall 
Street")--1+1--2. The other 5-gram is not a can- 
didate collocations (it gets C-value=O). 
4. For tile 4-grams, the "The Wall Street Jour- 
nal" occurs in two longer n-grams and therefore 
gets its C-value from Equation 3. Froin this 
string, t("Wall Street")=20+2=22 and c("Wall 
Street")-2+1=3. The "of The Wall Street" is 
not accepted as a eamtidate collocations since it; 
apt)ears with the same fl'equeney as the "of The 
Wall Street Jom'nal'. 
5. "Wall Street analysts" appears for the 
first time so it; gets its C-value from Equa- 
tion 2. "Wall Street Journal" mnl "The Wall 
Street" appearing in longer extracted n-grams 
get their values from Equation 3. They make 
t("Wall Street")=22+3+4+l=30 and c("Wall 
St, lee t" ) = 3+ \] + 1+ 1 =6. 
6. Finally, we evaluate the C-value for "Wall 
Street" from Equation 3. We find C-value("\¥all 
Street")=33. 
4 Experiments- Comparison 
The eortms used for the experiments is quite small 
(40,000 words) and consists of material l¥otn th(~ 
Wall Street Journal newswire. For these exper- 
ilnents we used n-grams of maxilnuln length 10. 
Longer n-grains apt)ea.r once, only (because of the 
size of the corpus). The, maximum length of the 
n-grams to be extracted is variallle attd depends 
on the size of the corpus and the application. 
From the extracted n-grams, those with a fle- 
quc'ncy of 3 or more were kept (other approaches 
get rid of n-grams of such low frequencies (Smadja, 
1993)). These n-grams were lbrwarded into the, 
implementation of our algoril;hm as well as our 
implementation of the algorithm by (Kita et al., 
1:)94). 
The Cost-Criteria algorithm needs a second 
threshold (besides tile one for tile frequency of the 
n-grams): for every n-gram a, K(a) is evaluated, 
and only those n-grams with this value greater 
than the' preset threshold will take part to tile rest 
of the algorithm. We set this threshold to ;I again 
for the, same reason as above (the gain we wouhl 
gel; for precision if we had set a higher threshold 
would be lost on recall). 
Table 3 shows the candidate c, ollocations with 
the higher values, extra('Le(l with C-value. A lot of 
eandidate e, otlocations extracted may seem unim- 
portant. This is because t}le algorithm extracts 
tile word sequences that are fl'equent. Which of 
these candidate collocations we should keep de- 
pends on the apt)lication. Brill's t)art-of-speeeh 
tagger, (Brill, 1992), was used to remove the n- 
grams that had an article as their last wor(1. 
44 
'l'a,I)l(~ 3: Exi,raci,('d c~m(tida,t(~ (:olloca, i,ion with C- 
vakae in (l(~,s(:(m(iin<~ or(l(;r. \[ C:V 2 _F. I (JandidA te 
Colloci tion  .... 
L84 92 ~Vi\L\]-;STR,10ET ,J()UFLNAL 
1:14 19 
87.6 93 
79.6 44 
53.2 59 
49.5 20 
44.75 25 
44 48 
41.17 44 
{17,{/4 26 
S6 (i 
36 22 
33 38 
31.34 23 
27.8 :~\[ 
27 3 
27 3 
27 30 
24 27 
24 10 
23.3d 27 
21,3,1 27 
21 10 
20 10 
20 5 
19.67 23 
18.5 23 
18 18 
i 8 6 
18 !) 
18 9 
{8 21 
1'2 17 
1'( 17 
17 17 
17 I I 
17 21 
1 (iA 19 
\] (J I (J 
16 4 
1 (i 8 
l (i 19 
15.5 20 
15 15 
i5 15 
\] 5 3 
15 3 
15 :1 
15 3 
15 5 
15 18 
StM\[ Rel)ort(:r o\[ q'tie Wall SI;reei; 
Journal 
( hlil;(xl SI;a,l;es 
t;\[l(; Unilx!d SI;iti;es 
i;he Unil;ed 
{lllll\[ll)(w) to <~lllOlt(~y) \['FO1H 
I;O (Ill()ll(;y) \[!I'O1H 
said il; 
(;h(! (;Olnl)~Hly 
\,VMI ~l;r(?el; J(iurnal 
<~tillllll)(~l'~ > I;o <~iH()ll(',y~> \[l'Oll\[ 
(HIOll(~y) }1 VO~/,I' 
The YVa\[1 ~l;reel; ,JOlll'IIiL\[ 
W;dl Str(!el; 
The, Wail Sl,l'(x~l, 
il, ,y(!{ir 
rl'here w(u'e .~Illlllli)(~l') sell|nil (bWs 
iii t;h(! t)(!riod this year 
There were "~IllllHI)(W~> selling 
da:ys fii l;lw, period this 
t;o lie 
will I)e 
a.t; l;h(! end of 
I;h(! C()lil \[)D.II~/I~ 
(~o/11 \[)i-I.l'(!( l wil,h 
< (J() hi ~\il l'\]Nr'l'> I );qr;tI~,ra,\[)tiilll'~ 
I<;rror <t(X)MMENT> 
~ill()ltt!y~" ~l, sii;qt'(~ 
I)ric(x\[ a,(; .~tilllllhl!r) LO yield 
White 1 louse 
t;he I il}tl'\]{(~l; 
Tot,a I ('ill'S 
in the \[}nil;ed SI;a,l;es 
Tim Sri I(hoo 
Tli( ~, ()nit(!d Sl;a,l;e.<~ 
N al:iOilil, l Hank 
hits })(!(!ii 
said Mr 
stud t:h;/t 
I;he (!n(I of 
of its 
fi)ui:l,h (lUarl;er 
I)i:+l,lilOli(| ~ ha,tliro(:|( 
~.nlllrlH(~r \]> <C()MM EN'.V> 
I)il,riLgl'ii~i)hillg \[~\]rl'(il: <(J()MMENT> 
its well as 
I, hal; it 
ill(ir(! I, hH.l l 
ll~,(l been 
it; is 
<t\[l()II(!y~> ~LI; I,\[l(! (!11(\[ O1! <itlltill)(!r~> 
ill iL ~cclu'il;ies i/lid 14x('ha,nge 
(}()iliillissi(ili 
;i <~tllllli\])(!l'~- for Qlilliii})(w~> sl;o(',k 
~qflil, 
,~id(',f~ ix)s(! <{illilill)(!r~ I;() ~inotie,v~ 
l'l'OIII 
l;h~ti; l;h(! \[}nited St,al,es 
\])(!(',~tll~(~ O\]" 
Am()ng l,h(l (;xl;ra(;lxxl ii-<~l',~l_illS w(', c}iAl sc(; 1;h(; 
doniain-si)(x:iti(: (',and|date c, ollocal;ions, mmh a,s 
"SI,aff l{(;t)orl;er o\[ l;h(; "vVa,I1 Strc(;l;", "Na.l;ional 
lianl¢" etc., and those l;ha, t a,i)pe, ar within other 
colloca,Lions ral;ho, r i;ha,n by 1;h(~,ms(~,lv(~,s, "~;a\]\[ 
Sla'(x',t .h)urmfl',"WM1 Strc(%" etc. 
Tlw, r(,~ are, howe, vet, t)robl(;ms: 
|. W(! did tit)l; (:nh:ula, i;e l;tl(', 1)recision or recall 
()l' l;\[W, (,'-'val'ttc algoril,hni. Th(!se cal(:ulai;ions (te- 
pen(1 Oll l;}le (let|nil;ion of ('ol\]o('.~t;ion ant| l;}ley m'(; 
domain dCl)endenl;. (l(j(~lhn(;r tU(~liLiOllf4 1,9 ca, l;(> 
~>orics o\[ collocation (l(,j(~ilm(;r, 1994)). 
2. As ii; (:a,II l)(; Seell \[rolll '\['al)le 3, one string aI)- 
i)(!a,ring I)ol;h in simdl a,u(t Cal)iLM \](!l;l;ers i~ la'e~tt;d 
as t:wo (li\[l'ur(~,nt SLl'illt{~. 'F}I(! l)r()l)l(;in (',au be, par- 
l ially solv(xl if' w(; llS(' ;L c:/moui(:al \[(.)rill. \[I()w(~,V(!l ', 
i\[ we wanl; 1;() apply l,he algorii;tnn f()r the e×l;ra,c,- 
l iolt ol7 domMn-sl)(~(',ili(: (:olloca,tions, (',as(! is t)erl;i- 
IV'AlL 
3. ".', iIl sl;rings lilw, "(~.1;.(:.", "0,l; ;ft." (~t(:.> is 
I;~Ll((,~II as a S(HII)O,11C(! l)ounda,ry (wen when il; is IlOL 
4. How f,o fill;e,r oul; the exl;ra.clxxl ii-<~i'&i\[is l;hal; 
a,I'(~ Ii()l; r{;l(w;/,nl; i;() l;he at)t)li(:a,l;i{}n (for Ihe (',an(li- 
dmc, (:{}ll(}{:ations) wc arc illt(~r{~si;(xl in, is anoi,h{~,r 
I)rol)l(!ui. A(:l,ually, for sonl(, ~ ()\[ l;he (~,xtir;c{:i;{!(l t> 
~l'//lilS (~:l;O i)(\]>>~ "lifts l)('.eu', "s~ti(l l;\]la.l;"~ el;(:.), we 
Calm()1, 1,}|ink ()\[ a,ny al)t)li('~rl;ion l;\]lat tiles(' n-grmns 
WOIll(\[ l)e. U~e\['ul. And though some of them ('.ould 
1)(! tilter(xl oliL l)y a l)a,rl;-o\[Lsptxx:h l;agg(;t', we ca, n- 
not say I;his for a,l\] l;h(; l;,ype, s of the, %ulw~-ml;e,d" 
(~,xl;racl;cd ii-~ii-i,t/is. 
5 Related -Work 
Ilesid(~s I,tl(~ work by KiI,~L (!1; al. m(mifion('xl (mr- 
lier, the, re ~r(', oLher inlx~r('stin~ apl)roaches t<) l;he 
exl;ra('l;ion o\[ collocltl;ions. 
((JilOU(!ka, (',L a\].> 1983) 1)r()t)ostx| a Hl(!l,hod, 
l)ase(1 on the ol)s(;rv(;d \['r(xlUC.tw.y ()f s(xlu(~.nc.i(;s of 
words, to cxtr~mt unhitcrrupl;(;d collocatioiis, 1)ul; 
1;he rcsull;s are d(,~t)(;n(l(;nl; on l;h(~ ~ize of the corl)us. 
(Churc, h a n(l \]ia,nks, ;I 990), l)rOl)OSCd 1;he asso(:i- 
al;iOli i'a%i(), ;i, ill(Wt~lll'(~ I)a~(x1 Oll illlli;li~l ill\[ori'Ha- 
1,ion (l<\]ulo~ \]96\]), I,()e~i;hnal;(; word a,~so(',ial;ion 
1K)I'IlIS. They idc, nlify tm, irs o\[ w()rd,~ t, ha, i; ~Lt)l)cm 
Ix)gel;ller 111()1()()\[I,(~11 t;ilan \])y ch,~illC(l. The coil(i- 
carious til(;y i(ienl;ify ('()uld a\]s() l)e due I;() ~(IIlII-LH- 
I,i(: I'(~tI, SO\[l~. They Mlow ga,l),~ I)el,wc,(ni the words 
and I;h('r('fore exlirm',l; inl('rrul)l;c(t wor(t ,'-;(~(\]II(HIC(;S, 
Since l;hey only tie, a,1 wil;h collocations of lengl;h 
I;W() (l;\[lOll~h iiiltl;lla, l in\[ormaDion (:~LII 1)e (~xlx;ntl(!d 
\[(il" ILll arl)il;rary llUIill)(!l' ()\[ (!Ve, lllls, (|?~-111o, 1961; 
X,tcl,\]liec.<~, 1977)), 1;hcy do 1lOl, consider n(;sl;ed col- 
h)(:a,l,ion,~. 
(l(im anti (ill(i, 1.9.03), l)rolios(xl IIIIlIAID,1 iufl)r- 
HI:~I;iOtl to ('al(:uliLIX; t, ii(; d('gr(;(; of word a,~so('.ial;ion 
45 
of compound words. They extend the measure 
for three words in a different way than that de- 
fined by (Fano, 1961), and no mention is given to 
how their formulas would be extended for word- 
sequences of length more that three. They do not 
consider nested collocations. 
(Smadja, 1993), extracts uninterrupted as well 
as interrupted collocations (predicative relations, 
rigid noun phrases and phrasal templates). The 
system performs very well under two conditions: 
the corpus must be large, and the collocations we 
are interested in extracting, must have high fre- 
quencies. 
(Nagao and Mori, 1994), extract collocations 
using the tbllowing rule: longer collocations and 
frequent collocations are more important. An im- 
provement to this algorithm is that of (Ikehara et 
al., 1995). They proposed an algorithm for the 
extraction of uninterrupted as well as interrupted 
collocations from Japanese corpora. The extrac- 
tion involves the following conditions: longer col- 
locations have priority, more frequent colloca- 
tions have priority, substrings are extracted only 
if tbund in other places by themselves. 
Finally, the Dictionary of English Collocations, 
(Kjellmer, 1994), includes n-grams appearing even 
only ()nee. For each of them its exclusive fre- 
quency (number of occurrences the n-gram ap- 
peared by itself), its inclusive frequency (number 
of times it appeared in total) and its relative fre- 
quency (the ratio of its ac.tual frequency to its ex- 
pected frequency), is given. 
6 Conclusions and Future Work 
As collocation identification (either in general lan- 
guage or in sublanguages) finds many applica- 
tions, the need to automate, as much as possible, 
that process increases. Automation is helped by 
the recent availability of large scale textual cor- 
pora. 
In this paper we dealt with the extraction of un- 
interrupted and interrupted collocations focusing 
on those we call nested collocations (those being 
substrings of other collocations). A inethod tbr 
their extraction was proposed. 
In fllture, we plan to extend our algorithm to 
include predicative relations. We are going to in- 
corporate linguistic knowledge to improve the re- 
sults. Finally, this algorithm will be applied for 
term extraction. 
7 Acknowledgements 
We thank our anonymous reviewers for their com- 
ments. 
References 
Ananiadou, S.; McNaught, J. 1995. Terms are 
not alone: term choice and choice terms. In 
Journal of Aslib Proceedings, vol.47,no.2:47 60. 
Brill, E. 1992. A simple rule-based part of speech 
tagger. In Prvc. of the Third Conference of Ap- 
plied Natural Language Processing, A CL, pages 
152 1.55. 
Choueka, Y., Klein, T. and Neuwitz, E. 1983. 
Automatic retrieval of frequent idiomatic and 
collocational expressions in a large corpus. In 
Journal of Literary and Linguistic Computing, 
4:34 38. 
Church, K.W. and Hanks, P. 1990. Word Associ- 
at, ion Norms, Mutual Information, and Lexicog- 
raphy. In Computational Linguistics, 16:22 29. 
Frawley, W. 1988. Relational models and recta- 
science. In Evens, M. (ed.) Relational models 
of the lexicon, Cambridge:Cambridge Univer- 
sity Press, 335 372. 
Nagao, M. and Mori, S. 1994. A new Method of 
N gram Statistics for Large Number of n and 
Automatic Extraction of Words and Phrases 
from Large Text Data of Japanese. In Proc. 
of COLING, pages 611 615. 
Fano, R.M. 1961. In Transmission of informa- 
tion: a statistical theory o.f communications, 
M.I.T. Press, New York. 
Ikehara, S.; Shirai, S. and Kawaoka, T. 1995. Au- 
tomatic Extraction of Collocations from Very 
Large Japanese Corpora using N-grmn Statis- 
tics. In Transactions of Information Processing 
Society of Japan, 11:2584-2596. (in Japalmse). 
Kim, P.K. and Cho, Y.K. 1993. Indexing Con> 
pound Words from Korean Texts using Mutual 
Information. In Proc. of NLPRS, pages 85 92. 
Kita, K.; Kat, o, Y.; Omoto, T. and Yano, Y. 11994. 
A Comparative Study of Automatic Extraction 
of Collocations fl'om Corpora: Mutual Informa- 
tion vs. Cost Criteria. In Journal of Natural 
Language Processing, 1:21. 33. 
Kjellmer, G. 1994. A Dictionary of English Col- 
locations, Clarendon Press, Oxibrd. 
McEliece, R.J. 1977. The Theory of Information 
and Coding, Addison Wesley, London. 
Sinclair, J. 1991. In ,\]. Sinclair and R. Carter, ed- 
itors, Corpus, Cor~cordanee, Collocation. Oxtbrd 
University Press, Oxford, England. 
Smadja,, F. 1993. Retrieving Collocations froin 
Text: Xtract. In Computational Linguistics, 
19:143--177. 
46 
