Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 40–49,
Sydney, July 2006. c©2006 Association for Computational Linguistics
MultilingualCollocationExtraction:IssuesandSolutions
VioletaSeretan
LanguageTechnologyLaboratory
UniversityofGeneva
2,ruedeCandolle,1211Geneva
Violeta.Seretan@latl.unige.ch
EricWehrli
LanguageTechnologyLaboratory
UniversityofGeneva
2,ruedeCandolle,1211Geneva
Eric.Wehrli@latl.unige.ch
Abstract
Althoughtraditionallyseenasa language-
independenttask, collocation extraction
relies nowadays more and more on the
linguistic preprocessing of texts (e.g.,
lemmatization,POStagging,chunkingor
parsing) prior to the applicationof sta-
tistical measures. This paper provides
a language-orientedreview of the exist-
ing extraction work. It points out sev-
erallanguage-specificissuesrelatedtoex-
tractionand proposesa strategy for cop-
ing withthem. It thendescribesa hybrid
extractionsystembasedon a multilingual
parser. Finally, it presentsa case-studyon
theperformanceofanassociationmeasure
acrossa numberoflanguages.
1 Introduction
Collocationsare understoodin this paperas “id-
iosyncratic syntagmatic combination of lexical
items”(Fontenelle,1992,222): heavyrain, light
breeze, great difficulty, grow steadily, meet re-
quirement, reach consensus, pay attention, ask a
question. Unlike idioms(kick the bucket, lend a
hand, pullsomeone’s leg), theirmeaningis fairly
transparentand easy to decode. Yet, differently
fromtheregularproductions,(bighouse, cultural
activity, read a book), collocationalexpressions
are highly idiosyncratic,since the lexical items
a headword combineswith in order to express
a given meaning is contingentupon that word
(Mel’ˇcuk,2003).
This is apparent when comparinga colloca-
tion’s equivalentsacrossdifferentlanguages.The
English collocationask a questiontranslatesas
poserunequestioninFrench(lit.,?putaquestion),
and as fare una domanda, haceruna preguntain
ItalianandSpanish(lit.,tomake a question).
Asit hasbeenpointedoutbymany researchers
(Cruse, 1986; Benson, 1990; McKeown and
Radev, 2000), collocationscannot be described
bymeansof generalsyntacticandsemanticrules.
They are arbitraryand unpredictable,and there-
foreneedtobememorizedandusedassuch.They
constitutethe so-called“semi-finishedproducts”
of language(Hausmann,1985)or the “islandsof
reliability”(Lewis, 2000)on whichthe speakers
buildtheirutterances.
2 Motivation
The key importanceof collocationsin text pro-
ductiontaskssuchasmachinetranslationandnat-
ural languagegenerationhas beenstressedmany
times.It hasbeenequallyshownthatcollocations
areusefulina rangeofotherapplications,suchas
word sense disambiguation(Brown et al., 1991)
andparsing(AlshawiandCarter, 1994).
The NLP communityfully acknowledgedthe
need for an appropriatetreatmentof multi-word
expressionsin general(Sag et al., 2002). Collo-
cationsareparticularlyimportantbecauseof their
prevalencein language,regardlessof the domain
or genre. Accordingto Jackendoff (1997, 156)
and Mel’ˇcuk (1998, 24), collocationsconstitute
thebulkofa language’s lexicon.
Thelastdecadeshave witnesseda considerable
developmentof collocationextraction techniques,
thatconcernbothmonolingualand(parallel)mul-
tilingualcorpora.
We can mentionhere only part of this work:
(Berry-Rogghe, 1973; Church et al., 1989;
Smadja,1993;Lin,1998;KrennandEvert,2001)
for monolingualextraction, and (Kupiec, 1993;
Wu,1994;Smadjaetal.,1996;KitamuraandMat-
40
sumoto,1996; Melamed,1997)for bilingualex-
tractionviaalignment.
Traditionally, collocationextractionwas con-
sidereda language-independenttask. Sincecollo-
cationsarerecurrent,typicallexicalcombinations,
a widerangeofstatisticalmethodsbasedonword
co-occurrencefrequency have been heavily used
for detectingthem in text corpora. Amongthe
mostoftenusedtypesof lexicalassociationmea-
sures (henceforth AMs) we mention: statistical
hypothesistests(e.g.,binomial,Poisson,Fisher,z-
score,chi-squared,t-score,andlog-likelihoodra-
tiotests),thatmeasurethesignificanceoftheasso-
ciationbetweentwowordsbasedonacontingency
table listing their joint and marginal frequency,
and Information-theoretic measures (Mutual In-
formation— henceforthMI — and its variants),
that quantityof ‘information’sharedby two ran-
domvariables.A detailedreview of thestatistical
methodsemployedincollocationextractioncanbe
found,for instance,in (Evert, 2004). A compre-
hensive listofAMsisgiven(Pecina,2005).
Veryoften,inadditiontotheinformationonco-
occurrencefrequency, language-specificinforma-
tion is also integratedin a collocationextraction
system(asit willbeseeninsection3):
- morphologicalinformation,inordertocount
inflectedwordformsasinstancesofthesame
baseform. For instance,ask questions, asks
question, asked questionare all instancesof
thesamewordpair, ask- question;
- syntacticinformation,inordertorecognizea
wordpairevenifsubjectto(complex)syntac-
tic transformations:ask multiplequestions,
questionasked, questionsthatonemightask.
Thelanguage-specificmodulesthusaimatcop-
ing with the problemof morphosyntacticvaria-
tion,inordertoimprovetheaccuracyoffrequency
information.Thisbecomestrulyimportantespe-
ciallyfor free-word orderand for high-inflection
languages,for which the token(form)-basedfre-
quencyfiguresbecometooskewedduetothehigh
lexical dispersion. Not only the data scattering
modifythe frequency numbersusedby AMs,but
it also altersthe performanceof AMs, if the the
probabilitiesinthecontingencytablebecomevery
low.
Morphosyntacticinformationhas in fact been
shown to significantlyimprove the extractionre-
sults (Breidt, 1993; Smadja, 1993; Zajac et al.,
2003). Morphologicaltoolssuch as lemmatizers
andPOStaggersarebeingcommonlyusedin ex-
tractionsystems;they areemployedbothfordeal-
ingwithtext variationandfor validatingthe can-
didatepairs: combinationsof functionwordsare
typicallyruledout (Justesonand Katz, 1995),as
are the ungrammaticalcombinationsin the sys-
temsthatmake useofparsers(ChurchandHanks,
1990;Smadja,1993;Basilietal.,1994;Lin,1998;
Goldmanetal.,2001;Seretanetal.,2004).
Given the motivations for performing a
linguistically-informedextraction— whichwere
also put forth, among others, by Church and
Hanks(1990,25), Smadja(1993,151) and Heid
(1994) — and given the recent developmentof
linguisticanalysistools,itseemsplausiblethatthe
linguisticstructurewill be more and more taken
intoaccountbycollocationextractionsystems.
Therestofthepaperisorganizedasfollows. In
section3 we provide a language-orientedreview
of the existingcollocationextractionwork. Then
wehighlight,insection4,aseriesofproblemsthat
arisein thetransferof methodologyto a new lan-
guage,andweproposea strategyfordealingwith
them. Section5 describesan extractionsystem,
and,finally, section6 presentsa case-studyonthe
collocationsextractedforfourlanguages,illustrat-
ingthecross-lingualvariationin theperformance
ofa particularAM.
3 OverviewofExtractionWork
3.1 English
As one mightexpect,the bulk of the collocation
extractionwork concernsthe English language:
(Choueka,1988;Churchet al.,1989;Churchand
Hanks,1990; Smadja,1993; Justesonand Katz,
1995;Kjellmer, 1994;Sinclair, 1995;Lin,1998),
amongmany others1.
Choueka’s method(1988)detectsn-grams(ad-
jacentwords)only, by simplycomputingthe co-
occurrencefrequency. Justesonand Katz (1995)
applya POS-filteronthepairsthey extract.Asin
(Kjellmer, 1994),the AM they use is the simple
frequency.
Smadja(1993)employsthez-scoreinconjunc-
tion with several heuristics(e.g., the systematic
occurrenceof two lexical items at the same dis-
tanceintext)andextractspredicativecollocations,
1E.g.,(Frantziet al.,2000;Pearce,2001;Goldmanet al.,
2001;ZaiuInkpenandHirst,2002;Dias,2003;Seretanetal.,
2004;Pecina,2005),andthelistcanbecontinued.
41
rigidnounphrasesandphrasaltemplates.Hethen
uses the a parserin order to validatethe results.
Theparsing is shownto leadto an increasein ac-
curacy from40%to80%.
(Churchet al., 1989)and (Churchand Hanks,
1990)usePOSinformationanda parsertoextract
verb-objectpairs,whichthenthey rankaccording
to the mutualinformation(MI) measurethey in-
troduce.
Lin’s(1998)isalsoahybridapproachthatrelies
ona dependency parser. Thecandidatesextracted
arethenrankedwithMI.
3.2 German
Germanis thesecondmostinvestigatedlanguage,
thanks to the early work of Breidt (1993) and,
morerecently, to thatof KrennandEvert,such as
(Krennand Evert, 2001; Evert and Krenn,2001;
Evert,2004)centeredonevaluation.
Breidt uses MI and t-score and comparesthe
results accuracy when various parametersvary,
such as the window size, presencevs. absence
of lemmatization,corpus size, and presencevs.
absenceof POS and syntacticinformation. She
focuses on N-V pairs2 and, despite the lack of
syntacticanalysistoolsat the time,by simulating
parsing she comes to the conclusionthat “Very
high precisionrates, which are an indispensable
requirementforlexicalacquisition,canonlyreal-
isticallybeenvisagedforGermanwithparsedcor-
pora”(Breidt,1993,82).
Later, Krennand Evert (2001)used a German
chunker to extractsyntacticpairssuchas P-N-V.
Their work put the basis of formal and system-
atic methodsin collocationextractionevaluation.
Zinsmeisterand Heid (2003; 2004) focused on
N-V and A-N-Vcombinationsidentifiedusinga
stochasticparser. They appliedmachinelearning
techniquesin combinationto the log-likelihood
measure(henceforthLL)fordistinguishingtrivial
compoundsfromlexicalizedones.
Finally, Wermter and Hahn (2004) identified
PP-V combinationsusing a POS tagger and a
chunker. They basedtheirmethodon a linguistic
criterion(that of limitedmodifiability)and com-
pared their resultswith those obtainedusing the
t-scoreandLLtests.
2Thefollowingabbreviationsare usedin thispaper: N -
noun,V- verb,A- adjective,Adv- adverb,Det- determiner,
Conj- conjunction,P - preposition.
3.3 French
Thanks to the outstanding work of Gross on
lexicon-grammar(1984), French is one of the
moststudiedlanguagesin termsof distributional
and transformationalpotential of words. This
workhasbeencarriedoutbeforethe computerera
and the advent of corpuslinguistics,whileauto-
maticextractionwaslaterperformed,forinstance,
in (Lafon,1984; Daille,1994; Bourigault, 1992;
Goldmanetal.,2001).
Daille (1994) aimed at extracting compound
nouns,defineda prioriby meansof certainsyn-
tacticpatterns,like N-A,N-N,N-`a-N,N-de-N,N
PDetN.Sheuseda lemmatizeranda POS-tagger
beforeapplyinga seriesof AMs,whichshe then
evaluatedagainst a domain-specificterminology
dictionaryand against a gold-standardmanually
createdfromtheextractioncorpus.
Similarly, Bourigault (1992) extracted noun-
phrasesfromshallow-parsedtext,andGoldmanet
al. (2001)extractedsyntacticcollocationsby us-
inga fullparserandapplyingtheLLtest.
3.4 OtherLanguages
In additionto English,GermanandFrench,other
languagesforwhichnotablecollocationextraction
workwasperformed,are—asweareawareof—
thefollowing:
• Italian:earlyextractionworkwascarriedout
byCalzolariandBindi(1990)andemployed
MI. It was followedby (Basiliet al., 1994),
thatmadeuseofparsinginformation;
• Korean:(Shimohataetal.,1997)usedanad-
jacencyn-grammodel,and(Kimetal.,1999)
reliedonPOS-tagging;
• Chinese:(Huanget al., 2005)usedPOSin-
formation,while(Luetal.,2004)appliedex-
tractiontechniquessimilarto Xtractsystem
(Smadja,1993);
• Japanese:(Ikeharaetal.,1995)wasbasedon
animprovedn-grammethod.
As for multilingualextraction via alignment
(wherecollocationsare first detectedin one lan-
guageand then matchedwith their translationin
anotherlanguage),mostortheexistingworkcon-
cern the English-Frenchlanguagepair, and the
Hansardcorpusof CanadianParliamentproceed-
ings. Wu (1994)signalsa numberof problems
42
that non-Indo-Europeanlanguagespose for the
existingalignmentmethodsbased on word- and
sentence-length:in Chinese,forinstance,mostof
thewordsarejustoneortwo characterslong,and
thereareno worddelimiters.Thisresultsuggests
thattheportabilityof existingalignmentmethods
tonewlanguagepairsisquestionable.
We are not concernedhere with extractionvia
alignment.We assume,instead,thatmultilingual
supportin collocationextractionmeansthe cus-
tomizationof the extraction procedurefor each
language.Thistopicwillbeaddressedin thenext
sections.
4 Multilingualism:WhyandHow?
4.1 SomeIssues
Astheprevioussectionshowed,many systemsof
collocationextractionrely on the linguisticpre-
processingof sourcecorporain order to support
the candidateidentificationprocess. Language-
specificinformation,suchastheonederivedfrom
morphologicalandsyntacticanalysis,was shown
to be highlybeneficialfor extraction. Moreover,
the possibilityto applythe associationmeasures
onsyntacticallyhomogenousmaterialisarguedto
benefitextraction,as the performanceof associa-
tion measuresmightvary withthe syntacticcon-
figurationsbecauseof the differencesin distribu-
tion(KrennandEvert,2001).
The lexical distribution is thereforea relevant
issuefromtheperspectiveofmultilingualcolloca-
tionextraction.Differentlanguagesshowdifferent
proportionsof lexical categories (N, V, A, Adv,
P, etc.) whichare evenly distributed acrosssyn-
tactictypes3. Dependingon the frequency num-
bers,a given AMcouldbe moresuitedfor a spe-
cificsyntactic configurationin onelanguage,and
less suitedfor the sameconfigurationin another.
Ideally, eachlanguageshouldbe assigneda suit-
able set of AMs to be appliedon syntactically-
homogenousdata.
Another issue that is relevant in the multi-
lingualism perspective is that of the syntactic
configurationscharacterizingcollocations. Sev-
eralsuchrelations(e.g.,noun-adjectival modifier,
predicate-argument)are likely to remainconstant
throughlanguages,i.e., to be judgedas colloca-
tionallyinterestingin many languages.However,
3For instance,V-P pairsare morerepresentedin English
thaninotherlanguages(asphrasalverbsorverb-particlecon-
structions).
other configurationscould be language-specific
(like P-N-V in German, whose English equiva-
lentisV-P-N).Yetotherconfigurationsmighthave
nocounterpartat allinanotherlanguage(e.g.,the
FrenchP-Apair `a neuf is translatedintoEnglish
asa Conj-Apair, asnew).
Findingall the collocationally-relevant syntac-
tictypesfora languageis thereforeanotherprob-
lem that has to be solved in multilingualextrac-
tion. Since a priori definingthese types based
on intuitiondoesnot ensurethe necessarycover-
age,analternativeproposalistoinducethemfrom
POSdataanddependencyrelations,asin(Seretan,
2005).
The morphoyntactic differences between lan-
guagesalso have to be taken into account. With
Englishasthemostinvestigatedlanguage,several
hypotheseswere put forth in extractionand be-
camecommonplace.
Forinstance,usinga5-wordswindowassearch
spaceforcollocationpairsisausualpractice,since
this span lengthwas shown sufficientto cover a
highpercentageofsyntacticco-occurrencesinEn-
glish. But — as suggestedby otherresearchers,
e.g., (Goldmanet al., 2001)—, this assumption
doesnotnecessaryholdforotherlanguages.
Similarly, the higherinflectionand the higher
transformation potential shown by some lan-
guages pose additional problems in extraction,
whichwereratherignoredforEnglish. AsKimet
al. (1999)notice,collocationextractionisparticu-
larlydifficultin free-orderlanguageslike Korean,
whereargumentsscramblefreely. Breidt(1993)
alsopointedouta coupleof problemsthatmakes
extractionfor Germanmoredifficultthanfor En-
glish: the stronginflectionfor verbs,the variable
word-order,andthepositionalambiguityofthear-
guments.Sheshowsthatevendistinguishingsub-
jectsfromobjectsisverydifficultwithoutparsing.
4.2 AStrategyforMultilingualExtraction
Summing up the previous discussion, the cus-
tomizationof collocationextractionfor a given
languageneedstotake intoaccount:
- the syntactic configurationscharacterizing
collocations,
- thelexicaldistributionover syntacticconfig-
urations,
- theadequacyofAMstotheseconfigurations.
43
These are language-specificparameterswhich
needto be setin a successfulmultilingualextrac-
tion procedure. Truly multilingualsystemshave
not been developedyet, but we suggestthe fol-
lowingstrategyforbuildingsucha system:
A. parse the source corpus, extract all the syn-
tactic pairs (e.g., head-modifier, predicate-
argument)andrankthemwitha givenAM,
B. analyzethe resultsandfindthe syntacticcon-
figurationscharacterizingcollocations,
C. evaluatetheadequacy ofAMsforrankingcol-
locationsin each syntacticconfiguration,and
find the most convenientmappingconfigura-
tions- AMs.
Oncecustomizedfora language,theextraction
procedureinvolves:
Stage1. parsing the source corpus for extract-
ing the lexical pairs in the relevant,
language-specific syntactic configura-
tionsfoundinstepB;
Stage2. ranking the pairs from each syntactic
classwiththeAMassignedinstepC.
5 AMultilingualCollocationExtractor
BasedonParsing
Ever sincethe collocationwas broughtto the at-
tentionof linguistsin theframeworkof contextu-
alism(Firth,1957; Firth,1968),it has beenpre-
ponderantlyseenasa purestatisticalphenomenon
oflexicalassociation.Infact,accordingtoa well-
knowndefinition,“acollocationisanarbitraryand
recurrentwordcombination”(Benson,1990).
Thisapproachwas at thebasisof thecomputa-
tionalworkoncollocation,althoughthereexistan
alternative approach— the linguistic,or lexico-
graphicone — that imposesa restrictedview on
collocation,whichis seenfirstofallasanexpres-
sionoflanguage.
Theexistingextractionwork(section3) shows
that there is a growing interest in adoptingthe
morerestricted(linguistic)view. Asmentionedin
section3,theimportanceofparsingforextraction
was confirmedbyseveralevaluationexperiments.
Withtherecentdevelopmentinthefieldoflinguis-
tic analysis,hybrid extractionsystems(i.e., sys-
tems relyingon syntacticalanalysisfor colloca-
tionextraction)arelikelytobecometherulerather
thantheexception.
Oursystem(Goldmanet al.,2001;Seretanand
Wehrli,2006)is — to our knowledge— the first
toperformthefullsyntacticanalysisassupportfor
collocationextraction;similarapproachesrelyon
dependency parsersoronchunking.
It is based on a symbolicparser that was de-
velopedover the last decade(Wehrli, 2004)and
achieves a highlevel of performance,in termsof
accuracy, speedandrobustness. Thelanguagesit
supportsare, for the timebeing,French,English,
Italian, Spanishand German. A few other lan-
guagesare beingalso implementedin the frame-
workofa multilingualismproject.
Providedthatcollocationextractioncanbeseen
as a two-stageprocess(where,in stage1, collo-
cationcandidatesareidentifiedinthetextcorpora,
andinstage2,theyarerankedaccordingtoagiven
AM, cf. section4.2), the role of the parseris to
supportthe first stage. A pair of lexicalitemsis
selectedasacandidateonlyifthereexistasyntac-
ticrelationholdingbetweenthetwo items.
Unlike the traditional,window-basedmethods,
candidateselectionis basedon syntacticproxim-
ity (as opposedto textual proximity). Another
peculiarityof our systemis that candidatepairs
are identifiedas the parsinggoeson; in otherap-
proaches, they are extracted by post-processing
theoutputofsyntactictools.
Thecandidatepairsidentifiedareclassifiedinto
syntacticallyhomogenoussets, according to the
syntacticrelationsholdingbetweenthetwo items.
Only certain predefined syntactic relations are
kept, that were judged as collocationally rele-
vant aftermultipleexperimentsof extractionand
data analysis (e.g., adjective-noun, verb-object,
subject-verb, noun-noun,verb-preposition-noun).
The sets obtainedare then ranked usingthe log-
likelihoodratiostest(Dunning,1993).
More details about the systemand its perfor-
mancecanbefoundin(SeretanandWehrli,2006).
Thefollowingexamples(takenfromtheextraction
experimentwe will describebelow) illustrateits
potentialto detectcollocationcandidates, even if
thesearesubjecttocomplex syntactictransforma-
tions:
1.a) atteindre objectif (Fr): Les objec-
tifs fix´es `a l’´echelleinternationale
visant `a r´eduire les ´emissionsne
peuventpasˆetreatteints`al’aidede
cesseulsprogrammes.
1.b) accogliere emendamento (It):
44
Possopertantoaccogliere in parte
e in lineadi principiogli emenda-
mentinn. 43-46e l’emendamento
n. 85.
1.c) reforzar cooperaci´on (Es): Quer-
emos permitira los pases que lo
deseen reforzar, en un contexto
unitario,su cooperaci´on en cierto
n´umerodesectores.
Thecollocationextractorispartofabiggersys-
tem (Seretanet al., 2004) that integrates a con-
cordancerand a sentencealigner, and that sup-
portsthe visualization,the manualvalidationand
the managementof a multilingualterminology
database. Thevalidatedcollocationsare usedfor
populatingthe lexiconof the parserand that of a
translationsystem(Wehrli,2003).
6 ACross-LingualExtraction
Experiment
A collocation extraction experiment concern-
ing four different languages (English, Spanish,
French,Italian)has beenconductedon a parallel
subcorpusof 42 files from the EuropeanParlia-
mentproceedings.Severalstatisticsandextraction
resultsarereportedinTable1.
Statistics English Spanish Italian French
tokens 2526403 2666764 2575858 2938118
sent/file 2329.1 2513.7 2331.6 2392.8
complete
parses 63.4% 35.5% 46.8% 63.7%
tokens/sent 25.8 25.3 26.3 29.2
extr. pairs
(tokens) 617353 568998 666122 565287
token/type 2.6 2.5 2.3 2.3
LLisdef. 85.9% 90.6% 83.5% 92.8%
Table1: Extractionstatistics
We computedthe distribution of pair tokens
according to the syntactic type and noted that
the most marked distributionaldifferenceamong
theselanguagesconcernthefollowingtypes:N-A
(7.12),A-N(4.26),V-O(2.68),V-P(4.16),N-P-N
(3.81)4.
Unsurprisingly, theRomancelanguagesareless
differentin termsof syntacticco-occurrencedis-
tribution, and the deviationof Englishfrom the
Romancemeanismorepronounced—inparticu-
lar, forN-A(9.72),V-P(5.63),A-N(5.25),N-P-N
4Thenumbersrepresentthevaluesthestandarddeviation
oftherelative percentagesinthewholelistsofpairs.
(4.77),andV-O(3.57). Thesedistributionaldiffer-
encesmightaccountfor the typesof collocations
highlightedby a particularAM(suchas LL)in a
languagevs. another. Figure1 displaysthe rela-
tive proportionsof 3 syntactictypes— adjective-
noun,subject-verbandverb-object— thatcanbe
foundat differentlevels in thesignificancelistre-
turnedbyLL.
Figure1: Cross-lingualproportionsof A-N,S-V
andV-Opairsatdifferentlevelsinthesignificance
lists
We performed a contrastive analysisof results,
by carryingout a case-studyaimed at checking
the LL performancevariabilityacrosslanguages.
Thestudyconcernedthe verb-objectcollocations
having the noun policyas the directobject. We
specificallyfocusedonthebest-scoredcollocation
extractedfromthe Frenchcorpus,namelymener
unepolitique(lit.,conducta policy).
We looked at the translationequivalentsof its
74 instancesidentifiedby our extractionsystem
in the corpus. The analysisrevealed that — at
least in this particularcase — the verbal collo-
cates of this noun are highly scattered: pursue,
implement,conduct,adopt,apply, develop,have,
draft, launch, run, carry out for English; prac-
ticar, llevar a cabo,desarrollar, realizar, aplicar,
seguir, hacer, adoptar, ejercer for Spanish;con-
durre, attuare, portare avanti,perseguire, pratti-
care, adottare, fare forItalian(amongseveraloth-
ers). Someofthecollocates(thoselistedfirst)are
more prominentlyused. But generallythey are
highlydispersed,andthismightindicatea bigger
difficultyforLLtopinpointthebestcollocateina
languagevs. another.
Wealsoobservedthatquitefrequently(inabout
25%of thecases)thecollocationdidnotconserve
itssyntacticconfiguration.Eithertheverb—here,
45
theequivalentfortheFrenchmener— is omitted
intranslations(like in2.bbelow):
2.a) des contradictionsexistentdansla
politiquequiestmen´ee(Fr);
2.b) we are dealingwith contradictory
policy (En),
or, in a few othercases,the wholecollocation
disappears,since paraphrasedwith a completely
differentsyntacticconstruction:
3.a) directionqui a men´e unepolitique
insens´eeder´eductiondepersonnel
(Fr);
3.b) a managementthat foolishly en-
gagedinstaff reductions(En).
Inordertoquantifytheimpactsuchfactorshave
on the performanceof the AM considered,we
furtherscrutinizedthecollocateslistforpolitique
proposedby LLtest foreach language(seeTable
2). The rank of a pair in the wholelist of verb-
objectcollocationsextracted,as assignedby the
LLtest,is shownin thelastcolumn.In thesesig-
nificancelists,thecollocationswithpolitiqueasan
objectconstitutea smallfraction,andfromthese,
onlythetopcollocationsaredisplayedin Table2.
Thethresholdwasmanuallydefinedinaccordance
withour intuitionthat the lower-scoredpairsob-
served manifestless a collocationalstrength. It
happensto be situatedaroundthe LL valueof 20
foreachlanguage(andis of coursespecificto the
sizeofourcorpusandtothenumberofV-Otokens
identifiedtherein).
If weconsidertheLLrankas thesuccessmea-
surefor collocatedetection,we caninferthatthe
collocatesofthewordunderinvestigationareeas-
ier to found in French,as comparedto English,
Italianor Spanish,becausethe value in the first
rowofthelastcolumnissmaller. Thisholdsifwe
areinterestedin onlyone(themostsalient)collo-
catefora word.
If we measurethe successof retrievingall the
collocates(byconsidering,forinstance,thespeed
to accessthemin theresults list— thehigherthe
rank,thebetter),thenFrenchcanbeagainconsid-
ered the easiestbecauseoverall, the positionsin
the V-O list are higher(i.e.,the meanof the rank
columnissmaller)withrespecttoSpanish,Italian
and,respectively, English.
This latter result corresponds,approximately,
to the order given by relative proportionof V-O
Language collocate freq LLscore rank
French mener 74 376.8 45
politique ´elaborer 17 50.1 734
adapter 5 48.3 780
axer 8 41.4 955
pratiquer 9 39.7 1011
d´evelopper 13 28.1 1599
adapter 8 25.2 1867
poursuivre 11 24.4 1943
English pursue 39 214.9 122
policy implement 38 108.7 325
develop 30 81.1 473
conduct 8 28.9 2014
harmonize 9 28.2 2090
gear 5 27.7 2201
need 25 24.9 2615
apply 16 23.3 2930
Spanish practicar 17 98.7 246
pol´ıtica desarrollar 27 82.4 312
aplicar 25 65.7 431
seguir 17 33.5 1003
coordinar 8 31.0 1112
basar 11 25.1 1473
orientar 6 22.5 1707
adaptar 5 20.0 1987
construir 6 19.4 2057
Italian attuare 23 79.5 382
politica perseguire 14 46.4 735
praticare 8 37.6 976
seguire 18 30.2 1314
portare 12 29.7 1348
rivedere 9 26.0 1607
riformare 7 25.6 1639
sviluppare 12 22.1 1975
adottare 20 21.2 2087
Table2: Verbalcollocatesfortheheadwordpolicy
pairs in each language(Spanish15.12%, French
15.14%, Italian 17.06%, and English 20.82%).
Given thatin EnglishV-O pairsare morenumer-
ousandtheverbsalsoparticipateinV-Pconstruc-
tions, it might seem reasonableto expect lower
LLscoresforV-O collocationsin Englishvs. the
other3 languages.
Ingeneral,weexpecta correlationbetweenex-
tractiondifficultyandthedistributionalproperties
ofco-occurrencetypes.
7 Conclusion
The paper pointed out several issues that oc-
cur in transferinga hybridcollocationextraction
methodology(thatcombineslinguisticwithstatis-
ticinformation)toa newlanguage.
Besides the questionable availability of
language-specifictext analysistools for the new
language,a numberof issuesthat are relevant to
extraction proper were addressed: the changes
in the distribution of (syntactic)word pairs, and
the need to find, for each language, the most
46
appropriateassociationmeasureto applyforeach
syntactictype (given that AMs are sensitive to
distributions and syntactic types); the lack of
a priori defined syntactictypes for a language;
and, finally, the portabilityof some widelyused
techniques(such as the window method) from
English to other languages exhibiting a higher
wordorderfreedom.
It is again in the multilingualismperspective
that the inescapableneed for preprocessingthe
textemerged(cf.differentresearcherscitedinsec-
tion 3): highlyinflectedlanguagesneed lemma-
tizers, free-word order languagesneed structural
informationin order to guaranteeacceptablere-
sults. As languagetoolsbecomenowadaysmore
andmoreavailable,weexpectthecollocationex-
traction(and terminologyacquisitionin general)
to be exclusively performedin the future by re-
lyingon linguisticanalysis. We thereforebelieve
thatmultilingualismis a trueconcernforcolloca-
tionextraction.
The paper reviewed the extractionwork in a
language-orientedfashion, while mentioningthe
typeof linguisticpreprocessingperformedwhen-
ever it was the case, as well as the language-
specificissues identifiedby the authors. It then
proposeda strategy for implementinga multilin-
gual extractionprocedurethat takes into account
thelanguage-specificissuesidentified.
An extraction system for four different lan-
guages,basedonfullparsing,wasthendescribed.
Finally, an experimentwas carriedout as a case
study, whichpointedoutseveralfactorsthatmight
determinea particularAMto performdifferently
acrosslanguages.Theexperimentsuggestedthat
log-likelihoodratios test might highlightcertain
verb-objectcollocationseasierin Frenchthan in
Spanish,ItalianandEnglish(in termsof salience
inthesignificancelist).
Futurework needsto extendthe typeof cross-
linguisticanalysisinitiatedhere, in orderto pro-
videmoreinsightson the differencesexpectedat
extractionbetweenonelanguageandanotherand
ontheresponsiblefactors,and,accordingly, tode-
finesstrategiestodealwiththem.
Acknowledgements
Theresearchdescribedinthispaperhasbeensup-
portedin partby a grantfromthe SwissNational
Foundation(No.101412-103999).

References
Hiyan Alshawi and David Carter. 1994. Training
andscalingpreferencefunctionsfordisambiguation.
ComputationalLinguistics, 20(4):635–648.
RobertoBasili,MariaTeresaPazienza,and PaolaVe-
lardi. 1994. A ”not-so-shallow”parserforcolloca-
tionalanalysis. In Proceedingsof the 15th confer-
enceon Computationallinguistics, pages447–453,
Kyoto, Japan.Associationfor ComputationalLin-
guistics.
Morton Benson. 1990. Collocationsand general-
purposedictionaries.InternationalJournalof Lexi-
cography, 3(1):23–35.
Godelieve L. M. Berry-Rogghe. 1973. The com-
putationof collocationsand their relevance to lex-
ical studies. In A. J. Aitken, R. W. Bailey, and
N.Hamilton-Smith,editors,TheComputerandLit-
eraryStudies, pages103–112.Edinburgh.
DidierBourigault. 1992.Surfacegrammaticalanalysis
fortheextractionofterminologicalnounphrases.In
Proceedingsofthe15thInternationalConferenceon
ComputationalLinguistics, pages977–981,Nantes,
France.
ElisabethBreidt.1993. ExtractionofV-N-collocations
from text corpora: A feasibility study for Ger-
man. In Proceedings of the Workshop on Very
Large Corpora: Academicand IndustrialPerspec-
tives, Columbus,U.S.A.
Peter F. Brown, Stephen A. Della Pietra, Vincent
J. DellaPietra,andRobertL. Mercer. 1991. Word-
sensedisambiguationusingstatisticalmethods. In
Proceedingsofthe29thAnnualMeetingoftheAsso-
ciationfor ComputationalLinguistics(ACL 1991),
pages264–270,Berkeley, California.
NicolettaCalzolariand Remo Bindi. 1990. Acqui-
sition of lexical informationfrom a large textual
Italian corpus. In Proceedingsof the 13th Inter-
nationalConferenceon ComputationalLinguistics,
pages54–59,Helsinki,Finland.
Yaacov Choueka. 1988. Lookingfor needles in a
haystack, or locating interesting collocationalex-
pressionsin largetextualdatabases.In Proceedings
of the InternationalConference on User-Oriented
Content-BasedText and Image Handling, pages
609–623,Cambridge,U.S.A.
KennethChurchand PatrickHanks. 1990. Word as-
sociationnorms,mutualinformation,andlexicogra-
phy. ComputationalLinguistics, 16(1):22–29.
Kenneth Church, William Gale, Patrick Hanks, and
DonaldHindle. 1989. Parsing,word associations
and typical predicate-argumentrelations. In Pro-
ceedingsof the InternationalWorkshopon Parsing
Technologies, pages 103–112,Pittsburgh. Carnegie
MellonUniversity.
D. AlanCruse. 1986. LexicalSemantics. Cambridge
UniversityPress,Cambridge.
B´eatrice Daille. 1994. Approche mixte pour
l’extraction automatiquede terminologie : statis-
tiqueslexicaleset filtres linguistiques. Ph.D.thesis,
Universit´e Paris7.
Ga¨el Dias. 2003. Multiword unit hybrid extraction.
In Proceedingsof the ACL Workshop on Multiword
Expressions, pages41–48,Sapporo,Japan.
Ted Dunning. 1993. Accuratemethodsfor the statis-
tics of surprise and coincidence. Computational
Linguistics, 19(1):61–74.
Stefan Evert and BrigitteKrenn. 2001. Methodsfor
thequalitativeevaluationoflexicalassociationmea-
sures. In Proceedingsof the 39th AnnualMeeting
of the Associationfor Computational Linguistics,
pages188–195,Toulouse,France.
Stefan Evert. 2004. The Statisticsof Word Cooccur-
rences: Word Pairs andCollocations. Ph.D.thesis,
UniversityofStuttgart.
JohnRupertFirth,1957. Papers in Linguistics1934-
1951, chapterModesof Meaning,pages190–215.
OxfordUniv. Press,Oxford.
J. R. Firth. 1968. A synopsisof linguistictheory,
1930–55. In F.R. Palmer, editor, Selectedpapers
of J. R. Firth,1952-1959. IndianaUniversityPress,
Bloomington.
Thierry Fontenelle. 1992. Collocationacquisition
froma corpusor froma dictionary:a comparison.
ProceedingsI-II. Papers submittedto the 5th EU-
RALEXInternationalCongress on Lexicographyin
Tampere, pages221–228.
Katerina T. Frantzi, Sophia Ananiadou,and Hideki
Mima. 2000. Automaticrecognitionof multi-word
terms:theC-value/NC-valuemethod. International
JournalonDigitalLibraries, 2(3):115–130.
Jean-Philippe Goldman, Luka Nerima, and Eric
Wehrli. 2001. Collocationextractionusinga syn-
tacticparser. In Proceedingsof the ACL Workshop
onCollocations, pages61–66,Toulouse,France.
MauriceGross. 1984. Lexicon-grammarandthesyn-
tacticanalysisofFrench.InProceedingsofthe22nd
conference on Associationfor ComputationalLin-
guistics, pages275–282,Morristown,NJ, USA.
Franz Iosef Hausmann. 1985. Kollokationenim
deutschenw¨orterbuch. ein beitrag zur theorie des
lexikographischenbeispiels”. In Henning Bergen-
holtzand JoachimMugdan,editors,Lezikographie
undGrammatik.Aktendes EssenerKolloquiumszur
Grammatikim W¨orterbuch., Lexicographica.Series
Major3,pages118–129.
UlrichHeid. 1994. On ways words work together-
researchtopicsinlexicalcombinatorics.InW. Mar-
tin, W. Meijs, M. Moerland, E. ten Pas, P. van
Sterkenburg, andP. Vossen,editors,Proceedingsof
theVIthEuralexInternationalCongress(EURALEX
’94), pages226–257,Amsterdam.
Chu-RenHuang,AdamKilgarriff, YichingWu, Chih-
MingChiu,SimonSmith,PavelRychly,Ming-Hong
Bai, and Keh-JiannChen. 2005. ChineseSketch
Engineand the extraction of grammatical colloca-
tions. In Proceedingsof theFourthSIGHANWork-
shop on ChineseLanguage Processing, pages 48–
55,JejuIsland,RepublicofKorea.
SatoruIkehara,SatoshiShirai,andTsukasaKawaoka.
1995. Automaticextractionof uninterruptedcollo-
cationsby n-gram statistics. In Proceedingsof first
AnnualMeetingof theAssociationforNatural Lan-
guage Processing, pages313–316.
Ray Jackendoff. 1997. The Architecture of the Lan-
guage Faculty. MITPress,Cambridge,MA.
JohnS. JustesonandSlava M. Katz. 1995. Technical
terminology:Somelinguistispropertiesand an al-
gorithmforidentificationintext. NaturalLanguage
Engineering, 1:9–27.
SeonhoKim,ZooilYang,MansukSong,andJung-Ho
Ahn. 1999. Retrieving collocationsfrom Korean
text. InProceedingsofthe1999JointSIGDATCon-
ferenceonEmpiricalMethodsin Natural Language
Processingand Very Large Corpora, pages71–81,
Maryland,U.S.A.
Mihoko Kitamuraand Yuji Matsumoto. 1996. Auto-
maticextractionof wordsequencecorrespondences
in parallelcorpora.In Proceedingsof the4thWork-
shopon Very Large Corpora, pages79–87,Copen-
hagen,Denmark,August.
G¨oranKjellmer. 1994. A Dictionaryof EnglishCollo-
cations. ClaredonPress,Oxford.
BrigitteKrennand Stefan Evert. 2001. Can we do
betterthan frequency? A case study on extracting
PP-verb collocations. In Proceedingsof the ACL
Workshopon Collocations, pages39–46,Toulouse,
France.
JulianKupiec. 1993. An algorithmfor finding noun
phrasecorrespondencesinbilingualcorpora.In31st
Annual Meeting of the Associationfor Computa-
tional Linguistics, pages 17–22, Columbus, Ohio,
U.S.A.
P. Lafon. 1984. D´epouillementet statistique en
l´exicometrie. Slatkine-Champion,Paris.
MichaelLewis. 2000. TeachingCollocations.Further
DevelopmentsIn The LexicalApproach. Language
TeachingPublications,Hove.
DekangLin. 1998. Extractingcollocationsfromtext
corpora. In First Workshopon ComputationalTer-
minology, pages57–63,Montreal.
Qin Lu, Yin Li, and Ruifeng Xu. 2004. Improving
Xtractfor Chinesecollocationextraction. In Pro-
ceedingsofIEEEInternationalConferenceonNatu-
ral Language ProcessingandKnowledge Engineer-
ing, pages333–338.
KathleenR.McKeownandDragomirR.Radev. 2000.
Collocations. In Robert Dale, Hermann Moisl,
and Harold Somers, editors, A Handbook of Nat-
ural Language Processing, pages507–523.Marcel
Dekker, NewYork,U.S.A.
I. Dan Melamed. 1997. A portable algorithmfor
mappingbitext correspondence. In Proceedingsof
the 35th Conference of the Associationfor Com-
putational Linguistics (ACL’97), pages 305–312,
Madrid,Spain.
Igor Mel’ˇcuk. 1998. Collocationsand lexical func-
tions. In Anthony P. Cowie, editor, Phraseology.
Theory, Analysis, and Applications, pages 23–53.
ClaredonPress,Oxford.
IgorMel’ˇcuk. 2003. Collocations:d´efinition,rˆole et
utilit´e. In FrancisGrossmannandAgn`es Tutin,ed-
itors,Lescollocations:analyseet traitement, pages
23–32.Editions”DeWerelt”,Amsterdam.
DarrenPearce.2001.Synonymyincollocationextrac-
tion. In WordNetandOtherLexicalResources: Ap-
plications,ExtensionsandCustomizations(NAACL
2001Workshop), pages41–46,Pittsburgh,U.S.A.
Pavel Pecina. 2005. An extensive empiricalstudyof
collocationextractionmethods. In Proceedingsof
the ACL StudentResearch Workshop, pages13–18,
AnnArbor, Michigan,June.
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann
Copestake, and Dan Flickinger. 2002. Multiword
expressions:A pain in the neck for NLP. In Pro-
ceedingsof the Third InternationalConference on
Intelligent Text ProcessingandComputationalLin-
guistics(CICLING2002), pages1–15,MexicoCity.
VioletaSeretanandEric Wehrli. 2006. Accuratecol-
locationextractionusing a multilingualparser. In
ProceedingsofCOLING/ACL2006. To appear.
VioletaSeretan,LukaNerima,andEricWehrli. 2004.
A toolformulti-wordcollocationextractionandvi-
sualizationin multilingualcorpora. In Proceedings
of the EleventhEURALEXInternationalCongress,
EURALEX2004, pages755–766,Lorient,France.
Violeta Seretan. 2005. Inductionof syntacticcol-
location patterns from generic syntacticrelations.
In Proceedings of Nineteenth InternationalJoint
Conferenceon ArtificialIntelligence(IJCAI2005),
pages1698–1699,Edinburgh,Scotland,July.
SayoriShimohata,ToshiyukiSugio,andJunjiNagata.
1997. Retrieving collocationsby co-occurrences
and word orderconstraints. In Proceedingsof the
Annual Meeting of the Associationfor Computa-
tionalLinguistics, pages476–481,Madrid,Spain.
JohnSinclair. 1995. CollinsCobuild EnglishDictio-
nary. HarperCollins,London.
Frank Smadja, Kathleen McKeown, and Vasileios
Hatzivassiloglou.1996.Translatingcollocationsfor
bilinguallexicons: a statisticalapproach.Computa-
tionalLinguistics, 22(1):1–38.
Frank Smadja. 1993. Retrieving collocationsfrom
text: Xtract.ComputationalLinguistics, 19(1):143–
177.
Eric Wehrli. 2003. Translationof words in context.
In Proceedingsof Machine TranslationSummit IX,
pages502–504,NewOrleans,Lousiana,U.S.A.
EricWehrli. 2004. Un mod`ele multilingued’analyse
syntaxique. In A. Auchlinet al., editor, Structures
et discours - M´elanges offerts `a EddyRoulet, pages
311–329.´EditionsNotabene,Qu´ebec.
JoachimWermterand Udo Hahn. 2004. Collocation
extractionbasedon modifiabilitystatistics. In Pro-
ceedingsof the 20th InternationalConference on
ComputationalLinguistics(COLING2004), pages
980–986,Geneva,Switzerland.
DekaiWu. 1994. Aligninga parallelEnglish-Chinese
corpusstatisticallywithlexicalcriteria.InProceed-
ingsof the32ndAnnualMeetingof theAssociation
for ComputationalLinguistics(ACL 1994), pages
80–87,LasCruces(NewMexico),U.S.A.
Diana Zaiu Inkpen and Graeme Hirst. 2002. Ac-
quiringcollocationsforlexicalchoicebetweennear-
synonyms. InProceedingsoftheACL-02Workshop
on UnsupervisedLexicalAcquisition, pages67–76,
Philadephia,Pennsylvania.
R´emi Zajac,Elke Lange, and Jin Yang. 2003. Cus-
tomizing complex lexical entries for high-quality
MT. In Proceedingsof the Ninth Machine Trans-
lationSummit, NewOrleans,U.S.A.
Heike Zinsmeisterand Ulrich Heid. 2003. Signif-
icant triples: Adjective+Noun+Verbcombinations.
In Proceedingsof the 7th Conference on Compu-
tationalLexicographyand Text Research (Complex
2003),Budapest.
Heike Zinsmeisterand UlrichHeid. 2004. Colloca-
tionsofcomplexnouns:Evidenceforlexicalisation.
InProceedingsofKONVENS2004, Vienna,Austria.
