Collocation Extraction Based on Modifiability Statistics
Joachim Wermter Udo Hahn
Computerlinguistik, Friedrich-Schiller-Universit¨at Jena
F¨urstengraben 30, D-07743 Jena, Germany
wermter@coling.uni-freiburg.de
Abstract
We introduce a new, linguistically grounded
measure of collocativity based on the property
of limited modifiability and test it on German
PP-verb combinations. We show that our mea-
sure not only significantly outperforms the stan-
dard lexical association measures typically em-
ployed for collocation extraction, but also yields
a valuable by-product for the creation of col-
location databases, viz. possible structural and
lexical attributes. Our approach is language-,
structure-, and domain-independent because it
only requires some shallow syntactic analysis
(e.g., a POS-tagger and a phrase chunker).
1 Introduction
Natural language is an open and very flexible com-
munication system. Syntax, of course, imposes con-
straints, e.g., on word order or the occurrence of par-
ticular phrasal types such as PPs or NPs, and lexi-
cal semantics imposes, e.g., selectional constraints
on conceptually permitted sorts or types within the
context of specific verbs or nouns. Nevertheless,
natural language speakers usually enjoy an enor-
mous degree of freedom to express the content they
want to convey in a great variety of linguistic forms.
There is, however, a significant subset of expres-
sions which do not share this rather free combinabil-
ity, so-called collocations. From a linguistic per-
spective, they can be characterized by at least three
recurrent and prominent properties (Manning and
Sch¨utze, 1999):
a0 Non-(or limited) compositionality. The mean-
ing of a collocation is not a straightforward
composition of the meanings of its parts. For
example, the meaning of ‘red tape’ is com-
pletely different from the meaning of its com-
ponents.
a0 Non-(or limited) substitutability. The parts of
a collocation cannot be substituted by seman-
tically similar words. Thus, ‘gut’ in ‘to spill
gut’ cannot be substituted by ‘intestine’ (see
also Lin (1999)).
a0 Non-(or limited) modifiability. Many collo-
cations cannot be supplemented by additional
lexical material. For example, the noun in ‘to
kick the bucket’ cannot be modified as ‘to kick
the a1 holey/plastic/watera2 bucket’.
Considering these observations, from a natu-
ral language processing perspective, collocations
should not enter, e.g., the standard syntax-semantics
pipeline so as to prevent compositional semantic
readings of expressions for which this is absolutely
not desired. Hence, collocations need to be identi-
fied as such and subsequently be blocked, e.g., from
compositional semantic interpretation.
In computational linguistics, a wide variety of
lexical association measures have been employed
for the task of (semi-)automatic collocation identifi-
cation and extraction. Almost all of these measures
can be grouped into one of the following three cate-
gories:
a0 frequency-based measures (e.g., based on ab-
solute and relative co-occurrence frequencies)
a0 information-theoretic measures (e.g., mutual
information, entropy)
a0 statistical measures (e.g., chi-square, t-test,
log-likelihood, Dice’s coefficient)
The corresponding metrics have been extensively
discussed in the literature both in terms of their
mathematical properties (Dunning, 1993; Manning
and Sch¨utze, 1999) and their suitability for the
task of collocation extraction (see Evert and Krenn
(2001) and Krenn and Evert (2001) for recent eval-
uations). Typically, they are applied to a set of can-
didate lexeme pairs which were obtained from pre-
processors varying in linguistic sophistication.1 The
selected measure then assigns an association score
1On the low end, this may just be a preset numeric window
span. In order to reduce the noise among the candidates, how-
ever, more elaborate linguistic processing, such as POS tag-
ging, chunking, or even parsing, is increasingly being applied.
to each candidate pair, which is computed from its
joint and marginal frequencies, thus expressing the
strength of the hypothesis stating whether it consti-
tutes a collocation or not.
While these association measures have their sta-
tistical merits in collocation identification, it is in-
teresting to note that they have relatively little to do
with the linguistic properties (such as those men-
tioned at the beginning) which are typically associ-
ated with the notion of collocativity. Therefore, it
may be interesting to investigate whether there is a
way to implement a measure which directly incor-
porates linguistic criteria in the collocation identifi-
cation task, and even more important, whether such
a linguistically rooted approach would fare better in
comparison to some of the standard lexical associa-
tion measures.
In the following study, we will introduce such a
linguistic measure for identifying PP-verb colloca-
tions in German, which is based on the property of
non- or limited modifiability. To the best of our
knowledge, this is the first work to use this kind
of linguistic measure to acquire collocations auto-
matically. By contrasting our method to previous
studies which use the standard lexical association
measures, we intend to emphasize a more linguis-
tically inspired use of statistics in collocation min-
ing. Section 2 motivates our definition of the notion
of collocation and Section 3 describes our methods,
in particular the linguistically grounded collocation
extraction algorithm, and the experimental setup de-
rived from it. In Section 4 we present and discuss
the results of our experiments.
2 Kinds of Collocations
There have been various approaches to define the
notion of ‘collocation’. This is by no means an easy
task, especially when it comes to defining the de-
marcation line between collocations and free word
combinations (modulo general syntactic and seman-
tic semantic constraints). We favor an approach
which draws this line on the semantic layer, viz. the
compositionality between the components of a lin-
guistic expression.
For this purpose, we distinguish between three
classes of collocations based on varying degrees of
semantic compositionality of the basic lexical enti-
ties involved:
1. Idiomatic Phrases. In this case, none of the
lexical components involved contribute to the
overall meaning in a semantically transpar-
ent way. The meaning of the expression is
metaphorical or figurative. For example, the
literal meaning of the German PP-verb combi-
nation ‘[jemanden] auf die Schippe nehmen’ is
‘to take [someone] onto the shovel’. Its figura-
tive meaning is ‘to lampoon somebody’.
2. Support Verb Constructions/Narrow Colloca-
tions. This second class contains expressions
in which at least one component contributes to
the overall meaning in a semantically transpar-
ent way and thus constitutes its semantic core.
For example, in the support verb construction
‘zur Verf¨ugung stellen’ (literal: ‘to put to avail-
abilty’; actual: ‘to make available’), the noun
‘Verf¨ugung’ is the semantic core of the expres-
sion, whereas the verb only has a support func-
tion with some impact on argument structure,
causativity or aktionsart. There are, however,
also narrow collocations in which the basic lex-
ical meaning of the verb is the semantic core:
For example, in ‘aus eigener Tasche bezahlen’
(‘to pay out of one’s own pocket’) the verb
‘bezahlen’ is the semantic core. What unifies
these two types is the fact that they function as
predicates.
3. Fixed Phrases. Here, all basic lexical mean-
ings of the components involved contribute to
the overall meaning in a semantically much
more transparent way. Still, they are not as
completely compositional as to classify them
as free word combinations. For example, all
the basic lexical meanings of the different lex-
ical components in ‘im Koma liegen’ (literal:
‘to lie in coma’; actual: ‘to be comatose’)
contribute to the overall meaning of the ex-
pression. Still, this is different from a com-
pletely compositional free word combination,
such as ‘auf der Strasse gehen’ (‘to walk on
the street’).
Our goal is to consider all three types of col-
locations as a whole, i.e., we will not distinguish
between the three different kinds of collocations.
However, in order to focus our experiments, we will
concentrate on a particular surface pattern in which
they occur, viz. PP-verb collocations.
3 Methods and Experiments
3.1 Construction and Statistics of the Testset
We used a 114-million-word German-language
newspaper corpus extracted from the Web to ac-
quire candidate PP-verb collocations. The corpus
was first processed by means of the TNT part-
of-speech tagger (Brants, 2000). Then we ran a
sentence/clause recognizer and an NP/PP chunker,
both developed at the Text Knowledge Engineer-
ing Lab at Freiburg University, on the POS-tagged
corpus. From the XML-marked-up tree output, PP-
verb complexes were automatically selected in the
following way: Taking a particular PP node as
a fixed point, either the preceding or the follow-
ing sibling V node was taken.2 From such a PP-
verb combination, we extracted and counted both its
various heads, in terms of Preposition-Noun-Verb
(PNV) triples, and all its associated supplements,
i.e., here in this case any additional lexical material
which also occurs in the nominal group of the PP,
such as articles, adjectives, adverbs, cardinals, etc.3
The extraction of the associated supplements is es-
sential to the linguistic measure described in sub-
section 3.3 below.
In order to reduce the amount of candidates
for evaluation and to eliminate low-frequency data,
we only considered PNV-triples with frequency
a0a2a1 a3a5a4 . This was also motivated by the well-
known fact that collocations tend to have a higher
co-occurrence frequency than free word combina-
tions.4 Table 1 contains the data for the correspond-
ing frequency distributions.
frequency PP-verb combinations
candidate tokens candidate types
all 1,663,296 1,159,133
a6a8a7a10a9a12a11 279,350 8,644
Table 1: Frequency distribution for PP-Verb tokens and
types for our 114-million-word newspaper corpus
3.2 Classification of the Testset
Three human judges manually classified the PP-
verb candidate types with a0a13a1 a3a5a4 in regard to
whether they were a collocation or not. For this
purpose, they used a manual, in which the guide-
lines included the linguistic properties as described
in Section 1 and the three collocation classes identi-
fied in Section 2.
Among the 8,644 PP-verb candidate types, 1,180
(13.7%) were identified as true collocations. The
inter-annotator agreement was 94.8% (with a stan-
dard deviation of 2.1).
2The verbs in this study are restricted to main verbs and are
reduced to their base form after extraction.
3It should be noted that both heads and associated supple-
ments may of course vary depending on the particular linguistic
structure targeted for collocation extraction.
4Cf. also Evert and Krenn (2001) for empirical evidence
justifying the exclusion of low-frequency data.
3.3 The Linguistic Measure
The linguistic property around which we built our
measure for collocativity is the non- or limited mod-
ifiabilty of collocations with additional lexical ma-
terial (i.e., supplements). The underlying assump-
tion is that a PNV triple is less modifiable (and thus
more likely to be a collocation) if it has a lexical
supplement which, compared to all others, is par-
ticularly characteristic. We express this assump-
tion in the following way: Let a14 be the number
of distinct supplements of a particular PNV triple
(a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30 ). The probability a32 of a particular sup-
plement a33a35a34a37a36a38a36a40a39 , a41a43a42a45a44 a3a47a46 a14a49a48 , is described by its fre-
quency scaled by the sum of all supplement frequen-
cies:
a32a51a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a54a53a55a38a56a57a26a54a26a59a58a59a60a61a42 (1)
a0
a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a54a53a55a38a56a57a26a54a26a59a58a5a60
a62a64a63
a24a23a65a67a66
a0
a50a52a15a17a16a19a18a68a20a23a22a69a24a27a26a29a28a31a30a54a53a55a38a56a57a26a54a26a29a70a71a60
with a62a19a63
a24a72a65a67a66
a0
a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a73a30a57a53a55a74a56a12a26a54a26a29a70a71a60a64a42
a0
a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a75a60 .
5
Then the modifiability a76a78a77a80a79 of a PNV triple can
be described by its most probable supplement:
a76a78a77a80a79a81a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a73a30a82a60a84a83a31a42 (2)
a85a87a86a89a88a91a90a92a85a87a93
a32a51a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a54a53a55a38a56a57a26a54a26a59a58a59a60
a46
a41a51a42a94a44
a3a47a46
a14a49a48
To define a measure of collocativity a95a96a77a98a97a84a97 for a
candidate set, some factor regarding frequency has
to be taken into account. Thus, besides a76a78a77a80a79 , we
take the relative co-occurrence frequency for a spe-
cific PNV triple a32a51a50a52a15a17a16a19a18a99a20a23a22a69a24a27a26a29a28a31a30a82a60 (a100 being the number
of candidate types (here, 8,644))
a32a51a50a52a15a17a16a101a18a68a20a23a22a25a24a27a26a29a28a73a30a82a60a84a83a31a42
a0
a50a52a15a17a16a19a18a68a20a23a22a69a24a27a26a29a28a31a30a12a60
a62
a20
a102
a65a67a66
a0
a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a73a30a52a103a59a60
(3)
and incorporate it as a second factor to a95a104a77a98a97a105a97 :
a95a96a77a98a97a84a97a91a50a52a15a17a16a101a18a68a20a23a22a25a24a27a26a29a28a73a30a82a60a105a83a31a42 (4)
a76a78a77a80a79a81a50a52a15a17a16a101a18a68a20a23a22a25a24a27a26a29a28a73a30a82a60a64a106a107a32a51a50a52a15a17a16a19a18a21a20a23a22a25a24a27a26a29a28a31a30a75a60
3.4 Methods of Evaluation
Standard procedures for evaluating the goodness of
collocativity measures usually involve identifying
the true positives among the a14 -highest ranked candi-
dates returned by a particular measure. Because this
is rather labor-intensive, a14 is usually small, rang-
ing from 50 to several hundred. Evert and Krenn
5Note that the zero supplement of the PNV triple, i.e., the
one for which no lexical supplements co-occur is also included
in this set.
(2001), however, point out the inadequacy of such
methods claiming they usually lead to very super-
ficial judgements about the measures to be exam-
ined. In contrast, they suggest examining various a14 -
highest ranked samples, which allows plotting stan-
dard precision and recall graphs for the whole can-
didate set.
We evaluate the a95a104a77a98a97a105a97 measure against two
widely used standard statistical tests (t-test and log-
likelihood) and against co-occurrence frequency.
The comparison to the t-test is especially interesting
because it was found to achieve the best overall pre-
cision scores in other studies (see Evert and Krenn
(2001)). Our baseline is defined by the proportion
of true positives (13.7%; see subsection 3.2), which
can be described as the likelihood of finding one by
blindly picking from the candidate set.
4 Experimental Results and Discussion
4.1 Precision and Recall for Collocation
Extraction
In the first experiment, we incrementally examined
parts of the a14 -highest ranked candidate lists re-
turned by the each of the four measures we consid-
ered. The precision values for various a14 were com-
puted such that for each percent point of the list,
the proportion of true positives was compared to the
overall number of candidate items returned. This
yields the precision curves in Figure 1 and its as-
sociated values at selected list portions in the upper
table from Table 2.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10 20 30 40 50 60 70 80 90 100
Portion of ranked list (in %)
Precision ModifiabilityT-test
Frequency
Log-likelihood
Base
Figure 1: Precision for Collocation Extraction
First, we observe that all measures outperform
the baseline by far and, thus, all are potentially use-
ful measures of collocativity. Of the statistical mea-
sures, log-likelihood (the most complex one) per-
forms the worst, whereas t-test and frequency, al-
most indistinguishable, share the middle position,
with frequency measurements having a very slight
edge at six rank points. This is in contrast to the
findings reported by Krenn and Evert (2001), which
gave the t-test an edge.6
As can be clearly seen, however, our linguis-
tic modifiability measure substantially outperforms
all other measures at all points in the ranked list.
Considering 1% (a14a1a0 a2a4a3 ), its precision value is
ten percentage points higher than for t-test and fre-
quency, and even 22 points higher compared to log-
likelihood. Until 50% (a14a5a0a7a6a9a8a4a10a4a10 ) of the ranked
list is considered, modifiability maintains a three to
five percentage point advantage in precision over t-
test and frequency. In the second half of the list,
all curves and associated values start converging to-
wards the baseline.
We also tested the significance of differences for
our precision results, both between modifiability
and frequency and between modifiability and t-test.
Because in both cases the ranked lists were taken
from the same set of candidates, viz. the 8,644 PP-
verb candidate types, and hence constitute depen-
dent samples, we applied the McNemar test (Sachs,
1984) for statistical testing. We selected 100 mea-
sure points in the ranked list, one after each incre-
ment of one percent, and then used the two-tailed
test for a confidence interval of 99%. Table 3, which
lists the number of significant differences for 10, 50
and 100 measure points, shows that almost all of
them are significantly different.
# of significance # of signicant differences
measure points comparing modifiability with
frequency t-test
10 9 9
50 49 49
100 96 97
Table 3: Significance testing of differences using the
two-tailed McNemar test at 99% confidence interval
The recall curves in Figure 2 and their corre-
sponding values in the lower table from Table 2
measure which proportion of all true positives is
identified by a particular measure at a certain part of
the ranked list. In this sense, recall is an even bet-
ter indicator of a particular measure’s performance.
Again, the linguistically motivated collocation ex-
traction algorithm outscores all others, even more
pronounced than for precision. When examining
20% (a14a11a0 a3a13a12 a10a4a14a38a60 , 30% (a14a15a0 a10a4a16a4a14a4a8 ) and 40%
6The reason why frequency performs even slightly better
than t-test may very well have to do with the size of our training
corpus (114 million words). But this just underlines the fact
that large corpora are essential for collocation discovery.
Portion of Precision scores of measure evaluated
ranked list
considered Modifiablity T-test Frequency Log-likelihood Baseline
1% 0.84 0.74 0.74 0.62 0.14
10% 0.51 0.46 0.45 0.39 0.14
20% 0.39 0.34 0.34 0.30 0.14
30% 0.31 0.27 0.28 0.25 0.14
40% 0.27 0.23 0.24 0.21 0.14
50% 0.24 0.20 0.21 0.18 0.14
60% 0.21 0.18 0.19 0.16 0.14
70% 0.19 0.17 0.17 0.15 0.14
80% 0.17 0.15 0.16 0.15 0.14
90% 0.15 0.14 0.15 0.14 0.14
(a0a2a1
a3a5a4a7a6a9a8a9a8 ) 100% 0.14 0.14 0.14 0.14 0.14
Portion of Recall scores of measure evaluated
ranked list
considered Modifiablity T-test Frequency Log-likelihood
1% 0.06 0.05 0.05 0.05
10% 0.37 0.33 0.33 0.28
20% 0.58 0.50 0.50 0.44
30% 0.69 0.60 0.61 0.55
40% 0.80 0.69 0.70 0.61
50% 0.87 0.75 0.76 0.66
60% 0.93 0.80 0.83 0.72
70% 0.96 0.85 0.88 0.78
80% 0.98 0.89 0.92 0.85
90% 0.99 0.93 0.96 0.92
100% 1.00 1.00 1.00 1.00
Table 2: Precision and Recall Scores for Collocation Extraction at Major Portions of the Ranked List
(a14 a0 a8 a6a9a16
a2 ) of the ranked list, modifiability, respec-
tively, identifies almost 60%, 70% and 80% of all
true positives, holding a ten percentage point lead
over t-test and frequency at each of these points.
When 50% (a14 a0 a6a9a8a4a10a4a10 ) are considered, this differ-
ence reaches eleven and twelve points (compared to
frequency and t-test, respectively).
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
10 20 30 40 50 60 70 80 90 100
Portion of ranked list (in %)
Recall
Modifiability
T-test
Frequency
Log-likelihood
Figure 2: Recall for Collocation Extraction
Even more strikingly, for the identification of
90% of all true positives, modifiability only needs
to look at 55% (a14 a0 a6 a12 a16 a6 ) of the ranked list. Fre-
quency, on the other hand, needs to examine 75%
(a14 a0 a3 a6 a2 a8 ) and t-test even 85% (a14 a0 a12 a8 a6 a12 ) of the
ranked list to reach this high level of recall.
4.2 Modifiability Revisited
The previous subsection showed that a measure for
collocation discovery which takes into account the
linguistic property of limited modifiability fares sig-
nificantly better than linguistically not so founded,
purely statistical measures. Although the modifia-
bility property constitutes common wisdom about
collocations, it has not yet been empirically eval-
uated. Thus, we ran an experiment which took
both the PNV triples classified as collocations and
the PNV triples classified as non-collocations and
counted the numbers of distinct supplements (re-
ferred to as a14 in Subsection 3.3). From this data,
we set up a distribution of collocational and non-
collocational PNV triples in which the distributional
ranking criterion was the number of distinct supple-
ments (cf. Figure 3).
PNV Triple NP Supplement Frequency
‘in Griff bekommen’ den/ART Griff/NN 459
‘to get under control’ Griff/NN 2
den/ART gewerkschaftlichen/ADJA Griff/NN 1
den/ART dramatischen/ADJA Griff/NN 1
den/ART erz¨ahlerischen/ADJA Griff/NN 1
‘unter Druck geraten’ Druck/NN 560
‘to get under pressure’ politischen/ADJA Druck/NN 6
erheblichen/ADJA politischen/ADJA Druck/NN 5
teilweise/ADV lebensgef¨ahrlichen/ADJA Druck/NN 1
wachsenden/ADJA Druck/NN 1
noch/ADV st¨arkeren/ADJA Druck/NN 1
schweren/ADJA Druck/NN 1
Table 4: Collocational PNV Triples with Associated Noun Phrase Supplements
0.001
0.01
0.1
1
10
100
1 10 100 1000
proportion of all PNV-triples (in %) number of distinct supplements of PNV-triples
Collocations
Non-Collocations
Figure 3: Distribution of Supplements for (Non-) Collo-
cations in PNV Triples. The x- and y-axes are log-scaled.
As Figure 3 reveals, not only is the propor-
tion of collocational PNV triples with only one
distinct supplement higher (36%) than the propor-
tion for non-collocational ones (20%), but with
each additional supplement, the collocational pro-
portion curve declines more steeply than its non-
collocational counterpart. Moreover, the colloca-
tional proportion curve already ends with 54 distinct
supplements, whereas the non-collocational propor-
tion curve leads up 520 distinct supplements. Thus,
we are able to add some empirical grounding to the
widespread textbook assumption about the limited
modifiablity of collocations.
Another observation (which is also inherent to
our linguistic measure) based on this experiment is
that some collocations do possess at least limited
modifiability. Collocation acquisition is, of course,
not a goal by itself, but rather aims at creating col-
location lexicons for both language processing and
generation (Smadja and McKeown, 1990). From
this perspective, our linguistic modifiabilty measure
actually yields quite a valuable by-product for the
development of lexicons or collocational knowledge
bases: A list of possible structural and lexical mod-
ifications associated with a particular collocational
entry candidate. In our case, these modifications re-
fer to the nominal group of the PP. We illustrate this
point in Table 4 with two collocational PNV triples
and some of their associated NP supplements plus
their frequencies.
As can be seen, both structural and lexical at-
tributes of collocations can thus be obtained. The
structural information comes in the form of part-of-
speech (POS) tags. From this, possible prenominal
POS types and their combinations can be used to
describe a collocation’s structural make-up. From a
lexical viewpoint, the collocation can be described
by the lexical semantic word classes used for mod-
ification.7 As can be seen in Table 4 under the
PNV triple for ‘to get under pressure’, the noun
‘Druck’ (‘pressure’) is often modified by a cer-
tain semantic class of adjectives, such as ‘stark’
(‘strong’), ‘schwer’ (‘heavy’), ‘erheblich’ (‘consid-
erable’, ‘grave’).
5 Related Work
Although there have been many studies on colloca-
tion extraction and mining using only statistical ap-
proaches (Church and Hanks, 1990; Ikehara et al.,
1996), there has been much less work on collocation
acquisition which takes into account the linguistic
properties typically associated with collocations.
Smadja (1993), which is the classic work on col-
location extraction, uses a two-stage filtering model
in which, in the first step, n-gram statistics deter-
mine possible collocations and, in the second step,
these candidates are submitted to a syntactic valida-
7Of course, lexical material is always at least partially de-
pendent on the domain in question. In our case, this is the news
domain with all its associated subdomains (politics, economics,
finance, culture, etc.).
tion procedure (e.g., determining verb-object collo-
cations) in order to filter out invalid collocations. In
a single-judge evaluation of 4,000 collocation can-
didates, the incorporation of linguistic criteria (via
tagging and predicate-argument parsing) boosts pre-
cision up to a level of 80% and recall to 94%. These
results are, of course, not comparable to ours. First
of all, precision and recall are measured at a fixed
point for a fixed unranked candidate list. In or-
der to obtain more reliable evaluation results, we
plot these values continuously on a ranked candi-
date list. Secondly, our kind of syntactic preprocess-
ing (which is standard nowadays) allows collocation
extraction algorithms to better control the structural
types of collocations.
Lin (1998) acquires a lexical dependency
database by assembling dependency relationships
from a parsed corpus. An entry in this database is
classified as collocation if its log-likelihood value is
greater than some threshold. Using an automatically
constructed similarity thesaurus, Lin (1999) then
separates compositional from non-compositional
collocations by taking into account the second lin-
guistic property described in Section 1, viz. their
non- or limited substitutability. In particular, he
checks the existence and mutual information val-
ues of phrases obtained by substituting the words
with similar ones, which results in the classifica-
tion of the phrase as being compositional or non-
compositional. Although this study offers some
promising results, its applicability rather falls into
the category of fine-classifying an already acquired
set of collocations, e.g., according to the criteria de-
scribed in Section 2, and thus is not really compara-
ble to our work. Moreover, the linguistic property in
his focus is of course a semantic one, whereas ours
is purely syntactic in nature.
6 Conclusion
We introduced a new, linguistically motivated mea-
sure of collocativity based on the property of lim-
ited modifiability and tested it on a large corpus
with emphasis on German PP-verb combinations.
We showed that our measure not only significantly
outperforms the standard lexical association mea-
sures typically used for collocation extraction, but
also yields a valuable by-product for the creation of
collocation databases, viz. possible structural and
lexical attributes of a collocation.
Our measure defines the modifiability property
in a linguistically simple way, by e.g. ignoring
the internal make-up of lexical supplements asso-
ciated with a collocation candidate. Hence, it may
be worthwhile to investigate whether a more sophis-
ticated approach, by e.g. taking into account inter-
nal POS types and their distribution etc., would im-
prove our results even more. We may also consider
other linguistic criteria (e.g., limited substitutabil-
ity) to further refine our measure and to categorize
already identified collocations.
At the methodological level, our approach, al-
though tested on German newspaper language data,
is language-, structure-, and domain-independent.
All it requires is some sort of shallow syntactic
analysis, e.g., POS tagging and phrase chunking.
Thus, in the future we plan to include other syntactic
types of collocations, such as verb-object or verb-
object-PP combinations, and also apply our method-
ology to other languages and domains, such as the
biomedical field.
Acknowledgements. We would like to thank our students,
Sabine Demsar, Kristina Meller, and Konrad Feldmeier, for
their excellent work as human collocation classifiers. This
work was partly supported by DFG grant KL 640/5-1.

References

T. Brants. 2000. TNT: A statistical part-of-speech tag-
ger. In Proceedings of the ANLP 2000 Conference,
pages 224–231.

K. Church and P. Hanks. 1990. Word association
norms, mutual information, and lexicography. Com-
putational Linguistics, 16(1):22–29.

T. Dunning. 1993. Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics,
19(1):61–74.

S. Evert and B. Krenn. 2001. Methods for the qualitative
evaluation of lexical association measures. In ACL’01
– Proceedings of the 39th Meeting of the ACL, pages
188–195.

S. Ikehara, S. Shirai, and H. Uchino. 1996. A statistical
method for extracting uninterrupted and interrupted
collocations from very large corpora. In Proceedings
of the COLING’96 Conference, pages 574–579.

B. Krenn and S. Evert. 2001. Can we do better than fre-
quency? A case study on extracting pp-verb colloca-
tions. In Proceedings of the ACL Workshop on Collo-
cations.

D. Lin. 1998. Automatic retrieval and clustering of sim-
ilar words. In Proceedings of the COLING/ACL’98
Conference, pages 768–774.

D. Lin. 1999. Automatic identification of non-
compositional phrases. In ACL’99 – Proceedings of
the 37th Meeting of the ACL, pages 317–324.

C. Manning and H. Sch¨utze. 1999. Foundations of Sta-
tistical Natural Language Processing. MIT Press.
L. Sachs. 1984. Applied Statistics. Springer.

F. A. Smadja and K. R. McKeown. 1990. Automatically
extracting and representing collocations for language
generation. In Proceedings of the 28th Meeting of the
ACL, pages 252–259.

F. Smadja. 1993. Retrieving collocations from text:
XTRACT. Computational Linguistics, 19(1):143–177.
