Automatically Constructing a Lexicon of
Verb Phrase Idiomatic Combinations
Afsaneh Fazly
Department of Computer Science
University of Toronto
Toronto, ON M5S 3H5
Canada
afsaneh@cs.toronto.edu
Suzanne Stevenson
Department of Computer Science
University of Toronto
Toronto, ON M5S 3H5
Canada
suzanne@cs.toronto.edu
Abstract
We investigate the lexical and syntactic
flexibility of a class of idiomatic expres-
sions. We develop measures that draw
on such linguistic properties, and demon-
strate that these statistical, corpus-based
measures can be successfully used for dis-
tinguishing idiomatic combinations from
non-idiomatic ones. We also propose
a means for automatically determining
which syntactic forms a particular idiom
can appear in, and hence should be in-
cluded in its lexical representation.
1 Introduction
The term idiom has been applied to a fuzzy cat-
egory with prototypical examples such as by and
large, kick the bucket, and let the cat out of the
bag. Providing adefinitive answerforwhatidioms
are, and determining how they are learned and un-
derstood, are still subject to debate (Glucksberg,
1993; Nunberg et al., 1994). Nonetheless, they are
often defined as phrases or sentences that involve
some degree of lexical, syntactic, and/or semantic
idiosyncrasy.
Idiomatic expressions, as a part of the vast fam-
ily of figurative language, are widely used both in
colloquial speech and in written language. More-
over, a phrase develops its idiomaticity over time
(Cacciari, 1993); consequently, new idioms come
into existence on a daily basis (Cowie et al., 1983;
Seaton and Macaulay, 2002). Idioms thus pose a
serious challenge, both for the creation of wide-
coverage computational lexicons, and for the de-
velopment of large-scale, linguistically plausible
natural language processing (NLP) systems (Sag
et al., 2002).
One problem is due to the range of syntactic
idiosyncrasy of idiomatic expressions. Some id-
ioms, such as by and large, contain syntactic vio-
lations; these are often completely fixed and hence
can be listed in a lexicon as “words with spaces”
(Sag et al., 2002). However, among those idioms
that are syntactically well-formed, some exhibit
limited morphosyntactic flexibility, while others
may be more syntactically flexible. For example,
theidiom shoot the breezeundergoes verbalinflec-
tion(shot the breeze), but not internal modification
orpassivization (?shoot the fun breeze, ?the breeze
was shot). In contrast, the idiom spill the beans
undergoes verbal inflection, internal modification,
and even passivization. Clearly, a words-with-
spaces approach does not capture the full range of
behaviour of such idiomatic expressions.
Another barrier to the appropriate handling of
idioms in a computational system is their seman-
tic idiosyncrasy. This is aparticular issue for those
idioms that conform to the grammar rules of the
language. Such idiomatic expressions are indistin-
guishable on the surface from compositional (non-
idiomatic) phrases, but a computational system
must be capable of distinguishing the two. For ex-
ample, a machine translation system should trans-
late the idiom shoot the breeze as a single unit of
meaning (“to chat”), whereas this is not the case
for the literal phrase shoot the bird.
In this study, we focus on a particular class of
English phrasal idioms, i.e., those that involve the
combination of a verb plus a noun in its direct ob-
ject position. Examples include shoot the breeze,
pull strings, and push one’s luck. We refer to these
as verb+noun idiomatic combinations (VNICs).
The class of VNICs accommodates a large num-
ber of idiomatic expressions (Cowie et al., 1983;
Nunberg etal., 1994). Moreover, their peculiar be-
337
haviour signifies the need for a distinct treatment
in a computational lexicon (Fellbaum, 2005). De-
spite this, VNICs have been granted relatively lit-
tle attention within the computational linguistics
community.
We look into two closely related problems
confronting the appropriate treatment of VNICs:
(i) the problem of determining their degree of flex-
ibility; and (ii) the problem of determining their
level of idiomaticity. Section 2 elaborates on the
lexicosyntactic flexibility of VNICs, and how this
relates to their idiomaticity. In Section 3, we pro-
pose two linguistically-motivated statistical mea-
sures for quantifying the degree of lexical and
syntactic inflexibility (or fixedness) of verb+noun
combinations. Section 4 presents an evaluation
of the proposed measures. In Section 5, we put
forward a technique for determining the syntac-
tic variations that a VNIC can undergo, and that
should be included in its lexical representation.
Section 6 summarizes our contributions.
2 Flexibility and Idiomaticity of VNICs
Although syntactically well-formed, VNICs in-
volve a certain degree of semantic idiosyncrasy.
Unlike compositional verb+noun combinations,
the meaning of VNICs cannot be solely predicted
from the meaning of their parts. There is much ev-
idence in the linguistic literature that the seman-
tic idiosyncrasy of idiomatic combinations is re-
flected in their lexical and/or syntactic behaviour.
2.1 Lexical and Syntactic Flexibility
A limited number of idioms have one (or more)
lexical variants, e.g., blow one’s own trumpet and
toot one’s own horn (examples from Cowie et al.
1983). However, most are lexically fixed (non-
productive) to a large extent. Neither shoot the
wind nor fling the breeze are typically recognized
as variations of the idiom shoot the breeze. Simi-
larly, spill the beans has an idiomatic meaning (“to
reveal a secret”), while spill the peas and spread
the beans have only literal interpretations.
Idiomatic combinations are also syntactically
peculiar: most VNICs cannot undergo syntactic
variations and at the same time retain their id-
iomatic interpretations. It is important, however,
tonotethatVNICsdifferwithrespecttothedegree
of syntactic flexibility they exhibit. Some are syn-
tactically inflexible for the most part, while others
are more versatile; as illustrated in 1 and 2:
1. (a) Tim and Joy shot the breeze.
(b) ?? Tim and Joy shot a breeze.
(c) ?? Tim and Joy shot the breezes.
(d) ?? Tim and Joy shot the fun breeze.
(e) ?? The breeze was shot by Tim and Joy.
(f) ?? The breeze that Tim and Joy kicked was fun.
2. (a) Tim spilled the beans.
(b) ? Tim spilled some beans.
(c) ?? Tim spilled the bean.
(d) Tim spilled the official beans.
(e) The beans were spilled by Tim.
(f) The beans that Tim spilled troubled Joe.
Linguists have explained the lexical and syntac-
tic flexibility of idiomatic combinations in terms
of their semantic analyzability (e.g., Glucksberg
1993; Fellbaum 1993; Nunberg et al. 1994). Se-
mantic analyzability is inversely related to id-
iomaticity. For example, the meaning of shoot the
breeze, a highly idiomatic expression, has nothing
todowith either shoot or breeze. Incontrast, aless
idiomatic expression, such as spill the beans, can
be analyzed as spill corresponding to “reveal” and
beans referring to “secret(s)”. Generally, the con-
stituents ofasemantically analyzable idiom can be
mapped onto their corresponding referents in the
idiomatic interpretation. Hence analyzable (less
idiomatic) expressions are often more open to lex-
ical substitution and syntactic variation.
2.2 Our Proposal
We use the observed connection between id-
iomaticity and (in)flexibility to devise statisti-
cal measures for automatically distinguishing id-
iomatic from literal verb+noun combinations.
While VNICs vary in their degree of flexibility
(cf. 1 and 2 above; see also Moon 1998), on the
whole they contrast with compositional phrases,
which are more lexically productive and appear in
a wider range of syntactic forms. We thus propose
to use the degree of lexical and syntactic flexibil-
ityofagivenverb+noun combination todetermine
the level of idiomaticity of the expression.
It is important to note that semantic analyzabil-
ity is neither a necessary nor a sufficient condi-
tion for an idiomatic combination to be lexically
or syntactically flexible. Other factors, such as
the communicative intentions and pragmatic con-
straints, can motivate a speaker to use a variant
in place of a canonical form (Glucksberg, 1993).
Nevertheless, lexical and syntactic flexibility may
well be used as partial indicators of semantic ana-
lyzability, and hence idiomaticity.
338
3 Automatic Recognition of VNICs
Here we describe our measures for idiomaticity,
whichquantify thedegreeoflexical, syntactic, and
overall fixedness of a given verb+noun combina-
tion, represented as a verb–noun pair. (Note that
our measures quantify fixedness, not flexibility.)
3.1 Measuring Lexical Fixedness
AVNICislexically fixedifthereplacement ofany
of its constituents by a semantically (and syntac-
tically) similar word generally does not result in
another VNIC, but in an invalid or a literal expres-
sion. One way of measuring lexical fixedness of
a given verb+noun combination is thus to exam-
ine theidiomaticity ofitsvariants, i.e., expressions
generated by replacing one of the constituents by
a similar word. This approach has two main chal-
lenges: (i) it requires prior knowledge about the
idiomaticity of expressions (which is what we are
developing our measure to determine); (ii) it needs
information on “similarity” among words.
Inspired by Lin(1999), weexamine the strength
of association between the verb and noun con-
stituents of the target combination and its variants,
as an indirect cue to their idiomaticity. We use the
automatically-built thesaurus of Lin (1998) to find
similar words to the noun of the target expression,
in order to automatically generate variants. Only
the noun constituent is varied, since replacing the
verb constituent of a VNIC with a semantically re-
lated verb is more likely to yield another VNIC, as
in keep/lose one’s cool (Nunberg et al., 1994).
Let a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a16a15a17a9a18a5a20a19a22a21a24a23a26a25a27a23a29a28a31a30 be the set
of the a28 most similar nouns to the noun a9 of the
target pair a32a34a33a36a35 a9a38a37 . We calculate the association
strength for the target pair, and for each of its vari-
ants, a32a39a33a36a35 a9 a5 a37 , using pointwise mutual informa-
tion (PMI) (Church et al., 1991):
a40a42a41a44a43
a7
a33a36a35
a9a46a45a47a11a48a13 a49a51a50a53a52a55a54
a7
a33a36a35
a9
a45
a11
a54
a7
a33
a11
a54
a7a10a9
a45
a11
a13 a49a51a50a53a52
a19a56a26a57a34a58a59a19a61a60a62a7
a33a36a35
a9a46a45a47a11
a60a62a7
a33a36a35a64a63
a11a62a60a65a7
a63a66a35
a9
a45
a11 (1)
where a67 a23a69a68a70a23a31a28 and a9a36a71 is the target noun; a56 is
the set of all transitive verbs in the corpus; a58 is
the set of all nouns appearing as the direct object
of some verb; a60a2a7 a33a72a35 a9 a45 a11 is the frequency of a33 and
a9
a45 occurring as a verb–object pair;
a60a62a7
a33a36a35a64a63
a11 is the
total frequency of the target verb with any noun in
a58 ; a60a2a7
a63a66a35
a9
a45
a11 is the total frequency of the noun a9
a45
in the direct object position of any verb in a56 .
Lin (1999) assumes that a target expression is
non-compositional if and only if its a40a73a41a74a43 value
is significantly different from that of any of the
variants. Instead, we propose a novel technique
thatbringstogether theassociation strengths (a40a42a41a44a43
values) of the target and the variant expressions
into a single measure reflecting the degree of lex-
ical fixedness for the target pair. We assume that
the target pair is lexically fixed to the extent that
its a40a42a41a44a43 deviates from the average a40a42a41a44a43 of its vari-
ants. Our measure calculates this deviation, nor-
malized using the sample’s standard deviation:
a75a62a76a51a77a79a78a81a80a18a82a18a78a81a83a84a83a86a85a88a87a90a89
a7
a33a36a35
a9a91a11a48a13
a40a42a41a44a43
a7
a33a36a35
a9a12a11a62a92
a40a42a41a44a43
a93
(2)
a40a73a41a74a43 is the mean and
a93 the standard deviation of
the sample; a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a84a85a88a87a97a89 a7 a33a72a35 a9a12a11a38a98a100a99a94a92a102a101 a35a104a103 a101a106a105 .
3.2 Measuring Syntactic Fixedness
Compared to compositional verb+noun combina-
tions, VNICs are expected to appear in more re-
stricted syntactic forms. To quantify the syntac-
tic fixedness of a target verb–noun pair, we thus
need to: (i) identify relevant syntactic patterns,
i.e., those that help distinguish VNICs from lit-
eralverb+noun combinations; (ii) translate thefre-
quency distribution of the target pair in the identi-
fied patterns into a measure of syntactic fixedness.
3.2.1 Identifying Relevant Patterns
Determining a unique set of syntactic patterns
appropriate for the recognition of all idiomatic
combinations is difficult indeed: exactly which
formsanidiomatic combination can occur inisnot
entirely predictable (Sag et al., 2002). Nonethe-
less, there are hypotheses about the difference in
behaviour of VNICs and literal verb+noun combi-
nations with respect to particular syntactic varia-
tions (Nunberg et al., 1994). Linguists note that
semantic analyzability is related to the referential
status of the noun constituent, which is in turn re-
lated to participation in certain morphosyntactic
forms. In what follows, we describe three types
of variation that are tolerated by literal combina-
tions, but are prohibited by many VNICs.
Passivization There is much evidence in the lin-
guistic literature that VNICs often do not undergo
passivization.1 Linguists mainly attribute this to
the fact that only a referential noun can appear as
the surface subject of a passive construction.
1There are idiomatic combinations that are used only in a
passivized form; we do not consider such cases in our study.
339
Determiner Type A strong correlation exists
between the flexibility of the determiner preced-
ing the noun in a verb+noun combination and the
overall flexibility of the phrase (Fellbaum, 1993).
It is however important to note that the nature of
the determiner is also affected by other factors,
such as the semantic properties of the noun.
Pluralization While the verb constituent of a
VNIC is morphologically flexible, the morpholog-
ical flexibility of the noun relates to its referential
status. A non-referential noun constituent is ex-
pected to mainly appear in just one of the singular
or plural forms. The pluralization of the noun is of
course also affected by its semantic properties.
Merging the three variation types results in a
pattern set, a0 a0 , of a1a2a1 distinct syntactic patterns,
given in Table 1.2
3.2.2 Devising a Statistical Measure
Thesecond stepistodeviseastatistical measure
that quantifies the degree of syntactic fixedness of
a verb–noun pair, with respect to the selected set
of patterns, a0 a0 . We propose a measure that com-
pares the “syntactic behaviour” of the target pair
with that of a “typical” verb–noun pair. Syntac-
tic behaviour of a typical pair is defined as the
prior probability distribution over the patterns in
a0
a0 . The prior probability of an individual pattern
a3a5a4
a98
a0
a0 is estimated as:
a54
a7a7a6a9a8a86a11a48a13
a10
a11a13a12a15a14a17a16
a10
a18a20a19a21a14a23a22
a60a2a7
a33
a3
a35
a9a46a45
a35
a3a5a4
a11
a10
a11a24a12a25a14a17a16
a10
a18a20a19a21a14a26a22
a10
a27a29a28a31a30a26a14a33a32a35a34
a60a2a7
a33
a3
a35
a9
a45
a35
a3a5a4a37a36
a11
The syntactic behaviour of the target verb–noun
pair a32a34a33a72a35 a9 a37 is defined as the posterior probabil-
ity distribution over the patterns, given the particu-
lar pair. The posterior probability of an individual
pattern a3a5a4 is estimated as:
a54
a7a7a6a9a8 a19a20a38
a35a40a39
a11a48a13 a54
a7
a33a36a35
a9
a35
a3a5a4
a11
a54
a7
a33a72a35
a9a12a11
a13
a60a62a7
a33a36a35
a9
a35
a3a5a4
a11
a10
a27a29a28a31a30a41a14a17a32a35a34
a60a62a7
a33a72a35
a9
a35
a3a5a4a37a36
a11
The degree of syntactic fixedness of the target
verb–noun pair is estimated as the divergence of
its syntactic behaviour (the posterior distribution
2We collapse some patterns since with a larger pattern set
the measure may require larger corpora to perform reliably.
Patterns
v det:NULL na42a44a43 v det:NULL na45a20a46
v det:a/an na42a44a43
v det:the na42a44a43 v det:the na45a20a46
v det:DEM na42a44a43 v det:DEM na45a20a46
v det:POSS na42a44a43 v det:POSS na45a20a46
v det:OTHER [ n
a42a44a43a48a47
a45a20a46 ] det:ANY [ n
a42a44a43a48a47
a45a20a46 ] be v
a45a13a49
a42a44a42
a12 a50a25a51
Table 1: Patterns for syntactic fixedness measure.
over the patterns), from the typical syntactic be-
haviour (the prior distribution). The divergence of
the two probability distributions is calculated us-
ing a standard information-theoretic measure, the
Kullback Leibler (KL-)divergence:
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83 a83a53a52a37a54a26a55
a7
a33a36a35
a9a91a11
a13 a56a44a7
a54
a7
a3a5a4
a19
a33a72a35
a9a12a11a91a19a51a19
a54
a7
a3a5a4
a11a86a11
a13
a10
a27a29a28a31a30a23a14a33a32a35a34
a54
a7
a3a5a4a37a36
a19
a33a36a35
a9a12a11a79a49a51a50a53a52
a54
a7
a3a57a4a37a36
a19
a33a72a35
a9a12a11
a54
a7
a3a5a4a37a36
a11 (3)
KL-divergence is always non-negative and is zero
if and only if the two distributions are exactly the
same. Thus, a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a58a52a37a54a26a55 a7 a33a72a35 a9a91a11a73a98a100a99a67a95a35a104a103 a101a106a105 .
KL-divergence is argued to be problematic be-
cause it is not a symmetric measure. Nonethe-
less, it has proven useful in many NLP applica-
tions (Resnik, 1999; Dagan et al., 1994). More-
over, the asymmetry is not an issue here since we
are concerned with the relative distance of several
posterior distributions from the same prior.
3.3 A Hybrid Measure of Fixedness
VNICs are hypothesized to be, in most cases, both
lexically and syntactically more fixed than literal
verb+noun combinations (see Section 2). We thus
propose a new measure of idiomaticity to be a
measure of the overall fixedness of a given pair.
We define a75a62a76a51a77a79a78a81a80a18a82a18a78a81a83a84a83a60a59a13a61a84a87a13a62a15a63 a85a88a85 a7 a33a72a35 a9a12a11 as:
a75a62a76a51a77a79a78a81a80a18a82a18a78a81a83a84a83a53a59a20a61a84a87a24a62a15a63 a85a88a85
a7
a33a36a35
a9a91a11
a13 a64
a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a53a52a65a54a21a55
a7
a33a72a35
a9a12a11
a103
a7
a1
a92a66a64a62a11
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83 a83a86a85a88a87a90a89
a7
a33a36a35
a9a91a11 (4)
where a64 weights the relative contribution of the
measures in predicting idiomaticity.
4 Evaluation of the Fixedness Measures
To evaluate our proposed fixedness measures, we
determine their appropriateness asindicators ofid-
iomaticity. We pose a classification task in which
idiomatic verb–noun pairs are distinguished from
literal ones. We use each measure to assign scores
340
to the experimental pairs (see Section 4.2 below).
We then classify the pairs by setting a threshold,
here the median score, where all expressions with
scores higher than the threshold are labeled as id-
iomatic and the rest as literal.
We assess the overall goodness of a measure by
looking at its accuracy (Acc) and the relative re-
duction in error rate (RER) on the classification
task described above. The RER of a measure re-
flects the improvement in its accuracy relative to
another measure (often a baseline).
We consider two baselines: (i) a random base-
line, a0a2a1 a82a18a80 , that randomly assigns a label (literal
or idiomatic) to each verb–noun pair; (ii) a more
informed baseline, a40a42a41a44a43 , an information-theoretic
measure widely used for extracting statistically
significant collocations.3
4.1 Corpus and Data Extraction
We use the British National Corpus (BNC;
“http://www.natcorp.ox.ac.uk/”) to extract verb–
noun pairs, along with information on the syn-
tactic patterns they appear in. We automatically
parse the corpus using the Collins parser (Collins,
1999), and further process it using TGrep2 (Ro-
hde, 2004). For each instance of a transitive verb,
we use heuristics to extract the noun phrase (NP)
in either the direct object position (if the sentence
is active), or the subject position (if the sentence
is passive). We then use NP-head extraction soft-
ware4 to get the head noun of the extracted NP,
its number (singular or plural), and the determiner
introducing it.
4.2 Experimental Expressions
We select our development and test expressions
from verb–noun pairs that involve a member of a
predefined list of (transitive) “basic” verbs. Ba-
sic verbs, in their literal use, refer to states or
acts that are central to human experience. They
are thus frequent, highly polysemous, and tend to
combine with other words to form idiomatic com-
binations (Nunberg et al., 1994). An initial list of
suchverbswasselected fromseveral linguistic and
psycholinguistic studies on basic vocabulary (e.g.,
Pauwels 2000; Newman and Rice 2004). We fur-
ther augmented this initial list with verbs that are
semantically related to another verb already in the
3As in Eqn. (1), our calculation of PMI here restricts the
verb–noun pair to the direct object relation.
4We use a modified version of the software provided by
Eric Joanis based on heuristics from (Collins, 1999).
list; e.g., lose is added in analogy with find. The
final list of 28 verbs is:
blow, bring, catch, cut, find, get, give, have, hear, hit, hold,
keep, kick, lay, lose, make, move, place, pull, push, put, see,
set, shoot, smell, take, throw, touch
From the corpus, we extract all verb–noun pairs
withminimum frequency of a1 a67 that contain abasic
verb. From these, we semi-randomly select an id-
iomatic and a literal subset.5 A pair is considered
idiomatic if it appears in a credible idiom dictio-
nary, such as the Oxford Dictionary of Current Id-
iomatic English (ODCIE) (Cowie et al., 1983), or
the Collins COBUILD Idioms Dictionary (CCID)
(Seaton and Macaulay, 2002). Otherwise, the pair
is considered literal. We then randomly pull out
a1a4a3 a67 development and a5 a67a53a67 test pairs (half idiomatic
and half literal), ensuring both low and high fre-
quency items are included. Sample idioms corre-
sponding to the extracted pairs are: kick the habit,
move mountains, lose face, and keep one’s word.
4.3 Experimental Setup
Development expressions are used in devising the
fixedness measures, as well as in determining the
values of the parameters a28 in Eqn. (2) and a64 in
Eqn. (4). a28 determines the maximum number of
nouns similar to the target noun, to be considered
in measuring the lexical fixedness of a given pair.
The value of this parameter is determined by per-
forming experiments over the development data,
in which a28 ranges from a1 a67 to a1 a67a53a67 by steps of a1 a67 ;
a28 is set to
a6 a67 based on the results. We also exper-
imented with different values of a64 ranging from a67
to a1 by steps of a7 a1 . Based on the development re-
sults, thebest value for a64 is a7a9a8 (giving moreweight
to the syntactic fixedness measure).
Testexpressions aresaved asunseen data forthe
final evaluation. We further divide the set of all
testexpressions, TESTa63 a85a88a85 ,intotwosetscorrespond-
ing to two frequency bands: TESTa10a12a11a13a15a14 contains a6 a67
idiomatic and a6 a67 literal pairs, each with total fre-
quency between a1 a67 and a16a66a67 (a1 a67 a23a20a60a18a17a20a19a20a21a96a7 a33a36a35 a9 a35a12a63 a11a23a22
a16a66a67 ); TESTa10a12a24a26a25a27a28a24 consists of a6 a67 idiomatic and a6 a67
literal pairs, each with total frequency of a16a66a67 or
greater (a60a18a17a20a19a20a21a96a7 a33a36a35 a9 a35a91a63 a11a30a29 a16a66a67 ). All frequency
counts are over the entire BNC.
4.4 Results
We first examine the performance of the in-
dividual fixedness measures, a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a104a85a88a87a90a89 and
5In selecting literal pairs, we choose those that involve a
physical act corresponding to the basic semantics of the verb.
341
Data Set: TESTa0
a11 a11
%Acc %RER
a1a3a2a5a4a7a6 50 -
a8a10a9a12a11 64 28
a13a15a14a17a16a19a18a20a6a19a4a7a18a22a21a23a21
a11 a24a26a25 65 30
a13a15a14a17a16a19a18a20a6a19a4a7a18a22a21a23a21a28a27a30a29a32a31 70 40
Table 2: Accuracy and relative error reduction for the two
fixedness and the two baseline measures over all test pairs.
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a53a52a65a54a21a55 , as well as that of the two baselines,
a0a2a1
a82a72a80 and a40a73a41a74a43 ; see Table 2. (Results for the over-
all measure are presented later in this section.) As
can be seen, the informed baseline, a40a42a41a44a43 , shows a
large improvement over the random baseline (a5 a8 a33
error reduction). This shows that one can get rel-
atively good performance by treating verb+noun
idiomatic combinations as collocations.
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a84a85a88a87a90a89 performs as well as the informed
baseline (a34 a67 a33 error reduction). This result shows
that, as hypothesized, lexical fixedness is areason-
ably good predictor of idiomaticity. Nonetheless,
the performance signifies a need for improvement.
Possibly the most beneficial enhancement would
be a change in the way we acquire the similar
nouns for a target noun.
The best performance (shown in boldface) be-
longs to a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a60a52a65a54a21a55 , with a16a66a67 a33 error reduction
over the random baseline, and a5 a67 a33 error reduction
over the informed baseline. These results demon-
strate that syntactic fixedness is a good indicator
of idiomaticity, better than a simple measure of
collocation (a40a73a41a74a43 ), or a measure of lexical fixed-
ness. These results further suggest that looking
into deep linguistic properties of VNICs is both
necessary and beneficial for the appropriate treat-
ment of these expressions.
a40a73a41a74a43 is known to perform poorly on low fre-
quency data. To examine the effect of frequency
on the measures, we analyze their performance on
the two divisions of the test data, corresponding to
the two frequency bands, TESTa10a12a11 a13a15a14 and TESTa10a12a24a26a25 a27a28a24 .
Results are given in Table 3, with the best perfor-
mance shown in boldface.
As expected, the performance of a40a73a41a74a43 drops
substantially for low frequency items. Inter-
estingly, although it is a PMI-based measure,
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a86a85a88a87a97a89 performs slightly better when the
data is separated based on frequency. The perfor-
mance of a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a60a52a65a54a21a55 improves quite a bit when
it is applied to high frequency items, while it im-
proves only slightly on the low frequency items.
These results show that both Fixedness measures
Data Set: TESTa35a37a36a38a40a39 TESTa35a37a41a28a42 a43a44a41
%Acc %RER %Acc %RER
a1a45a2a46a4a47a6 50 - 50 -
a8a48a9a12a11 56 12 70 40
a13a15a14a49a16a19a18a20a6a19a4a7a18a22a21a23a21
a11 a24a26a25 68 36 66 32
a13a15a14a49a16a19a18a20a6a19a4a7a18a22a21a23a21
a27a50a29a51a31 72 44 82 64
Table 3: Accuracy and relative error reduction for all mea-
sures over test pairs divided by frequency.
Data Set: TESTa0 a11 a11
%Acc %RER
a13a15a14a17a16a52a18a22a6a52a4a7a18a51a21a23a21
a11 a24a53a25 65 30
a13a15a14a17a16a52a18a22a6a52a4a7a18a51a21a23a21
a27a30a29a51a31 70 40
a13a15a14a17a16a52a18a22a6a52a4a7a18a51a21a23a21
a13a40a54a28a24a26a55
a0
a11 a11 74 48
Table 4: Performance of the hybrid measure over TESTa0 a11 a11 .
perform better onhomogeneous data, whileretain-
ing comparably good performance on heteroge-
neous data. These results reflect that our fixedness
measures are not as sensitive tofrequency as a40a42a41a44a43 .
Hence they can be used with a higher degree of
confidence, especially when applied to data that
is heterogeneous with regard to frequency. This
is important because while some VNICs are very
common, others have very low frequency.
Table 4 presents the performance of the hy-
brid measure, a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a58a59a13a61a84a87a24a62a25a63 a85a88a85 , repeating that of
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a84a85a88a87a90a89 and a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a60a52a37a54a26a55 for comparison.
a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a60a59a13a61a84a87a24a62a25a63 a85a88a85 outperforms both lexical and syn-
tactic fixedness measures, with a substantial im-
provement over a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83 a85a88a87a90a89 , and a small, but no-
table, improvement over a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a21a52a37a54a26a55 . Each of
the lexical and syntactic fixedness measures is a
good indicator of idiomaticity on its own, with
syntactic fixedness being a better predictor. Here
we demonstrate that combining them into a single
measure of fixedness, while giving more weight to
the better measure, results in a more effective pre-
dictor of idiomaticity.
5 Determining the Canonical Forms
Our evaluation of the fixedness measures demon-
strates their usefulness for the automatic recogni-
tion of idiomatic verb–noun pairs. To represent
such pairs in a lexicon, however, we must de-
termine their canonical form(s)—Cforms hence-
forth. For example, the lexical representation of
a32 shoot, breeze
a37 should include shoot the breeze
as a Cform.
Since VNICs are syntactically fixed, they are
mostly expected to have a single Cform. Nonethe-
less, there are idioms with two or more accept-
342
able forms. For example, hold fire and hold one’s
fire are both listed in CCID as variations of the
same idiom. Our approach should thus be capa-
ble of predicting all allowable forms for a given
idiomatic verb–noun pair.
Weexpect aVNICtooccurinitsCform(s)more
frequently than it occurs in anyother syntactic pat-
terns. To discover the Cform(s) for a given id-
iomatic verb–noun pair, we thus examine its fre-
quency of occurrence in each syntactic pattern in
a0
a0 . Since it is possible for an idiom to have more
than one Cform, we cannot simply take the most
dominant pattern as the canonical one. Instead, we
calculate a a0 -score for the target pair a32a102a33a72a35 a9a12a37 and
each pattern a3a5a4a65a36 a98 a0
a0 :
a0
a36
a7
a33a72a35
a9a12a11 a13
a60a62a7
a33a72a35
a9
a35
a3a5a4a37a36
a11a62a92 a60
a93
inwhicha60 isthemeanand a93 thestandard deviation
over the sample a15a84a60a62a7 a33a36a35 a9 a35 a3a57a4a65a36 a11a91a19 a3a5a4a37a36 a98 a0 a0 a30 .
The statistic a0 a36 a7 a33a36a35 a9a91a11 indicates how far and in
which direction the frequency of occurrence of the
pair a32 a33a36a35 a9 a37 in pattern a6 a8a2a1 deviates from the sam-
ple’smean, expressed inunits ofthesample’s stan-
dard deviation. To decide whether a3a5a4a25a36 is a canon-
ical pattern for the target pair, we check whether
a0
a36
a7
a33a36a35
a9a91a11a4a3a6a5a8a7 , where a5a9a7 is a threshold. For eval-
uation, we set a5a9a7 to a1 , based on the distribution of
a10 and through examining the development data.
We evaluate the appropriateness of this ap-
proach in determining the Cform(s) of idiomatic
pairs by verifying its predicted forms against OD-
CIE and CCID. Specifically, for each of the a1 a67a53a67
idiomatic pairs in TESTa63 a85a88a85 , we calculate the pre-
cision and recall of its predicted Cforms (those
whose a0 -scores are above a5a11a7 ), compared to the
Cforms listed in the two dictionaries. The average
precision across the 100 test pairs is 81.7%, and
the average recall is 88.0% (with 69 of the pairs
having 100% precision and 100% recall). More-
over, we find that for the overwhelming majority
of the pairs, a8 a3
a33 , the predicted Cform with the
highest a0 -score appears in the dictionary entry of
the pair. Thus, our method of detecting Cforms
performs quite well.
6 Discussion and Conclusions
The significance of the role idioms play in lan-
guage has long been recognized. However, due to
their peculiar behaviour, idioms have been mostly
overlooked by the NLP community. Recently,
there has been growing awareness of the impor-
tance of identifying non-compositional multiword
expressions (MWEs). Nonetheless, most research
on the topic has focused on compound nouns and
verb particle constructions. Earlier work on id-
iomshaveonlytouched thesurface oftheproblem,
failing to propose explicit mechanisms for appro-
priately handling them. Here, we provide effective
mechanisms for the treatment of a broadly doc-
umented and crosslinguistically frequent class of
idioms, i.e., VNICs.
Earlier research on the lexical encoding of id-
ioms mainly relied on the existence of human an-
notations, especially for detecting which syntactic
variations (e.g., passivization) an idiom can un-
dergo (Villavicencio et al., 2004). We propose
techniques for the automatic acquisition and en-
coding of knowledge about the lexicosyntactic be-
haviour of idiomatic combinations. We put for-
wardameans for automatically discovering the set
ofsyntactic variations that aretolerated byaVNIC
and that should be included in its lexical represen-
tation. Moreover, weincorporate suchinformation
into statistical measures that effectively predict the
idiomaticity level of a given expression. In this re-
gard, our work relates to previous studies on deter-
mining the compositionality (inverse of idiomatic-
ity) of MWEs other than idioms.
Most previous work on compositionality of
MWEs either treat them as collocations (Smadja,
1993), or examine the distributional similarity be-
tween the expression and its constituents (Mc-
Carthy et al., 2003; Baldwin et al., 2003; Ban-
nard et al., 2003). Lin (1999) and Wermter
and Hahn (2005) go one step further and look
into a linguistic property of non-compositional
compounds—their lexical fixedness—to identify
them. Venkatapathy and Joshi (2005) combine as-
pects of the above-mentioned work, by incorporat-
inglexical fixedness, collocation-based, anddistri-
butional similarity measures into a set of features
which are used to rank verb+noun combinations
according to their compositionality.
Our work differs from such studies in that it
carefully examines several linguistic properties of
VNICs that distinguish them from literal (com-
positional) combinations. Moreover, we suggest
novel techniques for translating such character-
istics into measures that predict the idiomaticity
level of verb+noun combinations. More specifi-
cally, we propose statistical measures that quan-
tify the degree of lexical, syntactic, and overall
fixedness of such combinations. We demonstrate
343
that these measures can be successfully applied to
the task of automatically distinguishing idiomatic
combinations from non-idiomatic ones. We also
show that our syntactic and overall fixedness mea-
sures substantially outperform a widely used mea-
sure of collocation, a40a42a41a44a43 , even when the latter
takes syntactic relations into account.
Others have also drawn on the notion of syntac-
tic fixedness for idiom detection, though specific
to a highly constrained type of idiom (Widdows
and Dorow, 2005). Our syntactic fixedness mea-
sure looks into a broader set of patterns associated
with a large class of idiomatic expressions. More-
over, our approach is general and can be easily ex-
tended to other idiomatic combinations.
Each measure we use to identify VNICs cap-
tures a different aspect of idiomaticity: a40a73a41a74a43 re-
flects the statistical idiosyncrasy of VNICs, while
the fixedness measures draw on their lexicosyn-
tactic peculiarities. Our ongoing work focuses on
combining these measures to distinguish VNICs
from other idiosyncratic verb+noun combinations
that are neither purely idiomatic nor completely
literal, so that we can identify linguistically plau-
sible classes of verb+noun combinations on this
continuum (Fazly and Stevenson, 2005).
References
TimothyBaldwin, Colin Bannard,TakaakiTanaka,and
Dominic Widdows. 2003. An empirical model of
multiword expression decomposability. In Proc. of
the ACL-SIGLEX Workshop on Multiword Expres-
sions, 89–96.
Colin Bannard, Timothy Baldwin, and Alex Las-
carides. 2003. A statistical approach to the seman-
tics of verb-particles. In Proc. of the ACL-SIGLEX
Workshop on Multiword Expressions, 65–72.
Cristina Cacciari and Patrizia Tabossi, editors. 1993.
Idioms: Processing, Structure, and Interpretation.
Lawrence Erlbaum Associates, Publishers.
Cristina Cacciari. 1993. The place of idioms in a lit-
eral and metaphorical world. In Cacciari and Ta-
bossi (Cacciari and Tabossi, 1993), 27–53.
Kenneth Church, William Gale, Patrick Hanks, and
Donald Hindle. 1991. Using statistics in lexical
analysis. In Uri Zernik, editor, Lexical Acquisition:
Exploiting On-Line Resources to Build a Lexicon,
115–164.Lawrence Erlbaum.
Michael Collins. 1999. Head-Driven Statistical Mod-
els for Natural Language Parsing. Ph.D. thesis,
University of Pennsylvania.
Anthony P. Cowie, Ronald Mackin, and Isabel R. Mc-
Caig. 1983. OxfordDictionaryof Current Idiomatic
English, volume 2. Oxford University Press.
Ido Dagan, Fernando Pereira, and Lillian Lee. 1994.
Similarity-based estimation of word cooccurrence
probabilities. In Proc. of ACL’94, 272–278.
Afsaneh Fazly and Suzanne Stevenson. 2005. Au-
tomatic acquisition of knowledge about multiword
predicates. In Proc. of PACLIC’05.
Christiane Fellbaum. 1993. The determiner in English
idioms. In Cacciari and Tabossi (Cacciari and Ta-
bossi, 1993), 271–295.
Christiane Fellbaum. 2005. The ontological loneliness
of verb phrase idioms. In Andrea Schalley and Di-
etmar Zaefferer,editors, Ontolinguistics. Mouton de
Gruyter. Forthcomming.
Sam Glucksberg. 1993. Idiom meanings and allu-
sional content. In Cacciari and Tabossi (Cacciari
and Tabossi, 1993), 3–26.
Dekang Lin. 1998. Automatic retrieval and clustering
of similar words. In Proc. of COLING-ACL’98.
Dekang Lin. 1999. Automatic identification of non-
compositionalphrases. In Proc. of ACL’99, 317–24.
Diana McCarthy, Bill Keller, and John Carroll.
2003. Detecting a continuum of compositionality in
phrasal verbs. In Proc. of the ACL-SIGLEX Work-
shop on Multiword Expressions.
Rosamund Moon. 1998. Fixed Expressions and Id-
ioms in English: A Corpus-Based Approach. Ox-
ford University Press.
John Newman and Sally Rice. 2004. Patterns of usage
for English SIT, STAND, and LIE: A cognitivelyin-
spired exploration in corpus linguistics. Cognitive
Linguistics, 15(3):351–396.
Geoffrey Nunberg, Ivan Sag, and Thomas Wasow.
1994. Idioms. Language,70(3):491–538.
Paul Pauwels. 2000. Put, Set, Lay and Place: A Cog-
nitive Linguistic Approach to Verbal Meaning. LIN-
COM EUROPA.
Philip Resnik. 1999. Semantic similarity in a taxon-
omy: An information-based measure and its appli-
cation to problemsof ambiguity in naturallanguage.
JAIR, (11):95–130.
Douglas L. T. Rohde. 2004. TGrep2 User Manual.
IvanSag,TimothyBaldwin,FrancisBond,AnnCopes-
take, and Dan Flickinger. 2002. Multiword expres-
sions: A pain in the neck for NLP. In Proc. of CI-
CLING’02, 1–15.
Maggie Seaton and Alison Macaulay, editors. 2002.
Collins COBUILD Idioms Dictionary. Harper-
Collins Publishers, 2nd edition.
Frank Smadja. 1993. Retrieving collocations from
text: Xtract. CL, 19(1):143–177.
Sriram Venkatapathy and Aravid Joshi. 2005. Mea-
suringtherelativecompositionalityofverb-noun(V-
N) collocations by integrating features. In Proc. of
HLT-EMNLP’05, 899–906.
Aline Villavicencio, Ann Copestake, Benjamin Wal-
dron, and Fabre Lambeau. 2004. Lexical encod-
ing of MWEs. In Proc. of the ACL’04 Workshop on
Multiword Expressions, 80–87.
Joachim Wermter and Udo Hahn. 2005. Paradigmatic
modifiability statistics for the extraction of com-
plexmulti-wordterms. In Proc. of HLT-EMNLP’05,
843–850.
DominicWiddowsandBeateDorow. 2005. Automatic
extraction of idioms using graph analysis and asym-
metric lexicosyntactic patterns. In Proc. of ACL’05
Workshop on Deep Lexical Acquisition, 48–56.
344
