Lexical Encoding of MWEs
Aline Villavicencio
Department of Language and Linguistics
University of Essex
Wivenhoe Park
Colchester, CO4 3SQ, UK
and
Computer Laboratory, University of Cambridge
avill@essex.ac.uk
Ann Copestake, Benjamin Waldron, Fabre Lambeau
Computer Laboratory
University of Cambridge
William Gates Building, JJ Thomson Avenue
Cambridge, CB3 0FD, UK
a0 aac10,bmw20,faml2
a1 @cl.cam.ac.uk
Abstract
Multiword Expressions present a challenge for lan-
guage technology, given their flexible nature. Each
type of multiword expression has its own charac-
teristics, and providing a uniform lexical encoding
for them is a difficult task to undertake. Nonethe-
less, in this paper we present an architecture for the
lexical encoding of these expressions in a database,
that takes into account their flexibility. This encod-
ing extends in a straightforward manner the one re-
quired for simplex (single) words, and maximises
the information contained for them in the descrip-
tion of multiwords.
1 Introduction
Multiword Expressions (MWEs) can be defined as
idiosyncratic interpretations that cross word bound-
aries (or spaces) (from Sag et al. (2002). They
comprise a wide-range of distinct but related phe-
nomena like idioms, phrasal verbs, noun-noun com-
pounds and many others, that due to their flexible
nature, are considered to be a challenge for many
areas of current language technology. Even though
some MWEs are fixed, and do not present inter-
nal variation, such as ad hoc, others are much more
flexible and allow different degrees of internal vari-
ability and modification, as, for instance, touch a
nerve (touch/find a nerve) and spill beans (spill sev-
eral/musical/mountains of beans). In terms of se-
mantics, some MWEs are opaque and their seman-
tics cannot be straightforwardly inferred from the
meanings of the component words (e.g. to kick the
bucket as to die). In other cases the meaning is more
transparent and can be inferred from the words in
the MWE (e.g. eat up, where the particle up adds a
completive sense to eat).
Given the flexibility and variation in form of
MWEs and the complex interrelations that may be
found between their components, an encoding that
treats them as invariant strings (a words with spaces
approach), will not be adequate to fully describe any
such expression appropriately with the exception of
the simplest fixed cases such as ad hoc ((Sag et al.,
2002), (Calzolari et al., 2002)). Different strate-
gies for encoding MWEs have been employed by
different lexical resources with varying degrees of
success, depending on the type of MWE. One case
is the Alvey Tools Lexicon (Carroll and Grover,
1989), which has a good coverage of phrasal verbs,
providing extensive information about their syntac-
tic aspects (variation in word order, subcategorisa-
tion, etc), but it does not distinguish compositional
from non-compositional entries neither does it spec-
ify entries that can be productively formed. Word-
Net, on the other hand, covers a large number of
MWEs (Fellbaum, 1998), but does not provide in-
formation about their variability. Neither of these
resources covers idioms. The challenge in design-
ing adequate lexical resources for MWEs, is to en-
sure that the variability and the extra dimensions re-
quired by the different types of MWE can be cap-
tured. Such a move is called for by Calzolari et al.
(2002) and Copestake et al. (2002). Calzolari et al.
(2002) discuss these problems while attempting to
establish the standards for MWE description in the
context of multilingual lexical resources. Their fo-
cus is on MWEs that are productive and that present
regularities that can be generalised and applied to
other classes of words that have similar properties.
Copestake et al. (2002) present an initial schema for
MWE description and we build on these ideas here,
by proposing an architecture for a lexical encoding
of MWEs, which allows for a unified treatment of
different kinds of MWE.
In what follows, we start by laying out the min-
imal encoding needed for simplex (single) words.
Then, we analyse two different types of MWE (id-
ioms and verb-particle constructions), and discuss
their requirements for a lexical encoding. Given
these requirements, we present a possible encoding
for MWEs, that uniformly captures different types
of expressions. This database encoding minimises
the amount of information that needs to be specified
for MWE entries, by maximising the information
that can be obtained from simplex words, while re-
Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 80-87
quiring only minimal modification to the encoding
used for simplex words. We finish with some dis-
cussion and conclusions.
2 Simplex Entries
Simplex entries, in this context, refer to simple stan-
dalone words that are defined independently of oth-
ers, and form the bulk of most lexical resources.
For these entries, it is necessary to define at least
their orthography, and syntactic and semantic char-
acteristics, but more information can also be spec-
ified, such as particular dialect, register, and so on,
and table 1 shows one such encoding. In this min-
imal encoding a lexical entry has an identifier (to
uniquely distinguish between the different entries
defining different combinations of parts-of-speech
and senses for a given word), the word’s orthog-
raphy, grammatical (syntactic and semantic) type
and predicate name.1 In the case of this example,
the identifier is like tv 1, which is an entry for the
verb like, with type trans-verb for transitive verbs,
and predicate name like v rel. A type like trans-
verb embodies the constraints defined for a given
construction (in this case transitive verbs), in a par-
ticular grammar, and these vary from grammar to
grammar. Thus, these words can be expanded into
full feature structures during processing according
to the constraints defined in a specific grammar.
Table 1: LINGO ERG lexical database encoding
identifier orthography type predicate
like tv 1 like trans-verb like v rel
This table shows a minimal encoding for simplex
words, but it can serve as basis for a more complete
one. That is the case of the LinGO ERG (Copes-
take and Flickinger, 2000) lexicon, which adopts for
its database version, a compatible but more com-
plex encoding which is successfully used to de-
scribe simplex words (Copestake et al., 2004). In
the next sections, we investigate what would be nec-
essary for extending this encoding for successfully
capturing MWEs.
3 Idioms
Idioms constitute a complex case of MWEs, al-
lowing a great deal of variation. Some idioms are
1The identifier and semantic relation names follow the stan-
dard adopted by the LinGO ERG (Copestake and Flickinger,
2000), while the grammatical type names are also compatible
with it.
very flexible and can be passivised, topicalised, in-
ternally modified, and/or have optional elements
(e.g. spill beans in those beans were spilt, users
spilt password beans and judges spill their musical
beans), while others are more inflexible and only ac-
cept morphological inflection (e.g. kick/kicks/kicked
the bucket).
In order to verify empirically the possible space
of variation that idioms allow, we analysed a sam-
ple of some of the most frequent idioms in English.
This sample was used for determining the require-
ments that an encoding needs in order to provide the
means of adequately capturing idioms.
The Collins Cobuild Dictionary of Idioms lists
approximately 4,400 idioms in English, and 750 of
them are marked as the most frequent listed.2 From
these, 100 idioms were randomly selected and anal-
ysed as described by Villavicencio and Copestake
(2002).
A great part of the idioms in this sample seems
to form natural classes that follow similar patterns
(e.g. the class of verb-object idioms, where an id-
iom consists of a specific verb that takes a specific
object such as rock boat and spill beans). The re-
maining idioms, on the other hand, cannot so eas-
ily be grouped together, forming a large tail of
classes often containing only one or two idioms (e.g.
thumbs up and quote, unquote).
Most of the idioms in this sample present a
large degree of variability, especially in terms of
their syntax, also allowing variable elements (throw
SOMEONE to the lions), and optional ones (in a
(tight) corner). The type of variation that these
MWEs allow seems to be linked to their decom-
posability (Nunberg et al., 1994) in the sense that
many idioms seem to be compositional if we con-
sider that some of their component words have non-
standard meanings. Then, using compositional pro-
cesses, the meaning of an idiom can be derived from
the meanings of its elements. Thus, in these idioms,
referred to as semantically decomposable idioms,
a meaning can be assigned to individual words (even
if some of them are non-standard meanings) from
where the meaning of the idiom can be composi-
tionally constructed. One example is spill the beans,
where if spill is paraphrased as reveal and beans as
secrets, the idiom can be interpreted as reveal se-
crets. On the other hand, an idiom like to kick the
bucket, meaning to die, according to this approach
is non decomposable.
When semantic decomposability is used as ba-
sis for the classification, the majority of the idioms
2These idioms have at least one occurrence in every 2 mil-
lion words of the corpus employed to build this dictionary.
in this sample is classified as decomposable, and a
few cases as non-decomposable. The decompos-
able cases correspond to the flexible idioms, and
the non-decomposable to the fixed ones, provid-
ing a clear cut division for their treatment. For the
non-decomposable idioms, a treatment of idioms as
words with space can be adopted similar to that of
simplex words, where in a single entry the orthog-
raphy of the component words is specified, along
with the syntactic and semantic type of the idiom,
and a corresponding predicate name. In addition,
for the cases that allow morphological inflection, it
is also important to define which of the elements
of the MWE can be inflected. In this case, an id-
iom like kick the bucket, is given the type of a nor-
mal intransitive verb, except that it is composed of
more than one word, and only the verb can be in-
flected (e.g. kick/kicked/kicks the bucket,...). Conse-
quently, an encoding for non-decomposable idioms
needs to allow the definition of several orthographic
elements for an entry, as well as the specification of
the entry’s orthographic element that allows inflec-
tion.
In order to capture the flexibility of decompos-
able idioms, a treatment using normal composi-
tional processes can be employed as discussed by
Copestake (1994). In this approach, each idiomatic
component of an idiom could be defined as a sep-
arate entry similar to that of a simplex word, ex-
cept that it would also be possible to specify a para-
phrase for its meaning. In the case of spill beans, it
would mean defining an entry for the idiomatic spill,
which can be paraphrased as reveal and another for
the idiomatic beans paraphrased as secrets. More-
over, as an idiomatic entry for a word may share
many of the properties of (one of) the word’s non-
idiomatic entries (sometimes differing from the lat-
ter only in terms of their semantics), it is important
to define also for each idiomatic element a corre-
sponding non-idiomatic one, from which many as-
pects will be inherited by default. For example, in
an idiom such as spill beans, the idiomatic entry for
spill shares with the non-idiomatic entry the mor-
phology (spilled or spilt) and the syntax (as a transi-
tive verb), and so does the idiomatic beans with the
non-idiomatic one. In addition, as there is a vari-
ability in the status of the words that form MWEs,
with some words having a more literal interpretation
and others a more idiomatic one, only the idiomatic
words need to have separate entries defined. For ex-
ample in the case of the idiom pull the plug, pull
can be interpreted as contributing one of its non-
idiomatic senses (that of removing), while plug has
an idiomatic interpretation (that can be understood
as meaning support). Thus, only an idiomatic en-
try (like that for plug) needs to be defined, while the
contribution of a non-idiomatic entry (like that for
pull) to the idiom comes from the standard entry for
that word.
Having idiomatic and non-idiomatic entries avail-
able for use in idioms is just the first step in being
able to capture this type of MWE. For a precise en-
coding of idioms, it is also necessary to define a very
specific context of use for the idiomatic entries, to
avoid the possibility of overgeneration. Thus, the
verb spill has its idiomatic meaning of reveal only in
the context of spilt the beans but not otherwise (e.g.
in spill the water). The definition of these idiomatic
contexts is important to ensure that idiomatic entries
are used only in the context of the idiom, and that
outside the idiom these entries are disallowed. Con-
versely, it is important to be able to define for each
idiom, all the elements that need to be present for
the idiomatic interpretation to be available. An id-
iom is only going to be understood as such if all of
its obligatory components are present. In addition, it
is necessary to ensure that the appropriate relation-
ship among the components of an idiom is found,
for the idiomatic meaning to be available, in order
to avoid the case of false positives, where all the
elements of an idiom are found, but not with the rel-
evant interrelations. Thus, a sentence like He threw
the cat among the pigeons has a possible idiomatic
interpretation available, but this interpretation is not
available in a sentence like He held the cat and she
threw the bread among the pigeons, even though it
has all the obligatory elements for the idiom (throw,
cat, among, pigeons), because cat did not occur as
a semantic argument (the agent) of throw. Many
idioms also present some slight variation in their
components, accepting any one of a restricted set of
words, as for example on home ground and on home
turf. Each of these possibilities corresponds to the
same idiom realised in a slightly different way, but
which nonetheless has the same meaning. Some id-
ioms have also optional elements (such as in a cor-
ner and in a tight corner), and for these it is nec-
essary to indicate which are the optional and which
are the obligatory elements.
Idioms also present variation in the number of
(obligatory) components they have, with some as
short as two words (e.g. pull strings) to others as
long as 10 words (e.g. six of one and half a dozen
of the other) or more, but with no lower and upper
bound, or standard size. Consequently, an adequate
treatment of idioms cannot assume that idioms will
have a specific pre-defined size, but instead it needs
to be able to deal with this variability.
4 Verb Particle Constructions
Verb Particle Constructions (VPCs) are combina-
tions of verbs and prepositional or adverbial par-
ticles, such as break down in The old truck broke
down. In syntactic terms, VPCs can be used in
several different subcategorisation frames (e.g. eat
up as intransitive or transitive VPC). In semantic
terms VPCs can range from idiosyncratic or semi-
idiosyncratic combinations, such as get along mean-
ing to be in friendly terms, where the meaning of the
combination cannot be straightforwardly inferred
from the meaning of the verb and the particle, (in
e.g. He got along well with his colleagues), to more
regular ones, such as tear up (in e.g. In a rage she
tore up the letter Jack gave her). The latter is a case
where the particle compositionally adds a specific
meaning to the construction and follows a produc-
tive pattern (e.g. as in tear up, cut up and split up,
where the verbs are semantically related and up adds
a sense of completion to the action of these verbs).
In terms of inflectional morphology, the verb-
particle verb follows the same pattern as the simplex
verb (e.g. split up and split). Other characteristics,
like register and dialect are also shared between the
verb in a VPC and the simplex verb. If the VPC and
corresponding simplex verb are defined as indepen-
dent unrelated entries, these generalisations about
what is common between them would be lost. One
option to avoid this problem is to define the VPC
entry in a lexical encoding in terms of the corre-
sponding simplex verb entry.
As discussed earlier for many VPCs the particle
compositionally adds to the meaning of the verb
to form the meaning of the VPC, and this pro-
vides one more reason for keeping the link between
the VPC entry (e.g. wander up) and the simplex
verb entry (e.g. wander), which share the seman-
tics of the verb. Moreover, some of the compo-
sitional VPCs seem to follow productive patterns
(e.g. the resultative combinations walk/jump/run
up/down/out/in/away/around/... from joining these
verbs and the directional/locative particles up,
down, out, in, away, around, ...). This is dis-
cussed in Fraser (1976), who notes that the seman-
tic properties of verbs seem to affect their possi-
bility of combination with particles. For produc-
tive VPCs, one possibility is then to use the en-
tries of verbs already listed in a lexical resource
to productively generate VPC entries by combin-
ing them with particles according to their seman-
tic classes, as discussed by Villavicencio (2003).
However, there are also cases of semi-productivity,
since the possibilities of combinations are not fully
predictable from a particular verb and particle (e.g.
phone/ring/call/*telephone up). Thus, although
some classes of VPCs can be productively gener-
ated from verb entries, to avoid overgeneration we
adopt an approach where the remaining VPCs need
to be explicitly licensed by the specification of the
appropriate VPC entry.
To sum up, for VPC entries an appropriate en-
coding needs to maintain the link between a VPC
and the corresponding simplex form, from where
the VPC inherits many of its characteristics, includ-
ing inflectional morphology and for compositional
cases, the semantics of the verb. On the other hand,
for a non-compositional entry, like get along, it is
necessary to specify the resulting semantics. In this
case, the semantics defined in the VPC entry over-
rides that inherited by default from its components.
5 A Possible Encoding for MWEs
Taking the encoding of simplex entries as basis for
an MWE encoding, we now discuss the necessary
extensions to the former, to be able to provide the
means of capturing the extra dimensions required
by the latter. While taking these requirements into
account, it is also desirable to define a very gen-
eral architecture, in which simplex and MWE en-
tries can be defined quite similarly, and in which
different types of MWE can be captured in a uni-
form encoding.
In the proposed encoding, simplex entries are still
defined in terms of orthography, grammatical type
and semantic predicate, in the Simplex table (ta-
ble 2). The same encoding can be used for fixed
MWEs, which are treated as words with space, ex-
cept that it also allows for the definition of the ele-
ment in the MWE that can be inflected. This is the
case of kick the bucket, which is defined as an in-
transitive construction whose first orthographic el-
ement (kick) is marked as allowing inflection, and
from where variations such as kicks the bucket can
be derived, table 2.
The encoding of flexible MWEs, on the other
hand, is done in 3 stages. In the first one, the id-
iomatic components of an MWE are defined in a
similar way to simplex words, in terms of an identi-
fier, grammatical type and semantic predicate, in the
MWE table (table 3). In addition, they also make
reference to a non-idiomatic simplex entry (base
form in table 3) from where they inherit by de-
fault many of their characteristics, including orthog-
raphy. This is done by means of the non-idiomatic
entry’s identifier. In the case of e.g. the idiomatic
spill (i spill tv 1), the corresponding non-idiomatic
entry is the transitive spill defined in the simplex ta-
ble, and whose identifier is spill tv 1. Moreover,
when appropriate, a non-idiomatic paraphrase for
the idiomatic element can also be defined. This is
achieved by specifying, in paraphrase the equiv-
alent non-idiomatic element’s semantic predicate.
The idiomatic spill, for example, is assigned as
corresponding paraphrase the non-idiomatic reveal
(reveal tv rel) defined in the simplex table. This
can be used to generate a non-idiomatic paraphrase
for the whole MWE (e.g. reveal secrets as para-
phrase of spill beans, as defined in table 3).
However, in order to be able to encode precisely
an MWE, in the second stage its context is speci-
fied, where all the elements that make that MWE
are listed. This ensures that only when all the core
elements defined for an MWE are present, is that the
MWE is recognised as such (e.g. spill and beans for
the MWE spill beans), preventing the case of false
positives (e.g. spill the milk) from being treated as
an instance of this MWE. Likewise, this prevents id-
iomatic entries from being used outside the context
of the MWE (e.g. the idiomatic spill being inter-
preted as reveal in spill some water). This is done
in the table known as MWE Components, table 4.
In this table each entry is defined in terms of an
identifier for the MWE (e.g. i spill beans 1), and
identifiers for each of the MWE components (e.g.
i spill tv 1 and i bean n 1), that provide the link to
the lexical specification of these components either
in the simplex table (table 2), or in the MWE table
(table 3). In order to allow MWEs with any number
of elements to be uniformly defined, (from shorter
ones like spill beans, rows 1 to 2 in table 4, to longer
ones like pull the curtain down on) we propose an
encoding where each element of the MWE is speci-
fied as a separate contextual entry (row). Thus, what
links all the components of an MWE together, spec-
ified each as an entry, is that they have the same
MWE identifier (e.g. i spill beans 1). Moreover,
to account for MWEs with optional elements, like
in a corner and in a tight corner where tight is op-
tional, each of the elements of the MWE needs to be
marked as obligatory or optional in this table.
For some MWEs, such as VPCs, one of the com-
ponents may be contributing a very specific mean-
ing in the context of that particular MWE, and often
the meaning is more specific than the one defined in
the corresponding base form entry for the compo-
nent, from when the meaning is obtained by default.
Thus, for non-compositional VPCs, such as look
up, the particles can be assumed to have a vacuous
semantic contribution, and the semantics of these
VPCs are contributed solely by the verbs. For look
up, the verbal component, look tv 1, defines the
meaning of the VPC as look-up tv rel while up is
assigned a vacuous relation (up-vacuous prt rel).
Similarly, up in a VPC such as wander up has either
a directional or locational/aspectual interpretation,
which in both cases can be regarded as qualifying
the event of wandering and can be compositionally
added to the meaning of the verb to generate the
meaning of the combination. For these cases, it is
important to allow the semantics of the component
in question to be further refined in its entry for that
MWE (e.g. up with semantics up-end-pt prt rel in
table 4). The approach taken means that the com-
monality in the directional interpretation between
wander up and walk up, where the semantics of the
particle is shared, is captured by means of the spe-
cific semantic type defined for the particle, which
means that generalizations can be made in an infer-
ence component or in semantic transfer for Machine
Translation. Similarly, by defining a VPC from the
base form of the corresponding verb, it is possible to
capture the fact that the semantics of verb is shared
between the verb wander and the VPC wander up.
Finally, in order to specify the appropriate rela-
tionships between the elements of the MWE, a set
of labels is used (PRED1, PRED2,...), which refer
to the position of the element in the logical form
for the MWE. This can be seen in the MWE Type
table (table 5). The basic idea behind the use of
these labels, defined in the column slot, is that they
can be employed as place holders in the semantic
predicate associated with that particular MWE. The
precise correspondences between these place hold-
ers and the predicates are specified in meta-types
defined for each different class of MWE. Thus the
particular meta-type verb-object-idiom is for idioms
with two obligatory elements, where PRED1 cor-
responds to pred1(X,Y) and PRED2 to pred2(Y),
and PRED1 (corresponding to the verb) is a pred-
icate whose second semantic argument (Y) is coin-
dexed with the second predicate (the object). When
this meta-type is instantiated with the entries for an
MWE like spill beans (i spill beans 1) the slots are
instantiated as i spill rel(X,Y), and i bean rel(Y).3
These meta-types act as interface between the
database and a specific grammar system. As men-
tioned before MWEs can be grouped together in
classes according to the patterns they follow (in
terms of syntactic and semantic characteristics).
Therefore, for each particular class of MWE, a
specific meta-type is defined, which contains the
precise interrelation between the components of
the MWE. This means that for a particular gram-
mar, for each meta-type there must be a (grammar-
3For reasons of clarity, in this paper we are using a simpli-
fied but equivalent notation for the meta-type description.
Table 5: MWE Type Table
mwe meta-type
i find nerve 1 verb-object-idiom
i spill beans 1 verb-object-idiom
walk up 1 verb-particle-np
wander up 1 verb-particle-np
look up 1 verb-particle-np
dependent) type that maps the semantic relations be-
tween the elements of the MWE into the appropri-
ate grammar dependent features. Thus, in the third
stage, it is necessary to specify the meta-types for
the MWEs encoded.
In order to test the generality of the meta-types
defined, a further sample of 25 idioms was ran-
domly selected, and an attempt was made to clas-
sify them according to the meta-types defined. The
majority of these idioms could be successfully de-
scribed by the available types, with only a few for
which further meta-types needed to be defined.
The same mechanisms are also used for defining
MWEs which have an element that can be realised
in different ways, but as one of a restricted set of
words like touch a nerve and find a nerve which
are instances of the same MWE. For these cases,
it is necessary to define each of the possible variants
and the position in the idiom in which they occur.
This is done in table 4, where find and touch, the
variants of the idiom find/touch a nerve are defined
as occurring in a particular slot, PRED1 (and nerve
as PRED2): i touch rel(X,Y) i nerve rel(Y) and
i find rel(X,Y) i nerve rel(Y). By using the same
identifier (i find nerve 1) and slot (PRED1) in both
cases, find and touch are specified as two possible
distinct realizations of the slot for that same idiom.
6 Discussion
Multiword Expressions present a challenge for lan-
guage technology, given their flexible nature. In this
paper we described a possible architecture for the
lexical encoding of these expressions. Even though
different types of MWEs have their own character-
istics, this proposal provides a uniform lexical en-
coding for defining them. This architecture takes
into account the flexibility of MWEs extending in a
straightforward manner the one required for simplex
words, and maximises the information contained for
them in the description of MWEs while minimising
the amount of information that needs to be defined
in the description of these expressions.
This encoding provides a clear way to capture
both fixed (and semi-fixed) MWEs and flexible
ones. The former are treated in the same manner
as simplex words, but with the possibility of speci-
fying the inflectional element of the MWE. For flex-
ible MWEs, on the other hand, the encoding is done
in three stages. The first one is the definition of
the idiomatic elements, in the MWE table, the sec-
ond the definition of an MWE’s components, in the
MWE Components table, and the third is the spec-
ification of a class (or meta-type) for the MWE, in
the MWE Type table. Different types of MWEs can
be straightforwardly described using this encoding,
as discussed in terms of idioms and VPCs.
A database employing this encoding can be in-
tegrated with a particular grammar, providing the
grammar system with a useful repertoire of MWEs.
This is the case of the MWE grammar (Villavicen-
cio, 2003) and of the wide-coverage LinGO ERG
(Flickinger, 2004), both implemented on the frame-
work of HPSG and successfully integrated with this
database. This encoding is also used as basis of the
architecture for a multilingual database of MWEs
defined by Villavicencio et al. (2004), which has
the added complexity of having to record the cor-
respondences and differences in MWEs in different
languages: different word orders, different lexical
and syntactic constructions, etc. In terms of usage,
this encoding means that the search facilities pro-
vided by the database can help the user investigate
MWEs with particular properties. This in turn can
be used to aid the addition of new MWEs to the
database by analogy with existing MWEs with sim-
ilar characteristics.
7 Acknowledgements
This research was supported in part by the
NTT/Stanford Research Collaboration, research
project on multiword expressions and by the Noun
Phrase Agreement and Coordination AHRB Project
MRG-AN10939/APN17606. This document was
generated partly in the context of the DeepThought
project, funded under the Thematic Programme
User-friendly Information Society of the 5th Frame-
work Programme of the European Community
(Contract No IST-2001-37836).

References
Nicoletta Calzolari, Charles Fillmore, Ralph Gr-
ishman, Nancy Ide, Alessandro Lenci, Cather-
ine MacLeod, and Antonio Zampolli. 2002. To-
wards best practice for multiword expressions in
computational lexicons. In Proceedings of the
3rd International Conference on Language Re-
sources and Evaluation (LREC 2002), Las Pal-
mas, Canary Islands.
John Carroll and Claire Grover. 1989. The deriva-
tion of a large computational lexicon of English
from LDOCE. In B. Boguraev and E. Briscoe,
editors, Computational Lexicography for Natural
Language Processing. Longman.
Ann Copestake and Dan Flickinger. 2000. An
open-source grammar development environment
and broad-coverage English grammar using
HPSG. In Proceedings of the 2nd International
Conference on Language Resources and Evalua-
tion (LREC 2000).
Ann Copestake, Fabre Lambeau, Aline Villavicen-
cio, Francis Bond, Timothy Baldwin, Ivan Sag,
and Dan Flickinger. 2002. Multiword expres-
sions: Linguistic precision and reusability. In
Proceedings of the 3rd International Conference
on Language Resources and Evaluation (LREC
2002), Las Palmas, Canary Islands.
Ann Copestake, Fabre Lambeau, Benjamin Wal-
dron, Francis Bond, Dan Flickinger, and Stephan
Oepen. 2004. A lexicon module for a grammar
development environment. In To appear in Pro-
ceedings of the International Conference on Lan-
guage Resources and Evaluation (LREC 2004),
Lisbon, Portugal.
Ann Copestake. 1994. Representing idioms. Paper
presented at the HPSG Conference.
Christiane Fellbaum. 1998. Towards a representa-
tion of idioms in WordNet. In Proceedings of the
workshop on the use of WordNet in Natural Lan-
guage Processing Systems (Coling-ACL 1998),
Montreal.
Dan Flickinger. 2004. Personal Communication.
Bruce Fraser. 1976. The Verb-Particle Combina-
tion in English. Academic Press, New York,
USA.
Geoffrey Nunberg, Ivan A. Sag, and Tom Wasow.
1994. Idioms. Language, 70:491–538.
Ivan Sag, Timothy Baldwin, Francis Bond, Ann
Copestake, and Dan Flickinger. 2002. Multi-
word expressions: A pain in the neck for NLP. In
Proceedings of the 3rd International Conference
on Intelligent Text Processing and Computational
Linguistics (CICLing-2002), pages 1–15, Mexico
City, Mexico.
Aline Villavicencio and Ann Copestake. 2002. As-
pectual on the nature of idioms. LinGO Working
Paper No. 2002-04.
Aline Villavicencio, Timothy Baldwin, and Ben-
jamin Waldron. 2004. A multilingual database of
idioms. In To appear in Proceedings of the Inter-
national Conference on Language Resources and
Evaluation (LREC 2004), Lisbon, Portugal.
Aline Villavicencio. 2003. Verb-particle construc-
tions and lexical resources. In Francis Bond,
Anna Korhonen, Diana McCarthy, and Aline
Villavicencio, editors, Proceedings of the ACL
2003 Workshop on Multiword Expressions: Anal-
ysis, Acquisition and Treatment, pages 57–64,
Sapporo, Japan.
