Formal Description of Multi-Word Lexemes with the 
Finite-State Formalism IDAREX 
Elisabeth Breidt 
Seminar ffir Sprachwissenschaft 
Universit£t Tfibingen 
Wilhelmstr. 113 
D-72074 Tfibingen 
Germany 
breidt@sfs.nphil.uni-tuebingen.de 
Frfddrique Segond, Giuseppe Valetto 
Rank Xerox Research Centre 
6, chemin de Maupertuis 
F-38240 Meylan 
France 
segond@grenoble.rxrc.xerox.com 
valetto@mailer.cefriel.it 
Abstract 
Most multi-word lexemes (MWLs) allow 
certain types of variation. This has to be 
taken into account for their description 
and their recognition in texts. We sug- 
gest to describe their syntactic restric- 
tions and their idiosyncratic peculiarities 
with local grammar rules, which at the 
same time allow to express in a general 
way regularities valid for a whole class of 
MWLs. The local grammars can be writ- 
ten in a very convenient and compact 
way as regular expressions in the formal- 
ism IDAREX which uses a two-level mor- 
phology. IDAREX allows to define various 
types of variables, and to mix canonical 
and inflected word forms in the regular 
expressions. ~ 
1 Introduction 
Most texts are rich in multi-word expressions 
that cannot be properly understood let alone be 
processed in an NLP system, ff they are not recog- 
nized as complex lexical units. Such expressions 
which we call multi-word lexemes (MWL) range 
from idioms (to rack one's brains over sth), over 
phrasal verbs (to come up with), lexical and gram- 
matical collocations (to make love, with regard to 
resp.) to compounds (on-line dictionary). 
While certain MWLs only occur in exactly one 
form, e.g. out of the blue or G:um Haaresbreite 
('by a hair's breadth', lit. by hair's breadth), and 
can thus be easily recognised with simple pat- 
tern matching techniques, it is well-known (see 
e.g. Gross 1982, Brundage et al. 1992, Nunberg 
et al. 1994) that most MWLs cannot be treated 
like completely fixed patterns, since they may un- 
dergo some variation. However, only a subset of 
1Part of this work was funded under LRE 62-080 
by the EEC. 
the variations allowed by general rules is valid: 
outside that subset, the expression loses its spe- 
cial, idiomatic meaning, either reverting to its lit- 
eral meaning or losing any significance altogether. 
In certain cases, MWLs can even contradict nor- 
mal syntactic rules, as with by and large, or G:von 
Haus aus ('originally', lit. from house out), where 
general rules would require an article between the 
preposition and the noun. 
The identification of MWLs is essential for any 
natural language processing based on lexical infor- 
mation, ranging from intelligent dictionary look- 
up over concordancing or indexing to machine 
translation. Therefore, the restricted lexical and 
syntactic variability of MWLs and their idiosyn- 
cratic peculiarities need to be expressed in the 
computational lexicon in order to be able to recog- 
nize the full range of their occurrences. We pro- 
pose to use local grammars for this, written as 
a special type of regular expressions (REs) in 
the finite-state formalism IDAItEX which makes 
use of a two-level morphological lexicon. So far, 
we have successfully applied this approach to ap- 
proximately 15,000 English, French and German 
MWLs (see also Segond and Breidt 1995). 
2 Restricted Variability of MWLs 
Basically, we recognize four types of variability 
(see also Fleischer 1982, Brundage et al. 1992, En- 
gelke 1994) that a description of MWLs, both for 
NLP and for human use, should cover. Though 
part of the variability of MWLs may follow from 
their semantic properties as argued in recent work 
(e.g. Nunberg et al. 1994), it is difficult to estab- 
lish such a relationship on a large scale, and a lot 
of remaining idiosyncratic characteristics of indi- 
vidual MWLs need to be represented. 
Morphological Variation: Particular words 
in the MWL may undergo certain inflections. 
For instance, in G:durchschlagender Erfolg 
('sweeping success', lit. rubbing--offsuccess), noun 
1036 
and adjective can vary in case and in numbcr, 
and comparative and superlative form are possible 
for the adjective, whereas G:griine Welle ('phased 
traffic lights', lit. green wave) may only vary in 
case, but not in number or adjective comparison 
without loosing its idiomatic meaning. 
Lexical Variation: One or more words can be 
substituted by other terms without changing the 
overall meaning of the MWL. 
For instance, in F:perdre la tdte ('to lose one's 
mind', lit. to lose the head), the noun can be sub- 
stituted by la boule (lit. ball, coll. head) or les 
pddales (lit. pedals) without loosing its idiomatic 
meaning, but not by la tronche (lit. slice, coll. 
head). 
Modification: One of the MWL's constituents 
can be modified, preserving the idiomatic mean- 
ing. 
For instance, in G:den (schgnen) Schein wahren 
('to keep up appearances', lit. the (nice) pretence 
preserve) the presence or absence of the adjective 
does not change the meaning at all, whereas in 
G:das Handtueh werfen ('to throw in the towel', 
lit. the towel throw) any modification would evoke 
the literal meaning. 
Structural Flexibility: This includes phenom- 
ena like passivization, topicalization, scrambling, 
raising constructions etc. 
For instance, whereas in German standard word 
order variation applies to all verbal MWLs, top- 
icalisation of lexically fixed components is only 
rarely po~ible, as in G:den Vogel dabei hat dana 
Jan abgeschossen ('finally, Jan surpassed every- 
one', lit. the bird with it has then Jan shot). 
3 IDAREX: Encoding Idioms As 
Regular EXpressions 
The IDAREX formalism and the corresponding 
FSC Finite State Compiler have been developed 
at Rank Xerox Research Centre by L. Karttunen, 
P. Tapanainen and G. Valetto 2. 
3.1 Morphological Variation 
Because IDAItEX use8 a two-level morphology, 
words can be presented either in their base form 
at the lexical level or in an inflected form at the 
surface level. The surface form is preceded by a 
colon and restricts occurrences of the word to ex- 
actly this form, e.g. 
2 For a more detailed 
description of the formalism see Kaxttunen and Yam- 
pol (1993), Tapanainen (1994), Uaxttunen (1995), and 
Segond and Tapanainen (1995). 
: Welle 
The lezical form is followed by an IDAREX morpho- 
logical variable specifying morphological features 
of the word, and a colon, e.g. 
durchschlagend A : 
This represents any occurrence of the word with 
the specified morphological properties. The mor- 
phological variable can be very general, such a.u 
'A' for ally adjectival use, or more specific, such 
as Abs0 for adjectives that may not be used in 
comparative form and isg to restrict nouns to the 
singular, as in 
grim Abse: Welle Nsg: 
This way, the restricted morph(~syntactic flexi- 
bility of MWLs can bc expressed very elegantly. 
3.2 Modification 
MWL modifications with particular words are rep- 
resented as optional expressions with parentheses, 
as ill 
:den (:sch6nen) :Schein 
The definition of word-class variables allows 
to express lexically unrestricted modifications of 
an MWL such as insertion of any adverb(s) (the 
Kleene star operator indicates that the item may 
occur any number of times): 
perdre V: ADV* :la :t@te 
On the basis of simpler word-class variables more 
complex ones may be defined for complex syntac- 
tic categories suclt as NP, ADVP or PP. 
3.3 Lexieal and Structural Variation 
The formalism provides a set of RE operators to 
combine the descriptions of single words. Square 
brackets '\[ \]' and the bar '--' are used to describe 
lexical variants and alternations of more complex 
sequences such as word order variation in German. 
For instance, for the French example above we 
write 
perdre V: ADV* \[:la :t~te I :la :boule 
I :les :p6dales \] ; 
To express German verb-front and verb-final 
word order as in 
'dabei wahrt er (immer) den Schein' 
('in this, he always keeps up appearances') 
and 'urn den Schein zu wahren' 
('in order to keep up appearances') 
we write 
\[ wahren Vfin: (kDV* NPnor.) /d)V* :den 
(:schSnen) :Schein I :den (:schSnen) 
:Schein (:zu) wahren \] ; 
1037 
In addition, IDAREX allows the definition of 
macros to capture generalisations on the syntactic 
level. Any position in the macro that we want to 
instantiate differently for each use is indicated by 
a parameter $4. Instantiations of parameters can 
be single words in lexical or surface form, vari- 
ables, operators or other macros. 
For example, instead of explicitly writing the 
complicated RE above, we define a word order 
macro WOVltrg that may be used for all German 
verbal MWLs having no additional idiommxternal 
arguments: 
WOV1Arg: 
\[ $2 Vfin: (ADV* NPnom) ADV* $1 
\[$1 (:zu) $2 v: \] 
In addition, we define auxiliary macros fix(i) 
because we want to instantiate the parameter $1, 
which stands for the lexically fixed components of 
the MWL, with expressions of variable length: 
fix5:$1 $2 $3 $4 $S fix2:$1 $2 etc. 
Using this word order macro, the MWLs den 
(schSnen) Schein wahren and die Ohren spitzen 
('to prick up one's ears', lit. the ears sharpen) can 
now both be expressed very simply according to 
the same schema as 
WOVlArg( fixS(:den ( :schSnen ) 
:Schein) vahren ) 
WOVlhrg( fix2(:die :Ohren) spitzen ) 
Further macros are defined for German for 
MWLs with a reflexive or particle verb, to express 
scrambling of an idiom-external PP complement 
or topicalisation. In French, macros describe for 
example the verb complex for MWLs involving a 
reflexive verb. 
4 Discussion 
NLP treatments of MWLs in so-called high level 
grammar formalisms have for example been pro- 
posed in Abeill6 and Schabes (1989) in the frame- 
work of lexicalised TAGs, Erbaeh (1992) and 
Copestake (1994) in HPSG, Van der Linden 
(1993) in CG. These approaches to our knowl- 
edge cannot satisfactorily represent lexical vari- 
ants, nor the restricted flexibility and modifiabil- 
ity of MWLs. 
Instead of using a high-level grammar formal- 
ism we describe ,MWLs with finite-state local 
grammars. Although finite-state techniques are 
known to be unable to represent all the depen- 
dencies found in natural language, they have the 
advantage of allowing a very efficient treatment of 
a great number of phenomena and the implemen- 
tation of robust, large-scale NLP systems. How- 
ever, the use of these techniques is usually ham- 
pered by the unwieldiness in notation that these 
techniques usually lead to. 
The presented approach overcomes this prob- 
lem: instead of having to specify local grammars 
directly as finite state networks or as graphs (e.g. 
Maurel 1993, Roche 1993, and Silberztein 1993), 
IDAKEX REs provide a convenient way to mix in- 
flected and uninflected word forms, morphological 
features and complete word classes, thus greatly 
relieving lexicographers from the burden of explic- 
itly listing all the possible forms. Furthermore, 
our formalism allows the use of a bigger set of op- 
erators such as contains ($), not (~), and (&), 
etc. This provides us with the possibility to ex- 
press certain things in a very compact way. For 
example, in the definition of German verbs we ex- 
clude contracted forms of verbs and the pronoun 
es such as geht's ('it goes') , we simply state "any 
expression with the morphological feature +V, fol- 
lowed by anything that must not contain a letter 
(i.e. additional morphologicalfeatures), and which 
does not contain the feature +ca in any position" 
define V ~.+V "$Letter k "$Y.+es 
With macros, generahzations about patterns 
that can occur for a whole class of MWLs can be 
expressed. This compactness and flexibility are, 
as far as we know, specific to our approach. 
Encoding the local grammars as REs instead of 
encoding them directly as networks does of course 
not change the expressive power of the formal- 
ism, but it conveniently abstracts the handling of 
MWLs from the graph manipulation level, allow- 
ing to develop and employ devices that operate on 
string representations and map them to the un- 
derlying finite state networks. As we have shown 
above, this simplifies considerably the description 
of the different patterns of variation oecuring in 
MWLs. 
Once the MWLs listed in the dictionary have 
been manually changed into their canonical base 
form, including possible lexical variants and mod- 
ifiers and indicating morphologically flexible com- 
ponents and the scope of alternative components, 
the IDAaEX REs describing all possible contexts 
in which the MWLs can occur can be produced 
automatically. For instance, the canonical forms 
for the examples from section 2 can be specified 
as: 
durchschlagender ° Erfolg ° 
grfine ° Welle (sg) 
perdre ° (ADV)^la t~te/la boule/les p6dales ^ 
T: den Vogel (bei etw) abschieBen ° 
den (schSnen) Schein wahren ° 
1038 
das Handtuch werfen ° 
out of the blue 
um Haaresbreite 
Such canonical base forms, somewhat similar in 
spirit to the notation used in Longman's 'Dic- 
tionary of English Idioms' (1979), do not only 
form the basis for the automatic processing and 
recognition of MWLs. Human users as well would 
profit from a careful description of the variabil- 
ity of MWLs, so it should be worthwhile to also 
include the canonical forms in dictionaries for hu- 
man users. 
The presented approach is successfully used in 
the COMPASS project 3 to represent MWLs in dic- 
tionary databases converted from standard bilin- 
gual dictionaries. The COMPASS system, based 
on the LOCOLEX engine (Bauer, Segond and Zae- 
hen 1995) developed at RXRC, allows look-up of 
words in the dictionary database directly out of 
an on-line text. When the user clicks on an un- 
known word in a foreign language, LOCOLEX eval- 
uates the context of the queried word. Currently, 
the system determines the word's part of speech 
and whether the word is part of an MWL. In the 
latter case, the translation for the entire MWL is 
returned, otherwise a selection of translations for 
the most appropiate part of speech. 
Acknowledgments 
We thank Annie Zaenen, Lauri Kaxttunen, Ted 
Briscoe, and Irene Maxwell for their comments on an 
earlier draft of this paper. 
References 
Abeilld Anne ; and Schabes Yves, (1989). "Pars- 
ing idioms in lexicalized TAGs". Proceedings of 
the 4th EACL, Manchester, UK. 
Bauer Daniel ; Segond Fr~ddrique ; and Zaenen 
Annie, (1995). "LOCOLEX: the translation 
rolls off your tongue". Proceedings of ACH- 
ALLC, Santa Barbara, CA. 
Brundage Jennifer ; Kresse March ;Schwall Ulrike 
; and Storrer Angelika, (1992). "Multiword 
Lexemes: A Monolingual and Contrastive Ty- 
pology for NLP and MT'. IWBS Report 232, 
IBM TR-80.92-029, IBM Deutschland GmbtI, 
Institut fiir Wissensbasierte Systeme, Heidel- 
berg, September. 
Copestake Anne, (1994). "Idioms in general and 
in HPSG". Presentation given at the Work- 
shop 'The Future of the Dictionary', Uriage- 
Les-Balns, France, September 1994. 
Z'Adapting bilingual dictionaries for COMPrehen- 
sion ASSistance', LRE-62-080. 
Engelke Sabine, (1994). Eigenschaften yon 
Phraseolexemen: Eine Uniersuchung zur syn- 
taktischen Variabilildt und internes Modifizier- 
barkeit yon somatischen verbalen Phraseolexe- 
men. Master's Thesis, Universit~it Tiibingen, 
Germany, April. 
Erbach Gregor, (1992). "Head-Driven Lexical 
Representation of Idioms in HPSG". In M. 
Everaert et al., editors, Proceedings of the In- 
ternational Conference on Idioms, Tilburg, NL, 
September. 
Fleischer Wolfgang, (1982). Phraseologie der den- 
tschen Gegenwartssprache. VEB Bibliographis- 
ches Institut, Leipzig, Germany. 
Gross Maurice. (1982). "Use classification de 
phrases fig6es fran~ais". Revue Quebecoise de 
Linguistique, Vol. 11, No. 2. Montreal. 
Karttunen Lanri, (1995). "The Replace Oper- 
ator". Proceedings of the Annual Meeting of 
the Association for Computational Linguistics 
(ACL-95), Boston, MA. 
Karttunen Lanri ; and Yarnpol Todd, (1993). "In- 
teractive Finite-State Calculus". Technical Re- 
port ISTL-NLTT-1993-04-01, Xerox Palo Alto 
Research Center, California. 
van der Linden Erik-Jan, (1993). A Categorial 
Computational Theory of Idioms. OTS Disser- 
tation Series, Utrecht, NL. 
Maurel Denis, (1993). "Passage d'un auto- 
mate avec tables d'acceptablilit~ hun automate 
lexical". In Acres du colloque Informatique 
et langue naturelle, pages 269-279, Nantes, 
France. 
Nunberg Geoffrey ; Wasow Thomas ; and Sag 
Ivan, (1994). "Idioms". Language, 70/3:491- 
538. 
Roche Emmanuel, (1993). Analyse syntaxique 
transformationneile du franfais par transduc- 
tents et lezique-grammaire. Th~e de doctorat, 
Universit~ Paris 7. 
Segond FrSd~rique ; and Breidt Elisabeth, (1995). 
"Description formelle des expressions h mots 
multiples en fran~ais et en allemand duns le 
cadre de la technologic des dtats finis". Lex- 
icomatique el Dictionnairiques, Aetes des IVe 
Journ6es Scientifiques du reseau "Lexicologie, 
Terminologie, Traduction" de I'UREF, Lyon, 
Septembre 1995. 
Segond Fr6d~rique ; and Tapanalnen Past, (1995). 
"Using a finite-state based formalism to identify 
and generate multiword expressions". Technical 
Report MLTT-019, Rank Xerox Research Cen- 
tre, Grenoble, France, July. 
1039 
Silberztein Max, (1993). Dictionnaires dlectro- 
niqnes et analyse automatique de textes - Le 
syst~me INTEX. Masson, Paris, France. 
Tapanainen Pasi, (1994). "RXRC Finite-State 
rule Compiler". Technical Report MLTT-020, 
Rank Xerox Research Centre, Grenoble, France. 
1040 
