Translating Idioms 
Eric Wehrli ° 
Laboratoire d'analyse et de technologie du langage 
University of Geneva 
wehrli@latl.unige.ch 
Abstract 
This paper discusses the treatment of fixed word 
expressions developed for our ITS-2 French- 
English translation system. This treatment 
makes a clear distinction between compounds 
- i.e. multiword expressions of X°-level in 
which the chunks are adjacent - and idiomatic 
phrases - i.e. multiword expressions of phrasal 
categories, where the chunks are not necessar- 
ily adjacent. In our system, compounds are 
handled during the lexical analysis, while id- 
ioms are treated in the syntax, where they are 
treated as "specialized lexemes". Once rec- 
ognized, an idiom can be transfered accord- 
ing to the specifications of the bilingual dic- 
tionary. We will show several cases of trans- 
fer to corresponding idioms in the target lan- 
guage, or to simple lexemes. The complete sys- 
tem, including several hundreds of compounds 
and idioms can be consulted on the Internet 
(ht tp ://latl.unige.ch/itsweb.html). 
1 Introduction 
Multiword expressions (henceforth MWE), are 
known to constitute a serious problem for nat- 
ural language processing (NLP) 1. In the case 
of translation, a proper treatment of MWE is 
a fundamental requirement, as few customers 
would tolerate a literal translation of such com- 
mon expressions as entrer en vigueur 'to come 
into effect', mettre en oeuvre 'to implement', 
faire preuve 'to show' or faire connaissance 'to 
meet '. 
" I am grateful to Anne Vandeventer, Christopher Laen- 
zlinger and Thierry Etchegoyhen for helpful comments. 
Part of the work described in this paper has been sup- 
ported by a grant from CTI (grant no 2673.1). 
zCf. Abeill~ & Schabes (1989), Arnold et al. (1995), 
Laporte (1988), Schenk (1995), Stock (1989), among 
others. 
However, a simple glance at some of the cur- 
rent commercial translation systems shows that 
none of them can be said to handle MWEs in an 
appropriate fashion. As a matter of fact, some 
of them explicitely warn their users not to use 
multiword expressions. 
In this paper, we will first stress some fun- 
damental properties of two classes of MWEs, 
compounds and idioms, and then present the 
treatment of idioms developed for our French- 
English ITS-2 translation system (cf. Ram- 
luckun & Wehrli, 1993). 
2 Compounds and idioms 
A two-way partition of MWEs in (i) compounds 
and (ii) idioms is both convenient and theo- 
retically well-motivated 2. Compounds are de- 
fined as MWEs of X°-level (ie. word level), in 
which the chunks are adjacent, as exemplified in 
(1), while "idiomatic expressions" correspond to 
MWEs of phrasal level, where chunks may not 
be adjacent, and may undergo various syntactic 
operations, as exemplified in (2-3). 
(1)a. pomme de terre 'potato' 
b. ~ cause de 'because of' 
c. d~s lors que 'as soon as' 
The compounds given in (1) function, respec- 
tively, as noun, preposition and conjunction. 
They correspond to a single unit, both syntac- 
tically and semantically. In contrast, idiomatic 
expressions do not generally constitute fixed, 
closed syntactic units. They do, however, be- 
have as semantic units. For instance the com- 
plex syntactic expression casser du sucre sur le 
dos de quelqu'un, literally break some sugar on 
~This distinction between compounds and idioms is 
also discussed in Wehrli (1997) 
1388 
somebody's back is essentially synonymous with 
criticize. 
(2)a. Jean a forc~ la main ~ Luc. 
Jean has forced the hand to Luc 
'Jean twisted Luc's hand' 
b. C'est ~ Luc que Jean a forc~ la main. 
It is to Luc that Jean has forced the 
hand 
'It is Luc's hand that Jean has twisted' 
c. C'est & Luc que Paul pretend que Jean 
a voulu forcer la main. 
It is to Luc that Paul claims that Jean 
has wanted to force the hand 
'It is Luc's hand that Paul claims that 
Jean has wanted to force' 
d. La main semble lui avoir ~t~ un peu 
forc~e. 
The hand hand seems to him to have 
been a little forced 
'His hand seems to have been some- 
what twisted' 
The idiom illustrated in (2) is typical of a 
very large class of idioms based on a verbal 
head. Syntactically, such idioms correspond to 
verb phrases, with a fixed direct object argu- 
ment (la main, in our example) and an open 
indirect object argument. Notice that this verb 
phrase is completely regular in its syntactic be- 
haviour. In particular, it can can undergo syn- 
tactic operations such as adverbial modification, 
raising, passive, dislocation, etc., as examplified 
in (2b-d). 
With example (3), we have a much less com- 
mon pattern, since the subject argument of 
the verb constitutes a chunk of the expression. 
Here, again, various operations are possible, in- 
cluding passive and raising ~ 
(3)a. Quelle mouche a piqu~ Paul? 
'What has gotten to Paul?' 
b. Quelle mouche semble l'avoir pique? 
'What seems to have gotten to him' 
c. Je me demande par quelle mouche Paul 
a ~t~ pique. 
'I wonder what's gotten to him' 
3Another interesting example of idiom with fixed sub- 
ject is la moutarde monte au nez de NP ("NP looses his 
temper"), discussed in Abeille and Schabes (1989). 
The extent to which expressions can undergo 
modifications and other syntactic operations 
can vary tremendously from one expression to 
the next, and in the absence of a general ex- 
planation for this fact, each expression must be 
recorded with the llst of its particular properties 
and constraints 4. 
Given the categorial distinction (X ° vs. XP) 
and other fundamental differences sketched 
above, compounds and idioms are treated very 
differently in our system. Compounds are sim- 
ply listed in the lexicon as complex lexical units. 
As such, their identification belongs to the lexi- 
cal analysis component. Once a compound has 
been recognized, its treatment in the ITS-2 sys- 
tem does not differ in any interesting way from 
the treatment of simple words. 
While idiomatic expressions must also be 
listed in the lexicon, their entries are far more 
complex than the ones of simple or compound 
words (cf. section 3.2). As for their identifica- 
tion, it turns out to be a rather complex oper- 
ation, which cannot be reliably carried out at a 
superficial level of representation. As we saw in 
the above examples, idiom chunks can be found 
far away from the (verbal) head with which they 
constitute an expression; they can also be mod- 
ified in various ways, and so on. Preprocessing 
idioms, for instance during the lexical analysis, 
might therefore lead to lengthy, inefficient or un- 
reliable treatments. We will argue that in order 
to drastically simplify the task of identifying id- 
ioms, it is necessary to undo whatever syntac- 
tic operations they might have undergone. To 
put it differently, idioms can best be recognized 
on the basis of a normalized structure, a struc- 
ture in which constituents occur in their canon- 
ical position. In a generative grammar frame- 
work, normalized structures correspond to D- 
structure representations. At that level, for in- 
stance, the four sentences in (2), share the com- 
mon structure in (4). 
(4) ... \[ Vp forcer \[ DP la main\] \[ pp/t X\] \] 
As we will show in the next section, our treat- 
ment of idiomatic expression takes advantage of 
4See for instance Nunberg et aL (1994), Ruwct 
(1983), Schenk (1995) or Segond and Breidt (1996) for a 
discussion on the degree of ficxibility of idioms and (in 
the first two) interesting attempts to connect syntactic 
flexibility to semantic transparency 
1389 
the drastic normalization process that our GB- 
based parser carries out. 
3 A sketch of the translation process 
In this section, we will show how idioms are 
handled in the French-to-English ITS-2 trans- 
lation system, a transfer-based translation sys- 
tem which uses GB-style D-structure represen- 
tations as interface structures. The general ar- 
chitecture of the system is given in figure 1 be- 
low. 
\ 
Parser I~.,," \ / ,Y Generator 
Lexical / ",,~ Database i-'"" 
Grammar 
Transfer component~/~ 
Figure 1. Architecture of ITS-2 
For concreteness, we shall first focus on the 
epinonymous idiom given in (5): 
(5)a. Paul a cass~ sa pipe. 
lit. 'Paul has broken his pipe' 
b. Paul kicked the bucket. 
Translation of (5a) is a three-step process: 
• Identification of source idiom 
• Transfer of idiom 
• Generation of target idiom 
3.1 Idiom identification 
As we argued in the previous section, the task of 
identifying an idiom is best accomplished at the 
abstract level of representation (D-structure). 
ITS-2 uses the IPS parser (cf. Wehrli, 1992, 
1997), which produces the structure (6) for the 
input (5a) 5: 
~In example 6, we use the following syntactic labels : 
TP (Tense Phrase) for sentences, VP for verb phrases, 
DP for Determiner Phrases, NP for Noun Phrases, and 
PP for Prepositional Phrases. 
(6) \[ Tt' \[ DP Paul\] \[ y a \[ vp cass~ \[ DP sa 
\[ NP pipe \[ pp eli\]I\]\] 
At this point, the structure is completely gen- 
eral, and does not contain any specification of 
idioms. The idiom recognition procedure is trig- 
gered by the "head of idiom" lexical feature as- 
sociated with the head casser. This feature is 
associated with all lexical items which are heads 
of idioms in the lexical database. 
The task of the recognition procedure is (i) to 
retrieve the proper idiom, if any (casser might 
be the head of several idioms), and (ii) to verify 
that all the constraints associated with that id- 
iom are satisfied. Idioms are listed in the lexical 
database as roughly illustrated in (6)6: 
(7)a. casser sa pipe 
'to kick the bucket' 
b. 1: \[ \] 2: \[ casser\] 3: \[ DP V 
pipe\] 
c. 1. \[+human\] 
2. \[-passive\] 
3. \[+literal,-extraposition\] 
POSS DP 
Idiom entries specify (a) the canonical form 
of the idiom (mostly for reference purposes), (b) 
the syntactic frame with an ordered list of con- 
stituents, and (c) the list of constraints associ- 
ated with each of the constituents. 
In our (rather simple) example, the lexical 
constraints associated with the idiom (7) state 
that the head is a transitive lexeme whose di- 
rect object has the fixed form "POSS pipe", 
where POSS stands for a possessive deter- 
miner coreferential with the external argument 
of the head (i.e. the subject). Furthermore, 
the subject constituant bears the feature \[+hu- 
man\], the head is marked as \[-passive\], mean- 
ing that this particular idiom cannot be pas- 
sivized. Finally, the object is also marked \[÷lit- 
eral, -extraposition\], which means that the di- 
rect object constituent cannot be modified in 
any way (not even pluralized), and cannot be 
extraposed. 
The structure in (7) satisfies all those con- 
straints, provided that the possessive sa refers 
6See Walther & Wehrll (1996) for a discussion of the 
structure of the lexical database underlying the ITS-2 
project 
1390 
uniquely to Paul T. It should be noticed that 
even though an idiom has been recognized in 
sentence (6), it also has a semantically well- 
formed literal meaning. Running ITS-2 in inter- 
active mode, the user would be asked whether 
the sentence should be taken literaly or as an ex- 
pression. In automatic mode, the idiom reading 
takes precedence over the literal interpretation s . 
3.2 Transfer and generation of idioms 
Once properly identified, an idiom will be trans- 
fered as any other abstract lexical unit. In 
other words, an entry in our bilingual lexicon 
has exactly the same form no matter whether 
the correspondance concerns simple lexemes or 
idioms. The corresponding target language lex- 
eme might be a simple or a complex abstract 
lexical unit. For instance, our bilingual lexical 
database contains, among many others, the fol- 
lowing correspondances: 
French English 
avoir besoin de X need X 
casser sa pipe kick the bucket 
faire la connaissance de X meet X 
avoir envie feel like 
quelle mouche a piqu~ what has gotten 
The generation of target language idioms fol- 
lows essentially the same pattern as the gener- 
ation of simple lexemes. The general pattern 
of generation in ITS-2 is the following: first, a 
maximal projection structure (XP) is projected 
on the basis of a lexical head and of the lexical 
specification associated with it. Second, syn- 
tactic operations apply on the resulting struc- 
ture (extraposition, passive, etc.) triggered ei- 
ther by lexical properties or general features 
transfered from the source sentence. For in- 
stance, the lexical feature \[+raising\] associated 
with a predicate would trigger a raising trans- 
formation (NP movement from the embedded 
subject position to the relevant subject posi- 
tion). Subject-Auxiliary inversion, topicaliza- 
tion, auxiliary verb insertion are all examples 
of syntactic transformations triggered by gen- 
eral features, derived from the source sentence. 
7Given a proper context, the sentence could be con- 
strued with sa referring to some other person, say Bill. 
8Such a heuristic seems to correspond to normal us- 
age, which would avoid formulation (Sa) to state that 
'Paul has broken someone's pipe'. 
The first step of the generation process pro- 
duces a target language D-structure, while the 
second step derives S-structure representations. 
Finally, a morphological component will de- 
termine the precise orthographical/phonological 
form of each lexical head. 
In the case of target language idioms, the 
general pattern applies with few modifications. 
Step 1 (projection of D-structure) is based on 
the lexical representation of the idiom (which 
specifies the complete syntactic pattern of the 
idiom, as we have pointed out earlier), and pro- 
duces structure (8a). Step 2, which only con- 
cerns the insertion of perfective auxiliary in po- 
sition T °, derives the S-structure (8b). Finally, 
the morphological component derives sentence (Sc). 
(8)a. \[Tp \[DPPaul\] \[vpkick \[vl~the \[ 
bucket\] \] \] \] 
b. \[Tp \[DPPaul\] \[Thave \[vpkick \[ 
the \[ bucket\] \] \] \] \] NP 
c. Paul has kicked the bucket. 
NP 
DP 
4 Conclusion 
In this paper, we have argued for a distinct 
treatment of compounds, viewed as complex 
lexical units of X°-level category, and of idioms, 
which are phrasal constructs. While compounds 
can be easily processed during the lexical anal- 
ysis, idiomatic expressions are best handled at 
a more abstract level of representation, in our 
case, the D-structure level produced by the 
parser. The task of recognition must be based 
on a detailed formal description of each idiom, 
a lengthy, sometimes tedious but unavoidable 
task. We have then shown that, once prop- 
erly identified, idioms can be transfered like any 
other abstract lexical unit. Finally, given the 
fully-specified lexical description of idioms, gen- 
eration of idiomatic expressions can be achieved 
without ad hoc machinery. 

References 
Abeill6, A. and Schabes, Y. (1989). "Parsing 
Idioms in lexicalized TAGs", Proceedings 
of EACL-89, Manchester, 1-9. 
1391 
Arnold, D., Balkan, L., Lee Humphrey, R., Mei- 
jer, S., Sadler, L. (1995). Machine Transla- 
tion: An Introductory Guide, HTML doc- 
ument (http://clwww.essex.ac.uk). 
Laporte, E. (1988). "Reconnaissance des ex- 
pressions fig~es lors de l'analyse automa- 
tique", Langages 90, Larousse, Paris. 
Nunberg, G., Sag, I., Wasow, T. (1994). "Id- 
ioms", Language, 70:3,491-538. 
Ramluckun, M. and Wehrh, E. (1993). "ITS-2 : 
an interactive personal translation system" 
Acres du coUoque de I'EACL, 476-477. 
Ruwet, N. (1983). "Du bon Usage des Expres- 
sions Idiomatiques dans l'argumentation en 
syntaxe g~n~rative". In Revue qu~b~coise 
de linguistique. 13:1. 
Schenk, A. (1995). 'The Syntactic Behavior 
of Idioms'. In Everaert M., van der Lin- 
den E., Schenk, A., Schreuder, R. Idioms: 
Structural and Psychological Perspectives, 
Lawrence Erlbaum Associates, Hove. 
Segond, D., and E. Breidt (1996). "IDAREX : 
description formelle des expressions ~ roots 
multiples en franqais et en allemand" in A. 
Clas, Ph. Thoiron and H. B~joint (eds.) 
Lexicomatique et dictionnairiques, Mon- 
treal, Aupelf-Uref. 
Stock, O. (1989). "Parsing with Flexibility, 
Dynamic Strategies, and Idioms in Mind", 
ComputationaILinguistics, 15.1. 1-18. 
Wehrh, E. (1992)"The IPS system", in C. Boitet 
(ed.) COLING-92, 870-874. 
Wehrli, E. (1997) L'analyse syntaxique des 
langues naturelles : probl~mes et m~th- 
odes, Paris, Masson. 
Walther, C., and E. Wehrh (1996) "Une base 
de donnees lexicale multilingue interactive" 
in A. Clas, P. Thoiron et H. B~joint (eds.) 
Lexicomatique et dictionnairiques, Mon- 
treal, Aupelf-Uref, 327-336. 
