Parsing Idioms in Lexicalized TAGs * 
Anne Abeill~ and Yves Schabes 
Laboratoire Automatique Documentaire et Linguistique 
University Paris 7, 2 place Jussieu, 75005 Paris France 
and Department of Computer and Information Science 
University of Pennsylvania, Philadelphia PA 19104-6389 USA 
abeille/schabes~linc.cis.upenn.edu 
ABSTRACT 
We show how idioms can be parsed in lexieal- 
ized TAGs. We rely on extensive studies of frozen 
phrases pursued at L.A.D.L) that show that id- 
ioms are pervasive in natural language and obey, 
generally speaking, the same morphological and 
syntactical patterns as 'free' structures. By id- 
iom we mean a structure in which some items are 
lexically frozen and have a semantics that is not 
compositional. We thus consider idioms of differ- 
ent syntactic categories : NP, S, adverbials, com- 
pound prepositions.., in both English and French. 
In lexicalized TAGs, the same grammar is used 
for idioms as for 'free' sentences. We assign 
them regular syntactic structures while represent- 
ing them semantically as one non-compositional 
entry. Syntactic transformations and insertion of 
modifiers may thus apply to them as to any 'free' 
structures. Unlike previous approaches, their vari- 
ability becomes the general case and their being 
totally frozen the exception. Idioms are gener- 
ally represented by extended elementary trees with 
'heads' made out of several items (that need not 
be contiguous) with one of the items serving as an 
index. When an idiomatic tree is selected by this 
index, lexical items are attached to some nodes in 
the tree. Idiomatic trees are selected by a single 
head node however the head value imposes lexical 
values on other nodes in the tree. This operation 
of attaching the head item of an idiom and its 
lexical parts is called lexical attachment. The 
• resulting tree has the lexical items corresponding 
to the pieces of the idiom already attached to it. 
*This work is partiMly supported (for the second au- 
thor) by ARO grant DAA29-84-9-007, DARPA grant 
N0014-85-K0018, NSF grants MCS-82-191169 and DCR- 
84-10413. We have benefitted immensely from our discus- 
sions with Aravind Joshi, Maurice Gross and Mitch Mar- 
cus. We want also to thank Kathleen Bishop, and Sharon 
Cote. 
1Laboratoire d'Automatique Documentaire et Linguis- 
tique, University of Paris 7. 
We generalize the parsing strategy defined for 
lexicalized TAG to the case of 'heads' made out 
of several items. We propose to parse idioms in 
two steps which are merged in the two steps pars- 
ing strategy that is defined for 'free' sentences. 
The first step performed during the lexical pass 
selects trees corresponding to the literal and id- 
iomatic interpretation. However it is not always 
the case that the idiomatic trees are selected as 
possible candidates. We require that all basic 
pieces building the minimal idiomatic expression 
must be present in the input string (with possibly 
some order constraints). This condition is a nec- 
essary condition for the idiomatic reading but of 
course it is not sufficient. The second step per- 
forms the syntax analysis as in the usual case. 
During the second step, idiomatic reading might 
be rejected. Idioms are thus parsed as any 'free' 
sentences. Except during the selection process, 
idioms do not require any special parsing mech- 
anism. We are also able to account for cases of 
ambiguity between idiomatic and literal interpre- 
tations. 
Factoring recursion from dependencies in TAGs 
allows discontinuous constituents to be parsed in 
an elegant way. We also show how regular 'trans- 
formations' are taken into account by the parser. 
Topics: Parsing, Idioms. 
1 Introduction to Tree Ad- 
joining Grammars 
Tree Adjoining Grammars (TAGs) were intro- 
duced by Joshi et al. 1975 and Joshi 1985 as 
a formalism for linguistic description. Their lin- 
guistic relevance was shown by Kroch and Joshi 
1985 and Abeill@ 1988. A lexicalized version of the 
formalism was presented in Schabes, Abeill~ and 
Joshi 1988 that makes them attractive for writing 
computational grammars. They were proved to be 
-1- 
parsable in polynomial time (worst case) by Vijay 
Shanker and Joshi 1985 and an Earley-type parser 
was presented by Schabes and Joshi 1988. 
The basic component of a TAG is a finite set 
of elementary trees that have two types: initial 
trees or auxiliary trees (See Figure 1). Both are 
minimal (but complete) linguistic structures and 
have at least one terminal at their frontier (that is 
their 'head'). Auxiliary trees are also constrained 
to have exactly one leaf node labeled with a non- 
terminal of the same category as their root node. 
lnlti*l 
x 
t. 
substitution nodes 
× /x\ 
/ 3 
Figure 1: Schematic initial and auxiliary trees 
Sentences of the language of a TAG are derived 
from the composition of an S-rooted initial tree 
with elementary trees by two operations: substi- 
tution or adjunction. 
Substitution inserts an initial tree (or a tree de- 
rived from an initial tree) at a leaf node bearing 
the same label in an elementary tree (See Fig- 
ure 2). 2 It is the operation used by CFGs. 
a._ v 
/\ 
Figure 2: Mechanism of substitution 
Adjunction is a more powerful operation: it in- 
serts an auxiliary tree at one of the corresponding 
node of an elementary tree (See Figure 3).3 
TAGs are more powerful than CFGs but only 
mildly so (Joshi 1983). Most of the linguistic ad- 
vantages of the formalism come from the fact that 
it factors recursion from dependencies. Kroch and 
Joshi 1985 show how unbounded dependencies can 
be 'localized' by having filler and gap as part of 
21 is the mark for substitution. 
SAt each node of an elementary tree, there is a feature 
structure associated with it (Vijayshanker and Joshi, 1988). 
Adjunction constraints can be defined in terms of feature 
structures and the success or failure of unification. 
(¢~) (8) 
Figure 3: Adjoining 
the same elementary tree and having insertion of 
matrix clauses provided by recursive adjunctions. 
Another interesting property of the formalism is 
its extended domain of locality, as compared to 
that of usual phrase structure rules in CFG. This 
was used by Abeill~ 1988 to account for the prop- 
erties of 'light' verb (often called 'support' verb for 
Romance languages) constructions with only one 
basic structure (instead of the double analysis or 
reanalysis usually proposed). 
We now define by an example the notion of 
derivation in a TAG. 
Take for example the derived tree in Figure 4. 
S 
Ad S 
yesterday NP VP 
A A D N V NP 
li I I 
a MaN saw N 
I 
Figure 4: Derived tree for: yesterday a man saw 
Mary 
It has been built with the elementary trees in 
Figure 5. 
s 
A 
S NP NPo$ VP 
A A A 
Ad S D D,\[, N V NP~,I, 
i I I I 
yesterday a man saw 
~adS\[yesterday\] c,D\[a\] ~NPdn\[man\] c~tnl\[saw\] 
NP 
I 
N 
I Mary 
aNPn\[Mary\] 
Figure 5: Some elementary trees 
Unlike CFGs, from the tree obtained by deriva- 
-2- 
tion (called the derived tree) it is not always pos- 
sible to know how it was constructed. The deriva- 
tion tree is an object that specifies uniquely how 
a derived tree was constructed. 
The root of the derivation tree is labeled by an 
S-type initial tree. All other nodes in the deriva- 
tion tree are labeled by auxiliary trees in the case 
of adjunction or initial trees in the case of sub- 
stitution. A tree address is associated with each 
node (except the root node) in the derivation tree. 
This tree address is the address of the node in the 
parent tree to which the adjunction or substitu- 
tion has been performed. We use the following 
convention: trees that are adjoined to their par- 
ent tree are linked by an unbroken line to their 
parent, and trees that are substituted are linked 
by dashed lines. 
The derivation tree in Figure 6 specifies how the 
derived tree was obtained: 
atnlIsaw\] 
~Pdn\[m~l (1) ~II~\[M~'yl (2.2) I~adS\[yesterday\] (0) ,, 
! 
aD\[al (11 
Figure 6: Derivation tree for Yesterday a man saw 
Mary 
aD\[a\] is substituted in the tree aNPdn\[man\] at 
node of address 1, aNPdn\[man\] is substituted in 
the tree atnl\[saw\] at address 1, aNPn\[Mary\] is 
substituted in the tree atnl\[saw\] at node 2.2 and 
the tree \[3adS\[yesterday\] is adjoined in the tree 
atnl\[saw\] at node 0. 
In a 'lexicalized' TAG, the 'category' of each 
word in the lexicon is in fact the tree structure(s) 
it selects. 4 Elementary trees that can be linked by 
a syntactic or a lexical rule are gathered in a Tree 
Family, that is selected as a whole by the head 
of the structure. A novel parsing strategy follows 
(Schabes, Abeill~, :loshi 1988). In a first step, the 
parser scans the input string and selects the dif- 
ferent tree structures associated with the lexical 
items of the string by looking up the lexicon. In 
a second step, these structures are combined to- 
gether to produce a sentence. Thus the parser uses 
only a subset of the entire (lexicalized) grammar. 
4The nodes of the tree structures have feature structures 
associated with them, see footnote 3. 
2 Linguistic Properties of Id- 
ioms 
Idioms have been at stake in many linguistic dis- 
cussions since the early transformational gram- 
mars, but no exhaustive work based on exten- 
sive listings of idioms have been pursued before 
Gross 1982. We rely on L.A.D.L.'s work for French 
that studied 8000 frozen sentences, 20, 000 frozen 
nouns and 6000 frozen adverbs. For English, we 
made use of Freckelton's thesis (1984) that listed 
more than 3000 sentential idioms. They show 
that, for a given structure, idiomatic phrases are 
usually more numerous in the language than 'free' 
ones. As is well known, idioms are made of the 
same lexicon and consist of the same sequences of 
categories as 'free' structures. An interesting ex- 
ception is the case of 'words' existing only as part 
of an idiomatic phrase, such as escampette in pren- 
dre la poudre d'escampette (to leave furtively) or 
umbrage in to take umbrage at NP. 
The specificity of idioms is their semantic non- 
compositionality. The meaning of casser sa pipe 
(to die), cannot be derived from that of casser (to 
break) and that of pipe (pipe). They behave se- 
mantically as one predicate, and for example the 
whole VP casser sa pipe selects the subject of the 
sentence and all possible modifiers. We therefore 
consider an idiom as one entity in the lexicon. 
It would not make sense to have its parts listed in 
the lexicon as regular categories and to have spe- 
cial rules to limit their distribution to this unique 
context. If they are already listed in the lexi- 
con, these existing entries are considered as mere 
homonyms. Furthermore, usually idioms are am- 
biguous between literal and idiomatic read- 
ings. 
Idioms do not appear necessarily as con- 
tlnuous strings in texts. As shown by M. Gross 
for French and P. Freckelton for English, more 
than 15% of sentential idioms are made up of un- 
bounded arguments, (e.g. NPo prendre NP1 en 
compte, NPo take NP1 into account, Butter would 
not melt in NP's mouth). Discontinuities can also 
come from the regular application of syntactic 
rules. For example, interposition of adverbs be- 
tween verb and object in compound V-NP phrases, 
and interposition of modals or auxiliaries between 
subject and verb in compound NP-V phrases are 
very general (Laporte 1988). 
As shown by Gazdar et al. 1985 for English, 
and Gross 1982 for French, most sentential id- 
ioms are not completely frozen and 'transfor- 
mations' apply to them much more regularly 
-3- 
than is usually thought. Freckelton 1984's list- 
ings of idiomatic sentences exhibit passivization 
for about 50% of the idioms comprised of a verb 
(different from be and have) and a frozen direct 
argument. Looking at a representative sample of 
2000 idiomatic sentences with frozen objects (from 
Gross's listings at LADL) yields similar results for 
passivization and relativization of the frozen argu- 
ment for French. This is usually considered a prob- 
lem for parsing, since the order in which the frozen 
elements of an idiom appear might thus vary. 
Recognizing idioms is thus dependent on the 
whole syntactic analysis and it is not realistic to 
reanalyze them as simple categories in a prepro- 
cessing step. 
3 Representing Idioms in 
Lexicalized TAGs 
We represent idioms with the same elementary 
trees as 'free' structures. The values of the argu- 
ments of trees that correspond to a literal expres- 
sion are introduced via syntactic categories and 
semantic features. However, the values of argu- 
ments of trees that correspond to an idiomatic 
expression are not only introduced via syntactic 
categories and semantic features but also directly 
specified. 
3.1 Extended Elementary Trees 
Some idioms select the same elementary tree struc- 
tures as 'free' sentences. For example, a sentential 
idiom with a frozen subject il/aut S1 selects the 
same tree family as any verb taking a sentential 
complement (ex: NP0 dit $1), except that ii is 
directly attached in subject position, whereas a 
'free' NP is inserted in NPo in the case of 'dit' 
(See Figure 7). 
S S 
NP0 VP NP0$ VP 
IA A 
il V Sl V $1 
I I 
faut dit 
Figure 7: trees for il faut and dit 
Usually idioms require elementary trees that are 
more expanded. Take now as another example 
the sentential idiom N Po kicked the bucket. The 
corresponding tree must be expanded up to the 
D1 and N1 level, the (resp. bucket) is directly 
attached to the D1 (resp. N1) node (See Figure 8). 
S /N 
NPo~ VP 
v Nil 
kicked D1 NI 
I I 
the bucket 
Figure 8: Tree for N Po kicked the bucket 
3.2 Multicomponent Heads 
In the lexicon, idiomatic trees are represented by 
specifying the elements of the idiom. An idiom 
as NPo kicked the bucket is indexed by a 'head' 
(kicked) which specifies the other pieces of the id- 
iom. Although the idiom is indexed by one item, 
the pieces are considered as its multicomponent 
heads.5 
We have, among others, the following entries in 
the lexicon: 6 
kicked , V : Tnl (transitive verb) (a) 
kicked , V : Tdnl\[D1 = the, N1 = bucket\] (idiom) (b) 
the , D : aD (e) 
bucket , N : aNPdn (d) 
John , N : aNP (e) 
The trees aNPdn and aNPn are: 7 
NP NP I (aNPn) A (aNPdn) 
NO D$ NO 
Among other trees, the tree atnl is in the family 
Tnl and the tree atdnl is in the family Tdnl: 
S 
S NPo$ VP A A 
NP0J, VP (c~tnl) V0 NPI 
V0 NPIJ, DiS N15 
(atdnl) 
5The choice of the item under which the idiom is indexed 
is most of the time arbitrary. 
eThe lexical entries are simplified to just illustrate how 
idiom are handled. 
ro marks the node under which the head is attached. 
-4- 
NP NP 
I I I 
John the bucket 
(aNPn\[John\]) (aD\[the\]) (aNPdn\[bucket\]) 
S 
A A NPo$ VP 
NPo$ VP A V NP1 
V NPI$ kicked DI N1 
I I I 
kicked the bucket 
(atnl \[kicked\])(atdnl \[kicked-the-bucket\]) 
Figure 9: Trees selected for the input 
John kicked the bucket 
Suppose that the input sentence is John kicked 
the bucket. The first entry for kicked (a) speci- 
fies that kicked can be attached under the V node 
in the tree atdnl (See the tree c~tnl\[kicked\] in 
Figure 9). However the second entry for kicked 
(b) specifies that kicked can be attached under 
the V node and that the must be attached un- 
der the node labeled by D1 and that bucket must 
be attached under the node labeled N1 in the 
tree atnl (See the tree atdnl\[kicked-the-bucket\] 
in Figure 9). 
In the first pass, the trees in Figure 9 are be 
selected (among others). 
Some idioms allow some lexical variation, usu- 
ally between a more familiar and a regular use of 
the same idiom, for example in French NPo per. 
dre la t~te and NPo perdre ia boule (to get mad). 
This is represented by allowing disjunction on the 
string that gets directly attached at a certain posi- 
tion in the idiomatic tree. NPo perdre ia t~te/boule 
will thus be one entry in the lexicon, and we do 
not have to specify that t~te and boule are synony- 
mous (and restrict this synonymy to hold only for 
this context). 
3.3 Selection of Idiomatic Trees 
We now explain how the first pass of the parser 
is modified to select the appropriate possible can- 
didates for idiomatic readings. Take the previ- 
ous example, John kicked the bucket. The verb 
kicked will select the tree atdnl \[kicked-the-bucket\] 
for an idiomatic reading. However, the values of 
the determiner and the noun of the object noun 
phrase are imposed to be respectively the and 
bucket. The determiner and the noun are at- 
tached to the tree atdnl\[kicked-the-bucket\], how- 
ever the tree atdnl\[kicked-the-bucket\] is selected 
if the words kicked, the and bucket appear in the 
input string at position compatible with the tree 
atrial\[kicked-the-bucket\]. Therefore they must re- 
spectively appear in the input string at some po- 
sition i, j and k such that i < j < k. If it is not 
the case, the tree atdnl\[kicked-the-bucket\] is not 
selected. This process is called lexical attach- 
ment. 
For example the word kicked in the fol- 
lowing sentences will select the idiomatic tree 
atdn 1 \[kicked-the-bucket\]: 
John kicked the bucket (sl) 
John kicked the proverbial bucket (sP) 
John kicked the man who was 
carrying the bucket (s3) 
The parser will accept sentences sl and sP as id- 
iomatic reading but not the sentence s3 since the 
tree atdnl\[kicked-the-bucket\] will fail in the parse. 
In the following sentence the word kicked will not 
select the idiomatic tree atdnl\[kicked-the-bucket\]: 
John kicked Mark (s4) 
John kicked a bucket (sS) 
John who was carrying a bucket 
kicked the child (s6) 
What did John kick? (sT) 
This test cuts down the number of idiomatic 
trees that are given to the parser as possible can- 
didates. Thus a lot of idioms are ruled out before 
starting the syntactic analysis because we know 
all the lexical items at the end of the first pass. 
This is important because a given item (e.g. a 
verb) can be the head of a large number of idioms 
(Gross 82 has listed more than 50 of them for the 
verb manger, and prendre or avoir yield thousands 
of them). However, as sentence s3 illustrates, the 
test is not sufficient. 
What TAGs allow us to do is to define mul- 
ticomponent heads for idiomatic structures with- 
out requiring their being contiguous in the input 
string. The formalism also allows us to access 
directly the different elements of the compound 
without flattening the structure. As opposed to 
CFGs, for example, direct dependencies can be 
expressed between arguments that are at differ- 
ent levels of depth in the tree without having to 
pass features across local domains. For example, 
in NPo rider DET sac (to express all of one's se- 
-5- 
,~" 2' 
cret thoughts), the determiner of the object sac 
has to be a possessive and agree in person with 
the subject : je vide mon sac, tu rides ton sac... 
In NPo dire DET quatre veritds a NP2 (to tell 
someone what he really is), the determiner of the 
object veritds has to be a possessive and agree in 
person with the second object NP2 : je te dis tes 
quatre veritds, je lui dis ses quatre verit~s. 
4 Literal and Idiomatic 
Readings 
Our representation expresses correctly that id- 
ioms are semantically non-compositional. Trees 
obtained by lexical attachment of several lexical 
items act as one syntactic unit and also one se- 
mantic unit. 
For example, the sentence John kicked the 
bucket can be parsed in two different ways. One 
derivation is built with the trees: atnl\[kicked\] 
(transitive verb), aNPn\[John\], aD\[the\] and 
aNPn\[bucket\] . It corresponds to the literal in- 
terpretation; the other derivation is built with the 
trees: atdnl\[kicked-the-bucket\] (idiomatic tree) 
and aNPn\[John\] (John): 
c~tnl\[ kicked\] 
oNPn\[Johnl (1) oaNPdn\[bucketl (2.2) 
ctD\[ the\] (1) 
literal derivation 
However, both derivations have the same de- 
rived tree: 
sg 
atdnl\[kicket- the- bucket\] 
! 
I ! 
~NI~\[ John\] (1) 
idiomatic derivation 
NP VP 
N V NP 
John kicked D N 
I I 
the bucket 
The meaning of kicked the bucket in its idiomatic 
reading cannot be derived from that of kicked and 
the bucket. However, by allowing arguments to be 
inserted by substitution or adjunction (in for ex- 
ample atdnl \[kicked-the-bucket\]), we represent the 
fact that NPo kicked the bucket acts as a syntactic 
and semantic unit expecting one argument NPo. 
Similarly, NPo kicked NP1 in atnl\[kicked\] acts as 
a syntactic and semantic unit expecting two argu- 
ments NPo and NP1. This fact is reflected in the 
two derivation trees of John kicked the bucket. 
However, the sentential idiom 'il fant $1', is not 
parsed as ambiguous, since faut has only one en- 
try (that is idiomatic) in the lexicon. When a 
certain item does not exist except in a specific 
idiom, for example umbrage in English, the cor- 
responding idiom to take umbrage of NP will not 
be parsed as ambiguous. The same holds when 
a item selects a construction only in an idiomatic 
expression. Aller, for example, takes an obligatory 
PP (or adverbial) argument in its non-idiomatic 
sense. Thus the idiom: 
aller son train (to follow one's way) 
is not parsed as ambiguous since there is no free 
NPo aller NP1 structure in the lexicon. 
We also have ambiguities for compound nom- 
inals such as carte bleue, meaning either credit 
card (idiomatic) or blue card (literal), and for com- 
pound adverbials like on a dime: John stopped on 
a dime will mean either that he stopped in a con- 
trolled way or on a 10 cent coin. 
Structures for literal and idiomatic readings are 
both selected by the parser in the first step. Since 
syntax and semantics are processed at the same 
time, the sentence is analyzed as ambiguous be- 
tween literal and idiomatic interpretations. The 
derived trees are the same but the derivation trees 
are different. For example, the adjective bleue se- 
lects an auxiliary tree that is adjoined to carte in 
the literal derivation tree, whereas it is directly 
attached in a complex initial tree in the case of 
idiomatic interpretation. 
All frozen elements of the idiom are directly 
attached in the corresponding elementary trees, 
and do not have to exist in the lexicon. They 
are thus distinguished from 'free' arguments that 
select their own trees (and their own semantics) 
to be substituted in a standard sentential tree. 
Therefore we distinguish two kinds of semantic op- 
erations: substitution (or adjunction) corresponds 
to a compositional semantics; direct attachment, 
on the other hand, makes different items behave 
as one semantic unit. 
One should notice that non-idiomatic readings 
are not necessarily literal readings. Since feature 
structures are used for selectional restrictions of 
arguments, metaphoric readings can be taken into 
account (Bishop, Cote and Abeill~ 1989). 
We are able to handle different kinds of seman- 
tic non-compositionality, and we do not treat as 
idiomatic all cases of non-literal readings. 
-6- 
s 
A 
NP0$ VP 
V NPI~, PP2/VA 
I A 
takes P2 NP2NA 
I I 
into N2/VA 
I account 
Figure 10: Tree for NPo takes NP1 into account 
NPo VP 
No V NPI 
I A A 
Jean Aux V Dt N1 
I I I I 
a casse sa pipe 
literal 
S 
NP o VP 
No V NPINA 
I A A 
Jean Aux V D t NINA 
I I I 1 
a casse sa pipe 
idiom 
Figure 11: Jean a cassg sa pipe 
5 Recognizing 
Discontinuous Idioms 
Parsing flexible idioms has received only partial 
solutions so far (Stock 1987, Laporte 1988). Since 
TAGs factor recursion from dependencies, discon- 
tinuities are captured straightforwardly without 
special devices (as opposed to Johnson 1985 or 
Bunt et al. 1987). We distinguish two kinds of dis- 
continuities: discontinuities that come from inter- 
nal structures and discontinuities that come from 
the insertion of modifiers. 
5.1 Internal Discontinuities 
Some idioms are internally discontinuous. Take for 
example the idioms NPo prendre NP1 en compte 
and NPo takes NP1 into account (see Figure 10). s 
The discontinuity is handled simply by argu- 
ments (here NPo and NP1) to be substituted 
(or adjoined in some cases) as any free sentences. 
The internal structures of arguments can be un- 
bounded. 
5.2 Recursive Insertions of Modi- 
fiers 
Some adjunctions of modifiers may be ruled out 
in idioms or some new ones may be valid only 
in idioms. If the sentence is possibly ambiguous 
between idiomatic and literal reading, the adjunc- 
tion of such modifiers force the literal interpre- 
tation. For example, in NPo casser sa pipe (to 
die) , the NP1 node in the idiomatic tree bears a 
null adjunction constraint (NA). The sentence H a 
cassd sa pipe en bois (he broke his wooden pipe) is 
SNA expresses the fact that the node has null adjunction 
constraint 
then parsed as non-idiomatic. This NA constraint 
will be the only difference between the two derived 
trees (See Figure 11): Jean a cass~ sa pipe (literal) 
and Jean a cassg sa pipe (idiomatic). 
But most idioms allow modifiers to be inserted 
in them. Each modifier can be unbounded (e.g. 
with embedded adjunct clauses) and their inser- 
tion is recursive. We treat these insertion by ad- 
junction of modifiers in the idiomatic tree. How- 
ever constraint of adjunction and feature structure 
constraints filter out partially or totally the inser- 
tion of modifiers at each node of an idiomatic tree. 
In a TAG, the internal structure of idioms is spec- 
ified in terms of a tree, and we can get a unified 
representation for such compound adverbials as 
la limite and ~ l' extreme limite (if there is no other 
way) or such complex determiners as a bunch of 
(or ia majoritd de NP ) and a whole bunch of NP 
(resp. la grande majoritd de NP) that will not have 
to be listed as separate entries in the lexicon. The 
adjective whole (resp. grande) adjoins to the noun 
bunch (resp. majoritd ), as to any noun. Take a 
bunch of NP. The adjective whole adjoins to the 
noun bunch as to any noun (See Figure 12) and 
builds a whole bunch of. 
In order to have a modifier with the right fea- 
tures adjoining at a certain node in the idiom, we 
associate some features with the head of the id- 
iom (as for heads of 'free' structures) but also with 
elements of the idiom that are directly attached. 
Unification equations, such as those constraining 
agreement, are the same for trees selected by id- 
ioms and trees selected by 'free' structures. Thus 
only grande that is feminine singular, and not 
grand for example, can adjoin to majorit~ that 
is feminine singular. In il falloir NP, the frozen 
subject il is marked 3rd person singular, and only 
an auxiliary like va (that is 3rd person singular) 
and not vont (3rd person plural) will be allowed 
CP---C -7- 
\ 
NP 
D N PP 
\[ I A 
a bunch P NP 
I 
of 
N 
A AN 
\[ 
whole 
NP 
D N PP 
by adjunction: \] ~ A 
a A N PNP 
I I I 
whole bunch of 
Figure 12: Trees for a whole bunch of 
to adjoin to the VP: il va falloir $1 and not il vont 
falloir $1. 
As another example, an idiom such as la 
moutarde monte au nez de NP (NP looses his tem- 
per) can be represented as contiguous in the ele- 
mentary tree. Adjunction takes place at any inter- 
nal node without breaking the semantic unity of 
the idiom. For example, an adjunct clause headed 
by anssit6t can adjoin between the frozen subject 
and the rest of the the idiom in la moutarde mon- 
ter au nez de NP2 : la montarde, aussitSt que 
Marie enlra, monta an nez de Max (Max, as soon 
as Marie got in, lost his temper). Similarly, aux- 
iliaries adjoin between frozen subjects and verbs 
as they do to 'free' VPs: There might have been 
a boz on the table is parsed as being derived from 
the idiom : there be NP1 P NP2. 
It should be noted that when a modifier adjoins 
to an interior node of an idiom, there is a semantic 
composition between the semantics of the modi- 
fier and that of the idiom as a whole, no matter 
at which interior node the adjunction takes place. 
For example, in John kicked the proverbial bucket 
semantic composition happens between the 3 units 
John, kick-the-bucket, and proverbial. 9 Semantic 
composition will be done the same way if an ad- 
junct clause were adjoined into the VP. In John 
kicked the bucket, as the proverb says, composi- 
tion will happen between John, kick-the.bucket, 
and the adjunct clause considered as one predi- 
cate as-proverb-say: 
9This is the case of a modifier where adjoining is valid 
only for the idiom. 
Therefore parsing flexible idioms is reduced to 
the general parsing of TAGs (Schabes and Joshi 
1988). 
6 Tree Families and Appli- 
cation of 'Transformations' 
to Idioms 
As in the case of predicates in lexicalized TAGs, 
sentential idioms are represented as selecting a set 
of elementary trees and not only one tree. These 
tree families gather all elementary trees that are 
possible syntactic realizations of a given argument 
structure. The family for transitive verbs, for ex- 
ample, is comprised of trees for wh-question on the 
subject, wh-question on the object, relativization 
on the subject, relativization on the object, and so 
on. In the first pass, the parser loads all the trees 
in the tree family corresponding to an item in the 
input string (unless certain trees in that family do 
not match with the feature of the head in the input 
string). 
The same tree families are used with idioms. 
However some trees in a family might be ruled 
out by an idiom if it does not satisfy one of the 
three following requirements. 
First, the tree must have slots in which the 
pieces of the idiom can be attached. I° If one 
distinguishes syntactic rules that keep the lexical 
value of an argument in a sentence (e.g. topical- 
ization, cleft extraction, relativization...), and syn- 
tactic rules that do not (deleting the node for that 
argument, or replacing it by a pronoun or a wh- 
element; e.g.: wh-question, pronominalization), it 
can be shown that usually only the former applies 
to frozen elements of an idiom. If you take the id- 
iom bruler nn fen (to run a (red) light), relativiza- 
tion and cleft extraction, but not wh-question, are 
possible on the noun fen, with the idiomatic read- 
ing: 
Le fen que Jean a brulg. 
C'est nn fen que Jean a brulg. 
• Que brule Jean ? 
Second, if all the pieces of an idiom can be at- 
tached in a tree, the order imposed by the tree 
must match with the order in which the pieces ap- 
pear in the input string. Thus, if enfant appears 
before attendre in the input string, the hypothe- 
sis for an idiomatic reading will be made but only 
the trees corresponding to relativization, cleft ex- 
lOTllis requirement is independent of the input string. 
-8- 
traction, topicalization in which enfant is required 
to appear before attendre will be selected. But if 
the string enfant is not present at all ih the input 
string, the idiomatic reading will not be hypoth- 
esized, and trees corresponding to qui attend-elle 
will never be selected as part of the family of the 
idiom attendre nn enfant. 
Third, the features of the heads of an idiom 
must unify with those imposed on the tree (as 
for 'free' sentences). For example, it has to be 
specified that bncket in to kick the bucket does not 
undergo relativization nor passivization, whereas 
tabs in to keep tabs on NP does. It is well known 
that even for 'free' sentences application of the 
passive, for example, has somehow to be speci- 
fied for each transitive verbs since there are lexical 
idiosyncrasies, aa The semantics of the passive tabs 
were kept on NP by NP is exactly the same as that 
of the active NP keep tabs on NP, since different 
trees in the same tree families are considered as 
(semantically) synonymous. 
7 Conclusion 
We have shown how idioms can be processed in 
lexicalized TAGs. We can access simultaneously 
frozen elements at different levels of depths where 
CFGs would either have to flatten the idiomatic 
structure (and lose the possibility of regular in- 
sertion of modifiers) or to use specific devices to 
check the presence of an idiom. We can also put 
sentential idioms in the same grammar as free 
sentences. The two pass parsing strategy we use 
combining with an operation of direct attachment 
of lexical items in idiomatic trees, enables us to 
cut down the number of idiomatic trees that the 
parser takes as possible candidates. We easily get 
possibly idiomatic and literal reading for a given 
sentence. The only distinctive property of idioms 
is the non-compositional semantics of their frozen 
constituents. The extended domain of locality of 
TAGs allows the two problems of internal discon- 
tinuity and of unbounded interpositions to be han- 
dled in a nice way. 
References 
Abeill6, Anne, 1988. Parsing French with Tree Adjoining 
Grammar: some Linguistic Accounts. In Proceedings of the 
12 th International Conference on Computational Linguis- 
tics (Coling'88). Budapest. 
alUnless one thinks that some regularity might show up 
if one distinguishes different kinds of direct complements 
with thematic roles. 
Bishop, Kathleen M.; Cote, Sharon; and Abeill6, Anne, 
1989. A Lezicalized Tree Adjoining Grammar for English. 
Technical Report, Department of Computer and Informa- 
tion Science, University of Pennsylvania. 
Bunt, et al., 1987. Discontinuous Constituents in Trees, 
Rules and Parsing. In Proceedings of European Chapter of 
the A CL '87. Copenhagen. 
Freckelton, P., 1984. Une Etude Comparative des E~:pres- 
sions Idiomatiques de I'Anglais et du Franfais. PhD thesis, 
Th~se de troisi~me cycle, University Paris 7. 
Gazdar, G.; Klein, E.; Pullum, G. K.; and Sag, I. A., 
1985. Generalized Phrase Structure Grammars. Blackwell 
Publishing, Oxford. Also published by Harvard University 
Press, Cambridge, MA. 
Gross, Maurice, 1982. Classification des phrases fig~es en 
F~ran~ais. Revue Qu~b~coise de Linguistique 11(2). 
Johnson, M., 1985. Parsing with discontinuous elements. 
In Proceedings of the ~3rd A CL meeting. Chicago. 
Joshi, Aravind K., 1985. How Much Context-Sensitivity 
is Necessary for Characterizing Structural Descriptions-- 
Tree Adjoining Grammars. In Dowty, D.; Karttunen, L.; 
and Zwicky, A. (editors), Natural Language Processing-- 
Theoretical, Computational and Psychological Perspec- 
tives. Cambridge University Press, New York. Originally 
presented in a Workshop on Natural Language Parsing at 
Ohio State University, Columbus, Ohio, May 1983. 
Joshi, A. K.; Levy, L. S.; and Takahashi, M., 1975. Tree 
Adjunct Grammars. J. Comput. S~./st. Sci. 1O(1). 
Kroch, A. and Joshi, A. K., 1985. Linguistic Relevance 
of Tree Adjoining Grammars. Technical Report MS-CIS- 
85-18, Department of Computer and Information Science, 
University of Pennsylvania. 
Laporte, E., 1988. Reconnaissance des expressions fig~es 
lors de l'analyse automatique. Langages. Larousse, Paris. 
Sehabes, Yves and Joshi, Aravind K., 1988. An Earley- 
Type Parsing Algorithm for Tree Adjoining Grammars. In 
26 th Meeting of the Association for Computational Lin- 
guistics. Buffalo. 
Schabes, Yves; Abeill6, Anne; and Joshi, Aravind K., 1988. 
Parsing Strategies with 'Lexicalized' Grammars: Applica- 
tion to Tree Adjoining Grammars. In Proceedings of the 
12 th International Conference on Computational Linguis° 
tics. 
Stock, O., 1987. Getting Idioms in a Lexicon Based 
Parser's Head. In Proceedings of A CL'87. Stanford. 
Vijay-Shanker, K. and Joshi, A. K., 1985. Some Compu- 
tational Properties of Tree Adjoining Grammars. In 23 rd 
Meeting of the Association for Computational Linguistics, 
pages 82-93. 
Vijay-Shanker, K. and Joshl, A.K., 1988. Feature Struc- 
ture Based Tree Adjoining Grammars. In Proceedings of 
the 12 th International Conference on Computational Lin- 
guistics (Coling'88). Budapest. 
-9- 
