Grammaires Stochastiques Lexicalisdes d'Arbres Adjoints 
Rdsumd du papier 
Stochastic Lexicalized Tree-adjoining Grammars 
Yves Schabes 
Motivations 
Les techniques stoch&stiques b4n6ficient aujourd'lmi 
d'un regain de popularit4. Cependant, les modules 
stochastiques utilis~s sont clairement inaddquats pour 
l'analyse syntaxique des langues naturelles. Les for- 
malismes probabilistes qui out dr6 propos4s dans le do~ 
maine de la th4orie de la communication (processus de 
Markov et n-grammes) (Pratt, 1942; Shannon, 1948; 
Shannon, 1951) ont ~te rapidement r6fut6s en linguis- 
tique. En effet, ces modules sont incapables de d$crire la 
syntaxe de mani~re hi4rarchique (sous forint d'arbre). 
De plus, les ph6nomSnes portant sur de longues dis- 
tances ne peuvent pas fitre pris en compte par ces for- 
malismes. Les grammaires stochastiques hors coutexte 
(Booth, 1969) permettent d%laborer une description 
hi4rarchique de la syntaxc. Ccpendant, aucune ap- 
proche utilisant les grammaires stoctlastiques hors con- 
texte (Lari and Young, 1990; Jelinek, Lafferty, and Mer- 
cer, 1990) est en pratique aussi efllcace que les processus 
de Markov ou les n-grammes. Eu effet, les rSgles hors 
contexte ne sont pas directement sensibles au mot et 
done £ une distribution de mots. 
Grammaires Stochastiques Lexi- 
calis~es d'Arbres Adjoints 
Les grammaires lexicalisdes d'arbres adjoiuts consistent 
d'un ensemble d'arbres, chacun a.ssoci4 £ un mot. Elles 
permettent de localiser la plupart des contraiutes syn- 
taxiques (par exemple, sujet-verbe, verbe-objet) tout 
en ddcrivant la syntaxe sous forme d'arbres. 
Dans cc papicr, la notion de derivation des gram- 
maires lexicalisdes d'arbres adjoints (tree-adjoining 
grammars) est modifi6e au cas de derivatious stochas- 
tiques. Le nouveau formalisme, les grammaires stochas- 
tiques lexicalisdes d'arbres adjoints (stochastic lexical- 
ized tree-adjoining grammars ou SLTAG) , a des pro- 
pridtds uniques car il maintient la notion de distribution 
cntrc mot tout en manipulant la syntaxe de maniSre 
hi6rarchique. 
Algorithmes 
Un algorithme pour calculer la probabilitd d'une phrase 
est pr4senter dans le papier. 
Ensuite, un algorithme qui permet de r4estimer les 
param~tres d'une grammaire stochastique lexicalisde 
d'arbres adjoints est ddcrit. Cette algorithme per- 
met de r~estimer les param~tres de fa~on 5. aug- 
menter apr~s chaque it6ration la probabilit6 du cor- 
pus. Cette algorithme peut 6tre utilis6 comme algo- 
rithme d'apprentissage. La grammaire initiale d'entrde 
g4n~re tous les roots de routes les faqons possibles. 
L'algorithme permct ensuite d'inf4rer unc grammaire 
b. partir du corpus. 
Evaluation Expdrimentale 
Nous avons testd l'algorithme de r$estimation sur un 
corpus artificiel (Figure 1) et aussi sur les sequences 
de parties du discours (Figure 2) du corpus 'ATIS' 
(Hemphill, Godfrey, and Doddington, 1990). Dans les 
deux cas, l'algorithme pour les grammaires stochas- 
tiques lexicalis~es d'arbres adjoints converge plus rapi- 
dement que celui pour les grammaires hors contexte 
(Baker, 1979). Ces expdriences confirment le fait que 
les grammaires stochastiques lexicalisdes d'arbres ad- 
joints permettent de mod~liser des distributions entre 
roots que les grammaires stochastiques hors contexte ne 
peuvent pas exprimer. 
1.8 
1.6 
1.4 
1.2 
1 
0.8 
0.6 
0.4 
, , , , , , , , 
SLTAG -- 
\\ SCFG ..... - 
" \\ 
\ 
t t t I I I I I 
2 3 4 5 6 7 8 9 i0 
iteration 
Figure 1: Convergence avec un corpus (le phr~qes du 
language {a"b"ln > 0} 
i \[ l 
SLTAG -- 
SCFG ..... 
5 i0 15 20 25 
itoration 
Figure 2: Convergence sur le ATIS Corpus 
ACheS Dr-; COLING-92. NANTES, 23-28 AOUT 1992 4 2 5 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
Stochastic Lexicalized Tree-Adjoining Grammars * 
Yves Schabes 
Dept. of Computer & Information Science 
University of Pennsylvania 
Philadelphia, PA 19104-6389, USA 
schabes@unagi, cis. upenn, edu 
Abstract 
The notion of stochastic lexicalized tree-adjoining 
grammar (SLTAG) is formally defined. The parameters 
of a SLTAG correspond to the probability of combining 
two structures each one associated with a word. The 
characteristics of SLTAG are unique and novel since it is 
lexieally sensitive (as N-gram models or Hidden Markov 
Models) and yet hierarchical (as stochastic context-free 
grammars). 
Then, two basic algorithms for SLTAG arc intro- 
duced: an algorithm for computing the probability of a 
sentence generated by a SLTAG and an inside-outside- 
like iterative algorithm for estimating the parameters 
of a SLTAG given a training corpus. 
Finally, we should how SLTAG enables to define a 
lexicalized version of stochastic context-free grammars 
and we report preliminary experiments showing some of 
the advantages of SLTAG over stochastic context-free 
grammars. 
1 Motivations 
Although stochastic techniques applied to syntax mod- 
eling have recently regained popularity, current lazl- 
guage models suffer from obvious inherent inadequacies. 
Early proposals such as Markov Models, N-gram mod- 
els (Pratt, 1942; Shannon, 1948; Shannon, 1951) and 
tlidden Markov Models were very quickly shown to be 
linguistically not appropriate for natural language (e.g. 
Chomsky (1964, pages 13-18)) since they are unable to 
capture long distance dependencies or to describe hier- 
archically the syntax of natural languages. Stochastic 
context-free granunar (Booth, 1969) is a hierarchical 
model more appropriate for natural languages, however 
none of such proposals (Lari and Young, 1990; Jelinek, 
Lafferty, and Mercer, 1990) perform as well as the sim- 
pler Markov Models because of the difficulty of captur- 
ing lexical information. The parameters of a stochas- 
tic context-free grammar do not correspond directly to 
a distribution over words since distributional phenom- 
ena over words that are embodied by the application of 
*This work was partially supported by DARPA Grant N0014- 
90-31863, ARO Grant DAAL03-89-C-0031 and NSF Grant 1RI90- 16592. We thank Aravind Joshi for suggesting the use of TAGs 
for statistical analysis during a private discussion that followed a presentation by bS'ed Jdinek during the June 1990 meeting of 
the DARPA Speech and Natural Language Workshop. We are also 
grateful to Peter Braun, FYed Jelinek, Mark Liberman, Mitch Marcus, Robert Mercer, Fernando Pereira said Stuart Shieber for 
providing vMu~ble comments. 
more than one context-free rule cannot be captured un- 
der the context-freeness assumption. This leads to the 
difficulty of maintaining a standard hierarchical model 
while capturing lexieal dependencies. 
This fact prompted researchers in natural language 
processing to give up hierarchical language models in 
the favor of non-hierarchical statistical models over 
words (such as word N-grams models). Probably for 
lack of a better language model, it has also been ar- 
gued that the phenomena that such devices cannot cap- 
ture occur relatively infrequently. Such argumentation 
is linguistically not sound. 
Lexicalized tree-adjoining grammars (LTAG) t com- 
bine hierarchical structures while being hxieany sensi- 
tive and are therefore more appropriate for statistical 
analysis of language. In fact, LTAGs are the simplest 
hierarchical formalism which can serve as the basis for 
lexicalizing context-free grammar (Schabes, 1990; Joshi 
and Sehabes, 1991). 
LTAG is a tree-rewriting system that combines trees 
of large domain with adjoining and substitution. The 
trees found in a TAG take advantage of the available ex- 
tended domain of locality by localizing syntactic depen- 
dencies (such as finer-gap, subject-verb, verb-objeet) 
and most semantic dependencies (such as predicate- 
argument relationship). For example, the following 
trees can be found in a LTAG lexicon: 
S /k 
NP,L VIP VP A 
V NPI NP NP VP* ADV 
L I I I 
uts J~n p~nutJ hungrily 
Since the elementary trees of a LTAG are minimal 
syntactic and semantic units, distributional analysis of 
the combination of these elementary trees based on a 
training corpus will inform us about relevant statistical 
aspects of the language such as the classes of words 
appearing as arguments of a predicative element, the 
distribution of the adverbs licensed by a specific verb, 
or the adjectives licensed by a specific noun. 
This kind of statistical analysis as independently sug- 
gested in (Resnik, 1991) can be made with LTAGs be- 
cause of their extended domain of locality but also be- 
cause of their lexiealized property. 
lWe attallnle familiarity throughout the paper with TAGs and 
its lexicallzed variant, See, for instance, (Joehl, 1987), (Schabes, Abeill~, and Joehi, 1988), (Schabes, 1990) or (Joslfi and Schabes, 
1~1). 
ACTES DE COLING-92. NANTES, 23-28 AOUT 1992 4 2 6 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
In this paper, this intuition is made formally precise 
by defining the notion of a stochastic lexicalized tree- 
adjoining grammar (SLTAG). We present an algorithm 
for computing the probability of a sentence generated 
by a SLTAG, and finally we introduce an iterative algo- 
rithm for estimathlg the parameters of a SLTAG given 
a training corpus of text. This algorithm can either 
be used for refining the parameters of a SLTAG or for 
inferring a tree-adjoining grammar frmn a training cor- 
pus. We also report preliminary experiments with this 
algorithm. 
Due to the lack of space, in this paper tim algorithms 
are described succinctly without proofs of correctness 
and more attention is given to tile concepts and tech- 
niques used for SLTAG. 
2 SLTAG 
hfformally speaking, SLTAGs are defined by assigning 
a probability to tile event that an elementary tree is 
combined (by adjunction or substitution) on a specific 
node of another elementary tree. These events of com- 
bination are the stochastic processes considered. 
Since SLTAG are defined on the basis of the deriva- 
tion and since TAG allows for a notion of derivation 
independent from the trees that are derived, a precise 
mathematical definition of the SLTAG derivation must 
be given. For this purpose, we use stochastic linear in- 
dexed grammars (SLIG) to formally express SLTAGs 
derivations. 
Linear Indexed grammar (LIG) (Alto, 1968; Gazdar, 
1985) is a rewriting system in which the non-terminal 
symbols are augmented with a stack, in addition to 
rewriting non-terminals, the rules of the grammar can 
have the effect of pushing or popping symbols on top of 
tile stacks that are associated with each non-terminal 
symbol. A specific rule is triggered by the non-termlnal 
on the left hand side of the rule and the top element of 
its associated stack. 
The productions of a LIG are restricted to copy the 
stack corresponding to tile non-terminal being rewrit- 
ten to at most one stack associated with a non-terminal 
symbol on tile right hand side of the production? 
In tile following, \[..p\] refers to a possibly unbounded 
stack whose top element is p and whose remaining part 
is schematically written as '..'. \[$\] represents a stack 
whose only element is the bottom of the stack. While it 
is possible to define SLIGs in general, we define them for 
the particular case where the rules are binary branching 
and where tile left hand sides are always incomparable. 
A stochastic linear indexed grammar, G, is denoted 
by (VN, VT, VI, S, Prod), where VN is a finite set of non- 
terminal symbols; VT is a finite set of terminal symbols; 
VI is a finite set of stack symbols; S E VN is the start 
symbol; Prod is a finite set of productions of the form: 
Xo\[$po\] --* a 
Xo\[..po\] --. x~\[..m\] x~\[$p~\] 
x0\[..po\] -~ Xl\[$pd x~\[-.p~\] 
Xo\[$Po\] --~ Xl\[$pl\] X2\[$p2\] 
where Xk E Vjv, a E VT and po ~. VI, Pl,P2 E V\[; P, a 
probability distribution which assigns a probability, 0 < 
P(X\[..z\] ~ A) < 1, to a rule, X\[..x\] -* A ~. Prodsuch 
2LIGs have been shown to be weakly eqtfivalent to "Ibee- Adjoining Graramars (V~jay-Shanker, 1987). 
that tbe sum of the probabilities of all the rules that can 
be applied to any non-terminal annotated with a stack 
is equal to one. More precisely if, VX E VN,Vp E VI: 
~ p(xt..pl -~ A) = 1 
A 
P(X \[..p\] --* A) should be interpreted as the probability 
that X\[..p\] is rewritten as A. 
A derivation starts from S associated with the empty 
stack (S\[$\]) and each level of the derivation must be 
validated by a production rule. The language of a SLIG 
is defined as follows: L = {w E VT~ \[ S\[$\]~w}. 
The probability of a derivation is defined as the prod- 
uct of tile probabilities of all individual rules involved 
(counting repetition) in the derivation, the derivation 
being validated by a correct configuration of the stack 
at each level. The probability of a sentence is then com- 
puted as the sum of the probabilities of all derivations 
of tile sentence. 
Following tile construction described in (Vijay- 
Shanker and Weir, 1991), given a LTAG, Glaa, we con- 
struct an equivalent LIG, G,ua. Tile constructed LIG 
generates tile same language as Gtag and each deriva- 
tion of Gtaa corresponds to a unique LIG derivation 
corresponds to a unique derivation in G,ua (and con- 
versely). In addition, a probability is assigned to each 
production of the LIG. For simplicity of explanation 
and without loss of generality we assume that each node 
in an elementary tree in Gt,9 is either a leaf node (i.e. 
either a foot node or a non-empty terminal node) or 
binary branching, a The construction of the equivalent 
SLIG follows. 
The non-terminal symbols of Gstia are the two sym- 
bols 'top' (t) and 'bottom' (b), tile set of terminal sym- 
bols is the same as the one of Gta9, the set of stack 
symbols is the set of nodes (not node labels) found in 
the elementary trees of Gla~ augmented with the bot- 
tom of tile stack ($), and tile start symbol is 'top' (t). 
For "all root nodes ~10 of an initial tree whose root is 
labeled by S, the following starting rules are added: 
t\[$\] ~ t\[$,t0\] (1) 
These rules state that a derivation must start from the 
top of the root node of some initial tree. P is the prob- 
ability that a derivation starts from the initial tree as- 
sociated with a lexical item and rooted by %. 
Then, for all node '/ in an elementary tree, the fol- 
lowing rules are generated. 
• If rhT/2 are ttle 2 children of a node r/sucb that r/2 is 
on the spine (i.e. subsumes tile foot node), include: 
b\[..~l ~&' tI$n, lt\[-.,~l (2) 
Since (2) encodes an immediate domination link de- 
fined by the tree-adjoining grammar, its associated 
probability is one. 
• Similarly, if thT/~ are the 2 children of a node r/such 
that r h is on the spine (i.e. subsumes the foot node), 
include: 
b\[..rt\] P=-*~ t\["rl~\]t\[$~\] (3) 
Since (3) encodes a~t immediate domination link de- 
fined by the tree-adjoining grammar, its associated 
probability is one. 
aThe algorlthnm explained ill this paper cart be generalized to lexicadized tree-adjoining granunars that need not be in Chottmky 
Normal Form using techniqu¢~ similar the one found in (Schabet, 
1991). 
ACIES DE COLING-92, NANTES, 23-28 AO~rf 1992 4 2 7 P~oc. OF COLING-92, NANTES, AUG. 23-28, 1992 
* If ~/tT/2 are the 2 children of a node q such that none 
of them is on the spine, include: 
b\[$~\] p~l \]~\[$I~1\]t\[$i~2 \] (4) 
Since (4) also encodes an immediate domination link 
defined by the tree-adjoining grammar, its associated 
probability is one. 
• If 7? is a node labeled by a non-terminal symbol and 
if it does not have an obligatory adjoining constraint, 
then we need to consider the case that adjunetion 
might not take place. In this ease, include: 
t\[..~\] L b\[..~\] (5) 
The probability of rule (5) corresponds to the proba- 
bility that no adjunetion takes place at node q. 
o If t/ is an node on which the auxiliary tree fl can 
be adjoined, the adjunetiou of fl can be predicted, 
therefore (assuming that ~tr is the root node of fl) 
include: 
t\["0\] L t\[..rl,,\] (6) 
The probability of rule (6) corresponds to the proba- 
bility of adjoining the auxiliary tree whose root node 
is ~/~, say/3, on the node 0 belonging to some elemen- 
tary tree, say a.4 
• If r)! is tim foot node of an auxiliary tree fl that has 
been adjoined, then the derivation of the node below 
q\] must resume. In this case, include: 
b\["0l\] ,~1 b\[..\] (7) 
The above stochastic production is included with 
probability one since the decision of adjunction has 
already been made in rules of the form (6). 
• Finally, if r h is the root node of an initial tree that 
can be substituted on a node marked for substitution 
r), include: 
t\[$~\] L t\[S~t\] (g) 
Here, p is the probability that the initial tree rooted 
by ~/~ is substituted at node q. It corresponds to 
the probability of substituting the lexicalized initial 
tree whose root node is 71, say 6, at the node q of a 
lexicalized elementary tree, say a. 5 
The SLIG constructed as above is well defined if the 
following equalities hold for all nodes ~l: 
P(t\[..~/\] ---* b\[..~/\]) + E P(t\[..~/\] --* t\[..q0~\] ) = 1 (9) 
P(t\[$~/\] ---* t\[$Ol\]) ---- 1 (10) 
E P(t\[$\] -~ t\[$O0\]) = 1 (11) 
4Since the granmmr is lexicalized, both trees a and /3 are a~ 
sociated with lexical iter~s, mad the site node for adjtmction ~ 
correuponds to some syntactic modification. Such llde encapsu- lates S modifiers (e.g. s~tential adverbs as in "apparently John 
left"), VP modifiers (e.g. verb phr~e adverbs as in "John left abruptly}", NP 
modifiers (e.g. relative clauses as in "The man who left 
was happy"), N modifiers (e.g. adtieetive~ as in "prelty 
woman"), or even sententiM complements (e.g. John think8 that 
Harry is sick). 
s Among other cases, the probability of thi~ rule corr~ponds to 
the probability of filling some argument p(~ition by a lexiealized 
tree. It will encapsulate the distribution for Belectional restriction 
since the position of substitution is taken into account. 
A gramular satisfying (12) is called consistent. 6 
E P(t\[$\]~w)= 1 (12) 
wEZ* 
Beside the distributional phenomena that we mentioned 
earlier, SLTAG also captures the effect of adjoining con- 
straints (selective, obligatory or null adjoining) which 
are required for tree-adjoining grammar. 7 
3 Algorithm for Computing the 
Probability of a Sentence 
We now define an bottom-up algorithm for SLTAG 
which computes the probability of an input string. The 
algorithm is an extension of the CKY-type parser for 
tree-adjoining grammar (Vijay-Shanker, 1987). The ex- 
tended algorithm parses all spans of the input string 
and also computes tbelr probability in a bottom-up 
fashion. 
Since the string on the frontier of an auxiliary is bro- 
ken up into two substrings by the foot node, for the 
purpose of computing the probability of the sentence, 
we will consider the probability that a node derives two 
substrings of the input string. This entity will be called 
the inside probability. Its exact definition is given be- 
low. 
We will refer to the subsequenee of the input string 
w = ax "" aN from position i to j, w{'. It is defined as 
follows: 
w~/'~f { ai+t" .uj ,ifi>_j' if/< j 
Given a string w = at... a N and a SLTAG rewritten 
as in (1-8) the inside probability, F(pos, 71, i,j, k,l), is 
defined for all nodes 7/ contained in an elementary tree 
and for pos E {t,b}, and for all indices 0 < i < j < 
k < I < N as follows: 
(i) If the node 7/does not subsume the foot node 
of (~ (if there is one), then j and k are un- 
bound and: 
l~ (pos, ~, i,-, -, I) d~=l P(pos\[$@~ w~) 
(it) If the node y/subsumes the foot node 7/! of e, 
then: 
l~ (pos, rL i, j, k, l) a~l P ( pos\[$@~ w{ b\[$o l lw~ ) 
In (ii), only the top element of the stack matters since 
as a consequence of the eonstrnction of the SLIG, we 
have that if pos\[$tl\]~ w~b\[$rll\]w ~ then for all string 
7 e V/~ we also have pos\[$Tr/\]~ w~b\[$7~l\]w~.S 
Initially, all inside probabilities are set to zero. Then, 
the computation goes bottom-up starting from the pro- 
ductions introducing lexieal items: if r/ is a node such 
that b\[$7/\] --~ a, then: 
1 ifl=i+lAa=w~ +t (1~ IW(b'Tl'i'-'-'l) = 0 otherwise. 
Then, the inside probabilities of larger substrings are 
computed bottom-up relying on the recurrence equa- 
~We will not investigate tim conditions under which (12) holds. We conjecture that the techniques used for dmcking the eolmis- 
tency of stochastic context-free grammars (Booth and Thomp6on, 
1973) can be adapted to SLTAG. 
r For example, for a given node 0 setting to zero the probability o\[ all rules of the forts (6) ht~ the effect of blocking adjunction. 
8Thls can be seen by obae~.ing that for any node on the path 
from the root node to the foot node of an auxiliary tree, the stack remains unchanged. 
ACRES DE COLING-92, NANTES. 23-28 AOt~T 1992 4 2 8 PROC. OF COLING-92, NANTES. AUG. 23-28, 1992 
lions stated in Appendix A. This computation takes 
in the worst case O(IGl~N6)-time and O(IGINa)-space 
for a sentence of lengtb N. 
Once the inside probabilities cmnputed, we obtain 
the probability of the sentence flu follows: 
P(w)aJP(t\[$\]~,~) = Z~(t, $, 0,-,-, Iwl) (14) 
Wc now consider the problem of re-estimating a 
SI,TAG. 
4 Inside-Ouside Algorithm for 
1%eestimating a SLTAG 
Given a set of positive example sentences, W = 
{wt'"wK}, we would like to compute the probabil- 
ity of each rule of a given SLTAG in order to maximize 
thc probability that the corpus were generated by this 
SLTAG. An algorithm solving this problem can be used 
in two different ways. 
The first use is as a reestimation algorithm. In ttfis 
approach, the input SI,'1'A(~ derives structures that arc 
reasonable according to some criteria (such as a linguis- 
tic theory and some a priori kuowledge of the corpus) 
and the intended use of the algorithm is to refine the 
probability of each rule. 
The second use is as a learning algorithm. At the first 
iteration, a SLTAG which generates all possible struc- 
tures over a given set of nodes and terminal symbols is 
used. Initially the probability of each rule is randomly 
assigned and then tile algorithm will re-estimate tbese 
probabilities. 
Informally speaking, given a first estimate of the pa- 
rameters of a SLTAG, the algorithm re-estimates these 
parameters on the basis of the parses of each sentence in 
a training corpus obtained by a CKY-tyt)e parser. The 
algorithm is designed to derive a new estimate after 
each iteration such that the probability of the corpus 
is increased or equivalently such that tile cross entropy 
estimate (negative log probability) is decreased: 
log~(e(r0)) 
lt(W,G) - weW (15) 
wEW 
In order to derive a new estimate, the algorithm 
needs to compute for all seutences in W the in- 
side probabilities and the outside probabilities. Given 
a string w = al...aN, tbe outside probability, 
0 ~ (pos, ~, i, j, k, It, is defined for all nodes r I contained 
in an elementary tree a and for pos E {t,b}, and for all 
indices 0 < i < j < k < l < N as follows: 
(it If the node r/does not subsume the foot node 
of a (if there is one), then j and k axe un- 
bound asld: ..de\] 
O'° (P os, O, i, -, -, t) - 
P(B"/ C V~ s.t. t\[$\]=~ Wio pos\[$Ttl\] w~) 
(ii) If the node ~/does subsume the foot node ~/! 
of a then: 
0 '~ (pos, O, i, j, k, l) aeJ- /'(37 ~ V~* 
s.t. 
t\[$\]~ Wlo pos\[$Trl\] w~ and b\[$7~ll\]~w\]) 
Once the inside probabilities computed, the outside 
probabilities can be computed top-down by consider- 
ing smaller spans of the input string starting with 
O"(t,$,O,-,-,N) = 1 (by definition). This is done 
by computing the recurrence equations stated in Ap- 
pendix B. 
In the following, we assume that r I subsumes the foot 
node r/l within a same elementary tree, and also that tll 
subsumes the foot node ~111 (within a same elementary 
tree). The other cases are handled similarly. Table 1 
shows the reestimation formulae for the adjoining rules 
(16) and the null adjoining rules (17). 
(16) corresponds to the average number of time that 
tl .... le L\[..T1\] .-* t\[..yqv\] is used, and (17) to th ...... 
age number of times no adjunction occnrred on T/. The 
denominators of (16) and of (17) estimate the average 
number of times that a derivation involves tlLe expan- 
sion oft\[-.~/\]. The numerator of(16) estimates the aver- 
age number of times that a derivation involves the rule 
t\[.-7/\] -~ t\[..Tirfl\]. Therefore, for example, (16) estimates 
the probability of using the rule/\['-~7\] ~ l\["rplt\]. 
The algorittun reiterates until H(W, G) is unchanged 
(within some epsilon) between two iterations. Each it- 
eration of the algoritbm requires at most O(IGIN e) 
time for each sentence of length N. 
5 Grammar Inference with 
SLTAG 
The reestimation algorithm explained in Section 4 can 
be used botll to reestimate the paramcters for a SI,TAG 
derived by some other mean or to infer a grammar from 
scratch. Ill the following, we investigate grammar In- 
ference from scratch. 
The initial grammar for the reestimation algoritiim 
consists of all SLIG rules for the tress ill Lexical- 
ized Normal I~brm (ill short LNF) over a given set 
= {aill .< i _< T} of terminal symbols, with suit- 
ably assigned non zero probability: 9 
S 0 $4 
s h t~ a i 
The above normal form is capable not only to de- 
rive any lexicalized tree-adjoining language, but also 
to impose ally binary bracketing over the strings of the 
language. The latter property is important as we would 
like to be able to use bracketing information in the ilL- 
put corpus as in (Pereira and Schabes, 1992). 
The worst case complexity of tim reestimation algo- 
rithm given iu Section 4 with respect to the length of 
the input string (O(NS)) makes this approach in gen- 
eral impractical for LNF grammars. 
However, if only trees of the form fit a' and a~" (or 
only of tile form /~' and a~), the language generated 
is a context-free language and can be handled more 
efficiently by the reestimation algorithnL 
9Adjoining constraints can be u~d in tiffs normal form, They will be reflected in 
the SLIG eq~vaient grammar. Indices have 
been added on S nodes in order to be able to uniquely refer to 
each node in the granunar. 
AcrEs OE COLING-92, NANTES. 23-28 AOOT 1992 4 2 9 DROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
w wP W ) x QW(t\[..~/\] ~ t\[.-r/rp\]) P(t\[-.t/\] ---, t\[..~Tt/t\]) = 1 (16) 
~wp--- ~ x \[R~0/) + ~_~O'~(t\[..O\] --, t\[..~/r/,\])\] 
1 
to~w /3(t\[..r/\] ---+ b\[..~/\]) = 1 (17) 
Ot°(t\["r/\] ~ t\["r/rY\]) = Z P(t\["O\]--*t\["O~Y\])×Iw(t'o/'i'r's'l)xlW(b'o'r'j'k's)xOW(t'~l'i'j'k'l) (18) 
i)r,j~k,t)l 
/~w(r/) = ~ P(t\[..r/\] ~ b\[..r/\]) x l~(t,o,i,j,k,l) x O~°(b,)l,i,j,k,l) (19) 
i,j,k,I 
Table 1: Keestimation of adjoining rules (16) and null adjoining rules (17) 
It can be shown that if, only trees of the form ~a~ and 
~a~ are considered, the reestimation algorithm requires 
in the worst case O(Na)-time) ° 
The system consisting of trees of the form ~' and c~ ~ 
can be seen as a stochastic lexicalized conle~:t-free gram- 
mars since it generates exactly context-free languages 
while being lexically sensitive. 
In the following, due to the lack of space, we report 
only few experiments on grammar inference using these 
restricted forms of SLTAG and the reestimation algo- 
rithm given in Section 4. We compare the results of 
the TAG inside-outside algorithm with the results of 
the inside-outside algorithm for context-free grammars 
(Baker, 1979). 
These preliminary experiments suggest that SLTAG 
achieves faster convergence (and also to a better solu- 
tion) than stochastic context-free grmnmars. 
5.1 Inferring the Language {a"b"\]n > 0} 
We consider first an artificial language. The train- 
ing corpus consists of 100 sentences in the language 
L = {a"b'~ln > 0} randomly generated by a stochastic 
context-free grammar. 
The initial grammar consists of the trees ~', fl~, c~ a 
and ab with random probability of adjoining and null 
adjoining. 
The inferred grammar models correctly the language 
L. Its rules of the form (I), (5) or (fi) with high prob- 
ability follow (any excluded rule of the same form has 
probability at least l0 -a3 times lower than the rules 
given below). The structural rules of the form (2), (3), 
(4) or (7) are not shown since their probability always 
remain 1. 
Z°This can be Been by ol~ervin g that, for exaanple in l(posji, i,j,k,I), it i~ nece~y the ea~ that k = l, nnd also 
by noting that k is superfluous. 
t\[$,Tg\] s:~4 t\[S,lg,78\] 
t\[$og\] o_~ t\[$,lg,lg\] 
t\[.-t/~\] z_~,o b\[,.~7~\] 
t\[~\] ,..~o b\[,~\] 
t\[..~\] ~,° b\[~\] 
t\[..o~\] 1~0 b\[..o~\] 
In the above grammar, a node S'k in a tree c~ a or/~ 
associated with the symbol a is referred as t/~, and a 
node S~ in a tree associated with b as r/~. 
We also conducted a similar experiment with 
the inside-outside algorithm for context-free grammar 
(Baker, 1979), starting with all pc~sible Chomsky Nor- 
mal Form rules over 4 non-terminals and the set of ter- 
minal symbols {a,b} (72 rules). The inferred grammar 
does not quite correctly model the language L. Fur- 
thermore, the algorithm does not converge as fast as in 
the case of SLTAG (See Figure 1). 
1.8 
1.6 
1,4 
1.2 
1 
0.8 
0.6 
0.4 
IIIIIIII SLTAG -- 
SCFG ..... " \ 
2345678910 
iteration 
Figure 1: Convergence for the Language {anb"ln > 0} 
5.2 Experiments on the ATIS Corpus 
We consider the part-of-speech sequences of the spoken- 
language transcriptions in the Texas Instruments sub- 
ACT~ BE COIANG-92. NANTES, 23-28 AO~' 1992 4 3 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
set of the Air Travel hfformation System (ATIS) corpus 
(Hemphill, Godfrey, and Doddington, 1990). This cor- 
pus is of interest since it has been used for infcrring 
stochastic context-free grammars from partially brack- 
eted corpora (Pereira and Sehabes, 1992). We use the 
data given by Pereira and Schabes (1992) on raw text 
and compare with an inferred SLTAG. 
The initial grammar consists of all trees (96) of the 
form fl~, a ~ for all 48 terminal symbols for part-of- 
speech. As shown in Figure 2, the grannnar converges 
very rapidly to a lower value of the log probability 
than the stochastic context-free grammar reported by 
Pereira and Schabes (1992). 
16 
14 
12 
i0 
SCFG ..... " 
i i i t 5 10 15 20 25 
iteration 
Figure 2: Convergence for ATIS Corpus 
6 Conclusion 
A novel statistical language model and fundamental al- 
gorithms for this model have been presented. 
SLTAGs provide a stochastic model both hierarchi- 
cal and sensitive to lexical information. They combiae 
the advantages of purely |exical models such ms N-gram 
distributions or Ilidden Markov Models and the one 
of ifierarchical modes as stochastic context-free gram- 
mars without their inhercnt limitations. The parame- 
ters of a SLTAG correspond to the probability of com- 
bining two structures each one associated with a word 
and therefore capture linguistically relevant distribu- 
tions over words. 
An algorithm for computing the probability of a sen- 
tence generated by a SLTAG was presented as well as 
an iterative algorithm for estimating the parameters of 
a SLTAG given a training corpus of raw text. Simi- 
larly to its context-free counterpart, the reestimation 
algorithm can be extended to handle partially parsed 
corpora (Pereira and Schabes, 1992). 
Preliminary experiments with a context-free subset 
of SLTAG confirms that SLTAG enables faster conver- 
gence than stochastic context-free grammars (SCFG). 
This is the case since SCFG are unable to represent 
lexieal influences on distribution except by a statisti- 
cally and eomputationally impractical proliferation of 
nonterminal symbols, whereas SLTAG allows for a lexi- 
eally sensitive distributional mmlysis while maintaining 
a hierarchical structure. 
Furthermore, the techniques explained in this paper 
apply to other grammatical formalisms such as combi- 
natory categorial grammars and modified head gram- 
mars since they have been proven to be equivalent to 
tree-adjoining grammars and linear indexed grmnmars 
(Joshi, Vijay-Shanker, and Weir, 1991). 
Due to the lack of space, only few experiments with 
SLTAG were reported. A full version of tile paper will 
be available by tile time of the meeting and more exper- 
imental details will be reported during the presentation 
of the paper. 
In collaboration with Aravind Joshi, Fernando 
Pereira and Stuart Slfieber, we are currently investigat- 
ing additional algorithnLs and applications for SLTAG, 
methods for lexical clustering and autonratic construc- 
tion of a SLTAG from a large training corpus. 
In the following, the inside and outside probabilities are 
re\]ative to the input string w. 3 t" stands for the the set of 
foot nodes, S for the set of nodes on which substitution can 
occur, ~ for the set of root nodes of initial trees, and ,4 for 
the set of non-terminal nodes of auxiliary trees. The inside 
probability can be computed bottom-up with the following 
recurrence equations. For all node v/found in an elementary 
tree, it can be shown that: 
1. If b\[$r/\] ~ a, I(b,7, i,-,-,I) = dl if / = i+ 1 and if 
a = w~ +1, 0 otherwise. 
2. \]f71 E3 c, l(b,7/,i,j,k,t)= l if i= j and if 
k = l, 0 otherwise. 
3. If b\[..7\] ~ t\[..Talt\[$7~\]: l(b, 7, i,j,k,I)= 
E l(t,7j,i,j,k,m) x l(t,7~,m,--,-,t) 
m=k 
4. If b\[..7\] -- t\[$oa\]t\[..7z\] , l(b, 7, i,j,k,I) = 
~ I(t, 71,i,-,-,m) xl(t,72,m,j,k,I) 
m~i+ l 
~. ff b\[$t~\] ~ t\[$~dt\[$7~\], ~(b, 7, i, -, -, 0 = 
E l(t'Tt'i'-'--'m) x l(t, 7~,m,-,-,I) 
m~i+l 
6. For all node 7 on which adjunction can be performed: 
l(t,,,i,j, k, 0 = 
1(b,,, i, j, k,t) × P(t\[..7\] ~ b\[..,l\]) 
+ × l(b, 7,r,j,k,s) 
× e(t\[..7\] - t\[-.,,id) 
7. For all node 7 E S: l(t, 7, i,-,-,l) = 
Zl(t'Tl'i'--'--'l) × P(t\[$7\] ~ t\[$Ta\]) 
'h 
8. I(t,$,i,-,-,l)= E I(t,7, i,-,-,I)×P(t\[$\] ~ t\[$0\]) 
)l 
B Computing the Outside 
Probabilities 
The outside probabilities can be computed top-down recur- 
sively over smaller spans of the input string once the in- 
side probabilities have been computed. First, by definition 
we have: O(t, $, 0,-, -, N) = 1. The following recurrence 
equations hold for all node y found in an elementary tree. 
1. If 7 E "g, O(t, 7, 0, -, -, N) = e(t\[$\] ~ t\[$7\]). 
And for all (i,j) ~ (0, N), O(t,~,i,-,-,j) = 
o(t, ,10, i,-,-,j) × P(@%\] ~ @)~\]) 
2. If 7 is an interior node which subsumes the foot node 
of the elementary tree it belongs to, O(t, ~, i, j, k, l) = ~ O(b,%,i,j,k,q) ) 
× l(t, 7~, 1,-, -, q) 
q=t+, × P(b\["70\] ~ t\["Tlt\[$7~\]) 
i-1 O(b, qo,p,j,k,l ) ) +Z × l(t'71'P'-'-'i) 
~=0 x P(b\[.-70\] ~ t\[$7,lt\[..7\]) 
3. If T/ is an interior node which does not subsume the 
foot node of the elementary tree it belongs to, we have: 
o(t,7,i,-,-,t) = v O(b,)lo,i,-,-,q) ) 
E × l(t')h'l'-'-'q) 
q=lq-i × P(b\[$70\] ~ t\[$7\]t\[$72\]) 
+ × I(t,7~,p,-,-,Q 
× P(b\[$7ol ~ t\[$7,\]t\[$7\]) 
+ ~ O(b'7°'i'j'Lq) 
× I(t, 72,l,j,k,q) 
,=, ~=,+, .=. × P(b\[.m) ~ @71t\[..Td) 
+ × I(t, 71,p,j,k,i) 
× P(b\[..%\] ~ t\[..7#\[$7\]) 
4. If T/ E.4, then: O(t,7, i,j,k,l)= 
k-l~(O(t''l°'i'p'q'l) ) 
~o ~ × l(t, 7o,j,p,q,k) p=j q=~+, × P(t\["7o\] ~ t\[-.%rl\]) ~f~%(o(t,%,i,-,-,t)) 
+ × l(t,)lo,j,-,-,k) 
× P(t\[$%\] ~ t\[$%7\]) 
5. If 7 is a node which subsumes the foot node of the ele- 
mentary tree it belongs to, we have: O(b, 7, i, j, k, I) = 
O(t, 7, i, j, k, l) × e(t\["7\] ~ b\[..~/\]) 
+ × l(t, 7o,p,i,l,q) 
% p=o q=* \ x P(t\["7o\]- t\["7o)?\]) 
6. And finally, if )1 is a node which does not subsume 
the foot node of the elementary tree it belongs to: 
O(b, 7, i,-,-,t) = 
o(t, 7, i, -, -, t) x P(t\[$7\] ~ b\[$7\]) 
+ × l(t,%,p,i,l,q) 
70 p=o q=~ \ x P(t\[$7o\] ~ t\[$7oY/\]) 
ACRES DE COLING-92, NAme, s. 23-28 ^o~rr 1992 4 3 2 Paoc. OF COLING-92, NANTES. AUG. 23-28, 1992 
ACRES DE COL1NG-92, NANTES, 23-28 AO~r 1992 4 3 1 PROr'.. OI: COLING-92, NANTES, AUG. 23-28, 1992 

References 

Aho, A. V. 1968. lndexed grammars - An extension 
to context free grammars. J. ACM, 15:647-671. 

Baker, J.K. 1979. Trainable grammars tbr speech 
recognition. In Jared J. Wolf and Dennis H. Klatt, 
editors, Speech communication papers presentacd at 
the 97 ~h Meeting of the Acoustical Society of Amer- 
ica, MIT, Cambridge, MA, June. 

llooth, Taylor R. and Richard A. Thoml)son. 1973. 
Applying probability measures to abstract languages. 
IEEE 7)'aasactions on Computers, C-22(5):442-450, 
May. 

Booth, T. 1969. Probabilistic representation of formal 
languages. In Tenth Annual IEEE Symposium on 
Switching and Automata Theory, October. 

Chomsky, N., 1964. Syntactic Structures, chapter 2-3, 
pages 13-18. Mouton. 

Gazdar, G. 1985. Applicability of indexed gr,'unmars 
to natural languages. Technical Report CSLI-85-34, 
Center for Study of Language and Information. 

tlempttill, Charles T., John J. Godfrey, and George IL 
Doddington. 1990. The ATIS spoken language sys- 
tems pilot corpus. In DARPA Speech and Natural 
Laaguage Workshop, Hidden Valley, Pennsylvania, 
June. 

Jelinek, F., J. D. Lafferty, and R. L. Mercer. 1990. Ba- 
sic methods of probabilistic context free grammars. 
Technical Report RC 16374 (72684), IBM, Yorktown 
Heights, New York 10598. 

Joshi, Aravind K. and Yves Schabes. 1991. Tree- 
adjoiuing grammars and lexiealized grammars. In 
Maurice Nivat and Andreas Podelski, editors, Defin- 
ability and Recognizability of Sets of Trees. Elsevier. 
Forthcoming. 

Joshi, Aravind K., K. Vijay-Simnker, and David Weir. 
1991. The convergence of mildly context-sensitive 
gramnmtical formalisms, in Peter Sells, Stuart 
Shieber, and Tom Wasow, editors, Foundational Is- 
sues in Natural Language Processing. MIT Press, 
Cambridge MA. 

Joshi, Aravind K. 1987. An Introduction to Tree Ad- 
joining Grammars. In A. Manaster-Ramer, editor, 
Mathematics of Language. John Beujamins, Amster- 
dana. 

Lari, K. and S. J. Young. 1990. The estimation of 
stochastic context-free grmnmars using the Inside- 
Outside algorithm. Computer Speech and Language, 
4:35-56. 

Pereira, Fernando and Yves Schabes. 1992. Inside- 
outside reestimation from partially bracketed cor- 
pora. In 20 th Meeting of the Association for Compu- 
tational Linguistics (ACL '9~), Newark, Delaware. 

Pratt, Fletcher. 1942. Secret and urgent, the story of 
codes and ciphers. Blue Ribbon Books. 

Resnik, Philip. 1991. Lexicalized tree-adjoining gram- 
mar for distributional analysis. In Penn Review of 
Linguistics, Spring. 

Schabes, Yves, Anne Abeill~, and Aravind K. Joshi. 
1988. Parsing strategies with 'lexicalized' grarnmars: 
Application to tree adjoining gra~mnars. In Proceed- 
ings of the 1~ lh International Conference on Compu- 
tational Linguistics (COLING'88}, Budapest, Hun- 
gary, August. 

Sehabes, Yves. 1990. Mathematical and Computational 
Aspects of Lexicalized Grammars. Ph.D. thesis, Uni- 
versity of Pennsylvania, Philadelphia, PA, August. 
Available as technical report (MS-CIS-90-48, LINC 
LAB179) from the Department of Computer Science. 

Schabes, Yves. 1991. An inside-outside algorithm 
for estimating the parameters of a hidden stochastic 
context-free grammar based on Earley's algorithm. 
Manuscript. 

Shannon, C. E. 1948. A mathematical theory of 
communication. The Bell System Technical Journal, 
27(3):379-423. 

Shannon, C. E. 1951. Prediction and entropy of printed 
english. The Bell System Technical Journal, 30:50-64. 

Vijay-Shanker, K. and David J. Weir. 1991. Parsing 
constrained grammar formalisms. In preparation. 

Vijay-Shanker, K. 1987. A Study of ?lbee Adjoining 
Grammars. Ph.D. thesis, Department of Computer 
and Information Science, University of Pennsylvmfia. 
A Computing the Inside Prob- 
abilities 
