STOCHASTIC TREE-ADJOINING GRAMMARS* 
Yves Schabes 
Department of Computer and Information Science 
University of Pennsylvania 
Philadelphia, PA 19104-6389 
ABSTRACT 
The notion of stochastic lexicalized tree-adjoining grammar 
(SLTAG) is defined and basic algorithms for SLTAG are de- 
signed. The parameters of a SLTAG correspond to the probabil- 
ity of combining two structures each one associated with a word. 
The characteristics of SLTAG are unique and novel since it is 
lexically sensitive (as N-gram models or Hidden Markov Mod- 
els) and yet hierarchical (as stochastic context-free grammars). 
An algorithm for computing the probability of a sentence gener- 
ated by a SLTAG is presented. Then, an iterative algorithm for 
estimating the parameters of a SLTAG given a training corpus 
is introduced. 
1. MOTIVATIONS 
Although stochastic techniques applied to syntax modeling 
have recently regained popularity, current language models 
suffer from obvious inherent inadequacies. Early proposals 
such as Markov Models, N-gram models \[1, 2, 3\] and Hid- 
den Markov Models were very quickly shown to be linguis- 
tically not appropriate for natural language (e.g. \[4\]) since 
they are unable to capture long distance dependencies or 
to describe hierarchically the syntax of natural languages. 
Stochastic context-free grammar \[5\] is a hierarchical model 
more appropriate for natural languages, however none of 
such proposals \[6, 7\] perform as well as the simpler Markov 
Models because of the difficulty of capturing lexical infor- 
mation. The parameters of a stochastic context-free gram- 
mar do not correspond directly to a distribution over words 
since distributional phenomena over words that are embod- 
ied by the application of more than one context-free rule 
cannot be captured under the context-freeness assumption. 
This leads to the difficulty of maintaining a standard hier- 
archical model while capturing lexical dependencies. 
This fact prompted researchers in natural language process- 
ing to give up hierarchical language models in the favor of 
non-hierarchical statistical models over words (such as word 
N-grams models). Probably for lack of a better language 
model, it has also been argued that the phenomena that 
such devices cannot capture occur relatively infrequently. 
*This work was partially supported by DARPA Grant N0014-90- 
31863, ARO Grant DAAL03-89-C-0031 and NSF Grant IRI90-16592. 
We thank Aravind Joshi for suggesting the use of TAGs for statistical 
analysis during a private discussion that followed a presentation by 
Fred Jelinek du.ring the June 1990 meeting of the DARPA Speech and 
Natural Language Workshop. We are also grateful to Peter Braun, 
Fred 3elinek, Mark Liberman, Mitch Marcus, Robert Mercer, Fer- 
nando Pereira and Stuart Shieber for providing valuable comments. 
Such argumentation is linguistically not sound. 
Lexicalized tree-adjoining grammars (LTAG) 1 combine hi- 
erarchical structures while being lexically sensitive and are 
therefore more appropriate for statistical analysis of lan- 
guage. In fact, LTAGs are the simplest hierarchical formal- 
ism which can serve as the basis for lexicalizing context-free 
grammar \[10, 11\]. 
LTAG is a tree-rewriting system that combines trees of 
large domain with adjoining and substitulion. The trees 
found in a TAG take advantage of the available extended 
domain of locality by localizing syntactic dependencies 
(such as filler-gap, subject-verb, verb-object) and most se- 
mantic dependencies (such as predicate-argument relation- 
ship). For example, the following trees can be found in a 
LTAG lexicon: 
S 
A 
NP$ VP VP 
A 
V NP$ NP NP VP* ADV 
I I I I 
eats John peanuts hungrily 
Since the elementary trees of a LTAG are minimal syntactic 
and semantic units, distributional analysis of the combina- 
tion of these elementary trees based on a training corpus 
will inform us about relevant statistical aspects of the lan- 
guage such as the classes of words appearing as arguments 
of a predicative element, the distribution of the adverbs li- 
censed by a specific verb, or the adjectives licensed by a 
specific noun. 
This kind of statistical analysis as independently suggested 
in \[12\] can be made with LTAGs because of their extended 
domain of locality but also because of their lexicalized prop- 
erty. 
In this paper, this intuition is made formally precise by 
defining the notion of a stochastic lexicalized tree-adjoining 
grammar (SLTAG). We present an algorithm for computing 
the probability of a sentence generated by a SLTAG, and 
finally we introduce an iterative algorithm for estimating 
the parameters of a SLTAG given a training corpus of text. 
This algorithm can either be used for refining the parame- 
1 We assume familiarity throughout the paper with TAGs and its 
lexicalized variant. See, for instance, \[S\], \[9\], \[10\] or \[111. 
140 
ters of a SLTAG or for inferring a tree-adjoining grammar 
from a training corpus. 
Due to the lack of space, in this paper the algorithms are 
described succinctly without proofs of correctness and more 
attention is given to the concepts and techniques used for 
SLTAG. 
2. SLTAG 
Informally speaking, SLTAGs are defined by assigning a 
probability to the event that an elementary tree is com- 
bined (by adjunction or substitution) on a specific node of 
another elementary tree. These events of combination are 
the stochastic processes considered. 
For sake of mathematical precision and elegance, we use a 
stochastic linear rewriting system, stochastic linear indexed 
grammars (SLIG), as a notation for SLTAGs. A linear in- 
dexed grammar is constructed following the method given 
in \[13\]. However, in addition, each rule is associated with 
a probability. 
Linear Indexed grammar (LIG) \[14, 15\] is a rewriting sys- 
tem in which the non-terminal symbols are augmented with 
a stack. In addition to rewriting non-terminals, the rules 
of the grammar can have the effect of pushing or popping 
symbols on top of the stacks that are associated with each 
non-terminal symbol. A specific rule is triggered by the 
non-terminal on the left hand side of the rule and the top 
element of its associated stack. LIGs \[15\] restrict 
The productions of a LIG are restricted to copy the stack 
corresponding to the non-terminal being rewritten to at 
most one stack associated with a non-terminal symbol on 
the right hand side of the production. 2 
In the following, \[..p\] refers to a possibly unbounded stack 
whose top element is p and whose remaining part is 
schematically written as '..'. \[$\] represents a stack whose 
only element is the bottom of the stack. While it is possible 
to define SLIGs in general, we define them for the partic- 
ular case where the rules are binary branching and where 
the left hand sides are always incomparable. 
A stochastic linear indexed grammar, G, is denoted by 
(VN, VT, Vt, S, Prod), where VN is a finite set of non- 
terminal symbols; VT is a finite set of terminal symbols; 
Vi is a finite set of stack symbols; S E VN is the start sym- 
bol; Prod is a finite set of productions of the form: 
x0\[$p0\] a 
Xo\[..po\] Xl\[..pd 
x0\[..p0\] xl\[$pl\] 
z0\[$p0\] Xl\[$pl\] x2\[$p2\] 
where Xk E VN, a G VT and Po E Vi, Pl,P2 G V~; 
2 LIGs have been shown to be weakly equivalent to Tree-Adjoining 
Grammars \[16\]. 
P, a probability distribution which assigns a probability, 
0 < P(X\[..z\] ~ A) < 1, to a rule, X\[..x\] -+ A E Prod 
such that the sum of the probabilities of all the rules that 
can be applied to any non-terminal annotated with a stack 
is equal to one. More precisely if, VX G VN, Vp E Vi: 
E P(X\[..p\] --+ A) = 1 
A 
P(X\[..p\] ~ A) should be interpreted as the probability 
that X\[..p\] is rewritten as A. 
A derivation starts from S associated with the empty stack 
(S\[$\]) and each level of the derivation must be validated 
by a production rule. The language of a SLIG is defined as 
follows: n = {w E V~ I S\[$\]:~w}. 
The probability of a derivation is defined as the product of 
the probabilities of all individual rules involved (counting 
repetition) in the derivation, the derivation being validated 
by a correct configuration of the stack at each level. The 
probability of a sentence is then computed as the sum of 
the probabilities of all derivations of the sentence. 
Following the construction described in \[13\], given a LTAG, 
Gtag, we construct an equivalent 3 LIG, G, ug. In addition, 
a probability is assigned to each production of the LIG. For 
simplicity of explanation and without loss of generality we 
assume that each node in an elementary tree found in a tree- 
adjoining grammar is either a leaf node (i.e. either a foot 
node or a non-empty terminal node) or binary branching. 4 
The construction of the equivalent SLIG follows. 
The non-terminal symbols of Gstia are the two symbols 
'top' (t) and 'bottom' (b), the set of terminal symbols is 
the same as the one of Gtag, the set of stack symbols is the 
set of nodes (not node labels) found in the elementary trees 
of Gta9 augmented with the bottom of the stack ($), and 
the start symbol is 'top' (t). 
For all root nodes N0 of an initial tree whose root is labeled 
by S, the following starting rules are added: 
t\[$\] .E+ t\[$r}0 \] (1) 
These rules state that a derivation must start from the top 
of the root node of some initial tree. P is the probability 
that a derivation starts from the initial tree associated with 
a lexical item and rooted by No. 
Then, for all node ~1 in an elementary tree, the following 
rules are generated. 
• If r/if/2 are the 2 children of a node N such that N2 is on 
3The constructed LIG generates the same language as the given 
tree-adjoining grammar. 4The algorithms explained in this paper can be generalized to lexi- 
calized tree-adjoining grammars that need not be in Chomsky Normal 
Form using techniques similar the one found in \[17\]. 
141 
the spine (i.e. subsumes the foot node), include: 
bE-0\] P -I (2) 
Since (2) encodes an immediate domination link defined 
by t\]he tree-adjoining grammar, its associated probability 
is one. 
• Similarly, if ~7102 are the 2 children of a node r/such that 
r/1 is on the spine (i.e. subsumes the foot node), include: 
bill (3) 
Since (3) encodes an immediate domination link defined 
by the tree-adjoining grammar, its associated probability 
is one. 
• If ~71~72 are the 2 children of a node r/such that none of 
them is on the spine, include: 
b\[$q\] p~l t\[$rh\]t\[$r/2 \] (4) 
Since (4) also encodes an immediate domination link de- 
fined by the tree-adjoining grammar, its associated prob- 
ability is one. 
• If 77 is a node labeled by a non-terminal symbol and if 
it does not have an obligatory adjoining constraint, then 
we need to consider the case that adjunction might not 
take place. In this case, include: 
t\[..~\] ~ b\[..~\] (5) 
The probability of rule (5) corresponds to the probability 
that no adjunction takes place at node 77. 
• If 77 is an node on which the auxiliary tree fl can be ad- 
joined, the adjunction of fl can be predicted, therefore 
(assuming that Yr is the root node of fl) include: 
(6) 
The probability of rule (6) corresponds to the probability 
of adjoining the auxiliary tree whose root node is ~Tr, say 
~, on the node ~7 belonging to some elementary tree, say 
Or. 5 
• If r7! is the foot node of an auxiliary tree/9 that has been 
adjoined, then the derivation of the node below O! must 
resume. In this case, include: 
b\[..~!\] ~-~ b\[..\] (7) 
The above stochastic production is included with prob- 
ability one since the decision of adjunction has already 
been made in rules of the form (6). 
• Finally, if 7/1 is the root node of an initial tree that can be 
substituted on a node marked for substitution ~/, include: 
(8) 
Here, p is the probability that the initial tree rooted by r/1 
is substituted at node r I. It corresponds to the probability 
of substituting the lexicalized initial tree whose root node 
5 Since the grammar is lexicalized, both trees c~ and/~ are associated 
with lexical items, and the site node for adjunction rl corresponds to 
some syntactic modification. Such rule encapsulates S modifiers (e.g. 
sentential adverbs as in "apparently John left"), VP modifiers (e.g. 
verb phrase adverbs as in "John left abruptly)", NP modifiers (e.g. 
relative clauses as in "The man who left was happy"), N modifiers 
(e.g. adjectives as in "?relty woman"), or even sentential complements 
(e.g. John ~hlnks ~hat Harry is sick). 
is 01, say 6, at the node r/ofa lexicalized elementary tree, 
say o~. 6 
The SLIG constructed as above is well defined if the fol- 
lowing equalities hold for all nodes r/: 
P(t\[-.y\] ~ b\[..T/\]) + ~ P(t\[..y\] ---~ t\[..0Th\] ) = 1 (9) 
P(t\[$y\] ~ t\[$yl\]) = 1 (10) 
rh 
P(t\[$\] ~ t\[$y0\]) = 1 (11) 
Yo 
A grammar satisfying (12) is called consistent. 7 
E P(t\[$\]=~w) = 1 (12) 
wE~* 
Beside the distributional phenomena that we mentioned 
earlier, SLTAG also captures the effect of adjoining con- 
straints (selective, obligatory or null adjoining) which are 
required for tree-adjoining grammar, s 
3. PROBABILITY OF A SENTENCE 
We now define an bottom-up algorithm for SLTAG which 
computes the probability of an input string. The algorithm 
is an extension of the CKY-type parser for tree-adjoining 
grammar \[18\]. The extended algorithm parses all spans of 
the input string and also computes their probability in a 
bottom-up fashion. 
Since the string on the frontier of an auxiliary is broken 
up into two substrings by the foot node, for the purpose of 
computing the probability of the sentence, we will consider 
the probability that a node derives two substrings of the in- 
put string. This entity will be called the inside probability. 
Its exact definition is given below. 
We will refer to the subsequence of the input string w = 
al... aN from position i to j, w~. It is defined as follows: 
"del f ai+l • "" aj , if i < j w~= ~ ~ ,ifi>j 
Given a string w = al.. "aN and a SLTAG rewritten as 
in (1-8) the inside probability, IW(pos, O, i, j, k, i), is defined 
for all nodes 77 contained in an elementary tree o~ and for 
pose {t,b},andforallindices0<i<j< k<l<Nas 
follows: 
6 Among other cases, the probability of this rule corresponds to the 
probability of filling some argument position by a lexicalized tree. It 
will encapsulate the distribution for selectional restriction since the 
position of substitution is taken into account. 
rWe will not investigate the conditions under which (12) holds. We 
conjecture that some of the techniques used for checking the consis- 
tency of stochastic context-free grammars can be adapted to SLTAG. 
SFor exaxnple, for a given node r/setting to zero the probability of 
all rules of the form (6) has the effect of blocking adjunction. 
142 
(i) If the node t/does not subsume the foot node of 
oL (if there is one), then j and k are unbound and: 
I t° (pos, 71, i,-,-, l)ae=lP(pos\[*71\]~ w~) 
(ii) If the node T/ subsumes the foot node ~/! of a, 
then: 
It°(pos, r h i,j, k, l)a=eY P( pos\[$ol~ w~b\[$ojlw~) 
In (ii), only the top element of the stack matters since as a 
consequence of the construction of the SLIG, we have that 
if pos\[$@=~ w~b\[$r//\]w~ then for all string 7 E V~ we also 
have pos\[*~\]~ ~b\[*~lw~.~ 
Initially, all inside probabilities are set to zero• Then, the 
computation goes bottom-up starting from the productions 
introducing lexical items: if r/is a node such that b\[$r/\] ~ a, 
then: 
( 1 ifl=i+lAa=w~+l (13) IW(b'rh i'-'-'l) = 0 otherwise. 
Then, the inside probabilities of larger substrings are com- 
puted bottom-up relying on the recurrence equations stated 
in Appendix A. This computation takes in the worst 
case O(IG\]2g6)-time and O(\[GIN4)-space for a sentence 
of length N. 
Once the inside probabilities computed, we obtain the prob- 
ability of the sentence as follows: 
P(W)de=/P(t\[$\]=~W) = I w (t, $, O,--,--, IWl) (14) 
We now consider the problem of re-estimating a SLTAG. 
4. RE-ESTIMATION OF SLTAG 
Given a set of positive example sentences, W = 
{wl...WK}, assumed to have been generated by an un- 
known SLTAG, we would like to compute the probability 
of each rule of a given SLTAG in order to maximize the 
probability that the corpus were generated by this SLTAG. 
An algorithm solving this problem can be used in two dif- 
ferent ways. 
The first use is as a re-estimation algorithm. In this ap- 
proach, the input SLTAG derives structures that are rea- 
sonable according to some criteria (such as a linguistic the- 
ory and some a priori knowledge of the corpus) and the 
intended use of the algorithm is to refine the probability of 
each rule• 
The second use is as a learning algorithm. At the first 
iteration, a SLTAG which generates all possible structures 
over a given set of nodes and terminal symbols is used• 
9This can be seen by observing that for any node on the path from 
the root node to the foot node of an auxiliary tree, the stack remains 
unchanged. 
Initially the probability of each rule is randomly assigned 
and then the algorithm will re-estimate these probabilities• 
Informally speaking, given a first estimate of the parame- 
ters of a SLTAG, the algorithm re-estimates these parame- 
ters on the basis of the parses of each sentence in a training 
corpus obtained by a CKY-type parser. The algorithm de- 
rives a new estimate such that the probability that the cor- 
pus were generated by the grarnlnar is increased. By anal- 
ogy to the inside-outside algorithm for stochastic context- 
free grammars \[19, 7\], we believe that the following quantity 
decreases after each iteration: 1° 
log2(P(w)) 
He(W) = to¢w (15) 
toEW 
In order to derive a new estimate, the algorithm needs to 
compute for all sentences in W the inside probabilities and 
the outside probabilities. Given a string w = al...aN, 
the outside probability, Ot°(pos,~l,i,j,k,l), is defined for 
all nodes r/contained in an elementary tree o~ and for pos E 
it,b}, and for all indices 0 _< i _< j _< k < i < N as 
follows: 
(i) If the node 7/does not subsume the foot node of 
o~ (if there is one), then j and k are unbound and: 
o~ (poe, ,7, i, -, -, t)~*- -I 
P(B7 e V~ s.t. t\[$\]:~ w~ pos\[*Trl\] wz N) 
(ii) If the node 77 does subsume the foot node rll of a 
then: 
o ~ (pos, ,7, i, j, k, 0% I 
P(37 e V~ s.t. 
• * k t\[$\]=~ w~ pos\[$Trl\] w~ v and b\[$Tr//\]=V-w3) 
Once the inside probabilities computed, the outside 
probabilities can be computed top-down by consider- 
ing smaller spans of the input string starting with 
OW(t,$, 0,-,-, N) = 1 (by definition). This is done by 
computing the recurrence equations stated in Appendix B. 
Due to the lack of space, we only illustrate the re-estimation 
of the rules corresponding to adjunction, rules of the form: 
t\[..r/\] ~ t\[..r/rF\]. The other re-estimation formulae can be 
derived in a similar manner. 
In the following, we assume that r 7 subsumes the foot node 
7/! within a same elementary tree, and also that r/I subsumes 
the foot node r/tt (within a same elementary tree). 
1°He is an estimate of the entropy H of the unknown language 
being estimated and it converges to the entropy of the language as 
the size of the corpus grows. 
143 
Let: 
Nto (t\[..~}\] --~ t\[..zpl/\], i, r, j, k, s, l) 
\]o ( ) -P'(t \["~/ (w/t"~PT/\]) x x IW(b, Tl, r,j,k,s) 
x OW(t,z},i,j,k,I) 
and: 
D~o(t,r/,i,j,k,l) = IW(t'Tl'i'j'k'l) × OW(t'~}'i'j'k'l) 
P(w) 
It can be shown that the rule t\[..r}\] --+ t\[..yyl\] is optimally 
reestimated at each iteration as follows: 
= 
~ N~0(t\[..r}\] ~ t\[..yrFl, i,r,j,k,s,l) 
weW O<i<r_<j<k<s_<l<lw I 
~ Dto(t,r},i,j,k,l) 
wew 0_<i_<j_<k_<l_<lwl 
The denominator of the above reestimation formula esti- 
mates the probability that a derivation will involve at least 
one expansion of t\[..r}\]. The numerator estimates the proba- 
bility that a derivation will involve the rule t\[--r}\] ---~ t\[..r}r}t\]. 
The probability of no adjunction on the node r/, 
P(t\[..r/\] ~ b\[-.z}\] is reestimated using the equality (9). 
The algorithm reiterates until He(W) is unchanged (within 
some epsilon) between two iterations. Each iteration of 
the algorithm requires at most O(\[GI2N6)-time for each 
sentence of length N. 
5. CONCLUSION 
A novel statistical language model and fundamental algo- 
rithms for this model have been presented. 
SLTAGs provide a stochastic model both hierarchical and 
sensitive to lexical information. They combine the advan- 
tages of purely lexical models such as N-gram distribu- 
tions or Hidden Markov Models and the one of hierarchical 
modes as stochastic context-free grammars without their 
inherent limitations. The parameters of a SLTAG corre- 
spond to the probability of combining two structures each 
one associated with a word and therefore capture linguisti- 
cally relevant distributions over words. 
An algorithm for computing the probability of a sentence 
generated by a SLTAG was presented as well as an iterative 
algorithm for estimating the parameters of a SLTAG given 
a training corpus of raw text. Similarly to its context-free 
counterpart, the reestimation algorithm can be extended to 
handle partially parsed corpora \[20\]. The worst case com- 
plexity of the algorithm with respect to the length of the in- 
put string (O(N6)) makes it impractical with a large corpus 
on a single processor computer for grammars requiring the 
worst case complexity. However, this complexity reduces to 
O(N 3) or to O(N 2) for interesting subsets of SLTAGs. If 
time permits, experiments in this direction will be reported 
144 
at the time of the meeting. 
Furthermore, the techniques explained in this paper apply 
to other grammatical formalisms such as combinatory cat- 
egorial grammars and modified head grammars since they 
have been proven to be equivalent to tree-adjoining gram- 
mars and linear indexed grammars \[21\]. 
In collaboration with Aravind Joshi, Fernando Pereira and 
Stuart Shieber, we are currently investigating additional 
algorithms and applications for SLTAG, methods for lexical 
clustering and automatic construction of a SLTAG from a 
large training corpus. 
REFERENCES 
1. Pratt, F. Secret and urgent, the story of codes and ciphers. 
Blue Ribbon Books, 1942. 
2. Shannon, C. E. A mathematical theory of communication. 
The Bell System Technical Journal, 27(3):379-423, 1948. 
3. Shannon, C. E. Prediction and entropy of printed english. 
The Bell System Technical Journal, 30:50-64, 1951. 
4. Chomsky, N. Syntactic Structures, chapter 2-3, pages 13- 
18. Mouton, 1964. 
5. Booth, T. Probabilistic representation of formal languages. 
In Tenth Annual 1EEE Symposium on Switching and Au- 
tomata Theory, October 1969. 
6. Lari, K. and Young, S. J. The estimation of stochastic 
context-free grammars using the Inside-Outside algorithm. 
Computer Speech and Language, 4:35-56, 1990. 
7. Jelinek, F., Lafferty, J. D., and Mercer, R. L. Basic meth- 
ods of probabilistic context free grammars. Technical Re- 
port RC 16374 (72684), IBM, Yorktown Heights, New York 
10598, 1990. 
8. Joshi, A. K. An Introduction to Tree Adjoining Grammars. 
In Manaster-Ramer, A., editor, Mathematics of Language. 
John Benjamins, Amsterdam, 1987. 
9. Schabes, Y., Abeilld, A., and Joshi, A. K. Parsing strate- 
gies with 'lexicalized' grammars: Application to tree ad- 
joining grammars. In Proceedings of the 12 th International 
Conference on Computational Linguistics (COLING'88), 
Budapest, Hungary, August 1988. 
10. Schabes, Y. Mathematical and Computational Aspects of 
Lexicalized Grammars. PhD thesis, University of Pennsyl- 
vania, Philadelphia, PA, August 1990. Available as tech- 
nical report (MS-CIS-90-48, LINC LAB179) from the De- 
partment of Computer Science. 
11. Joshi, A. K. and Schabes, Y. Tree-adjoining grammars 
and lexicalized grammars. In Nivat, M. and Podelski, A., 
editors, Definability and Recognizability of Sets of Trees. 
Elsevier, 1991. Forthcoming. 
12. Resnik, P. Lexicalized tree-adjoining grammar for distri- 
butional analysis. In Penn Review of Linguistics, Spring 
1991. 
13. Vijay-Shanker, K. and Weir, D. J. Parsing constrained 
grammar formalisms, 1991. In preparation. 
14. Aho, A. V. Indexed grammars -- An extension to context 
free grammars. J. ACM, 15:647-671, 1968. 
15. Gazdar, G. Applicability of indexed grammars to natural 
languages. Technical Report CSLI-85-34, Center for Study 
of Language and Information, 1985. 
16. Vijay-Shanker, K. A Study of Tree Adjoining Grammars. 
PhD thesis, Department of Computer and Information Sci- 
ence, University of Pennsylvania, 1987. 
17. Schabes, Y. An inside-outside algorithm for estimating 
the parameters of a hidden stochastic context-free gram- 
mar based on Earley's algorithm. Manuscript, 1991. 
18. Vijay-Shanker, K. and Joshi, A. K. Some computational 
properties of Tree Adjoining Grammars. In 23 ~a Meeting 
of the Association for Computational Linguistics, pages 82- 
93, Chicago, Illinois, July 1985. 
19. Baker, J. Trainable grammars for speech recognition. In 
Wolf, J. J. and Klatt, D. H., editors, Speech communica- 
tion papers presentaed at the 97 th Meeting of the Acoustical 
Society of America, MIT, Cambridge, MA, June 1979. 
20. Pereira, F. and Schabes, Y. Inside-outisde reestimation 
from partially bracketed corpora. 1992. Also in these pro- 
ceedings. 
21. Joshi, A. K., Vijay-Shanker, K., and Weir, D. The conver- 
gence of mildly context-sensitive grammatical formalisms. 
In Sells, P., Shieber, S., and Wasow, T., editors, Founda- 
tional Issues in Natural Language Processing. MIT Press, 
Cambridge MA, 1991. 
A. INSIDE PROBABILITIES 
In the following, the inside and outside probabilities are relative 
to the input string w..T stands for the the set of foot nodes, 
S for the set of nodes on which substitution can occur, ~ for 
the set of root nodes of initial trees, and .A for the set of non- 
terminal nodes of auxiliary trees. The inside probability can be 
computed bottom-up with the following recurrence equations. 
For all node 7 found in an elementary tree, it can be shown 
that: 
1. If b\[$~} --. a, l(b, 7, i, -, -, l) = 1 if i = i + 1 and if 
a = wi +a, 0 otherwise. 
2. IfT! E Y, I(b, Ty,i,j,k,l) = 1 ifi =j and if 
k = !, 0 otherwise. 
3. If b\[..7\] -* t\[--71\]t\[$72\]: I(b, 7, i,j, k, 1) = 
l--1 
x(t, i, j, k, m) × I(t, 72, -, l) 
rnml¢ 
4. If b\[..7\] ~ t\[$nl\]t\[..721, I(b, n, i, j, k, l) = 
i 
I(t, 71, i, -, -, ~) x X(t, 7~, ~, J, k, i) 
rn=i+l 
5. If b\[$7\] -- t\[$Tdt\[$7~\], I(b, 7, i, -, -, l) = 
1--1 
E I(t, 7~, i, -, -, m) x I(t, 72, m, -, -, l) 
rnffii+l 
6. For all node 7 on which adjunction can be performed: 
I(t, 7, i, j, k, !) = 
I(b, 7, i, j, k, l) x P(t\[..7\] --+ b\[.-7\]) 
+ x I(b,7, r,j,k,s) 
~=, ,=k ~1 x P(t\[--7\] ~ t\[-.77d) 
7. For all node 7 E S: I(t, 7, i, -, -, !) = 
El( t, 71, i, -,-, !) x P(t\[$7\] ~ t\[$71\]) 
71 
8. I(t, 8, i, -, -, l) = E I(t, 7, i, -, -, 1) x P(t\[$\] --+ t\[$7\]) 
145 
B. OUTSIDE PROBABILITIES 
The outside probabilities can be computed top-down recursively 
over smaller spans of the input string once the inside prob- 
abilities have been computed. First, by definition we have: 
O(t, $, 0,-,-, N) = 1. The following recurrence equations hold 
for all node 7 found in an elementary tree. 
1. If 7 e 7g, O(t, 7, 0,--,-, N) = P(t\[$\] -- tC$7\]). 
And for all (i,j) # (O,N), O(t, 7, i,-,-,j) = 
o( t, 7o, i,-,-, j) x P( t\[$7o\] --, t\[$7\]) 
2. If 7 is an interior node which subsumes the foot node of 
the elementary tree it belongs to, O(t, 7, i, j, k, I) = 
x I(t, 72,1, -, -, q) 
q=l+a x P(b\[..%\] --~ t\[-'7\]t\[$72\]) 
'-a (O(b'7°'p'j'k'l) ) 
+ E × I(t, 71,p,--,--,i) 
p=0 × P(b\[-.%\] --~ t\[$71\]t\[..7\]) 
3. If 7 is an interior node which does not subsume the 
foot node of the elementary tree it belongs to, we have: 
O(t, 7, i, -, -, l) = 
x I(t, 72,1, -, -, q) 
q=l+l X P(b\[$7/0 \] --~ t\[$7\]t\[$72\]) 
+E x I(t, 71,p,-,-,i) 
p=o x e(b\[$7o\] --~ t\[$7~\]t\[$71) 
N N N (O(b'7°'i'j'k'q) ) +E E E × I.,72,l,i,k,.) 
jft k=3+1 q=~ × P(b\["70\] ---~ t\[$7\]t\["72\]) 
+ x I(t, 71,p,j,k,i) 
p=o j_-p k=~ x P(b\[..%\] ~ t\[.-71\]t\[$7\]) 
4. If y E A, then: O(t, 7, i, j, k, l) = 
k-1 ~ (O(t'7°'i'p'q'l) ) 
× z(t,7o,j,p,q,k) 
7o ,=i q_-,+l × P(t\[..70\] --, t\[..70~\]) ( o(t, no,i,-,-,l) ) 
+ E x I(t, 70, J, -, -, k) 
70 x P(t\[$y0\] -- t\[$707\]) 
5. If 7 is a node which subsumes the foot node of the elemen- 
tary tree it belongs to, we have: O(b, 7, i,j, k, !) = 
O(t, 7, i,j, k, I) x P(t\[--7\] ~ b\['-7\]) 
+ x I(t, 7o,p,i,l,q) 
70 p=0 q=, x P(t\[..%\]---~ t\["70~/\]) 
6. And finally, if 7 is a node which does not subsume 
the foot node of the elementary tree it belongs to: 
O(b, 7, i, -, -, i) = 
O(t, 7, i,-,-, l) × P(t\[$7\] ~ b\[$7\]) 
+EEE × ..,70,p,i,l,,) 
70 p_-0 q_-~ x P(t\[$70\] ~ t\[$707\]) 
