PROBABILISTIC TREE-ADJOINING GRAMMAR AS A FRAMEWORK 
FOR STATISTICAL NATURAL LANGUAGE PROCESSING 
Philip Resnik* 
Department of Computer and Information Science 
University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA 
resnik@Iinc.cis, upenn, edu 
Abstract 
In this paper, I argue for the use of a probabilistic 
form of tree-adjoining grammar (TAG) in statistical 
natural language processing. I first discuss two previ- 
ous statistical approaches --- one that concentrates on 
the probabilities of structural operations, and another 
that emphasizes co, occurrence relationships between 
words. I argue that a purely structural approach, ex- 
emplified by probabilistie context-free grammar, lacks 
sufficient sensitivity to lexical context, and, conversely, 
that lexical co-occurence analyses require a richer no- 
tion of locality that is best provided by importing some 
notion of structure. 
I then propose probabilistie TAG as a framework 
for statistical language modelling, arguing that it pro- 
vides an advantageous combination of structure, local- 
ity, and lexical sensitivity. Issues in the acquisition of 
probabilistie TAG and parameter estimation are briefly 
considered. 
1 Probabilistic CFGs 
Thanks to the increased availability of text corpora, 
fast computers, and inexpensive off-line storagc, sta- 
tistical approaches to issues in natural \]mlguage pro- 
cessing are enjoying a surge of interest, across a wide 
variety of applications. As research iu tbis area pro- 
gresses, the question of how to combine our existing 
knowledge of grammatical methods (e.g. generative 
power, effÉcient parsing) with dcvelol)ments in statisti- 
cal and information-theoretic methods (especially tech- 
niques imported from the speech-processing commu- 
nity) takes on increasing significance. 
Perhaps the most straightforward combination of 
grammatical and statistical techniques can be found in 
the probabilistic generalization of context-free gram- 
*This reaeardl was supported by the following gremts: ARO 
DAAL 03~9-C-0031, DARPA N00014-90-J-1863, NSF lRI 90- 
16592, and Ben Franklin 91S.3078C-1, I would like to thank 
Aravind Joshi, Yves Schabes, and members of the CLIFF group 
at Penn for helpful discussion. 
(S~) S ~ NP VP (0.5) 
($2) S ~ S PP (0.35) 
($3) S ~ V NP (0.15) 
(VP,) VP ~ V NP (0.6) 
(VP2) VP ~ V PP (0.4) 
Figure 1: Fragment of a conlext-fl~e g~ammar. 
S 
NP VP (rl = S ~ NP VP) 
N VP (r2 = NP ~ N) 
we VP (rs = N ~ we) 
we V NP (r4 = VP ---* V NP) 
we like NP (r5 = V ---* like) 
we like N (r6 = NP ---, N) 
we like Mary (r7 = N ~ Mary) 
Figure 2: A context-free derivation 
mars. \[Jelinek et al., 1990\] characterize a probabilistic 
context-free grammar (PCFG) as a context-free gram- 
mar in which each production has been assigned a prob- 
ability of use. I will refer to the entire set of probabili- 
ties as the statistical parameters, or simply parameters, 
of the probabilistie grammar. 
For exmnple, the productions in Figure 1, together 
with their associated parameters, might comprise a 
fragment of a PCFG for English. Note that for each 
nonterminal symbol, the probabilities of its alternative 
expansions sum to 1. Tbis captures the fact that in 
a context-free derivation, each instance of a nontermi- 
na\] symbol must be rewritten according to exactly one 
of its expansions. Also by definition of a context-free 
derivation, each rewriting is independent of the con- 
text within which the nonterminal appears. So, for 
example, in the (leftmost) derivation of We like Mary 
(Figure 2), the expansions N ~ we and VP --~ V NP 
are independent events. 
The probability of n independent events is 
ACTES DE COLING-92, NANTES, 23-28 Aot)r 1992 4 1 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
merely the product of the probabilities of each in- 
dividual event. Therefbre the probability of a 
context-frce derivation with rewritings rt, r~, ..., rn is 
Pr(rl)Pr(r2) ... Pr(rn). 
Jelinek et al. use this fact to develop a set of efficient 
algorithms (based on the CKY parsing algorithm) to 
compute the probability of a sentence given a grammar, 
to find the most probable parse of a given sentence, to 
compute the probability of an initial substring leading 
to a sentence generated by the grarmnar, and, following 
\[Baker, 1979\], to estimate the statistical parameters for 
a given grammar using a large text corpus. 
Now, the definition of the probability of a deriva- 
tion that is used for PCFG depends crucially upon the 
context-freeness of the grammar; without it, the inde- 
pendence assumption that pernfits us to simply multi- 
ply probabilities is compromised. Yet experience tells 
us that rule expmmions are not, in general, context-free. 
Consider the following simple example. According to 
one frequency estimate of English usage \[~¥tmeis and 
Kucera, 1982\], he is more titan three times as likely 
as we to appear as a subject pronoun. So if we is re- 
placed with he in Figure 2, the new derivation (of He 
hke Mary) is accorded far higher probability according 
to the PCFG, even though He like Mary is not En- 
glish. The problem extends beyond such obvious cases 
of agreenmnt: since crushing happens to be more likely 
than mashing, the probability of deriving lie's crushing 
potatoes is greater than the probability corresponding 
derivation for He's mashing potatoes, although we ex- 
pect the latter to be more likely. 
lu short, lexical context matters. Although probn- 
bilistic context-free grammar captures the fact that not 
all nonterminal rewritiugs are equally likely, its insen- 
sitivity to lexical context makes it less than adequate 
as a statistical model of natural language, t and weak- 
eus Jelinek et al.'s contention that "in an ambiguous 
but appropriately chosen probabilistic CFG (PCFG), 
correct parses are Ifigh probability parses" (p. 2). In 
practical terms, it also suggests that techniques for in- 
corporatiug lexical context-sensitivity will potentially 
improve PCFG performance. 
2 Co-occurrence Statistics 
A second common approach to the corpus-based in- 
vestigation of natural language considers instances of 
words occurring adjacent to each other in the text, or, 
more generally, words occurring together within a win- 
dow of fixed size. For example, \[Church and tlanks, 
1989\] use an information-theoretic measure, mutual 
information, to calculate the degree of association of 
1 Independent of the statiatical i~ue, of course, it is also gen- 
erally accepted that CFGs are gcneratively too weak to model 
natural ltmguage in its full generality \[Shieber, 1985\]. 
word pairs based upon their co-occurrence throughout 
a large corpus. Tile nmtual information between two 
events is defined as 
• Pr(x, y) l(x; Y) = tOgp~) 
where, for tile purposes of corpus analysis, Pr(x) and 
Pr(y) are the respective probabilities of words x and y 
appearing within the corpus, and Pr(z, y) is the proba- 
bility of word x being followed by word y within a win- 
dow of w words. Intuitively, mutual information relates 
the actual probability of seeing x and y together (nu- 
merator) to the probability of seeing them together un- 
der the as~sumption that they were independent events 
(denominator). 
As defined here, tbe calculation of mutual infor- 
mation takes into account only information about co- 
occurrence within surface strings. A difficulty with 
such an al~proash, however, is that cO-occurrences of 
interest may not lie sufficiently local. Although sen- 
tence (1)a can witness the co-oceurence of students and 
spend within a small window (say, w = 2 or 3), sen- 
tence (1)b cannot. 
(1)a. Students spend a lot of money. 
b. Students who attend conferances spend a lot 
of money. 
Simply increasing the window size will not suffice, for 
two rc~sons. First, there is no bound on the length 
of relative clauses such as the one in (1)b, hence no 
fixed value of w that will solve the problem. Second, 
the choice of window size depends on the application 
-- Church and Hanks write that "smaller window sizes 
will identify fixed expressions (idioms) and other rela- 
tions that hold over short ranges; larger window sizes 
will highlight semantic concepts and other relationships 
that hold over larger scales" (p. 77). Extending the 
window size in order to capture co-occurrences such as 
the one found in (1)1) may therefore undermine other 
goals of the analysis. 
\[Brown et ul., 1991\] encounter a similar problem. 
Their application is statistics-driven machine transla- 
tion, where the statistics are calculated on the basis of 
trigranm (observed triples) of words within the source 
and target corpora. As in (1), sentences (2)a and (3)a 
illustrate a difficulty encountered when limiting win- 
dow size. Light verb constructions are an example of 
a situation in which the correct translation of the verb 
depends on the identity of its object. The correct word 
sense of p~ndre, "to make," is chosen in (2)a, since 
the dependency signalling that word sense -- between 
prcndre and its object, decision -- falls within a tri- 
gram window and thus within the bounds of their lan- 
guage model. 
(2)a. prendre la decision 
ACrEs DE COLING-92, NANTES, 23-28 ^ofrr 1992 4 1 9 PROC. OF COL1NG-92, NANTES, AUC;, 23-28, 1992 
b. make the decision 
(3)a. prendre une di\]fficile dgcision 
b. *take a difficult decision 
In (3)a, however, the dependency is not sufficiently lo- 
cal: the model has no access to lexical relationships 
spanning a distance of more than three words. With- 
out additional information about the verb's object, the 
system relies on the fact that the sense of prendre a.~ "to 
take" occurs with greater frequency, and tire incorrect 
translation results. 
Brown et al. propose to solve the problem by seek- 
ing clues to a word's sense within a larger context. 
Each possible clue, or informant, is e~ site relative to the 
word. For prendre, the informant of interest is "the first 
noun to the right"; other potential informants include 
"the first verb to the right," and so on. A set of such po- 
tential informants is defined in advance, and statistical 
techniques are used to identify, for each word, which 
potential informant contribntes most to determining 
the sense of the word to use. 
It seems clear that in most cases, Brown et al.'s 
informants represent approximations to structural re- 
lationships such as subject and object. Examples (1) 
through (3) suggest that attention to structure would 
have advantages for the analysis of lexical relationships. 
By using co-occurrence within a structure rather than 
co-occurrence within a window, it is possible to cap- 
ture lexical relationships without regard to irrelevant 
intervening material. One way to express this notion 
is by saying that structural relationships permit us to 
use a notion of locality that is more general than sim- 
ple distance in the surface string. So, for example, (1) 
demonstrates that the relationship between a verb and 
its subject is not affected by a subject relative clanse 
of any length. 
3 Hybrid Approaches 
In the previous sections I have observed that proba- 
billstic CFGs, though capable of capturing the relative 
likelihoods of context-free derivations, require greater 
sensitivity to lexical context in order to model natural 
languages. Conversely, lexica\] co-occurrence analyses 
seem to require a richer notion of locality, something 
that structural relationships such as subject and object 
can provide. In this section I briefly discuss two pro- 
posals that could be considered "hybrid" approaches, 
making use of both grammatical structure and lexical 
co-occurrence, 
The first of these, \[Hindle, 1990\], continues in the 
direction suggested at the end of the previous section: 
rather than using "informants" to approximate struc- 
tural relationships, lexical eo-oecnrrenee is calculated 
over structures. Hindle uses co-occurrence statistics, 
collected over parse trees, in order to classify nouns 
on the basis of the verb contexts in which they ap- 
pear. First, a robust parser is used to obtain a parse 
tree (possibly partial) for each sentence. For example, 
the following table contains some of the information 
Hindle's parser retrieved from the parse of the sen- 
tence "The clothes we wear, the food we eat, the air 
we breathe, the watcr we drink, the land that sustains 
us, and many of the prodncts we use are the result of 
agricultural research." 
I verb \] subject I object I 
e~t we food 
breathe we air 
drink we water 
sust'ain land us 
Next, Hindle calculates a similarity measure based 
on tile mutual information of verbs and their argu- 
ments: nouns that tend to appear as subjects and ob- 
jects of the sanae verbs are judged more similar. Ac- 
cording to the criterion that that two nouns be recip- 
rocally most similar to each other, Hindle's analysis 
derives such plausible pairs of similar nonns as ruling 
and decision, battle and fight, researcher and scientist. 
A proposal in \[Magerman and Marcus, 1991\] re- 
lates to probabilistie parsing. As in PCFG, Magerman 
and Marcus associate probabilities with the rules of a 
context-free grammar, but these probabilities are con- 
ditioned on the contexts within which a rule is observed 
to be used. It is in the formulation of "context" that 
lexical co-occurrence plays a role, albeit an indirect one. 
Magerman and Marcus's Pearl parser is essentially 
a bottom-up chart parser with Earley-style top-down 
prediction. When a context-free rule A -~ al...a~ 
is proposed as an entry in ttle chart spanning input 
symbols al through ai, it is assigned a "score" based 
on 
1. tile rule that generated this instance of nontermi- 
hal symbol A, and 
2. tile part-of-speech trigram centered at al. 
For example, given the input My first love was named 
Pearl, a proposed chart entry VP---+ V NP starting 
its span at love (i.e., a theory trying to interpret 
love as a verb) would be scored on the basis of the 
rule that generated the VP (in this ease, probably 
S ---* NP VP) together with the part-of-speech trigram 
"adjective verb verb." This score would be lower than 
that of a different chart entry interpreting love as a 
noun, since the latter would be scored using the more 
likely part of speech trigram "adjective noun verb". 
So, although tire context-free probability favors the in- 
terpretation of love as the beginning of a verb phrase, 
ACRES DI" COLUqG-92, NANTES, 23-28 AOl3"r 1992 4 2 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
information about the lexical context of the word res- 
cues the correct interpretation, in which it is catego- 
rized as a noun. 
Although Pearl does not take into account tile re- 
lationships between particular words, it does represent 
a promising combiuation of CFG-based probahilistic 
parsing mid corpus statistics calculated on the basis of 
simple (trigram) co-occurrences. 
A difficulty with the hybrid approaches, however, 
is that they leave unclear exactly what the statistical 
model of tile language is. In probabilistic context:i~ee 
grannnar and in surface-string analysis of lexical c(> 
occurrence, there is a precise deIinition of the event 
space -- what events go into making up a sentence. 
(As discussed earlier, the events in a PCFG derivation 
are the rule expansions; the events in a window-baqed 
analysis of surface strings call be viewed as transitions 
in a finite Markov chain modelling the language.) The 
absence of sitch a characterization in the hybrid ap- 
proaches makes it more difficult to identify what as- 
sumptions are being made, and gives such work a de- 
cidedly empirical flavor. 
4 Probabilistic TAG 
4.1 Lexicalized TAG 
The previous sections have demonstrated a need for a 
well-defned statistical framework in which both syn- 
tactic structure and texical co, occurrences are incor- 
porated. In thLs section, I argue that a probabilis- 
tic form of lexicalized tree-adjoining grammar provides 
just such a seamless combination. \[ begin with a brief 
description of lcxicalized tree-adjoining grammar, and 
then present its probabilistic generalization and the ad- 
vantages of the resulting formalism. 
"l~'ec~adjoining grammar \[Joshi ctaL, 1975\], or 
TAG, is a generalization of context-fl'ee grannnar that 
has been proposed as a useful forlualism for the study 
of natural languages. A tree-adjoining grammar eom- 
priscs two sets of elementary structures: initial trees 
and auxiliary trees. (See l,'igure 3, in which initial and 
auxiliary trees are labelled using cr and f3, respectively.) 
An auxiliary tree, by definition, has a uonterminal node 
on its frontier - tile fool node -- that matches the 
nonterminal symbol at its root. 
These elementary structures can be combined us- 
ing two operations, substitution and adjunction. The 
substitution operation corresponds to the rewriting of 
a symbol in a context-free derivation: a nonterminal 
node at the frontier of a tree is replaced with au initial 
tree having the same nonterminal symbol at its rout. 
So, for example, one could expand either NP in al by 
rewriting it as tree ccu. 
S NI' NP N 
NP VP N N Adj N / 
V NP pesamU people roaslcxl 
t eat 
(al) (~) (cd) (ill) 
Figure 3: Examples of TAG elementary frees 
Figure 4: Ad3unetion of auxiliary tree fl into free a at 
node 71, ~*sulting in derived structure 7. 
The adjunction operation is a generalization of sub- 
stitution that permits internal as well as frontier nodes 
to be expanded. One adjoins an auxiliary tree fl into 
a tree c~ at node r I by "splitting" ~ horizontally at r 1 
and then inserting ft. In the resulting structure, 7, f)'s 
root node appears in tfs original position, mid its foot 
node dominates all the material that r I donfinated pre- 
viously (see Figure 4). For example, l"igure 5 shows T1, 
the result of adjoining auxiliary tree ~1 at the N node 
of e~2. 
}br context-free grmnmar, each derived structure 
is itself a record of the operations that produced it: 
one Call simply read from a parse tree tile context-free 
rules that were used in the parse. In contrm~t, a derived 
structure in a tree-adjoining grammar is distinct from 
its derivation history. The final parse tree encodes the 
slruclure of the sentence according to the grammar, 
but the events that constitute the derivation history 
that is, the subsitutions and adjunctions that took 
place are not directly encoded. The significance of 
this distinction will become apparent shortly. 
A lcxicalized tree-adjoining gra~nmar is it TAG in 
which each elemcutary structure (initial or auxiliary 
tree) has a lexieal item on its frontier, known as its 
anchor \[Schabes, 1990\]. Another natural way to say 
this is that each lexical item has associated with it 
NP I 
N /~dj 
N I 
roasted peanuti 
(rl) 
Figure 5: Result of adjoiaiu 9 fll into ol~. 
ACRES 1OE COLING-92, NANTES, 23-28 AOl~l" 1992 4 2 1 PROC. Or; COLING-92, NANTJ!S. AUG. 23-28, 1992 
a set of structures that, taken together, characterize 
the contexts within which that item can appear. 2 For 
example, tree al in Figure 3 represents one possible 
structural context for eat; another would be an initial 
tree in which eat appears as an intransitive verb, and 
yet another would be a tree containing eat within a 
passive construction. 
4.2 Features of Probabilistic TAG 
Three features of lexicalized TAG urake it particularly 
appropriate as a probabilistic framework for natural 
language processing. First, since every tree is associ- 
ated with a lexica\] anchor, words and their associated 
structures are tightly linked. Thus, unlike probabilistic 
context-free grammar, the probabilities associated with 
structural operations are sensitive to lexical context. In 
particular, the probability of an adjunetion step such 
as the one that produced 71, above, is sensitive to the 
lexical anchors of the trees. One would expect a similar 
adjunction of ~1 into aa (resulting in the string roasted 
people) to have extremely low probability, reflecting the 
low degree of association between the lexical items in- 
volved. Similarly, tree aa is far more likely than a2 to 
be substituted as the subject NP in tree ~1, since peo- 
ple is more likely than peanuts to appear as the subject 
of eat. 
Notice that this attention to lexical context is not 
acquired at the expense of the independence assump- 
tion for probabilities. Just as expansion of a nonter- 
minal node in a CFG derivation takes place without 
regard to that node's history, substitution or adjunc- 
tion to a node during a TAG derivation takes place 
without regard to that node's history, a Thus it will be 
straightforward to express the probability of a proba- 
bilistic TAG derivation as a product of probabilities, 
just as was the case for PCFG, yielding a well-defined 
statistical model. 
Second, since TAG factors local dependencies from 
recursion, the problem of intervening material in 
window-based lexical co-occurrence analyses does not 
arise. Notice that in tree c~t, the verb, its subject, and 
its object all appear together within a single elementary 
structure. The introduction of recursive substructure 
-- relative clauses, adverbs, strings of adjectival mod- 
ifiers -- is done entirely by means of the adjunction 
operation, as illustrated by figures 4 and 5. This fact 
provides a principled treatment of examples such as (4): 
the probability of substituting cr 2 as the object NP of 
ax (capturing an association between eat and peanuts) 
2LexicMized TAG is only one of meaty sudl graznmar for- 
malisms, of course. 
3Schabes (personal communication) capture~ the generallza- 
tion quite nicely in obscxving that for both CFG derivation trees 
and TAG derivation histories, the path setA (set of possible paths 
from root to frontier) are regular. 
is independent of any other operations that might take 
place, such as the introduction (via adjunction) of an 
adjectival modifier. Similarly, the probability of substi- 
tuting aa as the subject NP of al is not affected by the 
subsequent adjunction of a relative clause (cf. exam- 
pie (1)). Thus, in contrast to a co-occnrrence analysis 
based on strings, the analysis of (4) within probabilis- 
tic TAG finds precisely the same relationship between 
the verb and its arguments in (4)b and (4)c as it does 
in (4)a. 
(4)a. People eat peanuts. 
b. People eat roasted peanuts. 
c. People who are v~iting the zoo eat roasted 
peanuts. 
Third, the notion of lexical anchor in lexicalized 
TAG has been generalized to account for multi-word 
lexicat entries \[Abeill6 and Schabes, 1989\]. Thus the 
formalism appears to satisfy the criteria of \[Smadja and 
McKeown, 1990\], who write of the need for "a flexible 
lexicon capable of using single word entries, multiple 
word entries as well as phrasal templates and a mecha- 
nism that would be able to gracefully merge and com- 
bine them with other types of constraints" (p. 256). 
Among "other types of constraints" are linguistic cri- 
teria, and here, too, TAG offers the potential for cap- 
turing the details of linguistic and statistical facts in a 
uniform way \[Kroch mad Joshi, 1985\]. 
4.3 Formalization of Probabilistic TAG 
There are a number of ways to express lexicalized 
tree-adjoining grammar as a probabilistic grammar 
formalism. 4 Here I propose what appears to me to 
be the most direct probabilistic generalization of lex- 
iealized TAG; a different treatment can be found in 
\[Schabes, 1992\]. 
Definitions: 
Let I denote the set of initial trees in the grammar, 
and A the set of auxiliary trees. 
Each tree in a lexlcalized TAG has a (possibly 
empty) subset of its frontier nodes marked as 
nodes at which substitution may take place. Given 
a tree c~, let that subset be denoted by s(a). 
Adjunction may take place at any node r/ in 
labelled by a nonterminal symbol, so long as *7 
s(~). Denote this set of possible adjunetion nodes 
n(.). 5 
4 The idea of defining probabilities over derivations involving 
combinations of elementary structur~ was introduced as early 
as \[J~hi, 1973\]. 
~Note that ~almtitution nodes and adjmmtion nodes ar~ not 
disting~fished in Figure 3. 
AcrEs DE COLING-92, NANTES, 23-28 nO~" 1992 4 2 2 PROC. OF COLING-92, NA/CrES, AUG. 23-28, 1992 
Let S(a, a', r/) denote the event of substituting tree 
c~ * into tree c~ at node r/. 
Let A(a,~,O) denote the event of adjoining aux- 
iliary tree ~ into tree c~ at node r/, and let 
A(o¢, none, U) denote the event in which no adjunc- 
tion is performed at that node. 
Let f~ = the set of all substitution and adjunction 
events. 
A probabilis~ic tree-adjoining grammar is a 5-tuple, 
(I, A, P1, Ps, Pa), where I and A are defined as above, 
PI is a function from I to the real interval \[0,1\], and 
Ps and PA are functions from fl to \[0,1\], such that: 
1. ~ Pr(a)= 1 
aEl 
2. Voe;uaV, o(, ) ~ PS(S(~,~¢,'J)) = l 
c~EI 
3. Vaelua Vires(a) ~ PA(A(ot, fl, ,)) = 1 
flEAU(.o.¢) 
Pl(c~) is interpreted as the probability that a derivation 
begins with initial tree a. Ps(S(ce, cd, r/)) denotes the 
probability of substituting cd at node ~t of tree a, and 
Pa(A(a,~,r/)) denotes the probability of adjoining fl 
at node 7/of tree a (where PA(A(a, none,t\])) denotes 
the probability that no adjunetion takes place at node 
o of a). 
A TAG derivation is described by the initial tree 
with which it starts, together with the sequence of 
substitutions and adjunction operations that then take 
place. Denoting each operation as op(cq,a2,~l),op E 
{S,A}, and denoting the initial tree with whieb tim 
derivation starts as ~0, the probability of a TAG deriva- 
tion ~-= (,to, opt(...) ..... op,(...)) i~ 
Pr(~) = P~(,~o) ~I Po~,(op,(...)) 
l<i<_n 
This definition is directly analogous to the probability 
of a context-free derivation r = (rh..., r,), 
Pr(r) = PI(S) H P(ri), 
l<i<_n 
though in a CFG every derivation starts with S and so 
PI(S) -= 1. 
Does probabilistic TAG, thus formalized, behave as 
desired? Returning to example (4), consider the fol- 
lowing derivation history for the sentence People eat 
roasted peanuts: 
(Otl 
S(al,aa, NP°~j), 
S(ai,a2, NPobj), 
A(a2,/~1, N)) 
Notice first that, given intuitively reasonable esti- 
mates for the probabilities of substitution and adjune- 
tion, the probability of this derivation would indeed 
be far greater than for the corresponding derivation of 
Peanuts eat roasted people. Thus, in contrast to PCFG, 
probabitistic TAG's sensitivity to lexieal context does 
provide a more accurate language model. In addition, 
were this derivation to be used as an observation for 
the purpose of estimating probabilities, the estimate 
of Ps(S(at,¢t3, NPs,,bj)) would be unaffected by the 
event A(a2, /!/1, N). That is, the model would in fact 
capture a relationship between the verb eat and its ob- 
ject peanuts, mediated by the trees that they anchor, 
regardless of intervening material in the surface string. 
4.4 Acquisition of Probabilistic TAG 
Despite the attractive theoretical features of proba- 
bilistic TAG, many practical issues remain. Foremost 
among these is the acquisition of the statistical param- 
eters. \[Sehabes, 1992\] has recently adapted the Inside- 
Outside algorithm, used for estinaating the parameters 
of a probabilistic CFG \[Baker, 1979\], to probabilistie 
TAG. The Inside-Outside algorithm is itself a gener- 
alization of tile Forward-Backward algorithm used to 
train hidden Markov models \[Baum, 1972\]. It opti- 
mizes by starting with a set of initial pareaneters (cho- 
sen randomly, or uniformly, or perhaps initially set on 
the basis of a priori knowledge), then iteratively col- 
lecting statistics over the training data (using the exist- 
ing parameters) and then re-estimating the parameters 
on the basis of those statistics. Each re-estimation step 
guarantees improved or at worst equivalent parameters, 
according to a maximmn-likelihood criterion. 
The procedure immediately raises two questions, 
regardless of whether the probabilistic formalism un- 
der cousideration is finite-state, context-free, or tree- 
adjoining. First, since the algorithm re-estimates pa- 
rameters but does not determine the rules in the gram- 
mar, the structures underlying the statistics (ares, 
rules, trees) must bc determined in some other fashion. 
The starting point can be a hand-written grammar, 
which requires considerable effort and linguistic knowl- 
edge; altcrnatively, one can initially include all possible 
structures and let tim statistics weed out useless rules 
by assigning them near-zero probability. A third pos- 
sibility is to add an engine that hypothesizes rules on 
the basis of observed data, a~ld then let the parameter 
estimation operate over these hypothesized structures. 
Which possibility is best depends upon the application 
and the breadth of linguistic coverage needed. 
Second, any statistical approach to natural lan- 
guage nmst consider tim size of the parameter set to 
be estimated. Additional parameters can enhance the 
theoretical accuracy of a statistical model (e.g., extend- 
ing a trigram model to 4-grams) but may also lead to an 
ACTES DE COL1NG-92. NANqES. 23-28 Ao(rr 1992 4 2 3 PRoc. ov COLING-92. NArCrES, Au6.23-28, 1992 
unmanageable number of parameters to store and re- 
trieve, much less estimate. In addition, as the number 
of parameters grows, more data are required in order 
to collect accurate statistics. For example, both (5)a 
and (5)b witness the same relationship between called 
and son within a trigram model. 
(5)a. Mary called her son. 
b. Mary called her son Junior. 
Within a recent lexicalized TAG for English \[Abeill~ 
el al., 1990\], however, these two instances of called 
are associated with two distinct initiM trees, reflecting 
the different syntactic structures of the two examples. 
Thus observations that would be the same for the tri- 
gram model can be fragmented in the TAG model. As 
a result of this fragmentation, each individual event is 
observed fewer times, and so the model is more vulner- 
able to statistical inaccuracy resulting from low counts. 
The acquisition of a probababilistie tree-adjoining 
grammar "from the ground up," that is, including hy- 
pothesizing grammatical structures as well as estimat- 
ing statistical parameters, is an intended topic of future 
work. 
5 Conclusions 
1 have argued that probabilistic 'rAG provides a seam- 
less framework that combines the simplicity and struc- 
ture of the probabilistie CFG approach with the lex- 
ieal sensitivity of string-based co-occurrence analyses. 
Within a wide variety of applications, it appears that 
various researchers (e.g. \[tIindle, 1990; Magerman and 
Marcus, 1991; Brown el al., 1991; Smadja and MeK- 
eown, 1990\]) are confronting issues similar to those 
discussed here. As the resources required for statis- 
tical approaches to natural language continue to be- 
come more readily available, these issues will take on 
increasing importance, and the need for a framework 
that unifies grammatical and statistical techniques will 
continue to grow. Probabilistie TAG is one step toward 
such a unifying framework. 
ACRES DE COLING-92, NANTES, 23-28 Ao(rr 1992 4 2 4 Paoc. or: COLING-92, NANTES, AUO. 23-28, 1992 

References 

\[Abeill~ and Schabes, 1989\] Anne Abeill~ and Yves Sch- 
abes. Parsing idioms iu tree adjoining grammars. 
In Fourth ConJerence of the European Chapter of the 
Association .\[or Computational Linguistics (EACL'89), 
Manchester, 1989. 

\[Abeill~ et al., 1990\] Anne Abeill6, 
Kathleen Bishop, Sharon Cote, and Yves Schabes. A 
lexicaiized tree adjoining grammar for english. Techni- 
call Report MS-CIS-90-24, Department of Computer and 
Information Science, University of Penusylvania, April 
1990. 

\[Baker, 1979\] J.K. Baker. Trainable grammars for speech 
recognition. In Proceedings of the Spring Conference 
o.\[ the Acoustical Society of America, pages 547-550, 
Boston, MA, June 1979. 

\[Baum, 1972\] L.E. Bantu. An inequality and associated 
maximization technique in statistical estimation of prob- 
abifistic functions of a Markov process. Inequalities, 3:1- 
8, 1972. 

\[Brown et al., 1991\] P. Brown, S. Della Pietra, V. Della 
Pietra, and R. Mercer. A statistical approach to 
sense disambiguation in machine translation. In Fourth 
DARPA Workshop on Speech and Natural Language, Pa- 
cific Grove, CA, February 1991. 

\[Church and Hanks, 1989\] K. Church and P. Hanks. Word 
a.ssociz~tion norms, nmtual information, and lexicogra- 
phy. In Proceedings o.\[ the ~Tth Meeting o\] the Associ- 
ation \]or Computational Linguistics, 1989. Vancouver, 
B.C. 

\[Francis and Kucera, 1982\] W. Francis and H. Kucera. Fre- 
quency Analysis of English Usage. Houghton Mifflin Co.: 
New York, 1982. 

\[Hindle, 1990\] D. Hindle. Noun classification frmn 
predicate-argument structures. In Proceedings o.\[ the 1~Hth 
Annual Meeting of the Assocation of Computational Lin- 
guistics, Pittsburgh, Penna., pages 268-275, 1990. 

\[Jelinek et al., 1990\] F. Jelinek, J. D. Lafferty, and R. L. 
Mercer. Basic methods of probabilistic context free gram- 
mars. Research Report RC 16374 (#72684), IBM, York- 
town Heights, New York 10598, 1990. 

\[Joshi el al., 1975\] Aravind Joshi, Leon Levy, and M Taka- 
hashi. Tree adjunct grammars. Journal of the Computer 
and System Sciences, 10, 1975. 

\[Joshi, 1973\] Aravind Joshi. Remarks on some aspects of 
language structure and their relevance to pattern analy- 
sis. Pattern Recognition, 5:365-381, 1973. 

\[Kroch and Joshi, 1985\] Anthony Kroch and Aravind K. 
Joshi. Linguistic relevance of tree adjoining grammars. 
Technical Report MS-CIS-85-18, Department of Com- 
puter and Information Science, University of Pennsyl- 
vania, April 1985. 

/\[Magerman and Marcus, 1991\] D. Magerman and M. Mar- 
cus. Pearl: A probabilistic chart parser. In Fourth 
DARPA Workshop on Speech and Natural Language, Pa- 
cific Grove, CA, February 1991. 

\[Schabes, 1990\] Yves Schabes. Mathematical and Compu- 
tational Aspects of Lexicalized Grammars. PhD thesis, 
Univ. of Pennsylvania, 1990. 

\[Schabes, 1992\] Yves Schabes. Stochastic lexicaiized tree- 
adjoining grammars, 1992. This proceedings. 

\[Shieber, 1985\] Stuart Shieber. Evidence against the 
context-freeness of natural language. Linguistics and 
Philosophy, 8:333-343, 1985. 

\[Smadja and McKeown, 1990\] F. Smadja and K. McKe- 
own. Automatically extracting and representing collo- 
cations for language generation. In Proceedings of the 
P8th Annual Meeting o\] the Assocation of Computational 
Linguistics, pages 252-259, 1990. 
