The Use of Shared Forests in Tree Adjoining Grammar Parsing* 
K. Vijay-Shanker 
Department of Computer & 
Information Sciences 
University of Delaware 
Newark, DE 19716 
USA 
vijay@udel.edu 
David J. Weir 
School of Cognitive & 
Computing Sciences 
University of Sussex 
Falmer, Brighton BN1 9QH 
UK 
davidw@ cogs. sussex, ac.uk 
Abstract 
We study parsing of tree adjoining gram- 
mars with particular emphasis on the use 
of shared forests to represent all the parse 
trees deriving a well-formed string. We 
show that there are two distinct ways of 
representing the parse forest one of which 
involves the use of linear indexed grammars 
and the other the use of context-free gram- 
mars. The work presented in this paper is 
intended to give a general framework for 
studying tag parsing. The schemes using 
lig and cfg to represent parses can be seen 
to underly most of the existing tag parsing 
algorithms, 
1 Introduction 
We study parsing of tree adjoining grammars (tag) 
with particular emphasis on the use of shared forests 
to represent all the parse trees deriving a well- 
formed string. Following Billot and Lang \[1989\] and 
Lang \[1992\] we use grammars as a means of recording 
all parses. Billot and Lang used context-free gram- 
mars (cfg) for representing all parses in a cfg parser 
demonstrating that a shared forest grammar can be 
viewed as a specialization of the grammar for the 
given input string. Lang \[1992\] extended this ap- 
proach considering both the recognition problem as 
well as the representation of all parses and suggests 
how this can be applied to tag. 
This paper examines this approach to tag pars- 
ing in greater detail. In particular, we show that 
*We are very grateful to Bernard Lang for helpful 
discussions. 
there are two distinct ways of representing the parse 
forest. One possibility is to use linear indexed 
grammars (lig), a formalism that is e~uivalent to 
tag \[Vijay-Shanker and Weir, in pressa\]. The use 
of lig is not surprising in that we would expect to be 
able to represent parses of a formalism in an equiva- 
lent formalism. However, we also show that there is 
a second way of representing parses that makes use 
ofa cfg. 
The work presented in this paper is intended to 
give a general framework for studying tag parsing. 
The schemes using lig and cfg to represent parses can 
be seen to underly most of the existing tag parsing 
algorithms. 
We begin with brief definitions of the tag and lig 
formalisms. This is followed by a discussion of the 
methods for cfg recognition and the representation of 
parses trees that were described in \[Billot and Lang, 
1989; Lang, 1992\]. In the remainder of the paper we 
examine how this approach can be applied to tag. We 
first consider the representation of parses using a cfg 
and give the space and time complexity of recogni- 
tion and extraction of parses using this representa- 
tion. We then consider the same issues where lig is 
used as the formalism for representing parses. We 
conclude by comparing these results with those for 
existing tag parsing algorithms. 
2 Tree Adjoining Grammars 
Ta~ is a tree generating formalism introduced 
in \[Joshi et al., 1975\]. A tag is defined by a finite 
set of elementary trees that are composed by means 
of the operations of tree adjunction and substitution. 
In this paper, we only consider the use of the adjunc- 
tion operation. 
384 
Definition 2.1 A tag, G, is denoted 
G= (V,, VT,S,I,A) 
where 
Vjv is a finite set of nonterminals symbols, 
VT is a finite set of terminal symbols, 
S E V/v is the start symbol, 
I is a finite set of initial trees, 
A is a finite set of auxiliary trees. 
An initial tree is a tree with root labeled by S and 
internal nodes and leaf nodes labeled by nonterminal 
and terminal symbols, respectively. An auxiliary 
tree is a tree that has a leaf node (the foot node) that 
is labeled by the same nonterminal that labels the 
root node. The remaining leaf nodes are labeled by 
terminal symbols and all internal nodes are labeled 
by nonterminals. The path from the root node to 
the foot node of an auxiliary tree is called the spine 
of the auxiliary tree. An elementary tree is either 
an initial tree or an auxiliary tree. We use a to refer 
to initial trees and/3 for auxiliary trees. 
A node of an elementary tree is called an elemen- 
tary node and is named with an elementary node 
address. An elementary node address is a pair com- 
prising of the name of the elementary tree to which 
the node belongs and the address of the node within 
that tree. We will assume the standard addressing 
scheme: the root node has an address c; if a node 
with address /~ has /¢ children then the \]c children 
(in left to right order) have addresses p • 1,..., p. k. 
Thus, for each address p we have p E A/'* where .hf 
is the set of natural numbers. In this section we use 
p to refer to addresses and r I to refer to elementary 
node addresses. In general, we can write 1/=~ 7, P 
where 7 is an elementary tree and p E dom (7) and 
dora (7) is the set of addresses of the nodes in 7. 
Let 7 be a tree with internal node labeled by a 
nonterminal A. Let/3 be an auxiliary tree with root 
and foot node labeled by the same nonterminal A. 
The tree, 7 ~, that results from the adjunction of/3 
at the node in 7 labeled A is formed by removing 
the subtree of 7 rooted at this node, inserting/3 in 
its place, and substituting it at the foot node of/3. 
Each elementary node is associated with a selec- 
tive adjoining (SA) constraint that determines the 
set of auxiliary trees that can be adjoined at that 
node. In addition when adjunction is mandatory 
at a node it is said to have an obligatory adjoin- 
ing (OA) constraint. Whether/3 can be adjoined at 
the node (labeled by A) in 7 is determined by the SA 
constraint of the node. In 7 t the nodes contributed 
by/3 have the same constraints as those associated 
with the corresponding nodes in/3. The remaining 
nodes in 7 ~ have the constraints of the corresponding 
nodes in 7. 
Given p E dom(7), by Ibl(7,p) we refer to the 
label of the node addressed # in 7. Similarly, we will 
use sa(7, p) and oa(7, p) to refer to the SA and OA 
constraints of a node addressed p in a tree 7. Finally, 
we will use ft (/3) to refer to the address of the foot 
node of an auxiliary tree/3. 
adj (7, P,/3) denotes the tree that results from the 
adjunction of/3 at the node in 7 with address p. This 
is defined when fl E sa(7, p). If adj (% #,/3) = 7 ~ then 
the nodes in 7 ~ are defined as follows. 
• don',('r')= 
{Pl I plEdorn(7) and 
Pl ~ P" P2 for some P2 E A f*} 
u (~- m I 1,1 e dom(/3)} 
U {p. ft (/3)- Pl I P" Ple dom (7) and ~ ~ ~} 
• if Pl E dora (7) such that Pl ~ P "Pl for some 
Pl E Af*, (i.e., the node in 7 with address Pl is 
not equal to or dominated by the node addressed 
p in 7) then 
- Ibl(7',~) = Ibl(%~l), 
-- sa(~J,f/1) = sa(\]¢,~i), 
- oa(~',~d = oa(%m), 
• if #. Pl E dom (7') such that Pl E dom (/3) then 
- tbl(~', ~. ~) = IbS(/3, ~d, 
- sa(~', ~. ~) = sa(/3, ~), 
- o~(~',~ .~i) = o~(/3,~d, 
• if p • ft(/3) • p~ E dora(7') such that p • Pl E 
dora (7) then 
- I~l(~',~. ft(/3). ~) = mbK%~-m), 
- sa(-f', i'" ft (/3). l'~) = s~('r, ~," ~,~), 
- oa(7',p, ft (/3). Pl) = oa(7,p. Pl), 
In general, if p is the address of a node in 7 then 
< 7, P > denotes the elementary node address of the 
node that contributes to its presence, and hence its 
label and constraints. 
The tree language, T(G), generated by a TAG, G, 
is the set of trees derived starting from an initial tree 
such that no node in the resulting tree has an OA 
constraint. The (string) language, L(G), generated 
by a TAG, G, is the set of strings that appear on the 
frontier of trees in T(G). 
Example 2.1 Figure 1 gives a TAG, G, which 
generates the language {wcw \[ w E {a,b}*}. The 
constraints associated with the root and foot of/3 
specify that no auxiliary trees can be adjoined at 
these nodes. This is indicated in Figure 1 by as- 
sociating the empty set, ~, with these nodes. An 
example derivation of the strings aca and abeab is 
shown in Figure 2. 
385 
s {pz~2} IZz s O 
I '% {P'"} 
SO a 
p2 so 
A 
b S {pZ,p2} 
sO b 
Figure 1: Example of a TAG G 
i lplp2} 
¢ 
/t s0 
• //~,elS 
,z 
SO • I 
¢ 
$0 
A 
b 
$o b 
SO a I 
¢ 
Figure 2: Sample derivations in G 
3 Linear Indexed Grammars 
An indexed grammar \[Aho, 1968\] can be viewed as 
a cfg in which objects are nonterminals with an as- 
sociated stack of symbols. In addition to rewriting 
nonterminals, the rules of the grammar can have the 
effect of pushing or popping symbols on top of the 
stacks that are associated with each nonterminal. 
In \[Gazdar, 1988\] a restricted form of indexed gram- 
mars was discussed in which the stack associated 
with the nonterminal on the left of each production 
can only be associated with one of the occurrences of 
nonterminals on the right of the production. Stacks 
of bounded size are associated with other occurrences 
of nonterminals on the right of the production. We 
call this linear indexed grammars (lig}. Lig generate 
the same class of languages as tag \[Vijay-Shanker 
and Weir, in pressa\]. 
Definition 3.1 A LIG, G, is denoted 
G = ( Vjv , VT , VI , S, P ) 
where 
Vlv is a finite set of nonterminals, 
VT is a finite set of terminals, 
VI is a finite set of indices (stack symbols), 
S • VN is the start symbol, and 
P is a finite set of productions. 
Given a lig, G = (V~¢, VT, VI, S, P), we define the 
set of objects of G as 
Vc(G) = { A\[a\] \[A • VN and cr • V~* } 
We use A\[oo a\] to denote the nonterminal A associ- 
ated with an arbitrary stack with the string a on top 
and A\[\] to denote that an empty stack is associated 
with A. We use T to denote strings in (Vc(G)UVT)*. 
The general form of a lig production is: 
A\[oo a\] ---* TB\[oo a'\]T' 
where A, B e VN, a, a' G VI* and T, T' G (Vc(C)U VT)*. 
Given a grammar, G = (V1v, VT, VI, S, P), the 
derivation relation, o=~, is defined such that if 
A\[oo a\] --~ TB\[oo a'\]T' G P 
then for every fle V\[ and TI,T2 • (Vc(G) U VT)*: 
T1AL0 \]T  T1TB\[Z '\]T'T  
As a result of the linearity in the rules, the stack 
~/a associated with the object in the left-hand side of 
the derivation and the stack j3cJ associated with one 
of the objects in the right-hand side have the initial 
part fl in common. In the derivation above, we say 
that the object BLSa' \] is the distinguished child of 
ALSa \]. Given a derivation, the distinguished de- 
scendant relation is the reflexive, transitive closure 
of the distinguished child relation. 
The language generated by a lig, G is: 
where ~ denotes the reflexive, transitive closure 
G 
of ~. 
G 
Example 3.1 The language 
{ wcw i w e {a,b}* } 
is generated by the lig 
G = ({S,T},{a,b,c},{7a,7b},S,P) 
where P contains the following productions. 
S\[oo \] -* aS\[oo 7.\] S\[oo \] -~ bS\[oo 7b\] 
S\[oo \] --~ T\[oo \] T\[oo 7a\] -+ T\[oo \]a 
T\[oo 7b \] --* T\[oo\]b T\[\] --* c 
This grammar generates the string abcab as follows. 
S\[\] ~ aSbo \] 
G 
===# abS\[TaTb \] 
G 
==~ abT\[Ta 7b\] O 
abT\[Ta\]b 
G 
==*. abT\[\]ab 
G 
abcab 
G 
386 
4 Parsing as Intersection with 
Regular Languages 
In the case of cfg parsing, \[Billot and Lung, 1989; 
Lang, 1992\] show that a cfg can be used to encode all 
of the parses for a given string. For example, let Go 
be a grammar and let the string w = al ... an be in 
L(Go). All parses for the string w can be represented 
by the shared forest grammar G~. The nonterminals 
in Gw are of the form (A, i, j) where A is a nonter- 
minal of Go and 0 < i < j < n. The construction of 
G~0 is such that any derivation from (A, i, j) encodes 
a derivation 
A ::~ ai+l...aj Oo 
For instance, suppose A .--, BC is a production in 
Go that is used in the first step of a derivation of the 
substring ai+l...a/ from A. Corresponding to this 
production, Gw contains a production 
(A, i,j) -.-* (B, i, k)(C, k,j) 
for each 0_< i< k < j < n. This can be used to 
encode all parses of ai+x ... aj from A where 
B ::~ ai+l...a~ and C -~ a~+t...aj 
In general, corresponding to a production 
A-+ X1...Xr 
in Go the grammar G~ contains a production 
(A, il,j,) --* (X1, il,jl)... (X,, it,j,) 
for every il,jl,...,i,,j~ E { 1,...,n} such that for 
each 1 _< k < r if X~ E VT then ik + 1 = jk, otherwise 
ik+l < jk. Additionally, G~ includes the production 
(a~,k,k + l) --, a~ 
for each 1 < k < n. 
Note that the number of nonterminals in the 
shared forest grammar, Gw, is O(n 2) and the num- 
ber of productions is O(n re+l) where Iw I = n and 
m is the maximum number of nonterminals in the 
right-hand-side of a production in Go. Therefore, if 
the object grammar were in Chomsky normal form, 
the number of productions is O(nZ). 
Lung \[1992\] extended this by showing that parsing 
a string w according to a grammar G can be viewed 
as intersecting the language L(G) with the regular 
language { w }. Suppose we have an object context- 
free grammar Go and some deterministic finite state 
automaton M. For the sake of simplicity, let us as- 
sume that Go is in Chomsky normal form. The stan- 
dard proof that context-free languages are closed un- 
der intersection with regular languages, constructs a 
context-free grammar for L(Go) f3 L(M) with a pro- 
duction 
(A,p, q) -. (B,p, r)(C, r, q) 
for each production A --~ BC of Go and states 
p, q, r of M. Also for each terminal a the production 
(a,p, q) --~ a will be included if and only if 6(p, a) = q 
where/~ is the transition function of M. 
Lung \[1992\] applied this to cfg recognition as fol- 
lows. Given an input, w - al...an, define the dfa 
M~ such that L(M~ ) - { w }. The state set of Mw is 
{ 0, 1,...,n }; the transition function 5 is such that 
6(i, ai+l) = i + 1 for each 0 _< i < n; 0 is the ini- 
tial state; and n is the final state. The shared for- 
est grammar G~ is obtained when the standard in- 
tersection construction described above is applied to 
Go and Mw. Furthermore, since L(Gw) = L(Go) N 
L(M,~) and L(M,~) = {w}, we have w E L(Go) if 
and only if L(G,~) is not the empty set. That is, the 
original recognition problem can be turned into one 
of generating the shared forest grammar, Gw, and 
deciding whether the start nonterminal, (S, 0, n), of 
Gw is an useful symbol, i.e., whether there is some 
terminal string z such that 
(S,0, n) =~x 
Ow 
Here S has been taken to be the start nonterminal of 
Go. Note that Gw can be constructed in O(n s) time 
and "recognition" can also be accomplished within 
this time bound. 
One advantage that arises from viewing parsing 
as intersection with regular languages is that exactly 
the same algorithm can be given a word net (a reg- 
ular language that is not a singleton) rather than a 
single word as input. This could be useful if we wish 
to deal with ill-formed inputs. 
5 Derivation versus Derived Trees in 
TAG 
For grammar formalisms involving the derivation of 
trees, a tree is called a derived tree with respect to a 
given grammar if it can be derived using the rewrit- 
ing rules of the grammar. A derivation tree of the 
grammar, on the other hand, is a tree that encodes 
the sequence of rewritings used in deriving a derived 
tree. In the case of cfg, a tree that is derived contains 
all the information about its derivation and there is 
no need to distinguish between derivation trees and 
derived trees. This is not always the case. In par- 
ticular, for a tree-rewriting system like tag we need 
to distinguish between derived and derivation trees. 
In fact there are at least two ways one can encode 
tag derivation trees. The first (see \[Vijay-Shanker, 
1987\]) captures the fact that derivations in tag are 
conte~t-free, i.e., the trees that can be adjoined at 
a node can be determined a priori and are not de- 
pendent on the derivation history. We capture this 
context-freeness by giving a cfg to represent the set 
of all possible derivation sequences in a tag. An al- 
ternate scheme uses a tag or a lig (see \[Vijay-Shanker 
387 
and Weir, in pressb\]) to represent the set of all pos- 
sible derivations. 
We briefly consider the first scheme to show how 
given a tag, Go and a string, w, context-free gram- 
mar can be used to represent shared forests. In later 
sections we will study the second scheme using lig 
for shared forests. 
6 Using CFG for Shared Forests 
Given a TAG, 
Go = (VN, VT,S,I,A) 
and a string w - ax...an we construct a context- 
free grammar, Gto such that L(G,~) ~ d~ if and only 
if w E L(Go). Let M~ be the dfa for w described in 
Section 4. 
Consider a tree fl that has been derived from some 
auxiliary tree in A. Let the string on the frontier of 
fl that is to the left of the foot node be us and the 
string to the right of the foot node be ur. Consider 
the tree that results from the adjunction of/3 at a 
node in with elementary node address I T/where v is 
the string on the frontier of the subtree rooted at ,7. 
After adjunction the strings us and ur will appear to 
the left and right (respectively) of v. 
Suppose that in a derivation of the string w by 
the grammar Go the strings ul and ur form two 
continuous substrings w: i.e., uz = ai+l...ap and 
ur = aq+l...aj for some 0 < i < p< q < j < n. 
Thus, according to the definition of M~ we would 
have ~(i, us) = p and 6(q, ur) = j. Hence, we can 
use the four states i, j, p and q of Mr0 to account for 
which parts of w are spanned by the frontier of ft. 
Since the string appearing at the subtree rooted at 
7/is v then if 6(p, v) = q we have 6(i, usvur) = j and 
p and q identify the substring of w that is spanned by 
the subtree rooted at 7/. However, the node T/may be 
on the spine of some auxiliary tree, i.e., on the path 
from the root to the foot node. In that case we will 
have to view the frontier of the subtree rooted at r/ 
as comprising two substrings, say vl and vr to the 
left and right of the foot node, respectively. The two 
states p, q of Mw are do not fully characterize the 
frontier of subtree rooted at I/. We need four states, 
sayp, q, r, s, where 6(p, vs ) = r and 6( s, vr ) = s. Note 
that the four states in question only characterize the 
frontier of subtree rooted at T/ before the adjunction 
of fl takes place. The four states i, j, r, s characterize 
the situation after adjunction of fl since 6(i, ut) = p, 
6(p, vz) = r (therefore 6(i, ulvl) = p) and 6(s, vrur ) = 
6(q, u~) = j. 
In the shared forest cfg Gw the derivation of the 
1Rather than repeatedly saying a node with an ele- 
mentary node address y/, henceforth we simply refer to it 
as the node 7/. 
string at frontier of tree rooted at ~/before adjunc- 
tion will be captured by the use of a nonterminal of 
the form (l, rhp, q,r,s ) and the situation after ad- 
junction will be characterized by (T, T/, i,j, r, s). We 
use the symbols T and .L to capture the fact that 
consideration of a node involves two phases: (i) the 
T part where we consider adjunction at a node, and 
(ii) the I part where we consider the subtree rooted 
at this node. Note that the states r, s are only needed 
when 0 is a node on the spine of an auxiliary tree. 
When this is not the case we let r = s = -. 
Since we have characterized the frontier of fl (i.e., 
the subtree rooted at the root/), the root of fl) by 
the four states i, j, p, q, we can use the nonterminal 
(T, roots, i, j, p, q) and can capture the derivation in- 
volving adjunction of/3 at ~/by a production of the 
form 
(T, 'I, i, j, r, s) --~ (T, root/), i, j, p, q) (1, r h p, q, r, s) 
Without further discussion, we will give the pro- 
ductions of Gw. For each elementary node 7/do the 
following. 
Case 1: When 7/is a node that is labeled by a ter- 
minal a, add the production 
(T, Ti, p,q,-,-)--, a 
if and only if 6(p, a) = q. 
Case 2a: Let T}I and T/2 be the children of ~1 and the 
left-child zh dominates the foot node then add the 
production 
(l,TI, i,j,p,q)--. (T, Th, i,k,p,q)(T,~,k,j,-,- ) 
if neither children dominate the foot node then add 
the production 
(.L, rhi, j,-,-) --* (r, ql, i,k,-,-)(Y, rl2, k,j,-,-) 
Case 2b: Let 7/1 and 02 be the children of r/and the 
right-child 7/2 dominates the foot node then add the 
production 
(±,Ti, i,j,p,q)--~ (T, TIy,i,k,-,-)(T, Tl2, k,j,p,q) 
Case 3: When 7/is a nonterminal node that does 
not have an OA constraint, then to capture the fact 
that it is not necessary to adjoin at this node, we 
add 
(T, Th i, j,p,q)--~ (±,lh i, j,p,q) 
Case 4a: When 0 is a node where fl can be adjoined 
and root/) is the root node of fl add the production 
(T,~I,i,j,r,s)--* (T, root/),i,j,p,q)(.L,~I,p,q,r,s) 
Case 4b: When r/is the foot node of the auxiliary 
tree/3 add the production 
(l,~hP, q,p,q)--*¢ 
388 
If t/is the root of an initial tree then add the pro- 
duction 
S --~ (T, r/, O, n,-,-). 
where S is the start symbol of Gw. 
Note that (cases 2a and 2b) we are assuming bi- 
nary branching merely to simplify the presentation. 
We can use a sequence of binary cfg productions to 
encode situations where t/has more than two chil- 
dren. That is, even if the object-level grammar was 
not binary branching, the shared forest grammar can 
still be. 
Note that since the state set of Mw is {0,..., n}, 
the number of nonterminals in Go is O(n4). Since 
there are at most three nonterminals in a production, 
there are at most six states involved in a production. 
Therefore, the number of productions is O(n 6) and 
construction of this grammar takes O(n 6) time. Al- 
though the derivations of Gto encode derivations of 
the string w by Go the specific set of terminal strings 
that is generated by G,o is not important. We do 
however have L(G~) # ~b if and only if w E L(Go). 
As before, we can determine whether L(G~) # ~ by 
checking whether the start nonterminal S is useful. 
Furthermore this can be detected in time and space 
linear to the size of the grammar. Since w E L(Go) 
if and only if L(Gto) # (h, recognition can be done in 
O(n 6) time and space. 
Once we have found all the useful symbols in the 
grammar we can prune the grammar by retaining 
only those productions that have only useful sym- 
bols. Since Gto is a cfg and since we can now guar- 
antee that every nonterminal can derive a terminal 
string and therefore using any production will yield 
a terminal string eventually, the derivations of w in 
Go can be read off by simply reading off derivations 
in Gw. 
7 Using LIG for Shared Forests 
We now present an alternate scheme to represent the 
derivations of a string w from a given object tag 
grammar Go. In later sections show how it can be 
used for solving the recognition problem and how a 
single parse can be extracted. 
The scheme presented in Section 6 that produced 
a cfg shared forest grammar captured the context- 
freeness of tag derivations. The approach that we 
now consider captures an alternative view of tag 
derivations in which a derivation is viewed as sen- 
sitive to the derivation history. In particular, the 
control of derivation can be captured with the use of 
additional stack machinery. This underlies the use 
of lig to represent the shared forests. 
In order to understand how a lig can be used to en- 
code a tag derivation, consider a top-down derivation 
in the object grammar as follows. A tag derivation 
can be seen as a traversal over the elementary trees 
beginning at the root of one of the initial trees. Sup- 
pose we have reached some elementary node t/. We 
must first consider adjunction at t/ and after that 
we must visit each of t/'s subtrees from left to right. 
When we first reach 7/we say that we are in the top 
phase of 1/. The derivation lig encodes this with the 
nonterminal T associated with a stack whose top el- 
ement is t/. After having considered adjunction at r/ 
we are in the bottom phase of 7/. The derivation lig 
encodes this with the nonterminal _L associated with 
a stack whose top element is 7/. 
When considering adjunction at r/we may have a 
choice of either not adjoining at all or selecting some 
auxiliary tree to adjoin. If the former case we move 
directly to the bottom phase of r/. In the latter case 
we move to (visit) the root of the auxiliary tree f/ 
that we have chosen to adjoin. Once we have finished 
visiting the nodes of f/(i.e., we have reached the foot 
of 3) we must return to (the bottom phase of) t/. 
Therefore, it is necessary, while visiting the nodes 
in ~ to store the adjunction node t/. This can be 
done by pushing ~/onto the stack at the point that 
we move to the root of ~. Note that the stack may 
grow to unbounded length since we might adjoin at 
a node within ~, and so on. When we reach the 
bottom phase of foot node of 3 the stack is popped 
and we find the node at which 3 was adjoined at the 
top of the stack. 
gFrom the above discussion it is clear that the lig 
needs just two nonterminals, T and _L. At each step 
of a derivation in the lig shared forest grammar the 
top of the stack will specify the node being currently 
being visited. Also, if the node r/being visited be- 
longs to an auxiliary tree and is on its spine we can 
expect the symbol below the top of the stack to give 
us the node where 3 is adjoined. If r/is not on the 
spine of an auxiliary tree then it is the only symbol 
on the stack. 
We now show how the lig shared forest grammar 
can be constructed for a given string w = at ...an. 
Suppose we have a tag 
Go = (VN, VT, S,I,A) 
and the dfa 
Mw "- (VN,Q, qo, if, F) 
as defined in Section 4. We construct the lig 
V~ = (Vk, Vr, V~,S',P) 
that generates the intersection of L(G) and L(Mw). 
P includes the following set of productions for the 
start symbol S' 
iS'\[\] .---, (T, qo, q/)\[r/\] I q; e F and 
t/is root of initial tree 
In addition, for each elementary node t/do the fol- 
lowing. 
389 
Case 1: When , is a node that is labeled by a ter- 
minal a P includes the production 
(T, p, q)\[ti\] ~ a 
for each p, q E Q such that q E 6(p, a). 
Case 2a: When ti1 and .2 are the children of a node 
. such that the left sibling ti1 is on the spine or nei- 
ther child is on the spine, P includes the production 
(/, p, q)\[oo .\] ~ (T, p, r)\[oo .1\] (T, r, q)\[.2\] 
for each p, q, r E Q. Note that the stack of adjunction 
points must be passed to the ancestor of the foot 
node all the way to the root. 
Case 2b: When ti1 and ~/~ are the children of a 
node ~/such that the right sibling T/2 is on the spine 
P includes the production 
(_L, p, q)\[oo .\] ~ (T, p, r)\[ti1\] (T, r, q)\[oo .2\] 
for each p, q, r E Q. 
Case 3: When r} is a nonterminal node that does not 
have an OA constraint P includes the production 
(T,p, q)\[oo.\] --~ (_L,p, q)\[oo 7/\] 
for each p, q E Q. This production is used when no 
adjunction takes place and we move directly between 
the top and bottom phases of 77. 
Case 4a: When ti is a node where fl can be adjoined 
and ti~ is the root node of/~ P includes the production 
(T, p, q)\[oo ti\] --~ (T, p, q)\[oo r/ti'\] 
for each p, q E Q. Note that the adjunction node ti 
has been pushed below the new node rf on the stack. 
Case 4b: When t} is a node where 77 can be adjoined 
and 171 is the foot node offl P includes the production 
(/, p, q)\[oo ti.'\] --~ (_L, p, q)\[oo .\] 
for each p, q E Q. Note that the stack symbol that 
appeared below ti will be the node at which fl was 
adjoined. 
Since the state set of Mw is (0,...,n} there are 
O(n 2) nonterminals in the grammar. Since at most 
three states are used in the productions, M~ has 
O(n 3) productions. The time taken to construct this 
grammar is also O(n3). As in the cfg shared forest 
grammar constructed in Section 6 we have assumed 
that the tag is binary branching for sake of sim- 
plifying the presentation. The construction can be 
adapted to allow for any degree of branching through 
the use of additional (binary) lig productions. Fur- 
thermore, this would not increase the space complex- 
ity of the grammar. Finally, note that unlike the cfg 
shared forest grammar, in the lig shared forest gram- 
mar Gt0, w is derived in Go if and only if w is derived 
in Gt,. Of course in both cases L(Gt,) = {w}NL(Go) 
and hence the recognition problem can be solved by 
determining whether the shared forest grammar gen- 
erates the empty set or not. 
8 Removing Useless Symbols 
As in the case of the cfg shared forest grammar, to 
solve the original recognition problem we have to de- 
termine if L(G~) ~ ¢. In particular, we have to de- 
termine whether S~\[\] derives a terminal string. We 
solve this question by construcing an nfa, Ma~, from 
Gto where the states of Ma. correspond to the non- 
terminal and terminal symbols of Gw. This trans- 
forms the question of determining whether a symbol 
is useful into a reachibility question on the graph of 
Ma.. In particular, for any string of stack symbols 
% the object A\[7\] derives a string of terminals if and 
only if it is possible, in the nfa Ma.., to reach a fi- 
nal state from the state corresponding to A on the 
input 7. Thus, w e L(Go) if and only if S'\[\] ::~ w Gw 
if and only if in Ma. a final state is reachable from 
the state corresponding to S ~ on the empty string. 
Given a lig Gw = (V2v, TIT, VI,S', P) we construct 
the nfa Ma. = (Q, E, 6, q0, F) as follows. Let the 
state set of M be the nonterminal and terminal al- 
phabet of Gw: i.e., Q = VN U VT. The initial state 
of MG,. is the start symbol of Gw, i.e., q0 - S'. The 
input alphabet of MG,. is the stack alphabet of G,,: 
i.e., E = VI. Note that since Gw is the lig shared 
forest the set VI is the set of the elementary node 
addresses of the object tag grammar Go. The set of 
final states, F, of MG,. is the set VT. The transition 
function 6 of Ma. is defined as follows. 
Case 1: If P contains the production 
A\[ti\] 
then add a to 6(A, tl). 
Case 2a: If P contains the production 
A\[oo .\] --* B\[oo ~h\]C\[.2\] 
then if 6(C, 172) n F ¢ ¢ and D E 6(B, .1) add D to 
6(A, 7/). 
Case 2b: The case where P contains the production 
A\[oo .1 ~ C\[,~\]B\[oo ti1\] 
is similar to Case 2a. 
Case 3: If P contains the production 
A\[oo .\] ---* B\[oo .\] 
then if C E 6(B, ~}) add C E 6(A, ti). 
Case 4a: If P contains the production 
A\[oo ~/\] --. B\[oo .rf\] 
then for each C such that C E 6(B, tf) and each D 
such that D e 6(C, ~}) add D to 6(A, 77). 
Case 4b: If P contains the production 
A\[oo tit/' \] --* B\[ti\] 
then add B to 6(A, 71'). 
390 
Case 5: If P contains the production 
S'\[\] ---* A\[~T\] 
then if B e 6(A, 7) add B to ~f(S', e). 
Given that w = al...an and that the nontermi- 
nals (and corresponding states in Ma,.) of Gw are of 
the form (T,i,j) or (.l.,i,j) where 0 < i < j < n, 
there are O(n 2) nonterminals (states in Mto) inthe 
lig Gw. The size of Maw is O(n 4) since there are O(n 2) 
out-transitions from each state. 
We can use standard dynamic programming tech- 
niques to ensure that each production is considered 
only once. Given such an algorithm it is easy to check 
that the construction of Ma,. will take O(n s) time. 
The worst case corresponds to case 4a which will take 
O(n 4) for each production. However, there are only 
O(n 2) such productions (for which case 4a applies). 
Once the nfa has been constructed the recognition 
problem (i.e., whether w e L(Go)) takes O(n 2) time. 
We have to check if there is an e-transition from the 
initial state to a final state and hence we will have 
to consider O(n 2) transitions. 
A straightforward algorithm can be used to remove 
the states for nonterminals that do not appear in any 
sentential form derived from S I. In other words, only 
keep states such that for some 3' there is a derivation 
S\[\] ~ TIA\[TIT2 
for some TIT2 E (Vv(Gu,) U VT)*. 
Note that the states to be removed are not those 
states that are not reachable from the initial state 
of Me,. The set of states reachable from the initial 
state includes only the set of nonterminals in objects 
that are the distiguished descendent of the root node 
in some derivation. 
/,From the construction of Mew it is that case that 
for each A E VN the set 
{ 3' l a e/~(A, 3') for some a 6 F } 
is equal to the set 
Thus, if a final state is accessible from a state .4 
then for some 3' (that witnesses the accessibility of a 
final state from .4) 
.413'1 
for some z E V~. 
Once the construction of Me, is complete we only 
retain those productions in Gw that involve nonter- 
minals that remain in the state set of of Me,. IIow- 
ever, unlike the case of the cfg shared forest gram- 
mar, the extraction of individual parses for the input 
w does not simply involve reading off a derivation of 
Gw. This is due to the fact that although retain- 
ing the state A does mean that there is a derivation 
S\[\] =~ TIA\[7\]T2 for some 3' and TIT2, we can Qw 
not guarantee that A\[7\] will derive a string of ter- 
minals. The next section describes how to deal with 
this problem. 
9 Recovery of a Parse 
Let the lig Gw with useless productions removed be 
= ( VN , VT , VI , S' , P ) 
and let the nfa Maw constructed in Section 8 with 
unnecessary states removed be 
Maw = (VN U VT, V1,5, S', VT) 
Recovering a parse of the string w by the object 
grammar Go has now been converted into the prob- 
lem of extracting one of the derivations of Gw. How- 
ever, this is not entirely straightforward. 
The presence of a state A in V N \[.J VT indicates that 
for some 7 in V\[ and T1, T~ in (Vc(Gw) U liT)* we 
have 
S'\[\] ~ T1A\[TIT2 
However, it is not necessarily the case that $(A, 7)f3 
lit i~ ¢, i.e., it might not be possible to reach a final 
state of Ma,, from A with input 7. All we know is 
that there is some 3 / E V/* (that could be distinct 
from 7) such that A\[7' \] derives a terminal string, 
i.e., at least one final state is accessible from A on 
the string 7'. 
This means that in recovering a derivation of Gw 
by considering the top-down application of produc- 
tions we must be careful about which production we 
choose at each stage. We cannot assume that any 
choice of production for an object, A\[7\] will eventu- 
ally lead to a complete derivation. Even if the top 
of the stack 3' is compatible with the use of a pro- 
duction, this does not guarantee that A\[3'\] derives a 
terminal string. 
We give an procedure recover that can be used to 
recover a derivation of G~ by using the nfa Ma.. 
This procedure guarantees that when we reach a 
state A by traversing a path 3' from the initial state 
then on the same string 3' a final state can be reached 
from the state A. 
If recover(T1 ... T,a) is invoked the following hold. 
.n~l 
• aEVT 
• T~ -- (Ai,~i) where Ai E VN and ~i E ~ for 
each 1 < i < n 
• recover(T1...Tnql) returns a derivation from 
391 
• St\[\] =:~ ZAl\[qn ...t/1\]y for some z, V6 V~ G. 
• Al\[t/,...t/l\] =~ Tx,tA2\[t/n...rl~lTl,r Gw 
Tn-l,tA,\[t/n\]Tn-l,r O~ 
$ -----f Tn,taTn,r 
Ou~ 
• 6(Ai,t/n...t/i) = a for each 1 < i < n, 
To recover a parse we call recover(((-r, 1, n), ,j)a) 
where a E liT such that 6((T, 1, n), O) = a and T/6 lit 
is the root of some initial tree. The definition of 
recover is as follows. 
Procedure recover((A 1, t/1)7"2... Tn a) 
Case 1: If n = 1 and 
p = Al\[t/1\] --* a • P 
then output p. Note there must be such a production 
Case 2a: If there is some production 
p = Al\[oo t/l\] -~ B\[oo t'\] C\[V'\] • P 
such that 6(C, 1") = b for some b • VT, and either 
n > 1 and A2 • ~(B,l') (where T2 = (A2,t/2)) or 
n = 1 and a • 6(B, 1') then output 
p. recover((B, I')T2... Tna). recover((C, l")b) 
Case 2b: If there is some production 
p = Al\[oo y,\] -~ C\[l"\] B\[oo I'\] • P 
such that 6(0, l") = b for some b • VT and either 
n > 1 and A2 • 6(S,l') (where T2 = (A2,t/2)) or 
n = 1 and a • 6(B, i') then output 
p. recover((B, l')T2... Tna). recover((C,/")b) 
Case 3: If there is some production 
p = Al\[OO t/l\] ---* B\[oo 1'\] • P 
such that either n > 1 and A2 • 6(B,l') (where 
T2 = (A2, t/2)) or n = 1 and a • 6(B, l') then output 
p. recover((B, l' )T2 . . . Tna) 
Case 4a: If there is some production 
p = Ax\[oo 71\] ~ B\[oo y21'\]inP 
such that C • 6(B, l ~) for some C • VN and A2 • 
6(C, th) and either n > 1 and T~ = (A2, t/z) or n = 1 
and a • 6(C, t/l) then output 
p. recover((B, l' )( C, t/l )T2 . . . T, a ) 
Case 4b: If there is a production 
p = Al\[oo t/2t/1\] ---* A~\[oo y~\] • P 
such that n > 1 and T2 = (Az,y2) then output 
p. recover(T2... T,) 
Given the form of the nonterminals and produc- 
tions of Gto we can see that the complexity of ex- 
tracting a parse as above is dominated by the com- 
plexity by Case 4a which takes O(n 4) time. If in 
Go every elementary tree has at least one terminal 
symbol in its frontier (as in a lexicalized tag) then 
to derive a string of length n there can beat most n 
adjunctions. In that case, when we wish to recover 
a parse the derivation height (which gives recursion 
depth of the the invocation of the above procedure) 
is O(n) and hence recovery of a parse will take O(n 5) 
time. 
10 Conclusions 
We have shown that there are two distinct ways of 
representing the parses of a tag using lig and cfg. 
• The cfg representation captures the fact that the 
choice of which trees to adjoin at each step of a 
derivation is context-free. In this approach the 
number of nonterminals is O(n4), the number 
of productions is O(n 6) and, hence, the recog- 
nition problem can be resolved in O(n 6) time 
with O(n 4) space. Note that now the prob- 
lem of whether the input string can be derived 
in the tag grammar is equivalent to deciding 
whether the shared forest cfg obtained generates 
the empty language or not. Each derivation of 
the shared forest cfg represents a parse of the 
given input string by the tag. 
• In the scheme that uses lig the number of non- 
terminals is O(n 2) and the number of produc- 
tions is O(n3). While the space complexity of 
the shared forest is more compact in the case 
of lig, recovering a parse is less straightforward. 
In order to facilitate recovery of a parse as well 
as to solve the recognition problem (i.e., deter- 
mine if the language generated by the shared 
forest grammar is nonempty) we use an aug- 
mented data structure (the nfa, Me,). With 
this structure the recognition problem can again 
6 4 be resolved in O(n ) with O~n ) space and the 
extraction of a parse has O(n ~) time complexity. 
The work described here is intended to provide a 
general framework that can be used to study and 
compare existing tag parsing algorithms (for exam- 
ple \[Vijay-Shanker and Joshi, 1985; Vijay-Shanker 
and Weir, in pressb; Schabes and Joshi, 1988\]). If 
we factor out the particular dynamic programming 
algorithm used to determine the sequence in which 
these rules are considered then the productions of 
our cfg and lig shared forest grammars encapsulate 
the steps of all of these algorithms. In particular, 
the algorithm presented in \[Vijay-Shanker and Joshi, 
1985\] can be seen to corresponds to the approach in- 
volving the use of cfg to encode derivations, whereas, 
the algorithm of \[Vijay-Shanker and Weir, in pressb\] 
392 
uses lig in this role. Although the space complexity 
of the cited parsing algorithms is O(n4), the data 
structures used by them do not explicitly give the 
shared forest representation provided by our shared 
forest grammars. The data structures would have 
to be extended to record how each entry in the table 
gets added. With this kind of additional information 
the space requirements of these algorithms would be- 
come O(n6). 
It is perhaps not surprising that the lig shared for- 
est and cfg shared forest described here turn out 
to be closely related. In the nfa MG, (after use- 
less symbols have been removed) we have (B,p, q) E 
df((A, i,j), ri) if and only if in the cfg shared forest 
(A, r/, i, j, p, q) is not a useless symbol. In addition, 
there is a close correspondence between productions 
in the two shared forest grammars. This shows that 
the two schemes result in essentially the same algo- 
rithms that store essentially the same information in 
the tables that they build. 
We end by noting that Lang \[1992\] also considers 
tag parsing with shared forest grammars, however, 
he uses the tag formalism itself to encode the shared 
forest. This does not utilize the distinction between 
derivation and derived trees in a tag. The algorithms 
presented here specialize the derivation tree Gram- 
mar to get shared forest whereas Lang \[1992\] spe- 
cializes object grammar itself. As a result, in or- 
der to get O(n 6) time complexity Lang must assume 
the object grammar tree in a very restricted normal 
form. 
\[Vijay-Shanker and Joshi, 1985\] 
K. Vijay-Shanker and A. K. Joshi. Some compu- 
tational properties of tree adjoining grammars. In 
23 rd meeting Assoc. Comput. Ling., pages 82-93, 
1985. 
\[Vijay-Shanker and Weir, in pressa\] 
K. Vijay-Shanker and D. J. Weir. The equiva- 
lence of four extensions of context-free grammars. 
Math. Syst. Theory, in press. 
\[Vijay-Shanker and Weir, in pressb\] 
K. Vijay-Shanker and D. J. Weir. Parsing con- 
strained grammar formalisms. Comput. Ling., in 
press. 
\[Vijay-Shanker, 1987\] K. Vijay-Shanker. A Study of 
Tree Adjoining Grammars. PhD thesis, University 
of Pennsylvania, Philadelphia, PA, 1987. 
References 
\[Aho, 1968\] A. V. Aho. Indexed grammars -- An 
extension to context free grammars. J. ACM, 
15:647-671, 1968. 
\[Billot and Lang, 1989\] S. Billot and B. Lang. The 
structure of shared forests in ambiguous parsing. 
In 27 th meeting Assoc. Comput. Ling., 1989. 
\[Gazdar, 1988\] G. Gazdar. Applicability of indexed 
grammars to natural languages. In U. Reyle and 
C. Rohrer, editors, Natural Language Parsing and 
Linguistic Theories. D. Reidel, Dordrecht, Hol- 
land, 1988. 
\[Joshi et al., 1975\] A. K. Joshi, L. S. Levy, and 
M. Takahashi. Tree adjunct grammars. J. Corn- 
put. Syst. Sci., 10(1), 1975. 
\[Lang, 1992\] B. Lang. Recognition can be harder 
than parsing. Presented at the Second TAG Work- 
shop, 1992. 
\[Schabes and Joshi, 1988\] Y. Schabes and A. K. 
Joshi. An Earley-type parsing algorithm for tree 
adjoining grammars. In 26 th meeting Assoc. Com- 
pat. Ling., 1988. 
393 
