A Uniform Method of Grammar Extraction 
and Its Applications 
Fei Xia and Martha Palmer and Aravind Joshi 
Department of Computer and Information Science 
University of Pennsylvania 
Philadelphia PA 19104, USA 
{fxia, mpalmer, j oshi)@linc, cis. upenn, edu 
Abstract 
Grammars are core elements of many NLP ap- 
plications. In this paper, we present a system 
that automatically extracts lexicalized gram- 
mars from annotated corpora. The data pro- 
duced by this system have been used in sev- 
eral tasks, such as training NLP tools (such 
as Supertaggers) and estimating the coverage 
of harid-crafted grammars. We report experi- 
mental results on two of those tasks and com- 
pare our approaches with related work. 
1 Introduction 
There are various grammar frameworks pro- 
posed for natural s. We take Lexi- 
calized Tree-adjoining Grammars (LTAGs) as 
representative of a class of lexicalized gram- 
mars. LTAGs (Joshi et al., 1975) are ap- 
pealing for representing various phenomena 
in natural s due to its linguistic and 
computational properties. In the last decade, 
LTAG has been used in several aspects of 
natural  understanding (e.g., pars- 
ing (Schabes, 1990; Srinivas, 1997), semantics 
(Joshi and Vijay-Shanker, 1999; Kallmeyer 
and Joshi, 1999), and discourse (Webber and 
Joshi, 1998)) and a number of NLP applica- 
tions (e.g., machine translation (Palmer et al., 
1998), information retrieval (Chandrasekar 
and Srinivas, 1997), and generation (Stone 
and Doran, 1997; McCoy et al., 1992). This 
paper describes a system that extracts LTAGs 
from annotated corpora (i.e., Treebanks). 
There has been much work done on extract- 
ing Context-Free grammars (CFGs) (Shirai 
et al., 1995; Charniak, 1996; Krotov et al., 
1998). However, extracting LTAGs is more 
complicated than extracting CFGs because 
of the differences between LTAGs and CFGs. 
First, the primitive elements of an LTAG are 
lexicalized tree structures (called elementary 
trees), not context-free rules (which can be 
seen. as trees with depth one). Therefore, an 
LTAG extraction algorithm needs to examine 
a larger portion of a phrase structure to build 
an elementary tree. Second, the composition 
operations in LTAG are substitution (same 
as the one in a CFG) and adjunction. It is 
the operation of adjunction that distinguishes 
LTAG from all other formalisms. Third, un- 
like in CFGs, the parse trees (also known as 
derived trees in the LTAG) and the derivation 
trees (which describe how elementary trees 
are combined to form parse trees) are differ- 
ent in the LTAG formalism in the sense that 
a parse tree can be produced by several dis- 
tinct derivation trees. Therefore, to provide 
training data for statistical LTAG parsers, an 
LTAG extraction algorithm should also build 
derivation trees. 
For each phrase structure in a Treebank, 
our system creates a fully bracketed phrase 
structure, a set of elementary trees and 
a derivation tree. The data produced by 
our system have been used in several NLP 
tasks. We report experimental results on two 
of those applications and compare our ap- 
proaches with related work. 
2 LTAG formalism 
The primitive elements of an LTAG are ele- 
mentary trees (etrees). Each etree is associ- 
ated with a lexical item (called the anchor of 
the tree) on its frontier. We choose LTAGs as 
our target grammars (i.e., the grammars to be 
extracted) because LTAGs possess many de- 
sirable properties, such as the Extended Do- 
main of Locality, which allows the encapsula- 
tion of all arguments of \[he anchor associated 
with an etree. There are two types of etrees:. 
initial trees and auxiliary trees. An auxiliary 
tree represents recursive structure and has a 
unique leaf node, called the foot node, which 
has the same syntactic category as the root 
node. Leaf nodes other than anchor nodes 
and foot nodes are substitutionnodes. Etrees 
are combined by two operations: substitution 
and adjunction, as in Figure 1 and 2. The 
53 
resulting structure of the combined etrees is 
called a derived tree. The combination pro- 
cess is expressed as a derivation tree. 
Figure 1: The substitution operation 
=>/_ _~ 
Figure 2: The adjunction operation 
Figure 3 shows the etrees, the derived tree, 
and the derivation tree for the sentence un- 
derwriters still draft policies. Foot and sub- 
stitution nodes are marked by., and $, re- 
spectively. The dashed and solid lines in the 
derivation tree are for adjunction and substi- 
tution operations, respectively. 
3 System Overview 
We have built a system, called LexTract, for 
grammar extraction. The architecture of Lex- 
Tract is shown in Figure 4 (the parts that will 
be discussed in this paper are in bold). The 
core of LexTract is an extraction algorithm 
that takes a Treebank sentence such as the 
one in Figure 5 and produces the trees (el- 
ementary trees, derived trees and derivation 
trees) such as the ones in Figure 3. 
3.1 The Form of Target Grammars 
Without further constraints, the etrees in the 
target grammar could be of various shapes. 
#, .4"--. ~, #3. ~: l: " VP -- ' S 
NP A "- ~I -- NP 
I ADVP VP* ". NP I VP ' s~s 
I " .... "~:~'-. i Ntis 
xa v!31> NP t/ I 
I t i ~ Iicll~i undenviStel'x still dr'aft Ix) ' 
(a) err¢¢a 
draft(#3) 
I ADVP VP underwdlers(#1 ) ", policies(#4) 
NNS l ~ still(#2) 
slill ~ NN$ 
I 
(b) derived tree (c) derivation tr~ 
Figure 3: Etrees, derived tree and derivation 
tree for underwriters still draft policies 
i l.~xTract Syslcm i matde:d 
i ! L'rAo~ 
Treebonk ~peclflc tlerlvatim 
t ~ i implaulible 
Figure 4: Architecture of LexTract 
((S (PP-LOC (IN at) 
(NP (NNP FNX))) 
(NP-SBJ-I (NNS underwriters)) 
(ADVP (RiB still)) 
(VP (VBP draft) 
(NP (NNS policies)) 
(S-MNR 
(NP-SBJ (-NONE- *- 1)) 
(VP (VBG using) 
(NP 
(NP (NN fountain) (NNS pens)) 
(CC and) 
(NP (VBG blotting) (NN papers)))))))) 
Figure 5: A Treebank example 
Our system recognizes three types of rela- 
tion (namely, predicate-argument, modifica- 
tion, and coordination relations) between the 
anchor of an etree and other nodes in the etree, 
and imposes the constraint that all the etrees 
to be extracted should fall into exactly one of 
the three patterns in Figure 6. 
• The spine-etrees for predicate-argument 
relations. X ° is the head of X m and the 
anchor of the etree. The etree is formed 
by a spine X m --+ X m-1 -~ .. --+ X ° and 
the arguments of X °. 
• The mod-etrees for modification rela- 
tions. The root of the etree has two chil- 
dren, one is a foot node with the label 
Wq, and the other node X m is a modifier 
X m 
A 
x' 
X ° zP~, 
lexical item 
X m 
Wq A 
Wq" X m ~.. cclt x ~ 
xO zPl x o z~ I 
lcxical ilerrl I 
Icxicalilem 
(a) spinc..etree (b) mod-etz.~ (c) conj-etree 
Figure 6: Three types of elementary trees in 
the target grammar 
54 
of the foot node. X TM is further expanded 
into a spine-etree whose head X ° is the 
anchor of the whole mod-etree. 
• The conj-etrees for coordination rela- 
tions. In a conj-etree, the children of the 
root are two conjoined constituents and a 
node for a coordination conjunction. One 
conjoined constituent is marked as the 
foot node, and the other is expanded into 
a spine-etree whose head is the anchor of 
the whole tree. 
Spine-etrees are initial trees, whereas mod- 
etrees and conj-etrees are auxiliary trees. 
3.2 Treebank-specific information 
The phrase structures in the Treebank (ttrees 
for short) are partially bracketed in the sense 
that arguments and modifiers are not struc- 
turally distinguished. In order to construct 
the etrees, which make such distinction, Lex- 
Tract requires its user to provide additional 
information in the form of three tables: a 
Head Percolation Table, an Argument Table, 
and a Tagset Table. 
A Head Percolation Table has previ- 
ously been used in several statistical parsers 
(Magerman, 1995; Collins, 1997) to find heads 
of phrases. Our strategy for choosing heads is 
similar to the one in (Collins, 1997). An Ar- 
gument Table informs LexTract what types of 
arguments a head can take. The Tagset Table 
specifies what function tags always mark ar- 
guments (adjuncts, heads, respectively). Lex- 
Tract marks each sibling of a head as an argu- 
ment if the sibling can be an argument of the 
head according to the Argument Table and 
none of the function tags of the sister indi- 
cates that it is an adjunct. For example, in 
Figure 5, the head of the root S is the verb 
draft, and the verb has two siblings: the noun 
phrase policies is marked as an argument of 
the verb because from the Argument Table we 
know that verbs in general can take an NP ob- 
ject; the clause is marked as a modifier of the 
verb because, although verbs in general can 
take a sentential argument, the Tagset Table 
informs LexTract that the function tag -MNR 
(manner) always marks a modifier. 
3.3 Overview of the Extraction 
Algorithm 
The extraction process has three steps: First, 
LexTract fully brackets each ttree; Second, 
LexTract decomposes the fully bracketed ttree 
((S (PP-LOC (IN at) 
(NP (NNP FNX))) 
(S (NP-SBJ-I (NNS underwriters)) (VP (ADVP (RB still)) 
(VP (VP (VBP draft) 
(NP (NNS policies))) 
(S-MNR (NP-SBJ (-NONE- *-1)) 
(VP (VBG using) 
(NP (NP (NN fountain) (NP (NNS pens))) 
(CC and) 
(NP (VBG blotting) 
(NP (NN papers)))))))) ))) 
Figure 7: The fully bracketed ttree 
into a set of etrees; Third, LexTract builds the 
derivation tree for the ttree. 
3.3.1 Fully bracketing ttrees 
As just mentioned, the ttrees in the Tree- 
bank do not explicitly distinguish arguments 
and modifiers, whereas etrees do. To account 
for this difference, we first fully bracket the 
ttrees by adding intermediate nodes so that 
at each level, one of the following relations 
holds between the head and its siblings: (1) 
head-argument relation, (2) modification re- 
lation, and (3) coordination relation. Lex- 
Tract achieves this by first choosing the head- 
child at each level and distinguishing argu- 
ments from adjuncts with the help of the three 
tables mentioned in Section 3.2, then adding 
intermediate nodes so that the modifiers and 
arguments of a head attach to different levels. 
Figure 7 shows the fully bracketed ttree. The 
nodes inserted by LexTract are in bold face. 
3.3.2 Building etrees 
In this step, LexTract removes recursive struc- 
tures - which will become mod-etrees or conj- 
etrees - from the fully bracketed ttrees and 
builds spine-etrees for the non-recursive struc- 
tures. Starting from the root of a fully brack- 
eted ttree, LexTract first finds a unique path 
from the root to its head. It then checks each 
node e on the path. If a sibling of e in the ttree 
is marked as a modifier, LexTract marks e and 
e's parent, and builds a mod-etree (or a conj- 
etree if e has another sibling which is a con- 
junction) with e's parent as the root node, e as 
the foot node, and e's siblings as the modifier. 
Next, LexTract creates a spine-etree with the 
remaining unmarked nodes on the path and 
their siblings. Finally, LexTract repeats this 
process for the nodes that are not on the path. 
In Figure 8, which is the same as the one in 
Figure 7 except that some nodes are numbered 
and split into the top and bottom pairs, 1 the 
1When a pair of etrees are combined during parsing, 
55 
#5 
s2.b -"----...L ¢" 
2 
" d I . 
#6 
Figure 8: The extracted et:rees can be seen as 
a decomposition of the fully bracketed ttree 
#1: #2: #3: #4: #5: #6: 
S NP NP VP S NP 
PP S" NNS ADVP VP = NPI VP NNS 
,s NP ~ { I RB VBP Na \[ FNX undcrwriten; ~ I pollciex 
still draft a\[ 
#7: #8: #9: #10: #I \[: #12: 
Vp NP NP NP "~ I cc NP 
v. s NN N. N~S I ~ VBG NP" NP" CCI NP 
Ne VP \[ NN 
l~,t,lntaln bloitln8 
£ ~per 
uffm8 
Figure 9: The extracted etrees from the fully 
bracketed ttree 
path from the root $1 to the head VBP is 
$1 ~ $2 ~ VP1 ~ VP2 --+ VP3 ~ VBP. 
Along the path the PP ~ at FNX- is a 
modifier of $2; therefore, Sl.b, S2.t, and the 
spine-etree rooted at PP form a mod-etree 
#1. Similarly, the ADVP still is a modifier 
of VP2 and $3 is a modifier of VP3, and the 
corresponding structures form mod-etrees #4 
and #7. On the path from the root to VBP, 
Sl.t and S2.b are merged (and so are VPi.t 
and VP3.b) to from the spine-etree #5. Re- 
peating this process for other nodes will gen- 
erate other trees such as trees #2, #3 and #6. 
The whole ttree yields twelve etrees as shown 
in Figure 9. 
3.3.3 Building derivation trees 
The fully bracketed ttree is in fact a derived 
tree of the sentence if the sentence is parsed 
with the etrees extracted by LexTract. In ad- 
dition to these etrees and the derived tree, we 
the root of one etree is merged with a node in the other 
etree. Splitting nodes into top and bottom pairs during 
the decomposition of the derived tree is the reverse 
process of merging nodes during parsing. For the sake 
of simplicity, we show the top and the bottom parts of 
a node only when the two parts will end up in different 
etrees. 
also need derivation trees to train statistical 
LTAG parsers. Recall that, in general, given 
a derived tree, the derivation tree that can 
generate the derived tree may not be unique. 
Nevertheless, given the fully bracketed ttree, 
the etrees, and the positions of the etrees in 
the ttree (see Figure 8), the derivation tree 
becomes unique if we choose either one of the 
following: 
• We adopt the traditional definition of 
derivation trees (which allows at most one 
adjunction at any node) and add an ad- 
ditional constraint which says that no ad- 
junction operation is allowed at the foot 
node of any auxiliary tree. 2 
• We adopt the definition of derivation 
trees in (Schabes and Shieber, 1992) 
(which allows multiple adjunction at any 
node) and require all mod-etrees adjoin 
to the etree that they modify. 
The user of LexTract can choose either op- 
tion and inform LexTract about his choice by 
setting a parameter. 3 Figure 10 shows the 
derivation tree based on the second option. 
draft (#5) 
a\[ (#1) underwriters(#3) ~i11(#4) policies(#6) using(#7) 
I I 
FNX(#2) pen(#9) 
fountain(#8) paper(#12) 
and(#10) bloldng(#l I) 
Figure 10: The derivation tree for the sentence 
3.4 Uniqueness of decomposition 
To summarize, LexTract is a - 
independent grammar extraction system, 
which takes Treebank-specific information 
(see Section 3.2) and a ttree T, and creates 
2Without this additional constraint, the derivation 
tree sometimes is not unique. For example, in Figure 
8, both #4 and #7 modify the etree #5. If adjunc- 
tion were allowed at foot nodes, ~4 could adjoin to 
~7 at VP2.b, and #7 would adjoin to #5 at VPs.b. 
An alternative is for #4 to adjoin to #5 at VPs.b and 
for ~7 to adjoin to ~4 at VP2.t. The no-adjunction- 
at-foot-node constraint would rule out the latter al- 
ternative and make the derivation tree unique. Note 
that this constraint has been adopted by several hand- 
crafted grammars such as the XTAG grammar for En- 
glish (XTAG-Group, 1998), because it eliminates this 
source of spurious ambiguity. 
SThis decision may affect parsing accuracy of an 
LTAG parser which uses the derivation trees for train- 
ing, but it will not affect the results reported in this 
paper. 
56 
(1) a fully bracketed ttree T*, (2) a set Eset 
of etrees, and (3) a derivation tree D for T*. 
Furthermore, Eset is the only tree set that 
satisfies all the following conditions: 
(C1) Decomposition: The tree set is a de- 
composition of T*, that is, T* would be 
generated if the trees in the set were com- 
bined via the substitution and adjunction 
operations. 
(C2) LTAG formalism: Each tree in the 
set is a valid etree, according to the LTAG 
formalism. For instance, each tree should 
be lexicalized and the arguments of the 
anchor should be encapsulated in the 
same etree. 
(C3) Target grammar: Each tree in the 
set falls into one of the three types as 
specified in Section 3.1. 
(C4) Treebank-specific information: 
The head/argument/adjunct distinction 
in the trees is made according to the 
Treebank-specific information provided 
by the user as specified in Section 3.2. 
S 
NP VP 
<1 I 
N V 
I t 
John left 
(T*) 
Ja,. I I 
John left 
(E l) (E2) 
\[ &~m t lea lc~hn \[ Icft 
(E) (E,) (Es) (E6) 
Figure 11: Tree sets for a fully bracketed ttree 
This uniqueness of the tree set may be quite 
surprising at first sight, considering that the 
number of possible decompositions of T* is 
~(2n), where n is the number of nodes in T*. 4 
Instead of giving a proof of the uniqueness, 
4Recall that the process of building etrees has two 
steps. First, LexTract treats each node as a pair of 
the top and bottom parts. The ttree is cut into pieces 
along the boundaries of the top and bottom parts of 
some nodes. The top and the bottom parts of each 
node belong to either two distinct pieces or one piece, 
as a result, there are 2 ~ distinct partitions. Second, 
some non-adjacent pieces in a partition can be glued 
together to form a bigger piece. Therefore, each par- 
tition will result in one or more decompositions of the 
ttree. In total, there are at least 2 n decompositions of 
the ttree. 
we use an example to illustrate how the con- 
ditions (C1)--(C4) rule out all the decompo- 
sitions except the one produced by LexTract. 
In Figure 11, the ttree T* has 5 nodes (i.e., 
S, NP, N, VP, and V). There are 32 distinct 
decompositions for T*, 6 of which are shown 
in the same figure. Out of these 32 decom- 
positions, only five (i.e., E2 -- E6) are fully 
lexicalized -- that is, each tree in these tree 
sets is anchored by a lexical item. The rest, 
including El, are not fully lexicalized, and are 
therefore ruled out by the condition (C2). For 
the remaining five etree sets, E2 -- E4 are 
ruled out by the condition (C3), because each 
of these tree sets has one tree that violates one 
constraint which says that in a spine-etree an 
argument of the anchor should be a substitu- 
tion node, rather than an internal node. For 
the remaining two, E5 is ruled out by (C4) 
because according to the Head Table provided 
by the user, the head of the S node should be 
V, not N. Therefore, E6, the tree set that is 
produced by LexTract, is the only etree set for 
T* that satisfies (C1)--(C4). 
3.5 The Experiments 
We have ran LexTract on the one-million- 
word English Penn Treebank (Marcus et 
al., 1993) and got two Treebank grammars. 
The first one, G1, uses the Treebank's 
tagset. The second Treebank grammar, 
G2, uses a reduced tagset, where some tags 
in the Treebank tagset are merged into a 
single tag. For example, the tags for verbs, 
MD/VB/VBP/VBZ/VBN/VBD/VBG, are 
merged into a single tag V. The reduced 
tagset is basically the same as the tagset 
used in the XTAG grammar (XTAG-Group, 
1998). G2 is built so that we can compare 
it with the XTAG grammar, as will be 
discussed in the next section. We also ran the 
system on the 100-thousand-word Chinese 
Penn Treebank (Xia et al., 2000b) and on a 
30-thousand-word Korean Penn Treebank. 
The sizes of extracted grammars are shown in 
Table 1. (For more discussion on the Chinese 
and the Korean Treebanks and the compar- 
ison between these Treebank grammars, see 
(Xia et al., 2000a)). The second column of 
the table lists the numbers of unique tem- 
plates in each grammar, where templates are 
etrees with the lexical items removed, s The 
third column shows the numbers of unique 
5For instance, #3, #6 and #9 in Figure 9 are three 
different etrees but they share the same template. An 
etree can be seen as a (word, template) pair. 
57 
etrees. The average numbers of etrees for each 
word type in G1 and G2 are 2.67 and-2.38 
respectively. Because frequent words often 
anchor many etrees, the numbers increase by 
more than 10 times when we consider word 
token, as shown in the fifth and sixth columns 
of the table. G3 and G4 are much smaller 
than G1 and G2 because the Chinese and the 
Korean Treebanks are much smaller than the 
English Treebank. 
In addition to LTAGs, by reading context- 
free rules off the etrees of a Treebank LTAG, 
LexTract also produces CFGs. The numbers 
of unlexicalized context-free rules from G1-- 
G4 are shown in the last column of Table 1. 
Comparing with other CFG extraction algo- 
rithms such as the one in (Krotov et al., 1998), 
the CFGs produced by LexTract have sev- 
eral good properties. For example, they allow 
unary rules and epsilon rules, they are more 
compact and the size of the grammar remains 
monotonic as the Treebank grows. 
Figure 12 shows the log frequency of tem- 
plates and the percentage of template tokens 
covered by template types in G1. 6 In both 
cases, template types are sorted according to 
their frequencies and plotted on the X-axes. 
The figure indicates that a small portion of 
template types, which can be seen as the core 
of the grammar, cover majority of template 
tokens in the Treebank. For example, the first 
100 (500, 1000 and 1500, resp.) templates 
cover 87.1% (96.6~o, 98.4% and 99.0% resp.) 
of the tokens, whereas about half (3411) of 
the templates each occur only once, account- 
ing for only 0.29% of template tokens in total. 
4 Applications of LexTract 
In addition to extract LTAGs and CFGs, Lex- 
Tract has been used to perform the following 
tasks: 
• We use the Treebank grammars produced 
by LexTract to evaluate the coverage of 
hand-crafted grammars. 
• We use the (word, template) sequence 
produced by LexTract to re-train Srini- 
vas' Supertaggers (Srinivas, 1997). 
• The derivation trees created by LexTract 
are used to train a statistical LTAG 
parser (Sarkar, 2000). LexTract output 
has also been used to train an LR LTAG 
parser (Prolo, 2000). 
6Similar results hold for G2, G3 and G4. 
• We have used LexTract to retrieve the 
data from Treebanks to test theoret- 
ical linguistic hypotheses such as the 
Tree-locality Hypothesis (Xia and Bleam, 
20O0). 
• LexTract has a filter that checks the 
plausibility of extracted etrees by decom- 
posing each etree into substructures and 
checking them. Implausible etrees are of- 
ten caused by Treebank annotation er- 
rors. Because LexTract maintains the 
mappings between etree nodes and ttree 
nodes, it can detect certain types of an- 
notation errors. We have used LexTract 
for the final cleanup of the Penn Chinese 
Treebank. 
Due to space limitation, in this paper we 
will only discuss the first two tasks. 
4.1 Evaluating the coverage of 
hand-crafted grammars 
The XTAG grammar (XTAG-Group, 1998) 
is a hand-crafted large-scale grammar for En- 
glish, which has been developed at University 
of Pennsylvania in the last decade. It has been 
used in many NLP applications such as gen- 
eration (Stone and Doran, 1997). Evaluating 
the coverage of such a grammar is important 
for both its developers and its users. 
Previous evaluations (Doran et al., 1994; 
Srinivas et al., 1998) of the XTAG grammar 
use raw data (i.e., a set of sentences with- 
out syntactic bracketing). The data are first 
parsed by an LTAG parser and the coverage 
of the grammar is measured as the percent- 
age of sentences in the data that get at least 
one parse, which is not necessarily the correct 
parse. For more discussion on this approach, 
see (Prasad and Sarkar, 2000). 
We propose a new evaluation method that 
takes advantage of Treebanks and LexTract. 
The idea is as follows: given a Treebank T and 
a hand-crafted grammar Gh, the coverage of 
Gh on T can be measured by the overlap of Gh 
and a Treebank grammar Gt that is produced 
by LexTract from T. In this case, we will esti- 
mate the coverage of the XTAG grammar on 
the English Penn Treebank (PTB) using the 
Treebank grammar G2. 
There are obvious differences between these 
two grammars. For example, feature struc- 
tures and multi-anchor etrees are present only 
in the XTAG grammar, whereas frequency in- 
formation is available only in G2. When we 
match templates in two grammars, we disre- 
58 
template etree 
types types 
Eng G1 6926 131,397 
Eng G2 2920 117,356 
Ch G3 1140 21,125 
Kor G4 634 9,787 
word 
types 
49,206 
49,206 
10,772 
6,747 
etree types etree types CFG rules 
per word type i per word token (unlexicalized) 
2.67~ 34.68 1524 
2.38 27.70 675 
1.96 9.13 515 
1.45 2.76 177 
Table 1: Grammars extracted from three Treebanks 
'r 
o.~ 
o.e 
0.7 
o.e 
o.5 
o.4 
o~ 
02 
o.~ 
o 
T~ m T~ 
(a) Frequency of templates (b) Coverage of templates 
Figure 12: Template types and template tokens in G1 
gard the type of information that is present 
only in one grammar. As a result, the map- 
ping between two grammars is not one-to-one. 
XTAG 
G~ 
~equency 
matched unmatched total 
templates templates 
497 507 1004 
215 2705 2920 
82.1% I 17.9% \[100% 
Table 2: Matched templates in two grammars 
Table 2 shows that 497 templates in the 
XTAG grammar and 215 templates in G2 
match, and the latter accounts for 82.1% of 
the template tokens in the PTB. The remain- 
ing 17.9% template tokens in the PTB do not 
match any template in the XTAG grammar 
because of one of the following reasons: 
(T1) Incorrect templates in G2: These tem- 
plates result from Treebank annotation er- 
rors, and therefore, are not in XTAG. 
(T2) Coordination in XTAG: the templates 
for coordinations in XTAG are generated 
on the fly while parsing (Sarkar and Joshi, 
1996), and are not part of the 1004 templates. 
Therefore, the conj-etrees in G2, which ac- 
count for 3.4% of the template tokens in the 
Treebank, do not match any templates in 
XTAG. 
(T3) Alternative analyses: XTAG and PTB 
sometimes choose different analyses for the 
same phenomenon. For example, the two 
grammars treat reduced relative clauses dif- 
ferently. As a result, the templates used to 
handle those phenomena in these two gram- 
mars do not match according to our defini- 
tion. 
(T4) Constructions not covered by XTAG: 
Some of such constructions are the unlike 
coordination phrase (UCP), parenthetical 
(PRN), and ellipsis. 
For (T1)--(T3), the XTAG grammar can 
handle the corresponding constructions al- 
though the templates used in two grammars 
look very different. To find out what construc- 
tions are not covered by XTAG, we manually 
classify 289 of the most frequent unmatched 
templates in G2 according to the reason why 
they are absent from XTAG. These 289 tem- 
plates account for 93.9% of all the unmatched 
template tokens in the Treebank. The results 
are shown in Table 3, where the percentage is 
with respect to all the tokens in the Treebank. 
From the table, it is clear that the most com- 
mon reason for mis-matches is (T3). Combin- 
ing the results in Table 2 and 3, we conclude 
that 97.2% of template tokens in the Treebank 
are covered by XTAG, while another 1.7% are 
not. For the remaining 1.1% template tokens, 
we do not know whether or not they are cov- 
ered by XTAG because we have not checked 
the remaining 2416 unmatched templates in 
G2. T 
To summarize, we have just showed that, 
7The number 97.2% is the sum of two numbers: 
the first one is the percentage of matched template to- 
kens (82.1% from Table 2). The secb-nd number is the 
percentage of template tokens which fall under (T1)-- 
(T3), i.e., 16.8%-1.7%=15.1% from Table 3. 
59 
T1 T2 T3 T4 total 
type 51 52 ~93 93 289 
freq 1.1% 3.4% 10.6% 1.7% 16.8% 
Table 3: Classifications of 289 unmatched 
templates 
by comparing templates in the XTAG gram- 
mar with the 'IYeebank grammar produced by 
LexTract, we estimate that the XTAG gram- 
mar covers 97.2% of template tokens in the 
English Treebank. Comparing with previous 
evaluation approach, this :method has several 
advantages. First, the whole process is semi- 
automatic and requires little human effort. 
Second, the coverage can be calculated at ei- 
ther sentence level or etree level, which is more 
fine-grained. Third, the method provides a 
list of etrees that can be added to the gram- 
mar to improve its coverage. Fourth, there 
is no need to parse the whole corpus, which 
could have been very time-consuming. 
4.2 Training Supertaggers 
A Supertagger (Joshi and Srinivas, 1994; 
Srinivas, 1997) assigns an etree template to 
each word in a sentence. The templates 
are also called Supertags because they in- 
clude more information than Part-of-Speech 
tags. Srinivas implemented the first Supertag- 
ger, and he also built a Lightweight Depen- 
dency Analyzer that assembles the Supertags 
of words to create an almost-parse for the sen- 
tence. Supertaggers have been found useful 
for several applications, such as information 
retrieval (Chandrasekar and Srinivas, 1997). 
To use a Treebank to train a Supertagger, 
the phrase structures in the Treebank have to 
be converted into (word, Supertag) sequences 
first. Producing such sequences is exactly one 
of LexTract's main functions, as shown previ- 
ously in Section 3.3.2 and Figure 9. 
Besides LexTract, there are two other at- 
tempts in converting the English Penn Tree- 
bank to train a Supertagger. Srinivas (1997) 
uses heuristics to map structural information 
in the Treebank into Supertags. His method 
is different from LexTract in that the set of 
Supertags in his method is chosen from the 
pre-existing XTAG grammar before the con- 
version starts, whereas LexTract extracts the 
Supertag set from Treebanks. His conversion 
program is also designed for this particular 
Supertag set, and it is not very-easy to port 
it to another Supertag set. A third difference 
is that the Supertags in his converted data do 
not always fit together, due to the discrep- 
ancy between the XTAG grammar and the 
Treebank annotation and the fact that the 
XTAG grammar does not cover all the tem- 
plates in the Treebank (see Section 4.1). In 
other words, even if the Supertagger is 100% 
accurate, it is possible that the correct parse 
for a sentence can not be produced by com- 
bining those Supertags in the sentence. 
Another work in converting Treebanks into 
LTAGs is described in (Chen and Vijay- 
Shanker, 2000). The method is similar to ours 
in that both work use Head Percolation Tables 
to find the head and both distinguish adjuncts 
from modifiers using syntactic tags and func- 
tional tags. Nevertheless, there are several 
differences: only LexTract explicitly creates 
fully bracketed ttrees, which are identical to 
the derived trees for the sentences. As a re- 
sult, building etrees can be seen as a task of 
decomposing the fully bracketed ttrees. The 
mapping between the nodes in fully bracketed 
ttrees and etrees makes LexTract a useful tool 
for 'IYeebank annotation and error detection. 
The two approaches also differ in how they 
distinguish arguments from adjuncts and how 
they handle coordinations. 
Table 4 lists the tagging accuracy of the 
same trigram Supertagger (Srinivas, 1997) 
trained and tested on the same original PTB 
data. s The difference in tagging accuracy 
is caused by different conversion algorithms 
that convert the original PTB data into the 
(word, template) sequences, which are fed 
to the Supertagger. The results of Chen & 
Vijay-Shanker's method come from their pa- 
per (Chen and Vijay-Shanker, 2000). They 
built eight grammars. We just list two of them 
which seem to be most relevant: C4 uses a re- 
duced tagset while C3 uses the PTB tagset. 
As for Srinivas' results, we did not use the re- 
sults reported in (Srinivas, 1997) and (Chen et 
al., 1999) because they are based on different 
training and testing data. 9 Instead, we re-ran 
SAll use Section 2-21 of the PTB for training, and 
Section 22 or 23 for testing. We choose those sec- 
tions because several state-of-thwart parsers (Collins, 
1997; Ratnaparkhi, 1998; Charniak, 1997) are trained 
on Section 2-21 and tested on Section 23. We include 
the results for Section 22 because (Chen and Vijay- 
Shanker, 2000) is tested on that section. For Srinivas' 
and our grammars, the first line is the results tested on 
Section 23, and the second line is the one for Section 
22. Chen & Vijay-Shauker's results~e for Section 22 
only. 
9He used Section 0-24 minus Section 20 for training 
and the Section 20 for testing. 
60 
his Supertagger using his data on the sections 
that we have chosen. 1° We have calculated 
two baselines for each seg of data. The first 
one tags each word in testing data with the 
most common Supertag w.r.t the word in the 
training data. For an unknown word, just use 
its most common Supertag. For the second 
baseline, we use a trigram POS tagger to tag 
the words first, and then for each word we use 
the most common Supertag w.r.t, the (word, 
POS tag) pair. 
templates 
Srinivas' 483 
our G2 2920 
our G1 6926 
Chen's 2366 -- 
(sect 22) -- 8996 
C4 4911 
C3 8623 
basel base2 
72.59 74.24 
72.14 73.74 
71.45 74.14 
70.54 73.41 
69.70 71.82 
68.79 70.90 
acc 
85.78 
85.53 
84.41 
83.60 
82.21 
81.88 
77.8 -- 
-- 78.9 
78.90 
78.O0 
Table 4: Supertagging results based on three 
different conversion algorithms 
A few observations are in order. First, the 
baselines for Supertagging are lower than the 
one for POS tagging, which is 91%, indicat- 
ing Supertagging is harder than POS tagging. 
Second, the second baseline is slightly bet- 
ter than the first baseline, indicating using 
~°Noticeably, the results we report on Srinivas' data, 
85.78% on Section 23 and 85.53% on Section 22, axe 
lower than 92.2% reported in (Srinivas, 1997) and 
91.37% in (Chen et al., 1999). There axe several 
reasons for the difference. First, the size of training 
data in our report is smaller than the one for his pre- 
vious work; Second, we treat punctuation marks as 
normal words during evaluation because, like other 
words, punctuation marks can anchor etrees, whereas 
he treats the Supertags for punctuation marks as al- 
ways correct. Third, he used some equivalent classes 
during evaluations. If a word is mis-tagged as x, while 
the correct Supertag is y, he considers that not to be 
an error if x and y appear in the same equivalent class. 
We suspect that the reason that those Supertagging er- 
rors axe disregarded is that those errors might not af- 
fect parsing results when the Supertags are combined. 
For example, both adjectives and nouns can modify 
other nouns. The two templates (i.e. Supertags) rep- 
resenting these modification relations look the same 
except for the POS tags of the anchors. If a word 
which should be tagged with one Supertag is mis- 
tagged with the other Supertag, it is likely that the 
wrong Supertag can still fit with other Supertags in 
the sentence and produce the right parse. We did not 
use these equivalent classes in this experiment because 
we are not aware of a systematic way to find all the 
cases in which Supertagging errors do not affect the 
final parsing results. 
POS tags may improve the Supertagging ac- 
curacy, n Third, the Supertagging accuracy 
using G2 is 1.3-1.9% lower than the one using 
Srinivas' data. This is not surprising since the 
size of G2 is 6 times that of Srinivas' grammar. 
Notice that G1 is twice the size of G2 and 
the accuracy using G1 is 2% lower. Fourth, 
higher Supertagging accuracy does not neces- 
sarily means the quality of converted data are 
better since the underlying grammars differ a 
lot with respect to the size and the coverage. 
A better measure will be the parsing accu- 
racy (i.e., the converted data should be fed to 
a common LTAG parser and the evaluations 
should be based on parsing results). We are 
currently working on that. Nevertheless, the 
experiments show that the (word, template) 
sequences produced by LexTract are useful for 
training Supertaggers. Our results are slightly 
lower than the ones trained on Srinivas' data, 
but our conversion algorithm has several ap- 
pealing properties: LexTract does not use pre- 
existing Supertag set; LexTract is - 
independent; the (word, Supertag) sequence 
produced by LexTract fit together. 
5 Conclusion 
We have presented a system for grammar ex- 
traction that produces an LTAG from a Tree- 
bank. The output produced by the system 
has been used in many NLP tasks, two of 
which are discussed in the paper. In the first 
task, by comparing the XTAG grammar with 
a Treebank grammar produced by LexTract, 
we estimate that the XTAG grammar covers 
97.2% of template tokens in the English Tree- 
bank. We plan to use the Treebank grammar 
to improve the coverage of the XTAG gram- 
mar. We have also found constructions that 
are covered in the XTAG grammar but do not 
appear in the Treebank. In the second task, 
LexTract converts the Treebank into a format 
that can be used to train Supertaggers, and 
the Supertagging accuracy is compatible to, if 
not better than, the ones based on other con- 
version algorithms. For future work, we plan 
to use derivation trees to train LTAG parsers 
directly and use LexTract to add semantic in- 
formation to the Penn Treebank. 

References 
R. Chandrasekar and B. Srinivas. 1997. Glean- 
ing information from the Web: Using Syntax 
to Filter out Irrelevant Information. In Proc. of 
AAAI 1997 Spring Symposium on NLP on the 
World Wide Web. 
Eugene Charniak. 1996. Treebank Grammars. In 
Proc. of AAAI-1996. 
Eugene Charniak. 1997. Statistical Parsing with 
a Context-Free Grammar and Word Statistics. 
In Proc. of AAAI-1997. 
John Chen and K. Vijay-Shanker. 2000. Auto- 
mated Extraction of TAGs from the Penn Tree- 
bank. In 6th International Workshop on Pars- 
ing Technologies (IWPT..2000), Italy. 
John Chen, Srinivas Bangalore, and K. Vijay- 
Shanker. 1999. New Models for Improving 
Supertag Disambiguation. In Proc. of EACL- 
1999. 
Mike Collins. 1997. Three Generative, Lexicalised 
Models for Statistical Parsing. In Proc. of the 
35th ACL. 
C. Doran, D. Egedi, B. A. Hockey, B. Srinivas, 
and M. Zaidel. 1994. XTAG System - A Wide 
Coverage Grammar for English. In Proc. of 
COLING-1994, Kyoto, Japan. 
Aravind Joshi and B. Srinivas. 1994. Disambigua- 
tion of Super Parts of Speech (or Supertags): 
Almost Parsing. In Proc. of COLING-1994. 
Aravind Joshi and K. Vijay-Shanker. 1999. Com- 
positional Semantics with LTAG: How Much 
Underspecification Is Necessary? In Proc. of 
3nd International Workshop on Computational 
Semantics. 
Aravind K. Joshi, L. Levy, and M. Takahashi. 
1975. Tree Adjunct Grammars. Journal of 
Computer and System Sciences. 
Laura Kallmeyer and Aravind Joshi. 1999. Un- 
derspecified Semantics with LTAG. 
Alexander Krotov, Mark Hepple, Robert 
Galzauskas, and Yorick Wilks. 1998. Compact- 
ing the Penn Treebank Grammar. In Proc. of 
A CL- COLING. 
David M. Magerman. 1995. Statistical Decision- 
Tree Models for Parsing. In Proc. of the 33rd 
ACL. 
M. Marcus, B. Santorini, and M. A. 
Marcinkiewicz. 1993. Building a Large 
Annotated Corpus of English: the Penn 
Treebank. Computational Lingustics. 
K. F. McCoy, K. Vijay-Shanker, and G. Yang. 
1992. A Functional Approach to Generation 
with TAG. In Proc. of the 30th A CL. 
Martha Palmer, Owen Rainbow, and Alexis 
Nasr. 1998. Rapid Prototyping of Domain- 
Specific Machine Translation System. In Proc. 
of AMTA-1998, Langhorne, PA. 
Rashmi Prasad and Anoop Sarkar. 2000. Compar- 
ing Test-Suite Based Evaluation and Corpus- 
Based Evaluation of a Wide-Coverage Grammar 
for English. In Proc. of LREC satellite work- 
shop Using Evaluation within HLT Programs: 
Results and Trends, Athen, Greece. 
Carlos A. Prolo. 2000. An Efficient LR Parser 
Generator for TAGs. In 6th International 
Workshop on Parsing Technologies (IWPT 
2000), Italy. 
Adwait Ratnaparkhi. 1998. Maximum Entropy 
Models for Natural Language Ambiguity Resolu- 
tion. Ph.D. thesis, University of Pennsylvania. 
Anoop Sarkar and Aravind Joshi. 1996. Coordi- 
nation in Tree Adjoining Grammars: Formaliza- 
tion and Implementation. In Proc. of the 18th 
COLING, Copenhagen, Denmark. 
Anoop Sarkar. 2000. Practical Experiments in 
Parsing using Tree Adjoining Grammars. In 
Proc. of 5th International Workshop on TAG 
and Related Frameworks (TAG+5). 
The XTAG-Group. 1998. A Lexicalized Tree Ad- 
joining Grammar for English. Technical Report 
IRCS 98-18, University of Pennsylvania. 
Yves Schabes and Stuart Shieber. 1992. An Al- 
ternative Conception of Tree-Adjoining Deriva- 
tion. In Proc. of the 20th Meeting of the Asso- 
ciation for Computational Linguistics. 
Yves Schabes. 1990. Mathematical and Computa- 
tional Aspects of Lexicalized Grammars. Ph.D. 
thesis, University of Pennsylvania. 
Kiyoaki Shirai, Takenobu Tokunaga, and Hozumi 
Tanaka. 1995. Automatic Extraction of 
Japanese Grammar from a Bracketed Corpus. 
In Proc. of Natural Language Processing Pacific 
Rim Symposium (NLPRS-1995). 
B. Srinivas, Anoop Sarkar, Christine Doran, and 
Beth Ann Hockey. 1998. Grammar and Parser 
Evaluation in the XTAG Project. In Workshop 
on Evaluation of Parsing Systems, Granada, 
Spain. 
B. Srinivas. 1997. Complexity of Lexical De- 
scriptions and Its Relevance to Partial Parsing. 
Ph.D. thesis, University of Pennsylvania. 
Matthew Stone and Christine Doran. 1997. Sen- 
tence Planning as Description Using Tree Ad- 
joining Grammar. In Proc. of the 35th A CL. 
Bonnie Webber and Aravind Joshi. 1998. Anchor- 
ing a Lexicalized Tree Adjoining Grammar for 
Discourse. In Proc. of A CL-COLING Workshop 
on Discourse Relations and Discourse Markers. 
Fei Xia and Tonia Bleam. 2000. A Corpus-Based 
Evaluation of Syntactic Locality in TAGs. In 
Proc. of 5th International Workshop on TAG 
and Related Frameworks (TAG+5). 
Fei Xia, Chunghye Han, Martha Palmer, and 
Aravind Joshi. 2000a. Comparing Lexicalized 
Treebank Grammars Extracted from Chinese, 
Korean, and English Corpora. In Proc. of the 
2nd GT~inese Language Processing Workshop, 
Hong Kong, China. 
Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen 
Okurowski, John Kovarik, Shizhe Huang, Tony 
Kroch, and Mitch Marcus. 2000b. Developing 
Guidelines and Ensuring Consistency for Chi- 
nese Text Annotation. In Proc. of the 2nd In- 
ternational Conference on Language Resources 
and Evaluation (LREC-2000),-Athens, Greece. 
