Recovering latent information in treebanks
David Chiang and Daniel M. Bikel
University of Pennsylvania
Dept of Computer and Information Science
200 S 33rd Street
Philadelphia PA 19104 USA
fdchiang,dbikelg@cis.upenn.edu
Abstract
Many recent statistical parsers rely on a preprocess-
ing step which uses hand-written, corpus-specific
rules to augment the training data with extra infor-
mation. For example, head-finding rules are used
to augment node labels with lexical heads. In this
paper, we provide machinery to reduce the amount
of human e ort needed to adapt existing models to
new corpora: first, we propose a flexible notation for
specifying these rules that would allow them to be
shared by di erent models; second, we report on an
experiment to see whether we can use Expectation-
Maximization to automatically fine-tune a set of
hand-written rules to a particular corpus.
1 Introduction
Most work in statistical parsing does not operate
in the realm of parse trees as they appear in many
treebanks, but rather on trees transformed via aug-
mentation of their node labels, or some other trans-
formation (Johnson, 1998). This methodology is il-
lustrated in Figure 1. The information included in
the node labels’ augmentations may include lexical
items, or a node label su x to indicate the node is an
argument and not an adjunct; such extra information
may be viewed as latent information, in that it is not
directly present in the treebank parse trees, but may
be recovered by some means. The process of recov-
ering this latent information has largely been limited
to the hand-construction of heuristics. However, as
is often the case, hand-constructed heuristics may
not be optimal or very robust. Also, the e ort re-
quired to construct such rules can be considerable.
In both respects, the use of such rules runs counter
to the data-driven approach to statistical parsing.
In this paper, we propose two steps to address
this problem. First, we define a new, fairly simple
syntax for the identification and transformation of
node labels that accommodates a wide variety of
node-label augmentations, including all those that
Model
parsed data +annotated data +
annotated data Training Decoding parsed data
Figure 1: Methodology for the development of a sta-
tistical parser. A+indicates augmentation.
are performed by existing statistical parsers that we
have examined. Second, we explore a novel use of
Expectation-Maximization (Dempster et al., 1977)
that iteratively reestimates a parsing model using
the augmenting heuristics as a starting point. Specif-
ically, the EM algorithm we use is a variant of
the Inside-Outside algorithm (Baker, 1979; Lari and
Young, 1990; Hwa, 1998). The reestimation adjusts
the model’s parameters in the augmented parse-tree
space to maximize the likelihood of the observed
(incomplete) data, in the hopes of finding a better
distribution over augmented parse trees (the com-
plete data). The ultimate goal of this work is to mini-
mize the human e ort needed when adapting a pars-
ing model to a new domain.
2 Background
2.1 Head-lexicalization
Many of the recent, successful statistical parsers
have made use of lexical information or an im-
plicit lexicalized grammar, both for English and,
more recently, for other languages. All of these
parsers recover the “hidden” lexicalizations in a
treebank and find the most probable lexicalized tree
when parsing, only to strip out this hidden infor-
mation prior to evaluation. Also, in all these pars-
ing e orts lexicalization has meant finding heads
of constituents and then propagating those lexical
heads to their respective parents. In fact, nearly
identical head-lexicalizations were used in the dis-
S(caught–VBD)
NP(boy–NN)
DET
The
NN
boy
ADVP(also–RB)
RB
also
VP(caught–VBD)
VBD
caught
NP(ball–NN)
DET
the
NN
ball
Figure 2: A simple lexicalized parse tree.
criminative models described in (Magerman, 1995;
Ratnaparkhi, 1997), the lexicalized PCFG models
in (Collins, 1999), the generative model in (Char-
niak, 2000), the lexicalized TAG extractor in (Xia,
1999) and the stochastic lexicalized TAG models
in (Chiang, 2000; Sarkar, 2001; Chen and Vijay-
Shanker, 2000). Inducing a lexicalized structure
based on heads has a two-pronged e ect: it not
only allows statistical parsers to be sensitive to lex-
ical information by including this information in
the probability model’s dependencies, but it also
determines which of all possible dependencies—
both syntactic and lexical—will be included in the
model itself. For example, in Figure 2, the nontermi-
nal NP(boy–NN) is dependent on VP(caught–VBD)
and not the other way around.
2.2 Other tree transformations
Lexicalization via head-finding is but one of many
possible tree transformations that might be use-
ful for parsing. As explored thoroughly by John-
son (1998), even simple, local syntactic trans-
formations on training trees for an unlexicalized
PCFG model can have a significant impact on pars-
ing performance. Having picked up on this idea,
Collins (1999) devises rules to identify arguments,
i.e., constituents that are required to exist on a par-
ticular side of a head child constituent dominated
by a particular parent. The parsing model can then
probabilistically predict sets of requirements on ei-
ther side of a head constituent, thereby incorporat-
ing a type of subcategorization information. While
the model is augmented to include this subcat-
prediction feature, the actual identification of argu-
ments is performed as one of many preprocessing
steps on training trees, using a set of rules sim-
ilar to those used for the identification of heads.
Also, (Collins, 1999) makes use of several other
transformations, such as the identification of sub-
jectless sentences (augmenting S nodes to become
SG) and the augmentation of nonterminals for gap
threading. Xia (1999) combines head-finding with
argument identification to extract elementary trees
for use in the lexicalized TAG formalism. Other re-
searchers investigated this type of extraction to con-
struct stochastic TAG parsers (Chiang, 2000; Chen
and Vijay-Shanker, 2000; Sarkar, 2001).
2.3 Problems with heuristics
While head-lexicalization and other tree transfor-
mations allow the construction of parsing models
with more data-sensitivity and richer representa-
tions, crafting rules for these transformations has
been largely an art, with heuristics handed down
from researcher to researcher. What’s more, on
top of the large undertaking of designing and im-
plementing a statistical parsing model, the use of
heuristics has required a further e ort, forcing the
researcher to bring both linguistic intuition and,
more often, engineering savvy to bear whenever
moving to a new treebank. For example, in the rule
sets used by the parsers described in (Magerman,
1995; Ratnaparkhi, 1997; Collins, 1999), the sets of
rules for finding the heads of ADJP, ADVP, NAC,
PP and WHPP include rules for picking either the
rightmost or leftmost FW (foreign word). The ap-
parently haphazard placement of these rules that
pick out FW and the rarity of FW nodes in the data
strongly suggest these rules are the result of engi-
neering e ort. Furthermore, it is not at all apparent
that tree-transforming heuristics that are useful for
one parsing model will be useful for another. Fi-
nally, as is often the case with heuristics, those used
in statistical parsers tend not to be data-sensitive,
and ironically do not rely on the words themselves.
3 Rule-based augmentation
In the interest of reducing the e ort required to con-
struct augmentation heuristics, we would like a no-
tation for specifying rules for selecting nodes in
bracketed data that is both flexible enough to encode
the kinds of rule sets used by existing parsers, and
intuitive enough that a rule set for a new language
can be written easily without knowledge of com-
puter programming. Such a notation would simplify
the task of writing new rule sets, and facilitate ex-
perimentation with di erent rules. Moreover, rules
written in this notation would be interchangeable
between di erent models, so that, ideally, adapta-
tion of a model to a new corpus would be trivial.
We define our notation in two parts: a structure
pattern language, whose basic patterns are speci-
fications of single nodes written in a label pattern
language.
3.1 Structure patterns
Most existing head-finding rules and argument-
finding rules work by specifying parent-child rela-
tions (e.g., NN is the head of NP, or NP is an argu-
ment of VP). A generalization of this scheme that
is familiar to linguists and computer scientists alike
would be a context-free grammar with rules of the
form
A!A1   (Ai)l   An;
where the superscript l specifies that if this rule gets
used, the ith child of A should be marked with the
label l.
However, there are two problems with such an ap-
proach. First, writing down such a grammar would
be tedious to say the least, and impossible if we
want to handle trees with arbitrary branching fac-
tors. So we can use an extended CFG (Thatcher,
1967), a CFG whose right-hand sides are regular ex-
pressions. Thus we introduce a union operator ([)
and a Kleene star ( ) into the syntax for right-hand
sides.
The second problem that our grammar may be
ambiguous. For example, the grammar
X!YhY[YYh
could mark with an h either the first or second sym-
bol of YY. So we impose an ordering on the rules of
the grammar: if two rules match, the first one wins.
In addition, we make the [ operator noncommuta-
tive: [ tries to match first, and only if it does
not match  , as in Perl. (Thus the above grammar
would mark the first Y.) Similarly,  tries to match
as many times as possible, also as in Perl.
But this creates a third and final problem: in the
grammar
X!(YYh[Yh)(YY[Y);
it is not defined which symbol of YYY should be
marked, that is, which union operator takes priority
over the other. Perl circumvents this problem by al-
ways giving priority to the left. In algebraic terms,
concatenation left-distributes over union but does
not right-distribute over union in general.
However, our solution is to provide a pair of con-
catenation operators:  , which gives priority to the
left, and , which gives priority to the right:
X ! (YYh[Yh) (YY[Y) (1)
X ! (YYh[Yh) (YY[Y) (2)
Rule (1) marks the second Y in YYY, but rule (2)
marks the first Y. More formally,
  ( [ ) = (   )[(   )
( [ )  = (   )[(   )
But if contains no unions or Kleene stars, then
   =    (   )
   =    (   )
So then, consider the following rules:
VP ! ?  VBh  ? ; (3)
VP ! ?  VBh  ? : (4)
where ? is a wildcard pattern which matches any
single label (see below). Rule (3) mark with an h
the rightmost VB child of a VP, whereas rule (4)
marks the leftmost VB. This is because the Kleene
star always prefers to match as many times as possi-
ble, but in rule (3) the first Kleene star’s preference
takes priority over the last’s, whereas in rule (4) the
last Kleene star’s preference takes priority over the
first’s.
Consider the slightly more complicated exam-
ples:
VP ! ?  (VBh[MDh) ? (5)
VP ! ?  ((VBh[MDh) ? ) (6)
Rule (5) marks the leftmost child which is either a
VB or a MD, whereas rule (6) marks the leftmost
VB if any, or else the leftmost MD. To see why this
so, consider the string MD VB X. Rule (5) would
mark the MD as h, whereas rule (6) would mark
the VB. In both rules VB is preferred over MD, and
symbols to the left over symbols to the right, but in
rule (5) the leftmost preference (that is, the prefer-
ence of the last Kleene star to match as many times
as possible) takes priority, whereas in rule (6) the
preference for VB takes priority.
3.2 Label patterns
Since nearly all treebanks have complex nontermi-
nal alphabets, we need a way of concisely specify-
ing classes of labels. Unfortunately, this will neces-
sarily vary somewhat across treebanks: all we can
define that is truly treebank-independent is the ?
pattern, which matches any label. For Penn Tree-
bank II style annotation (Marcus et al., 1993), in
which a nonterminal symbol is a category together
with zero or more functional tags, we adopt the fol-
lowing scheme: the atomic pattern a matches any
label with category a or functional tag a; more-
over, we define Boolean operators^,_, and:. Thus
NP^:ADV matches NP–SBJ but not NP–ADV.1
3.3 Summary
Using the structure pattern language and the la-
bel pattern language together, one can fully encode
the head/argument rules used by Xia (which resem-
ble (5) above), and the family of rule sets used by
Black, Magerman, Collins, Ratnaparkhi, and others
(which resemble (6) above). In Collins’ version of
the head rules, NP and PP require special treatment,
but these can be encoded in our notation as well.
4 Unsupervised learning of augmentations
In the type of approach we have been discussing
so far, hand-written rules are used to augment the
training data, and this augmented training data is
then used to train a statistical model. However, if we
train the model by maximum-likelihood estimation,
the estimate we get will indeed maximize the likeli-
hood of the training data as augmented by the hand-
written rules, but not necessarily that of the training
data itself. In this section we explore the possibility
of training a model directly on unaugmented data.
A generative model that estimates P(S;T;T+)
(where T+ is an augmented tree) is normally used
for parsing, by computing the most likely (T;T+)
for a given S . But we may also use it for augment-
ing trees, by computing the most likely T+ for a
given sentence-tree pair (S;T). From the latter per-
spective, because its trees are unaugmented, a tree-
bank is a corpus of incomplete data, warranting the
use of unsupervised learning methods to reestimate
a model that includes hidden parameters. The ap-
proach we take below is to seed a parsing model
using hand-written rules, and then use the Inside-
Outside algorithm to reestimate its parameters. The
resulting model, which locally maximizes the likeli-
hood of the unaugmented training data, can then be
used in two ways: one might hope that as a parser,
it would parse more accurately than a model which
only maximizes the likelihood of training data aug-
mented by hand-written rules; and that as a tree-
augmenter, it would augment trees in a more data-
sensitive way than hand-written rules.
4.1 Background: tree adjoining grammar
The parsing model we use is based on the stochas-
tic tree-insertion grammar (TIG) model described
1Note that unlike the noncommutative union operator[, the
disjunction operator_has no preference for its first argument.
by Chiang (2000). TIG (Schabes and Waters, 1995)
is a weakly-context free restriction of tree adjoin-
ing grammar (Joshi and Schabes, 1997), in which
tree fragments called elementary trees are com-
bined by two composition operations, substitution
and adjunction (see Figure 3). In TIG there are
certain restrictions on the adjunction operation.
Chiang’s model adds a third composition operation
called sister-adjunction (see Figure 3), borrowed
from D-tree substitution grammar (Rambow et al.,
1995).2
There is an important distinction between derived
trees and derivation trees (see Figure 3). A deriva-
tion tree records the operations that are used to com-
bine elementary trees into a derived tree. Thus there
is a many-to-one relationship between derivation
trees and derived trees: every derivation tree speci-
fies a derived tree, but a derived tree can be the result
of several di erent derivations.
The model can be trained directly on TIG deriva-
tions if they are available, but corpora like the
Penn Treebank have only derived trees. Just as
Collins uses rules to identify heads and arguments
and thereby lexicalize trees, Chiang uses nearly the
same rules to reconstruct derivations: each training
example is broken into elementary trees, with each
head child remaining attached to its parent, each ar-
gument broken into a substitution node and an ini-
tial root, and each adjunct broken o as a modifier
auxiliary tree.
However, in this experiment we view the derived
trees in the Treebank as incomplete data, and try to
reconstruct the derivations (the complete data) using
the Inside-Outside algorithm.
4.2 Implementation
The expectation step (E-step) of the Inside-Outside
algorithm is performed by a parser that computes all
possible derivations for each parse tree in the train-
ing data. It then computes inside and outside prob-
abilities as in Hwa’s experiment (1998), and uses
these to compute the expected number of times each
event occurred. For the maximization step (M-step),
we obtain a maximum-likelihood estimate of the pa-
rameters of the model using relative-frequency es-
2The parameters for sister-adjunction in the present model
di er slightly from the original. In the original model, all the
modifier auxiliary trees that sister-adjoined at a particular po-
sition were generated independently, except that each sister-
adjunction was conditioned on whether it was the first at that
position. In the present model, each sister-adjunction is condi-
tioned on the root label of the previous modifier tree.
NP
NNP
John
S
NP# VP
VB
leave
VP
MD
should
VP 
NP
NN
tomorrow
( 1)
( 2)
( ) ( )
)
 2
 1
1
 
2
 
2,1
S
NP
NNP
John
VP
MD
should
VP
VB
leave
NP
NN
tomorrow
Derivation tree Derived tree
Figure 3: Grammar and derivation for “John should leave tomorrow.” In this derivation, 1 gets substituted,
 gets adjoined, and gets sister-adjoined.
timation, just as in the original experiment, as if
the expected values for the complete data were the
training data.
Smoothing presents a special problem. There are
several several backo levels for each parameter
class that are combined by deleted interpolation. Let
 1,  2 and  3 be functions from full history con-
texts Y to less specific contexts at levels 1, 2 and
3, respectively, for some parameter class with three
backo levels (with level 1 using the most specific
contexts). Smoothed estimates for parameters in this
class are computed as follows:
e= 1e1+(1  1)( 2e2+(1  2)e3)
where ei is the estimate of p(X j i(Y)) for some
future context X, and the  i are computed by the
formula found in (Bikel et al., 1997), modified to
use the multiplicative constant 5 found in the similar
formula of (Collins, 1999):
 i =
 
1 di 1d
i
! 1
1+5ui=di
!
(7)
where di is the number of occurrences in training of
the context i(Y) (and d0 =0), and ui is the number
of unique outcomes for that context seen in training.
There are several ways one might incorporate this
smoothing into the reestimation process, and we
chose to depart as little as possible from the orig-
inal smoothing method: in the E-step, we use the
smoothed model, and after the M-step, we use the
original formula (7) to recompute the smoothing
weights based on the new counts computed from
the E-step. While simple, this approach has two im-
portant consequences. First, since the formula for
the smoothing weights intentionally does not maxi-
mize the likelihood of the training data, each itera-
tion of reestimation is not guaranteed to increase the
87.3
87.35
87.4
87.45
87.5
87.55
87.6
0 5 10 15 20
F
-
m
e
a
s
ur
e
Iteration
Figure 4: English, starting with full rule set
likelihood of the training data. Second, reestimation
tends to increase the size of the model in memory,
since smoothing gives nonzero expected counts to
many events which were unseen in training. There-
fore, since the resulting model is quite large, if an
event at a particular point in the derivation forest
has an expected count below 10 15, we throw it out.
4.3 Experiment
We first trained the initial model on sections 02–21
of the WSJ corpus using the original head rules, and
then ran the Inside-Outside algorithm on the same
data. We tested each successive model on some
held-out data (section 00), using a beam width of
10 4, to determine at which iteration to stop. The
F-measure (harmonic mean of labeled precision and
recall) for sentences of length 100 for each itera-
tion is shown in Figure 4. We then selected the ninth
reestimated model and compared it with the initial
model on section 23 (see Figure 7). This model did
only marginally better than the initial model on sec-
tion 00, but it actually performs worse than the ini-
tial model on section 23. One explanation is that the
84.5
84.55
84.6
84.65
84.7
84.75
84.8
84.85
84.9
84.95
85
85.05
0 5 10 15 20 25 30 35 40
F
-
m
e
a
s
ur
e
Iteration
Figure 5: English, starting with simplified rule set
73
73.05
73.1
73.15
73.2
73.25
73.3
73.35
73.4
73.45
73.5
0 5 10 15 20 25 30 35 40
F
-
m
e
a
s
ur
e
Iteration
Figure 6: Chinese, starting with full rule set
head rules, since they have been extensively fine-
tuned, do not leave much room for improvement.
To test this, we ran two more experiments.
The second experiment started with a simplified
rule set, which simply chooses either the leftmost or
rightmost child of each node as the head, depend-
ing on the label of the parent: e.g., for VP, the left-
most child is chosen; for NP, the rightmost child
is chosen. The argument rules, however, were not
changed. This rule set is supposed to represent the
kind of rule set that someone with basic familiarity
with English syntax might write down in a few min-
utes. The reestimated models seemed to improve on
this simplified rule set when parsing section 00 (see
Figure 5); however, when we compared the 30th
reestimated model with the initial model on section
23 (see Figure 7), there was no improvement.
The third experiment was on the Chinese Tree-
bank, starting with the same head rules used in
(Bikel and Chiang, 2000). These rules were origi-
nally written by Xia for grammar development, and
although we have modified them for parsing, they
have not received as much fine-tuning as the English
rules have. We trained the model on sections 001–
270 of the Penn Chinese Treebank, and reestimated
it on the same data, testing it at each iteration on
sections 301–325 (Figure 6). We selected the 38th
reestimated model for comparison with the initial
model on sections 271–300 (Figure 7). Here we did
observe a small improvement: an error reduction of
3.4% in the F-measure for sentences of length 40.
4.4 Discussion
Our hypothesis that reestimation does not improve
on the original rule set for English because that
rule set is already fine-tuned was partially borne
out by the second and third experiments. The model
trained with a simplified rule set for English showed
improvement on held-out data during reestimation,
but showed no improvement in the final evaluation;
however, the model trained on Chinese did show a
small improvement in both. We are uncertain as to
why the gains observed during the second experi-
ment were not reflected in the final evaluation, but
based on the graph of Figure 5 and the results on
Chinese, we believe that reestimation by EM can
be used to facilitate adaptation of parsing models
to new languages or corpora.
It is possible that our method for choosing
smoothing weights at each iteration (see x4.2) is
causing some interference. For future work, more
careful methods should be explored. We would
also like to experiment on the parsing model of
Collins (1999), which, because it can recombine
smaller structures and reorder subcategorization
frames, might open up the search space for better
reestimation.
5 Conclusion
Even though researchers designing and implement-
ing statistical parsing models have worked in the
methodology shown in Figure 1 for several years
now, most of the work has focused on finding ef-
fective features for the model component of the
methodology, and on finding e ective statistical
techniques for parameter estimation. However, there
has been much behind-the-scenes work on the ac-
tual transformations, such as head finding, and most
of this work has consisted of hand-tweaking exist-
ing heuristics. It is our hope that by introducing this
new syntax, less toil will be needed to write non-
terminal augmentation rules, and that human e ort
will be lessened further by the use of unsupervised
methods such as the one presented here to produce
better models for parsing and tree augmentation.
 100 words  40 words
Model Step LR LP CB 0CB  2 CB LR LP CB 0CB  2 CB
Original initial 86.95 87.02 1.21 62.38 82.33 87.68 87.76 1.02 65.30 84.86
Original 9 86.37 86.71 1.26 61.42 81.79 87.18 87.48 1.06 64.41 84.23
Simple initial 84.50 84.18 1.54 57.57 78.35 85.46 85.17 1.29 60.71 81.11
Simple 30 84.21 84.50 1.53 57.95 77.77 85.12 85.35 1.30 60.94 80.62
Chinese initial 75.30 76.77 2.72 45.95 67.05 78.37 80.03 1.79 52.82 74.75
Chinese 38 75.20 77.99 2.66 47.69 67.63 78.79 81.06 1.69 54.15 75.08
Figure 7: Results on test sets. Original =trained on English with original rule set; Simple=English, sim-
plified rule set. LR = labeled recall, LP = labeled precision; CB = average crossing brackets, 0 CB= no
crossing brackets, 2 CB=two or fewer crossing brackets. All figures except CB are percentages.
Acknowledgments
This research was supported in part by NSF grant
SBR-89-20230. We would like to thank Anoop
Sarkar, Dan Gildea, Rebecca Hwa, Aravind Joshi,
and Mitch Marcus for their valuable help.

References

James K. Baker. 1979. Trainable grammars for speech
recognition. In Proceedings of the Spring Conference
of the Acoustical Society of America, pages 547–550.

Daniel M. Bikel and David Chiang. 2000. Two statisti-
cal parsing models applied to the Chinese Treebank.
In Proceedings of the Second Chinese Language Pro-
cessing Workshop, pages 1–6.

Daniel M. Bikel, Scott Miller, Richard Schwartz,
and Ralph Weischedel. 1997. Nymble: a high-
performance learning name-finder. In Proceedings of
the Fifth Conference on Applied Natural Language
Processing (ANLP 1997), pages 194–201.

Eugene Charniak. 2000. A maximum-entropy-inspired
parser. In Proceedings of ANLP-NAACL2000, pages
132–139.

John Chen and K. Vijay-Shanker. 2000. Automated ex-
traction of TAGs from the Penn Treebank. In Pro-
ceedings of the Sixth International Workshop on Pars-
ing Technologies (IWPT 2000), pages 65–76, Trento.

David Chiang. 2000. Statistical parsing with an
automatically-extracted tree adjoining grammar. In
Proceedings of ACL-2000, pages 456–463.

Michael Collins. 1999. Head-Driven Statistical Models
for Natural Language Parsing. Ph.D. thesis, Univ. of
Pennsylvania.

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
Maximum likelihood from incomplete data via the
EM algorithm. J. Roy. Stat. Soc. B, 39:1–38.

Rebecca Hwa. 1998. An empirical evaluation of prob-
abilistic lexicalized tree insertion grammars. In Pro-
ceedings of COLING-ACL ’98, pages 557–563.

Mark Johnson. 1998. PCFG models of linguistic tree
representations. Computational Linguistics, 24:613–
632.

Aravind K. Joshi and Yves Schabes. 1997. Tree-
adjoining grammars. In Grzegorz Rosenberg and Arto
Salomaa, editors, Handbook of Formal Languages
and Automata, volume 3, pages 69–124. Springer-
Verlag, Heidelberg.

K. Lari and S. J. Young. 1990. The estimation of
stochastic context-free grammars using the inside-
outside algorithm. Computer Speech and Language,
4:35–56.

David M. Magerman. 1995. Statistical decision-tree
models for parsing. In Proceedings of ACL ’95, pages
276–283.

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated cor-
pus of English: the Penn Treebank. Computational
Linguistics, 19:313–330.

Owen Rambow, K. Vijay-Shanker, and David Weir.
1995. D-tree grammars. In Proceedings of ACL ’95,
pages 151–158.

Adwait Ratnaparkhi. 1997. A linear observed time sta-
tistical parser based on maximum entropy models. In
Proceedings of the Second Conference on Empirical
Methods in Natural Language Processing (EMNLP-
2).

Anoop Sarkar. 2001. Applying co-training methods to
statistical parsing. In Proceedings of NAACL-2001,
pages 175–182.

Yves Schabes and Richard C. Waters. 1995. Tree in-
sertion grammar: a cubic-time parsable formalism
that lexicalizes context-free grammar without chang-
ing the trees produced. Computational Linguistics,
21:479–513.

J. W. Thatcher. 1967. Characterizing derivation trees of
context-free grammars through a generalization of fi-
nite automata theory. J. Comp. Sys. Sci., 1:317–322.

Fei Xia. 1999. Extracting tree adjoining grammars from
bracketed corpora. In Proceedings of the 5th Nat-
ural Language Processing Pacific Rim Symposium
(NLPRS-99), pages 398–403.
