Sample Selection for Statistical Grammar Induction 
Rebecca Hwa 
Division of Engineering and Applied Sciences 
• Harvard University 
Cambridge, MA 02138 USA 
rebecca@eecs.harvard.edu 
Abstract 
Corpus-based grz.mmar induction relies on us- 
ing many hand-parsed sentences as training 
examples. However, the construction of a 
training corpus with detailed syntactic analy- 
sis for every sentence is a labor-intensive task. 
We propose to use sample selection methods 
to minimize the amount of annotation needed 
in the training data, thereby reducing the 
workload of the human annotators. This pa- 
per shows that the amount of annotated train- 
ing data can be reduced by 36% without de- 
grading the quality of the induced grammars. 
1 Introduction: 
Many learning problems in the domain of 
natural  processing need supervised 
training. For instance, it is difficult to induce 
a grammar from a corpus of raw text; but the 
task becomes much easier when the training 
sentences are supplemented with their parse 
trees. However, appropriate supervised train- 
ing data may be difficult to obtain. Existing 
corpora might not contain the relevant type 
of supervision, and the data might not be 
in the domain of interest. For example, one 
might need morphological analyses of the lex- 
icon in addition to the parse trees for inducing 
a grammar, or one might be interested in pro- 
cessing non-English s for which there 
is no annotated corpus. Because supervised 
training typically demands significant human 
involvement (e.g., annotating the parse trees 
of sentences by hand), building a new corpus is 
a labor-intensive task. Therefore, it is worth- 
while to consider ways of minimizing the size 
of the corpus to reduce the effort spent by an- 
notators. 
* This material is based upon work supported by the 
National Science Foundation under Grant No. IRI 
9712068. We thank Wheeler Rural for his plotting 
tool; and Stuart Shieber, Lillian Lee, Ric Crabbe, and 
the anonymous reviewers for their comments on the 
paper. 
There are two possible directions: one 
might attempt to reduce the amount of anno- 
tations in each sentence, as was explored by 
Hwa (1999); alternatively, one might attempt 
to reduce the number of training sentences. 
In this paper, we consider the latter approach 
using sample selection, an interactive learning 
method in which the machine takes the initia- 
tive of selecting potentially beneficial train- 
ing examples for the humans to annotate. If 
the system could accurately identify a subset 
of examples with high Training Utility Values 
(TUV) out of a pool of unlabeled data, the 
annotators would not need to waste time on 
processing uninformative examples. 
We show that sample selection can be 
applied to grammar induction to produce 
high quality grammars with fewer annotated 
training sentences. Our approach is to use 
uncertainty-based evaluation functions that 
estimate the TUV of a sentence by quantify- 
ing the grammar's uncertainty about assign- 
ing a parse tree to this sentence. We have 
considered two functions. The first is a sire- 
• ple heuristic that approximates the grammar's 
uncertainty in terms of sentence lengths. The 
second computes uncertainty in terms of the 
tree entropy of the sentence. This metric is 
described in detail later. 
This paper presents an empirical study 
measuring the effectiveness of our evaluation 
functions at selecting training sentences from 
the Wall Street Journal (WSJ) corpus (Mar- 
cuset al., 1993) for inducing grammars. Con- 
ducting the experiments with training pools 
of different sizes, we have found that sample 
selection based on tree entropy reduces a large 
training pool by 36% and a small training pool 
by 27%. These results surest that sample se- 
lection can significantly reduce \]~uman effort 
exerted in building training corpora. 
45 
2 Sample Selection 
Unlike traditional learning systems that re- 
ceive training examples indiscriminately, a 
learning system that uses sample selection 
actively influences its progress by choosing 
new examples to incorporate into its training 
set. Sample selection works with two types 
of learning systems: a committee of learners 
or a single learner. The committee-based se- 
lection algorithm works with multiple learn- 
ers, each maintaining a different hypothesis 
(perhaps pertaining to different aspects of the 
problem). The candidate examples that led 
to the most disagreements among the differ- 
ent learners are considered to have the high- 
est TUV (Cohn et al., 1994; Freund et al., 
1997). For computationally intensive prob- 
lems such as grammar induction, maintaining 
multiple learners may be an impracticality. In 
this work, we explore sample selection with a 
single learner that keeps just one working hy- 
pothesis at all times. 
Figure 1 outlines the single-learner sample 
selection training loop in pseudo-code. Ini- 
tially, the training set, L, consists of a small 
number of labeled examples, based on which 
the learner proposes its first hypothesis of 
the target concept, C. Also available to the 
learner is a large pool of uulabeled training 
candidates, U. In each training iteration, the 
selection algorithm, Select(n, U, C, f), ranks 
the candidates of U according to their ex- 
pected TUVs and returns the n candidates 
with the highest values. The algorithm com- 
putes the expected TUV of each candidate, 
u E U, with an evaluation function, f(u, C). 
This function may possibly rely on the hy- 
pothesis concept C to estimate the utility of 
a candidate u. The set of the n chosen candi- 
dates are then labeled by human and added 
to the existing training set. Rnnning the 
learning algorithm~ Train(L), on the updated 
training set, the system proposes a new hy- 
pothesis consistent with all the examples seen 
thus far. The loop continues until one of three 
stopping conditions is met: the hypothesis is 
considered close enough to the target concept, 
all candidates are labeled, or all human re- 
sources are exhausted. 
Sample selection may be beneficial for many 
learning tasks in natural  process- 
ing. Although there exist abundant collec- 
tions of raw text, the high expense of man- 
ually annotating the text sets a severe lim- 
itation for many learning algorithms in nat- 
U is a Set of unlabeled candidates. 
L is a set of labeled training examples. 
C is the current hypothesis. 
Initialize: 
C +-- Train(L). 
Repeat 
N ~-- Select(n, U, C, f). 
U~-U-N. 
L ~-- L t2 Label(N). 
C ~ Train(L). 
Until (C ---- Ctrue)Or (U = O) or (human stops) 
Figure 1: The pseudo-code for the sample se- 
lection learning algorithm 
ural  processing. Sample selection 
presents an attractive solution to offset this 
labeled data sparsity problem. Thus far, it 
has been successfully applied to several classi- 
fication applications. Some examples include 
text categorization (Lewis and Gale, 1994), 
part-of-speech tagging (Engelson and Dagan, 
1996), word-sense disambiguation (Fujii et al., 
1998), and prepositional-phrase attachment 
(Hwa, 2000). 
More difficult are learning problems whose 
objective is not classification, but generation 
of complex structures. One example in this di- 
rection is applying sample selection to seman- 
tic parsing (Thompson et al., 1999), in which 
sentences are paired with their semantic rep- 
resentation using a deterministic shift-reduce 
parser. Our work focuses on another complex 
natural  learning problem: inducing 
a stochastic context-free grammar that can 
generate syntactic parse trees for novel test 
sentences. 
Although abstractly, parsing with a gram- 
mar can be seen as a classification task of de- 
termining the structure of a sentence by se- 
lecting one tree out of a set of possible parse 
trees, there are two major distinctions that 
differentiate it from typical classification prob- 
lems. First, a classifier usually chooses from 
a fixed set of categories, but in our domain, 
every sentence has a different set of possible 
parse trees. Second, for most classification 
problems, the the number of the possible cate- 
gories is relatively small, whereas the number 
of potential parse trees for a sentence is expo- 
nential with respect to the sentence length. 
46 
3 Grammar Induction 
The degree of difficulty of the task of learning 
a grammar from data depends on the quantity 
and quality of the training supervision. When 
the training corpus consists of a larg e reservoir 
of fully annotated parse trees, it is possible 
to directly extract a grammar based on these 
parse trees. The success of recent high-quality 
parsers (Charniak, 1997; Collins, 1997) relies 
on the availability of such treebank corpora. 
To work with smaller training corpora, the 
learning system would require even more in- 
formation about the examples than their syn- 
tactic parse trees. For instance, Hermjakob 
and Mooney (1997) have described a learning 
system that can build a deterministic shift- 
reduce parser from a small set of training 
examples with the aid of detailed morpho- 
logical, syntactical, and semantic knowledge 
databases and step-by-step guidance from hu- 
man experts. 
The induction task becomes more chal- 
lenging as the amount of supervision in the 
training data and background knowledge de- 
creases. To compensate for the missing infor- 
mation, the learning process requires heuristic 
search to find locally optimal grammars. One 
form of partially supervised data might spec- 
ify the phrasal boundaries without specify- 
ing their labels by bracketing each constituent 
unit with a pair of parentheses (McNaughton, 
1967). For example, the parse tree for the sen- 
tence '~Several fund managers expect a rough 
market this morning before prices stablize." 
is labeled as "((Several fund managers) (ex- 
pect ((a rough market) (this morning)) (be- 
fore (prices stabilize))).)" As shown in Pereira 
and Schabes (1992), an essentially unsuper- 
vised learning algorithm such as the Inside- 
Outside re-estimation process (Baker, 1979; 
Lari and Young, 1990) can be modified to take 
advantage of these bracketing constraints. 
For our sample selection experiment, we 
chose to work under the more stringent con- 
dition of partially supervised training data, as 
described above, because our ultimate goal is 
to minimize the amount of annotation done 
by humans in terms of both the number of 
sentences and the number of brackets within 
the sentences. Thus, the quality of our in- 
duced grammars should not be compared to 
those extracted from a fully annotated train- 
ing corpus. The learning algorithm we use is 
a variant of the Inside-Outside algorithm that 
induces grammars expressed in the Probabilis- 
tic Lexicalized Tree Insertion Grammar rep- 
resentation (Schabes and Waters, 1993; Hwa, 
1998). This formalism's Context-free equiva- 
lence and its lexicalized representation make 
the training process efficient and computa- 
tionally plausible. 
4 Selective Sampling Evaluation 
Functions 
In this paper, we propose two uncertainty- 
based evaluation functions for estimating the 
training utilities of the candidate sentences. 
The first is a simple heuristic that uses the 
length of a sentence to estimate uncertain- 
ties. The second function computes uncer- 
tainty in terms of the entropy of the parse 
trees that the hypothesis-grammar generated 
for the sentence. 
4.1 Sentence Length 
Let us first consider a simple evaluation 
function that estimates the training utility 
of a candidate without consulting the cur- 
rent hypothesis-grammar, G. The function 
ften(s,G) coarsely approximates the uncer- 
tainty of a candidate sentence s with its 
length: 
flen(S, G) = length(s). 
The intuition behind this function is based 
on the general observation that longer sen- 
tences tend to have complex structures and 
introduce more opportunities for ambiguous 
parses. Since the scoring only depends on 
sentence lengths, this naive evaluation func- 
tion orders the training pool deterministically 
regardless of either the current state of the 
grammar or the annotation of previous train- 
ing sentences. This approach has one major 
advantage: it is easy to compute and takes 
negligible processing time. 
4.2 Tree Entropy 
Sentence length is not a very reliable indi- 
cator of uncertainty. To measure the un- 
certainty of a sentence more accurately, the 
evaluation function must base its estimation 
on the outcome of testing the sentence on 
the hypothesis-grammar. When a stochastic 
grammar parses a sentence, it generates a set 
of possible trees and associates a likelihood 
value with each. Typically, the most likely 
tree is taken to be the best parse for the sen- 
tence. 
We propose an evaluation function that 
considers the probabilities of all parses. The 
47 
set of probabilities of the possible parse trees 
for a sentence defines a distribution that in- 
dicates the grammar's uncertainty about the 
structure of the sentence. For example, a uni- 
form distribution signifies that the grammar 
is at its highest uncertainty because all the 
parses are equally likely; whereas a distribu- 
tion resembling an impulse function suggests 
that the grammar is very certain because it 
finds one parse much more likely than all oth- 
ers. To quantitatively characterize a distribu- 
tion, we compute its entropy. 
Entropy measures the uncertainty of assign- 
ing a value to a random variable over a dis- 
tribution. Informally speaking, it is the ex- 
pected number of bits needed to encode the 
assignment. A higher entropy value signifies 
a higher degree of uncertainty. At the highest 
uncertainty, the random variable is assigned 
one of n values over a uniform distribution, 
and the outcome would require log2 (n) bits to 
encode. 
More formally, let V be a discrete random 
variable that can take any possible outcome 
in set V. Let p(v) be the density function 
p(v) = Pr(Y = v), v E l). The entropy H(V) 
is the expected negative log likelihood of ran- 
dom variable V: 
H (V) = -EX (logdv(V ) ) ). 
= - 
vEY 
Further details about the properties of en- 
tropy can be found in textbooks on informa- 
tion theory (Cover and Thomas, 1991). 
Determining the parse tree for a sentence 
from a set of possible parses can be viewed as 
assigning a value to a random variable. Thus, 
a direct application of the entropy definition 
to the probability distribution of the parses for 
sentence s in grammar G computes its tree en- 
tropy, TE(s, G), the expected number of bits 
needed to encode the distribution of possible 
parses for s. Note that we cannot compare 
sentences of different lengths by their entropy. 
For two sentences of unequal lengths, both 
with uniform distributions, the entropy of the 
longer one is higher. To normalize for sen- 
tence length, we define an evaluation function 
that computes the similarity between the ac- 
tual probability distribution and the uniform 
distribution for a sentence of that length. For 
a sentence s of length l, there can be at most 
0(2 l) equally likely parse trees and its maxi- 
real entropy is 0(l) bits (Cover and Thomas, 
1991). Therefore, we define the evaluation 
function, fte(s, G) to be the tree entropy di- 
vided by the sentence length. 
TE(s, G) 
Ire(s, G) = length(s)" 
We now derive the expression for TE(s, G). 
Suppose that a sentence s can be generated by 
a grammar G with some non-zero probability, 
Pr(s \[ G). Let V be the set of possible parses 
that G generated for s. Then the probability 
that sentence s is generated by G is the sum 
of the probabilities of its parses. That is: 
Pr(s \[G) = ~Pr(vlG). 
vEY 
Note that Pr(v \[ G) reflects the probability of 
one particular parse tree, v, in the grammar 
out of all possible parse trees for all possible 
sentences that G accepts. But in order to ap- 
ply the entropy definition from above, we need 
to specify a distribution of probabilities for the 
parses of sentence s such that 
vr(v Is, o)= 1. 
vEV 
Pr(v \[ s, G) indicates the likelihood that v is 
the correct parse tree out of a set of possible 
parses for s according to grammar G. It is 
also the density function, p(v), for the distri- 
bution (i.e., the probability of assigning v to 
a random variable V). Using Bayes Rule and 
noting that Pr(v, s \[ G) = Pr(v \[ G) (because 
the existence of tree v implies the existence of 
sentence s), we get: 
v(v) = vr(v I s, G) = Vr(., s I G) = Vr(v I G) 
Pr(s I G) Pr(s I G)" 
Replacing the generic density function term 
in the entropy definition, we derive the expres- 
sion for TE(s, G), the tree entropy of s: 
TE(s,G) = H(V) 
---- -- Z PCv) Iog2P(V ) 
vEV 
= - P (s I a) log2(? (s I c) ) vEY 
Pr(v l C) 
= - ~ Pr(s \[G) l°g2Pr(v \[ G) 
vEY 
+ ~ Pr(v \[ G) log hPr(slG) 
vev Pr(s \[ G) 
48 
~,cv Pr(v l G) log2 Pr(v l G) 
Pr(s I G) 
E sv P (v I a) + logs P (s I -, i b) 
~vev Pr(v \] G) l°g 2Pr(vIG ) 
Pr(s 1 a) 
+ log 2 Pr(s I G) 
Using the bottom-up, dynamic program- 
ming technique of computing Inside Proba- 
bilities (Lari and Young, 1990), we can ef- 
ficiently compute the probability of the sen- 
tence, Pr(s I G). Similarly, the algorithm 
can be modified to compute the quantity 
~\]v~vPr( v I G)log2(Pr(v I G)) (see Ap- 
pendix A). 
5 Experimental Setup 
To determine the effectiveness of selecting 
training examples with the two proposed eval- 
uation functions, we compare them against 
a baseline of random selection (frand(S, G) = 
rand()). The task is to induce grammars from 
selected sentences in the Wall Street Journal 
(WSJ) corpus, and to parse unseen test sen- 
tences with the trained gr~.mmars. Because 
the vocabulary size (and the grammar size 
by extension) is very large, we have substi- 
tuted the words with their part-of-speech tags 
to avoid additional computational complexity 
in training the grammar. After replacing the 
words with part-of-speech tags, the vocabu- 
lary size of the corpus is reduced to 47 tags. 
We repeat the study for two different 
candidate-pool sizes. For the first experiment, 
we assume that there exists an abundant sup-- 
ply of unlabeled data. Based on empirical ob- 
servations (as will be shown in Section 6), for 
the task we are considering, the induction al- 
gorithm typically reaches its asymptotic limit 
after training with 2600 sentences; therefore, 
it is sufficient to allow for a candidate-pool size 
of U = 3500 unlabeled WSJ sentences. In the 
second experiment, we restrict the size of the 
candidate-pool such that U contains only 900 
unlabeled sentences. This experiment studies 
how the paucity of training data affects the 
evaluation functions. 
For both experiments, each of the three 
evaluation functions: frand, ften, and fte, is 
applied to the sample selection learning algo- 
rithm shown in Figure 1, where concept C is 
the current hypothesis-grammar G, and L, the 
set of labeled training data; initially consists 
of 100 sentences. In every iteration, n = 100 
new sentences are picked from U to be added 
to L, and a new C is induced from the updated 
L. After the hypothesis-grammar is updated, 
it is tested. The quality of the induced gram- 
max is judged by its ability to generate cor- 
rect parses for unseen test sentences. We use 
the consistent bracketing metric (i.e., the per- 
centage of brackets in the proposed parse not 
crossing brackets of the true parse) to mea- 
sure parsing accuracy 1. To ensure the staffs- 
tical significance of the results, we report the 
average of ten trials for each experiment 2. 
6 Results 
The results of the two experiments are graph- 
ically depicted in Figure 2. We plot learning 
rates of the induction processes using train- 
ing sentences selected by the three evaluation 
functions. The learning rate relates the qual- 
ity of the induced grammars to the amount of 
supervised training data available. In order 
for the induced grammar to parse test sen- 
tences with higher accuracy (x-axis), more su- 
pervision (y-axis) is needed. The amount of 
supervision is measured in terms of the num- 
ber of brackets rather than sentences because 
it more accurately quantifies the effort spent 
by the human annotator. Longer sentences 
tend to require more brackets than short ones, 
and thus take more time to analyze. We deem 
one evaluation function more effective than 
another if the smallest set of sentences it se- 
lected can train a grammar that performs at 
least as well as the grammar trained under the 
other function and if the selected data con- 
tains considerably fewer brackets than that of 
the other function. 
Figure 2(a) presents the outcomes of the 
first experiment, in which the evaluation func- 
tions select training examples out of a large 
candidate-pool. We see that overall, sample 
selection has a positive effect on the learning 
IThe unsupervised induction algorithm induces 
grammars that generate binary branching trees so that 
the number of proposed brackets in a sentence is al- 
ways one fewer than the length of the sentence. The 
WSJ corpus, on the other hand, favors a more fiat- 
tened tree structure with considerably fewer brackets 
per sentence. The consistent bracketing metric does 
not unfairly penalize a proposed parse tree for being 
binary branching. 
2We generate different candidate-pools by moving 
a fixed-size window across WSJ sections 02 through 
05, advancing 400 sentences for each trial. Sec~n 23 
is always used for testing. 
49 
E 
/ o 
a~ .... 
Pa.~ing accura,.~ on the ~ (a) 
i t f 
s : 
• J ./J 
...o°..o-- ~ 
Farsir~ accuracy on the tt~t 
(b) 
Figure 2: The learning rates of the induction processes using examples selected by the three 
evaluation functions for (a) when the candidate-pool is large, and (b) when the candidate-pool 
is small. 
grammar set 
baseline-26 
length-17 
tree entropy-!4 
11 avg. training brackets t-test on bracket.avg. \[ avg. score 
33355 N/A 80.3 
30288 better 80.3 
21236 better 80.4 
t-test on score avg 
N/A 
not sig. worse 
not sig. worse 
Table 1: Summary of pair-wise t-test with 95% confidence comparing the best set of grammars 
induced with the baseline (after 26 selection iterations) to the sets of grammars induced under 
the proposed evaluation functions (ften after 17 iterations, fte after 14 iterations). 
rate of the induction process. For the base- 
line case, the induction process uses frand, 
in which training sentences are randomly se- 
lected. The resulting grammars achieves an 
average parsing accuracy of 80.3% on the test 
sentences after seeing an average of 33355 
brackets in the training data. The learning 
rate of the tree entropy evaluation function, 
fte, progresses much faster than the baseline. 
To induce a grammar that reaches the same 
80.3% parsing accuracy with the examples se- 
lected by fte, the learner requires, on average, 
21236 training brackets, reducing the amount 
of annotation by 36% comparing to the base- 
line. While the simplistic sentence length 
evaluation function, f~en, is less helpful, its 
learning rate still improves slightly faster than 
the baseline. A grammar of comparable qual- 
ity can be induced from a set of training exam- 
ples selected by fzen containing an average of 
30288 brackets. This provides a small reduc- 
tion of 9% from the baseline 3. We consider a 
set of grammars to be comparable to the base- 
3In terms of the number of sentences, the baseline 
f~d used 2600 randomly chosen training sentences; 
.fze,~ selected the 1700 longest sentences as training 
data; and fte selected 1400 sentences. 
line if its mean test score is at least as high 
as that of the baseline and if the difference of 
the means is not statistically significant (us- 
ing pair-wise t-test at 95% confidence). Ta- 
ble 1 summarizes the statistical significance of 
comparing the best set of baseline grammars 
with those of of f~en and ffte. 
Figure 2(b) presents the results of the sec- 
ond experiment, in which the evaluation func- 
tions only have access to a small candidate 
pool. Similar to the previous experiment, 
grammars induced from training examples se- 
lected by fte require significantly less annota- 
tions than the baseline. Under the baseline, 
frand, to train grammars with 78.5% parsing 
accuracy on test data, an average of 11699 
brackets (in 900 sentences) is required. In con- 
trast, fte can induce a comparable grammar 
with an average of 8559 brackets (in 600 sen- 
tences), providing a saving of 27% in the num- 
ber of training brackets. The simpler evalua- 
tion function f~n out:performs the baseline 
as well; the 600 sentences it selected have an 
average of 9935 brackets. Table 2 shows the 
statistical significance of these comParisons. 
A somewhat surprising outcome of the sec- 
ond study is that the grammars induced from 
50 
grammar set 
baseline-9 
length-6 
tree entropy-6 
tree entropy-8 
II avg. training brackets t-test on bracket avg. avg. score t-~est on score a~ 
11699 N/A 78.5 N/A 
9936 better 78.5 not sig. worse 
8559 better 78.5 not sig. worse 
11242 better 79.1 better 
t-test avg. 
Table 2: Summary of pair-wise t-test with 95% confidence comparing the best set of grammars 
induced with the baseline (after 9 selection iterations) to the sets of grammars induced under 
the proposed evaluation functions (ften after 6 iterations, fte after 6 and 8 iterations). 
the three methods did not parse with the same 
accuracy when all the sentences from the un- 
labeled pool have been added to the training 
set. Presenting the training examples in dif- 
ferent orders changes the search path of the 
induction process. Trained on data selected 
by fte, the induced grammar parses the test 
sentences with 79.1% accuracy, a small but 
statistically significant improvement over the 
baseline. This suggests that, when faced with 
a dearth of training candidates, fte can make 
good use of the available data to induce gram- 
mars that are comparable to those directly in- 
duced from more data. 
7 Conclusion and Future Work 
This empirical study indicates that sample se- 
lection can significantly reduce the human ef- 
fort in parsing sentences for inducing gram- 
mars. Our proposed evaluation function using 
tree entropy selects helpful training examples. 
Choosing from a large pool of unlabeled can- 
didates, it significantly reduces the amount of 
training annotations needed (by 36% in the 
experiment). Although the reduction is less 
dramatic when the pool of candidates is small 
(by 27% in the experiment), the training ex- 
amples it selected helped to induce slightly 
better grammars. 
The current work suggests many potential 
research directions on selective sampling for 
grammar induction. First, since the ideas be- 
hind the proposed evaluation fimctions are 
general and independent of formalisms, we 
would like to empirically determine their ef- 
fect on other parsers. Next, we shall explore 
alternative formulations of evaluation func- 
tions for the single-learner system. The cur- 
rent approach uses uncertainty-based evalua- 
tion functions; we hope to consider other fac- 
tors such as confidence about the parameters 
of the grammars and domain knowledge. We 
also plan to focus on the constituent units 
within a sentence as training examples. Thus, 
the evaluation functions could estimate the 
training utilities of constituent units rather 
than full sentences. Another area of interest 
is to experiment with committee-based sam- 
ple selection using multiple learners. Finally, 
we are interested in applying sample selection 
to other natural  learning algorithms 
that have been limited by the sparsity of an- 
notated data. 

References 
James K. Baker. 1979. Trainable grammars for 
speech recognition. In Proceedings of the Spring 
Conference of the Acoustical Society of Amer- 
ica, pages 547-550, Boston, MA, June. 
Eugene Charniak. 1997. Statistical parsing with 
a context-free grammar and word statistics. In 
Proceedings of the AAAI, pages 598-603, Prov- 
idence, RI. AAAI Press/MIT Press. 
David Cohn, Les Atlas, and Richard Ladner. 1994. 
Improving generalization with active learning. 
Machine Learning, 15(2):201-221. 
Michael Collins. 1997. Three generative, lexi- 
calised models for statistical parsing. In Pro- 
ceedings of the 35th Annual Meeting of the A CL, 
pages 16-23, Madrid, Spain. 
Thomas M. Cover and Joy A. Thomas. 1991. El- 
ements of Information Theory. John Wiley. 
Sean P. Engelson and Ido Dagan. 1996. Mhaimiz- 
ing manual annotation cost in supervised train- 
ing from copora. In Proceedings of the 34th An- 
nual Meeting of the ACL, pages 319-326. 
Yoav Freund, H. Sebastian Seung, Eli Shamir, and 
Naftali Tishby. 1997. Selective sampling using 
the query by committee algorithm. Machine 
Learning, 28(2-3):133-168. 
Atsushi Fujii, Kentaro Inui, Takenobu Tokunaga, 
and Hozumi Tanaka. 1998. Selective sampling 
for example-based word sense disambiguation. 
Computational Linguistics, 24(4):573-598, De- 
cember. 
Ulf Hermjakob and Raymond J. Mooney. 1997. 
Learning parse and translation decisions from 
examples with rich context. In Proceedings o/ 
the Association for Computational Linguistics, 
pages 482-489. 
Rebecca Hwa. 1998. An empiric~al evaluation 
of probabilistic lexicaiized tree insertion gram- 
mars. In Proceedings off COLING-ACL, vol- 
ume 1, pages 557-563. 
Rebecca Hwa. 1999. Supervised grammar in- 
duction using training data with limited con- 
stituent information. In Proceedings of 37th An- 
nual Meeting of the ACL, pages 73-79, June. 
Rebecca Hwa. 2000. Learning Probabilistic Lex- 
icalized Grammars for Natural Language Pro- 
cessing. Ph.D. thesis, ttarvard University. 
Forthcoming. 
K. Lari and S.J. Young. 19!70. The estimation 
of stochastic context-free grammars using the 
inside-outside algorithm. Computer Speech and 
Language, 4:35-56. 
David D. Lewis and William A. Gale. 1994. A se- 
quential algorithm for training text classifiers. 
In Proceedings of the 17th Annual International 
ACM SIGIR Conference on Research and De- 
velopment in Information Retrieval, pages 3-12. 
Mitchell Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building 
a large annotated corpus of English: the 
Penn Treebank. Computational Linguistics, 
19(2):313--330. 
Robert McNaughton. 1967. Parenthesis gram- 
mars. Journal off the ACM, 2(3):490--500. 
Fernando Pereira and Yves Schabes. 1992. Inside- 
Outside reestimation from partially bracketed 
corpora. In Proceedings of the 30th Annual 
Meeting o\] the ACL, pages 128-135, Newark, 
Delaware. 
Yves Schabes and Richard Waters. 1993. Stochas- 
tic lexicalized context-free grammar. In Pro- 
ceedings of the Third International Workshop 
on Parsing Technologies, pages 257-266. 
Cynthia A. Thompson, Mary Elaine Califf, and 
Raymond J. Mooney. 1999. Active learning 
for natural  parsing and information 
extraction. In Proceedings of 1CML-99, pages 
406-414, Bled, Slovenia. 
