Automatic Grammar Induction and Parsing Free Text: 
A Transformation-Based Approach 
Eric Brill* 
Department of Computer and Information Science 
University of Pennsylvania 
brill@unagi.cis.upenn.edu 
Abstract 
In this paper we describe a new technique for 
parsing free text: a transformational grammar I 
is automatically learned that is capable of accu- 
rately parsing text into binary-branching syntac- 
tic trees with nonterminals unlabelled. The algo- 
rithm works by beginning in a very naive state of 
knowledge about phrase structure. By repeatedly 
comparing the results of bracketing in the current 
state to proper bracketing provided in the training 
corpus, the system learns a set of simple structural 
transformations that can be applied to reduce er- 
ror. After describing the algorithm, we present 
results and compare these results to other recent 
results in automatic grammar induction. 
INTRODUCTION 
There has been a great deal of interest of late in 
the automatic induction of natural language gram- 
mar. Given the difficulty inherent in manually 
building a robust parser, along with the availabil- 
ity of large amounts of training material, auto- 
matic grammar induction seems like a path worth 
pursuing. A number of systems have been built 
that can be trained automatically to bracket text 
into syntactic constituents. In (MM90) mutual in- 
formation statistics are extracted from a corpus of 
text and this information is then used to parse 
new text. (Sam86) defines a function to score the 
quality of parse trees, and then uses simulated an- 
nealing to heuristically explore the entire space of 
possible parses for a given sentence. In (BM92a), 
distributional analysis techniques are applied to a 
large corpus to learn a context-free grammar. 
The most promising results to date have been 
*The author would like to thank Mark Liberman, 
Melting Lu, David Magerman, Mitch Marcus, Rich 
Pito, Giorgio Satta, Yves Schabes and Tom Veatch. 
This work was supported by DARPA and AFOSR 
jointly under grant No. AFOSR-90-0066, and by ARO 
grant No. DAAL 03-89-C0031 PRI. 
1 Not in the traditional sense of the term. 
based on the inside-outside algorithm, which can 
be used to train stochastic context-free grammars. 
The inside-outside algorithm is an extension of 
the finite-state based Hidden Markov Model (by 
(Bak79)), which has been applied successfully in 
many areas, including speech recognition and part 
of speech tagging. A number of recent papers 
have explored the potential of using the inside- 
outside algorithm to automatically learn a gram- 
mar (LY90, SJM90, PS92, BW92, CC92, SRO93). 
Below, we describe a new technique for gram- 
mar induction. The algorithm works by beginning 
in a very naive state of knowledge about phrase 
structure. By repeatedly comparing the results of 
parsing in the current state to the proper phrase 
structure for each sentence in the training corpus, 
the system learns a set of ordered transformations 
which can be applied to reduce parsing error. We 
believe this technique has advantages over other 
methods of phrase structure induction. Some of 
the advantages include: the system is very simple, 
it requires only a very small set of transforma- 
tions, a high degree of accuracy is achieved, and 
only a very small training corpus is necessary. The 
trained transformational parser is completely sym- 
bolic and can bracket text in linear time with re- 
spect to sentence length. In addition, since some 
tokens in a sentence are not even considered in 
parsing, the method could prove to be consid- 
erably more robust than a CFG-based approach 
when faced with noise or unfamiliar input. After 
describing the algorithm, we present results and 
compare these results to other recent results in 
automatic phrase structure induction. 
TRANSFORMATION-BASED 
ERROR-DRIVEN LEARNING 
The phrase structure learning algorithm is a 
transformation-based error-driven learner. This 
learning paradigm, illustrated in figure 1, has 
proven to be successful in a number of differ- 
ent natural language applications, including part 
of speech tagging (Bri92, BM92b), prepositional 
259 
UNANNOTATED 
TEXT 
STATE 
ANNOTATED TRUTH 
RULES 
Figure 1: Transformation-Based Error-Driven 
Learning. 
phrase attachment (BR93), and word classifica- 
tion (Bri93). In its initial state, the learner is 
capable of annotating text but is not very good 
at doing so. The initial state is usually very easy 
to create. In part of speech tagging, the initial 
state annotator assigns every word its most likely 
tag. In prepositional phrase attachment, the ini- 
tial state annotator always attaches prepositional 
phrases low. In word classification, all words are 
initially classified as nouns. The naively annotated 
text is compared to the true annotation as indi- 
cated by a small manually annotated corpus, and 
transformations are learned that can be applied to 
the output of the initial state annotator to make 
it better resemble the truth. 
LEARNING PHRASE 
STRUCTURE 
The phrase structure learning algorithm is trained 
on a small corpus of partially bracketed text which 
is also annotated with part of speech informa- 
tion. All of the experiments presented below 
were done using the Penn Treebank annotated 
corpus(MSM93). The learner begins in a naive 
initial state, knowing very little about the phrase 
structure of the target corpus. In particular, all 
that is initially known is that English tends to 
be right branching and that final punctuation 
is final punctuation. Transformations are then 
learned automatically which transform the out- 
put of the naive parser into output which bet- 
ter resembles the phrase structure found in the 
training corpus. Once a set of transformations 
has been learned, the system is capable of taking 
sentences tagged with parts of speech and return- 
ing a binary-branching structure with nontermi- 
nals unlabelled. 2 
The Initial State Of The Parser 
Initially, the parser operates by assigning a right- 
linear structure to all sentences. The only excep- 
tion is that final punctuation is attached high. So, 
the sentence "The dog and old cat ate ." would be 
incorrectly bracketed as: 
((The(dog(and(old (cat ate))))). ) 
The parser in its initial state will obviously 
not bracket sentences with great accuracy. In 
some experiments below, we begin with an even 
more naive initial state of knowledge: sentences 
are parsed by assigning them a random binary- 
branching structure with final punctuation always 
attached high. 
Structural Transformations 
The next stage involves learning a set of trans- 
formations that can be applied to the output of 
the naive parser to make these sentences better 
conform to the proper structure specified in the 
training corpus. The list of possible transforma- 
tion types is prespecified. Transformations involve 
making a simple change triggered by a simple en- 
vironment. In the current implementation, there 
are twelve allowable transformation types: 
• (1-8) (AddHelete) a (leftlright) parenthesis to 
the (leftlright) of part of speech tag X. 
• (9-12) (Add\]delete) a (left\]right) parenthesis 
between tags X and Y. 
To carry out a transformation by adding or 
deleting a parenthesis, a number of additional sim- 
ple changes must take place to preserve balanced 
parentheses and binary branching. To give an ex- 
ample, to delete a left paren in a particular envi- 
ronment, the following operations take place (as- 
suming, of course, that there is a left paren to 
delete): 
1. Delete the left paren. 
2. Delete the right paren that matches the just 
deleted paren. 
3. Add a left paren to the left of the constituent 
immediately to the left of the deleted left paren. 
2This is the same output given by systems de- 
scribed in (MM90, Bri92, PS92, SRO93). 
260 
4. Add a right paren to the right of the con- 
stituent immediately to the right of the deleted 
left paren. 
5. If there is no constituent immediately to the 
right, or none immediately to the left, then the 
transformation fails to apply. 
Structurally, the transformation can be seen 
as follows. If we wish to delete a left paten to 
the right of constituent X 3, where X appears in a 
subtree of the form: 
X 
A YY Z 
carrying out these operations will transform this 
subtree into: 4 
Z 
A X YY 
Given the sentence: 5 
The dog barked . 
this would initially be bracketed by the naive 
parser as: 
((The(dogbarked)).) 
If the transformation delete a left parch to 
the right of a determiner is applied, the structure 
would be transformed to the correct bracketing: 
(((Thedog) barked), ) 
To add a right parenthesis to the right of YY, 
YY must once again be in a subtree of the form: 
X 
3To the right of the rightmost terminal dominated 
by X if X is a nonterminal. 
4The twelve transformations can be decomposed 
into two structural transformations, that shown 
here and its converse, along with six triggering 
environments. 
5Input sentences are also labelled with parts of 
speech. 
If it is, the following steps are carried out to 
add the right paren: 
1. Add the right paren. 
2. Delete the left paten that now matches the 
newly added paren. 
3. Find the right paren that used to match the just 
deleted paren and delete it. 
4. Add a left paren to match the added right paren. 
This results in the same structural change as 
deleting a left paren to the right of X in this par- 
ticular structure. 
Applying the transformation add a right paten 
to the right of a noun to the bracketing: 
((The(dogbarked)).) 
will once again result in the correct bracketing: 
(((Thedog)barked).) 
Learning Transformations 
Learning proceeds as follows. Sentences in the 
training set are first parsed using the naive parser 
which assigns right linear structure to all sen- 
tences, attaching final punctuation high. Next, for 
each possible instantiation of the twelve transfor- 
mation templates, that particular transformation 
is applied to the naively parsed sentences. The re- 
suiting structures are then scored using some mea- 
sure of success that compares these parses to the 
correct structural descriptions for the sentences 
provided in the training corpus. The transforma- 
tion resulting in the best scoring structures then 
becomes the first transformation of the ordered set 
of transformations that are to be learned. That 
transformation is applied to the right-linear struc- 
tures, and then learning proceeds on the corpus 
of improved sentence bracketings. The following 
procedure is carried out repeatedly on the train- 
ing corpus until no more transformations can be 
found whose application reduces the error in pars- 
ing the training corpus: 
1. The best transformation is found for the struc- 
tures output by the parser in its current state. 6 
2. The transformation is applied to the output re- 
sulting from bracketing the corpus using the 
parser in its current state. 
3. This transformation is added to the end of the 
ordered list of transformations. 
SThe state of the parser is defined as naive initial- 
state knowledge plus all transformations that cur- 
rently have been learned. 
261 
4. Go to 1. 
After a set of transformations has been 
learned, it can be used to effectively parse fresh 
text. To parse fresh text, the text is first naively 
parsed and then every transformation is applied, 
in order, to the naively parsed text. 
One nice feature of this method is that dif- 
ferent measures of bracketing success can be used: 
learning can proceed in such a way as to try to 
optimize any specified measure of success. The 
measure we have chosen for our experiments is the 
same measure described in (PS92), which is one of 
the measures that arose out of a parser evaluation 
workshop (ea91). The measure is the percentage 
of constituents (strings of words between matching 
parentheses) from sentences output by our system 
which do not cross any constituents in the Penn 
Treebank structural description of the sentence. 
For example, if our system outputs: 
(((Thebig) (dogate)).) 
and the Penn Treebank bracketing for this sen- 
tence was: 
(((Thebigdog) ate). ) 
then the constituent the big would be judged cor- 
rect whereas the constituent dog ate would not. 
Below are the first seven transformations 
found from one run of training on the Wall Street 
Journal corpus, which was initially bracketed us- 
ing the right-linear initial-state parser. 
1. Delete a left paren to the left of a singular noun. 
2. Delete a left paren to the left of a plural noun. 
3. Delete a left paren between two proper nouns. 
4. Delet a left paten to the right of a determiner. 
5. Add a right paten to the left of a comma. 
6. Add a right paren to the left of a period. 
7. Delete a right paren to the left of a plural noun. 
The first four transformations all extract noun 
phrases from the right linear initial structure. The 
sentence "The cat meowed ." would initially be 
bracketed as: 7 
((The (cat meowed)) . ) 
Applying the first transformation to this 
bracketing would result in: 
7These examples are not actual sentences in the 
corpus. We have chosen simple sentences for clarity. 
(((Thecat)meowed).) 
Applying the fifth transformation to the 
bracketing: 
( ( We ( ran ( 
would result in 
( ( ( We ran ) 
(and(theywalked))))).) 
, (and(they walked)))). ) 
RESULTS 
In the first experiment we ran, training and test- 
ing were done on the Texas Instruments Air Travel 
Information System (ATIS) corpus(HGD90). 8 In 
table 1, we compare results we obtained to re- 
sults cited in (PS92) using the inside-outside al- 
gorithm on the same corpus. Accuracy is mea- 
sured in terms of the percentage of noncrossing 
constituents in the test corpus, as described above. 
Our system was tested by using the training set 
to learn a set of transformations, and then ap- 
plying these transformations to the test set and 
scoring the resulting output. In this experiment, 
64 transformations were learned (compared with 
4096 context-free rules and probabilities used in 
the inside-outside algorithm experiment). It is sig- 
nificant that we obtained comparable performance 
using a training corpus only 21% as large as that 
used to train the inside-outside algorithm. 
Method # of Training Accuracy 
Corpus Sentences 
Inside-Outside 700 90.36% 
Transformation 
Learner 150 91.12% 
Table 1: Comparing two learning methods on the 
ATIS corpus. 
After applying all learned transformations to 
the test corpus, 60% of the sentences had no cross- 
ing constituents, 74% had fewer than two crossing 
constituents, and 85% had fewer than three. The 
mean sentence length of the test corpus was 11.3. 
In figure 2, we have graphed percentage correct 
as a function of the number of transformations 
that have been applied to the test corpus. As 
the transformation number increases, overtraining 
sometimes occurs. In the current implementation 
of the learner, a transformation is added to the 
list if it results in any positive net change in the 
Sin all experiments described in this paper, results 
are calculated on a test corpus which was not used in 
any way in either training the learning algorithm or in 
developing the system. 
262 
training set. Toward the end of the learning proce- 
dure, transformations are found that only affect a 
very small percentage of training sentences. Since 
small counts are less reliable than large counts, we 
cannot reliably assume that these transformations 
will also improve performance in the test corpus. 
One way around this overtraining would be to set 
a threshold: specify a minimum level of improve- 
ment that must result for a transformation to be 
learned. Another possibility is to use additional 
training material to prune the set of learned trans- 
formations. 
tO 
0 O~ 
¢1 ¢.- 
0 U 00 
¢1 
0_ 
0 
0 10 20 30 40 50 60 
RuleNumber 
Figure 2: Results From the ATIS Corpus, Starting 
With Right-Linear Structure. 
We next ran an experiment to determine what 
performance could be achieved if we dropped the 
initial right-linear assumption. Using the same 
training and test sets as above, sentences were ini- 
tially assigned a random binary-branching struc- 
ture, with final punctuation always attached high. 
Since there was less regular structure in this case 
than in the right-linear case, many more transfor- 
mations were found, 147 transformations in total. 
When these transformations were applied to the 
test set, a bracketing accuracy of 87.13% resulted. 
The ATIS corpus is structurally fairly regular. 
To determine how well our algorithm performs on 
a more complex corpus, we ran experiments on 
the Wall Street Journal. Results from this exper- 
iment can be found in table 2. 9 Accuracy is again 
9For sentences of length 2-15, the initial right-linear 
parser achieves 69% accuracy. For sentences of length 
measured as the percentage of constituents in the 
test set which do not cross any Penn Treebank 
constituents.l° 
As a point of comparison, in (SRO93) an ex- 
periment was done using the inside-outside algo- 
rithm on a corpus of WSJ sentences of length 1-15. 
Training was carried out on a corpus of 1,095 sen- 
tences, and an accuracy of 90.2% was obtained in 
bracketing a test set. 
# Training # of 
Sent. Corpus Trans- % 
Length Sents formations Accuracy 
2-15 250 83 88.1 
2-15 500 163 89.3 
2-15 1000 221 91.6 
2-20 250 145 86.2 
2-25 250 160 83.8 
Table 2: WSJ Sentences 
In the corpus we used for the experiments of 
sentence length 2-15, the mean sentence length 
was 10.80. In the corpus used for the experi- 
ment of sentence length 2-25, the mean length 
was 16.82. As would be expected, performance 
degrades somewhat as sentence length increases. 
In table 3, we show the percentage of sentences in 
the test corpus that have no crossing constituents, 
and the percentage that have only a very small 
number of crossing constituents.11 
Sent 
Length 
2-15 
2-15 
2-25 
# 
Training 
Corpus 
Sents 
500 
1000 
250 
% of 
O-error 
Sents 
53.7 
62.4 
29.2 
% of 
<_l-error 
Sents 
72.3 
77.2 
44.9 
% of 
<2-error 
Sents 
84.6 
87.8 
59.9 
Table 3: WSJ Sentences. 
In table 4, we show the standard deviation 
measured from three different randomly chosen 
training sets of each sample size and randomly 
chosen test sets of 500 sentences each, as well as 
2-20, 63% accuracy is achieved and for sentences of 
length 2-25, accuracy is 59%. 
a°In all of our experiments carried out on the Wall 
Street Journal, the test set was a randomly selected 
set of 500 sentences. 
nFor sentences of length 2-15, the initial right linear 
parser parses 17% of sentences with no crossing errors, 
35% with one or fewer errors and 50% with two or 
fewer. For sentences of length 2-25, 7% of sentences 
are parsed with no crossing errors, 16% with one or 
fewer, and 24% with two or fewer. 
263 
the accuracy as a function of training corpus size 
for sentences of length 2 to 20. 
# Training 
Corpus Sents 
% 
Correct 
0 63.0 
10 75.8 
50 82.1 
100 84.7 
250 86.2 
750 87.3 
Std. 
Dev. 
0.69 
2.95 
1.94 
0.56 
0.46 
0.61 
Table 4: WSJ Sentences of Length 2 to 20. 
We also ran an experiment on WSJ sen- 
tences of length 2-15 starting with random binary- 
branching structures with final punctuation at- 
tached high. In this experiment, 325 transfor- 
mations were found using a 250-sentence training 
corpus, and the accuracy resulting from applying 
these transformations to a test set was 84.72%. 
Finally, in figure 3 we show the sentence 
length distribution in the Wall Street Journal cor- 
pus. 
0 
8 
0 0 
CO 
:3 o °o 
.> 
-~ o rr 
0 O 
04 
0 
20 40 60 80 1 O0 
Sentence Length 
Figure 3: The Distribution of Sentence Lengths in 
the WSJ Corpus. 
While the numbers presented above allow 
us to compare the transformation learner with 
systems trained and tested on comparable cor- 
pora, these results are all based upon the as- 
sumption that the test data is tagged fairly re- 
liably (manually tagged text was used in all of 
these experiments, as well in the experiments of 
(PS92, SRO93).) When parsing free text, we can- 
not assume that the text will be tagged with the 
accuracy of a human annotator. Instead, an au- 
tomatic tagger would have to be used to first tag 
the text before parsing. To address this issue, we 
ran one experiment where we randomly induced a 
5% tagging error rate beyond the error rate of the 
human annotator. Errors were induced in such a 
way as to preserve the unigram part of speech tag 
probability distribution in the corpus. The exper- 
iment was run for sentences of length 2-15, with a 
training set of 1000 sentences and a test set of 500 
sentences. The resulting bracketing accuracy was 
90.1%, compared to 91.6% accuracy when using 
an unadulterated training corpus. Accuracy only 
degraded by a small amount when training on the 
corpus with adulterated part of speech tags, sug- 
gesting that high parsing accuracy rates could be 
achieved if tagging of the input were done auto- 
matically by a part of speech tagger. 
CONCLUSIONS 
In this paper, we have described a new approach 
for learning a grammar to automatically parse 
text. The method can be used to obtain high 
parsing accuracy with a very small training set. 
Instead of learning a traditional grammar, an or- 
dered set of structural transformations is learned 
that can be applied to the output of a very naive 
parser to obtain binary-branching trees with un- 
labelled nonterminals. Experiments have shown 
that these parses conform with high accuracy to 
the structural descriptions specified in a manually 
annotated corpus. Unlike other recent attempts 
at automatic grammar induction that rely heav- 
ily on statistics both in training and in the re- 
sulting grammar, our learner is only very weakly 
statistical. For training, only integers are needed 
and the only mathematical operations carried out 
are integer addition and integer comparison. The 
resulting grammar is completely symbolic. Un- 
like learners based on the inside-outside algorithm 
which attempt to find a grammar to maximize 
the probability of the training corpus in hope that 
this grammar will match the grammar that pro- 
vides the most accurate structural descriptions, 
the transformation-based learner can readily use 
any desired success measure in learning. 
We have already begun the next step in this 
project: automatically labelling the nonterminal 
nodes. The parser will first use the ~ransforma- 
~ioual grammar to output a parse tree without 
nonterminal labels, and then a separate algorithm 
will be applied to that tree to label the nontermi- 
nals. The nonterminal-node labelling algorithm 
makes use of ideas suggested in (Bri92), where 
nonterminals are labelled as a function of the la- 
264 
bels of their daughters. In addition, we plan to 
experiment with other types of transformations. 
Currently, each transformation in the learned list 
is only applied once in each appropriate environ- 
ment. For a transformation to be applied more 
than once in one environment, it must appear in 
the transformation list more than once. One pos- 
sible extension to the set of transformation types 
would be to allow for transformations of the form: 
add/delete a paren as many times as is possible 
in a particular environment. We also plan to ex- 
periment with other scoring functions and control 
strategies for finding transformations and to use 
this system as a postprocessor to other grammar 
induction systems, learning transformations to im- 
prove their performance. We hope these future 
paths will lead to a trainable and very accurate 
parser for free text. 

References 
J. Baker. Trainable grammars for 
speech recognition. In Speech commu- 
nication papers presented at the 97th 
Meeting of the Acoustical Society of 
America, 1979. 
E. Brill and M. Marcus. Automatically 
acquiring phrase structure using distri- 
butional analysis. In Darpa Workshop 
on Speech and Natural Language, Har- 
riman, N.Y., 1992. 
E. Brill and M. Marcus. Tagging an un- 
familiar text with minimal human su- 
pervision. In Proceedings of the Fall 
Symposium on Probabilistic Approaches 
to Natural Language - AAAI Technical 
-Report. American Association for Arti- 
ficial Intelligence, 1992. 
E. Brill and P. Resnik. A transformation 
based approach to prepositional phrase 
attachment. Technical report, Depart- 
ment of Computer and Information Sci- 
ence, University of Pennsylvania, 1993. 
E. Brill. A simple rule-based part 
of speech tagger. In Proceedings of 
the Third Conference on Applied Natu- 
ral Language Processing, A CL, Trento, 
Italy, 1992. 
E. Brill. A Corpus-Based Approach to 
Language Learning. PhD thesis, De- 
partment of Computer and Informa- 
tion Science, University of Pennsylva- 
nia, 1993. Forthcoming. 
T. Briscoe and N. Waegner. Ro- 
bust stochastic parsing using the inside- 
outside algorithm. In Workshop notes 
from the AAAI Statistically-Based NLP 
Techniques Workshop, 1992. 
G. Carroll and E. Charniak. Learn- 
ing probabilistic dependency grammars 
from labelled text - aaai technical re- 
port. In Proceedings of the Fall Sym- 
posium on Probabilisiic Approaches to 
Natural Language. American Associa- 
tion for Artificial Intelligence, 1992. 
E. Black et al. A procedure for quan- 
titatively comparing the syntactic cov- 
erage of English grammars. In Proceed- 
ings of Fourth DARPA Speech and Nat- 
ural Language Workshop, pages 306- 
311, 1991. 
C. Hemphill, J. Godfrey, and G. Dod- 
dington. The ATIS spoken language 
systems pilot corpus. In Proceedings of 
the DARPA Speech and Natural Lan- 
guage Workshop, 1990. 
K. Lari and S. Young. The estimation of 
stochastic context-free grammars using 
the inside-outside algorithm. Computer 
Speech and Language, 4, 1990. 
D. Magerman and M. Marcus. Parsing 
a natural language using mutual infor- 
mation statistics. In Proceedings, Eighth 
National Conference on Artificial Intel- 
ligence (AAAI 90), 1990. 
M. Marcus, B. Santorini, 
and M. Marcinkiewiez. Building a large 
annotated corpus of English: the Penn 
Treebank. To appear in Computational 
Linguistics, 1993. 
F. Pereira and Y. Schabes. Inside- 
outside reestimation from partially 
bracketed corpora. In Proceedings of the 
30th Annual Meeting of the Association 
for Computational Linguistics, Newark, 
De., 1992. 
G. Sampson. A stochastic approach 
to parsing. In Proceedings of COLING 
1986, Bonn, 1986. 
R. Sharman, F. Jelinek, and R. Mer- 
cer. Generating a grammar for sta- 
tistical training. In Proceedings of the 
1990 Darpa Speech and Natural Lan- 
guage Workshop, 1990. 
Y. Schabes, M. Roth, and R. Osborne. 
Parsing the Wall Street Journal with 
the inside-outside algorithm. In Pro- 
ceedings of the 1993 European ACL, 
Uterich, The Netherlands, 1993. 
