INSIDE-OUTSIDE REESTIMATION FROM PARTIALLY 
BRACKETED CORPORA 
Fernando Pereira 
2D-447, AT~zT Bell Laboratories 
PO Box 636, 600 Mountain Ave 
Murray Hill, NJ 07974-0636 
pereira@research, art. com 
Yves Schabes 
Dept. of Computer and Information Science 
University of Pennsylvania 
Philadelphia, PA 19104-6389 
schabes@una~i, cis. upenn, edu 
ABSTRACT 
The inside-outside algorithm for inferring the pa- 
rameters of a stochastic context-free grammar is 
extended to take advantage of constituent in- 
formation (constituent bracketing) in a partially 
parsed corpus. Experiments on formal and natu- 
ral language parsed corpora show that the new al- 
gorithm can achieve faster convergence and better 
modeling of hierarchical structure than the origi- 
nal one. In particular, over 90% test set bracket- 
ing accuracy was achieved for grammars inferred 
by our algorithm from a training set of hand- 
parsed part-of-speech strings for sentences in the 
Air Travel Information System spoken language 
corpus. Finally, the new algorithm has better time 
complexity than the original one when sufficient 
bracketing is provided. 
1. MOTIVATION 
The most successful stochastic language models 
have been based on finite-state descriptions such 
as n-grams or hidden Markov models (HMMs) 
(Jelinek et al., 1992). However, finite-state mod- 
els cannot represent the hierarchical structure of 
natural language and are thus ill-suited to tasks 
in which that structure is essential, such as lan- 
guage understanding or translation. It is then 
natural to consider stochastic versions of more 
powerful grammar formalisms and their gram- 
matical inference problems. For instance, Baker 
(1979) generalized the parameter estimation meth- 
ods for HMMs to stochastic context-free gram- 
mars (SCFGs) (Booth, 1969) as the inside-outside 
algorithm. Unfortunately, the application of 
SCFGs and the original inside-outside algorithm 
to natural-language modeling has been so far in- 
conclusive (Lari and Young, 1990; Jelinek et al., 
1990; Lari and Young, 1991). 
Several reasons can be adduced for the difficul- 
ties. First, each iteration of the inside-outside al- 
gorithm on a grammar with n nonterminals may 
require O(n3\[wl 3) time per training sentence w, 
128 
while each iteration of its finite-state counterpart 
training an HMM with s states requires at worst 
O(s2lwl) time per training sentence. That com- 
plexity makes the training of suffÉciently large 
grammars computationally impractical. 
Second, the convergence properties of the algo- 
rithm sharply deteriorate as the number of non- 
terminal symbols increases. This fact can be intu- 
itively understood by observing that the algorithm 
searches for the maximum of a function whose 
number of local maxima grows with the number of 
nonterminals. Finally, while SCFGs do provide a 
hierarchical model of the language, that structure 
is undetermined by raw text and only by chance 
will the inferred grammar agree with qualitative 
linguistic judgments of sentence structure. For ex- 
ample, since in English texts pronouns are very 
likely to immediately precede a verb, a grammar 
inferred from raw text will tend to make a con- 
stituent of a subject pronoun and the following 
verb. 
We describe here an extension of the inside-outside 
algorithm that infers the parameters of a stochas- 
tic context-free grammar from a partially parsed 
corpus, thus providing a tighter connection be- 
tween the hierarchical structure of the inferred 
SCFG and that of the training corpus. The al- 
gorithm takes advantage of whatever constituent 
information is provided by the training corpus 
bracketing, ranging from a complete constituent 
analysis of the training sentences to the unparsed 
corpus used for the original inside-outside algo- 
rithm. In the latter case, the new algorithm re- 
duces to the original one. 
Using a partially parsed corpus has several advan- 
tages. First, the the result grammars yield con- 
stituent boundaries that cannot be inferred from 
raw text. In addition, the number of iterations 
needed to reach a good grammar can be reduced; 
in extreme cases, a good solution is found from 
parsed text but not from raw text. Finally, the 
new algorithm has better time complexity when 
sufficient bracketing information is provided. 
2. PARTIALLY BRACKETED 
TEXT 
Informally, a partially bracketed corpus is a set 
of sentences annotated with parentheses marking 
constituent boundaries that any analysis of the 
corpus should respect. More precisely, we start 
from a corpus C consisting of bracketed strings, 
which are pairs e = (w,B) where w is a string 
and B is a bracketing of w. For convenience, we 
will define the length of the bracketed string c by 
Icl = Iwl. 
Given a string w = wl ..-WlM, a span of w is a 
pair of integers (i,j) with 0 < i < j g \[w\[, which 
delimits a substring iwj = wi+y ...wj of w. The 
abbreviation iw will stand for iWl~ I. 
A bracketing B of a string w is a finite set of spans 
on w (that is, a finite set of pairs or integers (i, j) 
with 0 g i < j < \[w\[) satisfying a consistency 
condition that ensures that each span (i, j) can be 
seen as delimiting a string iwj consisting of a se- 
quence of one of more. The consistency condition 
is simply that no two spans in a bracketing may 
overlap, where two spans (i, j) and (k, l) overlap if 
either i < k < j < l or k < i < l < j. 
Two bracketings of the same string are said to be 
compatible if their union is consistent. A span s is 
valid for a bracketing B if {s} is compatible with 
B. 
Note that there is no requirement that a bracket- 
ing of w describe fully a constituent structure of 
w. In fact, some or all sentences in a corpus may 
have empty bracketings, in which case the new al- 
gorithm behaves like the original one. 
To present the notion of compatibility between a 
derivation and a bracketed string, we need first 
to define the span of a symbol occurrence in a 
context-free derivation. Let (w,B) be a brack- 
eted string, and c~0 ==~ al :=¢, ... =~ c~m = w be 
a derivation of w for (S)CFG G. The span of a 
symbol occurrence in (~1 is defined inductively as 
follows: 
• Ifj -- m, c U = w E E*, and the span of wi in 
~j is (i- 1, i). 
• If j < m, then aj : flAT, aj+l = 
/3XI"'Xk')', where A -* XI".Xk is a rule 
of G. Then the span of A in aj is (il,jk), 
where for each 1 < l < k, (iz,jt) is the span 
of Xl in aj+l- The spans in (~j of the symbol 
occurrences in/3 and 7 are the same as those 
of the corresponding symbols in ~j+l. 
A derivation of w is then compatible with a brack- 
eting B of w if the span of every symbol occurrence 
in the derivation is valid in B. 
3. GRAMMAR REESTIMATION 
The inside-outside algorithm (Baker, 1979) is a 
reestimation procedure for the rule probabilities 
of a Chomsky normal-form (CNF) SCFG. It takes 
as inputs an initial CNF SCFG and a training cor- 
pus of sentences and it iteratively reestimates rule 
probabilities to maximize the probability that the 
grammar used as a stochastic generator would pro- 
duce the corpus. 
A reestimation algorithm can be used both to re- 
fine the parameter estimates for a CNF SCFG de- 
rived by other means (Fujisaki et hi., 1989) or to 
infer a grammar from scratch. In the latter case, 
the initial grammar for the inside-outside algo- 
rithm consists of all possible CNF rules over given 
sets N of nonterrninals and E of terminals, with 
suitably assigned nonzero probabilities. In what 
follows, we will take N, ~ as fixed, n - IN\[, t = 
\[El, and assume enumerations N - {A1,... ,An} 
and E = {hi,... ,bt}, with A1 the grammar start 
symbol. A CNF SCFG over N, E can then be 
specified by the n~+ nt probabilities Bp,q,r of each 
possible binary rule Ap --* Aq Ar and Up,m of each 
possible unary rule Ap --* bin. Since for each p the 
parameters Bp,q,r and Up,rn are supposed to be the 
probabilities of different ways of expanding Ap, we 
must have for all 1 _< p _< n 
E Bp,q,r + E Up,m = 1 (7) 
q,r m 
For grammar inference, we give random initial val- 
ues to the parameters Bp,q,r and Up,m subject to 
the constraints (7). 
The intended meaning of rule probabilities in a 
SCFG is directly tied to the intuition of context- 
freeness: a derivation is assigned a probability 
which is the product of the probabilities of the 
rules used in each step of the derivation. Context- 
freeness together with the commutativity of mul- 
tiplication thus allow us to identify all derivations 
associated to the same parse tree, and we will 
129 
I~(i- 1,i) = 
I~(i, k) = 
O~(O, lel) = 
O~(i,k) = 
^ 
~,qjr "-- 
pc -- 
P;= 
Up,m where c = (w, B) and bm= wi 
e(i, k) ~ ~ B,.,.,g(i,i)1,~(.~, k) 
q,r i<j<k 
1 ifp=l 
0 othe~ise. 
• ~-1 Id ~(~,k) ~ (~ O;(j,k)~(~,OB,.,~, + ~ OI(i,jlB,~.d~(k,~)) 
,~,r \j=o ~=k+1 
I -f; ~ B,.,.,g(~,j)~:(j,k)O~(~,k) 
,ec o_</<,f<k<i=,t Z:g/e" 
cEC 
1 c • E U,,mO;(,- 
¢~c l<i<ld,.=(,.,B),,~,=b.. EP;/P" 
¢EC 
If(0, Id) 
I;(i,j)O~(i,j) 
o_<i<./__.ld 
(1) 
(2) 
(s) 
(41 
(5) 
(6) 
Table I: Bracketed Reestimation 
speak indifferently below of derivation and anal- 
ysis (parse tree) probabilities. Finally, the proba- 
bility of a sentence or sentential form is the sum 
of the probabilities of all its analyses (equivalently, 
the sum of the probabilities of all of its leftmost 
derivations from the start symbol). 
3.1. The Inside-Outside Algorithm 
The basic idea of the inside-outside algorithm is 
to use the current rule probabilities and the train- 
ing set W to estimate the expected frequencies of 
certain types of derivation step, and then compute 
new rule probability estimates as appropriate ra- 
tios of those expected frequency estimates. Since 
these are most conveniently expressed as relative 
frequencies, they are a bit loosely referred to as 
inside and outside probabilities. More precisely, 
for each w E W, the inside probability I~ (i, j) es- 
timates the likelihood that Ap derives iwj, while 
the outside probability O~(i, j) estimates the like- 
lihood of deriving sentential form owi Ap j w from 
the start symbol A1. 
130 
3.2. The Extended Algorithm 
In adapting the inside-outside algorithm to par- 
tially bracketed training text, we must take into 
account the constraints that the bracketing im- 
poses on possible derivations, and thus on possi- 
ble phrases. Clearly, nonzero values for I~(i,j) 
or O~(i,j) should only be allowed if iwj is com- 
patible with the bracketing of w, or, equivalently, 
if (i,j) is valid for the bracketing of w. There- 
fore, we will in the following assume a corpus C of 
bracketed strings c = (w, B), and will modify the 
standard formulas for the inside and outside prob- 
abilities and rule probability reestimation (Baker, 
1979; Lari and Young, 1990; Jelinek et al., 1990) 
to involve only constituents whose spans are com- 
patible with string bracketings. For this purpose, 
for each bracketed string c = (w, B) we define the 
auxiliary function 
1 if (i,j) is valid for b E B 
~(i,j) = 0 otherwise 
The reestimation formulas for the extended algo- 
rithm are shown in Table 1. For each bracketed 
sentence c in the training corpus, the inside prob- 
abilities of longer spans of c are computed from 
those for shorter spans with the recurrence given 
by equations (1) and (2). Equation (2) calculates 
the expected relative frequency of derivations of 
iwk from Ap compatible with the bracketing B of 
c = (w, B). The multiplier 5(i, k) is i just in case 
(i, k) is valid for B, that is, when Ap can derive 
iwk compatibly with B. 
Similarly, the outside probabilities for shorter 
spans of c can be computed from the inside prob- 
abilities and the outside probabilities for longer 
spans with the recurrence given by equations (3) 
and (4). Once the inside and outside probabili- 
ties computed for each sentence in the corpus, the 
^ reestimated probability of binary rules, Bp,q,r, and 
the reestimated probability of unary rules, (Jp,ra, 
are computed by the reestimation formulas (5) and 
(6), which are just like the original ones (Baker, 
1979; Jelinek et al., 1990; Lari and Young, 1990) 
except for the use of bracketed strings instead of 
unbracketed ones. 
The denominator of ratios (5) and (6) estimates 
the probability that a compatible derivation of a 
bracketed string in C will involve at least one ex- 
pansion of nonterminal Ap. The numerator of (5) 
estimates the probability that a compatible deriva- 
tion of a bracketed string in C will involve rule 
Ap --* Aq At, while the numerator of (6) estimates 
• the probability that a compatible derivation of a 
string in C will rewrite Ap to b,n. Thus (5) es- 
timates the probability that a rewrite of Ap in a 
compatible derivation of a bracketed string in C 
will use rule Ap --~ Aq At, and (6) estimates the 
probability that an occurrence of Ap in a compat- 
ible derivation of a string in in C will be rewritten 
to bin. These are the best current estimates for 
the binary and unary rule probabilities. 
The process is then repeated with the reestimated 
probabilities until the increase in the estimated 
probability of the training text given the model 
becomes negligible, or, what amounts to the same, 
the decrease in the cross entropy estimate (nega- 
tive log probability) 
E log pc 
H(C,G) = (8) Icl 
c6C 
becomes negligible. Note that for comparisons 
with the original algorithm, we should use the 
cross-entropy estimate /~(W, G) of the unbrack- 
eted text W with respect to the grammar G, not 
(8). 
131 
3.3. Complexity 
Each of the three steps of an iteration of the origi- 
nal inside-outside algorithm -- computation of in- 
side probabilities, computation of outside proba- 
bilities and rule probability reestimation - takes 
time O(Iwl 3) for each training sentence w. Thus, 
the whole algorithm is O(Iw\[ 3) on each training 
sentence. 
However, the extended algorithm performs better 
when bracketing information is provided, because 
it does not need to consider all possible spans for 
constituents, but only those compatible with the 
training set bracketing. In the limit, when the 
bracketing of each training sentence comes from 
a complete binary-branching analysis of the sen- 
tence (a full binary bracketing), the time of each 
step reduces to O(\[w D. This can be seen from the 
following three facts about any full binary brack- 
eting B of a string w: 
1. B has o(Iwl) spans; 
2. For each (i, k) in B there is exactly one split 
point j such that both (i, j) and (j, k) are in 
3. Each valid span with respect to B must al- 
ready be a member of B. 
Thus, in equation (2) for instance, the number of 
spans (i, k) for which 5(i, k)   0 is O(\[eD, and 
there is a single j between i and k for which 
6(i, j) ~ 0 and 5(j,k) ~ 0. Therefore, the total 
time to compute all the I~(i, k) is O(Icl). A simi- 
lar argument applies to equations (4) and (5). 
Note that to achieve the above bound as well as to 
take advantage of whatever bracketing is available 
to improve performance, the implementation must 
preprocess the training set appropriately so that 
the valid spans and their split points are efficiently 
enumerated. 
4. EXPERIMENTAL 
EVALUATION 
The following experiments, although preliminary, 
give some support to our earlier suggested advan- 
tages of the inside-outside algorithm for partially 
bracketed corpora. 
The first experiment involves an artificial exam- 
ple used by Lari and Young (1990) in a previous 
evaluation of the inside-outside algorithm. In this 
case, training on a bracketed corpus can lead to a 
good solution while no reasonable solution is found 
training on raw text only. 
The second experiment uses a naturally occurring 
corpus and its partially bracketed version provided 
by the Penn Treebank (Brill et al., 1990). We 
compare the bracketings assigned by grammars in- 
ferred from raw and from bracketed training mate- 
rial with the Penn Treebank bracketings of a sep- 
arate test set. 
To evaluate objectively the accuracy of the analy- 
ses yielded by a grammar G, we use a Viterbi-style 
parser to find the most likely analysis of each test 
sentence according to G, and define the bracket- 
ing accuracy of the grammar as the proportion 
of phrases in those analyses that are compatible 
in the sense defined in Section 2 with the tree 
bank bracketings of the test set. This criterion is 
closely related to the "crossing parentheses" score 
of Black et al. (1991). 1 
In describing the experiments, we use the nota- 
tion GR for the grammar estimated by the original 
inside-outside algorithm, and GB for the grammar 
estimated by the bracketed algorithm. 
4.1. Inferring the Palindrome Lan- 
guage 
We consider first an artificial language discussed 
by Lari and Young (1990). Our training corpus 
consists of 100 sentences in the palindrome lan- 
guage L over two symbols a and b 
L - (ww R I E {a,b}'}. 
randomly generated 
S 
with the SCFG 
°~A C 
S°~BD 
S °-~ AA 
S BB 
C*-~SA 
D!+SB 
A *-~ a 
B&b 
1 Since the grammar inference procedure is restricted to 
Chomsky normal form grannnars, it cannot avoid difficult 
decisions by leaving out brackets (thus making flatter parse 
trees), as hunmn annotators often do. Therefore, the recall 
component in Black et aL's figure of merit for parser is not 
needed. 
132 
The initial grammar consists of all possible CNF 
rules over five nonterminals and the terminals a 
and b (135 rules), with random rule probabilities. 
As shown in Figure 1, with an unbracketed train- 
ing set W the cross-entropy estimate H(W, GR) re- 
mains almost unchanged after 40 iterations (from 
1.57 to 1.44) and no useful solution is found. 
In contrast, with a fully bracketed version C of 
the same training set, the cross-entropy estimate 
/~(W, GB) decreases rapidly (1.57 initially, 0.88 af- 
ter 21 iterations). Similarly, the cross-entropy esti- 
mate H(C, GB) of the bracketed text with respect 
to the grammar improves rapidly (2.85 initially, 
0.89 after 21 iterations). 
1.6 
1.5 
1.4 
1.3 G 
1.2 
<" I.i 
I 
0.9 
0.8 
~-... \ 
\ ! 
\ 
Raw -- 
Bracketed ..... 
% 
i ! i ! , , ! 
1 5 10 15 20 25 30 35 40 
iteration 
Figure 1: Convergence for the Palindrome Corpus 
The inferred grammar models correctly the palin- 
drome language. Its high probability rules (p > 
0.1, pip' > 30 for any excluded rule p') are 
S --*AD 
S -*CB 
B--*SC 
D--*SA 
A --* b 
B -* a 
C --* a 
D ---* b 
which is a close to optimal CNF CFG for the palin- 
drome language. 
The results on this grammar are quite sensitive 
to the size and statistics of the training corpus 
and the initial rule probability assignment. In 
fact, for a couple of choices of initial grammar 
and corpus, the original algorithm produces gram- 
mars with somewhat better cross-entropy esti- 
mates than those yielded by the new one. How- 
ever, in every case the bracketing accuracy on 
a separate test set for the result of bracketed 
training is above 90% (100% in several cases), in 
contrast to bracketing accuracies ranging between 
15% and 69% for unbracketed training. 
4.2. Experiments on the ATIS Cor- 
pus 
For our main experiment, we used part-of-speech 
sequences of spoken-language transcriptions in the 
Texas Instruments subset of the Air Travel Infor- 
mation System (ATIS) corpus (Hemphill et el., 
1990), and a bracketing of those sequences derived 
from the parse trees for that subset in the Penn 
Treebank. 
Out of the 770 bracketed sentences (7812 words) 
in the corpus, we used 700 as a training set C and 
70 (901 words) as a test set T. The following is an 
example training string 
( ( ( VB ( DT ~NS ( IB ( ( NN ) ( 
NN CD ) ) ) ) ) ) . ) 
corresponding to the parsed sentence 
(((List (the fares (for ((flight) 
(number 891)))))) .) 
The initial grammar consists of all 4095 possible 
CNF rules over 15 nonterminals (the same number 
as in the tree bank) and 48 terminal symbols for 
part-of-speech tags. 
A random initial grammar was trained separately 
on the unbracketed and bracketed versions of the 
training corpus, yielding grammars GR and GB. 
4.6 
4.4 
4.2 
4 
3.a 
3.6 
3.4 
3.2 
3 
2.8 
1 
i ! | I i ! ! 
~, Raw -- 
~ Bracketed ..... 
\ 
\. 
I I I I I | I 
I0 20 30 40 50 60 70 75 
iteration 
Figure 2: Convergence for the ATIS Corpus 
Figure 2 shows that H(W, GB) initially decreases 
faster than the/:/(W, GR), although eventually the 
133 
two stabilize at very close values: after 75 itera- 
tions, /I(W, GB) ~ 2.97 and /:/(W, GR) ~ 2.95. 
However, the analyses assigned by the resulting 
grammars to the test set are drastically different. 
I00 
80 
u 60 
o o 40 
rd 
20 
' Raw ' ' ' ' ' 
Bracketed ..... ., .......... ~"° .... 
/ 
l 
I I I i ' ' i 
I0 20 30 40 50 60 70 75 
iteration 
Figure 3: Bracketing Accuracy for the ATIS Cor- 
pus 
With the training and test data described above, 
the bracketing accuracy of GR after 75 iterations 
was only 37.35%, in contrast to 90.36% bracket- 
ing accuracy for GB. Plotting bracketing accu- 
racy against iterations (Figure 3), we see that un- 
bracketed training does not on the whole improve 
accuracy. On the other hand, bracketed training 
steadily improves accuracy, although not mono- 
tonically. 
It is also interesting to look at some the differences 
between GR and GB, as seen from the most likely 
analyses they assign to certain sentences. Table 
2 shows two bracketed test sentences followed by 
their most likely GR and GB analyses, given for 
readability in terms of the original words rather 
than part-of-speech tags. 
For test sentence (A), the only GB constituent 
not compatible with the tree bank bracketing 
is (Delta flight number), although the con- 
stituent (the cheapest) is linguistically wrong. 
The appearance of this constituent can be ex- 
plained by lack of information in the tree bank 
about the internal structure of noun phrases, as 
exemplified by tree bank bracketing of the same 
sentence. In contrast, the GR analysis of the same 
string contains 16 constituents incompatible with 
the tree bank. 
For test sentence (B), the G~ analysis is fully com- 
patible with the tree bank. However, the Grt anal- 
ysis has nine incompatible constituents, which for 
(A) 
Ga 
(I would (like (to (take (Delta ((flight number) 83)) (to Atlanta)))).) 
(What ((is (the cheapest fare (I can get)))) ?) 
(I (would (like ((to ((take (Delta flight)) (number (83 ((to Atlanta) .))))) 
((What (((is the) cheapest) fare)) ((I can) (get ?))))))) 
(((I (would (like (to (take (((Delta (flight number)) 83) (to Atlanta))))))) .) 
((What (is (((the cheapest) fare) (I (can get))))) ?)) 
GB 
(B) ((Tell me (about (the public transportation 
((from SF0) (to San Francisco))))).) 
GR (Tell ((me (((about the) public) transportation)) 
((from SF0) ((to San) (Francisco .))))) 
GB ((Tell (me (about (((the public) transportation) 
((from SFO) (to (San Francisco))))))) .) 
Table 2: Comparing Bracketings 
example places Francisco and the final punctua- 
tion in a lowest-level constituent. Since final punc- 
tuation is quite often preceded by a noun, a gram- 
mar inferred from raw text will tend to bracket 
the noun with the punctuation mark. 
This experiment illustrates the fact that although 
SCFGs provide a hierarchical model of the lan- 
guage, that structure is undetermined by raw text 
and only by chance will the inferred grammar 
agree with qualitative linguistic judgments of sen- 
tence structure. This problem has also been previ- 
ously observed with linguistic structure inference 
methods based on mutual information. Mater- 
man and Marcus (1990) addressed the problem by 
specifying a predetermined list of pairs of parts of 
speech (such as verb-preposition, pronoun-verb) 
that can never be embraced by a low-level con- 
stituent. However, these constraints are stipulated 
in advance rather than being automatically de- 
rived from the training material, in contrast with 
what we have shown to be possible with the inside- 
outside algorithm for partially bracketed corpora. 
5. CONCLUSIONS AND 
FURTHER WORK 
We have introduced a modification of the well- 
known inside-outside algorithm for inferring the 
parameters of a stochastic context-free grammar 
that can take advantage of constituent informa- 
tion (constituent bracketing) in a partially brack- 
eted corpus. 
The method has been successfully applied to 
SCFG inference for formal languages and for 
part-of-speech sequences derived from the ATIS 
134 
spoken-language corpus. 
The use of partially bracketed corpus can reduce 
the number of iterations required for convergence 
of parameter reestimation. In some cases, a good 
solution is found from a bracketed corpus but not 
from raw text. Most importantly, the use of par- 
tially bracketed natural corpus enables the algo- 
rithm to infer grammars specifying linguistically 
reasonable constituent boundaries that cannot be 
inferred by the inside-outside algorithm on raw 
text. While none of this is very surprising, it sup- 
plies some support for the view that purely unsu- 
pervised, self-organizing grammar inference meth- 
ods may have difficulty in distinguishing between 
underlying grammatical structure and contingent 
distributional regularities, or, to put it in another 
way, it gives some evidence for the importance of 
nondistributional regularities in language, which 
in the case of bracketed training have been sup- 
plied indirectly by the linguists carrying out the 
bracketing. 
Also of practical importance, the new algorithm 
can have better time complexity for bracketed 
text. In the best situation, that of a training set 
with full binary-branching bracketing, the time for 
each iteration is in fact linear on the total length 
of the set. 
These preliminary investigations could be ex- 
tended in several ways. First, it is important to 
determine the sensitivity of the training algorithm 
to the initial probability assignments and training 
corpus, as well as to lack or misplacement of brack- 
ets. We have started experiments in this direction, 
but reasonable statistical models of bracket elision 
and misplacement are lacking. 
Second, we would like to extend our experiments 
to larger terminal vocabularies. As is well known, 
this raises both computational and data sparse- 
ness problems, so clustering of terminal symbols 
will be essential. 
Finally, this work does not address a central weak- 
ness of SCFGs, their inability to represent lex- 
ical influences on distribution except by a sta- 
tistically and computationally impractical pro- 
liferation of nonterminal symbols. One might 
instead look into versions of the current algo- 
rithm for more lexically-oriented formalisms such 
as stochastic lexicalized tree-adjoining grammars 
(Schabes, 1992). 
ACKNOWLEGMENTS 
We thank Aravind Joshi and Stuart Shieber for 
useful discussions, and Mitch Marcus, Beatrice 
Santorini and Mary Ann Marcinkiewicz for mak- 
ing available the ATIS corpus in the Penn Tree- 
bank. The second author is partially supported 
by DARPA Grant N0014-90-31863, ARO Grant 
DAAL03-89-C-0031 and NSF Grant IRI90-16592. 

REFERENCES 
J.K. Baker. 1979. Trainable grammars for speech 
recognition. In Jared J. Wolf and Dennis H. Klatt, 
editors, Speech communication papers presented 
at the 97 ~h Meeting of the Acoustical Society of 
America, MIT, Cambridge, MA, June. 
E. Black, S. Abney, D. Flickenger, R. Grishman, 
P. Harrison, D. Hindle, R. Ingria, F. Jelinek, 
J. Klavans, M. Liberman, M. Marcus, S. Roukos, 
B. Santorini, and T. Strzalkowski. 1991. A pro- 
cedure for quantitatively comparing the syntactic 
coverage of english grammars. In DARPA Speech 
and Natural Language Workshop, pages 306-311, 
Pacific Grove, California. Morgan Kaufmann. 
T. Fujisaki, F. Jelinek, J. Cocke, E. Black, and 
T. Nishino. 1989. A probabilistic parsing method 
for sentence disambiguation. In Proceedings of the 
International Workshop on Parsing Technologies, 
Pittsburgh, August. 
Charles T. Hemphill, John J. Godfrey, and 
George R. Doddington. 1990. The ATIS spoken 
language systems pilot corpus. In DARPA Speech 
and Natural Language Workshop, Hidden Valley, 
Pennsylvania, June. 
F. Jelinek, J. D. Lafferty, and R. L. Mercer. 1990. 
Basic methods of probabilistic context free gram- 
mars. Technical Report RC 16374 (72684), IBM, 
Yorktown Heights, New York 10598. 
Frederick Jelinek, Robert L. Mercer, and Salim 
Roukos. 1992. Principles of lexical language mod- 
eling for speech recognition. In Sadaoki Furui and 
M. Mohan Sondhi, editors, Advances in Speech 
Signal Processing, pages 651-699. Marcel Dekker, 
Inc., New York, New York. 
K. Lari and S. J. Young. 1990. The estimation of 
stochastic context-free grammars using the Inside- 
Outside algorithm. Computer Speech and Lan- 
guage, 4:35-56. 
K. Lari and S. J. Young. 1991. Applications of 
stochastic context-free grammars using the Inside- 
Outside algorithm. Computer Speech and Lan- 
guage, 5:237-257. 
David Magerman and Mitchell Marcus. 1990. 
Parsing a natural language using mutual informa- 
tion statistics. In AAAI-90, Boston, MA. 
Yves Schabes. 1992. Stochastic lexicalized tree- 
adjoining grammars. In COLING 92. Forthcom- 
ing. 
T. Booth. 1969. Probabilistic representation of 
formal languages. In Tenth Annual IEEE Sympo- 
sium on Switching and Automata Theory, Octo- 
ber. 
Eric Brill, David Magerman, Mitchell Marcus, and 
Beatrice Santorini. 1990. Deducing linguistic 
structure from the statistics of large corpora. In 
DARPA Speech and Natural Language Workshop. 
Morgan Kaufmann, Hidden Valley, Pennsylvania, 
JuDe. 
