A STOCHASTIC APPROACH TO SENTENCE PARSING 
Tetsunosuke FuJisaki 
Science Institute, IBM Japan, Ltd. 
No. 36 Kowa Building 
5-19 Sanbancho,Chiyoda-ku 
Tokyo 102, Japan 
ABSTRACT 
A description will be given of a procedure to asslgn 
the most likely probabilitles to each of the rules 
of a given context-free grammar. The grammar devel- 
oped by S. Kuno at Harvard University was picked as 
the basis and was successfully augmented with rule 
probabilities. A brief exposition of the method 
with some preliminary results, whenused as a device 
for disamblguatingparsing English texts picked from 
natural corpus, will be given. 
Z. INTRODUCTION 
To prepare a grammar which can parse arbitrary sen- 
tances taken from a natural corpus is a difficult 
task. One of the most serious problems is the poten- 
tlally unbounded number of ambiguities. Pure syn- 
tactic analysis with an imprudent grammar will 
sometimes result in hundreds of parses. 
With prepositional phrase attachments and conjunc- 
tions, for example, it is known that the actual 
growth of ambiguities can be approximated by a Cat- 
fan number \[Knuth\], the number of ways to insert 
parentheses into a formula of M terms: 1, 2, 5, 14, 
42, 132, 469, 1430, 4892, ... The five ambiguities 
in the following sentence with three ambiguous con- 
structions can be well explained wlth this number. 
\[ I saw a man in a park with a scope. \[ 
I ! 
This Catalan number is essentially exponentlal and 
\[Martin\] reported a syntactically amblguous sentence 
with 455 parses: 
List the sales of products produced in 1973 I I 
with the products produced in 1972. I 
On the other hand, throughout the long history of 
natural language understanding work, semantic and 
pragmatic constraints are known to be indispensable 
and are recommended to be represented in some formal 
way and to be referred to during or after the syntac- 
tic analysis process. 
However, to represent semantic and pragmatic con- 
straints, (which are usually domain sensitive) in a 
well-formed way is a very difficult and expensive 
task. A lot of effort in that direction has been 
expended, especially in Artificial Intelligence, 
using semantic networks, frame theory, etc. Howev- 
er, to our knowledge no one has ever succeeded in 
preparing them except in relatlvely small restricted 
domains. \[Winograd, Sibuya\]. 
Faced with this situation, we propose in this paper 
to use statistics as a device for reducing ambigui- 
ties. In other words, we propose a scheme for gram- 
matical inference as defined by \[Fu\], a stochastic 
augmentatlon of a given grammar; furthermore, we 
propose to use the resultant statistics as a device 
for semantic and pragmatic constraints. Wlthin this 
stochastic framework, semantic and pragmatic con- 
straints are expected to be coded implicitly in the 
statistics. A simple bottom-up parse referring to 
the grammar rules as well as the statistics will 
assign relative probabilities among ambiguous deri- 
vations. And these relative probabilities should be 
useful for filtering meaningless garbage parses 
because high probabilities will be asslgned to the 
parse trees corresponding to meaningful interpreta- 
tions and iow probabilities, hopefully 0.0, to other 
parse trees which are grammatlcally correct but are 
not meaningful. 
Most importantly, stochastic augmentation of a gram- 
mar will be done automatically by feeding a set of 
sentences as samples from the relevant domain in 
which we are interested, while the preparation of 
semantic and pragmatic constraints in the form of 
usual semantic network, for example, should be done 
by human experts for each specific domain. 
This paper first introduces the basic ideas of auto- 
matic training process of statistics from given 
example sentences, and then shows how it works wit 
experimental results. 
II. GRAMMATICAL INFERENCE OF A STOCHASTIC GRAMMAR 
A. Estimation of Markov Parameters for sample texts 
Assume a Markov source model as a collectlon of 
states connected to one another by transitions which 
produce symbols from a finite alphabet. To each 
transition, t from a state s, is associated a proba- 
bility q(s,t), which is the probability that t will 
be chosen next when s is reached. 
When output sentences \[B(i)} from this markov model 
are observed, we can estimate the transition proba- 
bilities {q(s,t)} through an iteration process in 
the following way: 
i. Make an initial guess of {q(s,t\]}. 
16 
2. Parse each output sentence B(1). Let d(i,j) be 
a j-th derivation of the i-th output sentence 
B(i\]. 
3. 
4. 
Then the probability p|d(i,J}} of each deriva- 
tion d{i,J\] can be defined in the following way: 
p{d|i,j}} is the product of probability of all 
the transitions q{s,~) which contribute to that 
derivation d(~,~). 
From this p(d(i,~}), the Bayes a posterlori 
estimate of the count c{s,t,i,j), how many times 
the transition t from state $ is used on the der- 
ivation d\[i,J}, can be estimated as follows: 
5. 
n(s,t,i,j) x p(d(i,j)) c(s,t,i,j) = 
~-p(d(i,j)) 
J 
where n{s,t,i,~} is a number of times the tran- 
sition t from state s is used in the derivation 
d{i,j}. 
Obviously, c{s,t,i,~} becomes nfs,t,i,J} in an 
unambiguous case. 
From this ={a,t,l,j}, new estimate of the proba- 
billties @{$,t} can be calculated. 
~-~ c(s,t,i,j) £j 
f(s,t) = 
Y- Y- Y-c(s,t,£,j) 
ijt 
6. Replace {qfs, t}} with this new estimate {@{s,t}} 
and repeat from step 2. 
Through this process, asymptotic convergence will 
hold in the entropy of {q{$,t\]} which is defined as: 
Zntoropy = ~- ~ -q(s,t)xlog(q(s,t)) 
st 
and the {q(s,t)) will approach the real transition 
probability \[Baum-1970~1792\]. 
Further optimized versions of this algorlthm can be 
found in \[Bahl-1983\] and have been successfully used 
for estimating parameters of various Markov models 
which approximate speech processes \[Bahl - 1978, 
1980\]. 
B. Extension to context-free grammar" 
This procedure for automatically estimating Markov 
source parameters can easily be extended to con- 
text-free grammars in the following manner. 
Assume that each state in the Markov model corre- 
sponds to a possible sentential form based on a giv- 
en context-free grammar. Then each transition 
corresponds to the application of a context-free 
production rule to the previous state, i.e. previ- 
ous sentential form. For example, the state NP. VP 
can be reached from the state S by applying a rule 
S->NP VP, the state ART. NOUN. VP can be reached from 
the state NP. VP by applying the rule NP->ART NOUN to 
the first NP of the state NP. VP, and so on. 
Since the derivations correspond to sequences of 
state transitions among the states defined above, 
parsin E over the set of sentences given as training 
data will enable us to count how many times each 
transition is fired from the given sample sentences. 
For example, transitions from the state S to the 
state NP. VP may occur for almost every sentence 
because the correspondin E rule, 'S->NP VP', must be 
used to derive the most frequent declarative sen- 
tences; the transition from state ART. NOUN. VP to the 
stats 'every'.NOUN. VP may happen 103 times; etc. If 
we associate each grammar rule with an a priori 
probabillty as an initial guess, then the Bayes a 
posteriorl estimate of the number of times each 
transition will be traversed can be calculated from 
the initial probabilities and the actual counts 
observed as described above. 
Since each production is expected to occur independ- 
ently of the context, the new estimate of the proba- 
billty for a rule will be calculated at each 
iteration step by masking the contexts. That is, 
the Bayes estimate counts from all of the transi- 
tions which correspond to a single context free 
rule; all transitions between states llke xxx. A. yyy 
and xxx. B.C. yyy correspond to the production rule 
'A->B C' regardless of the contents of xxx and yyy; 
are tied together to get the new probability esti- 
mate of the corresponding rule. 
Renewing the probabilities of the rules with new 
estimates, the same steps will be repeated until 
they converge. 
ZZZ. EXPERIHENTATZON 
A. Base Grammar 
As the basis of this research, the grammar developed 
by Prof. S. Kuno in the 1960's for the machine trans- 
lation project at Harvard University \[Ktmo-1963, 
1966\] was chosen, with few modifications. The set 
of grammar specifications in that grammar, whlchare 
in Greibach normal form, were translated into a form 
which is favorable to our method. 2118 rules of the 
original rules were rewrlttenas 5241 rules in Chom- 
sky normal form. 
B. Parser 
A bottom-up context-free parser based on Cocke-Kasa- 
mi-Yotmg algorithm was developed especially for this 
purpose. Special emphasis was put on the design of 
the parser to get better performance in highly 
ambiguous cases. That is, alternative-links, the 
dotted llnk shown in the figure below, are intro- 
duced to reduce the number of intermediate substruc- 
ture as far as possible. 
A/P 
17 
C. Test Corpus 
Training sentences were selected from the magazines, 
31 articles from Reader's Digest and Datamation, and 
from IBM correspondence. Among 5528 selected sen- 
tences from the magazine articles, 3582 sentences 
were successfully parsed with 0.89 seconds of CPU 
time ( IBM 3033-UP ) and with 48.5 ambiguities per a 
sentence. The average word lengths were 10.85 words 
from this corpus. 
From the corpus of IBM correspondence, 1001 sen- 
tences, 12.65 words in length in average, were cho- 
sen end 624 sentences were successfully parsed with 
--average of 13.5 ambiguities. 
D. Resultant Stochastic Context-free Grammar 
After a certain number of iterations, probabilities 
were successfully associated to all of the grammar 
rules and the lexlcal rules as shown below: 
* IT4 
0.98788 HELP 
0.00931 SEE 
0.00141 HEAR 
0.00139 WATCH 
0.00000 HAVE 
0.00000 FEEL 
---(a) 
---(b) 
* SE 
0.28754 PRN VX PD ---(c) 
0.25530 AAA 4XVX PD ---(d) 
0.14856 NNNVX PD 
0.13567 AV1 SE 
0.04006 PRE NQ SE 
0.02693 AV4 IX MX PD 
0.01714 NUM 4XVXPD 
0.01319 IT1 N2 PD 
*VE 
0.16295 VT1 N2 
0.14372 VIl 
0.11963 AUX BV 
0.10174 PRE NQ VX 
0.09460 8E3 PA 
In the above llst, (a) means that "HELP" will be gen- 
erated from part-of-speech "IT4" with the probabili- 
ty 0.98788, and (b) means that "SEE" will be 
generated from part-of-speech "IT4" with the proba- 
bility 0.00931. (c) means that the non-terminal "SE 
(sentence)" will generate the sequence, "PRN (pro- 
noun)", "VX (predicate)" and "PD (period or post 
sententlal modifiers followed by period)" with the 
probability 0.28754. (d) means that "SE" will gener- 
ate the sequence, "AAA(artlcle, adjective, etc.)" , 
"4X (subject noun phrase)", "VX" and "PD" with the 
probability 0.25530. The remaining lines are to be 
interpreted similarly. 
E. Parse Trees with Probabilities 
Parse trees were printed as shown below including 
relative probabilities of each parse. 
WE DO NOT UTILIZE OUTSIDE ART SERVICES DIRECTLY . 
** total ambiguity is : 3 
*: SENTENCE 
*: PRONOUN 'we' 
*: PREDICATE 
*: AUXILIARY 'do' 
*: INFINITE VERB PHRASE 
* ADVERB TYPE1 'not' 
A: 0.356 INFINITE VERB PHRASE 
I*: VERB TYPE ITl'utilize' 
\[*: OBJECT 
\[ *: NOUN 'outside' 
\] *: ADJ CLAUSE 
\[ *: NOUN 'art' 
\[ *: PRED. WITH NO OBJECT 
\[ *: VERB TYPE VT1 'services' 
B: 0.003 INFINITE VERB PHRASE 
\[*: VERB TYPE ITl'utillze' 
\[*: OBJECT 
I *: PREPOSITION 'outside' 
\[ *: NOUN OBJECT 
\[ *: NOUN ' art ' 
\[ *: OBJECT 
\[ *: NOUN 'services' 
C: 0. 641 INFINITE VERB PHRASE 
\[*: VERB TYPE ITl'utilize' 
\[*: OBJECT 
\] *:. NOUN 'outside' 
\[ *: OBJECT MASTER 
\[ *: NOUN ' art' 
\[ *: OBJECT MASTER 
\] * NOUN 'services' 
*: PERIOD 
*: ADVERB TYPE1 'directly' 
*: PRD w ! 
This example shows that the sentence 'We do not uti- 
lize outside art services directly.' was parsed in 
three different ways. The differences are shown as 
the difference of the sub-trees identified by A, B 
and C in the figure. 
The numbers following the identifiers are the rela- 
tive probabilities. As shown in this case, the cor- 
rect parse, the third one, got the highest relatlve 
probability, as was expected. 
F. Result 
63 ambiguous sentences from magazine corpus and 21 
ambiguous sentences from IBM correspondence were 
chosen at random from the sample sentences and their 
parse trees with probabilities were manually exam- 
ined as shown in the table below: 
18 
a• 
b. 
C. 
d. 
e. 
f. 
Corpus Magazine 
63 Number of sentences 
checked manually 
Number of sentences 4 
with no correct parse I 
~umber of sentences 54 
which got highest prob. 
on most natural parse 
Number of sentences 5 
which did not get the 
highest prob. on the 
most natural parse 
Success ratio d/(d+e) .915 
IBM 
21 
18 
• 947 
Taking into consideration that the grammar is not 
tailored for this experiment in any way, the result 
is quite satisfactory. 
The only erroneous case of the IBM corpus is due to a 
grammar problem. That is, in this grammar, such 
modifier phrases as TO-infinltives, prepositional 
phrases, adverbials, etc. after the main verb will 
be derived from the 'end marker' of the sentence, 
i.e. period, rather then from the relevant constitu- 
ent being modified. The parse tree in the previous 
figure is a typical example, that is, the adverb 
'DIRECTLY' is derived from the 'PERIOD' rather then 
from the verb 'UTILIZE '. This simplified handling 
of dependencies will not keep information between 
modifying and modified phrases end as a result, will 
cause problems where the dependencies have crucial 
roles in the analysis. This error occurred in a sen- 
tenoe ' ... is going ~o work out', where the two 
interpretations for the phrase '%o work' exist: 
'~0 work' modifies 'period' as: 
1. A TO-infinitlve phrase 
2. A prepositional phrase 
Ignoring the relationship to the previous context 
'Is going', the second interpretation got the higher 
probability because prepositionalphrases occur more 
frequently then TO-infinltivephrases if the context 
is not taken into account. 
IV. CONCLUSION 
The result from the trials suggests the strong 
potential of this method. And this also suggests 
some application possibility of this method such as: 
refining, minimizing, and optimizing a given con- 
text-free grammar. It will be also useful for giv- 
ing a dlsamblguation capability to a given ambiguous 
context-free grammar. 
In this experiment, an existing grammar was picked 
with few modlflcatlons, therefore, only statistics 
due to the syntactic differences' of the sub-strut- 
tured units were gathered. Applying this method to 
the collection of statistics which relate more to 
sementlcs should be investigated as the next step of 
this project• Introduction into the grammar of a 
dependency relationship among sub-structured units, 
semantically categorized parts-of-speech, head word 
inheritance among sub-structured units, etc. might 
be essential for this purpose. More investigation 
should be done on this direction. 
V. ACKNOWLEDGEMENTS 
This work was carried out when the author was in the 
Computer Science Department of the IBM Thomas J. 
Watson Research Center. 
The author would llke to thank Dr. John Cocke, Dr. F. 
Jelinek, Dr. B. Mercer, Dr. L. Bahl of the IBM Thomas 
J• Watson Research Center, end Prof. S. Kuno, of 
Harvard University for their encouragement and valu- 
able technical suggestions. 
Also the author is indebted to Mr. E. Black, Mr. B. 
Green end Mr. J Lutz for their assistance end dis- 
cussions. 
VIZ. REFERENCES 
• Bahl,L. ,Jelinek,F. , end Mercer,R. ,A Maximum 
Likelihood Approarch to Continuous Speech Recog- 
nition,Vol. PAMI-5,No. 2, IEEE Trans. Pattern 
Analysis end Machine Intelligence,1983 
• Bahl,L. ,et. al. ,Automatic Recognition of Contin- 
uously Spoken Sentences from a finite state 
grammar, Pron. IEEE Int. Conf. Acoust., Speech, 
Signal Processing, Tulsa, OK, Apr. 1978 
• Bahl,L. ,et. al. ,Further results on the recogni- 
tion of a continuousl read natural corpus, Pron. 
IEEE Int. Conf. Acoust., Speech,Signal Process- 
ing, Denver, CO,Apr. 1980 
• Baum,L.E. ,A Maximazatlon Technique occurring in 
the Statistical Analysis of Probablistlc Func- 
tlons of Markov Chains, Vol. 41, No.l, The 
Annals of Mathematical Statistlcs, 1970 
• Baum,L.E. ,An Inequality and Associated Maximi- 
zation Technique in Statistical Estimation for 
Probabllstlc Functions of Markov Processes, Ine- 
qualities, Vol. 3, Academic Press, 1972 
• Fu,K.S. ,Syntactic Methods in Pattern Recogni- 
tion,Vol 112, Mathematics in science end Engi- 
neering, Academic Press, 1974 
• Knuth,D. ,Fundamental Algorlthms,Vol I. in The 
Art of Computer Programming, Addison Wesley, 
1975 
• Kuno,S. ,The Augmented Predictive Analyzer for 
Context-free Languages-Its Relative Efficiency, 
Vol. 9, No. 11, CACM, 1966 
• Kttno,S. ,Oettinger,A. G. ,Syntactic Structure and 
Ambiguity of English, Pron. FJCC, AFIPS, 1963 
• Martln,W. , at. al. ,Preliminary Analysis of a 
Breadth-First Parsing Algorithm: Theoretical and 
Experimental Results, MIT LCS report TR-261, MIT 
1981 
• Sibuya,M. ,FuJlsakl,T. end Takao,Y. ,Noun-Phrase 
Model end Natural Query Language, Vol 22, No 
5,IBM J. Res. Dev. 1978 
• Winograd,T. ,Understanding Natural Language, 
Academic Press, 1972 
• Woods ,W. ,The Lunar Sciences Natural Language 
Information System, BBN Report No. 2378, Bolt, 
Berenek end Newman 
19 
