PARAMETER ESTIMATION FOR CONSTRAINED 
CONTEXT-FREE LANGUAGE MODELS 
Kevin Mark, Michael Miller, Ulf Grenander~ Steve Abney t 
Electronic Systems and Signals Research Laboratory 
Washington University 
St. Louis, Missouri 63130 
ABSTRACT 
A new language model incorporating both N-gram and 
context-free ideas is proposed. This constrained context-free 
model is specified by a stochastic context-free prior distribu- 
tion with N-gram frequency constraints. The resulting dis- 
tribution is a Markov random field. Algorithms for sampling 
from this distribution and estimating the parameters of the 
model are presented. 
1. INTRODUCTION 
This paper introduces the idea of N-gram constrained 
context-free language models. This class of language 
models merges two prevalent ideas in language modeling: 
N-grams and context-free grammars. In N-gram lan- 
guage models, the underlying probability distributions 
are Markov chains on the word string. N-gram mod- 
els have advantages in their simplicity. Both parameter 
estimation and sampling from the distribution are sim- 
ple tasks. A disadvantage of these models is their weak 
modeling of linguistic structure. 
Context-free language models are instances of random 
branching processes. The major advantage of this class 
of models is its ability to capture linguistic structure. 
In the following section, notation for stochastic context- 
free language models and the probability of a word string 
under this model are presented. Section 3 reviews a pa- 
rameter estimation algorithm for SCF language models. 
Section 4 introduces the bigram-constrained context-free 
language model. This language model is seen to be a 
Markov random field. In Section 5, a random sampling 
algorithm is stated. In Section 6, the problem of param- 
eter estimation in the constrained context-free language 
model is addressed. 
*Division of Applied Mathematics, Brown University, Provi- 
dence, Rhode Island 02904 
tBell Communications Research, Morristown, New Jersey 
07962 
2. STOCHASTIC CONTEXT-FREE 
GRAMMARS 
A stochastic context-free grammar G is specified by the 
quintuple < VN, VT, R, S, P > where VN is a finite set 
of non-terminal symbols, VT is a finite set of terminal 
symbols, R is a set of rewrite rules, S is a start symbol 
in VN, and P is a parameter vector. If r 6 R, then Pr is 
the probability of using the rewrite rule r. 
For our experiments, we are using a 411 rule grammar 
which we will refer to as the Abney-2 grammar. The 
grammar has 158 syntactic variables, i.e., IVNI = 158. 
The rules of the Abney-2 grammar are of the form 
H -+ G1,G2 .... Gk where H, Gi 6 VN and k = 1,2 ..... 
Hence, this grammar is not expressed in Chomsky Nor- 
mal Form. We maintain this more general form for the 
purposes of linguistic analysis. 
An important measure is the probability of a deriva- 
tion tree T. Using ideas from the random branching 
process literature \[2, 4\], we specify a derivation tree T 
by its depth L and the counting statistics zt(i,k),l = 
1 .... ,n,i = 1 .... ,IVNI, and k = 1 ..... IRI. The count- 
ing statistic zz(i, k) is the number of non-terminals at 6 
VN rewritten at level I with rule rk 6 R. With these 
statistics the probability of a tree T is given by 
L IVN\] IRI 
= H H H (1) 
l=l i=l k=l 
In this model, the probability of a word string W1,N = 
w:w2... WN, fl(Wl,N), is given by 
Z(W:,N) = =(T) (2) 
TEParses(W,,N) 
where Parses(W1,N) is the set of parse trees for the 
given word string. For an unambiguous grammar, 
Parses(Wl,N) consists of a single parse. 
146 
3. PARAMETER ESTIMATION FOR 
SCFGS 
An important problem in stochastic language models is 
the estimation of model parameters. In the parameter 
estimation problem for SCFGs, we observe a word string 
W1,N of terminal symbols. With this observation, we 
want to estimate the rule probabilities P. For a grammar 
in Chomsky Normal Form, the familiar Inside/Outside 
Algorithm is used to estimate P. However, the Abney- 
2 grammar is not in this normal form. Although the 
grammar could be easily converted to CNF, we prefer to 
retain its original form for linguistic relevance. Hence, 
we need an algorithm that can estimate the probabilities 
of rules in our more general form given above. 
The algorithm that we have derived is a specific case of 
Kupiec's trellis-based algorithm \[3\]. Kupiec's algorithm 
estimates parameters for general recursive transition net- 
works. In our case, we only have rules of the following 
two types: 
1. H ---~ G1G2"..Gk where H, Gi E VN and k = 
1,2 .... 
2. H-+TwhereHEVN andTEVw. 
For this particular topology, we derived the following 
trellis-based algorithm. 
Trellis-based algorithm 
1. Compute inner probabilities a(i,j,a) = Pr\[o" 
Wij\] where a E VN and Wij denotes the substring 
wi • • • wj. 
o~(i,i,o') = .p°ldo__wi -I- E "p°ld°--°'~^eii'trl)' 
G 1 :G--~G I 
o~(i,j,o') = E o~n,e(i,j,o'n,a) 
Gn :G---~ ...G n 
"n,~(i, j, ~m, ~) = 
bold ari g o" "~ O"-~'Qm ,.. \ , J, rn,\] 
ifo" ~ o'm.., or m = 1 
.i-1 • Ek=,+l "rite(', k, fire-l, ff).(k, J, ~m) 
if o" ~... o'ra- 1 am • • • 
2. Compute outer probabilities fl(i,j,o') = Pr\[S :~ 
Wl,i-i o" W/+X,N\] where o" e VN. 
fl(1, N, S) = 1.0 
~(i, j, if) ---- E bold ,~ ri ~ n) J n_,o...IJntek , J, if, 
i-1 
+ E E a"'e(k'i'p'n)fl"'~(k'j'°''n) 
n.-.t....pa.., k=0 
tints(i, j, crm, o') = 
{ f~(i, j, o.) 
if a ~ ... a~ L E~=~+i ~(j, k, o'~+t)f~.tdi, k, o'm+t, o-) 
if a ~ ...OynO'rn+l . . . 
3. Re-estimate P. 
pnew 
0"--+ 0"10"2,..a n 
N-1 N 
E/N=i N • ~j=i a(z, j, a)fl(i, j, a) 
new _ PaI ._~ T -- 
Ei:w,=T ca(i, i, o')fl( i, i, a) 
E/N=1 N • Ej=, ~(~, J, ~)~(i, j, ~) 
For CNF grammars, the trellis-based algorithm reduces 
to the Inside-Outside algorithm. We have tested the al- 
gorithm on both CNF grammars and non-CNF gram- 
mars. In either case, the estimated probabilities are 
asymptotically unbiased. 
4. SCFGS WITH BIGRAM 
CONSTRAINTS 
We now consider adding bigram relative frequencies as 
constraints on our stochastic context-free trees. The sit- 
uation is shown in Figure 1. In this figure, a word string 
is shown with its bigram relationships and its underlying 
parse tree structure. 
In this model, we assume a given prior context-free dis- 
tribution as given by fl(W1,N) (Equation 2). This prior 
distribution may be obtained via the trellis-based esti- 
mation algorithm (Section 3) applied to a training text 
or, alternatively, from a hand-parsed training text. We 
are also given bigram relative frequencies, 
N--1 
hai,aj(Wl,g) = ~ lq,,aj(wk, w~+l) (3) 
k=l 
where tri, aj E VT. 
Given this type of structure involving both hierarchical 
and bigram relationships, what probability distribution 
on word strings should we consider? The following the- 
orem states the maximum entropy solution. 
147 
S 
NP VP 
Art N 
I I v . 
ths boy \] 
Figure 1: Stochastic context-free tree with bigram rela- 
tionships. 
Theorem 1 
distribution maximizing the generalized entropy 
-- E p(c) log p(c) (4) f(c) 
subject 
to the constraints {E\[ha,,as(W1,N)\] : Ha,,aj}ai,a~CVw 
is 
Let c = W1,N and f(c) = fl(W1,N). The 
Pr(W1,N) = p*(c) = (5) 
Z-lexp ( E EOtal,a~hal,a2(W1,N))fl(Wl,N) 
oI~VT G2~VT 
where Z is the normalizing constant. 
Remarks The specification of bigram constraints for 
h(.) is not necessary for the derivation of this theorem. 
The constraint function h(.) may be any function on the 
word string including general N-grams. Also, note that 
if the parameters o~a1,~,2 are all zero, then this distribu- 
tion reduces to the unconstrained stochastic context-free 
model. 
5. SIMULATION 
For simulation purposes, we would like to be able to 
draw sample word strings from the maximum entropy 
distribution. The generation of such sentences for this 
language model cannot be done directly as in the un- 
constrained context-free model. In order to generate 
sentences, a random sampling algorithm is needed. A 
simple Metropolis-type algorithm is presented to sample 
from our distribution. 
The distribution must first be expressed in Gibbs form: 
1 -E(W~.N) Pr(Wl,g) = ~e (6) 
where 
E(WI,N) = -- E E ha,, a2h'',a2(Wl,g) 
oa EVT a2EVT 
- log 3(Wa,N). (7) 
Given this 'energy' E, the following algorithm generates 
a sequence of samples, {W 1, W 2, W3,...}, from this dis- 
tribution. 
Random sampling algorithm 
1. perturb W i to W new 
2. compute AE = E(W new) - E(W i) 
3. if AE < 0 then Wi+T +_ 
wnew 
else wi+l ~ 
W new p( new 
W ) = e_AE with probability = P(W) 
4. increment i and repeat step 1. 
In the first step, the perturbation of a word string is done 
as follows: 
1. generate parses of the string W 
2. choose one of these parses 
3. choose a node in the parse tree 
4. generate a subtree rooted at this node according to 
the prior rule probabilities 
5. let the terminal sequence of the modified tree be the 
new word string W new. 
This method of perturbation satisfies the detailed bal- 
ance conditions in random sampling. 
Proposition Given a sequence 
of samples {W 1, W 2, W3,...} generated with the ran- 
dom sampling algorithm above. The sequence converges 
weakly to the distribution Pr(W1,N). 
148 
6. PARAMETER ESTIMATION FOR 
THE CONSTRAINED 
CONTEXT-FREE MODEL 
In the parameter estimation problem for the constrained 
context-free model, we are given an observed word string 
W1,N of terminal symbols and want to estimate the 
c~ parameters in the maximum entropy distribution, 
Pr(W1,N). One criterion in estimating these parameters 
is maximizing the likelihood given the observed data. 
Maximum likelihood estimation yields the following con- 
dition for the optimum (ML) estimates: 
0 Pr(W1,N) I = 0 (8) 
~Olaa ,ab I &~a ,~t~ 
Evaluating the left hand side gives the following maxi- 
mum likelihood condition 
Ea .... b \[ha',ab(Wl,g)\] = h°.,°b(W1,N) (O) 
One method to obtain the maximum likelihood estimates 
is given by Younes \[5\]. His estimation algorithm uses 
a random sampling algorithm to estimate the expected 
value of the constraints in a gradient descent framework. 
Another method is the pseudolikelihood approach which 
we consider here. 
In the pseudolikelihood approach, an approximation to 
the likelihood is derived from local probabilities \[1\]. In 
our problem, these local probabilities are given by: 
Pr(wilwl ..... wi-1, Wi÷l ..... WN) = 
exp(~,_,,~, + ~,,~,+,)Z(W1,N) 
EWj:~V T exp(aw,_,,w; + aw:,w,+,)~ti(Wl,N, w~ 10) 
where , 
~i(W1,N, W~) = ETEParses(w, ..... wi-l,w:,wi+t ..... wN) 7r(T). 
The pseudolikelihood £ is given in terms of these local 
probabilities by 
N 
£ -- IXPr(wilwl'""Wi-l'W'+l'""WN) (11) 
i=l 
Maximizing the pseudolikelihood £ is equivalent to max- 
imizing the log-pseudolikelihood, 
N-1 
logi = Nlog~(W1,N)+ 2 ~ ~wk,~_. (12) 
k=l 
- ~ log ~-w,_~,~: +~:,,~,+, ~(w1,N, w~) 
i=l . Lw:eVT J 
We can estimate the oL parameters by maximizing the 
log-pseudolikelihood with respect to the c,'s. The algo- 
rithm that we use to do this is a gradient descent al- 
gorithm. The gradient descent algorithm is an iterative 
algorithm in which the parameters are updated by a fac- 
tor of the gradient, i.e., 
0 log O~(i+1) = ~(i) "Jr #0~,o2 (13) 
a I t0"2 O" 1 ~0~ 
where # is the step size and the gradient is given by 
0 log £ N- I 
--2 E awk'wk+ 'l°''°~(wk'wk+l) 
' k=l 
g ~ 0 CftVi--l,W~ ÷Otto~,~i..Ll ~ /TIT l~ L.~wlEV.p ET'~ -e * * ~ pi~ VVl,N,Wi\]. 
Ottvi--ltw{÷Otw{,wi..~. 1 fJ /TI¥ -. I\ \-- I i=l L..~w~EVT e * 
' Pi~VVl,N,Wi) 
The gradient descent algorithm is sensitive to the choice 
of step size #. This choice is typically made by trial and 
error. 
7. CONCLUSION 
This paper introduces a new class of language models 
based on Markov random field ideas. The proposed 
context-free language model with bigram constraints of- 
fers a rich linguistic structure. In order to facilitate ex- 
ploring this structure, we have presented a random sam- 
pling algorithm and a parameter estimation algorithm. 
The work presented here is a beginning. Further work is 
being done in improving the efficiency of the algorithms 
and in investigating the correlation of bigram relative 
frequencies and estimated a parameters in the model. 
References 
1. Besag, J., "Spatial Interaction and the Statistical Anal- 
ysis of Lattice Systems," J. R. Statist. Soc. B, Vol. 36, 
1974, pp. 192-236. 
2. Harris, T. E., The Theory of Branching Processes, 
Springer-Verlag, Berlin, 1963. 
3. Kupiec, J., "A trellis-based algorithm for estimating the 
parameters of a hidden stochastic context-free gram- 
mar," 1991. 
4. Miller, M. I., and O'Sullivan, J. A., "Entropies and Com- 
binatorics of Random Branching Processes and Context- 
Free Languages," IEEE Trans. on Information Theory, 
March, 1992. 
5. Younes, L., "Maximum likelihood estimation for Gibb- 
sian fields," 1991. 
149 
