In: Proceedings of CoNLL-2000 and LLL-2000, pages 79-82, Lisbon, Portugal, 2000. 
Using Perfect Sampling in Parameter Estimation of a Whole 
Sentence Maximum Entropy Language Model* 
F. Amaya t and J. M. Benedf 
Departamento de Sistemas Inform£ticos y Computacidn 
Universidad Polit6cnica de Valencia 
Camino de vera s/n, 46022-Valencia (Spain) 
{famaya, jbenedi}@dsic.upv.es 
Abstract 
The Maximum Entropy principle (ME) is an ap- 
propriate framework for combining information 
of a diverse nature from several sources into the 
same  model. In order to incorporate 
long-distance information into the ME frame- 
work in a  model, a Whole Sentence 
Maximum Entropy Language Model (WSME) 
could be used. Until now MonteCarlo Markov 
Chains (MCMC) sampling techniques has been 
used to estimate the paramenters of the WSME 
model. In this paper, we propose the applica- 
tion of another sampling technique: the Perfect 
Sampling (PS). The experiment has shown a re- 
duction of 30% in the perplexity of the WSME 
model over the trigram model and a reduc- 
tion of 2% over the WSME model trained with 
MCMC. 
1 Introduction 
The  modeling problem may be defined 
as the problem of calculating the probability of 
a string, p(w) = p(wl,..., Wn). The probability 
p(w) is usually calculated via conditional prob- 
abilities. The n-gram model is one of the most 
widely used  models. The power of the 
n-gram model resides in its simple formulation 
and the ease of training. On the other hand, n- 
grams only take into account local information, 
and important long-distance information con- 
tained in the string wl ... wn cannot be modeled 
by it. In an attempt to supplement the local in- 
formation with long-distance information, hy- 
brid models have been proposed such us (Belle- 
* This work has been partially supported by the Spanish 
CYCIT under contract (TIC98/0423-C06). 
t Granted by Universidad del Cauca, Popay~n (Colom- 
bia) 
garda, 1998; Chelba, 1998; Benedl and Sanchez, 
2000). 
The Maximum Entropy principle is an ap- 
propriate framework for combining information 
of a diverse nature from several sources into 
the same model: the Maximum Entropy model 
(ME) (Rosenfeld, 1996). The information is in- 
corporated as features which are submitted to 
constraints. The conditional form of the ME 
model is: 
1 (1) 
p(ulx) = z(x) 
where Ai are the parameters to be learned (one 
for each feature), the fi are usually characteris- 
tic functions which are associated to the fea- 
tures and Z(x) = ~y exp{~i~l Aifi(x,y)} is 
the normalization constant. The main advan- 
tages of ME are its flexibility (local and global 
information can be included in the model) and 
its simplicity. The drawbacks are that the para- 
menter's estimation is computationally expen- 
sive, specially the evaluation of the normaliza- 
tion constant Z(x) andthat the grammatical 
information contained in the sentence is poorly 
encoded in the conditional framework. This is 
due to the assumption of independence in the 
conditional events: in the events in the state 
space, only a part of the information contained 
in the sentence influences de calculation of the 
probability (Ristad, 1998). 
2 Whole Sentence Maximum 
Entropy Language Model 
An alternative to combining local, long-distance 
and structural information contained in the 
sentence, within the maximum entropy frame- 
work, is the Whole Sentence Maximum En- 
tropy model (WSME) (Rosenfeld, 1997). The 
79 
WSME is based in the calculation of unre- 
stricted ME probability p(w) of a whole sen- 
tence w = wl... Wn. The probability distribu- 
tion is the distribution p that has the maximum 
entropy relative to a prior distribution P0 (in 
other words: the distribution that minimize de 
divergence D(pllpo)) (Della Pietra et al., 1995). 
The distribution p is given by: 
m . . p(w) = 5po(w)eE~=l ~,:~(w) (2) 
where Ai and f~ are the same as in (1). Z is 
a (global) normalization constant and P0 is a 
prior proposal distribution. The Ai and Z are 
unknown and must be learned. 
The parameters Ai may be interpreted as be- 
ing weights of the features and could be learned 
using some type of iterative algorithm. We have 
used the Improved Iterative Scaling algorithm 
(IIS) (Berger et al., 1996). In each iteration of 
the IIS, we find a 5i value such that adding this 
value to Ai parameters, we obtain an increase 
in the the log-likelihood. The 5i values are ob- 
tained as the solution of the m equations: 
1 
- Z = 0 
w wEN (3) 
where/ = 1,...,m, f#(w) = ~=lfi(w) and 
f~ is a training corpus. Because the domain of 
WSME is not restricted to a part of the sen- 
tence (context) as in the conditional case, it 
allows us to combine global structural syntac- 
tic information which is contained in the sen- 
tence with local and other kinds of long range 
information such us triggers. Furthermore, the 
WSME model is easier to train than the con- 
ditional one, because in the WSME model we 
don't need to estimate the normalization con- 
stant Z during the training time. In contrast, 
for each event (x, y) in the training corpus, we 
have to calculate Z(x) in each iteration of the 
MEC model. 
The main drawbacks of the WSME model are 
its integration with other modules and the cal- 
culation of the expected value in the left part of 
equation (3), because the event space is huge. 
Here we focus on the problem of calculating 
the expected value in (3). The first sum in (3) 
is the expected value of fie ~::#, and it is obvi- 
ously not possible to sum over all the sentences. 
However, we can estimate the mean by using 
the empirical expected value: 
\[ fie~if# \] 1 M Z f/(sJ) (4) Ep 
k J j=l 
where sl,. • •, SM is a random sample from p(w). 
Once the parameters have been learned it is pos- 
sible to estimate the value of the normalization 
constant, because Z = ~w e~l ~f~(W)p0(w ) = 
F m |e~i=l if~|, and it can be estimated 1 by 
L .I means of the sample mean with respect to P0 
(Chen and Rosenfeld, 1999). 
In each iteration of IIS, the calculation of (4) 
requires sampling from a probability distribu- 
tion which is partially known (Z is unknown), 
so the classical sampling techniques are not use- 
ful. In the literature, there are some meth- 
ods like the MonteCarlo Markov Chain meth- 
ods (MCMC) that generate random samples 
from p(w) (Sahu, 1997; Tierney, 1994). With 
the MCMC methods, we can simulate a sample 
approximately from the probability distribution 
and then use the sample to estimate the desired 
expected value in (4). 
3 Perfect Sampling 
In this paper, we propose the application of an- 
other sampling technique in the parameter esti- 
mation process of the WSME model which was 
introduced by Propp and Wilson (Propp and 
Wilson, 1996): the Perfect Sampling (PS). The 
PS method produces samples from the exact 
limit distribution and, thus, the sampling mean 
given in (4) is less biased than the one obtained 
with the MCMC methods. Therefore, we can 
obtain better estimations of the parameters Ai. 
In PS, we obtain a sample from the limit 
distribution of an ergodic Markov Chain X = 
{Xn; n _> 0}, taking values in the state space S 
(in the WSME case, the state space is the set of 
possible sentences). Because of the ergodicity, 
if the transition law of X is P(x, A) := P(Xn E 
AIXn_i = x), then it has a limit distribution ~-, 
that is: if we start a path on the chain in any 
state at time n = 0, then as n ~ ~, Xn ~ ~'. 
The first algorithm of the family of PS was pre- 
sented by Propp and Wilson (Propp and Wil- 
son, 1996) under the name Coupling From the 
Past (CFP) and is as follows: start a path in 
80 
every state of S at some time (-T) in the past 
such that at time n = 0, all the paths collapse 
to a unique value (due to the ergodicity). This 
value is a sample element. In the majority of 
cases, the state space is huge, so attempting 
to begin a path in every state is not practical. 
Thus, we can define a partial stochastic order 
in the state space and so we only need start two 
paths: one in the minimum and one in the maxi- 
mum. The two paths collapse at time n = 0 and 
the value of the coalescence state is a sample 
element of ~-. The CFP algorithm first deter- 
mines the time T to start and then runs the two 
paths from time (-T) to 0. Information about 
PS methods may be consulted in (Corcoran and 
Tweedie, 1998; Propp and Wilson, 1998). 
4 Experimental work 
In this work, we have made preliminary exper- 
iments using PS in the estimation of the ex- 
pected value (4) during the learning of the pa- 
rameters of a WSME model. We have imple- 
mented the Cai algorithm (Cai, 1999) to obtain 
perfect samples. The Cai algorithm has the ad- 
vantage that it doesn't need the definition of the 
partial order. 
The experiments were carried out using a 
pseudonatural corpus: "the traveler task "1. 
The traveler task consists in dialogs between 
travelers and hotel clerks. The size of the vocab- 
ulary is 693 words. The training set has 490,000 
sentences and 4,748,690 words. The test set has 
10,000 sentences and 97,153 words. 
Three kinds of features were used in the 
WSME model: n-grams (1-grams, 2-grams, 3- 
grams), distance 2 n-grams (d2-2-grams, d2-3- 
grams) and triggers. The proposal prior distri- 
bution used was a trigram model. 
We trained WSME models with different sets 
of features using the two sampling techniques: 
MCMC and PS. We measured the perplexity 
(PP) of each of the models and obtained the 
percentage of improvement in the PP with re- 
spect to a trigram base-line model (see table 1). 
The first model used MCMC techniques (specif- 
ically the Independence Metropolis-Hastings al- 
gorithm (IMH) 2) and features of n-grams and 
distance 2 n-grams. The second model used a 
1EuTrans ESPRIT-LTR Project 20268 
2IMH has been reported recently as the most useful 
MCMC algorithm used in the WSME training process. 
Method PP % Improvement 
IMH 3.37115 28 
PS 3.46336 26 
IMH-T 3.37198 28 
PS-T 3.26964 30 
Trigram 4.66975 
Table h Test set perplexity of the WSME 
model over the traveler task corpus: IMH with 
features of n-grams and d-n-grams (IMH), PS 
with n-grams and d-n-grams (PS) IMH with 
triggers (IMH-T), PS with triggers (PS-T). The 
base-line model is a trigram model (Trigram) 
PS algorithm and features of n-grams and dis- 
tance 2 n-grams. The third model used the IMH 
algorithm and features of triggers. The fourth 
used PS and features of triggers. Finally, in or- 
der to compare with the classical methods, we 
included the trigram base-line model. 
In all cases, the WSME had a better perfor- 
mance than the n-gram model. From the results 
in Table 1, we see that the use of features of 
triggers improves the performance of the model 
more than the use of n-gram features, this may 
be due to the correlation between the triggers 
and the n-grams, the n-gram information has 
been absorbed by the prior distribution and di- 
minishes the effects of the feature of n-grams. 
We believe this is the reason why PS-T in Ta- 
ble 1 is better than PS. We also see how IMH 
and IHM-T shows the same improvement, i.e. 
the use of triggers does not seem improve the 
perplexity of the model but, this may be due 
to the sampling technique: the parameter val- 
ues depends on the estimation of an expected 
value, and the estimation depends on the sam- 
pling. Finally, the PS-T has better perplexity 
than the IMH-T. The only difference between 
both of these is the sampling technique,neither 
of then has the correlation influence in the fea- 
tures, so we think that the improvement may 
be due to the sampling technique. 
5 Conclusion and future works 
We have presented a different approach to the 
sampling step needed in the parameter estima- 
tion of a WSME model. Using this technique, 
we have obtained a reduction of 30% in the per- 
plexity of the WSME model over the base-line 
81 
trigram model and an improvement of 2% over 
the model trained with MCMC techniques. We 
are extending our experiments to a major cor- 
pus: the Wall Street Journal corpus and using a 
set of features which is more general, including 
features that reflect the global structure of the 
sentence. 
We are working on introducing the grammat- 
ical information contained into the sentence to 
the model; we believe that such information im- 
proves the quality of the model significantly. 

References 
J. R. Bellegarda. 1998. A multispan  
modeling framework for large vocabulary speech 
recognition. IEEE Transactions on Speech and 
Audio Processing, 6 (5):456-467. 
J.M. Bened~ and J.A. Sanchez. 2000. Combination 
of n-grams and stochastic context-free grammars 
for  modeling. International conference 
on computational linguistics (COLIN-A CL). 
A.L. Berger, V.J. Della Pietra, and S.A. Della 
Pietra. 1996. A Maximum Entropy approach to 
natural  processing. Computational Lin- 
guistics, 22(1):39-72. 
H. Cai. 1999. Exact Sampling using auxiliary vari- 
ables. Statistical Computing Section, ASA Pro- 
ceedings. 
C. Chelba. 1998. A structured Language Model. 
PhD Dissertation Proposal, The Johns Hopkins 
University. 
S. Chen and R. Rosenfeld. 1999. Efficient sampling 
and feature selection in whole sentence maximum 
entropy  models. Proc. IEEE Int. Con- 
ference on Acoustics, Speech and Signal Process- 
ing (ICASSP). 
J.N. Corcoran and R.L. Tweedie. 1998. Perfect sam- 
pling for Independent Metropolis-Hastings chains. 
preprint. Colorado State University. 
S. Della Pietra, V. Della Pietra, and J. Lafferty. 
1995. Inducing features of random fields. Tech- 
nical Report CMU-CS-95-144, Carnegie Mellon 
University. 
J. G. Propp and D. B. Wilson. 1996. Exact sampling 
with coupled markov chains and applications to 
statistical mechanics. Random Structures and Al- 
gorithms, 9:223-252. 
J. A. Propp and D. B. Wilson. 1998. Coupling from 
the Past: User's Guide. Dimacs series in discrete 
Mathematics and Theoretical Computer Science, 
pages 181-192. 
E. S. Ristad, 1998. Maximum Entropy Modeling 
Toolkit, Version 1.6 Beta. 
R. Rosenfeld. 1996. A Maximum Entropy approach 
to adaptive statistical  modeling. Com- 
puter Speech and Language, 10:187-228. 
R. Rosenfeld. 1997. A whole sentence Maximum En- 
tropy  model. IEEE workshop on Speech 
Recognition and Understanding. 
S. Sahu. 1997. Bayesian data analysis. Technical re- 
port, School of Mathematics, University of Walles. 
L. Tierney. 1994. Markov chains for exploring pos- 
terior distributions. The Annals o/ Statistics, 
22:1701-1762. 
