XML Viewer - w00-0714

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0714_metho.xml
Size: 5,754 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0714">
  <Title>Using Perfect Sampling in Parameter Estimation of a Whole Sentence Maximum Entropy Language Model*</Title>
  <Section position="3" start_page="0" end_page="79" type="metho">
    <SectionTitle>
2 Whole Sentence Maximum
Entropy Language Model
</SectionTitle>
    <Paragraph position="0"> An alternative to combining local, long-distance and structural information contained in the sentence, within the maximum entropy framework, is the Whole Sentence Maximum Entropy model (WSME) (Rosenfeld, 1997). The  WSME is based in the calculation of unrestricted ME probability p(w) of a whole sentence w = wl... Wn. The probability distribution is the distribution p that has the maximum entropy relative to a prior distribution P0 (in other words: the distribution that minimize de divergence D(pllpo)) (Della Pietra et al., 1995). The distribution p is given by: m . . p(w) = 5po(w)eE~=l ~,:~(w) (2) where Ai and f~ are the same as in (1). Z is a (global) normalization constant and P0 is a prior proposal distribution. The Ai and Z are unknown and must be learned.</Paragraph>
    <Paragraph position="1"> The parameters Ai may be interpreted as being weights of the features and could be learned using some type of iterative algorithm. We have used the Improved Iterative Scaling algorithm (IIS) (Berger et al., 1996). In each iteration of the IIS, we find a 5i value such that adding this value to Ai parameters, we obtain an increase in the the log-likelihood. The 5i values are obtained as the solution of the m equations:</Paragraph>
    <Paragraph position="3"> f~ is a training corpus. Because the domain of WSME is not restricted to a part of the sentence (context) as in the conditional case, it allows us to combine global structural syntactic information which is contained in the sentence with local and other kinds of long range information such us triggers. Furthermore, the WSME model is easier to train than the conditional one, because in the WSME model we don't need to estimate the normalization constant Z during the training time. In contrast, for each event (x, y) in the training corpus, we have to calculate Z(x) in each iteration of the MEC model.</Paragraph>
    <Paragraph position="4"> The main drawbacks of the WSME model are its integration with other modules and the calculation of the expected value in the left part of equation (3), because the event space is huge.</Paragraph>
    <Paragraph position="5"> Here we focus on the problem of calculating the expected value in (3). The first sum in (3) is the expected value of fie ~::#, and it is obviously not possible to sum over all the sentences. However, we can estimate the mean by using the empirical expected value: \[ fie~if# \] 1 M Z f/(sJ) (4) Ep k J j=l where sl,. * *, SM is a random sample from p(w). Once the parameters have been learned it is possible to estimate the value of the normalization constant, because Z = ~w e~l ~f~(W)p0(w ) = F m |e~i=l if~|, and it can be estimated 1 by L .I means of the sample mean with respect to P0 (Chen and Rosenfeld, 1999).</Paragraph>
    <Paragraph position="6"> In each iteration of IIS, the calculation of (4) requires sampling from a probability distribution which is partially known (Z is unknown), so the classical sampling techniques are not useful. In the literature, there are some methods like the MonteCarlo Markov Chain methods (MCMC) that generate random samples from p(w) (Sahu, 1997; Tierney, 1994). With the MCMC methods, we can simulate a sample approximately from the probability distribution and then use the sample to estimate the desired expected value in (4).</Paragraph>
  </Section>
  <Section position="4" start_page="79" end_page="80" type="metho">
    <SectionTitle>
3 Perfect Sampling
</SectionTitle>
    <Paragraph position="0"> In this paper, we propose the application of another sampling technique in the parameter estimation process of the WSME model which was introduced by Propp and Wilson (Propp and Wilson, 1996): the Perfect Sampling (PS). The PS method produces samples from the exact limit distribution and, thus, the sampling mean given in (4) is less biased than the one obtained with the MCMC methods. Therefore, we can obtain better estimations of the parameters Ai.</Paragraph>
    <Paragraph position="1"> In PS, we obtain a sample from the limit distribution of an ergodic Markov Chain X = {Xn; n _&gt; 0}, taking values in the state space S (in the WSME case, the state space is the set of possible sentences). Because of the ergodicity, if the transition law of X is P(x, A) := P(Xn E AIXn_i = x), then it has a limit distribution ~-, that is: if we start a path on the chain in any state at time n = 0, then as n ~ ~, Xn ~ ~'.</Paragraph>
    <Paragraph position="2"> The first algorithm of the family of PS was presented by Propp and Wilson (Propp and Wilson, 1996) under the name Coupling From the Past (CFP) and is as follows: start a path in  every state of S at some time (-T) in the past such that at time n = 0, all the paths collapse to a unique value (due to the ergodicity). This value is a sample element. In the majority of cases, the state space is huge, so attempting to begin a path in every state is not practical.</Paragraph>
    <Paragraph position="3"> Thus, we can define a partial stochastic order in the state space and so we only need start two paths: one in the minimum and one in the maximum. The two paths collapse at time n = 0 and the value of the coalescence state is a sample element of ~-. The CFP algorithm first determines the time T to start and then runs the two paths from time (-T) to 0. Information about PS methods may be consulted in (Corcoran and Tweedie, 1998; Propp and Wilson, 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML