XML Viewer - w97-1014

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1014_metho.xml
Size: 7,404 bytes
Last Modified: 2025-10-06 14:14:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1014">
  <Title>Word Triggers and the EM Algorithm</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Single-Trigger Model
</SectionTitle>
    <Paragraph position="0"> In this section, we review the basic model definition for single word trigger pairs as introduced in (Tillmann and Ney, 1996).</Paragraph>
    <Paragraph position="1"> We fix one trigger word pair (a --+ b) and define an extended model pab(wlh ) with an trigger interaction parameter q(bla ). To pave the way for the following extensions, we consider the asymmetric model rather than the symmetric model as originally described in (Tillmann and Ney, 1996).</Paragraph>
    <Paragraph position="2"> Backing-Off As indicated by the results of several groups (Lau and Rosenfeld, 1993, Rosenfeld, 1994, Tillmann and Ney, 1996), the word trigger pairs do not help much to predict the next word if there is already a good model based on specific contexts like trigram, bi-gram or cache.</Paragraph>
    <Paragraph position="3"> Therefore, we allow the trigger interaction a ~ b only if the probability p(blh ) of the reference model is not sufficiently high, i.e. if p(blh ) &lt; Po for a certain threshold p0 (note that, by setting P0 := 1.0, the trigger effect is used in all cases). Thus, we use the trigger effect only for the following subset of histories: null</Paragraph>
    <Paragraph position="5"> In the experiments, we used P0 := 1.5/W, where W = 20000 is the vocabulary size. We define the model pab(wlh ) as an extension of the reference model p(wlh ) by a backing-off technique (Katz 87):</Paragraph>
    <Paragraph position="7"> if h E Hab, w # b p(wlh ) if h ~ Hab For a training corpus wl...WN, we consider the log-likelihood functions of both the extended model and the reference model p(wnlh,~), where we define the history hn: n--1 hn :: ll)n_ M = Wn-M...Wn-2Wn-1 For the difference Fab -- FO in the log-likelihoods of the extended language model pab(wlh ) and the reference model p(w\[h), we obtain:</Paragraph>
    <Paragraph position="9"> where we have used the usual counts N(h, w):</Paragraph>
    <Paragraph position="11"> and two additional counts N(a;b) and N(a;b) defined particularly for word trigger modeling:</Paragraph>
    <Paragraph position="13"> Tillmann ~t Ney 118 Word Triggers and EM Note that, for the counts N(a; b) and/V(a; b), it does not matter how often the triggering word a actually occurred in the history h E Hab.</Paragraph>
    <Paragraph position="14"> The unknown trigger parameter q(b\[a) is estimated using maximum likelihood estimation. By taking the derivative and setting it to zero, we obtain the estimate:</Paragraph>
    <Paragraph position="16"> which can be interpreted as the relative frequency of the occurrence of the word trigger (a -+ b).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Linear Interpolation
</SectionTitle>
      <Paragraph position="0"> Although the backing-off method presented results in a closed-form solution for the trigger parameter q(b\]a), the disadvantage is that we have to use an explicit probability threshold P0 to decide whether or not the trigger effect applies. Furthermore, the ultimate goal is to combine several word trigger pairs into a single model, and it is not clear how this could be done with the backing-off model.</Paragraph>
      <Paragraph position="1"> Therefore, we replace the backing-off model by the corresponding model for linear interpolation:</Paragraph>
      <Paragraph position="3"> \[1- q(bla)\]p(blh )-4- q(bla) if a e h, w = b = \[1 q(bla)\]p(wlh ) ifa e h, w # b P(wlh ) if a ~ h where tf(w, v) = 1 if and only if v = w. Note that this interpolation model allows a smooth transition from no trigger effect (q(bla) --+ O) to a strong trigger effect (q(bla) -+ 1).</Paragraph>
      <Paragraph position="4"> For a corpus Wl...Wn...'tON, we have the log-</Paragraph>
      <Paragraph position="6"> is therefore different from the unigram count N(a).</Paragraph>
      <Paragraph position="7"> To apply maximum likelihood estimation, we take the derivative with respect to q(b\[a) and obtain the following implicit equation for q(b\[a) after some ele-</Paragraph>
      <Paragraph position="9"> No explicit solution is possible. However, we can give bounds for the exact solution (proof omitted):</Paragraph>
      <Paragraph position="11"> and an additional count N(a; b):</Paragraph>
      <Paragraph position="13"> n:aEh~,b=wr.</Paragraph>
      <Paragraph position="14"> An improved estimate can be obtained by the EM algorithm (Dempster and Laird, 1977):</Paragraph>
      <Paragraph position="16"> An example of the full derivation of the iteration formula for the EM algorithm will be given in the next section for the more general case of a multi-trigger language model.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Multi-Trigger Model
</SectionTitle>
    <Paragraph position="0"> The trigger pairs are used in combination with a conventional baseline model p(wn \[h,~) (e.g. m-gram) to define a trigger model pT(Wn \[hn):</Paragraph>
    <Paragraph position="2"> with the trigger parameters ot(w\]v) that must be normalized for each v:</Paragraph>
    <Paragraph position="4"> Tillmann ~4 Ney 119 Word Triggers and EM To simplify the notation, we have used the convention: null m mE.M.</Paragraph>
    <Paragraph position="5"> with * .A4,: the set of triggering words for position n * M, = I.A4,1: the number of triggering words for position n Unfortunately, no method is known that produces closed-form solutions for the maximum-likelihood estimates. Therefore, we resort to the EM algorithm in order to obtain the maximum-likelihood estimates. The framework of the EM algorithm is based on the so-called Q(#;~) function, where ~ is the new estimate obtained from the previous estimate /.t (Baum, 1972), (Dempster and Laird, 1977). The symbol # stands for the whole set of parameters to be estimated. The Q(#; ~) function is an extension of the usual log-likelihood function and is for our model:</Paragraph>
    <Paragraph position="7"> Taking the partial derivatives and solving for ~, we obtain: ~' E,~(wnlw._~ ) 1 N M, \]= S-&amp;quot; When taking the partial derivatives with respect to ~(wlv), we use the method of Lagrangian multipliers for the normalization constraints and obtain: A(w, v) with ~(~1~1- EA(w',v)</Paragraph>
    <Paragraph position="9"> Note how the interaction of word triggers is taken into account by a local weighting effect: For a fixed position n with wn = w, the contribution of a particular observed distant word pair (v...w) to ~(wlv) depends on the interaction parameters of all other word pairs (v'...w) with v' e {w~_-~} and the base-line probability p(wlh).</Paragraph>
    <Paragraph position="10"> Note that the local convergence property still holds when the length M, of the history is dependent on the word position n, e.g. if the history reaches back only to the beginning of the current paragraph.</Paragraph>
    <Paragraph position="11"> A remark about the functional form of the multi-trigger model is in order. The form chosen in this paper is a sort of linear combination of the trigger pairs. A different approach is to combine the various trigger pairs in multiplicative way, which results from a Maximum-Entropy approach (Lau and Rosenfeld, 1993).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML