File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-1014_evalu.xml
Size: 5,418 bytes
Last Modified: 2025-10-06 14:00:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1014"> <Title>Word Triggers and the EM Algorithm</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experimental results </SectionTitle> <Paragraph position="0"> Language Model Training and Corpus We first describe the details of the language model used and of its training. The trigger pairs were selected as described in subsection 2 and were used to extend a baseline language model. As in many other systems, the baseline language model used here consists of two parts, an m-gram model (here: trigram/bigram/unigram) and a cache part(Ney and Essen, 1994). Since the cache effect is equivalent to self-trigger pairs (a ---+ a), we can expect that there is some trade-off between the word triggers and the cache, which was confirmed in some initial informal experiments.</Paragraph> <Paragraph position="1"> For this reason, it is suitable to consider the simultaneous interpolation of these three language model parts to define the refined language model. Thus we have the following equation for the refined language model p(w \[h,):</Paragraph> <Paragraph position="3"> where pM(w,~ Ih,) is the m-gram model, pc(w, Ih,) is the cache model and pw(wnlhn) is the trigger model. The three interpolation parameters must be normalized: AM &quot;1&quot; ~C -I- AT &quot;-~ 1 The details of the m-gram model are similar to those given in (Ney and Generet, 1995). The cache model</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> There were two me~hods used to compute the trigger parameters: * method 'no:EM': The trigger parameters cr(w\[v) are obtained by renormalization from the single trigger parameters q(wlv):</Paragraph> <Paragraph position="9"> The backing-off method described in Section 2.1 was used to select the top-K most significant single trigger pairs. In the experiments, we used</Paragraph> <Paragraph position="11"> * method 'with EM': The trigger parameters o~(wlv ) are initialized by the 'no EM' values and re-estimated using the EM algorithm as described in Section 3. The typical number of iterations is 10.</Paragraph> <Paragraph position="12"> The experimental tests were performed on the Wall Street Journal (WSJ) task (Paul and Baker, 1992) for a vocabulary size of 20000 words. To train the m-gram 1,anguage model and the interpolation parameters, we used three training corpora with sizes of 1, 5 and 39 million running words. However, the word trigger pairs were always selected and trained from the 39=million word training corpus. In the experiments, the history h was defined to start with the most recent article delimiter.</Paragraph> <Paragraph position="13"> The interpolation parameters are trained by using the EM algorithm. In the case of the 'EM triggers', this is done jointly with the reestimation of the trigger parameters ~(wlv ). To avoid the overfitting of the interpolation parameters on the training corpus, which was used to train both the m-gram language model and the interpolation parameters, we applied the leaving-one-out technique.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Examples of Trigger Pairs </SectionTitle> <Paragraph position="0"> In Table 2 and Table 3 we present examples of selected trigger pairs for the two methods no EM and EM. For a fixed triggering word v, we show the most significant triggered words w along with the trigger interaction parameter c~(wlv ) for both methods.</Paragraph> <Paragraph position="1"> There are 8 triggering words v for each of which we show the 15 triggered words w with the highest trigger parameter ot(wlv ). The triggered words w are sorted by the ot(wlv ) parameter. /,From the table it can be seen that for the no EM trigger pairs the trigger parameter oL(wlv ) varies only slightly over the triggered words w. This is different for the EM triggers, where the trigger parameters o~(wlv ) have a much larger variation. In addition the probability mass of the EM-trained trigger pairs is much more concentrated on the first 15 triggered words.</Paragraph> <Paragraph position="2"> The perplexity was computed on a test corpus of 325 000 words from the WSJ task. The results are shown in Table 1 for each of the three training corpora (1,5 and 39 million words). For comparison purposes, the perplexities of the trigram model with and without cache are included. As can be seen from this table, the trigger model is able to improve the perplexities in all conditions, and the EM triggers are consistently (although sometimes only slightly) better than the no EM triggers. There is an effect of the training corpus size: if the trigram model is already well trained, the trigger model does not help as much as for a less well trained trigram model. This observation is confirmed by the part b of Table 1, which shows the EM trained interpolation parameters. As the size of the training corpus decreases the relative weight of the cache and trigger component increases. Furthermore in the last row of Table 1 it can be seen that the relative weight of the trigger component increases after the EM training which indicates that the parameters of our trigger modell are successfully trained by this EM approach.</Paragraph> <Paragraph position="3"> Tillmann ~ Ney 121 Word Triggers and EM</Paragraph> </Section> </Section> class="xml-element"></Paper>