File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/h92-1021_evalu.xml
Size: 2,809 bytes
Last Modified: 2025-10-06 14:00:09
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1021"> <Title>IMPROVEMENTS IN STOCHASTIC LANGUAGE MODELING</Title> <Section position="7" start_page="108" end_page="108" type="evalu"> <SectionTitle> 3.6. Results and Discussion </SectionTitle> <Paragraph position="0"> We tested our combined model on a large collection of test sets, using perplexity reduction as our measure. A selection is given in table 2. Set WSJ-dev is the CSR development test set (70K words). Set BC-3 is the entire Brown Corpus, where the history was flushed arbitrarily every 3 sentences.</Paragraph> <Paragraph position="1"> Set BC-20 is the same as BC-3, but with history-flushing every 20 sentences. Set RM is the 39K words used in training the Resource Management system, with no history flushing.</Paragraph> <Paragraph position="2"> The last result in table 2 was derived by training the trigram on only 1.2M words of WSJ data, and testing on the WSJ development set. This was done to facilitate a more equitable comparison with the results reported in \[5\].</Paragraph> <Paragraph position="3"> test set static PP dynamic PP improvement model for several test sets Our biggest surprise was that &quot;self triggering&quot; (trigger pairs of the form (A ~ A)) was found to play a larger role than would be indicated by our utility measure. Correlations of this type are an important special case, and are already captured by the conventional cache based models. We decided to adapt our model in the face of reality, and maintained a separate self-triggering model that was added as a third interpolation component (the results in table 2 already reflect this change). This independent component, although consisting of far fewer trigger pairs, was responsible for as much as half of the overall perplexity reduction. On tasks with a vastly different unigram behavior, such as the Resource Management data set, the self-triggering component accounted for most of the improvement. Why do self-triggering pairs have a higher impact than anticipated? One reason could be an inadequacy in our utility measure. Another could spring from the difference between training and testing. If the test set were statistically identical to the training set, the utility of every trigger pair would be exactly as predicted by our expected utility measure. Since in reality the training and testing sets differ, the actual utility is lower than predicted. All trigger pairs suffer a degradation, except for the self-triggering ones. The latter hold their own because self correlations are robust and are better maintained across different corpora. This explains why the self-triggering component is most dominant when the statistical difference between the training and testing data is greatest.</Paragraph> </Section> class="xml-element"></Paper>