File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/p96-1041_evalu.xml

Size: 3,754 bytes

Last Modified: 2025-10-06 14:00:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1041">
  <Title>An Empirical Study of Smoothing Techniques for Language Modeling</Title>
  <Section position="7" start_page="313" end_page="313" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> In Figure 2, we display the performance of the interp-baseline method for bigram and trigram models on TIPSTER, Brown, and the WSJ subset of TIPSTER. In Figures 3-6, we display the relative performance of various smoothing techniques with respect to the baseline method on these corpora, as measured by difference in entropy. In the graphs on the left of Figures 2-4, each point represents an average over ten runs; the error bars represent the empirical standard deviation over these runs. Due to resource limitations, we only performed multiple runs for data sets of 50,000 sentences or less. Each point on the graphs on the right represents a single run, but we consider sizes up to the amount of data available. The graphs on the bottom of Figures 3-4 are close-ups of the graphs above, focusing on those algorithms that perform better than the baseline. To give an idea of how these cross-entropy differences translate to perplexity, each 0.014 bits correspond roughly to a 1% change in perplexity.</Paragraph>
    <Paragraph position="1"> In each run except as noted below, optimal values for the parameters of the given technique were searched for using Powell's search algorithm as realized in Numerical Recipes in C (Press et al., 1988, pp. 309-317). Parameters were chosen to optimize the cross-entropy of one of the development test sets associated with the given training set. To constrain the search, we searched only those parameters that were found to affect performance significantly, as verified through preliminary experiments over several data sizes. For katz and church-gale, we did not perform the parameter search for training sets over 50,000 sentences due to resource constraints, and instead manually extrapolated parameter val- null ods in terms of lines of C++ code ues from optimal values found on smaller data sizes.</Paragraph>
    <Paragraph position="2"> We ran interp-del-int only on sizes up to 50,000 sentences due to time constraints.</Paragraph>
    <Paragraph position="3"> From these graphs, we see that additive smoothing performs poorly and that methods katz and interp-held-out consistently perform well. Our implementation church-gale performs poorly except on large bigram training sets, where it performs the best. The novel methods new-avg-count and new-one-count perform well uniformly across training data sizes, and are superior for trigram models. Notice that while performance is relatively consistent across corpora, it varies widely with respect to training set size and n-gram order.</Paragraph>
    <Paragraph position="4"> The method interp-del-int performs significantly worse than interp-held-out, though they differ only in the data used to train the A's. However, we delete one word at a time in interp-del-int; we hypothesize that deleting larger chunks would lead to more similar performance.</Paragraph>
    <Paragraph position="5"> In Figure 7, we show how the values of the parameters 6 and Cmin affect the performance of methods katz and new-avg-count, respectively, over several training data sizes. Notice that poor parameter setting can lead to very significant losses in performance, and that optimal parameter settings depend on training set size.</Paragraph>
    <Paragraph position="6"> To give an informal estimate of the difficulty of implementation of each method, in Table 1 we display the number of lines of C++ code in each implementation excluding the core code common across techniques.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML