XML Viewer - w02-1003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-1003_concl.xml
Size: 8,988 bytes
Last Modified: 2025-10-06 13:53:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1003">
  <Title>An Incremental Decision List Learner</Title>
  <Section position="6" start_page="2" end_page="2" type="concl">
    <SectionTitle>
4 Experimental Results and Discussion
</SectionTitle>
    <Paragraph position="0"> In this section, we give experimental results, showing that our new algorithm substantially outperforms the standard algorithm. We also show that while accuracy is competitive with TBLs, two linear classifiers are more accurate than the decision list algorithms.</Paragraph>
    <Paragraph position="1"> Many of the problems that probabilistic decision list algorithms have been used for are very similar: in a given text context, determine which of two choices is most appropriate. Accent restoration (Yarowsky, 1994), word sense disambiguation (Yarowsky, 2000), and other problems all fall into this framework, and typically use similar feature types. We thus chose one problem of this type, grammar checking, and believe that our results should carry over at least to these other, closely related problems. In particular, we chose to use exactly the same training, test, problems, and feature sets used by Banko and Brill (2001a; 2001b). These problems consisted of trying to guess which of two confusable words, e.g.</Paragraph>
    <Paragraph position="2"> &amp;quot;their&amp;quot; or &amp;quot;there&amp;quot;, a user intended. Banko and Brill chose this data to be representative of typical machine learning problems, and, by trying it across data sizes and different pairs of words, it exhibits a good deal of different behaviors. Banko and Brill used a standard set of features, including words within a window of 2, part-of-speech tags within a window of 2, pairs of word or tag features, and whether or not a given word occurred within a window of 9. Altogether, they had 55 feature types. They used all features of each type that occurred at least twice in the training data.</Paragraph>
    <Paragraph position="3"> We ran our comparisons using 7 different algorithms. The first three were variations on the standard probabilistic decision list learner. In particular, first we ran the standard sorted decision list learner, equivalent to the algorithm of Figure 3, with a threshold of negative infinity. That is, we included all rules that had a predicted entropy at least as good as the unigram distribution, whether or not they would actually improve entropy on the training data. We call this &amp;quot;Sorted: [?][?].&amp;quot; Next, we ran the same learner with a threshold of 0 (&amp;quot;Sorted: 0&amp;quot;): that is, we included all rules that had a predicted entropy at least as good as the unigram distribution, and that would at least improve entropy on the training data. Then we ran the algorithm with a threshold of 3 (&amp;quot;Sorted: 3&amp;quot;), in an attempt to avoid overfitting. Next, we ran our incremental algorithm, again with a threshold of reducing training entropy by at least 3 bits.</Paragraph>
    <Paragraph position="4"> In addition to comparing the various decision list algorithms, we also tried several other algorithms.</Paragraph>
    <Paragraph position="5"> First, since probabilistic decision lists are probabilistic analogs of TBLs, we compared to TBL (Brill, 1995). Furthermore, after doing our research on decision lists, we had several successes using simple linear models, such as a perceptron model and a maximum entropy (maxent) model (Chen and Rosenfeld, 1999). For the perceptron algorithm, we used a variation that includes a margin requirement, t (Zaragoza</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Training Sizes
</SectionTitle>
      <Paragraph position="0"> and Herbrich, 2000). Figure 4 shows this incredibly simple algorithm. We use q(x j ) to represent the vector of answers to questions about input x  +1; and t is a margin. We assume that one of the questions is TRUE, eliminating the need for a separate threshold variable. When t =0, the algorithm reduces to the standard perceptron algorithm. The inclusion of a non-zero margin and running to convergence guarantees convergence for separable data to a solution that works nearly as well as a linear support vector machine (Krauth and Mezard, 1987). Given the extreme simplicity of the algorithm and the fact that it works so well (not just compared to the algorithms in this paper, but compared to several others we have tried), the perceptron with margin is our favorite algorithm when we don't need probabilities, and model size is not an issue.</Paragraph>
      <Paragraph position="1"> Most of our algorithms have one or more parameters that need to be tuned. We chose 5 additional confusable word pairs for parameter tuning and chose parameter values that worked well on entropy and error rate across data sizes, as measured on these 5 additional word pairs. For the smoothing discount value we used 0.7. For thresholds for both the sorted and the incremental learner, we used 3 bits. For the perceptron algorithm, we set t to 20. For TBL's minimum number of errors to fix, the traditional value of  2 worked well. For the maxent model, for smoothing, we used a Gaussian prior with 0 mean and 0.3 variance. Since sometimes one learning algorithm is better at one size, and worse at another, we tried three training sizes: 1, 10 and 50 million words.</Paragraph>
      <Paragraph position="2"> In Figure 5, we show the error rates of each algorithm at different training sizes, averaged across the 10 words in the test set. We computed the geometric mean of error rate, across the ten word pairs. We chose the geometric mean, because otherwise, words with the largest error rates would disproportionately dominate the results. Figure 6, shows the geometric mean of the model sizes, where the model size is the number of rules. For maxent and perceptron models, we counted size as the total number of features, since these models store a value for every feature.</Paragraph>
      <Paragraph position="3"> For Sorted: [?][?] and Sorted: 0, the size is similar to a maxent or perceptron model - almost every rule is used. Sorted: 3 drastically reduces the model size by a factor of roughly 20 - while improving performance. Incremental: 3 is smaller still, by about an additional factor of 2 to 5, although its accuracy is slightly worse than Sorted: 3. Figure 7 shows the entropy of each algorithm. Since entropy is logarthmic, we use the arithmetic mean.</Paragraph>
      <Paragraph position="4"> Notice that the traditional probabilistic decision list learning algorithm - equivalent to Sorted: [?][?] - always has a higher error rate, higher entropy, and larger size than Sorted: 0. Similarly, Sorted: 3 has lower entropy, higher accuracy, and smaller models than Sorted: 0. Finally, Incremental: 3 has slightly higher error rates, but slightly lower entropies, and 1/2 to 1/5 as many rules. If one wants a probabilistic decision list learner, this is clearly the algorithm to use. However, if probabilities are not needed, then TBL can produce lower error rates, with still fewer rules. On the other hand, if one wants either the lowest entropies or highest accuracies, then it appears that linear models, such as maxent or the perceptron algorithm with margin work even better, at the expense of producing much larger models.</Paragraph>
      <Paragraph position="5"> Clearly, the new algorithm works very well when small size and probabilities are needed. It would be interesting to try combining this algorithm with decision trees in some way. Both Yarowsky (2000) and Florian et al. (2000) were able to get improvements on the simple decision list structure by adding additional splits - Yarowsky by adding them at the root, and Florian et al. by adding them at the leaves. Notice however that the chief advantage of decision lists over linear models is their compact size and understandability, and our techniques simultaneously improve those aspects; adding additional splits will almost certainly lead to larger models, not smaller.</Paragraph>
      <Paragraph position="6"> It would also be interesting to try more sophisticated smoothing techniques, such as those of Yarowsky.</Paragraph>
      <Paragraph position="7"> We have shown that a simple, incremental algorithm for learning probabilistic decision lists can produce models that are significantly more accurate, have significantly lower entropy, and are significantly smaller than those produced by the standard sorted learning algorithm. The new algorithm comes at the cost of some increased time, space, and complexity, but variations on it, such as the sorted algorithm with thresholding, or the techniques of Section 2.2.1, can be used to trade off space, time, and list size. Overall, given the substantial improvements from this algorithm, it should be widely used whenever the advantages - compactness and understandability - of probabilistic decision lists are needed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML