File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0708_evalu.xml
Size: 6,831 bytes
Last Modified: 2025-10-06 13:58:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0708"> <Title>Memory-Based Learning for Article Generation</Title> <Section position="8" start_page="45" end_page="47" type="evalu"> <SectionTitle> 5 Evaluation and Discussion </SectionTitle> <Paragraph position="0"> We tested the features discussed in section 3 with respect to a number of different memory-based learning methods as implemented in the TiMBL system (Daelemans et al., 2000).</Paragraph> <Paragraph position="1"> We considered two different learning algorithms. The first, IB1 is a k-nearest neighbour algorithm. 3. This can be used with two different metrics to judge the distance between the examples: overlap and modified value difference metric (MVDM). TiMBL automatically learns weights for the features, using one of five different weighting methods: no weighting, gain ratio, information gain, chi-squared and shared variance. The second algorithm, IGTREE, stores examples in a tree which is pruned according to the weightings. This makes it much faster and of comparable accuracy. The results for these different methods, for k = 1, 4, 16 are displayed in Table 3. IB1 is tested with leave-one-out cross-validation, IGTREE with ten-fold cross validation.</Paragraph> <Paragraph position="2"> The best results were (82.6%) for IB1 with the MVDM metric, and either no weighting or weighting by gain ratio. IGTREE did not perform as well. We investigated more values of k, from 1 to 200, and found they had little influence on the accuracy results with k = 4 or 5 performing slightly better.</Paragraph> <Paragraph position="3"> We also tested each of the features described in Section 3 in isolation and then all together.</Paragraph> <Paragraph position="4"> We used the best performing algorithm from our earlier experiment: IB1 with MVDM, gain ratio and k = 4. The results of this are given in When interpreting these results it is important to recall the figures provided in Table 1. The most common article, for any PoS, was no and for many PoS, including pronouns, generating no article is always correct. There is more variation in NPs headed by common nouns and adjectives, and a little in NPs headed by proper nouns. Our baseline therefore consists of never 3Strictly speaking, it is a k nearest distance algorithm, which looks at all examples in the nearest k distances, the number of which may be greater than k. generating an article: this will be right in 70.0% of all cases.</Paragraph> <Paragraph position="5"> Looking at the figures in Table 4, we see that many of the features investigated did not improve results above the baseline. Using the head of the NP itself to predict the article gave the best results of any single feature, raising the accuracy to 79.4%. The functional tag of the head of the NP itself improved results slightly. The use of the semantic classes (72.1%) clearly improves the results over the baseline thereby indicating that they capture useful generalizations. The results from testing the features in combination are shown in Table 5. Interestingly, features which were not useful on their own, proved useful in combination with the head noun. The most useful features appear to be the category of the embedding constituent (81.1%) and the presence or absence of a determiner (80.9%). Combining all the features gave an accuracy of 82.9%.</Paragraph> <Paragraph position="6"> Our best results (82.6%), which used all features are significantly better than the baseline of generating no articles (70.0%) or using only the head of the NP for training (79.4%). We also improve significantly upon earlier results of 78% as reported by Knight and Chander (1994), which in any case is a simpler task since it only involved choice between the and alan. Further, our results are competitive with state of the art rule-based systems. Because different corpora are used to obtain the various results reported in the literature and the problem is often defined differently, detailed comparison is difficult. However, the accuracy achieved appears to approach the accuracy results achieved with hand-written rules.</Paragraph> <Paragraph position="7"> In order to test the effect of the size of the training data, we tested used the best performing algorithm from our earlier experiment (IB1 with MVDM, gain ratio and k = 4) on various subsets of the corpus: the first 10%, the first 20%, the first 30% and so on to the whole corpus. The results are given in Table 6.</Paragraph> <Paragraph position="8"> The accuracy is still improving even with 300,744 NPs, an even larger corpus should give even better results. It is important to keep in mind that we, like most other researchers, have been training and testing on a relatively homogeneous corpus. Furthermore, we took as given information about the number of the NP. In many applications we will have neither a large amount of homogeneous training data nor information about number.</Paragraph> <Section position="1" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 5.1 Future Work </SectionTitle> <Paragraph position="0"> In the near future we intend to further extend our approach in various directions. First, we plan to investigate other lexical and syntactic features that might further improve our results, such as the existence of pre-modifiers like superlative and comparative adjectives, and post-modifiers like prepositional phrases, relative clauses, and so on. We would also like to investigate the effect of additional discourse-based features such as one that incorporates information about whether the referent of a noun phrase has been mentioned before.</Paragraph> <Paragraph position="1"> Second, we intend to make sure that the features we are using in training and testing will be available in the applications we consider. For example, in machine translation, the input noun phrase may be all dogs, whereas the output could be either all dogs or all the dogs. At present, words such as all, both, half in our input are tagged as pre-determiners if there is a following determiner (it can only be the or a possessive), and determiners if there is no article. To train for a realistic application we need to collapse the determiner and pre-determiner inputs together in our training data.</Paragraph> <Paragraph position="2"> Furthermore, we are interested in training on corpora with less markup, like the British National Corpus (Burnard, 1995) or even no markup at all. By running a PoS tagger and then an NP chunker, we should be able to get a lot more training data, and thus significantly improve our coverage. If we can use plain text to train on, then it will be easier to adapt our tool quickly to new domains, for which there are unlikely to be fully marked up corpora.</Paragraph> </Section> </Section> class="xml-element"></Paper>