File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1510_evalu.xml
Size: 9,055 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1510"> <Title>Probabilistic models for disambiguation of an HPSG-based chart generator</Title> <Section position="6" start_page="98" end_page="99" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> In this section, we present five experiments: comparison among four models described in Section 3.2, syntax models with different features, different corpus sizes, different beam widths, and the distribution of generation time. The bigram model was trained using 100,000 sentences in the BNC. The unigram and syntax model was trained using Section 02-21 of the WSJ portion of the Penn Treebank (39,832 sentences). Section 22 (1,700 sentences) and 23 (2,416 sentences) were used as the development and test data, respectively.</Paragraph> <Paragraph position="1"> Because the generator is still slow to generate long sentences, sentences with more than 20 words were not used. We converted the treebank into HPSG-style derivation trees by the method of Miyao et al. (2004) and extracted the semantic relations, which are used as the inputs to the generator. The sentences where this conversion failed were also eliminated although such sentences were few (about 0.3% of the eliminated data). The resulting training data consisted of 18,052 sentences and the test data consisted of 1,006 sentences. During training, uncovered sentences - where the lexicon does not include the lexical entry to construct correct derivation - were also ignored, while such sentences remained in the test data. The final training data we can utilize consisted of 15,444 sentences. The average sentence length of the test data was 12.4, which happens to be close to that of Velldal and Oepen (2005) though the test data is different.</Paragraph> <Paragraph position="2"> The accuracy of the generator outputs was evaluated by the BLEU score (Papineni et al., 2001), which is commonly used for the evaluation of machine translation and recently used for the evaluation of generation (Langkilde-Geary, 2002; Velldal and Oepen, 2005). BLEU is the weighted average of n-gram precision against the reference sentence. We used the sentences in the Penn Treebank as the reference sentences. The beam width was increased from a5a21a20a27a28a23a22a3a24a25a13 a9 to a5a8a33a19a13 a28a42a25a14a13a26a24a25a13 a9 in two steps. The parameters were empirically determined using the development set. All the experiments were conducted on AMD Opteron servers with a 2.0-GHz CPU and 12-GB memory.</Paragraph> <Paragraph position="3"> Table 2 shows the average generation time and the accuracy of the models presented in Section 3. The generation time includes time for the input for which the generator could not output a sentence, while the accuracy was calculated only in the case of successful generation. All models succeeded in generation for over 90% of the test data.</Paragraph> <Paragraph position="4"> Contrary to the result of the Velldal and Oepen (2005), the syntax model outperformed the combined model. We observed the same result when we varied the parameters for beam thresholding. This is possibly just because the language model was not trained enough as that of the previous research (Velldal and Oepen, 2005) where the model was 4-gram and trained with the entire BNC3.</Paragraph> <Paragraph position="5"> Although the accuracy shown in Table 2 was lower than that of Velldal and Oepen, there is little point in direct comparison between the accuracy of the two systems because the settings are considerably different in terms of the grammar, the input representation, and the training and test set. The algorithm we proposed does not depend on our specific setting and can be integrated and evaluated within their setting. We used larger training data (15,444 sentences) and test data (1,006 sentences), compared to their treebank of 864 sentences where the log-linear models were evaluated by cross validation.</Paragraph> <Paragraph position="6"> This is the advantage of adopting feature forests to efficiently estimate the log-linear models.</Paragraph> <Paragraph position="7"> Figure 5 shows the relationship between the size of training data and the accuracy. All the following experiments were conducted on the syntax model.</Paragraph> <Paragraph position="8"> The accuracy seems to saturate around 4000 sentences, which indicates that a small training set is enough to train the current syntax model and that future development.</Paragraph> <Paragraph position="9"> we could use an additional feature set to improve the accuracy. Similar results are reported in parsing (Miyao and Tsujii, 2005) while the accuracy saturated around 16,000 sentences. When we use more complicated features or train the model with longer sentences, possibly the size of necessary training data will increase.</Paragraph> <Paragraph position="10"> Table 3 shows the performance of syntax models with different feature sets. Each row represents a model where one of the atomic features in Table 1 was removed. The &quot;None&quot; row is the baseline model. The rightmost column represents the difference of the accuracy from the model trained with all features. SYM, LE, and COMMA features had a significant influence on the performance. These results are different from those in parsing reported by Miyao and Tsujii (2005) where COMMA and SPAN especially contributed to the accuracy. This observation implies that there is still room for improvement by tuning the combination of features for generation.</Paragraph> <Paragraph position="11"> We compared the performance of the generator with different beam widths to investigate the effect of iterative beam search. Table 4 shows the results when we varied a37 , which is the number of edges, while thresholding by FOM differences is disabled, and Table 5 shows the results when we varied only a57 , which is the FOM difference.</Paragraph> <Paragraph position="12"> Intuitively, beam search may decrease the accuracy because it cannot explore all possible candi- null dates during generation. Iterative beam search is more likely to decrease the accuracy than ordinary beam search. However, the results show that the accuracy did not drastically decrease at small widths. Moreover, the accuracy of iterative beam search was almost the same as that of a37 a38 a33a19a13 . On the other hand, generation time significantly increased as a37 or a57 increased, indicating that iterative beam search efficiently discarded unnecessary edges without loosing the accuracy. Although the coverage increases as the beam width increases, the coverage at a37 a38 a33a19a13 or a57 a38 a25a14a13a26a24a25a13 is lower than that of iterative beam search (Table 2)4.</Paragraph> <Paragraph position="13"> Finally, we examined the distribution of generation time without the limitation of sentence length in order to investigate the strategy to improve the efficiency of the generator. Figure 6 is a histogram of generation time for 500 sentences randomly selected from the development set, where 418 sentences were successfully generated and the average BLEU score was 0.705. The average sentence length was 22.1 and the maximum length was 60, and the average generation time was 27.9 sec, which was much longer than that for short sentences. It shows that a few sentences require extremely long time for generation although about 70% of the sentences were generated within 5 sec. Hence, the average time possibly decreases if we investigate what kind of sentences require especially long time and improve the 4This is because the generator fails when the number of edges exceeds 10,000. Since the number of edges significantly increases when a0 or a1 is large, generation fails even if the correct edges are in the chart.</Paragraph> <Paragraph position="14"> algorithm to remove such time-consuming fractions.</Paragraph> <Paragraph position="15"> The investigation is left for future research.</Paragraph> <Paragraph position="16"> The closest empirical evaluations on the same task is that of Langkilde-Geary (2002) which reported the performance of the HALogen system while the approach is rather different. Hand-written mapping rules are used to make a forest containing all candidates and the best candidate is selected using the bigram model. The performance of the generator was evaluated on Section 23 of the Penn Treebank in terms of the number of ambiguities, generation time, coverage, and accuracy. Several types of input specifications were examined in order to measure how specific the input should be for generating valid sentences. One of the specifications named &quot;permute, no dir&quot; is similar to our input in that the order of modifiers is not determined at all. The generator produced outputs for 82.7% of the inputs with average generation time 30.0 sec and BLEU score 0.757. The results of our last experiment are comparable to these results though the used section is different.</Paragraph> </Section> class="xml-element"></Paper>