File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-5008_evalu.xml
Size: 3,119 bytes
Last Modified: 2025-10-06 13:59:28
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-5008"> <Title>Automatic generation of paraphrases to be used as translation references in objective evaluation measures of machine translation</Title> <Section position="8" start_page="62" end_page="63" type="evalu"> <SectionTitle> 7.2 Results </SectionTitle> <Paragraph position="0"> The scores in BLEU and NIST (both on a scale from 0 to 1) shown in Figure 2 are interpreted 6Formally, NIST is an open scale. Hence, scores cannot be directly compared for different seed sentences. We thus normalised them by the score of the seed sentence against itself. In this way, NIST scores become comparable for different seed sentences.</Paragraph> <Paragraph position="1"> as a measure of the lexical and syntactical variation among paraphrases. The lower they are, the greater the variation. The upper graphs show that this variation depends clearly on the lengths of the seed sentences. The shorter the seed sentence, the greater the variation among the paraphrases produced by this method. This is no surprise as the detection phase introduces a bias as was mentionned in Section 5 with the example sentence Sure.</Paragraph> <Paragraph position="2"> The lower graphs show that the variation does not depend on the number of paraphrases per seed sentence. Hence, on the contrary to a method that would produce more variations as more paraphrases are generated, in our method, the variation is not expected to change when one produces more and more paraphrases (however, the grammatical quality or the paraphrasing quality could change). In this sense, the method is scalable, i.e., one could tune the number of paraphrases wished without considerably altering the lexical and syntactical variation.</Paragraph> <Section position="1" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 7.3 Comparison with reference sets </SectionTitle> <Paragraph position="0"> produced by hand We compared the lexical and syntactical variation of our paraphrases with paraphrases created by hand for a past MT evaluation campaign (AKIBA et al., 2004) in two language pairs: Japanese to English and Chinese to English.</Paragraph> <Paragraph position="1"> For every reference set, we evaluated each sentence against one chosen at random and left out. The mean of all these evaluation scores gives an indication on the overall internal lexical and syntactical variation inside the reference sets. The lower the scores, the better the lexical and syntactical variation. This scheme was applied to both reference sets created by hand, and to the one automatically produced by our method. The scores obtained are shown on Figure 7. Whereas BLEU scores are comparable for all reference sets, which indicates no notable difference in flu- null and automatically produced by our method. The lower the scores, the better the lexical and syntactical variation.</Paragraph> <Paragraph position="2"> ency, NIST scores are definitely better for the automatically produced reference set: this hints at a possibly richer lexical variation.</Paragraph> </Section> </Section> class="xml-element"></Paper>