File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2014_metho.xml

Size: 9,887 bytes

Last Modified: 2025-10-06 14:09:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2014">
  <Title>ATR - Spoken language communication research labs</Title>
  <Section position="4" start_page="79" end_page="80" type="metho">
    <SectionTitle>
3 Experimental setup
</SectionTitle>
    <Paragraph position="0"> The most popular off-the-shelf objective methods currently seem to be BLEU and NIST. As NIST was a modification of the original definition of BLEU, the work reported here concentrates on BLEU. Also, according to (BRILL and SORICUT, 2004), BLEU is a good representative of a class of automatic evaluation methods with the focus on precision6.</Paragraph>
    <Section position="1" start_page="79" end_page="80" type="sub_section">
      <SectionTitle>
3.1 Computation of a BLEU score
</SectionTitle>
      <Paragraph position="0"> For a given maximal order N, a baseline BLEUwN score is the product of two factors: a brevity penalty and the geometric average of modified n-gram precisions computed for all n-grams up to N.</Paragraph>
      <Paragraph position="2"> The brevity penalty is the exponential of the relative variation in length against the closest reference: null</Paragraph>
      <Paragraph position="4"> where C is the candidate and Rclosest is the closest reference to the candidate according to its length.</Paragraph>
      <Paragraph position="5"> |S |is the length of a sentence S in words. Using a consistent notation, we note as |S|W the number of occurrences of the (sub)string W in the sentence S, so that |S|w1...wn is the number of occurrences of the word n-gram w1 ...wn in the sentence S.</Paragraph>
      <Paragraph position="6"> With the previous notations, a modified n-gram precision for the order n is the ratio of two sums7:</Paragraph>
      <Paragraph position="8"> limited to the maximal number of occurrences of the n-gram considered in a single reference8; * the denominator gives the total number of n-grams in the candidate.</Paragraph>
      <Paragraph position="9"> We leave the basic definition of BLEU untouched. The previous formulae can be applied to character n-grams instead of word n-grams. In the sequel of this paper, for a given order N, the measure obtained using words will be called BLEUwN, whereas the measure in characters for a given order M will be noted BLEUcM.</Paragraph>
    </Section>
    <Section position="2" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
3.2 The test data
</SectionTitle>
      <Paragraph position="0"> We perform our study on English because a language for which the segmentation is obvious and undisputable is required. On Japanese or Chinese, this would not be the case, as different segmenters differ in their results on the same texts9.</Paragraph>
      <Paragraph position="1"> The experiments presented in this paper rely on a data set consisting of 510 Japanese sentences translated into English by 4 different machine translation systems, adding up to 2,040 candidate translations. For each sentence, a set of 13 references had been produced by hand in advance.</Paragraph>
      <Paragraph position="2"> Different BLEU scores in words and characters were computed for each of the 2,040 English candidate sentences, with their corresponding 13 reference sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="80" end_page="81" type="metho">
    <SectionTitle>
4 Results: equivalence BLEUwN /
BLEUcM
</SectionTitle>
    <Paragraph position="0"> To investigate the equivalence of BLEUwN and BLEUcM, we use three methods: we look for the best correlation, the best agreement in judgements between the two measures, and the best behaviour, according to an intrinsic property of BLEU.</Paragraph>
    <Section position="1" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
4.1 Best correlation
</SectionTitle>
      <Paragraph position="0"> For some given order N, our goal is to determine the value of M for which the BLEUcM scores (in  the object of the present study, which, again, is to show the equivalence between BLEU in words and characters. characters) are best correlated with the scores obtained with BLEUwN. To this end, we compute for all possible Ns and Ms all Pearson's correlations between scores obtained with BLEUwN and BLEUcM. We then select for each N, that M which gives a maximum in correlation. The results10 are shown in Table 1. For N = 4 words, the best M is 17 characters.</Paragraph>
    </Section>
    <Section position="2" start_page="80" end_page="81" type="sub_section">
      <SectionTitle>
4.2 Best agreement in judgement
</SectionTitle>
      <Paragraph position="0"> Similar to the previous method, we compute for all possible Ms and Ns all Kappa coefficients between BLEUwN and BLEUcM and then select, for each given N, that M which gives a maximum. The justification for such a procedure is as follows.</Paragraph>
      <Paragraph position="1"> All BLEU scores fall between 0 and 1, therefore it is always possible to recast them on a scale of grades. We arbitrarily chose 10 grades, ranging from 0 to 9, to cover the interval [0, 1] with ten smaller intervals of equal size. A grade of 0 corresponds to the interval [0, 0.1[, and so on, up to grade 9 which corresponds to [0.9, 1]. A sentence with a BLEU score of, say 0.435, will be assigned a grade of 4.</Paragraph>
      <Paragraph position="2"> By recasting BLEU scores as described above, they become judgements into discrete grades, so that computing two different BLEU scores first in words and then in characters for the same sentence, is tantamount to asking two different judges to judge the same sentence. A well-established technique to assess the agreement between two judges being the computation of the Kappa coefficient, we use this technique to measure the agreement between any BLEUwN and any BLEUcM.</Paragraph>
      <Paragraph position="3"> The maximum in the Kappa coefficients is reached for the values11 given in Table 1. For N = 4 words, the best M is 18 characters.</Paragraph>
      <Paragraph position="4"> 10The average ratio M/N obtained is 4.14, which is not that distant from the average word length in our data set: 3.84 for the candidate sentences.</Paragraph>
      <Paragraph position="5"> Also, for N = 4, we computed all values of Ms for each sentence length. See Table 2.</Paragraph>
      <Paragraph position="6"> 11Except for N = 3, where the value obtained (14) is quite different from that obtained with Pearson's correlation (10), the values obtained with Kappa coefficients atmost differ by 1.</Paragraph>
    </Section>
    <Section position="3" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
4.3 Best analogical behaviour
</SectionTitle>
      <Paragraph position="0"> BLEU depends heavily on the geometric average of modified n-gram precision scores. Therefore, because one cannot hope to find a given n-gram in a sentence if neither of the two included (n[?]1)grams is found in the same sentence, the following property holds for BLEU: For any given N, for any given candidate, for any given set of references, BLEUwN [?] BLEUw(N[?]1) The left graph of Figure 2 shows the correspondence of BLEUw4 and BLEUw3 scores for the data set. Indeed all points are found on the diagonal or below.</Paragraph>
      <Paragraph position="1"> Using the property above, we are interested in finding experimentally the value M such that BLEUcM [?] BLEUw(N[?]1) is true for almost all values. Such a value M can then be considered to be the equivalent in characters for the value N in words.</Paragraph>
      <Paragraph position="2"> Here we look incrementally for the M allowing BLEUcM to best mimic BLEUwN, that is leaving at least 90% of the points on or under the diagonal. For N = 4, as the graph in the middle of Figure 2 illustrates, such a situation is first encountered for M = 18. The graph on the right side shows the corresponding layout of the scores for the data set. This indeed tends to confirm that the M for which BLEUcM displays a similar behaviour to BLEUw4 is around 18.</Paragraph>
      <Paragraph position="3"> 5 The standard case of system evaluation</Paragraph>
    </Section>
    <Section position="4" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
5.1 BLEUw4 similarequal BLEUc18
</SectionTitle>
      <Paragraph position="0"> According to the previous results, it is possible to find some M for some given N for which there is a high correlation, a good agreement in judgement and an analogy of behaviour between measures in characters and in words. For the most widely used value of N, 4, the corresponding values in characters were 17 according to correlation, 18 according to agreement in judgement, and 18 according to analogical behaviour. We thus decide to take 18 as the number of characters corresponding to 4 words (see Figure 1 for plots of scores in words against scores in characters).</Paragraph>
    </Section>
    <Section position="5" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
5.2 Ranking systems
</SectionTitle>
      <Paragraph position="0"> We recomputed the overall BLEU scores of the four MT systems whose data we used, with the usual BLEUw4 and its corresponding method in characters, BLEUc18. Table 3 shows the average values obtained on the four systems.</Paragraph>
      <Paragraph position="1"> When going from words to characters, the values decrease by an average of 0.047. This is explained as follows: a sentence of less than N units, has necessarily a BLEU score of 0 for N-grams in this unit. Table 4 shows that, in our data, there are more sentences of less than 18 characters (350) than sentences of less than 4 words (302).</Paragraph>
      <Paragraph position="2"> Thus, there are more 0 scores with characters, and this explains the decrease in system scores when going from words to characters.</Paragraph>
      <Paragraph position="3"> On the whole, Table 3 shows that happily enough, shifting from words to characters in the application of the standard BLEU measure leaves the ranking unchanged12.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML