File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2003_metho.xml
Size: 20,137 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2003"> <Title/> <Section position="4" start_page="17" end_page="17" type="metho"> <SectionTitle> 2 Descriptive Power of Standard Metrics </SectionTitle> <Paragraph position="0"> In this section we perform a simple experiment in order to measure the descriptive power of current state-of-the-art metrics, i.e., their ability to capture the features which characterize human translations with respect to automatic ones.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.1 Experimental Setting </SectionTitle> <Paragraph position="0"> We use the data from the Openlab 2006 Initiative1 promoted by the TC-STAR Consortium2. This test suite is entirely based on European Parliament Proceedings3, covering April 1996 to May 2005. We focus on the Spanish-to-English translation task. For the purpose of evaluation we use the development set which consists of 1008 sentences.</Paragraph> <Paragraph position="1"> However, due to lack of available MT outputs for the whole set we used only a subset of 504 sentences corresponding to the first half of the development set. Three human references per sentence are available.</Paragraph> <Paragraph position="2"> We employ ten system outputs; nine are based on Statistical Machine Translation (SMT) systems (Gim'enez and M`arquez, 2005; Crego et al., 2005), and one is obtained from the free Systran4 on-line rule-based MT engine. Evaluation results have been computed by means of the IQMT5 Framework for Automatic MT Evaluation (Gim'enez and Amig'o, 2006).</Paragraph> <Paragraph position="3"> We have selected a representative set of 22 metric variants corresponding to six different families: BLEU (Papineni et al., 2001), NIST (Doddington, 2002), GTM (Melamed et al., 2003), mPER (Leusch et al., 2003), mWER (Niessen et al., 2000) and ROUGE (Lin and Och, 2004a).</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.2 Measuring Descriptive Power of Evaluation Metrics </SectionTitle> <Paragraph position="0"> Our main assumption is that if an evaluation metric is able to characterize human translations, then, human references should be closer to each other than automatic translations to other human references. Based on this assumption we introduce two measures (ORANGE and KING) which analyze the descriptive power of evaluation metrics from diferent points of view.</Paragraph> </Section> </Section> <Section position="5" start_page="17" end_page="17" type="metho"> <SectionTitle> ORANGE Measure </SectionTitle> <Paragraph position="0"> ORANGE compares automatic and manual translations one-on-one. Let a5 and a6 be the sets of automatic and reference translations, respectively, anda7a9a8a11a10a13a12a14a6a16a15 an evaluation metric which outputs the quality of an automatic translation a10a18a17a19a5 by comparison to a6 . ORANGE measures the descriptive power as the probability that a human reference a20 is more similar than an automatic translation a10 to the rest of human references:</Paragraph> <Paragraph position="2"> ORANGE was introduced by Lin and Och (2004b)6 for the meta-evaluation of MT evaluation metrics. The a21 a6a22a5a24a23a26a25a28a27 measure provides information about the average behavior of automatic and manual translations regarding an evaluation metric.</Paragraph> </Section> <Section position="6" start_page="17" end_page="18" type="metho"> <SectionTitle> KING Measure </SectionTitle> <Paragraph position="0"> However, ORANGE does not provide information about how many manual translations are discernible from automatic translations. The a58a60a59a61a23a26a25 measure complements the ORANGE, tackling these two issues by universally quantifying on</Paragraph> <Paragraph position="2"> KING represents the probability that, for a given evaluation metric, a human reference is more similar to the rest of human references than any automatic translation 7.</Paragraph> <Paragraph position="3"> KING does not depend on the distribution of automatic translations, and identifies the cases for whicha71 the given metric has been able to discern human translations from automatic ones. That is, it measures how many manual translations can be used as gold-standard for system evaluation/improvement purposes.</Paragraph> </Section> <Section position="7" start_page="18" end_page="18" type="metho"> <SectionTitle> 2.3 Results </SectionTitle> <Paragraph position="0"/> </Section> <Section position="8" start_page="18" end_page="18" type="metho"> <SectionTitle> ORANGE Results </SectionTitle> <Paragraph position="0"> All values of the ORANGE measure are lower than 0.5, which is the ORANGE value that a random metric would obtain (see central representation in Figure 2). This is a rather counterintuitive result. A reasonable explanation, however, is that automatic translations behave as centroids with respect to human translations, because they somewhat average the vocabulary distribution in the manual references; as a result, automatic translations are closer to each manual summary than manual summaries to each other (see leftmost representation in Figure 2).</Paragraph> <Paragraph position="1"> In other words, automatic translations tend to share (lexical) features with most of the references, but not to match exactly any of them. This is a combined effect of: a72 The nature of MT systems, mostly statistical, which compute their estimates based on the number of occurrences of words, tending to rely more on events that occur more often. Consequently, automatic translations typically consist of frequent words, which are likely to appear in most of the references.</Paragraph> <Paragraph position="2"> a72 The shallowness of current metrics, which are not able to identify the common properties of manual translations with regard to automatic translations.</Paragraph> </Section> <Section position="9" start_page="18" end_page="18" type="metho"> <SectionTitle> KING Results </SectionTitle> <Paragraph position="0"> KING values, on the other hand, are slightly higher than the value that a random metric would obtain ( a73a74a75a9a74a77a76a79a78a81a80a83a82 ). This means that every standard metric is able to discriminate a certain number of manual translations from the set of automatic translations; for instance, GTM-3 identifies 19% of the manual references. For the remaining 81% of the test cases, however, GTM-3 cannot make the distinction, and therefore cannot be used to detect and improve weaknesses of the automatic MT systems.</Paragraph> <Paragraph position="1"> These results provide an explanation for the low correlation between automatic evaluation metrics and human judgements at the sentence level.</Paragraph> <Paragraph position="2"> The necessary conclusion is that new metrics with higher descriptive power are required.</Paragraph> </Section> <Section position="10" start_page="18" end_page="19" type="metho"> <SectionTitle> 3 Improving Descriptive Power </SectionTitle> <Paragraph position="0"> The design of a metric that is able to capture all the linguistic aspects that distinguish human translations from automatic ones is a difficult path to trace. We approach this challenge by following a 'divide and conquer' strategy. We suggest to build a set of specialized similarity metrics devoted to the evaluation of partial aspects of MT quality.</Paragraph> <Paragraph position="1"> The challenge is then how to combine a set of similarity metrics into a single evaluation measure of MTa84 quality. The QARLA framework provides a solution for this challenge.</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.1 Similarity Metric Combinations inside </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="11" start_page="19" end_page="20" type="metho"> <SectionTitle> QARLA </SectionTitle> <Paragraph position="0"> The QARLA Framework permits to combine several similarity metrics into a single quality measure (QUEEN). Besides considering the similarity of automatic translations to human references, the QUEEN measure additionally considers the distribution of similarities among human references.</Paragraph> <Paragraph position="1"> The QUEEN measure operates under the assumption that a good translation must be similar to human references (a85 ) according to all similarity metrics. QUEENa86a11a87a89a88 is defined as the probability, over a85a91a90a60a85a92a90a60a85 , that for every metric a93 in a given metric set a94 the automatic translation a87 is more similar to a human reference than two other references to each other:</Paragraph> <Paragraph position="3"> where a87 is the automatic translation being evaluated, a114a35a107a47a106a57a107 a111 a106a57a107 a111a111a103a115 are three different human references in a85 , and a93a9a86a11a87a13a106a57a107a61a88 stands for the similarity of a107 to a87 .</Paragraph> <Paragraph position="4"> In the case of Openlab data, we can count only on three human references per sentence. In order to increase the number of samples for QUEEN estimation we can use reference similaritiesa93a9a86a35a107 a111a106a57a107 a111a111a88 between manual translation pairs from other sentences, assuming that the distances between manual references are relatively stable across examples. null</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.2 Similarity Metrics </SectionTitle> <Paragraph position="0"> We begin by defining a set of 22 similarity metrics taken from the list of standard evaluation metrics in Subsection 2.1. Evaluation metrics can be tuned into similarity metrics simply by considering only one reference when computing its value.</Paragraph> <Paragraph position="1"> Secondly, we explore the possibility of designing complementary similarity metrics that exploit linguistic information at levels further than lexical. Inspired in the work by Liu and Gildea (2005), who introduced a series of metrics based on constituent/dependency syntactic matching, we have designed three subgroups of syntactic similarity metrics. To compute them, we have used the dependency trees provided by the MINIPAR dependency parser (Lin, 1998). These metrics compute the level of word overlapping (unigram precision/recall) between dependency trees associated to automatic and reference translations, from three different points of view: TREE-X overlapping between the words hanging from non-terminal nodes of type a94 of the tree. For instance, the metric TREE PRED reflects the proportion of word overlapping between subtrees of type 'pred' (predicate of a clause).</Paragraph> <Paragraph position="2"> GRAM-X overlapping between the words with the grammatical category a94 . For instance, the metric GRAM A reflects the proportion of word overlapping between terminal nodes of type 'A' (Adjective/Adverbs).</Paragraph> <Paragraph position="3"> LEVEL-X overlapping between the words hanging at a certain level a94 of the tree, or deeper. For instance, LEVEL-1 would consider overlapping between all the words in the sentences. null In addition, we also consider three coarser metrics, namely TREE, GRAM and LEVEL, which correspond to the average value of the finer metrics corresponding to each subfamily.</Paragraph> </Section> <Section position="2" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 3.3 Metric Set Selection </SectionTitle> <Paragraph position="0"> We can compute KING over combinations of metrics by directly replacing the similarity metric a93a9a86a11a87a13a106a57a107a61a88 with the QUEEN measure. This corresponds exactly to the KING measure used in QARLA:</Paragraph> <Paragraph position="2"> KING represents the probability that, for a given set of human references a85 , and a set of metrics a94 , the QUEEN quality of a human reference is greater than the QUEEN quality of any automatic translation in a117 .</Paragraph> <Paragraph position="3"> The similarity metrics based on standard evaluation measures together with the two new families of similarity metrics form a set of 104 metrics. Our goal is to obtain the subset of metrics with highest descriptive power; for this, we rely on the KING probability. A brute force exploration of all possible metric combinations is not viable. In order to performa126 an approximate search for a local maximum in KING over all the possible metric combinations defined by a127 , we have used the following greedy heuristic: 1. Individual metrics are ranked by their KING value.</Paragraph> <Paragraph position="4"> 2. In decreasing rank order, metrics are individ- null ually added to the set of optimal metrics if, and only if, the global KING is increased.</Paragraph> <Paragraph position="5"> After applying the algorithm we have obtained the optimal metric set:</Paragraph> <Paragraph position="7"> and TREE WHNa129 which has a KING value of 0.29. This is significantly higher than the maximum KING obtained by any individual standard metric (which was 0.19 for GTM-3).</Paragraph> <Paragraph position="8"> As to the probability ORANGE that a reference translation attains a higher score than an automatic translation, this metric set obtains a value of 0.49 vs. 0.42. This means that still the metrics are, on average, unable to discriminate between human references and automatic translations. However, the proportion of sentences for which the metrics are able to discriminate (KING value) is significantly higher.</Paragraph> <Paragraph position="9"> The metric set with highest descriptive power contains metrics at different linguistic levels. For instance, GTM-1 and NIST-2 reward n-gram matches at the lexical level. GRAM A, GRAM N, GRAM AUX and GRAM BE capture word overlapping for nouns, auxiliary verbs, adjectives and adverbs, and auxiliary uses of the verb 'to be', respectively. TREE, TREE AUX, TREE PNMOD, TREE PRED, TREE REL, TREE S and TREE WHN reward lexical overlapping over different types of dependency subtrees: surface subjects, relative clauses, predicates, auxiliary verbs, postnominal modifiers, and whn-elements at C-spec positions, respectively.</Paragraph> <Paragraph position="10"> These results are a clear indication that features from several linguistic levels are useful for the characterization of human translations.</Paragraph> </Section> </Section> <Section position="12" start_page="20" end_page="21" type="metho"> <SectionTitle> 4 Human-like vs. Human Acceptable </SectionTitle> <Paragraph position="0"> In this section we analyze the relationship between the two different kinds of MT evaluation presented: (i) the ability of MT systems to generate human-like translations, and (ii) the ability of MT systems to generate translations that look acceptable to human judges.</Paragraph> <Section position="1" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 4.1 Experimental Setting </SectionTitle> <Paragraph position="0"> The ideal test set to study this dichotomy inside the QARLA Framework would consist of a large number of human references per sentence, and automatic outputs generated by heterogeneous MT systems.</Paragraph> </Section> <Section position="2" start_page="20" end_page="21" type="sub_section"> <SectionTitle> 4.2 Descriptive Power vs. Correlation with Human Judgements </SectionTitle> <Paragraph position="0"> We use the data and results from the IWSLT04 Evaluation Campaign8. We focus on the evaluation of the Chinese-to-English (CE) translation task, in which a set of 500 short sentences from the Basic Travel Expressions Corpus (BTEC) were translated (Akiba et al., 2004). For purposes of automatic evaluation, 16 reference translations and outputs by 20 different MT systems are available for each sentence. Moreover, each of these outputs was evaluated by three judges on the basis of adequacy and fluency (LDC, 2002). In our experiments we consider the sum of adequacy and fluency assessments.</Paragraph> <Paragraph position="1"> However, the BTEC corpus has a serious drawback: sentences are very short (8 word length in average). In order to consider a sentence adequate we are practically forcing it to match exactly some of the human references. To alleviate this effect we selected sentences consisting of at least ten words. A total of 94 sentences (of 13 words length in average) satisfied this constraint.</Paragraph> <Paragraph position="2"> Figure 3 shows, for all metrics, the relationship between the power of characterization of human references (KING, horizontal axis) and the correlation with human judgements (Pearson correlation, vertical axis). Data are plotted in three different groups: original standard metrics, single metrics inside QARLA (QUEEN measure), and the optimal metric combination according to KING.</Paragraph> <Paragraph position="3"> The optimal set is: a128 GRAM N, LEVEL 2, LEVEL 4, NIST-1, NIST3, NIST-4, and 1-WERa129 This set suggests that all kinds of n-grams play an important role in the characterization of human translations.a130 The metric GRAM N reflects the importance of noun translations. Unlike the Openlab corpus, levels of the dependency tree (LEVEL 2 and LEVEL 4) are descriptive features, but dependency relations are not (TREE metrics). This is probably due to the small average sentence length in IWSLT.</Paragraph> <Paragraph position="4"> Metrics exhibiting a high level of correlation outside QARLA, such as NIST-3, also exhibit a high descriptive power (KING). There is also a tendency for metrics with a KING value around 0.6 to concentrate at a level of Pearson correlation around 0.5.</Paragraph> <Paragraph position="5"> But the main point is the fact that the QUEEN measure obtained by the metric combination with highest KING does not yield the highest level of correlation with human assessments, which is obtained by standard metrics outside QARLA (0.5 vs. 0.7).</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Human Judgements vs. Similarity to References </SectionTitle> <Paragraph position="0"> In order to explain the above results, we have analyzed the relationship between human assessments and the QUEEN values obtained by the best combination of metrics for every individual translation. null Figure 4 shows that high values of QUEEN (i.e., similarity to references) imply high values of human judgements. But the reverse is not true. There are translations acceptable to a human judge but not similar to human translations according to QUEEN. This fact can be understood by inspecting a few particular cases. Table 1 shows two cases of translations exhibiting a very low QUEEN value and very high human judgment score. The two cases present the same kind of problem: there exists some word or phrase absent from all human references. In the first example, the automatic translation uses the expression &quot;seats&quot; to make a reservation, where humans invariably choose &quot;table&quot;. In the second example, the automatic translation users &quot;rack&quot; as the place to put a bag, while humans choose &quot;overhead bin&quot;, &quot;overhead compartment&quot;, but never &quot;rack&quot;. Therefore, the QUEEN measure discriminates these automatic translations regarding to all human references, thus assigning them a low value.</Paragraph> <Paragraph position="1"> However, human judges find the translation still acceptable and informative, although not strictly human-like.</Paragraph> <Paragraph position="2"> These results suggest that inside the set of human acceptable translations, which includes human-like translations, there is also a subset of translations unlikely to have been produced by a human translator. This is a drawback of MT evaluation based on human references when the evaluation criteria is Human Acceptability. The good news are that when Human Likeness increases, Human Acceptability increases as well.</Paragraph> </Section> </Section> class="xml-element"></Paper>