File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/c04-1072_concl.xml
Size: 2,330 bytes
Last Modified: 2025-10-06 13:53:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1072"> <Title>ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation</Title> <Section position="6" start_page="3" end_page="3" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> In this paper we introduce a new automatic evaluation method, ORANGE, to evaluate automatic evaluation metrics for machine translations. We showed that the new method can be easily implemented and integrated with existing statistical machine translation frameworks.</Paragraph> <Paragraph position="1"> ORANGE assumes a good automatic evaluation metric should assign high scores to good translations and assign low scores to bad translations. Using reference translations as examples of good translations, we measure the quality of an automatic evaluation metric based on the average rank of the references within a list of alternative machine translations. Comparing with traditional approaches that require human judgments on adequacy or fluency, ORANGE requires no extra human involvement other than the availability of reference translations. It also streamlines the process of design and error analysis for developing new automatic metrics. Using ORANGE, we have only one parameter, i.e.</Paragraph> <Paragraph position="2"> ORANGE itself, to optimize vs. two in correlation analysis using human assigned adequacy and fluency. By examining the rank position of the automatic evaluation metrics (16384-best list).</Paragraph> <Paragraph position="3"> references, we can easily identify the confusion set of the references and propose new features to improve automatic metrics.</Paragraph> <Paragraph position="4"> One caveat of the ORANGE method is that what if machine translations are as good as reference translations? To rule out this scenario, we can sample instances where machine translations are ranked higher than human translations. We then check the portion of the cases where machine translations are as good as the human translations. If the portion is small then the ORANGE method can be confidently applied. We conjecture that this is the case for the currently available machine translation systems. However, we plan to conduct the sampling procedure to verify this is indeed the</Paragraph> </Section> class="xml-element"></Paper>