File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/w01-0808_evalu.xml
Size: 2,450 bytes
Last Modified: 2025-10-06 13:58:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0808"> <Title>Construction of a Spanish Generation module in the framework of a General-Purpose, Multilingual Natural Language Processing System. In Proceedings of the VII</Title> <Section position="7" start_page="1" end_page="3" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> The generation components described in the previous sections are part of an MT system that has been run on actual Microsoft technical documentation. The system is frequently evaluated to provide a measure of progress and to yield feedback on its design and development.</Paragraph> <Paragraph position="1"> In evaluating our progress over time and comparing our system with others, we have performed several periodic, blind human evaluations. We focus here on the evaluation of our Spanish-English and English-Spanish systems.</Paragraph> <Paragraph position="2"> For each evaluation, several human raters judge the same set of 200-250 sentences randomly extracted from our technical corpora (150K sentences).</Paragraph> <Paragraph position="3"> The raters are not shown the source language sentence; instead, they are presented with a human translation, along with two machine-generated translations. Their task is to choose between the alternatives, using the human translation as a reference.</Paragraph> <Paragraph position="4"> Table 1 summarizes a comparison of the output of our Spanish-English system with that of Babelfish (http://world.altavista.com/). Table 2 does the same for our English-Spanish system and Lernout & Hauspie's English-Spanish system (http://officeupdate.lhsl.com/). In these tables, a rating of 1 means that raters uniformly preferred the translation produced by our system; a rating of 0 means that they did not uniformly prefer either translation; a rating of -1 means that they uniformly preferred the translation produced by the alternative system. Beside each rating is a confidence measure for the mean preference at the .99 level (Richardson, The human raters used for these evaluations work for an independent agency and played no development role building the systems they test.</Paragraph> <Paragraph position="5"> In interpreting our results, it is important to keep in mind that our MT system has been customized to the test domain, while the Babelfish and Lernout & Hauspie systems have not.</Paragraph> </Section> class="xml-element"></Paper>