File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1402_intro.xml
Size: 5,879 bytes
Last Modified: 2025-10-06 14:02:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1402"> <Title>CA</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> * Improve the present evaluation methodologies </SectionTitle> <Paragraph position="0"> * Identify new (quantitative and qualitative) approaches for already evaluated technologies: socio-technical and psychocognitive aspects * Identify protocols for new technologies and applications * Identification of language resources relevant for evaluation (to promote the development of new linguistic resources for those languages and domains where they do not exist yet, or only exist in a prototype stage, or exist but cannot be made available to the interested users); The object of the CESTA campaign is twofold. It is on the one hand to provide an evaluation of commercial Machine Translation Systems and on the other hand, to work collectively on the setting of a new reusable Machine Translation Evaluation protocol that is both user oriented and accounts for the necessity to use semantic metrics in order to make available a high quality reusable machine translation protocol to system providers.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Object of the campaign </SectionTitle> <Paragraph position="0"> The object of the CESTA campaign is to evaluate technologies together with metrics, i.e. to contribute to the setting of a state of the art within the field of Machine Translation systems evaluation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.3 CESTA user oriented protocol </SectionTitle> <Paragraph position="0"> The campaign will last three years, starting from January 2003. A board of European experts are members of CESTA Scientific committee and have been working together in order to determine the protocol to use for the campaign. Six systems are being evaluated.</Paragraph> <Paragraph position="1"> Five of these systems are commercial MT systems and one is a prototype developed at the university of Montreal by the RALI research centre. Evaluation is carried out on text rather than sentences. Text approximate width will be 400 words. Two runs will be carried out. For industrial reasons, systems will be made anonymous.</Paragraph> <Paragraph position="2"> 2 State-of-the-art in the field of Machine Translation evaluation In 1966, the ALPAC report draws light on the limits of Machine Translation systems. In 1979, the Van Slype report presented a study dedicated to Machine Translation metrics.</Paragraph> <Paragraph position="3"> In 1992, the JEIDA campaign puts the user at the center of evaluator's preoccupation. JEIDA proposed to draw human measures on the basis of three questionnaires: * One destined to users (containing a hundred questions) * Other questionnaires are destined to system Machine translation systems editors (three different questionnaires), * And a set of other questionnaires reserved to Machine Translation systems developers.</Paragraph> <Paragraph position="4"> Scores are worked out on the background of fourteen categories of questions. From these scores, graphs are produced according to the answers obtained. A comparison of different graphs for each systems is used as a basis for systems classification.</Paragraph> <Paragraph position="5"> The first DARPA Machine Translation evaluation campaign (1992-1994) makes use of human judgments. It is a very expensive method but interesting however, as regards the reliability of the evaluation thus produced. This campaign is based on tests carried out from French, Spanish and Japanese as source languages and English as a target language. The measures used for each of the following criteria are: * Fidelity - a proximity distance is worked out between a source sentence and a target sentence on a 1 to 5 scale.</Paragraph> <Paragraph position="6"> * Intelligibility, that corresponds to linguistic acceptability of a translation is measured on a 1 to 5 evaluation scale.</Paragraph> <Paragraph position="7"> * Informativeness: the test is carried out on reading of the target text alone. A questionnaire on text informative content is displayed allowing to work out a measure calculated on the basis of the percentage of good answers provided in system translation.</Paragraph> <Paragraph position="8"> In 1995, the OVUM report proposes to compare commercial Machine Translation systems on the basis of ten criteria.</Paragraph> <Paragraph position="9"> In 1996, the EAGLES report (EAGLES, 1999) sets new standards for Natural Language Processing software evaluation on the background of ISO 9126.</Paragraph> <Paragraph position="10"> Initiated in 1999, and coordinated by Pr Antonio Zampolli, the ISLE project is divided into three working groups, one being a Machine Translation group.</Paragraph> <Paragraph position="11"> Starting from ISO 9126 standard (King, 1999b), the aim of the project is to produce two taxonomies (c.f. section 3 of this article) and : * One defining quality subcriteria with the aim of refining the six criteria defined by ISO 9126 (i.e. functionality, reliability, user-friendliness, efficiency, maintenance portability) * The second one specifying use contexts that define the type of task induced the use of a by Machine Translation system, the types of users and input data. This taxonomy uses contextual parameters to select and order the quality criteria subject to evaluation. This taxonomy can be viewed and downloaded on the ISSCO website at the following address : http://www.issco.unige.ch/projects/isle/fe mti/ The second DARPA campaign (Papineni, K., S. Roukos, T. Ward and Z. Wei-Jing, 2001), making use of the IBM BLEU metric is mentioned in the CESTA protocol (c.f. section 8.1 of this article).</Paragraph> </Section> </Section> class="xml-element"></Paper>