XML Viewer - w04-1402

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1402_metho.xml
Size: 21,606 bytes
Last Modified: 2025-10-06 14:09:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1402">
  <Title>CA</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 User-oriented evaluations
</SectionTitle>
    <Paragraph position="0"> An emerging evaluation methodology in NLP technology focuses on quality requirements analysis. The needs and consequently the satisfaction of end-users, and this will depend on the tasks and expected results requirement domains, which we have identified as diagnostic quality dimensions. One of the most suitable methods in this type of evaluation is the adequacy evaluation that aims at finding out whether a system or product is adequate to someone's needs (see Sparck-Jones &amp; Gallier, 1996 and King, 1996 among many others for a more detailed discussion of these issues). This approach encourages communication between users and developers.</Paragraph>
    <Paragraph position="1"> The definition of the CESTA evaluation protocol took into account the Framework for MT Evaluation in ISLE (FEMTI), available online. FEMTI offers the possibility to define evaluation requirements, then to select relevant 'qualities', and the metrics commonly used to score them (cf. ISO/IEC 9126, 14598). The CESTA evaluation methodology is founded on a black box approach.</Paragraph>
    <Paragraph position="2"> CESTA evaluators considered a generic user, which is interested in general-purpose, readyto-use translations, preferably using an off-the-shelf system. In addition, CESTA aims at producing reusable resources, and providing information about the reliability of the metrics (validation), while being cost-effective and fast.</Paragraph>
    <Paragraph position="3"> With these evaluation requirements in mind (FEMTI-1), it appears that the relevant qualities (FEMTI-2) are 'suitability', 'accuracy' and 'well-formedness'. Automated metrics best meet the CESTA needs for reusability, among which BLEU, X-score and D-score (chosen for internal reasons). Their validation requires the comparison of their scores with recognised human scores for the same qualities (e.g., human assessment of fidelity or fluency).</Paragraph>
    <Paragraph position="4"> 'Efficiency', measured through post-editing time, was also discussed. For the evaluation, first a general-purpose dictionary could be used, then a domain-specific one.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 An approach based on use cases
</SectionTitle>
      <Paragraph position="0"> ISO 14598 directives for evaluators put forth as a prequisite for systems development the detailed identification of user needs that ought to be specified through the use case document.</Paragraph>
      <Paragraph position="1"> Moreover, conducting a full evaluation process involves going through the establishment of an evaluation requirements document. ISO 14598 document specifies that quality requirements should be identified &amp;quot;according to user needs, application area and experience, software integrity and experience, regulations, law, required standards, etc.&amp;quot;.</Paragraph>
      <Paragraph position="2"> The evaluation specification document is created using the Software Requirement Specifications (SRS) and the Use-Case document. The CESTA protocol relies on a use case that refers to a translation need grounded on basic syntactic correctness and simple understanding of a text, as required by information watch tasks for example, and excludes making a direct use of the text for post editing purposes.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Two campaigns
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Specificities of the CESTA campaign
</SectionTitle>
      <Paragraph position="0"> Two campaigns are being organised : The first campaign is organised using a system's default dictionary. After systems terminological adaptation a second campaign will be organised.</Paragraph>
      <Paragraph position="1"> Two studies previously carried out and presented respectively at the 2001 MT Summit (Mustafa El Hadi, Dabbadie, Timimi, 2001) and at the 2002 LREC conference (Mustafa Mustafa El Hadi, Dabbadie, Timimi, 2002) allowed us to realise the gap in terms in terms of quality between results obtained on target text after terminological enrichment.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 First campaign
</SectionTitle>
      <Paragraph position="0"> The organisation of the campaign implies going through several steps :  currently being be implemented in conformity with protocol specifications validated by CESTA scientific committee. CESTA protocol specifications have been communicated to participants in particular as regards data formatting, test schedule, metrics and adaptation phase. For cost requirements, CESTA will not include a training phase. The first run will start during autumn 2004</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Second campaign
</SectionTitle>
      <Paragraph position="0"> The systems having already been tuned, an adaptation phase will not be carried out for the second campaign. However terminological adaptation will be necessary at this stage. The second series of tests being carried out on a thematically homogeneous corpus, the thematic domain only will be communicated to participants for terminological adaptation. For thematic adaptation, and in order to avoid system optimisation after the first series of tests, a new domain specific 200.000 word hiding corpus will be used.</Paragraph>
      <Paragraph position="1"> The terminological domain on which evaluation will be carried out will then have to be defined. This terminological domain will be communicated to participants but not the corpus used itself. On the other hand, participants will be asked to send organisers a written agreement by which they will commit themselves to provide organisers with any relevant information regarding system tuning and specific adaptations that have made on each of the participating MT systems, in order to allow the scientific committee to understand and analyse the origin of the potential system ranking changes. The second run will start during year 2005.</Paragraph>
      <Paragraph position="2"> Organisers have committed themselves not to publish the results between the two campaigns.</Paragraph>
      <Paragraph position="3"> After the training phase, the second campaign will take place. Participants will be given a fifteen days delay to send the results. An additional three months period will be necessary to carry out result analysis and prepare data publication and workshop organisation.</Paragraph>
      <Paragraph position="4"> CESTA scientific committee also decided in parallel with the two campaigns, to evaluate systems capacity to process formatted texts including images and HTML tags. Participants who do not wish to participate to this additional test have informed the scientific committee. Most of the time the reason is that their system is only capable of processing raw text. This is the case mainly for academic systems involved in the campaign, most of the commercial systems being nowadays able to process formatted text.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Contrastive evaluation
</SectionTitle>
    <Paragraph position="0"> One of the particularities of the CESTA protocol is to provide a Meta evaluation of the automated metrics used for the campaign - a kind of state of the art of evaluation metrics. The robustness of the metrics will be tested on minor language pairs through a contrastive evaluation against human judgement.</Paragraph>
    <Paragraph position="1"> The scientific committee has decided to use ArabicFrench as a minor language pair.</Paragraph>
    <Paragraph position="2"> Evaluation on the minor language pair will be carried directly on two of the participating systems and using English as a pivotal language on the other systems. Translation through a pivotal language will then be the following : ArabicEnglishFrench.</Paragraph>
    <Paragraph position="3"> Organiser are, of course, perfectly aware of the potential loss of quality provoked by the use of a pivotal language but recall however that, contrarily to the major language pair, evaluation carried out on the minor language pair through a pivotal system will not be used to evaluate these systems themselves, but metric robustness. Results of metric evaluation and systems evaluation will, of course, be obtained and disseminated separately.</Paragraph>
    <Paragraph position="4"> During the tests of the first campaign, the FrenchEnglish system obtaining the best ranking will be selected to be used as a pivotal system for metrics robustness Meta evaluation.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Test material
</SectionTitle>
    <Paragraph position="0"> The required material is a set of corpora as detailed in the following section and a test tool that will be implemented according to metrics requirements and under the responsibility of CESTA organisers.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Corpus
</SectionTitle>
      <Paragraph position="0"> The evaluation corpus is composed of 50 texts, each text length is 400 words to be translated twice, considering that a translation already exists in the original corpus. The different corpora are provided by ELRA. The masking corpus has 250.000 words and must be thematically homogeneous.</Paragraph>
      <Paragraph position="1"> For each language pair the following corpora will be used: Adaptation * This 200.000 a 250.000 word corpus is a bilingual corpus. It is used to validate exchanges between organisers and participants and for system tuning.</Paragraph>
      <Paragraph position="2">  but it will have to be thematically homogeneous (on a specific domain that will be communicated to participants a few months before the run takes place) * One masking corpus similar to the previous one.</Paragraph>
      <Paragraph position="3"> Additional requirement The BLANC metric requires the use of a bilingual aligned corpus at document scale. Three human translations will be used for each of the evaluation source texts. Considering that the corpora used, already provide one official translations, only two additional human translations will be necessary. These translations will be carried out under the organisers responsibility. Within the framework of CESTA use cases, evaluation is not made in order to obtain a ready to publish target language translation, but rather to provide a foreign user a simple access to information within the limits of basic grammatical correctness, as already mentioned in this article.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 The BLEU, BLANC and ROUGE metrics
</SectionTitle>
    <Paragraph position="0"> Three types of metrics will be tested on the corpus, the CESTA protocol being the combination of a contrastive reference to three different protocols:</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 The IBM &amp;quot;BLEU&amp;quot; protocol (Papineni, K.,
</SectionTitle>
      <Paragraph position="0"> S. Roukos, T. Ward and Z. Wei-Jing, 2001).</Paragraph>
      <Paragraph position="1"> The IBM BLEU metric used by the DARPA for its 2001 evaluation campaign, uses co-occurrence measures based on N-Grams. The translation in English of 80 Chinese source documents by six different commercial Machine Translation systems, was submitted to evaluation. From a reference corpus of translations made by experts, this metric works out quality measures according to a distance calculated between an automatically produced translation and the reference translation corpus based on shared N-grams (n=1,2,3...). The results of this evaluation are then compared to human judgments.</Paragraph>
      <Paragraph position="2"> * NIST now offers an online evaluation of MT systems performance, i.e.: o A program that can be downloaded for research aims.</Paragraph>
      <Paragraph position="3"> The user then provides source texts and reference translations for a determined pair of languages.</Paragraph>
      <Paragraph position="4"> o An e-mail evaluation service, for more formal evaluations. Results can be obtained in a few minutes.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 The &amp;quot;BLANC&amp;quot; protocol
</SectionTitle>
      <Paragraph position="0"> It is a metric derived from a study presented at the LREC 2002 conference (Hartley A., Rajman M., 2002). We only take into account a part of the protocol described in the referred paper, i.e. the X score, that corresponds to grammatical correctness.</Paragraph>
      <Paragraph position="1"> We will not give an exhaustive description of this experience and shall only detail the elements that are relevant to the CESTA evaluation protocol.</Paragraph>
      <Paragraph position="2"> The protocol has been tested on the following languages.</Paragraph>
      <Paragraph position="3">  hundred source texts ranging from 250 to 300 words each. A corpus of 600 translations is thus produced.</Paragraph>
      <Paragraph position="4"> * For each of the source texts, a corpus of 6 translations is produced automatically.</Paragraph>
      <Paragraph position="5"> These translations are then regrouped by series of six texts.</Paragraph>
      <Paragraph position="6"> * According to the protocol initiated by (White &amp; Forner, 2001) these series are then ranked by medium adequacy score.</Paragraph>
      <Paragraph position="7"> * Every 5 series, a series is extracted from the whole. Packs of twenty series of target translations are thus obtained and submitted to human evaluators.</Paragraph>
      <Paragraph position="8">  * Each evaluator reads 10 series of 6 translations i.e. 60 texts.</Paragraph>
      <Paragraph position="9"> * Each of these series is then read by six different evaluators * The evaluators must observe a ten minute compulsory break every two series.</Paragraph>
      <Paragraph position="10"> * The evaluators do not know that the texts have been translated automatically.</Paragraph>
      <Paragraph position="11">  The directive given to them is the following: &lt;&lt; rank these six texts from best to worst. If you cannot manage to give a different ranking to two texts, regroup them under the same parenthesis and give them the same score, as in the following example : 4 [1 2] 6 [3 5].&amp;quot; The aim of this instruction is to produce rankings that are similar to the rankings attributed automatically.</Paragraph>
      <Paragraph position="12"> Human judgement that ranks from best to worse corresponds in reality to a set of the fluency, adequacy and Informativeness criteria that can be attributed to the texts translated automatically.  Within the framework of the CESTA evaluation campaign the scientific committee decided to make use of the X-score only, the semantic D-score having proved to be unstable and that it could be advantageously replaced by the a metric based on (Bogdan, B.; Hartley, A.; Atwell, 2003), a reformulation of the D-score developed by (Rajman, M. and T.</Paragraph>
      <Paragraph position="13"> Hartley, 2001), and which we refer to as the ROUGE metric in this article.</Paragraph>
      <Paragraph position="14">  document is computed. This profile is then used to derive the X-score for each document, making use of the following formula:</Paragraph>
      <Paragraph position="16"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.3 The &amp;quot;ROUGE&amp;quot; protocol
</SectionTitle>
      <Paragraph position="0"> This protocol, developed by Anthony Hartley in (Bogdan, B.; Hartley, A.; Atwell, 2003), is a semantic score. It is the result of a reformulation of the D-Score, the semantic score initiated through previous collaboration with Martin Rajman (Rajman, M. and T. Hartley, 2001), as explained in the previous section.</Paragraph>
      <Paragraph position="1"> The original idea on which this protocol is based relies on the fact that MT evaluation metrics that &amp;quot;are based on comparing the distribution of statistically significant words in corpora of MT output and in human reference translation corpora&amp;quot;.</Paragraph>
      <Paragraph position="2"> The method used to measure MT quality is the following: a statistical model for MT output corpora and for a parallel corpus of human translations, each statistically significant word being highlighted in the corpus. On the other hand, a statistical significance score is given for each highlighted word. Then statistical models for MT target texts and human translations are compared, special attention being paid to words that are automatically marked as significant in MT outputs, whereas they do not appear to be marked as significant in human translations. These words are considered to be &amp;quot;over generated&amp;quot;. The same operation is then carried out on &amp;quot;under generated words&amp;quot;. At this stage, a third operation consists in the marking of the words equally marked as significant by the MT systems and the human translations. The overall difference is then calculated for each pair of texts in the corpora.</Paragraph>
      <Paragraph position="3"> Three measures specifying differences in statistical models for MT and human translations are then implemented : the first one aiming at avoiding &amp;quot;over generation&amp;quot;, the second one aiming at avoiding &amp;quot;under generation&amp;quot; and the last one being a combination of these two measures. The average scores for each of the MT systems are then computed.</Paragraph>
      <Paragraph position="4"> As detailed in (Bogdan, B.; Hartley, A.; Atwell, 2003): &amp;quot;1. The score of statistical significance is computed for each word (with absolute frequency [?] 2 in the particular text) for each text in the corpus, as follows:  is the relative frequency of the same word in the rest of the corpus, without this text; N word[txt-not-found] is the proportion of texts in the corpus, where this word is not found (number of texts, where it is not found divided by number of texts in the corpus)</Paragraph>
      <Paragraph position="6"> is the relative frequency of the word in the whole corpus, including this particular text</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. In the second stage, the lists of statistically
</SectionTitle>
    <Paragraph position="0"> significant words for corresponding texts together with their S word[text] scores are compared across different MT systems. Comparison is done in the following way: For all words which are present in lists of statistically significant words both in the human reference translation and in the MT output, we compute the sum of changes of their S  is added to the scores of all &amp;quot;over-generated&amp;quot; words (words that do not appear in the list of statistically significant words for human reference translation, but are present in such list for MT output). The resulting score becomes the general &amp;quot;over-generation&amp;quot; score for this particular text:  , could be interpreted as scores for ability to avoid &amp;quot;over-generation&amp;quot; and &amp;quot;under-generation&amp;quot; of statistically significant words. The combined (o&amp;u) score is computed similarly to the F-measure, where Precision and Recall are equally important:  The number of statistically significant words could be different in each text, so in order to make the scores compatible across texts we compute the average over-generation and under-generation scores per each statistically significant word in a given text. For the o text score we divide S o.text by the number of statistically significant words in the MT text, for the u text score we divide S  The general performance of an MT system for IE tasks could be characterised by the average oscore, u-score and u&amp;o-score for all texts in the corpus&amp;quot;.</Paragraph>
    <Paragraph position="1"> 8 Time Schedule and result dissemination The CESTA evaluation campaign started in January 2003 after having been labeled by the French Ministry of Research. During year 2003 CESTA scientific committee went through protocol detailed redefinition and specification and a time schedule was agreed upon.</Paragraph>
    <Paragraph position="2"> 2004 first semester is being dedicated to corpus untagging and the programming of CESTA evaluation tool. Reference human translations will also have to be produced and the implemented evaluation tool submitted to trial and validation. After this preliminary work, the first run will start during autumn 2004. At the end of the first campaign, result analysis will be carried out. A workshop will then be organized for CESTA participants. Then the second campaign will take place at the end of Spring 2005, the terminological adaptation phase being scheduled on a five month scale.</Paragraph>
    <Paragraph position="3"> After carrying out result analysis and final report redaction, a public workshop will be organized and the results disseminated and subject to publication at the end of 2005.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML