File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/x98-1030_evalu.xml
Size: 6,968 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?> <Paper uid="X98-1030"> <Title>MUC/MET Evaluation Trends</Title> <Section position="4" start_page="236" end_page="237" type="evalu"> <SectionTitle> EVALUATION RESULTS </SectionTitle> <Paragraph position="0"> The evaluation results are given in the table below in terms of the highest score approached by the best system to the nearest percentage point. Early on in some of the tasks, only recall (R) and precision (P) were calculated, but usually the combined F-measure (F) was used with recall and precision weighted equally.</Paragraph> <Paragraph position="1"> The scores for Scenario Template did not ever get above the mid-60's for E The reasons for this barrier are several and were partially addressed when multiple tasks were attempted at varying levels of processing.</Paragraph> <Paragraph position="2"> Scoring and template design issues that interfere with meaningful measures of progress are discussed in the next section and are being addressed in future evaluation methods.</Paragraph> <Paragraph position="3"> The scores for Named Entity are in the mid90's and are close to human performance. The machines and the annotators are still significantly different in their performance, but automation is preferable due to its speed. Applications of Named Entity have been successful in assisting humans in processing large amounts of textual data.</Paragraph> <Paragraph position="4"> The scores for Template Element and Template Relations are also high enough to make the technology reliable for use by analysts. The Template Elements extracted from newswire articles are indicative of the content of the article for most purposes.</Paragraph> <Paragraph position="5"> Although the corefercnce scores are lower than the Template Element scores, enough coreference is being processed to achieve reliable results in Template Element.</Paragraph> <Paragraph position="6"> The multi-lingual scores are impressive both in Scenario Template and in Named Entity. Some of the variability is due to changes of domain between training and test documents in MUC-7. The results were surprising because many of the developers were not native or fluent speakers of the languages on which their systems were evaluated. Also, differences in style across</Paragraph> </Section> <Section position="5" start_page="237" end_page="238" type="evalu"> <SectionTitle> EVALUATION ALGORITHMS </SectionTitle> <Paragraph position="0"> in</Paragraph> <Section position="1" start_page="237" end_page="237" type="sub_section"> <SectionTitle> Evaluation Metrics </SectionTitle> <Paragraph position="0"> The evaluation metrics used for information extraction were adapted from Information Retrieval early in MUC-3 \[1\]. Both SGML markup in the text stream and template slot fills are scored automatically. The determination of scores for the coreference equivalence classes is based on a model theoretic algorithm that counts the minimal number of links that must be added to make the classes in the answer key and system response match \[4\].</Paragraph> </Section> <Section position="2" start_page="237" end_page="237" type="sub_section"> <SectionTitle> Statistical Significance Testing </SectionTitle> <Paragraph position="0"> At the end of MUC-3, a statistical significance testing algorithm was developed to determine the significance of the results of the evaluation \[3\]. The method is a computer intensive method called approximate randomization and is based on a documentby-document comparison of performance for each pair of systems in the evaluation. The results are graphed and the sets of systems that are not significantly different from each other in performance on the test set are enclosed in the same circle. The method does not say that results are significant within a certain percentage, but rather looks at the characteristics of the performance of the systems across all documents.</Paragraph> </Section> <Section position="3" start_page="237" end_page="238" type="sub_section"> <SectionTitle> Interannotator Scoring </SectionTitle> <Paragraph position="0"> During the course of the Tipster evaluations a method for measuring interannotator agreement was provided by the scoring software. This measure and the accompanying error reports assisted during the development of task definitions using training data and during the development of training and test datasets.</Paragraph> <Paragraph position="1"> The scoring software is designed to work domainindependently so it was easy to adapt for different scenarios and template slot designations in ST, TE, and TR and SGML markup for NE and CO. The key2key configuration option needed only be given to score an answer key against an answer key. Feedback is given in a strict fashion as to whether the annotators' keys agreed on the alternatives and optional elements allowed only in answer keys.</Paragraph> </Section> <Section position="4" start_page="238" end_page="238" type="sub_section"> <SectionTitle> User Interfaces </SectionTitle> <Paragraph position="0"> The scoring software has formatted reports and a GUI for viewing evaluation results in all languages. These tools for visualizing systems errors assisted developers in debugging their systems and in presenting their results. The user interfaces were designed based on input from the participants and the customers.</Paragraph> </Section> <Section position="5" start_page="238" end_page="238" type="sub_section"> <SectionTitle> Remaining Scoring Issues Alignment </SectionTitle> <Paragraph position="0"> The tree structure of the Scenario Templates requires choosing which objects to score against the objects given in the answer key. In order not to over penalize systems, alignment is done to optimize the Fmeasure. However, the optimization is not exhaustive and in some cases does not converge. Instead of scoring slot fills as both missing and spurious in less than ideal mappings, the scorer tends to map and score the mismatching slot fills as incorrect and the F-measure is calculated in such a way as to minimize the negative effect of missing and spurious slots.</Paragraph> <Paragraph position="1"> In the future, the tree structure of the template will be greatly simplified so that the alignment problem is insignificant in understanding the results during development and testing.</Paragraph> </Section> <Section position="6" start_page="238" end_page="238" type="sub_section"> <SectionTitle> Linchpin Effect </SectionTitle> <Paragraph position="0"> Another issue that arose in task design was the problem of penalizing a system in multiple places for one mistake. The inherent interdependencies of information especially in event descriptions made this aspect of task design difficult. Clearly over the course of Tipster, the annotations and templates changed to show the amelioration of this effect. Future evaluations will still need to beware of it.</Paragraph> </Section> </Section> class="xml-element"></Paper>