File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/m91-1003_intro.xml
Size: 3,108 bytes
Last Modified: 2025-10-06 14:05:01
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1003"> <Title>COMPARING MUCK-II AND MUC-3 : ASSESSING THE DIFFICULTY OF DIFFERENT TASK S</Title> <Section position="2" start_page="0" end_page="25" type="intro"> <SectionTitle> OVERVIEW </SectionTitle> <Paragraph position="0"> The natural language community has made impressive progress in evaluation over the last four years .</Paragraph> <Paragraph position="1"> However, as the evaluations become more sophisticated and more ambitious, a fundamental proble m emerges: how to compare results across changing evaluation paradigms . When we change domain , task, and scoring procedures, as has been the case from MUCK-I to MUCK-II to MUC-3, we los e comparability of results . This makes it difficult to determine whether the field has made progress sinc e the last evaluation . Part of the success of the MUC conferences has been due to the incremental approach taken to system evaluation . Over the four year period of the three conferences, the domain has becom e more &quot; realistic&quot;, the task has become more ambitious and specified in much greater detail, and th e scoring procedures have evolved to provide a largely automated scoring mechanism . This process has been critical to demonstrating the utility of the overall evaluation process . However we still need som e way to assess overall progress of the field, and thus we need to compare results and task difficulty o f MUC-3 relative to MUCK-II .</Paragraph> <Paragraph position="2"> This comparison is complicated by the absence of any generally agreed upon metrics for comparin g the difficulty of two natural language tasks. There is no real analog, for example, to perplexity in the speech community, which provides a rough cross-task assessment of the difficulty of recognizing speech , given a corpus and some grammar or language model for that corpus . Natural language does not hav e a set of metrics whose affect on task difficulty is well-understood . In the absence of such metrics, thi s paper outlines a set of dimensions that capture some of the differences between the two tasks . It remains a subject for further research to define appropriate metrics for cross-task comparison and to determin e how these metrics correlate with performance .</Paragraph> <Paragraph position="3"> Clearly it is impossible to come up with a single number that characterizes the relative difficulty o f MUCK-II and MUC-3 . Nonetheless, we can characterize both qualitative and quantitative difference s along the following dimensions : length for some parsers Table 1 : Complexity of Dat a were computed) . The best precision and recall figures for MUCK-II were in the range of 70-80% on th e final test (using a corrected score, calculated in the MUC-3 style) . For MUC-3, precision and recall were in the range of 45-65% for the final test. Although the error rates were about double for MUC-3, the tas k was many times harder . From this, we can conclude that the field has indeed made impressive progres s in the two years since MUCK-II.</Paragraph> </Section> class="xml-element"></Paper>