File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1003_metho.xml

Size: 11,110 bytes

Last Modified: 2025-10-06 14:12:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1003">
  <Title>COMPARING MUCK-II AND MUC-3 : ASSESSING THE DIFFICULTY OF DIFFERENT TASK S</Title>
  <Section position="3" start_page="25" end_page="26" type="metho">
    <SectionTitle>
DIMENSIONS OF COMPARISON
</SectionTitle>
    <Paragraph position="0"> Complexity of the Dat a The first respect in which MUCK-II and MUC-3 differ is in the type of messages chosen as input . MUCK-II used Navy message traffic (operational reports) as input, which are filled with jargon, bu t also cover a rather limited domain with a fairly small vocabulary (2000 words) . These messages wer e characterized by a telegraphic syntax and somewhat random punctuation ; run-on sentences, for example , occurred quite frequently. MUC-3 uses news wire reports as input, which are more general (in tha t they use less jargon), but cover a much wider range of subjects, with a much larger vocabulary (20,000 words) . Although the syntax is generally more &amp;quot;standard&amp;quot; in MUC-3, the richness of the corpus pose s new problems. Sentences can be extremely long and quite rich in their range of syntactic construction s (see Figure 1 for extreme examples of MUCK-II and MUC-3 sentences) .</Paragraph>
    <Paragraph position="1"> In MUCK-II, there were four distinct message types ; in MUC-3, there are sixteen . These include &amp;quot;text&amp;quot;, &amp;quot;communique&amp;quot; and &amp;quot;editorial&amp;quot; as well as translated transcriptions of radio communiques originall y in Spanish . The average sentence length is longer in MUC-3 (27 words compared to 12 words for MUCK-II), as is message length (12 sentences/message for MUC-3 compared to 3 sentences/message for MUCK -II) . These differences are summarized in Table 1 .</Paragraph>
    <Paragraph position="2"> None of the measures in Table 1 makes any attempt to measure grammatical complexity . Of the various measures, perhaps sentence length correlates most closely. In general, longer sentences are harde r to parse; for certain types of parser, the time to parse increases as the cube of sentence length . Even this does not begin to describe the rich range of syntactic constructions in MUC-3 . On the other hand, this richness is difficult to compare to the difficulties of handling telegraphic and heavily elliptical material in MUCK-II. Figure 1 contains a sentence from MUCK-II and a sentence from MUC-3, to illustrate th e different problems posed by the two corpora.</Paragraph>
    <Paragraph position="3"> In addition to the metrics reported in Table 1, there are other metrics which would help to captur e the notion of data complexity . We could measure the rate of growth of the vocabulary, for example , to determine how frequently new words appear; this would also give us insight into whether we had a</Paragraph>
  </Section>
  <Section position="4" start_page="26" end_page="28" type="metho">
    <SectionTitle>
MUCK-I I
SEATTLE TAKEN UNDER FIRE BY KRESTAI, FRIENDLY FORCES AIR CONDITIO N
WARNING REF WEAPON FREE ON HOSTILE SURFAC E
</SectionTitle>
    <Paragraph position="0"> He disclosed that, in August 1988, he was assigned the missions of planning the assassinatio n of the president of the republic and the beheading of the Honduran spiritual and political lead ership through the physical elimination of Honduran Archbishop Msgr Hector Enrique Santo s and presidential candidates Carlo Roberto Flores and Rafael Leonardo Callejas Romero, o f the Liberal and National parties respectively.</Paragraph>
    <Paragraph position="1">  There are also some system-dependent measures which would be interesting : number of grammar rules used, number of co-occurrence patterns, number of inference rules, and size of the knowledge base . Investigation of what to measure and how these measures relate to other things (e .g., accuracy, speed ) should be considered a subject of further research in its own right, if we expect to make meaningfu l comparisons across different domains .</Paragraph>
    <Section position="1" start_page="26" end_page="28" type="sub_section">
      <SectionTitle>
Corpus Dimensions
</SectionTitle>
      <Paragraph position="0"> The size of the MUC-3 corpus has had a profound impact on how the participating message under standing systems were built and debugged . MUC-3 provided one to two orders of magnitude more dat a than MUCK-II : 1300 training messages and some 400,000 words, compared to 105 messages and 3,00 0 words for MUC :-II. If hand-debugging was still (barely) possible in MUCK-II, it was clearly an overwhelming task for MUC-3. In addition, system throughput became a major consideration in MUC-3 . If it takes a system one day to process 100 messages, then running through the development corpus (130 0 messages) would take two weeks - a serious impediment to system development . This has placed a greate r premium on system throughput and on automated procedures for detecting errors and evaluating overal l performance. It has also led some systems to explore methods for skimming and for distinguishing &amp;quot;important&amp;quot; or high-information passages from less important passages, as well as robust or partial parsin g techniques .</Paragraph>
      <Paragraph position="1"> The corpus dimensions differed for the test sets as well . The test set for MUC-3 consisted of 10 0 messages (33,000 words), compared to five messages for MUCK-II (158 words) . The larger trainin g corpus placed heavier processing demands on the systems, but it also meant that the test data was mor e representative of the training data . In MUCK-II, 16% of the total number of words in the test represente d previously unseen tokens . In MUC-3, this figure dropped to 1 .6% .</Paragraph>
      <Paragraph position="2"> Nature of the Tas k There was a slight change in task focus between MUCK-II and MUC-3 . In MUCK-II, the tas k was template-fill, where each message generated at least one template (although one type of templat e was an &amp;quot;OTHER&amp;quot; template, indicating that it did not describe any significant event) . All messages were  considered to &amp;quot;relevant&amp;quot; .in the sense of generating a template; only 5% of the training messages generate d an &amp;quot;OTHER&amp;quot; template. MUC-3 required both relevance assessment and template fill : only about 50% of the messages were relevant . Part of the task involved determining whether a message containe d relevant information, according to a complicated set of rules that distinguished current terrorist event s from military attacks and from reports on no-longer-current terrorist events . Thus relevance assessmen t turned out to be a complex task, requiring four pages of instructions and definitions in MUC-3 (compare d to half a page of instructions for MUCK-II) . Filling a template for an irrelevant message was penalize d - to a greater or lesser extent, depending on which metrics were used in the summary scoring .</Paragraph>
      <Paragraph position="3"> Although this represents a change between the two tasks, it is difficult to come up with any numerica l measures to quantify this difference. On the one hand, the participants reported that this shift did no t contribute substantially to the difficulty . On the other hand, most sites devoted substantial effort to creating a specialized set of rules to distinguish relevant from irrelevant messages . Understanding these rules for relevance was certainly one of the least portable and most domain-specific aspects of the task , so it undoubtedly did contribute to the greater difficulty of MUC-3 .</Paragraph>
      <Paragraph position="4"> Difficulty of the Template Fill Tas k Reflecting the change in application domains, the templates changed from MUCK-II to MUC-3 . The templates differ in how many types of template there are, in the number of slots, in the allowable range of slot fills, and in number of fills per slot (since more than one fill is required in certain cases) . MUCK-I I had 6 types of events and 10 slots per template, of which five were filled from small closed class lists , three from larger closed class lists, and two by string fills from the text . MUC-3 had 10 types of event s and 17 slots (not counting slots reserved for indexing messages and templates), of which eight were smal l closed classes, two were larger closed classes, and seven required numerical or string fills from the text . For the MUCK-II test, there were 5 templates generated for 5 messages with just over one fill pe r slot (55 fills for 50 slots) . For the MUC-3 test of 100 messages, 65 out of 100 were relevant . For th e relevant messages, there were 133 templates generated (roughly 2 templates per message, counting the 1 9 optional templates) . The ratio of slots to slot fillers was approximately 2500 answers for 2260 slots (1 . 1 answers/slot). However, many answers in MUC-3 included cross-references to other slot fillers, whic h were required to get full credit for a correct answer . There were approximately 1000 of these cross references, so a more realistic estimate of number of &amp;quot;fills&amp;quot; was 3500 (1 .5 answers/slot) . This information  slot as the branching factor at that poin t Table 3 : Comparison of the Template Fill Task is summarized in Table 3 .</Paragraph>
      <Paragraph position="5"> It is possible to enumerate the ways in which the two template fill tasks differ, but it is extremel y difficult to assess how this affects the overall difficulty of filling the template . One crude approach is t o compute a perplexity-like measure of the tasks, looking at the filled template as a string of answers, usin g the number of possible fills for each slot is an estimate of the branching factor at that point . This yields a &amp;quot;difficulty&amp;quot; figure of 17 for MUCK-II as opposed to a figure of 30 for MUC-3 . This number corresponde d to the perceived increase in difficulty between the two tasks by the participants : MUC-3 was definitely viewed as harder, but not an order of magnitude harder .</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
Scoring and Results
</SectionTitle>
      <Paragraph position="0"> Finally, the two tasks also scored the results differently . MUCK-II generally used a score based aroun d 1 : wrong = 0, no answer = 1, right = 2 . In MUC-3, the correct answer counted 1, the wrong answe r counted 0 . It is possible to recompute the scores for MUC-2 to make them comparable to MUC-3 . If we do this, we find that the top-scoring systems in MUCK-II had precision and recall scores in the 70-80% range. This compares to 45-65% for the top-scoring systems in MUC-3 for the run which maximized bot h precision and recall using the &amp;quot;MATCHED-MISSING&amp;quot; method of computing the score r . Since 100% is the upper bound, it is actually more meaningful to compare the &amp;quot;shortfall&amp;quot; from the upper bound ; for MUCK-II, this is 20-30% and for MUC-3, 35-55%. Thus MUC-3 performance is about half as good a s (has twice the shortfall) as MUCK-II .</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML