File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1007_metho.xml

Size: 20,973 bytes

Last Modified: 2025-10-06 14:12:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1007">
  <Title>Matched Only Matched / Missin g All Template s</Title>
  <Section position="4" start_page="60" end_page="63" type="metho">
    <SectionTitle>
EFFORT
</SectionTitle>
    <Paragraph position="0"> We spent overall approximately between 1 and 1 .5 person-years on MUC-3 . This time was divided as follows:  The primary limiting factor in performance on MUC-3 was the limited ability of programs to perfor m linguistic and extra-linguistic tasks at a pragmatic or discourse level . These tasks include event referenc e resolution and inference . For example, some correct templates in TST2 depended on distinguishing tw o events based on the knowledge that. Cartagena is a resort, assuming that two men leaving a package in a restaurant could be planting a bomb, and generating an extra template for a series of kidnappings because on e of them took place on a particular day. These many discourse and event-based issues overwhelm the relativel y minor problems of parsing and semantic interpretation . Robustness of linguistic processing for MUC-3 was surprisingly easy to achieve, while the intricacies of template generation were surprisingly difficult to master .  Trainin g Our method of training was to run our system over the messages in the development and TST1 corpus . We used the results of these runs to detect problems and determine what new capabilities we needed to mak e these test stories work . We did not perform any automated training, although we did make heavy use of a keyword-in-context browser and some use of data from a tagged corpus .</Paragraph>
    <Paragraph position="1"> As explained above, lexical coverage and parsing did not seem to stand in the way of major performance gains for MUC-3, so we did not focus our efforts in these areas .</Paragraph>
    <Paragraph position="2"> Our system improved fairly steadily over time, as the graph shown in Figure 2 illustrates .</Paragraph>
    <Paragraph position="3"> These improvements were gained through a combination of adding knowledge, fixing bugs, adding som e capabilities (like template splitting and merging) and coding MUC-specific tasks (like distinguishing guerrill a warfare from terrorist activity) .</Paragraph>
  </Section>
  <Section position="5" start_page="63" end_page="63" type="metho">
    <SectionTitle>
RETROSPECTIVE ON THE TASK AND RESULT S
</SectionTitle>
    <Paragraph position="0"> In retrospect, over the last six month period, there were no major changes to our system that we would hav e made for MUC as a result of our experience with this corpus and task .</Paragraph>
    <Paragraph position="1"> With a minimum of customization (perhaps one or two person months of effort), our system quickl y reached the level of performance on MUC-3 achieved by the other top systems . This ultimately proved a bit discouraging, as progress from that point on was quite slow, but it is evidence that the NLTooLsET system , designed for easy adaptation to new tasks and domains, does what it is supposed to do .</Paragraph>
    <Paragraph position="2"> The most successful portion of our system that was designed for this task was the text reduction mechanism [1] . The NLTooLsET now uses a lexico-semantic pattern matcher as a text pre-preprocessor to reduc e the complexity of the sentences passed to the parser . This allowed us to keep the system running in real time, prevented the parser from dealing with overly complex sentences, and achieved more accurate results .</Paragraph>
    <Paragraph position="3"> In addition, the pre-processor allowed a discourse processing module to divide the input text roughly int o events prior to parsing, which seemed to have a considerable positive effect on later processing (see the pape r on discourse in this volume ) The speed of our system, over 1000/words per minute on this task on conventional hardware without an y major optimizations, is already way ahead of human performance and suggests that this technology will b e able to process large volumes of text .</Paragraph>
    <Paragraph position="4"> We were similarly pleased that the sentence-level performance of the NLTooLsET was as good as it was .</Paragraph>
    <Paragraph position="5"> While we fixed minor problems with the lexicon, grammar, parser, and semantic interpreter, robustness o f linguistic processing did not seem to be a major problem . In part, this seems to be because the MUC-3 domain is still quite narrow . It is much broacler than MUCK-II, and the linguistic complexity is a challenge , but knowledge base and control issues are relatively minor because there are simply not that many differen t ways that bombings, murders, and kidnappings occur .</Paragraph>
    <Paragraph position="6"> The fact that sentence-level interpretation wasn't a major barrier in MUC-3 has both good and ba d implications. Fortunately, we can expect that progress in new (perhaps extra-linguistic) areas will soo n bring system performance on this sort. of task ahead of human performance, and make this research pay off in real applications . Unfortunately, it. is unclear whether this new progress will spill over into other domain s and applications, or whether it will lead to narrowly-focused development for future MUCs . The combination of a narrow domain with broad linguistic issues could make non-linguistic solutions more attractive for this sort of task . The only way to test the degree to which these solutions are reusable is to keep testing system transportability and evaluating performance on new and broader tasks .</Paragraph>
  </Section>
  <Section position="6" start_page="63" end_page="64" type="metho">
    <SectionTitle>
ISSUES IN EVALUATIO N
</SectionTitle>
    <Paragraph position="0"> After analyzing the results of our system and the primary measures of comparison between systems (recall , precision and overgeneration in the MATCHED/MISSING row), we realized that several factors in syste m performance were being confounded and/or not being measured . We isolated six, interrelated measures o f system performance as follows :  2. Precision : Gross, overall precision call be estimated by the MATCHED/MISSING column . 3. Template Overgeneration : The OVERGENERATION column in the ALL-TEMPLATES score repor t is the best overall measure of template overgeneration .</Paragraph>
    <Paragraph position="1"> 4. Slot Overgeneration : Subtracting the TEMPLATE-ID scores from the MATCHED/MISSING over generation column results in slot overgeneration .</Paragraph>
    <Paragraph position="2"> 5. Quality of Fills: Recall and precision in the MATCHED-ONLY row, when template-ID and spuriou s templates are subtracted, provides an approximation to how well the templates that are filled out ar e filled out .</Paragraph>
    <Paragraph position="3"> 6 . Template Match : There are two aspects to how well a system matches the templates that are in the  answer key. One is the number of templates systems generate, and the other is how accurate the type s of those templates are. The precision of the TEMPLATE-ID row gives a measure of how close the number of answer templates were given . Precision and recall of the INCIDENT-TYPE slot also giv e the accuracy of the templates matched .</Paragraph>
    <Paragraph position="4"> Any measure of the performance of a data extraction system must have a meaningful way of combining the effects of template level decisions with slot-filling ability, but must also distinguish slot-filling from templat e decisions for system comparison . Template-level decisions are : * When to create a template * What type of template to creat e * When to merge multiple templates * When to eliminate a templat e Template-level decisions reflect a system's ability to carve out messages into discrete topics or individua l events . This includes text-level issues such as when a new event is being introduced as opposed to giving further detail on an already mentioned event, and determining the topic or type of that event . Slot-level decisions relate to the quality of the template fills once the decision has been made as to whic h and how many templates to generate . In general, slot-level decisions are closer to and represent more th e core language processing capabilities than template-level decisions .</Paragraph>
    <Paragraph position="5"> The interaction of recall, precision and overgeneration presents additional challenges in evaluating systems, and MUC-3 should provide ample data to test the utility of combined metrics . In addition, it i s important to be able use the scores on the MUC task both for comparing systems and for proving th e ultimate utility of the systems . The MUC-3 results might seem low to those not really familiar with th e tests, while many of the systems could already be extremely useful even without major improvements i n performance.</Paragraph>
    <Paragraph position="6"> Finally, estimates providing a margin of error for all the scores on a MUC-like task are necessary in order to compare results meaningfully . This error comes from the inherent imprecision in any &amp;quot;right answer &amp;quot; against which scores are computed, and the inevitable difference in the performance of systems from one tes t set to another .</Paragraph>
  </Section>
  <Section position="7" start_page="64" end_page="65" type="metho">
    <SectionTitle>
LINGUISTIC PHENOMENA TES T
</SectionTitle>
    <Paragraph position="0"> Our results on the linguistic test of apposition are interesting, as we estimate that we recognize 90% of thes e syntactic structures with regular expression patterns in a context-independent pre-processing stage, prior t o the application of any syntactic parsing using our context-free grammar.</Paragraph>
    <Paragraph position="1"> The slot configuration files confounded the pure test of recall and precision with respect to appositio n by not factoring out entire templates that were missed (presumably an issue not related to the treatmen t of the appositive) . Also complicating a &amp;quot;pure&amp;quot; test is the penalty for spurious fills included in those slots where the appositives were present ; again, an error unrelated to the fill that contained the appositive .</Paragraph>
    <Paragraph position="2">  We corrected for these interfering effects to get a truer measure of the performance. This was done by eliminating from the test score those slots not present because of missing templates, and eliminatin g the spurious slot fills . With these corrections, we calculated recall for the &amp;quot;easy&amp;quot; cases to be 96% from 72 % (unrevised) . The hard cases went from 43% recall to 89% recall (again, unrevised) . This difference is entirely attributed to one example. Based on this, we would not want to draw any substantive conclusions on our performance of easy vs . hard appositives .</Paragraph>
    <Paragraph position="3"> Our results on the linguistic phenomena tests show that our performance on the same sentence appositive s was better than the same information distributed across multiple sentences . This was expected, as our syste m does not use the semantic interpretation of &amp;quot;to be&amp;quot; sentences to modify the type assignments of targets . The cases here where the assignments were correct were cases of our default typing, CIVILIAN .</Paragraph>
    <Paragraph position="4"> The preposed appositives were more accurate than the postposed . We would have expected that postposed would be easier because it is easier to determine their boundaries . Preposed appositives, on the othe r hand, are typically shorter and do not appear next to or in list constructs .</Paragraph>
    <Paragraph position="5"> We would not want to draw any conclusions from these results on the intrinsic power of the pertinent techniques . These techniques are detailed in the system walkthrough paper (cf. this volume) . We feel that a fair amount of effort has gone into system development for the apposition, so, from this regard, these tests seem to reflect that linguistic phenomena are not as important for overall performance as other factors . That is, larger gains in terms of recall and precision scores seem to come with less effort from focusing o n discourse and event structure rather than local linguistic issues such as apposition .</Paragraph>
  </Section>
  <Section position="8" start_page="65" end_page="65" type="metho">
    <SectionTitle>
REUSABILITY
</SectionTitle>
    <Paragraph position="0"> We estimate that about 50% of the effort spent on this task will not be reusable at all (except, perhaps , for future MUCs), although 80% of the improvements to the parser recovery (or 20% of the total effort) are reusable. Note, however, that these are not, necessarily the changes we would have chosen to make! Abou t 10-15% of the total effort is work that is necessary for any template generation task from text in a new domain . The other 35-40% of the non-reusable effort stems from MUC-3 specific rules not tied to the effor t of data extraction in general or in particular. The items that went into this effort are discussed more i n Section below .</Paragraph>
  </Section>
  <Section position="9" start_page="65" end_page="67" type="metho">
    <SectionTitle>
LESSONS LEARNE D
</SectionTitle>
    <Paragraph position="0"> The GE Syste m This task has proven our system's transportability, robustness and accuracy quite well . The things that worked particularly well for MUC-3 were: pattern matching pre-processor discourse processin g lexicon parser semantic interpreter partial parser The MUC experience also pointed out some clear deficits with some aspects of text-level interpretatio n that are particularly critical in multi-template texts, in particular : discourse and complex event representation reference resolutio n handling background event s In addition, there were three problems with our system that were largely fixed during MUC-3 : list processing (including coordination) phrase attachment and parser control  The MUC Task Certain aspects of this MUC task did not test the text processing capabilities of the systems . These fall into the category of task-specific rules to eliminate correctly filled-out templates . The application of these rules is outside the language processing components of the systems ; however, the misapplication of the rule s can have a great effect on the score . We estimate three-quarters of our missing templates and most of th e spurious templates are due to the misapplication of the following &amp;quot;rules&amp;quot;, further described below-- stale data, guerrilla warfare, non-specific events, and template splitting We estimate that these specific problems account for approximately 50% of the missing recall in ou r results (i .e. half of the difference between our recall and 100% recall) . The rest of the missing recall is a combination of sharing information across templates, language analysis failures, knowledge failures , and subtle differences in interpretating events . Looking at recall, this is supported by our score on th e MATCHED-ONLY row, which is an underestimate because it still includes many problems in incorrectl y splitting or merging templates .</Paragraph>
    <Paragraph position="1"> The four major MUC-specific issues are : Stale Date: Eliminate all templates that report on events over two months old, unless they add new information. The application of this rule depends on correctly determining the date of the event ; an error in this slot will cause the incorrect deletion of the entire template, while extra templates an d slots can result from missing the &amp;quot;stale date&amp;quot; .</Paragraph>
    <Paragraph position="2">  reported locations and dates for any given incident.</Paragraph>
    <Paragraph position="3"> We believe that, to test text processing systems, fine lines of distinction between relevant and irrelevan t texts should be left to human beings, and that the MUC task should focus on accurate information extraction , not subtle judgements of relevance or validity . One proposal, which has been tentatively adopted for MUC 4, is to encode these distinctions as slot fills as opposed to template/no-template decisions ; for example, GUERRILLA-WARFARE could be a TYPE-OF-INCIDENT as opposed to an IRRELEVANT template . This will minimize the influence of the extra-linguistic post, editing and maximize the testing of the cor e system ability to extract information from text .</Paragraph>
    <Paragraph position="4"> Evaluation The most important lesson we learned on this task, and probably the biggest contribution of MUC t o the state of the art, is the importance of having an &amp;quot;answer key&amp;quot; to direct the focus of research efforts . Without the answer key, we would proceed by fixing problems with our system, sentence by sentence. This methodology succeeds in making particular sentences and texts work, and can also fix general problems wit h the system . However, concentrating on sentences and phenomena, rather than tasks and answers, can als o introduce unintended effects, and can focus research on phenomena that prove irrelevant to a task . The answer key allows system developers to focus attention on fixing widespread problems as well a s quickly testing the global effect of every change .</Paragraph>
    <Paragraph position="5"> Another important lesson from this evaluation is that drastically different techniques could produc e similar answers, while many important differences between systems are &amp;quot;buried&amp;quot; in the more detailed report s of scores . This happened because MUC-3 really combined many different tasks, from template generation and slot filling to temporal interpretation, knowledge-base issues, and even event recognition (e .g. knowin g that Jesuits are a good target) . One of the challenges for this sort of evaluation is to determine not only wha t produces good overall results, but also which portions of the task are best covered by which technologies .</Paragraph>
  </Section>
  <Section position="10" start_page="67" end_page="67" type="metho">
    <SectionTitle>
THOUGHTS FOR MUC- 4
</SectionTitle>
    <Paragraph position="0"> Two competing designs for future MUCs are to retain the same domain, perhaps deepening the task, and t o move on to a new domain with the same basic template-filling task. Retaining the same basic domain an d task has the apparent advantage of minimizing the effort required just to perform the test, at least for those groups that have already invested the effort . The stable task also allows MUC to be used as a benchmark for measuring the progress of the field . On the other hand, keeping the task and domain stable could put ne w groups (i .e . those not involved in MUC-3) at a disadvantage, and runs the risk of having effort unknowingl y devoted to MUC-specific problems .</Paragraph>
    <Paragraph position="1"> The alternative, to select new tasks and broader domains for future MUCs, has the benefit of allowin g new projects to enter on a roughly equal basis, to check the validity of the MUC-3 task, and to measur e transportability across domains . However, this choice would require additional work of all participants, and would probably require holding the evaluations less frequently.</Paragraph>
    <Paragraph position="2"> Presently, it seems that MUC-4 will follow the line of MUC-3, measuring the progress of the field (an d the individual participants) but not showing the relationships between domains or transportability, and no t introducing new capabilities . The field is moving quickly enough, however, that broader domains and ne w tasks will soon be necessary to have better measures of problems, progress, and applications .</Paragraph>
    <Paragraph position="3"> Another major issue in MUGs is how often they should occur . We believe that it is far more dangerous to have the tests too frequently than to have them infrequently . While infrequent tests produce less data an d provide less of a chance for new entrants, frequent evaluations of this sort are more likely to inhibit research by pushing short-term system issues in front of larger, critical advances . Perhaps the best compromise is to have continual evaluations, but expect that each site will participate only once in every two or thre e evaluations .</Paragraph>
    <Paragraph position="4"> We believe that MUCs can only be a useful test of text interpretation technology if they measure transportability and customizability as well as accuracy . Otherwise, it will not be clear how much functionalit y is produced by special-purpose features . This could be achieved by moving to a new domain and shortening the length of the development time . Also, the task should minimize or eliminate domain-specific rules that move systems away from their information extraction role . This will give truer measures of a text processing system's ability to move into a new domain and extract useful factual information from free text .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML