File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1001_metho.xml
Size: 19,763 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1001"> <Title>OVERVIEW OF THE THIRD MESSAGE UNDERSTANDING EVALUATION AND CONFERENCE</Title> <Section position="3" start_page="5" end_page="7" type="metho"> <SectionTitle> CORPUS AND TAS K </SectionTitle> <Paragraph position="0"> The corpus was formed via a keyword query'. to an electronic databas e containing articles in message format from open sources worldwide . These article s had been gathered, translated (if necessary), edited, and disseminated by th e Foreign Broadcast Information Service (FBIS) of the U .S . Government . A training set of 1300 texts was identified, and additional texts were set aside for use as tes t data 2 . The message headers were used to create or augment a dateline and the tex t type information appearing at the front of the article ; the original messag e headers and routing information were removed . The layout was modified slightl y to improve readability (e .g., by double-spacing between paragraphs), an d problems that arose with certain characters when the data was downloaded were rectified (e.g., square brackets were missing and had to be reinserted) . The body of the text was modified minimally and with the sole purpose of eliminating som e idiosyncratic features that were well beyond the scope of interest of MUC-3 3 .</Paragraph> <Paragraph position="1"> The corpus presents realistic challenges in terms of overall size (over 2 . 5 megabytes), length of the individual articles (approximately a half-page each o n average), variety of text types (newspaper articles, TV and radio news, speech an d interview transcripts, rebel communiques, etc .), range of linguistic phenomen a represented (both well-formed and ill-formed), and open-endedness of th e vocabulary (especially with respect to proper nouns) . The texts used in MUCK- I and MUCK-II originated as teletype messages and thus were all upper case ; the MUC-3 texts are also all upper case, but only as a consequence of downloading from the source database, where the texts appear in mixed upper and lower case .</Paragraph> <Paragraph position="2"> The task was to extract information on terrorist incidents (incident type, date , location, perpetrator, target, instrument, outcome, etc .) from the relevant texts i n a blind test on 100 previously unseen texts .</Paragraph> <Paragraph position="3"> Approximately half the articles were irrelevant to the task as defined .</Paragraph> <Paragraph position="4"> In some cases the terrorism keywords in th e query used to form the corpus (see footnote) were used in irrelevant senses, e .g. , &quot;explosion&quot; in the phrase &quot;social explosion&quot; .</Paragraph> <Paragraph position="5"> In other cases, an entity of one of the '. The query specified a hit as a message containing both a country/nationality name (e.g., Honduras or Honduran) for one of the nine countries of interest (Argentina, Bolivia, Chile , Colombia, Ecuador, El Salvador, Guatemala, Honduras, Peru) and some inflectional form of a common word associated with terrorist acts (abduct, abduction, ambush, arson, assassinate , assassination, assault, blow [up], bomb, bombing, explode, explosion, hijack, hijacking, kidnap , kidnapping, kill, killing, murder, rob, shoot, shooting, steal, terrorist) . Some of the articles i n the MUC-3 corpus may no longer satisfy this query, since the message headers (including th e subject line) were removed after the retrieval was done .</Paragraph> <Paragraph position="6"> 2 Over 300 articles were set aside from the overall corpus to be used as test data . The composition of the test sets was intentionally controlled with respect to the frequency wit h which incidents concerning any given country are represented ; otherwise, the selection wa s done simply by taking every nth article about that country .</Paragraph> <Paragraph position="7"> 3 For example, transcriptions of radio and TV broadcasts sometimes contained sentences i n which words were enclosed in parentheses to indicate that the transcriber could not be certai n of them, e .g., &quot;They are trying to implicate the (Ochaski Company) with narcoterrorism .&quot; (This quote is from article number PA1807130691 of the Latin America volume of the Foreig n Broadcast Information Service Daily Reports .) In cases such as this, where the text i s parenthetical in form but not in function, the parentheses were deleted .</Paragraph> <Paragraph position="8"> nine countries of interest -- the second necessary condition for a hit -- wa s mentioned, but the entity did not play a significant role in the terrorist incident . Other articles were irrelevant for reasons that were harder to formulate .</Paragraph> <Paragraph position="9"> For example, some articles concerned common criminal activity or guerrilla warfare (or other military conflict) .</Paragraph> <Paragraph position="10"> Rules were developed to challenge the systems t o discriminate among various kinds of violent acts and to generate templates onl y for those that would be of interest to a terrorism news analyst . The real-life scenario also required that only timely, substantive information be extracted ; thus, rules were formulated that defined relevance in terms of whether the news wa s recent and whether it at least mentioned who/what the target was .</Paragraph> <Paragraph position="11"> Other relevance criteria were developed as well, again with the intent of simulating a real-life task . The relevance criteria are described in the first part of appendix A , which is the principal documentation of the MUC-3 task . Appendix D contains some representative samples of relevant and irrelevant articles .</Paragraph> <Paragraph position="12"> It can be seen that the relevance criteria are extensive and would sometimes b e difficult to state, let alone implement . It was learned that greater allowance s needed to made for the fact that this was an evaluation task and not a real-life one . Systems that generated generally correct internal data structures for a relevan t incident, only to filter out that data structure by making a single mistake on one o f the relevance criteria, were penalized for having missed the incident entirel y rather than being penalized for getting just one aspect of the incident descriptio n wrong. Some allowance was made in the answer key for the fact that incidents o r facts about incidents might be of questionable relevance, given the vagueness o f some texts and gaps in the statement of the relevance criteria ; the templat e notation allowed for optionality, and systems were not penalized if they failed t o generate an optional template or an optional filler in a required template .</Paragraph> <Paragraph position="13"> If an article was determined to be relevant, there was then the task o f determining how many distinct relevant incidents were being reported . The information on these incidents had to be correctly disentangled and represented i n separate templates.</Paragraph> <Paragraph position="14"> The extracted information was to be represented in th e template in one of several ways, according to the data format requirements of eac h slot. (See appendix A .) Some slot fills were required to be categories from a predefined set of possibilities called a &quot;set list&quot; (e .g., for the various types o f terrorist incidents such as BOMBING, ATTEMPTED BOMBING, BOMB THREAT) ; others were required to be canonicalized forms (e.g ., for dates) or numbers ; still others were to be in the form of strings (e .g., for person names) .</Paragraph> <Paragraph position="15"> A relatively simple article and corresponding answer key template from th e dry-run test set (labeled TST1) are shown in Figures 1 and 2 . Note that the text i n Figure 1 is all upper case, that the dateline includes the source of the articl e (&quot;Inravision Television Cadena 1&quot;) and that the article is a news report by Jorge Alonso Sierra Valencia . In Figure 2, the left-hand column contains the slot labels , and the right-hand column contains the correct answers as defined by NOSC .</Paragraph> <Paragraph position="16"> Slashes mark alternative correct responses (systems are to generate just one of the possibilities), an asterisk marks slots that are inapplicable to the incident typ e being reported, a hyphen marks a slot for which the text provides no fill, and a colon introduces the cross-reference portion of a fill (except for slot 16, where th e colon is used as a separator between more general and more specific place names) .</Paragraph> <Paragraph position="17"> More information on the template notation can be found in appendix A, an d further examples of texts and templates can be found in appendices D and E .</Paragraph> <Paragraph position="18"> TST 1-MUC 3-008 0 BOGOTA, 3 APR 90 (INRAVISION TELEVISION CADENA 1) -- [REPORT] [JORGE ALONS O SIERRA VALENCIA] [TEXT] LIBERAL SENATOR FEDERICO ESTRADA VELEZ WA S KIDNAPPED ON 3 APRIL AT THE CORNER OF 60TH AND 48TH STREETS IN WESTER N MEDELLIN, ONLY 100 METERS FROM A METROPOLITAN POLICE CAI [IMMEDIATE A1&quot;1ENTION CENTER] . THE ANTIOQUTA DEPARTMENT LIBERAL PARTY LEADER HAD LEFT HIS HOUSE WITHOUT ANY BODYGUARDS ONLY MINUTES EARLIER . AS HE WAITED FOR THE TRAFFIC LIGHT TO CHANGE, THREE HEAVILY ARMED MEN FORCED HIM TO GE T OUT OF HIS CAR AND GET INTO A BLUE RENAULT .</Paragraph> <Paragraph position="19"> HOURS LATER, THROUGH ANONYMOUS TELEPHONE CALLS TO THE METROPOLITA N POLICE AND TO THE MEDIA, THE EXTRADITABLES CLAIMED RESPONSIBILITY FOR TH E KIDNAPPING . IN THE CALLS, THEY ANNOUNCED THAT THEY WILL RELEASE TH E</Paragraph> </Section> <Section position="4" start_page="7" end_page="8" type="metho"> <SectionTitle> SENATOR WITH A NEW MESSAGE FOR THE NATIONAL GOVERNMENT . LAST WEEK, FEDERICO ESTRADA VELEZ HAD REJECTED TALKS BETWEEN TH E GOVERNMENT AND THE DRUG TRAFFICKERS . </SectionTitle> <Paragraph position="0"> Reports, which are the secondary source for all the texts in the MUC-3 corpus .</Paragraph> <Paragraph position="1"> 0. MESSAGE I D 1. TEMPLATE ID 2. DATE OF INCIDENT 3. TYPE OF INCIDENT 4. CATEGORY OF INCIDENT 5. PERPETRATOR : ID OF INDIV(S ) 6. PERPETRATOR : ID OF ORG(S ) 7. PERPETRATOR : CONFIDENCE 8. PHYSICAL TARGET : ID(S ) 9. PHYSICAL TARGET : TOTAL NUM 10. PHYSICAL TARGET : TYPE(S) 11. HUMAN TARGET: ID(S) 12. HUMAN TARGET: TOTAL NU M 13. HUMAN TARGET: TYPE(S) 14. TARGET : FOREIGN NATION(S ) 15. INSTRUMENT : TYPE(S) 16. LOCATION OF INCIDENT 17. EFFECT ON PHYSICAL TARGET(S ) The participants collectively created the answer key for the training set, each site manually filling in templates for partially overlapping subset of the texts . This task was carried out at the start of the evaluation ; it therefore provided participants with good training on the task requirements and provided NOSC wit h good early feedback . Generating and cross-checking the templates required an investment of at least two person-weeks of effort per site . These answer keys were updated a number of times to reduce errors and to maintain currency wit h changing template fill specifications . In addition to generating answer ke y templates, sites were also responsible for compiling a list of the place names tha t appeared in their set of texts ; NOSC then merged these lists to create the set lists fo r the TARGET : FOREIGN NATION slot and LOCATION OF INCIDENT slot.</Paragraph> </Section> <Section position="5" start_page="8" end_page="9" type="metho"> <SectionTitle> MEASURES OF PERFORMANC E </SectionTitle> <Paragraph position="0"> All systems were evaluated on the basis of performance on the informatio n extraction task in a blind test at the end of each phase of 'he evaluation . It was expected that the degree of success achieved by the different techniques in Ma y would depend on such factors as whether the number of possible slot fillers wa s small, finite, or open-ended and whether the slot could typically be filled by fairl y straightforward extraction or not . System characteristics such as amount o f domain coverage, degree of robustness, and general ability to make proper use o f information found in novel input were also expected to be major factors . The dry run test results were not assumed to provide a good basis for estimatin g performance on the final test in May, but the expectation was that most, if not all , of the systems that participated in the dry run would show dramatic improvement s in performance .</Paragraph> <Paragraph position="1"> The test results show that some of these expectations were born e out, while others were not or were less significant than expected .</Paragraph> <Paragraph position="2"> A semi-automated scoring program was developed under contract for MUC-3 t o enable the calculation of the various measures of performance . It was distribute d to participants early on during the evaluation and proved invaluable in providin g them with the performance feedback necessary to prioritize and reprioritize thei r development efforts as they went along .</Paragraph> <Paragraph position="3"> The scoring program can be set up t o score all the templates that the system generates or to score subsets o f templates/slots .</Paragraph> <Paragraph position="4"> User interaction is required only to determine whether a mismatch between the system-generated templates and the answer key template s should be judged completely or partially correct .</Paragraph> <Paragraph position="5"> (A partially correct filler for slo t 11 in Figure 2 might be &quot;VELEZ&quot; (&quot;LEADER&quot;), and a partially correct filler fo r slot 16 would be simply COLOMBIA .) An extensive set of interactive scorin g guidelines was developed to standardize the interactive scoring .</Paragraph> <Paragraph position="6"> These guideline s are contained in appendix C . The scoring program maintains a log of interaction s that can be used in later scoring runs and augmented by the user as the system i s updated and the system-generated templates change .</Paragraph> <Paragraph position="7"> The two primary measures of performance were completeness (recall) an d accuracy (precision) . There were two additional measures, one to isolate th e amount of spurious data generated (overgeneration) and the other to determin e the rate of incorrect generation as a function of the number of opportunities t o incorrectly generate (fallout) .</Paragraph> <Paragraph position="8"> The labels &quot;recall,&quot; &quot;precision,&quot; and &quot;fallout&quot; were borrowed from the field of information retrieval, but the definitions of those term s had to be substantially modified to suit the template-generation task .</Paragraph> <Paragraph position="9"> The overgeneration metric has no correlate in the information retrieval field, i .e., a MUC-3 system can generate indefinitely more data than is actually called for, bu t an information retrieval system cannot retrieve more than the total number o f items (e .g., documents) that are actually present in the corpus .</Paragraph> <Paragraph position="10"> Fallout can be calculated only for those slots whose fillers form a closed set . Scores for the other three measures were calculated for the test set overall, wit h breakdowns by template slot .</Paragraph> <Paragraph position="11"> Figure 3 presents somewhat simplified definitions .</Paragraph> <Paragraph position="12"> The most significant thing that this table does not show is that precision and recal l are actually calculated on the basis of points -- the term &quot;correct&quot; includes syste m responses that matched the key exactly (earning 1 point each) and syste m responses that were judged to be a good partial match (earning .5 point each) . It should also be noted that overgeneration is not only a measure in its own right bu t is also a component of precision, where it acts as a penalty by contributing to the denominator.</Paragraph> <Paragraph position="13"> Overgeneration also figures in fallout by contributing to the numerator. Further information on the MUC-3 evaluation metrics and scorin g methods, including information on three different ways penalties for missing and spurious data were assigned, can be found elsewhere in this volume in the pape r on evaluation metrics by Nancy Chinchor [3] .</Paragraph> </Section> <Section position="6" start_page="9" end_page="9" type="metho"> <SectionTitle> TEST PROCEDURE </SectionTitle> <Paragraph position="0"> Final testing was done on a test set of 100 previously unseen texts that wer e representative of the corpus as a whole . Participants were asked to copy the tes t package electronically to their own sites when they were ready to begin testing .</Paragraph> <Paragraph position="1"> Appendix B contains a copy of the test procedure . The testing had to be conducted and the results submitted within a week of the date when the test package was mad e available for electronic transfer. Each site submitted their system-generated templates, the outputs of the scoring program (score reports and the interactiv e scoring history file), and a trace of the system's processing (whatever type of trac e the system normally produces that could serve to help validate the system' s outputs).</Paragraph> <Paragraph position="2"> Initial scoring was done at the individual sites, with someone designate d as interactive scorer who preferably had not been part of the system developmen t team . After the conference, the system-generated templates for all sites wer e labeled anonymously and rescored by two volunteers in order to ensure that th e official scores were obtained as consistently as possible .</Paragraph> <Paragraph position="3"> The system at each site was to be frozen before the test package wa s</Paragraph> <Paragraph position="5"> were completed .</Paragraph> <Paragraph position="6"> Furthermore, no backing up was permitted during testing in th e event of a system error.</Paragraph> <Paragraph position="7"> In such a situation, processing was to be aborted an d restarted with the next text .</Paragraph> <Paragraph position="8"> A few sites encountered unforeseen system problem s that were easily pinpointed and fixed . They reported unofficial, revised test result s at the conference that were generally similar to the official test results and do no t alter the overall picture of the current state of the art . The basic test called for systems to be set up to generate templates tha t produced the &quot;maximum tradeoff&quot; between recall and precision, i .e., templates that achieved scores as high as possible and as similar as possible on both recall an d precision . This was the normal mode of operation for most systems and for man y was the only mode of operation that the developers had tried . Those sites that coul d offer alternative tradeoffs were invited to do so, provided they notified NOSC i n advance of the particular setups they intended to test on .</Paragraph> <Paragraph position="9"> In addition to the scores obtained for these metrics on the basic template generation task, scores were obtained of system performance on the linguistic phenomenon of apposition, as measured by the template fills generated by th e systems in particular sets of instances . That is, sentences exemplifying apposition were marked for separate scoring if successful handling of the phenomenon seemed to be required in order to fill one or more template slots correctly for tha t sentence . This test was conducted as an experiment and is described in the pape r by Nancy Chinchor on linguistic phenomena testing [4] .</Paragraph> </Section> class="xml-element"></Paper>