File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/m91-1001_intro.xml
Size: 10,855 bytes
Last Modified: 2025-10-06 14:04:59
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1001"> <Title>OVERVIEW OF THE THIRD MESSAGE UNDERSTANDING EVALUATION AND CONFERENCE</Title> <Section position="2" start_page="0" end_page="5" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> The Naval Ocean Systems Center (NOSC) has conducted the third in a series o f evaluations of English text analysis systems . These evaluations are intended to advance our understanding of the merits of current text analysis techniques, a s applied to the performance of a realistic information extraction task . The latest one is also intended to provide insight into information retrieval technolog y (document retrieval and categorization) used instead of or in concert wit h language understanding technology .</Paragraph> <Paragraph position="1"> The inputs to the analysis/extraction proces s consist of naturally-occurring texts that were obtained in the form of electroni c messages .</Paragraph> <Paragraph position="2"> The outputs of the process are a set of templates or semantic frame s resembling the contents of a partially formatted database .</Paragraph> <Paragraph position="3"> The premise on which these evaluations are based is that task-oriented test s enable straightforward comparisons among systems and provide usefu l quantitative data on the state of the art in text understanding . The tests ar e designed to treat the systems under evaluation as black boxes and to point u p system performance on discrete aspects of the task as well as on the task overall . These quantitative data can be interpreted in light of information known about each system's text analysis techniques in order to yield qualitative insights into th e relative validity of those techniques as applied to the general problem of information extraction.</Paragraph> <Paragraph position="4"> The process of conducting these evaluations has presented great opportunitie s for examining and improving on the evaluation methodology itself. Although stil l far from perfect, the MUC-3 evaluation was markedly better than the previous one , especially with respect to the way scoring was done and the degree to which the test set was representative of the training set . Much of the credit for improvemen t goes to the evaluation participants themselves, who have been actively involved i n nearly every aspect of the evaluation . The previous MUC, known as MUCK-II (the naming convention has since been stripped down), proved that systems existe d that could do a reasonable job of extracting data from ill-formed paragraph-lengt h texts in a narrow domain (naval messages about encounters with hostile forces ) and that measuring performance on such a task was a feasible and viable thing t o do.</Paragraph> <Paragraph position="5"> However, the usage of a very small test set (just 5 texts) and an extremel y unsophisticated scoring procedure combined to make it inadvisable to publicize th e results .</Paragraph> <Paragraph position="6"> (Results obtained in experiments conducted on one MUCK-II system afte r the evaluation was completed are discussed in [1] .) The MUC-3 evaluation was significantly broader in scope than previous ones in most respects, including text characteristics, task specifications, performanc e measures, and range of text understanding and information extraction techniques . MUC-3 presented a significantly more challenging task than MUCK-II, which wa s held in June of 1989 .</Paragraph> <Paragraph position="7"> The results show that MUC-3 was not an unreasonabl e challenge to 1991 technologies . The means used to measure performance have evolved far enough that we no longer hesitate to present the system scores, an d work on the evaluation methodology is planned that will take the next step t o determine the statistical significance of the results .</Paragraph> <Paragraph position="8"> In another effort to determine their significance, some work has already bee n undertaken by Hirschman [2] to measure the difference in complexity of MUC-lik e evaluation tasks so that the results can be used to quantify progress in the field of text understanding . This objective, however, brings up another critical area o f improvement for future evaluations, namely refining the evaluation methodolog y in such a way as to better isolate the systems' text analysis capabilities from thei r data extraction capabilities. This will be done, since the MUC- 3 corpus and task ar e sufficiently challenging that .they can be used again (with a new test set) in a future evaluation .</Paragraph> <Paragraph position="9"> That evaluation will seek to examine more closely the tex t analysis capabilities of the systems, to measure improvements in performance b y MUC-3 systems, and to establish performance baselines for any new systems .</Paragraph> <Paragraph position="10"> This paper covers most of the basics of the MUC-3 evaluation, which wer e presented during a tutorial session and in an overview presentation at the start o f the regular sessions . This paper is also an overview of the conferenc e proceedings, which includes papers contributed by the sites that participated i n the evaluation and by individuals who were involved in the evaluation in othe r ways. Parts I, II, and III of the proceedings are organized in the order in whic h the sessions were held, but the ordering of papers within Parts II and III i s alphabetical by site and does not necessarily correspond with the order in whic h the presentations were made during the conference.</Paragraph> <Paragraph position="11"> The proceedings als o includes a number of appendices containing materials pertinent to the evaluation . OVERVIEW OF MUC- 3 The planning for MUC-3 began while MUCK-II was still in progress, wit h suggestions from MUCK-II participants for improvements . A MUC-3 program committee was formed from among those MUCK-II participants who provide d significant feedback on the MUCK-II effort. The MUC-3 program committee included Laura Blumer Balcom (Advanced Decision Systems), Ralph Grishman (Ne w York University), Jerry Hobbs (SRI International), Lisa Rau (General Electric), an d Carl Weir (Unisys Center for Advanced Information Technology) . Since one of the suggestions for MUC-3 was to add an element of document filtering to the task o f data extraction, David Lewis (then at the University of Massachusetts and now at the University of Chicago) was invited to join the committee as a representative o f the information retrieval community .</Paragraph> <Paragraph position="12"> NOSC began looking for a suitable corpus - in late 1989 and obtained assistanc e from other government agencies to acquire it during the summer of 1990 . At that time, a call for participation was sent to academic, industrial, and commercia l organizations in the United States that were known to be engaged in system design or development in the area of text analysis or information retrieval . Participation on the part of many of the respondents was contingent upon receiving outsid e financial support ; approximately two-thirds of the sites were awarded financia l support by the Defense Advanced Research Projects Agency (DARPA) . These awards were modest, some sites having requested funds only to pay travel expenses and others having requested funds to cover up to half of the total cost of participating. The total cost was typically estimated to be approximately equivalen t to one person-year of effort .</Paragraph> <Paragraph position="13"> The evaluation was officially launched in October, 1990, with a three-mont h phase dedicated to compiling the &quot;answer key&quot; templates for the texts in th e training set (see next section), refining the task definition, and developing the initial MUC-3 version of the data extraction systems . These systems underwent a dry-run test in February, 1991, after which a meeting was held to discuss th e results and hammer out some of the remaining evaluation issues . Twelve sites participated in the dry run. One site dropped out after the dry run (TRW), and fou r new sites entered, three of which had already been involved to some extent (BB N Systems and Technologies, McDonnell Douglas Electronic Systems Company, an d Synchronetics, Inc .) and one that had not (Hughes Research Laboratories) .</Paragraph> <Paragraph position="14"> The second phase began in mid-February and, while system developmen t continued at each of the participating sites, updates were made to the scorin g program, the task definition, and the answer key templates for the training set . Final testing was carried out in May, 1991, concluding with the Third Messag e Understanding Conference (MUC-3), which was attended by representatives of th e participating sites and interested government organizations . During th e conference, the evaluation participants decided that the test results should b e validated by having the system-generated templates rescored by a single party . Two of the participants were selected to work as a team to carry out this task, an d the results of their effort are the official test scores presented in this volume . Pure and hybrid systems based on a wide range of text interpretatio n techniques (e .g ., statistical, key-word, template-driven, pattern-matching, in depth natural language processing) were represented in the MUC-3 evaluation .</Paragraph> <Paragraph position="15"> The fifteen sites that completed the evaluation are Advanced Decision System s (Mountain View, CA), BBN Systems and Technologies (Cambridge, MA), Genera l Electric (Schenectady, NY), General Telephone and Electronics (Mountain View , CA), Intelligent Text Processing, Inc . (Santa Monica, CA), Hughes Researc h Laboratories (Malibu, CA), Language Systems, Inc. (Woodland Hills, CA), McDonnel l</Paragraph> <Section position="1" start_page="4" end_page="5" type="sub_section"> <SectionTitle> Douglas Electronic Systems (Santa Ana, CA), New York University (New York City , </SectionTitle> <Paragraph position="0"> NY), PRC, Inc . (McLean, VA), SRI International (Menlo Park, CA), Synchronetics , Inc . together with the University of Maryland (Baltimore, MD), Unisys Center fo r Advanced Information Technology (Paoli, PA), the University of Massachusett s (Amherst, MA), and the University of Nebraska (Lincoln, NE) in association wit h the University of Southwest Louisiana (Lafayette, LA) .</Paragraph> <Paragraph position="1"> Parts II and III of this volume include papers by each of these sites . In addition, an experimenta l prototype of a probabilistic text categorization system was developed by Davi d Lewis, who is now at the University of Chicago, and was tested along with the othe r systems. That work is described in a paper in Part IV.</Paragraph> </Section> </Section> class="xml-element"></Paper>