XML Viewer - m95-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/m95-1002_metho.xml
Size: 56,481 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="M95-1002">
  <Title>OVERVIEW OF RESULTS OF THE MUC-6 EVALUATION</Title>
  <Section position="4" start_page="13" end_page="13" type="metho">
    <SectionTitle>
CORPU S
</SectionTitle>
    <Paragraph position="0"> Testing was conducted using Wall Street Journal texts provided by the Linguistic Data Consortium . The articles used in the evaluation were drawn from a corpus of approximately 58,000 articles spanning the period o f January 1993 through June 1994. This period comprised the &amp;quot;evaluation epoch .&amp;quot; As a condition for participatio n in the evaluation, the sites agreed not to seek out and exploit Wall Street Journal articles from that epoch once th e training phase of the evaluation had begun, i .e., once the scenario for the Scenario Template task had been disclosed to the participants.</Paragraph>
    <Paragraph position="1"> The training set and test set each consisted of 100 articles and were drawn from the corpus using a text retrieval system called Managing Gigabytes, whose retrieval engine is based on a context-vector model, producing a ranked list of hits according to degree of match with a keyword search query . It can also be used to do unranked , Boolean retrievals . The Boolean retrieval method was used in the initial probing of the corpus to identif y candidates for the Scenario Template task, because the Boolean retrieval is relatively fast, and the unranked result s are easy to scan to get a feel for the variety of nonrelevant as well as relevant documents that match all or some o f the query terms. Once the scenario had been identified, the ranked retrieval method was used, and the ranked lis t was sampled at different points to collect approximately 200 relevant and 200 nonrelevant articles, representing a variety of article types (feature articles, brief notices, editorials, etc .). From those candidate articles, the trainin g and test sets were selected blindly, with later checks and corrections for imbalances in the relevant/nonrelevan t categories and in article types .</Paragraph>
    <Paragraph position="2"> From the 100 test articles, a subset of 30 articles (some relevant to the Scenario Template task, others not ) was selected for use as the test set for the Named Entity and Coreference tasks . The selection was again done blindly, with later checks to ensure that the set was fairly representative in terms of article length and type . Note that although Named Entity, Coreference and Template Element are defined as domain-independent tasks, the articles that were used for MUC-6 testing were selected using domain-dependent criteria pertinent to the Scenari o Template task . The manually filled templates were created with the aid of Tabula Rasa, a software tool develope d for the Tipster Text Program by New Mexico State University Computing Research Laboratory .</Paragraph>
  </Section>
  <Section position="5" start_page="13" end_page="14" type="metho">
    <SectionTitle>
NAMED ENTIT Y
</SectionTitle>
    <Paragraph position="0"> The Named Entity (NE) task requires insertion of SGML tags into the text stream . The tag elements are ENAMEX (for entity names, comprising organizations, persons, and locations), TIMEX (for tempora l expressions, namely direct mentions of dates and times), and NUMEX (for number expressions, consisting onl y of direct mentions of currency values and percentages) . A TYPE attribute accompanies each tag element an d identifies the subtype of each tagged string: for ENAMEX, the TYPE value can be ORGANIZATION, PERSON ,  or LOCATION; for TIMEX, the TYPE value can be DATE or TIME ; and for NUMEX, the TYPE value can be MONEY or PERCENT .</Paragraph>
    <Paragraph position="1"> Text strings that are to be annotated are termed markables . As indicated above, markables include names o f organizations, persons, and locations, and direct mentions of dates, times, currency values and percentages . Nonmarkables include names of products and other miscellaneous names (&amp;quot;Macintosh,&amp;quot; &amp;quot;Wall Street Journal&amp;quot; (i n reference to the periodical as a physical object), &amp;quot;Dow Jones Industrial Average&amp;quot;) ; names of groups of people and miscellaneous usages of person names (&amp;quot;Republicans,&amp;quot; &amp;quot;Gramm-Rudman,&amp;quot; &amp;quot;Alzheimer['s]&amp;quot;) ; addresses and adjectival forms of location names (&amp;quot;53140 Gatchell Rd.,&amp;quot; &amp;quot;American&amp;quot;); indirect and vague mentions of dates and times (&amp;quot;a few minutes after the hour,&amp;quot; &amp;quot;thirty days before the end of the year&amp;quot;) ; and miscellaneous uses of numbers, including some that are similar to currency or percentage expressions (&amp;quot;[Fees] 1 3/4,&amp;quot; &amp;quot;12 points,&amp;quot; &amp;quot;1 .5 times&amp;quot;). The full text of the task definition is contained in appendix C .</Paragraph>
    <Paragraph position="2"> The evaluation metrics used for NE are essentially the same as those used for the two template-filling tasks , Template Element and Scenario Template, and are discussed in the paper by Chinchor in this volume on the scoring software. The following breakdowns of overall scores on NE are computed : * by slot, i.e., for performance across tag elements, across TYPE attributes, and across tag strings ; * by subcategorization, i.e., for performance on each TYPE attribute separately ; * by document section, i.e., for performance on distinct subparts of the article, as identified by th e SGML tags contained in the original text : &lt;HL&gt; (&amp;quot;headline&amp;quot;), &lt;DD&gt; (&amp;quot;document date&amp;quot;), &lt;DATELINE&gt;, and &lt;TXT&gt; (the body of the article).</Paragraph>
  </Section>
  <Section position="6" start_page="14" end_page="89" type="metho">
    <SectionTitle>
NE Results Overal l
</SectionTitle>
    <Paragraph position="0"> Fifteen sites participated in the NE evaluation, including two that submitted two system configurations fo r testing and one that submitted four, for a total of 20 systems. As shown in the table below, performance on th e NE task overall was over 90% on the F-measure for half of the systems tested, which includes systems fro m seven different sites . On the basis of the results of the dry run, in which two of the nine systems scored ove r 90%, we were not surprised to find official scores that were similarly high, but it was not expected that so man y systems would enter the formal evaluation and perform so well .</Paragraph>
    <Paragraph position="1">  decreasing F-Measure (P&amp;R ) It was also unexpected that one of the systems would match human performance on the task . Human performance was measured by comparing the 30 draft answer keys produced by the annotator at NRaD with those produced by the annotator at SAIC . This test measures the amount of variability between the annotators . When 1 5 the outputs are scored in &amp;quot;key-to-response&amp;quot; mode, as though one annotator's output represented the &amp;quot;key&amp;quot; and the other the &amp;quot;response,&amp;quot; the humans achieved an overall F-measure of 96 .68 and a corresponding error per response fill (ERR) score of 6% . The top-scoring system, the baseline configuration of the SRA system (labele d satie.base in appendix A), achieved an F-measure of 96 .42 and a corresponding error score of 5% . In considering the significance of these results from a general standpoint, the following facts about the tes t set need to be remembered :  there were only a few markable percentage expressions .</Paragraph>
    <Paragraph position="2"> The results should also be qualified by saying that they reflect performance on data that makes accurate usage of upper and lower case distinctions. What would performance be on data where case provided no (reliable) clue s and for languages where case doesn't distinguish names? SRA ran an experiment on an upper-case version of th e test set that showed 85% recall and 89% precision overall, with identification of organization names presenting the greatest problem. That result represents nearly a 10-point decrease on the F-measure from their official baseline. The case-insensitive results would be slightly better if the task guidelines themselves didn't depend o n case distinctions in certain situations, as when identifying the right boundary for the organization name span in a string such as &amp;quot;the Chrysler division&amp;quot; (currently, only &amp;quot;Chrysler&amp;quot; would be tagged) .</Paragraph>
  </Section>
  <Section position="7" start_page="89" end_page="89" type="metho">
    <SectionTitle>
NE Results on Some Aspects of Tas k
</SectionTitle>
    <Paragraph position="0"> The figures below show the sample size for the various tag elements and TYPE values .</Paragraph>
    <Paragraph position="1">  Note that nearly 80% of the tags were ENAMEX and that almost half of those were subcategorized as organization names . As indicated in the table below, all systems performed better on identifying person name s than on identifying organization or location names, and all but a few systems performed better on location name s than on organization names . Organization names are varied in their form, consisting of proper nouns, general vocabulary, or a mixture of the two . They can also be quite long and complex and can even have interna l punctuation such as a commas or an ampersand. Sometimes it is difficult to distinguish them from names of other types, especially from person names . Common organization names, first names of people, and locatio n names can be handled by recourse to list lookup, although there are drawbacks : some names may be on more than one list, the lists will not be complete and may not match the name as it is realized in the text (e .g., may not cover the needed abbreviated form of an organization name, may not cover the complete person name), etc .</Paragraph>
    <Paragraph position="2">  The difference that recourse to lists can make in performance is seen by comparing two runs made by SRA , labeled satie.base and satie.nonames. The satie.nonames configuration resulted in a three point decrease in recal l and one point decrease in precision. The changes occurred only in performance on identifying organizations . BBN conducted a comparative test in which the extra configuration (gershwin.optional) used a larger lexicon than the basic configuration (gershwin.baseline), but the exact nature of the difference is not known and th e performance differences are very small . As with the SRA experiment, the only differences in performanc e between the two BBN configurations are with the organization type . The University of Durham reported that the y had intended to use gazetteer and company name lists, but didn't, because they found that the lists did not hav e much effect on their system's performance .</Paragraph>
    <Paragraph position="3"> The error scores for persons, dates, and monetary expressions was less than or equal to 10% for the larg e majority of systems . Several systems posted scores under 10% error for locations, but none was able to do so fo r oganizations . For percentages, about half the systems had 0% error, which reflects the simplicity of that particular subtask . Note that the number of instances of percentages in the test set is so small that a singl e mistake could result in an error of 6% .</Paragraph>
    <Paragraph position="4"> Examination of the score tables in the appendix show that slot-level performance on ENAMEX follows a different pattern for most systems from slot-level performance on NUMEX and TIMEX . The general pattern is for systems to have done better on the TEXT slot than on the TYPE slot for ENAMEX tags and for systems t o have done better on the TYPE slot than on the TEXT slot for NUMEX and TIMEX tags . Errors on the TEXT slot are errors in finding the right span for the tagged string, and this can be a problem for all three subcategorie s of tag. The TYPE slot, however, is a more difficult slot for ENAMEX than for the other subcategories . It involves a three-way distinction for ENAMEX and only a two-way distinction for NUMEX and TIMEX, and i t offers the possibility of confusing names of one type with names of another, especially the possibility o f confusing organization names with person names .</Paragraph>
    <Paragraph position="5"> Looking at the document section scores in the table below, we see that the error score on the body of the text was much lower than on the headline for all but a few systems . There was just one system that posted a higher error score on the body than on the headline, the NMSU CRL ives.basic configuration, and the difference in scores is largely due to the fact that the system overgenerated to a greater extent on the body than on th e headline. Its basic strategy for headlines was a conservative one : tag a string in the headline as a name only if th e system had found it in the body of the text or if the system had predicted the name based on truncation of name s</Paragraph>
    <Paragraph position="7"> found in the body of the text . Most, if not all, the systems that were evaluated on the NE task adopted the basic strategy of processing the headline after processing the body of the text .</Paragraph>
    <Paragraph position="8">  The interannotator variability test provides reference points indicating human performance on the different aspects of the NE task . The document section results show 0% error on Document Date and Dateline, 7% erro r on Headline, and 6% error on Text. The subcategory error scores were 6% on Organization, 1% on Person, an d 4% on Location, 8% on Date, and 0% on Money and Percent. These results show that human variability on thi s task patterns in a way that is similar to the performance of most of the systems in all respects except perhap s one: the greatest source of difficulty for the humans was on identifying dates . Analysis of the results shows that some Date errors were a result of simple oversight (e .g., &amp;quot;fiscal 1994&amp;quot;) and others were a consequence o f forgetting or misinterpreting the task guidelines with respect to determining the maximal span of the date expression (e.g., tagging &amp;quot;fiscal 1993's second quarter&amp;quot; and &amp;quot;Aug . 1&amp;quot; separately, rather than tagging &amp;quot;fiscal 1993' s second quarter, ended Aug . 1&amp;quot; as a single expression in accordance with the task guidelines) .</Paragraph>
    <Paragraph position="9"> NE Results on &amp;quot;Walkthrough Article&amp;quot; In the answer key for the walkthrough article (see appendix A to this proceedings) there are 69 ENAMEX tags (including a few optional ones), six TIMEX tags and six NUMEX tags . Interannotator scoring showed that one annotator missed tagging one instance of &amp;quot;Coke&amp;quot; as an (optional) organization, and the other annotator misse d one date expression (&amp;quot;September&amp;quot;) . Common mistakes made by the systems included missing the date expression, &amp;quot;the 21st century,&amp;quot; and spuriously identifying &amp;quot;60 pounds&amp;quot; (which appeared in the context, &amp;quot;Mr . Dooner, who recently lost 60 pounds over three-and-a-half months, ...&amp;quot;) as a monetary value rather than ignorin g it as a weight . In addition, a number of errors identifying entity names were made; some of those errors als o showed up as errors on the Template Element task and are described in a later section of this paper .</Paragraph>
  </Section>
  <Section position="8" start_page="89" end_page="89" type="metho">
    <SectionTitle>
COREFERENC E
</SectionTitle>
    <Paragraph position="0"> The task as defined for MUC-6 was restricted to noun phrases (NPs) and was intended to be limited t o phenomena that were relatively noncontroversial and easy to describe . The variety of high-frequency phenomen a covered by the task is partially represented in the following hypothetical example, where all bracketed tex t segments are considered coreferential :</Paragraph>
    <Paragraph position="2"> [Motor Vehicles International Corp .] announced a major management shake-up. . .. [MVI] said the chief executive officer has resigned . .. . [The Big 10 auto maker] is attempting to regain market share .</Paragraph>
    <Paragraph position="3"> ... [It] will announce significant losses for the fourth quarter . .. . A [company] spokesman said [they ] are moving [their] operations to Mexico in a cost-saving effort . .. . [MVI, [the first company to announce such a move since the passage of the new international trade agreement],] is facing increasing demands from unionized workers. ... [Motor Vehicles International] is [the biggest American aut o exporter to Latin America].</Paragraph>
    <Paragraph position="4"> The example passage covers a broad spectrum of the phenomena included in the task . At one end of the spectrum are the proper names and aliases, which are inherently definite and whose referent may appear anywher e in the text. In the middle of the spectrum are definite descriptions and pronouns whose choice of referent i s constrained by such factors as structural relations and discourse focus . On the periphery of the central phenomena are markables whose status as coreferring expressions is determined by syntax, such as predicate nominal s (&amp;quot;Motor Vehicles International is the biggest American auto exporter to Latin America&amp;quot;) and appositives (&amp;quot;MVI , the first company to announce such a move since the passage of the new international trade agreement&amp;quot;) . At the far end of the spectrum are bare common nouns, such as the prenominal &amp;quot;company&amp;quot; in the example, whose statu s as a referring expression may be questionable .</Paragraph>
    <Paragraph position="5"> An algorithm developed by the MITRE Corporation for MUC-6 was implemented by SAIC and used for scoring the task (see &amp;quot;A Model-Theoretic Coreference Scoring Scheme&amp;quot; and &amp;quot;Four Scorers and Seven Years Ago: The Scoring Scheme for MUC-6&amp;quot; in this volume) . The algorithm compares the equivalence classes defined by the coreference links in the manually-generated answer key and the system-generated response . The equivalence classes are the models of the identity equivalence coreference relation . Using a simple counting scheme, the algorithm obtains recall and precision scores by determining the minimal perturbations required to align the equivalence classes in the key and response . No metrics other than recall and precision were defined for this task, and no statistical significance testing was performed on the scores .</Paragraph>
  </Section>
  <Section position="9" start_page="89" end_page="89" type="metho">
    <SectionTitle>
CO Results Overall
</SectionTitle>
    <Paragraph position="0"> In all, seven sites participated in the MUC-6 coreference evaluation. Most systems achieved approximatel y the same levels of performance : five of the seven systems were in the 51%-63% recall range and 62%-72 % precision range . About half the systems focused only on individual coreference, which has direct relevance to th e other MUC-6 evaluation tasks .</Paragraph>
    <Paragraph position="1">  A few of the evaluation sites reported that good name/alias recognition alone would buy a system a lot o f recall and precision points on this task, perhaps about 30% recall (since proper names constituted a large minorit y  of the annotations) and 90% precision . The precision figure is supported by evidence from the NE evaluation . In that evaluation, a number of systems scored over 90% on the named entity recall and precision metrics, providin g a sound basis for good performance on the coreference task for individual entities .</Paragraph>
    <Paragraph position="2"> In the middle of the effort of preparing the test data for the formal evaluation, an interannotator variabilit y test was conducted. The two versions of the independently prepared, manual annotations of 17 articles were score d against each other using the scoring program in the normal &amp;quot;key to response&amp;quot; scoring mode . The amount of agreement between the two annotators was found to be 80% recall and 82% precision . There was a large number of factors that contributed to the 20% disagreement, including overlooking coreferential NPs, using differen t interpretations of vague portions of the guidelines, and making different subjective decisions when the text of a n article was ambiguous, sloppy, etc . Most human errors pertained to definite descriptions and bare nominals, no t to names and pronouns.</Paragraph>
  </Section>
  <Section position="10" start_page="89" end_page="89" type="metho">
    <SectionTitle>
CO Results on Some Aspects of Task and on &amp;quot;Walkthrough Article &amp;quot;
</SectionTitle>
    <Paragraph position="0"> To keep the annotation of the evaluation data fairly simple, the MUC-6 planning committee decided not t o design the notation to subcategorize linkages and markables in any way . Two useful attributes for the equivalence class as a whole would be one to distinguish individual coreference from type coreference and one to identify th e general semantic type of the class (organization, person, location, time, currency, etc.). For each NP in the equivalence class, it would be useful to identify its grammatical type (proper noun phrase, definite common nou n phrase, bare singular common noun phrase, personal pronoun, etc .). The decision to minimize the annotatio n effort makes it difficult to do detailed quantitative analysis of the results .</Paragraph>
    <Paragraph position="1"> An analysis by the participating sites of their system's performance on the walkthrough article provide s some insight into performance on aspects of the coreference task that were dominant in that article . The article contains about 1000 words and approximately 130 coreference links, of which all but about a dozen are reference s to individual persons or individual organizations (see appendix A) . Approximately 50 of the anaphors are personal pronouns, including reflexives and possessives, and 58 of the markables (anaphors and antecedents) are prope r names, including aliases . The percentage of personal pronouns is relatively high (38%), compared to the test set overall (24%), as is the percentage of proper names (40% on this text versus an estimate of 30% overall) .</Paragraph>
    <Paragraph position="2"> Performance on this particular article for some systems was higher than performance on the test set overall , reaching as high as 77% recall and 79% precision . These scores indicate that pronoun resolution techniques a s well as proper noun matching techniques are good, compared to the techniques required to determine reference s involving common noun phrases . For common noun phrases, the systems were not required to include the entir e NP in the response ; the response could minimally contain only the head noun. Despite this flexibility in th e expected contents of the response, the systems nonetheless had to implicitly recognize the full NP, since to b e considered coreferential, the head and its modifiers all had to be consistent with another markable .</Paragraph>
  </Section>
  <Section position="11" start_page="89" end_page="89" type="metho">
    <SectionTitle>
TEMPLATE ELEMEN T
</SectionTitle>
    <Paragraph position="0"> The Template Element (TE) task requires extraction of certain general types of information about entitie s and merging of the information about any given entity before presentation in the form of a template (or &amp;quot;object&amp;quot;) .</Paragraph>
    <Paragraph position="1"> For MUC-6 the entities that were to be extracted were limited to organizations and persons .' The ORGANIZATION object contains attributes (&amp;quot;slots&amp;quot;) for the string representing the organization nam e (ORG_NAME), for strings representing any abbreviated versions of the name (ORG_ALIAS), for a string tha t describes the particular organization (ORG_DESCRIPTOR), for a subcategory of the type of organization (ORG_TYPE, whose permissible values are GOVERNMENT, COMPANY, and OTHER), and for canonica l forms of the specific and general location of the organization (ORG_LOCALE and ORG_COUNTRY) . The PERSON object contains slots only for the string representing the person name (PER_NAME), for string s representing any abbreviated versions of the name (PER_ALIAS), and for strings representing a very limited rang e of titles (PER_TITLE).</Paragraph>
    <Paragraph position="2"> The task documentation (appendix E) includes definition of an &amp;quot;artifact&amp;quot; entity, but that entity type was not used i n MUC-6 for either the dry run or the formal run . The entity types that were involved in the evaluation are the same a s those required for the Scenario Template task .</Paragraph>
    <Paragraph position="3">  The task places heavy emphasis on recognizing proper noun phrases, as in the NE task, since all slot s except ORG_DESCRIPTOR and PER_TITLE expect proper names as slot fillers (in string or canonical form , depending on the slot. However, the organization portion of the TE task is not limited to recognizing th e referential identity between full and shortened names ; it requires the use of text analysis techniques at all levels of text structure to associate the descriptive and locative information with the appropriate entity . Analysis of complex NP structures, such as appositional structures and postponed modifier adjuncts, is needed in order t o relate the locale and descriptor to the name in &amp;quot;Creative Artists Agency, the big Hollywood talent agency&amp;quot; and i n &amp;quot;Creative Artists Agency, a big talent agency based in Hollywood.&amp;quot; Analysis of sentence structures to identify grammatical relations such as predicate nominals is needed in order to relate those same pieces of information i n &amp;quot;Creative Artists Agency is a big talent agency based in Hollywood .&amp;quot; Analysis of discourse structure is needed in order to identify long-distance relationships .</Paragraph>
    <Paragraph position="4"> The answer key for the TE task contains one object for each specific organization and person mentioned i n the text . For generation of a PERSON object, the text must provide the name of the person (full name or part o f a name). For generation of an ORGANIZATION object, the text must provide either the name (full or part) or a descriptor of the organization . Since the generation of these objects is independent of the relevance criteri a imposed by the Scenario Template (ST) task, there are many more ORGANIZATION and PERSON objects i n the TE key than in the ST key . For the formal evaluation, there were 606 ORGANIZATION and 496 PERSO N objects in the TE key, versus 120 ORGANIZATION and 137 PERSON objects in the ST key .</Paragraph>
    <Paragraph position="5"> The same set of articles was used for TE as for ST ; therefore, the content of the articles is oriented towar d the terms and subject matter covered by the ST task, which concerns changes in corporate management.2 One effect of this bias is simply the number of entities mentioned in the articles : for the test set used for the MUC- 6 dry run, which was based on a scenario concerning labor union contract negotiations, there were only about hal f as many organizations and persons mentioned as there were in the test set used for the formal run .</Paragraph>
  </Section>
  <Section position="12" start_page="89" end_page="89" type="metho">
    <SectionTitle>
TE Results Overall
</SectionTitle>
    <Paragraph position="0"> Twelve systems -- from eleven sites, including one that submitted two system configurations for testing were tested on the TE task. All but two of the systems posted F-measure scores in the 70-80% range, and four of  the systems were able to achieve recall in the 70-80% range while maintaining precision in the 80-90% range, a s shown in the figure 4 . Human performance was measured in terms of variability between the outputs produced b y the two NRaD and SAIC evaluators for 30 of the articles in the test set (the same 30 articles that were used fo r NE and CO testing) . Using the scoring method in which one annotator's draft key serves as the &amp;quot;key&amp;quot; and th e other annotator's draft key serves as the &amp;quot;response,&amp;quot; the overall consistency score was 93 .14 on the F-measure , with 93% recall and 93% precision .</Paragraph>
  </Section>
  <Section position="13" start_page="89" end_page="89" type="metho">
    <SectionTitle>
TE Results on Some Aspects of Task
</SectionTitle>
    <Paragraph position="0"> Given the more varied extraction requirements for the ORGANIZATION object, it is not surprising tha t performance on that portion of the TE task was not as good as on the PERSON object3, as is clear in the figure  ORG_LOCALE slot is filled . (The reverse is not the case, i .e., ORG_COUNTRY may be filled even if ORG_LOCALE is not, but this situation is relatively rare.) Since a missing or spurious ORG_LOCALE i s likely to incur the same error in ORG_COUNTRY, the error scores for the two slots are understandably similar . With respect to performance on ORG_DESCRIPTOR, note that there may be multiple descriptors (or none ) in the text. However, the task does not require the system to extract all descriptors of an entity that are containe d in the text; it requires only that the system extract one (or none) . Frequently, at least one can be found in clos e proximity to an organization's name, e .g., as an appositive (&amp;quot;Creative Artists Agency, the big Hollywood talent agency&amp;quot;). Nonetheless, performance is much lower on this slot than on others .</Paragraph>
    <Paragraph position="1"> Leaving aside the fact that descriptors are common noun phrases, which makes them less obvious candidate s for extraction than proper noun phrases would be, what reasons can we find to account for the relatively lo w performance on the ORG_DESCRIPTOR slot? One reason for low performance is that an organization may b e identified in a text solely by a descriptor, i.e., without a fill for the ORG_NAME slot and therefore without th e usual local clues that the NP is in fact a relevant descriptor . It is, of course, also possible that a text may identif y an organization solely by name. Both possibilities present increased opportunities for systems to undergenerate  or overgenerate. Also, the descriptor is not always close to the name, and some discourse processing may b e required in order to identify it -- this is likely to increase the opportunity for systems to miss the information . A third significant reason is that the response fill had to match the key fill exactly in order to be counted correct; there was no allowance made in the scoring software for assigning full or partial credit if the response fill onl ypartially matched the key fill . It should be noted that human performance on this task was also relatively low , but it is unclear whether the degree of disagreement can be accounted for primarily by the reasons given above o r whether the disagreement is attributable to the fact that the guidelines for that slot had not been finalized at the time when the annotators created their version of the keys .</Paragraph>
  </Section>
  <Section position="14" start_page="89" end_page="89" type="metho">
    <SectionTitle>
TE Results on &amp;quot;Walkthrough Article &amp;quot;
</SectionTitle>
    <Paragraph position="0"> TE performance of all systems on the walkthrough article was not as good as performance on the test set as a whole, but the difference is small for about half the systems . Viewed from the perspective of the TE task, th e walkthrough article presents a number of interesting examples of entity type confusions that can result from insufficient processing (appendix A) . There are cases of organization names misidentified as person names, there is a case of a location name misidentified as an organization name, and there are cases of nonrelevant entity type s (publications, products, indefinite references, etc .) misidentified as organizations . Errors of these kinds result in a penalty at the object level, since the extracted information is contained in the wrong type of object . Examples of each of these types of error appear below, along with the number of systems that committed the error . (The chopin.noref system configuration of the SRA system produced the same output as chopin.base and has been disregarded in the tallies ; thus, the total number of systems tallied is eleven.)  1 . Miscategorizations of entities as person (PER_NAME or PER_ALIAS) instead of organization (ORG_NAME or ORG_ALIAS ) * Six systems : McCann-Erickson (also extracted with the name of &amp;quot;McCann,&amp;quot; &amp;quot;One McCann,&amp;quot; &amp;quot;Whil e McCann&amp;quot;; organization category is indicated clearly by context in which full name appears, &amp;quot;John Dooner Will Succeed James At Helm of McCann-Erickson&amp;quot; in headline and &amp;quot;Robert L . James, chairman and chief executive officer of McCann-Erickson, and John J. Dooner Jr., the agency's president and chief operating officer&amp;quot; in the body of the article) * Six systems : J. Walter Thompson (also extracted with the name of &amp;quot;Walter Thompson&amp;quot; ; organization category is indicated by context, &amp;quot;Peter Kim was hired from WPP Group's J . Walter Thompson last September...&amp;quot;) *Four systems: Fallon McElligott (organization category is indicated by context, &amp;quot; . ..other ad agencies, such as Fallon McElligott&amp;quot;) *One system : Ammirati &amp; Puris (the presence of the ampersand is a clue, as is the context, &amp;quot; ...presiden t and chief executive officer of Ammirati &amp; Puris&amp;quot; ; but note that the article also mentions the name of one of the company's founders, Martin Puris) 2. Miscategorization of entity as organization (ORG_NAME) instead of location (ORG_LOCALE )  Given the variety of contextual clues that must be taken into account in order to analyze the above entities correctly, it is understandable that just about any given system would commit at least one of them . But the problems are certainly tractable ; none of the fifteen TE entities in the key (ten ORGANIZATION entities and fiv e PERSON entities) was miscategorized by all of the systems.</Paragraph>
    <Paragraph position="1"> In addition to miscategorization errors, the walkthrough text provides other interesting examples of syste m errors at the object level and the slot level, plus a number of examples of system successes . One success for the systems as a group is that each of the six smaller ORGANIZATION objects and four smaller PERSON objects (those with just one or two filled slots in the key) was matched perfectly by at least one system ; in addition, one larger ORGANIZATION object and two larger PERSON objects were perfectly matched by at least one system . Thus, each of the five PERSON objects in the key and seven of the ten ORGANIZATION objects in the ke y were matched perfectly by at least one system . The three larger ORGANIZATION objects that none of th e systems got perfectly correct are for the McCann-Erickson, Creative Artists Agency, and Coca-Cola companies . Common errors in these three ORGANIZATION objects included missing the descriptor or locale/country o r failing to identify the organization's alias with its name .</Paragraph>
  </Section>
  <Section position="15" start_page="89" end_page="89" type="metho">
    <SectionTitle>
SCENARIO TEMPLAT E
</SectionTitle>
    <Paragraph position="0"> A Scenario Template (ST) task captures domain- and task-specific information . Three scenarios were defined in the course of MUC-6 : (1) a scenario concerning the event of organizations placing orders to bu y aircraft with aircraft manufacturers (the &amp;quot;aircraft order&amp;quot; scenario) ; (2) a scenario concerning the event of contract negotiations between labor unions and companies (the &amp;quot;labor negotiations&amp;quot; scenario) ; (3) a scenario concerning changes in corporate managers occupying executive posts (the &amp;quot;management succession&amp;quot; scenario). The first scenario was used as an example of the general design of the ST task, the second was used for the MUC-6 dry ru n evaluation, and the third was used for the formal evauation . One of the innovations of MUC-6 was to formaliz e the general structure of event templates, and all three scenarios defined in the course of MUC-6 conformed to that general structure (appendix E). In this article, the management succession scenario will be used as the basis fo r discussion ; the details of that scenario are given in appendix F.</Paragraph>
    <Paragraph position="1"> The management succession template consists of four object types, which are linked together via one-wa y pointers to form a hierarchical structure. At the top level is the TEMPLATE object, of which there is on e instantiated for every document. This object points down to one or more SUCCESSION_EVENT objects if th e document meets the event relevance criteria given in the task documentation . Each event object captures the changes occurring within a company with respect to one management post . The SUCCESSION_EVENT object points down to the IN_AND_OUT object, which in turn points down to PERSON Template Element objects tha t represent the persons involved in the succession event . The IN_AND_OUT object contains ST-specific information that relates the event with the persons . The ORGANIZATION Template Element objects are presen t at the lowest level along with the PERSON objects, and they are pointed to not only by the IN_AND_OU T object but also by the SUCCESSION_EVENT object. The organization pointed to by the event object is the organization where the relevant management post exists ; the organization pointed to by the relational object is th e organization that the person who is moving in or out of the post is coming from or going to.</Paragraph>
    <Paragraph position="2">  The scenario is designed around the management post rather than around the succession act itself . Although the management post and information associated with it are represented in the SUCCESSION_EVENT object , that object does not actually represent an event, but rather a state, i .e., the vacancy of some management post . The relational-level IN_AND_OUT objects represent the personnel changes pertaining to that state .</Paragraph>
  </Section>
  <Section position="16" start_page="89" end_page="89" type="metho">
    <SectionTitle>
ST Results Overall
</SectionTitle>
    <Paragraph position="0"> Nine sites submitted a total of eleven systems for evaluation on the ST task . All the participating sites als o submitted systems for evaluation on the TE and NE tasks . All but one of the development teams (UDurham) had members who were veterans of MUC-5 .</Paragraph>
    <Paragraph position="1"> Of the 100 texts in the test set, 54 were relevant to the management succession scenario, including six tha t were only marginally relevant . Marginally relevant event objects are marked in the answer key as being optional , which means that a system is not penalized if it does not produce such an event object. The approximate 50-50 split between relevant and nonrelevant texts was intentional and is comparable to the richness of the MUC- 3 &amp;quot;TST2&amp;quot; test set and the MUC-4 &amp;quot;TST4&amp;quot; test set . (The test sets used for MUC-5 had a much higher proportion of relevant texts .) Systems are measured for their performance on distinguishing relevant from nonrelevant texts vi a the text filtering metric, which uses the classic information retrieval definitions of recall and precision (see prefac e to appendix B).</Paragraph>
    <Paragraph position="2"> For MUC-6, text filtering scores were as high as 98% recall (with precision in the 80th percentile) or 96 % precision (with recall in the 80th percentile) . Similar tradeoffs and upper bounds on performance can be seen i n the TST2 and TST4 results (see score reports in sections 2 and 4 of appendix G in [1]) . However, performance o f the systems as a group is better on the MUC-6 test set. The text filtering results for MUC-6, MUC-4 (TST4) and MUC-3 (TST2) are shown in figure 8 .</Paragraph>
    <Paragraph position="3">  Whereas the Text Filter row in the score report shows the system ' s ability to do text filtering (documen t detection), the All Objects row and the individual Slot rows show the system's ability to do information extraction. The measures used for information extraction include two overall ones, the F-measure and error pe r response fill, and several other, more diagnostic ones (recall, precision, undergeneration, overgeneration, and substitution) . See preface to appendix B for definitions of the metrics . Note that the text filtering definition of precision is different from the information extraction definition of precision ; the latter definition includes an element in the formula that accounts for the number of spurious template fills generated .</Paragraph>
    <Paragraph position="4"> The All Objects recall and precision scores are shown in figure 9 . The highest ST F-measure score wa s 56.40 (47% recall, 70% precision). Statistically, large differences of up to 15 points may not be reflected as a  difference in the ranking of the systems . Most of the systems fall into the same rank at the high end, and th e evaluation does not clearly distinguish more than two ranks (see the paper on statistical significance testing b y Chinchor in this volume) . Human performance was measured in terms of interannotator variability on only 30 texts in the test set and showed agreement to be approximately 83%, when one annotator's templates were treate d as the &amp;quot;key&amp;quot; and the other annotator's templates were treated as the &amp;quot;response .&amp;quot; No analysis has been done of the relative difficulty of the MUC-6 ST task compared to previous extractio n evaluation tasks . The one-month limitation on development in preparation for MUC-6 would be difficult t o factor into the computation, and even without that additional factor, the problem of coming up with a reasonable, objective way of measuring relative task difficulty has not been adequately addressed . Nonetheless, as one rough measure of progress in the area of information extraction as a whole, we can consider the F-measures of the top scoring systems from the MUC-5 and MUC-6 evaluations . Note that the table below shows four top scores for MUC-5, one for each language-domain pair : English Joint Ventures (EJV), Japanese Joint Ventures (JJV) , English Microelectronics (EME), and Japanese Microelectronics (JME) . From this table, it may be reasonable to conclude that progress has been made, since the MUC-6 performance level is at least as high as for three of th e four MUC-5 tasks and since that performance level was reached after a much shorter time .</Paragraph>
  </Section>
  <Section position="17" start_page="89" end_page="89" type="metho">
    <SectionTitle>
ST Results on Some Aspects of Task and on &amp;quot;Walkthrough Article &amp;quot;
</SectionTitle>
    <Paragraph position="0"> Three succession events are reported in the walkthrough article . Successful interpretation of three sentences from the walkthrough article is necessary for high performance on these events . The tipoff on the first two events comes at the end of the second paragraph: Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, i s stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year . He will be succeeded by Mr . Dooner, 45 .</Paragraph>
    <Paragraph position="1"> The basis of the third event comes halfway through the two-page article : In addition, Peter Kim was hired from WPP Group's J . Walter Thompson last September as vice chairman, chief strategy officer, world-wide .</Paragraph>
    <Paragraph position="2"> The article was relatively straightforward for the annotators who prepared the answer key, and there were n o substantive differences in the output produced by each of the two annotators .</Paragraph>
    <Paragraph position="3"> Table 5 contains a paraphrased summary of the output that was to be generated for each of these events , along with a summary of the output that was actually generated by systems evaluated for MUC-6 . The system-generated outputs are from three different systems, since no one system did better than all other systems on al l three events . The substantive differences between the system-generated output and the answer key are indicated b y underlining in the system output .</Paragraph>
    <Paragraph position="4"> Recurring problems in the system outputs include the information about whether the person is currently o n the job or not and the information on where the outgoing person's next job would be and where the incomin g person's previous job was. Note also that even the best system on the third event was unable to determine tha t the succession event was occurring at McCann-Erickson ; in addition, it only partially captured the full title of the post. To its credit, however, it did recognize that the event was relevant ; only two systems produced output that  is recognizable as pertaining to this event . One common problem was the simple failure to recognize &amp;quot;hire&amp;quot; as a n indicator of a succession .</Paragraph>
    <Paragraph position="5"> Table 5 . Paraphrased summary of ST outputs for walkthrough articl e Two systems never filled the OTHER_ORG slot or its dependent slot, REL_OTHER_ORG, despite the fac t that data to fill those slots was often present ; over half the IN_AND_OUT objects in the answer key contain dat a for those two slots . Almost without exception, systems did more poorly on those two slots than on any other s in the SUCCESSION_EVENT and IN_AND_OUT objects; the best scores posted were 70% error on OTHER_ORG (median score of 79%) and 72% error on REL_OTHER_ORG (median of 86%) .</Paragraph>
    <Paragraph position="6"> Performance on the VACANCY_REASON and ON_THE_JOB slots was better for nearly all systems . The lowest error scores were 56% on VACANCY_REASON (median of 70%) and 62% on ON_THE_JOB (median o f 71%).</Paragraph>
    <Paragraph position="7"> The slot that most systems performed best on is NEW_STATUS ; the lowest error score posted on that slo t is 47% (median of 55%) . This slot has a limited number of fill options, and the right answer is almost alway s either IN or OUT, depending on whether the person involved is assuming a post (IN) or vacating a post (OUT) . Performance on the POST slot was not quite as good ; the lowest error was 52% (median of 65%) . The POST slot requires a text string as fill, and there is no finite list of possible fills for the slot . As seen in the third event of the walkthrough article, the fill can be an extended title such as &amp;quot;vice chairman, chief strategy officer, world wide.&amp;quot; For most events, however, the fill is one of a large handful of possibilities, including &amp;quot;chairman, &amp;quot; &amp;quot;president,&amp;quot; &amp;quot;chief executive [officer],&amp;quot; &amp;quot;CEO,&amp;quot; &amp;quot;chief operating officer,&amp;quot; &amp;quot;chief financial officer,&amp;quot; etc .</Paragraph>
  </Section>
  <Section position="18" start_page="89" end_page="89" type="metho">
    <SectionTitle>
DISCUSSION: CRITIQUE OF TASKS
</SectionTitle>
    <Paragraph position="0"> Named Entity The primary subject for review in the NE evaluation is its limited scope . A variety of proper name types were excluded, e.g. product names. The range of numerical and temporal expressions covered by the task was als o limited; one notable example is the restriction of temporal expressions to exclude &amp;quot;relative&amp;quot; time expressions suc h as &amp;quot;last week&amp;quot;. Restriction of the corpus to Wall Street Journal articles resulted in a limited variety of markables and in reliance on capitalization to identify candidates for annotation .</Paragraph>
    <Paragraph position="1"> Some work on expanding the scope of the NE task has been carried out in the context of a foreign-languag e NE evaluation conducted in the spring of 1996 . This evaluation is called the MET (Multilingual Named Entity )</Paragraph>
    <Section position="1" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
Answer Key
</SectionTitle>
      <Paragraph position="0"> James out, Dooner in as CEO of McCann Erickson as a result of James departing th e workforce; James is still on the job as CEO ; Dooner is not on the job as CEO yet, and his ol d job was with the same org as his new job .</Paragraph>
      <Paragraph position="1"> James out, Dooner in as chairman of McCann-Erickson as a result of James departing the workforce; James is still on the job as chairman ; Dooner is not on the job as chairman yet, and his old job was with the same org as hi s new job.</Paragraph>
      <Paragraph position="2"> Kim in as &amp;quot;vice chairman, chief strateg y officer, world-wide&amp;quot; of McCann-Erickson, wher e the vacancy existed for other/unknown reasons ; he is already on the job in the post, and his ol d job was with J. Walter Thompson .</Paragraph>
      <Paragraph position="3"> System Output _ James out, Dooner in as CEO of McCann-Erickson as a result of a reassignment of James; James is not on the job as CEO any more, and his new job is at the same as his old job; Dooner may or may not be on the job as CEO yet, and his old job was with the same org as his ne w job . (SRA satie_base system) James out, Dooner in as chairman of McCann-Erickson as a result of James departin g the workforce ; James is not on the job as chairman any more; Dooner is already on the job as chairman, and his old job was with Ammirati &amp; Puris . (NYU system) Kim in as vice chairman of WPP Group , where the vacancy existed for other/unknow n reasons; he may or may not be on the job in that post yet, and the article doesn't say where his ol d job was. (BBN system)  and, like MUC-6, was carried out under the auspices of the Tipster Text program . The experience gained from that evaluation will serve as critical input to revising the Engish version of the task . Coreference Many aspects of the CO task are in definite need of review for reasons of either theory or practice . One set o f issues concerns the range of syntactically governed coreference phenomena that are considered markable . For example, apposition as a markable phenomenon was restrictively defined to exclude constructs that could rather b e analyzed as left modification, such as &amp;quot;chief executive Scott McNealy,&amp;quot; which lacks the comma punctuation tha t would clearly identify &amp;quot;executive&amp;quot; as the head of an appositive construction . Another set of issues is semantic i n nature and includes fundamental questions such as the validity of including type coreference in the task and th e legitimacy of the implied definition of coreference versus reference . If an antecedent expression is nonreferential , can it nonetheless be considered coreferential with subsequent anaphoric expressions? Or can only referrin g expressions corefer? Finally, the current notation presents a set of issues, such as its inability to represen t multiple antecedents, as in conjoined NPs, or alternate antecedents, as in the case of referential ambiguity . In short, the preliminary nature of the task design is reflected in the somewhat unmotivated boundarie s between markables and nonmarkables and in weaknesses in the notation. One indication of immaturity of the task definition (as well as an indication of the amount of genuine textual ambiguity) is the fact that over te n percent of the linkages in the answer key were marked as &amp;quot;optional .&amp;quot; (Systems were not penalized if they failed to include such linkages in their output .) The task definition is now under review by a discourse working grou p formed in 1996 with representatives from both inside and outside the MUC commuity, including representative s from the spoken-language community .</Paragraph>
      <Paragraph position="4"> Template Elemen t There are miscellaneous outstanding problems with the TE task . With respect to the ORGANIZATION and PERSON objects, there are issues such as rather fuzzy distinctions among the three organization subtypes an d between the organization name and alias, the extremely limited scope of the person title slot, and the lack of a person descriptor slot . The ARTIFACT object, which was not used for either the dry run or the forma l evaluation, needs to be reviewed with respect to its general utility, since its definition reflects primarily th e requirements of the MUC-5 microelectronics task domain . There is a task-neutral DATE slot that is defined as a template element ; it was used in the MUC-6 dry run as part of the labor negotiation scenario, but as currentl y defined, it fails to capture meaningfully some of the recurring kinds of date information . In particular, problem s remain with normalizing various types of date expressions, including ones that are vague and/or require extensiv e use of calendar information.</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
Scenario Template
</SectionTitle>
      <Paragraph position="0"> The issues with respect to the ST task relate primarily to the ambitiousness of the scenario template s defined for MUC-6 . Although the management scenario contained only five domain-specific slots (disregardin g slots containing pointers to other objects), it nonetheless reflected an interest in capturing as complete a representation of the basic event as possible . As a result, a few &amp;quot;peripheral&amp;quot; facts about the event were include d that were difficult to define in the task documentation and/or were not reported clearly in many of the articles .</Paragraph>
      <Paragraph position="1"> Two of the slots, VACANCY_REASON and ON_THE_JOB, had to be filled on the basis of inference fro m subtle linguistic cues in many cases . An entire appendix to the scenario definition is devoted to heuristics fo r filling the ON_THE_JOB slot . These two slots caused problems for the annotators as well as for the systems .</Paragraph>
      <Paragraph position="2"> The annotators' problems with VACANCY_REASON may have had more to do with understanding what th e scenario definition was saying than with understanding what the news articles were saying . The annotators' problems with ON_THE_JOB were probably more substantive, since the heuristics documented in the appendi x were complex and sometimes hard to map onto the expressions found in the news articles. A third slot, REL_OTHER_ORG, required special inferencing on the basis of both linguistics and world knowledge in order t o determine the corporate relationship between the organization a manager is leaving and the one the manager i s going to. There may, in fact, be just one organization involved -- the person could be leaving a post at a company in order to take a different (or an additional) post at the same company.</Paragraph>
      <Paragraph position="3">  Defining a generalized template structure and using Template Element objects as one layer in the structur e reduced the amount of effort required for participants to move their system from one scenario to another. Further simplification may be advisable in order to focus on core information elements and exclude somewha t idiosyncratic ones such as the three slots described above . In the case of the management succession scenario, a proposal was made to eliminate the three slots discussed above and more, including the relational object itself, an d to put the personnel information in the event object (see the SRA paper in this volume) . Much less information about the event would be captured, but there would be a much stronger focus on the most essential information elements. This would possibly lead to significant improvements in performance on the basic event-relate d elements and to development of good end-user tools for incorporating some of the domain-specific patterns into a generic extraction system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML