XML Viewer - x96-1048

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1048_metho.xml
Size: 62,083 bytes
Last Modified: 2025-10-06 14:14:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1048">
  <Title>OVERVIEW OF RESULTS OF THE MUC-6 EVALUATION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
EVALUATION TASKS
</SectionTitle>
    <Paragraph position="0"> A basic characterization of the challenge presented by each evaluation task is as follows: * Named Entity (NE) -- Insert SGML tags into the text to mark each string that represents a person, organization, or location name, or a date or time stamp, or a currency or percentage figure.</Paragraph>
    <Paragraph position="1"> * Coreference (CO) -- Insert SGML tags into the text to link strings that represent coreferring noun phrases.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Template Element (TE) -- Extract
</SectionTitle>
    <Paragraph position="0"> basic information related to organization and person entities, drawing evidence from anywhere in the text.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="423" type="metho">
    <SectionTitle>
* Scenario Template (ST) -- Drawing
</SectionTitle>
    <Paragraph position="0"> evidence from anywhere in the text, extract prespecified event information, and relate the event information to the particular organization and person entities involved in the event.</Paragraph>
    <Paragraph position="1"> The two SGML-based tasks required innovations to tie system-internal data structures to the original text so that the annotations could be inserted by the system without altering the original text in any other way. This capability has other useful applications as well, e.g., it enables text highlighting in a browser. It also facilitates information extraction, since some of the information in the extraction templates is in the form of literal text strings, which some systems have in the past had difficulty reproducing in their output.</Paragraph>
    <Paragraph position="2"> The inclusion of four different tasks in the evaluation implicitly encouraged sites to design general-purpose architectures that allow the production of a variety of types of output from a single internal representation in order to allow use of the full range of analysis techniques for all tasks. Even the simplest of the tasks, Named Entity, occasionally requires in-depth processing, e.g., to determine whether &amp;quot;60 pounds&amp;quot; is an expression of weight or of monetary value. Nearly half the sites chose to participate in all four tasks, and all but one site participated in at least one SGML task and one extraction task.</Paragraph>
    <Paragraph position="3"> The variety of tasks designed for MUC-6 reflects the interests of both participants and sponsors in assessing and furthering research that can satisfy some urgent text processing needs in the very near term and can lead to solutions to more  challenging text understanding problems in the longer term. Identification of certain common types of names, which constitutes a large portion of the Named Entity task and a critical portion of the Template Element task, has proven to be largely a solved problem. Recognition of alternative ways of identifying an entity constitutes a large portion of the Coreference task and another critical portion of the Template Element task and has been shown to represent only a modest challenge when the referents are names or pronouns. The mix of challenges that the Scenario Template task represents has been shown to yield levels of performance that are smilar to those achieved in previous MUCs, but this time with a much shorter time required for porting.</Paragraph>
    <Paragraph position="4"> Documentation of each of the tasks and summary scores for all systems evaluated can be found in the MUC-6 proceedings \[1\].</Paragraph>
  </Section>
  <Section position="6" start_page="423" end_page="423" type="metho">
    <SectionTitle>
CORPUS
</SectionTitle>
    <Paragraph position="0"> Testing was conducted using Wall Street Journal texts provided by the Linguistic Data Consortium. The articles used in the evaluation were drawn from a corpus of approximately 58,000 articles spanning the period of January 1993 through June 1994. This period comprised the &amp;quot;evaluation epoch.&amp;quot; As a condition for participation in the evaluation, the sites agreed not to seek out and exploit Wall Street Journal articles from that epoch once the training phase of the evaluation had begun, i.e., once the scenario for the Scenario Template task had been disclosed to the participants.</Paragraph>
    <Paragraph position="1"> The training set and test set each consisted of 100 articles and were drawn from the corpus using a text retrieval system called Managing Gigabytes, whose retrieval engine is based on a context-vector model, producing a ranked list of hits according to degree of match with a keyword search query. It can also be used to do unranked, Boolean retrievals. The Boolean retrieval method was used in the initial probing of the corpus to identify candidates for the Scenario Template task, because the Boolean retrieval is relatively fast, and the unranked results are easy to scan to get a feel for the variety of nonrelevant as well as relevant documents that match all or some of the query terms. Once the scenario had been identified, the ranked retrieval method was used, and the ranked list was sampled at different points to collect approximately 200 relevant and 200 nonrelevant articles, representing a variety of article types (feature articles, brief notices, editorials, etc.). From those candidate articles, the training and test sets were selected blindly, with later checks and corrections for imbalances in the relevant/nonrelevant categories and in article types.</Paragraph>
    <Paragraph position="2"> From the 100 test articles, a subset of 30 articles (some relevant to the Scenario Template task, others not) was selected for use as the test set for the Named Entity and Coreference tasks. The selection was again done blindly, with later checks to ensure that the set was fairly representative in terms of article length and type. Note that although Named Entity, Coreference and Template Element are defined as domain-independent tasks, the articles that were used for MUC-6 testing were selected using domain-dependent criteria pertinent to the Scenario Template task. The manually filled templates were created with the aid of Tabula Rasa, a software tool developed for the Tipster Text Program by New Mexico State University</Paragraph>
  </Section>
  <Section position="7" start_page="423" end_page="424" type="metho">
    <SectionTitle>
Computing Research Laboratory.
NAMED ENTITY
</SectionTitle>
    <Paragraph position="0"> The Named Entity (NE) task requires insertion of SGML tags into the text stream. The tag elements are ENAMEX (for entity names, comprising organizations, persons, and locations), TIMEX (for temporal expressions, namely direct mentions of dates and times), and NUMEX (for number expressions, consisting only of direct mentions of currency values and percentages). A TYPE attribute accompanies each tag element and identifies the subtype of each tagged string: for ENAMEX, the TYPE value can be ORGANIZATION, PERSON, or LOCATION; for TIMEX, the TYPE value can be DATE or TIME; and for NUMEX, the TYPE value can be MONEY or PERCENT.</Paragraph>
    <Paragraph position="1"> Text strings that are to be annotated are termed markables. As indicated above, markables include names of organizations, persons, and locations, and direct mentions of dates, times, currency values and percentages. Non-markables include names of products and other miscellaneous names (&amp;quot;Macintosh,&amp;quot; &amp;quot;Wall Street Journal&amp;quot; (in reference to the periodical as a physical object), &amp;quot;Dow Jones Industrial Average&amp;quot;); names of groups of people and miscellaneous usages of person names (&amp;quot;Republicans,&amp;quot; &amp;quot;Gramm-Rudman,&amp;quot; &amp;quot;Alzheimer\['s\]&amp;quot;); addresses and adjectival forms of location names (&amp;quot;53140 Gatchell Rd.,&amp;quot; &amp;quot;American&amp;quot;); indirect and vague mentions of dates and times (&amp;quot;a few minutes after the hour,&amp;quot; &amp;quot;thirty days before the end of the year&amp;quot;); and miscellaneous uses of numbers, including some that are similar to currency or percentage expressions (&amp;quot;\[Fees\] 1 3/4,&amp;quot; &amp;quot;12 points,&amp;quot; &amp;quot;1.5 times&amp;quot;).</Paragraph>
    <Paragraph position="2">  The evaluation metrics used for NE are essentially the same as those used for the two template-filling tasks, Template Element and Scenario Template. The following breakdowns of overall scores on NE are computed: * by slot, i.e., for performance across tag elements, across TYPE attributes, and across tag strings; * by subcategorization, i.e., for performance on each TYPE attribute separately; * by document section, i.e., for performance on distinct subparts of the article, as identified by the SGML tags contained in the original text: &lt;HL&gt; (&amp;quot;headline&amp;quot;), &lt;DD&gt; (&amp;quot;document date&amp;quot;), &lt;DATELINE&gt;, and &lt;TXT&gt; (the body of the article).</Paragraph>
  </Section>
  <Section position="8" start_page="424" end_page="425" type="metho">
    <SectionTitle>
NE Results Overall
</SectionTitle>
    <Paragraph position="0"> Fifteen sites participated in the NE evaluation, including two that submitted two system configurations for testing and one that submitted four, for a total of 20 systems. As shown in table 1, performance on the NE task overall was over 90% on the F-measure for half of the systems tested, which includes systems from seven different sites.</Paragraph>
    <Paragraph position="1"> On the basis of the results of the dry run, in which two of the nine systems scored over 90%, we were not surprised to find official scores that were similarly high, but it was not expected that so many systems would enter the formal evaluation and perform so well.</Paragraph>
    <Paragraph position="2"> It was also unexpected that one of the systems would match human performance on the task.</Paragraph>
    <Paragraph position="3"> Human performance was measured by comparing the 30 draft answer keys produced by the annotator at NRaD with those produced by the annotator at SAIC. This test measures the amount of variability between the annotators. When the outputs are scored in &amp;quot;key-to-response&amp;quot; mode, as though one annotator's output represented the &amp;quot;key&amp;quot; and the other the &amp;quot;response,&amp;quot; the humans achieved an overall F-measure of 96.68 and a corresponding error per response fill (ERR) score of 6%. The top-scoring system, the baseline configuration of the SRA system, achieved an F-measure of 96.42 and a corresponding error score of 5%.</Paragraph>
    <Paragraph position="4"> In considering the significance of these results from a general standpoint, the following facts about the test set need to be remembered:  decreasing F-Measure (P&amp;R) 1 1 Key to F-measure scores: BBN baseline configuration 93.65, BBN experimental configuration 92.88, Knight-Ridder 85.73, Lockheed-Martin 90.84, UManitoba 93.33, UMass 84.95, MITRE 91.2, NMSU CRL baseline configuration 85.82, NYU 88.19, USheffield 89.06, SRA baseline configuration 96.42, SRA &amp;quot;fast&amp;quot; configuration 95.66, SRA &amp;quot;fastest&amp;quot; configuration 92.61, SRA &amp;quot;nonames&amp;quot; configuration 94.92, SRI 94.0, Sterling Software 92.74.  * It represents just one style of writing (journalistic) and has a basic basic  toward financial news and a specific bias toward the topic of the Scenario Template task.</Paragraph>
    <Paragraph position="5"> * It was very small (only 30 articles).</Paragraph>
    <Paragraph position="6"> There were no markable time expressions in the test set, and there were only a few markable percentage expressions.</Paragraph>
    <Paragraph position="7"> The results should also be qualified by saying that they reflect performance on data that makes accurate usage of upper and lower case distinctions. What would performance be on data where case provided no (reliable) clues and for languages where case doesn't distinguish names? SRA ran an experiment on an upper-case version of the test set that showed 85% recall and 89% precision overall, with identification of organization names presenting the greatest problem. That result represents nearly a 10-point decrease on the F-measure from their official baseline. The case-insensitive results would be slightly better if the task guidelines themselves didn't depend on case distinctions in certain situations, as when identifying the right boundary for the organization name span in a string such as &amp;quot;the Chrysler division&amp;quot; (currently, only &amp;quot;Chrysler&amp;quot; would be tagged).</Paragraph>
  </Section>
  <Section position="9" start_page="425" end_page="427" type="metho">
    <SectionTitle>
NE Results on Some Aspects of
Task
</SectionTitle>
    <Paragraph position="0"> Figures 1 and 2 show the sample size for the various tag elements and TYPE values. Note that nearly 80% of the tags were ENAMEX and that almost half of those were subcategofized as organization names. As indicated in table 2, all systems performed better on identifying person names than on identifying organization or location names, and all but a few systems performed better on location names than on organization names.</Paragraph>
    <Paragraph position="1"> Organization names are varied in their form, consisting of proper nouns, general vocabulary, or a mixture of the two. They can also be quite long and complex and can even have internal punctuation such as a commas or an ampersand. Sometimes it is difficult to distinguish them from names of other types, especially from person names. Common organization names, first names of people, and location names can be handled by recourse to list lookup, although there are drawbacks: some names may be on more than one list, the lists will not be  realized in the text (e.g., may not cover the needed abbreviated form of an organization name, may not cover the complete person name), etc.</Paragraph>
    <Paragraph position="2"> The difference that recourse to lists can make in performance is seen by comparing two runs made by SRA. The experimental configuration resulted in a three point decrease in recall and one point decrease in precision, compared to the performance of the baseline system configuration. The changes occurred only in performance on identifying organizations. BBN conducted a comparative test in which the experimental configuration used a larger lexicon than the baseline configuration, but the exact nature of the difference is not known and the performance differences are very small. As with the SRA experiment, the only differences in performance between the two BBN configurations are with the organization type. The University of Durham reported that they had intended to use gazetteer and company name lists, but didn't, because they found that the lists did not have much effect on their system's performance.</Paragraph>
    <Paragraph position="3"> The error scores for persons, dates, and monetary expressions was less than or equal to 10% for the large majority of systems. Several systems posted scores under 10% error for locations, but none was able to do so for oganizations. For percentages, about half the systems had 0% error, which reflects the simplicity of that particular subtask. Note that the number of instances of percentages in the test set is so small that a single mistake could result in an error of 6%.</Paragraph>
    <Paragraph position="4"> Slot-level performance on ENAMEX follows a different pattern for most systems from slot-level performance on NUMEX and TIMEX. The general pattern is for systems to have done better on the TEXT slot than on the TYPE slot for ENAMEX tags and for systems to have done better on the TYPE slot than on the TEXT slot for NUMEX and TIMEX tags. Errors on the TEXT slot are errors in finding the right span for the tagged string, and this can be a problem for all three subcategories of tag. The TYPE slot, however, is a more difficult slot for ENAMEX than for the other subcategories. It involves a three-way distinction for ENAMEX and only a two-way distinction for NUMEX and TIMEX, and it offers the possibility of confusing names of one type with names of another, especially the possibility of confusing organization names with person names.</Paragraph>
    <Paragraph position="5"> Looking at the document section scores in table 3, we see that the error score on the body of the text was much lower than on the headline for all but a few systems. There was just one system that posted a higher error score on the body than on the headline, the baseline NMSU CRL configuration, and the difference in scores is largely due to the fact that the system overgenerated to a greater extent on the body than on the headline. Its basic strategy for  headlines was a conservative one: tag a string in the headline as a name only if the system had found it in the body of the text or if the system had predicted the name based on truncation of names found in the body of the text. Most, if not all, the systems that were evaluated on the NE task adopted the basic strategy of processing the headline after processing the body of the text.</Paragraph>
    <Paragraph position="6"> The interannotator variability test provides reference points indicating human performance on the different aspects of the NE task. The document section results show 0% error on Document Date and Dateline, 7% error on Headline, and 6% error on Text. The subcategory error scores were 6% on Organization, 1% on Person, and 4% on Location, 8% on Date, and 0% on Money and Percent. These results show that human variability on this task patterns in a way that is similar to the performance of most of the systems in all respects except perhaps one: the greatest source of difficulty for the humans was on identifying dates. Analysis of the results shows that some Date errors were a result of simple oversight (e.g., &amp;quot;fiscal 1994&amp;quot;) and others were a consequence of forgetting or misinterpreting the task guidelines with respect to determining the maximal span of the date expression (e.g., tagging &amp;quot;fiscal 1993's second quarter&amp;quot; and &amp;quot;Aug. 1&amp;quot; separately, rather than tagging &amp;quot;fiscal 1993's second quarter, ended Aug. 1&amp;quot; as a single expression in accordance with the task guidelines).</Paragraph>
  </Section>
  <Section position="10" start_page="427" end_page="427" type="metho">
    <SectionTitle>
NE Results on &amp;quot;Walkthrough
Article&amp;quot;
</SectionTitle>
    <Paragraph position="0"> In the answer key for the walkthrough article there are 69 ENAMEX tags (including a few optional ones), six TIMEX tags and six NUMEX tags. Interannotator scoring showed that one annotator missed tagging one instance of &amp;quot;Coke&amp;quot; as an (optional) organization, and the other annotator missed one date expression (&amp;quot;September&amp;quot;).</Paragraph>
    <Paragraph position="1"> Common mistakes made by the systems included missing the date expression, &amp;quot;the 21st century,&amp;quot; and spuriously identifying &amp;quot;60 pounds&amp;quot; (which appeared in the context, &amp;quot;Mr. Dooner, who recently lost 60 pounds over three-and-a-half months .... &amp;quot;) as a monetary value rather than ignoring it as a weight.</Paragraph>
    <Paragraph position="2"> In addition, a number of errors identifying entity names were made; some of those errors also showed up as errors on the Template Element task and are described in a later section of this paper.</Paragraph>
  </Section>
  <Section position="11" start_page="427" end_page="428" type="metho">
    <SectionTitle>
COREFERENCE
</SectionTitle>
    <Paragraph position="0"> The task as defined for MUC-6 was restricted to noun phrases (NPs) and was intended to be limited to phenomena that were relatively noncontroversial and easy to describe. The variety of high-frequency phenomena covered by the task is partially represented in the following hypothetical example, where all bracketed text segments are considered coreferential:</Paragraph>
    <Section position="1" start_page="428" end_page="428" type="sub_section">
      <SectionTitle>
\[Motor Vehicles International Corp.\]
</SectionTitle>
      <Paragraph position="0"> announced a major management shakeup .... \[MVI\] said the chief executive officer has resigned .... \[The Big 10 auto maker\] is attempting to regain market share .... \[It\] will announce significant losses for the fourth quarter .... A \[company\] spokesman said \[they\] are moving \[their\] operations to Mexico in a cost-saving effort .... \[MVI, \[the first company to announce such a move since the passage of the new international trade agreement\],\] is facing increasing demands from unionized workers .... \[Motor Vehicles International\] is \[the biggest American auto exporter to Latin America\].</Paragraph>
      <Paragraph position="1"> The example passage covers a broad spectrum of the phenomena included in the task. At one end of the spectrum are the proper names and aliases, which are inherently definite and whose referent may appear anywhere in the text. In the middle of the spectrum are definite descriptions and pronouns whose choice of referent is constrained by such factors as structural relations and discourse focus. On the periphery of the central phenomena are markables whose status as coreferring expressions is determined by syntax, such as predicate nominals (&amp;quot;Motor Vehicles International is the biggest American auto exporter to Latin America&amp;quot;) and appositives (&amp;quot;MVI, the first company to announce such a move since the passage of the new international trade agreement&amp;quot;). At the far end of the spectrum are bare common nouns, such as the prenominal &amp;quot;company&amp;quot; in the example, whose status as a referring expression may be questionable. An algorithm developed by the MITRE Corporation for MUC-6 was implemented by SAIC and used for scoring the task. The algorithm compares the equivalence classes defined by the coreference links in the manually-generated answer key and the system-generated response. The equivalence classes are the models of the identity equivalence coreference relation. Using a simple counting scheme, the algorithm obtains recall and precision scores by determining the minimal perturbations required to align the equivalence classes in the key and response. No metrics other than recall and precision were defined for this task, and no statistical significance testing was performed on the scores.</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="428" end_page="429" type="metho">
    <SectionTitle>
CO Results Overall
</SectionTitle>
    <Paragraph position="0"> In all, seven sites participated in the MUC-6 coreference evaluation. Most systems achieved approximately the same levels of performance: five of the seven systems were in the 51%-63% recall  range and 62%-72% precision range. About half the systems focused only on individual coreference, which has direct relevance to the other MUC-6 evaluation tasks.</Paragraph>
    <Paragraph position="1"> A few of the evaluation sites reported that good name/alias recognition alone would buy a system a lot of recall and precision points on this task, perhaps about 30% recall (since proper names constituted a large minority of the annotations) and 90% precision. The precision figure is supported by evidence from the NE evaluation. In that evaluation, a number of systems scored over 90% on the named entity recall and precision metrics, providing a sound basis for good performance on the coreference task for individual entities.</Paragraph>
    <Paragraph position="2"> In the middle of the effort of preparing the test data for the formal evaluation, an interannotator variability test was conducted. The two versions of the independently prepared, manual annotations of 17 articles were scored against each other using the scoring program in the normal &amp;quot;key to response&amp;quot; scoring mode. The amount of agreement between the two annotators was found to be 80% recall and 82% precision. There was a large number of factors that contributed to the 20% disagreement, including overlooking coreferential NPs, using different interpretations of vague portions of the guidelines, and making different subjective decisions when the text of an article was ambiguous, sloppy, etc. Most human errors pertained to definite descriptions and bare nominals, not to names and pronouns.</Paragraph>
  </Section>
  <Section position="13" start_page="429" end_page="429" type="metho">
    <SectionTitle>
CO Results on Some Aspects of
</SectionTitle>
    <Paragraph position="0"> Task and on &amp;quot;Walkthrough Article&amp;quot; To keep the annotation of the evaluation data fairly simple, the MUC-6 planning committee decided not to design the notation to subcategorize linkages and markables in any way. Two useful attributes for the equivalence class as a whole would be one to distinguish individual coreference from type coreference and one to identify the general semantic type of the class (organization, person, location, time, currency, etc.). For each NP in the equivalence class, it would be useful to identify its grammatical type (proper noun phrase, definite common noun phrase, bare singular common noun phrase, personal pronoun, etc.). The decision to minimize the annotation effort makes it difficult to do detailed quantitative analysis of the results.</Paragraph>
    <Paragraph position="1"> An analysis by the participating sites of their system's performance on the walkthrough article provides some insight into performance on aspects of the coreference task that were dominant in that article. The article contains about 1000 words and approximately 130 coreference links, of which all but about a dozen are references to individual persons or individual organizations. Approximately 50 of the anaphors are personal pronouns, including reflexives and possessives, and 58 of the markables (anaphors and antecedents) are proper names, including aliases. The percentage of personal pronouns is relatively high (38%), compared to the test set overall (24%), as is the percentage of proper names (40% on this text versus an estimate of 30% overall).</Paragraph>
    <Paragraph position="2"> Performance on this particular article for some systems was higher than performance on the test set overall, reaching as high as 77% recall and 79% precision. These scores indicate that pronoun resolution techniques as well as proper noun matching techniques are good, compared to the techniques required to determine references involving common noun phrases. For common noun phrases, the systems were not required to include the entire NP in the response; the response could minimally contain only the head noun. Despite this flexibility in the expected contents of the response, the systems nonetheless had to implicitly recognize the full NP, since to be considered coreferential, the head and its modifiers all had to be consistent with another markable.</Paragraph>
  </Section>
  <Section position="14" start_page="429" end_page="430" type="metho">
    <SectionTitle>
TEMPLATE ELEMENT
</SectionTitle>
    <Paragraph position="0"> The Template Element (TE) task requires extraction of certain general types of information about entities and merging of the information about any given entity before presentation in the form of a template (or &amp;quot;object&amp;quot;). For MUC-6 the entities that were to be extracted were limited to organizations and persons) The ORGANIZATION object contains attributes (&amp;quot;slots&amp;quot;) for the string representing the organization name (ORG NAME), for strings representing any abbreviated versions of the name (ORG_ALIAS), for a string that describes the particular organization (ORG_DESCRIPTOR), for a subcategory of the type of organization (ORG_TYPE, whose permissible values are GOVERNMENT, COMPANY, and OTHER), and for canonical forms of the specific and general location of the organization (ORG LOCALE and ORG_COUNTRY). The PERSON object contains 3The task documentation includes definition of an &amp;quot;artifact&amp;quot; entity, but that entity type was not used in MUC-6 for either the dry run or the formal run. The entity types that were involved in the evaluation are the same as those required for the Scenario Template task.</Paragraph>
    <Paragraph position="1">  slots only for the string representing the person name (PER_NAME), for strings representing any abbreviated versions of the name (PERALIAS), and for strings representing a very limited range of titles (PER_TITLE).</Paragraph>
    <Paragraph position="2"> The task places heavy emphasis on recognizing proper noun phrases, as in the NE task, since all slots except ORG_DESCRIPTOR and PERTITLE expect proper names as slot fillers (in string or canonical form, depending on the slot.</Paragraph>
    <Paragraph position="3"> However, the organization portion of the TE task is not limited to recognizing the referential identity between full and shortened names; it requires the use of text analysis techniques at all levels of text structure to associate the descriptive and locative information with the appropriate entity. Analysis of complex NP structures, such as appositional structures and postposed modifier adjuncts, is needed in order to relate the locale and descriptor to the name in &amp;quot;Creative Artists Agency, the big Hollywood talent agency&amp;quot; and in &amp;quot;Creative Artists Agency, a big talent agency based in Hollywood.&amp;quot; Analysis of sentence structures to identify grammatical relations such as predicate nominals is needed in order to relate those same pieces of information in &amp;quot;Creative Artists Agency is a big talent agency based in Hollywood.&amp;quot; Analysis of discourse structure is needed in order to identify long-distance relationships.</Paragraph>
    <Paragraph position="4"> The answer key for the TE task contains one object for each specific organization and person mentioned in the text. For generation of a PERSON object, the text must provide the name of the person (full name or part of a name). For generation of an ORGANIZATION object, the text must provide either the name (full or part) or a descriptor of the organization. Since the generation of these objects is independent of the relevance criteria imposed by the Scenario Template (ST) task, there are many more ORGANIZATION and PERSON objects in the TE key than in the ST key.</Paragraph>
    <Paragraph position="5"> For the formal evaluation, there were 606 ORGANIZATION and 496 PERSON objects in the TE key, versus 120 ORGANIZATION and 137 PERSON objects in the ST key.</Paragraph>
    <Paragraph position="6"> The same set of articles was used for TE as for ST; therefore, the content of the articles is oriented toward the terms and subject matter covered by the ST task, which concerns changes in corporate management. 4 One effect of this bias is simply the number of entities mentioned in the articles: for the</Paragraph>
  </Section>
  <Section position="15" start_page="430" end_page="430" type="metho">
    <SectionTitle>
4 The method used for selecting the articles for the test
</SectionTitle>
    <Paragraph position="0"> set is described at the beginning of this article.</Paragraph>
    <Paragraph position="1"> test set used for the MUC-6 dry run, which was based on a scenario concerning labor union contract negotiations, there were only about half as many organizations and persons mentioned as there were in the test set used for the formal run.</Paragraph>
  </Section>
  <Section position="16" start_page="430" end_page="430" type="metho">
    <SectionTitle>
TE Results Overall
</SectionTitle>
    <Paragraph position="0"> Twelve systems -- from eleven sites, including one that submitted two system configurations for testing-- were tested on the TE task. All but two of the systems posted F-measure scores in the 70-80% range, and four of the systems were able to achieve recall in the 70-80% range while maintaining precision in the 80-90% range, as shown in the figure 4. Human performance was measured in terms of variability between the outputs produced by the two NRaD and SAIC evaluators for 30 of the articles in the test set (the same 30 articles that were used for NE and CO testing). Using the scoring method in which one annotator's draft key serves as the &amp;quot;key&amp;quot; and the other annotator's draft key serves as the &amp;quot;response,&amp;quot; the overall consistency score was 93.14 on the F-measure, with 93% recall and 93% precision.</Paragraph>
  </Section>
  <Section position="17" start_page="430" end_page="432" type="metho">
    <SectionTitle>
TE Results on Some Aspects of Task
</SectionTitle>
    <Paragraph position="0"> Given the more varied extraction requirements for the ORGANIZATION object, it is not surprising that performance on that portion of the TE task was not as good as on the PERSON object 5, as is clear in figure 5.</Paragraph>
    <Paragraph position="1"> Figure 6 indicates the relative amount of error contributed by each of the slots in the ORGANIZATION object. It is evident that the more linguistic processing necessary to fill a slot, the harder the slot is to fill correctly. The ORG_COUNTRY slot is a special case in a way, since it is required to be filled when the ORG_LOCALE slot is filled. (The reverse is not the case, i.e., ORG_COUNTRY may be filled even if ORG_LOCALE is not, but this situation is relatively rare.) Since a missing or spurious ORG_LOCALE is likely to incur the same error in ORG_COUNTRY, the error scores for the two slots are understandably similar.</Paragraph>
    <Paragraph position="2"> 5 The highest score for the PERSON object, 95% recall and 95% precision, is close to the highest score on the NE subcategorization for person, which was 98% recall and 99% precision.</Paragraph>
    <Paragraph position="3">  With respect to performance on ORG_DESCRIPTOR, note that there may be multiple descriptors (or none) in the text. However, the task does not require the system to extract all descriptors of an entity that are contained in the text; it requires only that the system extract one (or none). Frequently, at least one can be found in close proximity to an organization's name, e.g., as an appositive (&amp;quot;Creative Artists Agency, the big Hollywood talent agency&amp;quot;). Nonetheless, performance is much lower on this slot than on others.</Paragraph>
    <Paragraph position="4"> Leaving aside the fact that descriptors are common noun phrases, which makes them less obvious candidates for extraction than proper noun phrases would be, what reasons can we find to account for the relatively low performance on the ORG_DESCRIPTOR slot? One reason for low performance is that an organization may be identified in a text solely by a descriptor, i.e., without a fill for the ORG_NAME slot and therefore without the usual local clues that the NP is in fact a relevant descriptor. It is, of course, also possible that a text may identify an organization solely by name. Both possibilities present increased opportunities for systems to undergenerate or overgenerate. Also, the descriptor is not always close to the name, and some discourse processing may be requ~ed in order to identify it -- this is likely to increase the opportunity for systems to miss the information. A third significant reason is that the response fill had to match the key fill exactly in order to be counted correct; there was no allowance made in the scoring software for assigning full or partial credit if the response fill only partially matched the key fill. It should be noted that human performance on this task was also relatively low, but it is unclear whether the degree of disagreement can be accounted for primarily by the reasons given above or whether the disagreement is attributable to the fact that the guidelines for that slot had not been finalized at the time when the annotators created their version of the keys.</Paragraph>
  </Section>
  <Section position="18" start_page="432" end_page="433" type="metho">
    <SectionTitle>
TE Results on &amp;quot;Walkthrough
Article&amp;quot;
</SectionTitle>
    <Paragraph position="0"> TE performance of all systems on the walkthrough article was not as good as performance on the test set as a whole, but the difference is small for about half the systems. Viewed from the perspective of the TE task, the walkthrough article presents a number of interesting examples of entity type confusions that can result from insufficient processing. There are cases of organization names misidentified as person names, there is a case of a location name misidentified as an organization name, and there are cases of nonrelevant entity types (publications, products, indefinite references, etc.) misidentified as organizations. Errors of these kinds result in a penalty at the object level, since the extracted information is contained in the wrong type of object. Examples of each of these types of error appear below, along with the number of systems that committed the error. (An experimental configuration of the SRA system produced the same output as the baseline configuration and has been disregarded in the tallies; thus, the total number of systems tallied is eleven.)  1. Miscategorizations of entities as person</Paragraph>
    <Paragraph position="2"> organization category is indicated clearly by context in which full name appears, &amp;quot;John Dooner Will Succeed James At Helm of McCann-Erickson&amp;quot; in headline and &amp;quot;Robert L. James, chairman and chief executive officer of McCann-Erickson, and John J. Dooner Jr., the agency's president and chief operating officer&amp;quot; in the body of the article) eSix systems: J. Walter Thompson (also extracted with the name of &amp;quot;Walter Thompson&amp;quot;; organization category is indicated by context, &amp;quot;Peter Kim was hired from WPP Group's J. Walter Thompson last September...&amp;quot;) eFour systems: Fallon McElligott (organization category is indicated by context, &amp;quot;...other ad agencies, such as Fallon McElligott&amp;quot;) eOne system: Ammirati &amp; Puris (the presence of the ampersand is a clue, as is the context, &amp;quot;...president and chief executive officer of Ammirati &amp; Puris&amp;quot;; but note that the article also mentions the name of one of the company's founders,</Paragraph>
    <Paragraph position="4"> oSix systems: New York Times (publication name in phrase, &amp;quot;a framed page from the New York Times&amp;quot;; without sufficient context, the name can be ambiguous in its reference to a physical object versus an organization) eThree systems: Coca-Cola Classic (product name deriving from &amp;quot;Coca-Cola,&amp;quot; which appears separately in several places in the article and is occasionally ambiguous even in context between  product name and organization name) eOne system: Not Butter (part of product name, &amp;quot;I Can't Believe It's Not Butter&amp;quot;) eOne system: Taster (part of product name, &amp;quot;Taster's Choice&amp;quot;) * One system: Choice (part of product name, &amp;quot;Taster's Choice&amp;quot;) eFive systems: a hot agency (nonspecific  use of indefinite in phrase &amp;quot;...is interested in acquiring a hot agency&amp;quot;) Given the variety of contextual clues that must be taken into account in order to analyze the above entities correctly, it is understandable that just about any given system would commit at least one of them. But the problems are certainly tractable; none of the fifteen TE entities in the key (ten ORGANIZATION entities and five PERSON entities) was miscategofized by all of the systems. In addition to miscategorization errors, the walkthrough text provides other interesting examples of system errors at the object level and the slot level, plus a number of examples of system successes. One success for the systems as a group is that each of the six smaller ORGANIZATION objects and four smaller PERSON objects (those with just one or two filled slots in the key) was matched perfectly by at least one system; in addition, one larger ORGANIZATION object and two larger PERSON objects were perfectly matched by at least one system. Thus, each of the five PERSON objects in the key and seven of the ten ORGANIZATION objects in the key were matched perfectly by at least one system. The three larger ORGANIZATION objects that none of the systems got perfectly correct are for the McCann-Erickson, Creative Artists Agency, and Coca-Cola companies. Common errors in these three ORGANIZATION objects included missing the descriptor or locale/country or failing to identify the organization's alias with its name.</Paragraph>
  </Section>
  <Section position="19" start_page="433" end_page="434" type="metho">
    <SectionTitle>
SCENARIO TEMPLATE
</SectionTitle>
    <Paragraph position="0"> A Scenario Template (ST) task captures domain- and task-specific information. Three scenarios were defined in the course of MUC-6: (1) a scenario concerning the event of organizations placing orders to buy aircraft with aircraft manufacturers (the &amp;quot;aircraft order&amp;quot; scenario); (2) a scenario concerning the event of contract negotiations between labor unions and companies (the &amp;quot;labor negotiations&amp;quot; scenario); (3) a scenario concerning changes in corporate managers occupying executive posts (the &amp;quot;management succession&amp;quot; scenario). The first scenario was used as an example of the general design of the ST task, the second was used for the MUC-6 dry run evaluation, and the third was used for the formal evauation.</Paragraph>
    <Paragraph position="1"> One of the innovations of MUC-6 was to formalize the general structure of event templates, and all three  scenarios defined in the course of MUC-6 conformed to that general structure. In this article, the management succession scenario will be used as the basis for discussion.</Paragraph>
    <Paragraph position="2"> The management succession template consists of four object types, which are linked together via one-way pointers to form a hierarchical structure. At the top level is the TEMPLATE object, of which there is one instantiated for every document. This object points down to one or more SUCCESSION_EVENT objects if the document meets the event relevance criteria given in the task documentation. Each event object captures the changes occurring within a company with respect to one management post. The SUCCESSION_EVENT object points down to the Ib~AND_OUT object, which in turn points down to PERSON Template Element objects that represent the persons involved in the succession event. The IN_AND_OUT object contains STspecific information that relates the event with the persons. The ORGANIZATION Template Element objects are present at the lowest level along with the PERSON objects, and they are pointed to not only by the IN_AND_OUT object but also by the SUCCESSION_EVENT object. The organization pointed to by the event object is the organization where the relevant management post exists; the organization pointed to by the relational object is the organization that the person who is moving in or out of the post is coming from or going to.</Paragraph>
    <Paragraph position="3"> The scenario is designed around the management post rather than around the succession act itself. Although the management post and information associated with it are represented in the SUCCESSION_EVENT object, that object does not actually represent an event, but rather a state, i.e., the vacancy of some management post. The relational-level Iih~AND_OUT objects represent the personnel changes pertaining to that state.</Paragraph>
  </Section>
  <Section position="20" start_page="434" end_page="437" type="metho">
    <SectionTitle>
ST Results Overall
</SectionTitle>
    <Paragraph position="0"> Nine sites submitted a total of eleven systems for evaluation on the ST task. All the participating sites also submitted systems for evaluation on the TE and NE tasks. All but one of the development teams (UDurham) had members who were veterans of MUC-5.</Paragraph>
    <Paragraph position="1"> Of the 100 texts in the test set, 54 were relevant to the management succession scenario, including six that were only marginally relevant. Marginally relevant event objects are marked in the answer key as being optional, which means that a system is not penalized if it does not produce such an event object. The approximate 50-50 split between relevant and nonrelevant texts was  intentional and is comparable to the richness of the MUC-3 &amp;quot;TST2&amp;quot; test set and the MUC-4 &amp;quot;TST4&amp;quot; test set. (The test sets used for MUC-5 had a much higher proportion of relevant texts.) Systems are measured for their performance on distinguishing relevant from nonrelevant texts via the text filtering metric, which uses the classic information retrieval definitions of recall and precision.</Paragraph>
    <Paragraph position="2"> For MUC-6, text filtering scores were as high as 98% recall (with precision in the 80th percentile) or 96% precision (with recall in the 80th percentile). Similar tradeoffs and upper bounds on performance can be seen in the TST2 and TST4 results (see score reports in sections 2 and 4 of appendix G in \[2\]). However, performance of the systems as a group is better on the MUC-6 test set. The text filtering results for MUC-6, MUC-4 (TST4) and MUC-3 (TST2) are shown in figure 8.</Paragraph>
    <Paragraph position="3"> Whereas the Text Filter row in the score report shows the system's ability to do text filtering (document detection), the All Objects row and the individual Slot rows show the system's ability to do information extraction. The measures used for information extraction include two overall ones, the F-measure and error per response fill, and several other, more diagnostic ones (recall, precision, undergeneration, overgeneration, and substitution).</Paragraph>
    <Paragraph position="4"> The text filtering definition of precision is different from the information extraction definition of precision; the latter definition includes an element in the formula that accounts for the number of spurious template fills generated.</Paragraph>
    <Paragraph position="5"> The All Objects recall and precision scores are shown in figure 9. The highest ST F-measure score was 56.40 (47% recall, 70% precision).</Paragraph>
    <Paragraph position="6"> Statistically, large differences of up to 15 points may not be reflected as a difference in the ranking of the systems. Most of the systems fall into the same rank at the high end, and the evaluation does not clearly distinguish more than two ranks (see the paper on statistical significance testing by Chinchor in \[1\]). Human performance was measured in terms of interannotator variability on only 30 texts in the test set and showed agreement to be approximately 83%, when one annotator's templates were treated as the &amp;quot;key&amp;quot; and the other annotator's templates were treated as the &amp;quot;response.&amp;quot;  No analysis has been done of the relative difficulty of the MUC-6 ST task compared to previous extraction evaluation tasks. The one-month limitation on development in preparation for MUC-6 would be difficult to factor into the computation, and even without that additional factor, the problem of coming up with a reasonable, objective way of measuring relative task difficulty has not been adequately addressed. Nonetheless, as one rough measure of progress in the area of information extraction as a whole, we can consider the F-measures of the top-scoring systems from the  for MUC-6 and MUC-5 ST tasks Note that table 4 shows four top scores for MUC-5, one for each language-domain pair: English Joint Ventures (EJV), Japanese Joint Ventures (JJV), English Microelectronics (EME), and Japanese Microelectronics (JME). From this table, it may be reasonable to conclude that progress has been made, since the MUC-6 performance level is at least as high as for three of the four MUC-5 tasks and since that performance level was reached after a much shorter time.</Paragraph>
    <Paragraph position="7"> ST Results on Some Aspects of Task and on &amp;quot;Walkthrough Article&amp;quot; Three succession events are reported in the walkthrough article. Successful interpretation of three sentences from the walkthrough article is necessary for high performance on these events.</Paragraph>
    <Paragraph position="8"> The tipoff on the first two events comes at the end of the second paragraph: Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.</Paragraph>
    <Paragraph position="9"> The basis of the third event comes halfway through the two-page article: In addition, Peter Kim was hired from WPP Group's J. Walter Thompson last September as vice chairman, chief strategy officer, world-wide.</Paragraph>
    <Section position="1" start_page="437" end_page="437" type="sub_section">
      <SectionTitle>
Answer Key
</SectionTitle>
      <Paragraph position="0"> James out, Dooner in as CEO of McCann-Erickson as a result of James departing the workforce; James is still on the job as CEO; Dooner is not on the job as CEO yet, and his old job was with the same org as his new job.</Paragraph>
      <Paragraph position="1"> Event James out, Dooner in as chairman of #2 McCann-Erickson as a result of James departing the workforce; James is still on the job as chairman; Dooner is not on the job as chairman yet, and his old job was with the same org as his new job.</Paragraph>
      <Paragraph position="2"> Event Kim in as &amp;quot;vice chairman, chief strategy #3 officer, world-wide&amp;quot; of McCann-Erickson, where the vacancy existed for other/unknown reasons; he is already on the job in the post, and his old job was with J. Walter Thompson</Paragraph>
    </Section>
  </Section>
  <Section position="21" start_page="437" end_page="438" type="metho">
    <SectionTitle>
III
</SectionTitle>
    <Paragraph position="0"> The article was relatively straightforward for the annotators who prepared the answer key, and there were no substantive differences in the output produced by each of the two annotators.</Paragraph>
    <Paragraph position="1"> Table 5 contains a paraphrased summary of the output that was to be generated for each of these events, along with a summary of the output that was actually generated by systems evaluated for MUC-6. The system-generated outputs are from three different systems, since no one system did better than all other systems on all three events.</Paragraph>
    <Paragraph position="2"> The substantive differences between the system-generated output and the answer key are indicated by underlining in the system output.</Paragraph>
    <Paragraph position="3"> Recurring problems in the system outputs include the information about whether the person is currently on the job or not and the information on where the outgoing person's next job would be and where the incoming person's previous job was.</Paragraph>
    <Paragraph position="4"> Note also that even the best system on the third event was unable to determine that the succession event was occurring at McCann-Efickson; in addition, it only partially captured the full title of the post. To its credit, however, it did recognize that the event was relevant; only two systems produced output that is recognizable as pertaining to this event. One common problem was the simple failure to recognize &amp;quot;hire&amp;quot; as an indicator of a succession.</Paragraph>
    <Paragraph position="5"> Two systems never filled the OTHER_ORG slot or its dependent slot, REL OTHER_ORG,</Paragraph>
    <Section position="1" start_page="437" end_page="438" type="sub_section">
      <SectionTitle>
System Output
</SectionTitle>
      <Paragraph position="0"> James out, Dooner in as CEO of McCann-Erickson as a result of a reassignment of James; James is no__! on the job as CEO any more, his new job is at the same as his old job; Dooner may or may not be on the job as CEO yet, and his old job was with the same org as his new job. (SRA satie_base system) James out, Dooner in as chairman of McCann-Erickson as a result of James departing the workforce; James is no_4 on the job as chairman any more; Dooner is already on the job as chairman, and his old job was with Ammirati &amp; Puris. (NYU system) Kim in as vice chairman of WPP Group, where the vacancy existed for other/unknown reasons; he may or may not be on the job in that post yet, and the article doesn't say where his old job was. (BBN system) outputs for walkthrough article despite the fact that data to fill those slots was often present; over half the IN_AND_OUT objects in the answer key contain data for those two slots.</Paragraph>
      <Paragraph position="1"> Almost without exception, systems did more poorly on those two slots than on any others in the SUCCESSION_EVENT and IN_AND_OUT objects; the best scores posted were 70% error on OTHER_ORG (median score of 79%) and 72% error on REL_OTHER ORG (median of 86%).</Paragraph>
      <Paragraph position="2"> Performance on the VACANCY_REASON and ON_THE JOB slots was better for nearly all systems. The lowest error scores were 56% on VACANCY_REASON (median of 70%) and 62% on ONZI'HE_JOB (median of 71%).</Paragraph>
      <Paragraph position="3"> The slot that most systems performed best on is NEWSTATUS; the lowest error score posted on that slot is 47% (median of 55%). This slot has a limited number of fill options, and the right answer is almost always either IN or OUT, depending on whether the person involved is assuming a post (IN) or vacating a post (OUT). Performance on the POST slot was not quite as good; the lowest error was 52% (median of 65%). The POST slot requires a text string as fill, and there is no finite list of possible fills for the slot. As seen in the third event of the walkthrough article, the fill can be an extended title such as &amp;quot;vice chairman, chief strategy officer, world-wide.&amp;quot; For most events, however, the fill is one of a large handful of possibilities, including &amp;quot;chairman,&amp;quot; &amp;quot;president,&amp;quot; &amp;quot;chief executive \[officer\],&amp;quot; &amp;quot;CEO,&amp;quot; &amp;quot;chief operating officer,&amp;quot; &amp;quot;chief financial officer,&amp;quot; etc.</Paragraph>
    </Section>
  </Section>
  <Section position="22" start_page="438" end_page="439" type="metho">
    <SectionTitle>
DISCUSSION: CRITIQUE OF
TASKS
</SectionTitle>
    <Paragraph position="0"> Named Entity The primary subject for review in the NE evaluation is its limited scope. A variety of proper name types were excluded, e.g. product names. The range of numerical and temporal expressions covered by the task was also limited; one notable example is the restriction of temporal expressions to exclude &amp;quot;relative&amp;quot; time expressions such as &amp;quot;last week&amp;quot;. Restriction of the corpus to Wall Street Journal articles resulted in a limited variety of markables and in reliance on capitalization to identify candidates for annotation.</Paragraph>
    <Paragraph position="1"> Some work on expanding the scope of the NE task has been carried out in the context of a foreign-language NE evaluation conducted in the spring of 1996. This evaluation is called the MET (Multilingual Named Entity) and, like MUC-6, was carried out under the auspices of the Tipster Text program. The experience gained from that evaluation will serve as critical input to revising the Engish version of the task.</Paragraph>
    <Paragraph position="2"> Coreference Many aspects of the CO task are in definite need of review for reasons of either theory or practice. One set of issues concerns the range of syntactically governed correference phenomena that are considered markable. For example, apposition as a markable phenomenon was restrictively defined to exclude constructs that could rather be analyzed as left modification, such as &amp;quot;chief executive Scott McNealy,&amp;quot; which lacks the comma punctuation that would clearly identify &amp;quot;executive&amp;quot; as the head of an appositive construction. Another set of issues is semantic in nature and includes fimdamental questions such as the validity of including type coreferrence in the task and the legitimacy of the implied definition of coteference versus reference. If an antecedent expression is nonreferential, can it nonetheless be considered coreferential with subsequent anaphoric expressions? Or can only referring expressions corefer? Finally, the current notation presents a set of issues, such as its inability to represent multiple antecedents, as in conjoined NPs, or alternate antecedents, as in the case of referential ambiguity.</Paragraph>
    <Paragraph position="3"> In short, the preliminary nature of the task design is reflected in the somewhat unmotivated boundaries between markables and nonmarkables and in weaknesses in the notation. One indication of immaturity of the task definition (as well as an indication of the amount of genuine textual ambiguity) is the fact that over ten percent of the linkages in the answer key were marked as &amp;quot;optional.&amp;quot; (Systems were not penalized if they failed to include such linkages in their output.) The task definition is now under review by a discourse working group formed in 1996 with representatives from both inside and outside the MUC commuity, including representatives from the spoken-language community.</Paragraph>
    <Section position="1" start_page="438" end_page="438" type="sub_section">
      <SectionTitle>
Template Element
</SectionTitle>
      <Paragraph position="0"> There are miscellaneous outstanding problems with the TE task. With respect to the ORGANIZATION and PERSON objects, there are issues such as rather fuzzy distinctions among the three organization subtypes and between the organization name and alias, the extremely limited scope of the person title slot, and the lack of a person descriptor slot. The ARTIFACT object, which was not used for either the dry run or the formal evaluation, needs to be reviewed with respect to its general utility, since its definition reflects primarily the requirements of the MUC-5 microelectronics task domain. There is a taskneutral DATE slot that is defined as a template element; it was used in the MUC-6 dry run as part of the labor negotiation scenario, but as currently defined, it fails to capture meaningfully some of the recurring kinds of date information. In particular, problems remain with normalizing various types of date expressions, including ones that are vague and/or require extensive use of calendar information.</Paragraph>
    </Section>
    <Section position="2" start_page="438" end_page="439" type="sub_section">
      <SectionTitle>
Scenario Template
</SectionTitle>
      <Paragraph position="0"> The issues with respect to the ST task relate primarily to the ambitiousness of the scenario templates defined for MUC-6. Although the management scenario contained only five domain-specific slots (disregarding slots containing pointers to other objects), it nonetheless reflected an interest in capturing as complete a representation of the basic event as possible. As a result, a few &amp;quot;peripheral&amp;quot; facts about the event were included that were difficult to define in the task documentation and/or were not reported clearly in many of the articles.</Paragraph>
      <Paragraph position="1"> Two of the slots, VACANCY_REASON and ON_THE_JOB, had to be filled on the basis of inference from subtle linguistic cues in many cases. An entire appendix to the scenario definition is devoted to heuristics for filling the ON_THE JOB slot. These two slots caused problems for the  annotators as well as for the systems. The annotators' problems with VACANCY_REASON may have had more to do with understanding what the scenario definition was saying than with understanding what the news articles were saying. The annotators' problems with ONZI'HE_JOB were probably more substantive, since the heuristics documented in the appendix were complex and sometimes hard to map onto the expressions found in the news articles. A third slot, REL_OTHER_ORG, required special inferencing on the basis of both linguistics and world knowledge in order to determine the corporate relationship between the organization a manager is leaving and the one the manager is going to. There may, in fact, be just one organization involved -- the person could be leaving a post at a company in order to take a different (or an additional) post at the same company.</Paragraph>
      <Paragraph position="2"> Defining a generalized template structure and using Template Element objects as one layer in the structure reduced the amount of effort required for participants to move their system from one scenario to another. Further simplification may be advisable in order to focus on core information elements and exclude somewhat idiosyncratic ones such as the three slots described above. In the case of the management succession scenario, a proposal was made to eliminate the three slots discussed above and more, including the relational object itself, and to put the personnel information in the event object. Much less information about the event would be captured, but there would be a much stronger focus on the most essential information elements. This would possibly lead to significant improvements in performance on the basic event-related elements and to development of good end-user tools for incorporating some of the domain-specific patterns into a generic extraction system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML