File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1010_metho.xml
Size: 10,558 bytes
Last Modified: 2025-10-06 14:14:52
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1010"> <Title>DESCRIPTORS RECALL PRECISION</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> TEMPLATE ELEMENT TASK </SectionTitle> <Paragraph position="0"> AATM7 was applied to the MUC-7 Template Element task in order to test some theories of coreference that were being investigated under the TIPSTER III research activity. The Template Element task requires an automatic system to build templates for every person, organization, and artifact entity, as well as every location.</Paragraph> <Paragraph position="1"> Entities The Entities are defined as follows: An organization object consists of: organization's name and aliases found in the text, a type slot of ORGANIZATION, one descriptor phrase, and the category of the organization: ORG_CO, ORG_GOVT, or ORG_OTHER. A person object consists of: person's name and aliases found in the text, a type slot of PERSON, one descriptor phrase, and the category of the person: PER_CIV or PER_MIL An artifact object consists of: artifact's name and aliases found in the text, a type slot of ARTIFACT, one descriptor phrase, and the category of the artifact: ART_AIR, ART_LAND, or ART_WATER. To perform this task perfectly, an automatic system must link all references to the same entity within a text, and collect those references, whether they be names or descriptive noun phrases. The entire list of unique names for an entity is placed in the &quot;NAME&quot; slot. Of the descriptors, the system must pick one of those found, and put it in the &quot;DESCRIPTOR&quot; slot, as long as it is not &quot;insubstantial&quot; according to the fill rules, e.g. &quot;the company&quot; or &quot;Dr.&quot; Pronouns are also excluded from the entity object. Additionally, the system must decide to what category the entity belongs, either through its knowledge base or the surrounding context, e.g. &quot;Gen. Smith&quot; vs. &quot;Ms. Smith&quot; as PER_MIL vs. PER_CIV.</Paragraph> <Paragraph position="2"> The limitation to one descriptor can have the effect of hiding how well the coreference resolution has performed, since a system may have found all descriptive phrases, plus one incorrect descriptor, and chosen the incorrect descriptor, thus getting a score of incorrect for the entire slot. Lockheed Martin is planning to test a multiple-descriptor version of MUC-7, in the near future. Of the three entity types, those of &quot;PERSON&quot; and &quot;ORGANIZATION&quot; are the most similar, since language is used in similar ways to describe them. They both can be named, where the &quot;name&quot; is an identity which, within the context of a story, is usually unique. The artifact, which in MUC terms can be a land, air, sea, or space vehicle, is sometimes named, but often the tag which is considered the name is merely a type. For example, a story that tells about three different F-14 crashes may, according to MUC rules, produce three different entities named &quot;F-14&quot;, whose only difference would be found in information not captured by the TE object.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Locations </SectionTitle> <Paragraph position="0"> Locations are defined as follows: A location object consists of: locale found in the text, the country where the locale exists, and the locale type: CITY, PROVINCE, COUNTRY, REGION, AIRPORT, or UNK. The location object's locale slot is filled with the most specific reference to a location. For example, if the location were &quot;Philadelphia, PA,&quot; the locale slot would be filled with &quot;Philadelphia.&quot; The country would be &quot;United States&quot; and the locale type would be &quot;CITY.&quot; The deficiency of this design is obvious; it fails to differentiate between the actual location and any other city named &quot;Philadelphia&quot; in the nation. An alternative design, which has been used for other NLToolset applications, contains a locale slot which holds the entire phrase describing the locale. Some examples are: &quot;at the checkpoint on Route 30&quot; &quot;southwest of Miami&quot; &quot;Wilmington, Delaware&quot; Additionally, the location object contains slots for whatever other information can be gleaned from the text or from on-line resources, such as a gazetteer. This includes slots for city, country, province, latitute/longitude, region, or water.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> TIPSTER Research </SectionTitle> <Paragraph position="0"> AATM7 was developed with a focus on the investigation of a number of techniques involved in coreference resolution. Coreference Resolution can be thought of as the identification and linking of all references to a particular entity. References may be in the form of names, pronouns, or noun phrases.</Paragraph> <Paragraph position="1"> Syntax is frequently used by an author to associate a descriptive phrase with an entity. This can be seen in the following examples: APPOSITIVE: &quot;Lockheed Martin, an aerospace firm,&quot; PRENOMIAL: &quot;the aerospace firm, Lockheed Martin&quot; NAME-MODIFIED HEAD NOUN: &quot;the Lockheed Martin aerospace firm&quot; PREDICATIVE NOMINATIVE: &quot;Lockheed Martin is an aerospace firm&quot; When an entity is referred to only by a descriptive phrase, finding its true identity is very challenging. The following sentence &quot;The president has announced that he will resign.&quot; has varying degrees of import, depending on its preceding sentence...</Paragraph> <Paragraph position="2"> &quot;Coca Cola Company today revealed the future plans of its president, James Murphy.&quot; &quot;Impeachment hearings were scheduled to begin today against President Clinton.&quot; An automatic system can use the information closely related by syntax to the entity, in this case the title &quot;President&quot; or the prenominal &quot;its president&quot;, to identify the entity referred to by &quot;the president.&quot; This is the heart of our current research. Our aim is to find all descriptive information closely related by syntax and to build a story-specific ontology for each entity so that far-flung references that depend on this semantic information can be identified.As part of this research, the Template Element development keys were analyzed to determine how often the descriptors of an organization and person are directly associated by syntax. A surprisingly large number of descriptive phrases within the keys can be directly associated to an entity by way of syntax. Of a total of approximately 900 descriptors, 125 were organization descriptors, and 775, person descriptors --- a disproportionate number, since there are actually more organization entities (985) than person entities (802) in the keys.</Paragraph> <Paragraph position="3"> The following table shows the breakdown by category and entity type. &quot;Association by Context&quot; refers to descriptors that have been found in titles, prenominal phrases, appositives, and predicate nominatives. &quot;Association by Reference&quot; refers to a remote reference which refers to a named entity. &quot;Un-named&quot; refers to entities described by noun phrases alone, e.g. &quot;a local bank.&quot; This data supports the hypothesis that much reliable descriptive information can be obtained through syntactic association. This descriptive information can be associated with the entity object and then be used to help resolve associations by reference, in a manner similar to that used for organizations in the Lockheed Martin MUC-6 system, LOUELLA. This is the idea of a semantic filter, which was used to compare descriptive phrases with the semantic content of organization names, as in the following example.</Paragraph> <Paragraph position="4"> &quot;Buster Brown Shoes&quot; => (buster brown shoes shoe footwear) &quot;the footwear maker&quot; => (footwear maker make manufacturer) Since person names rarely include semantic content, we must rely on other descriptive information to build the semantics, either through world knowledge stored in the system's knowledge base or through associations found in the text itself.</Paragraph> <Paragraph position="5"> As part of Lockheed Martin's TIPSTER research, the freeware Brill part-of-speech tagger was connected to the NLToolset to see if it could help streamline the process of building patterns to find descriptors. Since standard NLToolset processing provides all possible parts of speech for each token, a part-of-speech tagger was introduced to see if it could simplify the process of pattern writing. It was found that a package for finding and correctly linking the majority of person descriptors could be written in about a week by incorporating the information that Brill provides with that provided by the NLToolset, i.e. symbol name, semantic category, and possible parts of speech as found in the Finding artifacts and linking up all references to the same entity has proved especially challenging because of the unusual way that artifacts are described in text, and the way that the descriptions are categorized for MUC-7. For instance, &quot;Boeing 747&quot; and &quot;F-14&quot; are considered names, whereas &quot;TWA Flight 800&quot; is considered a descriptor. Under the TIPSTER research, a new algorithm was developed to find vehicles and resolve coreferences. The algorithm differs from that for organizations and people in that a match is assumed to belong to the most recently seen entity, unless there is some information to contradict this assumption. The possible types of contradictory information are: model information, manufacturer, military branch, airline, and flight number. Further, if the comparison reveals that one entity has military information and the other has airline information, there is a contradiction. Further, the variable-binding feature of the NLToolset's pattern matching allows the developer to extract type information while finding the entities in the text. This type information helps the system to distinguish between entities during coreference resolution.</Paragraph> </Section> class="xml-element"></Paper>