File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1037_metho.xml

Size: 29,354 bytes

Last Modified: 2025-10-06 14:14:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1037">
  <Title>SRI's Tipster II Project</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Event Merging
</SectionTitle>
    <Paragraph position="0"> In describing the system, we will say what it does, given as input the following paragraph from the management succession domain of MUC-6: A. C. Nielsen Co. said George Garrick, 40 years old, president of Information Resources Inc.'s London-based European Information Services operation, will become president and chief operating officer of Nielsen Marketing Research USA, a unit of Dun &amp; Bradstreet Corp.</Paragraph>
    <Paragraph position="1"> He succeeds John It. Costello, who resigned in March.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="201" type="metho">
    <SectionTitle>
1. The Name Recognizer recognizes the names
</SectionTitle>
    <Paragraph position="0"> of persons, organizations, and locations, as well as such special constructions as dates and amounts of money. There are three primary methods for this.</Paragraph>
    <Paragraph position="1"> We have patterns for recognizing the internal structure of names, as in &amp;quot;A.C. Nielsen Co.&amp;quot; We have a list of common names, many of which could not otherwise be recognized, such as &amp;quot;IBM&amp;quot; and &amp;quot;Toys 'R' Us&amp;quot;. Finally, we recognize or reclassify names on the basis of their immediate context. For example, if we see &amp;quot;XYZ's sales&amp;quot; or &amp;quot;the CEO of XYZ&amp;quot;, then we know XYZ is a company.</Paragraph>
    <Paragraph position="2"> In our sample text, this phase results in the following labelling: A. C. Nielsen Co.co said George Garrickper, 40 years old, president of  Servicesco operation, will become president and chief operating officer of Nielsen Marketing Research USAco, a unit of Dun &amp; Bradstreet Corp.co.</Paragraph>
    <Paragraph position="3"> He succeeds John H. Costelloeer, who resigned in MarchD~t~.</Paragraph>
    <Paragraph position="4"> 2. The Basic Phrase Recognizer recognizes basic noun groups, that is, noun phrases up through the head noun. It also recognizes verb groups, or verbs together with their auxilliaries and embedded adverbs; certain predicate complement constructions are also analyzed as verb groups. It also labels prepositions and other particles, such as the possessive marker, relative pronouns, and conjunctions. The core grammar for this phase is domainindependent. But there are some domain-dependent  specializations of the rules, where special semantics applies. For example, there is a general rule allowing a noun-hyphen-past participle sequence in the adjective position of noun groups, and there is a specialized version of this for a location followed by &amp;quot;-based&amp;quot;, as in &amp;quot;London-based&amp;quot;. In the sample text, this phase results in the following labelling: \[A. C. Nielsen CO.\]NG \[said\]vG \[George</Paragraph>
  </Section>
  <Section position="6" start_page="201" end_page="201" type="metho">
    <SectionTitle>
3. The Complex Phrase Recognizer recognizes
</SectionTitle>
    <Paragraph position="0"> complex noun groups and verb groups. For complex noun groups it attaches possessives, &amp;quot;of&amp;quot; phrases, controlled prepositional phrases, and age and other appositives to head nouns, and it recognizes some cases of noun group conjunction. For verb groups, it attaches support verbs to their content verb or nominalization complements. Some of these rules are domain-independent, but for any given domain we typically implement a number of highpriority, domain-dependent specializations of the general rules. For example, for management succession, we have complex noun groups for companies, persons, and positions. A company can have another company as a possessive, as in &amp;quot;Information</Paragraph>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
Resources Inc.'s London-based European Informa-
</SectionTitle>
      <Paragraph position="0"> tion Services operation&amp;quot;. A relational company term such as &amp;quot;unit&amp;quot; can have another company as a complement. Companies can take a company appositive.</Paragraph>
      <Paragraph position="1"> Position titles can be conjoined, and a position title can have an &amp;quot;of&amp;quot; phrase specifying the company. Persons can have position appositives.</Paragraph>
      <Paragraph position="2"> In the sample text, this phase results in the following labelling: \[A. C. Nielsen Co.\]co \[said\]vG</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="201" end_page="202" type="metho">
    <SectionTitle>
4. The Clause-Level Event Recognizer recognizes
</SectionTitle>
    <Paragraph position="0"> events in the domain of interest. This is done by matching the output of the Complex Phrase Recognizer with a set of patterns specifying the subject, verb, object, and prepositional phrases in which the events are typically expressed. In addition, locative, temporal, and epistemic adjuncts are recognized at this stage. Examples of patterns for the management succession domain are as follows:  As the patterns are recognized, event structures are built up, indicating what type of event occurred and who and what the participants are. For the management succession domain, there is an event structure for a state, specifying that a person is in a position at an organization, and a structure for transitions between two states.</Paragraph>
    <Paragraph position="1"> For the sample text, the following four event structures are constructed, corresponding to the four patterns above:  5. Once individual clause-level patterns have been recognized, the event structures that are built up are merged with other event structures from the  same and previous sentences. There are various constraints of consistency, compatibility, and distance that govern whether or not the two merge.</Paragraph>
    <Paragraph position="2"> For the sample text, merging the four events found by the Clause-Level Event Recognizer results in the two following transitions, both with the same end state, the first person-centered and the second p0sition-centered: Person: Garrick Person: Garrick Position: president ~ Position: president Org: EIS Org: NMR Person: Costello Person: Garrick Position: president ::C/. Position: president Org: NMR Org: NMR This result is then mapped into the desired template, which may be different since in general its structure will be determined by retrieval requirements rather than how the information is typically expressed in texts.</Paragraph>
  </Section>
  <Section position="8" start_page="202" end_page="203" type="metho">
    <SectionTitle>
3 FastSpec: A Declarative
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
Specification Language
</SectionTitle>
      <Paragraph position="0"> In the first version of FASTUS (Hobbs et al., 1992), the finite-state transducers were represented in a table of state changes with blocks of code associated with the final states. Only the developers were able to define patterns in this system. The next version, used in MUC-5 (Appelt et al., 1993), had a graphical interface for defining state changes and allowed blocks of code to be associated with transitions. Only a small group of cognoscenti were able to use this system.</Paragraph>
      <Paragraph position="1"> One of the first accomplishments of the current project was the definition and development of a declarative specification language called FastSpec.</Paragraph>
      <Paragraph position="2"> It enabled the easy definition of patterns and their associated semantics, and made it possible for a larger set of users to define the patterns.</Paragraph>
      <Paragraph position="3"> FastSpec allows the definition of multiple grammars, one for each phase. The terminal symbols in the grammar for a phase correspond to the objects produced by the previous phase, and their attributes can be accessed and checked. The rules have a syntactic part, expressing the pattern in the form of a regular expression, with attribute and other constraints permitted on the terminal symbols. They also have a semantic part, which specifies how attributes are to be set in the output objects of the phase.</Paragraph>
      <Paragraph position="4"> The following is a fragment of a grammar for verb groups in the Basic Phrase Recognizer:</Paragraph>
      <Paragraph position="6"> This covers a phrase like &amp;quot;could not really have left&amp;quot;.</Paragraph>
      <Paragraph position="7"> V-en and Adv refer to words that are past participles and adverbs, respectively. V\[have\] indicates some form of the verb &amp;quot;have&amp;quot;. The use of indices like &amp;quot;:1&amp;quot; allows us to access the attributes of terminal symbols. The semantics in these rules sets the features of active, aspect, tense, and negative appropriately, and sets head to point to the input object providing the past participle.</Paragraph>
      <Paragraph position="8"> The following is one rule in a grammar for the Clause-Level Event Recognizer for the labor negotiations domain used in the dry run of MUC-6 in April 1995.</Paragraph>
      <Paragraph position="10"> This says that when an organization resumes talks with an organization, it is a significant event.</Paragraph>
      <Paragraph position="11"> Event-Adj is matched by temporal, locative, epistemic and other adverbial adjuncts. Compl is matched by various possible noun complements.</Paragraph>
      <Paragraph position="12"> This rule creates an event structure in which the event type is Talk, the parties are the subject and the object of &amp;quot;with&amp;quot; matched by the patterns, and the talk status is Bargaining.</Paragraph>
      <Paragraph position="13"> FastSpec has made it immensely easier for us to specify grammars, and recently it has become  one of the principal influences on the Tipster effort to develop a community-wide Common Pattern-Specification Language.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="203" end_page="204" type="metho">
    <SectionTitle>
4 Compile-Time Transforma-
</SectionTitle>
    <Paragraph position="0"> tions For an application in which we had to recognize the products made by companies, we would want a pattern that would recognize GM manufactures cars.</Paragraph>
    <Paragraph position="1"> But in addition to writing a rule for this pattern, we would have to write rules for all the syntactic variations of the simple active clause, to recognize Cars are manufactured by GM.</Paragraph>
    <Paragraph position="2"> ... GM, which manufactures cars.</Paragraph>
    <Paragraph position="3"> ... cars, which are manufactured by GM.</Paragraph>
    <Paragraph position="4"> ... cars manufactured by GM.</Paragraph>
    <Paragraph position="5"> GM is to manufacture cars.</Paragraph>
    <Paragraph position="6"> Cars are to be manufactured by GM.</Paragraph>
    <Paragraph position="7"> GM is a car manufacturer* Moreover, in each of these patterns we would need to allow the occurrence of temporal, locative, and other adverbials. Yet all of these variations are predictable, and every time we want the first pattern we want the others as well.</Paragraph>
    <Paragraph position="8"> This consideration led us to implement what can be called &amp;quot;compile-time transformations&amp;quot;. Expensive operations of transformation are not done while the text is being processed. Instead, the transformed patterns are generated when the grammar is compiled. We have implemented a number of parameterized metarules that specify the possible linguistic variations of the simple active clause, expressed in terms of the subject, verb, and object of the active clause, and having the same semantics. Then domain-specific patterns are defined that provide particular instantiations of the metarules.</Paragraph>
    <Paragraph position="9"> The metarule for the basic active clause, as in &amp;quot;The company resumed talks&amp;quot;, is into this rule and a new specific rule is generated. Each of these variables is a (list of) lexical or other attributes, and when they are plugged into the metarule, they define a pattern that is constrained to those attributes. Adverbials are recognized by matching a sequence of input objects with Event-Adj. Indices are associated with each of the arguments of the head's predication, and these can be used in the semantics specified for particular pattern. null The metarule for passives, as in &amp;quot;Talks were resumed&amp;quot;, is</Paragraph>
    <Paragraph position="11"> The object still has the index 3, so that the same semantics can be used for the passive as for the active. null The metarule for relative clauses with a gapped subject, as in &amp;quot;the company, which resumed talks</Paragraph>
    <Paragraph position="13"> The metarule for nominalizations, as in &amp;quot;the company's resumption of talks&amp;quot;, must appear in the Complex Phrase Recognizer and has the form</Paragraph>
    <Paragraph position="15"> Once the variables ??subj, ??head, ??obj, ??prep, and ??pobj are defined by the user, they are plugged Here all the arguments are optional. We could simply have the bare nominal.</Paragraph>
    <Paragraph position="16"> In addition to the basic patterns, middle verbs and symmetric verbs are handled. Middle verbs are verbs whose object can appear in the subject position and still have an active verb.</Paragraph>
    <Paragraph position="17"> They resumed the talks.</Paragraph>
    <Paragraph position="18"> The talks resumed*  The metarule that implements the middle &amp;quot;transformation&amp;quot; is as follows:</Paragraph>
    <Paragraph position="20"> Symmetric verbs are verbs where an argument linked to the head with the preposition &amp;quot;with&amp;quot; can be moved into a subject position, conjoined with the subject. For example , The union met with the company.</Paragraph>
    <Paragraph position="21"> The union and the company met.</Paragraph>
    <Paragraph position="22"> The meeting between the union and the company.</Paragraph>
    <Paragraph position="23"> To handle this there are patterns in the Complex Phrase Recognizer that recognize a conjunction of the subject and the prepositional argument, when the verb is designated symmetrical: NG\[??subj\] ''and'' NG\[??pobj\] This is then given a special attribute symconj, and in the Clause-Level Event Recognition phase, complex noun groups with this property are sought as subjects for symmetric verbs.</Paragraph>
    <Paragraph position="25"> With this set of metarules, defining the necessary patterns becomes very easy. One need only specify the subject, verb, object, preposition, and prepositional object, and the classes of metarules that need to be instantiated, and the specific rules are automatically generated. For example, the specification for &amp;quot;resume&amp;quot; would be Transformations: Middle, Basic:  I: Subj = org; 2: Head = resume-word; 3: Obj = talk-word;</Paragraph>
    <Paragraph position="27"> In the semantics, we set the type of event to be Talk and the talk status to be Bargaining. The parties are those referred to by the subject (1) and the prepositional object (4).</Paragraph>
    <Paragraph position="28"> Our experience with this aspect of the FASTUS system has been very encouraging. During the preparation for MUC-6, it took us only about one day to implement the necessary clause-level domain patterns, because of the compile-time transformations. null</Paragraph>
  </Section>
  <Section position="10" start_page="204" end_page="205" type="metho">
    <SectionTitle>
5 Atomic versus Molecular
Approaches
</SectionTitle>
    <Paragraph position="0"> There are two approaches that have emerged in our experience with FASTUS. They might be called the &amp;quot;atomic&amp;quot; approach and the &amp;quot;molecular&amp;quot; approach. Both approaches are made easier by FastSpec and the compile-time transformations.</Paragraph>
    <Paragraph position="1"> In the atomic approach, the system recognizes entities of a certain highly restricted type and assumes that they play a particular role in a particular event, based on that type; then after event merging it is determined whether enough information as been accumulated for this to he an event of interest. This approach is more noun-driven, and its patterns are much looser. It is most appropriate when the entity type is highly predictive of its role in the event. The microelectronics domain of MUC-5 and the labor negotiations were of this character. When one sees a union, it can only go into the union slot of a negotiation event.</Paragraph>
    <Paragraph position="2"> In the molecular approach, the system must recognize a description of the entire event, not just the participants in the event. This approach is more verb-driven, and the patterns tend to be tighter. It is most appropriate when the syntactic role of an NP is the primary determinate of the entity's role in the event. The terrorist domain of MUC-3 and MUC-4, the joint venture domain of MUC-5 and the management succession domain of MUC-6 were of this character. You can't tell from the fact that an entity is a person whether he is going into or out of a position at an organization. You have to see how that person relates to which verb.</Paragraph>
    <Paragraph position="3"> The distinction between these two approaches can be used as a conceptual tool for analyzing new  domains.</Paragraph>
  </Section>
  <Section position="11" start_page="205" end_page="205" type="metho">
    <SectionTitle>
6 Adapting Rules from Exam-
</SectionTitle>
    <Paragraph position="0"> ples The FastSpec language and the compile-time transformations make it easier for linguists and computer scientists to define patterns. But they do not enable ordinary users to specify their own patterns. One way to achieve this would be to have automatic learning of patterns from examples provided by the user. We have begun in a small way to implement such an approach.</Paragraph>
    <Paragraph position="1"> We need a way for the user to supply a mapping from strings in the text to entries in the template. This can be accomplished by having a two-window editor; the text being annotated or analyzed is in one window, the template in the other. The user marks a string in the text, and then either copies the string to a template entry or enters the set fill that is triggered by the string. Such a system is first of all a convenient text editor for filling data bases from text by hand. But if the system is trying to deduce the implicit rules the user is responding to to make the fills, then the system is automatically constructing an information extraction system as well.</Paragraph>
    <Paragraph position="2"> We have implemented a preliminary experimental version of such a system, and are currently developing a more advanced one. We assume that the user somehow provides a mapping from text strings to template entries and that the semantics of the rule is completely specified by such a mapping. Moreover, we are only handling the case where the new rule to be induced is a specialization of an already existing rule, in the sense that &lt;Location&gt; &amp;quot;- .... based&amp;quot; is a specialization of &lt;Noun&gt; &amp;quot;-&amp;quot; &lt;Past-Participle&gt; In general, the problem of rule induction is very hard. What we are doing is a tractable and useful special case.</Paragraph>
    <Paragraph position="3"> The first problem is to identify the phase in which the new rule should be defined. To do this, we identify the highest-level phase (call it Phase n) in which the constituent boundaries produced by the phase correspond to the way the user has broken up the text. A new rule is then hypothesized in Phase n/l.</Paragraph>
    <Paragraph position="4"> For example, if the user has marked the string &amp;quot;the union resumed talks with the company&amp;quot; and placed &amp;quot;the union&amp;quot; in one slot and &amp;quot;the company&amp;quot; in another, then Phase n is the Complex Phrase Recognizer, since it provides those noun groups as independent objects. On the other hand, if the string is &amp;quot;the union's resumption of talks with the company&amp;quot;, then the Complex Phrase Recognizer will not do, since it combines at least &amp;quot;the union&amp;quot; and possibly &amp;quot;the company&amp;quot; into the same complex noun group as &amp;quot;resumption&amp;quot;. We have to back up one more phase, to the Basic Phrase Recognizer, to get these noun groups as independent elements.</Paragraph>
    <Paragraph position="5"> In the current version, we determine what Phase n + 1 rule matches the entire string and then construct as general as possible a specialization of that rule. For the semantics of the specialized rule, we encode the mapping the user has constructed.</Paragraph>
    <Paragraph position="6"> Determining the correct level of generalization of the hypothesized rule is a difficult problem. There are some obvious heuristics that we have implemented, such as generalizing &amp;quot;38&amp;quot; to Number and &amp;quot;Garrick&amp;quot; to Person. But should we generalize &amp;quot;United Steel Workers&amp;quot; to Union or to Organization? Our current approach is to be conservative and to experiment with various options.</Paragraph>
    <Paragraph position="7"> Once the rule is hypothesized it will be presented to the user in some form for feedback and validation.</Paragraph>
    <Paragraph position="8"> How best to implement this is still a research issue.</Paragraph>
    <Paragraph position="9"> This work represents a productive synergy between the Tipster project and another FASTUS-based project at SRI, the Message Handler, for processing a large number of types of military messages. 1 The basic ideas were worked out in connection with our Tipster II project. We will be developing a sophisticated, general version of the system as part of our Tipster III research. In the meantime, we are using the theory that we have worked out to develop a restricted learning component for the Message Handler. This effort of applying theory to a very complex real-world task can give us insights into the various problems that arise.</Paragraph>
  </Section>
  <Section position="12" start_page="205" end_page="206" type="metho">
    <SectionTitle>
7 Coreference Resolution
</SectionTitle>
    <Paragraph position="0"> There are three places in FASTUS processing that coreference resolution gets done. Early in the processing, in Name Recognition, entities that are referred to by the same name, or by a name and a plausible acronym or alias, are marked as coreferential. Late in the processing, in Event Merging,  some coreference resolution happens as a side-effect of merging event strutures. In the example of Section 2, we learn from Clause--Level Event Recognition that Garrick will become president and CO0, and we learn that &amp;quot;he&amp;quot; will succeed Costello. These are two consistent management succession event descriptions, so they are merged, and in the course of doing so, we resolve &amp;quot;he&amp;quot; to Garrick.</Paragraph>
    <Paragraph position="1"> The third type of coreference resolution occurs after complex noun groups are recognized. This module was implemented early in 1995 in order to participate in the Coreference evaluation in MUC6, but it was done in a way that was completely in accord with normal FASTUS processing, and the results of coreference resolution are used by subsequent phases.</Paragraph>
    <Paragraph position="2"> Coreference resolution is done only for definite noun groups and pronouns. We experimented with an algorithm for bare noun groups, but it hurt precision more than it helped recall.</Paragraph>
    <Paragraph position="3"> Two principal techniques are used to resolve definite noun groups. First we look for a previous noun group with the same head noun. thus, &amp;quot;the agreement&amp;quot; will resolve with &amp;quot;an agreement&amp;quot;. In addition, we look for a previous object of the right domain-specific type. Thus, &amp;quot;the Detroit automaker&amp;quot; will resolve to &amp;quot;General Motors&amp;quot; or to &amp;quot;a company&amp;quot;, since &amp;quot;automaker&amp;quot; is of type COMPANY and General Motors is a company. No use is made of synonymy or of a sort hierarchy otherwise. Thus, &amp;quot;the agreement&amp;quot; will not resolve back to &amp;quot;a contract&amp;quot;. This is obviously a place where the algorithm can be improved. Rather arbitrarily, we have set the search window to ten sentences; this is a parameter than can be experimented with.</Paragraph>
    <Paragraph position="4"> For third person pronouns we use an approximation of the algorithms of Hobbs (1978) and Kameyama (1986). We search for noun groups of the right number and gender, first from left to right in the current sentence, then from left to right in the previous sentence, and then from right to left in two more sentences. The pronoun &amp;quot;they&amp;quot; can be identified with either a plural noun group or an organization. null For singular first person pronouns, 'T' and &amp;quot;me&amp;quot;, we resolve to the nearest person. For plural first person pronouns, &amp;quot;we&amp;quot; and &amp;quot;us&amp;quot;, we resolve to the nearest organization or set of persons. We allow all of the current sentence, including material to the right of the pronoun, since quotes frequently precede the designation of the speaker, as in &amp;quot;I was robbed,&amp;quot; said John.</Paragraph>
    <Paragraph position="5"> An obvious improvement would be to determine whether the person occurs as the subject of a verb of speaking, but an informM examination of the data suggested this would not result in a significant improvement. null The heuristics we use for coreference resolution are very simple and easily implemented in a FASTUS framework. Numerous improvements readily suggest themselves. But we have been surprised how strong a performance can be achieved just with these simple heuristics. Our performance on the MUC-6 Coreference task was a recall of 59% and a precision of 72%. These scores placed SRI among the leaders.</Paragraph>
  </Section>
  <Section position="13" start_page="206" end_page="207" type="metho">
    <SectionTitle>
8 Information Extraction and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="206" end_page="207" type="sub_section">
      <SectionTitle>
Document Retrieval
</SectionTitle>
      <Paragraph position="0"> As part of Tipster II.V, we are engaged in a joint effort with the University of Massachusetts to determine ways in which information extraction technology can improve the performance of document retrieval systems, such as the INQUERY system. Initially, we are pursuing three investigations.</Paragraph>
      <Paragraph position="1"> 1. The first is simply to examine a large number of highly ranked false positives for a number of queries, and to determine whether information extraction techniques can help. We have done this on a small scale, five texts for one TREC topic. The topic was actual retaliation against terrorists. The false positives all talked about retaliation against terrorists, but it was embedded in negative or modal contexts, such as the following: ... will not retaliate against the terrorist attack ...</Paragraph>
      <Paragraph position="2"> ... discussed the possibility of retaliating. null ... if we retaliate against terrorists ...</Paragraph>
      <Paragraph position="3"> These are the kinds of features that Basic and Complex Phrase Recognition in FASTUS can spot, and the texts could thereby be rejected.</Paragraph>
      <Paragraph position="4"> 2. We have already developed an information extraction system for the management succession domain, and that corresponds to one of the TREC topics. We will run INQUERY on that topic and then run the MUC-6 FASTUS system on the 100 texts that INQUERY ranks most highly. We can then determine whether there is any criterion definable in terms of the events extracted that can improve on INQUERY's ranking. This will lead to the question of how much information extraction domain development is necessary for how much corresponding document retrieval improvement.</Paragraph>
      <Paragraph position="5">  3. We have a moderately well developed module for coreference resolution. Can this be used to improve INQUERY's performance? The idea is to apply FASTUS processing, up through coreference resolution, to all the documents in the corpus. We would then use the resulting coreference chains to increase the richness of concepts in the text. For example, consider two documents that each mention IBM once. The first is about IBM and contains numerous subsequent references to &amp;quot;the computer company&amp;quot; and &amp;quot;they&amp;quot;. The second mentions IBM only in the context of IBM-compatible peripherals and is concerned with something else entirely. Having every reference to IBM count as a mention of IBM will result in the first document having a much higher score than the second. This method could help in both directions. If the topic concerns IBM, references to the computer company will increase the score. If the topic concerns computer companies, references to IBM will increase the score.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML