XML Viewer - p99-1048

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1048_metho.xml
Size: 20,304 bytes
Last Modified: 2025-10-06 14:15:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1048">
  <Title>Corpus-Based Identification of Non-Anaphoric Noun Phrases</Title>
  <Section position="3" start_page="0" end_page="373" type="metho">
    <SectionTitle>
2 Prior Research
</SectionTitle>
    <Paragraph position="0"> Computational coreference resolvers fall into two categories: systems that make no attempt to identify non-anaphoric discourse entities prior to coreference resolution, and those that apply a filter to discourse entities, identifying a subset of them that are anaphoric. Those that do not practice filtering include decision tree models (Aone and Bennett, 1996), (Mc-Carthy and Lehnert, 1995) that consider all possible combinations of potential anaphora and referents. Exhaustively examining all possible combinations is expensive and, we believe, unnecessary. null Of those systems that apply filtering prior to coreference resolution, the nature of the filtering varies. Some systems recognize when an anaphor and a candidate antecedent are incompatible. In SRI's probabilistic model (Kehler,  The ARCE battalion command has reported that about 50 peasants of various ages have been kidnapped by terrorists of the Farabundo Marti National Liberation Front \[FMLN\] in San Miguel Department. According to that garrison, the mass kidnapping took place on 30 December in San Luis de la Reina. The source added that the terrorists forced the individuals, who were taken to an unknown location, out of their residences, presumably to incorporate them against their will into clandestine groups.</Paragraph>
    <Paragraph position="1">  1997), a pair of extracted templates may be removed from consideration because an outside knowledge base indicates contradictory features. Other systems look for particular constructions using certain trigger words. For example, pleonastic 2 pronouns are identified by looking for modal adjectives (e.g. &amp;quot;necessary&amp;quot;) or cognitive verbs (e.g. &amp;quot;It is thought that...&amp;quot;) in a set of patterned constructions (Lappin and Leass, 1994), (Kennedy and Boguraev, 1996).</Paragraph>
    <Paragraph position="2"> A more recent system (Vieira and Poesio, 1997) recognizes a large percentage of non-anaphoric definite noun phrases (NPs) during the coreference resolution process through the use of syntactic cues and case-sensitive rules.</Paragraph>
    <Paragraph position="3"> These methods were successful in many instances, but they could not identify them all.</Paragraph>
    <Paragraph position="4"> The existential NPs that were missed were existential to the reader, not because they were modified by particular syntactic constructions, but because they were part of the reader's general world knowledge.</Paragraph>
    <Paragraph position="5"> Definite noun phrases that do not need to be resolved because they are understood through world knowledge can represent a significant portion of the existential noun phrases in a text. In our research, we found that existential NPs account for 63% of all definite NPs, and 24% of them could not be identified by syntactic or lexical mea.ns. This paper details our method for identifying existential NPs that are understood through general world knowledge. Our system requires no hand coded information and can recognize a larger portion of existential NPs than Vieira and Poesio's system.</Paragraph>
  </Section>
  <Section position="4" start_page="373" end_page="374" type="metho">
    <SectionTitle>
3 Definite NP Taxonomy
</SectionTitle>
    <Paragraph position="0"> To better understand what makes an NP anaphoric or non-anaphoric, we found it useful to classify definite NPs into a taxonomy. We 2Pronouns that are semantically empty, e.g. &amp;quot;It is clear that....&amp;quot; first classified definite NPs into two broad categories, referential NPs, which have prior referents in the texts, and existential NPs, which do not. In Figure 1, examples of referential NPs are &amp;quot;the mass kidnapping,&amp;quot; &amp;quot;the terrorists&amp;quot; and &amp;quot;the individuals.&amp;quot;, while examples of existential NPs are &amp;quot;the ARCE battalion command&amp;quot; and &amp;quot;the Farabundo Marti National Liberation Front.&amp;quot; (The full taxonomy can be found in Figure 2.) We should clarify an important point. When we say that a definite NP is existential, we say this because it completely specifies a cognitive representation of the entity in the reader's mind.</Paragraph>
    <Paragraph position="1"> That is, suppose &amp;quot;the F.B.I.&amp;quot; appears in both sentence 1 and sentence 7 of a text. Although there may be a cohesive relationship between the noun phrases, because they both completely specify independently, we consider them to be non-anaphoric.</Paragraph>
    <Section position="1" start_page="373" end_page="374" type="sub_section">
      <SectionTitle>
Definite Noun Phrases
</SectionTitle>
      <Paragraph position="0"> - Referential - Existential - Independent - Syntactic - Semantic - Associative  We further classified existential NPs into two categories, independent and associative, which are distinguished by their need for context. Independent existentials can be understood in isolation. Associative existentials are inherently associated with an event, action, object or other context 3. In a text about a basketball game, for example, we might find &amp;quot;the score,&amp;quot; &amp;quot;the hoop&amp;quot; and &amp;quot;the bleachers.&amp;quot; Although they may  that our independent existentials roughly equate to her new class, our associative existentials to her inferable class, and our referentials to her evoked class.  not have direct antecedents in the text, we understand what they mean because they are all associated with basketball games. In isolation, a reader would not necessarily understand the meaning of &amp;quot;the score&amp;quot; because context is needed to disambiguate the intended word sense and provide a complete specification.</Paragraph>
      <Paragraph position="1"> Because associative NPs represent less than 10% of the existential NPs in our corpus, our efforts were directed at automatically identifying independent existentials. Understanding how to identify independent existential NPs requires that we have an understanding of why these NPs are existential. We classified independent existentials into two groups, semantic and syntactic. Semantically independent NPs are existential because they are understood by readers who share a collective understanding of current events and world knowledge. For example, we understand the meaning of &amp;quot;the F.B.I.&amp;quot; without needing any other information. Syntactically independent NPs, on the other hand, gain this quality because they are modified structurally. For example, in &amp;quot;the man who shot Liberty Valence,&amp;quot; &amp;quot;the man&amp;quot; is existential because the relative clause uniquely identifies its referent.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="374" end_page="376" type="metho">
    <SectionTitle>
4 Mining Existential NPs from a
Corpus
</SectionTitle>
    <Paragraph position="0"> Our goal is to build a system that can identify independent existential noun phrases automatically. In the previous section, we observed that &amp;quot;existentialism&amp;quot; can be granted to a definite noun phrase either through syntax or semantics. In this section, we introduce four methods for recognizing both classes of existentials.</Paragraph>
    <Section position="1" start_page="374" end_page="374" type="sub_section">
      <SectionTitle>
4.1 Syntactic Heuristics
</SectionTitle>
      <Paragraph position="0"> We began by building a set of syntactic heuristics that look for the structural cues of restrictive premodification and restrictive postmodification. Restrictive premodification is often found in noun phrases in which a proper noun is used as a modifier for a head noun, for example, &amp;quot;the U.S. president.&amp;quot; &amp;quot;The president&amp;quot; itself is ambiguous, but &amp;quot;the U.S. president&amp;quot; is not. Restrictive postmodification is often represented by restrictive relative clauses, prepositional phrases, and appositives. For example, &amp;quot;the president of the United States&amp;quot; and &amp;quot;the president who governs the U.S.&amp;quot; are existential due to a prepositional phrase and a relative clause, respectively.</Paragraph>
      <Paragraph position="1"> We also developed syntactic heuristics to recognize referential NPs. Most NPs of the form &amp;quot;the &lt;number&gt; &lt;noun&gt;&amp;quot; (e.g., &amp;quot;the 12 men&amp;quot;) have an antecedent, so we classified them as referential. Also, if the head noun of the NP appeared earlier in the text, we classified the NP as referential.</Paragraph>
      <Paragraph position="2"> This method, then, consists of two groups of syntactic heuristics. The first group, which we refer to as the rule-in heuristics, contains seven heuristics that identify restrictive premodification or postmodification, thus targeting existential NPs. The second group, referred to as the rule-out heuristics, contains two heuristics that identify referential NPs.</Paragraph>
    </Section>
    <Section position="2" start_page="374" end_page="374" type="sub_section">
      <SectionTitle>
4.2 Sentence One Extractions (Sl)
</SectionTitle>
      <Paragraph position="0"> Most referential NPs have antecedents that precede them in the text. This observation is the basis of our first method for identifying semantically independent NPs. If a definite NP occurs in the first sentence 4 of a text, we assume the NP is existential. Using a training corpus, we create a list of presumably existential NPs by collecting the first sentence of every text and extracting all definite NPs that were not classified by the syntactic heuristics. We call this list the S1 extractions.</Paragraph>
    </Section>
    <Section position="3" start_page="374" end_page="375" type="sub_section">
      <SectionTitle>
4.3 Existential Head Patterns (EHP)
</SectionTitle>
      <Paragraph position="0"> While examining the S1 extractions, we found many similar NPs, for example &amp;quot;the Salvadoran Government,&amp;quot; &amp;quot;the Guatemalan Government,&amp;quot; and &amp;quot;the U.S. Government.&amp;quot; The similarities indicate that some head nouns, when premodified, represent existential entities. By using the S1 extractions as input to a pattern generation algorithm, we built a set of Existential Head Patterns (EHPs) that identify such constructions. These patterns are of the form &amp;quot;the &lt;x+&gt; 5 &lt;nounl ...nounN&gt;&amp;quot; such as &amp;quot;the &lt;x+&gt; government&amp;quot; or &amp;quot;the &lt;x+&gt; Salvadoran government.&amp;quot; Figure 3 shows the algorithm for creating EHPs.</Paragraph>
      <Paragraph position="1"> 4Many of the texts we used were newspaper articles and all headers, including titles and bylines, were stripped before processing.</Paragraph>
      <Paragraph position="3"> 1. For each NP of more than two words, build a candidate pattern of the form &amp;quot;the &lt;x+&gt; headnoun.&amp;quot; Example: if the NP was &amp;quot;the new Salvadoran government,&amp;quot; the candidate pattern would be &amp;quot;the &lt;x+&gt; government.&amp;quot; 2. Apply that pattern to the corpus, count how many times it matches an NP.</Paragraph>
      <Paragraph position="4"> 3. If possible, grow the candidate pattern by inserting the word to the left of the headnoun, e.g. the candidate pattern now becomes &amp;quot;the &lt;x+&gt; Salvadoran government.&amp;quot; 4. Reapply the pattern to the corpus, count how many times it matches an NP. If the new count is less that the last iteration's count, stop and return the prior pattern. If the new count is</Paragraph>
    </Section>
    <Section position="4" start_page="375" end_page="375" type="sub_section">
      <SectionTitle>
4.4 Definite-Only List (DO)
</SectionTitle>
      <Paragraph position="0"> It also became clear that some existentials never appear in indefinite constructions. &amp;quot;The F.B.I.,&amp;quot; &amp;quot;the contrary,&amp;quot; &amp;quot;the National Guard&amp;quot; are definite NPs which are rarely, if ever, seen in indefinite constructions. The chances that a reader will encounter &amp;quot;an F.B.I.&amp;quot; are slim to none. These NPs appeared to be perfect candidates for a corpus-based approach. To locate &amp;quot;definite-only&amp;quot; NPs we made two passes over the corpus. The first pass produced a list of every definite NP and its frequency. The second pass counted indefinite uses of all NPs cataloged during the first pass. Knowing how often an NP was used in definite and indefinite constructions allowed us to sort the NPs, first by the probability of being used as a definite (its definite probability), and second by definite-use frequency.</Paragraph>
      <Paragraph position="1"> For example, &amp;quot;the contrary&amp;quot; appeared high on this list because its head noun occurred 15 times in the training corpus, and every time it was in a definite construction. From this, we created a definite-only list by selecting those NPs which occurred at least 5 times and only in definite constructions.</Paragraph>
      <Paragraph position="2"> Examples from the three methods can be found in the Appendix.</Paragraph>
    </Section>
    <Section position="5" start_page="375" end_page="376" type="sub_section">
      <SectionTitle>
4.5 Vaccine
</SectionTitle>
      <Paragraph position="0"> Our methods for identifying existential NPs are all heuristic-based and therefore can be incorrect in certain situations. We identified two types of common errors.</Paragraph>
      <Paragraph position="1"> 1. An incorrect $1 assumption. When the S1 assumption falls, i.e. when a definite NP in the first sentence of a text is truly referential, the referential NP is added to the S1 list. Later, an Existential Head Pattern may be built from this NP. In this way, a single misclassified NP may cause multiple noun phrases to be misclassified in new texts, acting as an &amp;quot;infection&amp;quot; (Roaxk  and Charniak, 1998).</Paragraph>
      <Paragraph position="2"> 2. Occasional existentialism. Sometimes an NP is existential in one text but referential in an- null other. For example, &amp;quot;the guerrillas&amp;quot; often refers to a set of counter-government forces that the reader of an E1 Salvadoran newspaper would understand. In some cases, however, a particular group of guerrillas was mentioned previously in the text (&amp;quot;A group of FMLN rebels attacked the capital...&amp;quot;), and later references to &amp;quot;the guerrillas&amp;quot; referred to this group. To address these problems, we developed a vaccine. It was clear that we had a number of infections in our S1 list, including &amp;quot;the base,&amp;quot; &amp;quot;the  For every definite NP in a text  1. Apply syntactic RuleOutHeuristics, if any fired, classify the NP as referential. 2. Look up the NP in the S1 list, if found, classify the NP as existential (unless stopped by vaccine).</Paragraph>
      <Paragraph position="3"> 3. Look up the NP in the DO list, if found, classify the NP as existential. 4. Apply all EHPs, if any apply, classify the NP as existential (unless stopped by vaccine). 5. Apply syntactic RuleInHeuristics, if any fired, classify the NP as existential. 6. If the NP is not yet classified, classify the NP as referential.  individuals,&amp;quot; &amp;quot;the attack,&amp;quot; and &amp;quot;the banks.&amp;quot; We noticed, however, that many of these incorrect NPs also appeared near the bottom of our definite/indefinite list, indicating that they were often seen in indefinite constructions. We used the definite probability measure as a way of detecting errors in the S1 and EHP lists. If the definite probability of an NP was above an upper threshold, the NP was allowed to be classifted as existential. If the definite probability of an NP fell below a lower threshold, it was not allowed to be classified by the S1 or EHP method.</Paragraph>
      <Paragraph position="4"> Those NPs that fell between the two thresholds were considered occasionally existential.</Paragraph>
      <Paragraph position="5"> Occasionally existential NPs were handled by observing where the NPs first occurred in the text. For example, if the first use of &amp;quot;the guerrillas&amp;quot; was in the first few sentences of a text, it was usually an existential use. If the first use was later, it was usually a referential use because a prior definition appeared in earlier sentences. We applied an early allowance threshold of three sentences - occasionally existential NPs occuring under this threshold were classified as existential, and those that occurred above were left unclassified. Figure 4 details the vaccine's algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="376" end_page="377" type="metho">
    <SectionTitle>
5 Algorithm &amp; Training
</SectionTitle>
    <Paragraph position="0"> We trained and tested our methods on the Latin American newswire articles from MUC-4 (MUC-4 Proceedings, 1992). The training set contained 1,600 texts and the test set contained 50 texts. All texts were first parsed by SUN-DANCE, our heuristic-based partial parser developed at the University of Utah.</Paragraph>
    <Paragraph position="1"> We generated the S1 extractions by processing the first sentence of all training texts. This produced 849 definite NPs. Using these NPs as  input to the existential head pattern algorithm, we generated 297 EHPs. The DO list was built by using only those NPs which appeared at least 5 times in the corpus and 100% of the time as definites. We generated the DO list in two iterations, once for head nouns alone and once for full NPs, resulting in a list of 65 head nouns and 321 full NPs 6.</Paragraph>
    <Paragraph position="2"> Once the methods had been trained, we classifted each definite NP in the test set as referential or existential using the algorithm in Figure 5. Figure 6 graphically represents the main elements of the algorithm. Note that we applied vaccines to the S1 and EHP lists, but not to the DO list because gaining entry to the DO list is much more difficult -- an NP must occur at least 5 times in the training corpus, and every time it must occur in a definite construction.</Paragraph>
    <Section position="1" start_page="377" end_page="377" type="sub_section">
      <SectionTitle>
Method Tested
</SectionTitle>
      <Paragraph position="0"> 0. Baseline 1. Syntactic Heuristics 2. Syntactic Heuristics + S1 3. Syntactic Heuristics + EHP 4. Syntactic Heuristics + DO 5. Syntactic Heuristics + S1 + EHP 6. Syntactic Heuristics + S1 + EHP + DO 7. Syntactic Heuristics + S1 + EHP + DO + Va(70/25) 8. Syntactic Heuristics + S1 + EHP + DO + Vb(50/25)  To evaluate the performance of our algorithm, we hand-tagged each definite NP in the 50 test texts as a syntactically independent existential, a semantically independent existential, an associative existential or a referential NP. Figure 8 shows the distribution of definite NP types in the test texts. Of the 1,001 definite NPs tested, 63% were independent existentials, so removing these NPs from the coreference resolution process could have substantial savings. We measured the accuracy of our classifications using recall and precision metrics. Results are shown in Figure 7.</Paragraph>
      <Paragraph position="1">  As a baseline measurement, we considered the accuracy of classifying every definite NP as existential. Given the distribution of definite NP types in our test set, this would result in recall of 100% and precision of 72%. Note that we are more interested in high measures of precision than recall because we view this method to be the precursor to a coreference resolution algorithm. Incorrectly removing an anaphoric NP means that the coreference resolver would never have a chance to resolve it, on the other hand, non-anaphoric NPs that slip through can still be ruled as non-anaphoric by the coreference resolver.</Paragraph>
      <Paragraph position="2"> We first evaluated our system using only the syntactic heuristics, which produced only 43% recall, but 92% precision. Although the syntactic heuristics are a reliable way to identify existential definite NPs, they miss 57% of the true existentials.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML