File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/j01-4003_abstr.xml

Size: 26,558 bytes

Last Modified: 2025-10-06 13:41:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-4003">
  <Title>A Corpus-Based Evaluation of Centering and Pronoun Resolution</Title>
  <Section position="2" start_page="0" end_page="513" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The aims of this paper are to compare implementations of pronoun resolution algorithms automatically on a common corpus and to see if results from psycholinguistic experiments can be used to improve pronoun resolution. Many hand-tested corpus evaluations have been done in the past (e.g., Walker 1989; Strube 1998; Mitkov 1998; Strube and Hahn 1999), but these have the drawback of being carried out on small corpora. While manual evaluations have the advantage of allowing the researcher to examine the data closely, they are problematic because they can be time consuming, generally making it difficult to process corpora that are large enough to provide reliable, broadly based statistics. With a system that can run various pronoun resolution algorithms, one can easily and quickly analyze large amounts of data and generate more reliable results. In this study, this ability to alter an algorithm slightly and test its performance is central.</Paragraph>
    <Paragraph position="1"> We first show the attractiveness of the Left-Right Centering algorithm (henceforth LRC) (Tetreault 1999) given its incremental processing of utterances, psycholinguistic plausibility, and good performance in finding the antecedents of pronouns. The algorithm is tested against three other leading pronoun resolution algorithms: Hobbs's naive algorithm (1978), S-list (Strube 1998), and BFP (Brennan, Friedman, and Pollard 1987). Next we use the conclusions from two psycholinguistic experiments on ranking the Cf-list, the salience of discourse entities in prepended phrases (Gordon, Grosz, and Gilliom 1993) and the ordering of possessor and possessed in complex NPs (Gordon et al. 1999), to try to improve the performance of LRC.</Paragraph>
    <Paragraph position="2"> We begin with a brief review of the four algorithms to be compared (Section 2). We then discuss the results of the corpus evaluation (Sections 3 and 4). Finally, we show that the results from two psycholinguistic experiments, thought to provide a better ordering of the Cf-list, do not improve LRC's performance when they are incorporated (Section 5).</Paragraph>
    <Paragraph position="3">  * Department of Computer Science, Rochester, NY 14627. E-mail: tetreaul@cs.rochester.edu (~) 2001 Association for Computational Linguistics Computational Linguistics Volume 27, Number 4 2. Algorithms</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Hobbs's Algorithm
</SectionTitle>
      <Paragraph position="0"> Hobbs (1978) presents two algorithms: a naive one based solely on syntax, and a more complex one that includes semantics in the resolution method. The naive one (henceforth, the Hobbs algorithm) is the one analyzed here. Unlike the other three algorithms analyzed in this project, the Hobbs algorithm does not appeal to any discourse models for resolution; rather, the parse tree and grammatical rules are the only information used in pronoun resolution.</Paragraph>
      <Paragraph position="1"> The Hobbs algorithm assumes a parse tree in which each NP node has an N type  node below it as the parent of the lexical object. The algorithm is as follows: 1. Begin at the NP node immediately dominating the pronoun.</Paragraph>
      <Paragraph position="2"> 2. Walk up the tree to the first NP or S encountered. Call this node X, and call the path used to reach it p.</Paragraph>
      <Paragraph position="3"> 3. Traverse all branches below node X to the left of path p in a left-toright, breadth-first manner. Propose as the antecedent any NP node that is encountered which has an NP or S node between it and X. If no antecedent is found, proceed to Step 4.</Paragraph>
      <Paragraph position="4"> 4. If node X is the highest S node in the sentence, traverse the surface parse trees of previous sentences in order of recency, the most recent first; each tree is traversed in a left-to-right, breadth-first manner, and when an NP node is encountered, propose it as the antecedent. If X is not the highest S node in the sentence, continue to Step 5.</Paragraph>
      <Paragraph position="5"> 5. From node X, go up the tree to the first NP or S node encountered. Call this new node X, and call the path traversed to reach it p.</Paragraph>
      <Paragraph position="6"> 6. If X is an NP node and if the path p to X did not pass through the 51 node that X immediately dominates, propose X as the antecedent.</Paragraph>
      <Paragraph position="7"> 7. Traverse all branches below node X to the left of path p in a left-toright, breadth-first manner. Propose any NP node encountered as the antecedent.</Paragraph>
      <Paragraph position="8"> 8. If X is an S node, traverse all branches of node X to the right of path p in a left-to-right, breadth-first manner, but do not go below any NP or S node encountered. Propose any NP node encountered as the antecedent.</Paragraph>
      <Paragraph position="9"> 9. Go to Step 4.</Paragraph>
      <Paragraph position="10">  A match is &amp;quot;found&amp;quot; when the NP in question matches the pronoun in number, gender, and person. The algorithm amounts to walking the parse tree from the pronoun in question by stepping through each NP and S on the path to the top S and running a breadth-first search on NP's children left of the path. If a referent cannot be found in the current utterance, then the breadth-first strategy is repeated on preceding utterances. Hobbs did a hand-based evaluation of his algorithm on three different texts: a history chapter, a novel, and a news article. Four pronouns were considered: he, she, it, and they. Cases where it refers to a nonrecoverable entity (such as the time or weather) were not counted. The algorithm performed successfully on 88.3% of the 300 pronouns in the corpus. Accuracy increased to 91.7% with the inclusion of selectional constraints.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="508" type="sub_section">
      <SectionTitle>
2.2 Centering Theory and BFP's Algorithm
</SectionTitle>
      <Paragraph position="0"> Centering theory is part of a larger theory of discourse structure developed by Grosz and Sidner (1986). These researchers assert that discourse structure has three compo- null Tetreault Centering and Pronoun Resolution nents: (1) a linguistic structure, which is the structure of the sequence of utterances; (2) the intentional structure, which is a structure of discourse-relevant purposes; and (3) the attentional state, which is the state of focus. The attentional state models the discourse participants' focus of attention determined by the other two structures at any one time. Also, it has global and local components that correspond to the two  levels of discourse coherence. Centering models the local component of attentional state--namely, how the speaker's choice of linguistic entities affects the inference load placed upon the hearer in discourse processing. For example, referring to an entity with a pronoun signals that the entity is more prominently in focus.</Paragraph>
      <Paragraph position="1"> As described by Brennan, Friedman, and Pollard (1987) (henceforth, BFP) and Walker, Iida, and Cote (1994), entities called centers link an utterance with other utterances in the discourse segment. Each utterance within a discourse has one backward-looking center (Cb) and a set of forward-looking centers (Cf). The Cf set for an utterance /do is the set of discourse entities evoked by that utterance. The Cf set is ranked according to discourse salience; the most accepted ranking is by grammatical role (by subject, direct object, indirect object). The highest-ranked element of this list is called the preferred center (Cp). The Cb represents the most highly ranked element of the previous utterance that is found in the current utterance. Essentially, it serves as a link between utterances. Abrupt changes in discourse topic are reflected by a change of Cb between utterances. In discourses where the change of Cb is minimal, the Cp of the utterance represents a prediction of what the Cb will be in the next utterance.</Paragraph>
      <Paragraph position="2"> Grosz, Joshi, and Weinstein (1986, 1995) proposed the following constraints of centering theory:  There is precisely one Cb.</Paragraph>
      <Paragraph position="3"> Every element of the Cf-list for Ui must be realized in Ui.</Paragraph>
      <Paragraph position="4"> The center, Cb(Ui, D), is the highest-ranked element of Cf(Ui_I, D) that is realized in Ui.</Paragraph>
      <Paragraph position="5"> In addition, they proposed the following rules:</Paragraph>
    </Section>
    <Section position="3" start_page="508" end_page="509" type="sub_section">
      <SectionTitle>
Rules
</SectionTitle>
      <Paragraph position="0"> For each utterance U/, in a discourse segment D, consisting of utterances of U1 ... Urn: .</Paragraph>
      <Paragraph position="1"> .</Paragraph>
      <Paragraph position="2"> If some element of Cf(Ui-1, D) is realized as a pronoun in Ui, then so is Cb(Ui, D).</Paragraph>
      <Paragraph position="3"> Transition states (defined below) are ordered such that a sequence of Continues is preferred over a sequence of Retains, which are preferred over sequences of Shifts.</Paragraph>
      <Paragraph position="4"> The relationship between the Cb and Cp of two utterances determines the coherence between the utterances. Centering theory ranks the coherence of adjacent utterances with transitions that are determined by the following criteria:  1. whether or not the Cb is the same from Un-1 to Un; 2. whether or not this entity coincides with the Cp of U,.  Computational Linguistics Volume 27, Number 4 Table 1 Centering transition table.</Paragraph>
      <Paragraph position="6"> BFP and Walker, Iida, and Cote (1994) identified a finer gradation in the Shift transition, stating that Retains were preferred over Smooth Shifts, which were preferred over Rough Shifts. Table 1 shows the criteria for each transition.</Paragraph>
      <Paragraph position="7"> Given these constraints and rules, BFP proposed the following pronoun-binding algorithm based on centering:  I. Generate all possible Cb - Cf combinations.</Paragraph>
      <Paragraph position="8"> 2. Filter combinations by contraindices and centering rules.</Paragraph>
      <Paragraph position="9"> 3. Rank remaining combinations by transitions.</Paragraph>
      <Paragraph position="10">  Walker (1989) compared Hobbs and BFP on three small data sets using hand evaluation. The results indicated that the two algorithms performed equivalently over a fictional domain of 100 utterances; and Hobbs outperformed BFP over domains consisting of newspaper articles (89% to 79%) and a task domain (Tasks) (51% to 49%).</Paragraph>
    </Section>
    <Section position="4" start_page="509" end_page="510" type="sub_section">
      <SectionTitle>
2.3 The S-List Approach
</SectionTitle>
      <Paragraph position="0"> The third approach (Strube 1998) discards the notions of backward- and forward-looking centers but maintains the notion of modeling the attentional state. This method, the Sqist (salience list), was motivated by the BFP algorithm's problems with incrementality and computational overhead (it was also difficult to coordinate the algorithm with intrasentential resolution).</Paragraph>
      <Paragraph position="1"> 2.3.1 The S-List. The model has one structure, the S-list, which &amp;quot;describes the attentional state of the hearer at any given point in processing a discourse&amp;quot; (Strube 1998, page 1252). At first glance, this definition is quite similar to that of a Cfqist; however, the two differ in ranking and composition. First, the S-list can contain elements from both the current and previous utterance while the Cf-list contains elements from the previous utterance alone. Second, the S-list's elements are ranked not by grammatical role but by information status and then by surface order.</Paragraph>
      <Paragraph position="2"> The elements of the S-list are separated into three information sets--hearer-old discourse entities (OLD), mediated discourse entities (MED), and hearer-new discourse entities (NEW)--all of which are based on Prince's (1981) familiarity scale. The three sets are further subdivided: OLD consists of evoked and unused entities; MED consists of inferrables, containing inferrables, and anchored brand-new discourse intrasentential entities; NEW consists solely of brand-new entities.</Paragraph>
      <Paragraph position="3"> What sorts of NPs fall into these categories? Pronouns and other referring expressions, as well as previously mentioned proper names, are evoked. Unused entities are proper names. Inferrables are entities that are linked to some other entity in the hearer's knowledge, but indirectly. Anchored brand-new discourse entities have as their anchor an entity that is OLD.</Paragraph>
      <Paragraph position="4">  Tetreault Centering and Pronoun Resolution The three sets are ordered by their information status. OLD entities are preferred over MED entities, which are preferred over NEW entities. Within each set, the ordering is by utterance and position in utterance. Basically, an entity of utterance x is preferred over an entity of utterance y if utterance x follows utterance y. If the entities are in the same utterance, they are ranked by position in the sentence: an entity close to the beginning of the sentence is preferred over one that is farther away.</Paragraph>
      <Paragraph position="5">  2.3.2 Algorithm. The resolution algorithm presented here comes from Strube (1998) and personal communication with Michael Strube.</Paragraph>
      <Paragraph position="6"> For each utterance (U1 ... UN): for each entity within Ui: 1. If Ui is a pronoun, then find a referent by looking through the S-list left to right for one that matches in gender, number, person, and binding constraints. Mark entity as EVOKED. I 2. If Ui is preceded by an indefinite article, then mark Ui as BRAND-NEW.</Paragraph>
      <Paragraph position="7"> 3. If Ui is not preceded by a determiner, then mark Ui as UNUSED.</Paragraph>
      <Paragraph position="8"> 4. Else mark Ui as ANCHORED BRAND-NEW.</Paragraph>
      <Paragraph position="9"> 5. Insert Ui into the S-list given the ranking described above.</Paragraph>
      <Paragraph position="10"> 6. Upon completion of Ui remove all entities from the S-list that were not realized in Ui.</Paragraph>
      <Paragraph position="11">  In short, the S-list method continually inserts new entities into the S-list in their proper positions and &amp;quot;cleanses&amp;quot; the list after each utterance to purge entities that are unlikely to be used again in the discourse. Pronoun resolution is a simple lookup in the S-list.</Paragraph>
      <Paragraph position="12"> Strube did perform a hand test of the S-list algorithm and the BFP algorithm on three short stories by Hemingway and three articles from the New York Times. BFP, with intrasentential centering added, successfully resolved 438 pronouns out of 576 (76%). The S-list approach performed much better (85%).</Paragraph>
    </Section>
    <Section position="5" start_page="510" end_page="511" type="sub_section">
      <SectionTitle>
2.4 Left-Right Centering Algorithm
</SectionTitle>
      <Paragraph position="0"> Left-Right Centering (Tetreault 1999) is an algorithm built upon centering theory's constraints and rules as detailed in Grosz, Joshi, and Weinstein (1995). The creation of the LRC algorithm is motivated by BFP's limitation as a cognitive model in that it makes no provision for incremental resolution of pronouns (Kehler 1997). Psycholinguistic research supports the claim that listeners process utterances one word at a time. Therefore, when a listener hears a pronoun, he or she will try to resolve it immediately; if new information appears that makes the original choice incorrect (such as a violation of binding constraints), the listener will go back and find a correct antecedent.</Paragraph>
      <Paragraph position="1"> Responding to the lack of incremental processing in the BFP model, we have constructed an incremental resolution algorithm that adheres to centering constraints.</Paragraph>
      <Paragraph position="2"> It works by first searching for an antecedent in the current utterance; 2 if one is not found, then the previous Cf-lists (starting with the previous utterance) are searched  Computational Linguistics Volume 27, Number 4 left to right for an antecedent:  1. Preprocessing--from previous utterance: Cb(Un-1) and Cf(Un_l) are available.</Paragraph>
      <Paragraph position="3"> 2. Process utterance--parse and extract incrementally from Un all references to discourse entities. For each pronoun do: (a) Search for an antecedent intrasententially in Cf-partial(Un) 3 that meet feature and binding constraints.</Paragraph>
      <Paragraph position="4"> If one is found, proceed to the next pronoun within utterance. Else go to (b).</Paragraph>
      <Paragraph position="5"> (b) Search for an antecedent intersententially in Cf(Un-1) that meets feature and binding constraints.</Paragraph>
      <Paragraph position="6"> 3. Create Cf--create Cf-list of Un by ranking discourse entities of Un ac null cording to grammatical function. Our implementation used a left-to-right breadth-first walk of the parse tree to approximate sorting by grammatical function.</Paragraph>
      <Paragraph position="7"> It should be noted that while BFP makes use of Rule 2 of centering theory, LRC does not since Rule 2's role in pronoun resolution is not yet known (see Kehler \[1997\] for a critique of its use by BFP).</Paragraph>
      <Paragraph position="8"> The preference for searching intrasententially before intersententially is motivated by the fact that large sentences are not broken up into clauses as Kameyama (1998) proposes. By looking through the Cf-partial, clause-by-clause centering is roughly approximated. In addition, the antecedents of reflexive pronouns are found by searching Cf-partial right to left because their referents are usually found in the minimal S. There are two important points to be made about centering and pronoun resolution. First, centering is not a pronoun resolution method; the fact that pronouns can be resolved is simply a side effect of the constraints and rules. Second, ranking by grammatical role is very naive. In a perfect world, the Cf-list would consist of entities  ranked by a combination of syntax and semantics. In our study, ranking is based solely on syntax.</Paragraph>
      <Paragraph position="9"> 3. Evaluation of Algorithms</Paragraph>
    </Section>
    <Section position="6" start_page="511" end_page="512" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> All four algorithms were compared on two domains taken from the Penn Treebank annotated corpus (Marcus, Santorini, and Marcinkiewicz 1993). The first domain consists of 3,900 utterances (1,694 unquoted pronouns) in New York Times articles provided by Ge, Hale, and Charniak (1998), who annotated the corpus with coreference information. The corpus consists of 195 different newspaper articles. Sentences are fully bracketed and have labels that indicate part of speech and number. Pronouns and their antecedent entities are all marked with the same tag to facilitate coreference verification. In addition, the subject NP of each S subconstituent is marked.</Paragraph>
      <Paragraph position="1"> The second domain consists of 553 utterances (511 unquoted pronouns) in three fictional texts taken from the Penn Treebank corpus, which we annotated in the same manner as Ge, Hale, and Charniak's corpus. The second domain differs from the first in that the sentences are generally shorter and less complex, and contain more hes  Tetreault Centering and Pronoun Resolution</Paragraph>
    </Section>
    <Section position="7" start_page="512" end_page="512" type="sub_section">
      <SectionTitle>
3.2 Method
</SectionTitle>
      <Paragraph position="0"> The evaluation (Byron and Tetreault 1999) consisted of two steps: (1) parsing Penn Treebank utterances and (2) running the four algorithms. The parsing stage involved extracting discourse entities from the Penn Treebank utterances. Since we were solely concerned with pronouns having NP antecedents, we extracted only NPs. For each NP we generated a &amp;quot;filecard&amp;quot; that stored its syntactic information. This information included agreement properties, syntactic type, parent nodes, depth in tree, position in utterance, presence or absence of a determiner, gender, coreference tag, utterance number, whether it was quoted, commanding verb, whether it was part of a title, whether it was reflexive, whether it was part of a possessive NP, whether it was in a prepended phrase, and whether it was part of a conjoined sentence. The entities were listed in each utterance in order of mention except in the case of conjoined NPs.</Paragraph>
      <Paragraph position="1"> Conjoined entities such as John and Mary were realized as three entities: the singular entities John and Mary and the plural John and Mary. The plural entity was placed ahead of the singular ones in the Cf-list, on the basis of research by Gordon et al.</Paragraph>
      <Paragraph position="2"> (1999).</Paragraph>
      <Paragraph position="3"> Conjoined utterances were broken up into their subutterances. For example, the utterance United Illuminating is based in New Haven, Conn., and Northeast is based in Hartford, Conn. was replaced by the two utterances United Illuminating is based in New Haven, Conn. and Northeast is based in Hartford, Conn. This strategy was inspired by Kameyama's (1998) methods for dealing with complex sentences; it improves the accuracy of each algorithm by 1% to 2%.</Paragraph>
      <Paragraph position="4"> The second stage involved running each algorithm on the parsed forms of the Penn Treebank utterances. For all algorithms, we used the same guidelines as Strube and Hahn (1999): no world knowledge was assumed, only agreement criteria (gender, number) and binding constraints were applied. Unlike Strube and Hahn, we did not make use of sortal constraints. The number of each NP could be extracted from the Penn Treebank annotations, but gender had to be hand-coded. A database of all NPs was tagged with their gender (masculine, feminine, neuter). NPs such as president or banker were marked as androgynous since it is possible to refer to them with a gendered pronoun. Entities within quotes were removed from the evaluation since the S-list algorithm and BFP do not allow resolution of quoted text.</Paragraph>
      <Paragraph position="5"> We depart from Walker's (1989) and Strube and Hahn's (1999) evaluations by not defining any discourse segments. Walker defines a discourse segment as a paragraph (unless the first sentence of the paragraph has a pronoun in subject position or unless it has a pronoun with no antecedent among the preceding NPs that match syntactic features). Instead, we divide our corpora only by discourses (newspaper article or story). Once a new discourse is encountered, the history list for each algorithm (be it the Cf-list or S-list) is cleared. Using discourse segments should increase the efficiency of all algorithms since it constrains the search space significantly.</Paragraph>
      <Paragraph position="6"> Unlike Walker (1989), we do not account for false positives or error chains; instead, we use a &amp;quot;location'-based evaluation procedure. Error chains occur when a pronoun P6 refers to a pronoun P6 that was resolved incorrectly to entity Ek (where P6 and Pil evoke the same entity El). So P/2 would corefer incorrectly with Ek. In our evaluation, a coreference is deemed correct if it corefers with an NP that has the same coreference tag. So in the above situation, Pi2 would be deemed correct since it was matched to an expression that should realize the correct entity.</Paragraph>
    </Section>
    <Section position="8" start_page="512" end_page="513" type="sub_section">
      <SectionTitle>
3.3 Algorithm Modifications
</SectionTitle>
      <Paragraph position="0"> The BFP algorithm had to be modified slightly to compensate for underspecifications in its intrasentential resolution. We follow the same method as Strube and Hahn (1999);  Computational Linguistics Volume 27, Number 4 that is, we first try to resolve pronouns intersententially using the BFP algorithm. If there are pronouns left unresolved, we search for an antecedent left to right in the same utterance. Strube and Hahn use Kameyama's (1998) specifications for complex sentences to break up utterances into smaller components. We keep the utterances whole (with the exception of splitting conjoined utterances).</Paragraph>
      <Paragraph position="1"> As an aside, the BFP algorithm can be modified (Walker 1989) so that intrasentential antecedents are given a higher preference. To quote Walker, the alteration (suggested by Carter \[1987\]) involves selecting intrasentential candidates &amp;quot;only in the cases where no discourse center has been established or the discourse center has been rejected for syntactic or selectional reasons&amp;quot; (page 258). Walker applied the modification and was able to boost BFP's accuracy to 93% correct over the fiction corpus, 84% on Newsweek articles, and 64% on Tasks (up from 90%, 79%, and 49%, respectively). BFP with Carter's modification may seem quite similar to LRC except for two points. First, LRC seeks antecedents intrasententially regardless of the status of the discourse center. Second, LRC does not use Rule 2 in constraining possible antecedents intersententially, while BFP does so.</Paragraph>
      <Paragraph position="2"> Because the S-list approach incorporates both semantics and syntax in its familiarity ranking scheme, a shallow version that uses only syntax is implemented in this study. This means that inferrables are not represented and entities rementioned as NPs may be underrepresented in the ranking.</Paragraph>
      <Paragraph position="3"> Both the BFP and S-list algorithms were modified so that they have the ability to look back through all past Cf/S-lists. This puts the two algorithms on equal footing with the Hobbs and LRC algorithms, which allow one to look back as far as possible within the discourse.</Paragraph>
      <Paragraph position="4"> Hobbs (1978) makes use of selectional constraints to help refine the search space for neutral pronouns such as it. We do not use selectional constraints in this syntax-only study.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML