File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-4004_metho.xml

Size: 43,711 bytes

Last Modified: 2025-10-06 14:07:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-4004">
  <Title>A Machine Learning Approach to Coreference Resolution of Noun Phrases Wee Meng Soon* DSO National Laboratories</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DSO National Laboratories
</SectionTitle>
    <Paragraph position="0"> In this paper, we present a learning approach to coreference resolution of noun phrases in unrestricted text. The approach learns from a small, annotated corpus and the task includes resolving not just a certain type of noun phrase (e.g., pronouns) but rather general noun phrases. It also does not restrict the entity types of the noun phrases; that is, coreference is assigned whether they are of &amp;quot;organization,&amp;quot; &amp;quot;person,&amp;quot; or other types. We evaluate our approach on common data sets (namely, the MUC-6 and MUC-7 coreference corpora) and obtain encouraging results, indicating that on the general noun phrase coreference task, the learning approach holds promise and achieves accuracy comparable to that of nonlearning approaches. Our system is the first learning-based system that offers performance comparable to that of state-of-the-art nonlearning systems on these data sets.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Coreference resolution is the process of determining whether two expressions in natural language refer to the same entity in the world. It is an important subtask in natural language processing systems. In particular, information extraction (IE) systems like those built in the DARPA Message Understanding Conferences (Chinchor 1998; Sundheim 1995) have revealed that coreference resolution is such a critical component of IE systems that a separate coreference subtask has been defined and evaluated since MUC-6 (MUC-6 1995).</Paragraph>
    <Paragraph position="1"> In this paper, we focus on the task of determining coreference relations as defined in MUC-6 (MUC-6 1995) and MUC-7 (MUC-7 1997). Specifically, a coreference relation denotes an identity of reference and holds between two textual elements known as markables, which can be definite noun phrases, demonstrative noun phrases, proper names, appositives, sub-noun phrases that act as modifiers, pronouns, and so on. Thus, our coreference task resolves general noun phrases and is not restricted to a certain type of noun phrase such as pronouns. Also, we do not place any restriction on the possible candidate markables; that is, all markables, whether they are &amp;quot;organization,&amp;quot; &amp;quot;person,&amp;quot; or other entity types, are considered. The ability to link coreferring noun phrases both within and across sentences is critical to discourse analysis and language understanding in general.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="523" type="metho">
    <SectionTitle>
2. A Machine Learning Approach to Coreference Resolution
</SectionTitle>
    <Paragraph position="0"> We adopt a corpus-based, machine learning approach to noun phrase coreference resolution. This approach requires a relatively small corpus of training documents that have been annotated with coreference chains of noun phrases. All possible markables in a training document are determined by a pipeline of language-processing modules, and training examples in the form of feature vectors are generated for appropriate pairs of markables. These training examples are then given to a learning algorithm to build a classifier. To determine the coreference chains in a new document, all markables are determined and potential pairs of coreferring markables are presented to the classifier, which decides whether the two markables actually corefer. We give the details of these steps in the following subsections.</Paragraph>
    <Section position="1" start_page="0" end_page="522" type="sub_section">
      <SectionTitle>
2.1 Determination of Markables
</SectionTitle>
      <Paragraph position="0"> A prerequisite for coreference resolution is to obtain most, if not all, of the possible markables in a raw input text. To determine the markables, a pipeline of natural language processing (NLP) modules is used, as shown in Figure 1. They consist of tokenization, sentence segmentation, morphological processing, part-of-speech tagging, noun phrase identification, named entity recognition, nested noun phrase extraction, and semantic class determination. As far as coreference resolution is concerned, the goal of these NLP modules is to determine the boundary of the markables, and to provide the necessary information about each markable for subsequent generation of features in the training examples.</Paragraph>
      <Paragraph position="1"> Our part-of-speech tagger is a standard statistical tagger based on the Hidden Markov Model (HMM) (Church 1988). Similarly, we built a statistical HMM-based noun phrase identification module that determines the noun phrase boundaries solely based on the part-of-speech tags assigned to the words in a sentence. We also implemented a module that recognizes MUC-style named entities, that is, organization, person, location, date, time, money, and percent. Our named entity recognition module uses the HMM approach of Bikel, Schwartz, and Weischedel (1999), which learns from a tagged corpus of named entities. That is, our part-of-speech tagger, noun phrase identification module, and named entity recognition module are all based on HMMs and learn from corpora tagged with parts of speech, noun phrases, and named entities, respectively. Next, both the noun phrases determined by the noun phrase identification module and the named entities are merged in such a way that if the noun phrase overlaps with a named entity, the noun phrase boundaries will be adjusted to subsume the named entity.</Paragraph>
      <Paragraph position="2">  Soon, Ng, and Lira Coreference Resolution The nested noun phrase extraction module subsequently accepts the noun phrases and determines the nested phrases for each noun phrase. The nested noun phrases are divided into two groups: .</Paragraph>
      <Paragraph position="3"> .</Paragraph>
      <Paragraph position="4"> Nested noun phrases from possessive noun phrases. Consider two possessive noun phrases marked by the noun phrase module, his long-range strategy and Eastern's parent. The nested noun phrase for the first phrase is the pronoun his, while for the second one, it is the proper name Eastern.</Paragraph>
      <Paragraph position="5"> Nested noun phrases that are modifier nouns (or prenominals). For example, the nested noun phrase for wage reductions is wage, and for Union representatives, it is Union.</Paragraph>
      <Paragraph position="6"> Finally, the markables needed for coreference resolution are the union of the noun phrases, named entities, and nested noun phrases found. For markables without any named entity type, semantic class is determined by the semantic class determination module. More details regarding this module are given in the description of the semantic class agreement feature.</Paragraph>
      <Paragraph position="7"> To achieve acceptable recall for coreference resolution, it is most critical that the eligible candidates for coreference be identified correctly in the first place. In order to test our system's effectiveness in determining the markables, we attempted to match the markables generated by our system against those appearing in the coreference chains annotated in 100 SGML documents, a subset of the training documents available in MUC-6. We found that our system is able to correctly identify about 85% of the noun phrases appearing in coreference chains in the 100 annotated SGML documents. Most of the unmatched noun phrases are of the following types:</Paragraph>
      <Paragraph position="9"> Our system generated a head noun that is a subset of the noun phrase in the annotated corpus. For example, Saudi Arabia, the cartel's biggest producer was annotated as a markable, but our system generated only Saudi Arabia.</Paragraph>
      <Paragraph position="10"> Our system extracted a sequence of words that cannot be considered as a markable.</Paragraph>
      <Paragraph position="11"> Our system extracted markables that appear to be correct but do not match what was annotated. For example, our system identified selective wage reductions, but wage reductions was annotated instead.</Paragraph>
    </Section>
    <Section position="2" start_page="522" end_page="523" type="sub_section">
      <SectionTitle>
2.2 Determination of Feature Vectors
</SectionTitle>
      <Paragraph position="0"> To build a learning-based coreference engine, we need to devise a set of features that is useful in determining whether two markables corefer or not. In addition, these features must be generic enough to be used across different domains. Since the MUC-6 and MUC-7 tasks define coreference guidelines for all types of noun phrases and different types of noun phrases behave differently in terms of how they corefer, our features must be able to handle this and give different coreference decisions based on different types of noun phrases. In general, there must be some features that indicate the type of a noun phrase. Altogether, we have five features that indicate whether the markables are definite noun phrases, demonstrative noun phrases, pronouns, or proper names.</Paragraph>
      <Paragraph position="1"> There are many important knowledge sources useful for coreference. We wanted to use those that are not too difficult to compute. One important factor is the distance  Computational Linguistics Volume 27, Number 4 between the two markables. McEnery, Tanaka, and Botley (1997) have done a study on how distance affects coreference, particularly for pronouns. One of their conclusions is that the antecedents of pronouns do exhibit clear quantitative patterns of distribution. The distance feature has different effects on different noun phrases. For proper names, locality of the antecedents may not be so important. We include the distance feature so that the learning algorithm can best decide the distribution for different classes of noun phrases.</Paragraph>
      <Paragraph position="2"> There are other features that are related to the gender, number, and semantic class of the two markables. Such knowledge sources are commonly used for the task of determining coreference.</Paragraph>
      <Paragraph position="3"> Our feature vector consists of a total of 12 features described below, and is derived based on two extracted markables, i and j, where i is the potential antecedent and j is the anaphor. Information needed to derive the feature vectors is provided by the pipeline of language-processing modules prior to the coreference engine.</Paragraph>
      <Paragraph position="4">  1. Distance Feature (DIST): Its possible values are 0,1, 2, 3 ..... This feature captures the distance between i and j. If i and j are in the same sentence, the value is 0; if they are one sentence apart, the value is 1; and so on.</Paragraph>
      <Paragraph position="5"> 2. i-Pronoun Feature (I_PRONOUN): Its possible values are true or false. If i is a pronoun, return true; else return false. Pronouns include reflexive pronouns (himself, herself), personal pronouns (he, him, you), and possessive pronouns (hers, her).</Paragraph>
      <Paragraph position="6"> 3. j-Pronoun Feature (J_PRONOUN): Its possible values are true or false. If j is a pronoun (as described above), then return true; else return false.</Paragraph>
      <Paragraph position="7"> 4. String Match Feature (STR_MATCH): Its possible values are true or  false. If the string of i matches the string of j, return true; else return false. We first remove articles (a, an, the) and demonstrative pronouns (this, these, that, those) from the strings before performing the string comparison. Therefore, the license matches this license, that computer matches computer.</Paragraph>
      <Paragraph position="8"> 5. Definite Noun Phrase Feature (DEF_NP): Its possible values are true or false. In our definition, a definite noun phrase is a noun phrase that starts with the word the. For example, the car is a definite noun phrase. If j is a definite noun phrase, return true; else return false.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="523" end_page="529" type="metho">
    <SectionTitle>
6. Demonstrative Noun Phrase Feature (DEM_NP): Its possible values are
</SectionTitle>
    <Paragraph position="0"> true or false. A demonstrative noun phrase is one that starts with the word this, that, these, or those. If j is a demonstrative noun phrase, then return true; else return false.</Paragraph>
    <Paragraph position="1"> 7. Number Agreement Feature (NUMBER): Its possible values are true or false. If i and j agree in number (i.e., they are both singular or both plural), the value is true; otherwise false. Pronouns such as they and them are plural, while it, him, and so on, are singular. The morphological root of a noun is used to determine whether it is singular or plural if the noun is not a pronoun.</Paragraph>
    <Paragraph position="2"> 8. Semantic Class Agreement Feature (SEMCLASS): Its possible values are true, false, or unknown. In our system, we defined the following semantic classes: &amp;quot;female,&amp;quot; &amp;quot;male,&amp;quot; &amp;quot;person,&amp;quot; &amp;quot;organization,&amp;quot; &amp;quot;location,&amp;quot; &amp;quot;date,&amp;quot; &amp;quot;time,&amp;quot; &amp;quot;money, .... percent,&amp;quot; and &amp;quot;object.&amp;quot; These semantic classes  Soon, Ng, and Lim Coreference Resolution .</Paragraph>
    <Paragraph position="3"> 10.</Paragraph>
    <Paragraph position="4"> 11.</Paragraph>
    <Paragraph position="5"> are arranged in a simple ISA hierarchy. Each of the &amp;quot;female&amp;quot; and &amp;quot;male&amp;quot; semantic classes is a subclass of the semantic class &amp;quot;person,&amp;quot; while each of the semantic classes &amp;quot;organization,&amp;quot; &amp;quot;location,&amp;quot; &amp;quot;date,&amp;quot; &amp;quot;time,&amp;quot; &amp;quot;money,&amp;quot; and &amp;quot;percent&amp;quot; is a subclass of the semantic class &amp;quot;object.&amp;quot; Each of these defined semantic classes is then mapped to a WordNet synset (Miller 1990). For example, &amp;quot;male&amp;quot; is mapped to the second sense of the noun male in WordNet, &amp;quot;location&amp;quot; is mapped to the first sense of the norm location, and so on.</Paragraph>
    <Paragraph position="6"> The semantic class determination module assumes that the semantic class for every markable extracted is the first sense of the head noun of the markable. Since WordNet orders the senses of a noun by their frequency, this is equivalent to choosing the most frequent sense as the semantic class for each norm. If the selected semantic class of a markable is a subclass of one of our defined semantic classes C, then the semantic class of the markable is C; else its semantic class is &amp;quot;unknown.&amp;quot; The semantic classes of markables i and j are in agreement if one is the parent of the other (e.g., chairman with semantic class &amp;quot;person&amp;quot; and Mr. Lim with semantic class &amp;quot;male&amp;quot;), or they are the same (e.g., Mr. Lim and he, both of semantic class &amp;quot;male&amp;quot;). The value returned for such cases is true. If the semantic classes of i and j are not the same (e.g., IBM with semantic class &amp;quot;organization&amp;quot; and Mr. Lim with semantic class &amp;quot;male&amp;quot;), return false. If either semantic class is &amp;quot;unknown,&amp;quot; then the head noun strings of both markables are compared. If they are the same, return true; else return unknown.</Paragraph>
    <Paragraph position="7"> Gender Agreement Feature (GENDER): Its possible values are true, false, or unknown. The gender of a markable is determined in several ways. Designators and pronouns such as Mr., Mrs., she, and he, can determine the gender. For a markable that is a person's name, such as Peter H. Diller, the gender cannot be determined by the above method. In our system, the gender of such a markable can be determined if markables are found later in the document that refer to Peter H. Diller by using the designator form of the name, such as Mr. Diller. If the designator form of the name is not present, the system will look through its database of common human first names to determine the gender of that markable. The gender of a markable will be unknown for noun phrases such as the president and chief executive officer. The gender of other markables that are not &amp;quot;person&amp;quot; is determined by their semantic classes. Unknown semantic classes will have unknown gender while those that are objects will have &amp;quot;neutral&amp;quot; gender. If the gender of either markable i or j is unknown, then the gender agreement feature value is unknown; else if i and j agree in gender, then the feature value is true; otherwise its value is false.</Paragraph>
    <Paragraph position="8"> Both-Proper-Names Feature (PROPER_NAME): Its possible values are true or false. A proper name is determined based on capitalization.</Paragraph>
    <Paragraph position="9"> Prepositions appearing in the name such as of and and need not be in uppercase. If i and j are both proper names, return true; else return false. Alias Feature (ALIAS): Its possible values are true or false. If i is an alias of j or vice versa, return true; else return false. That is, this feature value is true if i and j are named entities (person, date, organization, etc.) that refer to the same entity. The alias module works differently depending  Computational Linguistics Volume 27, Number 4 12.</Paragraph>
    <Paragraph position="10"> on the named entity type. For i and j that are dates (e.g., 01-08 and Jan. 8), by using string comparison, the day, month, and year values are extracted and compared. If they match, then j is an alias of i. For i and j that are &amp;quot;person,&amp;quot; such as Mr. Simpson and Bent Simpson, the last words of the noun phrases are compared to determine whether one is an alias of the other. For organization names, the alias function also checks for acronym match such as IBM and International Business Machines Corp. In this case, the longer string is chosen to be the one that is converted into the acronym form. The first step is to remove all postmodifiers such as Corp. and Ltd. Then, the acronym function considers each word in turn, and if the first letter is capitalized, it is used to form the acronym. Two variations of the acronyms are produced: one with a period after each letter, and one without.</Paragraph>
    <Paragraph position="11"> Appositive Feature (APPOSITIVE): Its possible values are true or false.</Paragraph>
    <Paragraph position="12"> If j is in apposition to i, return true; else return false. For example, the markable the chairman of Microsoft Corp. is in apposition to Bill Gates in the sentence Bill Gates, the chairman of Microsoft Corp ...... Our system determines whether j is a possible appositive construct by first checking for the existence of verbs and proper punctuation. Like the above example, most appositives do not have any verb; and an appositive is separated by a comma from the most immediate antecedent, i, to which it refers. Further, at least one of i and j must be a proper name. The MUC-6 and MUC-7 coreference task definitions are slightly different. In MUC-6, j needs to be a definite noun phrase to be an appositive, while both indefinite and definite noun phrases are acceptable in MUC-7.</Paragraph>
    <Paragraph position="13"> As an example, Table 1 shows the feature vector associated with the antecedent i,  Frank Newman, and the anaphor j, vice chairman, in the following sentence: (1) Separately, Clinton transition officials said that Frank Newman, 50, vice chairman and chief financial officer of BankAmerica Corp., is expected to  be nominated as assistant Treasury secretary for domestic finance.</Paragraph>
    <Paragraph position="14"> Table 1 Feature vector of the markable pair (i = Frank Newman, j = vice chairman).</Paragraph>
    <Section position="1" start_page="525" end_page="526" type="sub_section">
      <SectionTitle>
Feature Value Comments
</SectionTitle>
      <Paragraph position="0"> DIST 0 i and j are in the same sentence I_PRONOUN - i is not a pronoun J~RONOUN - j is not a pronoun STR_MATCH - i and j do not match DEF_NP - j is not a definite noun phrase DEMaNP - j is not a demonstrative noun phrase NUMBER + i and j are both singular SEMCLASS 1 i and j are both persons (This feature has three values: false(0), true(l), unknown(2).) GENDER 1 i and j are both males (This feature has three values: false(0), true(l), unknown(2).) PROPER_NAME - Only i is a proper name ALIAS - j is not an alias of i APPOSITIVE + j is in apposition to i  Soon, Ng, and Lira Coreference Resolution Because of capitalization, markables in the headlines of MUC-6 and MUC-7 documents are always considered proper names even though some are not. Our system solves this inaccuracy by first preprocessing a headline to correct the capitalization before passing it into the pipeline of NLP modules. Only those markables in the headline that appear in the text body as proper names have their capitalization changed to match those found in the text body. All other headline markables are changed to lowercase.</Paragraph>
    </Section>
    <Section position="2" start_page="526" end_page="527" type="sub_section">
      <SectionTitle>
2.3 Generating Training Examples
</SectionTitle>
      <Paragraph position="0"> Consider a coreference chain A1 - A2 - A3 - A4 found in an annotated training document. Only pairs of noun phrases in the chain that are immediately adjacent (i.e., A1 -A2, A2 - A3, and A3 - A4) are used to generate the positive training examples. The first noun phrase in a pair is always considered the antecedent, while the second is the anaphor. On the other hand, negative training examples are extracted as follows.</Paragraph>
      <Paragraph position="1"> Between the two members of each antecedent-anaphor pair, there are other markables extracted by our language-processing modules that either are not found in any coreference chain or appear in other chains. Each of them is then paired with the anaphor to form a negative example. For example, if markables a, b, and B1 appear between A1 and A2, then the negative examples are a - A2, b - A2, and B1 - A2. Note that a and b do not appear in any coreference chain, while B1 appears in another coreference chain.</Paragraph>
      <Paragraph position="2"> For an annotated noun phrase in a coreference chain in a training document, the same noun phrase must be identified as a markable by our pipeline of language-processing modules before this noun phrase can be used to form a feature vector for use as a training example. This is because the information necessary to derive a feature vector, such as semantic class and gender, is computed by the language-processing modules. If an annotated noun phrase is not identified as a markable, it will not contribute any training example. To see more clearly how training examples are generated, consider the following four sentences:  * Sentence 1 1. (Eastern Air)a1 Proposes (Date For Talks on ((Pay)cl-CUt)dl Plan)hi 2. (Eastern Air)l Proposes (Date)2 For (Talks)3 on (Pay-Cut Plan)4 * Sentence 2 1. (Eastern Airlines)a2 executives notified (union)el leaders that the carrier wishes to discuss selective ((wage)c2 reductions)d2 on (Feb. 3)b2.</Paragraph>
      <Paragraph position="3"> 2. ((Eastern Airlines)5 executives)6 notified ((union)7 leaders)8 that (the carrier)9 wishes to discuss (selective (wage)10 reductions)n on (Feb. 3)12.</Paragraph>
      <Paragraph position="4"> 1.</Paragraph>
      <Paragraph position="6"> By proposing (a meeting date)b3, (Eastern)a3 moved one step closer toward reopening current high-cost contract agreements with ((its)a4 unions)e3.</Paragraph>
      <Paragraph position="7"> By proposing (a meeting dateh7, (Eastern)18 moved (one step)19 closer toward reopening (current high-cost contract</Paragraph>
      <Paragraph position="9"> Each sentence is shown twice with different noun phrase boundaries. Sentences labeled (1) are obtained directly from part of the training document. The letters in the subscripts uniquely identify the coreference chains, while the numbers identify the noun phrases. Noun phrases in sentences labeled (2) are extracted by our language-processing modules and are also uniquely identified by numeric subscripts.</Paragraph>
      <Paragraph position="10"> Let's consider chain e, which is about the union. There are three noun phrases that corefer, and our system managed to extract the boundaries that correspond to all of them: (union)7 matches with (union)el, (union)13 with (union)e2, and (its unions)22 with (its unions)e3. There are two positive training examples formed by ((union)13, (its unions)22) and ((union)7, (union)13). Noun phrases between (union)7 and (union)13 that do not corefer with (union)13 are used to form the negative examples. The negative examples are ((the carrier)9, (union)is), ((wage)lo, (union)13), ((selective wage reductions)11, (union)13), and ((Feb. 3)12, (union)13). Negative examples can also be found similarly between ((union)13, (its unions)22).</Paragraph>
      <Paragraph position="11"> As another example, neither noun phrase in chain d, (Pay-Cut)all and (wage reductions)a2, matches with any machine-extracted noun phrase boundaries. In this case, no positive or negative example is formed for noun phrases in chain d.</Paragraph>
    </Section>
    <Section position="3" start_page="527" end_page="527" type="sub_section">
      <SectionTitle>
2.4 Building a Classifier
</SectionTitle>
      <Paragraph position="0"> The next step is to use a machine learning algorithm to learn a classifier based on the feature vectors generated from the training documents. The learning algorithm used in our coreference engine is C5, which is an updated version of C4.5 (Quinlan 1993).</Paragraph>
      <Paragraph position="1"> C5 is a commonly used decision tree learning algorithm and thus it may be considered as a baseline method against which other learning algorithms can be compared.</Paragraph>
    </Section>
    <Section position="4" start_page="527" end_page="529" type="sub_section">
      <SectionTitle>
2.5 Generating Coreference Chains for Test Documents
</SectionTitle>
      <Paragraph position="0"> Before determining the coreference chains for a test document, all possible markables need to be extracted from the document. Every markable is a possible anaphor, and every markable before the anaphor in document order is a possible antecedent of the anaphor, except when the anaphor is nested. If the anaphor is a child or nested markable, then its possible antecedents must not be any markable with the same root markable as the current anaphor. However, the possible antecedents can be other root markables and their children that are before the anaphor in document order. For example, consider the two root markables, Mr. Tom's daughter and His daughter's eyes, appearing in that order in a test document. The possible antecedents of His cannot be His daughter or His daughter's eyes, but can be Mr. Tom or Mr. Tom's daughter.</Paragraph>
      <Paragraph position="1"> The coreference resolution algorithm considers every markable j starting from the second markable in the document to be a potential candidate as an anaphor. For each j, the algorithm considers every markable i before j as a potential antecedent. For each pair i and j, a feature vector is generated and given to the decision tree classifier. A coreferring antecedent is found if the classifier returns true. The algorithm starts from the immediately preceding markable and proceeds backward in the reverse order of  Soon, Ng, and Lim Coreference Resolution the markables in the document until there is no remaining markable to test or an antecedent is found.</Paragraph>
      <Paragraph position="2"> As an example, consider the following text with markables already detected by the NLP modules: (2) (Ms. Washington)73's candidacy is being championed by (several powerful lawmakers)74 including ((her)76 boss)75, (Chairman John Dingell)77 (D., (Mich.)78) of (the House Energy and Commerce Committee)79. (She)so currently is (a counsel)81 to (the committee)s2. (Ms.</Paragraph>
      <Paragraph position="3"> Washington)s3 and (Mr. DingeU)s4 have been considered (allies)s5 of (the (securities)s7 exchanges)s6, while (banks)s8 and ((futures)90 exchanges)89 have often fought with (themM.</Paragraph>
      <Paragraph position="4"> We will consider how the boldfaced chains are detected. Table 2 shows the pairs of markables tested for coreference to form the chain for Ms. Washington-her-She-Ms. Washington. When the system considers the anaphor, (her)76, all preceding phrases, except (her boss)75, are tested to see whether they corefer with it. (her boss)75 is not tested because (her)76 is its nested noun phrase. Finally, the decision tree determines that the noun phrase (Ms. Washington)73 corefers with (her)76. In Table 2, we only show the system considering the three anaphors (her)76, (She)so, and (Ms. Washington)s3, in that order.</Paragraph>
      <Paragraph position="5"> Table 2 Pairs of markables that are tested in forming the coreference chain Ms. Washington-her-She-Ms. Washington. The feature vector format: DIST, SEMCLASS, NUMBER, GENDER, PROPER_NAME, ALIAS, ,_PRONOUN, DEF_NP, DEMANP, STR_MATCH, APPOSITIVE,  Computational Linguistics Volume 27, Number 4 We use the same method to generate coreference chains for both MUC-6 and MUC7, except for the following. For MUC-7, because of slight changes in the coreference task definition, we include a filtering module to remove certain coreference chains. The task definition states that a coreference chain must contain at least one element that is a head noun or a name; that is, a chain containing only prenominal modifiers is removed by the filtering module.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="529" end_page="533" type="metho">
    <SectionTitle>
3. Evaluation
</SectionTitle>
    <Paragraph position="0"> In order to evaluate the performance of our learning approach to coreference resolution on common data sets, we utilized the annotated corpora and scoring programs from MUC-6 and MUC-7, which assembled a set of newswire documents annotated with coreference chains. Although we did not participate in either MUC-6 or MUC-7, we were able to obtain the training and test corpora for both years from the MUC organizers for research purposes. 1 To our knowledge, these are the only publicly available annotated corpora for coreference resolution.</Paragraph>
    <Paragraph position="1"> For MUC-6, 30 dry-run documents annotated with coreference :information were used as the training documents for our coreference engine. There are also 30 annotated training documents from MUC-7. The total size of the 30 training documents is close to 12,400 words for MUC-6 and 19,000 words for MUC-7. There are altogether 20,910 (48,872) training examples used for MUC-6 (MUC-7), of which only 6.5% (4.4%) are positive examples in MUC-6 (MUC-7). 2 After training a separate classifier for each year, we tested the performance of each classifier on its corresponding test corpus. For MUC-6, the C5 pruning confidence is set at 20% and the minimum number of instances per leaf node is set at 5. For MUC-7, the pruning confidence is 60% and the minimum number of instances is 2. The parameters are determined by performing 10-fold cross-validation on the whole training set for each MUC year. The possible pruning confidence values that we tried are 10%, 20%, 40%, 60%, 80%, and 100%, and for minimum instances, we tried 2, 5, 10, 15, and 20.</Paragraph>
    <Paragraph position="2"> Thus, a total of 30 (6 x 5) cross-validation runs were executed.</Paragraph>
    <Paragraph position="3"> One advantage of using a decision tree learning algorithm is that the resulting decision tree classifier can be interpreted by humans. The decision tree generated for MUC-6, shown in Figure 2, seems to encapsulate a reasonable rule of thumb that matches our intuitive linguistic notion of when two noun phrases can corefer. It is also interesting to note that only 8 out of the 12 available features in the training examples are actually used in the final decision tree built.</Paragraph>
    <Paragraph position="4"> MUC-6 has a standard set of 30 test documents, which is used by all systems that participated in the evaluation. Similarly, MUC-7 has a test corpus of 20 documents. We compared our system's MUC-6 and MUC-7 performance with that of the systems that took part in MUC-6 and MUC-7, respectively. When the coreference engine is given new test documents, its output is in the form of SGML files with the coreference chains properly annotated according to the guidelines. 3 We then used the scoring programs  corpora. 2 Our system runs on a Pentium III 550MHz PC. It took less than 5 minutes to generate the training examples from the training documents for MUC-6, and about 7 minutes for MUC-7. The training time for the C5 algorithm to generate a decision tree from all the training examples was less than 3 seconds for both MUC years. 3 The time taken to generate the coreference chains for the 30 MUC-6 test documents of close to 13,400 words was less than 3 minutes, while it took less than 2 minutes for the 20 MUC-7 test documents of about 10,000 words.</Paragraph>
    <Paragraph position="5">  Soon, Ng, and Lira Coreference Resolution</Paragraph>
    <Paragraph position="7"> : APPOSITIVE = -: : :...ALIAS = +: + : ALIAS ....</Paragraph>
    <Paragraph position="9"> The decision tree classifier learned for MUC-6.</Paragraph>
    <Paragraph position="10"> for the respective years to generate the recall and precision scores for our coreference engine.</Paragraph>
    <Paragraph position="11"> Our coreference engine achieves a recall of 58.6% and a precision of 67.3%, yielding a balanced F-measure of 62.6% for MUC-6. For MUC-7, the recall is 56.1%, the precision is 65.5%, and the balanced F-measure is 60.4%. 4 We plotted the scores of our coreference engine (square-shaped) against the official test scores of the other systems (crossshaped) in Figure 3 and Figure 4.</Paragraph>
    <Paragraph position="12"> We also plotted the learning curves of our coreference engine in Figure 5 and Figure 6, showing its accuracy averaged over three random trials when trained on 1, 2, 3, 4, 5, 10, 15, 20, 25, and 30 training documents. The learning curves indicate that our coreference engine achieves its peak performance with about 25 training documents, or about 11,000 to 17,000 words of training documents. This number of training documents would generate tens of thousands of training examples, sufficient for the decision tree learning algorithm to learn a good classifier. At higher numbers of training documents, our system seems to start overfitting the training data. For example, on MUC-7 data, training on the full set of 30 training documents results in a more complex decision tree.</Paragraph>
    <Paragraph position="13"> Our system's scores are in the upper region of the MUC-6 and MUC-7 systems.</Paragraph>
    <Paragraph position="14"> We performed a simple one-tailed, paired sample t-test at significance level p = 0.05 to determine whether the difference between our system's F-measure score and each of the other systems' F-measure score on the test documents is statistically significant. 5 We found that at the 95% significance level (p = 0.05), our system performed better than three MUC-6 systems, and as well as the rest of the MUC-6 systems. Using the 4 Note that MUC-6 did not use balanced F-measure as the official evaluation measure, but MUC-7 did. 5 Though the McNemar test is shown to have low Type I error compared with the paired t-test (Dietterich 1998), we did not carry out this test in the context of coreference. This is because an example instance defines a coreference link between two noun phrases, and since this link is transitive in nature, it is unclear how the number of links misclassified by System A but not by System B and vice versa can be obtained to execute the McNemar test.</Paragraph>
    <Paragraph position="15">  Coreference scores of MUC-7 systems and our system.</Paragraph>
    <Paragraph position="16"> same significance level, our system performed better than four MUC-7 systems, and as well as the rest of the MUC-7 systems. Our result is encouraging since it indicates that a learning approach using relatively shallow features can achieve scores comparable to those of systems built using nonlearning approaches.</Paragraph>
    <Paragraph position="17">  Soon, Ng, and Lira Coreference Resolution  It should be noted that the accuracy of our coreference resolution engine depends to a large extent on the performance of the NLP modules that are executed before the coreference engine. Our current learning-based, HMM named entity recognition module is trained on 318 documents (a disjoint set from both the MUC-6 and MUC-7  Computational Linguistics Volume 27, Number 4 test documents) tagged with named entities, and its score on the MUC-6 named entity task for the 30 formal test documents is only 88.9%, which is not considered very high by MUC-6 standards. For example, our named entity recognizer could not identify the two named entities USAir and Piedmont in the expression USAir and Piedmont but instead treat them as one single named entity. Our part-of-speech tagger achieves 96% accuracy, while the accuracy of noun phrase identification is above 90%.</Paragraph>
  </Section>
  <Section position="6" start_page="533" end_page="536" type="metho">
    <SectionTitle>
4. The Contribution of the Features
</SectionTitle>
    <Paragraph position="0"> One factor that affects the performance of a machine learning approach is the set of features used. It is interesting to find out how useful each of our 12 features is in the MUC-6 and MUC-7 coreference tasks. One way to do this is to train and test using just one feature at a time. Table 3 and Table 4 show the results of the experiment. For both  the same features. The unary features are I_PRONOUN, J_PRONOUN, DEF_NP, and DEM_NP, while the rest are binary in nature. All the unary features score an F-measure of 0. The binary features with 0 F-measure are DIST, PROPERd'qAME, GENDER, SEMCLASS, and NUMBER.</Paragraph>
    <Paragraph position="1"> The ALIAS, APPOSITIVE, and STR_MATCH features give nonzero F-measure. All these features give rather high precision scores (&gt; 80% for ALIAS, &gt; 65% for STR_MATCH, and &gt; 57% for APPOSITIVE). Since these features are highly informative, we were curious to see how much they contribute to our MUC-6 and MUC-7 results of 62.6% and 60.4%, respectively. Systems ALIAS_STR and ALIAS_STR~a~PPOS in Table 3 and Table 4 show the results of the experiment. In terms of absolute Fmeasure, the difference between using these three features and using all features is 2.3% for MUC-6 and 1% for MUC-7; in other words, the other nine features contribute just 2.3% and 1% more for each of the MUC years. These nine features will be the first ones to be considered for pruning away by the C5 algorithm. For example, four features, namely, SEMCLASS, PROPER_NAME, DEF_NP, and DEM_NP, are not used in the MUC-6 tree shown in Figure 2. Figure 7 shows the distribution of the test cases over the five positive leaf nodes of the MUC-6 tree. For example, about 66.3% of all  Computational Linguistics Volume 27, Number 4</Paragraph>
    <Paragraph position="3"> Distribution of test examples from the 30 MUC-6 test documents for positive leaf nodes of the MUC-6 tree.</Paragraph>
    <Paragraph position="4"> the test examples that are classified positive go to the &amp;quot;If STRiVIATCH&amp;quot; branch of the tree.</Paragraph>
    <Paragraph position="5"> Other baseline systems that are used are ONE_CHAIN, ONE_WRD, and HD_WRD (Cardie and Wagstaff 1999). For ONE_CHAIN, all markables formed one chain. In ONE_WRD, markables corefer if there is at least one common word. In HD_WRD, markables corefer if their head words are the same. The purpose of ONE_CHAIN is to determine the maximum recall our system is capable of. The recall level here indirectly measures how effective the noun phrase identification module is. Both ONE_WRD and HD_WRD are less stringent variations of STR_MATCH. The performance of ONE_WRD is the worst. HD_WRD offers better recall compared to STR_MATCH, but poorer precision. However, its F-measure is comparable to that of STRA4ATCH.</Paragraph>
    <Paragraph position="6"> The score of the coreference system at the University of Massachusetts (RESOLVE), which uses C4.5 for coreference resolution, is shown in Table 3. RESOLVE is shown because among the MUC-6 systems, it is the only machine learning-based system that we can directly compare to. The other MUC-6 systems were not based on a learning approach. Also, none of the systems in MUC-7 adopted a learning approach to coreference resolution (Chinchor 1998).</Paragraph>
    <Paragraph position="7"> RESOLVE's score is not high compared to scores attained by the rest of the MUC-6 systems. In particular, the system's recall is relatively low. Our system's score is higher than that of RESOLVE, and the difference is statistically significant. The RESOLVE system is described in three papers: McCarthy and Lehnert (1995), Fisher et al. (1995), and McCarthy (1996). As explained in McCarthy (1996), the reason for this low recall is that RESOLVE takes only the &amp;quot;relevant entities&amp;quot; and &amp;quot;relevant references&amp;quot; as input, where the relevant entities and relevant references are restricted to &amp;quot;person&amp;quot;  Soon, Ng, and Lim Coreference Resolution and &amp;quot;organization.&amp;quot; In addition, because of limitations of the noun phrase detection module, nested phrases are not extracted and therefore do not take part in coreference. Nested phrases can include prenominal modifiers, possessive pronouns, and so forth. Therefore, the number of candidate markables to be used for coreference is small.</Paragraph>
    <Paragraph position="8"> On the other hand, the markables extracted by our system include nested noun phrases, MUC-style named entity types (money, percent, date, etc.), and other types not defined by MUC. These markables will take part in coreference. About 3,600 top-level markables are extracted from the 30 MUC-6 test documents by our system. As detected by our NLP modules, only about 35% of these 3,600 phrases are &amp;quot;person&amp;quot; and &amp;quot;organization&amp;quot; entities and references. Concentrating on just these types has thus affected the overall recall of the RESOLVE system.</Paragraph>
    <Paragraph position="9"> RESOLVE's way of generating training examples also differs from our system's: instances are created for all possible pairings of &amp;quot;relevant entities&amp;quot; and &amp;quot;relevant references,&amp;quot; instead of our system's method of stopping at the first coreferential noun phrase when traversing back from the anaphor under consideration. We implemented RESOLVE's way of generating training examples, and the results (DSO-TRG) are reported in Table 3 and Table 4. For MUC-7, there is no drop in F-measure; for MUC-6, the F-measure dropped slightly.</Paragraph>
    <Paragraph position="10"> RESOLVE makes use of 39 features, considerably more than our system's 12 features. RESOLVE's feature set includes the two highly informative features, ALIAS and STR_MATCH. RESOLVE does not use the APPOSITIVE feature.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML