File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1018_metho.xml

Size: 31,721 bytes

Last Modified: 2025-10-06 14:08:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1018">
  <Title>A Mention-Synchronous Coreference Resolution Algorithm Based on the Bell Tree Xiaoqiang Luo and Abe Ittycheriah</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Bell Tree: From Mention to Entity
</SectionTitle>
    <Paragraph position="0"> Let us consider traversing mentions in a document from beginning (left) to end (right). The process of forming entities from mentions can be represented by a tree structure. The root node is the initial state of the process, which consists of a partial entity containing the first mention of a document. The second mention is  numbers in [] denote a partial entity. In-focus entities are marked on the solid arrows, and active mentions are marked by *. Solid arrows signify that a mention is linked with an in-focus partial entity while dashed arrows indicate starting of a new entity.</Paragraph>
    <Paragraph position="1"> added in the next step by either linking to the existing entity, or starting a new entity. A second layer of nodes are created to represent the two possible outcomes. Subsequent mentions are added to the tree in the same manner. The process is mention-synchronous in that each layer of tree nodes are created by adding one mention at a time. Since the number of tree leaves is the number of possible coreference outcomes and it equals the Bell Number (Bell, 1934), the tree is called the Bell tree. The Bell Number a0a2a1a4a3a6a5 is the number of ways of partitioning a3 distinguishable objects (i.e., mentions) into non-empty disjoint subsets (i.e., entities). The Bell Number has a &amp;quot;closed&amp;quot; formula</Paragraph>
    <Paragraph position="3"> a16a24a23 and it increases rapidly as a3 increases: a0a2a1a26a25a28a27a29a5a31a30a33a32a35a34a25a37a36a15a38a39a27 a9a41a40 ! Clearly, an efficient search strategy is necessary, and it will be addressed in Section 4.</Paragraph>
    <Paragraph position="4"> Figure 1 illustrates how the Bell tree is created for a document with three mentions. The initial node consists of the first partial entity [1] (i.e., node (a) in Figure 1). Next, mention2becomes active (marked by &amp;quot;*&amp;quot; in node (a)) and can either link with the partial entity [1] and result in a new node (b1), or start a new entity and create another node (b2). The partial entity which the active mention considers linking with is said to be in-focus. In-focus entities are highlighted on the solid arrows in Figure 1. Similarly, mention 3 will be active in the next stage and can take five possible actions, which create five possible coreference results shown in node (c1) through (c5).</Paragraph>
    <Paragraph position="5"> Under the derivation illustrated in Figure 1, each leaf node in the Bell tree corresponds to a possible coreference outcome, and there is no other way to form entities. The Bell tree clearly represents the search space of the coreference resolution problem. The coreference resolution can therefore be cast equivalently as finding the &amp;quot;best&amp;quot; leaf node. Since the search space is large (even for a document with a moderate number of mentions), it is difficult to estimate a distribution over leaves directly. Instead, we choose to model the process from mentions to entities, or in other words, score paths from the root to leaves in the Bell tree.</Paragraph>
    <Paragraph position="6"> A nice property of the Bell tree representation is that the number of linking or starting steps is the same for all the hypotheses. This makes it easy to rank them using the &amp;quot;local&amp;quot; linking and starting probabilities as the number of factors is the same. The Bell tree representation is also incremental in that mentions are added sequentially. This makes it easy to design a decoder and search algorithm.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Coreference Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Linking and Starting Model
</SectionTitle>
      <Paragraph position="0"> We use a binary conditional model to compute the probability that an active mention links with an in-focus partial entity. The conditions include all the partially-formed entities before, the focus entity index, and the active mention.</Paragraph>
      <Paragraph position="1"> Formally, let a42a39a43a45a44a31a46 a38a48a47a50a49a51a47a52a3a54a53 be a3 mentions in a document. Mention index a49 represents the order it appears in the document. Let a55a24a56 be an entity, and  a49a59a58a60a62a61 be the (many-to-one) map from mention index a49 to entity index a61 . For an active mention index</Paragraph>
      <Paragraph position="3"> a1a69a49a70a5a22a71 for some a38a72a47a65a49a73a47 a63a75a74 a38a28a53a76a71 the set of indices of the partially-established entities to the left of a43 a16 (note that a66 a9 a7a78a77 ), and</Paragraph>
      <Paragraph position="5"> the set of the partially-established entities. The link model is then</Paragraph>
      <Paragraph position="7"> the probability linking the active mention a43 a16 with the in-focus entity a55 a81 . The random variable a89 a16 takes value from the set a66 a16 and signifies which entity is in focus; a85 takes binary value and is a38 if a43 a16 links with a55a39a81 .</Paragraph>
      <Paragraph position="8"> As an example, for the branch from (b2) to (c4) in Figure 1, the active mention is &amp;quot;3&amp;quot;, the set of partial entities to the left of &amp;quot;3&amp;quot; is a79 a40 a7 a42a92a91a38a18a93a94a71 a91a25a80a93a26a53 , the active entity is the second partial entity &amp;quot;[2]&amp;quot;. Probability</Paragraph>
      <Paragraph position="10"> tion &amp;quot;3&amp;quot; links with the entity &amp;quot;[2].&amp;quot; The model a83a84a1a86a85a88a87a79 a16 a71 a43 a16 a71a100a89 a16 a7 a67 a5 only computes how likely a43 a16 links with a55 a81 ; It does not say anything about the possibility that a43 a16 starts a new entity. Fortunately, the starting probability can be computed using link probabilities (1), as shown now.</Paragraph>
      <Paragraph position="11"> Since starting a new entity means that a43 a16 does not link with any entities in a79 a16 , the probability of starting a new entity, a83a84a1a86a85a65a7 a27 a87a79 a16 a71 a43</Paragraph>
      <Paragraph position="13"> (3) indicates that the probability of starting an entity can be computed using the linking probabilities</Paragraph>
      <Paragraph position="15"> The linking model (1) and approximated starting model (5) can be used to score paths in the Bell tree.</Paragraph>
      <Paragraph position="16"> For example, the score for the path (a)-(b2)-(c4) in Figure 1 is the product of the start probability from (a) to (b2) and the linking probability from (b2) to (c4).</Paragraph>
      <Paragraph position="17"> Since (5) is an approximation, not true probability, a constant a25 is introduced to balance the linking probability and starting probability and the starting probability becomes:</Paragraph>
      <Paragraph position="19"> If a25a29a28 a38 , it penalizes creating new entities; Therefore, a25 is called start penalty. The start penalty a25 can be used to balance entity miss and false alarm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Model Training and Features
</SectionTitle>
      <Paragraph position="0"> The model a83a84a1a4a85 a87a79 a16 a71 a43 a16 a71a100a89 a16 a7 a67 a5 depends on all partial entities a79 a16 , which can be very expensive. After making some modeling assumptions, we can approximate it as:</Paragraph>
      <Paragraph position="2"> From (7) to (8), entities other than the one in focus, a55 a81 , are assumed to have no influence on the decision of linking a43 a16 with a55a39a81 . (9) further assumes that the entity-mention score can be obtained by the maximum mention pair score. The model (9) is very similar to the model in (Morton, 2000; Soon et al., 2001; Ng and Cardie, 2002) while (8) has more conditions.</Paragraph>
      <Paragraph position="3"> We use maximum entropy model (Berger et al., 1996) for both the mention-pair model (9) and the entity-mention model (8):</Paragraph>
      <Paragraph position="5"> is a normalizing factor to ensure that (10) or (11) is a probability. Effective training algorithm exists (Berger et al., 1996) once the set of features a42 a57 a16 a1a33a8 a71a54a8 a71a100a85a68a5 a53 is selected. null The basic features used in the models are tabulated in Table 1.</Paragraph>
      <Paragraph position="6"> Features in the lexical category are applicable to non-pronominal mentions only. Distance features characterize how far the two mentions are, either by the number of tokens, by the number of sentences, or by the number of mentions in-between. Syntactic features are derived from parse trees output from a maximum entropy parser (Ratnaparkhi, 1997). The &amp;quot;Count&amp;quot; feature calculates how many times a mention string is seen. For pronominal mentions, attributes such as gender, number, possessiveness and reflexiveness are also used. Apart from basic features in Table 1, composite features can be generated by taking conjunction of basic features. For example, a distance feature together with reflexiveness of a pronoun mention can help to capture that the antecedent of a reflexive pronoun is often closer than that of a non-reflexive pronoun.</Paragraph>
      <Paragraph position="7"> The same set of basic features in Table 1 is used in the entity-mention model, but feature definitions are slightly different. Lexical features, including the acronym features, and the apposition feature are computed by testing any mention in the entity a55 a81 against the active mention a43 a16 . Editing distance for a1 a55 a81 a71 a43 a16 a5 is defined as the minimum distance over any non-pronoun mentions and the active mention. Distance features are computed by taking minimum between mentions in the entity and the active mention.</Paragraph>
      <Paragraph position="8"> In the ACE data, mentions are annotated with three levels: NAME, NOMINAL and PRONOUN. For each ACE entity, a canonical mention is defined as the longest NAME mention if available; or if the entity does not have a NAME mention, the most recent NOMINAL mention; if there is no NAME and NOMINAL mention, the most recent pronoun mention. In the entity-mention model, &amp;quot;ncd&amp;quot;,&amp;quot;spell&amp;quot; and &amp;quot;count&amp;quot; features are computed over the canonical mention of the in-focus entity and the active mention. Conjunction features are used in the entity-mention model too.</Paragraph>
      <Paragraph position="9"> The mention-pair model is appealing for its simplicity: features are easy to compute over a pair of men- null Lexical exact_strm 1 if two mentions have the same spelling; 0 otherwise left_subsm 1 if one mention is a left substring of the other; 0 otherwise right_subsm 1 if one mention is a right substring of the other; 0 otherwise acronym 1 if one mention is an acronym of the other; 0 otherwise edit_dist quantized editing distance between two mention strings spell pair of actual mention strings ncd number of different capitalized words in two mentions Distance token_dist how many tokens two mentions are apart (quantized) sent_dist how many sentences two mentions are apart (quantized) gap_dist how many mentions in between the two mentions in question (quantized) Syntax POS_pair POS-pair of two mention heads apposition 1 if two mentions are appositive; 0 otherwise Count count pair of (quantized) numbers, each counting how many times a mention string is seen Pronoun gender pair of attributes of {female, male, neutral, unknown } number pair of attributes of {singular, plural, unknown} possessive 1 if a pronoun is possessive; 0 otherwise reflexive 1 if a pronoun is reflexive; 0 otherwise Table 1: Basic features used in the maximum entropy model.</Paragraph>
      <Paragraph position="10"> tions; its drawback is that information outside the mention pair is ignored. Suppose a document has three mentions &amp;quot;Mr. Clinton&amp;quot;, &amp;quot;Clinton&amp;quot; and &amp;quot;she&amp;quot;, appearing in that order. When considering the mention pair &amp;quot;Clinton&amp;quot; and &amp;quot;she&amp;quot;, the model may tend to link them because of their proximity; But this mistake can be easily avoided if &amp;quot;Mr. Clinton&amp;quot; and &amp;quot;Clinton&amp;quot; have been put into the same entity and the model knows &amp;quot;Mr. Clinton&amp;quot; referring to a male while &amp;quot;she&amp;quot; is female. Since gender and number information is propagated at the entity level, the entity-mention model is able to check the gender consistency when considering the active mention &amp;quot;she&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Discussion
</SectionTitle>
      <Paragraph position="0"> There is an in-focus entity in the condition of the linking model (1) while the starting model (2) conditions on all left entities. The disparity is intentional as the starting action is influenced by all established entities on the left.</Paragraph>
      <Paragraph position="1"> (4) is not the only way a83a84a1a4a89 a16 a7 a67 a87a79 a16 a71 a43 a16 a5 can be approximated. For example, one could use a uniform distribution over a66 a16 . We experimented several schemes of approximation, including a uniform distribution, and (4) worked the best and is adopted here. One may con-</Paragraph>
      <Paragraph position="3"> a5 directly and use it to score paths in the Bell tree. The problem is that 1) the size of a66 a16 from which a89 a16 takes value is variable; 2) the start action depends on all entities in a79 a16 , which makes it difficult to train a83a84a1a4a89 a16 a7 a67 a87a79 a16 a71 a43 a16 a5 directly.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Search Issues
</SectionTitle>
    <Paragraph position="0"> As shown in Section 2, the search space of the coreference problem can be represented by the Bell tree.</Paragraph>
    <Paragraph position="1"> Thus, the search problem reduces to creating the Bell tree while keeping track of path scores and picking the top-N best paths. This is exactly what is described in Algorithm 1.</Paragraph>
    <Paragraph position="2"> In Algorithm 1, a0 contains all the hypotheses, or paths from the root to the current layer of nodes. Variable a1 a1a86a79 a5 stores the cumulative score for a coreference result a79 . At line 1, a0 is initialized with a single entity consisting of mention a43 a9 , which corresponds to the root node of the Bell tree in Figure 1. Line 2 to 15 loops over the remaining mentions (a43a3a2 to a43a5a4 ), and for each mention a43 a16 , the algorithm extends each result a79 in a0 (or a path in the Bell tree) by either linking a43  with an existing entity a55 a44 (line 5 to 10), or starting an entity a91a43 a16 a93 (line 11 to 14). The loop from line 2 to 12 corresponds to creating a new layer of nodes for the active mention a43 a16 in the Bell tree. a6 a31 in line 4 and a7 in line 6 and 11 have to do with pruning, which will be discussed shortly. The last line returns top a8 results, where a79 a41a10a9 a47 denotes the a11  The complexity of the search Algorithm 1 is the total number of nodes in the Bell tree, which is a12 a4a16a18a17 a9 a0 a1 a63 a5 , where a0a2a1 a63 a5 is the Bell Number. Since the Bell number increases rapidly as a function of the number of mentions, pruning is necessary. We prune the search space in the following places: a0 Local pruning: any children with a score below a fixed factor a7 of the maximum score are pruned.</Paragraph>
    <Paragraph position="3"> This is done at line 6 and 11 in Algorithm 1. The operation in line 4 is:</Paragraph>
    <Paragraph position="5"> a0 Global pruning: similar to local pruning except that this is done using the cumulative score a1 a1a4a79 a5 .</Paragraph>
    <Paragraph position="6"> Pruning based on the global scores is carried out at line 15 of Algorithm 1.</Paragraph>
    <Paragraph position="7"> a0 Limit hypotheses: we set a limit on the maximum number of live paths. This is useful when a document contains many mentions, in which case excessive number of paths may survive local and global pruning.</Paragraph>
    <Paragraph position="8"> a0 Whenever available, we check the compatibility of entity types between the in-focus entity and the active mention. A hypothesis with incompatible entity types is discarded. In the ACE annotation, every mention has an entity type. Therefore we can eliminate hypotheses with two mentions of different types.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Performance Metrics
</SectionTitle>
      <Paragraph position="0"> The official performance metric for the ACE task is ACE-value. ACE-value is computed by first calculating the weighted cost of entity insertions, deletions and substitutions; The cost is then normalized against the cost of a nominal coreference system which outputs no entities; The ACE-value is obtained by subtracting the normalized cost from a38 . Weights are designed to emphasize NAME entities, while PRONOUN entities (i.e., an entity consisting of only pronominal mentions) carry very low weights. A perfect coreference system will get a a38a39a27a98a27a2a1 ACE-value while a system outputs no entities will get a a27 ACE-value. Thus, the ACE-value can be interpreted as percentage of value a system has, relative to the perfect system.</Paragraph>
      <Paragraph position="1"> Since the ACE-value is an entity-level metric and is weighted heavily toward NAME entities, we also measure our system's performance by an entity-constrained mention F-measure (henceforth &amp;quot;ECM-F&amp;quot;). The metric first aligns the system entities with the reference entities so that the number of common mentions is maximized. Each system entity is constrained to align with at most one reference entity, and vice versa. For example, suppose that a reference document contains three entities: a42a92a91a43 a9 a93a26a71 a91a43 a2 a71 a43 a40 a93a26a71 a91a43a4a3 a93a26a53 while a system outputs four entities: a42 a91a43 a9 a71 a43 a2 a93a94a71 a91a43 a40 a93a26a71 a91a43a6a5 a93a26a71 a91a43a6a7 a93a26a53 , then the best alignment (from reference to system) would be  are not aligned. The number of common mentions of the best alignment is a25 (i.e., a43 a9 anda43 a40 ), which leads to a mention recall a2  and precision a2  . The ECM-F measures the percentage of mentions that are in the &amp;quot;right&amp;quot; entities.</Paragraph>
      <Paragraph position="2"> For tests on the MUC data, we report both F-measure using the official MUC score (Vilain et al., 1995) and ECM-F. The MUC score counts the common links between the reference and the system output.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results on the ACE data
</SectionTitle>
      <Paragraph position="0"> The system is first developed and tested using the ACE data. The ACE coreference system is trained with a11 a38a13a12 documents (about a38a15a14a98a27a17a16 words) of ACE 2002 training data. A separate a14a98a27 documents (a32a28a27a17a16 words) is used as the development-test (Devtest) set. In 2002, NIST released two test sets in February (Feb02) and September (Sep02), respectively. Statistics of the three test sets is summarized in Table 2. We will report coreference results on the true mentions of the three test sets.</Paragraph>
      <Paragraph position="1"> TestSet #-docs #-words #-mentions #-entities  For the mention-pair model, training events are generated for all compatible mention-pairs, which results in about a14a98a27a17a14a18a16 events, about a38a39a32a28a27a17a16 of which are positive examples. The full mention-pair model uses about a38a20a19 a38a15a16 features; Most are conjunction features. For the entity-mention model, events are generated by walking through the Bell tree. Only events on the true path (i.e., positive examples) and branches emitting from a node on the true path to a node not on the true path (i.e., negative examples) are generated. For example, in Figure 1, suppose that the path (a)-(b2)-(c4) is the truth, then positive training examples are starting event from (a) to (b2) and linking event from (b2) to (c4); While the negative examples are linking events from (a) to (b1), (b2) to (c3), and the starting event from (b2) to (c5). This scheme generates about a99a29a25a29a25a21a16 events, out of which about a38a15a22a17a16 are positive training examples. The full entity-mention model has about a22a35a34a11 a16 features, due to less number of conjunction features and training examples. null Coreference results on the true mentions of the Devtest, Feb02, and Sep02 test sets are tabulated in Table 3. These numbers are obtained with a fixed search beam a25a98a27 and pruning threshold a7 a7 a27 a34a27a29a27a35a38 (widening the search beam or using a smaller pruning threshold did not change results significantly).</Paragraph>
      <Paragraph position="2"> The mention-pair model in most cases performs better than the mention-entity model by both ACE-value and ECM-F measure although none of the differences is statistically significant (pair-wise t-test) at p-value a27a35a34a27a29a32 . Note that, however, the mention-pair model uses a25a28a27 times more features than the entity-pair model. We also observed that, because the score between the in-focus entity and the active mention is computed by (9) in the mention-pair model, the mention-pair sometimes mistakenly places a male pronoun and female pronoun into the same entity, while the same mistake is avoided in the entity-mention model. Using the canonical mentions when computing some features (e.g., &amp;quot;spell&amp;quot;) in the entity-mention model is probably not optimal and it is an area that needs further research.</Paragraph>
      <Paragraph position="3"> When the same mention-pair model is used to score the ACE 2003 evaluation data, an ACE-value a19 a99 a34a11 a1 is obtained on the system1 mentions. After retrained with Chinese and Arabic data (much less training data than English), the system got a32a21a22 a34a22 a1 and a32 a11 a34a32a2a1 ACE-value on the system mentions of ACE 2003 evaluation data for Chinese and Arabic, respectively. The results for all three languages are among the top-tier submission systems. Details of the mention detection and coreference system can be found in (Florian et al., 2004).</Paragraph>
      <Paragraph position="4"> Since the mention-pair model is better, subsequent analyses are done with the mention pair model only.</Paragraph>
      <Paragraph position="5">  To see how each category of features affects the performance, we start with the aforementioned mention-pair model, incrementally remove each feature category, retrain the system and test it on the Devtest set. The result is summarized in Table 4. The last column lists the number of features. The second row is the full mention-pair model, the third through seventh row correspond to models by removing the syntactic features (i.e., POS tags and apposition features), count features, distance features, mention type and level information, and pair of mention-spelling features. If a basic feature is removed, conjunction features using that basic feature are also removed. It is striking that the smallest system consisting of only a99a17a14 features (string and substring match, acronym, edit distance and number of different capitalized words) can get as much as a22a17a12a35a34a27a2a1 ACE-value. Table 4 shows clearly that these lexical features and the distance features are the most important. Sometimes the ACE-value increases after removing a set of features, but the ECM-F measure tracks nicely the trend that the more features there are, the better the performance is. This is because the ACE-value  are the standard deviations. * indicates that the result is significantly (pair-wise t-test) different from the line above at a6 a7a15a27a35a34a27a29a32 .</Paragraph>
      <Paragraph position="6">  As discussed in Section 3.1, the start penalty a25 can be used to balance the entity miss and false alarm. To see this effect, we decode the Devtest set by varying the start penalty and the result is depicted in Figure 2. The ACE-value and ECM-F track each other fairly well.</Paragraph>
      <Paragraph position="7"> Both achieve the optimal when a2a4a3 a17 a25 a30 a74 a27a35a34a22 .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Experiments on the MUC data
</SectionTitle>
      <Paragraph position="0"> To see how the proposed algorithm works on the MUC data, we test our algorithm on the MUC6 data. To minimize the change to the coreference system, we first map the MUC data into the ACE style. The original MUC coreference data does not have entity types (i.e., &amp;quot;ORGANIZATION&amp;quot;, &amp;quot;LOCATION&amp;quot; etc), required in the ACE style. Part of entity types can be recovered from the corresponding named-entity annotations. The recovered named-entity label is propagated to all mentions belonging to the same entity. There are 504 out of 2072 mentions of the MUC6 formal test set and 695 out of 2141 mentions of the MUC6 dry-run test set that cannot be assigned labels by this procedure. A  features. None of the ECM-F differences between MP and EM is statistically significant at a6 a7 a27a35a34a27a29a32 . generic type &amp;quot;UNKNOWN&amp;quot; is assigned to these mentions. Mentions that can be found in the named-entity annotation are assumed to have the ACE mention level &amp;quot;NAME&amp;quot;; All other mentions other than English pronouns are assigned the level &amp;quot;NOMINAL.&amp;quot; After the MUC data is mapped into the ACE-style, the same set of feature templates is used to train a coreference system. Two coreference systems are trained on the MUC6 data: one trained with 30 dry-run test documents (henceforth &amp;quot;MUC6-small&amp;quot;); the other trained with 191 &amp;quot;dryrun-train&amp;quot; documents that have both coreference and named-entity annotations (henceforth &amp;quot;MUC6-big&amp;quot;) in the latest LDC release. To use the official MUC scorer, we convert the output of the ACE-style coreference system back into the MUC format. Since MUC does not require entity label and level, the conversion from ACE to MUC is &amp;quot;lossless.&amp;quot; null Table 5 tabulates the test results on the true mentions of the MUC6 formal test set. The numbers in the table represent the optimal operating point determined by ECM-F. The MUC scorer cannot be used since it inherently favors systems that output fewer number of entities (e.g., putting all mentions of the MUC6 formal test set into one entity will yield a a38a39a27a98a27a2a1 recall and a19a20a22 a34a14 a1 precision of links, which gives an a22a18a22a35a34a25a17a1 F-measure). The MUC6-small system compares favorably with the similar experiment in Harabagiu et al. (2001) in which an a22 a38a98a34a14a2a1 F-measure is reported. When measured by the ECM-F measure, the MUC6-small system has the same level of performance as the ACE system, while the MUC6-big system performs better than the ACE system. The results show that the algorithm works well on the MUC6 data despite some information is lost in the conversion from the MUC format to the ACE format. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> There exists a large body of literature on the topic of coreference resolution. We will compare this study with some relevant work using machine learning or statistical methods only.</Paragraph>
    <Paragraph position="1"> Soon et al. (2001) uses a decision tree model for coreference resolution on the MUC6 and MUC7 data.</Paragraph>
    <Paragraph position="2"> Leaves of the decision tree are labeled with &amp;quot;link&amp;quot; or &amp;quot;not-link&amp;quot; in training. At test time, the system checks a mention against all its preceding mentions, and the first one labeled with &amp;quot;link&amp;quot; is picked as the antecedent. Their work is later enhanced by (Ng and Cardie, 2002) in several aspects: first, the decision tree returns scores instead of a hard-decision of &amp;quot;link&amp;quot; or &amp;quot;not-link&amp;quot; so that Ng and Cardie (2002) is able to pick the &amp;quot;best&amp;quot; candidate on the left, as opposed the first in (Soon et al., 2001); Second, Ng and Cardie (2002) expands the feature sets of (Soon et al., 2001). The model in (Yang et al., 2003) expands the conditioning scope by including a competing candidate. Neither (Soon et al., 2001) nor (Ng and Cardie, 2002) searches for the global optimal entity in that they make locally independent decisions during search. In contrast, our decoder always searches for the best result ranked by the cumulative score (subject to pruning), and subsequent decisions depend on earlier ones.</Paragraph>
    <Paragraph position="3"> Recently, McCallum and Wellner (2003) proposed to use graphical models for computing probabilities of entities. The model is appealing in that it can potentially overcome the limitation of mention-pair model in which dependency among mentions other than the two in question is ignored. However, models in (McCallum and Wellner, 2003) compute directly the probability of an entity configuration conditioned on mentions, and it is not clear how the models can be factored to do the incremental search, as it is impractical to enumerate all possible entities even for documents with a moderate number of mentions. The Bell tree representation proposed in this paper, however, provides us with a naturally incremental framework for coreference resolution. null Maximum entropy method has been used in coreference resolution before. For example, Kehler (1997) uses a mention-pair maximum entropy model, and two methods are proposed to compute entity scores based on the mention-pair model: 1) a distribution over entity space is deduced; 2) the most recent mention of an entity, together with the candidate mention, is used to compute the entity-mention score. In contrast, in our mention pair model, an entity-mention pair is scored by taking the maximum score among possible mention pairs. Our entity-mention model eliminates the need to synthesize an entity-mention score from mention-pair scores. Morton (2000) also uses a maximum entropy mention-pair model, and a special &amp;quot;dummy&amp;quot; mention is used to model the event of starting a new entity.</Paragraph>
    <Paragraph position="4"> Features involving the dummy mention are essentially computed with the single (normal) mention, and therefore the starting model is weak. In our model, the starting model is obtained by &amp;quot;complementing&amp;quot; the linking scores. The advantage is that we do not need to train a starting model. To compensate the model inaccuracy, we introduce a &amp;quot;starting penalty&amp;quot; to balance the linking and starting scores.</Paragraph>
    <Paragraph position="5"> To our knowledge, the paper is the first time the Bell tree is used to represent the search space of the coreference resolution problem.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML