XML Viewer - h05-1083

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1083_metho.xml
Size: 28,336 bytes
Last Modified: 2025-10-06 14:09:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1083">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 660-667, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Multi-Lingual Coreference Resolution With Syntactic Features</Title>
  <Section position="3" start_page="660" end_page="660" type="metho">
    <SectionTitle>
2 Statistical Coreference Resolution Model
</SectionTitle>
    <Paragraph position="0"> Our coreference system uses a binary entity-mention model PL( je, m) (henceforth link model ) to score the action of linking a mention m to an entity e. In our implementation, the link model is computed as</Paragraph>
    <Paragraph position="2"> where mprime is one mention in entity e, and the basic model building block ^PL(L = 1je, mprime, m) is an exponential or maximum entropy model (Berger et al., 1996):</Paragraph>
    <Paragraph position="4"> where Z(e, mprime, m) is a normalizing factor to ensure that ^PL( je, mprime, m) is a probability, fgi(e, mprime, m, L)g are features and flig are feature weights.</Paragraph>
    <Paragraph position="5"> Another start model is used to score the action of creating a new entity with the current mention m. Since starting a new entity depends on all the partial entities created in the history feigti=1, we use the following approximation:</Paragraph>
    <Paragraph position="7"> In the maximum-entropy model (2), feature (typically binary) functions fgi(e, mprime, m, )g provide us with a exible framework to encode useful information into the the system: it can be as simple as gi(e, mprime, m, L = 1) = 1 if mprime and m have the same surface string, or gj(e, mprime, m, L = 0) = 1 if e and m differ in number, or as complex as gl(e, mprime, m, L = 1) = 1 if mprime c-commands m and mprime is a NAME mention and m is a pronoun mention. These feature functions bear similarity to rules used in other coreference systems (Lappin and Leass, 1994; Mitkov, 1998; Stuckardt, 2001), except that the feature weights flig are automatically trained over a corpus with coreference information. Learning feature weights automatically eliminates the need of manually assigning the weights or precedence of rules, and opens the door for us to explore rich features extracted from parse trees, which is discussed in the next section.</Paragraph>
  </Section>
  <Section position="4" start_page="660" end_page="662" type="metho">
    <SectionTitle>
3 Syntactic Features
</SectionTitle>
    <Paragraph position="0"> In this section, we present a set of features extracted from syntactic parse trees. We discuss how we approximately compute linguistic concepts such as governing category (Haegeman, 1994), apposition and dependency relationships from noisy syntactic parse trees. While parsing and parse trees depend on the target language, the automatic nature of feature extraction from parse trees makes the process language-independent.</Paragraph>
    <Section position="1" start_page="660" end_page="661" type="sub_section">
      <SectionTitle>
3.1 Features Inspired by Binding Theory
</SectionTitle>
      <Paragraph position="0"> The binding theory (Haegeman, 1994) concerning pronouns can be summarized with the following principles:  1. A re exive or reciprocal pronoun (e.g., herself or each other ) must be bound in its governing category (GC).</Paragraph>
      <Paragraph position="1"> 2. A normal pronoun must be free in its governing cat null egory.</Paragraph>
      <Paragraph position="2"> The rst principle states that the antecedent of a re exive or reciprocal pronoun is within its GC, while the second principle says that the antecedent of a normal pronoun is outside its GC. While the two principles are simple, they all rely on the concept of governing category, which is de ned as the minimal domain containing the pronoun in question, its governor, and an accessible subject. The concept GC can best be explained with a few examples in Figure 1, where the label of a head constituent is marked within a box, and GC, accessible subject, and governor constituents are marked in parentheses with GC , Sub and gov. Noun-phrases (NP) are numbered for the convenience of referencing. For example, in sub- gure (1) of Figure 1, the governor of himself is likes, the subject is John, hence the GC is the entire sentence spanned by the root S. Since himself is re exive, its antecedent must be John by Principle 1. The parse tree in sub- gure (2) is the same as that in sub- gure (1), but since him is a normal pronoun, its antecedent, according to Principle 2, has to be outside the GC, that is, him cannot be coreferenced with John. . Sentence in sub- gure (3) is slightly more complicated: the governor of herself is description, and the accessible subject is Miss Smith. Thus, the governing category is NP6. The rst principle implies that the antecedent of herself must be Miss Smith.</Paragraph>
      <Paragraph position="3"> It is clear from these examples that GC is very useful in nding the antecedent of a pronoun. But the last example shows that determining GC is not a trivial matter. Not only is the correct parse tree required, but extra information is also needed to identify the head governor  and the minimal constituent dominating the pronoun, its governor and an accessible subject. Determining the accessible subject itself entails checking other constraints such as number and gender agreement. The complexity of computing governing category, compounded with the noisy nature of machine-generated parse tree, prompts us to compute a set of features that characterize the structural relationship between a candidate mention and a pronoun, as opposed to explicitly identify GC in a parse tree. These features are designed to implicitly model the binding constraints.</Paragraph>
      <Paragraph position="4"> Given a candidate antecedent or mention m1 and a pronoun mention m2 within a parsed sentence, we rst test if they have c-command relation, and then a set of counting features are computed. The features are detailed as follows:</Paragraph>
      <Paragraph position="6"> commands another constituent Y in a parse tree if the rst branching node dominating X also dominates Y . The binary feature ccmd(m1, m2) is true if the minimum NP dominating m1 c-commands the minimum NP dominating m2. In sub- gure (1) of Figure 1, NP1 c-commands NP2 since the rst branching node dominating NP1 is S and it dominates NP2.</Paragraph>
      <Paragraph position="7"> If ccmd(m1, m2) is true, we then de ne the c-command path T(m1, m2) as the path from the minimum NP dominating m2 to the rst branching node that dominates the minimum NP dominating m1. In sub- gure (1) of Figure 1, the c-command path T( John , himself ) would be NP2-VP-S.</Paragraph>
      <Paragraph position="8"> (2) NP count(m1, m2): If ccmd(m1, m2) is true, then NP count(m1, m2) counts how many NPs are seen on the c-command path T(m1, m2), excluding two endpoints. In sub- gure (1) of Figure 1, NP count( John , himself ) = 0 since there is no NP on T( John , himself ).</Paragraph>
      <Paragraph position="9"> (3) V P count(m1, m2): similar to NP count(m1, m2), except that this feature counts how many verb phrases (VP) are seen on the c-command path. In sub- gure (1) of Figure 1, V P count( John , himself ) is true since there is one VP on T( John , himself ).</Paragraph>
      <Paragraph position="10"> (4) S count(m1, m2): This feature counts how many clauses are seen on the c-command path when ccmd(m1, m2) is true. In sub- gure (1) of Figure 1, S count( John , himself ) = 0 since there is no clause label on T( John , himself ).</Paragraph>
      <Paragraph position="11"> These features are designed to capture information in the concept of governing category when used in conjunction with attributes (e.g., gender, number, re exiveness) of individual pronouns. Counting the intermediate NPs, VPs and sub-clauses implicitly characterizes the governor of a pronoun in question; the presence or absence of a subclause indicates whethere or not a coreferential relation is across clause boundary.</Paragraph>
    </Section>
    <Section position="2" start_page="661" end_page="661" type="sub_section">
      <SectionTitle>
3.2 Dependency Features
</SectionTitle>
      <Paragraph position="0"> In addition to features inspired by the binding theory, a set of dependency features are also computed with the help of syntactic parse trees. This is motivated by examples such as John is the president of ABC Corporation, where John and the president refer to the same per-son and should be in the same entity. In scenarios like this, lexical features do not help, while the knowledge that John left-modi es the verb is and the the president right-modi es the same verb would be useful.</Paragraph>
      <Paragraph position="1"> Given two mentions m1 and m2 in a sentence, we compute the following dependency features: (1)same head(m1, m2): The feature compares the bi-lexical dependencies hm1, h(m1)i, and hm2, h(m2)i, where h(x) is the head word which x modi es. The feature is active only if h(m1) = h(m2), in which case it returns h(m1).</Paragraph>
      <Paragraph position="2"> (2)same POS(m1, m2): To get good coverage of dependencies, we compute a feature same POS(m1, m2), which examines the same dependency as in (1) and returns the common head part-of-speech (POS) tag if</Paragraph>
      <Paragraph position="4"> The head child nodes are marked with boxes in Figure 1. For the parse tree in sub- gure (1), same head( John , him ) would return likes as John left-modi es likes while him right-modi es likes, and same POS( John , him ) would return V as the POS tag of likes is V.</Paragraph>
      <Paragraph position="5"> (3) mod(m1, m2): the binary feature is true if m1 modi es m2. For parse tree (2) of Figure 1, mod( John , him ) returns false as John does not modify him directly. A reverse order feature mod(m2, m1) is computed too.</Paragraph>
      <Paragraph position="6"> (4) same head2(m1, m2): this set of features examine second-level dependency. It compares the head word of h(m1), or h(h(m1)), with h(m2) and returns the common head if h(h(m1)) = h(m2). A reverse order feature same head2(m2, m1) is also computed.</Paragraph>
      <Paragraph position="7"> (5) same POS2(m1, m2): similar to (4), except that it computes the second-level POS. A reverse order feature same POS2(m2, m1) is computed too.</Paragraph>
      <Paragraph position="8"> (6) same head22(m1, m2): it returns the common second-level head if h(h(m1)) = h(h(m2)).</Paragraph>
    </Section>
    <Section position="3" start_page="661" end_page="662" type="sub_section">
      <SectionTitle>
3.3 Apposition and Same-Parent Features
</SectionTitle>
      <Paragraph position="0"> Apposition is a phenomenon where two adjacent NPs refer to the same entity, as Jimmy Carter and the former president in the following example: (II) Jimmy Carter, the former president of US, is visiting Europe.</Paragraph>
      <Paragraph position="1"> Note that not all NPs separated by a comma are necessarily appositive. For example, in John called Al, Bob, and Charlie last night, Al and Bob share a same NP  parent and are separated by comma, but they are not appositive. null To compute the apposition feature appos(m1, m2) for mention-pair (m1, m2), we rst determine the minimum dominating NP of m1 and m2. The minimum dominating NP of a mention is the lowest NP, with an optional modifying phrase or clause, that spans the mention. If the two minimum dominating NPs have the same parent NP, and they are the only two NP children of the parent, the value of appos(m1, m2) is true. This would exclude Al and Bob in John called Al, Bob, and Charlie last night from being computed as apposition.</Paragraph>
      <Paragraph position="2"> We also implement a feature same parent(m1, m2) which tests if two mentions m1 and m2 are dominated by a common NP. The feature helps to prevent the system from linking his with colleague in the sentence John called his colleague.</Paragraph>
      <Paragraph position="3"> All the features described in Section 3.1-3.3 are computed from syntactic trees generated by a parser. While the parser is language dependent, feature computation boils down to encoding the structural relationship of two mentions, which is language independent. To test the effectiveness of the syntactic features, we integrate them into 3 coreference systems processing Arabic, Chinese and English.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="662" end_page="665" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="662" end_page="662" type="sub_section">
      <SectionTitle>
4.1 Data and System Description
</SectionTitle>
      <Paragraph position="0"> All experiments are done on true mentions of the ACE (NIST, 2004) 2004 data. We reserve part of LDCreleased 2004 data as the development-test set (henceforth devtest ) as follows: documents are sorted by their date and time within each data source (e.g., broadcast news (bnews) and news wire (nwire) are two different sources) and the last 25% documents of each data source are reserved as the devtest set. Splitting data on chronological order simulates the process of a system's development and deployment in the real world. The devtest set statistics of three languages (Arabic, Chinese and English) is summarized in Table 1, where the number of documents, mentions and entities is shown on row 2 through 4, respectively. The rest of 2004 ACE data together with earlier ACE data is used as training.</Paragraph>
      <Paragraph position="1">  The of cial 2004 evaluation test set is used as the blind test set on which we run our system once after the system development is nished. We will report summary results on this test set.</Paragraph>
      <Paragraph position="2"> As for parser, we train three off-shelf maximum-entropy parsers (Ratnaparkhi, 1999) using the Arabic, Chinese and English Penn treebank (Maamouri and Bies, 2004; Xia et al., 2000; Marcus et al., 1993). Arabic words are segmented while the Chinese parser is a character-based parser. The three parsers have a label F-measure of 77%, 80%, and 86% on their respective test sets. The three parsers are used to parse both ACE training and test data. Features described in Section 3 are computed from machine-generated parse trees.</Paragraph>
      <Paragraph position="3"> Apart from features extracted from parse trees, our coreference system also utilizes other features such as lexical features (e.g., string matching), distance features characterized as quantized word and sentence distances, mention- and entity-level attribute information (e.g, ACE distinguishes 4 types of mentions: NAM(e), NOM(inal), PRE(modi er) and PRO(noun)) found in the 2004 ACE data. Details of these features can be found in (Luo et al., 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="662" end_page="663" type="sub_section">
      <SectionTitle>
4.2 Performance Metrics
</SectionTitle>
      <Paragraph position="0"> The of cial performance metric in the ACE task is ACE-Value (NIST, 2004). The ACE-Value is an entity-based metric computed by subtracting a normalized cost from 1 (so it is unbounded below). The cost of a system is a weighted sum of costs associated with entity misses, false alarms and errors. This cost is normalized against the cost of a nominal system that outputs no entity. A perfect coreference system gets 100% ACE-Value while a system outputting many false-alarm entities could get a negative value.</Paragraph>
      <Paragraph position="1"> The default weights in ACE-Value emphasize names, and severely discount pronouns: the relative importance of a pronoun is two orders of magnitude less than that of a name. So the ACE-Value will not be able to accurately reect a system's improvement on pronouns2. For this reason, we compute an unweighted entity-constrained mention F-measure (Luo, 2005) and report all contrastive experiments with this metric. The F-measure is computed by rst aligning system and reference entities such that the number of common mentions is maximized and each system entity is constrained to align with at most one reference entity, and vice versa. For example, suppose that a reference document contains three entities: f[m1], [m2, m3], [m4]g while a system outputs four entities: f[m1, m2], [m3], [m5], [m6]g, where fmi : i = 1, 2, , 6g are mentions, then the best alignment from reference to system would be [m1] , [m1, m2], [m2, m3] , [m3] and other entities are not aligned. The number of common mentions of the best alignment is 2 2Another possible choice is the MUC F-measure (Vilain et al., 1995). But the metric has a systematic bias for systems generating fewer entities (Bagga and Baldwin, 1998) see Luo (2005). Another reason is that it cannot score single-mention entity.</Paragraph>
      <Paragraph position="2">  (i.e., m1 and m3), thus the recall is 24 and precision is  5. Due to the one-to-one entity alignment constraint, theF-measure here is more stringent than the accuracy (Ge et al., 1998; Mitkov, 1998; Kehler et al., 2004) computed on antecedent-pronoun pairs.</Paragraph>
    </Section>
    <Section position="3" start_page="663" end_page="663" type="sub_section">
      <SectionTitle>
4.3 Effect of Syntactic Features
</SectionTitle>
      <Paragraph position="0"> We rst present the contrastive experimental results on the devtest described in sub-section 4.1.</Paragraph>
      <Paragraph position="1"> Two coreference systems are trained for each language: a baseline without syntactic features, and a system including the syntactic features. The entity-constrained F-measures with mention-type breakdown are presented in Table 2. Rows marked with Nm contain the number of mentions, while rows with base and +synt are F-measures for the baseline and the system with the syntactic features, respectively.</Paragraph>
      <Paragraph position="2"> The syntactic features improve pronoun mentions across three languages not surprising since features inspired by the binding theory are designed to improve pronouns.</Paragraph>
      <Paragraph position="3"> The pronoun improvement on the Arabic (from 73.2% to 74.6%) and English (from 69.2% to 72.0%) system is statistically signi cant (at above 95% con dence level), but change on the Chinese system is not. For Arabic, the syntactic features improve Arabic NAM, NOM and PRE mentions, probably because Arabic pronouns are sometimes attached to other types of mentions. For Chinese and English, the syntactic features do not practically change the systems' performance.</Paragraph>
      <Paragraph position="4"> As will be shown in Section 4.5, the baseline systems without syntactic features are already competitive, compared with the results on the coreference evaluation track (EDR-coref) of the ACE 2004 evaluation (NIS, 2004). So it is nice to see that syntactic features further improve a good baseline on Arabic and English.</Paragraph>
    </Section>
    <Section position="4" start_page="663" end_page="665" type="sub_section">
      <SectionTitle>
4.4 Error Analyses
</SectionTitle>
      <Paragraph position="0"> From the results in Table 2, we know that the set of syntactic features are working in the Arabic and English system. But the results also raise some questions: Are there interactions among the the syntactic features and other features? Why do the syntactic features work well for Arabic and English, but not Chinese? To answer these questions, we look into each system and report our ndings in the following sections.</Paragraph>
      <Paragraph position="1">  Our system uses a group of distance features. One observation is that information provided by some syntactic features (e.g., V P count(m1, m2) etc) may have overlapped with some of the distance features. To test if this is the case, we take out the distance features from the English system, and then train two systems, one with the syntactic features, one without. The results are shown in Table 3, where numbers on the row b-dist are F-measures after removing the distance features from the baseline, and numbers on the row b-dist+synt are with the syntactic features.</Paragraph>
      <Paragraph position="2">  F-measures(%).</Paragraph>
      <Paragraph position="3"> As can be seen, the impact of the syntactic features is much larger when the distance features are absent in the system: performance improves across all the four mention types after adding the syntactic features, and the overall F-measure jumps from 72.5% to 79.3%. The PRE type gets the biggest improvement since features extracted from parse trees include apposition, same-parent test, and dependency features, which are designed to help mention pairs in close distance, just as in the case of PRE mentions.</Paragraph>
      <Paragraph position="4"> Comparing the numbers in Table 3 with the English base-line of Table 2, we can also conclude that distance features and syntactic features lead to about the same level of performance when the other set of features is not used. When the distance features are used, the syntactic features further help to improve the performance of the NOM and PRO mention type, albeit to a less degree because of information overlap between the two sets of features.</Paragraph>
      <Paragraph position="5">  Results in Table 2 show that the syntactic features are not so effective for Chinese as for Arabic and English. The  rst thing we look into is if there is any idiosyncrasy in the Chinese language.</Paragraph>
      <Paragraph position="6"> In Table 4, we list the statistics collected over the training sets of the three languages: the second row are the total number of mentions, the third row the number of pronoun mentions, the fourth row the number of events where the c-command feature ccmd(m1, m2) is used, and the last row the average number of c-command features per pronoun (i.e., the fourth row divided by the third row). A pronouns event is de ned as a tuple of training instance (e, m1, m2) where m1 is a mention in entity e, and the second mention m2 is a pronoun.</Paragraph>
      <Paragraph position="7"> From Table 4, it is clear that Chinese pronoun distribution is very different: pronoun mentions account for about 8.7% of the total mentions in Chinese, while 29.0% of Arabic mentions and 25.1% of English mentions are pronouns (the same disparity can be observed in the devtest set in Table 2). This is because Chinese is a pro-drop language (Huang, 1984): for example, in the Chinese Penn treebank version 4, there are 4933 overt pronouns, but 5750 pro-drops! The ubiquity of pro-drops in Chinese results in signigicantly less pronoun training events. Consequently, the pronoun-related features are not trained as well as in English and Arabic. One way to quantify this is by looking at the average number of c-command features on a per-pronoun basis: as shown in the last row of Table 4, the c-command feature is seen more than twice often in Arabic and English as in Chinese. Since low-count features are ltered out, the sparsity of pronoun events prevent many compound features (e.g., conjunction of syntactic and distance features) from being trained in the Chinese system, which explains why the syntactic  As stated in Table 4, 29.0% of Arabic mentions are pronouns, compared to a slightly lower number (25.1%) for English. This explains the relatively high positive impact of the syntactic features on the Arabic coreference system, compared to English and Chinese systems. To understand how syntactic features work in the Arabic system, we examine two examples extracted from the devtest set: (1) the rst example shows the negative impact of syntactic features because of the noisy parsing output, and (2) the second example proves the effectiveness of the syntactic features to nd the dependency between two mentions. In both examples, the baseline system and the system with syntactic features give different results.</Paragraph>
      <Paragraph position="8"> Let's consider the following sentence:... A D A &lt;&lt; Y fi,@ J K Q @ Q . J&amp;quot; K ...</Paragraph>
      <Paragraph position="9"> ... its-capital-Jerusalem-Israel-consider-and ...... O JK Y @, oe flQ ,@ QC/ ,@ aeJ J C/ @ fi,@ YK QK A J fl of-the-city-the-Eastern-the-half-the-Palestininan-want-while The English text shown above is a word-to-word translation of the Arabic text (read from right-to-left). In this example, the parser wrongly put the nominal mention Y fi , @ (Jerusalem) and the pronominal mention O JK Y V@ (the-city) under the same constituent, which acti-vates the same parent feature. The use of the feature same parent( Y fi, @, O JK Y V@) leads to the two mentions being put into different entities. This is because there are many cases in the training data where two mentions under the same parent are indeed in different entities: a similar English example is John called his sister , where his and sister belong to two different entities. The same parent feature is a strong indicator of not putting them into the same entity.</Paragraph>
      <Paragraph position="10">  the PRO mention o (hm) with its antecedent, the NAM mention aeJ flA fl Q,@ (AlzqAqywn): top Arabic sentence; middle corresponding romanized sentence; bottom token-to-token English translation.</Paragraph>
      <Paragraph position="11"> Table 5 shows another example in the devtest set. The top part presents the segmented Arabic text, the middle part is the corresponding romanized text, and the bottom part contains the token-to-token English translation. Note that Arabic text reads from right to left and its corresponding romanized text from left to right (i.e., the right-most Arabic token maps to the left-most romanized token). The parser output the correct syntactic structure: Figure 2 shows a portion of the system-generated parse tree. It can be checked that NP1 c-commands NP2 and the group of features inspired by the binding theory are active. These features help to link the PRO(onominal) mention o (hm) with the NAM(e) mention aeJ flA fl Q,@ (AlzqAqywn).</Paragraph>
      <Paragraph position="12"> Without syntactic features theses two mentions were split into different entities.</Paragraph>
    </Section>
    <Section position="5" start_page="665" end_page="665" type="sub_section">
      <SectionTitle>
4.5 ACE 2004 Results
</SectionTitle>
      <Paragraph position="0"> To get a sense of the performance level of our system, we report the results on the ACE 2004 of cial test set with both the F-measure and the of cial ACE-Value metric.</Paragraph>
      <Paragraph position="1"> This data is used as the blind test set which we run our system only once.</Paragraph>
      <Paragraph position="2"> Results are summarized in Table 6, where the second row (i.e. base ) contains the baseline numbers, and the third row (i.e., +synt ) contains the numbers from systems with the syntactic features. Columns under F are F-measure and those under AV are ACE-Value. The last row Nm contains the number of mentions in the three test sets.</Paragraph>
      <Paragraph position="3">  Data.</Paragraph>
      <Paragraph position="4"> The performance of three full ( +synt ) systems is remarkably close to that on the devtest set(cf. Table 2): For Arabic, F-measure is 80.1 on the devtest vs. 81.5 here; For Chinese, 84.9 vs. 84.7; And for English, 80.8 vs. 82.0. The syntactic features again help Arabic and English statistically very signi cant in F-measure, but have no signi cant impact on Chinese. The performance consistency across the devtest and blind test set indicates that the systems are well trained.</Paragraph>
      <Paragraph position="5"> The F-measures are computed on all types of mentions. Improvement on mention-types targeted by the syntactic features is larger than the lump-sum F-measure. For example, the F-measure for English pronouns on this test set is improved from 69.5% to 73.7% (not shown in Table 6 due to space limit). The main purpose of Table 6 is to get a sense of performance level correspondence between the F-measure and ACE-Value.</Paragraph>
      <Paragraph position="6"> Also note that, for Arabic and English, the difference between the base and +synt systems, when measured by ACE-Value, is much smaller. This is not surprising since ACE-Value heavily discounts pronouns and is insensitive to improvement on pronouns the very reason we adopt the F-measure in Section 4.3 and 4.4 when reporting the contrastive experiment results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML