File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0701_metho.xml

Size: 20,321 bytes

Last Modified: 2025-10-06 14:09:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0701">
  <Title>Overlap Features</Title>
  <Section position="4" start_page="1" end_page="4" type="metho">
    <SectionTitle>
3 Maximum Entropy Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> Fleischman et al. (2003) describe a dataset of concept-instance pairs extracted automatically from a very large corpus of newspaper articles. The data-set (referred to here as the ACL dataset) contains approximately 2 million pairs (of which 93% are legitimate) in which the concept is represented by a complex noun phrase (e.g. president of the United domly selecting two names from a hand crafted list of 8 individuals (e.g., Haifa Al-Faisal and Tom Cruise) and treat the pair as one name with two referents.</Paragraph>
      <Paragraph position="1"> States) and the instance by a name (e.g. William Jefferson Clinton).</Paragraph>
      <Paragraph position="2">  A set of 2675 legitimate concept-instance pairs was randomly selected from the ACL dataset described above; each of these was then matched with another concept-instance pair that had an identical instance name, but a different concept name. This set of matched pairs was hand tagged by a human annotator to reflect whether or not the identical instance names actually referred to the same individual. The set was then randomly split into a training set of 1875 matched pairs (84% referring to the same individual), a development set of 400 matched pairs (85.5% referring to the same individual), and a test set of 400 matched pairs (83.5% referring to the same individual).</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Features
</SectionTitle>
      <Paragraph position="0"> In designing a binary classifier to determine whether two concept-instance pairs refer to the same individual, we formulate a number of different features used to describe each matched pair. These features are summarized in Table 1, and described in more detail below.</Paragraph>
      <Paragraph position="1"> Name Features We use a number of methods meant to express information available from the orthography of the instance name itself. The first of these features (Name-Common) seeks to estimate the commonality of the instance name. With this features we hope to capture the intuition that more common names (such as John Smith) will be more likely to refer to different individuals than more uncommon names (such as Yasir Arafat). We calculate this feature by splitting the instance name into first, middle (if necessary) and last sub-names. We then use a table of name frequencies downloaded from the US census website to give each sub-name a score; these scores are then multiplied together for a final value.</Paragraph>
      <Paragraph position="2"> The second name statistic feature estimates how famous the instance name is. With this features we  Although the dataset includes multiple types of named entities, we focus here only on person names.</Paragraph>
      <Paragraph position="3"> hope to capture the intuition that names of very famous people (such as Michael Jackson) are less likely to refer to different individuals than less famous, yet equally common, names (such as John Smith). We calculate this feature in two ways: first, we use the frequency of the instance name as it appears in the ACL dataset to give a representation of how often the name appears in newspaper text (Name-Fame); second, we use the number of hits reported on google.com for a query consisting of the quoted instance name itself (Web-Fame).</Paragraph>
      <Paragraph position="4"> These fame features are used both as is and scaled by the commonality feature described above.</Paragraph>
      <Paragraph position="5"> Web Features Aside from the fame features described above, we use a number of other features derived from web search results. The first of which, called WebIntersection, is simply the number of hits returned for a query using the instance name and the heads of each concept noun phrase in the match pair; i.e., (name + head1 +head2).</Paragraph>
      <Paragraph position="6"> The second, called WebDifference, is the absolute value of the difference between the hits returned from a query on the instance name and just the head of concept 1 vs. the instance name and just the head of concept 2; i.e., abs ((name + head1) -(name +head2)).</Paragraph>
      <Paragraph position="7"> The third, called WebRatio, is the ratio between the WebIntersection score and the sum of the hits returned when querying the instance name and just the head of concept 1 and the instance name and just the head of concept 2; i.e., (name + head1</Paragraph>
      <Paragraph position="9"> In order to capture some aspects of the contextual cues to referent disambiguation, we include features representing the similarity between the sentential contexts from which each concept-instance pair was extracted. The similarity metric that we use is a simple word overlap score based on the number of words that are shared amongst both sentences. We include scores in which each non-stop-word is treated equally (Sentence-Count), as well as, in which each non-stop-word is weighted according to its term frequency in a large corpus (Sentence-TF). We further include two similar features that only examine the overlap in the concepts (Concept-Count and Concept-TF).</Paragraph>
      <Paragraph position="10">  JCN sem. dist. of Jiang and Conrath HSO sem. dist. of Hirst and St. Onge LCH sem. dist. of Leacock and Chodrow Lin sem. dist. of Lin Res sem. dist. of Resnik</Paragraph>
      <Paragraph position="12"/>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
Semantic Features
</SectionTitle>
      <Paragraph position="0"> Another important clue in determining the coreference of instances is the semantic relatedness of the concepts with which they are associated. In order to capture this, we employ five metrics described in the literature that use the WordNet ontology to determine a semantic distance between two lexical items (Budanitsky and Hirst. 2001). We use the implementation described in Pedersen (2004) to create features corresponding to the scores on the following metrics shown in Table 1. Due to problems associated with word sense ambiguity, we take the maximum score amongst all possible combinations of senses for the heads of the concepts in the matched pair. The final output to the model is a single similarity measure for each of the eight metrics described in Pedersen (2004).</Paragraph>
      <Paragraph position="1">  In developing features useful for referent disambiguation, it is clear that the concept information to which we have access is very useful. For example, given that we see John Edwards /politician and John Edwards /lawyer, our knowledge that politicians are often lawyers is very useful in judging referential identity.</Paragraph>
      <Paragraph position="2">  In order to exploit this information, we leverage the strong correlation between orthographic identity of instance names and their referential identity.</Paragraph>
      <Paragraph position="3"> As described above, approximately 84% of those matched pairs that had identical instance names referred to the same referent. In a separate examination, we found, not surprisingly, that nearly 100% of pairs that were matched to instances with different names (such as Bill Clinton vs. George Clinton) referred to different referents.</Paragraph>
      <Paragraph position="4"> We take advantage of this strong correlation in developing features by first making the (admittedly wrong) assumption that orthographic identity is equivalent to referential identity, and then using that assumption to calculate a number of statistics over the large ACL dataset. We postulate that the noise introduced by our assumption will be offset by the large size of the dataset, yielding a number of highly informative features.</Paragraph>
      <Paragraph position="5"> The statistics we calculate are as follows: P1: The probability that instance 1 and instance 2 have the same referent given that instance 1 is paired with concept A and instance 2 with concept B; i.e., p(i1=i2  |i1-A, i2-B) P2: The probability that instance 1 is paired with concept A and instance 2 with concept B given that instance 1 and instance 2 have the same referent; i.e., p(i1-A, i2-B  |i1=i2) P3: The probability that instance 1 is paired with concept A given that instance 2 is paired with concept B plus the probability that instance  It should be noted that this feature is attempting to encode knowledge about what concepts occur together in the real world, which is different than, what is being encoded in the semantic features, described above.  2 is paired with concept B given that instance 1 is paired with concept A; i.e., p(i1-A  |i2-B) + p(i2-B  |i1-A) P4: The probability that instance 1 is paired with concept A and instance 2 is paired with  concept B divided by the probability that instance 1 is paired with concept A plus the probability that instance 2 is paired with concept B; i.e., p(i1-A, i2-B) / (p(i1-A) + p(i2-B))  data compared to baseline (i.e., always same referent). Aside from the noise introduced by the assumption described above, another problem with these features arises when the derived probabilities are based on very low frequency counts. Thus, when adding these features to the model, we bin each feature according to the number of counts that the score was based on.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Model
</SectionTitle>
      <Paragraph position="0"> Maximum Entropy (Max. Ent.) models implement the intuition that the best model will be the one that is consistent with the set of constrains imposed by the evidence, but otherwise is as uniform as possible (Berger et al., 1996). We model the probability of two instances having the same referent (r=[1,0]) given a vector of features x according to the Max.</Paragraph>
      <Paragraph position="1"> Ent. formulation below:</Paragraph>
      <Paragraph position="3"> feature function over values of r and vector elements, n is the total number of feature functions, and l</Paragraph>
      <Paragraph position="5"> is the weight for a given feature function.</Paragraph>
      <Paragraph position="6"> The final output of the model is the probability</Paragraph>
      <Paragraph position="8"> given a feature vector that r=1; i.e., the probability that the referents are the same.</Paragraph>
      <Paragraph position="9"> We train the Max. Ent. model using the YASMET Max. Ent. package (Och, 2002). Feature weights are smoothed using Gaussian priors with mean 0. The standard deviation of this distribution is optimized on the development set, as is the number of training iterations and the probability threshold used to make the hard classifications reported in the following experiment.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Results for the classifier on the held out test set are reported in Figure 1. Baseline here represents always choosing the most common classification (i.e., instance referents are the same). Figure 2 represents the learning curve associated with this task. Figure 3 shows the effect on performance of incrementally adding the best feature set (as determined by greedily trying each one) to the model.</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Discussion
</SectionTitle>
      <Paragraph position="0"> It is clear from the results that this model outperforms the baseline for this task (p&gt;0.01) (p&lt;0.01) (Mitchell, 1997). Interestingly, although the number of labeled examples that were used to train the system was by no means extravagant, it appears from the learning curve that increasing the size of the training set will not have a large effect on classifier performance. Also of interest, Figure 3 shows that the greedy feature selection technique found that the most powerful features for this task are the estimated statistic features and the web features. While the benefit of such large corpora features is not surprising, the relative lack of power from the semantic and overlap features (which exploit ontolological and contextual information) was surprising.</Paragraph>
      <Paragraph position="1">  In future work, we will examine how more sophisticated similarity metrics and larger windows of context (e.g., the whole document) might improve performance.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="4" end_page="7" type="metho">
    <SectionTitle>
4 Clustering
</SectionTitle>
    <Paragraph position="0"> data using different subsets of feature types. Feature types are greedily added one at a time, starting with Estimated Statistics and ending with Semantic Features.</Paragraph>
    <Paragraph position="1"> Having generated a model to predict the probability that two concept-instance pairs with the same name refer to the same individual, we are faced with the problem of using such a model to partition all of our concept-instance pairs according to the individuals to which they actually refer. Although, ideally, we should be able to simply apply the model to all possible pairs, in reality, such a methodology may lead to a contradiction.</Paragraph>
    <Paragraph position="2"> For example, given that the model predicts instance A is identical to instance B, and in addition, that instance B is identical to C, because of the transitivity of the identity relation, we must assume that A is identical to C. However, if the model predicts that A is not identical to C, (which can and does occur) we must assume the model is wrong in at least one of its three predictions.</Paragraph>
    <Paragraph position="3">  Note that for these tests, the model parameters are not optimized for each run; thus, the performance is slightly worse than in Figure 1.</Paragraph>
    <Paragraph position="4">  Following Ng and Cardie (2002), we address this problem by clustering each set of concept-instance pairs with identical names, using a form of groupaverage agglomerative clustering, in which the similarity score between instances is just the probability output by the model. Because standard agglomerative clustering algorithms are O(n  ) if cosign similarity metrics are not used (Manning and Schutze, 2001), we adapt the method to our framework. Our algorithm operates as follows  : On input D={concept-instance pairs of same name}, build a fully connected graph G with vertex set D: 1) Label each edge (d,d') in G with a score corresponding to the probability of identity predicted by the Max. Ent. model 2) While the edge with max score in G &gt; threshold: a. Merge the two nodes connected by the edge with the max score.</Paragraph>
    <Paragraph position="5"> b. For each node in the graph a. Merge the two edges connecting it to the newly merged node b. Assign the new edge a score equal to the avg. of the two old edge scores.</Paragraph>
    <Paragraph position="6"> The final output of this algorithm is a new graph in which each node represents a single referent associated with a set of concept-instance pairs. This algorithm provides an efficient way, O(n  ), to compose the pair-wise information given by the model. Further, because the only free parameter is a merging threshold (which can be determined through cross-validation) the algorithm is free to choose a different number of referents for each instance name it is tested on. This is critical for the task because each instance name can have any number of referents associated with it.</Paragraph>
    <Section position="1" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
4.1 Test Data
</SectionTitle>
      <Paragraph position="0"> In order to test clustering, we randomly selected a set of 31 instance names from the ACL dataset, 11 of which referred to multiple individuals and 20 of which had only a single referent  This algorithm was developed with Hal Daume (technical report, in prep.).</Paragraph>
      <Paragraph position="1">  In an examination of 113 different randomly selected instance names from the ACL dataset we found that 32 instance pair with that instance name was then extracted and hand annotated such that each individual referent was given a unique identifying code. We chose not to test on artificially generated test examples (such as the pseudo-names described in Mann and Yarowsky, 2003) because of our reliance on name orthography in feature generation (see section 3.2). Further, such pseudo-names ignore the fact that names often correlate with other features (such as occupation or birthplace), and that they do not guarantee clean test data (i.e., the two names chosen for artificial identity may themselves each refer to multiple individuals).</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
4.2 Experimental Design
</SectionTitle>
      <Paragraph position="0"> In examining the results of the clustering, we chose to use a simple clustering accuracy as our performance metric. According to this technique, we match the output of our system to a gold standard clustering (defined by the hand annotations described above).</Paragraph>
      <Paragraph position="1">  We compare our algorithm on the 31 sets of concept-instance pairs described above against two baseline systems. The first (baseline1) is simply a single clustering of all pairs into one cluster; i.e., all instances have the same referent. The second (baseline2) is a simple greedy clustering algorithm that sequentially adds elements to the previous cluster whose last-added element is most similar (and exceeds some threshold set by cross validation). null</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> In examining performance, we present a weighted average over these 31 instance sets, based on the number of nodes (i.e., concept-instance pairs) in each set of instances (total nodes = 1256). Cross-validation is used to set the threshold for both the baseline2 and modified agglomerative algorithm.</Paragraph>
      <Paragraph position="1"> appeared only once in the dataset, 53 appeared more than once but always referred to the same referent, and 28 had multiple referents.</Paragraph>
      <Paragraph position="2">  While this is a relatively simple measure, we believe that, if anything, it is overly conservative, and thus, valid for the comparisons that we are making.</Paragraph>
      <Paragraph position="3"> These results are presented in Table 2. Figure 4 examines performance as a function of the number of referents within each of the 31 instance sets.</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> tive clustering and Baseline system as a function of the number of referents in the test set.</Paragraph>
      <Paragraph position="1"> While the algorithm we present clearly outperforms the baseline2 method over all 31 instance sets (p&lt;0.01), we can see that it only marginally outperforms our most simple baseline1 method (p&lt;0.10) (Mitchell, 1997). This is due to the fact that for each of the 20 instance sets that only have a single referent, the baseline achieves a perfect score, while the modified agglomerative method only achieves a score of 96.4%. Given this aspect of the baseline, and the distribution of the data, the fact that our algorithm outperforms the baseline at all speaks to its usefulness for this task.</Paragraph>
      <Paragraph position="2"> A better sense of the usefulness of this algorithm, however, can be seen by looking at its performance only on instance sets with multiple referents. As seen in Table 3, on multiple referent instance sets, modified agglomerative clustering outperforms both the baseline1 and baseline2 methods by a statistically significant margin (p&lt;0.01) (Mitchell, 1997).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML