File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0609_metho.xml
Size: 28,962 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0609"> <Title>Discriminative Training of Clustering Functions: Theory and Experiments with Entity Identification</Title> <Section position="4" start_page="64" end_page="65" type="metho"> <SectionTitle> 2 Clustering in Natural Language Tasks </SectionTitle> <Paragraph position="0"> Clustering is the task of partitioning a set of elements S [?] X into a disjoint decomposition 1 p(S) = {S1,S2, ***,SK} of S. We associate with it a partition function p = pS : X - C = {1,2,...K} that maps each x [?] S to a class index pS(x) = k iff x [?] Sk. The subscript S in pS and pS(x) is omitted when clear from the context.</Paragraph> <Paragraph position="1"> Notice that, unlike a classifier, the image x [?] S under a partition function depends on S.</Paragraph> <Paragraph position="2"> In practice, a clustering algorithm A (e.g. K-Means), and a distance metric d (e.g., Euclidean distance), are typ1Overlapping partitions will not be discussed here.</Paragraph> <Paragraph position="3"> ically used to generate a function h to approximate the true partition function p. Denote h(S) = Ad(S), the partition of S by h. A distance (equivalently, a similarity) function d that measures the proximity between two elements is a pairwise function X x X - R+, which can be parameterized to represent a family of functions -metric properties are not discussed in this paper. For example, given any two element x1 =< x(1)1 ,***,x(m)1 > and x2 =< x(1)2 ,***,x(m)2 > in an m-dimensional space, a linearly weighted Euclidean distance with parameters</Paragraph> <Paragraph position="5"> When supervision (e.g. class index of elements) is unavailable, the quality of a partition function h operating on S [?] X, is measured with respect to the distance metric defined over X. Suppose h partitions S into disjoint sets h(S) = {Sprimek}K1 , one quality function used in the K-Means algorithm is defined as:</Paragraph> <Paragraph position="7"> where uprimek is the mean of elements in set Sprimek. However, this measure can be computed irrespective of the algorithm.</Paragraph> <Section position="1" start_page="64" end_page="65" type="sub_section"> <SectionTitle> 2.1 What is a Good Metric? </SectionTitle> <Paragraph position="0"> A good metric is one in which close proximity correlates well with the likelihood of being in the same class. When applying clustering to some task, people typically decide on the clustering quality measure qS(h) they want to optimize, and then chose a specific clustering algorithm A and a distance metric d to generate a 'good' partition function h. However, it is clear that without any supervision, the resulting function is not guaranteed to agree with the target function p (or one's original intention).</Paragraph> <Paragraph position="1"> Given this realization, there has been some work on selecting a good distance metric for a family of related problems and on learning a metric for specific tasks. For the former, the focus is on developing and selecting good distance (similarity) metrics that reflect well pairwise proximity between domain elements. The &quot;goodness&quot; of a metric is empirically measured when combined with different clustering algorithms on different problems. For example (Lee, 1997; Weeds et al., 2004) compare similarity metrics such as the Cosine, Manhattan and Euclidean distances, Kullback-Leibler divergence, Jensen-Shannon divergence, and Jaccard's Coefficient, that could be applied in general clustering tasks, on the task of measuring distributional similarity. (Cohen et al., 2003) compares a number of string and token-based similarity metrics on the task of matching entity names and found that, overall, the best-performing method is a hybrid scheme (SoftTFIDF) combining a TFIDF weighting scheme of tokens with the Jaro-Winkler string-distance scheme that is widely used for record linkage in databases.</Paragraph> <Paragraph position="3"> dimensional space < X(1),X(2) >, are clustered into two groups containing solid and hollow points respectively.</Paragraph> <Paragraph position="4"> Moreover, it is not clear whether there exists any universal metric that is good for many different problems (or even different data sets for similar problems) and is appropriate for any clustering algorithm. For the word-based distributional similarity mentioned above, this point was discussed in (Geffet and Dagan, 2004) when it is shown that proximity metrics that are appropriate for class-based language models may not be appropriate for other tasks. We illustrate this critical point in Fig. 1. (a) and (b) show that even for the same data collection, different clustering algorithms with the same metric could generate different outcomes. (b) and (c) show that with the same clustering algorithm, different metrics could also produce different outcomes. Therefore, a good distance metric should be both domain-specific and associated with a specific clustering algorithm.</Paragraph> </Section> <Section position="2" start_page="65" end_page="65" type="sub_section"> <SectionTitle> 2.2 Metric Learning via Pairwise Classification </SectionTitle> <Paragraph position="0"> Several works (Cohen et al., 2003; Cohen and Richman, 2002; McCallum and Wellner, 2003; Li et al., 2004) have tried to remedy the aforementioned problems by attempting to learn a distance function in a domain-specific way via pairwise classification. In the training stage, given a set of labeled element pairs, a function</Paragraph> <Paragraph position="2"> ements as to whether they belong to the same class (1) or not (0), independently of other elements. The distance between the two elements is defined by converting the prediction confidence of the pairwise classifier, and clustering is then performed based on this distance function. Particularly, (Li et al., 2004) applied this approach to measuring name similarity in the entity identification problem, where a pairwise classifier (LMR) is trained using the SNoW learning architecture (Roth, 1998) based on variations of Perceptron and Winnow, and using a collection of relational features between a pair of names.</Paragraph> <Paragraph position="3"> The distance between two names is defined as a softmax over the classifier's output. As expected, experimental evidence (Cohen et al., 2003; Cohen and Richman, 2002; Li et al., 2004) shows that domain-specific distance functions improve over a fixed metric. This can be explained by the flexibility provided by adapting the metric to the domain as well as the contribution of supervision that guides the adaptation of the metric.</Paragraph> <Paragraph position="4"> A few works (Xing et al., 2002; Bar-Hillel et al., 2003; Schultz and Joachims, 2004; Mochihashi et al., 2004) outside the NLP domain have also pursued this general direction, and some have tried to learn the metric with limited amount of supervision, no supervision or by incorporating other information sources such as constraints on the class memberships of the data elements. In most of these cases, the algorithm practically used in clustering, (e.g. K-Means), is not considered in the learning procedure, or only implicitly exploited by optimizing the same objective function. (Bach and Jordan, 2003; Bilenko et al., 2004) indeed suggest to learn a metric directly in a clustering task but the learning procedure is specific for one clustering algorithm.</Paragraph> </Section> </Section> <Section position="5" start_page="65" end_page="66" type="metho"> <SectionTitle> 3 Supervised Discriminative Clustering </SectionTitle> <Paragraph position="0"> To solve the limitations of existing approaches, we develop the Supervised Discriminative Clustering Framework (SDC), that can train a distance function with respect to any chosen clustering algorithm in the context of a given task, guided by supervision.</Paragraph> <Paragraph position="1"> A labeled data set S Fig. 2 presents this framework, in which a clustering task is explicitly split into training and application stages, and the chosen clustering algorithm involves in both stages. In the training stage, supervision is directly integrated into measuring the clustering error errS(h,p) of a partition function h by exploiting the feedback given by the true partition p. The goal of training is to find a partition function h[?] in a hypothesis space H that minimizes the error. Consequently, given a new data set Sprime in the application stage, under some standard learning theory assumptions, the hope is that the learned partition function can generalize well and achieve small error as well.</Paragraph> <Section position="1" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 3.1 Supervised and Unsupervised Training </SectionTitle> <Paragraph position="0"> Let p be the target function over X, h be a function in the hypothesis space H, and h(S) = {Sprimek}K1 . In principle, given data set S [?] X, if the true partition p(S) = {Sk}K1 of S is available, one can measure the deviation of h from p over S, using an error function errS(h,p) - R+. We distinguish an error function from a quality function (as in Equ. 2) as follows: an error function measures the disagreement between clustering and the target partition (or one's intention) when supervision is given, while a quality is defined without any supervision.</Paragraph> <Paragraph position="1"> For clustering, there is generally no direct way to compare the true class index p(x) of each element with that given by a hypothesis h(x), so an alternative is to measure the disagreement between p and h over pairs of elements. Given a labeled data set S and p(S), one error function, namely weighted clustering error, is defined as a sum of the pairwise errors over any two elements in S, weighted by the distance between them:</Paragraph> <Paragraph position="3"> where D = maxxi,xj[?]S d(xi,xj) is the maximum distance between any two elements in S and I is an indicator function. Aij [?] I[(p(xi) = p(xj) & h(xi) negationslash= h(xj)] and Bij [?] I[(p(xi) negationslash= p(xj)&h(xi) = h(xj)] represent two types of pairwise errors respectively.</Paragraph> <Paragraph position="4"> Just like the quality defined in Equ. 2, this error is a function of the metric d. Intuitively, the contribution of a pair of elements that should belong to the same class but are split by h, grows with their distance, and vice versa.</Paragraph> <Paragraph position="5"> However, this measure is significantly different from the quality, in that it does not just measure the tightness of the partition given by h, but rather the difference between the tightness of the partitions given by h and by p.</Paragraph> <Paragraph position="6"> Given a set of observed data, the goal of training is to learn a good partition function, parameterized by specific clustering algorithms and distance functions. Depending on whether training data is labeled or unlabeled, we can further define supervised and unsupervised training.</Paragraph> <Paragraph position="7"> Definition 3.1 Supervised Training: Given a labeled data set S and p(S), a family of partition functions H, and the error function errS(h,p)(h [?] H), the problem is to find an optimal function h[?] s.t.</Paragraph> <Paragraph position="9"> Definition 3.2 Unsupervised Training: Given an unlabeled data set S (p(S) is unknown), a family of partition functions H, and a quality function qS(h)(h [?] H), the problem is to find an optimal partition function h[?] s.t.</Paragraph> <Paragraph position="11"> With this formalization, SDC along with supervised training, can be distinguished clearly from (1) unsupervised clustering approaches, (2) clustering over pairwise classification; and (3) related works that exploit partial supervision in metric learning as constraints.</Paragraph> </Section> <Section position="2" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 3.2 Clustering via Metric Learning </SectionTitle> <Paragraph position="0"> By fixing the clustering algorithm in the training stage, we can further define supervised metric learning, a special case of supervised training.</Paragraph> <Paragraph position="1"> Definition 3.3 Supervised Metric Learning: Given a labeled data set S and p(S), and a family of partition functions H = {h} that are parameterized by a chosen clustering algorithm A and a family of distance metrics dth (th [?] Ohm), the problem is to seek an optimal metric dth[?] with respect to A, s.t. for h(S) = A dth(S)</Paragraph> <Paragraph position="3"> Learning the metric parameters th requires parameterizing h as a function of th, when the algorithm A is chosen and fixed in h. In the later experiments of Sec. 5, we try to learn weighted Manhattan distances for the single-link algorithm and other algorithms, in the task of entity identification. In this case, when pairwise features are extracted for any elements x1,x2 [?] X, (x1,x2) =< ph1,ph2,***,phm >, the linearly weighted Manhattan distance, parameterized by (th = {wl}m1 ) is defined as:</Paragraph> <Paragraph position="5"> where wl is the weight over feature phl(x1,x2). Since measurement of the error is dependent on the metric, as shown in Equ. 3, one needs to enforce some constraints on the parameters. One constraint issummationtextml=1|wl |= 1, which prevents the error from being scale-dependent (e.g., metrics giving smaller distance are always better).</Paragraph> </Section> </Section> <Section position="6" start_page="66" end_page="67" type="metho"> <SectionTitle> 4 A General Learner for SDC </SectionTitle> <Paragraph position="0"> In addition to the theoretical SDC framework, we also develop a practical learning algorithm based on gradient descent (in Fig. 3), that can train a distance function for any chosen clustering algorithm (such as Single-Linkage and K-Means), as in the setting of supervised metric learning.</Paragraph> <Paragraph position="1"> The training procedure incorporates the clustering algorithm (step 2.a) so that the metric is trained with respect to the specific algorithm that will be applied in evaluation. The convergence of this general training procedure depends on the convexity of the error as a function of th.</Paragraph> <Paragraph position="2"> For example, since the error function we use is linear in th, the algorithm is guaranteed to converge to a global minimum. In this case, for rate of convergence, one can appeal to general results that typically imply, when there exists a parameter vector with zero error, that convergence rate depends on the 'separation&quot; of the training data, which roughly means the minimal error archived with this parameter vector. Results such as (Freund and Schapire, 1998) can be used to extend the rate of convergence result a bit beyond the separable case, when a small number of the pairs are not separable.</Paragraph> <Paragraph position="3"> Algorithm: SDC-Learner Input: S and p(S): the labeled data set. A: the clustering algorithm. errS(h,p): the clustering error function. a > 0 : the learning rate. T (typically T is large) : the number of iterations allowed.</Paragraph> <Paragraph position="4"> Output: th[?] : the parameters in the distance function d. 1. In the initial (I-) step, we randomly choose th0 for d. After this step we have the initial d0 and h0.</Paragraph> <Paragraph position="5"> 2. Then we iterate over t (t = 1,2,***), (a) Partition S using ht[?]1(S) [?]A dt[?]1(S); (b) Compute errS(ht[?]1,p) and update th using the formula: tht = tht[?]1 [?]a* [?]errS(ht[?]1,p)[?]tht[?]1 . (c) Normalization: tht = 1Z *tht, where Z = ||tht||.</Paragraph> <Paragraph position="6"> 3. Stopping Criterion: If t > T, the algorithm exits and For the weighted clustering error in Equ. 3, and linearly weighted Manhattan distances as in Equ. 5, the update rule in Step 2(b) becomes</Paragraph> <Paragraph position="8"/> </Section> <Section position="7" start_page="67" end_page="68" type="metho"> <SectionTitle> 5 Entity Identification in Text </SectionTitle> <Paragraph position="0"> We conduct experimental study on the task of entity identification in text (Bilenko et al., 2003; McCallum and Wellner, 2003; Li et al., 2004). A given entity - representing a person, a location or an organization - may be mentioned in text in multiple, ambiguous ways. Consider, for example, an open domain question answering system (Voorhees, 2002) that attempts, given a question like: &quot;When was President Kennedy born?&quot; to search a large collection of articles in order to pinpoint the concise answer: &quot;on May 29, 1917.&quot; The sentence, and even the document that contains the answer, may not contain the name &quot;President Kennedy&quot;; it may refer to this entity as &quot;Kennedy&quot;, &quot;JFK&quot; or &quot;John Fitzgerald Kennedy&quot;. Other documents may state that &quot;John F. Kennedy, Jr.</Paragraph> <Paragraph position="1"> was born on November 25, 1960&quot;, but this fact refers to our target entity's son. Other mentions, such as &quot;Senator Kennedy&quot; or &quot;Mrs. Kennedy&quot; are even &quot;closer&quot; to the writing of the target entity, but clearly refer to different entities. Understanding natural language requires identifying whether different mentions of a name, within and across documents, represent the same entity.</Paragraph> <Paragraph position="2"> We study this problem for three entity types - People, Location and Organization. Although deciding the coreference of names within the same document might be relatively easy, since within a single document identical mentions typically refer to the same entity, identifying coreference across-document is much harder. With no standard corpora for studying the problem in a general setting - both within and across documents, we created our own corpus. This is done by collecting about 8,600 names from 300 randomly sampled 1998-2000 New York Times articles in the TREC corpus (Voorhees, 2002). These names are first annotated by a named entity tagger, then manually verified and given as input to an entity identifier. null Since the number of classes (entities) for names is very large, standard multi-class classification is not feasible.</Paragraph> <Paragraph position="3"> Instead, we compare SDC with several pairwise classification and clustering approaches. Some of them (for example, those based on SoftTFIDF similarity) do not make use of any domain knowledge, while others do exploit supervision, such as LMR and SDC. Other works (Bilenko et al., 2003) also exploited supervision in this problem by discriminative training of a pairwise classifier but were shown to be inferior.</Paragraph> <Paragraph position="4"> 1. SoftTFIDF Classifier - a pairwise classifier deciding whether any two names refer to the same entity, implemented by thresholding a state-of-art SoftTFIDF similarity metric for string comparison (Cohen et al., 2003). Different thresholds have been experimented but only the best results are reported.</Paragraph> <Paragraph position="5"> 2. LMR Classifier (P|W) - a SNoW-based pairwise classifier (Li et al., 2004) (described in Sec. 2.2) that learns a linear function for each class over a collection of relational features between two names: including string and token-level features and structural features (listed in Table 1). For pairwise classifiers like LMR and SoftTFIDF, prediction is made over pairs of names so transitivity of predictions is not guaranteed as in clustering.</Paragraph> <Paragraph position="6"> 3. Clustering over SoftTFIDF - a clustering approach based on the SoftTFIDF similarity metric.</Paragraph> <Paragraph position="7"> 4. Clustering over LMR (P|W) - a clustering approach (Li et al., 2004) by converting the LMR classifier into a similarity metric (see Sec. 2.2).</Paragraph> <Paragraph position="8"> 5. SDC - our new supervised clustering approach. The dis- null tance metric is represented as a linear function over a set of pairwise features as defined in Equ. 5.</Paragraph> <Paragraph position="9"> The above approaches (2), (4) and (5) learn a classifier or a distance metric using the same feature set as in Table 1. Different clustering algorithms 2, such as Single- null 2003) - seeking a minimum cut of a nearest neighbor graph, Repeated Bisections and K-medoids (Chu et al., 2001) (a variation of K-means) are experimented in (5).</Paragraph> <Paragraph position="10"> The number of entities in a data set is always given.</Paragraph> </Section> <Section position="8" start_page="68" end_page="70" type="metho"> <SectionTitle> 6 Experimental Study </SectionTitle> <Paragraph position="0"> Our experimental study focuses on (1) evaluating the supervised discriminative clustering approach on entity identification; (2) comparing it with existing pairwise classification and clustering approaches widely used in similar tasks; and (3) further analyzing the characteristics of this new framework.</Paragraph> <Paragraph position="1"> We use the TREC corpus to evaluate different approaches in identifying three types of entities: People, Locations and Organization. For each type, we generate three pairs of training and test sets, each containing about 300 names. We note that the three entity types yield very different data sets, exhibited by some statistical properties3. Results on each entity type will be averaged over the three sets and ten runs of two-fold cross-validation for each of them. For SDC, given a training set with annotated name pairs, a distance function is first trained using the algorithm in Fig. 3 (in 20 iterations) with respect to a clustering algorithm and then be used to partition the corresponding test set with the same algorithm.</Paragraph> <Paragraph position="2"> For a comparative evaluation, the outcomes of each approach on a test set of names are converted to a classification over all possible pairs of names (including non-matching pairs). Only examples in the set Mp, those that are predicated to belong to the same entity (positive predictions) are used in the evaluation, and are compared with the set Ma of examples annotated as positive.</Paragraph> <Paragraph position="3"> The performance of an approach is then evaluated by F1 value, defined as: F1 = 2|Mp</Paragraph> <Section position="1" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 6.1 Comparison of Different Approaches </SectionTitle> <Paragraph position="0"> Fig. 4 presents the performance of different approaches (described in Sec. 5) on identifying the three entity types.</Paragraph> <Paragraph position="1"> We experimented with different clustering algorithms but only the results by Single-Linkage are reported for Cluster over LMR (P|W) and SDC, since they are the best.</Paragraph> <Paragraph position="2"> SDC works well for all three entity types in spite of their different characteristics. The best F1 values of SDC are 92.7%, 92.4% and 95.7% for people, locations and organizations respectively, about 20% [?] 30% error reduction compared with the best performance of the other approaches. This is an indication that this new approach which integrates metric learning and supervision in a unified framework, has significant advantages 4.</Paragraph> </Section> <Section position="2" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 6.2 Further Analysis of SDC </SectionTitle> <Paragraph position="0"> In the next experiments, we will further analyze the characteristics of SDC by evaluating it in different settings.</Paragraph> <Paragraph position="1"> Different Training Sizes Fig. 5 reports the relationship between the performance of SDC and different training sizes. The learning curves for other learning-based approaches are also shown. We find that SDC exhibits good learning ability with limited supervision. When training examples are very limited, for example, only 10% of all 300 names, pairwise classifiers based on Perceptron and Winnow exhibit advantages over SDC. However, when supervision become reasonable (30%+ examples), SDC starts to outperform all other approaches.</Paragraph> <Paragraph position="2"> Different Clustering Algorithms Fig. 6 shows the performance of applying different clustering algorithms (see Sec. 5) in the SDC approach. Single-Linkage and Complete-Linkage outperform all other algorithms. One possible reason is that this task has a great number of 4We note that in this experiment, the relative comparison between the pairwise classifiers and the clustering approaches over them is not consistent for all entity types. This can be partially explained by the theoretical analysis in (Li et al., 2004) and the difference between entity types.</Paragraph> <Paragraph position="3"> The Single-Linkage algorithm is applied whenever clustering is performed. Results are reported in F1 and averaged over the three data sets for each entity type and 10 runs of two-fold cross-validation. Each training set typically contains 300 annotated names.</Paragraph> <Paragraph position="4"> applied whenever clustering is performed. X-axis denotes different percentages of 300 names used in training. Results are reported in F1 and averaged over the three data sets for each entity type. ing algorithms are compared in SDC (a = 100.0). Results are averaged over the three data sets for each entity type and 10 runs of two-fold cross-validations.</Paragraph> <Paragraph position="5"> classes (100 [?] 200 entities) for 300 names in each single data set. The results indicate that the metric learning process relies on properties of the data set, as well as the clustering algorithm. Even if a good distance metric could be learned in SDC, choosing an appropriate algorithm for the specific task is still important.</Paragraph> <Paragraph position="6"> Different Learning Rates We also experimented with different learning rates in the SDC approach as shown in Fig. 7. It seems that SDC is not very sensitive to different learning rates as long as it is in a reasonable range.</Paragraph> <Paragraph position="7"> SDC with different learning rates (a = 1.0,10.0,100.0, 1000.0) compared in this setting. Single-Linkage clustering algorithm is applied.</Paragraph> </Section> <Section position="3" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 6.3 Discussion </SectionTitle> <Paragraph position="0"> The reason that SDC can outperform existing clustering approaches can be explained by the advantages of SDC training the distance function with respect to the chosen clustering algorithm, guided by supervision, but they do not explain why it can also outperform the pairwise classifiers. One intuitive explanation is that supervision in the entity identification task or similar tasks is typically given on whether two names correspond to the same entity entity-level annotation. Therefore it does not necessarily mean whether they are similar in appearance. For exam- null ple, &quot;Brian&quot; and &quot;Wilson&quot; could both refer to a person &quot;Brian Wilson&quot; in different contexts, and thus this name pair is a positive example in training a pairwise classifier. However, with features that only capture the appearance similarity between names, such apparently different names become training noise. This is what exactly happened when we train the LMR classifier with such name pairs. SDC, however, can employ this entity-level annotation and avoid the problem through transitivity in clustering. In the above example, if there is &quot;Brian Wilson&quot; in the data set, then &quot;Brian&quot; and &quot;Wilson&quot; can be both clustered into the same group with &quot;Brian Wilson&quot;. Such cases do not frequently occur for locations and organization but still exist .</Paragraph> </Section> </Section> <Section position="9" start_page="70" end_page="70" type="metho"> <SectionTitle> 7 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we explicitly formalize clustering as a learning task, and propose a unified framework for training a metric for any chosen clustering algorithm, guided by domain-specific supervision. Our experiments exhibit the advantage of this approach over existing approaches on Entity Identification. Further research in this direction will focus on (1) applying it to more NLP tasks, e.g.</Paragraph> <Paragraph position="1"> coreference resolution; (2) analyzing the related theoretical issues, e.g. the convergence of the algorithm; and (3) comparing it experimentally with related approaches, such as (Xing et al., 2002) and (McCallum and Wellner, 2003).</Paragraph> <Paragraph position="2"> Acknowledgement This research is supported by NSF grants IIS-9801638 and ITR IIS-0085836, an ONR</Paragraph> </Section> class="xml-element"></Paper>