File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3244_metho.xml

Size: 18,076 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3244">
  <Title>Learning Nonstructural Distance Metric by Minimum Cluster Distortions</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Related Works
</SectionTitle>
    <Paragraph position="0"> The issues above of feature correlations and feature weightings can be summarized as a problem of defining an appropriate metric in the feature space, based on the distribution of data. This problem has recently been highlighted in the field of machine learning research. (Xing et al., 2002) has an objective that is quite similar to that of this paper, and gives a metric matrix that resembles ours based on sample pairs of &amp;quot;similar points&amp;quot; as training data. (Bach and Jordan, 2004) and (Schultz and Joachims, 2004) seek to answer the same problem with an additional scenario of spectral clustering and relative comparisons in Support Vector Machines, respectively. In this aspect, our work is a straight successor of (Xing et al., 2002) where its general usage in vector space is preserved. We offer a discussion on the similarity to our method and our advantages 1When we normalize the length of the vectors j~uj = j~vj = 1 as commonly adopted, (~u ~v)T (~u ~v) = j~uj2 + j~vj2 2~u ~v / ~u ~v = cos(~u;~v) ; therefore, this includes a cosine similarity (Manning and Sch&amp;quot;utze, 1999).</Paragraph>
    <Paragraph position="1"> in section 6. Finally, we note that the Fisher kernel of (Jaakkola and Haussler, 1999) has the same concept that gives an appropriate similarity of two data through the Fisher information matrix obtained from the empirical distribution of data. However, it is often approximated by a unit matrix because of its heavy computational demand.</Paragraph>
    <Paragraph position="2"> In the field of information retrieval, (Jiang and Berry, 1998) proposes a Riemannian SVD (R-SVD) from the viewpoint of relevance feedback. This work is close in spirit to our work, but is not aimed at defining a permanent distance function and does not utilize cluster structures existent in the training data.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Defining an Optimal Metric
</SectionTitle>
    <Paragraph position="0"> To solve the problems in section 2, we note the function that synonymous clusters play. There are many levels of (more or less) synonymous clusters in linguistic data: phrases, sentences, paragraphs, documents, and, in a web environment, the site that contains the document. These kinds of clusters can often be attributed to linguistic expressions because they nest in general so that each expression has a parent cluster.</Paragraph>
    <Paragraph position="1"> Since these clusters are synonymous, we can expect the vectors in each cluster to concentrate in the ideal feature space. Based on this property, we can introduce an optimal weighting and correlation in a supervised fashion. We will describe this method below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Basic Idea
</SectionTitle>
      <Paragraph position="0"> As stated above, vectors in the same cluster must have a small distance between each other in the ideal geometry. When we measure an L2-distance between ~u and ~v by a Mahalanobis distance parameterized by M:</Paragraph>
      <Paragraph position="2"> where symmetric metric matrix M gives both corresponding feature weights and feature correlations.</Paragraph>
      <Paragraph position="3"> When we take M = I (unit matrix), we recover the original Euclidean distance (1).</Paragraph>
      <Paragraph position="4"> Equation (2) can be rewritten as (3) because M is symmetric:</Paragraph>
      <Paragraph position="6"> Therefore, this distance amounts to a Euclidean distance in M1=2-mapped space (Xing et al., 2002).</Paragraph>
      <Paragraph position="7"> Note that this distance is global, and different from the ordinary Mahalanobis distance in pattern recognition (for example, (Duda et al., 2000)) that is defined for each cluster one by one, using a cluster-specific covariance matrix. That type of distance cannot be generalized to new kinds of data; therefore, it has been used for local classifications. What we want is a global distance metric that is generally useful, not a measure for classification to predefined clusters. In this respect, (Xing et al., 2002) shares the same objective as ours.</Paragraph>
      <Paragraph position="8"> Therefore, we require an optimization over all the clusters in the training data. Generally, data in the clusters are distributed as in figure 1(a), comprising ellipsoidal forms that have high (co)variances for some dimensions and low (co)variances for other dimensions. Further, the cluster is not usually aligned to the axes of coordinates. When we find a global metric matrix M that minimizes the cluster distortions, namely, one that reduces high variances and expands low variances for the data to make a spherical form as good as possible in the M1=2-mapped space (figure 1(b)), we can expect it to capture necessary and unnecessary variations and correlations on the features, combining information from many clusters to produce a more reliable metric that is not locally optimal. We will find this optimal M below.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Global optimization over clusters
</SectionTitle>
      <Paragraph position="0"> Suppose that each data (for example, sentences or documents) is a vector ~s 2 Rn, and the whole corpus can be divided into N clusters, X1 ::: XN . That is, each vector has a dimension n, and the number of clusters is N. For each cluster Xi, cluster centroid ci is calculated as ~ci = 1=jXijP~s2Xi ~s , where jXj denotes the number of data in X. When necessary, each element in ~sj or ~ci is referenced as sjk or cik (k = 1::: n).</Paragraph>
      <Paragraph position="1"> The basic idea above is formulated as follows.</Paragraph>
      <Paragraph position="2"> We seek the metric matrix M that minimizes the distance between each data ~sj and the cluster centroid ~ci, dM(~sj;~ci) for all clusters X1 ::: XN . Mathematically, this is formulated as a quadratic minimization problem</Paragraph>
      <Paragraph position="4"> Scale constraint (5) is necessary for excluding a degenerate solution M = O. 1 is an arbitrary constant: when we replace 1 by c, c2M becomes a new solution. This minimization problem is an extension to the method of MindReader (Ishikawa et al., 1998) to multiple clusters, and has a unique solution below.</Paragraph>
      <Paragraph position="5"> Theorem The matrix that solves the minimization</Paragraph>
      <Paragraph position="7"> When A is singular, we can use as A 1 a Moore-Penrose matrix pseudoinverse A+. Generally, A consists of linguistic features and is very sparse, and often singular. Therefore, A+ is nearly always necessary for the above computation. For details, see</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Appendix B.
4.3 Generalization
</SectionTitle>
      <Paragraph position="0"> While we assumed through the above construction that each cluster is equally important, this is not the case in general. For example, clusters with a small number of data may be considered weak, and in the hierarchical clustering situation, a &amp;quot;grandmother&amp;quot; cluster may be weaker. If we have confidences 1 ::: N for the strength of clustering for each cluster X1 ::: XN , this information can be incorporated into (4) by a set of normalized cluster</Paragraph>
      <Paragraph position="2"> where i = i= PNj=1 j , and we obtain a respectively weighted solution in (7). Further, we note that when N = 1, this metric recovers the ordinary Mahalanobis distance in pattern recognition. However, we used equal weights for the experiments below because the number of data in each cluster was approximately equal.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluated our metric distance on the three tasks of synonymous sentence retrieval, document retrieval, and the K-means clustering of general vectorial data. After calculating M on the training data of clusters, we applied it to the test data to see how well its clusters could be recovered. As a measure of cluster recovery, we use 11-point average precision and R-precision for the distribution of items of the same cluster in each retrieval result. Here, R equals the cardinality of the cluster; therefore, R-precision shows the precision of cluster recovery.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Synonymous sentence retrieval
5.1.1 Sentence cluster corpus
</SectionTitle>
      <Paragraph position="0"> We used a paraphrasing corpus of travel conversations (Sugaya et al., 2002) for sentence retrieval.</Paragraph>
      <Paragraph position="1"> This corpus consists of 33,723,164 Japanese translations, each of which corresponds to one of the original English sentences. By way of this correspondence, Japanese sentences are divided into 10,610 clusters. Therefore, each cluster consists of Japanese sentences that are possible translations from the same English seed sentence that the cluster has. From this corpus, we constructed 10 sets of data. Each set contains random selection of 200 training clusters and 50 test clusters, and each cluster contains a maximum of 100 sentences 2. Experiments were conducted on these 10 datasets for each level of dimensionality reduction (see below) to produce average statistics.</Paragraph>
      <Paragraph position="2">  As a feature of a sentence, we adopted unigrams of all words and bigrams of functional words from the part-of-speech tags, because the sequence of functional words is important in the conversational corpus. null While the lexicon is limited for travel conversations, the number of features exceeds several thousand or more. This may be prohibitive for the calculation of the metric matrix, therefore, we additionally compressed the features with SVD, the same method used in Latent Semantic Indexing (Deerwester et al., 1990).</Paragraph>
      <Paragraph position="3">  Qualitative result Figure 5 (last page) shows a sample retrieval result. A sentence with (*) mark at the end is the correct answer, that is, a sentence from the same original cluster as the query. We can see that the results with the metric distance contain 2When the number of data in the cluster exceeds this limit, 100 sentences are randomly sampled. All sampling are made without replacement.</Paragraph>
      <Paragraph position="4"> less noise than a standard Euclid baseline with tf.idf weighting, achieving a high-precision retrieval. Although the high rate of dimensionality reduction in figure 6 shows degradation due to the dimension contamination, the effect of metric distance is still apparent despite bad conditions.</Paragraph>
      <Paragraph position="5"> Quantitative result Figure 2 shows the averaged precision-recall curves of retrieval and figure 3 shows 11-point average precisions, for each rate of dimensionality reduction. Clearly, our method achieves higher precision than the standard method, and does not degrade much with feature compressions unless we reduce the dimension too much, i.e., to &lt; 5%.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Document retrieval
</SectionTitle>
      <Paragraph position="0"> As a method of tackling clusters of texts, the text classification task has recently made great advances with a Na&amp;quot;ive Bayes or SVM classifiers (for example, (Joachims, 1998)). However, they all aim at classifying texts into a few predefined clusters, and cannot deal with a document that fits neither of the clusters. For example, when we regard a website as a cluster of documents, the possible clusters are numerous and constantly increasing, which precludes classificatory approaches. For these circumstances, document clustering or retrieval will benefit from a global distance metric that exploits the multitude of cluster structures themselves.</Paragraph>
      <Paragraph position="1">  For this purpose, we used the 20-Newsgroup dataset (Lang, 1995). This is a standard text classification dataset that has a relatively large number of classes, 20. Among the 20 newsgroups, we selected 16 clusters of training data and 4 clusters of test data, and performed 5-fold cross validation. The maximum number of documents per cluster is 100, and when it exceeds this limit, we made a random sampling of 100 documents as the sentence retrieval experiment.</Paragraph>
      <Paragraph position="2"> Because our proposed metric is calculated from the distribution of vectors in high-dimensional feature space, it becomes inappropriate if the norm of the vectors (largely proportional to document length) differs much from document to document.</Paragraph>
      <Paragraph position="3"> 3 Therefore, we used subsampling/oversampling to form a median length (130 words) on training documents. Further, we preprocessed them with tf.idf as a baseline method.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5.2.2 Results
</SectionTitle>
    <Paragraph position="0"> Table 1 shows R-precision and 11-point average precision. Since the test data contains 4 clusters, the baselines of precision are 0.25. We can see from both results that metric distance produces a better retrieval over the tf.idf and dot product. However, refinements in precision are certain (average p = 0.0243) but subtle.</Paragraph>
    <Paragraph position="1"> This can be thought of as the effect of the dimensionality reduction performed. We first decompose data matrix X by SVD: X = USV 1 and build a k-dimensional compressed representation Xk = VkX; where Vk denotes a k-largest submatrix of V .</Paragraph>
    <Paragraph position="2"> From the equation (3), this means a Euclidean distance of M1=2Xk = M1=2VkX. Therefore, Vk may subsume the effect of M in a preprocessing stage.</Paragraph>
    <Paragraph position="3"> Close inspection of table 1 shows this effect as a tradeoff between M and Vk. To make the most of metric distance, we should consider metric induction and dimensionality reduction simultaneously, or reconsider the problem in kernel Hilbert space.</Paragraph>
    <Paragraph position="4">  them to a high-dimensional hypersphere; this proved to produce an unsatisfactory result. Defining metrics that work on a hypersphere like spherical K-means (Dhillon and Modha, 2001) requires further research.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 K-means clustering and general vectorial
</SectionTitle>
      <Paragraph position="0"> data Metric distance can also be used for clustering or general vectorial data. Figure 4 shows the K-means clustering result of applying our metric distance to some of the UCI Machine Learning datasets (Blake and Merz, 1998). K-means clustering was conducted 100 times with a random start, where K equals the known number of classes in the data 4. Clustering precision was measured as an average probability that a randomly picked pair of data will conform to the true clustering (Xing et al., 2002). We also conducted the same clustering for documents of the 20-Newsgroup dataset to get a small increase in precision like the document retrieval experiment in section 5.2.</Paragraph>
      <Paragraph position="1">  Learning dataset results. The horizontal axis shows compressed dimensions (rightmost is original). The right bar shows clustering precision using Metric distance, and the left bar shows that using Euclidean distance.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> In this paper, we proposed an optimal distance metric based on the idea of minimum cluster distortion in training data. Although vector distances have frequently been used in natural language processing, this is a rather neglected but recently highlighted problem. Unlike recently proposed methods with spectral methods or SVMs, our method assumes no such additional scenarios and can be considered as 4Because of the small size of the dataset, we did not apply cross-validation as in other experiments.</Paragraph>
    <Paragraph position="1"> a straight successor to (Xing et al., 2002)'s work.</Paragraph>
    <Paragraph position="2"> Their work has the same perspective as ours, and they calculate a metric matrix A that is similar to ours based on a set S of vector pairs (~xi;~xj) that can be regarded as similar. They report that the effectiveness of A increases as the number of the training pairs S increases; this requires O(n2) sample points from n training data, and must be optimized by a computationally expensive Newton-Raphson iteration. On the other hand, our method uses only linear algebra, and can induce an ideal metric using all the training data at the same time. We believe this metric can be useful for many vector-based language processing methods that have used cosine similarity. null There remains some future directions for research. First, as we stated in section 4.3, the effect of a cluster weighted generalized metric must be investigated and optimal weighting must be induced.</Paragraph>
    <Paragraph position="3"> Second, as noted in section 5.2.1, the dimensionality reduction required for linguistic data may constrain the performance of the metric distance. To alleviate this problem, simultaneous dimensionality reduction and metric induction may be necessary, or the same idea in a kernel-based approach is worth considering. The latter obviates the problem of dimensionality, while it restricts the usage to a situation where the kernel-based approach is available.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML