File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1808_metho.xml
Size: 24,260 bytes
Last Modified: 2025-10-06 14:09:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1808"> <Title>Discovering Synonyms and Other Related Words</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Corpus Data </SectionTitle> <Paragraph position="0"> Our corpus consists of nouns in a sentence context. We used all the nouns (in base form) that occurred more than 100 times (in any inflected form) in a corpus of Finnish newspaper text.</Paragraph> <Paragraph position="1"> The corpus contained 245000 documents totaling 48 million words of the Finnish newspaper Helsingin sanomat from 1995-1997. Excluding TV and radio listings, there were 196000 documents with 42 million words. As corpus data we selected all the 17835 nouns occurring more than 100 times comprising 14 million words of the corpus.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"> First we present the types of features we have extracted from the corpus. Then we briefly describe the similarity measure which we use in order to calculate the similarity between the nouns in the corpus data. We also introduce a method for creating derived similarity information in a low-dimensional space. Finally we present the clustering algorithms which we apply to the similarity information.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Feature extraction </SectionTitle> <Paragraph position="0"> The present experiments aim at discovering the nouns that are most similar in meaning to a given noun. The assumption is that words occurring in similar syntactic contexts belong to the same semantic categories (Harris, 1968). In order to determine the similarity of the syntactic contexts, we represent a word w as a probability distribution over a set of features a occurring in the context of w: P(a|w). The context features a are the major class words wprime (nouns, adjectives and verbs) with direct dependency links to the word w. The context feature is the word wprime in base form labeled with the dependency relation r. For example, the noun might occur as an object of a verb and with an adjective modifier; both the verb and the adjective including their dependency relations are context features.</Paragraph> <Paragraph position="1"> We used Connexor's dependency parser FDG for Finnish (Connexor, 2002) for parsing the corpus. A sample of the parser output is shown in Table 1. Tokens of each sentence are numbered starting from zero, each token is on its own line, the token number first, the actual word form second and the base form in the third field. The fourth field links dependent tokens to their heads using a grammatical label and the CompuTerm 2004 - 3rd International Workshop on Computational Terminology64 # Token Base form Dependency Morphosyntax Gloss number of the head token. The fifth field contains morphosyntactic information.</Paragraph> <Paragraph position="2"> Two tokens, 3 and 5, are labeled as nouns N.</Paragraph> <Paragraph position="3"> The noun video is a direct object to the verb esitt&quot;a&quot;a, and the noun filmi is coordinated with video, so video gets two feature occurrences from this sentence: esitt&quot;a&quot;a-obj cc-filmi.</Paragraph> <Paragraph position="4"> Also, filmi gets video-cc as a feature occurrence. The pronoun toinen is not a potential feature because of its word class and because it is not linked. The coordinating conjunction ja is not a potential feature because of its word class.</Paragraph> <Paragraph position="5"> The parsed corpus contained a total of 18516609 unambiguous noun occurrences, 69314 noun/verb ambiguities, 39104 noun/adjective ambiguities, 20847 noun/adverb ambiguities and 11739 noun/numeral ambiguities, i.e. the amount of remaining ambiguities was less than 0.8%.</Paragraph> <Paragraph position="6"> When its analyses were underspecified with more than one morphological analysis remaining, we took the relatively small risk (p < 0.008) of committing to a noun analysis.</Paragraph> <Paragraph position="7"> As a straightforward weighting of the context features of a word, we used the number of occurrences with all the instances of the word. In our choice of similarity formula, the representation of a word w must be a probability distribution. This is formally just a matter of normalizing the weights of the features. Thus, a word w is represented as w : a mapsto- P(a|w), i.e. the conditional probability distribution of all features a given the word w, such that summationtexta P(a|w) = 1. Extracting features only from direct dependency relations produces few feature occurrences for each instance of a noun. This keeps the number of distinct features tolerable for all but the most frequent words, and still retains the most promising co-occurring words. As we use only linear frequency weighting, very frequent features tend to get more weight than they should. Additionally, many rare features could have been dropped without much loss of information.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Similarity calculations </SectionTitle> <Paragraph position="0"> In (Weeds, 2003; Lee, 2001), i.a. the information radius is applied to finding words that can be used as proxies or substitutes for one another. Their tests show that the information radius is among the best for finding such words.</Paragraph> <Paragraph position="1"> Here we briefly recapitulate the details of the similarity estimate, which is rather an estimate of dissimilarity.</Paragraph> <Paragraph position="2"> Two words are distributionally similar to the extent that they co-occur with the same words, i.e., to the extent that they share features. We define the dissimilarity of two words, p and q, as</Paragraph> <Paragraph position="4"> where D(pbardblm) =summationtexta p(a)(log2 p(a)[?]log2 m(a)) and m(a) = (p(a) + q(a))/2 for any feature a.</Paragraph> <Paragraph position="5"> This is the symmetrically weighted case of the Jensen-Shannon divergence (Lin, 1991), also known as the information radius or the mean divergence to the mean (Dagan et al., 1999). For complete identity, J(p,p) = 0. For completely disjoint feature sets, J(p,q) = 1. The formula is symmetric but does not satisfy the triangle inequality. For speed the estimate may be calculated from the shared features alone (Lee, 1999). After calculating all the pairwise estimates, we retained lists of the 100 most similar nouns for each of the nouns in the corpus data. No other data is used in the similarity calculations.</Paragraph> <Paragraph position="6"> CompuTerm 2004 - 3rd International Workshop on Computational Terminology 65</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Low-dimensional similarity measures </SectionTitle> <Paragraph position="0"> Performing all the calculations in high-dimensional feature space is time-consuming. Here we introduce a method that can be used as an approximation in low-dimensional feature space based on the initial similarity estimates.</Paragraph> <Paragraph position="1"> Assume that we have lists of the words that are distributionally most similar to a given word w. Each list Lw contains 100 words with an estimate of their similarity to w. The words in Lw represent a mix of the different meanings of the word w. We create a similarity matrix disw for these words such that disw(p,q) = J(p,q), where p,q [?] Lw. The similarity matrix disw is a symmetric matrix of the dimensions 101 by 101, as we also include the word w in the matrix.</Paragraph> <Paragraph position="2"> A vector pw = disw(p,.) in the similarity matrix disw is regarded as a projection of the word p from a high dimensional feature space onto a 101-dimensional space, i.e. p is projected onto the 101 most important dimensions of w.</Paragraph> <Paragraph position="3"> The new matrix is not orthogonal, so we apply single-value decomposition (SVD) disw = T SD and use T to rotate the matrix so that the first axis runs along the direction of the largest variation among the word similarity estimates, the second dimension runs along the direction of the second largest variation and so forth. After this rotation we can cluster the new vectors pw,T = Tt pw as low-dimensional representatives of the original high-dimensional feature space. Often SVD is used for dimensionality reduction, but here we use its left singular vectors only for rotating the matrix in order to achieve noise reduction during clustering.</Paragraph> <Paragraph position="4"> In the new low-dimensional vector representation pw,T we apply the cosine distance</Paragraph> <Paragraph position="6"> to calculate the similarity between words. As a comparison we also tried the squared Euclidean distance eucld(pw,T,qw,T) = bardblpw,T [?]qw,Tbardbl2 between words in the low-dimensional space. We first normalize the vectors to unit length, which effectively makes the squared Euclidean distance the same as two times the cosine distance:</Paragraph> <Paragraph position="8"/> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Clustering </SectionTitle> <Paragraph position="0"> When we wish to discover the potential senses of w by clustering, we are currently only interested in the 100 words in Lw with a similarity estimate for w. The other words are deemed to be too dissimilar to w to be relevant.</Paragraph> <Paragraph position="1"> We cluster the words related to w with standard algorithms such as complete-link and average-link clustering (Manning and Sch&quot;utze, 1999). Complete-link and average-link are hierarchical clustering methods. We compare them with flat clustering methods like k-means and self-organizing maps (SOM) (Kohonen, 1997).</Paragraph> <Paragraph position="2"> In k-means the clusters have no ordering. The potential benefit of using SOM with a two-dimensional display compared to k-means is that related data samples get assigned into nearby clusters as the SOM converges forming cluster areas with related content.</Paragraph> <Paragraph position="3"> We use the MATLAB implementation (The MathWorks, Inc., 2002) of the algorithms. We use both the original similarity measures in disw and the distance measures cosd and eucld, which we defined on the low-dimensional space.</Paragraph> <Paragraph position="4"> In order to use methods like k-means and SOM, we need to be able to calculate the similarity between cluster centroids and words to be clustered each time a centroid is updated. We do this in the low-dimensional space pw,T using cosd and eucld.</Paragraph> <Paragraph position="5"> For SOM, the MATLAB implementation supported only the squared Euclidean distance. It should be noted that the centroids are not necessarily of unit length, so the squared Euclidean distance is different from the cosine distance between the samples and the centroids, when the centroids are based on more than one sample.</Paragraph> <Paragraph position="6"> Our clustering setup currently produces hard clusters, where each word w in Lw belong to one cluster, as opposed to soft clustering, where a word may belong to several clusters. We call the cluster containing the word w itself the key cluster.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation methodology </SectionTitle> <Paragraph position="0"> In order to evaluate the quality of the clusters we need a gold standard. English and a number of other languages have resources such as WordNet (Fellbaum, 1998; Vossen, 2001). For Finnish there is no WordNet and there are no large on-line synonym dictionaries available. In fact, our experiment can be seen as a feasibility study for automatically extracting information that could be used for building a WordNet for Finnish. The synsets of WordNet contain synonyms, so we can evaluate the feasibility of the clusters for WordNet development by rating the amount of synonyms and related words in the the back translations into Finnish. The shared back translations vaje, vajaus, alij&quot;a&quot;am&quot;a, tilivajaus are highlighted.</Paragraph> <Paragraph position="1"> discovered clusters.</Paragraph> <Paragraph position="2"> We note that when translating a word from the source language the meaning of the word is rendered in a target language. Such meaning preserving relations are available in translation dictionaries. If we translate into the target language and back we end up i.a. with the synonyms of the original source language word. In addition, we may also get some spurious words that are related to other meanings of the target language words. If we assume that the other words represent spurious cases of polysemy or homonymy in the target language, we can reduce the impact of these spurious words by considering several target languages and for each source word we use only the back-translated source words that are common to all the target languages. We call such a group of words a source word synonym set. For an example, see In addition to the mechanical rating of the synonym content we also manually classified the words of some cluster samples into synonymy, antonymy, hyperonymy, hyponymy, complementarity and other relations.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Evaluation data </SectionTitle> <Paragraph position="0"> In order to evaluate the clusters we picked a random sample of 1759 nouns from the corpus data, which represented approximately 10% of the words we had clustered. For these words we extracted the translations in the Finnish-English, Finnish-German and Finnish-French MOT dictionaries (Kielikone, 2004) available in electronic form. We then translated each target language word back into Finnish using the same resources. The dictionaries are based on extensive hand-made dictionaries. The choice of words may be slightly different in each of them, which means that the words in common for all the dictionaries after the back translation tend to be only the core synonyms.</Paragraph> <Paragraph position="1"> For evaluation purposes it would be unfair to demand that the clustering generate words into the clusters that are not in the corpus data, so we also removed those back translations from the source word synonym sets. Finally, only synonym sets that had more than one word remaining were interesting, i.e. they contained more than the original source word. There were 453 of the 1759 test words that met the qualifications. The average number of synonyms or back translations for these test words was 3.53 including the source word itself.</Paragraph> <Paragraph position="2"> For manual classification we used a sample of 50keyclustersfromthe wholesetofclustersand an additional sample of 50 key clusters from the words qualifying for the mechanical evaluation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation method </SectionTitle> <Paragraph position="0"> The mechanical evaluation was performed by picking the key cluster produced by a clustering algorithm for each of the test words. The key cluster was the cluster which contained the original source word. The evaluation was a simple overlap calculation with the gold standard generated from the translation dictionaries. By counting the number of cluster words in a source word synonym set and dividing by the synonym set size, we get the recall R. By counting the number of of source word synonyms in a cluster and dividing by the cluster size, we get the precision P.</Paragraph> <Paragraph position="1"> The manual evaluation was performed independently by the two authors and an external linguist. We then discussed the result in order to arrive at a common view.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Testing </SectionTitle> <Paragraph position="0"> First we did some initial experimenting with a preliminary test sample in order to tune the parameters. We then clustered the corpus data and evaluated the clusters against the gold standard, which gave an estimate of the synonym content of the clusters. In addition, we performed a manual evaluation of the result of the standard deviation of 2% using a denoised and a noisy low-dimensional feature space. best clustering algorithm.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Parameter selection </SectionTitle> <Paragraph position="0"> We clustered the words in Lw with the complete-link and average-link clustering algorithms using the disw similarity information. The algorithms form hierarchical cluster trees which need to be split into clusters at some level. The inconsistency coefficient c characterizes each link in a cluster tree by comparing its length with the average size of other links at the same level of the hierarchy. The higher the value of this coefficient, the less similar the objects connected by the link (The MathWorks, Inc., 2002). We selected the inconsistency coefficient c = 1 by testing on a separate initial test set different from the final evaluation data.</Paragraph> <Paragraph position="1"> Using the cosine distance cosd(pw,T,qw,T) as a similarity measure on the projected and rotated representation of the words we clustered with the above mentioned standard clustering algorithms as well as with the k-means algorithm. Using the euclidean distance eucld(pw,T,qw,T) we also produced self-organizing maps (SOM). For k-means and SOM an initial number of clusters need to be selected.</Paragraph> <Paragraph position="2"> We selected 35 clusters as this was close to the average of what the other algorithms produced, which we were comparing with. For k-means we used the best out of 10 iterations and for SOM we trained a 5 x 7 hexagonal gridtop for 10 epochs. We also tried a considerably longer training period for SOM but noticed only an insignificant improvement on the cluster precision. null We also tried a number of other algorithms in the MATLAB package, but they typically produced a result either containing only the word itself or clusters containing more than one fifth of the words in the key cluster. We deemed such clustering results a failure on our data without need for formal evaluation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Experiments </SectionTitle> <Paragraph position="0"> After evaluating against the translation dictionary gold standard, the result of the experiment with complete-link, average-link, k-means and SOM clustering using different similarity measures is shown in Table 3. The best recall with the best precision was achieved with the average-link clustering using the information radius on the original feature space with 47 +- 2% recall and and 42+-2% precision. This produced clusters with an average size of 6.05 words.</Paragraph> <Paragraph position="1"> The difference between complete-link and average-link clustering is not statistically significant even if the average-link is slightly better. The recall is statistically significantly better in the original feature space than in the low-dimensional space at the risk level p = 0.05, whereas the precision remains roughly the same.</Paragraph> <Paragraph position="2"> The average-link and complete-link clustering have a statistically significantly better precision than k-means and SOM, respectively, at the risk level p < 0.05. We can also see that there is hardly any difference in practice between the Euclidean distance on normalized word vectors and the cosine distance despite the fact that the centroids were not normalized when using the squared Euclidean distance with k-means.</Paragraph> <Paragraph position="3"> As can be seen from Table 4 the rotation of the low-dimensional feature space using SVD has the effect of increasing precision statistically significantly at the risk level p < 0.005, i.e. the of different semantic relations in the cluster content in two different samples of 50 clusters each. clusters become less noisy.</Paragraph> <Paragraph position="4"> We then performed a manual evaluation of the output of the best clustering algorithm. We used one cluster sample from the 453 clusters qualifying for mechanical evaluation and one sample from the whole set of 1753 clusters. The results of the manual evaluation is shown in Table 6. The evaluation shows that 69-79% of the material in the clusters is relevant for constructing a thesaurus.</Paragraph> <Paragraph position="5"> The manual evaluation agrees with the mechanical evaluation, when the manual evaluation found a synonym content of 52%, compared to the minimum synonym content of 42% found by the mechanical evaluation. This means that the clusters actually contain a few more synonyms than those conservatively agreed on by the three translation dictionaries.</Paragraph> <Paragraph position="6"> If we evaluate the sample of key clusters drawn from all the words in the test sample, we get a synonym content of 38%. This figure is rather low, but can be explained by the fact that many of the words were compound nouns that had no synonyms, which is why the translation dictionaries either did not have them listed or contained no additional source word synonyms for them.</Paragraph> <Paragraph position="7"> In Table 5, we see a few sample clusters whose words we rated during manual evaluation.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> The feature selection and the feature weighting radically influences the outcome of the results of any machine learning task. This has been noted in several evaluations of supervised machine learning algorithms (Voorhees et al., 1995; Yarowsky and Florian, 2002; Lind'en, 2003).</Paragraph> <Paragraph position="1"> During clustering, i.e. unsupervised learning, the features extracted from the corpus are the only information guiding the machine learning in addition to the clustering principle, which makes successful feature extraction, good feature weighting and accurate similarity measurements crucial for the success of the clustering. The clustering algorithms only exploit and preserve the information provided by the features and the similarity measure.</Paragraph> <Paragraph position="2"> In (Weeds, 2003; Lee, 2001; Dagan et al., 1999), the information radius is applied to find words that can be used as distributional proxies for one another. They extract features only from verbrelations whereas we use the fullrange of dependency syntactic relations. One intention of this study was to evaluate whether the selected corpus and the features extracted provide a basis for forming linguistically meaningful clusters that are useful in thesaurus construction. The result showed that 69-79% of the words found in the key clusters are useful, which is very encouraging. It turned out that the chosen features as such were useful, even if the over-all result probably could benefit from a more nuanced feature weighting scheme. We do not yet fully understand how the initial feature weighting affects the outcome of the clustering. Perhaps there are features that would contribute to a more fine-grained clustering if properly weighted.</Paragraph> <Paragraph position="3"> Next we intend to identify more than a single key cluster for each word, which poses addi-CompuTerm 2004 - 3rd International Workshop on Computational Terminology 69 tional challenges forthe evaluation. We also aim at evaluating the generated clusters in an information retrieval setting in order to see if they improve performance despite the fact that they contain more than synonyms. This would also shed some light on exactly how much synonymy we need to aim at in a practical application.</Paragraph> </Section> class="xml-element"></Paper>