File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3007_metho.xml
Size: 9,521 bytes
Last Modified: 2025-10-06 14:09:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3007"> <Title>Exploiting Aggregate Properties of Bilingual Dictionaries For Distinguishing Senses of English Words and Inducing English Sense Clusters</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Synonymy Relation </SectionTitle> <Paragraph position="0"> We began by using the above-described data set to obtain a synonymy relation between English words.</Paragraph> <Paragraph position="1"> In general, in a paper bilingual dictionary, each foreign word can be associated with a list of English words which are possible translations; in our reduced format each entry lists a single foreign word and single possible English translation, though taking a union of all English translations for a particular foreign word recreates this list.</Paragraph> <Paragraph position="2"> We use the notion of coentry to build the synonymy relation between English words. The per-entry coentry count Cper[?]entry(e1,e2) for two English words e1 and e2 is simply the number of times e1 and e2 both appear as the translation of the same foreign word (over all foreign words, dictionaries and languages). The per-dictionary coentry count Cper[?]dict(e1,e2), ignores the number of individual coentries within a particular dictionary and merely counts as 1 any number of coentries inside a particular dictionary. Finally, per-language coentry count Cper[?]lang(e1,e2) counts as 1 any number of coentries for e1 and e2 for a particular language. Thus, for the following snippet from the database: Eng. Wd. Foreign Wd. Foreign Language Dict. ID</Paragraph> <Paragraph position="4"> the Italian and German languages. We found the more conservative per-dictionary and per-language counts to be a useful device, given that some dictionary creators appear sometimes to copy and paste identical synonym sets in a fairly indiscriminate fashion, spuriously inflating the Cper[?]entry(e1,e2) counts.</Paragraph> <Paragraph position="5"> Our algorithm for identifying synonyms was simple: we sorted all pairs of English words by decreasing Cper[?]dict(e1,e2) and, after inspection of the resulting list, cut it off at a per-dictionary and per-language count threshold1 yielding qualitatively strong results. For all word pairs e1,e2 above threshold, we say the symmetric synonymy relation S(e1,e2) holds. The following tables provide a clarifying example showing how synonymy can be inferred from multiple bilingual dictionaries in a way which is impossible with a single such dictionary (because of idiosyncratic foreign language polysemy).</Paragraph> <Paragraph position="6"> Lang. Dict. ID Foreign Wd English Translations GERMAN ger.dict1 absetzen deposit drop deduct sell GERMAN ger.dict1 ablagerung deposit sediment settlement The table above displays entries from one German-English dictionary. How can we tell that &quot;sediment&quot; is a better synonym for &quot;deposit&quot; than &quot;sell&quot;? We can build and examine the Polysemy which is specific to German - &quot;deposit&quot; and &quot;sell&quot; senses coexisting in a particular word form &quot;absetzen&quot; - will result in total coentry counts Cper[?]lang(deposit,sell), over all languages and dictionaries, which are low. In fact, &quot;deposit&quot; and &quot;sell&quot; are coentries under only 2 out of 44 languages in our database (German and Swedish, which are closely related). On the other hand, near-synonymous English translations of a particular sense across a variety of languages will result in high coentry counts, as is the case with Cper[?]lang(deposit,sediment). As illustrated in the tables, German, French, Czech and Turkish all support the synonymy hypothesis for this pair of English words.</Paragraph> <Paragraph position="7"> &quot;deposit&quot; Coentries Per Entry Per Dict. Per Lang. sell 4 4 2 sediment 68 40 18 The above table, listing the various coentry counts for &quot;deposit&quot;, demonstrates the empirical motivation in the aggregate dictionary for the synonymy relationship between deposit and sediment, while the aggregate evidence of synonymy between deposit and sell is weak, limited to 2 languages, and is most likely the result of a word polysemy restricted to a few Germanic languages.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Different Senses: Asymmetries of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Synonymy Relations </SectionTitle> <Paragraph position="0"> After constructing the empirically derived synonymy relation S described in the previous section, we observed that one can draw conclusions from the topology of the graph of S relationships (edges) among words (vertices).</Paragraph> <Paragraph position="1"> Specifically, consider the case of three words e1,e2, e3 for which S(e1,e2) and S(e1,e3) hold, but S(e2,e3) does not. Figure 1 illustrates this situation with an example from data (e1 = &quot;fair&quot;), and more examples are listed in Table 1. As Figure 1 suggests and inspection of the random extracts presented in Table 1 will confirm, this topology can be interpreted as indicating that e2 and e3 exemplify differing senses of e1.</Paragraph> <Paragraph position="2"> We decided to investigate and apply it with more generality. This will be discussed in the next section.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Inducing Sense Taxonomies: Clustering </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> with Synonym Similarity </SectionTitle> <Paragraph position="0"> With the goal of using the aggregate bilingual dictionary to induce interesting and useful sense distinctions of English words, we investigated the following strategy.</Paragraph> <Paragraph position="1"> distinctions derived via unbalanced synonymy relationships among three words, W and two of its synonyms syn1(W) & syn2(W), such that Cper[?]dict(W,syn1(W)) and Cper[?]dict(W,syn2(W)) are high, whereas Cper[?]dict(syn1(W),syn2(W)) is low (0). Extracted from a list sorted by descending Cper[?]dict(W,syn1(W))</Paragraph> <Paragraph position="3"> were smoothed to prevent division by zero).</Paragraph> <Paragraph position="4"> For each target word Wt in English having a sufficiently high dictionary occurrence count to allow interesting results2, a list of likely synonym words Ws was induced by the method described in Section 33. Additionally, we generated a list of all words Wc having non-zero Cper[?]dict(Wt,Wc).</Paragraph> <Paragraph position="5"> The synonym words Ws - the sense exemplars for target words Wt - were clustered based on vectors of coentry counts Cper[?]dict(Ws,Wc). This restriction on vector dimension to only words that have nonzero coentries with the target word helps to exclude distractions such as coentries of Ws corresponding to a sense which doesn't overlap with Wt. The example given in the following table shows an excerpt of the vectors for synonyms of strike. The hit synonym overlaps strike in the beat/bang/knock sense. Restricting the vector dimension as described will help prevent noise from hit's common 2For our experiments, English words occurring in at least 15 distinct source dictionaries were considered.</Paragraph> <Paragraph position="6"> 3Again, the threshold for synonyms was 10 and 5 respectively for per-dictionary and per-language coentry counts.</Paragraph> <Paragraph position="7"> chart-topper/recording/hit single sense. The following table also illustrates the clarity with which major sense distinctions are reflected in the aggregate dictionary. The induced clustering for strike (tree as well as flat cluster boundaries) is presented in Figure 4.</Paragraph> <Paragraph position="8"> attack bang hit knock walkout find We used the CLUTO clustering toolkit (Karypis, 2002) to induce a hierarchical agglomerative clustering on the vectors for Ws. Example results for vital and strike are in Figures 3 and 4 respectively4. Figure 4 also presents flat clusters automatically derived from the tree, as well as a listing of some foreign words associated with</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> There is a distinguished history of research extracting lexical semantic relationships from bilingual dictionaries (Copestake et al., 1995; Chen and Chang, 1998). There is also a long-standing goal of mapping translations and senses in multiple languages in a linked ontology structure (Resnik and Yarowsky, 1997; Risk, 1989; Vossen, 1998). The recent work of Ploux and Ji (2003) has some similarities to the techniques presented here in that it considers topological properties of the graph of synonymy relationships between words. The current paper can be distinguished on a number of dimensions, including our much greater range of participating languages, and the fundamental algorithmic linkage between multilingual translation distributions and monolingual synonymy clusters.</Paragraph> <Paragraph position="1"> automatically derived from the tree are denoted by the horizontal lines.</Paragraph> </Section> class="xml-element"></Paper>