File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1105_metho.xml
Size: 21,287 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1105"> <Title>Bilingual-Dictionary Adaptation to Domains</Title> <Section position="3" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Method using the ratio of associated words 3.1 Outline </SectionTitle> <Paragraph position="0"> This method is based on the assumption that each word associated with a target word suggests a specific sense of the target word, in other words, specific translation equivalents of the target word. It is also assumed that dominance of a translation equivalent in a domain correlates with how many associated words suggesting it occur in a corpus of the domain. It is thus necessary to identify which associated words suggest which translation equivalents. This can be done by using the sense-vs.-clue correlation algorithm that the author developed for unsupervised word-sense disambiguation (Kaji and Morimoto 2002). The algorithm works with a set of senses of a target word, each of which is defined as a set of synonymous translation equivalents, and it results in a correlation matrix of senses vs. clues (i.e., associated words). It is used here with a set of translation equivalents instead of a set of senses, resulting in a correlation matrix of translation equivalents vs.</Paragraph> <Paragraph position="1"> associated words.</Paragraph> <Paragraph position="2"> The proposed method consists of the following steps (as shown in Figure 2).</Paragraph> <Paragraph position="3"> First, word associations are extracted from a corpus of each language. The first step is the same as that of the contextual-similarity-based method described in Section 2.</Paragraph> <Paragraph position="4"> Second, word associations are aligned translingually by consulting a bilingual dictionary, and pairwise correlation between translation equivalents of a target word and its associated words is calculated iteratively. A detailed description of this step is given in the following subsection.</Paragraph> <Paragraph position="5"> Third, each associated word is assigned to the translation equivalent having the highest correlation with it. This procedure may be problematic, since an associated word often suggests two or more translation equivalents that represent the same sense. However, it is difficult to separate translation equivalents suggested by an associated word from others. Each associated word is therefore assigned to the translation equivalent it suggests most strongly.</Paragraph> <Paragraph position="6"> Finally, a translation equivalent is selected when based on contextual similarity the ratio of associated words assigned to it exceeds a certain threshold. In addition, representative associated words are selected for each selected translation equivalent. A representativeness measure was devised under the assumption that representative associated words are near the centroid of a cluster consisting of associated words assigned to a translation equivalent. The representative associated words help lexicographers validate the selected translation equivalents.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Calculation of correlation between translation </SectionTitle> <Paragraph position="0"> equivalents and associated words The iterative algorithm described below has two main features. First, it overcomes the problem of failure in word-association alignment due to incompleteness of the bilingual dictionary and disparity in topical coverage between the corpora of the two languages. Second, it overcomes the problem of ambiguity in word-association alignment.</Paragraph> <Paragraph position="1"> For a first-language word association (x, x(j))where a target word is given as x and its j-th associated word is given as x(j)a set consisting of second-language word associations alignable with it, denoted as Y(x, x(j)), is constructed. That is,</Paragraph> <Paragraph position="3"> is the collection of word associations extracted from a corpus of the second language, and D is a bilingual dictionary to be adapted.</Paragraph> <Paragraph position="4"> Each first-language word association (x, x(j)) is characterized by a set consisting of accompanying associated words, denoted as Z(x, x(j)). An accompanying associated word is a word that is associated with both words making up the word association in question. That is,</Paragraph> <Paragraph position="6"> is the collection of word associations extracted from a corpus of the first language.</Paragraph> <Paragraph position="7"> In addition, alignment of a first-language word association (x, x(j)) with a second-language word association (y, y) ([?]Y(x, x(j))) is characterized by a set consisting of translingually alignable accompanying associated words, denoted as W((x, x(j)), (y, y)). A translingually alignable accompanying associated word is a word that is an accompanying associated word of the first-language word association making up the alignment in question and, at the same time, is alignable with an accompanying associated word of the second-language word association making up the alignment in question. That is,</Paragraph> <Paragraph position="9"> The correlation between the i-th translation equivalent of target word x, denoted as y(i), and the j-th associated word x(j) is defined as</Paragraph> <Paragraph position="11"> where MI(x, x(j)) is the mutual information between x and x(j), and PL(y(i), x(j)) is the plausibility factor for y(i) given by x(j). The mutual information between the target word and the associated word is the base of the correlation between each translation equivalent of the target word and the associated word; it is multiplied by the normalized plausibility factor. The plausibility factor is defined as the weighted sum of two component plausibility factors.</Paragraph> <Paragraph position="13"> where a is a parameter adjusting the relative weights of the component plausibility factors.</Paragraph> <Paragraph position="14"> The first component plausibility factor, PL , is defined as the sum of correlations between the translation equivalent and the accompanying associated words. That is, the ratio of associated words word usually correlates closely with the translation equivalent that correlates closely with a majority of its accompanying associated words.</Paragraph> <Paragraph position="15"> The second component plausibility factor, PL , is defined as the maximum plausibility of alignment involving the translation equivalent, where the plausibility of alignment of a first-language word association with a second-language word association is defined as the mutual information of the second-language word association multiplied by the sum of correlations between the translation equivalent and the translingually alignable accompanying associated words. That is, This is based on the assumption that correct alignment of word associations is usually accompanied by many associated words that are alignable with each other as well as the assumption that alignment with a strong word association is preferable to alignment with a weak word association.</Paragraph> <Paragraph position="16"> The above definition of the correlations between translation equivalents and associated words is recursive, so they can be calculated iteratively. Initial values are set as</Paragraph> <Paragraph position="18"> That is, the mutual information between the target word and an associated word is used as the initial value for the correlations between all translation equivalents of the target word and the associated word.</Paragraph> <Paragraph position="19"> It was proved experimentally that the algorithm works well for a wide range of values of parameter a and that the correlation values converge rapidly. Parameter a and the number of iterations were set to five and six, respectively, in the experiments described in Section 4.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Material and preparation </SectionTitle> <Paragraph position="0"> The experiment focused on nouns, whose appropriate translations often vary with domains. A wide-coverage bilingual noun dictionary was constructed by collecting pairs of nouns from the EDR English-to-Japanese and Japanese-to-English dictionaries. The resulting dictionary consists of 633,000 pairs of 269,000 English nouns and 276,000 Japanese nouns.</Paragraph> <Paragraph position="1"> An English corpus consisting of Wall Street Journal articles (July 1994 to December 1995; 189MB) and a Japanese corpus consisting of Nihon Keizai Shimbun articles (December 1993 to November 1994; 275MB) were used as the comparable corpora.</Paragraph> <Paragraph position="2"> English nouns occurring 10 or more times in the English corpus were selected as the target words.</Paragraph> <Paragraph position="3"> The total number of selected target words was 12,848. For each target word, initial candidate translation equivalents were selected from the bilingual dictionary in descending order of frequency in the Japanese corpus; the maximum number of candidates was set at 20, and the minimum frequency was set at 10. The average number of candidate translation equivalents per target word was 3.3, and 1,251 target words had 10 or more candidate translation equivalents.</Paragraph> <Paragraph position="4"> Extraction of word associations, which is the first step common to the method based on contextual similarity (abbreviated as the CS method hereinafter) and the method using the ratio of associated words (abbreviated as the RAW method hereinafter), was done as follows. Co-occurrence frequencies of noun pairs were counted by using a window of 13 words, excluding function words, and then noun pairs having mutual information larger than zero were extracted. null</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Experimental results </SectionTitle> <Paragraph position="0"> Results of the CS and RAW methods for six target words are listed in Tables 1 and 2, respectively. Table 1 lists the top-five translation equivalents in descending order of contextual similarity. Table 2 lists translation equivalents with a ratio of associated words larger than 4% along with their top-four representative associated words. In these tables, the occurrence frequencies in the test corpora are appended to both the target words and the translation equivalents. These indicate the weak comparability between the Wall Street Journal and Nihon Keizai Shimbun corpora. Moreover, it is clear that neither the CS method nor the RAW method relies on the occurrence frequencies of words.</Paragraph> <Paragraph position="1"> Tables 1 and 2 clearly show that the two methods produce significantly different lists of translation equivalents. It is difficult to judge the appropriateness of the results of the CS method without examining the comparable corpora. However, it seems that inappropriate translation equivalents were often ranked high by the CS method. In contrast, referring to the representative associated words enables the results of the RAW method to be judged as appropriate or inappropriate. More than 90% of the selected translation equivalents were judged as definitely appropriate. null Table 2 also includes the orders of translation equivalents determined by a conventional bilingual dictionary (remarks column). They are quite different from the orders determined by the RAW method.</Paragraph> <Paragraph position="2"> This shows the necessity and effectiveness of ranking translation equivalents according to relevancy to a domain.</Paragraph> <Paragraph position="3"> Processing times were measured by separating both the CS and RAW methods into two parts. The processing time of the first part shared by the two methods, i.e., extracting word associations from corpora, is roughly proportional to the corpus size.</Paragraph> <Paragraph position="4"> For example, it took 2.80 hours on a Windows PC (CPU clock: 2.40 GHz; memory: 1 GB) to extract word associations from the 275 MB Japanese corpus.</Paragraph> <Paragraph position="5"> The second part, i.e., selecting translation equivalents for target words, is specific to each method, and the processing time of it is proportional to the number of target words. It took 11.5 minutes and 2.40 hours on another Windows PC (CPU clock: 2.40 GHz; memory: 512 MB) for the CS and RAW methods, respectively, to process the 12,848 target words. It was thus proved that both the CS and RAW methods are computationally feasible.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Quantitative evaluation using pseudo target words </SectionTitle> <Paragraph position="0"> A method for bilingual-dictionary adaptation using comparable corpora should be evaluated by us- null This column shows the orders of translation equivalents determined by a conventional dictionary Kenkyushas New Collegiate English-Japanese Dictionary, 5th edition. For example, 3a indicates that a translation equivalent belongs to the subgroup a in the third group of translations. A hyphen indicates that a translation equivalent is not contained in the dictionary.</Paragraph> <Paragraph position="1"> ing recall and precision measures defined as where S is a set consisting of pairs of translation equivalents contained in the test comparable corpora, and T is a set consisting of pairs of translation equivalents selected by the method. To calculate these measures, it is necessary to know all pairs of translation equivalents contained in the test corpora.</Paragraph> <Paragraph position="2"> This is almost impossible in the case that the test corpora are large.</Paragraph> <Paragraph position="3"> To avoid this difficulty, an automated evaluation scheme using pseudo target words was devised. A pseudo word is formed by three real words, and it has three distinctive pseudo senses corresponding to the three constituent words. Translation equivalents of a constituent word are regarded as candidate translation equivalents of the pseudo word that represent the pseudo sense corresponding to the constituent word. For example, a pseudo word action/address/application has three pseudo senses corresponding to action, address, and application. It has candidate translation equivalents such as Su Song <SOSHOU> and Jue Yi <KETSUGI> originating from action, Yan Shuo <ENZETSU> and Qing Yuan <SEIGAN> originating from address, and Ying Yong <OUYOU> and Ying Mu <OUBO> originating from application. Furthermore, pseudo word associations are produced by combining a pseudo word with each of the associated words of the first two constituent words. It is thus assumed that first two pseudo senses occur in the corpora but the third one does not. For example, the pseudo word action/address/application has associated words including court and vote, which are associated with action, as well as President and legislation, which are associated with address.</Paragraph> <Paragraph position="4"> Using the pseudo word associations, a bilingual-dictionary-adaptation method selects translation equivalents for the pseudo target word. On the one hand, when at least one of the translation equivalents originating from the first (second) constituent word is selected, it means that the first (second) pseudo sense is successfully selected. For example, when Su Song <SOSHOU> is selected as a translation equivalent for the pseudo target word action/address/application, it means that the pseudo sense corresponding to action is successfully selected. On the other hand, when at least one of translation equivalents originating from the third constituent word is selected, it means that the third pseudo sense is erroneously selected. For example, when Ying Yong <OUYOU> is selected as a translation equivalent for the pseudo target word action/address/application, it means that the pseudo sense corresponding to application is erroneously selected. The method is thus evaluated by recall and precision of selecting pseudo senses. That is, where S is a set consisting of pseudo senses corresponding to the first two constituent words, and T is a set consisting of pseudo senses relevant to translation equivalents selected by the method.</Paragraph> <Paragraph position="5"> A total of 1,000 pseudo target words were formed by using randomly selected words that occur more than 100 times in the Wall Street Journal corpus. Using these pseudo target words, both the CS and RAW methods were evaluated. As for the CS method, the recall and precision of selecting pseudo senses were calculated in the case that N most-similar translation equivalents are selected (N=2, 3,). As for the RAW method, the recall and precision of selecting pseudo senses were calculated in the case that the threshold for the ratio of associated words is set from 20% down to 1% in 1% intervals.</Paragraph> <Paragraph position="6"> Recall vs. precision curves for the two methods are shown in Figure 3. These curves clearly show that the RAW method outperforms the CS method.</Paragraph> <Paragraph position="7"> The RAW method maximizes the F-measure, i.e., harmonic means of recall and precision, when the threshold for the ratio of associated words is set at 4%; the recall, precision, and F-measure are 92%, 80%, and 86%, respectively. In contrast, the CS method maximizes the F-measure when N is set at nine; the recall, precision, and F-measure are 96%, 72%, and 82%, respectively.</Paragraph> <Paragraph position="8"> It should be mentioned that the above evaluation was done under strict conditions. That is, two out of three pseudo senses of each pseudo target word were assumed to occur in the corpus, while many real target words have only one sense in a specific domain. Target words with only one sense occurring in a corpus are generally easier to cope with than those with multiple senses occurring in a corpus. Accordingly, recall and precision for real target words would be higher than the above ones for the pseudo target words.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The reasons for the superior performance of the RAW method to the CS method are discussed in the following.</Paragraph> <Paragraph position="1"> * The RAW method overcomes both the sparseness of word-association data and the topical disparity between corpora of two languages. This is due to the smoothing effects of the iterative algorithm for calculating correlation between translation equivalents and associated words; namely, associated words are correlated with translation equivalents even if they fail to be aligned with their counterpart. In contrast, the CS method is much affected by the above-mentioned difficulties. All low values of contextual similarity (see Table 1) support this fact.</Paragraph> <Paragraph position="2"> * The RAW method assumes that a target word has more than one sense, and, therefore, it is effective for polysemous target words. In contrast, contextual similarity is ineffective for a target word with two or more senses occurring in a corpus. The context vector characterizing such a word is a composite of context vectors characterizing respective senses; therefore, the context vector characterizing any candidate translation equivalent does not show very high similarity.</Paragraph> <Paragraph position="3"> * The RAW method can select an appropriate number of translation equivalents for each target word by setting a threshold for the ratio of associated words. In contrast, the CS method is forced to select a fixed number of translation equivalents for all target words; it is difficult to predetermine a threshold for the contextual similarity, since the range of its values varies with target words (see Table 1).</Paragraph> <Paragraph position="4"> Finally, from a practical point of view, advantages of the RAW method are discussed in the following.</Paragraph> <Paragraph position="5"> * The RAW method selects translation equivalents contained in the comparable corpora of a domain together with evidence, i.e., representative associated words that suggest the selected translation equivalents. Accordingly, it allows lexicographers to check the appropriateness of selected translation equivalents efficiently.</Paragraph> <Paragraph position="6"> * The ratio of associated words can be regarded as a rough approximation of a translation probability.</Paragraph> <Paragraph position="7"> Accordingly, a translation equivalent can be fixed for a word, when the particular translation equivalent has an exceedingly large ratio of associated words. A sophisticated procedure for word-sense disambiguation or translation-word selection needs to be applied only to words whose two or more translation equivalents have significant ratios of associated words.</Paragraph> </Section> class="xml-element"></Paper>