File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1006_metho.xml
Size: 18,061 bytes
Last Modified: 2025-10-06 14:14:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1006"> <Title>Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information</Title> <Section position="3" start_page="0" end_page="23" type="metho"> <SectionTitle> 2 Overview of Proposed Method </SectionTitle> <Paragraph position="0"> The finding underlying our proposed method is as follows. In a hilingual corpus, a pair of words corresponding to each other generally accompany the same context, although expressed in the two diflcrent languages. If we calculate the pairwise correlations between the contexts in which the words occur, a correponding pair of words will show a high correlation. Although one occurrence of a word may not give a suMcient context to chm'acterize the word, accumulating all the contexts in which the word occurs throughout the text allows the word to be distinguished from the other words in the same language text.</Paragraph> <Paragraph position="1"> Figure 1 shows how two words are associated through their contexts, each expressed in its respective language. We use the set of words co-nccurring with word w, which we refer to as the co-occurrence set of w, to concisely represent tire accumtdated contexts characterizing the word. To associate two co-occurrence sets whose elements are words in different languages, we consult a bilingual dictionary and extract the possible word correspondences between them. The point is that even if the pair of words to be associated is missing in the bilingual dictionary, their co-occurrence sets can be associated through the bilingual dictionary. Of cource, some of the correspondences between the co-occurrence sets may be also missing in the bilingual dictionary.</Paragraph> <Paragraph position="2"> Nevertheless, the co-occurrence sets can be still associated, owing to the other correspondences between them that arc contained in the bilingual dictionary.</Paragraph> <Paragraph position="3"> Japanese text ...... T bdeg 1/5~(c)~ &quot; *jJ75C/-~- 7~ ~ ......... i ........................ ~o~x~, i~, I AND). r'~ b 7J~OgJ~7Oo * ................................... I I English text .................. the two inputs to the addres~ comparato~ coincide with each other, * ................................... ................................. a lock identification number register, an identification numberlcomparato~ and an AND gate .................................</Paragraph> <Paragraph position="4"> 41, 41, Co-occurrence set of 'J:L~L~i~'~' Co-occurrence set of 'comparator' ...................... ~i ~~-Dictionary ~~~~~~iiiii~!:i:i ........ !i~i!i!!iii!iiiiiiii! ~ .... Fig. 1 Associating words through contexts.</Paragraph> <Paragraph position="5"> Japanese text 4.</Paragraph> </Section> <Section position="4" start_page="23" end_page="23" type="metho"> <SectionTitle> 4, I </SectionTitle> <Paragraph position="0"> Co-occurrence data extraction &quot;l 4- 4. 4.</Paragraph> <Paragraph position="1"> oo occu.on e ot,or ) I Oa, u,at,oo o, corre,a,,oo I Co-occ rrooco each Japanese word 41- &quot;~&quot;11 each English word Correlation for each pair of Set of words~ffor each sentence Set of words for each sentence ual dicti I Co-occurrence data extraction \] Fig. 2 Method for extracting word correspondences. idea. While the examples shown here are for Japanese and English, the method is applicable to any pair of languages. The method is divided into three parts: Japanese text processing, English text processing, and bilingual processing. The Japanese text processing is composed of sentence segmentation, morphological analysis, and co-occurrence data extraction. It extracts a co-occurrence set for each word from a Japanese text. Likewise, the English text processing extracts a co-occurrence set for each word from an Engish text. The bilingual processing then calculates the pairwise correlations between the co-occurrence sets for Japanese words and those for English words, and selects the pairs of words with the highest correlations.</Paragraph> </Section> <Section position="5" start_page="23" end_page="25" type="metho"> <SectionTitle> 3 Technical Details </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="23" end_page="24" type="sub_section"> <SectionTitle> 3.1 Extraction of words from text </SectionTitle> <Paragraph position="0"> Natural language texts are composed of two types of words: content words and function words. The target of extraction can usually be restricted to the correspondences between content words, which are characterized by both dominance in number and straightforwardness. Additionally, the function words are useless as elements of co-occurrence sets, since they do not indicate specific contexts. Therefore, we extract only the content words from the texts in both languages.</Paragraph> <Paragraph position="1"> The content words are divided into simple words and compound words. The tbnner are extracted by dictionary look up mid morphological analysis. To extract the latter, we are describing a set of rules or patterns. So far, we have only addressed nominal compounds (simple noun phrases), whose patterns arc given below. Here, N, A, and NP stand for noun, adjective, and simple noun phrase, respectively. Nq- stands for at string of one or more Ns. * Japanese nominal comlxmnds: NP := N N+ * English nonfinal compounds: NP := N N+ I A N+ The nominal compounds are extracted from the morphological analysis results by pattern matching.</Paragraph> <Paragraph position="2"> Here, an NP included in a larger NP is rejected, since only self-contained NPs qualify as nominal compounds. One exception is an English NP starting wilh a noun that is included in an NP starting with an adjective, lmcause the case of an adjective modil'ying a nominal compound is just as likely as the case of an adjective being a part of a no Illinltl C O m pc ulld,</Paragraph> </Section> <Section position="2" start_page="24" end_page="24" type="sub_section"> <SectionTitle> 3.2 Extraction of co-occurrence data </SectionTitle> <Paragraph position="0"> Definitions of 'co-occurrence' include syntactic co-occurrence, co-occurrence in a k-word window, co-.occurrcuce ill a sentence, and co-occnrfcncc ill a documen|. We use co-.occurrence i n a sentence, i n which a pair of words occurring within the same sentence is regarded as a co-occurrence. While co-occurrence in a k-word window may produce better results when a sentence in one hulguage corresi)onds to a sequence of lwo or more shorler sentences in tile other language, it is difficult to determine an appropriate wdue of k because word order differs considerably between Japanese and English.</Paragraph> <Paragraph position="1"> The relations between a compound word and its constituent words are not, strictly speaking, co-occnrreuce relations. Moreover, if we treated them in the same nlanucr its co-occurrence relations, it would cause some confosion. Suppose that compound word w is composed of lwo simple words, w' and w&quot;. If we included both w' and w&quot; in the co-occurrence set of w, and vice versa, the differences between the co-occurrence set of w and those of w' and w&quot; woukl decrease. Therefore, we exclude the constituent words from the co-occurrence set of a compound word and vice versa.</Paragraph> <Paragraph position="2"> As mentioned in Section 2, the co-occurrence sets of a word are accumulated. This is not a mere union operation, but a union operation accompanied by frequency counting. The resultant co-occurrence set i s expressed as C(w)= {w,/f, \[ i = 1 ,-'-, n }, which shows that word w~co-occurs with word w ( times.</Paragraph> </Section> <Section position="3" start_page="24" end_page="24" type="sub_section"> <SectionTitle> 3.3 Calculation of correlations between </SectionTitle> <Paragraph position="0"> words We define correlation R(jw, ew) between Japanese word jw and English word ew as follows.</Paragraph> <Paragraph position="2"> m; j= 1,'&quot;, n} is the intersection of C(jw) and C(ew), whose elements ~u'e pairs of a Japanese word and an English word with their frequency. \] * \[ means the sum of frequencies of all elements.</Paragraph> <Paragraph position="3"> Generating intersection C(\]w) f) C(ew)from C(\]w)and C(ew) is not easy because the procedure ofpairingjw~ (c_ C(\]w) ) and eu~ (E ~ C(ew) ) is nondeterministic. A pair of words cannot be determined independently of the other possible pairs. To reduce processing time, we calculate</Paragraph> <Paragraph position="5"> 3. For example, the English-based approximate calculation is done as follows. First, Japanese co-occurrence set C(jw) is transformed into pseudo co-occurrence set Cl,(jw) by consulting bilingual dictionary D, which is a set ()f pairs of words:</Paragraph> <Paragraph position="7"> The intersection of pseudo co-occurrence set Cp(jw) and English co-occurrence set C(ew) is then generated:</Paragraph> <Paragraph position="9"> Tiffs approximate calculation is likely to result in an overestimated correlation when there is ambiguity in pairing jw, ((! C(/w) ) and eu~ (G C(ew) ), as occurs in Fig. 3(a). Figure 3(a) shows that the number of elements in the intersection exceeds that in the Japanese co-uccurrence set. The English-based and Japanese-based approximate calculations therefore do not always coincide with each other. While selecting the minimmn of the two approxinmte wducs is safer, it does not guarantee a precise value. Since ambiguity in associating co-occurrence sets does not occur too often, and considering the need lbr efficiency, we execute either of the two approximate calculations rather than make a precise calculation.</Paragraph> <Paragraph position="10"> To increase tile reliability of the correlation values, we remove tile useless words from tile Co-occurrence sets before calculating the correlations. The useless Japanese word i s jw such th at { ew I (j'w, ew) c- D} (\] { ew I ewe- TE}= (T u is the input English text), and tile useless English word is ew such that {jw I (\]w, ew) ~ D} f\] {jw \[ jw~: Tj} = (Tj is the inpt, t Japanese text). These words do not contribute to the word-pair correlations.</Paragraph> </Section> <Section position="4" start_page="24" end_page="25" type="sub_section"> <SectionTitle> 3.4 Selection of pairs of words with high </SectionTitle> <Paragraph position="0"> correlation The absolute values of the correlations are not significant because they are sensitive to the numbers of words in the co-occurrence sets, which vary considerably from word to word. However, their relative values are significant when either a Japanese or an English word i s fixed. We take the strategy of selecting the mutually best-matched pairs having no highly probable competitors. We call (jw, When for a mutually best-matched pair (jw, ew), there exists either ew' such that R(jw, ew) > a * R(jw, ew) and ~w, ew)C D or jw' such that R(jw', ew) >&quot; a &quot; RUw, ew) and (jw', ew) < D, we call (jw, ew) or (jw', ew) a highly probable competitor* Here, a is a predetermined constant (0 < a <~ 1 ), and D i s the bilingual dictionary.</Paragraph> <Paragraph position="1"> 3.5 Feedback of extracted pairs of words Obviously, the performance of the proposed method depends upon the coverage of the bilingual dictionary over the corpus. The coverage is the proportion of the word correspondences in the corpus that are already contained in the bilingual dictionary. Generally speaking, the wider the coverage, the more reliable the correlation values. Accordingly, the feedback of extracted pairs will probably improve performance, even though some of them are erroneous. In Fig. 2, the feedback is represented by dotted line.</Paragraph> </Section> </Section> <Section position="6" start_page="25" end_page="135" type="metho"> <SectionTitle> 4 Experiment and Results </SectionTitle> <Paragraph position="0"> We implemented our proposed method on a workstation and carried out an experiment using patent-specification documents in Japanese and English and a bilingual dictionary for a machine translation system. The dictionary contains approximately 60,000 Japanese entry words, each having several English translations.</Paragraph> <Paragraph position="1"> The quantitative profile of the sample patent documents is shown in Table l(a).</Paragraph> <Paragraph position="2"> We executed the word correspondence extraction program for each document. Parameter a in the selection of pairs of words was assumed to be 0. This means that tile output pairs were limited as much as possible. Both results before and after feedback were obtained to evaluate the effect of feedback. The extracted pairs of words were divided into two groups: those which are already contained in the bilingual dictionary and those which are not yet contained in the bilingual dictionaryJ ) The former are insignificant from the practical point of view. However, they are signficant in evaluating the effectiveness of the proposed correlation measure because the dictionary information regarding a particular pair of words does not contribute to the correlation between the pak itself. Accordingly, we evaluated two cases: Case A - the already known pairs of words are included - and Case B - the already known pairs of words are excluded.</Paragraph> <Paragraph position="3"> A good way to evaluate word correspondence extraction methods is to measure their recall and precision. These measures are defined as follows. The recall is the proportion of all word correspondences in a lJWe neglected tile reference numbers peculiar to the patent docmnents because their correpondences are Irivial. Tile underlined numerals in the following pair of sentences is an example of a retbrence number: ...... ~: g b'Z\]:L~,~ 5 0 4 a)~XJjT)~ -~ bilingual corpus that m'e actually extracted. The precision is the proportion of extracted word correspondences that arc actually correct. While the precision is rather easy to calculate, the recall is difficult to calculate because it is a time-consuming task to manually identify all the word correspondences in the bilingual cortms. Therefore, instead of calculating the recall according to its de_nition, we make a rough estimation using the ratio of the number of correct pairs of words extracted to the number of words in either the Japanese or English text. We call this the pseudo-recall. The pseudo-recall indicates the lowest limit of the recall since a word in the Japanese text does not always have a straightforward counterpart in the English text, and vice versa.</Paragraph> <Paragraph position="4"> Tables l(b) and (c) show the pseudo-recall and the precision in Cases A and B, respectively, lu Case A, the pseudo-recall and precision before feedback were 27.8% Table 2 Examples of extracted word correspondences.</Paragraph> <Paragraph position="6"> ( ~)~3/. Jill ,~,,, radio frequency heating ) S: simple word, C: compound word and 87.5% respectively, and those after feedback were 30.4% and 88.0%. In Case B, the pseudo-recall and precision before feedback were 25.7% and 74.9% respectively, and those after feedback were 28.0% and 75.6%.</Paragraph> <Paragraph position="7"> The experiment confirmed that the proposed method can extract not only compound word correspondences but also simple word correspondences from a small corpus. Examples of word correspondences extracted from a patent document are shown in Table 2. The comparison of results before and after feedback supported the effectiveness of using feedback. That is, feedback increases recall while preserving precision. We also ascertained that repeating the feedback one more time did not result in significant improvement.</Paragraph> </Section> <Section position="7" start_page="135" end_page="135" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The experiment shows that the proposed method is effective in reducing the cost of bilingual dictionary augmentation. Tile recall of the method is not high.</Paragraph> <Paragraph position="1"> Furthermore, it cannot extract more than one correspondence for a word. Still, the method is effective because it can extract from a small corpus. Bilingual documenLs should be handled separately. Even if a correspondence pair of words fails to be extracted from one bilingual document, it may be extracted from another bilingual document, where it occurs prevailingly.</Paragraph> <Paragraph position="2"> The following are directions for further improvement.</Paragraph> <Paragraph position="3"> (1) Refinement of nominal compound extraction procedure: The simplified procedure described in Sec. 3.1 often causes omission (a nominal compound is not extracted) and noise (an inappropriate word string is extracted). These are major causes of errors in word correspondence extraction; refining the nominal compound extraction procedure will considerably improve recall and precision. (2) Use of symbol/numeral correspondences: In the present implementation, the correspondences of symbols and numerals are not used in calculating the correlation because the bilingual dictionary does not contain them. However, they have the potential of increasing the reliablilty of the correlation values. A character-string-matching routine to identify the correspondences of symbols/numerals should thus be added to the correlation calculation module.</Paragraph> <Paragraph position="4"> (3) Use of the constituent word information of compound words: The key idea of our method is to associate a pair of words through their co-occurrence information with the assistance of a bilingual dictionary. In contrast, that of the previous linguistic methods is to associate a pair of compound words through their constituent word information with the assistance of a bilingual dictionary. These two are not incompatible. Combining them would surely increase the recall and precision for compound word correspondences.</Paragraph> </Section> class="xml-element"></Paper>