File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2159_metho.xml
Size: 14,175 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2159"> <Title>A Bootstrapping Method for Extracting Bilingual Text Pairs</Title> <Section position="3" start_page="0" end_page="1066" type="metho"> <SectionTitle> 2 The Basic Idea </SectionTitle> <Paragraph position="0"> As we will describe in Section 3, several CLIR approaches that rely on parallel corpora have been proposed and lead to successful retrieval results. In those approaches, a parallel corpus used as training data should be large enough to obtain good retrieval results. Although we use a CLIR method which relies on a parallel corpus, we begin with a very small parallel corpus. We retrieve bilingual text pairs from a bilingual comparable corpus using the small parallel corpus as training data. Then we concatenate the text pairs to the initial small parallel corpus and grow the parallel corpus by iterating the retrieval and concatenation processes (Figure 1).</Paragraph> <Paragraph position="2"> This kind of bootstrapping method has a problem, however: It is highly sensitive to the accuracy of the text pairs obtained in the early stages of the iterations. In order to solve this problem, we concatenate only a small number of the most &quot;reliable&quot; text pairs to the initial parallel corpus in the early stages, then gradually increase the number of the text pairs which are concatenated to the initial parallel corpus. We will describe the details of the method in Section 4.</Paragraph> </Section> <Section position="4" start_page="1066" end_page="1066" type="metho"> <SectionTitle> 3 Corpus-based CLIR approaches </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1066" end_page="1066" type="sub_section"> <SectionTitle> 3.1 Previous Researches </SectionTitle> <Paragraph position="0"> As we mentioned in Section 2, we use a CLIR method which relies on a parallel corpus in our bootstrapping method. One approach to corpus-based CLIR is to use the Latent Semantic Indexing technique proposed by Fumas et al.</Paragraph> <Paragraph position="1"> (1988) on a parallel corpus to construct a language illdependent representation el' queries and documents (Landauer and Lfltman, 1990).</Paragraph> <Paragraph position="2"> Another approach that relies on a parallel corpus has been suggested by l)unning and l)avis (1993). Their method is based on the vector space model and involves the linear trausforulation of the representation of a query. A parallel corpus can also be used to enhance existing knowledge-based resources. The resources ark used to translate the query and then classical IR matching techniques are applied to compute the similarity between the trauslated query and documents (Hull and Grel'enstette, 1996).</Paragraph> </Section> <Section position="2" start_page="1066" end_page="1066" type="sub_section"> <SectionTitle> 3.2 hfforlnation Mapping for CLIR </SectionTitle> <Paragraph position="0"> For our bootstrapping method, we adopted a CLIR method which is based on the hfformation Mapping approach (Masuichi et al., 1999). Information Mapping is basically a wlriant of the vector space model, and is based on an approach first proposed by Schtitze (1995). The approach is closely related to Latent Semantic Indexing, and the dilTerence between these two is discussed in Schfitze and Pedersen (1997). Note that our bootstrapping method does not depend on any particuhu&quot; properties o1' the Information Mapping approach, so it could employ other corpus-based CLIR methods such as Latent Semantic indexing. null Information Mapping begins with a large word-by-word matrix. A list of n content-bearing words and m w)cabulary words correspond to the columns and the rows of the matrix. The most fiequently appearing n words in a training corpus are selected as content-bearing words and the most frequently appearing m words as vocabulary words.</Paragraph> <Paragraph position="1"> Each cell of the matrix holds the nmnber of total cooccurrences between a content-bearing word and a vocabulary word in the training corpus. In this way, an n-dimensional vector which represents the word's distributional behavior is produced t'or each vocabulary word. Then the original n-dimensional vector space is converted into a condensed, lowerdilnensional, real-valued matrix using Singular Value l)ecomposition (SVD) (Berry, 1992).</Paragraph> <Paragraph position="2"> The lower-dimensional vector space is called word space. A document vector and a query vector are calculated by summing the vectors corresponding to the vocabulary words in the document or the query, and the proximity between the two vectors is del'ined as the cosine of the angle between them.</Paragraph> <Paragraph position="3"> To apply this method to CL1R, we regard each translation pair in a training parallel corpus of language LI and L2 as a single COlnpotmd document and create a word-by-word matrix and then a word space. The word space represents a hmguage independent vector space for vocabulary words in both 1,1 and L2, and therefore query and document vectors in both LI and L2 can be calculated and compmed in the salne word space.</Paragraph> </Section> </Section> <Section position="5" start_page="1066" end_page="1068" type="metho"> <SectionTitle> 4 Experimental tests and Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1066" end_page="1068" type="sub_section"> <SectionTitle> 4.1 Tests with complete-pair corpora </SectionTitle> <Paragraph position="0"> We used an English-Japanese bilingual patent text corpus for our experilnental tests. For our first test, we prepared I000 English-Japanese patent text pairs as a pseudo bilingual comparable corpus. For each Japanese patent text in the corpus, its English translation by humans exists j, so this corpus could be regarded as an ideal bilingual comparable corpus. We also prepared 100 pairs as an initial parallel corpus (a training corpus) to create an initial word space. All the patents The quality of the translations wtrics greatly from word-for-word translations to short sunnnaries.</Paragraph> <Paragraph position="1"> in the two corpora were randomly selected from the Japanese patents issued in 1991, and the two corpora shared no patent. We used only the title and abstract texts and removed all other information, such as author, patent ID and issue date. Table 1 shows an example of an English-Japanese pair in the corpora. All characters in the English texts are l-byte characters and all characters, including alphabetical and numerical characters, in the Japanese texts are 2-byte, so there is no word which is shared by both English and Japanese texts.</Paragraph> <Paragraph position="2"> We used all words which appeared in a training corpus as vocabulary words, and the most frequently appearing 3000 English words as content-bearing words and then reduced the dimension of the vectors from 3000 to 200 by SVD.</Paragraph> <Paragraph position="3"> llose liar 'l'ransl~zrring Fertilizer from Fcflilizer Tmlki of Mobile \[&quot;arm Machine Abslracl: PROlll ,I~M TO lie .SOI.VI.~D: To provide a mechanism To arrange a ferlilizer Imnsli~r bose from a ferlilizer lallk wilhoul catlsing hindrance Io lhe olher mcchaniSlllS, t}lc. SOI.UTION: A fertilizer Irans fer hose 38 Io deliver a fC/~lilizer rrllll |il fcriilizer lank 31 placed al ~1 side of a mobile machine l~dy I Io lhe downslream side of a ferlilizing par128 is laid along the oilier circulllli?rellCe of a passage 23 placed ~l\[Ollg die back and a side o1&quot; a drivcl's seal 8 and exlending \[ioln Ihe driver's seal 8 Io a working inacbille I I. 1 ~ll:~ I-/'6.~tzltE~t')x-'ga 1 h',6 ll~llEt~t128 T ~'~IV'IIE~/I~.J~'C'~IIB,I~,--7,.a 8-'2, Table 1 : An example of an English-Japanese patent pair We began with a word space created from the 100 English-Japanese translation pairs (the initial parallel corpus). Then using the word space, we calculated 1000 English patent vectors and 1000 Japanese patent vectors which correspond to the patent texts in the pseudo comparable corpus. Next we extracted English-Japanese patent pairs which satisfied the simple condition that the English patent vector in the pair has the highest proximity (the biggest cosine) with the Japanese patent vector in the pair among the 1000 Japanese patent vectors, and vice versa (hereafter we call these pairs mutual-proximity pairs).</Paragraph> <Paragraph position="4"> Note that mutual-proximity pairs are, of course, not always correct translation pairs.</Paragraph> <Paragraph position="5"> Then we selected the 10 most &quot;reliable&quot; mutual-proximity pairs, assuming that the higher the proximity between the two vectors of a mutual-proximity pair, the more reliable the mutual-proximity pair is. Finally we concatenated the 10 mutual-proximity pairs to the initial 100 translation pairs. This is the first stage of our bootstrapping method.</Paragraph> <Paragraph position="6"> In the second stage, we created a new word space regarding the 110 English-Japanese pairs obtained in the first stage as a training corpus. Then we selected the 20 most reliable mutual-proximity pairs and concatenated them to the initial 100 patent translation pairs.</Paragraph> <Paragraph position="7"> At the Nth stage, we selected the N* 10 most reliable lnutual-proximity pairs. If the number of the nmtual-proximity pairs obtained in the stage is less than N*I0, all of the mutual-proximity pairs were concatenated to the initial 100 patent translation pairs.</Paragraph> <Paragraph position="8"> We repeated this procedure up to the 100th stage. At the 100th stage, we obtained 727 mutual-proximity pairs and 721 pairs out of the 727 pairs were correct translation pairs.</Paragraph> <Paragraph position="9"> Therefore the recall of the obtained pairs was 72.1% (721/1000) and the precision was 99.2% (721/727) (see the column of Testl and the row of the &quot;bootstrapping method&quot; of Table 2). On the other hand, we obtained 341 mutual-proximity pairs and 258 pairs out of the 341 pairs were correct translation pairs in the case of the normal Information Mapping method which corresponds to the first stage of our bootstrapping method. In this case, the recall was 25.8% and the precision was 75.7% (see the column of Testl and the row of the &quot;normal method&quot; of Table 2).</Paragraph> <Paragraph position="10"> and the recall through the 100 stages. The precision was kept over 93.3% and the recall went up gradually. We could successfully grow the bilingual text pairs using bootstrapping. null We prepared 4 more different sets of 1000 pairs 1'or pseudo comparable corpora and different sets of 100 pairs for initial parallel corpora, and repeated the same test 4 more times. Table 2 shows results of the 5 tests of the bootstrapping method and the normal Information Mapping method. In each case the bootstrapping method could drastically improve both the precision and the recall.</Paragraph> <Paragraph position="11"> We also conducted tests to see if the resulting text pairs obtained at the 100th stage in the previous tests are useful for the normal Infer marion Mapping method. We prepared another 1000 English-Japanese patent translation pairs for each of the 5 previous tests as evaluation corpora. No same patents were shared between any two of all the corpora.</Paragraph> <Paragraph position="12"> We extracted mutual-proximity pairs froln the new 1000 English-Japanese pair with the normal Information Mapping method, using (1) the initial parallel corpus in the previous test, (2) the initial parallel corpus + the mutual-proximity pairs obtained in the previous test, (3) the initial parallel corpus + the 1000 Eng null lish-Japanese correct translation pairs in the pseudo comparable corpus of the previous test, as a training corpus respectively. For example, in Test 1, the number of pairs in the refining corpus is 100 for (1), 827 with 6 error pairs for (2) and 1100 for (3).</Paragraph> <Paragraph position="13"> the Introduction, it is highly likely that a real bilingual comparable corpus includes bilingual pairs which share the same information, but it also includes a lot of irrelevant texts. To simulate this, we replaced half of the Japanese patent texts in the pseudo comparable corpora of the previous tests with different Japanese patent texts which were randomly selected.</Paragraph> <Paragraph position="14"> Therefore the corpus included 500 English-Japanese translation pairs, and 500 English patents and 500 Japanese patents which were totally irrelevant to each other.</Paragraph> <Paragraph position="15"> tests for Table 3 shows the results. The results of (3) can be considered as the ceilings of the precision aud the recall, because we used all the correct translation pairs in the pseudo comparable corpus. In each case, both the precision and the recall of (2) is very close to the ceilings, so we think the bilingual text pairs obtained by our bootstrapping method is useful as a training corpus for the normal Information Mapping method.</Paragraph> </Section> <Section position="2" start_page="1068" end_page="1068" type="sub_section"> <SectionTitle> 4.2 Tests with incomplete-pair corpora </SectionTitle> <Paragraph position="0"> In the tests described above, we used the ideal pseudo COlnparable corpus. As described in Results are shown in Figure 3, Table 4 and Table5, which correspond to Figure 2, Table 2 and Table 3 respectively.</Paragraph> <Paragraph position="1"> Figure 4, Table 6 and Table 7 show results in the case that we replaced 80% of Japanese patent texts with irrelevant Japanese patent texts. The results of these tests are not as good as the results of tests with the ideal pseudo comparable corpora. Figure 4 and 6 show, however, the bootstrapping method iml?roved both the precision and recall of the extracted text pairs as compared to the normal method. Figure 5 and 7 also show that the bilingual text pairs obtained by the bootstrapping method are still useful as a training corpus for the normal method.</Paragraph> </Section> </Section> class="xml-element"></Paper>