File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1413_metho.xml
Size: 14,333 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1413"> <Title>Using the Web as a Bilingual Dictionary</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Partially Bilingual Text in the Web </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Coverage of Fields </SectionTitle> <Paragraph position="0"> It is very difficult to measure precisely in what field of science there are a large number of partially bilingual text in the web. However, it is possible to get a rough estimate on the relative amount in different fields, by asking a search engine for documents containing both Japanese and English technical terms in each field several times.</Paragraph> <Paragraph position="1"> For this purpose, we used a Japanese-to-English technical term dictionary licensed from NOVA, a maker of commercial machine translation systems. The dictionary is classified into 19 categories, ranging from aeronautics to ecology to trade, as shown in Table 1. There are 1,082,594 pairs of Japanese and English technical terms1.</Paragraph> <Paragraph position="2"> We randomly selected 30 pairs of Japanese and English terms from each category and sent queries to an Internet search engine, Google (Google, 2001), to see whether there are any documents that contain both Japanese and English technical terms. The fourth column in Table 1 shows the percentage of queries (J-E pairs) returned by at least one document.</Paragraph> <Paragraph position="3"> 1The dictionary can be searched in their web site (NOVA Inc., 2000).</Paragraph> <Paragraph position="4"> It is very encouraging that, on average, 42% of the queries returned at least one document. The results show that the web is worth mining for bilingual lexicon, in fields such as aeronautics, computer, and law.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Classification of Format </SectionTitle> <Paragraph position="0"> In order to implement a term translation extractor, we have to analyze the format, or structural pattern of the partially bilingual documents. There are at least three typical formats in the web. Figure 1 shows examples.</Paragraph> <Paragraph position="1"> a64 aligned paragraph format a64 table format a64 plain text format In 'aligned paragraph' format, each paragraph contains one language and the paragraphs with different languages are interlaced. This format is often found in web pages designed for both Japanese and foreigners, such as official documents by governments and academic papers by researchers (usually title and abstract only). In 'table' format, each row contains a pair of equivalent terms. They are not necessarily marked by the TABLE tag of HTML. This format is often found in bilingual glossaries of which there are many in the web. Some portals offer hyper links to such bilingual glossaries, such as kotoba.ne.jp (kotoba.ne.jp, 2000).</Paragraph> <Paragraph position="2"> In 'plain text' format, phrases of different language are interlaced in the monolingual text of the baseline language. The vast majority of partially bilingual documents in the web belongs to this category.</Paragraph> <Paragraph position="3"> The formats of the web documents are so wildly different that it is impossible to automatically classify them to estimate the relative quantities belonging to each format. Instead, we examined the distance (in bytes) from a Japanese technical term to its corresponding English technical term in the documents retrieved from the web by the experiment described in the Section 2.1 Figure 2 shows the results. Positive distance indicates that the English term appeared after the Japanese term, while negative distance indicates the reverse. It is observed that the English and Japanese terms are likely to appear very close to Registration for Foreign Residents and Birth Registration</Paragraph> <Paragraph position="5"> The official name for registration for foreign residents in Japana91 as determined by the Ministry of Justicea91 is a92 Alien Registrationa93a95a94 (b) An example of 'table format' taken from a medical glossary.</Paragraph> <Paragraph position="7"> fields words samples found Example of Japanese-English pair aeronautics and space 17862 30 57% a37a10a196a43a197a10a198 ecliptic coordinates glish terms each other. 28% (=233/847) of English terms appeared just after (within 10 bytes) the corresponding Japanese terms. 58% (=490/847) of English terms appeared withina8 50 bytes. They probably reflect either table or plain text format.</Paragraph> <Paragraph position="8"> Although there are 28% (=237/847) English terms appeared outside the window of a8 200 bytes, we find this 'distance heuristics' very powerful, so it was used in the term translation algorithm described in the next section.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Term Translation Extraction Algorithm </SectionTitle> <Paragraph position="0"> Let a9 and a10 be Japanese and English technical terms which are translations of each other. Let a11 be a document, and leta12a14a13a15a9a17a16 be a set of documents which includes the Japanese terma9 . Leta18a20a19a21a13a15a9a23a22a21a10a24a16 be a statistical translation model which gives the likelihood (or score) thata9 and a10 are translations of each other.</Paragraph> <Paragraph position="1"> Figure 3 shows the basic (conceptual) algorithm for extracting the English translation of a given Japanese technical term from the web. First, we retrieve all documents a12a25a13a15a9a26a16 that contain the glish translation of Japanese term given Japanese technical terma9 using a search engine. We then eliminate the Japanese only documents. For each English term a10 contained in the (partially) bilingual documents, we compute the translation probabilitya18 a19a13a15a9a41a22a21a10a27a16 , and select the English term a28a10 which has the highest translation probability.</Paragraph> <Paragraph position="2"> In practise, it is often prohibitive to down load all documents that include the Japanese term.</Paragraph> <Paragraph position="3"> Moreover, a reliable Japanese-English statistical translation model is not available at the moment because of the scarcity of parallel corpora.</Paragraph> <Paragraph position="4"> Rather, one of the aim of this research is to collect the resources for building such translation models. We therefore employed a very simplistic approach. null Instead of using all documents including the Japanese term, we used only the predetermined number of documents (top 100 documents based on the rank given by the search engine). This entails the risk of missing the documents including the English terms we are looking for.</Paragraph> <Paragraph position="5"> Instead of using a statistical translation model, we used a scoring function in the form of a geometric distribution as shown in Equation (1).</Paragraph> <Paragraph position="7"> Here, a11a74a13a15a9a41a22a21a10a27a16 is the byte distance between Japanese terma9 and English terma10 . It is divided by 10 and the integer part of the quotient is used as the variable in the geometric distribution (a75a51a76a78a77a24a77a24a79 indicates flooring operation). The parameter (the average) of the geometric distributiona44 is set to 0.6 in our experiment.</Paragraph> <Paragraph position="8"> There is no theoretical background to the scoring function Equation (1). It was designed, after a trial and error, so that the likelihood of can- null didates pairs being translations of each other decreases exponentially as the distance between the two terms increases. Starting from the score of 0.6, it decreases 40% for every 10 bytes.</Paragraph> <Paragraph position="9"> If we observed the same pair of Japanese and English terms more than once, it is more likely that they are valid translations. Therefore, we sum the score of Equation (1) for each occurrence of pair a13a15a9a41a22a21a10a27a16 and select the highest scoring English term a28a10 as the translation of the Japanese terma9 .</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Test Terms </SectionTitle> <Paragraph position="0"> In order to factor out the characteristics of the search engine and the proposed term extraction algorithm, we used, as a test set, those words that are guaranteed to have at lease one retrieved document that includes both Japanese and English terms.</Paragraph> <Paragraph position="1"> First, we randomly selected 50 pairs of such Japanese and English terms, from the pairs used in the experiment described in Section 2.1. They are shown in Figure 2. We then sent each Japanese term as a query to an Internet search engine, Google, and down loaded the top 100 web documents. &quot;o&quot; indicates that at least one of the down loaded documents included both terms. &quot;x&quot; indicates that no document included both terms.</Paragraph> <Paragraph position="2"> This resulted in a test set of 34 pairs of Japanese and English terms.</Paragraph> <Paragraph position="3"> For example, although there are a lot of documents which include both &quot;a80 &quot; and &quot;west&quot;, the top 100 documents retrieved by &quot;a80 &quot; as the query did not contain &quot;west&quot; since &quot;a80 &quot; is a highly frequent Japanese word.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Extraction Accuracy </SectionTitle> <Paragraph position="0"> Table 3 shows the extraction accuracy of the English translation of Japanese term. Since both Japanese and English terms could occur as a sub-part of more longer terms, we need to consider local alignment to extract the English subpart corresponding to the Japanese query. Instead of doing this alignment, we introduced two partial match measures as well as exact matching.</Paragraph> <Paragraph position="1"> In Table 3, 'exact' indicates that the output is exactly matched to the correct answer, while 'partial-1' indicates that the correct answer was a subpart of the output; 'partial-2' indicates that at least one word of the output is a subpart of the correct answer.</Paragraph> <Paragraph position="2"> For example, the eye disease 'a26a28a27a30a29a32a31 ', whose translation is 'macular degeneration', is sometimes more formally refereed to as 'a33a35a34 a31a36a26a37a27a36a29a35a31 ', whose translation is 'age-related macular degeneration'. 'Partial-1' holds if 'age-related macular degeneration' is extracted when the query is 'a26a38a27a38a29a39a31 '. 'Partial-2' holds if 'degeneration' is included in the output when the query is 'a26a39a27a40a29a39a31 '.</Paragraph> <Paragraph position="3"> It is encouraging that useful outputs (either exact or partial matches) are included in the top 10 candidates with the probability of around 60%.</Paragraph> <Paragraph position="4"> Since we used simple string matching to measure the accuracy automatically, the evaluation reported in Table 3 is very conservative. Because the output contains acronyms, synonyms, and related words, the overall performance of the system is fairly credible.</Paragraph> <Paragraph position="5"> For example, the extracted translations for the query 'a41a43a42a38a44a46a45a38a47a38a48 ' (National Information Infrastructure) were as follows, where the second candidate is the correct answer.</Paragraph> <Paragraph position="6"> 18.721123: nii 13.912146: national information infrastructure 2.137008: gii 1.398144: unii NII (nii) is the acronym for National Information Infrastructure, while GII (gii) and UNII (unii) stand for Global Information Infrastructure and Unlicensed National Information Infrastructure, respectively.</Paragraph> <Paragraph position="7"> If the query is a chemical substance, its molecular formula, instead of acronym, is often extracted, such as 'HCOOCH3' for 'a49a38a50a52a51a6a53a38a54 ' (methyl formate).</Paragraph> <Paragraph position="8"> 1.801008: methyl formate 0.840786: hcooch3 0.84: hcooh As for synonyms, although we took 'operating expenses' to be the correct translation for 'a55a57a56a59a58 a60 ', the following third candidate 'operating cost' is also a legitimate translation. This is counted as 'partial-2' because 'operating' is a subpart of the correct answer.</Paragraph> <Paragraph position="9"> 1.8: fa 0.606144: ohr 0.6: operating cost For your information, OHR (Over Head Ratio) is a management index and equals to the operating cost divided by the gross operating profit. 'Fa' happened to be used three times in a tutorial document on accounting to stand for 'operating expenses', such as &quot;a55a46a56a40a58 a60 (Fa)=a61a40a62 (E)*23%&quot;, where 'a61a40a62 ' means 'cost'.</Paragraph> <Paragraph position="10"> The following example is a combination of the acronyms, synonyms and related words, which is, in a sense, a typical output of the proposed system. The query is 'a63a57a64a57a65a57a66 ', and 'climate study' is the translation we assumed to be correct.</Paragraph> <Paragraph position="11"> 0.2784: world climate research programme A subpart of the 9th candidate 'climate research' is also a legitimate translation. 'WCRP' is the acronym for 'World Climate Research Programme', which is the 9th candidate and is translated to 'a67a39a68a38a63a39a64a38a65a39a66a35a69a59a70 ' which includes the original Japanese query. 'WMO' stands for World Meteorological Organization, which hosts this international program.</Paragraph> <Paragraph position="12"> In short, if you look at the extracted translations together with the context from which they are extracted, you can learn a lot about the relevant information of the query term and its translation candidates. We think this is a useful tool for human translators, and it could provide a useful resource for statistical machine translation and cross language information retrieval.</Paragraph> </Section> </Section> class="xml-element"></Paper>