File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1108_evalu.xml
Size: 10,082 bytes
Last Modified: 2025-10-06 13:59:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1108"> <Title>Learning Bilingual Translations from Comparable Corpora to Cross-Language Information Retrieval: Hybrid Statistics-based and Linguistics-based Approach</Title> <Section position="6" start_page="0" end_page="2" type="evalu"> <SectionTitle> 4 Experiments and Evaluations </SectionTitle> <Paragraph position="0"> Experiments have been carried out to measure the improvement of our proposal on bilingual terminology acquisition from comparable corpora on Japanese-English tasks in CLIR, i.e. Japanese queries to retrieve English documents.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Linguistic Resources </SectionTitle> <Paragraph position="0"> Collections of news articles from Mainichi Newspapers (1998-1999) for Japanese and Mainichi Daily News (1998-199) for English were considered as comparable corpora, because of the common feature in the time period and the generalized domain.</Paragraph> <Paragraph position="1"> We have also considered documents of NTCIR-2 test collection as comparable corpora in order to cope with special features of the test collection during evaluations.</Paragraph> <Paragraph position="2"> Morphological analyzers, ChaSen version 2.2.9 (Matsumoto and al., 1997) for texts in Japanese and OAK2 (Sekine, 2001) were used in the linguistic preprocessing. null EDR bilingual dictionary (EDR, 1996) was used to translate context vectors of source and target languages. null NTCIR-2 (Kando, 2001), a large-scale test collection was used to evaluate the proposed strategies in CLIR.</Paragraph> <Paragraph position="3"> SMART information retrieval system (Salton, 1971), which is based on vector space model, was used to retrieve English documents.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluations on the Proposed Translation Model </SectionTitle> <Paragraph position="0"> We considered the set of news articles as well as the abstracts of NTCIR-2 test collection as comparable corpora for Japanese-English language pairs.</Paragraph> <Paragraph position="1"> The abstracts of NTCIR-2 test collection are partially aligned (more than half are Japanese-English paired documents) but the alignment was not considered in the present research to treat the set of documents as comparable. Content words (nouns, verbs, adjectives, adverbs) were extracted from English and Japanese corpora. In addition, foreign words (mostly represented in katakana) were extracted from Japanese texts. Thus, context vectors were constructed for 13,552,481 Japanese terms and 1,517,281 English terms. Similarity vectors were constructed for 96,895,255 (Japanese, English) pairs of terms and 92,765,129 (English, Japanese) pairs of terms. Bi-directional similarity vectors (after merging and disambiguation) resulted in 58,254,841 (Japanese, English) pairs of terms.</Paragraph> <Paragraph position="2"> Table 1 illustrates some situations with the extracted English translation alternatives for Japanese terms a0a2a1 (eiga), using the two-stages comparable corpora approach and combination to linguistics-based pruning. Using the two-stages comparable corpora-based approach, correct translations of the Japanese term a0a3a1 (eiga) were ranked in top 3 (movie) and top 5 (film). We notice that top ranked translations, which are considered as wrong translations, are related mostly to the context of the source Japanese term and could help the query expansion in CLIR. Combined two-stages comparable corpora with the linguistics-based pruning shows better results with ranks 2 (movie) and 4 (film).</Paragraph> <Paragraph position="3"> Japanese vocabulary is frequently imported from other languages, primarily (but not exclusively) from English. The special phonetic alphabet (here Japanese katakana) is used to write down foreign words and loanwords, example names of persons and others. Katakana terms could be treated via transliteration or possible romanization, i.e., conversion of Japanese katakana to their English equivalence or the alphabetical description of their pronunciation. Transliteration is the phonetic or spelling representation of one language using the alphabet of another language (Knight and Graehl, 1998).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Evaluations on SMART Weighting Schemes </SectionTitle> <Paragraph position="0"> gual English runs, i.e., English queries to retrieve English documents and the bilingual Japanese-English runs, i.e., Japanese queries to retrieve English document. Topics 0101 to 0149 were considered and key terms contained in the fields, title <TITLE>, description <DESCRIPTION> and concept <CONCEPT> were used to generate 49 queries in Japanese and English.</Paragraph> <Paragraph position="1"> There is a variety of techniques implemented in SMART to calculate weights for individual terms in both documents and queries. These weighting techniques are formulated by combining three parameters: Term Frequency component, Inverted Document Frequency component and Vector Normalization component.</Paragraph> <Paragraph position="2"> The standard SMART notation to describe the combined schemes is &quot;XXX.YYY&quot;. The three characters to the left (XXX) and right (YYY) of the period refer to the document and query vector components, respectively. For example, ATC.ATN applies augmented normalized term frequency, tf idf document frequency (term frequency times inverse document frequency components) to weigh terms in the collection of documents. Similarly ATN refers to the weighting scheme applied to the query.</Paragraph> <Paragraph position="3"> First experiments were conducted on several combinations of weighting parameters and schemes of SMART retrieval system for documents terms and query terms, such as ATN, ATC, LTN, LTC, NNN, NTC, etc. Best performances in terms of average precision were realized by the following combined weighting schemes: ATN.NTC, LTN.NTC, LTC.NTC, ATC.NTC and NTC.NTC, respectively.</Paragraph> <Paragraph position="4"> The best weighting scheme for the monolingual runs turned out to be the ATN.NTC. This finding is somewhat different from previous results where ANN (Fox and Shaw, 1994), LTC (Fuhr and al., 1994) weighting schemes on query terms, LNC.LTC (Buckley and al., 1994) and LNC.LTN (Knaus and Shauble, 1993) combined weighting schemes on document terms and query terms showed the best results. On the other hand, our findings were quite similar to the result presented by Savoy (Savoy, 2003), where the ATN.NTC showed the best performance among the existing weighting schemes in SMART for English monolingual runs.</Paragraph> <Paragraph position="5"> Table 2 shows some weighting schemes of SMART retrieval system, among others. To assign an indexing weight wij that reflects the importance of each single-term Tj in a document Di, different factors should be considered (Salton and McGill, 1983), as follows: within-document term frequency tfij, which represents the first letter of the SMART label.</Paragraph> <Paragraph position="6"> collection-wide term frequency dfj, which represents the second letter of the SMART label.</Paragraph> <Paragraph position="7"> In Table 2, idfj = log NFj ; where, N represents the number of documents and Fj represents the document frequency of term Tj.</Paragraph> <Paragraph position="8"> normalization scheme, which represents the third letter of the SMART label.</Paragraph> </Section> <Section position="4" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 4.4 Evaluations on CLIR </SectionTitle> <Paragraph position="0"> Bilingual translations were extracted from comparable corpora using the proposed two-stages model. A fixed number (set to five) of top-ranked translation alternatives was retained for evaluations in CLIR.</Paragraph> <Paragraph position="1"> Results and performances on the monolingual and bilingual runs for the proposed translation models and the combination to linguistics-based pruning are described in Table 3. Evaluations were based on the average precision, differences in term of average precision of the monolingual counterpart and the improvement over the monolingual counterpart. As based pruning, in the case of ATN.NTC weighting scheme.</Paragraph> <Paragraph position="2"> The proposed two-stages model using comparable corpora 'BCC' showed a better improvement in terms of average precision compared to the simple model 'SCC' (one stage, i.e., simple comparable corpora-based translation) with +27.1%. Combination to Linguistics-based pruning showed the best performance in terms of average precision with +41.7% and +11.5% compared to the simple comparable corpora-based model 'SCC' and the two-stages comparable corpora-based model 'BCC', respectively, in the case of ATN.NTC weighting scheme. Different weighting schemes of SMART retrieval system showed an improvement in term of average precision for the proposed translation models 'BCC' and 'BCC+Morph'.</Paragraph> <Paragraph position="3"> The approach based on comparable corpora largely affected the translation because related words could be added as translation alternatives or expansion terms. The acquisition of bilingual terminology from bi-directional comparable corpora yields a significantly better result than using the simple model. Moreover, the linguistics-based pruning based pruning (weighting scheme = ATN.NTC) technique has allowed an improvement in the effectiveness of CLIR.</Paragraph> <Paragraph position="4"> Finally, statistical t-test (Hull, 1993) was carried out in order to measure significant differences between paired retrieval models. The improvement by using the proposed two-stages comparable corpora-based method 'BCC' was statistically significant (p-value=0.0011). The combined statistics-based and linguistics-based pruning 'BCC+Morph' was</Paragraph> </Section> </Section> class="xml-element"></Paper>