Learning Bilingual Translations from Comparable Corpora to
Cross-Language Information Retrieval: Hybrid Statistics-based and
Linguistics-based Approach
Fatiha Sadat
Nara Institute of Science and Technology
8916-5 Takayama-cho, Ikoma-shi
Nara, 630-0101, Japan
ffatia-s, uemurag@is.aist-nara.ac.jp, yosikawa@itc.nagoya-u.ac.jp
Masatoshi Yoshikawa
Nagoya University
Furo-cho, Chikusa-ku,
Nagoya, 464-8601, Japan
Shunsuke Uemura
Nara Institute of Science and Technology
8916-5 Takayama-cho, Ikoma-shi,
Nara, 630-0101, Japan
Abstract
Recent years saw an increased interest
in the use and the construction of large
corpora. With this increased interest
and awareness has come an expansion
in the application to knowledge acqui-
sition and bilingual terminology extrac-
tion. The present paper will seek to
present an approach to bilingual lexi-
con extraction from non-aligned compa-
rable corpora, combination to linguistics-
based pruning and evaluations on Cross-
Language Information Retrieval. We pro-
pose and explore a two-stages translation
model for the acquisition of bilingual ter-
minology from comparable corpora, dis-
ambiguation and selection of best transla-
tion alternatives on the basis of their mor-
phological knowledge. Evaluations using
a large-scale test collection on Japanese-
English and different weighting schemes
of SMART retrieval system confirmed the
effectiveness of the proposed combina-
tion of two-stages comparable corpora
and linguistics-based pruning on Cross-
Language Information Retrieval.
Keywords: Cross-Language Information
Retrieval, Comparable corpora, Transla-
tion, Disambiguation, Part-of-Speech.
1 Introduction
Researches on corpus-based approaches to machine
translation (MT) have been on the rise, particularly
because of their promise to provide bilingual termi-
nology and enrich lexical resources such as bilingual
dictionaries and thesauri. These approaches gener-
ally rely on large text corpora, which play an impor-
tant role in Natural Language Processing (NLP) and
Information Retrieval (IR). Moreover, non-aligned
comparable corpora have been given a special in-
terest in bilingual terminology acquisition and lex-
ical resources enrichment (Dagan and Itai, 1994;
Dejean et al., 2002; Diab and Finch, 2000; Fung,
2000; Koehn and Knight, 2002; Nakagawa, 2000;
Peters and Picchi, 1995; Rapp, 1999; Shahzad and
al., 1999; Tanaka and Iwasaki, 1996).
Unlike parallel corpora, comparable corpora are
collections of texts from pairs or multiples of lan-
guages, which can be contrasted because of their
common features, in the topic, the domain, the au-
thors or the time period. This property made com-
parable corpora more abundant, less expensive and
more accessible through the World Wide Web.
In the present paper, we are concerned by exploit-
ing scarce resources for bilingual terminology ac-
quisition, then evaluations on Cross-Language In-
formation Retrieval (CLIR). CLIR consists of re-
trieving documents written in one language using
queries written in another language. An application
is conducted on NTCIR, a large-scale data collection
for (Japanese, English) language pair.
The remainder of the present paper is organized
as follows: Section 2 presents the proposed two-
stages approach for bilingual terminology acquisi-
tion from comparable corpora. Section 3 describes
the integration of linguistic knowledge for pruning
the translation candidates. Experiments and evalua-
tions in CLIR are discussed in Sections 4. Section 5
concludes the present paper.
2 Two-stages Comparable Corpora-based
Approach
Our proposed approach to bilingual terminology ac-
quisition from comparable corpora (Sadat et al.,
2003; Sadat et al., 2003) is based on the assump-
tion of similar collocation, i.e., If two words are mu-
tual translations, then their most frequent collocates
are likely to be mutual translations as well. More-
over, we apply this assumption in both directions of
the corpora, i.e., find translations of the source term
in the target language corpus but also translations
of the target terms in the source language corpus.
The proposed two-stages approach for the acquisi-
tion, disambiguation and selection of bilingual ter-
minology is described as follows:
 Bilingual terminology acquisition from source
language to target language to yield a first
translation model, represented by similarity
SIMS!T .
 Bilingual terminology acquisition from target
language to source language to yield a sec-
ond translation model, represented by similar-
ity SIMT!S.
 Merge the first and second models to yield
a two-stages translation model, based on bi-
directional comparable corpora and repre-
sented by similarity SIMS$T .
We follow strategies of previous researches (De-
jean et al., 2002; Fung, 2000; Rapp, 1999) for
the first and second translation models and propose
a merging strategy for the two-stages translation
model (Sadat et al., 2003).
First, word frequencies, context word frequencies
in surrounding positions (here three-words window)
are computed following a statistics-based metrics,
the log-likelihood ratio (Dunning, 1993). Context
vectors for each source term and each target term
are constructed. Next, context vectors of the tar-
get words are translated using a preliminary bilin-
gual dictionary. We consider all translation candi-
dates, keeping the same context frequency value as
the source term. This step requires a seed lexicon, to
expand using the proposed bootstrapping approach
of this paper. Similarity vectors are constructed for
each pair of source term and target term using the
cosine metric (Salton and McGill, 1983).
Therefore, similarity vectors SIMS!T and
SIMT!S for the first and second models are con-
structed and merged for a bi-directional acquisition
of bilingual terminology from source language to
target language. The merging process will keep
common pairs of source term and target transla-
tion (s,t) which appear in SIMS!T as pairs of (s,t)
but also in SIMT!S as pairs of (t,s), to result in
combined similarity vectors SIMS$T for each pair
(s,t).The product of similarity values of both simi-
larity vectors SIMS!T for pairs (s,t) and SIMT!S
for pairs (t,s) will result in similarity values in vec-
tors SIMS$T .
Therefore, similarity vectors of the two-stages
translation model are expressed as follows:
SIMS$T = f(s;t;simS$T (tjs)) j (s;t;simS!T (tjs))
2 SIMS!T ^ (t;s;simT!S(sjt)) 2 SIMT!S
^ simS$T (tjs) = simS!T (tjs)  simT!S(sjt)g
3 Linguistics-based Pruning
Combining linguistic and statistical methods is be-
coming increasingly common in computational lin-
guistics, especially as more corpora become avail-
able (Klanvans and Tzoukermann, 1996; Sadat et
al., 2003). We propose to integrate linguistic con-
cepts into the corpora-based translation model. Mor-
phological knowledge such as Part-of-Speech (POS)
tags, context of terms, etc., could be valuable to filter
and prune the extracted translation candidates. The
objective of the linguistics-based pruning technique
is the detection of terms and their translations that
are morphologically close enough, i.e., close or sim-
ilar POS tags. This proposed approach will select a
fixed number of equivalents from the set of extracted
target translation alternatives that match the Part-of-
Speech of the source term.
Therefore, POS tags are assigned to each source
term (Japanese) via morphological analysis. As
well, a target language morphological analysis will
assign POS tags to the translation candidates. We
restricted the pruning technique to nouns, verbs, ad-
jectives and adverbs, although other POS tags could
be treated in a similar way. For Japanese-English1
pair of languages, Japanese nouns (a0a2a1 ) are com-
pared to English nouns (NN) and Japanese verbs (a3
a1 ) to English verbs (VB). Japanese adverbs (
a4
a1 )
are compared to English adverbs (RB) and adjec-
tives (JJ); while, Japanese adjectives (a5a7a6 a1 ) are
compared to English adverbs (RB) and adjectives
(JJ). This is because most adverbs in Japanese are
formed from adjectives. Thus. We select pairs or
source term and target translation (s,t) such as:
POS(s) = ’NN’ and POS(t) = ’a0a8a1 ’
POS(s) = ’VB’ and POS(t) = ’a3 a1 ’
POS(s) = ’RB’ and [POS(t) = ’ a4 a1 ’ or ’a5a9a6 a1 ’]
POS(s) = ’JJ’ and [POS(t) = ’ a5a8a6 a1 ’ or ’a4 a1 ’]
Japanese foreign words (tagged FW) were consid-
ered as loanwords, i.e., technical terms and proper
nouns imported from foreign languages; and there-
fore were not pruned with the proposed linguistics-
based technique but could be treated via translitera-
tion.
The generated translation alternatives are sorted
in decreasing order by similarity values. Rank
counts are assigned in increasing order, starting at
1 for the first sorted list item. A fixed number of
top-ranked translation alternatives are selected and
misleading candidates are discarded.
In order to demonstrate the procedure of our
translation model, we give an example in Japanese
and explain how the English translations are ex-
tracted, disambiguated and selected and how the
phrasal translation is constructed.
Given a simple Japanese query ’ a10a7a11a2a10a13a12a2a14a16a15
a17a19a18a9a20
a10a21a11a9a10a23a22a9a15a25a24a8a26a8a27a29a28a19a30a29a12a21a14
a17a25a31a8a32a25a33 ’ (ajia
kyougi taikai wa, ajia saidai no supoutsu kyougikai
de aru).
After segmentation, removing stop words
and keeping only content words (nouns, verbs,
adverbs, adjectives and foreign words), the asso-
ciated list of Japanese terms becomes a34a35a10a16a11a19a10 ,
1English POS tags NN refers to noun, VB to verb, RB to
adverb, JJ to adjective; while Japanese POS tags a36a38a37 refers to
a noun, a39a40a37 to a verb, a41a40a37 to an adverb and a42a40a43a44a37 to an
adjective, with respect to their extensions.
a12a16a14 , a15
a17 ,
a10a16a11a45a10 , a22a19a15 , a26a45a27a46a28a47a30 , a12a16a14 ,
a17 ’
(ajia;kyougi;taikai;ajia;saidai;supoutsu;kyougi;
kai). The combined translation model is applied
on each source term of the associated list and
top-ranked word translation alternatives are selected
according to their highest similarities as follows:
a34a48a10a46a11a49a10a51a50 (ajia):f(asia, 1.035), (assembly, 0.0611),
(city, 0.0589), (event, 0.0376), etc.g
a34a52a12a2a14a53a50 (kyougi): f(competition, 0.057), (sport,
0.0561), (representative, 0.0337), (international,
0.0331), etc.g
a34a52a15
a17
a50 (taikai): f(meeting, 0.176), (tournament,
0.0588), (assembly, 0.0582), (dialogue, 0.0437),
etc.g
a34a54a22a55a15 ’ (saidai): f(general, 0.0459), (great, 0.0371),
(famous, 0.0362), (global, 0.0329), (group, 0.032),
(measure, 0.0271), (factor, 0.0268), etc.g
a34a35a26a7a27a46a28a56a30 ’ (supoutsu): f(sport, 1.098), (union,
0.0399), (day, 0.0392), (international, 0.0375), etc.g
a34
a17 ’ (kai): f(taikai, 0.0489), (great, 0.0442), (meet-
ing, 0.0365), (gather, 0.0348), (person, 0.0312),
etc.g
The phrasal translation associated to the Japanese
query is formed by selecting a number of top-ranked
translation alternatives (here set to 3) and illustrated
as follows: ’asia assembly city competition sport
representative meeting tournament assembly gen-
eral great famous sport union day taikai great meet-
ing’.
Linguistics-based pruning was applied on the
Japanese terms and the extracted English translation
alternatives. Chasen morphological analyzer (Mat-
sumoto and al., 1997)for Japanese has associated
POS tags as a0a9a1 (noun) to all Japanese terms:
a10a21a11a9a10 (ajia)
a0a8a1 -
a57a29a58
a0
a12a21a14 (kyougi)
a0a8a1 -
a59a29a60a21a61a9a62
a15
a17 (taikai) a0a8a1 -
a63a9a64
a22a21a15 (saidai) a0a8a1 -a63a9a64
a26a21a27a29a28a19a30 (supoutsu)
a0a8a1 -
a63a9a64
a17 (kai) a0a8a1 -
a63a9a64
Therefore, English translation alternatives associ-
ated with POS tags as nouns (NN) via a morpholog-
ical analyzer for English (Sekine, 2001)are selected
and translation candidates having POS tags other
than NN (noun) are discarded. Selected translation
alternatives for the Japanese noun a22a16a15 (saidai)
become ’group, measure, factor’. As well, the
Japanese term ’ a17 ’ (kai) is associated to the En-
glish translations: ’taikai, meeting, person’.
The phrasal translation associated to the Japanese
query after the linguistics-based pruning is illus-
trated as follows: ’asia assembly city competition
sport representative meeting tournament assembly
group measure factor sport union day taikai meet-
ing person’.
Possible re-scoring techniques could be applied
on phrasal translation in order to select best trans-
lation alternatives among the extracted ones.
4 Experiments and Evaluations
Experiments have been carried out to measure the
improvement of our proposal on bilingual termi-
nology acquisition from comparable corpora on
Japanese-English tasks in CLIR, i.e. Japanese
queries to retrieve English documents.
4.1 Linguistic Resources
Collections of news articles from Mainichi Newspa-
pers (1998-1999) for Japanese and Mainichi Daily
News (1998-199) for English were considered as
comparable corpora, because of the common fea-
ture in the time period and the generalized domain.
We have also considered documents of NTCIR-2 test
collection as comparable corpora in order to cope
with special features of the test collection during
evaluations.
Morphological analyzers, ChaSen version 2.2.9
(Matsumoto and al., 1997) for texts in Japanese and
OAK2 (Sekine, 2001) were used in the linguistic pre-
processing.
EDR bilingual dictionary (EDR, 1996) was used
to translate context vectors of source and target lan-
guages.
NTCIR-2 (Kando, 2001), a large-scale test collec-
tion was used to evaluate the proposed strategies in
CLIR.
SMART information retrieval system (Salton,
1971), which is based on vector space model, was
used to retrieve English documents.
4.2 Evaluations on the Proposed Translation
Model
We considered the set of news articles as well as
the abstracts of NTCIR-2 test collection as compa-
rable corpora for Japanese-English language pairs.
The abstracts of NTCIR-2 test collection are par-
tially aligned (more than half are Japanese-English
paired documents) but the alignment was not con-
sidered in the present research to treat the set of
documents as comparable. Content words (nouns,
verbs, adjectives, adverbs) were extracted from En-
glish and Japanese corpora. In addition, foreign
words (mostly represented in katakana) were ex-
tracted from Japanese texts. Thus, context vectors
were constructed for 13,552,481 Japanese terms and
1,517,281 English terms. Similarity vectors were
constructed for 96,895,255 (Japanese, English) pairs
of terms and 92,765,129 (English, Japanese) pairs
of terms. Bi-directional similarity vectors (after
merging and disambiguation) resulted in 58,254,841
(Japanese, English) pairs of terms.
Table 1 illustrates some situations with the ex-
tracted English translation alternatives for Japanese
terms a0a2a1 (eiga), using the two-stages compara-
ble corpora approach and combination to linguistics-
based pruning. Using the two-stages comparable
corpora-based approach, correct translations of the
Japanese term a0a3a1 (eiga) were ranked in top 3
(movie) and top 5 (film). We notice that top
ranked translations, which are considered as wrong
translations, are related mostly to the context of the
source Japanese term and could help the query ex-
pansion in CLIR. Combined two-stages compara-
ble corpora with the linguistics-based pruning shows
better results with ranks 2 (movie) and 4 (film).
Japanese vocabulary is frequently imported from
other languages, primarily (but not exclusively)
from English. The special phonetic alphabet (here
Japanese katakana) is used to write down for-
eign words and loanwords, example names of per-
sons and others. Katakana terms could be treated
via transliteration or possible romanization, i.e.,
conversion of Japanese katakana to their English
equivalence or the alphabetical description of their
pronunciation. Transliteration is the phonetic or
spelling representation of one language using the
alphabet of another language (Knight and Graehl,
1998).
4.3 Evaluations on SMART Weighting
Schemes
Conducted experiments and evaluations were com-
pleted on NTCIR test collection using the monolin-
Table 1: An example for the two-stages comparable corpora translation model and linguistics-based pruning
Two-stages Comparable Corpora Linguistics-based Pruning
Japanese English Similarity English Similarity
Term Translation Value Rank Translation Value Rank
famous 0.449 1
a0a2a1 picture 0.361 2 picture 0.361 1
(eiga) movie 0.2163 3 movie 0.2163 2
oscar 0.1167 4 oscar 0.1167 3
film 0.1116 5 film 0.1116 4
gual English runs, i.e., English queries to retrieve
English documents and the bilingual Japanese-
English runs, i.e., Japanese queries to retrieve En-
glish document. Topics 0101 to 0149 were con-
sidered and key terms contained in the fields, title
<TITLE>, description <DESCRIPTION>
and concept <CONCEPT> were used to gener-
ate 49 queries in Japanese and English.
There is a variety of techniques implemented in
SMART to calculate weights for individual terms in
both documents and queries. These weighting tech-
niques are formulated by combining three parame-
ters: Term Frequency component, Inverted Docu-
ment Frequency component and Vector Normaliza-
tion component.
The standard SMART notation to describe the
combined schemes is ”XXX.YYY”. The three char-
acters to the left (XXX) and right (YYY) of the pe-
riod refer to the document and query vector compo-
nents, respectively. For example, ATC.ATN applies
augmented normalized term frequency, tf idf doc-
ument frequency (term frequency times inverse doc-
ument frequency components) to weigh terms in the
collection of documents. Similarly ATN refers to the
weighting scheme applied to the query.
First experiments were conducted on several com-
binations of weighting parameters and schemes of
SMART retrieval system for documents terms and
query terms, such as ATN, ATC, LTN, LTC, NNN,
NTC, etc. Best performances in terms of aver-
age precision were realized by the following com-
bined weighting schemes: ATN.NTC, LTN.NTC,
LTC.NTC, ATC.NTC and NTC.NTC, respectively.
The best weighting scheme for the monolingual
runs turned out to be the ATN.NTC. This finding
is somewhat different from previous results where
ANN (Fox and Shaw, 1994), LTC (Fuhr and al.,
1994) weighting schemes on query terms, LNC.LTC
(Buckley and al., 1994) and LNC.LTN (Knaus and
Shauble, 1993) combined weighting schemes on
document terms and query terms showed the best re-
sults. On the other hand, our findings were quite
similar to the result presented by Savoy (Savoy,
2003), where the ATN.NTC showed the best per-
formance among the existing weighting schemes in
SMART for English monolingual runs.
Table 2 shows some weighting schemes of
SMART retrieval system, among others. To assign
an indexing weight wij that reflects the importance
of each single-term Tj in a document Di, differ-
ent factors should be considered (Salton and McGill,
1983), as follows:
 within-document term frequency tfij, which
represents the first letter of the SMART label.
 collection-wide term frequency dfj, which rep-
resents the second letter of the SMART label.
In Table 2, idfj = log NFj ; where, N represents
the number of documents and Fj represents the
document frequency of term Tj.
 normalization scheme, which represents the
third letter of the SMART label.
4.4 Evaluations on CLIR
Bilingual translations were extracted from compara-
ble corpora using the proposed two-stages model. A
fixed number (set to five) of top-ranked translation
alternatives was retained for evaluations in CLIR.
Results and performances on the monolingual and
bilingual runs for the proposed translation models
and the combination to linguistics-based pruning are
described in Table 3. Evaluations were based on
the average precision, differences in term of aver-
age precision of the monolingual counterpart and the
improvement over the monolingual counterpart. As
Table 2: Weighting Schemes on SMART Retrieval System
SMART Label Weighting Scheme
NNN wij = tfij
ATN wij = idfj  [0:5 + tfij2 max tfi ]
LTN wij = idfj  [ln(tfij) + 1:0]
LTC wij = idfj [ln(tfij )+0:1]pPn
k=1[idfk (ln(tfik)+0:1)]
2
ATC wij =
idfj (0:5+ tfij2 max tfi )q
Pn
k=1[idfk (0:5+
tfik
2 max tfi )]2
NTC wij = idfj tfijpPn
k=1[idfk tfik]
2
well, evaluations using R-precision are illustrated in
Table 3.
Figure 1 represents the recall/precision curves of
the proposed two-stages comparable corpora-based
translation model and combination to linguistics-
based pruning, in the case of ATN.NTC weighting
scheme.
The proposed two-stages model using compara-
ble corpora ’BCC’ showed a better improvement in
terms of average precision compared to the sim-
ple model ’SCC’ (one stage, i.e., simple compara-
ble corpora-based translation) with +27.1%. Com-
bination to Linguistics-based pruning showed the
best performance in terms of average precision with
+41.7% and +11.5% compared to the simple compa-
rable corpora-based model ’SCC’ and the two-stages
comparable corpora-based model ’BCC’, respec-
tively, in the case of ATN.NTC weighting scheme.
Different weighting schemes of SMART retrieval
system showed an improvement in term of average
precision for the proposed translation models ’BCC’
and ’BCC+Morph’.
The approach based on comparable corpora
largely affected the translation because related
words could be added as translation alternatives or
expansion terms. The acquisition of bilingual ter-
minology from bi-directional comparable corpora
yields a significantly better result than using the sim-
ple model. Moreover, the linguistics-based pruning
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Precision
Recall
ME
SCC
BCC
BCC+Morph
Figure 1: Recall/Precision curves for the proposed
translation models and combination to linguistics-
based pruning (weighting scheme = ATN.NTC)
technique has allowed an improvement in the effec-
tiveness of CLIR.
Finally, statistical t-test (Hull, 1993) was carried
out in order to measure significant differences be-
tween paired retrieval models. The improvement by
using the proposed two-stages comparable corpora-
based method ’BCC’ was statistically significant
(p-value=0.0011). The combined statistics-based
and linguistics-based pruning ’BCC+Morph’ was
Table 3: Best results on different weighting schemes for the proposed translation models and the linguistics-
based pruning
Average Precision, % Monolingual, and % Improvement R-Precision, % Monolingual, and % Improvement
Weighting ME SCC BCC BCC+Morph ME SCC BCC BCC+Morph
Models (Monolingual (Simple Comp. (Two-stages Comp. (Linguistics- (Monolingual (Simple Comp. (Two-stages Comp. (Linguistics-
English) Corpora) Corpora) (based pruning) English) Corpora) Corpora) (based pruning)
0.2683 0.1417 0.1801 0.2008 0.2982 0.1652 0.2143 0.2391
ATN.NTC (100%) (52.81%) (67.12%) (74.84%) (100%) (55.34%) (71.86%) (80.18%)
(-47.18%) (-32.87%) (-25.16%) (-44.6%) (-28.13%) (-19.82%)
0.2236 0.091 0.1544 0.1729 0.2508 0.1339 0.1823 0.2066
LTN.NTC (100%) (40.69%) (69.05%) (77.32%) (100%) (53.39%) (72.69%) (82.37%)
(-59.3%) (-30.94%) (-22.67%) (-46.61%) (-27.31%) (-17.62%)
0.1703 0.0787 0.1138 0.1327 0.1943 0.0966 0.1396 0.1663
LTC.NTC (100%) (46.21%) (66.82%) (77.92%) (100%) (49.71%) (71.85%) (85.59%)
(-53.78%) (-33.17%) (-22.08%) (-50.28%) (-28.15%) (-14.41%)
0.1665 0.0707 0.1091 0.1252 0.2004 0.0923 0.1368 0.1481
ATC.NTC (100%) (42.46%) (65.52%) 75.19% (100%) (46.05%) (68.26%) (73.9%)
(-57.53%) (-34.47%) (-24.8%) (-53.94%) (-31.73%) (-26.1%)
0.1254 0.0575 0.073 0.0915 0.154 0.079 0.0989 0.1175
NTC.NTC (100%) (45.85%) (58.21%) (72.96%) (100%) (51.3%) (64.22%) (76.3%)
(-54.15%) (-41.78%) (-27.03%) (-48.7%) (-35.78%) (-23.7%)
found statistically significant (p-value= 0.05) over
the monolingual retrieval ’ME’.
5 Conclusions and Future Work
Dictionary-based translation has been widely used
in CLIR because of its simplicity and availability.
However, failure to translate words and compounds
as well as limitations of general-purpose dictionaries
especially for specialized vocabulary are among the
reasons of drop in retrieval performance especially
when dealing with CLIR. Enriching bilingual dic-
tionaries and thesauri is possible through bilingual
terminology acquisition from large corpora. Parallel
corpora are costly to acquire and their availability is
extremely limited for any pair of languages or even
not existing for some languages, which are charac-
terized by few amounts of Web pages on the WWW.
In contrast, comparable corpora are more abundant,
more available in different domains, less expensive
and more accessible through the WWW.
In the present paper, we investigated the ap-
proach of extracting bilingual terminology from
comparable corpora in order to enrich existing bilin-
gual lexicons and thus enhance Cross-Language In-
formation Retrieval. We proposed a two-stages
translation model consisting of bi-directional ex-
traction, merging and disambiguation of the ex-
tracted bilingual terminology. A hybrid combination
to linguistics-based pruning showed its efficiency
across Japanese-English pair of languages. Most of
the selected terms could be considered as translation
candidates or expansion terms in CLIR.
Ongoing research is focused on the integration
of transliteration for the special phonetic alphabet.
Techniques on phrasal translation will be investi-
gated in order to select best phrasal translation al-
ternatives in CLIR. Evaluations using other combi-
nations and more efficient weighting schemes that
are not included in SMART retrieval system such
as OKAPI, which showed great success in informa-
tion retrieval, are among the future subjects of our
research on CLIR.
Acknowledgements
The present research is supported in part by the
Ministry of Education, Culture, Sports, Science and
Technology of Japan. Our thanks go to all reviewers
for their valuable comments on the earlier version of
this paper.

References

C. Buckley, J. Allan and G. Salton. 1994. Automatic
Routing and Ad-hoc Retrieval using Smart. Proc. Sec-
ond Text Retrieval Conference TREC-2, pages 45-56,

I. Dagan and I. Itai. 1994. Word Sense Disambiguation
using a Second Language Monolingual Corpus. Computational Linguistics, 20(4):563-596.

H. Dejean, E. Gaussier and F. Sadat. 2002. An Approach
based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In Proc. COLING 2002.

M. Diab and S. Finch. 2000. A Statistical Word-level
Translation Model for Comparable Corpora. Proc. of
the Conference on Content-based Multimedia Information Access RIAO.

T. Dunning. 1993. Accurate Methods for the Statistics
of Surprise and Coincidence. Computational linguistics 19(1).

EDR. 1996. Japan Electronic Dictionary Research Insti-
tute, Ltd. EDR electronic dictionary version 1.5 EDR.
Technical guide. Technical report TR2-007.

A. E. Fox and A. J. Shaw. 1994. Combination of Multi-
ple Searches. Proc. Second Text Retrieval Conference
TREC-2, pages 243-252.

N. Fuhr, U. Pfeifer, C. Bremkamp, M. Pollmann and C.
Buckley. 1994. Probabilistic Learning Approaches
for Indexing and Retrieval with the TREC-2 Collection. Proc. Second Text Retrieval Conference TREC-2,
pages 67-74.

P. Fung. 2000. A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel
Corpora. In Jean Veronis, Ed. Parallel Text Processing.

D. Hull. 1993. Using Statistical Testing in the Evalua-
tion of Retrieval Experiments. Proc. ACM SIGIR'93,
pages 329-338.

N. Kando. 2001. Overview of the Second NTCIR Workshop. In Proc. Second NTCIR Workshop on Research
in Chinese and Japanese Text Retrieval and Text Summarization.

J. Klavans and E. Tzoukermann. 1996. Combining Corpus and Machine-Readable Dictionary Data for Building Bilingual Lexicons. Machine Translation, 10(3-4):1-34.

D. Knaus and P. Shauble. 1993. Effective and Efficient
Retrieval from Large and Dynamic Document Collections. Proc. Second Text Retrieval Conference TREC-3, pages 163-170.

K. Knight and J. Graehl. 1998. Machine Transliteration.
Computational Linguistics, 24(4).

P. Koehn and K. Knight. 2002. Learning a Translation
Lexicon from Monolingual Corpora. In Proc. ACL-02
Workshop on Unsupervised Lexical Acquisition.

Y. Matsumoto, A. Kitauchi, T. Yamashita, O. Imaichi and
T. Imamura. 1997. Japanese Morphological Analysis
System ChaSen Manual. Technical Report NAIST-IS-TR97007.

H. Nakagawa. 2000. Disambiguation of Lexical Translations based on Bilingual Comparable Corpora. Proc.
LREC 2000, Workshop of Terminology Resources and
Computation WTRC 2000, pages 33-38.

C. Peters and E. Picchi. 1995. Capturing the comparable: A System for Querying Comparable Text Corpora. Proc. 3rd International Conference on Statistical Analysis of Textual Data, pages 255-262.

R. Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and German Corpora.
In Proc. European Association for Computational Linguistics.

F. Sadat, M. Yoshikawa and S. Uemura. 2003. Enhancing Cross-language Information Retrieval by an
Automatic Acquisition of Bilingual Terminology from
Comparable Corpora. In Proc. ACM SIGIR 2003,
Toronto, Canada.

F. Sadat, M. Yoshikawa and S. Uemura. 2003. Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval. In Proc. ACL 2003, Sapporo, Japan.

G. Salton. 1971. The SMART Retrieval System, Experi-
ments in Automatic Documents Processing. Prentice-
Hall, Inc., Englewood Cliffs, NJ.

G. Salton and J. McGill. 1983. Introduction to Modern
Information Retrieval. New York, Mc Graw-Hill.

J. Savoy. 2003. Cross-Language Information Retrieval:
Experiments based on CLEF 2000 Corpora. Informa-
tion Processing & Management 39(1):75-115.

S. Sekine. 2001. OAK System-Manual. New York Uni-
versity.

I. Shahzad, K. Ohtake, S. Masuyama and K. Yamamoto.
1999. Identifying Translations of Compound using
Non-aligned Corpora. Proc. Workshop MAL, pages
108-113.

K. Tanaka and H. Iwasaki. 1996 Extraction of Lexical
Translations from Non-aligned Corpora. Proc. COLING 96.
