Integrating Cross-Lingually Relevant News Articles and
Monolingual Web Documents in Bilingual Lexicon Acquisition
Takehito Utsuro
†
and Kohei Hino
‡
and Mitsuhiro Kida
†
Seiichi Nakagawa
‡
and Satoshi Sato
†
†Graduate School of Informatics, Kyoto University, Sakyo-ku, Kyoto, 606-8501, Japan
‡Department of Information and Computer Sciences, Toyohashi University of Technology
Tenpaku-cho, Toyohashi, 441–8580, Japan
Abstract
In the framework of bilingual lexicon acquisition
from cross-lingually relevant news articles on the
Web, it is relatively harder to reliably estimate bilin-
gual term correspondences for low frequency terms.
Considering such a situation, this paper proposes to
complementarily use much larger monolingual Web
documents collected by search engines, as a resource
for reliably re-estimating bilingual term correspon-
dences. We experimentally show that, using a suf-
ficient number of monolingual Web documents, it
is quite possible to have reliable estimate of bilin-
gual term correspondences for those low frequency
terms.
1 Introduction
Translation knowledge acquisition from paral-
lel/comparative corpora is one of the most im-
portant research topics of corpus-based MT.
This is because it is necessary for an MT sys-
tem to (semi-)automatically increase its trans-
lation knowledge in order for it to be used in
the real world situation. One limitation of
the corpus-based translation knowledge acquisi-
tion approach is that the techniques of transla-
tion knowledge acquisition heavily rely on avail-
ability of parallel/comparative corpora. How-
ever, the sizes as well as the domain of existing
parallel/comparative corpora are limited, while
it is very expensive to manually collect paral-
lel/comparative corpora. Therefore, it is quite
important to overcome this resource scarcity
bottleneck in corpus-based translation knowl-
edge acquisition research.
In order to solve this problem, we proposed
an approach of taking bilingual news articles
on Web news sites as a source for translation
knowledge acquisition (Utsuro et al., 2003). In
the case of Web news sites in Japan, Japanese
as well as English news articles are updated ev-
eryday. Although most of those bilingual news
articles are not parallel even if they are from
the same site, certain portion of those bilingual
news articles share their contents or at least re-
port quite relevant topics. This characteristic
is quite important for the purpose of transla-
tion knowledge acquisition. Utsuro et al. (2003)
showed that it is possible to acquire translation
knowledge of domain specific named entities,
event expressions, and collocational expressions
from the collection of bilingual news articles on
Web news sites.
Based on the results of our previous study,
this paper further examines the correlation of
term frequency and the reliability of bilingual
term correspondences estimated from bilingual
news articles. We show that, for high frequency
terms, it is relatively easier to reliably estimate
bilingual term correspondences. However, for
low frequency terms, it is relatively harder to re-
liably estimate bilingual term correspondences.
Low frequency problem of this type often hap-
pens when a suﬃcient number of bilingual news
articles are not available at hand.
Considering such a situation, this paper then
proposes to complementarily use much larger
monolingual Web documents collected by search
engines, as a resource for reliably re-estimating
bilingual term correspondences. Those col-
lected monolingual Web documents are re-
garded as comparable corpora. Here, a stan-
dard technique of estimating bilingual term cor-
respondences from comparable corpora is em-
ployed. In the evaluation, we show that, using
a suﬃcient number of monolingual Web docu-
ments, it is relatively easier to have reliable esti-
mate of bilingual term correspondences. As one
of the most remarkable experimental evalua-
tion results, we further show that, for the terms
which appear infrequently in news articles, the
accuracy of re-estimating bilingual term corre-
spondences does actually improve.
Figure 1: Translation Knowledge Acquisition
from Web News Sites: Overview
2 Estimating Bilingual Term
Correspondences from
Cross-Lingually Relevant News
Articles
2.1 Overview
Figure 1 illustrates the overview of our frame-
work of translation knowledge acquisition from
Web news sites. First, pairs of Japanese and
English news articles which report identical con-
tents or at least closely related contents are re-
trieved. In this cross-lingual retrieval process,
translation knowledge such as a bilingual dic-
tionary and an MT software is used for mea-
suring similarity of Japanese and English arti-
cles across languages. Then, by applying pre-
viously studied techniques of translation knowl-
edge acquisition from parallel/comparative cor-
pora, translation knowledge such as bilingual
term correspondences are acquired.
2.2 Cross-Language Retrieval of Rel-
evant News Articles
This section gives the overview of our frame-
work of cross-language retrieval of relevant news
articles from Web news sites (Utsuro et al.,
2003). First, from Web news sites, both
Japanese and English news articles within cer-
tain range of dates are retrieved. Let d
J
and
d
E
denote one of the retrieved Japanese and
English articles, respectively. Then, each En-
glish article d
E
is translated into a Japanese
document d
MT
J
by some commercial MT soft-
ware
1
. Each Japanese article d
J
as well as the
Japanese translation d
MT
J
of each English ar-
ticle are next segmented into word sequences,
and word frequency vectors v(d
J
)andv(d
MT
J
)
are generated. Then, cosine similarities between
v(d
J
)andv(d
MT
J
) are calculated
2
and pairs of
articles d
J
and d
E
which satisfy certain criterion
are considered as candidates for cross-lingually
relevant article pairs.
As we describe in section 4.1, on Web news
sites in Japan, the number of articles up-
dated per day is far greater (about 4 times)
in Japanese than in English. Thus, it is
much easier to find cross-lingually relevant ar-
ticles for each English query article than for
each Japanese query article. Considering this
fact, we estimate bilingual term correspon-
dences from the results of cross-lingually re-
trieving relevant Japanese articles with English
query articles. For each English query article
d
i
E
and its Japanese translation d
MTi
J
,theset
D
i
J
of Japanese articles that are within certain
range of dates and are with cosine similarities
higher than or equal to a certain lower bound
L
d
is constructed:
D
i
J
=
braceleftBig
d
J
| cos(v(d
MTi
J
),v(d
J
)) ≥ L
d
bracerightBig
(1)
2.3 Estimating Bilingual Term Cor-
respondences with Pseudo-
Parallel Corpus
This section describes the technique we apply to
the task of estimating bilingual term correspon-
dences from cross-lingually relevant news texts.
Here, we regard cross-lingually relevant news
texts as a pseudo-parallel corpus, to which stan-
dard techniques of estimating bilingual term
correspondences from parallel corpora can be
applied
3
.
1
In this query translation process, we compared an
MT software with a bilingual lexicon. CLIR with query
translation by an MT software performed much better
than that by a bilingual lexicon. In the case of news
articles on Web news sites, it is relatively easier to find
articles in the other language which report closely related
contents, with just a few days diﬀerence of report dates.
In such a case, exact query translation by an MT soft-
ware is suitable, because exact translation is expected to
easily match the closely related articles in the other lan-
guage. As we mention in section 3.3, this is opposite to
the situation of monolingual Web documents, where it is
much less expected to find closely related documents in
the other language.
2
It is also quite possible to employ weights other than
word frequencies such as tf ·idf and similarity measures
other than cosine measure such as dice or Jaccard coef-
ficients.
3
We also applied another technique based on con-
textual vector similarities (Utsuro et al., 2003), which
First, we concatenate constituent Japanese
articles of D
i
J
into one article D
primei
J
, and regard
the article pair d
i
E
and D
primei
J
as a pseudo-parallel
sentence pair. Next, we collect such pseudo-
parallel sentence pairs and construct a pseudo-
parallel corpus PPC
EJ
of English and Japanese
articles:
PPC
EJ
=
braceleftBig
〈d
i
E
,D
primei
J
〉|D
i
J
negationslash= ∅
bracerightBig
Then, we apply standard techniques of es-
timating bilingual term correspondences from
parallel corpora (Matsumoto and Utsuro, 2000)
to this pseudo-parallel corpus PPC
EJ
.First,
from a pseudo-parallel sentence pair d
i
E
and D
primei
J
,
we extract monolingual (possibly compound
4
)
term pair t
E
and t
J
:
r〈t
E
,t
J
〉 s.t. ∃d
i
E
∃d
J
,t
E
in d
i
E
,t
J
in d
J
, (2)
cos(v(d
MTi
J
),v(d
J
)) ≥ L
d
Then, based on the contingency table of co-
occurrence document frequencies of t
E
and t
J
below, we estimate bilingual term correspon-
dences according to the statistical measures
such as the mutual information, the φ
2
statistic,
the dice coeﬃcient, and the log-likelihood ratio.
t
J
¬t
J
t
E
df (t
E
,t
J
)=adf(t
E
,¬t
J
)=b
¬t
E
df (¬t
E
,t
J
)=cd(¬t
E
,¬t
J
)=d
We compare the performance of those four
measures, where the φ
2
statistic and the log-
likelihood ratio perform best, the dice coeﬃcient
the second best, and the mutual information the
worst. In section 4.3, we show results with the
φ
2
statistic as the bilingual term correspondence
corr
EJ
(t
E
,t
J
):
φ
2
(t
E
,t
J
)=
(ad − bc)
2
(a + b)(a + c)(b + d)(c + d)
3 Re-estimating Bilingual Term
Correspondences using
Monolingual Web Documents
3.1 Overview
This section illustrates the overview of the pro-
cess of re-estimating bilingual term correspon-
dences using monolingual Web documents col-
lected by search engines. Figure 2 gives its
rough idea.
has been well studied in the context of bilingual lexicon
acquisition from comparable corpora. In this method,
we regard cross-lingually relevant texts as a compara-
ble corpus, where bilingual term correspondences are es-
timated in terms of contextual similarities across lan-
guages. This technique is less eﬀective than the one we
describe here (Utsuro et al., 2003).
4
In the evaluation of this paper, we restrict English
and Japanese terms t
E
and t
J
to be up to five words
long.
Figure 2: Re-estimating Bilingual Term Corre-
spondences using Monolingual Web Documents:
Overview
Suppose that we have an English term, and
that the problem to solve here is to find its
Japanese translation. As we described in the
previous section and in Figure 1, with a cross-
lingually relevant Japanese and English news
articles database, we can have a certain num-
ber of Japanese translation candidates for the
target English term. Here, for high frequency
terms, it is relatively easier to have reliable
ranking of those Japanese translation candi-
dates. However, for low frequency terms, hav-
ing reliable ranking of those Japanese transla-
tion candidates is diﬃcult. Especially, low fre-
quency problem of this type often happens when
we do not have large enough language resources
(in this case, cross-lingually relevant news arti-
cles).
Considering such a situation, re-estimation of
bilingual term correspondences proceeds as fol-
lows, using much larger monolingual Web doc-
uments sets that are easily accessible through
search engines. First, English pages which
contain the target English term are collected
through an English search engine. In the simi-
lar way, for each Japanese term in the Japanese
translation candidates, Japanese pages which
contain the Japanese term are collected through
a Japanese search engine. Then, texts con-
tained in those English and Japanese pages are
extracted and are regarded as comparable cor-
pora. Here, a standard technique of estimat-
ing bilingual term correspondences from com-
parable corpora (e.g., Fung and Yee (1998) and
Rapp (1999)) is employed. Contextual sim-
ilarity between the target English term and
the Japanese translation candidate is measured
across languages, and all the Japanese transla-
tion candidates are re-ranked according to the
contextual similarities.
3.2 Filtering by Hits of Search En-
gines
Before re-estimating bilingual term correspon-
dences using monolingual Web documents, we
assume there exists certain correlation between
hits of the English term t
E
and the Japanese
term t
J
returned by search engines. Depending
on the hits h(t
E
)oft
E
, we restrict the hits h(t
J
)
of t
J
to be within the range of a lower bound
h
L
and an upper bound h
U
:
h
L
<h(t
J
) ≤ h
U
As search engines, we used AltaVista
(http://www. altavista.com/ for En-
glish, and goo (http://www.goo.ne.jp/)for
Japanese. With a development data set con-
sisting of translation pairs of an English term
and a Japanese term, we manually constructed
the following rules for determining the lower
bound h
L
and the upper bound h
U
:
1. 0 <h(t
E
) ≤ 100
h
L
=0,h
U
=10,000 × h(t
E
)
2. 100 <h(t
E
) ≤ 20,000
h
L
=0.05 × h(t
E
), h
U
=1,000,000
3. 20,000 <h(t
E
)
h
L
=1,000, h
U
=50× h(t
E
)
In the experimental evaluation of Section 4.4,
the initial set of Japanese translation candi-
dates consists of 50 terms for each English term,
which are then reduced to on the average 24.8
terms with this filtering.
3.3 Re-estimating Bilingual Term
Correspondences based on Con-
textual Similarity
This section describes how to re-estimate bilin-
gual term correspondences using monolingual
Web documents collected by search engines.
For an English term t
E
and a Japanese term
t
J
,letD(t
E
)andD(t
J
) be the sets of docu-
ments returned by search engines with queries
t
E
and t
J
, respectively. Then, for the English
term t
E
, translated contextual vector cv
trJ
(t
E
)
is constructed as below: each English sen-
tence s
E
which contains t
E
is translated into
Japanese sentence s
tr
J
, then the term frequency
vectors
5
v(s
tr
J
) of Japanese translation s
tr
J
are
5
In the term frequency vectores, compound terms are
restricted to be up to five words long both for English
and Japanese.
Table 1: Statistics of # of Days, Articles, and
Article Sizes
total total average # average
#of #of of articles article
days articles per day size (bytes)
Eng 935 23064 24.7 3228.9
Jap 941 96688 102.8 837.7
summed up into the translated contextual vec-
tor cv
trJ
(t
E
):
cv
trJ
(t
E
)=
summationdisplay
∀s
E
in D(t
E
) s.t. t
E
in s
E
v(s
tr
J
)
The contextual vector cv(t
J
) for the Japanese
term t
J
is also constructed by summing up the
term frequency vectors v(s
J
) of each Japanese
sentence s
J
which contains t
J
:
cv(t
J
)=
summationdisplay
∀s
J
in D(t
J
) s.t. t
J
in s
J
v(s
J
)
In the translation of English sentences into
Japanese, we evaluated an MT software and a
bilingual lexicon in terms of the performance of
re-estimation of bilingual term correspondences.
Unlike the situation of cross-lingually relevant
news articles mentioned in Section 2.2, trans-
lation by a bilingual lexicon is more eﬀective
for monolingual Web documents. In the case of
monolingual Web documents, it is much less ex-
pected to find closely related documents in the
other language. In such cases, multiple trans-
lation rather than exact translation by an MT
software is suitable. In Section 4.4, we show
evaluation results with translation by a bilin-
gual lexicon
6
.
Finally, bilingual term correspondence
corr
EJ
(t
E
,t
J
) is estimated in terms of co-
sine measure cos(cv
trJ
(t
E
),cv(t
J
)) between
contextual vectors cv
trJ
(t
E
)andcv(t
J
).
4 Experimental Evaluation
4.1 Japanese-English Relevant News
Articles on Web News Sites
We collected Japanese and English news articles
from a Web news site. Table 1 shows the total
number of collected articles and the range of
dates of those articles represented as the num-
ber of days. Table 1 also shows the number of
articles updated in one day, and the average ar-
ticle size. The number of Japanese articles up-
dated in one day are far greater (about 4 times)
than that of English articles.
6
Eijiro Ver.37, 850,000 entries, http://homepage3.
nifty.com/edp/.
Table 2: # of Japanese/English Articles Pairs with Similarity Values above Lower Bounds
Lower Bound L
d
of Articles’ Sim w/o 0.3 0.4 0.5
Diﬀerence of Dates (days) CLIR ≤ 2
# of English Articles 23064 6073 2392 701
# of Japanese Articles 96688 12367 3444 882
# of English-Japanese Article Pairs — 16507 3840 918
Next, for several lower bounds L
d
of the
similarity between English and Japanese arti-
cles, Table 2 shows the numbers of English and
Japanese articles as well as article pairs which
satisfy the similarity lower bound. Here, the
diﬀerence of dates of English and Japanese arti-
cles is within two days, with which it is guaran-
teed that, if exist, closely related articles in the
other language can be discovered (see Utsuro et
al. (2003) for details). Note that it can happen
that one article has similarity values above the
lower bound against more than one articles in
the other language.
According to our previous study (Utsuro et
al., 2003), cross-lingually relevant news arti-
cles are available in the direction of English-
to-Japanese retrieval for more than half of the
retrieval query English articles. Furthermore,
with the similarity lower bound L
d
=0.3, pre-
cision and recall of cross-language retrieval are
around 30% and 60%, respectively. Therefore,
with the similarity lower bound L
d
=0.3, at
least 1,800 (≈ 6,073×0.5×0.6) English articles
have relevant Japanese articles in the results of
cross-language retrieval. Based on this analysis,
the next section gives evaluation results with
the similarity lower bound L
d
=0.3.
4.2 English Term List for Evaluation
For the evaluation of this paper, we first man-
ually select target English terms and their
reference Japanese translation, and examine
whether reference bilingual term correspon-
dences can be estimated by the methods pre-
sented in Sections 2 and 3. Target English terms
are selected by the following procedure.
First, from the whole English articles of Ta-
ble 1, any sequence of more than one words
whose frequency is more than or equal to 10 is
enumerated. This enumeration is easily imple-
mented and eﬃciently computed by employing
the technique of PrefixSpan (Pei et al., 2001).
Here, certain portion of those word sequences
are appropriate as compound terms, while the
rest are some fragments of a compound term,
or concatenation of those fragments. In or-
der to automatically select candidates for cor-
rect compound terms, we parse those word se-
Figure 3: Accuracy of Estimating Bilingual
Term Correspondences with News Articles
quencesbyCharniakparser
7
, and collect noun
phrases which consist of adjectives, nouns, and
present/past participles. For each of those word
sequences, the φ
2
statistic against Japanese
translation candidates is calculated, then those
word sequences are sorted in descending order of
their φ
2
statistic. Finally, among top 3,000 can-
didates for compound terms, 100 English com-
pound terms are randomly selected for the eval-
uation of this paper. Selected 100 terms satisfy
the following condition: those English terms can
be correctly translated neither by the MT soft-
ware used in Section 2.2, nor by the bilingual
lexicon used in Section 3.3.
4.3 Estimating Bilingual Term Cor-
respondences with News Articles
For the 100 English terms selected in the pre-
vious section, Japanese translation candidates
which satisfy the condition of the formula (2) in
Section 2.3 are collected, and are ranked accord-
ing to the φ
2
statistic. Figure 3 plots the rate
of reference Japanese translation being within
top n candidates. In the figure, the plot labeled
as “full” is the result with the whole articles in
Table 1. In this case, the accuracy of the top
ranked Japanese translation candidate is about
40%, and the rate of reference Japanese trans-
lation within top five candidates is about 75%.
7
http://www.cs.brown.edu/people/ec/
Table 3: Statistics of Average Document Frequencies and Number of Days
Document Frequencies of target English Term #ofDays
Data Set df (t
E
) df (t
E
,t
J
) Eng Jap
freq=10, 13.6 days 14.9 9.1 13.6 21.9
freq=10, 20 days 14.9 9.1 21.0 78.7
freq=10, 200 days 14.9 9.1 200 581
freq=70, 600 days 37.4 24.9 600 872
full 53.9 35.6 935 941
On the other hand, other plots labeled as
“Freq=x, y days” are the results when the num-
ber of the news articles is reduced, which are
simulations for estimating bilingual term cor-
respondences for low frequency terms. Here,
the label “Freq=x, y days” indicates that news
articles used for φ
2
statistic estimation is re-
stricted to certain portion of the whole news
articles so that the following condition be satis-
fied: i) co-occurrence document frequency of a
target English term and its reference Japanese
translation is fixed to be x,
8
ii) the number of
days be greater than or equal to y.Foreach
news articles data set, Table 3 shows document
frequencies df (t
E
) of a target English term t
E
,
co-occurrence document frequencies df (t
E
,t
J
)
of t
E
and its reference Japanese translation t
J
,
and the numbers of days for English as well as
Japanese articles. Those numbers are all aver-
aged over the 100 English terms. The number of
days for Japanese articles could be at maximum
five times larger than that for English articles,
because relevant Japanese articles are retrieved
against a query English article from the dates of
diﬀerences within two days (details are in Sec-
tions 2.2 and 4.1).
As can be seen from the plots of Figure 3,
the smaller the news articles data set, the lower
the plot is. Especially, in the case of the small-
est news articles data set, it is clear that re-
liable ranking of Japanese translation candi-
dates is diﬃcult. This is because it is not easy
to discriminate the reference Japanese transla-
tion and the other candidates with statistics ob-
tained from such a small news articles data set.
4.4 Re-estimating Bilingual Term
Correspondences with Monolin-
gual Web Documents
For the 100 target English terms evaluated in
the previous section, this section describes the
result of applying the technique presented in
Section 3.3, i.e., re-estimating bilingual term
8
When the co-occurrence document frequency of t
E
and t
J
in the whole news articles is less than x,allthe
co-occurring dates are included.
Figure 4: Accuracy of Re-estimating Bilingual
Term Correspondences with Monolingual Web
Documents
correspondences with monolingual Web docu-
ments. For each of the 100 target English
terms, bilingual term correspondences are re-
estimated against candidates of Japanese trans-
lation ranked within top 50 according to the
φ
2
statistic. Here, as a simulation for terms
that are infrequent in news articles, 50 can-
didate terms for Japanese translation are col-
lected from the smallest data set labeled as
“Freq=10, 13.6 days”. As mentioned in Sec-
tion 3.2, those 50 candidates are reduced to on
the average 24.8 terms with the filtering by hits
of search engines. For each of an English term
t
E
and a Japanese term t
J
, 100 monolingual
documents are collected by search engines
910
.
Figure 4 compares the plots of re-estimation
with monolingual Web documents and estima-
tion by news articles (data set “Freq=10, 13.6
9
In the result of our preliminary evaluation, accuracy
of re-estimating bilingual term correspondences did not
improve even if more than 100 documents were used.
10
Alternatively, as the monolingual documents from
which contextual vectors are constructed, we evaluated
each of the short passages listed in the summary pages
returned by search engines, instead of the whole docu-
ments of the URLs listed in the summary pages. The
diﬀerence of the performance of bilingual term corre-
spondence estimation is little, while the computational
cost can reduced to almost 5%.
days”). It is clear from this result that mono-
lingual Web documents contribute to improving
the accuracy of estimating bilingual term corre-
spondences for low frequency terms.
One of the major reasons for this improve-
ment is that topics of monolingual Web doc-
uments collected through search engines are
much more diverse than those of news articles.
Such diverse topics help discriminate correct
and incorrect Japanese translation candidates.
For example, suppose that the target English
term t
E
is “special anti-terrorism law” and its
reference Japanese translation is “��0f�

��O”. In the news articles we used for evalua-
tion, most articles in which t
E
or t
J
appear have
“dispatch of Self-Defense Force for reconstruc-
tion of Iraq” as their topics. Here, Japanese
translation candidates other than “��0f�

��O” that are highly ranked according to
the φ
2
statistic are: e.g., “	:�r�(dissolution
of the House of Representatives)” and “����
��	(assistance for reconstruction of Iraq)”,
which frequently appear in the topic of “dis-
patch of Self-Defense Force for reconstruction
of Iraq”.
On the other hand, in the case of monolin-
gual Web documents collected through search
engines, it can be expected that topics of docu-
ments may vary according to the query terms.
In the case of the example above, the major
topic is “dispatch of Self-Defense Force for re-
construction of Iraq” for both of reference terms
t
E
and t
J
, while major topics for other Japanese
translation candidates are: “issues on Japanese
Diet” for “	:�r�(dissolution of the House
of Representatives)” and “issues on reconstruc-
tion of Iraq, not only in Japan, but all over the
world” for “������	(assistance for re-
construction of Iraq)”. Those topics of incor-
rect Japanese translation candidates are diﬀer-
entfromthatofthetargetEnglishtermt
E
,and
their contextual vector similarities against the
target English term t
E
are relatively low com-
pared with the reference Japanese translation
t
J
. Consequently, the reference Japanese trans-
lation t
J
is re-ranked higher compared with the
ranking based on news articles.
5 Related Works
In large scale experimental evaluation of bilin-
gual term correspondence estimation from com-
parable corpora, it is diﬃcult to estimate bilin-
gual term correspondences against every possi-
ble pair of terms due to its computational com-
plexity. Previous works on bilingual term cor-
respondence estimation from comparable cor-
pora controlled experimental evaluation in var-
ious ways in order to reduce this computational
complexity. For example, Rapp (1999) filtered
out bilingual term pairs with low monolingual
frequencies (those below 100 times), while Fung
and Yee (1998) restricted candidate bilingual
term pairs to be pairs of the most frequent 118
unknown words. Cao and Li (2002) restricted
candidate bilingual compound term pairs by
consulting a seed bilingual lexicon and requir-
ing their constituent words to be translation
of each other across languages. On the other
hand, in the framework of bilingual term corre-
spondences estimation of this paper, the compu-
tational complexity of enumerating translation
candidates can be easily avoided with the help of
cross-language retrieval of relevant news texts.
Furthermore, unlike Cao and Li (2002), bilin-
gual term correspondences for compound terms
are not restricted to compositional translation.
6 Conclusion
In the framework of bilingual lexicon acquisition
from cross-lingually relevant news articles on
the Web, it has been relatively harder to reliably
estimate bilingual term correspondences for low
frequency terms. This paper proposed to com-
plementarily use much larger monolingual Web
documents collected by search engines, as a re-
source for reliably re-estimating bilingual term
correspondences. We showed that, for the terms
which appear infrequently in news articles, the
accuracy of re-estimating bilingual term corre-
spondences actually improved.

References

Y. Cao and H. Li. 2002. Base noun phrase translation
using Web data and the EM algorithm. In Proc. 19th
COLING, pages 127–133.

P. Fung and L. Y. Yee. 1998. An IR approach for trans-
lating new words from nonparallel, comparable texts.
In Proc. 17th COLING and 36th ACL, pages 414–420.

Y. Matsumoto and T. Utsuro. 2000. Lexical knowledge
acquisition. In R. Dale, H. Moisl, and H. Somers,
editors, Handbook of Natural Language Processing,
chapter 24, pages 563–610. Marcel Dekker Inc.

J. Pei, J. Han, B. Mortazavi-Asl, and H. Pinto. 2001.
Prefixspan: Mining sequential patterns eﬃciently by
prefix-projected pattern growth. In Proc. Inter. Conf.
Data Mining, pages 215–224.

R. Rapp. 1999. Automatic identification of word trans-
lations from unrelated English and German corpora.
In Proc. 37th ACL, pages 519–526.

T. Utsuro, T. Horiuchi, T. Hamamoto, K. Hino, and
T. Nakayama. 2003. Eﬀect of cross-language IR in
bilingual lexicon acquisition from comparable cor-
pora. In Proc. 10th EACL, pages 355–362.
