Base Noun Phrase Translation
Using Web Data and the EM Algorithm
Yunbo Cao
Microsoft Research Asia
i-yuncao@microsoft.com
Hang Li
Microsoft Research Asia
hangli@microsoft.com
Abstract
We consider here the problem of Base Noun
Phrase translation. We propose a new method
to perform the task. For a given Base NP, we
first search its translation candidates from the
web. We next determine the possible
translation(s) from among the candidates
using one of the two methods that we have
developed. In one method, we employ an
ensemble of Naïve Bayesian Classifiers
constructed with the EM Algorithm.Inthe
other method, we use TF-IDF vectors also
constructed with the EM Algorithm.
Experimental results indicate that the
coverage and accuracy of our method are
significantly better than those of the baseline
methods relying on existing technologies.
1. Introduction
We address here the problem of Base NP
translation, in which for a given Base Noun
Phrase in a source language (e.g., ‘information
age’ in English), we are to find out its possible
translation(s) in a target language (e.g., ‘G5b5G1643G1bca
G4b7’inChinese).
We define a Base NP as a simple and
non-recursive noun phrase. In many cases, Base
NPs represent holistic and non-divisible concepts,
and thus accurate translation of them from one
language to another is extremely important in
applications like machine translation, cross
language information retrieval, and foreign
language writing assistance.
In this paper, we propose a new method for
Base NP translation, which contains two steps: (1)
translation candidate collection, and (2)
translation selection. In translation candidate
collection, for a given Base NP in the source
language, we look for its translation candidates in
the target language. To do so, we use a
word-to-word translation dictionary and corpus
data in the target language on the web. In
translation selection, we determine the possible
translation(s) from among the candidates. We use
non-parallel corpus data in the two languages on
the web and employ one of the two methods
which we have developed. In the first method, we
view the problem as that of classification and
employ an ensemble of Naïve Bayesian
Classifiers constructed with the EM Algorithm.
We will use ‘EM-NBC-Ensemble’ to denote this
method, hereafter. In the second method, we view
the problem as that of calculating similarities
between context vectors and use TF-IDF vectors
also constructed with the EM Algorithm. We will
use ‘EM-TF-IDF’ to denote this method.
Experimental results indicate that our method
is very effective, and the coverage and top 3
accuracy of translation at the final stage are
91.4% and 79.8%, respectively. The results are
significantly better than those of the baseline
methods relying on existing technologies. The
higher performance of our method can be
attributed to the enormity of the web data used
and the employment of the EM Algorithm.
2. Related Work
2.1 Translation with Non-parallel
Corpora
A straightforward approach to word or phrase
translation is to perform the task by using parallel
bilingual corpora (e.g., Brown et al, 1993).
Parallel corpora are, however, difficult to obtain
in practice.
To deal with this difficulty, a number of
methods have been proposed, which make use of
relatively easily obtainable non-parallel corpora
(e.g., Fung and Yee, 1998; Rapp, 1999; Diab and
Finch, 2000). Within these methods, it is usually
assumed that a number of translation candidates
for a word or phrase are given (or can be easily
collected) and the problem is focused on
translation selection.
All of the proposed methods manage to find out
the translation(s) of a given word or phrase, on
the basis of the linguistic phenomenon that the
contexts of a translation tend to be similar to the
contexts of the given word or phrase. Fung and
Yee (1998), for example, proposed to represent
the contexts of a word or phrase with a
real-valued vector (e.g., a TF-IDF vector), in
which one element corresponds to one word in
the contexts. In translation selection, they select
the translation candidates whose context vectors
are the closest to that of the given word or phrase.
Since the context vector of the word or phrase
to be translated corresponds to words in the
source language, while the context vector of a
translation candidate corresponds to words in the
target language, and further the words in the
source language and those in the target language
have a many-to-many relationship (i.e.,
translation ambiguities), it is necessary to
accurately transform the context vector in the
source language to a context vector in the target
language before distance calculation.
The vector-transformation problem was not,
however, well-resolved previously. Fung and
Yee assumed that in a specific domain there is
only one-to-one mapping relationship between
words in the two languages. The assumption is
reasonable in a specific domain, but is too strict in
the general domain, in which we presume to
perform translation here. A straightforward
extension of Fung and Yee’s assumption to the
general domain is to restrict the many-to-many
relationship to that of many-to-one mapping (or
one-to-one mapping). This approach, however,
has a drawback of losing information in vector
transformation, as will be described.
For other methods using non-parallel corpora,
see also (Tanaka and Iwasaki, 1996; Kikui, 1999,
Koehn and Kevin 2000; Sumita 2000; Nakagawa
2001; Gao et al, 2001).
2.2 Translation Using Web Data
Web is an extremely rich source of data for
natural language processing, not only in terms of
data size but also in terms of data type (e.g.,
multilingual data, link data). Recently, a new
trend arises in natural language processing, which
tries to bring some new breakthroughs to the field
by effectively using web data (e.g., Brill et al,
2001).
Nagata et al (2001), for example, proposed to
collect partial parallel corpus data on the web to
create a translation dictionary. They observed
that there are many partial parallel corpora
between English and Japanese on the web, and
most typically English translations of Japanese
terms (words or phrases) are parenthesized and
inserted immediately after the Japanese terms in
documents written in Japanese.
3. Base Noun Phrase Translation
Our method for Base NP translation comprises of
two steps: translation candidate collection and
translation selection. In translation candidate
collection, we look for translation candidates of a
given Base NP. In translation selection, we find
out possible translation(s) from the translation
candidates.
In this paper, we confine ourselves to
translation of noun-noun pairs from English to
Chinese; our method, however, can be extended
to translations of other types of Base NPs
between other language pairs.
3.1 Translation Candidate Collection
We use heuristics for translation candidate
collection. Figure 1 illustrates the process of
collecting Chinese translation candidates for an
English Base NP ‘information age’ with the
heuristics.
1. Input ‘information age’;
2. Consult English-Chinese word translation dictionary:
information ->G5b5G1643
age ->G1448G5558(how old somebody is)
G1bcaG4b7(historical era)
G17e4G1448(legal adult hood)
3. Compositionally create translation candidates in
Chinese:
G5b5G1643G1448G5558;G5b5G1643G1bcaG4b7;G5b5G1643G17e4G1448
4. Search the candidates on web sites in Chinese and
obtain the document frequencies of them (i.e., numbers
of documents containing them):
G5b5G1643G1bcaG4b710000
G5b5G1643G1448G555810
G5b5G1643G17e4G14480
5. Output candidates having non-zero document
frequencies and the document frequencies:
G5b5G1643G1bcaG4b710000
G5b5G1643G1448G555810
Figure 1. Translation candidate collection
3.2 Translation Selection --
EM-NBC-Ensemble
We view the translation selection problem as that
of classification and employ EM-NBC-Ensemble
to perform the task. For the ease of explanation,
we first describe the algorithm of using only
EM-NBC and next extend it to that of using
EM-NBC-Ensemble.
Basic Algorithm
Let e
~
denote the Base NP to be translated and C
~
the set of its translation candidates (phrases).
Suppose that kC =|
~
| .Letc
~
represent a random
variable on C
~
.LetE denote a set of words in
English, and C a set of words in Chinese.
Suppose that nCmE == ||and|| .Lete
represent a random variable on E and c a random
variable on C. Figure 2 describes the algorithm.
Input: e
~
, C
~
, contexts containing e
~
, contexts containing all
Cc
~
~
∈ ;
1. create a frequency vector )),(,),(),((
21 m
efefef L
),,1(, miEe
i
L=∈
using contexts containing e
~
;
transforming the vector into
)),(,),(),((
21 nEEE
cfcfcf L
),,1(, niCc
i
L=∈
, using a translation dictionary
and the EM algorithm;
2. for each ( Cc
~
~
∈ ){
estimate with Maximum Likelihood Estimation the prior
probability )
~
(cP using contexts containing all Cc
~
~
∈ ;
create a frequency vector )),(,),(),((
21 n
cfcfcf L
),,1(, niCc
i
L=∈
using contexts containing c
~
;
normalize the frequency vector , yielding
),,1(,)),
~
|(,),
~
|(),
~
|((
21
niCcccPccPccP
in
LL =∈ ;
calculate the posterior probability )|
~
( DcP with EM-NBC
(generally EM-NBC-Ensemble), where
),,1(,)),(,),(),((
21
niCccfcfcf
inEEE
LL =∈=D
3. Sort Cc
~
~
∈ in descending order of )|
~
( DcP ;
Output: the top sorted results
Figure 2. Algorithm of EM-NBC-Ensemble
Context Information
As input data, we use ‘contexts’ in English which
contain the phrase to be translated. We also use
contexts in Chinese which contain the translation
candidates.
Here, a context containing a phrase is defined
as the surrounding words within a window of a
predetermined size, which window covers the
phrase. We can easily obtain the data by
searching for them on the web. Actually, the
contexts containing the candidates are obtained at
the same time when we conduct translation
candidate collection (Step 4 in Figure 1).
EM Algorithm
We define a relation between E and C
as CER ×⊆ , which represents the links in a
translation dictionary. We further define
}),(|{ Rcee
c
∈=Γ .
At Step 1, we assume that all the instances in
))(),..,(),((
21 m
efefef are independently generated
according to the distribution defined as:
∑
∈
=
Cc
cePcPeP )|()()(
(1)
We estimate the parameters of the distribution by
using the Expectation and Maximization (EM)
Algorithm (Dempster et al., 1977).
Initially, we set for all Cc ∈
||
1
)(
C
cP = ,





Γ∉
Γ∈
Γ=
c
c
c
e
e
ceP
if,0
if,
||
1
)|(
Next, we estimate the parameters by iteratively
updating them, until they converge (cf., Figure 3).
Finally, we calculate )(cf
E
for all Cc ∈ as:
∑
∈
=
Ee
E
efcPcf )()()(
(2)
In this way, we can transform the frequency
vector in English ))(),..,(),((
21 m
efefef into a vector
in Chinese ))(),..,(),((
21 nEEE
cfcfcf=D .
Prior Probability Estimation
At Step 2, we approximately estimate the prior
probability )
~
(cP by using the document
frequencies of the translation candidates. The
data are obtained when we conduct candidate
collection (Step 4 in Figure 1).
∑
∑
∑
∈
∈
∈
←
←−
←−
Ee
Ee
Cc
ecPef
ecPef
ceP
ecPefcP
cePcP
cePcP
ecP
)|()(
)|()(
)|(
)|()()(StepM
)|()(
)|()(
)|(StepE
Figure 3. EM Algorithm
EM-NBC
At Step 2, we use an EM-based Naïve
Bayesian Classifier (EM-NBC) to select the
candidates c
~
whose posterior probabilities are
the largest:






+=
∑
∈
∈
∈
)
~
|(log)()
~
(logmaxarg
)|
~
(maxarg
~
~
~
~
ccPcfcP
cP
Cc
E
Cc
Cc
D
(3)
Equation (3) is based on Bayes’ rule and the
assumption that the data in D are independently
generated from CcccP ∈),
~
|( .
In our implementation, we use an equivalent






−−
∑
∈
∈
)
~
|(log)()
~
(logminarg
~
~
ccPcfcP
Cc
E
Cc
α
(4)
where 1≥α is an additional parameter used to
emphasize the prior information. If we ignore the
first term in Equation (4), then the use of one
EM-NBC turns out to select the candidate whose
frequency vector is the closest to the transformed
vector D in terms of KL divergence (cf., Cover
and Tomas 1991).
EM-NBC-Ensemble
To further improve performance, we use an
ensemble (i.e., a linear combination) of
EM-NBCs (EM-NBC-Ensemble), while the
classifiers are constructed on the basis of the data
in different contexts with different window sizes.
More specifically, we calculate
where s),1,(i, L=
i
D denotes the data in different
contexts.
3.3 Translation Selection -- EM-TF-IDF
We view the translation selection problem as that
of calculating similarities between context
vectors and use as context vectors TF-IDF
vectors constructed with the EM Algorithm.
Figure 4 describes the algorithm in which we use
the same notations as those in
EM-NBC-Ensemble.
The idf valueofaChinesewordc is calculated
in advance and as
)/)(log()( Fcdfcidf −= (6)
where )cdf( denotes the document frequency of
c and F the total document frequency.
Input: e
~
, C
~
, contexts containing e
~
, contexts containing
all Cc
~
~
∈ , Cc),cidf( ∈ ;
1. create a frequency vector )),(,),(),((
21 m
efefef L
),,1(, miEe
i
L=∈
using contexts containing e
~
;
transforming the vector into
21
)),c(f,),c(f),c(f(
nEEE
L
),,1(, niCc
i
L=∈
, using a translation dictionary and
the EM algorithm;
create a TF-IDF vector
11
)),cidf())c(f,),cidf()c(f(
nnEE
L=A ),,1(, niCc
i
L=∈
2. for each ( Cc
~
~
∈ ){
create a frequency vector )),(,),(),((
21 n
cfcfcf L
),,1(, niCc
i
L=∈ using contexts containing c
~
;
create a TF-IDF vector
11
))cidf())c(f,),cidf()c(f(
nn
L=B ),,1(, niCc
i
L=∈
;
calculate
),cos()c
~
tfidf( BA=
;}
3. Sort Cc
~
~
∈ in descending order of )c
~
tfidf(
;
Output: the top sorted results
Figure 4. Algorithm of EM-TF-IDF
3.4 Advantage of Using EM Algorithm
The uses of EM-NBC-Ensemble and EM-TF-IDF
can be viewed as extensions of existing methods
for word or phrase translation using non-parallel
corpora. Particularly, the use of the EM
Algorithm can help to accurately transform a
frequency vector from one language to another.
Suppose that we are to determine if ‘G5b5G1643G1bca
G4b7’ is a translation of ‘information age’ (actually
it is). The frequency vectors of context words for
‘information age’ and ‘G5b5G1643G1bcaG4b7’ are given in A
and D in Figure 5, respectively. If for each
English word we only retain the link connecting
to the Chinese translation with the largest
frequency (a link represented as a solid line) to
establish a many-to-one mapping and transform
vector A from English to Chinese, we obtain
vector B. It turns out, however, that vector B is
quite different from vector D, although they
should be similar to each other. We will refer to
this method as ‘Major Translation’ hereafter.
With EM, vector A in Figure 5 is transformed
into vector C, which is much closer to vector D,
as expected. Specifically, EM can split the
frequency of a word in English and distribute
them into its translations in Chinese in a
theoretically sound way (cf., the distributed
frequencies of ‘internet’). Note that if we assume
a many-to-one (or one-to-one) mapping
∑
=
=
s
i
i
cP
s
cP
1
)|
~
(
1
)|
~
( DD
(5)
relationship, then the use of EM turns out to be
equivalent to that of Major Translation.
3.5 Combination
In order to further boost the performance of
translation, we propose to also use the translation
method proposed in Nagata et al. Specifically, we
combine our method with that of Nagata et al by
using a back-off strategy.
Figure 6 illustrates the process of collecting
Chinese translation candidates for an English
Base NP ‘information asymmetry’ with Nagata et
al’s method.
In the combination of the two methods, we first
use Nagata et al’s method to perform translation;
if we cannot find translations, we next use our
method. We will denote this strategy ‘Back-off’.
4. Experimental Results
We conducted experiments on translation of the
Base NPs from English to Chinese.
We extracted Base NPs (noun-noun pairs) from
the Encarta
1
English corpus using the tool
developed by Xun et al (2000). There were about
1
http://encarta.msn.com/Default.asp
3000 Base NPs extracted. In the experiments, we
used the HIT English-Chinese word translation
dictionary
2
. The dictionary contains about 76000
Chinese words, 60000 English words, and
118000 translation links. As a web search engine,
we used Google (http://www.google.com).
Five translation experts evaluated the
translation results by judging whether or not they
were acceptable. The evaluations reported below
are all based on their judgements.
4.1 Basic Experiment
In the experiment, we randomly selected 1000
Base NPs from the 3000 Base NPs. We next used
our method to perform translation on the 1000
phrases. In translation selection, we employed
EM-NBC-Ensemble and EM-TF-IDF.
Table 1. Best translation result for each method
Accuracy (%)
Top 1 Top 3
Coverage
(%)
EM-NBC-Ensemble 61.7 80.3
Prior 57.6 77.6
MT-NBC-Ensemble 59.9 78.1
EM-KL-Ensemble 45.9 72.3
EM-NBC 60.8 78.9
EM-TF-IDF 61.9 80.8
MT-TF-IDF 58.2 77.6
EM-TF 55.8 77.8
89.9
Table 1 shows the results in terms of coverage
and top n accuracy. Here, coverage is defined as
the percentage of phrases which have translations
selected, while top n accuracy is defined as the
percentage of phrases whose selected top n
translations include correct translations.
For EM-NBC-Ensemble, we set the α !in (4) to
be 5 on the basis of our preliminary experimental
results. For EM-TF-IDF, we used the non-web
datadescribedinSection4.4toestimateidf
values of words. We used contexts with window
sizes of ±1, ±3, ±5, ±7, ±9, ±11.
2
The dictionary is created by the Harbin Institute of Technology.
ABCD
G14G18
G14G13
G15G17
G49G55G48G54G58G48G51G46G5c
G14G18
G14G13
G15G17
G49G55G48G54G58G48G51G46G5c
G1cG11G16
G14G17G11G1a
G14G15
G14G15
G49G55G48G54G58G48G51G46G5c
G14G16
G15G13
G1c
G14G18
G49G55G48G54G58G48G51G46G5c
Figure 5. Example of frequency vector transformation
1. Input ‘information asymmetry’;
2. Search the English Base NP on web sites in Chinese
and obtain documents as follows (i.e., using partial parallel
corpora):
G740G9ccG2c58G197bG80aG35d9G2c4G1814G1cddG35d9G2c5Ga60G3175G29daG35d9G2c4G17e5G4c0G2fc4G40eG759G46bc
G48eG2c5G45eeG140cG1960G19b5G2d14G418cGeeeGeeaG46bcG1869G4318G35d9G1814G3e1G45aG3fb7G2c58G5b5G1643G2c8
G947GcfcG759G46bcG48eG3e2GeeaG46bcG48eG41fG4bc8G112cGcfcG5b5G1643G3e1G11cdG2fc4
G2c4information asymmetryG2c5G1c4
3. Find the most frequently occurring Chinese phrases
immediately before the brackets containing the English
Base NP, using a suffix tree;
4. Output the Chinese phrases and their document
frequencies:
G5b5G1643G3e1G11cdG2fc45
G5b5G1643Gf05G3e355
Figure 6. Nagata et al’s method
G16G18
G17G13
G17G18
G18G13
G18G18
G19G13
G19G18
G13G15G17G19G1bG14G13G14G15
G5aG4cG51G47G52G5aG3G56G4cG5dG48
G37
G52
G53
G3
G14
G3
G44
G46
G46
G58
G55
G44
G46
G5c
G3
Gb
G8
Gc
G28G30G10G31G25G26G10G28G51G56G48G50G45G4fG48 G33G55G4cG52G55 G28G30G10G31G25G26
G30G37G10G31G25G26G10G28G51G56G48G50G45G4fG48 G28G30G10G2eG2fG10G28G51G56G48G50G45G4fG48 G28G30G10G37G29G10G2cG27G29
G30G37G10G37G29G10G2cG27G29 G28G30G10G37G29
Figure 7. Translation results
Figure 7 shows the results of
EM-NBC-Ensemble and EM-TF-IDF, in which
for EM-NBC-Ensemble ‘window size’ denotes
that of the largest within an ensemble. Table 1
summarizes the best results for each of them.
‘Prior’ and ‘MT-TF-IDF’ are actually
baseline methods relying on the existing
technologies. In Prior, we select candidates
whose prior probabilities are the largest,
equivalently, document frequencies obtained in
translation candidate collection are the largest. In
MT-TF-IDF, we use TF-IDF vectors transformed
with Major Translation.
Our experimental results indicate that both
EM-NBC-Ensemble and EM-TF-IDF
significantly outperform Prior and MT-TF-IDF,
when appropriate window sizes are chosen. The
p-values of the sign tests are 0.00056 and 0.00133
for EM-NBC-Ensemble, 0.00002 and 0.00901
for EM-TF-IDF, respectively.
We next removed each of the key components
of EM-NBC-Ensemble and used the remaining
components as a variant of it to perform
translation selection. The key components are (1)
distance calculation by KL divergence (2) EM, (3)
prior probability, and (4) ensemble. The variants,
thus, respectively make use of (1) the baseline
method ‘Prior’, (2) an ensemble of Naïve
Bayesian Classifiers based on Major Translation
(MT-NBC-Ensemble), (3) an ensemble of
EM-based KL divergence calculations
(EM-KL-Ensemble), and (4) EM-NBC. Figure 7
and Table 1 show the results. We see that
EM-NBC-Ensemble outperforms all of the
variants, indicating that all the components
within EM-NBC-Ensemble play positive roles.
We removed each of the key components of
EM-TF-IDF and used the remaining components
as a variant of it to perform translation selection.
The key components are (1) idf value and (2) EM.
The variants, thus, respectively make use of (1)
EM-based frequency vectors (EM-TF), (2) the
baseline method MT-TF-IDF. Figure 7 and Table
1 show the results. We see that EM-TF-IDF
outperforms both variants, indicating that all of
the components within EM-TF-IDF are needed.
Comparing the results between
MT-NBC-Ensemble and EM-NBC-Ensemble
and the results between MT-TF-IDF and
EM-TF-IDF, we see that the uses of the EM
Algorithm can indeed help to improve translation
accuracies.
Table 2. Sample of translation outputs
Base NP Translation
calcium ion G4a6dG2f8fG1124
adventure tale
G766G4c3dG1b19G45f
Gf1bG461bG1b19G45f
G766G4c3dG4f4G41c8
lung cancer G368eG2c20
aircraft carrier *G4eb2G1d0eG1853G45a4G48e
adult literacy
*G17e4G48eG419aG112b
*G17e4G1448G419aG112b
Table 2 shows translations of five Base NPs as
output by EM-NBC-Ensemble, in which the
translations marked with * were judged incorrect
by human experts. We analyzed the reasons for
incorrect translations and found that the incorrect
translations were due to: (1) no existence of
dictionary entry (19%), (2) non-compositional
translation (13%), (3) ranking error (68%).
4.2 Our Method vs. Nagata et al’s Method
Table 3. Translation results
Accuracy (%)
Top 1 Top 3
Coverage (%)
Our Method 61.7 80.3 89.9
Nagata et al’s 72.0 76.0 10.5
We next used Nagata et al’s method to perform
translation. From Table 3, we can see that the
accuracy of Nagata et al’s method is higher than
that of our method, but the coverage of it is lower.
The results indicate that our proposed Back-off
strategy for translation is justifiable.
4.3 Combination
In the experiment, we tested the Back-off strategy,
Table 4 shows the results. The Back-off strategy
Table 4. Translation results
AccuracyG2c4%G2c5
Top 1 Top 3
Coverage
G2c4%G2c5
Back-off (Ensemble) 62.9 79.7
Back-off (TF-IDF) 62.2 79.8
91.4
helps to further improve the results whether
EM-NBC-Ensemble or EM-TF-IDF is used.
4.4 Web Data vs. Non-web Data
To test the effectiveness of the use of web data,
we conducted another experiment in which we
performed translation by using non-web data.
The data comprised of the Wall Street Journal
corpus in English (1987-1992, 500MB) and the
People’s Daily corpus in Chinese (1982-1998,
700MB). We followed the Back-off strategy as in
Section 4.3 to translate the 1000 Base NPs.
Table 5. Translation results
AccuracyG2c4%G2c5
Data
Top 1 Top 3
Coverage
G2c4%G2c5
Web (EM-NBC-Ensemble) 62.9 79.7 91.4
Non-web (EM-NBC-Ensemble) 56.9 74.7 79.3
Web (EM-IF-IDF) 62.2 79.8 91.4
Non-web (EM-TF-IDF) 51.5 71.4 78.5
The results in Table 5 show that the use of web
data can yield better results than non-use of it,
although the sizes of the non-web data we used
were considerably large in practice. For Nagata et
al’s method, we found that it was almost
impossible to find partial-parallel corpora in the
non-web data.
5. Conclusions
This paper has proposed a new and effective
method for Base NP translation by using web
data and the EM Algorithm. Experimental results
show that it outperforms the baseline methods
based on existing techniques, mainly due to the
employment of EM. Experimental results also
show that the use of web data is more effective
than non-use of it.
Future work includes further applying the
proposed method to the translation of other types
of Base NPs and between other language pairs.
Acknowledgements
We thank Ming Zhou, Chang-Ning Huang,
Jianfeng Gao, and Ashley Chang for many
helpful discussions on this research project. We
also acknowledge Shenjie Li for help with
program coding.

References

Brill E., Lin J., Banko M., Dumais S. and Ng A. (2001)
Data-Intensive Question Answering. In Proc. of
TREC '2001.

Brown P.F., Della Pietra, S.A., Della Pietra V.J., and
Mercer, R.L. (1993) The mathematics of Statistical
Machine Translation: Parameter Estimation.
Computational Linguistics 19(2), pp.263--11.

Cover T. and Thomas J. (1991) Elements of
Information Theory, Wiley.

Dempster A. P, Laird N. M. and Rubin D. B. (1977)
Maximum likelihood from incomplete data via the
EM algorithm. J. Roy. Stat. Soc. B 39:1--38.

Diab M. and Finch S. (2000) A statistical word-level
translation model for comparable corpora.InProc.
of RIAO.

Fung P. and Yee L.Y. (1998) An IR approach for
translation new words from nonparallel,
comparable texts. In Proc. of COLING-ACL '1998,
pp 414--20.

Gao J. F., Nie J. Y., Xun E. D., Zhang J., Zhou M. and
Huang C. N. (2001) Improving Query Translation
for Cross-Language Information Retrieval Using
Statistical Models. In Proc. of SIGIR '2001.

Kikui G. (1999) Resolving translation ambiguity using
non-parallel bilingual corpora. In Proc. of ACL
'1999 Workshop, Unsupervised Learning in NLP.

Koehn P. and Knight K.(2000) Estimating word
translation probabilities from unrelated
monolingual corpora using the EM algorithm. In
Proc. of AAAI '2000.

Nagata M., Saito T., and Suzuki K. (2001) Using the
Web as a bilingual dictionary. In Proc. of ACL'2001
DD-MT Workshop.

Nakagawa H. (2001) Disambiguation of single noun
translations extracted from bilingual comparable
corpora. In Terminology 7:1.

Pederson T.(2000) A Simple Approach to Building
Ensembles of Naïve Bayesian Classifiers for Word
Sense Disambiguation. In Proc. of NAACL '2000.

Rapp R. (1999) Automatic identification of word
translations from unrelated English and German
corpora. In Proc. of ACL'1999.

Sumita E.(2000) Lexical transfer using a vector-space
model. In Proc. of ACL '2000.

Tanaka K. and Iwasaki H. (1996) Extraction of
Lexical Translation from non-aligned corpora. In
Proc. of COLING '1996

Xun E.D., Huang C.N. and Zhou M. (2000) A Unified
Statistical Model for the Identification of English
BaseNP. In Proc. of ACL '2000.
