A Word-to-Word Model of Translational Equivalence 
I. Dan Melamed 
Dept. of Computer and Information Science 
University of Pennsylvania 
Philadelphia, PA, 19104, U.S.A. 
raelamed~unagi, c is. upenn, edu 
Abstract 
Many multilingual NLP applications need 
to translate words between different lan- 
guages, but cannot afford the computa- 
tional expense of inducing or applying a full 
translation model. For these applications, 
we have designed a fast algorithm for esti- 
mating a partial translation model, which 
accounts for translational equivalence only 
at the word level . The model's preci- 
sion/recall trade-off can be directly con- 
trolled via one threshold parameter. This 
feature makes the model more suitable for 
applications that are not fully statistical. 
The model's hidden parameters can be eas- 
ily conditioned on information extrinsic to 
the model, providing an easy way to inte- 
grate pre-existing knowledge such as part- 
of-speech, dictionaries, word order, etc.. 
Our model can link word tokens in paral- 
lel texts as well as other translation mod- 
els in the literature. Unlike other transla- 
tion models, it can automatically produce 
dictionary-sized translation lexicons, and it 
can do so with over 99% accuracy. 
1 Introduction 
Over the past decade, researchers at IBM have devel- 
oped a series of increasingly sophisticated statistical 
models for machine translation (Brown et al., 1988; 
Brown et al., 1990; Brown et al., 1993a). However, 
the IBM models, which attempt to capture a broad 
range of translation phenomena, are computation- 
ally expensive to apply. Table look-up using an ex- 
plicit translation lexicon is sufficient and preferable 
for many multilingual NLP applications, including 
"crummy" MT on the World Wide Web (Church 
& I-Iovy, 1993), certain machine-assisted translation 
tools (e.g. (Macklovitch, 1994; Melamed, 1996b)), 
concordancing for bilingual lexicography (Catizone 
et al., 1993; Gale & Church, 1991), computer- 
assisted language learning, corpus linguistics (Melby. 
1981), and cross-lingual information retrieval (Oard 
&Dorr, 1996). 
In this paper, we present a fast method for in- 
ducing accurate translation lexicons. The method 
assumes that words are translated one-to-one. This 
assumption reduces the explanatory power of our 
model in comparison to the IBM models, but, as 
shown in Section 3.1, it helps us to avoid what we 
call indirect associations, a major source of errors in 
other models. Section 3.1 also shows how the one- 
to-one assumption enables us to use a new greedy 
competitive linking algorithm for re-estimating the 
model's parameters, instead of more expensive algo- 
rithms that consider a much larger set of word cor- 
respondence possibilities. The model uses two hid- 
den parameters to estimate the confidence of its own 
predictions. The confidence estimates enable direct 
control of the balance between the model's preci- 
sion and recall via a simple threshold. The hidden 
parameters can be conditioned on prior knowledge 
about the bitext to improve the model's accuracy. 
2 Co-occurrence 
With the exception of (Fung, 1998b), previous 
methods for automatically constructing statistical 
translation models begin by looking at word co- 
occurrence frequencies in bitexts (Gale & Church, 
1991; Kumano & Hirakawa, 1994; Fung, 1998a; 
Melamed, 1995). A bitext comprises a pair of texts 
in two languages, where each text is a translation 
of the other. Word co-occurrence can be defined in 
various ways. The most common way is to divide 
each half of the bitext into an equal number of seg- 
ments and to align the segments so that each pair of 
segments Si and Ti are translations of each other 
(Gale & Church, 1991; Melamed, 1996a). Then, 
two word tokens (u, v) are said to co-occur in the 
490 
aligned segment pair i if u E Si and v E Ti. The 
co-occurrence relation can also be based on distance 
in a bitext space, which is a more general represen- 
tations of bitext correspondence (Dagan et al., 1993; 
Resnik & Melamed, 1997), or it can be restricted to 
words pairs that satisfy some matching predicate, 
which can be extrinsic to the model (Melamed, 1995; 
Melamed, 1997). 
3 The Basic Word-to-Word Model 
Our translation model consists of the hidden param- 
eters A + and A-, and likelihood ratios L(u, v). The 
two hidden parameters are the probabilities of the 
model generating true and false positives in the data. 
L(u,v) represents the likelihood that u and v can 
be mutual translations. For each co-occurring pair of 
word types u and v, these likelihoods are initially set 
proportional to their co-occurrence frequency n(u,v) 
and inversely proportional to their marginal frequen- 
cies n(u) and n(v) z, following (Dunning, 1993) 2. 
When the L(u, v) are re-estimated, the model's hid- 
den parameters come into play. 
After initialization, the model induction algorithm 
iterates: 
1. Find a set of "links" among word tokens in the 
bitext, using the likelihood ratios and the com- 
petitive linking algorithm. 
2. Use the links to re-estimate A +, A-, and the 
likelihood ratios. 
3. Repeat from Step 1 until the model converges 
to the desired degree. 
The competitive linking algorithm and its one-to-one 
assumption are detailed in Section 3.1. Section 3.1 
explains how to re-estimate the model parameters. 
3.1 Competitive Linking Algorithm 
The competitive linking algorithm is designed to 
overcome the problem of indirect associations, illus- 
trated in Figure 1. The sequences of u's and v's 
represent corresponding regions of a bitext. If uk 
and vk co-occur much more often than expected by 
chance, then any reasonable model will deem them 
likely to be mutual translations. If uk and Vk are 
indeed mutual translations, then their tendency to 
ZThe co-occurrence frequency of a word type pair is 
simply the number of times the pair co-occurs in the 
corpus. However, n(u) = ~-~v n(u.v), which is not the 
same as the frequency of u, because each token of u can 
co-occur with several differentv's. 
2We could just as easily use other symmetric "asso- 
ciation" measures, such as ¢2 (Gale & Church, 1991) or 
the Dice coefficient (Smadja, 1992). 
• • • Uk. 1 tJk ~ = Uk+l • • • 
t 
• , • Vk. 1 Vk Vk+l • . . 
Figure 1: Uk and vk often co-occur, as do uk and 
uk+z. The direct association between uk and vk, and 
the direct association between uk and Uk+l give rise 
to an indirect association between v~ and uk+l. 
co-occur is called a direct association. Now, sup- 
pose that uk and Uk+z often co-occur within their 
language. Then vk and uk+l will also co-occur more 
often than expected by chance. The arrow connect- 
ing vk and u~+l in Figure 1 represents an indirect 
association, since the association between vk and 
Uk+z arises only by virtue of the association between 
each of them and uk. Models of translational equiv- 
alence that are ignorant of indirect associations have 
"a tendency ... to be confused by collocates" (Dagan 
et al., 1993). 
Fortunately, indirect associations are usually not 
difficult to identify, because they tend to be weaker 
than the direct associations on which they are based 
(Melamed, 1996c). The majority of indirect associ- 
ations can be filtered out by a simple competition 
heuristic: Whenever several word tokens ui in one 
half of the bitext co-occur with a particular word to- 
ken v in the other half of the bitext, the word that is 
most likely to be v's translation is the one for which 
the likelihood L(u, v) of translational equivalence is 
highest. The competitive linking algorithm imple- 
ments this heuristic: 
1. Discard all likelihood scores for word types 
deemed unlikely to be mutual translations, i.e. 
all L(u,v) < 1. This step significantly reduces 
the computational burden of the algorithm. It 
is analogous to the step in other translation 
model induction algorithms that sets all prob- 
abilities below a certain threshold to negligible 
values (Brown et al., 1990; Dagan et al., 1993; 
Chen, 1996). To retain word type pairs that 
are at least twice as likely to be mutual transla- 
tions than not, the threshold can be raised to 2. 
Conversely, the threshold can be lowered to buy 
more coverage at the cost of a larger model that 
will converge more slowly. 
2. Sort all remaining likelihood estimates L(u, v) 
from highest to lowest. 
3. Find u and v such that the likelihood ratio 
L(u,v) is highest. Token pairs of these types 
491 
n(u,v) 
N 
k(u.v) 
K 
T 
k+ k- 
B(k{n,p) 
= frequency of co-occurrence between word types u and v 
= ~"\].(u.,,) n(u.v) = total number of co-occurrences in the bitext 
= frequency of links between word types u and v 
= ~"\].(u,v) k(u.,,) = total number of links in the bitext 
= Pr( mutual translations I co-occurrence ) 
= Pr( link I co-occurrence ) 
= Pr( link \[ co-occurrence of mutual translations ) 
= Pr( link I co-occurrence of not mutual translations ) 
= Pr(kin,p), where k has a binomial distribution with parameters n and p 
N.B.: k + and )~- need not sum to 1, because they are conditioned on different events. 
Figure 2: Variables used to estimate the model parameters. 
would be the winners in any competitions in- 
volving u or v. 
4. Link all token pairs (u, v) in the bitext. 
5. The one-to-one assumption means that linked 
words cannot be linked again. Therefore, re- 
move all linked word tokens from their respec- 
tive texts. 
6. If there is another co-occurring word token pair 
(u, v) such that L(u, v) exists, then repeat from 
Step 3. 
The competitive linking algorithm is more greedy 
than algorithms that try to find a set of link types 
that are jointly most probable over some segment of 
the bitext. In practice, our linking algorithm can be 
implemented so that its worst-case running time is 
O(lm), where l and m are the lengths of the aligned 
segments. 
The simplicity of the competitive linking algo- 
rithm depends on the one-to-one assumption: 
Each word translates to at most one other word. 
Certainly, there are cases where this assumption is 
false. We prefer not to model those cases, in order to 
achieve higher accuracy with less effort on the cases 
where the assumption is true. 
3.2 Parameter Estimation 
The purpose of the competitive linking algorithm is 
to help us re-estimate the model parameters. The 
variables that we use in our estimation are summa- 
rized in Figure 2. The linking algorithm produces a 
set of links between word tokens in the bitext. We 
define a link token to be an ordered pair of word 
tokens, one from each half of the bitext. A link 
type is an ordered pair of word types. Let n(u.,,) be 
the co-occurrence frequency of u and v and k(~,,,) be 
the number of links between tokens of u and v 3. An 
3Note that k(u,v) depends on the linking algorithm, 
but n(u.v) is a constant property of the bitext. 
important property of the competitive linking algo- 
rithm is that the ratio kiu.,,)/n(u,v ) tends to be very 
high if u and v are mutual translations, and quite 
low if they are not. The bimodality of this ratio 
for several values of n(u.,,i is illustrated in Figure 3. 
This figure was plotted after the model's first iter- 
ation over 300000 aligned sentence pairs from the 
I0(0) 
,oo 
LI,. {0 
} 0 ~ 
(u V)/n(u v) o~ , 
Figure 3: A fragment of the joint frequency 
(k(u.v)/n(u.v), n(u.v)). Note that the frequencies are 
plotted on a log scale -- the bimodality is quite sharp. 
Canadian Hansard bitext. Note that the frequencies 
are plotted on a log scale -- the bimodality is quite 
sharp. 
The linking algorithm creates all the links of a 
given type independently of each other, so the num- 
ber k(u,v ) of links connecting word types u and v 
has a binomial distribution with parameters n(u.,,l 
and P(u.,,)- If u and v are mutual translations, then 
P(u,,,) tends to a relatively high probability, which we 
will call A +. If u and v are not mutual translations, 
then P(u,v) tends to a very low probability, which 
we will call A-. A + and A- correspond to the two 
peaks in the frequency distribution of k(u.,,)/niu.v~ 
in Figure 2. The two parameters can also be inter- 
preted as the percentage of true and false positives. 
If the translation in the bitext is consistent and the 
492 
model is accurate, then A + should be near 1 and A- 
should be near 0. 
To find the most probable values of the hidden 
model parameters A + and A-, we adopt the standard 
method of maximum likelihood estimation, and find 
the values that maximize the probability of the link 
frequency distributions. The one-to-one assumption 
implies independence between different link types, 
so that 
Pr(linkslm°del) = H Vr(k(u,v)\[n(u,v), A +, A-). 
R~V (1) 
The factors on the right-hand side of Equation 1 can 
be written explicitly with the help of a mixture co- 
efficient. Let r be the probability that an arbitrary 
co-occurring pair of word types are mutual transla- 
tions. Let B(kln,p ) denote the probability that k 
links are observed out of n co-occurrences, where k 
has a binomial distribution with parameters n and p. 
Then the probability that u and v are linked k(u,v) 
times out of n(u,v) co-occurrences is a mixture of two 
binomials: 
Pr(k(u,v) ln(u,v), A +, A-) = (2) 
= rB(k(u,v)ln(u,v), A +) 
÷ (1-r)B(k(u,v)ln(u,v),A-) 
One more variable allows us to express r in terms 
of A + and A- : Let A be the probability that an arbi- 
trary co-occuring pair of word tokens will be linked, 
regardless of whether they are mutual translations. 
Since r is constant over all word types, it also repre- 
sents the probability that an arbitrary co-occurring 
pair of word tokens are mutual translations. There- 
fore, 
A = rA + + (1 - r)A-. (3) 
A can also be estimated empirically. Let K be the 
total number of links in the bitext and let N be the 
total number of co-occuring word token pairs: K = 
~(u,v) k(u,v/, N = ~(~,v) n(u,v). By definition, 
A = KIN. (4) 
Equating the right-hand sides of Equations (3) and 
(4) and rearranging the terms, we get: 
KIN - ,X- - (5) 
A+ _ )~- 
Since r is now a function of A + and A-, only the 
latter two variables represent degrees of freedom in 
the model. 
The probability function expressed by Equations 1 
and 2 has many local maxima. In practice, these 
c 
-1.2 
-1.4 
E -1.6 
"~ -1.8 ) 
0 
Figure 4: Pr(links\[model) has only one global max- 
imum in the region of interest. 
local maxima are like pebbles on a mountain, in- 
visible at low resolution. We computed Equation 1 
over various combinations of A + and A- after the 
model's first iteration over 300000 aligned sentence 
pairs from the Canadian Hansard bitext. Figure 4 
shows that the region of interest in the parameter 
space, where 1 > A + > A > A- > 0, has only one 
clearly visible global maximum. This global maxi- 
mum can be found by standard hill-climbing meth- 
ods, as long as the step size is large enough to avoid 
getting stuck on the pebbles. 
Given estimates for A + and A-, we can compute 
B(ku,,,\[nu,v, A +) and B(ku,v\[nu,v, A-). These are 
probabilities that k(u,v) links were generated by an 
algorithm that generates correct links and by an al- 
gorithm that generates incorrect links, respectively, 
out ofn(u,v) co-occurrences. The ratio of these prob- 
abilities is the likelihood ratio in favor of u and v 
being mutual translations, for all u and v: 
B(ku,vln<,,,,, ),+) 
L(u,v) = B(ku,vln~,v, A_ ) . (61 
4 Class-Based Word-to-Word 
Models 
In the basic word-to-word model, the hidden param- 
eters A + and A- depend only on the distributions of 
link frequencies generated by the competitive link- 
ing algorithm. More accurate models can be induced 
by taking into account various features of the linked 
tokens. For example, frequent words are translated 
less consistently than rare words (Melamed, 1997). 
To account for this difference, we can estimate sep- 
arate values of X + and A- for different ranges of 
n(u,v). Similarly, the hidden parameters can be con- 
ditioned on the linked parts of speech. Word order 
can be taken into account by conditioning the hid- 
den parameters on the relative positions of linked 
word tokens in their respective sentences. Just as 
easily, we can model links that coincide with en- 
tries in a pre-existing translation lexicon separately 
493 
from those that do not. This method of incorporat- 
ing dictionary information seems simpler than the 
method proposed by Brown et ai. for their models 
(Brown et al., 1993b). When the hidden parameters 
are conditioned on different link classes, the estima- 
tion method does not change; it is just repeated for 
each link class. 
5 Evaluation 
A word-to-word model of translational equivalence 
can be evaluated either over types or over tokens. 
It is impossible to replicate the experiments used to 
evaluate other translation models in the literature, 
because neither the models nor the programs that 
induce them are generally available. For each kind 
of evaluation, we have found one case where we can 
come close. 
We induced a two-class word-to-word model of 
translational equivalence from 13 million words of 
the Canadian Hansards, aligned using the method 
in (Gale & Church, 1991). One class repre- 
sented content-word links and the other represented 
function-word links 4. Link types with negative 
log-likelihood were discarded after each iteration. 
Both classes' parameters converged after six it- 
erations. The value of class-based models was 
demonstrated by the differences between the hid- 
den parameters for the two classes. (A +,A-) con- 
verged at (.78,00016) for content-class links and at 
(.43,.000094) for function-class links. 
5.1 Link Types 
The most direct way to evaluate the link types in 
a word-level model of translational equivalence is to 
treat each link type as a candidate translation lexi- 
con entry, and to measure precision and recall. This 
evaluation criterion carries much practical import, 
because many of the applications mentioned in Sec- 
tion 1 depend on accurate broad-coverage transla- 
tion lexicons. Machine readable bilingual dictionar- 
ies, even when they are available, have only limited 
coverage and rarely include domain-specific terms 
(Resnik & Melamed, 1997). 
We define the recall of a word-to-word translation 
model as the fraction of the bitext vocabulary repre- 
sented in the model. Translation model precision is 
a more thorny issue, because people disagree about 
the degree to which context should play a role in 
judgements of translational equivalence. We hand- 
evaluated the precision of the link types in our model 
in the context of the bitext from which the model 
4Since function words can be identified by table look- 
up, no POS-tagger was involved. 
was induced, using a simple bilingual concordancer. 
A link type (u, v) was considered correct if u and v 
ever co-occurred as direct translations of each other. 
Where the one-to-one assumption failed, but a link 
type captured part of a correct translation, it was 
judged "incomplete." Whether incomplete links are 
correct or incorrect depends on the application. 
100 
98 
96 
~) 94 
u 92 t_ 
9O 
88 
86 
84 
(99.2%) ~ 
(9~ .6%) t'",,, 
""-,}(89.2%) 
........ " ...................... x 
incomplete = incorrect ......... -~(86.8%) 
3'6 4'6 9'0 % recall 
Figure 5: Link type precision with 95~ confidence 
intervals at varying levels of recall. 
We evaluated five random samples of 100 link 
types each at three levels of recall. For our bitext, 
recall of 36%, 46% and 90% corresponded to trans- 
lation lexicons containing 32274, 43075 and 88633 
words, respectively. Figure 5 shows the precision of 
the model with 95% confidence intervals. The upper 
curve represents precision when incomplete links are 
considered correct, and the lower when they are con- 
sidered incorrect. On the former metric, our model 
can generate translation lexicons with precision and 
recall both exceeding 90%, as well as dictionary- 
sized translation lexicons that are over 99% correct. 
Though some have tried, it is not clear how to 
extract such accurate lexicons from other published 
translation models. Part of the difficulty stems from 
the implicit assumption in other models that each 
word has only one sense. Each word is assigned the 
same unit of probability mass, which the model dis- 
tributes over all candidate translations. The correct 
translations of a word that has several correct trans- 
lations will be assigned a lower probability than the 
correct translation of a word that has only one cor- 
rect translation. This imbalance foils thresholding 
strategies, clever as they might be (Gale & Church, 
1991; Wu ~z Xia, 1994; Chen, 1996). The likelihoods 
in the word-to-word model remain unnormalized, so 
they do not compete. 
The word-to-word model maintains high preci- 
sion even given much less training data. Resnik 
& Melamed (1997) report that the model produced 
494 
translation lexicons with 94% precision and 30% re- 
call, when trained on French/English software man- 
uals totaling about 400,000 words. The model 
was also used to induce a translation lexicon from 
a 6200-word corpus of French/English weather re- 
ports. Nasr (1997) reported that the translation 
lexicon that our model induced from this tiny bitext 
accounted for 30% of the word types with precision 
between 84% and 90%. Recall drops when there is 
tess training data, because the model refuses to make 
predictions that it cannot make with confidence. For 
many applications, this is the desired behavior. 
5.2 Link Tokens 
type of error errors made by errors made 
IBM Model 2 by our model 
wrong link 
missing link 
partial link 
class conflict 
tokenization 
paraphrase 
32 
12 
7 
3 
39 
7 
36 
10 
5 
2 
36 
TOTAL 93 96 
Table 1: Erroneous link tokens generated by two 
translation models. 
The most detailed evaluation of link tokens to 
date was performed by (Macklovitch & Hannan, 
1996), who trained Brown et al.'s Model 2 on 74 
million words of the Canadian Hansards. These au- 
thors kindly provided us with the links generated 
by that model in 51 aligned sentences from a held- 
out test set. We generated links in the same 51 
sentences using our two-class word-to-word model, 
and manually evaluated the content-word links from 
both models. The IBM models are directional; i.e. 
they posit the English words that gave rise to each 
French word, but ignore the distribution of the En- 
glish words. Therefore, we ignored English words 
that were linked to nothing. 
The errors are classified in Table 1. The "wrong 
link" and "missing link" error categories should be 
self-explanatory. "Partial links" are those where one 
French word resulted from multiple English words, 
but the model only links the French word to one of 
its English sources. "Class conflict" errors resulted 
from our model's refusal to link content words with 
function words. Usually, this is the desired behavior, 
but words like English auxiliary verbs are sometimes 
used as content words, giving rise to content words 
in French. Such errors could be overcome by a model 
that classifies each word token, for example using a 
part-of-speech tagger, instead of assigning the same 
class to all tokens of a given type. The bitext pre- 
processor for our word-to-word model split hyphen- 
ated words, but Macklovitch &Hannan's preproces- 
sor did not. In some cases, hyphenated words were 
easier to link correctly; in other cases they were more 
difficult. Both models made some errors because of 
this tokenization problem, albeit in different places. 
The "paraphrase" category covers all link errors that 
resulted from paraphrases in the translation. Nei- 
ther IBM's Model 2 nor our model is capable of link- 
ing multi-word sequences to multi-word sequences, 
and this was the biggest source of error for both 
models. 
The test sample contained only about 400 content 
words 5, and the links for both models were evaluated 
post-hoc by only one evaluator. Nevertheless, it ap- 
pears that our word-to-word model with only two 
link classes does not perform any worse than IBM's 
Model 2, even though the word-to-word model was 
trained on less than one fifth the amount of data that 
was used to train the IBM model. Since it doesn't 
store indirect associations, our word-to-word model 
contained an average of 4.5 French words for every 
English word. Such a compact model requires rel- 
atively little computational effort to induce and to 
apply. 
des screaming 
vents . winds 
,A, 
dechames ---" and 
et dangerous 
une ~ sea 
mer . conditions 
s p 
de'montee.-"" 
Figure 6: An example of the different sorts of er- 
rors made by the word-to-word model and the IBM 
Model 2. Solid lines are links made by both mod- 
els; dashes lines are links made by the IBM model 
only. Only content-class links are shown. Neither 
model makes the correct links (ddcha£nds,screaming) 
and (ddmontde, dangerous). 
5The exact number depends on the tokenization 
method. 
495 
In addition to the quantitative differences between 
the word-to-word model and the IBM model, there 
is an important qualitative difference, illustrated in 
Figure 6. As shown in Table 1, the most common 
kind of error for the word-to-word model was a miss- 
ing link, whereas the most common error for IBM's 
Model 2 was a wrong link. Missing links are more in- 
formative: they indicate where the model has failed. 
The level at which the model trusts its own judge- 
ment can be varied directly by changing the likeli- 
hood cutoff in Step 1 of the competitive linking algo- 
rithm. Each application of the word-to-word model 
can choose its own balance between link token pre- 
cision and recall. An application that calls on the 
word-to-word model to link words in a bitext could 
treat unlinked words differently from linked words, 
and avoid basing subsequent decisions on uncertain 
inputs. It is not clear how the precision/recall trade- 
off can be controlled in the IBM models. 
One advantage that Brown et al.'s Model i has 
over our word-to-word model is that their objec- 
tive function has no local maxima. By using the 
EM algorithm (Dempster et al., 1977), they can 
guarantee convergence towards the globally opti- 
mum parameter set. In contrast, the dynamic na- 
ture of the competitive linking algorithm changes 
the Pr(datalmodel ) in a non-monotonic fashion. We 
have adopted the simple heuristic that the model 
"has converged" when this probability stops increas- 
ing. 
6 Conclusion 
Many multilingual NLP applications need to trans- 
late words between different languages, but cannot 
afford the computational expense of modeling the 
full range of translation phenomena. For these ap- 
plications, we have designed a fast algorithm for esti- 
mating word-to-word models of translational equiv- 
alence. The estimation method uses a pair of hid- 
den parameters to measure the model's uncertainty, 
and avoids making decisions that it's not likely to 
make correctly. The hidden parameters can be con- 
ditioned on information extrinsic to the model, pro- 
viding an easy way to integrate pre-existing knowl- 
edge. 
So far we have only implemented a two-class 
model, to exploit the differences in translation con- 
sistency between content words and function words. 
This relatively simple two-class model linked word 
tokens in parallel texts as accurately as other trans- 
lation models in the literature, despite being trained 
on only one fifth as much data. Unlike other transla- 
tion models, the word-to-word model can automat- 
ically produce dictionary-sized translation lexicons, 
and it can do so with over 99% accuracy. 
Even better accuracy can be achieved with a more 
fine-grained link class structure. Promising features 
for classification include part of speech, frequency 
of co-occurrence, relative word position, and trans- 
lational entropy (Melamed, 1997). Another inter- 
esting extension is to broaden the definition of a 
"word" to include multi-word lexical units (Smadja, 
1992). If such units can be identified a priori, their 
translations can be estimated without modifying the 
word-to-word model. In this manner, the model can 
account for a wider range of translation phenomena. 
Acknowledgements 
The French/English software manuals were provided 
by Gary Adams of Sun MicroSystems Laboratories. 
The weather bitext was prepared at the University 
of Montreal, under the direction Of Richard Kit- 
tredge. Thanks to Alexis Nasr for hand-evaluating 
the weather translation lexicon. Thanks also to Mike 
Collins, George Foster, Mitch Marcus, Lyle Ungar, 
and three anonymous reviewers for helpful com- 
ments. This research was supported by at. equip- 
ment grant from Sun MicroSystems and by ARPA 
Contract #N66001-94C-6043. 

References 
P. F. Brown, J. Cocke, S. Della Pietra, V. Della 
Pietra, F. Jelinek, R. Mercer, & P. Roossin, "A 
Statistical Approach to Language Translation," 
Proceedings of the 12th International Conference 
on Computational Linguistics, Budapest, Hun- 
gary, 1988. 
P. F. Brown, J. Cocke, S. Della Pietra, V. Della 
Pietra, F. Jelinek, R. Mercer, & P. Roossin, 
"A Statistical Approach to Machine Translation," 
Computational Linguistics 16(2), 1990. 
P. F. Brown, V. J. Della Pietra, S. A. Della Pietra 
& R. L. Mercer, "The Mathematics of Statisti- 
cal Machine Translation: Parameter Estimation," 
Computational Linguistics 19(2), 1993. 
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, 
M. J. Goldsmith, J. Hajic, R. L. Mercer & S. Mo- 
hanty, "But Dictionaries are Data Too," Proceed- 
ings of the ARPA HLT Workshop, Princeton, N J, 
1993. 
R. Catizone, G. Russell & S. Warwick "Deriving 
Translation Data from Bilingual Texts," Proceed- 
ings of the First International Lexical Acquisition 
Workshop, Detroit, MI, 1993. 
S. Chen, Building Probabilistic Models for Natu- 
ral Language, Ph.D. Thesis, Harvard University, 
1996. 
K. W. Church & E. H. Hovy, "Good Applications for 
Crummy Machine Translation," Machine Transla- 
tion 8, 1993. 
I. Dagan, K. Church, & W. Gale, "Robust Word 
Alignment for Machine Aided Translation," Pro- 
ceedings of the Workshop on Very Large Corpora: 
Academic and Industrial Perspectives, Columbus, 
OH, 1993. 
A. P. Dempster, N. M. Laird & D. B. Rubin, "Maxi- 
mum likelihood from incomplete data via the EM 
algorithm," Journal of the Royal Statistical Soci- 
ety 34(B), 1977. 
T. Dunning, "Accurate Methods for the Statistics 
of Surprise and Coincidence," Computational Lin- 
guistics 19(1), 1993. 
P. Fung, "Compiling Bilingual Lexicon Entries from 
a Non-Parallel English-Chinese Corpus," Proceed- 
ings of the Third Workshop on Very Large Cor- 
pora, Boston, MA, 1995a. 
P. Fung, "A Pattern Matching Method for Find- 
ing Noun and Proper Noun Translations from 
Noisy Parallel Corpora," Proceedings of the 33rd 
Annual Meeting of the Association for Computa- 
tional Linguistics, Boston, MA, 1995b. 
W. Gale & K. W. Church, "A Program for Align- 
ing Sentences in Bilingual Corpora" Proceedings 
of the 29th Annual Meeting of the Association for 
Computational Linguistics, Berkeley, CA, 1991. 
W. Gale & K. W. Church, "Identifying Word Corre- 
spondences in Parallel Texts," Proceedings of the 
DARPA SNL Workshop, 1991. 
A. Kumano & H. Hirakawa, "Building an MT Dic- 
tionary from Parallel Texts Based on Linguistic 
and Statistical Information," Proceedings of the 
15th International Conference on Computational 
Linguistics, Kyoto, Japan, 1994. 
E. Macklovitch :'Using Bi-textual Alignment for 
Translation Validation: The TransCheck Sys- 
tem," Proceedings of the 1st Conference of the As- 
sociation for Machine Translation in the Ameri- 
cas, Columbia, MD, 1994. 
E. Macklovitch & M.-L. Hannan, "Line 'Em Up: Ad- 
vances in Alignment Technology and their Impact 
on Translation Support Tools," 2nd Conference 
of the Association for Machine Translation in the 
Americas, Montreal, Canada, 1996. 
I. D. Melamed "Automatic Evaluation and Uniform 
Filter Cascades for Inducing N-best Translation 
Lexicons," Proceedings of the Third Workshop on 
Very Large Corpora, Boston, MA, 1995. 
I. D. Melamed, "A Geometric Approach to Mapping 
Bitext Correspondence," Proceedings of the First 
Conference on Empirical Methods in Natural Lan- 
guage Processing, Philadelphia, PA, 1996a. 
I. D. Melamed "Automatic Detection of Omissions 
in Translations," Proceedings of the 16th Interna- 
tional Conference on Computational Linguistics, 
Copenhagen, Denmark, 1996b. 
I. D Melamed, "Automatic Construction of Clean 
Broad-Coverage Translation Lexicons," 2nd Con- 
ference of the Association for Machine Transla- 
tion in the Americas, Montreal, Canada, 1996c. 
I. D. Melamed, "Measuring Semantic Entropy," Pro- 
ceedings of the SIGLEX Workshop on Tagging 
Text with Lexical Semantics, Washington, DC, 
1997. 
I. D. Melamed, "A Portable Algorithm for Mapping 
Bitext Correspondence," Proceedings of the 35th 
Conference of the Association for Computational 
Linguistics, Madrid, Spain, 1997. (in this volume) 
A. Melby, "A Bilingual Concordance System and its 
Use in Linguistic Studies," Proceedings of the En- 
glish LACUS Forum, Columbia, SC, 1981. 
A. Nasr, personal communication, 1997. 
P. Resnik & I. D. Melamed, "Semi-Automatic Acqui- 
sition of Domain-Specific Translation Lexicons," 
Proceedings of the 7th ACL Conference on Ap- 
plied Natural Language Processing, Washington, 
DC, 1997. 
D. W. Oard & B. J. Dorr, "A Survey of Multilingual 
Text Retrieval, UMIACS TR-96-19, University of 
Maryland, College Park, MD, 1996. 
F. Smadja, "How to Compile a Bilingual Collo- 
cational Lexicon Automatically," Proceedings of 
the AAAI Workshop on Statistically-Based NLP 
Techniques, 1992. 
D. Wu & X. Xia, "Learning an English-Chinese 
Lexicon from a Parallel Corpus," Proceedings of 
the First Conference of the Association for Ma- 
chine Translation in the Americas, Columbia, 
MD, 1994. 
