Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 875–882, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Inducing a multilingual dictionary
from a parallel multitext in related languages
Dmitriy Genzel
Department of Computer Science
Box 1910
Brown University
Providence, RI 02912, USA
dg@cs.brown.edu
Abstract
Dictionaries and word translation models
are used by a variety of systems, espe-
cially in machine translation. We build
a multilingual dictionary induction system
for a family of related resource-poor lan-
guages. We assume only the presence
of a single medium-length multitext (the
Bible). The techniques rely upon lexical
and syntactic similarity of languages as
well as on the fact that building dictionar-
ies for several pairs of languages provides
information about other pairs.
1 Introduction and Motivation
Modern statistical natural language processing tech-
niques require large amounts of human-annotated
data to work well. For practical reasons, the required
amount of data exists only for a few languages of
major interest, either commercial or governmental.
As a result, many languages have very little com-
putational research done in them, especially outside
the borders of the countries in which these languages
are spoken. Some of these languages are, however,
major languages with hundreds of millions of speak-
ers. Of the top 10 most spoken languages, Lin-
guistic Data Consortium at University of Pennsyl-
vania, the premier U.S. provider of corpora, offers
text corpora only in 7 (The World Factbook (2004),
2000 estimate) Only a few of the other languages
(French, Arabic, and Czech) have resources pro-
vided by LDC. Many Asian and Eastern European
languages number tens of millions of speakers, yet
very few of these seem to have any related compu-
tational linguistics work, at least as presented at the
international conferences, such as the ACL.1
The situation is not surprising, nor is it likely to
significantly change in the future. Luckily, most
of these less-represented languages belong to lan-
guage families with several prominent members. As
a result, some of these languages have siblings with
more resources and published research. 2 Inter-
estingly, the better-endowed siblings are not always
the ones with more native speakers, since political
considerations are often more important.3 If one
is able to use the resources available in one lan-
guage (henceforth referred to as source) to facilitate
the creation of tools and resource in another, related
language (target), this problem would be alleviated.
This is the ultimate goal of this project, but in the
first stage we focus on multi-language dictionary in-
duction.
Building a high-quality dictionary, or even bet-
ter, a joint word distribution model over all the lan-
guages in a given family is very important, because
using such a model one can use a variety of tech-
niques to project information across languages, e.g.
to parse or to translate. Building a unified model for
more than a pair of languages improves the quality
over building several unrelated pairwise models, be-
cause relating them to each other provides additional
information. If we know that word a in language A
has as its likely translation word b in language B,
and b is translated as c in C, then we also know that
a is likely to be translated as c, without looking at
1The search through ACL Anthology, for e.g., Telugu (∼70
million speakers) shows only casual mention of the language.
2Telugu’s fellow Dravidian language Tamil (∼65 million
speakers) has seen some papers at the ACL
3This is the case with Tamil vs. Telugu.
875
the A to C model.
2 Previous Work
There has been a lot of work done on building dic-
tionaries, by using a variety of techniques. One
good overview is Melamed (2000). There is work
on lexicon induction using string distance or other
phonetic/orthographic comparison techniques, such
as Mann and Yarowsky (2001) or semantic com-
parison using resources such as WordNet (Kondrak,
2001). Such work, however, primarily focuses on
finding cognates, whereas we are interested in trans-
lations of all words. Moreover, while some tech-
niques (e.g., Mann and Yarowsky (2001)) use mul-
tiple languages, the languages used have resources
such as dictionaries between some language pairs.
We do not require any dictionaries for any language
pair.
An important element of our work is focusing on
more than a pair of languages. There is an active
research area focusing on multi-source translation
(e.g., Och and Ney (2001)). Our setting is the re-
verse: we do not use multiple dictionaries in order
to translate, but translate (in a very crude way) in
order to build multiple dictionaries.
Many machine translation techniques require dic-
tionary building as a step of the process, and there-
fore have also attacked this problem. They use a va-
riety of approaches (a good overview is Koehn and
Knight (2001)), many of which require advanced
tools for both languages which we are not able to
use. They also use bilingual (and to some extent
monolingual) corpora, which we do have available.
They do not, however, focus on related languages,
and tend to ignore lexical similarity 4, nor are they
able to work on more than a pair of languages at a
time.
It is also worth noting that there has been some
MT work on related languages which explores lan-
guage similarity in an opposite way: by using dic-
tionaries and tools for both languages, and assum-
ing that a near word-for-word approach is reasonable
(Hajic et al., 2000).
4Much of recent MT research focuses on pairs of languages
which are not related, such as English-Chinese, English-Arabic,
etc.
3 Description of the Problem
Let us assume that we have a group of related lan-
guages, L1 ...Ln, and a parallel sentence-aligned
multitext C, with corresponding portions in each
language denoted as C1 ...Cn. Such a multitext ex-
ists for virtually all the languages in the form of the
Bible. Our goal is to create a multilingual dictionary
by learning the joint distribution P(x1 ...xn)xi∈Li
which is simply the expected frequency of the n-
tuple of words in a completely word-aligned mul-
titext. We will approach the problem by learning
pairwise language models, although leaving some
parameters free, and then combine the models and
learn the remaining free parameters to produce the
joint model.
Let us, therefore, assume that we have a set of
models {P(x,y|θij)x∈Li,y∈Lj}inegationslash=j where θij is a
parameter vector for pairwise model for languages
Li and Lj. We would like to learn how to combine
these models in an optimal way. To solve this prob-
lem, let us first consider a simpler and more general
setting.
3.1 Combining Models of Hidden Data
Let X be a random variable with distribution
Ptrue(x), such that no direct observations of it exist.
However, we may have some indirect observations
of X and have built several models of X’s distri-
bution, {Pi(x|θi)}ni=1, each parameterized by some
parameter vector θi. Pi also depends on some other
parameters that are fixed. It is important to note that
the space of models obtained by varying θi is only a
small subspace of the probability space. Our goal is
to find a good estimate of Ptrue(x).
The main idea is that if some Pi and Pj are close
(by some measure) to Ptrue, they have to be close
to each other as well. We will therefore make the
assumption that if some models of X are close to
each other (and we have reason to believe they are
fair approximations of the true distribution) they are
also close to the true distribution. Moreover, we
would like to set the parameters θi in such a way
that P(xi|θi) is as close to the other models as pos-
sible. This leads us to look for an estimate that is
as close to all of our models as possible, under the
876
optimal values of θi’s, or more formally:
Pest = argmin
ˆP(·)
min
θ1
...min
θn
d( ˆP(·),P1(·|θ1),...Pn(·|θn))
where d measures the distance between ˆP and all the
Pi under the parameter setting θi. Since we have no
reason to prefer any of the Pi, we choose the follow-
ing symmetric form for d:
nsummationdisplay
i=1
D( ˆP(·)||Pi(·|θi))
where D is a reasonable measure of distance be-
tween probability distributions. The most appro-
priate and the most commonly used measure in
such cases in the Kullback-Leibler divergence, also
known as relative entropy:
D(p||q) =summationdisplay
x
p(x)log p(x)q(x)
It turns out that it is possible to find the optimal ˆP
under these circumstances. Taking a partial deriva-
tive and solving, we obtain:
ˆP(x) =
producttextn
i=1 Pi(x|θi)1/nsummationtext
xprime∈X
producttextn
i=1 Pi(xprime|θi)1/n
Substituting this value into the expression for
function d, we obtain the following distance mea-
sure between the Pi’s:
dprime(P1(X|θ1)...Pn(X|θn))
= minˆP d( ˆP,P1(X|θ1),...Pn(X|θn))
= −logsummationtextx∈X producttextni=1 Pi(x|θi)1/n
This function is a generalization of the well-
known Bhattacharyya distance for two distributions
(Bhattacharyya, 1943):
b(p,q) =summationdisplay
i
√p
iqi
These results suggest the following Algorithm 1
to optimize d (and dprime):
• Set all θi randomly
• Repeat until change in d is very small:
– Compute ˆP according to the above for-
mula
– For i from 1 to n
∗ Set θi in such a way as to minimize
D( ˆP(X)||Pi(X|θi))
– Compute d according to the above for-
mula
Each step of the algorithm minimizes d. It is also
easy to see that minimizing D( ˆP(X)||Pi(X|θi)) is
the same as setting the parameters θi in order to max-
imize producttextx∈X Pi(x|θi)ˆP(x), which can be interpreted
as maximizing the probability under Pi of a cor-
pus in which word x appears ˆP(x) times. In other
words, we are now optimizing Pi(X) given an ob-
served corpus of X, which is a much easier problem.
In many types of models for Pi the Expectation-
Maximization algorithm is able to solve this prob-
lem.
3.2 Combining Pairwise Models
Following the methods outlined in the previous
section, we can find an optimal joint probability
P(x1 ...xn)xi∈Li if we are given several models
Pj(x1 ...xn|θj). Instead, we have a number of pair-
wise models. Depending on which independence as-
sumptions we make, we can define a joint distribu-
tion over all the languages in various ways. For ex-
ample, for three languages, A, B, and C, and we can
use the following set of models:
P1(A,B,C) = P(A|B)P(B|C)P(C)
P2(A,B,C) = P(C|A)P(A|B)P(B)
P3(A,B,C) = P(B|C)P(C|A)P(A)
and
dprime( ˆP,P1,P2,P3)
= D( ˆP||P1) + D( ˆP||P2) + D( ˆP||P3)
= 2H( ˆP(A,C),P(A,C))
+ 2H( ˆP(A,B),P(A,B))
+ 2H( ˆP(B,C),P(B,C)) − 3H( ˆP)
− H( ˆP(A),P(A)) −H( ˆP(B),P(B))
− H( ˆP(C),P(C))
where H(·) is entropy, H(·,·) is cross-entropy, and
ˆP(A,B) means ˆP marginalized to variables A,B.
The last three cross-entropy terms involve monolin-
gual models which are not parameterized. The en-
tropy term does not involve any of the pairwise dis-
tributions. Therefore, if ˆP is fixed, to maximize dprime
877
we need to maximize each of the bilingual cross-
entropy terms.
This means we can apply the algorithm from
the previous section with a small modification
(Algorithm 2):
• Set all θij (for each language pair i,j) ran-
domly
• Repeat until change in d is very small:
– Compute Pi for i = 1...k where k is the
number of the joint models we have cho-
sen
– Compute ˆP from {Pi}
– For i,j such that i negationslash= j
∗ Marginalize ˆP to (Li,Lj)
∗ Set θij in such a way as to minimize
D( ˆP(Li,Lj)||Pi(Li,Lj|θij))
– Compute d according to the above for-
mula
Most of the θ parameters in our models can be
set by performing EM, and the rest are discrete with
only a few choices and can be maximized over by
trying all combinations of them.
4 Building Pairwise Models
We now know how to combine pairwise translation
models with some free parameters. Let us now dis-
cuss how such models might be built.
Our goal at this stage is to take a parallel bitext
in related languages A and B and produce a joint
probability model P(x,y), where x ∈ A,y ∈ B.
Equivalently, since the models PA(x) and PB(y)
are easily estimated by maximum likelihood tech-
niques from the bitext, we can estimate PA→B(y|x)
or PB→A(x|y). Without loss of generality, we will
build PA→B(y|x).
The model we are building will have a number of
free parameters. These parameters will be set by the
algorithm discussed above. In this section we will
assume that the parameters are fixed.
Our model is a mixture of several components,
each discussed in a separate section below:
PA→B(y|x) = λfw(x)PfwA→B(y|x)
+ λbw(x)PbwA→B(y|x)
+ λchar(x)PcharA→B(y|x)
+ λpref(x)PprefA→B(y|x)
+ λsuf(x)PsufA→B(y|x)
+ λcons(x)PconsA→B(y|x)
(1)
where all λs sum up to one. The λs are free pa-
rameters, although to avoid over-training we tie the
λs for x’s with similar frequencies. These lambdas
form a part of the θij parameter mentioned previ-
ously, where Li = A and Lj = B.
The components represent various constraints that
are likely to hold between related languages.
4.1 GIZA (forward)
This component is in fact GIZA++ software, origi-
nally created by John Hopkins University’s Summer
Workshop in 1999, improved by Och (2000). This
software can be used to create word alignments for
sentence-aligned parallel corpora as well as to in-
duce a probabilistic dictionary for this language pair.
The general approach taken by GIZA is as fol-
lows. Let LA and LB be the portions of the par-
allel text in languages A and B respectively, and
LA = (xi)i=1...n and LB = (yi)i=1...m. We can
define P(LB|LA) as
maxP
A→B
maxP
aligns
nsummationdisplay
i=1
msummationdisplay
j=1
PA→B (yj|xi)Paligns (xi|j)
The GIZA software does the maximization by
building a variety of models, mostly described by
Brown et al. (1993). GIZA can be tuned in various
ways, most importantly by choosing which models
to run and for how many iterations. We treat these
parameters as free, to be set along with the rest at a
later stage.
As a side effect of GIZA’s optimization, we obtain
the PA→B(y|x) that maximizes the above expres-
sion. It is quite reasonable to believe that a model
of this sort is also a good model for our purposes.
This model is what we refer to as PfwA→B(y|x) in
the model overview.
GIZA’s approach is not, however, perfect. GIZA
builds several models, some quite complex, yet it
878
does not use all the information available to it, no-
tably the lexical similarity between the languages.
Furthermore, GIZA tries to map words (especially
rare ones) into other words if possible, even if the
sentence has no direct translation for the word in
question.
These problems are addressed by using other
models, described in the following sections.
4.2 GIZA (backward)
In the previous section we discussed using GIZA to
try to optimize P(LB|LA). It is, however, equally
reasonable to try to optimize P(LA|LB) instead. If
we do so, we can obtain PfwB→A(x|y) that pro-
duces maximal probability for P(LA|LB). We,
however need a model of PA→B(y|x). This is easily
obtained by using Bayes’ rule:
PbwA→B(y|x) = PfwB→A(x|y)PB(y)P
A(x)
which requires us to have PB(y) and PA(x). These
models can be estimated directly from LB and LA,
by using maximum likelihood estimators:
PA(x) =
summationtext
i δ(xi,x)
n
and
PB(y) =
summationtext
i δ(yi,y)
m
where δ(x,y) is the Kronecker’s delta function,
which is equal to 1 if its arguments are equal, and
to 0 otherwise.
4.3 Character-based model
This and the following models all rely on having a
model of PA→B(y|x) to start from. In practice it
means that this component is estimated following
the previous components and uses the models they
provide as a starting point.
The basic idea behind this model is that in related
languages words are also related. If we have a model
Pc of translating characters in language A into char-
acters in language B, we can define the model for
translating entire words.
Let word x in language A consists of characters
x1 through xn, and word y in language B consist of
characters y1 through ym.
Let us define (the unnormalized) character model:
Puchar(y|x) = Pcharlen(y|x,m)Plength(m|x)
i.e., estimating the length of y first, and y itself af-
terward. We make an independence assumption that
the length of y depends only on length of x, and are
able to estimate the second term above easily. The
first term is harder to estimate.
First, let us consider the case where lengths of x
and y are the same (m = n). Then,
Pcharlen(y|x,n) =
nproductdisplay
i=1
Pc(yi|xi)
Let yj be word y with j’s character removed. Let
us now consider the case when m > n. We define
(recursively):
Pcharlen(y|x,m) =
msummationdisplay
i=1
1
mPcharlen(y
i|x,m− 1)
Similarly, if n > m:
Pcharlen(y|x) =
nsummationdisplay
i=1
1
nPcharlen(y|x
i,m)
It is easy to see that this is a valid probability
model over all sequences of characters. However,
y is not a random sequence of characters, but a word
in language B, moreover, it is a word that can serve
as a potential translation of word x. So, to define a
proper distribution over words y given a word x and
a set of possible translations of x, T(x)
Pchar(y|x) = Puchar (y|x,y ∈ T(x))
= δyprime∈T(x) Puchar(y,y∈T(x)|x)summationtext
yprime∈T(x) Puchar(y
prime|x)
This is the complete definition of Pchar, except
for the fact that we are implicitly relying upon the
character-mapping model, Pc, which we need to
somehow obtain. To obtain it, we rely upon GIZA
again. As we have seen, GIZA can find a good word-
mapping model if it has a bitext to work from. If we
have a PA→B word-mapping model of some sort, it
is equivalent to having a parallel bitext with words y
and x treated as a sequence of characters, instead of
indivisible tokens. Each (x,y) word pair would oc-
cur PA→B(x,y) times in this corpus. GIZA would
then provide us with the Pc model we need, by opti-
mizing the probability B language part of the model
given the language A part.
879
4.4 Prefix Model
This model and the two models that follow are built
on the same principle. Let there be a function f :
A → CA and a function g : B → CB. These func-
tions group words in A and B into some finite set of
classes. If we have some PA→B(y|x) to start with,
we can define
PfgA→B(y|x)
= P (y|g(y))P (g(y)|f(x))P (f(x)|x)
= P(y)
summationtext
xprime:f(xprime)=f(x)
summationtext
yprime:g(yprime)=g(y) P(x
prime,yprime)
parenleftBigsummationtext
xprime:f(xprime)=f(x) P(x
prime)
parenrightBigparenleftBigsummationtext
yprime:g(yprime)=g(y) P(y
prime)
parenrightBig
For the prefix model, we rely upon the following
idea: words that have a common prefix often tend to
be related. Related words probably should translate
as related words in the other language as well. In
other words, we are trying to capture word-level se-
mantic information. So we define the following set
of f and g functions:
fn(x) = prefix(x,n)
gm(y) = prefix(y,m)
where n and m are free parameters, whose values we
will determine later. We therefore define PprefA→B
as Pfg with f and g specified above.
4.5 Suffix Model
Similarly to a prefix model mentioned above, it is
also useful to have a suffix model. Words that have
the same suffixes are likely to be in the same gram-
matical case or share some morphological feature
which may persist across languages. In either case,
if a strong relationship exists between the result-
ing classes, it provides good evidence to give higher
likelihood to the word belonging to these classes. It
is worth noting that this feature (unlike the previous
one) is unlikely to be helpful in a setting where lan-
guages are not related.
The functions f and g are defined based on a set of
suffixes SA and SB which are learned automatically.
f(x) is defined as the longest possible suffix of x
that is in the set SA, and g is defined similarly, for
SB.
The sets SA and SB are built as follows. We start
with all one-character suffixes. We then consider
two-letter suffixes. We add a suffix to the list if it
occurs much more often than can be expected based
on the frequency of its first letter in the penultimate
position, times the frequency of its second letter in
the last position. We then proceed in a similar way
for three-letter suffixes. The threshold value is a free
parameter of this model.
4.6 Constituency Model
If we had information about constituent boundaries
in either language, it would have been useful to
make a model favoring alignments that do not cross
constituent boundaries. We do not have this infor-
mation at this point. We can assume, however, that
any sequence of three words is a constituent of sorts,
and build a model based on that assumption.
As before, let LA = (xi)i=1...n and LB =
(yi)i=1...m. Let us define as CA(i) a triple
of words (xi−1,xi,xi+1) and as CB(j) a triple
(yj−1,yj,yj+1). If we have some model PA→B, we
can define
PCA→CB(j|i) = 1CPA→B(yj−1|xi−1)PA→B(yj|xi)
× PA→B(yj+1|xi+1)
where C is the sum over j of the above products, and
serves to normalize the distribution.
PconsA→B(y|x)
= summationtextni=1summationtextmj=1 P(y|CB(j))PCA→CB(j|i)P(CA(i)|x)
= summationtexti:xi=xsummationtextmj=1 P(y|CB(j))PCA→CB(j|i)
= 1summationtext
j=1 δ(yj,y)
summationtext
i:xi=x
summationtext
j:yi=y PCA→CB(j|i)
5 Evaluation
The output of the system so far is a multi-lingual
word translation model. We will evaluate it by pro-
ducing a tri-lingual dictionary (Russian-Ukrainian-
Belorussian), picking a highest probability transla-
tion for each word, from the corresponding Bibles.
Unfortunately, we do not have a good hand-built tri-
lingual dictionary to compare it to, but only one
good bilingual one, Russian-Ukrainian5. We will
therefore take the Russian-Ukrainian portion of our
dictionary and compare it to the hand-built one.
Our evaluation metric is the number of entries that
match between these dictionaries. If a word has sev-
eral translations in the hand-built dictionary, match-
5The lack of such dictionaries is precisely why we do this
work
880
ing any of them counts as correct. It is worth not-
ing that for all the dictionaries we generate, the to-
tal number of entries is the same, since all the words
that occur in the source portion of the corpus have an
entry. In other words, precision and recall are pro-
portional to each other and to our evaluation metric.
Not all of the words that occur in our dictionary
occur in the hand-built dictionary and vice versa. An
absolute upper limit of performance, therefore, for
this evaluation measure is the number of left-hand-
side entries that occur in both dictionaries.
In fact, we cannot hope to achieve this number.
First, because the dictionary translation of the word
in question might never occur in the corpus. Second,
even if it does, but never co-occurs in the same sen-
tence as its translation, we will not have any basis
to propose it as a translation.6. Therefore we have
a “achievable upper limit”, the number of words
that have their “correct” translation co-occur at least
once. We will compare our performance to this up-
per limit.
Since there is no manual tuning involved we do
not have a development set, and use the whole bible
for training (the dictionary is used as a test set, as
described above).
We evaluate the performance of the model with
just the GIZA component as the baseline, and add
all the other components in turn. There are two pos-
sible models to evaluate at each step. The pairwise
model is the model given in equation 1 under the
parameter setting given by Algorithm 2, with Be-
lorussian used as a third language. The joint model
is the full model over these three languages as es-
timated by Algorithm 2. In either case we pick a
highest probability Ukrainian word as a translation
of a given Russian word.
The results for Russian-Ukrainian bibles are pre-
sented in Table 1. The “oracle” setting is the set-
ting obtained by tuning on the test set (the dictio-
nary). We see that using a third language to tune
works just as well, obtaining the true global max-
imum for the model. Moreover, the joint model
(which is more flexible than the model in Equation
1) does even better. This was unexpected for us, be-
6Strictly speaking, we might be able to infer the word’s exis-
tence in some cases, by performing morphological analysis and
proposing a word we have not seen, but this seems too hard at
the moment
Table 1: Evaluation for Russian-Ukrainian (with Be-
lorussian to tune)
Stage Pair Joint
Forward (baseline) 62.3% 71.7%
Forward+chars 77.1% 84.2%
Forward+chars+backward 81.3% 84.1%
Fw+chars+bw+prefix 83.5% 84.5%
Fw+chars+bw+prefix+suffix 84.5% 85%
Fw+chars+bw+pref+suf+const 84.5% 85.2%
“Oracle” setting for λ’s 84.6%
Table 2: Evaluation for Russian-Ukrainian (with Be-
lorussian and Polish)
Tuned by Pair Joint
Belorussian (prev. table) 84.5% 85.2% &
Polish 84.6% 78.6%
Both 84.5% 85.2%
“Oracle” tuning 84.5%
cause the joint model relies on three pairwise mod-
els equally, and Russian-Belorussian and Ukrainian-
Belorussian models are bound to be less reliable for
Russian-Ukrainian evaluation. It appears, however,
that our Belorussian bible is translated directly from
Russian rather than original languages, and parallels
Russian text more than could be expected.
To insure our results are not affected by this fact
we also try Polish separately and in combination
with Belorussian (i.e. a model over 4 languages),
as shown in Table 2.
These results demonstrate that the joint model
is not as good for Polish, but it still finds the
optimal parameter setting. This leads us to pro-
pose the following extension: let us marginalize
joint Russian-Ukrainian-Belorussian model into just
Russian-Ukrainian, and add this model as yet an-
other component to Equation 1. Now we cannot use
Belorussian as a third language, but we can use Pol-
ish, which we know works just as well for tuning.
The resulting performance for the model is 85.7%,
our best result to date.
881
6 Discussion and Future Work
We have built a system for multi-dictionary in-
duction from parallel corpora which significantly
improves quality over the standard existing tool
(GIZA) by taking advantage of the fact that lan-
guages are related and we have a group of more
than two of them. Because the system attempts to
be completely agnostic about the languages it works
on, it might be used successfully on many language
groups, requiring almost no linguistic knowledge on
the part of the user. Only the prefix and suffix com-
ponents are somewhat language-specific, but even
they are sufficiently general to work, with varying
degree of success, on most inflective and agglutina-
tive languages (which form a large majority of lan-
guages). For generality, we would also need a model
of infixes, for languages such as Hebrew or Arabic.
We must admit, however, that we have not tested
our approach on other language families yet. It is
our short term plan to test our model on several Ro-
mance languages, e.g. Spanish, Portuguese, French.
Looking at the first lines of Table 1, one can see
that using more than a pair of languages with a
model using only a small feature set can dramat-
ically improve performance (compare second and
third columns), while able to find the optimal val-
ues for all internal parameters.
As discussed in the introduction, the ultimate goal
of this project is to produce tools, such as a parser,
for languages which lack them. Several approaches
are possible, all involving the use of the dictionary
we built. While working on this project, we would
no longer be treating all languages in the same way.
We would use the tools available for that language to
further improve the performance of pairwise mod-
els involving that language and, indirectly, even the
pairs not involving this language. Using these tools,
we may be able to improve the word translation
model even further, simply as a side effect.
Once we build a high-quality dictionary for a spe-
cial domain such as the Bible, it might be possible to
expand to a more general setting by mining the Web
for potential parallel texts.
Our technique is limited in the coverage of the
resulting dictionary which can only contain words
which occur in our corpus. Whatever the corpus
may be, however, it will include the most common
words in the target language. These are the words
that tend to vary the most between related (and even
unrelated) languages. The relatively rare words (e.g.
domain-specific and technical terms) can often be
translated simply by inferring morphological rules
transforming words of one language into another.
Thus, one may expand the dictionary coverage us-
ing non-parallel texts in both languages, or even in
just one language if its morphology is sufficiently
regular.
References
The Central Intelligence Agency. 2004. The world fact-
book.
A. Bhattacharyya. 1943. On a measure of divergence be-
tween two statistical populations defined by their prob-
ability distributions. Bull. Calcutta Math. Soc., 35:99–
109.
P.F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L.
Mercer. 1993. The mathematics of statistical machine
translation: Parameter estimation. Computational Lin-
guistics, 19(2):263–311.
J. Hajic, J. Hric, and V. Kubon. 2000. Machine transla-
tion of very close languages. In Proccedings of the 6th
Applied Natural Language Processing Conference.
P. Koehn and K. Knight. 2001. Knowledge sources
for word-level translation models. In Proceedings of
the Conference on Empirical Methods in Natural Lan-
guage Processing.
G. Kondrak. 2001. Identifying cognates by phonetic
and semantic similarity. In Proceedings of the Second
Meeting of the North American Chapter of the Asso-
ciation for Computational Linguistics, Pittsburgh, PA,
pages 103–110.
G. Mann and D. Yarowsky. 2001. Multipath transla-
tion lexicon induction via bridge languages. In Pro-
ceedings of the Second Meeting of the North American
Chapter of the Association for Computational Linguis-
tics, Pittsburgh, PA, pages 151–158.
I. D. Melamed. 2000. Models of translational equiv-
alence among words. Computational Linguistics,
26:221–249, June.
F. J. Och and H. Ney. 2000. Improved statistical align-
ment models. In Proceedings of the 38th Annual Meet-
ing of the Association for Computational Linguistics,
pages 440–447, Hongkong, China, October.
F. J. Och and H. Ney. 2001. Statistical multi-source
translation. In Proccedings of MT Summit VIII, pages
253–258.
882
