Refined Lexicon Models for Statistical Machine Translation using a
Maximum Entropy Approach
Ismael Garc´ıa Varea
Dpto. de Inform´atica
Univ. de Castilla-La Mancha
Campus Universitario s/n
02071 Albacete, Spain
ivarea@info-ab.uclm.es
Franz J. Och and
Hermann Ney
Lehrstuhl f¨ur Inf. VI
RWTH Aachen
Ahornstr., 55
D-52056 Aachen, Germany
a0 och|ney
a1 @cs.rwth-aachen.de
Francisco Casacuberta
Dpto. de Sist. Inf. y Comp.
Inst. Tecn. de Inf. (UPV)
Avda. de Los Naranjos, s/n
46071 Valencia, Spain
fcn@iti.upv.es
Abstract
Typically, the lexicon models used in
statistical machine translation systems
do not include any kind of linguistic
or contextual information, which often
leads to problems in performing a cor-
rect word sense disambiguation. One
way to deal with this problem within
the statistical framework is to use max-
imum entropy methods. In this paper,
we present how to use this type of in-
formation within a statistical machine
translation system. We show that it is
possible to significantly decrease train-
ing and test corpus perplexity of the
translation models. In addition, we per-
form a rescoring of a2 -Best lists us-
ing our maximum entropy model and
thereby yield an improvement in trans-
lation quality. Experimental results are
presented on the so-called “Verbmobil
Task”.
1 Introduction
Typically, the lexicon models used in statistical
machine translation systems are only single-word
based, that is one word in the source language cor-
responds to only one word in the target language.
Those lexicon models lack from context infor-
mation that can be extracted from the same paral-
lel corpus. This additional information could be:
a3 Simple context information: information of
the words surrounding the word pair;
a3 Syntactic information: part-of-speech in-
formation, syntactic constituent, sentence
mood;
a3 Semantic information: disambiguation in-
formation (e.g. from WordNet), cur-
rent/previous speech or dialog act.
To include this additional information within the
statistical framework we use the maximum en-
tropy approach. This approach has been applied
in natural language processing to a variety of
tasks. (Berger et al., 1996) applies this approach
to the so-called IBM Candide system to build con-
text dependent models, compute automatic sen-
tence splitting and to improve word reordering in
translation. Similar techniques are used in (Pap-
ineni et al., 1996; Papineni et al., 1998) for so-
called direct translation models instead of those
proposed in (Brown et al., 1993). (Foster, 2000)
describes two methods for incorporating informa-
tion about the relative position of bilingual word
pairs into a maximum entropy translation model.
Other authors have applied this approach to lan-
guage modeling (Rosenfeld, 1996; Martin et al.,
1999; Peters and Klakow, 1999). A short review
of the maximum entropy approach is outlined in
Section 3.
2 Statistical Machine Translation
The goal of the translation process in statisti-
cal machine translation can be formulated as fol-
lows: A source language string a4a6a5a7a9a8 a4 a7a6a10a11a10a11a10 a4
a5is to be translated into a target language string
a12a14a13
a7 a8
a12
a7a6a10a11a10a11a10
a12
a13 . In the experiments reported in
this paper, the source language is German and the
target language is English. Every target string is
considered as a possible translation for the input.
If we assign a probability a15a17a16a19a18 a12 a13a7a21a20a4a6a5a7a23a22 to each pair
of strings a18 a12a14a13a7a25a24 a4 a5a7 a22 , then according to Bayes’ de-
cision rule, we have to choose the target string
that maximizes the product of the target language
model a15a17a16a19a18 a12a14a13a7 a22 and the string translation model
a15a17a16a19a18a26a4a6a5
a7 a20
a12 a13
a7 a22 .
Many existing systems for statistical machine
translation (Berger et al., 1994; Wang and Waibel,
1997; Tillmann et al., 1997; Nießen et al., 1998)
make use of a special way of structuring the string
translation model like proposed by (Brown et al.,
1993): The correspondence between the words in
the source and the target string is described by
alignments that assign one target word position
to each source word position. The lexicon prob-
ability a27a28a18a26a4 a20a12 a22 of a certain target word a12 to occur
in the target string is assumed to depend basically
only on the source word a4 aligned to it.
These alignment models are similar to the con-
cept of Hidden Markov models (HMM) in speech
recognition. The alignment mapping is a29a31a30
a32
a8a34a33a21a35 from source position
a29 to target position
a32
a8a36a33a37a35 . The alignment a33
a5
a7 may contain align-
ments a33a37a35a38a8a40a39 with the ‘empty’ word a12a14a41 to ac-
count for source words that are not aligned to
any target word. In (statistical) alignment models
a15a17a16a19a18a26a4 a5
a7a28a24 a33
a5
a7 a20
a12a25a13
a7 a22 , the alignment a33
a5
a7 is introduced as
a hidden variable.
Typically, the search is performed using the so-
called maximum approximation:
a42
a12 a13
a7 a8 a43a37a44a46a45a48a47a49a43a21a50
a51a53a52a54
a55
a15a17a16a19a18
a12 a13
a7 a22a57a56a21a58a60a59a62a61
a54
a15a17a16a19a18a26a4 a5
a7 a24 a33
a5
a7 a20
a12 a13
a7 a22a14a63
a8 a43a37a44a46a45a48a47a49a43a21a50
a51 a52a54
a55
a15a17a16a19a18
a12 a13
a7 a22a57a56 a47a64a43a21a50
a59 a61
a54
a15a17a16a19a18a26a4 a5
a7 a24 a33
a5
a7 a20
a12 a13
a7 a22a14a63
The search space consists of the set of all possible
target language strings a12a25a13a7 and all possible align-
ments a33 a5
a7 .
The overall architecture of the statistical trans-
lation approach is depicted in Figure 1.
3 Maximum entropy modeling
The translation probability a15a17a16a19a18a26a4 a5a7 a24 a33 a5a7 a20a12a25a13a7 a22 can be
rewritten as follows:
a15a17a16a19a18a26a4 a5
a7 a24 a33
a5
a7a65a20
a12 a13
a7a66a22 a8
a5a67
a35a69a68
a7
a15a70a16a71a18a26a4
a35 a24 a33a21a35 a20
a4
a35a11a72
a7
a7 a24 a33
a35a25a72
a7
a7 a24
a12 a13
a7a66a22
a8
a5a67
a35a69a68
a7a48a73
a15a17a16a19a18
a33a21a35 a20
a4
a35a11a72
a7
a7 a24 a33
a35a25a72
a7
a7 a24
a12 a13
a7 a22a74a56
a15a17a16a19a18a26a4
a35 a20
a4
a35a25a72
a7
a7 a24 a33
a35
a7 a24
a12 a13
a7 a22a53a75
Source Language Text
Transformation
 Lexicon Model
Language Model
Global Search:
 
 
Target Language Text
 
over
  Pr(f
1  
J  |e
1
I )
   Pr(   e1I )
   Pr(f1  J  |e1I )   Pr(   e1I )
  e1I
f1 J
maximize
 Alignment Model
Transformation
Figure 1: Architecture of the translation approach
based on Bayes’ decision rule.
Typically, the probability a15a70a16a71a18a26a4 a35 a20a4 a35a11a72
a7
a7 a24 a33
a35
a7 a24
a12a14a13
a7 a22 is
approximated by a lexicon model a27a28a18a26a4 a35 a20a12
a59a77a76
a22 by
dropping the dependencies on a4
a35a11a72
a7
a7 , a33
a35a25a72
a7
a7 , and
a12 a13
a7 .
Obviously, this simplification is not true for a lot
of natural language phenomena. The straightfor-
ward approach to include more dependencies in
the lexicon model would be to add additional de-
pendencies(e.g. a27a74a18a26a4 a35 a20a12
a59a53a76
a24
a12
a59a77a76a46a78
a54 a22 ). This approach
would yield a significant data sparseness problem.
Here, the role of maximum entropy (ME) is to
build a stochastic model that efficiently takes a
larger context into account. In the following, we
will use a27a74a18a26a4 a20a79a80a22 to denote the probability that the
ME model assigns to a4 in the context a79 in order
to distinguish this model from the basic lexicon
model a27a28a18a26a4 a20a12 a22 .
In the maximum entropy approach we describe
all properties that we feel are useful by so-called
feature functions a81a28a18 a79 a24 a4 a22 . For example, if we
want to model the existence or absence of a spe-
cific word a12a11a82 in the context of an English word a12
which has the translation a4 we can express this
dependency using the following feature function:
a81
a51a84a83a66a85a86a51a77a85
a18
a79 a24
a4
a22 a8
a55a88a87 if
a4
a8
a4
a82 and a12 a82a90a89
a79
a39 otherwise
(1)
The ME principle suggests that the optimal
parametric form of a model a27a28a18a26a4 a20a79a90a22 taking into
account only the feature functions a81a92a91 a24a94a93 a8
a87
a24
a10a11a10a11a10
a24a46a95 is given by:
a27a28a18a26a4
a20a79a80a22 a8
a87
a96
a18
a79a90a22a28a97
a50a99a98
a73a101a100
a58
a91
a68
a7a99a102
a91a103a81a92a91a99a18
a79 a24
a4
a22a53a75
Here a96 a18 a79a80a22 is a normalization factor. The re-
sulting model has an exponential form with free
parameters
a102
a91
a24a94a93 a8
a87
a24
a10a11a10a11a10
a24a46a95 . The parameter
values which maximize the likelihood for a given
training corpus can be computed with the so-
called GIS algorithm (general iterative scaling)
or its improved version IIS (Pietra et al., 1997;
Berger et al., 1996).
It is important to notice that we will have to ob-
tain one ME model for each target word observed
in the training data.
4 Contextual information and training
events
In order to train the ME modela27 a51 a18a26a4 a20a79a80a22 associated
to a target word a12 , we need to construct a corre-
sponding training sample from the whole bilin-
gual corpus depending on the contextual informa-
tion that we want to use. To construct this sample,
we need to know the word-to-word alignment be-
tween each sentence pair within the corpus. That
is obtained using the Viterbi alignment provided
by a translation model as described in (Brown et
al., 1993). Specifically, we use the Viterbi align-
ment that was produced by Model 5. We use the
program GIZA++ (Och and Ney, 2000b; Och and
Ney, 2000a), which is an extension of the training
program available in EGYPT (Al-Onaizan et al.,
1999).
Berger et al. (1996) use the words that sur-
round a specific word pair a18 a12 a24 a4 a22 as contextual in-
formation. The authors propose as context the 3
words to the left and the 3 words to the right of
the target word. In this work we use the following
contextual information:
a3 Target context: As in (Berger et al., 1996) we
consider a window of 3 words to the left and
to the right of the target word considered.
a3 Source context: In addition, we consider a
window of 3 words to the left of the source
word a4 which is connected to a12 according to
the Viterbi alignment.
a3 Word classes: Instead of using a dependency
on the word identity we include also a de-
pendency on word classes. By doing this, we
improve the generalization of the models and
include some semantic and syntactic infor-
mation with. The word classes are computed
automatically using another statistical train-
ing procedure (Och, 1999) which often pro-
duces word classes including words with the
same semantic meaning in the same class.
A training event, for a specific target word a12 , is
composed by three items:
a3 The source word
a4 aligned to
a12 .
a3 The context in which the aligned pair
a18
a12
a24
a4
a22
appears.
a3 The number of occurrences of the event in
the training corpus.
Table 1 shows some examples of training events
for the target word “which”.
5 Features
Once we have a set of training events for each tar-
get word we need to describe our feature func-
tions. We do this by first specifying a large pool
of possible features and then by selecting a subset
of “good” features from this pool.
5.1 Features definition
All the features we consider form a triple
(a27a65a104a103a105 a24 label-1a24 label-2) where:
a3 pos: is the position that label-2 has in a spe-
cific context.
a3 label-1: is the source word
a4 of the aligned
word pair a18 a12 a24 a4 a22 or the word class of the
source word a4 (a106a107a18a26a4 a22 ).
a3 label-2: is one word of the aligned word pair
a18
a12
a24
a4
a22 or the word class to which these words
belong (a106a107a18a26a4 a22 a24a109a108 a18 a12 a22 ).
Using this notation and given a context a79 :
a79 a8
a12a11a110
a72a92a111
a10a112a10a112a10
a12a25a110
a10a112a10a112a10
a12a25a110a114a113
a111
a4
a35a25a72a92a111
a10a112a10a112a10
a4
a35
Table 1: Some training events for the English word “which”. The symbol “ ” is the placeholder of the
English word “which” in the English context. In the German part the placeholder (“ ”) corresponds
to the word aligned to “which”, in the first example the German word “die”, the word “das” in the
second and the word “was” in the third. The considered English and German contexts are separated by
the double bar “a20a112a20”.The last number in the rightmost position is the number of occurrences of the event
in the whole corpus.
Alig. word (a4 ) Context (a79 ) # of occur.
die bar there , I just already nette Bar , 2
das hotel best , is very centrally ein Hotel , 1
was now , one do we jetzt , 1
Table 2: Meaning of different feature categories where a115 represents a specific target word and a116 repre-
sents a specific source word.
Category a81 a51a53a117 a18 a79 a24 a4 a35 a22 a8 a87 if and only if ...
1 a4
a35a118a8
a116
2 a4 a35 a8 a116 and a115a48a119 a120 a121 a117
2 a4 a35a118a8 a116 and a115a48a119 a121 a117 a120
3 a4 a35a118a8 a116 and a115a48a119 a120 a120 a120 a121 a117
3 a4 a35a118a8 a116 and a115a48a119 a121 a117 a120 a120 a120
6 a4
a35a118a8
a116 and a122a123a119 a120 a124
a76
7 a4 a35 a8 a116 and a122a123a119 a120 a120 a120 a124
a76
for the word pair a18 a12a11a110 a24 a4 a35 a22 , we use the following
categories of features:
1. (a39 a24 a4 a35 a24 )
2. (a125 a87 a24 a4
a35 a24
a12 a82 ) and a12 a82
a8
a12a11a110a114a126
a7
3. (a125a17a127 a24 a4 a35 a24 a12 a82 ) and a12 a82a128a89a130a129 a12a11a110 a72a92a111 a10a112a10a112a10a12a11a110a114a113 a111a37a131
4. (a125 a87 a24 a106a132a18a26a4 a35 a22 a24a109a108 a18 a12 a82 a22 ) and a12 a82 a8 a12 a110a114a126 a7
5. (a125a17a127 a24 a106a132a18a26a4 a35 a22 a24a109a108 a18 a12a11a82 a22 ) and a12a11a82 a89a133a129 a12a11a110 a72a92a111 a10a112a10a112a10a12a25a110a112a113 a111a103a131
6. (a134 a87 a24 a4 a35 a24 a4 a82 ) and a4 a82 a8 a4 a35a25a72 a7
7. (a134a135a127 a24 a4 a35 a24 a4 a82 ) and a4 a82 a89a133a129 a4 a35a25a72a92a111 a10a112a10a112a10a4 a35a11a72 a7 a131
8. (a134
a87
a24
a106a132a18a26a4
a35 a22 a24
a106a107a18a26a4
a82
a22 ) and
a4
a82
a8
a4
a35a11a72
a7
9. (a134a135a127 a24 a106a132a18a26a4 a35 a22 a24 a106a107a18a26a4 a82 a22 ) and a4 a82a128a89a136a129 a4 a35a25a72a92a111 a10a112a10a4 a35a11a72 a7 a131
Category 1 features depend only on the source
word a4 a35 and the target word a12a25a110 . A ME model that
uses only those, predicts each source translation
a4
a35 with the probability
a137a27
a51
a18a26a4
a35 a22 determined by the
empirical data. This is exactly the standard lex-
icon probability a27a28a18a26a4 a20a12 a22 employed in the transla-
tion model described in (Brown et al., 1993) and
in Section 2.
Category 2 describes features which depend in
addition on the word a12 a82 one position to the left or
to the right of a12a25a110 . The same explanation is valid
for category 3 but in this case a12 a82 could appears in
any position of the context a79 . Categories 4 and
5 are the analogous categories to 2 and 3 using
word classes instead of words. In the categories
6, 7, 8 and 9 the source context is used instead of
the target context. Table 2 gives an overview of
the different feature categories.
Examples of specific features and their respec-
tive category are shown in Table 3.
Table 3: The 10 most important features and their
respective category and
a102
values for the English
word “which”.
Category Feature
a1021 (0,was,) 1.20787
1 (0,das,) 1.19333
5 (3,F35,E15) 1.17612
4 (1,F35,E15) 1.15916
3 (3,das,is) 1.12869
2 (1,das,is) 1.12596
1 (0,die,) 1.12596
5 (-3,was,@@) 1.12052
6 (-1,was,@@) 1.11511
9 (-3,F26,F18) 1.11242
5.2 Feature selection
The number of possible features that can be used
according to the German and English vocabular-
ies and word classes is huge. In order to re-
duce the number of features we perform a thresh-
old based feature selection, that is every feature
which occurs less than a138 times is not used. The
aim of the feature selection is two-fold. Firstly,
we obtain smaller models by using less features,
and secondly, we hope to avoid overfitting on the
training data.
In order to obtain the threshold a138 we compare
the test corpus perplexity for various thresholds.
The different threshold used in the experiments
range from 0 to 512. The threshold is used as a
cut-off for the number of occurrences that a spe-
cific feature must appear. So a cut-off of 0 means
that all features observed in the training data are
used. A cut-off of 32 means those features that
appear 32 times or more are considered to train
the maximum entropy models.
We select the English words that appear at least
150 times in the training sample which are in total
348 of the 4673 words contained in the English
vocabulary. Table 4 shows the different number
of features considered for the 348 English words
selected using different thresholds.
In choosing a reasonable threshold we have to
balance the number of features and observed per-
plexity.
Table 4: Number of features used according to
different cut-off threshold. In the second column
of the table are shown the number of features used
when only the English context is considered. The
third column correspond to English, German and
Word-Classes contexts.
# features used
a138 English English+German
0 846121 1581529
2 240053 500285
4 153225 330077
8 96983 210795
16 61329 131323
32 40441 80769
64 28147 49509
128 21469 31805
256 18511 22947
512 17193 19027
6 Experimental results
6.1 Training and test corpus
The “Verbmobil Task” is a speech translation task
in the domain of appointment scheduling, travel
planning, and hotel reservation. The task is dif-
ficult because it consists of spontaneous speech
and the syntactic structures of the sentences are
less restricted and highly variable. For the rescor-
ing experiments we use the corpus described in
Table 5.
Table 5: Corpus characteristics for translation
task.
German English
Train Sentences 58 332
Words 519 523 549 921
Vocabulary 7 940 4 673
Test Sentences 147
Words 1 968 2 173
PP (trigr. LM) (40.3) 28.8
To train the maximum entropy models we used
the “Ristad ME Toolkit” described in (Ristad,
1997). We performed 100 iteration of the Im-
proved Iterative Scaling algorithm (Pietra et al.,
1997) using the corpus described in Table 6,
Table 6: Corpus characteristics for perplexity
quality experiments.
German English
Train Sentences 50 000
Words 454 619 482 344
Vocabulary 7 456 4 420
Test Sentences 8073
Words 64 875 65 547
Vocabulary 2 579 1 666
which is a subset of the corpus shown in Table 5.
6.2 Training and test perplexities
In order to compute the training and test perplex-
ities, we split the whole aligned training corpus
in two parts as shown in Table 6. The training
and test perplexities are shown in Table 7. As
expected, the perplexity reduction in the test cor-
pus is lower than in the training corpus, but in
both cases better perplexities are obtained using
the ME models. The best value is obtained when
a threshold of 4 is used.
We expected to observe strong overfitting ef-
fects when a too small cut-off for features gets
used. Yet, for most words the best test corpus
perplexity is observed when we use all features
including those that occur only once.
Table 7: Training and Test perplexities us-
ing different contextual information and different
thresholds a138 . The reference perplexities obtained
with the basic translation model 5 are TrainPP =
10.38 and TestPP = 13.22.
English English+German
a138 TrainPP TestPP TrainPP TestPP
0 5.03 11.39 4.60 9.28
2 6.59 10.37 5.70 8.94
4 7.09 10.28 6.17 8.92
8 7.50 10.39 6.63 9.03
16 7.95 10.64 7.07 9.30
32 8.38 11.04 7.55 9.73
64 9.68 11.56 8.05 10.26
128 9.31 12.09 8.61 10.94
256 9.70 12.62 9.20 11.80
512 10.07 13.12 9.69 12.45
6.3 Translation results
In order to make use of the ME models in a statis-
tical translation system we implemented a rescor-
ing algorithm. This algorithm take as input the
standard lexicon model (not using maximum en-
tropy) and the 348 models obtained with the ME
training. For an hypothesis sentence a12 a13a7 and a cor-
responding alignment a33 a5a7 the algorithm modifies
the score a15a17a16a19a18a26a4 a5a7a28a24 a33 a5a7 a20a12a25a13a7 a22 according to the refined
maximum entropy lexicon model.
We carried out some preliminary experiments
with the a2 -best lists of hypotheses provided by
the translation system in order to make a rescor-
ing of each i-th hypothesis and reorder the list ac-
cording to the new score computed with the re-
fined lexicon model. Unfortunately, our a2 -best
extraction algorithm is sub-optimal, i.e. not the
true best a2 translations are extracted. In addition,
so far we had to use a limit of only a87 a39 translations
per sentence. Therefore, the results of the transla-
tion experiments are only preliminary.
For the evaluation of the translation quality
we use the automatically computable Word Er-
ror Rate (WER). The WER corresponds to the
edit distance between the produced translation
and one predefined reference translation. A short-
coming of the WER is the fact that it requires a
perfect word order. This is particularly a prob-
lem for the Verbmobil task, where the word or-
der of the German-English sentence pair can be
quite different. As a result, the word order of
the automatically generated target sentence can
be different from that of the target sentence, but
nevertheless acceptable so that the WER measure
alone can be misleading. In order to overcome
this problem, we introduce as additional measure
the position-independent word error rate (PER).
This measure compares the words in the two sen-
tences without taking the word order into account.
Depending on whether the translated sentence is
longer or shorter than the target translation, the
remaining words result in either insertion or dele-
tion errors in addition to substitution errors. The
PER is guaranteed to be less than or equal to the
WER.
We use the top-10 list of hypothesis provided
by the translation system described in (Tillmann
and Ney, 2000) for rescoring the hypothesis us-
ing the ME models and sort them according to the
new maximum entropy score. The translation re-
sults in terms of error rates are shown in Table 8.
We use Model 4 in order to perform the transla-
tion experiments because Model 4 typically gives
better translation results than Model 5.
We see that the translation quality improves
slightly with respect to the WER and PER. The
translation quality improvements so far are quite
small compared to the perplexity measure im-
provements. We attribute this to the fact that the
algorithm for computing the a2 -best lists is sub-
optimal.
Table 8: Preliminary translation results for the
Verbmobil Test-147 for different contextual infor-
mation and different thresholds using the top-10
translations. The baseline translation results for
model 4 are WER=54.80 and PER=43.07.
English English+German
a138 WER PER WER PER
0 54.57 42.98 54.02 42.48
2 54.16 42.43 54.07 42.71
4 54.53 42.71 54.11 42.75
8 54.76 43.21 54.39 43.07
16 54.76 43.53 54.02 42.75
32 54.80 43.12 54.53 42.94
64 54.21 42.89 54.53 42.89
128 54.57 42.98 54.67 43.12
256 54.99 43.12 54.57 42.89
512 55.08 43.30 54.85 43.21
Table 9 shows some examples where the trans-
lation obtained with the rescoring procedure is
better than the best hypothesis provided by the
translation system.
7 Conclusions
We have developed refined lexicon models for
statistical machine translation by using maximum
entropy models. We have been able to obtain a
significant better test corpus perplexity and also a
slight improvement in translation quality. We be-
lieve that by performing a rescoring on translation
word graphs we will obtain a more significant im-
provement in translation quality.
For the future we plan to investigate more re-
fined feature selection methods in order to make
the maximum entropy models smaller and better
generalizing. In addition, we want to investigate
more syntactic, semantic features and to include
features that go beyond sentence boundaries.
References
Yaser Al-Onaizan, Jan Curin, Michael Jahr,
Kevin Knight, John Lafferty, Dan Melamed,
David Purdy, Franz J. Och, Noah A. Smith,
and David Yarowsky. 1999. Statistical ma-
chine translation, final report, JHU workshop.
http://www.clsp.jhu.edu/ws99/pro-
jects/mt/final report/mt-final-
report.ps.
A. L. Berger, P. F. Brown, S. A. Della Pietra, et al.
1994. The candide system for machine translation.
In Proc. , ARPA Workshop on Human Language
Technology, pages 157–162.
Adam L. Berger, Stephen A. Della Pietra, and Vin-
cent J. Della Pietra. 1996. A maximum entropy
approach to natural language processing. Compu-
tational Linguistics, 22(1):39–72, March.
Peter F. Brown, Stephen A. Della Pietra, Vincent J.
Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation: Pa-
rameter estimation. Computational Linguistics,
19(2):263–311.
George Foster. 2000. Incorporating position informa-
tion into a maximum entropy/minimum divergence
translation model. In Proc. of CoNNL-2000 and
LLL-2000, pages 37–52, Lisbon, Portugal.
Sven Martin, Christoph Hamacher, J¨org Liermann,
Frank Wessel, and Hermann Ney. 1999. Assess-
ment of smoothing methods and complex stochas-
tic language modeling. In IEEE International Con-
ference on Acoustics, Speech and Signal Process-
ing, volume I, pages 1939–1942, Budapest, Hun-
gary, September.
Sonja Nießen, Stephan Vogel, Hermann Ney, and
Christoph Tillmann. 1998. A DP-based search
algorithm for statistical machine translation. In
COLING-ACL ’98: 36th Annual Meeting of the As-
sociation for Computational Linguistics and 17th
Int. Conf. on Computational Linguistics, pages
960–967, Montreal, Canada, August.
Franz J. Och and Hermann Ney. 2000a. Giza++:
Training of statistical translation models.
http://www-i6.Informatik.RWTH-
Aachen.DE/˜och/software/GIZA++.html.
Franz J. Och and Hermann Ney. 2000b. Improved sta-
tistical alignment models. In Proc. of the 38th An-
nual Meeting of the Association for Computational
Linguistics, pages 440–447, Hongkong, China, Oc-
tober.
Table 9: Four examples showing the translation obtained with the Model 4 and the ME model for a
given German source sentence.
SRC: Danach wollten wir eigentlich noch Abendessen gehen.
M4: We actually concluding dinner together.
ME: Afterwards we wanted to go to dinner.
SRC: Bei mir oder bei Ihnen?
M4: For me or for you?
ME: At your or my place?
SRC: Das w¨are genau das richtige.
M4: That is exactly it spirit.
ME: That is the right thing.
SRC: Ja, das sieht bei mir eigentlich im Januar ziemlich gut aus.
M4: Yes, that does not suit me in January looks pretty good.
ME: Yes, that looks pretty good for me actually in January.
Franz J. Och. 1999. An efficient method for deter-
mining bilingual word classes. In EACL ’99: Ninth
Conf. of the Europ. Chapter of the Association for
Computational Linguistics, pages 71–76, Bergen,
Norway, June.
K.A. Papineni, S. Roukos, and R.T. Ward. 1996.
Feature-based language understanding. In ESCA,
Eurospeech, pages 1435–1438, Rhodes, Greece.
K.A. Papineni, S. Roukos, and R.T. Ward. 1998.
Maximum likelihood and discriminative training of
direct translation models. In Proc. Int. Conf. on
Acoustics, Speech, and Signal Processing, pages
189–192.
Jochen Peters and Dietrich Klakow. 1999. Compact
maximum entropy language models. In Proceed-
ings of the IEEE Workshop on Automatic Speech
Recognition and Understanding, Keystone, CO,
December.
Stephen Della Pietra, Vincent Della Pietra, and John
Lafferty. 1997. Inducing features in random fields.
IEEE Trans. on Pattern Analysis and Machine In-
teligence, 19(4):380–393, July.
Eric S. Ristad. 1997. Maximum entropy modelling
toolkit. Technical report, Princeton Univesity.
R. Rosenfeld. 1996. A maximum entropy approach to
adaptive statistical language modeling. Computer,
Speech and Language, 10:187–228.
Christoph Tillmann and Hermann Ney. 2000. Word
re-ordering and dp-based search in statistical ma-
chine translation. In 8th International Confer-
ence on Computational Linguistics (CoLing 2000),
pages 850–856, Saarbr¨ucken, Germany, July.
C. Tillmann, S. Vogel, H. Ney, and A. Zubiaga. 1997.
A DP-based search using monotone alignments in
statistical translation. In Proc. 35th Annual Conf.
of the Association for Computational Linguistics,
pages 289–296, Madrid, Spain, July.
Ye-Yi Wang and Alex Waibel. 1997. Decoding algo-
rithm in statistical translation. In Proc. 35th Annual
Conf. of the Association for Computational Linguis-
tics, pages 366–372, Madrid, Spain, July.
