A Phrase-Based, Joint Probability Model for Statistical Machine Translation
Daniel Marcu
Information Sciences Institute and
Department of Computer Science
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA, 90292
marcu@isi.edu
William Wong
Language Weaver Inc.
1639 11th St., Suite 100A
Santa Monica, CA 90404
wwong@languageweaver.com
Abstract
We present a joint probability model for
statistical machine translation, which au-
tomatically learns word and phrase equiv-
alents from bilingual corpora. Transla-
tions produced with parameters estimated
using the joint model are more accu-
rate than translations produced using IBM
Model 4.
1 Motivation
Most of the noisy-channel-based models used in
statistical machine translation (MT) (Brown et al.,
1993) are conditional probability models. In the
noisy-channel framework, each source sentence e in
a parallel corpus is assumed to “generate” a target
sentence f by means of a stochastic process, whose
parameters are estimated using traditional EM tech-
niques (Dempster et al., 1977). The generative
model explains how source words are mapped into
target words and how target words are re-ordered
to yield well-formed target sentences. A variety
of methods are used to account for the re-ordering
stage: word-based (Brown et al., 1993), template-
based (Och et al., 1999), and syntax-based (Yamada
and Knight, 2001), to name just a few. Although
these models use different generative processes to
explain how translated words are re-ordered in a tar-
get language, at the lexical level they are quite sim-
ilar; all these models assume that source words are
individually translated into target words.1
1The individual words may contain a non-existent element,
called NULL.
We suspect that MT researchers have so far cho-
sen to automatically learn translation lexicons de-
fined only over words for primarily pragmatic rea-
sons. Large scale bilingual corpora with vocabu-
laries in the range of hundreds of thousands yield
very large translation lexicons. Tuning the probabil-
ities associated with these large lexicons is a difficult
enough task to deter one from trying to scale up to
learning phrase-based lexicons. Unfortunately, trad-
ing space requirements and efficiency for explana-
tory power often yields non-intuitive results.
Consider, for example, the parallel corpus of three
sentence pairs shown in Figure 1. Intuitively, if we
allow any Source words to be aligned to any Target
words, the best alignment that we can come up with
is the one in Figure 1.c. Sentence pair (S2, T2) of-
fers strong evidence that “b c” in language S means
the same thing as “x” in language T. On the basis
of this evidence, we expect the system to also learn
from sentence pair (S1, T1) that “a” in language S
means the same thing as “y” in language T. Unfortu-
nately, if one works with translation models that do
not allow Target words to be aligned to more than
one Source word — as it is the case in the IBM mod-
els (Brown et al., 1993) — it is impossible to learn
that the phrase “b c” in language S means the same
thing as word “x” in language T. The IBM Model
4 (Brown et al., 1993), for example, converges to the
word alignments shown in Figure 1.b and learns the
translation probabilities shown in Figure 1.a.2 Since
in the IBM model one cannot link a Target word
to more than a Source word, the training procedure
2To train the IBM-4 model, we used Giza (Al-Onaizan et al.,
1999).
                                            Association for Computational Linguistics.
                    Language Processing (EMNLP), Philadelphia, July 2002, pp. 133-139.
                         Proceedings of the Conference on Empirical Methods in Natural
IBM−4 T−Table
p(y | a) = 1
p(x | c) = 1
p(z | b) = 0.98
p(x | b) = 0.02
Joint T−Table
p(x, b c) = 0.34
p(y, a) = 0.01
p(x y, a b c) = 0.32
p(z, b) = 0.33
Corresponding
Conditional Table
p(x y | a b c ) = 1
p(x | b c) = 1
p(y | a) = 1
p(z | b) = 1
S1: a   b   c
T1: x   y
S2: b   c
T2: x
S3: b
T3: z
S1: a   b   c
T1: x   y
S2: b   c
T2: x
S3: b
T3: z
S1: a   b   c
T1: x   y
S2: b   c
T2: x
S3: b
T3: z
Intuitive JointIBM−4
a) b) c) e)d)
Figure 1: Alignments and probability distributions in IBM Model 4 and our joint phrase-based model.
yields unintuitive translation probabilities. (Note
that another good word-for-word model is one that
assigns high probability to p(x a0 b) and p(z a0 b) and
low probability to p(x a0 c).)
In this paper, we describe a translation model that
assumes that lexical correspondences can be estab-
lished not only at the word level, but at the phrase
level as well. In constrast with many previous ap-
proaches (Brown et al., 1993; Och et al., 1999; Ya-
mada and Knight, 2001), our model does not try to
capture how Source sentences can be mapped into
Target sentences, but rather how Source and Tar-
get sentences can be generated simultaneously. In
other words, in the style of Melamed (2001), we es-
timate a joint probability model that can be easily
marginalized in order to yield conditional probabil-
ity models for both source-to-target and target-to-
source machine translation applications. The main
difference between our work and that of Melamed
is that we learn joint probability models of trans-
lation equivalence not only between words but also
between phrases and we show that these models can
be used not only for the extraction of bilingual lexi-
cons but also for the automatic translation of unseen
sentences.
In the rest of the paper, we first describe our
model (Section 2) and explain how it can be imple-
mented/trained (Section 3). We briefly describe a
decoding algorithm that works in conjunction with
our model (Section 4) and evaluate the performance
of a translation system that uses the joint-probability
model (Section 5). We end with a discussion of the
strengths and weaknesses of our model as compared
to other models proposed in the literature.
2 A Phrase-Based Joint Probability Model
2.1 Model 1
In developing our joint probability model, we started
out with a very simple generative story. We assume
that each sentence pair in our corpus is generated by
the following stochastic process:
1. Generate a bag of concepts a1 .
2. For each concept a2a4a3a6a5a7a1 , generate a pair of
phrases a8a10a9a11 a3a13a12 a9a14 a3a16a15 , according to the distribution
a17
a8a10a9
a11
a3a13a12
a9
a14
a3a16a15 , where a9
a11
a3 and a9a14 a3 each contain at least
one word.
3. Order the phrases generated in each language
so as to create two linear sequences of phrases;
these sequences correspond to the sentence
pairs in a bilingual corpus.
For simplicity, we initially assume that the bag of
concepts and the ordering of the generated phrases
are modeled by uniform distributions. We do not
assume that a2a4a3 is a hidden variable that generates
the pair a8a18a9a11 a3a13a12 a9a14 a3a19a15 , but rather that a2a4a3a21a20a22a8a10a9a11 a3a13a12 a9a14 a3a16a15 . Un-
der these assumptions, it follows that the probability
of generating a sentence pair (E, F) using concepts
a2a4a3a23a5a24a1 is given by the product of all phrase-to-
phrase translation probabilities, a25a27a26a29a28a31a30a33a32 a17 a8a10a9a11 a3 a12 a9a14 a3 a15 that
yield bags of phrases that can be ordered linearly so
as to obtain the sentences E and F. For example, the
sentence pair “a b c” — “x y” can be generated us-
ing two concepts, (“a b” : “y”) and (“c” : “x”); or
one concept, (“a b c” : “x y”), because in both cases
the phrases in each language can be arranged in a
sequence that would yield the original sentence pair.
However, the same sentence pair cannot be gener-
ated using the concepts (“a b” : “y”) and (“c” : “y”)
because the sequence “x y” cannot be recreated from
the two phrases “y” and “y”. Similarly, the pair can-
not be generated using concepts (“a c” : “x”) and
(“b” : “y”) because the sequence “a b c” cannot be
created by catenating the phrases “a c” and “b”.
We say that a set of concepts a1 can be linearized
into a sentence pair (E, F) if E and F can be obtained
by permuting the phrases a9a11 a3 and a9a14 a3 that characterize
all concepts a2a4a3a34a5a35a1 . We denote this property us-
ing the predicate a36a37a8a31a38a39a12a41a40a42a12a43a1a44a15 . Under this model, the
probability of a given sentence pair (E, F) can then
be obtained by summing up over all possible ways
of generating bags of concepts a1a45a5a47a46 that can be
linearized to (E, F).
a48
a8a31a38a39a12a41a40a49a15a50a20 a51
a32a52a30a54a53a56a55a57a59a58a61a60a63a62a64a65a62a32a67a66
a68
a26a16a28a31a30a69a32
a17
a8a10a9
a11
a3a70a12
a9
a14
a3a29a15 (1)
2.2 Model 2
Although Model 1 is fairly unsophisticated, we have
found that it produces in practice fairly good align-
ments. However, this model is clearly unsuited for
translating unseen sentences as it imposes no con-
straints on the ordering of the phrases associated
with a given concept. In order to account for this,
we modify slightly the generative process in Model
1 so as to account for distortions. The generative
story of Model 2 is this:
1. Generate a bag of concepts a1 .
2. Initialize E and F to empty sequences a71 .
3. Randomly take a concept a2a43a3a72a5a73a1 and generate
a pair of phrases a8a10a9
a11
a3a13a12
a9
a14
a3a19a15 , according to the dis-
tribution a17 a8a10a9a11 a3a13a12 a9a14 a3a19a15 , where a9a11 a3 and a9a14 a3 each contain
at least one word. Remove then a2a43a3 from a1 .
4. Append phrase a9a14 a3 at the end of F. Let a74 be the
start position of a9a14 a3 in F.
5. Insert phrase a9a11 a3 at position a75 in E provided that
no other phrase occupies any of the positions
between a75 and a75a16a76 a0 a9a11 a3 a0, where a0 a9a11 a3 a0 gives the
length of the phrase a9a11 a3 . We hence create the
alignment between the two phrases a9a14 a3 and a9a11 a3
with probabilitya77a79a78
a55a29a80
a81
a28a19a55
a68
a82a84a83
a77a86a85
a8
a48
a12a84a8a31a75a16a76
a0
a9
a11
a3
a0
a15a88a87a33a89a90a15a43a12
where a85 a8a92a91a41a12a29a93a56a15 is a position-based distortion dis-
tribution.
6. Repeat steps 3 to 5 until a1 is empty.
In Model 2, the probability to generate a sentence
pair (E, F) is given by formula (2), where a48a95a94a69a96 a8 a9a14
a77
a3
a15
denotes the position of word a74 of phrase a9a14 a3 in sen-
tence F and a48a95a94a69a96 a26a16a97 a8a10a9a11 a3a29a15 denotes the position in sen-
tence E of the center of mass of phrase a11 a3 .
a48
a8a31a38a39a12a41a40a49a15a50a20 a51
a32a52a30a84a53a56a55a57a59a58a61a60a63a62a64a65a62a32a67a66
a68
a26a16a28a31a30a69a32a99a98
a17
a8a18a9
a11
a3a13a12
a9
a14
a3a19a15a72a100
a55a29a80
a81
a28a19a55
a68
a77
a83a52a101
a85
a8
a48a95a94a69a96
a8
a9
a14
a77
a3
a15a43a12
a48a95a94a69a96
a26a19a97
a8a10a9
a11
a3a29a15a88a15a29a102
(2)
Model 2 implements an absolute position-based
distortion model, in the style of IBM Model 3. We
have tried many types of distortion models. We
eventually settled for the model discussed here be-
cause it produces better translations during decod-
ing. Since the number of factors involved in com-
puting the probability of an alignment does not vary
with the size of the Target phrases into which Source
phrases are translated, this model is not predis-
posed to produce translations that are shorter than
the Source sentences given as input.
3 Training
Training the models described in Section 2 is com-
putationally challenging. Since there is an exponen-
tial number of alignments that can generate a sen-
tence pair (E, F), it is clear that we cannot apply the
1. Determine high-frequency n-
grams in the bilingual corpus.
2. Initialize the t-distribution
table.
3. Apply EM training on the
Viterbi alignments, while using
smoothing.
4. Generate conditional model
probabilities.
Figure 2: Training algorithm for the phrase-based
joint probability model.
EM training algorithm exhaustively. To estimate the
parameters of our model, we apply the algorithm in
Figure 2, whose steps are motivated and described
below.
3.1 Determine high-frequency n-grams in E
and F
If one assumes from the outset that any phrases
a9
a11
a3a103a5a23a104a52a105 and
a9
a14
a3a42a5a23a106a107a105 can be generated from a con-
cept a2a4a3 , one would need a supercomputer in order to
store in the memory a table that models the a17 a8a10a9a11 a3a108a12 a9a14 a3a16a15
distribution. Since we don’t have access to comput-
ers with unlimited memory, we initially learn t distri-
bution entries only for the phrases that occur often in
the corpus and for unigrams. Then, through smooth-
ing, we learn t distribution entries for the phrases
that occur rarely as well. In order to be considered
in step 2 of the algorithm, a phrase has to occur at
least five times in the corpus.
3.2 Initialize the t-distribution table
Before the EM training procedure starts, one has no
idea what word/phrase pairs are likely to share the
same meaning. In other words, all alignments that
can generate a sentence pair (E, F) can be assumed to
have the same probability. Under these conditions,
the evidence that a sentence pair (E, F) contributes to
the fact that a8a18a9a11 a3 a12 a9a14 a3 a15 are generated by the same con-
cept a2a4a3 is given by the number of alignments that can
be built between (E, F) that have a concept a2a4a3 that
is linked to phrase a9a11 a3 in sentence E and phrase a9a14 a3
in sentence F divided by the total number of align-
ments that can be built between the two sentences.
Both these numbers can be easily approximated.
Given a sentence E of a75 words, there are a109a42a8a31a75a108a12a110a74a65a15
ways in which the a75 words can be partitioned into a74
non-empty sets/concepts, where a109a42a8a31a75a108a12a110a74a65a15 is the Stir-
ling number of second kind.
a109a42a8a31a75a108a12a110a74a65a15a50a20a112a111
a74a114a113
a77a54a115
a101
a51
a3
a83a117a116
a8a108a118
a111
a15
a3a69a119
a74
a91a84a120
a8a19a74a121a118a122a91a108a15a70a123 (3)
There are also a109a42a8a92a124a86a12a110a74a65a15 ways in which the a124 words
of a sentence F can be partitioned into a74 non-
empty sets. Given that any words in E can be
mapped to any words in F, it follows that there are
a125 a97 a3
a123
a58a61a126a127a62a97a128a66
a77
a83a52a101 a74a114a113a43a109a42a8a31a75a108a12a110a74a65a15a129a109a42a8a92a124a86a12a110a74a65a15 alignments that can be
built between two sentences (E, F) of lengths a75 and
a124 , respectively. When a concept a2a4a3 generates two
phrases a8a10a9a11 a3 a12 a9a14 a3 a15 of length a130 and a131 , respectively, there
are only a75a132a118a39a130 and a124a133a118a34a131 words left to link. Hence, in
the absence of any other information, the probabil-
ity that phrases a9a11 a3 and a9a14 a3 are generated by the same
concept a2a4a3 is given by formula (4).
a125 a97 a3
a123
a58a134a126
a115a59a135
a62a97
a115a59a136
a66
a77
a83a52a101 a74a114a113a43a109a42a8a31a75a117a118a122a130a95a12a110a74a65a15a129a109a42a8a92a124a137a118a23a131a54a12a110a74a65a15
a125 a97 a3
a123
a58a134a126a138a62a97a128a66
a77
a83a52a101 a74a114a113a43a109a42a8a31a75a108a12a110a74a65a15a129a109a42a8a92a124a86a12a110a74a65a15
(4)
Note that the fractional counts returned by equa-
tion (4) are only an approximation of the t distri-
bution that we are interested in because the Stir-
ling numbers of the second kind do not impose any
restriction on the words that are associated with a
given concept be consecutive. However, since for-
mula (4) overestimates the numerator and denomi-
nator equally, the approximation works well in prac-
tice.
In the second step of the algorithm, we apply
equation (4) to collect fractional counts for all un-
igram and high-frequency n-gram pairs in the carte-
sian product defined over the phrases in each sen-
tence pair (E, F) in a corpus. We sum over all these
t-counts and we normalize to obtain an initial joint
distribution a17 . This step amounts to running the EM
algorithm for one step over all possible alignments
in the corpus.
3.3 EM training on Viterbi alignments
Given a non-uniform t distribution, phrase-to-phrase
alignments have different weights and there are no
other tricks one can apply to collect fractional counts
over all possible alignments in polynomial time.
Starting with step 3 of the algorithm in Figure 2, for
each sentence pair in a corpus, we greedily produce
an initial alignment by linking together phrases so
as to create concepts that have high t probabilities.
We then hillclimb towards the Viterbi alignment of
highest probability by breaking and merging con-
cepts, swapping words between concepts, and mov-
ing words across concepts. We compute the prob-
abilities associated with all the alignments we gen-
erate during the hillclimbing process and collect t
counts over all concepts in these alignments.
We apply this Viterbi-based EM training proce-
dure for a few iterations. The first iterations estimate
the alignment probabilities using Model 1. The rest
of the iterations estimate the alignment probabilities
using Model 2.
During training, we apply smoothing so we can
associate non-zero values to phrase-pairs that do not
occur often in the corpus.
3.4 Derivation of conditional probability model
At the end of the training procedure, we take
marginals on the joint probability distributions a17 and
a85 . This yields conditional probability distributions
a17
a8
a9
a14
a3
a0
a9
a11
a3 a15 and
a85
a8
a48a95a94a69a96
a40
a0
a48a95a94a69a96
a38a21a15a43a12 which we use for
decoding.
3.5 Discussion
When we run the training procedure in Figure 2 on
the corpus in Figure 1, after four Model 1 iterations
we obtain the alignments in Figure 1.d and the joint
and conditional probability distributions shown in
Figure 1.e. At prima facie, the Viterbi alignment for
the first sentence pair appears incorrect because we,
as humans, have a natural tendency to build align-
ments between the smallest phrases possible. How-
ever, note that the choice made by our model is quite
reasonable. After all, in the absence of additional
information, the model can either assume that “a”
and “y” mean the same thing or that phrases “a b c”
and “x y” mean the same thing. The model chose
to give more weight to the second hypothesis, while
preserving some probability mass for the first one.
Also note that although the joint distribution puts
the second hypothesis at an advantage, the condi-
tional distribution does not. The conditional distri-
bution in Figure 1.e is consistent with our intuitions
that tell us that it is reasonable both to translate “a
b c” into “x y”, as well as “a” into “y”. The condi-
tional distribution mirrors perfectly our intuitions.
4 Decoding
For decoding, we have implemented a greedy pro-
cedure similar to that proposed by Germann et
al. (2001). Given a Foreign sentence F, we first pro-
duce a gloss of it by selecting phrases in a104 a105 that
maximize the probability a48 a8a31a38a39a12a41a40a49a15 . We then itera-
tively hillclimb by modifying E and the alignment
between E and F so as to maximize the formula
a48
a8a31a38a21a15
a48
a8a31a40
a0
a38a21a15 . We hillclimb by modifying an exist-
ing alignment/translation through a set of operations
that modify locally the aligment/translation built un-
til a given time. These operations replace the En-
glish side of an alignment with phrases of differ-
ent probabilities, merge and break existing concepts,
and swap words across concepts. The probability
p(E) is computed using a simple trigram language
model that was trained using the CMU Language
Modeling Toolkit (Clarkson and Rosenfeld, 1997).
The language model is estimated at the word (not
phrase) level. Figure 3 shows the steps taken by our
decoder in order to find the translation of sentence
“je vais me arrˆeter l`a .” Each intermediate transla-
tion in Figure 3 is preceded by its probability and
succeded by the operation that changes it to yield a
translation of higher probability.
5 Evaluation
To evaluate our system, we trained both Giza (IBM
Model 4) (Al-Onaizan et al., 1999) and our joint
probability model on a French-English parallel cor-
pus of 100,000 sentence pairs from the Hansard
corpus. The sentences in the corpus were at most
20 words long. The English side had a total
of 1,073,480 words (21,484 unique tokens). The
French side had a total of 1,177,143 words (28,132
unique tokens).
We translated 500 unseen sentences, which were
uniformly distributed across lengths 6, 8, 10, 15, and
20. For each group of 100 sentences, we manu-
ally determined the number of sentences translated
perfectly by the IBM model decoder of Germann et
al. (2001) and the decoder that uses the joint prob-
Model Percent perfect translations IBM Bleu score
Sentence length Sentence length
6 8 10 15 20 Avg. 6 8 10 15 20 Avg.
IBM 36 26 35 11 2 22 0.2076 0.2040 0.2414 0.2248 0.2011 0.2158
Phrase-based 43 37 33 19 6 28 0.2574 0.2181 0.2435 0.2407 0.2028 0.2325
Table 1: Comparison of IBM and Phrase-Based, Joint Probability Models on a translation task.
je vais me arreter la .
i   .      me  to     that  .
je vais me arreter la .
i want  me  to     there  .
je vais me arreter la .
i want  me  to     that  .
je vais me arreter la .
i want me stop there .
je vais me arreter la .
let me   to   stop  there  .
7.75e−10
9.46e−08
1.09e−09
2.97e−10
7.50e−11
1.28e−14
ChangeWordTrans("arreter","stop")
FuseAndChange("je vais","let me")
changeWordTrans("vais", "want")
FuseAndChangeTrans("la .", "there .")
je vais me arreter la .
i am going to stop   there  .
                              "i am going to")
FuseAndChange("je vais me",
Figure 3: Example of phrase-based greedy decod-
ing.
ability model. We also evaluated the translations
automatically, using the IBM-Bleu metric (Papineni
et al., 2002). The results in Table 1 show that the
phrased-based translation model proposed in this pa-
per significantly outperforms IBM Model 4 on both
the subjective and objective metrics.
6 Discussion
6.1 Limitations
The main shortcoming of the phrase-based model in
this paper concerns the size of the t-table and the
cost of the training procedure we currently apply.
To keep the memory requirements manageable, we
arbitrarily restricted the system to learning phrase
translations of at most six words on each side. Also,
the swap, break, and merge operations used dur-
ing the Viterbi training are computationally expen-
sive. We are currently investigating the applicability
of dynamic programming techniques to increase the
speed of the training procedure.
Clearly, there are language pairs for which it
would be helpful to allow concepts to be realized as
non-contiguous phrases. The English word “not”,
for example, is often translated into two French
words, “ne” and “pas”. But “ne” and “pas” al-
most never occur in adjacent positions in French
texts. At the outset of this work, we attempted to de-
velop a translation model that enables concepts to be
mapped into non-contiguous phrases. But we were
not able to scale and train it on large amounts of data.
The model described in this paper cannot learn that
the English word “not” corresponds to the French
words “ne” and “pas”. However, our model learns
to deal with negation by memorizing longer phrase
translation equivalents, such as (“ne est pas”, “is
not”); (“est inadmissible”, “is not good enough”);
and (“ne est pas ici”, “is not here”).
6.2 Comparison with other work
A number of researchers have already gone be-
yond word-level translations in various MT set-
tings. For example, Melamed (2001) uses word-
level alignments in order to learn translations of non-
compositional compounds. Och and Ney (1999)
learn phrase-to-phrase mappings involving word
classes, which they call “templates”, and exploit
them in a statistical machine translation system. And
Marcu (2001) extracts phrase translations from au-
tomatically aligned corpora and uses them in con-
junction with a word-for-word statistical translation
system. However, none of these approaches learn
simultaneously the translation of phrases/templates
and the translation of words. As a consequence,
there is a chance that the learning procedure will not
discover phrase-level patterns that occur often in the
data. In our approach, phrases are not treated differ-
ently from individual words, and as a consequence
the likelihood of the EM algorithm converging to a
better local maximum is increased.
Working with phrase translations that are learned
independent of a translation model can also affect
the decoder performance. For example, in our pre-
vious work (Marcu, 2001), we have used a statis-
tical translation memory of phrases in conjunction
with a statistical translation model (Brown et al.,
1993). The phrases in the translation memory were
automatically extracted from the Viterbi alignments
produced by Giza (Al-Onaizan et al., 1999) and re-
used in decoding. The decoder described in (Marcu,
2001) starts from a gloss that uses the translations
in the translation memory and then tries to improve
on the gloss translation by modifying it incremen-
tally, in the style described in Section 4. How-
ever, because the decoder hill-climbs on a word-for-
word translation model probability, it often discards
good phrasal translations in favour of word-for-word
translations of higher probability. The decoder in
Section 4 does not have this problem because it hill-
climbs on translation model probabilities in which
phrases play a crucial role.
Acknowledgments. This work was supported by
DARPA-ITO grant N66001-00-1-9814 and by NSF-
STTR grant 0128379.

References
Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin
Knight, John Lafferty, Dan Melamed, Franz-Josef
Och, David Purdy, Noah A. Smith, and David
Yarowsky. 1999. Statistical machine translation. Fi-
nal Report, JHU Summer Workshop.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and Robert L. Mercer. 1993. The mathematics
of statistical machine translation: Parameter estima-
tion. Computational Linguistics, 19(2):263–311.
Philip Clarkson and Ronald Rosenfeld. 1997. Statistical
language modeling using the CMU-Cambridge toolkit.
In Proceedings of Eurospeech, September.
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
39(Ser B):1–38.
Ulrich Germann, Mike Jahr, Kevin Knight, Daniel
Marcu, and Kenji Yamada. 2001. Fast decoding and
optimal decoding for machine translation. In Proceed-
ings of the 39th Annual Meeting of the Association for
Computational Linguistics (ACL’01), pages 228–235,
Toulouse, France, July 6–11. Decoder available at
http://www.isi.edu/natural-language/projects/rewrite/.
Daniel Marcu. 2001. Towards a unified approach to
memory- and statistical-based machine translation. In
Proceedings of the 39th Annual Meeting of the Associ-
ation for Computational Linguistics (ACL’01), pages
378–385, Toulouse, France, July 6–11.
Dan Melamed. 2001. Empirical Methods for Exploiting
Parallel Texts. The MIT Press.
Franz Josef Och, Christoph Tillmann, and Herman Ney.
1999. Improved alignment models for statistical ma-
chine translation. In Proceedings of the Joint Work-
shop on Empirical Methods in NLP and Very Large
Corpora, pages 20–28, University of Maryland, Mary-
land.
Kishore Papineni, Salim Roukos, Todd Ward, John Hen-
derson, and Florence Reeder. 2002. Corpus-based
comprehensive and diagnostic MT evaluation: Initial
Arabic, Chinese, French, and Spanish results. In Pro-
ceedings of the Human Language Technology Confer-
ence, pages 124–127, San Diego, CA, March 24-27.
Kenji Yamada and Kevin Knight. 2001. A syntax-based
statistical translation model. In Proceedings of the
39th Annual Meeting of the Association for Compu-
tational Linguistics (ACL’01), Toulouse, France, July
6–11.
