Discriminative Reranking for Machine Translation
Libin Shen
Dept. of Comp. & Info. Science
Univ. of Pennsylvania
Philadelphia, PA 19104
libin@seas.upenn.edu
Anoop Sarkar
School of Comp. Science
Simon Fraser Univ.
Burnaby, BC V5A 1S6
anoop@cs.sfu.ca
Franz Josef Och
Info. Science Institute
Univ. of Southern California
Marina del Rey, CA 90292
och@isi.edu
Abstract
This paper describes the application of discrim-
inative reranking techniques to the problem of
machine translation. For each sentence in the
source language, we obtain from a baseline sta-
tistical machine translation system, a ranked a0 -
best list of candidate translations in the target
language. We introduce two novel perceptron-
inspired reranking algorithms that improve on
the quality of machine translation over the
baseline system based on evaluation using the
BLEU metric. We provide experimental results
on the NIST 2003 Chinese-English large data
track evaluation. We also provide theoretical
analysis of our algorithms and experiments that
verify that our algorithms provide state-of-the-
art performance in machine translation.
1 Introduction
The noisy-channel model (Brown et al., 1990) has been
the foundation for statistical machine translation (SMT)
for over ten years. Recently so-called reranking tech-
niques, such as maximum entropy models (Och and Ney,
2002) and gradient methods (Och, 2003), have been ap-
plied to machine translation (MT), and have provided
significant improvements. In this paper, we introduce
two novel machine learning algorithms specialized for
the MT task.
Discriminative reranking algorithms have also con-
tributed to improvements in natural language parsing
and tagging performance. Discriminative reranking al-
gorithms used for these applications include Perceptron,
Boosting and Support Vector Machines (SVMs). In the
machine learning community, some novel discriminative
ranking (also called ordinal regression) algorithms have
been proposed in recent years. Based on this work, in
this paper, we will present some novel discriminative
reranking techniques applied to machine translation. The
reranking problem for natural language is neither a clas-
sification problem nor a regression problem, and under
certain conditions MT reranking turns out to be quite dif-
ferent from parse reranking.
In this paper, we consider the special issues of apply-
ing reranking techniques to the MT task and introduce
two perceptron-like reranking algorithms for MT rerank-
ing. We provide experimental results that show that the
proposed algorithms achieve start-of-the-art results on the
NIST 2003 Chinese-English large data track evaluation.
1.1 Generative Models for MT
The seminal IBM models (Brown et al., 1990) were
the first to introduce generative models to the MT task.
The IBM models applied the sequence learning paradigm
well-known from Hidden Markov Models in speech
recognition to the problem of MT. The source and tar-
get sentences were treated as the observations, but the
alignments were treated as hidden information learned
from parallel texts using the EM algorithm. This source-
channel model treated the task of finding the probability
a1a3a2a5a4a7a6a9a8a11a10 , where a4 is the translation in the target (English)
language for a given source (foreign) sentence a8 , as two
generative probability models: the language model a1a3a2a5a4a12a10
which is a generative probability over candidate transla-
tions and the translation model a1a3a2a5a8a13a6a14a4a12a10 which is a gener-
ative conditional probability of the source sentence given
a candidate translation a4 .
The lexicon of the single-word based IBM models does
not take word context into account. This means unlikely
alignments are being considered while training the model
and this also results in additional decoding complexity.
Several MT models were proposed as extensions of the
IBM models which used this intuition to add additional
linguistic constraints to decrease the decoding perplexity
and increase the translation quality.
Wang and Waibel (1998) proposed an SMT model
based on phrase-based alignments. Since their transla-
tion model reordered phrases directly, it achieved higher
accuracy for translation between languages with differ-
ent word orders. In (Och and Weber, 1998; Och et al.,
1999), a two-level alignment model was employed to uti-
lize shallow phrase structures: alignment between tem-
plates was used to handle phrase reordering, and word
alignments within a template were used to handle phrase
to phrase translation.
However, phrase level alignment cannot handle long
distance reordering effectively. Parse trees have also
been used in alignment models. Wu (1997) introduced
constraints on alignments using a probabilistic syn-
chronous context-free grammar restricted to Chomsky-
normal form. (Wu, 1997) was an implicit or self-
organizing syntax model as it did not use a Treebank. Ya-
mada and Knight (2001) used a statistical parser trained
using a Treebank in the source language to produce parse
trees and proposed a tree to string model for alignment.
Gildea (2003) proposed a tree to tree alignment model us-
ing output from a statistical parser in both source and tar-
get languages. The translation model involved tree align-
ments in which subtree cloning was used to handle cases
of reordering that were not possible in earlier tree-based
alignment models.
1.2 Discriminative Models for MT
Och and Ney (2002) proposed a framework for MT based
on direct translation, using the conditional modela1a15a2a16a4a7a6a14a8a11a10
estimated using a maximum entropy model. A small
number of feature functions defined on the source and
target sentence were used to rerank the translations gen-
erated by a baseline MT system. While the total num-
ber of feature functions was small, each feature function
was a complex statistical model by itself, as for exam-
ple, the alignment template feature functions used in this
approach.
Och (2003) described the use of minimum error train-
ing directly optimizing the error rate on automatic MT
evaluation metrics such as BLEU. The experiments
showed that this approach obtains significantly better re-
sults than using the maximum mutual information cri-
terion on parameter estimation. This approach used the
same set of features as the alignment template approach
in (Och and Ney, 2002).
SMT Team (2003) also used minimum error training
as in Och (2003), but used a large number of feature func-
tions. More than 450 different feature functions were
used in order to improve the syntactic well-formedness
of MT output. By reranking a 1000-best list generated by
the baseline MT system from Och (2003), the BLEU (Pa-
pineni et al., 2001) score on the test dataset was improved
from 31.6% to 32.9%.
2 Ranking and Reranking
2.1 Reranking for NLP tasks
Like machine translation, parsing is another field of natu-
ral language processing in which generative models have
been widely used. In recent years, reranking techniques,
especially discriminative reranking, have resulted in sig-
nificant improvements in parsing. Various machine learn-
ing algorithms have been employed in parse reranking,
such as Boosting (Collins, 2000), Perceptron (Collins and
Duffy, 2002) and Support Vector Machines (Shen and
Joshi, 2003). The reranking techniques have resulted in a
13.5% error reduction in labeled recall/precision over the
previous best generative parsing models. Discriminative
reranking methods for parsing typically use the notion of
a margin as the distance between the best candidate parse
and the rest of the parses. The reranking problem is re-
duced to a classification problem by using pairwise sam-
ples.
In (Shen and Joshi, 2004), we have introduced a new
perceptron-like ordinal regression algorithm for parse
reranking. In that algorithm, pairwise samples are used
for training and margins are defined as the distance be-
tween parses of different ranks. In addition, the uneven
margin technique has been used for the purpose of adapt-
ing ordinal regression to reranking tasks. In this paper,
we apply this algorithm to MT reranking, and we also
introduce a new perceptron-like reranking algorithm for
MT.
2.2 Ranking and Ordinal Regression
In the field of machine learning, a class of tasks (called
ranking or ordinal regression) are similar to the rerank-
ing tasks in NLP. One of the motivations of this paper
is to apply ranking or ordinal regression algorithms to
MT reranking. In the previous works on ranking or or-
dinal regression, the margin is defined as the distance
between two consecutive ranks. Two large margin ap-
proaches have been used. One is the PRank algorithm,
a variant of the perceptron algorithm, that uses multi-
ple biases to represent the boundaries between every two
consecutive ranks (Crammer and Singer, 2001; Harring-
ton, 2003). However, as we will show in section 3.7, the
PRank algorithm does not work on the reranking tasks
due to the introduction of global ranks. The other ap-
proach is to reduce the ranking problem to a classification
problem by using the method of pairwise samples (Her-
brich et al., 2000). The underlying assumption is that the
samples of consecutive ranks are separable. This may
become a problem in the case that ranks are unreliable
when ranking does not strongly distinguish between can-
didates. This is just what happens in reranking for ma-
chine translation.
3 Discriminative Reranking for MT
The reranking approach for MT is defined as follows:
First, a baseline system generates a0 -best candidates. Fea-
tures that can potentially discriminate between good vs.
bad translations are extracted from these a0 -best candi-
dates. These features are then used to determine a new
ranking for the a0 -best list. The new top ranked candidate
in this a0 -best list is our new best candidate translation.
3.1 Advantages of Discriminative Reranking
Discriminative reranking allows us to use global features
which are unavailable for the baseline system. Second,
we can use features of various kinds and need not worry
about fine-grained smoothing issues. Finally, the statis-
tical machine learning approach has been shown to be
effective in many NLP tasks. Reranking enables rapid
experimentation with complex feature functions, because
the complex decoding steps in SMT are done once to gen-
erate the N-best list of translations.
3.2 Problems applying reranking to MT
First, we consider how to apply discriminative reranking
to machine translation. We may directly use those algo-
rithms that have been successfully used in parse rerank-
ing. However, we immediately find that those algorithms
are not as appropriate for machine translation. Let a4a18a17
be the candidate ranked at the a19 th position for the source
sentence, where ranking is defined on the quality of the
candidates. In parse reranking, we look for parallel hy-
perplanes successfully separating a4a21a20 and a4a18a22a24a23a25a23a25a23a26 for all the
source sentences, but in MT, for each source sentence,
we have a set of reference translations instead of a single
gold standard. For this reason, it is hard to define which
candidate translation is the best. Suppose we have two
translations, one of which is close to reference transla-
tion refa27 while the other is close to reference translation
refa28 . It is difficult to say that one candidate is better than
the other.
Although we might invent metrics to define the qual-
ity of a translation, standard reranking algorithms can-
not be directly applied to MT. In parse reranking, each
training sentence has a ranked list of 27 candidates on
average (Collins, 2000), but for machine translation, the
number of candidate translations in the a0 -best list is much
higher. (SMT Team, 2003) show that to get a reasonable
improvement in the BLEU score at least 1000 candidates
need to be considered in the a0 -best list.
In addition, the parallel hyperplanes separating a4a21a20 and
a4a18a22a29a23a25a23a25a23a26 actually are unable to distinguish good translations
from bad translations, since they are not trained to distin-
guish any translations in a4a18a22a24a23a25a23a25a23a26 . Furthermore, many good
translations in a4 a22a29a23a25a23a25a23a26 may differ greatly from a4 a20 , since
there are multiple references. These facts cause problems
for the applicability of reranking algorithms.
3.3 Splitting
Our first attempt to handle this problem is to redefine the
notion of good translations versus bad translations. In-
stead of separating a4 a20 and a4 a22a29a23a25a23a25a23a26 , we say the top a30 of the
a0 -best translations are good translations, and the bottom
a31 of the
a0 -best translations are bad translations, where
a30a33a32
a31a35a34
a0 . Then we look for parallel hyperplanes split-
ting the top a30 translations and bottom a31 translations for
X2
X1
score−metric
W
margin
bad translations
good translations
others
Figure 1: Splitting for MT Reranking
each sentence. Figure 1 illustrates this situation, where
a0a37a36a39a38a11a40a21a41
a30
a36a43a42 and
a31
a36a44a42 .
3.4 Ordinal Regression
Furthermore, if we only look for the hyperplanes to sepa-
rate the good and the bad translations, we, in fact, discard
the order information of translations of the same class.
Maybe knowing that a4a21a20a46a45a47a45 is better than a4a21a20a48a45a49a20 may be use-
less for training to some extent, but knowing a4a18a22 is better
than a4a18a50 a45a47a45 is useful, if a30 a36a51a42a52a40a9a40 . Although we cannot give
an affirmative answer at this time, it is at least reasonable
to use the ordering information. The problem is how to
use the ordering information. In addition, we only want
to maintain the order of two candidates if their ranks are
far away from each other. On the other hand, we do not
care the order of two translations whose ranks are very
close, e.g. 100 and 101. Thus insensitive ordinal regres-
sion is more desirable and is the approach we follow in
this paper.
3.5 Uneven Margins
However, reranking is not an ordinal regression prob-
lem. In reranking evaluation, we are only interested in the
quality of the translation with the highest score, and we
do not care the order of bad translations. Therefore we
cannot simply regard a reranking problem as an ordinal
regression problem, since they have different definitions
for the loss function.
As far as linear classifiers are concerned, we want to
maintain a larger margin in translations of high ranks and
a smaller margin in translations of low ranks. For exam-
ple,
margina2a16a4 a20 a41a53a4a18a50 a45 a10a55a54 margina2a16a4 a20 a41a47a4 a20a46a45 a10a55a54 margina2a16a4 a22a56a20 a41a53a4a18a50 a45 a10
The reason is that the scoring function will be penalized
if it can not separate a4a21a20 from a4a21a20a46a45 , but not for the case of
a4a18a22a56a20 versus a4 a50 a45 .
3.6 Large Margin Classifiers
There are quite a few linear classifiers1 that can sepa-
rate samples with large margin, such as SVMs (Vapnik,
1998), Boosting (Schapire et al., 1997), Winnow (Zhang,
2000) and Perceptron (Krauth and Mezard, 1987). The
performance of SVMs is superior to other linear classi-
fiers because of their ability to margin maximization.
However, SVMs are extremely slow in training since
they need to solve a quadratic programming search. For
example, SVMs even cannot be used to train on the whole
Penn Treebank in parse reranking (Shen and Joshi, 2003).
Taking this into account, we use perceptron-like algo-
rithms, since the perceptron algorithm is fast in training
which allow us to do experiments on real-world data. Its
large margin version is able to provide relatively good re-
sults in general.
3.7 Pairwise Samples
In previous work on the PRank algorithm, ranks are de-
fined on the entire training and test data. Thus we can
define boundaries between consecutive ranks on the en-
tire data. But in MT reranking, ranks are defined over ev-
ery single source sentence. For example, in our data set,
the rank of a translation is only the rank among all the
translations for the same sentence. The training data in-
cludes about 1000 sentences, each of which normally has
1000 candidate translations with the exception of short
sentences that have a smaller number of candidate trans-
lations. As a result, we cannot use the PRank algorithm
in the reranking task, since there are no global ranks or
boundaries for all the samples.
However, the approach of using pairwise samples does
work. By pairing up two samples, we compute the rel-
ative distance between these two samples in the scoring
metric. In the training phase, we are only interested in
whether the relative distance is positive or negative.
However, the size of generated training samples will
be very large. For a0 samples, the total number of pair-
wise samples in (Herbrich et al., 2000) is roughly a0
a22
. In
the next section, we will introduce two perceptron-like al-
gorithms that utilize pairwise samples while keeping the
complexity of data space unchanged.
4 Reranking Algorithms
Considering the desiderata discussed in the last sec-
tion, we present two perceptron-like algorithms for MT
reranking. The first one is a splitting algorithm specially
designed for MT reranking, which has similarities to a
1Here we only consider linear kernels such as polynomial
kernels.
classification algorithm. We also experimented with an
ordinal regression algorithm proposed in (Shen and Joshi,
2004). For the sake of completeness, we will briefly de-
scribe the algorithm here.
4.1 Splitting
In this section, we will propose a splitting algorithm
which separates translations of each sentence into two
parts, the top a30 translations and the bottom a31 translations.
All the separating hyperplanes are parallel by sharing the
same weight vector a57 . The margin is defined on the dis-
tance between the top a30 items and the bottom a31 items in
each cluster, as shown in Figure 1.
Let a58
a17a60a59a61 be the feature vector of the
a62a9a63a16a64 translation of
the a19a65a63a16a64 sentence, and a66 a17a16a59a61 be the rank for this translation
among all the translations for the a19a67a63a16a64 sentence. Then the
set of training samples is:
a68
a36a39a69a70a2
a58
a17a16a59a61a9a41
a66
a17a60a59a61a11a10a71a6a12a38
a34
a19
a34a73a72
a41a74a38
a34
a62
a34
a0a76a75a52a41
where a72 is the number of clusters and a0 is the length of
ranks for each cluster.
Let a77 a2 a58 a10a78a36 a57a80a79a82a81a24a58 be a linear function, where a58 is the
feature vector of a translation, and a57a80a79 is a weight vector.
We construct a hypothesis function a83 a79a85a84a87a86a89a88a91a90 with a77
as follows.
a83 a79
a2
a58
a20 a41a93a92a94a92a95a92
a58
a26 a10a96a36
a30a14a97
a0
a31
a2
a77
a2
a58
a20 a10a49a41a24a92a95a92a95a92a94a41
a77
a2
a58
a26 a10a53a10a49a41
where a30a14a97 a0 a31 is a function that takes a list of scores for the
candidate translations computed according to the evalua-
tion metric and returns the rank in that list. For example
a30a14a97
a0
a31
a2a16a98a52a40a21a41a100a99a70a40a21a41a47a101a9a40a52a10a76a36a39a2a100a38a9a41a47a42a21a41a47a102a52a10 .
The splitting algorithm searches a linear function
a77
a2
a58
a10a13a36
a57a80a79a103a81a104a58 that successfully splits the top a30 -ranked
and bottom a31 -ranked translations for each sentence,
where a30a13a32
a31a39a34
a0 . Formally, let
a105
a79 a36a106a2
a66
a79
a20
a41a24a92a95a92a94a92a95a41
a66
a79
a26
a10a107a36
a83a108a79
a2
a58
a20a14a41a93a92a94a92a95a92
a58
a26a21a10 for any linear function
a77 . We look for the
function a77 such that
a66
a79
a17
a34
a30 if a66
a17
a34
a30 (1)
a66
a79
a17a107a109
a0a107a110
a31
a32
a38 if
a66
a17
a109
a0a7a110
a31
a32
a38a9a41 (2)
which means that a77 can successfully separate the good
translations and the bad translations.
Suppose there exists a linear function a77 satisfying (1)
and (2), we say a69a18a2 a58 a17a16a59a61a9a41 a66 a17a16a59a61a11a10a56a75 is a111 a1a87a112 a19a65a113a46a113a48a97a18a114 a112a16a115 by a77 given
a0a116a41
a30 and
a31 . Furthermore, we can define the splitting mar-
gin a117 for the translations of the a19a65a63a16a64 sentence as follows.
a117
a2
a77
a41
a19
a10a116a36 a118a120a119a94a121
a61a56a122a123a47a124a60a125a126a49a127a108a128 a77
a2
a58
a17a16a59a61a11a10a76a110 a118a130a129a14a131
a61a56a122a123 a124a132a125a126a49a133 a26a18a134a136a135a29a137a3a20 a77
a2
a58
a17a60a59a61a11a10
The minimal splitting margin, a117a139a138a5a140a29a141
a17
a63 , for a77 given
a0a116a41
a30 and
a31 is defined as follows.
a117 a138a5a140a29a141
a17
a63
a2
a77
a10a142a36 a118a120a119a95a121
a17 a117
a2
a77
a41
a19
a10
a36 a118a120a119a95a121
a17
a2a15a118a120a119a94a121
a123a56a124a60a125a126a29a127a108a128 a77
a2
a58
a17a60a59a61a11a10a116a110 a118a130a129a14a131
a123a56a124a132a125a126 a133 a26a18a134a136a135a29a137a3a20 a77
a2
a58
a17a60a59a61a143a10a53a10
Algorithm 1 splitting
Require: a30 a41 a31 , and a positive learning margin a144 .
1: a113a96a145 a40 , initialize a57
a45
;
2: repeat
3: for (a19 a36a39a38a52a41a24a92a95a92a94a92a95a41 a72 ) do
4: compute a57a130a63a3a81a24a58 a17a60a59a61 , a146 a61 a145 a40 for all a62 ;
5: for (a38 a34 a62a80a147 a112 a34 a0 ) do
6: if a2 a66
a17a60a59a61
a34
a30 and a66
a17a16a59
a141
a109
a0a148a110
a31
a32
a38 and
a57a130a63a55a81
a58
a17a16a59a61
a147a73a57a130a63a3a81a93a58
a17a60a59
a141
a32a149a144 ) then
7: a146
a61
a145a150a146
a61
a32
a38 ;
a146
a141
a145a150a146
a141
a110a73a38 ;
8: else if a2 a66 a17a16a59a61 a109 a0a151a110 a31 a32 a38 and a66 a17a16a59
a141
a34
a30 and a57a130a63a104a81
a58
a17a16a59a61a152a54
a57a130a63a3a81a93a58
a17a60a59
a141
a110
a144 ) then
9: a146 a61 a145a150a146 a61a153a110a154a38 ; a146
a141
a145a150a146
a141
a32
a38 ;
10: end if
11: end for
12: a57a130a63
a137a15a20
a145a150a57a130a63a139a32a73a155
a61
a146
a61
a58
a17a16a59a61 ;
a113a96a145a150a113a156a32
a38 ;
13: end for
14: until no updates made in the outer for loop
Algorithm 1 is a perceptron-like algorithm that looks
for a function that splits the training data. The idea of the
algorithm is as follows. For every two translations a58
a17a16a59a61
and a58
a17a16a59
a141
, if
a157 the rank of
a58
a17a60a59a61 is higher than or equal to
a30 , a66
a17a16a59a61
a34
a30 ,
a157 the rank of
a58
a17a16a59
a141
is lower than a30 , a66
a17a16a59
a141
a109
a0a7a110
a31
a32
a38 ,
a157 the weight vector
a57 can not successfully separate
a2
a58
a17a60a59a61 and
a58
a17a16a59
a141
a10 with a learning margin
a144 , a57a158a81a11a58
a17a16a59a61
a147
a57a159a81a29a58
a17a16a59
a141
a32a148a144 ,
then we need to update a57 with the addition of a58
a17a60a59a61a116a110
a58
a17a16a59
a141
.
However, the updating is not executed until all the in-
consistent pairs in a sentence are found for the purpose
of speeding up the algorithm. When sentence a19 is se-
lected, we first compute and store a57 a63 a81a46a58 a17a60a59a61 for all a62 . Thus
we do not need to recompute a57a130a63a153a81a52a58 a17a16a59a61 again in the in-
ner loop. Now the complexity of a repeat iteration is
a160
a2
a72
a0
a22
a32
a72
a0a139a161a70a10 , where a161 is the average number of active
features in vector a58 a17a60a59a61 . If we updated the weight vector
whenever an inconsistent pair was found, the complexity
of a loop would be a160 a2 a72 a0
a22
a161a18a10 .
The following theorem will show that Algorithm 1 will
stop in finite steps, outputting a function that splits the
training data with a large margin, if the training data is
splittable. Due to lack of space, we omit the proof for
Theorem 1 in this paper.
Theorem 1 Suppose the training samples a69a70a2 a58 a17a16a59a61a9a41 a66 a17a60a59a61a11a10a49a75
are a111 a1a87a112 a19a65a113a46a113a48a97a18a114 a112a16a115 by a linear function defined on the weight
vector a57a107a162 with a splitting margin a117 , where a6a94a6a57a107a162 a6a94a6a76a36a163a38 .
Let a164 a36 a72 a97a70a165 a17a16a59a61 a6a95a6a58 a17a16a59a61 a6a94a6. Then Algorithm 1 makes at most
a26a104a166a53a167a168a166a53a137a139a22a47a169
a170
a166 mistakes on the pairwise samples during the
training.
Algorithm 2 ordinal regression with uneven margin
Require: a positive learning margin a144 .
1: a113a96a145 a40 , initialize a57
a45
;
2: repeat
3: for (sentence a19 a36a171a38a52a41a24a92a95a92a94a92a95a41 a72 ) do
4: compute a57a130a63a3a81a24a58 a17a60a59a61 and a146 a61 a145 a40 for all a62 ;
5: for (a38 a34 a62a80a147 a112 a34 a0 ) do
6: if a2 a66
a17a16a59a61
a147a172a66
a17a60a59
a141
and a161 a19a48a111
a2
a66
a17a16a59a61a52a41
a66
a17a60a59
a141
a10a71a54a44a173 and
a57a130a63a15a81
a58
a17a16a59a61a74a110
a57 a63 a81a24a58
a17a16a59
a141
a147a175a174
a2
a66
a17a16a59a61a9a41
a66
a17a16a59
a141
a10
a144
a10 then
7: a146 a61 a145a150a146 a61 a32a148a174 a2 a66 a17a16a59a61a9a41 a66 a17a16a59
a141
a10a49a176
8: a146
a141
a145a150a146
a141
a110
a174
a2
a66
a17a16a59a61a9a41
a66
a17a16a59
a141
a10a49a176
9: else if a2 a66 a17a60a59a61a177a54 a66 a17a60a59
a141
and a161 a19a48a111 a2 a66 a17a60a59a61a9a41 a66 a17a60a59
a141
a10a163a54
a173 and
a57a130a63a74a81a18a58
a17a60a59
a141
a110
a57a130a63a74a81a52a58
a17a16a59a61
a147a178a174
a2
a66
a17a16a59
a141
a41
a66
a17a16a59a61a11a10
a144
a10
then
10: a146 a61 a145a150a146 a61 a110 a174 a2 a66 a17a16a59
a141
a41
a66
a17a16a59a61 a10a49a176
11: a146
a141
a145a150a146
a141
a32a148a174
a2
a66
a17a16a59
a141
a41
a66
a17a16a59a61 a10a49a176
12: end if
13: end for
14: a57a130a63
a137a15a20
a145a150a57a130a63a139a32a73a155
a61
a146
a61
a58
a17a16a59a61 ;
a113a96a145a150a113a156a32
a38 ;
15: end for
16: until no updates made in the outer for loop
4.2 Ordinal Regression
The second algorithm that we will use for MT reranking
is the a173 -insensitive ordinal regression with uneven mar-
gin, which was proposed in (Shen and Joshi, 2004), as
shown in Algorithm 2.
In Algorithm 2, the function a161 a19a46a111 is used to control the
level of insensitivity, and the function a174 is used to con-
trol the learning margin between pairs of translations with
different ranks as described in Section 3.5. There are
many candidates for a174 . The following definition for a174
is one of the simplest solutions.
a174
a2a179a1a156a41a47a180a104a10a96a181
a38
a1
a110
a38
a180
We will use this function in our experiments on MT
reranking.
5 Experiments and Analysis
We provide experimental results on the NIST 2003
Chinese-English large data track evaluation. We use the
data set used in (SMT Team, 2003). The training data
consists of about 170M English words, on which the
baseline translation system is trained. The training data is
also used to build language models which are used to de-
fine feature functions on various syntactic levels. The de-
velopment data consists of 993 Chinese sentences. Each
Chinese sentence is associated with 1000-best English
translations generated by the baseline MT system. The
development data set is used to estimate the parameters
for the feature functions for the purpose of reranking. The
Table 1: BLEU scores reported in (SMT Team, 2003).
Every single feature was combined with the 6 baseline
features for the training and test. The minimum error
training (Och, 2003) was used on the development data
for parameter estimation.
Feature BLEU%
Baseline 31.6
POS Language Model 31.7
Supertag Language Model 31.7
Wrong NN Position 31.7
Word Popularity 31.8
Aligned Template Models 31.9
Count of Missing Word 31.9
Template Right Continuity 32.0
IBM Model 1 32.5
test data consists of 878 Chinese sentences. Each Chinese
sentence is associated with 1000-best English translations
too. The test set is used to assess the quality of the rerank-
ing output.
In (SMT Team, 2003), 450 features were generated.
Six features from (Och, 2003) were used as baseline fea-
tures. Each of the 450 features was evaluated indepen-
dently by combining it with 6 baseline features and as-
sessing on the test data with the minimum error training.
The baseline BLEU score on the test set is 31.6%. Table
1 shows some of the best performing features.
In (SMT Team, 2003), aggressive search was used to
combine features. After combining about a dozen fea-
tures, the BLEU score did not improve any more, and
the score was 32.9%. It was also noticed that the major
improvement came from the Model 1 feature. By com-
bining the four features, Model 1, matched parentheses,
matched quotation marks and POS language model, the
system achieved a BLEU score of 32.6%.
In our experiments, we will use 4 different kinds of
feature combinations:
a157 Baseline: The 6 baseline features used in (Och,
2003), such as cost of word penalty, cost of aligned
template penalty.
a157 Best Feature: Baseline + IBM Model 1 + matched
parentheses + matched quotation marks + POS lan-
guage model.
a157 Top Twenty: Baseline + 14 features with individual
BLEU score no less than 31.9% with the minimum
error training.
a157 Large Set: Baseline + 50 features with individual
BLEU score no less than 31.7% with the minimum
error training. Since the baseline is 31.6% and the
95% confidence range is a182 0.9%, most of the fea-
tures in this set are not individually discriminative
with respect to the BLEU metric.
We apply Algorithm 1 and 2 to the four feature sets.
For algorithm 1, the splitting algorithm, we set a31 a36a183a42a9a40a52a40
in the 1000-best translations given by the baseline MT
system. For algorithm 2, the ordinal regression algo-
rithm, we set the updating condition as a66 a17a16a59a61a82a184 a102 a147a89a66 a17a16a59
a141and
a66
a17a60a59a61
a32
a102a104a40
a147a51a66
a17a60a59
a141
, which means one’s rank number is
at most half of the other’s and there are at least 20 ranks
in between. Figures 2-9 show the results of using Al-
gorithm 1 and 2 with the four feature sets. The a165 -axis
represents the number of iterations in the training. The
left a66 -axis stands for the BLEU% score on the test data,
and the right a66 -axis stands for log of the loss function on
the development data.
Algorithm 1, the splitting algorithm, converges on the
first three feature sets. The smaller the feature set is, the
faster the algorithm converges. It achieves a BLEU score
of 31.7% on the Baseline, 32.8% on the Best Feature, but
only 32.6% on the Top Twenty features. However it is
within the range of 95% confidence. Unfortunately on
the Large Set, Algorithm 1 converges very slowly.
In the Top Twenty set there are a fewer number of in-
dividually non-discriminative feature making the pool of
features “better”. In addition, generalization performance
in the Top Twenty set is better than the Large Set due to
the smaller set of “better” features, cf. (Shen and Joshi,
2004). If the number of the non-discriminative features
is large enough, the data set becomes unsplittable. We
have tried using the a185 trick as in (Li et al., 2002) to make
data separable artificially, but the performance could not
be improved with such features.
We achieve similar results with Algorithm 2, the or-
dinal regression with uneven margin. It converges on
the first 3 feature sets too. On the Baseline, it achieves
31.4%. We notice that the model is over-trained on the
development data according to the learning curve. In the
Best Feature category, it achieves 32.7%, and on the Top
Twenty features, it achieves 32.9%. This algorithm does
not converge on the Large Set in 10000 iterations.
We compare our perceptron-like algorithms with the
minimum error training used in (SMT Team, 2003) as
shown in Table 2. The splitting algorithm achieves
slightly better results on the Baseline and the Best Fea-
ture set, while the minimum error training and the regres-
sion algorithm tie for first place on feature combinations.
However, the differences are not significant.
We notice in those separable feature sets the perfor-
mance on the development data and the test data are
tightly consistent. Whenever the log-loss on the devel-
opment set is decreased, and BLEU score on the test set
goes up, and vice versa. This tells us the merit of these
two algorithms; By optimizing on the loss function for
Table 2: Comparison between the minimum error
training with discriminative reranking on the test data
(BLEU%)
Algorithm Baseline Best Feat Feat Comb
Minimum Error 31.6 32.6 32.9
Splitting 31.7 32.8 32.6
Regression 31.4 32.7 32.9
the development data, we can improve performance on
the test data. This property is guaranteed by the theoreti-
cal analysis and is borne out in the experimental results.
6 Conclusions and Future Work
In this paper, we have successfully applied the discrim-
inative reranking to machine translation. We applied a
new perceptron-like splitting algorithm and ordinal re-
gression algorithm with uneven margin to reranking in
MT. We provide a theoretical justification for the perfor-
mance of the splitting algorithms. Experimental results
provided in this paper show that the proposed algorithms
provide state-of-the-art performance in the NIST 2003
Chinese-English large data track evaluation.
Acknowledgments
This material is based upon work supported by the Na-
tional Science Foundation under Grant No. 0121285.
The first author was partially supported by JHU post-
workshop fellowship and NSF Grant ITR-0205456. The
second author is partially supported by NSERC, Canada
(RGPIN: 264905). We thank the members of the SMT
team of JHU Workshop 2003 for help on the dataset and
three anonymous reviewers for useful comments.
References
P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Je-
linek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. 1990.
A statistical approach to machine translation. Computational
Linguistics, 16(2):79–85.
M. Collins and N. Duffy. 2002. New ranking algorithms for
parsing and tagging: Kernels over discrete structures, and
the voted perceptron. In Proceedings of ACL 2002.
M. Collins. 2000. Discriminative reranking for natural lan-
guage parsing. In Proceedings of the 7th ICML.
K. Crammer and Y. Singer. 2001. PRanking with Ranking. In
NIPS 2001.
D. Gildea. 2003. Loosely tree-based alignment for machine
translation. In ACL 2003.
E. F. Harrington. 2003. Online Ranking/Collaborative Filtering
Using the Perceptron Algorithm. In ICML.
R. Herbrich, T. Graepel, and K. Obermayer. 2000. Large mar-
gin rank boundaries for ordinal regression. In A.J. Smola,
P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Ad-
vances in Large Margin Classifiers, pages 115–132. MIT
Press.
W. Krauth and M. Mezard. 1987. Learning algorithms with
optimal stability in neural networks. Journal of Physics A,
20:745–752.
Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kan-
dola. 2002. The perceptron algorithm with uneven margins.
In Proceedings of ICML 2002.
F. J. Och and H. Ney. 2002. Discriminative training and max-
imum entropy models for statistical machine translation. In
ACL 2002.
F. J. Och and H. Weber. 1998. Improving statistical natural
language translation with categories and rules. In COLING-
ACL 1998.
F. J. Och, C. Tillmann, and H. Ney. 1999. Improved alignment
models for statistical machine. In EMNLP-WVLC 1999.
F. J. Och. 2003. Minimum error rate training for statistical
machine translation. In ACL 2003.
K. Papineni, S. Roukos, and T. Ward. 2001. Bleu: a method for
automatic evaluation of machine translation. IBM Research
Report, RC22176.
R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. 1997.
Boosting the margin: a new explanation for the effectiveness
of voting methods. In Proc. 14th ICML.
L. Shen and A. K. Joshi. 2003. An SVM based voting algo-
rithm with application to parse reranking. In Proc. of CoNLL
2003.
L. Shen and A. K. Joshi. 2004. Flexible margin selection for
reranking with full pairwise samples. In Proc. of 1st IJC-
NLP.
SMT Team. 2003. Final report: Syntax for statisti-
cal machine translation. JHU Summer Workshop 2003,
http://www.clsp.jhu.edu/ws2003/groups/translate.
V. N. Vapnik. 1998. Statistical Learning Theory. John Wiley
and Sons, Inc.
Y. Wang and A. Waibel. 1998. Modeling with structures in
statistical machine translation. In COLING-ACL 1998.
D. Wu. 1997. Stochastic inversion transduction grammars and
bilingual parsing of parallel corpora. Computational Lin-
guistics, 23(3):377–400.
K. Yamada and K. Knight. 2001. A syntax-based statistical
translation model. In ACL 2001.
T. Zhang. 2000. Large Margin Winnow Methods for Text Cat-
egorization. In KDD-2000 Workshop on Text Mining.
29
30
31
32
33
34
0 50 100 150 200 250 300 350 400
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 2: Splitting on Baseline
29
30
31
32
33
34
0 50 100 150 200 250 300 350 400
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 3: Splitting on Best Feature
29
30
31
32
33
34
0 100 200 300 400 500 600
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 4: Splitting on Top Twenty
29
30
31
32
33
34
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 5: Splitting on Large Set
29
30
31
32
33
34
0 2000 4000 6000 8000 10000
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 6: Ordinal Regression on Baseline
29
30
31
32
33
34
0 2000 4000 6000 8000 10000
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 7: Ordinal Regression on Best Feature
29
30
31
32
33
34
0 2000 4000 6000 8000 10000
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 8: Ordinal Regression on Top Twenty
29
30
31
32
33
34
0 2000 4000 6000 8000 10000
bleu% on test log-loss on dev
a186
# iteration
bleu% on test 
log-loss on dev
Figure 9: Ordinal Regression on Large Set
