Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 73–80, Vancouver, October 2005. c©2005 Association for Computational Linguistics
A Discriminative Matching Approach to Word Alignment
Ben Taskar Simon Lacoste-Julien Dan Klein
Computer Science Division, EECS Department
University of California, Berkeley
Berkeley, CA 94720
Abstract
We present a discriminative, large-
margin approach to feature-based
matching for word alignment. In this
framework, pairs of word tokens re-
ceive a matching score, which is based
on features of that pair, including mea-
sures of association between the words,
distortion between their positions, sim-
ilarity of the orthographic form, and so
on. Even with only 100 labeled train-
ing examples and simple features which
incorporate counts from a large unla-
beled corpus, we achieve AER perfor-
mance close to IBM Model 4, in much
less time. Including Model 4 predic-
tions as features, we achieve a relative
AER reduction of 22% in over inter-
sected Model 4 alignments.
1 Introduction
The standard approach to word alignment from
sentence-aligned bitexts has been to construct
models which generate sentences of one lan-
guage from the other, then fitting those genera-
tive models with EM (Brown et al., 1990; Och
and Ney, 2003). This approach has two primary
advantages and two primary drawbacks. In its
favor, generative models of alignment are well-
suited for use in a noisy-channel translation sys-
tem. In addition, they can be trained in an un-
supervised fashion, though in practice they do
require labeled validation alignments for tuning
model hyper-parameters, such as null counts or
smoothing amounts, which are crucial to pro-
ducing alignments of good quality. A primary
drawback of the generative approach to align-
ment is that, as in all generative models, explic-
itly incorporating arbitrary features of the in-
put is diﬃcult. For example, when considering
whether to align two words in the IBM models
(Brown et al., 1990), one cannot easily include
information about such features as orthographic
similarity (for detecting cognates), presence of
the pair in various dictionaries, similarity of the
frequency of the two words, choices made by
other alignment systems on this sentence pair,
and so on. While clever models can implicitly
capture some of these information sources, it
takes considerable work, and can make the re-
sulting models quite complex. A second draw-
back of generative translation models is that,
since they are learned with EM, they require
extensive processing of large amounts of data
to achieve good performance. While tools like
GIZA++ (Och and Ney, 2003) do make it eas-
ier to build on the long history of the generative
IBM approach, they also underscore how com-
plex high-performance generative models can,
and have, become.
In this paper, we present a discriminative ap-
proach to word alignment. Word alignment is
cast as a maximum weighted matching problem
(Cormen et al., 1990) in which each pair of words
(e
j
,f
k
) in a sentence pair (e,f) is associated
with a score s
jk
(e,f) reflecting the desirability
of the alignment of that pair. The alignment
73
for the sentence pair is then the highest scoring
matching under some constraints, for example
the requirement that matchings be one-to-one.
This view of alignment as graph matching is
not, in itself, new: Melamed (2000) uses com-
petitive linking to greedily construct matchings
where the pair score is a measure of word-
to-word association, and Matusov et al. (2004)
find exact maximum matchings where the pair
scores come from the alignment posteriors of
generative models. Tiedemann (2003) proposes
incorporating a variety of word association
“clues” into a greedy linking algorithm.
What we contribute here is a principled ap-
proach for tractable and eﬃcient learning of the
alignment score s
jk
(e,f) as a function of ar-
bitrary features of that token pair. This con-
tribution opens up the possibility of doing the
kind of feature engineering for alignment that
has been so successful for other NLP tasks. We
first present the algorithm for large margin es-
timation of the scoring function. We then show
that our method can achieve AER rates com-
parable to unsymmetrized IBM Model 4, using
extremely little labeled data (as few as 100 sen-
tences) and a simple feature set. Remarkably,
by including bi-directional IBM Model 4 predic-
tions as features, we achieve an absolute AER
of 5.4 on the English-French Hansards alignment
task, a relative reduction of 22% in AER over in-
tersected Model 4 alignments and, to our knowl-
edge, the best AER result published on this task.
2 Algorithm
We model the alignment prediction task as a
maximum weight bipartite matching problem,
where nodes correspond to the words in the
two sentences. For simplicity, we assume here
that each word aligns to one or zero words in
the other sentence. The edge weight s
jk
repre-
sents the degree to which word j in one sentence
can translate into the word k in the other sen-
tence. Our goal is to find an alignment that
maximizes the sum of edge scores. We represent
a matching using a set of binary variables y
jk
that are set to 1 if word j is assigned to word
k in the other sentence, and 0 otherwise. The
score of an assignment is the sum of edge scores:
s(y)=
summationtext
jk
s
jk
y
jk
.Themaximumweightbi-
partite matching problem, arg max
y∈Y
s(y), can
be solved using well known combinatorial algo-
rithms or the following linear program:
max
z
summationdisplay
jk
s
jk
z
jk
(1)
s.t.
summationdisplay
j
z
jk
≤ 1,
summationdisplay
k
z
jk
≤ 1, 0 ≤ z
jk
≤ 1,
where the continuous variables z
jk
correspond to
the binary variables y
jk
. This LP is guaranteed
to have integral (and hence optimal) solutions
for any scoring function s(y) (Schrijver, 2003).
Note that although the above LP can be used to
compute alignments, combinatorial algorithms
are generally more eﬃcient. However, we use
the LP to develop the learning algorithm below.
For a sentence pair x, we denote position
pairs by x
jk
and their scores as s
jk
.Welet
s
jk
= w
latticetop
f(x
jk
) for some user provided fea-
ture mapping f and abbreviate w
latticetop
f(x,y)=
summationtext
jk
y
jk
w
latticetop
f(x
jk
). We can include in the fea-
ture vector the identity of the two words, their
relative positions in their respective sentences,
their part-of-speech tags, their string similarity
(for detecting cognates), and so on.
At this point, one can imagine estimating a
linear matching model in multiple ways, includ-
ing using conditional likelihood estimation, an
averaged perceptron update (see which match-
ings are proposed and adjust the weights ac-
cording to the diﬀerence between the guessed
and target structures (Collins, 2002)), or in
large-margin fashion. Conditional likelihood es-
timation using a log-linear model P(y | x)=
1
Z
w
(x)
exp{w
latticetop
f(x,y)} requires summing over all
matchings to compute the normalization Z
w
(x),
which is #P-complete (Valiant, 1979). In our
experiments, we therefore investigated the aver-
aged perceptron in addition to the large-margin
method outlined below.
2.1 Large-margin estimation
We follow the large-margin formulation of
Taskar et al. (2005a). Our input is a set of
training instances {(x
i
,y
i
)}
m
i=1
,whereeachin-
stance consists of a sentence pair x
i
and a target
74
alignment y
i
. We would like to find parameters
w that predict correct alignments on the train-
ing data:
y
i
=argmax
¯y
i
∈Y
i
w
latticetop
f(x
i
, ¯y
i
), ∀i,
where Y
i
is the space of matchings appropriate
for the sentence pair i.
In standard classification problems, we typi-
cally measure the error of prediction, lscript(y
i
, ¯y
i
),
using the simple 0-1 loss. In structured prob-
lems, where we are jointly predicting multiple
variables, the loss is often more complex. While
the F-measure is a natural loss function for this
task, we instead chose a sensible surrogate that
fits better in our framework: Hamming distance
between y
i
and ¯y
i
, which simply counts the
number of edges predicted incorrectly.
We use an SVM-like hinge upper bound on
the loss lscript(y
i
, ¯y
i
), given by max
¯y
i
∈Y
i
[w
latticetop
f
i
(¯y
i
)+
lscript
i
(¯y
i
) − w
latticetop
f
i
(y
i
)], where lscript
i
(¯y
i
)=lscript(y
i
, ¯y
i
), and
f
i
(¯y
i
)=f(x
i
, ¯y
i
). Minimizing this upper bound
encourages the true alignment y
i
to be optimal
with respect to w for each instance i:
min
||w||≤γ
summationdisplay
i
max
¯y
i
∈Y
i
[w
latticetop
f
i
(¯y
i
)+lscript
i
(¯y
i
)] − w
latticetop
f
i
(y
i
),
where γ is a regularization parameter.
In this form, the estimation problem is a mix-
ture of continuous optimization over w and com-
binatorial optimization over y
i
.Inorderto
transform it into a more standard optimization
problem, we need a way to eﬃciently handle the
loss-augmented inference, max
¯y
i
∈Y
i
[w
latticetop
f
i
(¯y
i
)+
lscript
i
(¯y
i
)]. This optimization problem has pre-
cisely the same form as the prediction prob-
lem whose parameters we are trying to learn
—max
¯y
i
∈Y
i
w
latticetop
f
i
(¯y
i
) — but with an additional
term corresponding to the loss function. Our as-
sumption that the loss function decomposes over
the edges is crucial to solving this problem. In
particular, we use weighted Hamming distance,
which counts the number of variables in which
a candidate solution ¯y
i
diﬀers from the target
output y
i
, with diﬀerent cost for false positives
(c
+
) and false negatives (c
-
):
lscript
i
(¯y
i
)=
summationdisplay
jk
bracketleftbig
c
-
y
i,jk
(1 − ¯y
i,jk
)+c
+
¯y
i,jk
(1 − y
i,jk
)
bracketrightbig
=
summationdisplay
jk
c
-
y
i,jk
+
summationdisplay
jk
[c
+
− (c
-
+ c
+
)y
i,jk
]¯y
i,jk
.
The loss-augmented matching problem can then
be written as an LP similar to Equation 1 (with-
out the constant term
summationtext
jk
c
-
y
i,jk
):
max
z
summationdisplay
jk
z
i,jk
[w
latticetop
f(x
i,jk
)+c
+
− (c
-
+ c
+
)y
i,jk
]
s.t.
summationdisplay
j
z
i,jk
≤ 1,
summationdisplay
k
z
i,jk
≤ 1, 0 ≤ z
i,jk
≤ 1.
Hence, without any approximations, we have a
continuous optimization problem instead of a
combinatorial one:
max
¯y
i
∈Y
i
w
latticetop
f
i
(¯y
i
)+lscript
i
(¯y
i
)=d
i
+max
z
i
∈Z
i
(w
latticetop
F
i
+c
i
)
latticetop
z
i
,
where d
i
=
summationtext
jk
c
-
y
i,jk
is the constant term, F
i
is the appropriate matrix that has a column of
features f(x
i,jk
) for each edge jk, c
i
is the vector
of the loss terms c
+
− (c
-
+ c
+
)y
i,jk
and finally
Z
i
= {z
i
:
summationtext
j
z
i,jk
≤ 1,
summationtext
k
z
i,jk
≤ 1, 0 ≤
z
i,jk
≤ 1}.
Plugging this LP back into our estimation
problem, we have
min
||w||≤γ
max
z∈Z
summationdisplay
i
w
latticetop
F
i
z
i
+ c
latticetop
i
z
i
− w
latticetop
F
i
y
i
, (2)
where z = {z
1
,...,z
m
}, Z = Z
1
×...×Z
m
.In-
stead of the derivation in Taskar et al. (2005a),
which produces a joint convex optimization
problem using Lagrangian duality, here we
tackle the problem in its natural saddle-point
form.
2.2 The extragradient method
For saddle-point problems, a well-known solu-
tion strategy is the extragradient method (Ko-
rpelevich, 1976), which is closely related to
projected-gradient methods.
The gradient of the objective in Equation 2
is given by:
summationtext
i
F
i
(z
i
− y
i
) (with respect to w)
and F
latticetop
i
w + c
i
(with respect to each z
i
). We de-
note the Euclidean projection of a vector onto
Z
i
as P
Z
i
(v)=argmin
u∈Z
i
||v − u|| and pro-
jection onto the ball ||w|| ≤ γ as P
γ
(w)=
γw/max(γ,||w||).
75
An iteration of the extragradient method con-
sists of two very simple steps, prediction:
¯w
t+1
= P
γ
(w
t
+ β
k
summationdisplay
i
F
i
(y
i
− z
t
i
));
¯z
t+1
i
= P
Z
i
(z
t
i
+ β
k
(F
latticetop
i
w
t
+ c
i
));
and correction:
w
t+1
= P
γ
(w
t
+ β
k
summationdisplay
i
F
i
(y
i
− ¯z
t+1
i
));
z
t+1
i
= P
Z
i
(z
t
i
+ β
k
(F
latticetop
i
¯w
t+1
+ c
i
)),
where β
k
are appropriately chosen step sizes.
The method is guaranteed to converge linearly
toasolutionw
∗
,z
∗
(Korpelevich, 1976; He and
Liao, 2002; Taskar et al., 2005b). Please see
www.cs.berkeley.edu/
~
taskar/extragradient.pdf
for more details.
The key subroutine of the algorithm is Eu-
clidean projection onto the feasible sets Z
i
.In
case of word alignment, Z
i
is the convex hull of
bipartite matchings and the problem reduces to
the much-studied minimum cost quadratic flow
problem (Bertsekas et al., 1997). The projection
problem P
Z
i
(z
prime
i
)isgivenby
min
z
summationdisplay
jk
1
2
(z
prime
i,jk
− z
i,jk
)
2
s.t.
summationdisplay
j
z
i,jk
≤ 1,
summationdisplay
k
z
i,jk
≤ 1, 0 ≤ z
i,jk
≤ 1.
We can now use a standard reduction of bipar-
tite matching to min cost flow by introducing a
source node connected to all the words in one
sentence and a sink node connected to all the
words in the other sentence, using edges of ca-
pacity 1 and cost 0. The original edges jk have
a quadratic cost
1
2
(z
prime
i,jk
−z
i,jk
)
2
and capacity 1.
Now the minimum cost flow from the source to
the sink computes projection of z
prime
i
onto Z
i
We
use standard, publicly-available code for solving
this problem (Guerriero and Tseng, 2002).
3 Experiments
We applied this matching algorithm to word-
level alignment using the English-French
Hansards data from the 2003 NAACL shared
task (Mihalcea and Pedersen, 2003). This
corpus consists of 1.1M automatically aligned
sentences, and comes with a validation set of 39
sentence pairs and a test set of 447 sentences.
The validation and test sentences have been
hand-aligned (see Och and Ney (2003)) and are
marked with both sure and possible alignments.
Using these alignments, alignment error rate
(AER) is calculated as:
AER(A,S,P)=1−
|A ∩ S| + |A ∩ P|
|A| + |S|
Here, A is a set of proposed index pairs, S is
the sure gold pairs, and P is the possible gold
pairs. For example, in Figure 1, proposed align-
ments are shown against gold alignments, with
open squares for sure alignments, rounded open
squares for possible alignments, and filled black
squares for proposed alignments.
Since our method is a supervised algorithm,
we need labeled examples. For the training data,
we split the original test set into 100 training
examples and 347 test examples. In all our ex-
periments, we used a structured loss function
lscript(y
i
, ¯y
i
) that penalized false negatives 3 times
more than false positives, where 3 was picked by
testing several values on the validation set. In-
stead of selecting a regularization parameter γ
and running to convergence, we used early stop-
ping as a cheap regularization method, by set-
ting γ to a very large value (10000) and running
the algorithm for 500 iterations. We selected a
stopping point using the validation set by simply
picking the best iteration on the validation set in
terms of AER (ignoring the initial ten iterations,
which were very noisy in our experiments). All
selected iterations turned out to be in the first
50 iterations, as the algorithm converged fairly
rapidly.
3.1 Features and Results
Very broadly speaking, the classic IBM mod-
els of word-level translation exploit four primary
sources of knowledge and constraint: association
of words (all IBM models), competition between
alignments (all models), zero- or first-order pref-
erences of alignment positions (2,4+), and fer-
tility (3+). We model all of these in some way,
76
one
of
the
major
objectives
of
these
consultations
is to
make sure that
the
recovery benefits
all
.
le
un
de
les
grands
objectifs
de
les
consultations
est
de
faire
en
sorte
que
la
relance
profite
´egalement
`a
tous
.
one
of
the
major
objectives
of
these
consultations
is to
make sure that
the
recovery benefits
all
.
le
un
de
les
grands
objectifs
de
les
consultations
est
de
faire
en
sorte
que
la
relance
profite
´egalement
`a
tous
.
(a) Dice only (b) Dice and Distance
one
of
the
major
objectives
of
these
consultations
is to
make sure that
the
recovery benefits
all
.
le
un
de
les
grands
objectifs
de
les
consultations
est
de
faire
en
sorte
que
la
relance
profite
´egalement
`a
tous
.
one
of
the
major
objectives
of
these
consultations
is to
make sure that
the
recovery benefits
all
.
le
un
de
les
grands
objectifs
de
les
consultations
est
de
faire
en
sorte
que
la
relance
profite
´egalement
`a
tous
.
(c) Dice, Distance, Orthographic, and BothShort (d) All features
Figure 1: Example alignments for each successive feature set.
except fertility.
1
First, and, most importantly, we want to in-
clude information about word association; trans-
lation pairs are likely to co-occur together in
a bitext. This information can be captured,
among many other ways, using a feature whose
1
In principle, we can model also model fertility, by
allowing 0-k matches for each word rather than 0-1, and
having bias features on each word. However, we did not
explore this possibility.
value is the Dice coeﬃcient (Dice, 1945):
Dice(e,f)=
2C
EF
(e,f)
C
E
(e)C
F
(f)
Here, C
E
and C
F
are counts of word occurrences
in each language, while C
EF
is the number of
co-occurrences of the two words. With just this
feature on a pair of word tokens (which depends
only on their types), we can already make a stab
77
at word alignment, aligning, say, each English
word with the French word (or null) with the
highest Dice value (see (Melamed, 2000)), sim-
ply as a matching-free heuristic model. With
Dice counts taken from the 1.1M sentences, this
gives and AER of 38.7 with English as the tar-
get, and 36.0 with French as the target (in line
with the numbers from Och and Ney (2003)).
As observed in Melamed (2000), this use of
Dice misses the crucial constraint of competi-
tion: a candidate source word with high asso-
ciation to a target word may be unavailable for
alignment because some other target has an even
better aﬃnity for that source word. Melamed
uses competitive linking to incorporate this con-
straint explicitly, while the IBM-style models
get this eﬀect via explaining-away eﬀects in EM
training. We can get something much like the
combination of Dice and competitive linking by
running with just one feature on each pair: the
Dice value of that pair’s words.
2
With just a
Dice feature – meaning no learning is needed
yet – we achieve an AER of 29.8, between the
Dice with competitive linking result of 34.0 and
Model 1 of 25.9 given in Och and Ney (2003).
An example of the alignment at this stage is
shown in Figure 1(a). Note that most errors lie
oﬀ the diagonal, for example the often-correct
to-`a match.
IBM Model 2, as usually implemented, adds
the preference of alignments to lie near the di-
agonal. Model 2 is driven by the product of a
word-to-word measure and a (usually) Gaussian
distribution which penalizes distortion from the
diagonal. We can capture the same eﬀect us-
ing features which reference the relative posi-
tions j and k of a pair (e
j
,f
k
). In addition to a
Model 2-style quadratic feature referencing rela-
tive position, we threw in the following proxim-
ity features: absolute diﬀerence in relative posi-
tion abs(j/|e|−k/|f|), and the square and square
root of this value. In addition, we used a con-
junction feature of the dice coeﬃcient times the
proximity. Finally, we added a bias feature on
each edge, which acts as a threshold that allows
2
This isn’t quite competitive linking, because we use
a non-greedy matching.
in
1978
Americans
divorced
1,122,000
times
.
en
1978
,
on
a
enregistr´e
1,122,000
divorces
sur
le
continent
.
in
1978
Americans
divorced
1,122,000
times
.
en
1978
,
on
a
enregistr´e
1,122,000
divorces
sur
le
continent
.
(a) (b)
Figure 2: Example alignments showing the ef-
fects of orthographic cognate features. (a) Dice
and Distance, (b) With Orthographic Features.
sparser, higher precision alignments. With these
features, we got an AER of 15.5 (compare to 19.5
for Model 2 in (Och and Ney, 2003)). Note that
we already have a capacity that Model 2 does
not: we can learn a non-quadratic penalty with
linear mixtures of our various components – this
gives a similar eﬀect to learning the variance of
the Gaussian for Model 2, but is, at least in
principle, more flexible.
3
These features fix the
to-`a error in Figure 1(a), giving the alignment
in Figure 1(b).
On top of these features, we included other
kinds of information, such as word-similarity
features designed to capture cognate (and ex-
act match) information. We added a feature for
exact match of words, exact match ignoring ac-
cents, exact matching ignoring vowels, and frac-
tion overlap of the longest common subsequence.
Since these measures were only useful for long
words, we also added a feature which indicates
that both words in a pair are short. These or-
thographic and other features improved AER to
14.4. The running example now has the align-
ment in Figure 1(c), where one improvement
may be attributable to the short pair feature – it
has stopped proposing the-de, partially because
the short pair feature downweights the score of
that pair. A clearer example of these features
making a diﬀerence is shown in Figure 2, where
both the exact-match and character overlap fea-
3
The learned response was in fact close to a Gaussian,
but harsher near zero displacement.
78
tures are used.
One source of constraint which our model still
does not explicitly capture is the first-order de-
pendency between alignment positions, as in the
HMM model (Vogel et al., 1996) and IBM mod-
els 4+. The the-le error in Figure 1(c) is symp-
tomatic of this lack. In particular, it is a slightly
better pair according to the Dice value than the
correct the-les. However, the latter alignment
has the advantage that major-grands follows it.
To use this information source, we included a
feature which gives the Dice value of the words
following the pair.
4
We also added a word-
frequency feature whose value is the absolute
diﬀerence in log rank of the words, discourag-
ing very common words from translating to very
rare ones. Finally, we threw in bilexical features
of the pairs of top 5 non-punctuation words in
each language.
5
This helped by removing spe-
cific common errors like the residual tendency
for French de to mistakenly align to English the
(the two most common words). The resulting
model produces the alignment in Figure 1(d).
It has sorted out the the-le / the-les confusion,
and is also able to guess to-de, which is not the
most common translation for either word, but
which is supported by the good Dice value on
the following pair (make-faire).
With all these features, we got a final AER
of 10.7, broadly similar to the 8.9 or 9.7 AERs
of unsymmetrized IBM Model 4 trained on the
same data that the Dice counts were taken
from.
6
Of course, symmetrizing Model 4 by in-
tersecting alignments from both directions does
yield an improved AER of 6.9, so, while our
model does do surprisingly well with cheaply ob-
tained count-based features, Model 4 does still
outperform it so far. However, our model can
4
It is important to note that while our matching algo-
rithm has no first-order eﬀects, the features can encode
such eﬀects in this way, or in better ways – e.g. using as
features posteriors from the HMM model in the style of
Matusov et al. (2004).
5
The number of such features which can be learned
depends on the number of training examples, and since
some of our experiments used only a few dozen training
examples we did not make heavy use of this feature.
6
Note that the common word pair features aﬀected
common errors and therefore had a particularly large im-
pact on AER.
Model AER
Dice (without matching) 38.7 / 36.0
Model 4 (E-F, F-E, intersected) 8.9 / 9.7/ 6.9
Discriminative Matching
Dice Feature Only 29.8
+ Distance Features 15.5
+ Word Shape and Frequency 14.4
+ Common Words and Next-Dice 10.7
+ Model 4 Predictions 5.4
Figure 3: AER on the Hansards task.
also easily incorporate the predictions of Model
4 as additional features. We therefore added
three new features for each edge: the prediction
of Model 4 in the English-French direction, the
prediction in the French-English direction, and
the intersection of the two predictions. With
these powerful new features, our AER dropped
dramatically to 5.4, a 22% improvement over the
intersected Model 4 performance.
Another way of doing the parameter estima-
tion for this matching task would have been
to use an averaged perceptron method, as in
Collins (2002). In this method, we merely run
our matching algorithm and update weights
based on the diﬀerence between the predicted
and target matchings. However, the perfor-
mance of the average perceptron learner on the
same feature set is much lower, only 8.1, not
even breaking the AER of its best single feature
(the intersected Model 4 predictions).
3.2 Scaling Experiments
We explored the scaling of our method by learn-
ing on a larger training set, which we created by
using GIZA++ intersected bi-directional Model
4 alignments for the unlabeled sentence pairs.
We then took the first 5K sentence pairs from
these 1.1M Model 4 alignments. This gave us
more training data, albeit with noisier labels.
On a 3.4GHz Intel Xeon CPU, GIZA++ took
18 hours to align the 1.1M words, while our
method learned its weights in between 6 min-
utes (100 training sentences) and three hours
(5K sentences).
79
4 Conclusions
We have presented a novel discriminative, large-
margin method for learning word-alignment
models on the basis of arbitrary features of word
pairs. We have shown that our method is suit-
able for the common situation where a moder-
ate number of good, fairly general features must
be balanced on the basis of a small amount of
labeled data. It is also likely that the method
will be useful in conjunction with a large labeled
alignment corpus (should such a set be created).
We presented features capturing a few separate
sources of information, producing alignments on
the order of those given by unsymmetrized IBM
Model 4 (using labeled training data of about
thesizeothershaveusedtotunegenerative
models). In addition, when given bi-directional
Model 4 predictions as features, our method
provides a 22% AER reduction over intersected
Model 4 predictions alone. The resulting 5.4
AER on the English-French Hansarks task is,
to our knowledge, the best published AER fig-
ure for this training scenario (though since we
use a subset of the test set, evaluations are not
problem-free). Finally, our method scales to
large numbers of training sentences and trains
in minutes rather than hours or days for the
higher-numbered IBM models, a particular ad-
vantage when not using features derived from
those slower models.
References
D. P. Bertsekas, L. C. Polymenakos, and P. Tseng. 1997.
An e-relaxation method for separable convex cost net-
work flow problems. SIAM J. Optim., 7(3):853–870.
P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della
Pietra, F. Jelinek, J. D. Laﬀerty, R. L. Mercer, and
P. S. Roossin. 1990. A statistical approach to machine
translation. Computational Linguistics, 16(2):79–85.
M. Collins. 2002. Discriminative training methods for
hidden markov models: Theory and experiments with
perceptron algorithms. In Proc. EMNLP.
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 1990.
Introduction to Algorithms. MIT Press, Cambridge,
MA.
L. R. Dice. 1945. Measures of the amount of ecologic as-
sociation between species. Journal of Ecology, 26:297–
302.
F. Guerriero and P. Tseng. 2002. Implementation
and test of auction methods for solving generalized
network flow problems with separable convex cost.
Journal of Optimization Theory and Applications,
115(1):113–144, October.
B.S. He and L. Z. Liao. 2002. Improvements of some
projection methods for monotone nonlinear variational
inequalities. JOTA, 112:111:128.
G. M. Korpelevich. 1976. The extragradient method for
finding saddle points and other problems. Ekonomika
i Matematicheskie Metody, 12:747:756.
E. Matusov, R. Zens, and H. Ney. 2004. Symmetric word
alignments for statistical machine translation. In Proc.
of COLING 2004.
I. D. Melamed. 2000. Models of translational equivalence
among words. Computational Linguistics, 26(2):221–
249.
R. Mihalcea and T. Pedersen. 2003. An evaluation ex-
ercise for word alignment. In Proceedings of the HLT-
NAACL 2003 Workshop, Building and Using parallel
Texts: Data Driven Machine Translation and Beyond,
pages 1–6, Edmonton, Alberta, Canada.
F. Och and H. Ney. 2003. A systematic comparison of
various statistical alignment models. Computational
Linguistics, 29(1):19–52.
A. Schrijver. 2003. Combinatorial Optimization: Poly-
hedra and Eﬃciency. Springer.
B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin.
2005a. Learning structured prediction models: a large
margin approach. In Proceedings of the International
Conference on Machine Learning.
B. Taskar, S. Lacoste-Julien, and M. Jordan. 2005b.
Structured prediction via the extragradient method.
In Proceedings of Neural Information Processing Sys-
tems.
J. Tiedemann. 2003. Combining clues for word align-
ment. In Proceedings of EACL.
L. G. Valiant. 1979. The complexity of computing the
permanent. Theoretical Computer Science, 8:189–201.
S. Vogel, H. Ney, and C. Tillmann. 1996. HMM-based
word alignment in statistical translation. In COLING
16, pages 836–841.
80
