Symmetric Word Alignments for Statistical Machine Translation
Evgeny Matusov and Richard Zens and Hermann Ney
Lehrstuhl f¨ur Informatik VI, Computer Science Department
RWTH Aachen University
D-52056 Aachen, Germany
fmatusov,zens,neyg@cs.rwth-aachen.de
Abstract
In this paper, we address the word
alignment problem for statistical machine
translation. We aim at creating a sym-
metric word alignment allowing for reli-
able one-to-many and many-to-one word
relationships. We perform the iterative
alignment training in the source-to-target
and the target-to-source direction with
the well-known IBM and HMM alignment
models. Using these models, we robustly
estimatethelocalcostsofaligningasource
word and a target word in each sentence
pair. Then, we use efficient graph algo-
rithms to determine the symmetric align-
ment with minimal total costs (i.e. max-
imal alignment probability). We evalu-
ate the automatic alignments created in
this way on the German–English Verb-
mobil task and the French–English Cana-
dian Hansards task. We show statistically
significant improvements of the alignment
quality compared to the best results re-
ported so far. On the Verbmobil task,
we achieve an improvement of more than
1% absolute over the baseline error rate of
4.7%.
1 Introduction
Word-aligned bilingual corpora provide im-
portant knowledge for many natural language
processing tasks, such as the extraction of
bilingual word or phrase lexica (Melamed,
2000; Och and Ney, 2000). The solutions of
these problems depend heavily on the quality
of the word alignment (Och and Ney, 2000).
Word alignment models were first introduced
in statistical machine translation (Brown et
al., 1993). An alignment describes a mapping
from source sentence words to target sentence
words.
Using the IBM translation models IBM-1
to IBM-5 (Brown et al., 1993), as well as
the Hidden-Markov alignment model (Vogel et
al., 1996), we can produce alignments of good
quality. However, all these models constrain
the alignments so that a source word can be
aligned to at most one target word. This con-
straint is useful to reduce the computational
complexity of the model training, but makes
it hard to align phrases in the target lan-
guage (English) such as ‘the day after tomor-
row’ to one word in the source language (Ger-
man) ‘¨ubermorgen’. We will present a word
alignment algorithm which avoids this con-
straint and produces symmetric word align-
ments. This algorithm considers the align-
ment problem as a task of finding the edge
cover with minimal costs in a bipartite graph.
The parameters of the IBM models and HMM,
in particular the state occupation probabili-
ties, will be used to determine the costs of
aligning a specific source word to a target
word.
We will evaluate the suggested alignment
methods on the German–English Verbmo-
bil task and the French–English Canadian
Hansards task. We will show statistically sig-
nificant improvements compared to state-of-
the-art results in (Och and Ney, 2003).
2 Statistical Word Alignment Models
In this section, we will give an overview of
the commonly used statistical word alignment
techniques. They are based on the source-
channel approach to statistical machine trans-
lation (Brown et al., 1993). We are given
a source language sentence fJ1 := f1:::fj:::fJ
which has to be translated into a target lan-
guage sentence eI1 := e1:::ei:::eI. Among all
possible target language sentences, we will
choose the sentence with the highest proba-
bility:
ˆeI1 = argmax
eI1
'Pr(eI
1jf
J
1 )
“
= argmax
eI1
'Pr(eI
1)¢Pr(f
J
1 je
I
1)
“
This decomposition into two knowledge
sources allows for an independent modeling of
target language model Pr(eI1) and translation
model Pr(fJ1 jeI1). Into the translation model,
the word alignment A is introduced as a hid-
den variable:
Pr(fJ1 jeI1) =
X
A
Pr(fJ1 ;AjeI1)
Usually, the alignment is restricted in the
sense that each source word is aligned to at
most one target word, i.e. A = aJ1. The align-
ment may contain the connection aj = 0 with
the ‘empty’ word e0 to account for source sen-
tence words that are not aligned to any tar-
get word at all. A detailed description of the
popular translation/alignment models IBM-1
to IBM-5 (Brown et al., 1993), as well as the
Hidden-Markov alignment model (HMM) (Vo-
gel et al., 1996) can be found in (Och and Ney,
2003). Model 6 is a loglinear combination of
the IBM-4, IBM-1, and the HMM alignment
models.
A Viterbi alignment ˆA of a specific model is
an alignment for which the following equation
holds:
ˆA = argmax
A
'Pr(fJ
1 ;Aje
I
1)
“:
3 State Occupation Probabilities
The training of all alignment models is done
using the EM-algorithm. In the E-step, the
counts for each sentence pair (fJ1 ;eI1) are cal-
culated. Here, we present this calculation on
the example of the HMM. For its lexicon pa-
rameters, the marginal probability of a target
word ei to occur at the target sentence posi-
tion i as the translation of the source word fj
at the source sentence position j is estimated
with the following sum:
pj(i;fJ1 jeI1) =
X
aJ1:aj=i
Pr(fJ1 ;aJ1jeI1)
This value represents the likelihood of aligning
fj to ei via every possible alignment A = aJ1
that includes the alignment connection aj = i.
By normalizing over the target sentence posi-
tions, we arrive at the state occupation proba-
bility:
pj(ijfJ1 ;eI1) = pj(i;f
J1 jeI1)
IP
i0=1
pj(i0;fJ1 jeI1)
In the M-step of the EM training, the state
occupation probabilities are aggregated for all
words in the source and target vocabularies
by taking the sum over all training sentence
pairs. After proper renormalization the lexi-
con probabilities p(fje) are determined.
Similarly, the training can be performed
in the inverse (target-to-source) direction,
yielding the state occupation probabilities
pi(jjeI1;fJ1 ).
The negated logarithms of the state occu-
pation probabilities
w(i;j;fJ1 ;eI1) := ¡logpj(ijfJ1 ;eI1) (1)
can be viewed as costs of aligning the source
word fj with the target word ei. Thus, the
word alignment task can be formulated as the
task of finding a mapping between the source
and the target words, so that each source and
each target position is covered and the total
costs of the alignment are minimal.
Using state occupation probabilities for
word alignment modeling results in a num-
ber of advantages. First of all, in calculation
of these probabilities with the models IBM-1,
IBM-2 and HMM the EM-algorithm is per-
formed exact, i.e. the summation over all
alignments is efficiently performed in the E-
step. For the HMM this is done using the
Baum-Welch algorithm (Baum, 1972). So far,
an efficient algorithm to compute the sum over
all alignments in the fertility models IBM-3
to IBM-5 is not known. Therefore, this sum
is approximated using a subset of promising
alignments (Och and Ney, 2000). In both
cases, the resulting estimates are more pre-
cise than the ones obtained by the maximum
approximation, i.e. by considering only the
Viterbi alignment.
Instead of using the state occupation prob-
abilities from only one training direction as
costs (Equation 1), we can interpolate the
state occupation probabilities from the source-
to-target and the target-to-source training for
each pair (i,j) of positions in a sentence pair
(fJ1 ;eI1). This will improve the estimation of
the local alignment costs. Having such sym-
metrized costs, we can employ the graph align-
ment algorithms (cf. Section 4) to produce
reliable alignment connections which include
many-to-one and one-to-many alignment re-
lationships. The presence of both relation-
ship types characterizes a symmetric align-
ment that can potentially improve the trans-
lation results (Figure 1 shows an example of a
symmetric alignment).
Another important advantage is the effi-
ciency of the graph algorithms used to deter-
Figure 1: Example of a symmetric alignment
with one-to-many and many-to-one connec-
tions (Verbmobil task, spontaneous speech).
mine the final symmetric alignment. They will
be discussed in Section 4.
4 Alignment Algorithms
In this section, we describe the alignment ex-
traction algorithms. We assume that for each
sentence pair (fJ1 ;eI1) we are given a cost ma-
trix C.1 The elements of this matrix cij are
the local costs that result from aligning source
word fj to target word ei. For a given align-
ment A  I £ J, we define the costs of this
alignment c(A) as the sum of the local costs
of all aligned word pairs:
c(A) =
X
(i;j)2A
cij (2)
Now, our task is to find the alignment with the
minimum costs. Obviously, the empty align-
ment has always costs of zero and would be op-
timal. To avoid this, we introduce additional
constraints. The first constraint is source sen-
tence coverage. Thus each source word has
to be aligned to at least one target word or
alternatively to the empty word. The second
constraint is target sentence coverage. Similar
to the source sentence coverage thus each tar-
get word is aligned to at least one source word
or the empty word.
Enforcing only the source sentence cover-
age, the minimum cost alignment is a mapping
from source positions j to target positions aj,
including zero for the empty word. Each tar-
get position aj can be computed as:
aj = argmin
i
fcijg
This means, in each column we choose the
row with the minimum costs. This method re-
sembles the common IBM models in the sense
1For notational convenience, we omit the depen-
dency on the sentence pair (fJ1 ;eI1) in this section.
that the IBM models are also a mapping from
source positions to target positions. There-
fore, this method is comparable to the IBM
models for the source-to-target direction. Sim-
ilarly, if we enforce only the target sentence
coverage, the minimum cost alignment is a
mapping from target positions i to source po-
sitions bi. Here, we have to choose in each
row the column with the minimum costs. The
complexity of these algorithms is in O(I ¢J).
The algorithms for determining such a non-
symmetric alignment are rather simple. A
more interesting case arises, if we enforce both
constraints, i.e. each source word as well as
each target word has to be aligned at least
once. Even in this case, we can find the global
optimum in polynomial time.
The task is to find a symmetric alignment
A, for which the costsc(A) are minimal (Equa-
tion 2). This task is equivalent to finding
a minimum-weight edge cover (MWEC) in a
complete bipartite graph2. The two node
sets of this bipartite graph correspond to the
source sentence positions and the target sen-
tence positions, respectively. The costs of an
edge are the elements of the cost matrix C.
To solve the minimum-weight edge cover
problem, we reduce it to the maximum-weight
bipartite matching problem. As described
in (Keijsper and Pendavingh, 1998), this re-
duction is linear in the graph size. For the
maximum-weight bipartite matching problem,
well-known algorithm exist, e.g. the Hungar-
ian method. The complexity of this algorithm
is in O((I + J)¢I ¢J). We will call the solu-
tion of the minimum-weight edge cover prob-
lem with the Hungarian method “the MWEC
algorithm”. In contrary, we will refer to the al-
gorithm enforcing either source sentence cov-
erage or target sentence coverage as the one-
sided minimum-weight edge cover algorithm
(o-MWEC).
The cost matrix of a sentence pair (fJ1 ;eI1)
can be computed as a weighted linear interpo-
lation of various cost types hm:
cij =
MX
m=1
‚m ¢hm(i;j)
In our experiments, we will use the negated
logarithm of the state occupation probabilities
as described in Section 3. To obtain a more
symmetric estimate of the costs, we will inter-
polate both the source-to-target direction and
2An edge cover of G is a set of edges E0 such that
each node of G is incident to at least one edge in E0.
the target-to-source direction (thus the state
occupation probabilities are interpolated log-
linearly). Because the alignments determined
in the source-to-target training may substan-
tially differ in quality from those produced in
the target-to-source training, we will use an
interpolation weight fi:
cij = fi¢w(i;j;fJ1 ;eI1)+(1¡fi)¢w(j;i;eI1;fJ1 )
(3)
Additional feature functions can be included
to compute cij; for example, one could make
use of a bilingual word or phrase dictionary.
To apply the methods described in this sec-
tion, wemadetwoassumptions: first, thecosts
of an alignment can be computed as the sum
of local costs. Second, the features have to be
static in the sense that we have to fix the costs
before aligning any word. Therefore, we can-
not apply dynamic features such as the IBM-
4 distortion model in a straightforward way.
One way to overcome these restrictions lies in
using the state occupation probabilities; e.g.
for IBM-4, they contain the distortion model
to some extent.
5 Results
5.1 Evaluation Criterion
We use the same evaluation criterion as de-
scribed in (Och and Ney, 2000). We compare
the generated word alignment to a reference
alignment produced by human experts. The
annotation scheme explicitly takes the am-
biguity of the word alignment into account.
There are two different kinds of alignments:
sure alignments (S) which are used for unam-
biguous alignments and possible alignments
(P) which are used for alignments that might
or might not exist. The P relation is used
especially to align words within idiomatic ex-
pressions and free translations. It is guaran-
teed that the sure alignments are a subset of
the possible alignments (S  P). The ob-
tained reference alignment may contain many-
to-one and one-to-many relationships.
The quality of an alignment A is computed
as appropriately redefined precision and recall
measures. Additionally, we use the alignment
error rate (AER), which is derived from the
well-known F-measure.
recall = jA\SjjSj ; precision = jA\PjjAj
AER(S;P;A) = 1¡ jA\Sj+jA\PjjAj+jSj
Table 1: Verbmobil task: corpus statistics.
Source/Target: German English
Train Sentences 34446
Words 329625 343076
Vocabulary 5936 3505
Singletons 2600 1305
Dictionary Entries 4404
Test Sentences 354
Words 3233 3109
S reference relations 2559
P reference relations 4596
Table2: Canadian Hansards: corpus statistics.
Source/Target: French English
Train Sentences 128K
Words 2.12M 1.93M
Vocabulary 37542 29414
Singletons 12986 9572
Dictionary Entries 28701
Test Sentences 500
Words 8749 7946
S reference relations 4443
P reference relations 19779
With these definitions a recall error can only
occur if a S(ure) alignment is not found and a
precision error can only occur if a found align-
ment is not even P(ossible).
5.2 Experimental Setup
We evaluated the presented lexicon sym-
metrization methods on the Verbmobil and
the Canadian Hansards task. The German–
English Verbmobil task (Wahlster, 2000) is a
speech translation task in the domain of ap-
pointment scheduling, travel planning and ho-
tel reservation. The French–English Canadian
Hansards task consists of the debates in the
Canadian Parliament.
The corpus statistics are shown in Table 1
and Table 2. The number of running words
and the vocabularies are based on full-form
words including punctuation marks. As in
(Och and Ney, 2003), the first 100 sentences
of the test corpus are used as a development
corpus to optimize model parameters that are
not trained via the EM algorithm, e.g. the
interpolation weights. The remaining part of
the test corpus is used to evaluate the models.
We use the same training schemes (model
sequences) as presented in (Och and Ney,
2003): 15H5334363 for the Verbmobil Task ,
i.e. 5 iteration of IBM-1, 5 iterations of the
HMM, 3 iteration of IBM-3, etc.; for the Cana-
dian Hansards task, we use 15H10334363. We
refer to these schemes as the Model 6 schemes.
For comparison, we also perform less sophisti-
cated trainings, to which we refer as the HMM
schemes (15H10 and 15H5, respectively), as
well as the IBM Model 4 schemes (15H103343
and 15H53343).
Inalltrainingschemesweuseaconventional
dictionary (possibly containing phrases) as ad-
ditional training material. Because we use the
same training and testing conditions as (Och
and Ney, 2003), we will refer to the results pre-
sented in that article as the baseline results.
5.3 Non-symmetric Alignments
In the first experiments, we use the state oc-
cupation probabilities from only one transla-
tion direction to determine the word align-
ment. This allows for a fair comparison with
the Viterbi alignment computed as the result
of the training procedure. In the source-to-
target translation direction, we cannot esti-
mate the probability for the target words with
fertility zero and choose to set it to 0. In this
case, the minimum weight edge cover problem
is solved by the one-sided MWEC algorithm.
Like the Viterbi alignments, the alignments
produced by this algorithm satisfy the con-
straint that multiple source (target) words can
only be aligned to one target (source) word.
Tables 3 and 4 show the performance of
the one-sided MWEC algorithm in compar-
ison with the experiment reported by (Och
and Ney, 2003). We report not only the final
alignment error rates, but also the intermedi-
ate results for the HMM and IBM-4 training
schemes.
For IBM-3 to IBM-5, the Viterbi alignment
and a set of promising alignments are used
to determine the state occupation probabili-
ties. Consequently, we observe similar align-
ment quality when comparing the Viterbi and
the one-sided MWEC alignments.
We also evaluated the alignment quality af-
ter applying alignment generalization meth-
ods, i.e. we combine the alignment of both
translation directions. Experimentally, the
best generalization heuristic for the Canadian
Hansards task is the intersection of the source-
to-target and the target-to-source alignments.
For the Verbmobil task, the refined method
of (Och and Ney, 2003) is used. Again, we
observed similar alignment error rates when
merging either the Viterbi alignments or the
o-MWEC alignments.
Table 3: AER [%] for non-symmetric align-
ment methods and for various models (HMM,
IBM-4, Model 6) on the Canadian Hansards
task.
Alignment method HMM IBM4 M6
Baseline T!S 14.1 12.9 11.9
S!T 14.4 12.8 11.7
intersection 8.4 6.9 7.8
o-MWEC T!S 14.0 13.1 11.9
S!T 14.3 13.0 11.7
intersection 8.2 7.1 7.8
Table 4: AER [%] for non-symmetric align-
ment methods and for various models (HMM,
IBM-4, Model 6) on the Verbmobil task.
Alignment method HMM IBM4 M6
Baseline T!S 7.6 4.8 4.6
S!T 12.1 9.3 8.8
refined 7.1 4.7 4.7
o-MWEC T!S 7.3 4.8 4.5
S!T 12.0 9.3 8.5
refined 6.7 4.6 4.6
5.4 Symmetric Alignments
The heuristically generalized Viterbi align-
ments presented in the previous section can
potentially avoid the alignment constraints3.
However, the choice of the optimal general-
ization heuristic may depend on a particular
language pair and may require extensive man-
ual optimization. In contrast, the symmetric
MWEC algorithm is a systematic and theo-
retically well-founded approach to the task of
producing a symmetric alignment.
In the experiments with the symmetric
MWEC algorithm, the optimal interpolation
parameter fi (see Equation 3) for the Verbmo-
bil corpus was empirically determined as 0:8.
This shows that the model parameters can be
estimated more reliably in the direction from
German to English. In the inverse English-
to-German alignment training, the mappings
of many English words to one German word
are not allowed by the modeling constraints,
although such alignment mappings are signif-
icantly more frequent than mappings of many
German words to one English word.
The experimentally best interpolation pa-
rameterfortheCanadianHansardscorpuswas
fi = 0:5. Thus the model parameters esti-
mated in the translation direction from French
to English are as reliable as the ones estimated
3Consequently, we will use them as baseline for the
experiments with symmetric alignments.
in the direction from English to French.
Lines 2a and 2b of Table 5 show the perfor-
mance of the MWEC algorithm. The align-
ment error rates are slightly lower if the HMM
or the full Model 6 training scheme is used
to train the state occupation probabilities on
the Canadian Hansards task. On the Verbmo-
bil task, the improvement is more significant,
yielding an alignment error rate of 4.1%.
Columns 4 and 5 of Table 5 contain the re-
sults of the experiments, in which the costs
cij were determined as the loglinear interpola-
tion of state occupation probabilities obtained
from the HMM training scheme with those
from IBM-4 (column 4) or from Model 6 (col-
umn 5). We set the interpolation parameters
for the two translation directions proportional
to the optimal values determined in the previ-
ous experiments. On the Verbmobil task, we
obtain a further improvement of 19% relative
over the baseline result reported in (Och and
Ney, 2003), reaching an AER as low as 3.8%.
The improvements of the alignment qual-
ity on the Canadian Hansards task are less
significant. The manual reference alignments
for this task contain many possible connec-
tions and only a few sure connections (cf. Ta-
ble 2). Thus automatic alignments consisting
of only a few reliable alignment points are fa-
vored. Because the differences in the number
of words and word order between French and
English are not as dramatic as e.g. between
German and English, the probability of the
emptywordalignmentisnotveryhigh. There-
fore, plenty of alignment points are produced
by the MWEC algorithm, resulting in a high
recall and low precision. To increase the preci-
sion, we replaced the empty word connection
costs (previously trained as state occupation
probabiliities using the EM algorithm) by the
global, word- and position-independent costs
depending only on one of the involved lan-
guages. The alignment error rates for these
experiments are given in lines 3a and 3b of Ta-
ble 5. The global empty word probability for
the Canadian Hansards task was empirically
set to 0.45 for French and for English, and,
for the Verbmobil task, to 0.6 for German and
0.1 for English. On the Canadian Hansards
task, we achieved further significant reduction
of the AER. In particular, we reached an AER
of 6.6% by performing only the HMM training.
In this case the effectiveness of the MWEC al-
gorithm is combined with the efficiency of the
HMM training, resulting in a fast and robust
alignment training procedure.
We also tested the more simple one-sided
MWEC algorithm. In contrast to the exper-
iments presented in Section 5.3, we used the
loglinear interpolated state occupation prob-
abilities (given by the Equation 3) as costs.
Thus, although the algorithm is not able to
produce a symmetric alignment, it operates
with symmetrized costs. In addition, we used
a combination heuristic to obtain a symmetric
alignment. The results of these experiments
are presented in Table 5, lines 4-6 a/b.
The performance of the one-sided MWEC
algorithm turned out to be quite robust on
both tasks. However, the o-MWEC align-
ments are not symmetric and the achieved low
AER depends heavily on the differences be-
tween the involved languages, which may fa-
vor many-to-one alignments in one translation
direction only. That is why on the Verbmobil
task, when determining the mininum weight in
each row for the translation direction from En-
glish to German, the alignment quality deteri-
orates, because the algorithm cannot produce
alignments which map several English words
to one German word (line 5b of Table 5).
Applying the generalization heuristics
(line 6a/b of Table 5), we achieve an AER of
6.0% on the Canadian Hansards task when
interpolating the state occupation probabil-
ities trained with the HMM and with the
IBM-4 schemes. On the Verbmobil task, the
interpolation of the HMM and the Model 6
schemes yields the best result of 3.7% AER.
In the latter experiment, we reached 97.3%
precision and 95.2% recall.
6 Related Work
A description of the IBM models for statistical
machine translation can be found in (Brown et
al., 1993). The HMM-based alignment model
was introduced in (Vogel et al., 1996). An
overview of these models is given in (Och and
Ney, 2003). That article also introduces the
Model 6; additionally, state-of-the-art results
are presented for the Verbmobil task and the
Canadian Hansards task for various configura-
tions. Therefore, we chose them as baseline.
Additional linguistic knowledge sources such
as dependeny trees or parse trees were used in
(Cherry and Lin, 2003; Gildea, 2003). Bilin-
gual bracketing methods were used to produce
a word alignment in (Wu, 1997). (Melamed,
2000) uses an alignment model that enforces
one-to-one alignments for nonempty words.
Table 5: AER[%] for different alignment symmetrization methods and for various alignment
models on the Canadian Hansards and the Verbmobil tasks (MWEC: minimum weight edge
cover, EW: empty word).
Symmetrization Method HMM IBM4 M6 HMM + IBM4 HMM + M6
Canadian 1a. Baseline (intersection) 8.4 6.9 7.8 – –
Hansards 2a. MWEC 7.9 9.3 7.5 8.2 7.4
3a. MWEC (global EW costs) 6.6 7.4 6.9 6.4 6.4
4a. o-MWEC T!S 7.3 7.9 7.4 6.7 7.0
5a. S!T 7.7 7.6 7.2 6.9 6.9
6a. S$T (intersection) 7.2 6.6 7.6 6.0 7.1
Symmetrization Method HMM IBM4 M6 HMM + IBM4 HMM + M6
Verbmobil 1b. Baseline (refined) 7.1 4.7 4.7 – –
2b. MWEC 6.4 4.4 4.1 4.3 3.8
3b. MWEC (global EW costs) 5.8 5.8 6.6 6.0 6.7
4b. o-MWEC T!S 6.8 4.4 4.1 4.5 3.7
5b. S!T 9.3 7.2 6.8 7.5 6.9
6b. S$T (refined) 6.7 4.3 4.1 4.6 3.7
7 Conclusions
In this paper, we addressed the task of au-
tomatically generating symmetric word align-
ments for statistical machine translation. We
exploited the state occupation probabilties de-
rived from the IBM and HMM translation
models. We used the negated logarithms of
these probabilities as local alignmentcosts and
reduced the word alignment problem to find-
ing an edge cover with minimal costs in a
bipartite graph. We presented efficient algo-
rithms for the solution of this problem. We
evaluated the performance of these algorithms
by comparing the alignment quality to man-
ual reference alignments. We showed that in-
terpolating the alignment costs of the source-
to-target and the target-to-source translation
directions can result in a significant improve-
ment of the alignment quality.
In the future, we plan to integrate the graph
algorithms into the iterative training proce-
dure. Investigating the usefulness of addi-
tional feature functions might be interesting
as well.
Acknowledgment
This work has been partially funded by the
EU project TransType 2, IST-2001-32091.

References

L. E. Baum. 1972. An inequality and associated
maximization technique in statistical estimation
for probabilistic functions of markov processes.
Inequalities, 3:1–8.

P. F. Brown, S. A. Della Pietra, V. J. Della Pietra,
and R. L. Mercer. 1993. The mathematics of
statistical machine translation: Parameter esti-
mation. Computational Linguistics, 19(2):263–
311, June.

C. Cherry and D. Lin. 2003. A probability model
to improve word alignment. In Proc. of the 41th
Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 88–95, Sap-
poro, Japan, July.

D. Gildea. 2003. Loosely tree-based alignment for
machine translation. In Proc. of the 41th An-
nual Meeting of the Association for Computa-
tional Linguistics (ACL), pages 80–87, Sapporo,
Japan, July.

J. Keijsper and R. Pendavingh. 1998. An effi-
cient algorithm for minimum-weight bibranch-
ing. Journal of Combinatorial Theory Series B,
73(2):130–145, July.

I. D. Melamed. 2000. Models of translational
equivalence among words. Computational Lin-
guistics, 26(2):221–249.

F. J. Och and H. Ney. 2000. Improved statistical
alignment models. In Proc. of the 38th Annual
Meeting of the Association for Computational
Linguistics (ACL), pages 440–447, Hong Kong,
October.

F. J. Och and H. Ney. 2003. A systematic com-
parison of various statistical alignment models.
Computational Linguistics, 29(1):19–51, March.

S. Vogel, H. Ney, and C. Tillmann. 1996. HMM-
based word alignment in statistical translation.
In COLING ’96: The 16th Int. Conf. on Com-
putational Linguistics, pages 836–841, Copen-
hagen, Denmark, August.

W. Wahlster, editor. 2000. Verbmobil: Founda-
tions of speech-to-speech translations. Springer
Verlag, Berlin, Germany, July.

D. Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel cor-
pora. Computational Linguistics, 23(3):377–
403, September.
