ProAlign: Shared Task System Description
Dekang Lin and Colin Cherry
Department of Computing Science
University of Alberta
Edmonton, Alberta, Canada, T6G 2E8
{lindek,colinc}@cs.ualberta.ca
Abstract
ProAlign combines several different ap-
proaches in order to produce high quality word
word alignments. Like competitive linking,
ProAlign uses a constrained search to find high
scoring alignments. Like EM-based methods,
a probability model is used to rank possible
alignments. The goal of this paper is to give
a bird’s eye view of the ProAlign system to
encourage discussion and comparison.
1 Alignment Algorithm at a Glance
We have submitted the ProAlign alignment system to
the WPT’03 shared task. It received a 5.71% AER on
the English-French task and 29.36% on the Romanian-
English task. These results are with the no-null data; our
output was not formatted to work with explicit nulls.
ProAlign works by iteratively improving an align-
ment. The algorithm creates an initial alignment us-
ing search, constraints, and summed φ2 correlation-based
scores (Gale and Church, 1991). This is similar to the
competitive linking process (Melamed, 2000). It then
learns a probability model from the current alignment,
and conducts a constrained search again, this time scor-
ing alignments according to the probability model. The
process continues until results on a validation set begin to
indicate over-fitting.
For the purposes of our algorithm, we view an align-
ment as a set of links between the words in a sen-
tence pair. Before describing the algorithm, we will de-
fine the following notation. Let E be an English sen-
tence e1,e2,...,em and let F be a French sentence
f1,f2,...,fn. We define a link l(ei,fj) to exist if ei and
fj are a translation (or part of a translation) of one an-
other. We define the null link l(ei,f0) to exist if ei does
not correspond to a translation for any French word in F.
The null link l(e0,fj) is defined similarly. An alignment
A for two sentences E and F is a set of links such that ev-
ery word in E and F participates in at least one link, and
a word linked to e0 or f0 participates in no other links. If
e occurs in E x times and f occurs in F y times, we say
that e and f co-occur xy times in this sentence pair.
ProAlign conducts a best-first search (with constant
beam and agenda size) to search a constrained space of
possible alignments. A state in this space is a partial
alignment, and a transition is defined as the addition of
a single link to the current state. Any link which would
create a state that does not violate any constraint is con-
sidered to be a valid transition. Our start state is the empty
alignment, where all words in E and F are implicitly
linked to null. A terminal state is a state in which no more
links can be added without violating a constraint. Our
goal is to find the terminal state with the highest proba-
bility.
To complete this algorithm, one requires a set of con-
straints and a method for determining which alignment is
most likely. These are presented in the next two sections.
The algorithm takes as input a set of English-French sen-
tence pairs, along with dependency trees for the English
sentences. The presence of the English dependency tree
allows us to incorporate linguistic features into our model
and linguistic intuitions into our constraints.
2 Constraints
The model used for scoring alignments has no mecha-
nism to prevent certain types of undesirable alignments,
such as having all French words align to the same En-
glish word. To guide the search to correct alignments, we
employ two constraints to limit our search for the most
probable alignment. The first constraint is the one-to-one
constraint (Melamed, 2000): every word (except the null
words e0 and f0) participates in exactly one link.
The second constraint, known as the cohesion con-
straint (Fox, 2002), uses the dependency tree (Mel’ˇcuk,
1987) of the English sentence to restrict possible link
combinations. Given the dependency tree TE and a (par-
tial) alignment A, the cohesion constraint requires that
phrasal cohesion is maintained in the French sentence. If
two phrases are disjoint in the English sentence, the align-
ment must not map them to overlapping intervals in the
French sentence. This notion of phrasal constraints on
alignments need not be restricted to phrases determined
from a dependency structure. However, the experiments
conducted in (Fox, 2002) indicate that dependency trees
demonstrate a higher degree of phrasal cohesion during
translation than other structures.
Consider the partial alignment in Figure 1. The most
probable lexical match for the English word to is the
French word `a. When the system attempts to link to and
`a, the distinct English phrases [the reboot] and [the host
to discover all the devices] will be mapped to intervals
in the French sentence, creating the induced phrasal in-
tervals [`a ... [r´einitialisation] ... p´eriph´eriques]. Re-
gardless of what these French phrases will be after the
alignment is completed, we know now that their intervals
will overlap. Therefore, this link will not be added to the
partial alignment.
The reboot causes the host to discover all the devices 
det subj det subj aux 
pre 
det 
obj mod 
       la Suite rØinitialisation  ,  l’  h te repŁre tous les pØriphØriques 
after to the reboot the host locate all the peripherals 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 11 
Figure 1: An Example of Cohesion Constraint
To define this notion more formally, let TE(ei) be
the subtree of TE rooted at ei. The phrase span of
ei, spanP(ei,TE,A), is the image of the English phrase
headed by ei in F given a (partial) alignment A. More
precisely, spanP(ei,TE,A) = [k1,k2], where
k1 = min{j|l(u,j) ∈ A,eu ∈ TE(ei)}
k2 = max{j|l(u,j) ∈ A,eu ∈ TE(ei)}
The head span is the image of ei itself. We define
spanH(ei,TE,A) = [k1,k2], where
k1 = min{j|l(i,j) ∈ A}
k2 = max{j|l(i,j) ∈ A}
In Figure 1, for the node reboot, the phrase span is
[4,4] and the head span is also [4,4]; for the node discover
(with the link between to and `a in place), the phrase span
is [2,11] and the head span is the empty set ∅.
With these definitions of phrase and head spans, we de-
fine two notions of overlap, originally introduced in (Fox,
2002) as crossings. Given a head node eh and its modi-
fier em, a head-modifier overlap occurs when:
spanH(eh,TE,A)∩ spanP(em,TE,A) negationslash= ∅
Given two nodes em1 and em2 which both modify the
same head node, a modifier-modifier overlap occurs
when:
spanP(em1,TE,A)∩ spanP(em2,TE,A) negationslash= ∅
Following (Fox, 2002), we say an alignment is co-
hesive with respect to TE if it does not introduce
any head-modifier or modifier-modifier overlaps. For
example, the alignment A in Figure 1 is not cohe-
sive because spanP(reboot,TE,A) = [4,4] intersects
spanP(discover,TE,A) = [2,11]. Since both reboot
and discover modify causes, this creates a modifier-
modifier overlap. One can check for constraint viola-
tions inexpensively by incrementally updating the vari-
ous spans as new links are added to the partial alignment,
and checking for overlap after each modification. More
details on the cohesion constraint can be found in (Lin
and Cherry, 2003).
3 Probability Model
We define the word alignment problem as finding the
alignment A that maximizes P(A|E,F). ProAlign mod-
els P(A|E,F) directly, using a different decomposition
of terms than the model used by IBM (Brown et al.,
1993). In the IBM models of translation, alignments ex-
ist as artifacts of a stochastic process, where the words
in the English sentence generate the words in the French
sentence. Our model does not assume that one sentence
generates the other. Instead it takes both sentences as
given, and uses the sentences to determine an alignment.
An alignment A consists of t links {l1,l2,...,lt}, where
each lk = l(eik,fjk) for some ik and jk. We will refer to
consecutive subsets of A as lji = {li,li+1,...,lj}. Given
this notation, P(A|E,F) can be decomposed as follows:
P(A|E,F) = P(lt1|E,F) =
tproductdisplay
k=1
P(lk|E,F,lk−11 )
At this point, we factor P(lk|E,F,lk−11 ) to make com-
putation feasible. Let Ck = {E,F,lk−11 } represent the
context of lk. Note that both the context Ck and the link
lk imply the occurrence of eik and fjk. We can rewrite
P(lk|Ck) as:
P(lk|Ck) = P(lk,Ck)P(C
k)
= P(Ck|lk)P(lk)P(C
k,eik,fjk)
= P(lk|eik,fjk)× P(Ck|lk)P(C
k|eik,fjk)
Here P(lk|eik,fjk) is link probability given a co-
occurrence of the two words, which is similar in spirit to
Melamed’s explicit noise model (Melamed, 2000). This
term depends only on the words involved directly in the
link. The ratio P(Ck|lk)P(C
k|eik,fjk)
modifies the link probability,
providing context-sensitive information.
Ck remains too broad to deal with in practical sys-
tems. We will consider only a subset FTk of relevant
features of Ck. We will make the Na¨ıve Bayes-style as-
sumption that these features ft ∈ FTk are conditionally
independent given either lk or (eik,fjk). This produces a
tractable formulation for P(A|E,F):
tproductdisplay
k=1

P(lk|ei
k,fjk)×
productdisplay
ft∈FTk
P(ft|lk)
P(ft|eik,fjk)


More details on the probability model used by ProAlign
are available in (Cherry and Lin, 2003).
3.1 Features used in the shared task
For the purposes of the shared task, we use two feature
types. Each type could have any number of instantiations
for any number of contexts. Note that each feature type
is described in terms of the context surrounding a word
pair.
The first feature type fta concerns surrounding links.
It has been observed that words close to each other in
the source language tend to remain close to each other in
the translation (S. Vogel and Tillmann, 1996). To capture
this notion, for any word pair (ei,fj), if a link l(eiprime,fjprime)
exists within a window of two words (where i−2 ≤ iprime ≤
i+2 and j−2 ≤ jprime ≤ j+2), then we say that the feature
fta(i−iprime,j −jprime,eiprime) is active for this context. We refer
to these as adjacency features.
The second feature type ftd uses the English parse tree
to capture regularities among grammatical relations be-
tween languages. For example, when dealing with French
and English, the location of the determiner with respect
to its governor is never swapped during translation, while
the location of adjectives is swapped frequently. For any
word pair (ei,fj), let eiprime be the governor of ei, and let
rel be the relationship between them. If a link l(eiprime,fjprime)
exists, then we say that the feature ftd(j − jprime,rel) is ac-
tive for this context. We refer to these as dependency
features.
Take for example Figure 2 which shows a partial align-
ment with all links completed except for those involving
the. Given this sentence pair and English parse tree, we
can extract features of both types to assist in the align-
ment of the1. The word pair (the1,lprime) will have an active
adjacency feature fta(+1,+1,host) as well as a depen-
dency feature ftd(−1,det). These two features will work
together to increase the probability of this correct link.
the host discovers all the devices 
det subj 
pre 
det 
obj 
 l'  hôte repère tous les périphériques 
1 2 3 4 5 
1 2 3 4 5 6 
6 
the host locate all the peripherals 
Figure 2: Feature Extraction Example
In contrast, the incorrect link (the1,les) will have only
ftd(+3,det), which will work to lower the link probabil-
ity, since most determiners are located before their gov-
ernors.
3.2 Training the model
Since we always work from a current alignment, training
the model is a simple matter of counting events in the
current alignment. Link probability is the number of time
two words are linked, divided by the number of times
they co-occur. The various feature probabilities can be
calculated by also counting the number of times a feature
occurs in the context of a linked pair of words, and the
number of times the feature is active for co-occurrences
of the same word pair.
Considering only a single, potentially noisy alignment
for a given sentence pair can result in reinforcing errors
present in the current alignment during training. To avoid
this problem, we sample from a space of probable align-
ments, as is done in IBM models 3 and above (Brown
et al., 1993), and weight counts based on the likelihood
of each alignment sampled under the current probability
model. To further reduce the impact of rare, and poten-
tially incorrect events, we also smooth our probabilities
using m-estimate smoothing (Mitchell, 1997).
4 Multiple Alignments
The result of the constrained alignment search is a high-
precision, word-to-word alignment. We then relax the
word-to-word constraint, and use statistics regarding col-
locations with unaligned words in order to make many-to-
one alignments. We also employ a further relaxed link-
ing process to catch some cases where the cohesion con-
straint ruled out otherwise good alignments. These auxil-
iary methods are currently not integrated into our search
or our probability model, although that is certainly a di-
rection for future work.
5 Conclusions
We have presented a brief overview of the major ideas
behind our entry to the WPT’03 Shared Task. Primary
among these ideas are the use of a cohesion constraint in
search, and our novel probability model.
Acknowledgments
This project is funded by and jointly undertaken with Sun
Microsystems, Inc. We wish to thank Finola Brady, Bob
Kuhns and Michael McHugh for their help. We also wish
to thank the WPT’03 reviewers for their helpful com-
ments.

References
P. F. Brown, V. S. A. Della Pietra, V. J. Della Pietra, and
R. L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Computa-
tional Linguistics, 19(2):263–312.
Colin Cherry and Dekang Lin. 2003. A probability
model to improve word alignment. Submitted.
Heidi J. Fox. 2002. Phrasal cohesion and statistical
machine translation. In 2002 Conference on Empiri-
cal Methods in Natural Language Processing (EMNLP
2002), pages 304–311.
W.A. Gale and K.W. Church. 1991. Identifying word
correspondences in parallel texts. In 4th Speech
and Natural Language Workshop, pages 152–157.
DARPA, Morgan Kaufmann.
Dekang Lin and Colin Cherry. 2003. Word alignment
with cohesion constraint. Submitted.
I. Dan Melamed. 2000. Models of translational equiv-
alence among words. Computational Linguistics,
26(2):221–249, June.
Igor A. Mel’ˇcuk. 1987. Dependency syntax: theory and
practice. State University of New York Press, Albany.
Tom Mitchell. 1997. Machine Learning. McGraw Hill.
H. Ney S. Vogel and C. Tillmann. 1996. HMM-based
word alignment in statistical translation. In 16th In-
ternational Conference on Computational Linguistics,
pages 836–841, Copenhagen, Denmark, August.
