Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 141–144,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
A Generalized Alignment-Free Phrase Extraction
Bing Zhao
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA-15213
bzhao@cs.cmu.edu
Stephan Vogel
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA-15213
vogel+@cs.cmu.edu
Abstract
In this paper, we present a phrase ex-
traction algorithm using a translation lex-
icon, a fertility model, and a simple dis-
tortion model. Except these models, we
do not need explicit word alignments for
phrase extraction. For each phrase pair (a
block), a bilingual lexicon based score is
computed to estimate the translation qual-
ity between the source and target phrase
pairs; a fertility score is computed to es-
timate how good the lengths are matched
between phrase pairs; a center distortion
score is computed to estimate the relative
position divergence between the phrase
pairs. We presented the results and our
experience in the shared tasks on French-
English.
1 Introduction
Phrase extraction becomes a key component in to-
day’s state-of-the-art statistical machine translation
systems. With a longer context than unigram, phrase
translation models have  exibilities of modelling lo-
cal word-reordering, and are less sensitive to the er-
rors made from preprocessing steps including word
segmentations and tokenization. However, most of
the phrase extraction algorithms rely on good word
alignments. A widely practiced approach explained
in details in (Koehn, 2004), (Och and Ney, 2003)
and (Tillmann, 2003) is to get word alignments from
two directions: source to target and target to source;
the intersection or union operation is applied to get
re ned word alignment with pre-designed heuristics
 xing the unaligned words. With this re ned word
alignment, the phrase extraction for a given source
phrase is essentially to extract the target candidate
phrases in the target sentence by searching the left
and right projected boundaries.
In (Vogel et al., 2004), they treat phrase align-
ment as a sentence splitting problem: given a source
phrase,  nd the boundaries of the target phrase such
that the overall sentence alignment lexicon probabil-
ity is optimal. We generalize it in various ways, esp.
by using a fertility model to get a better estimation of
phrase lengths, and a phrase level distortion model.
In our proposed algorithm, we do not need ex-
plicit word alignment for phrase extraction. Thereby
it avoids the burden of testing and comparing differ-
ent heuristics especially for some language speci c
ones. On the other hand, the algorithm has such  ex-
ibilities that one can incorporate word alignment and
heuristics in several possible stages within this pro-
posed framework to further improve the quality of
phrase pairs. In this way, our proposed algorithm
is more generalized than the usual word alignment
based phrase extraction algorithms.
The paper is structured as follows: in section 2,
The concept of blocks is explained; in section 3, a
dynamic programming approach is model the width
of the block; in section 4, a simple center distortion
of the block; in section 5, the lexicon model; the
complete algorithm is in section 6; in section 7, our
experience and results using the proposed approach.
2 Blocks
We consider each phrase pair as a block within a
given parallel sentence pair, as shown in Figure 1.
The y-axis is the source sentence, indexed word
by word from bottom to top; the x-axis is the target
sentence, indexed word by word from left to right.
The block is de ned by the source phrase and its pro-
jection. The source phrase is bounded by the start
and the end positions in the source sentence. The
projection of the source phrase is de ned as the left
and right boundaries in the target sentence. Usually,
the boundaries can be inferred according to word
alignment as the left most and right most aligned
positions from the words in the source phrase. In
141
Start
End
Right boundaryLeft boundary
Width
src center
tgt center
Figure 1: Blocks with  width and  centers 
this paper, we provide another view of the block,
which is de ned by the centers of source and target
phrases, and the width of the target phrase.
Phrase extraction algorithms in general search
for the left and right projected boundaries of each
source phrase according to some score metric com-
puted for the given parallel sentence pairs. We
present here three models: a phrase level fertility
model score for phrase pairs’ length mismatch, a
simple center-based distortion model score for the
divergence of phrase pairs’ relative positions, and
a phrase level translation score to approximate the
phrase pairs’ translational equivalence. Given a
source phrase, we can search for the best possible
block with the highest combined scores from the
three models.
3 Length Model: Dynamic Programming
Given the word fertility de nitions in IBM Mod-
els (Brown et al., 1993), we can compute a prob-
ability to predict phrase length: given the candi-
date target phrase (English) eI1, and a source phrase
(French) of length J, the model gives the estima-
tion of P(J|eI1) via a dynamic programming algo-
rithm using the source word fertilities. Figure 2
shows an example fertility trellis of an English tri-
gram. Each edge between two nodes represents one
English word ei. The arc between two nodes rep-
resents one candidate non-zero fertility for ei. The
fertility of zero (i.e. generating a NULL word) cor-
responds to the direct edge between two nodes, and
in this way, the NULL word is naturally incorpo-
rated into this model’s representation. Each arc is
e1 e2 e3
1
3
2
0 0
2
0
e1 e2 e3
……
….
1
2
3
4
3
1
3
1
2
Figure 2: An example of fertility trellis for dynamic
programming
associated with a English word fertility probability
P(φi|ei). A path φI1 through the trellis represents
the number of French words φi generated by each
English word ei. Thus, the probability of generating
J words from the English phrase along the Viterbi
path is:
P(J|eI1) = max
{φI1,J=summationtextIi=1 φi}
IY
i=1
P(φi|ei) (1)
The Viterbi path is inferred via dynamic program-
ming in the trellis of the lower panel in Figure 2:
φ[j, i] = max
8>
><
>>:
φ[j, i − 1] + log PNULL(0|ei)
φ[j − 1, i − 1] + log Pφ(1|ei)
φ[j − 2, i − 1] + log Pφ(2|ei)
φ[j − 3, i − 1] + log Pφ(3|ei)
where PNULL(0|ei) is the probability of generating
a NULL word from ei; Pφ(k = 1|ei) is the usual
word fertility model of generating one French word
from the word ei; φ[j, i] is the cost so far for gener-
ating j words from i English words ei1 : e1,··· , ei.
After computing the cost of φ[J, I], we can trace
back the Viterbi path, along which the probability
P(J|eI1) of generating J French words from the En-
glish phrase eI1 as shown in Eqn. 1.
142
With this phrase length model, for every candidate
block, we can compute a phrase level fertility score
to estimate to how good the phrase pairs are match
in their lengthes.
4 Distortion of Centers
The centers of source and target phrases are both il-
lustrated in Figure 1. We compute a simple distor-
tion score to estimate how far away the two centers
are in a parallel sentence pair in a sense the block is
close to the diagonal.
In our algorithm, the source center circledotfj+l
j
of the
phrase fj+lj with length l +1 is simply a normalized
relative position de ned as follows:
circledotfj+l
j
= 1|F|
jprime=j+lX
jprime=j
jprime
l + 1 (2)
where |F| is the French sentence length.
For the center of English phrase ei+ki in the target
sentence, we  rst de ne the expected corresponding
relative center for every French word fjprime using the
lexicalized position score as follows:
circledotei+k
i
(fjprime) = 1|E| ·
P(i+k)
iprime=i i
prime · P(fjprime|eiprime)
P(i+k)
iprime=i P(fjprime|eiprime)
(3)
where |E| is the English sentence length. P(fjprime|ei)
is the word translation lexicon estimated in IBM
Models. i is the position index, which is weighted
by the word level translation probabilities; the term
of PIi=1 P(fjprime|ei) provides a normalization so that
the expected center is within the range of target sen-
tence length. The expected center for ei+ki is simply
a average of circledotei+k
i
(fjprime):
circledotei+k
i
= 1l + 1
j+lX
jprime=j
circledotei+k
i
(fjprime) (4)
This is a general framework, and one can certainly
plug in other kinds of score schemes or even word
alignments to get better estimations.
Given the estimated centers of circledotfj+l
j
and
circledotei+k
i
, we can compute how close they are by
the probability of P(circledotei+k
i
|circledotfj+l
j
). To estimate
P(circledotei+k
i
|circledotfj+l
j
), one can start with a  at gaussian
model to enforce the point of (circledotei+k
i
,circledotfj+l
j
) not too
far off the diagonal and build an initial list of phrase
pairs, and then compute the histogram to approxi-
mate P(circledotei+k
i
|circledotfj+l
j
).
5 Lexicon Model
Similar to (Vogel et al., 2004), we compute for each
candidate block a score within a given sentence pair
using a word level lexicon P(f|e) as follows:
P(fj+lj |ei+ki ) =
Y
jprime∈[j,j+l]
X
iprime∈[i,i+k]
P(fjprime|eiprime)
k + 1
·
Y
jprime /∈[j,j+l]
X
iprime /∈[i,i+k]
P(fjprime|eiprime)
|E|− k − 1
6 Algorithm
Our phrase extraction is described in Algorithm
1. The input parameters are essentially from IBM
Model-4: the word level lexicon P(f|e), the English
word level fertility Pφ(φe = k|e), and the center
based distortion P(circledotei+k
i
|circledotfj+l
j
).
Overall, for each source phrase fj+lj , the algo-
rithm  rst estimates its normalized relative center
in the source sentence, its projected relative cen-
ter in the target sentence. The scores of the phrase
length, center-based distortion, and a lexicon based
score are computed for each candidate block A lo-
cal greedy search is carried out for the best scored
phrase pair (fj+lj , ei+ki ).
In our submitted system, we computed the
following seven base scores for phrase pairs:
Pef(fj+lj |ei+ki ), Pfe(ei+ki |fj+lj ), sharing similar
function form in Eqn. 5.
Pef(fj+lj |ei+ki ) =
Y
jprime
X
iprime
P(fjprime|eiprime)P(eiprime|ei+ki )
=
Y
jprime
X
iprime
P(fjprime|eiprime)
k + 1 (5)
We compute phrase level relative frequency in both
directions: Prf(fj+lj |ei+ki ) and Prf(ei+ki |fj+lj ). We
compute two other lexicon scores which were also
used in (Vogel et al., 2004): S1(fj+lj |ei+ki ) and
S2(ei+ki |fj+lj ) using the similar function in Eqn. 6:
S(fj+lj |ei+ki ) =
Y
jprime
X
iprime
P(fjprime|eiprime) (6)
143
In addition, we put the phrase level fertility score
computed in section 3 via dynamic programming to
be as one additional score for decoding.
Algorithm 1 A Generalized Alignment-free Phrase
Extraction
1: Input: Pre-trained models: Pφ(φe = k|e) ,
P(circledotE|circledotF ) , and P(f|e).
2: Output: PhraseSet: Phrase pair collections.
3: Loop over the next sentence pair
4: for j : 0 → |F|− 1,
5: for l : 0 → MaxLength,
6: foreach fj+lj
7: compute circledotf and circledotE
8: left = circledotE ·|E|-MaxLength,
9: right= circledotE ·|E|+MaxLength,
10: for i : left → right,
11: for k : 0 → right,
12: compute circledote of ei+ki ,
13: score the phrase pair (fj+lj , ei+ki ), where
score = P(circledote|circledotf)P(l|ei+ki )P(fj+lj |ei+ki )
14: add top-n {(fj+lj , ei+ki )} into PhraseSet.
7 Experimental Results
Our system is based on the IBM Model-4 param-
eters. We train IBM Model 4 with a scheme of
1720h73043 using GIZA++ (Och and Ney, 2003).
The maximum fertility for an English word is 3. All
the data is used as given, i.e. we do not have any
preprocessing of the English-French data. The word
alignment provided in the workshop is not used in
our evaluations. The language model is provided
by the workshop, and we do not use other language
models.
The French phrases up to 8-gram in the devel-
opment and test sets are extracted with top-3 can-
didate English phrases. There are in total 2.6 mil-
lion phrase pairs 1 extracted for both development
set and the unseen test set. We did minimal tuning
of the parameters in the pharaoh decoder (Koehn,
2004) settings, simply to balance the length penalty
for Bleu score. Most of the weights are left as they
are given: [ttable-limit]=20, [ttable-threshold]=0.01,
1Our phrase table is to be released to public in this workshop
[stack]=100, [beam-threshold]=0.01, [distortion-
limit]=4, [weight-d]=0.5, [weight-l]=1.0, [weight-
w]=-0.5. Table 1 shows the algorithm’s performance
on several settings for the seven basic scores pro-
vided in section 6.
settings Dev.Bleu Tst.Bleu
s1 27.44 27.65
s2 27.62 28.25
Table 1: Pharaoh Decoder Settings
In Table 1, setting s1 was our submission
without using the inverse relative frequency of
Prf(ei+ki |fj+lj ). s2 is using all the seven scores.
8 Discussions
In this paper, we propose a generalized phrase ex-
traction algorithm towards word alignment-free uti-
lizing the fertility model to predict the width of the
block, a distortion model to predict how close the
centers of source and target phrases are, and a lex-
icon model for translational equivalence. The algo-
rithm is a general framework, in which one could
plug in other scores and word alignment to get bet-
ter results.

References
P.F. Brown, Stephen A. Della Pietra, Vincent. J.
Della Pietra, and Robert L. Mercer. 1993. The mathe-
matics of statistical machine translation: Parameter es-
timation. In Computational Linguistics, volume 19(2),
pages 263 331.
Philip Koehn. 2004. Pharaoh: a beam search decoder
for phrase-based smt. In Proceedings of the Confer-
ence of the Association for Machine Translation in the
Americans (AMTA).
Franz J. Och and Hermann Ney. 2003. A systematic
comparison of various statistical alignment models. In
Computational Linguistics, volume 29, pages 19 51.
Christoph Tillmann. 2003. A projection extension algo-
rithm for statistical machine translation. In Proceed-
ings of the Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP).
Stephan Vogel, Sanjika Hewavitharana, Muntsin Kolss,
and Alex Waibel. 2004. The ISL statistical translation
system for spoken language translation. In Proc. of the
International Workshop on Spoken Language Transla-
tion, pages 65 72, Kyoto, Japan.
