Modeling with Structures in Statistical Machine Translation 
Ye-Yi Wang and Alex Waibel 
School of Computer Science 
Carnegie Mellon University 
5000 Forbes Avenue 
Pittsburgh, PA 15213, USA 
{yyw, waibel}©cs, cmu. edu 
Abstract 
Most statistical machine translation systems 
employ a word-based alignment model. In this 
paper we demonstrate that word-based align- 
ment is a major cause of translation errors. We 
propose a new alignment model based on shal- 
low phrase structures, and the structures can 
be automatically acquired from parallel corpus. 
This new model achieved over 10% error reduc- 
tion for our spoken language translation task. 
1 Introduction 
Most (if not all) statistical machine translation 
systems employ a word-based alignment model 
(Brown et al., 1993; Vogel, Ney, and Tillman, 
1996; Wang and Waibel, 1997), which treats 
words in a sentence as independent entities and 
ignores the structural relationship among them. 
While this independence assumption works well 
in speech recognition, it poses a major problem 
in our experiments with spoken language trans- 
lation between a language pair with very dif- 
ferent word orders. In this paper we propose a 
translation model that employs shallow phrase 
structures. It has the following advantages over 
word-based alignment: 
• Since the translation model can directly de- 
pict phrase reordering in translation, it is 
more accurate for translation between lan- 
guages with different word (phrase) orders. 
• The decoder of the translation system can 
use the phrase information and extend 
hypothesis by phrases (multiple words), 
therefore it can speed up decoding. 
The paper is organized as follows. In sec- 
tion 2, the problems of word-based alignment 
models are discussed. To alienate these prob- 
lems, a new alignment model based on shal- 
low phrase structures is introduced in section 
3. In section 4, a grammar inference algorithm 
is presented that can automatically acquire the 
phrase structures used in the new model. Trans- 
lation performance is then evaluated in sec- 
tion 5, and conclusions are presented in sec- 
tion 6. 
2 Word-based Alignment Model 
In a word-based alignment translation model, 
the transformation from a sentence at the source 
end of a communication channel to a sentence 
at the target end can be described with the fol- 
lowing random process: 
1. Pick a length for the sentence at the target 
end. 
2. For each word position in the target sen- 
tence, align it with a source word. 
3. Produce a word at each target word po- 
sition according to the source word with 
which the target word position has been 
aligned. 
IBM Alignment Model 2 is a typical example 
of word-based alignment. Assuming a sentence 
s = Sl,...,st at the source of a channel, the 
model picks a length m of the target sentence 
t according to the distribution P(m I s) = e, 
where e is a small, fixed number. Then for each 
position i (0 < i _< m) in t, it finds its corre- 
sponding position ai in s according to an align- 
ment distribution P(ai l i, a~ -1, m, s) = a(ai l 
i, re, l). Finally, it generates a word ti at the 
position i of t from the source word s~, at the 
aligned position ai, according to a translation 
z 1 m distribution P(ti \] t~- , a 1 , s) -- t(ti I s~,). 
1357 
waere denn Montag der sech und zwanzigste Juli moeglich 
it's going to difficulty to find meeting time i think is Monday the twenty sixth of July possible 
waere denn Montag der sech und zwanzigste Juli moeglich 
it's going to difficulty to find meeting time I think is Monday the twenty sixth of July possible 
Figure 1: Word Alignment with deletion in translation: the top alignment is the one made by IBM 
Alignment Model 2, the bottom one is the 'ideal' alignment. 
fiter der zweiten Terrain im Mai koennte ich den Mittwoch den fuenf und zwanzigsten anbieten 
1 could offer ~ou Wednesday the twenty fifth for the second date in May 
fuer der zweiten Termin im Mai koennte ich den Mittwoch den fuenf und zwanzigsten anbieten 
I could offer you Wednesday the twenty fifth for the second date in May 
Figure 2: Word Alignment of translation with different phrase order: the top alignment is the one 
made by IBM Alignment Model 2, the bottom one is the 'ideal' alignment. 
fuer der zweiten Termin im Mai koennte ich den Mittwoch den fuenf und zwanzigsten anbieten 
! could offer you Wednesday the twenty fifth for the second date in May 
Figure 3: Word Alignment with Model 1 for one of the previous examples. Because no alignment 
probability penalizes the long distance phrase reordering, it is much closer to the 'ideal' alignment. 
1358 
Therefore, P(t\]s) is the sum of the proba- 
bilities of generating t from s over all possible 
alignments A, in which the position i in t is 
aligned with the position ai in s: 
P(t Is) 
l l m 
e y~ ... ~ l"It(tjls~J)a(ajlj, l,m) 
al~0 am=Oj=l 
m l 
e 1-I Y~ t(tjlsi)a(ilj, l,m) (1) 
j=li=O 
A word-based model may have severe prob- 
lems when there are deletions in translation 
(this may be a result of erroneous sentence 
alignment) or the two languages have different 
word orders, like English and German. Figure 1 
and Figure 2 show some problematic alignments 
between English/German sentences made by 
IBM Model 2, together with the 'ideal' align- 
ments for the sentences. Here the alignment 
parameters penalize the alignment of English 
words with their German translation equiva- 
lents because the translation equivalents are far 
away from the words. 
An experiment reveals how often this kind 
of "skewed" alignment happens in our En- 
glish/German scheduling conversation parallel 
corpus (Wang and Waibel, 1997). The ex- 
periment was based on the following obser- 
vation: IBM translation Model 1 (where the 
alignment distribution is uniform) and Model 
2 found similar Viterbi alignments when there 
were no movements or deletions, and they pre- 
dicted very different Viterbi alignments when 
the skewness was severe ill a sentence pair, since 
the alignment parameters in Model 2 penalize 
the long distance alignment. Figure 3 shows the 
Viterbi alignment discovered by Model 1 for the 
same sentences in Figure 21 . 
We measured the distance of a Model 1 
alignment a 1 and a Model 2 alignment a z 
~--,Igl la ~ _ a2\]. To estimate the skew- aS A.-,i= 1 
ness of the corpus, we collected the statistics 
about the percentage of sentence pairs (with at 
~The better alignment on a given pair of sentences 
does not mean Model 1 is a better model. Non-uniform 
alignment distribution is desirable. Otherwise, language 
model would be the only factor that determines the 
source sentence word order in decoding. 
e~ 
30 
25 
20 
15 
10 
5 
0 
0 0.5 1 1.5 2 2.5 
Alignment distance > x * target sentence length 
Figure 4: Skewness of Translations 
least five words in a sentence) with Model 1 
and Model 2 alignment distance greater than 
1/4,2/4,3/4,..., 10/4 of the target sentence 
length. By checking the Viterbi alignments 
made by both models, it is almost certain that 
whenever the distance is greater that 3/4 of the 
target sentence length, there is either a move- 
ment or a deletion in the sentence pair. Fig- 
ure 4 plots this statistic -- around 30% of the 
sentence pairs in our training data have some 
degree of skewness in alignments. 
3 Structure-based Alignment Model 
To solve the problems with the word-based 
alignment models, we present a structure-based 
alignment model here. The idea is to di- 
rectly model the phrase movement with a rough 
alignment, and then model the word alignment 
within phrases with a detailed alignment. 
Given an English sentence e = ele2...et, its 
German translation g = 9192"" "gin can be gen- 
erated by the following process: 
1. Parse e into a sequence of phrases, so 
Z ---- (e11, e12, • • • , el/l) (e21, e22, • •., e212) • • • 
(enl, enz,.--, e~l.) 
= EoEIE2...En, 
where E0 is a null phrase. 
2. With the probability P(q \] e,E), deter- 
mine q < n + 1, the number of phrases in 
g. Let Gi'"Gq denote these q phrases. 
Each source phrase can be aligned with at 
most one target phrase. Unlike English 
phrases, words in a German phrase do not 
1359 
have to form a consecutive sequence. So 
g may be expressed with something like 
g = gllg12g21g13g22"", where gij repre- 
sents the j-th word in the i-th phrase. 
3. For each German phrase Gi, 0 <_ i < q, with 
the probability P(rili, r~ -1, E, e), align it 
with an English phrase E~. 
4. For each German phrase Gi, 0 <_ i < q, de- 
termine its beginning position bi in g with 
the distribution P(bi l " 1.i-1 _q e, E). ~, u 0 ~ r0~ 
5. Now it is time to generate the individual 
words in the German phrases through de- 
tailed alignment. It works like IBM Model 
4. For each word eij in the phrase Ei, 
its fertility ¢ij has the distribution P(¢ij I 
.. j-1¢i0-1 E). ~ 3, ¢il , , bo, ro, e, 
6. For each word eij in the phrase Ei, it gen- 
erates a tablet rij = {Tijl,Tij2,'''Tij¢ij} 
by generating each of the words in rij 
in turn with the probability P(rijk I 
r~.li,rJ~ -1 - , rio-l, l%, bo,qr~,e,E) forthek-th 
word in the tablet. 
7. For each element risk in the tablet 
vii, the permutation 7rij k determines 
its position in the target sentence ac- 
cording to the distribution P(rrij k I 
7rk_ 1 "- . ijl , 7r~l 1, 7r;-1, TO/, (~/, b(~, r~, e, E). 
We made the following independence assump- 
tions: 
1. The number of target sentence phrases de- 
pends only on the number of phrases in the 
source sentence: 
P(qle, E) --pn(q\[n) 
2. P(ri l i, r~-l,E,e) 
= a(rili) x 1-I0_<j<i(1 - 5(ri, rj)) 
where 5(x,y) = 1 when x = y, and 
5(x, y) = 0 otherwise. 
This assumption states that P(ri I 
i, rio-X,E,e) depends on i and ri. It also 
1 depends on r~- with the factor YI0<j<i(1- 
(f(ri, rj)) to ensure that each EnglisI~ phrase 
is aligned with at most one German phrase. 
3. The beginning position of a target phrase 
depends on its distance from the beginning 
position of its preceding phrase, as well as 
. 
. 
the length of the source phrase aligned with 
the preceding phrase: 
P(bi l i, bio-l,r~,e,E) 
= I = o (Ai I lEr,_,l) 
The fertility and translation tablet of a 
source word depend on the word only: 
P(¢ij l i,J, ¢ilj-1 , wo'~i-1 , ~o,hq rq, e, E) 
= n(¢ij l 
P(Tijk I Tkl 1,7":i 1- "- , rg -1, ¢0,t bo ,q r~,e,E) 
= levi) 
The leftmost position of the translations of 
a source word depends on its distance from 
the beginning of the target phrase aligned 
with the source phrase that contains that 
source word. It also depends on the iden- 
tity of the phrase, and the position of the 
source word in the source phrase. 
j-1 i-i t E) 
= dl (Trijl -- bil El, j) 
For a target word rijk other than the left- 
most Tij 1 in the translation tablet of the 
source eij, its position depends on its dis- 
tance from the position of another tablet 
word 7"ij(k_l) closest to its left, the class of 
the target word Tijk, and the fertility of the 
source word eij. 
p(  jkl  l 1, i-1 i l - rCil ,Tr o ,rO,¢o,b~,r~,e,E) 
= d2(rcijk - lrij(k_l) I 6(rijk), ¢ij) 
here G(g) is the equivalent class for g. 
3.1 Parameter Estimation 
EM algorithm was used to estimate the seven 
types of parameters: Pn, a, a, ¢, r, dl and 
d2. We used a subset of probable alignments 
in the EM learning, since the total number of 
alignments is exponential to the target sentence 
length. The subset was the neighboring align- 
ments (Brown et al., 1993) of the Viterbi align- 
ments discovered by Model 1 and Model 2. We 
chose to include the Model 1 Viterbi alignment 
here because the Model 1 alignment is closer 
to the "ideal" when strong skewness exists in a 
sentence pair. 
4 Finding the Structures 
It is of little interest for the structure-based 
alignment model if we have to manually find 
1360 
the language structures and write a grammar 
for them, since the primary merit of statistical 
machine translation is to reduce human labor. 
In this section we introduce a grammar infer- 
ence technique that finds the phrases used in the 
structure-based alignment model. It is based on 
the work in (Ries, Bu¢, and Wang, 1995), where 
the following two operators are used: 
. 
. 
Clustering: Clustering words/phrases 
with similar meanings/grammatical func- 
tions into equivalent classes. The mutual 
information clustering algorithm(Brown et 
al., 1992) were used for this. 
Phrasing: The equivalent class sequence 
Cl, c2,...c k forms a phrase if 
P(cl, c2,'" "ck) log P(cI, c2,'" "ck) > 8, P(c,)P(c2)" "P(ck) 
where ~ is a threshold. By changing the 
threshold, we obtain a different number of 
phrases. 
The two operators are iteratively applied to 
the training corpus in alternative steps. This 
results in hierarchical phrases in the form of se- 
quences of equivalent classes of words/phrases. 
Since the algorithm only uses a monolin- 
gual corpus, it often introduces some language- 
specific structures resulting from biased usages 
of a specific language. In machine transla- 
tion we are more interested in cross-linguistic 
structures, similar to the case of using interlin- 
gua to represent cross-linguistic information in 
knowledge-based MT. 
To obtain structures that are common in both 
languages, a bilingual mutual information clus- 
tering algorithm (Wang, Lafferty, and Waibel, 
1996) was used as the clustering operator. It 
takes constraints from parallel corpus. We also 
introduced an additional constraint in cluster- 
ing, which requires that words in the same class 
must have at least one common potential part- 
of-speech. 
Bilingual constraints are also imposed on the 
phrasing operator. We used bilingual heuris- 
tics to filter out the sequences acquired by the 
phrasing operator that may not be common in 
multiple languages. The heuristics include: 
. 
. 
Average Translation Span: Given a 
phrase candidate, its average translation 
span is the distance between the leftmost 
and the rightmost target positions aligned 
with the words inside the candidate, av- 
eraged over all Model 1 Viterbi alignments 
of sample sentences. A candidate is filtered 
out if its average translation span is greater 
than the length of the candidate multiplied 
by a threshold. This criterion states that 
the words in the translation of a phrase 
have to be close enough to form a phrase 
in another language. 
Ambiguity Reduction: A word occur- 
ring in a phrase should be less ambiguous 
than in other random context. Therefore 
a phrase should reduce the ambiguity (un- 
certainty) of the words inside it. For each 
source language word class c, its translation 
entropy is defined as )-'\]~g t(g \[ c)log(g \[ c). 
The average per source class entropy re- 
duction induced by the introduction of a 
phrase P is therefore 
1 \[p\[ ~--~\[~-'~ t(g Iv)logt(g\[c) 
cEP g 
- ~_t(glc, P) logt(glc, P)\] 
g 
A threshold was set up for minimum en- 
tropy reduction. 
By applying the clustering operator followed 
with the phrasing operator, we obtained shallow 
phrase structures partly shown in Figure 5. 
Given a set of phrases, we can deterministi- 
cally parse a sentence into a sequence of phrases 
by replacing the leftmost unparsed substring 
with the longest matching phrase in the set. 
5 Evaluation and Discussion 
We used the Janus English/German schedul- 
ing corpus (Suhm et al., 1995) to train our 
phrase-based alignment model. Around 30,000 
parallel sentences (400,000 words altogether for 
both languages) were used for training. The 
same data were used to train Simplified Model 
2 (Wang and Waibel, 1997) and IBM Model 
3 for performance comparison. A larger En- 
glish monolingual corpus with around 0.5 mil- 
lion words was used for the training of a bigram 
1361 
\[Sunday Monday..\] 
\[Sunday Monday..\] 
\[Sunday Monday. .\] 
\[Sunday Monday..\] 
\[Sunday Monday..\] 
\[Sunday Monday..\] 
\[January February. 
\[January February. 
\[afternoon morning...\] 
\[at by...\] \[one two...\] 
\[the every each...\] \[first second third...\] 
\[the every each...\] \[twenty depending remaining3 
\[the every each...\] \[eleventh thirteenth...\] 
\[in within...\] \[January February...\] 
.\] \[first second third...\] \[at by...\] 
.\] \[first second third...\] 
\[January February...\] \[the every each...\] \[first second third...\] 
\[I he she itself\] \[have propose remember hate...\] 
\[eleventh thirteenth...\] \[after before around\] \[one two three...\] 
Figure 5: Example of Acquired Phrases. Words in a bracket form a cluster, phrases are cluster 
sequences. Ellipses indicate that a cluster has more words than those shown here. 
Model Correct OK Incorrect Accuracy 
Model 2 284 87 176 59.9% 
Model 3 98 45 57 60.3% 
S. Model 303 96 148 64.2% 
Table h Translation Accuracy: a correct trans- 
lation gets one credit, an okay translation gets 
1/2 credit, an incorrect one gets 0 credit. Since 
the IBM Model 3 decoder is too slow, its per- 
formance was not measured on the entire test 
set. 
ity mass is more scattered in the structure-based 
model, reflecting the fact that English and Ger- 
man have different phrase orders. On the other 
hand, the word based model tends to align a 
target word with the source words at similar po- 
sitions, which resulted in many incorrect align- 
ments, hence made the word translation proba- 
bility t distributed over many unrelated target 
words, as to be shown in the next subsection. 
5.3 Model Complexity 
language model. A preprocessor splited Ger- 
man compound nouns. Words that occurred 
only once were taken as unknown words. This 
resulted in a lexicon of 1372 English and 2202 
German words. The English/German lexicons 
were classified into 250 classes in each language 
and 560 English phrases were constructed upon 
these classes with the grammar inference algo- 
rithm described earlier. 
We limited the maximum sentence length to 
be 20 words/15 phrases long, the maximum fer- 
tility for non-null words to be 3. 
5.1 Translation Accuracy 
Table 1 shows the end-to-end translation perfor- 
mance. The structure-based model achieved an 
error reduction of around 12.5% over the word- 
based alignment models. 
5.2 Word Order and Phrase Alignment 
Table 2 shows the alignment distribution for the 
first German word/phrase in Simplified Model 
2 and the structure-based model. The probabil- 
The structure-based model has 3,081,617 free 
parameters, an increase of about 2% over the 
3,022,373 free parameters of Simplified Model 2. 
This small increase does not cause over-fitting, 
as the performance on the test data suggests. 
On the other hand, the structure-based model 
is more accurate. This can be illustrated with 
an example of the translation probability distri- 
bution of the English word 'T'. Table 3 shows 
the possible translations of 'T' with probability 
greater than 0.01. It is clear that the structure- 
based model "focuses" better on the correct 
translations. It is interesting to note that the 
German translations in Simplified Model 2 of- 
ten appear at the beginning of a sentence, the 
position where 'T' often appears in English sen- 
tences. It is the biased word-based alignments 
that pull the unrelated words together and in- 
crease the translation uncertainty. 
We define the average translation entropy as 
m n F_. P(ei) F_, -t(gs I ei)logt(gs 
l ei). 
i=O j=l 
1362 
j 0 1 2 3 4 5 6 7 
aM2(jl 1) 0.04 0.86 0.054 0.025 0.008 0.005 0.004 0.002 
asM(jl 1) 0.003 0.29 0.25 0.15 0.07 0.11 0.05 0.04 
8 9 
3.3x I0 -4 2.9xi0 -4 
0.02 0.01 
Table 2: The alignment distribution for the first German word/phrase in Simplified Model 2 and 
in the structure-based model. The second distribution reflects the higher possibility of phrase 
reordering in translation. 
tM2(*l I) tSM(*l I) 
ich 0.708 
da 0.104 
am 0.024 
das 0.022 
dann 0.022 
also 0.019 
es 0.011 
ich 0.988 
mich 0.010 
Table 3: The translation distribution of "I'. It 
is more uncertain in the word-based alignment 
model because the biased alignment distribu- 
tion forced the associations between unrelated 
English/German words. 
(m, n are English and German lexicon size.) 
It is a direct measurement of word transla- 
tion uncertainty. The average translation en- 
tropy is 3.01 bits per source word in Sim- 
plified Model 2, 2.68 in Model 3, and 2.50 
in the structured-based model. Therefore 
information-theoretically the complexity of the 
word-based alignment models is higher than 
that of the structure-based model. 
6 Conclusions 
The structure-based alignment directly models 
the word order difference between English and 
German, makes the word translation distribu- 
tion focus on the correct ones, hence improves 
translation performance. 
7 Acknowledgements 
We would like to thank the anonymous COL- 
ING/ACL reviewers for valuable comments. 
This research was partly supported by ATR and 
the Verbmobil Project. The views and conclu- 
sions in this document are those of the authors. 

References 
Brown, P. F., S. A. Della-Pietra, V. J Della- 
Pietra, and R. L. Mercer. 1993. The Math- 
ematics of Statistical Machine Translation: 
Parameter Estimation. Computational Lin- 
guistics, 19 (2) :263-311. 
Brown, P. F., V. J. Della-Pietra, P. V. deSouza, 
J. C. Lai, and R. L. Mercer. 1992. Class- 
Based N-gram Models of Natural Language. 
Computational Linguistics, 18 (4) :467-479. 
Ries, Klaus, Finn Dag Bu¢, and Ye- 
Yi Wang. 1995. Improved Language 
Modelling by Unsupervised Acquisi- 
tion of Structure. In ICASSP '95. 
IEEE. corrected version available via 
http ://www. cs. cmu. edu/~ies/icassp_gs, html. 
Suhm, B., P.Geutner, T. Kemp, A. Lavie, 
L. Mayfield, A. McNair, I. Rogina, T. Schultz, 
T. Sloboda, W. Ward, M. Woszczyna, and 
A. Waibel. 1995. JANUS: Towards multilin- 
gual spoken language translation. In Proceed- 
ings of the ARPA Speech Spoken Language 
Technology Workshop, Austin, TX, 1995. 
Vogel, S., H. Ney, and C. Tillman. 1996. 
HMM-Based Word Alignment in Statistical 
Translation. In Proceedings of the Seven- 
teenth International Conference on Compu- 
tational Linguistics: COLING-g6, pages 836- 
841, Copenhagen, Denmark. 
Wang, Y., J. Lafferty, and A. Waibel. 1996. 
Word Clustering with Parallel Spoken Lan- 
guage Corpora. In Proceedings of the 4th In- 
ternational Conference on Spoken Language 
Processing (ICSLP'96), Philadelphia, USA. 
Wang, Y. and A. Waibel. 1997. Decoding Al- 
gorithm in Statistical Machine Translation. 
In Proceedings of the 35th Annual Meeting 
of the Association for Computational Lin- 
guistics and 8th Conference of the European 
Chapter of the Association for Computational 
Linguistics (A CL/EA CL '97), pages 366-372, 
Madrid, Spain. 
