Toward hierarchical models for statistical machine translation of
inflected languages
Sonja Nießen and Hermann Ney
Lehrstuhl f¨ur Informatik VI,
Computer Science Department
RWTH Aachen - University of Technology
D-52056 Aachen, Germany
a0 niessen,ney
a1 @informatik.rwth-aachen.de
Abstract
In statistical machine translation, cor-
respondences between the words in
the source and the target language are
learned from bilingual corpora on the
basis of so called alignment models.
Existing statistical systems for MT of-
ten treat different derivatives of the
same lemma as if they were indepen-
dent of each other. In this paper we
argue that a better exploitation of the
bilingual training data can be achieved
by explicitly taking into account the in-
terdependencies of the different deriva-
tives. We do this along two direc-
tions: Usage of hierarchical lexicon
models and the introduction of equiv-
alence classes in order to ignore in-
formation not relevant for the trans-
lation task. The improvement of the
translation results is demonstrated on a
German-English corpus.
1 Introduction
The statistical approach to machine translation
has become widely accepted in the last few years.
It has been successfully applied to realistic tasks
in various national and international research pro-
grams. However in many applications only small
amounts of bilingual training data are available
for the desired domain and language pair, and it
is highly desirable to avoid at least parts of the
costly data collection process.
Some recent publications have dealt with the
problem of translation with scarce resources.
(Brown et al., 1994) describe the use of dictio-
naries. (Al-Onaizan et al., 2000) report on an ex-
periment of Tetun-to-English translation by dif-
ferent groups, including one using statistical ma-
chine translation. They assume the absence of
linguistic knowledge sources such as morphologi-
cal analyzers and dictionaries. Nevertheless, they
found that human mind is very well capable of
deriving dependencies such as morphology, cog-
nates, proper names, spelling variations etc., and
that this capability was finally at the basis of the
better results produced by humans compared to
corpus based machine translation. The additional
information results from complex reasoning and it
is not directly accessible from the full word form
representation of the data.
In this paper, we take a different point of
view: Even if full bilingual training data is scarce,
monolingual knowledge sources like morpholog-
ical analyzers and data for training the target lan-
guage model as well as conventional dictionar-
ies (one word and its translation per entry) may
be available and of substantial usefulness for im-
proving the performance of statistical translation
systems. This is especially the case for highly in-
flected languages like German.
We address the question of how to achieve a
better exploitation of the resources for training the
parameters for statistical machine translation by
taking into account explicit knowledge about the
languages under consideration. In our approach
we introduce equivalence classes in order to ig-
nore information not relevant to the translation
process. We furthermore suggest the use of hi-
erarchical lexicon models.
The paper is organized as follows. After re-
viewing the statistical approach to machine trans-
lation, we first explain our motivation for exam-
ining the morphological characteristics of an in-
flected language like German. We then describe
the chosen output representation after the analysis
and present our approach for exploiting the infor-
mation from morpho-syntactic analysis. Experi-
mental results on the German-English Verbmobil
task are reported.
2 Statistical Machine Translation
The goal of the translation process in statisti-
cal machine translation can be formulated as fol-
lows: A source language string a2a4a3a5a7a6 a2 a5a4a8a9a8a9a8 a2
a3is to be translated into a target language string
a10a12a11
a5 a6
a10
a5 a8a9a8a9a8
a10
a11 . In the experiments reported in this
paper, the source language is German and the tar-
get language is English. Every English string is
considered as a possible translation for the input.
If we assign a probability a13a15a14a17a16 a10a12a11a5a19a18a2 a3a5a21a20 to each pair
of strings a16 a10 a11a5a12a22 a2a4a3a5 a20 , then according to Bayes’ de-
cision rule, we have to choose the English string
that maximizes the product of the English lan-
guage model a13a23a14a24a16 a10 a11a5 a20 and the string translation
model a13a15a14a17a16a25a2 a3a5a26a18a10a27a11a5a28a20 .
Many existing systems for statistical machine
translation (Wang and Waibel, 1997; Nießen et
al., 1998; Och and Weber, 1998) make use of a
special way of structuring the string translation
model like proposed by (Brown et al., 1993): The
correspondence between the words in the source
and the target string is described by alignments
which assign one target word position to each
source word position. The lexicon probability
a29
a16a25a2
a18
a10
a20 of a certain English word
a10 is assumed
to depend basically only on the source word a2
aligned to it.
The overall architecture of the statistical trans-
lation approach is depicted in Figure 1. In this
figure we already anticipate the fact that we can
transform the source strings in a certain manner.
3 Basic Considerations
The parameters of the statistical knowledge
sources mentioned above are trained on bilingual
Source Language Text
 Lexicon Model
Language Model
Global Search:
 
 
Target Language Text
 
over
  Pr(f
1  
J  |e
1
I )
   Pr(   e1I )
   Pr(f1  J  |e1I )   Pr(   e1I )
  e1I
f1 J
maximize
 Alignment Model
Transformation
Transformation
morpho-syntactic
Analysis
Figure 1: Architecture of the translation approach
based on Bayes’ decision rule.
corpora. In general, the resulting probabilistic
lexica contain all word forms occurring in this
training corpora as separate entries, not taking
into account whether or not they are derivatives of
the same lemma. Bearing in mind that 40% of the
word forms have only been seen once in training
(see Table 2), it is obvious that learning the cor-
rect translations is difficult for many words. Be-
sides, new input sentences are expected to con-
tain unknown word forms, for which no transla-
tion can be retrieved from the lexica. As Table
2 shows, this problem is especially relevant for
highly inflected languages like German: Texts in
German contain many more different word forms
than their English translations. The table also re-
veals that these words are often derived from a
much smaller set of base forms (“lemmata”), and
when we look at the number of different lemmata
and the respective number of lemmata, for which
there is only one occurrence in the training data,
German and English texts are more resembling.
Another aspect is the fact that conventional dic-
tionaries are often available in an electronic form
for the considered language pair. Their usabil-
ity for statistical machine translation is restricted
because they are substantially different from full
bilingual parallel corpora inasmuch the entries are
often pairs of base forms that are translations of
each other, whereas the corpora contain full sen-
tences with inflected forms. To make the informa-
tion taken from external dictionaries more useful
for the translation of inflected language is an in-
teresting objective.
As a consequence of these considerations, we
aim at taking into account the interdependencies
between the different derivatives of the same base
form.
4 Output Representation after
Morpho-syntactic Analysis
We use GERCG, a constraint grammar parser for
German for lexical analysis and morphological
and syntactic disambiguation. For a description
of the Constraint Grammar approach we refer the
reader to (Karlsson, 1990). Figure 2 gives an ex-
ample of the information provided by this tool.
Input: Wir wollen nach dem Essen
nach Essen aufbrechen
"<*wir>""wir" * PRON PERS PL1 NOM
"<wollen>""wollen" V IND PR¨AS PL1
"<nach>""nach" pre PR¨AP Dat
"<dem>""das" ART DEF SG DAT NEUTR
"<*essen>""*essen" S NEUTR SG DAT
"<nach>""nach" pre PR¨AP Dat
"<*essen>""*essen" S EIGEN NEUTR SG DAT
"*esse" S FEM PL DAT"*essen" S NEUTR PL DAT
"*essen" S NEUTR SG DAT
"<aufbrechen>""aufbrechen" V INF
Figure 2: Sample analysis of a German sentence
A full word form is represented by the infor-
mation provided by the morpho-syntactic anal-
ysis: From the interpretation “gehen-V-IND-
PR¨AS-SG1”, i.e. the lemma plus part of speech
plus the other tags the word form “gehe” can
be restored. From Figure 2 we see that the tool
can quite reliably disambiguate between differ-
ent readings: It infers for instance that the word
“wollen” is a verb in the indicative present first
person plural form. Without any context taken
into account, “wollen” has other readings. It can
even be interpreted as derived not from a verb,
but from an adjective with the meaning “made of
wool”. In this sense, the information inherent to
the original word forms is augmented by the dis-
ambiguating analyzer. This can be useful for de-
riving the correct translation of ambiguous words.
In the rare cases where the tools returned more
than one reading, it is often possible to apply sim-
ple heuristics based on domain specific preference
rules or to use a more general, non-ambiguous
analysis.
The new representation of the corpus where
full word forms are replaced by lemma plus mor-
phological and syntactic tags makes it possible
to gradually reduce the information: For exam-
ple we can consider certain instances of words as
equivalent. We have used this fact to better exploit
the bilingual training data along two directions:
Omitting unimportant information and using hi-
erarchical translation models.
5 Equivalence classes of words with
similar Translations
Inflected forms of words in the input language
contain information that is not relevant for trans-
lation. This is especially true for the task of
translating from a highly inflected language like
German into English for instance: In bilingual
German-English corpora, the German part con-
tains many more different word forms than the
English part (see Table 2). It is useful for the
process of statistical machine translation to define
equivalence classes of word forms which tend to
be translated by the same target language word,
because then, the resulting statistical translation
lexica become smoother and the coverage is sig-
nificantly improved. We construct these equiva-
lence classes by omitting those informations from
morpho-syntactic analysis, which are not relevant
for the translation task.
The representation of the corpus like it is pro-
vided by the analyzing tools helps to identify -
and access - the unimportant information. The
definition of relevant and unimportant informa-
tion, respectively, depends on many factors like
the involved languages, the translation direction
and the choice of the models.
Linguistic knowledge can provide information
about which characteristics of an input sentence
are crucial to the translation task and which can
be ignored, but it is desirable to find a method
for automating this decision process. We found
that the impact on the end result due to different
choices of features to be ignored was not large
enough to serve as reliable criterion. Instead, we
could think of defining a likelihood criterion on
a held-out corpus for this purpose. Another pos-
sibility is to assess the impact on the alignment
quality after training, which can be evaluated au-
tomatically (Langlais et al., 1998; Och and Ney,
2000), but as we found that the alignment quality
on the Verbmobil data is consistently very high,
and extremely robust against manipulation of the
training data, we abandoned this approach.
We resorted to detecting candidates from the
probabilistic lexica trained for translation from
German to English. For this, we focussed on
those derivatives of the same base form, which
resulted in the same translation. For each set
of tags, we counted how often an additional tag
could be replaced by a certain other tag without
effect on the translation. Table 1 gives some of
the most frequently identified candidates to be ig-
nored while translating: The gender of nouns is
irrelevant for their translation (which is straight-
forward, because the gender is unambiguous for a
certain noun) and the case, i.e. nominative, dative,
accusative. For the genitive forms, the translation
in English differs. For verbs we found the candi-
dates number and person. That is, the translation
of the first person singular form of a verb is of-
ten the same as the translation of the third person
plural form, for example.
Table 1: Candidates for equivalence classes.
POS candidates
noun gender: MASK,FEM,NEUTR
and case: NOM,DAT,AKK
verb number: SG,PL
and person: 1,2,3
adjective gender, case and number
number case
As a consequence, we dropped those tags,
which were most often identified as irrelevant for
translation from German to English.
6 Hierarchical Models
One way of taking into account the interdepen-
dencies of different derivatives of the same base
form is to introduce equivalence classes a30a32a31 at vari-
ous levels of abstraction starting with the inflected
form and ending with the lemma.
Consider, for example, the German verb
form a2 a6 "ankomme", which is derived from
the lemma "ankommen" and which can be
translated into English by a10 a6 "arrive". The
hierarchy of equivalence classes is as follows:
a30a32a33
a6 "ankommen-V-IND-PR¨AS-SG1"
a30 a33a35a34
a5
a6 "ankommen-V-IND-PR¨AS-SG"
a30 a33a35a34a37a36
a6 "ankommen-V-IND-PR¨AS"
...
a30a32a38
a6 "ankommen"
a8
a39 is the maximal number of morpho-syntactic
tags. a30 a33a35a34 a5 contains the forms "ankomme",
"ankommst" and "ankommt"; in a30a32a33a40a34a37a36 the
number (SG or PL) is ignored and so on. The
largest equivalence class contains all derivatives
of the infinitive "ankommen".
We can now define the lexicon probability of a
word a2 to be translated by a10 with respect to the
level a41 :
a29
a31a42a16a25a2
a18
a10
a20 a6a44a43a45a46a48a47a49a51a50
a29
a16a53a52
a31
a38
a18
a10
a20a55a54
a29
a16a25a2
a18
a52
a31
a38
a22
a10
a20 a22 (1)
where a52 a31
a38
a6
a52a42a38
a22
a8a9a8a9a8
a22
a52 a31 is the representation of a
word where the lemma a52a51a38 and a41 additional tags
are taken into account. For the example above,
a52a42a38
a6 "ankommen",
a52
a5
a6 "V", and so on.
a29
a16a25a2
a18
a52
a31
a38
a22
a10
a20 is the probability of
a2 for a given a52
a31
a38
.
We make the assumption that this probability does
not depend on a10 . a29 a16a25a2 a18a52 a33
a38
a20 is always assumed to be
1. In other words, the inflected form can non-
ambiguously be derived from the full interpreta-
tion.
a29
a16a53a52
a31
a38
a18
a10
a20 is the probability of the translation for
a10 to belong to the equivalence class
a30 a31 . The sum
over a56a57a52 a31
a38a59a58
amounts to summing up over all possible
readings of a2 .1
We combine the a29 a31 by means of linear interpo-
lation:
a29
a16a25a2
a18
a10
a20 a6a44a60
a38
a29
a38a61a16a25a2
a18
a10
a20a63a62
a8a9a8a9a8
a62 a60
a33
a29
a33 a16a25a2
a18
a10
a20
a8 (2)
7 Translation Experiments
Experiments were carried out on Verbmobil data,
which consists of spontaneously spoken dialogs
in the appointment scheduling domain (Wahlster,
1993). German source sentences are translated
into English.
7.1 Treatment of Ambiguity
Common bilingual corpora normally contain full
sentences which provide enough context informa-
tion for ruling out all but one reading for an in-
flected word form. To reduce the remaining un-
certainty, we have implemented preference rules.
For instance, we assume that the corpus is cor-
rectly true-case-converted beforehand and as a
consequence, we drop non-noun interpretations
of uppercase words. Besides, we prefer indica-
tive verb readings instead of subjunctive or im-
perative. For the remaining ambiguities, we resort
to the unambiguous parts of the readings, i.e. we
drop all tags causing mixed interpretations.
There are some special problems with the anal-
ysis of external lexica, which do not provide
enough context to enable efficient disambigua-
tion. We are currently implementing methods for
handling this special situation.
It can be argued that it would be more elegant to
leave the decision between different readings, for
instance, to the overall decision process in search.
We plan this integration for the future.
7.2 Performance Measures
We use the following evaluation criteria (Nießen
et al., 2000):
a64 SSER (subjective sentence error rate):
Each translated sentence is judged by a hu-
man examiner according to an error scale
from 0.0 (semantically and syntactically cor-
rect) to 1.0 (completely wrong).
1The probability functions are defined to return zero for
impossible interpretations of a65 .
a64 ISER (information item semantic error rate):
The test sentences are segmented into infor-
mation items; for each of them, the trans-
lation candidates are assigned either “ok”
or an error class. If the intended informa-
tion is conveyed, the error count is not in-
creased, even if there are slight syntactical
errors, which do not seriously deteriorate the
intelligibility.
7.3 Translation Results
The training set consists of 58 322 sentence pairs.
Table 2 summarizes the characteristics of the
training corpus used for training the parameters of
Model 4 proposed in (Brown et al., 1993). Testing
Table 2: Corpus statistics: Verbmobil training.
Singletons are types occurring only once in train-
ing.
English German
no. of running words 550 213 519 790
no. of word forms 4 670 7 940
no. of singletons 1 696 3 452
singletons [%] 36 43
no. of lemmata 3 875 3 476
no. of singletons 1 322 1 457
was carried out on 200 sentences not contained in
the training data. For a detailed statistics see Ta-
ble 3.
Table 3: Statistics of the Verbmobil test corpus
for German-to-English translation. Unknowns are
word forms not contained in the training corpus.
no. of sentences 200
no. of running words 2 055
no. of word forms 385
no. of unknown word forms 25
We used a translation system called “single-
word based approach” described in (Tillmann and
Ney, 2000) and compared to other approaches in
(Ney et al., 2000).
7.3.1 Lexicon Combination
So far we have performed experiments with hi-
erarchical lexica, where two levels are combined,
i.e. a39 in Equation (2) is set to 1. a60 a38 and a60 a5 are
set to a66 a8a68a67 and a29 a16a25a2 a18a52a51a38 a20 is modeled as a uniform
distribution over all derivations of the lemma a52a42a38
occurring in the training data plus the base form
itself, in case it is not contained. The process of
lemmatization is unique in the majority of cases,
and as a consequence, the sum in Equation (1) is
not needed for a two-level lexicon combination of
full word forms and lemmata.
As the results summarized in Table 4 show, the
combined lexicon outperforms the conventional
one-level lexicon. As expected, the quality gain
achieved by smoothing the lexicon is larger if
the training procedure can take advantage of an
additional conventional dictionary to learn trans-
lation pairs, because these dictionaries typically
only contain base forms of words, whereas trans-
lations of fully inflected forms are needed in the
test situation.
Examples taken from the test set are given in
Figure 3. Smoothing the lexicon entries over the
derivatives of the same lemma enables the trans-
lation of “sind” by “would” instead of “are”. The
smoothed lexicon contains the translation “conve-
nient” for any derivative of “bequem”. The com-
parative “more convenient” would be the com-
pletely correct translation.
7.3.2 Equivalence classes
As already mentioned, we resorted to choos-
ing one single reading for each word by applying
some heuristics (see Section 7.1). For the nor-
mal training corpora, unlike additional external
dictionaries, this is not critical because they con-
tain predominantly full sentences which provide
enough context for an efficient disambiguation.
Currently, we are working on the problem of ana-
lyzing the entries in conventional dictionaries, but
for the time being, experiments for equivalence
classes have been carried out using only bilingual
corpora for estimating the model parameters.
Table 5 shows the effect of the introduction of
equivalence classes. The information from the
morpho-syntactic analyzer (stems plus tags like
described in Section 4) is reduced by dropping
unimportant information like described in Section
5. Both error metrics could be decreased in com-
parison to the usage of the original corpus with
inflected word forms. A reduction of 3.3% of the
information item semantic error rate shows that
more of the intended meaning could be found in
the produced translations.
Table 5: Effect of the introduction of equivalence
classes. For the baseline we used the original in-
flected word forms.
SSER [%] ISER [%]
inflected words 37.4 26.8
equivalence classes 35.9 23.5
The first two examples in Figure 4 demonstrate
the effect of the disambiguating analyzer which
identifies “Hotelzimmer” as singular on the ba-
sis of the context (the word itself can represent
the plural form as well), and “das” as article in
contrast to a pronoun. The third example shows
the advantage of grouping words in equivalence
classes: The training data does not contain the
word “billigeres”, but when generalizing over the
gender and case information, a correct translation
can be produced.
8 Conclusion and Future Work
We have presented methods for a better exploita-
tion of the bilingual training data for statisti-
cal machine translation by explicitly taking into
account the interdependencies of the different
derivatives of the same base form. We suggest the
usage of hierarchical models as well as an alter-
native representation of the data in combination
with the identification and omission of informa-
tion not relevant for the translation task.
First experiments prove their general applica-
bility to realistic tasks such as spontaneously spo-
ken dialogs. We expect the described methods to
yield more improvement of the translation quality
for cases where much smaller amounts of training
data are available.
As there is a large overlap between the mod-
eled events in the combined probabilistic models,
we assume that log-linear combination would re-
sult in more improvement of the translation qual-
ity than the combination by linear interpolation
does. We will investigate this in the future. We
also plan to integrate the decision regarding the
choice of readings into the search process.
Table 4: Effect of two-level lexicon combination. For the baseline we used the conventional one-level
full form lexicon.
ext. dictionary SSER [%] ISER [%]
baseline yes 35.7 23.9
combined yes 33.8 22.3
baseline no 37.4 26.8
combined no 36.9 25.8
input sind Sie mit einem Doppelzimmer einverstanden?
baseline are you agree with a double room?
combined lexica would you agree with a double room?
input mit dem Zug ist es bequemer
baseline by train it is UNKNOWN-bequemer
combined lexica by train it is convenient
Figure 3: Examples for the effect of the combined lexica.
Acknowledgement. This work was partly sup-
ported by the German Federal Ministry of Educa-
tion, Science, Research and Technology under the
Contract Number 01 IV 701 T4 (VERBMOBIL).

References
Yaser Al-Onaizan, Ulrich Germann, Ulf Hermjakob,
Kevin Knight, Philipp Koehn, Daniel Marcu, and
Kenji Yamada. 2000. Translating with scarce re-
sources. In Proceedings of the Seventeenth Na-
tional Conference on Artificial Intelligence (AAAI),
pages 672–678, Austin, Texas, August.
P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and
R.L. Mercer. 1993. Mathematics of statistical ma-
chine translation: Parameter estimation. Computa-
tional Linguistics, 19(2):263–311.
P. F. Brown, S. A. Della Pietra, and M. J. Della Pietra,
V. J. a nd Goldsmith. 1994. But dictionaries are
data too. In Proc. ARPA Human Language Tech-
nology Workshop ’93, pages 202–205, Princeton,
NJ, March. distributed as Human Language Tech-
nology by San Mateo, CA: Morgan Kaufmann Pub-
lishers.
Fred Karlsson. 1990. Constraint grammar as a frame-
work for parsing running text. In Proceedings of the
13th International Conference on Computational
Linguistics, volume 3, pages 168–173, Helsinki,
Finland.
Philippe Langlais, Michel Simard, and Jean V´eronis.
1998. Methods and practical issues in evaluat-
ing alignment techniques. In Proceedings of 36th
Annual Meeting of the Association for Computa-
tional Linguistics and 17th International Confer-
ence on Computational Linguistic, pages 711–717,
Montr´eal, P.Q., Canada, August.
Hermann Ney, Sonja Nießen, Franz Josef Och, Has-
san Sawaf, Christoph Tillmann, and Stephan Vogel.
2000. Algorithms for statistical translation of spo-
ken language. IEEE Transactions on Speech and
Audio Processing, 8(1):24–36, January.
Sonja Nießen, Stephan Vogel, Hermann Ney, and
Christoph Tillmann. 1998. A DP based search al-
gorithm for statistical machine translation. In Pro-
ceedings of the 36th Annual Meeting of the Associa-
tion for Computational Linguistics and the 17th In-
ternational Conference on Computational Linguis-
tics, pages 960–967, Montr´eal, P.Q., Canada, Au-
gust.
Sonja Nießen, Franz Josef Och, Gregor Leusch, and
Hermann Ney. 2000. An evaluation tool for ma-
chine translation: Fast evaluation for MT research.
In Proceedings of the 2nd International Conference
on Language Resources and Evaluation, pages 39–
45, Athens, Greece, May.
Franz Josef Och and Hermann Ney. 2000. Im-
proved statistical alignment models. In Proc. of the
38th Annual Meeting of the Association for Com-
putational Linguistics, pages 440–447, Hongkong,
China, October.
Franz Josef Och and Hans Weber. 1998. Improv-
ing statistical natural language translation with cat-
egories and rules. In Proceedings of the 36th
Annual Meeting of the Association for Computa-
tional Linguistics and the 17th International Con-
ference on Computational Linguistics, pages 985–
989, Montr´eal, P.Q., Canada, August.
Christoph Tillmann and Hermann Ney. 2000. Word
re-ordering and DP-based search in statistical ma-
chine translation. In Proc. COLING 2000: The
18th Int. Conf. on Computational Linguistics, pages
850–856, Saarbr¨ucken, Germany, August.
Wolfgang Wahlster. 1993. Verbmobil: Translation of
Face-to-Face Dialogs. In Proceedings of the MT
Summit IV, pages 127–135, Kobe, Japan.
Ye-Yi Wang and Alex Waibel. 1997. Decoding al-
gorithm in statistical translation. In Proceedings of
the ACL/EACL ’97, Madrid, Spain, pages 366–372,
July.
