Automatic Acquisition of Hierarchical Transduction Models 
for Machine Translation 
Hiyan Alshawi Srinivas Bangalore Shona Douglas 
AT&T Labs Research 
180 Park Avenue, P.O. Box 971 
Florham Park, NJ 07932 USA 
Abstract 
We describe a method for the fully automatic 
learning of hierarchical finite state translation 
models. The input to the method is transcribed 
speech utterances and their corresponding hu- 
man translations, and the output is a set of 
head transducers, i.e. statistical lexical head- 
outward transducers. A word-alignment func- 
tion and a head-ranking function are first ob- 
tained, and then counts are generated for hy- 
pothesized state transitions of head transduc- 
ers whose lexical translations and word order 
changes are consistent with the alignment. The 
method has been applied to create an English- 
Spanish translation model for a speech trans- 
lation application, with word accuracy of over 
75% as measured by a string-distance compari- 
son to three reference translations. 
1 Introduction 
The fully automatic construction of translation 
lnodels offers benefits in terms of development 
effort and potentially in robustness over meth- 
ods requiring hand-coding of linguistic informa- 
tion. However, there are disadvantages to the 
automatic approaches proposed so far. The var- 
ious methods described by Brown et. al (1990; 
1993) do not take into account the natural struc- 
turing of strings into phrases. Example-based 
translation, exemplified by the work of Sumita 
and Iida (1995), requires very large amounts 
of training material. The number of states 
in a simple finite state model such as those 
used by Vilar et al. (1996) becomes extremely 
large when faced with languages with large word 
order differences. The work reported in Wu 
(1997), which uses an inside-outside type of 
training algorithm to learn statistical context- 
free transduction, has a similar motivation to 
the current work, but the models we describe 
here, being fully lexical, are more suitable for 
direct statistical modelling. 
In this paper, we show that both the net- 
work topology and parameters of a head trans- 
ducer translation model (Alshawi, 1996b) can 
be learned fully automatically from a bilingual 
corpus. It has already been shown (Alshawi et 
al., 1997) that a head transducer model with 
hand-coded structure can be trained to give bet- 
ter accuracy than a comparable transfer-based 
system, with smaller model size, computational 
requirements, and development effort. 
We have applied the learning method to cre- 
ate an English-Spanish translation model for a 
limited domain, with word accuracy of over 75% 
measured by a string distance comparison (as 
used in speech recognition) to three reference 
translations. The resulting translation model 
has been used as a component of an English- 
Spanish speech translation system. 
We first present the steps of the transduc- 
tion training method in Section 2. In Section 3 
we describe how we obtain an alignment func- 
tion from source word subsequences to target 
word subsequences for each transcribed utter- 
ance and its translation. The construction of 
states and transitions is specified in Section 4; 
the method for selecting phrase head words is 
described in Section 5. The string comparison 
evaluation metric we use is described in Sec- 
tion 6, and the results of testing the method in 
a limited domain of English-Spanish translation 
are reported in Section 7. 
2 Overview 
2.1 Lexical head transducers 
In our training method, we follow the simple 
lexical head transduction model described by 
Alshawi (1996b) which can be regarded as a 
type of statistical dependency grammar trans- 
41 
duction. This type of transduction model con- 
sists of a collection of head transducers; the pur- 
pose of a particular transducer is to translate 
a specific source word w into a target word v, 
and further to translate the pair of sequences of 
dependent words to the left and right of w to 
sequences of dependents to the left and right of 
c. When applied recursively, a set of such trans- 
ducerb effects a hierarchical transduction of the 
source string into the target string. 
A distinguishing property of head transduc- 
ers, as compared to 'standard' finite state trans- 
ducers is that they perform a transduction out- 
wards from a 'head' word in the input string 
rather than by traversing the input string from 
left to right. A head transducer for translating 
source word tt, to target word v consists of a set 
of states qo(w : v),ql(w : v),q~(w : v),.., and 
transitions of the form: 
(qi(w : v), qj(w : v), Wd, Vd, a', ~) 
where the transition is from state qi(w : v) to 
state qj(w : v), reading the next source depen- 
dent Wd at position c~ relative to w and writing 
a target dependent vd at position fi relative to 
v. Positions left of a head (in the source or tar- 
get) are indicated with negative integers, while 
those right of the head are indicated with posi- 
tive integers. 
The head transducers we use also include the 
following probability parameters for start, tran- 
sition, and stop events: 
P(start, q(w : v)\]w) 
P(qj (w: v), Wd, Vd, a', fllqi(w : v)) 
P(stoplq(w : v)) 
In the present work, when a model is ap- 
plied to translate a source sentence, the cho- 
sen derivation of the target string is the deriva- 
tion that maximizes the product of the above 
transducer event probabilities. The transduc- 
tion search algorithm we use to apply the trans- 
lation model is a bottom-up dynamic program- 
ruing algorithm similar to the analysis algorithm 
for relational head acceptors described by A1- 
shawl (1996a). 
2.2 Training method 
The training method is organized into two main 
stages, an alignment stage followed by a trans- 
ducer construction stage as shown in Figure 1. 
I ... wt... w" w ...w~... 
If(w) ... f(w ) ... I ... f(w ) ... 
Figure 2: Partitioning the source and target 
around a head w with respect to f 
The single input to the training process is a 
bitext corpus, constructed by taking each ut- 
terance in a corpus of transcribed speech and 
having it manually translated. We use the term 
bitext in what follows to refer to a pair consist- 
ing of the transcription of a single utterance and 
its translation. 
The steps in the training procedure are as fol- 
lows: 
1. For each bitext, compute an alignment func- 
tion f from source words to target words, using 
the method described in Section 3. 
2. Partition the source into a head word w and 
substrings to the left and right of w (as shown 
in Figure 2). The extents of the partitions pro- 
jected onto the target by f must not overlap. 
Any selection of the head satisfying this con- 
straint is valid but the selection method used 
influences accuracy (Section 5). 
3. Continue partitioning the left and right sub- 
strings recursively around sub-heads wl and wr. 
4. Trace hypothesized head-transducer transi- 
tions that would output the translations of the 
left and right dependents of w (i.e. wl and wr) 
at the appropriate positions in the target string, 
indicated by f. This step is described in more 
detail below in Section 4. 
5. Apply step 4 recursively to partitions headed 
by wl and wr, and then their dependents, until 
all left and right partitions have at most one 
word. 
6. Aggregate hypothesized transitions to form 
the counts of a maximum likelihood head trans- 
duction model. 
The recursive partioning of the source and tar- 
get strings gives the hierarchical decomposition 
for head transduction. In step 2, the constraint 
42 
bitexts bitexts bitexts source text 
Pairing 
Extraction 
event 
trace 
Model Builder 
alignment 
model 
Alignment 
Search 
alignments -- 
Head 
Selection 
ranked 
heads 
v 
Transducer I 
Construction 
event 
trace 
Model Builder 
translation 
model 
I Transduction 
V l Search 
translated text 
Figure 1: Head transducer training method 
on target partitions ensures that the transduc- 
tion hypothesized in training does not contain 
crossing dependency structures in the target. 
3 Alignment 
The first stage in the training process is ob- 
taining, for each bitext, an alignment function 
f : W ~ V mapping word subsequences W in 
the source to word subsequences V in the tar- 
get. In this process an alignment model is con- 
structed which specifies a cost for each pairing 
(W, V) of source and target subsequences, and 
all alignment search is carried out to minimize 
the sum of the costs of a set of pairings which 
completely maps the bitext source to its target. 
3.1 Alignment model 
The cost of a pairing is composed of a weighted 
combination of cost functions. We currently use 
two. 
The first cost function is the ¢ correlation 
measure (cf the use of ¢2 in Gale and Church 
(1991)) computed as follows: 
= 
(bc- ad) 
x/(a + b)(c + d)(a + c)(b + d) 
where 
a = nv -- n~,i~v 
b = nw, y 
c = N - nv - nw + nw, v 
d = nw - nw, v 
N is the total number of bitexts, nv the number 
of bitexts in which V appears in the target, nw 
the number of bitexts in which W appears in 
the source, and nw, y the number of bitexts in 
which W appears in the source and V appears 
in the target. 
We tried using the log probabilities of tar- 
get subsequences given source subsequences (cf 
Brown et al. (1990)) as a cost function instead 
of ¢ but ¢ resulted in better performance of our 
translation models. 
The second cost function used is a distance 
measure which penalizes pairings in which the 
source subsequence and target subsequence are 
in very different positions in their respective 
sentences. Different weightings of distance to 
correlation costs can be used to bias the model 
towards more or less parallel alignments for dif- 
ferent language pairs. 
43 
3.2 Alignment search 
The agenda-based alignment search makes use 
of dynamic programming to record the best cost 
seen for all partial alignments covering the same 
source and target subsequence; partial align- 
ments coming off the agenda that have a higher 
cost for the same coverage are discarded and 
take 11o further part in the search. An effort 
limit on the number of agenda items processed is 
used to ensure reasonable speed in the search re- 
gardless of sentence length. An iterative broad- 
ening strategy is used, so that at breadth i only 
the i lowest cost pairings for each source subse- 
quence are allowed in the search, with the result 
that most optimal alignments are found well be- 
fore the effort limit is reached. 
In the experiment reported in Section 7, 
source and target subsequences of lengths 0, 1 
and 2 were allowed in pairings. 
4 Transducer construction 
Building a head transducer involves creating ap- 
propriate head transducer states and tracing hy- 
pothesized head-transducer transitions between 
them that are consistent with the occurrence 
of the pairings (W, f(W)) in each aligned bi- 
text. When a source sequence W in an align- 
ment pairing consists of more than one word, 
the least frequent of these words in the train- 
ing corpus is taken to be the primary word of 
the subsequence. It is convenient to extend the 
domain of an alignment function f to include 
primary words w by setting f(w) = f(W). 
The main transitions that are traced in our 
construction are those that map heads, wl and 
wr, of the the right and left dependent phrases 
of w (see Figure 2) to their translations as indi- 
cated in the alignment. The positions of these 
dependents in the target string are computed 
by comparing the positions of f(wt) and f(wr) 
to the position of l: = f(w). The actual states 
and transitions in the construction are specified 
below. 
Additional transitions are included for cases 
of compounding, i.e. those for which the source 
subsequence in an alignment function pairing 
consists of more than one word. Specifically, 
the source subsequence W may be a compound 
consisting of a primary word w together with 
a secondary word w'. There are no additional 
transitions for cases in which the target subse- 
quence V = f(w) of an alignment function pair- 
ing has more than one word. For the purposes of 
the head-transduction model constructed, such 
compound target subsequences are effectively 
treated as single words (containing space char- 
acters). That is, we are constructing a tran- 
ducer for (w : V). 
We use the notation Q(w : V) for states of 
the constructed head transducer. Here Q is an 
additional symbol e.g. "initial" for identifying 
a specific state of this transducer. A state such 
as initial(w : V) mentioned in the construction 
is first looked up in a table of states created 
so far in the training procedure; and created if 
necessary. A bar above a substring denotes the 
number of words preceding the substring in the 
source or target string. 
We give the construction for the case illus- 
trated in Figure 2, i.e. one left dependent wt, 
one right dependent wr, and a single secondary 
word w' to the left of w. Figure 3 shows the 
result as part of a finite state transition dia- 
gram. The other transition arrows shown in the 
diagram will arise from other bitext alignments 
containing (w : V) pairings. Other cases cov- 
ered by our algorithm (e.g. a single left depen- 
dent but no right dependent) are simple vari- 
ants. 
Wt :(. 
w, f(wt) 
v) 
-1:0 
~) left~,(w : V) 
-1:31 
) raid~ I(w V) 
+t:32 
)) :i.az(w : v) 
Figure 3: States and transitions constructed for 
the partition shown in Figure 2 
1. Mark initial(w : V) as an initial state for 
the transducer. 
2. Include a transition consuming the secondary 
44 
word w t without any target output: 
(initial(w: V), leftw,(W : V), w', e, -1, 0), 
where e is the empty string. 
3. Include a transition for mapping the source 
dependent wl to the target dependent f(wt): 
(le ftw,(w : V), midw~(w : V), wt, f(wl), -1,/31) 
where 13l = f(wt) - V. 
4. Include a transition for mapping the source 
dependent wr to the target dependent f(w~): 
(midw,(w : V), final(w : V), w~, f(wr), +l,/3r) 
where/3,. = f(w,.) - Y. 
.5. Mark final(w : 1 I) as a final state for the 
transducer. 
The inclusion of transitions, and the marking 
of states as initial or final, are treated as event 
observation counts for a statistical head trans- 
duction model. More specifically, they are used 
as counts for maximum likelihood estimation of 
the transducer start, transition, and stop prob- 
abilities specified in Section 2. 
5 Head selection 
We have been using the following monolingual 
metrics which can be applied to either the 
source or target language to predict the likeli- 
hood of a word being the head word of a string. 
Distance: The distance between a dependent 
and its head. In general, the likelihood of a 
head-dependent relation decreases as distance 
increases (Collins, 1996). 
Word frequency: The frequency of occurrence 
of a word in the training corpus. 
IVord 'complezity': For languages with pho- 
netic orthography such as English, 'complexity' 
of a word can be measured in terms of number 
of characters in that word. 
Optionality: This metric is intended to iden- 
tify optional modifiers which are less likely to 
be heads. For each word we find trigrams with 
the word of interest as the middle word and 
compare the distribution of these trigrams with 
the distribution of the bigrams formed from the 
outer pairs of words. If these two distributions 
are strongly correlated then the word is highly 
optional. 
Each of the above metrics provides a score for 
the likelihood of a word being a head word. A 
weighted sum of these scores is used to produce 
a ranked list of head words given a string for use 
in step 2 of the training algorithm in Section 2. 
If the metrics are applied to the target language 
instead of the source, the ranking of a source 
word is taken from the ranking of the target 
word it is aligned with. 
In Section 7, we show the effectiveness of ap- 
propriate head selection in terms of the trans- 
lation performance and size of the head trans- 
ducer model in the context of an English- 
Spanish translation system. 
6 Evaluation method 
There is no agreed-upon measure of machine 
translation quality. For our current purposes 
we require a measure that is objective, reliable, 
and that can be calculated automatically. 
We use here the word accuracy measure of 
the string distance between a reference string 
and a result string, a measure standardly used 
in the automatic speech recognition (ASR) com- 
munity. While for ASR the reference is a human 
transcription of the original speech and the re- 
sult the output of the speech recognition process 
run on the original speech, we use the measure 
to compare two different translations of a given 
source, typically a human translation and a ma- 
chine translation. 
The string distance metric is computed by 
first finding a transformation of one string into 
another that minimizes the total weight of sub- 
stitutions, insertions and deletions. (We use 
the same weights for these operations as in the 
NIST ASR evaluation software (NIS, 1997).) If 
we write S for the resulting number of substi- 
tions, I for insertions, D for deletions, and R 
for number of words in the reference translation 
string, we can express the metric as follows: 
word accuracy = (1 D+S+I)_ 
R 
This measure has the merit of being com- 
pletely automatic and non-subjective. How- 
ever, taking any single translation as reference 
is unrealistically unfavourable, since there is a 
range of acceptable translations. To increase 
the reliability of the measure, therefore, we give 
each system translation the best score it receives 
against any of a number of independent human 
translations of the same source. 
45 
wfw 
sys 
max source length 
5 10 15 20 >20 
45.8 46.5 45.2 44.5 44.0 
79.4 78.3 77.3 75.2 74.1 
Table 1: Word accuracy (percent) against the 
single held-out human translation 
7 English-Spanish experiment 
The training and test data for the experiments 
reported here were taken from a set of tran- 
scribed utterances from the air travel infor- 
mation system (ATIS) corpus together with a 
translation of each utterance to Spanish. An 
utterance is typically a single sentence but is 
sometimes more than one sentence spoken in se- 
quence. There were 14418 training utterances, 
a total of 140788 source words, corresponding to 
167865 target words. This training set was used 
as input to alignment model construction; align- 
ment search was carried out only on sentences 
up to length 15, a total of 11542 bitexts. Trans- 
duction training (including head ranking) was 
carried out on the 11327 alignments obtained. 
Tlle test set used in the evaluations reported 
here consisted of 336 held-out English sentences. 
We obtained three separate human translations 
of this test set: 
trl was translated by the same translation 
bureau as the training data; 
tr2 was translated by a different translation 
bureau; 
crl was a correction of the output of the 
trained system by a professional translator. 
The models evaluated are 
sys: the automatically trained head trans- 
duction model; 
wfw: a baseline word-for-word model in 
which each English word is translated by the 
Spanish word most highly correlated with it in 
the corpus. 
Table 1 shows the word accuracy percent- 
ages (see Section 6) for the trained system sys 
and the word-for-word baseline wfw against trl 
(the original held-out translations) at various 
source sentence lengths. The trained system 
has word accuracy of 74.1% on sentences of all 
lengths; on sentences up to length 15 (the length 
on which the transduction model was trained) 
the score was 77.3%. 
max source length 
5 10 15 20 >20 
wfw 46.2 47.5 46.6 45.8 45.3 
sys 80.1 81.6 81.0 79.3 78.5 
Table 2: Word accuracy (percent) against the 
closest of three human translations 
Head selector 
Baseline 
(Random Heads) 
In Source 
In Target (sys) 
Word 
accuracy 
64.7% 
71.4% 
74.1% 
Number of 
parameters 
108K 
67K 
66K 
Table 3: Translation performance with different 
head selection methods 
Table 2 shows the word accuracy percentages 
for the trained system sys and the word-for- 
word baseline wfw against any of the three ref- 
erence translations trl, crl, and tr2. That is, 
for each output string the human translation 
closest to it is taken as the reference transla- 
tion. With this more accurate measure, the sys- 
tem's word accuracy is 78.5% on sentences of all 
lengths. 
Table 3 compares the performance of the 
translation system when head words are se- 
lected (a) at random (baseline), (b) with head 
selection in the source language, and (c) with 
head selection in the target language, i.e., select- 
ing source heads that are aligned with the high- 
est ranking target head words. The reference for 
word accuracy here is the single reference trans- 
lation trl. Note that the 'In Target' head selec- 
tion method is the one used in training trans- 
lation model sys. The use of head selection 
metrics improves on random head selection in 
terms of translation accuracy and number of pa- 
rameters. An interesting twist, however, is that 
applying the metrics to target strings performs 
better than applying the metrics to the source 
words directly. 
8 Concluding remarks 
We have described a method for learning a head 
transduction model automatically from trans- 
lation examples. Despite the simplicity of the 
current version of this method, the experiment 
46 
we reported in this paper demonstrates that 
the method leads to reasonable performance 
for English-Spanish translation in a limited do- 
main. We plan to increase the accuracy of the 
model using the kind of statistical modeling 
techniques that have contributed to improve- 
ments in automatic learning of speech recogni- 
tion models in recent years. We have started 
to experiment with learning models for more 
challenging language pairs such as English to 
Japanese that exhibit more variation in word 
order and complex lexical transformations. 

References 
H. Alshawi, A.L. Buchbaum, and F. Xia. 
1997. A Comparison of Head Trandsucers 
and Transfer for a Limited Domain Transla- 
tion Application. In 35 th Annual Meeting of 
the Association for Computational Linguis- 
tics. Madrid, Spain, August. 
H. Alshawi. 1996a. Head automata and bilin- 
gual tiling: Translation with minimal repre- 
sentations. In 34th Annual Meeting of the 
Association for Computational Linguistics, 
pages 167-176, Santa Cruz, California. 
H. Alshawi. 1996b. Head automata for speech 
translation. In International Conference on 
.Spoken Language Processing, Philadelphia, 
Pennsylvania. 
P.J. Brown, J. Cocke, S. Della Pietra, V. Della 
Pietra, J. Lafferty, R. Mercer, and P. Rossin. 
1990. A Statistical Approach to Ma- 
chine Translation. Computational Linguis- 
tics, 16(2):79-85. 
P.J. Brown, S.A. Della Pietra, V.J. Della Pietra, 
and R.L. Mercer. 1993. The mathematics of 
machine translation: Parameter estimation. 
Computational Linguistics, 16(2):263-312. 
Michael John Collins. 1996. A new statistical 
parser based on bigram lexical dependencies. 
In 34th Meeting of the Association for Com- 
putational Linguistics, pages 184-191, Santa 
Cruz. 
W.A. Gale and K.W. Church. 1991. Identify- 
iug word correspondences in parallel texts. 
In Proceedings of the Fourth DARPA Speech 
and Natural Language Processing Workshop, 
pages 152-157, Pacific Grove, California. 
National Institute of Standards and Technology, 
http://www.itl.nist.gov/div894, 1997. Spo- 
ken Natural Language Processing Group Web 
page. 
Eiichiro Sumita and Hitoshi Iida. 1995. Hetero- 
geneous computing for example-based trans- 
lation of spoken language. In 6 th Interna- 
tional Conference on Theoretical and Method- 
ological Issues in Machine Translation, pages 
273-286, Leuven, Belgium. 
J.M. Vilar, V. M. Jim~nez, J.C. Amengual, 
A. Castellanos, D. Llorens, and E. Vidal. 
1996. Text and speech translation by means 
of subsequential transducers. Natural Lan- 
guage Engineering, 2(4) :351-354. 
Dekai Wu. 1997. Stochastic inversion trans- 
duction grammars and bilingual parsing of 
parallel corpora. Computational Linguistics, 
23(3):377-404. 
