A Comparison of Head Transducers and Transfer for a Limited 
Domain Translation Application 
Hiyan Alshawi and Adam L. Buchsbaum 
AT&T Labs 
180 Park Avenue 
Florham Park. NJ 079:32-0971. USA 
{hiyan,alb}.~research.at t.com 
Abstract 
We compare the effectiveness of two related 
• machine translation models applied to the 
same limited-domain task. One is a trans- 
fer model with monolingual head automata 
for analysis and generation; the other is a 
direct transduction model based on bilin- 
gual head transducers. We conclude that 
the head transducer model is more effective 
according to measures of accuracy, compu- 
tational requirements, model size, and de- 
velopment effort. 
I Introduction 
In this paper we describe an experimental ma- 
chine translation system based on head transducer 
models and compare it to a related transfer sys- 
tem, described in Alshawi 1996a, based on mono- 
lingual head automata. Head transducer models 
consist of collections of finite state machines that 
are associated with pairs of lexical items in a bilin- 
gual lexicon. The transfer system follows the fa- 
miliar analysis-transfer-generation architecture (Is- 
abelle and Macklovitch 1986). with mapping of 
dependency representations (Hudson 1984)in the 
transfer phase. In contrast, the head transducer 
approach is more closely aligned with earlier di- 
rect translation methods: no explicit representa- 
tions of the source language (interlingua or other- 
wise) are created in the process of deriving the target 
string. Despite ~he simple direct architecture, the 
head transducer model does embody modern prin- 
ciples of lexicalized recursive grammars and statis- 
tical language processing. The context for evaluat- 
ing both the transducer and transfer models was the 
development of experimental prototypes for speech- 
to-speech translation. 
In the case of text translation for publishing, it 
is reasonable to adopt economic measures of the 
Fei Xia 
Department of Computer and 
Information Science 
University of Pennsylvania 
Philadelphia, PA 19104. USA 
fxia@cis.upenn.edu 
effectiveness of translation systems. This involves 
assessing the total cost .,f employing a '~ransiation 
system, including, for example, the cost of manual 
post-editing. Post-editing "s not an option in speech 
translation systems for person-to-person communi- 
cation, and real-time operation is important in this 
context, so in comparing the two translation models 
we looked at a variety of other measures, including 
translation accuracy, speed, and system complexity. 
Both models underlying the translation systems 
can be characterized as statistical translation mod- 
els, but unlike the models proposed by Brown et 
al. (1990, 1993), these models have non-uniform lin- 
guistically motivated structure, at present coded by 
hand. In fact, the original motivation for the head 
transducer models was that they are simpler and 
more amenable to automatic model structure acqui- 
sition, while the transfer component of the tradi- 
tional system was designed with regard to allowing 
maximum flexibility in mapping between source and 
target representations to overcome translation diver- 
gences (Lindop and Tsujii 1991: Dorr 1994). In prac- 
tice, it turned out that adopting the simpler trans- 
ducer models did not invoive sacrificing accuracy, at 
least for our limited domain application. 
We first describe the transfer and head transducer 
approaches in Sections 2 and 3 and the method used 
to assign the numerical parameters of the models in 
Section 4. In Section 5. we compare experimental 
systems, based on the two approaches, for English- 
to-Chinese translation of air travel enquiries, and we 
conclude in Section 6. 
2 Monolingual Automata and 
Transfer 
In this section we review the approach based oll 
monolingual head automata together with transfer 
mapping. Further details of this approach, includ- 
ing the analysis, transfer, and generation algorithms 
appear in Alshawi 1996a. 
360 
2.1 Monolingual Relational Models 
We can characterize the language models used for 
analysis and generation in the transfer system as 
quantitative generative models of ordered depen- 
dency trees. In the dependency trees generated by 
these models, each node is labeled with a word w 
from the vocabulary V of the language in question: 
the nodes (and their word labels) immediately dom- 
inated by such a node are the dependents of w in 
the dependency derivation. Dependency tree arcs 
are labeled with symbols taken from a set R of de- 
pendency rei~iorss. These monolingual models are 
reversible, in the sense they can be used for analy- 
sis or generation. The motivation for these models is 
similar to that for Probabilistic Link Grammar (Laf- 
ferry, Sleator, and Temperley 1992). one difference 
being that the head automata derivations are always 
trees. 
The models are quantitative in that they assign a 
real-number cost to derivations. Various cost func- 
tions are possible, though in the experiments re- 
ported in this paper, a discriminative cost function 
is used, as discussed in Section 4. In the monolin- 
gual models, derivation events are actions performed 
by relational head acceptors, a particular type of fi- 
nite state automata associated with each word in the 
language. 
A relational head acceptor writes (or accepts) a 
pair of symbol sequences, a left sequence and a right 
sequence. The symbols in these sequences are taken 
from the set R of dependency relations. In a de- 
pendency derivation, an acceptor is associated with 
a node with word w, and the sequences written by 
the acceptor correspond to the relation labels of the 
arcs to the left and right of the node. In other words, 
they are the dependency relations between w and the 
dependents of w to its left and right. The possible 
actions taken by a relational head acceptor m. in 
state qi are: 
• Left transition: write a symbol r onto the right 
end of the left sequence and eater state qi+l. 
• Right transition: write a symbol r onto the left 
end of the right sequence and enter state qi+l. 
• Stop: stop in state q, at which point the se- 
quences are considered complete. 
Derivation of ordered dependency trees proceeds 
recursively by generating the dependent relations for 
a node according to the word and acceptor at that 
node, and then generating the trees dominated by 
these relation edges. This process involves the fol- 
lowing actions in addition to the acceptor actions 
above: 
) Selection of a word and acceptor to start an 
entire derivation. 
• Selection of a dependent word and acceptor 
given a head word and a dependency relation. 
2.2 Transfer 
Transfer in this model is a mapping between un- 
ordered dependency trees. Surface ordering of de- 
pendent phrases of either the source or target is not 
taken into account in the transfer mapping. This or- 
dering is completely defined by the source and target 
monolingual models. 
Our transfer model involves a bilingual lexicon 
specifying paired source-target fragments of depen- 
dency trees. A bilingual iexical entry (see Alshawi 
1996a for more details) includes a mapping function 
between the source and target nodes of the frag- 
ments. Valid transfer mappings are defined in terms 
of a tiling of the source dependency tree with source 
fragments from bilingual lexicon entries so that the 
partial mappings defined in entries are extended to 
a mapping for the entire source tree. This tiling pro- 
cess has the side effect of creating an unordered tar- 
get dependency representation. The following non- 
deterministic actions are involved in the tiling pro- 
cess: 
• Selection of a bilingual entry given a source lan- 
guage word, w. 
• Matching the nodes and arcs of the source frag- 
ment of an entry against a local subgraph in- 
cluding a node labeled by w. 
3 Bilingual Head Transduction 
3.1 Bilingual Head Transducers 
A head transducer is a transduction version of the 
finite state head acceptors employed in the transfer 
model. Such a transducer M is associated with a 
pair of words, a source word w and a target word 
t,. In fact. w is taken from the set ~,~ consisting of 
the source language vocabulary augmented by the 
"'empty word" e, and t, is taken from !,~, the tar- 
get language vocabulary augmented with e. A head 
transducer reads from a pair of source sequences, a 
left source sequence Lt and a right source sequence 
RI; it writes to a pair of target sequences, a left 
target sequence L.~ and a right target sequence R, 
(Figure 1). 
Head transducers were introduced in Alshawi 
1996b, where the symbols in the source and target 
sequences are source and target words respectively. 
In the experiment described in this paper the sym- 
bols written are dependency relation symbols or the 
361 
l °11 1 L., r~ r~ ~ r~ R~ • . . r j+ t • . . 
" \[ ' 
Figure 1: Head transducer M converts the sequences 
of left and right relations (r~ ... r~) and (r~+l...rn 1) 
of w into left and right relations (r~...r\]) and 
empty symbol e. While it is possible to construct a 
translator based on head transduction models with- 
out relation symbols, using a version of head trans- 
ducers with relation symbols allowed for a more di- 
rect comparison between the transfer and transducer 
systems, as discussed in Section 5 
We can think of the transducer as simultaneously 
deriving the source and target sequences through a 
series of transitions followed by a stop action. From 
a state qi these actions are as follows: 
• Left transition: write a symbol rl onto the right 
end of L1, write symbol r2 to position a in the 
target sequences, and enter state qi+l. 
* Right transition: write a symbol rl onto the left 
end of R1, write a symbol r~ to position a in 
the target sequences, and enter state qi+t. 
. Stop: stop in state qi, at which point the se- 
quences Lt, R1, L~ and R,. are considered com- 
plete. 
In simple head transducers, the target positions 
a can be restricted in a similar way to the source 
positions, i.e., the right end of L~ or the left end of 
R.~. The version used in the experiment allows ad- 
ditional positions, including the left end of L2 and 
the right end R~.. Allowing additional target posi- 
tions increases the flexibility of transducers in the 
translation application without an adverse effect on 
computational complexity• On the other hand, we 
restrict the source side positions as indicated above 
to keep the transduction search similar in nature to 
head-outward context free parsing. 
3.2 Recursive Head Transduction 
We can apply a set of head transducers recursively 
to derive a pair of source-target ordered dependency 
trees• This is a recursive process in which the depen- 
dency relations for corresponding nodes in the two 
trees are derived by a head transducer. In addition 
to the actions performed by the head transducers. 
this derivation process involves the actions: 
Selection of a pair of words wo E V1 and vo E V2, 
and a head transducer 3,10 to start the entire 
derivation. 
Selection of a pair of dependent words w I and 
v ~ and transducer M I given head words w and v 
and source and target dependency relations el 
and r2. (w,w' E V1; v,v' e V2.) 
The recursion takes place by running a head trans- 
ducer (M' in the second action above) to derive local 
dependency trees for corresponding pairs of depen- 
dent words (w', v'). 
4 Event Cost Assignment 
The transfer and head transduction derivation mod- 
els can be formulated as probabilistic generative 
models; such formulations were given in Alshawi 
1996a and 1996b respectively. Under such a for- 
mulation, negated log probabilities can be used as 
the costs for the actions listed in Sections 2 and 3. 
However, experimentation reported in Alshawi and 
Buchsbaum 1997 suggests that improved translation 
accuracy can be achieved by adopting cost functions 
other than log probability. This is true in particular 
for a family of discriminative cost functions. 
We define a cost function f as a real valued func- 
tion taking two arguments, a event e and a context 
c. The context c is an equivalence class of states un- 
der which an action is taken, and the event e is an 
equivalence class of actions possible from that set of 
states. We write the value of the function as f(elc ), 
borrowing notation from the special case of condi- 
tional probabilities. The pair (elc) is referred to as a 
choice. The cost of a solution (i.e., a possible trans- 
lation of an input string) is the sum of costs for all 
choices in the derivation of that solution. 
Discriminative cost functions, including likelihood 
ratios (cf. Dunning 1993), make use of both positive 
and negative instances of performing a task. Here 
we take a positive instance to be the derivation of 
a "'correct" translation, and a negative instance the 
derivation of an "incorrect" translation, where cor- 
rectness is judged by a speaker of both languages. 
Let n + (e\]c) be the count of taking choice (elc) in pos- 
itive instances resulting from processing the source 
sentences in a training corpus. Similarly, let n-(elc ) 
be the count of taking (elc) for negative instances. 
362 
The cost function" used in the experiments is com- 
puted as: 
/(elc) = log(n+(el c) + n-(elc)) -log(n+(ele)). 
(By comparison, the usual "logprob" cost function 
using only positive instances would be log(n+(c)) - 
log(n+(elc)).) For unseen choices, we replace the 
context c and event e with larger equivalence classes. 
5 Effectiveness Comparison 
5.1 English-Chinese ATIS Models 
Both the transfer and transducer systems were 
trained and evaluated on English-to-Mandarin Chi- 
nese translation of transcribed utterances from the 
ATIS corpus (Hirschman et al. 1993). By train- 
ing here we simply mean assignment of the cost 
functions for fixed model structures. These model 
structures were coded by hand as monolingual head 
acceptor and bilingual dependency lexicons for the 
transfer system and a head transducer lexicon for 
the transducer system. 
Positive and negative counts for cost assignment 
were collected from two sources for both systems and 
an additional third source for the transfer system. 
The first set of counts was derived by processing 
traces using around 1200 sample utterances from 
the ATIS corpus. This involved running the sys- 
tems on the sample utterances, starting initially with 
uniform costs, and presenting the resulting trans- 
lations to a human judge for classification as cor- 
rect or incorrect. The second source of counts was 
hand-tagging around 800 utterance transcriptions 
to identify correct and incorrect attachment points 
for prepositional phrases, PP-attachment being im- 
portant for English-Chinese translation (Chen and 
Chen 1992). This attachment information was con- 
verted to corresponding counts for head-dependent 
choices involving prepositional phrase attachment. 
The additional source of counts used in the trans- 
fer system was an unsupervised training method 
in which 13000 training utterances were translated 
from English to Chinese, and then back again; the 
derivations were classified as positive (otherwise neg- 
ative) if the resulting back-translation was suffi- 
ciently close to the original English, as described in 
Alshawi and Buchsbaum 1997. 
There was a strong systematic relationship be- 
tween the structure of the models used in the two 
systems in the following sense. The head transducers 
were built by modifying the English head acceptors 
defined for the transfer system. This involved the 
addition of target relations, including some epsilon 
relations, to automaton transitions. In some cases, 
Transfer Head Transducer 
Word error rate 16.2 11.7 
(per cent) 
Time 1.09 0.17 
(seconds/sent.) 
Space 1.67 0.14 
(Mbytes/sent.) 
Table 1: Accuracy. time, and space comparison 
the automata needed to be modified to include addi- 
tional states, and also some transitions with epsilon 
relations on the English (source) side. Typically, 
such cases arise when an additional particle needs 
to be generated on the target side, for example the 
yes-no question particle in Chinese. The inclusion of 
such particles often depended on additional distinc- 
tions not present in the original English automata. 
hence the requirement for additional states in the 
bilingual transducer versions. 
5.2 Performance 
To evaluate the relative performance of the two 
translators, 200 utterances were chosen at random 
from a previously unseen test sample of ATIS utter- 
ances having no overlap with samples used in model 
building and cost assignment. There was no restric- 
tion on utterance length or ATIS "class" (dialogue or 
one-off queries, etc.) in making this selection. These 
English test utterances were processed by both sys- 
tems, yielding lowest cost Chinese translations. 
Three measures of performance--accuracy, com- 
putation time, and memory usage--were compared, 
with the results in Table 1, showing improvements 
by the transducer system for all three measures. The 
accuracy figures are given in terms of translation 
word error rate, a measure we believe to be some- 
what less subjective than sentence level measures of 
grammaticality and meaning preservation. Trans- 
lation word error rate is defined as the number of 
words in the source which are judged to have been 
mistranslated. For the purposes of this definition, 
mistranslation of a source word includes choice of 
the wrong target word (or words), the absence (or 
incorrect addition) of a particle related to the word, 
and the generation of a correct target word in the 
wrong position. 
The improvement in word error rates of the trans- 
ducer system was achieved without the benefit of the 
additional counts from unsupervised training, men- 
tioned above, with 13,000 utterances. Earlier experi- 
ments (Alshawi and Buschbaum 1997) show that the 
unsupervised training does lead to an improvement 
363 
in the performance of the transfer system. How- 
ever, this improvement is relatively small: around 
2% reduction in the number of utterances contain- 
ing translation errors. (Word error rates for direct 
comparison with the results above are not available.) 
We also know that some additional improvement of 
the transducer system can be achieved by increasing 
the amount of training data: with a further 600 su- 
pervised training samples (for a total of 1800), the 
error rate for the transducer system falls to 11.0%. 
The processing times reported above are averages 
over the same 200 test utterances used in the accu- 
racy evaluation. These timings are for an implemen- 
tation of the search algorithms in Lisp on a Silicon 
Graphics machine with a 150MHz R4400 processor. 
The space figures give the average amount of mem- 
ory allocated in processing each utterance. 
5.3 Model Size and Development Effort 
The performance comparison above is, of course, not 
the whole story, particularly since manual effort was 
required to build the model structures before train- 
ing for cost assignment. However, we believe the 
conclusion for the improvement in performance of 
the transducer system is valid because the amount 
of effort in building and training the transfer models 
exceeded that for the the transducer systems. After 
construction of the English head acceptor models, 
common to both systems, a rough estimate of the 
effort required for completing the models for English 
to Chinese translation is 12 person-months for the 
transfer system and 3 person-months for the trans- 
ducer system. With respect to training effort, as 
noted, the amount of supervised training effort in 
the main experiment was the same for both systems 
(supervised discriminative training for 1200 utter- 
auces plus tagging of prepositional attachments for 
800 utterances), while the transfer system also ben- 
efited from unsupervised training with 13000 utter- 
ances. 
In comparing models for language processing, or 
indeed other tasks, it is reasonable to ask if per- 
formance improvements by one model over another 
were achieved through an increase in model complex- 
ity. We looked at three measures of model complex- 
ity for the two systems, with the results shown in 
Table 2. The first was the number of lexical entries. 
For the transfer model this includes both monolin- 
gual entries and the bilingual entries required for the 
English to Chinese direction; there are only bilin- 
gual entries in the transducer model. Comparing the 
structural complexity of the two models is somewhat 
more difficult but we can make a graph-theoretic ab- 
straction and count the number of edges in model 
Transfer Head Transducer 
Lexical entries 3,250 1,201 
Edges 72,180 47,910 
Choices 100,472 67,011 
Table 2: Lexicon and model size comparison 
components. Both systems include edges for au- 
tomaton state transitions. The edge count for the 
transfer system includes the number of dependency 
graph edges in bilingual entries. Finally, we also 
looked at the number of choices for which train- 
ing counts were available, i.e., the number of model 
numerical parameters for which direct evidence was 
present in training data. As can be seen from Ta- 
ble 2, the transducer system has a lower model com- 
plexity according to all three measures. 
6 Conclusion 
There are many aspects to the effectiveness of the 
translation component of a speech translator, mak- 
ing comparisons between systems difficult. There is 
also an inherent difficulty in evaluating the transla- 
tion task: a single source utterance has many valid 
translations and the validity of translations is a mat- 
ter of degree. Despite this, we believe that in the 
comparison considered in this paper, it is reason- 
able to make an overall assessment that the head 
transducer system is more effective that the transfer- 
based system. One justification for this conclusion 
is that the systems were closely related, having iden- 
tical sublanguage domain and test data, and using 
similar automata for analysis in the transfer system 
and transduction in the transducer system. Another 
justification is that it was not necessary to make 
difficult comparisons between different aspects of ef- 
fectiveness: the transducer system performed better 
with respect to all the measures we looked at for 
accuracy, speed, memory, development effort and 
model complexity. Looking forward, the relative 
simplicity of head transducer models makes them 
more promising for further automating the develop- 
ment of translation applications. 
Acknowledgment 
We are grateful to Jishen He for building the Chinese 
model and bilingual lexicon of the earlier transfer 
system that we used in this work for comparison 
with the head transducer system. 
364 

References 
Alshawi, H. and A.L. Buchsbaum. 1997. "State- 
Transition Cost Functions and an Application to 
Language Translation". In Proceedings of the In- 
ternational Conference on Acoustics, Speech, and 
Signal Processing, IEEE, Munich, Germany. 
Alshawi, H. 1996a. "Head Automata and Bilin- 
gual Tiling: Translation with Minimal Represen- 
tations". In Proceedings of the 34th Annual Meet- 
ing of the Association for Computational Linguis- 
tics, Santa Cruz, California, 167-176. 
Alshawi, H. 1996b. "Head Automata for Speech 
Translation". In Proceedings of the Interna- 
tional Conference on Spoken Language Processing, 
Philadelphia, Pennsylvania. 
Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, 
F. Jelinek, J. Lafferty, R. Mercer and P. Rossin. 
1990. "A Statistical Approach to Machine Trans- 
lation". Computational Linguistics 16:79-85. 
Brown, P.F., S.A. Della Pietra, V.J. Della Pietra, 
and R.L. Mercer. 1993. "The Mathematics of 
Statistical Machine Translation: Parameter Esti- 
mation". Computational Linguistics 19:263-312. 
Chen, K.H. and H. H. Chen. 1992. "Attachment and 
Transfer of Prepositional Phrases with Constraint 
Propagation". Computer Processing of Chinese 
and Oriental Languages, Vol. 6, No. 2, 123-142. 
Dorr, B.J. 1994. "Machine Translation Divergences: 
A Formal Description and Proposed Solution". 
Computational Linguistics 20:597-634. 
Dunning, T. 1993. "Accurate Methods for Statis- 
tics of Surprise and Coincidence." Computational 
Linguistics 19:61-74. 
Hudson, R.A. 1984. Word Grammar. Blackwell, 
Oxford. 
Hirschman, L., M. Bates, D. Dahl, W. Fisher, J. 
Garofolo, D. Pallett, K. Hunicke-Smith, P. Price, 
A. Rudnicky, and E. Tzoukermann. 1993. "Multi- 
Site Data Collection and Evaluation in Spoken 
Language Understanding". In Proceedings of the 
Human Language Technology Workshop, Morgan 
Kaufmann, San Francisco, 19-24. 
Isabelle, P. and E. Macklovitch. 1986. "Transfer 
and MT Modularity", In Eleventh International 
Conference on Computational Linguistics, Bonn, 
Germany, 115-117. 
Jelinek, F., R.L. Mercer and S. Roukos. 1992. 
"Principles of Lexical Language Modeling for 
Speech Recognition". In S. Furui and M.M. 
Sondhi (eds.), Advances in Speech Signal Process- 
ing, Marcel Dekker, New York. 
Lafferty, J., D. Sleator and D. Temperley. 1992. 
"Grammatical Trigrams: A Probabilistic Model of 
Link Grammar". In Proceedings of the 1992 AAAI 
Fall Symposium on Probabilistic Approaches to 
Natural Language, 89-97. 
Kay, M. 1989. "Head Driven Parsing". In Pro- 
ceedings of the Workshop on Parsing Technolo- 
gies, Pittsburgh, 1989. 
Lindop, J, and J. Tsujii. 1991. "Complex Transfer 
in MT: A Survey of Examples". Technical Re- 
port 91/5, Centre for Computational Linguistics, 
UMIST, Manchester, UK. 
Sata, G. and O. Stock. 1989. "Heacl-Driven Bidirec- 
tional Parsing". In Proceedings of the Workshop 
on Parsing Technologies, Pittsburgh. 
Younger, D. 1967. Recognition and Parsing of 
Context-Free Languages in Time n 3. Information 
and Control, 10, 189-208. 
