Speech Translation Performance of Statistical Dependency Transduction
and Semantic Similarity Transduction
Hiyan Alshawi and Shona Douglas
AT&T Labs - Research
Florham Park, NJ 07932, USA
a0 hiyan,shona
a1 @research.att.com
Abstract
In this paper we compare the performance
of two methods for speech translation.
One is a statistical dependency transduc-
tion model using head transducers, the
other a case-based transduction model in-
volving a lexical similarity measure. Ex-
amples of translated utterance transcrip-
tions are used in training both models,
though the case-based model also uses se-
mantic labels classifying the source utter-
ances. The main conclusion is that while
the two methods provide similar transla-
tion accuracy under the experimental con-
ditions and accuracy metric used, the sta-
tistical dependency transduction method
is significantly faster at computing trans-
lations.
1 Introduction
Machine translation, natural language processing,
and more generally other computational problems
that are not amenable to closed form solutions,
have typically been tackled by one of three broad
approaches: rule-based systems, statistical mod-
els (including generative models), and case-based
systems. Hybrid solutions combining these ap-
proaches have also been used in language pro-
cessing generally (Klavans and Resnik, 1996) and
more specifically in machine translation (for exam-
ple Frederking et al. (1994)).
In this paper we compare the performance of two
methods for speech translation. One is the statistical
dependency transduction model (Alshawi and Dou-
glas, 2000; Alshawi et al., 2000b), a trainable gener-
ative statistical translation model using head trans-
ducers (Alshawi, 1996). The other is a case-based
transduction model which makes use of a semantic
similarity measure between words. Both models are
trained automatically using examples of translated
utterances (the transcription of a spoken utterance
and a translation of that transcription). The case-
based model makes use of additional information in
the form of labels associated with source language
utterances, typically one or two labels per utterance.
This additional information, which was originally
provided for a separate monolingual task, is used to
construct the lexical similarity measure.
In training these translation methods, as well as
their runtime application, no pre-existing bilingual
lexicon is needed. Instead, in both cases, the initial
phase of training from the translation data is a sta-
tistical hierarchical alignment search applied to the
set of bilingual examples. This training phase pro-
duces a bilingual lexicon, used by both methods, as
well as synchronized hierarchical alignments used to
build the dependency transduction model.
In the experiments comparing the performance
of the models we look at accuracy as well as the
time taken to translate sentences from English to
Japanese. The source language inputs used in these
experiments are naturally spoken utterances from
large numbers of real customers calling telephone
operator services.
In section 2 we describe the hierarchical align-
ment algorithm followed by descriptions of the
translation methods in sections 3 and 4. We present
the experiments in section 5 and provide concluding
remarks in section 6.
                                            Association for Computational Linguistics.
                           Algorithms and Systems, Philadelphia, July 2002, pp. 31-38.
                          Proceedings of the Workshop on Speech-to-Speech Translation:
Figure 1: Alignment mapping a2 , source head-map
a3 , and target head-map
a4
2 Hierarchical alignments
Both the translation systems described in this pa-
per make use of automatically created hierarchical
alignments of the source and target strings of the
training corpus bitexts. As will be described in sec-
tion 3, we estimate the parameters of a dependency
transduction model from such alignments. In the
case-based method described in section 4, the align-
ments are the basis for the translation lexicon used
to compute substitutions and word-for-word transla-
tions.
A hierarchical alignment consists of four func-
tions. The first two functions are an alignment
mapping a2 from source words a5 to target words
a2a7a6a8a5a10a9 (which may be the empty word a11 ), and an in-
verse alignment mapping from target words a12 to
source words a2a14a13a8a6a8a12a14a9 . (The inverse mapping is needed
to handle mapping of target words to a11 ; it coincides
with a2 for pairs without a11 .) The other two functions
are a source head-mapa3 mapping source dependent
words a5 to their heads a3 a6a8a5a15a9 in the source string,
and a target head-map a4 mapping target dependent
words a12 to their head words a4a16a6a8a12a14a9 in the target string.
An example hierarchical alignment is shown in Fig-
ure 1.
A hierarchical alignment is synchronized (i.e.
corresponds to synchronized dependency trees) if,
roughly speaking, a2 induces an isomorphism be-
tween the dependency functions a3 and a4 (see
Alshawi and Douglas (2000) for a more formal def-
inition). The hierarchical alignment in Figure 1 is
synchronized.
In some previous work (Alshawi et al., 1998; Al-
shawi et al., 2000a; Alshawi et al., 2000b) the train-
ing method constructs synchronized alignments in
which each head word has at most two dependent
phrases. Here we use the technique described by
Alshawi and Douglas (2000) where the models have
greater freedom to vary the granularity of phrase lo-
cality.
Constructing synchronized hierarchical align-
ments for a corpus has two stages: (a) computing
co-occurrence statistics from the training data; (b)
searching for an optimal synchronized hierarchical
alignment for each bitext.
2.1 Word correlation statistics
For each source word in the dataset, a translation
pairing cost a17a18a6a8a5a20a19a21a12a22a19a24a23a25a9 is assigned for all possible
translations in the context of a bitext a23 . Here a5 and a12
are usually words, but may also be the empty word a11
or compounds formed from contiguous words; here
we restrict compounds to a maximum length of two
words.
The assignment of these lexical translation pair-
ing costs may be done using various statistical mea-
sures. The main component of a17 is the so-called
a26 correlation measure (see Gale and Church (1991))
normalized to the range a27a28a29a19a31a30a33a32 with a28 indicating per-
fect correlation. In the experiments described in this
paper, the cost function a17 relating a source word (or
compound) a5 in a bitext with a target word (or com-
pound) a12 is
a17a14a6a8a5a34a19a21a12a35a19a24a23a25a9a37a36
a26
a6a8a5a34a19a21a12a29a9a39a38a41a40a22a6a8a5a34a19a21a12a35a19a24a23a25a9
where a40a42a6a8a5a20a19a21a12a22a19a24a23a43a9 is a length-normalized measure of
the apparent distortion in the positions of a5 and a12 in
the source and target strings of a23 . For example, if a5
appears at the middle of the source string and a12 ap-
pears at the middle of the target string, then the dis-
tortion is a28 . We have found that, at least for our data,
this pairing cost leads to better performance than the
use of log probabilities of target words given source
words (cf. Brown et al. (1993)).
The value used for a26 a6a8a5a20a19a21a12a14a9 is first computed from
counts of the number of bitexts in the training set in
which a5 and a12 co-occur, in which a5 only appears, in
which a12 only appears, and in which neither of them
appear. In other words, we first treat any word in
the target string to be a possible translation of any
word in the source string. This value is then refined
by re-estimation during the alignment optimization
process.
2.2 Optimal hierarchical alignments
We wish to find a hierarchical alignment that re-
spects the co-occurrence statistics of bitexts as well
as the phrasal structure implicit in the source and tar-
get strings. For this purpose we define the cost of a
hierarchical subalignment to be the sum of the costs
a17a18a6a8a5a20a19a21a12a22a19a24a23a43a9 of each pairing a6a8a5a34a19a21a12a29a9a45a44a46a2 , where a2 is the
(sub)alignment mapping function.
The complete hierarchical alignment which min-
imizes this cost function is computed using a dy-
namic programming procedure. This procedure
works bottom-up, starting with all possible sub-
alignments with at most one source word (or com-
pound) and one target word (or compound). Adja-
cent source substrings are then combined to deter-
mine the lowest cost subalignments for successively
larger substrings of the bitext satisfying the con-
straints for synchronized alignments stated above.
The successively larger substrings eventually span
the entire source string, yielding the optimal hierar-
chical alignment for the bitext.
At each combination step in the optimization pro-
cedure, one of the two source subphrases is added
as a dependent of the head of the other subphrase.
Since the alignment we are constructing is synchro-
nized, this choice will force the selection of a target
dependent phrase. Our current (admittedly crude)
strategy for selecting the dependent subphrase is to
choose the one with the highest subalignment cost,
i.e. the head of the subphrase with the better sub-
alignment becomes the head of the enlarged phrase.
Recall that the initial estimates for a26 are computed
from co-occurence counts for a5a34a19a21a12 in bitexts. In the
second and subsequent rounds of this procedure, the
a26 values are computed from co-occurence counts for
a6a8a5a20a19a21a12a14a9 in pairings in the alignments produced by the
previous round. The improvement in the models re-
sulting from this re-estimation seems to stabilize af-
ter approximately five to ten rounds.
3 Statistical Dependency Transduction
The dependency transduction model is an automati-
cally trainable translation method that models cross-
lingual lexical mapping, hierarchical phrase struc-
ture, and monolingual lexical dependency. It is a
generative statistical model for synchronized pairs
of dependency trees in which each local tree is pro-
duced by a weighted head transducer. Since this
model has been presented at length elsewhere (Al-
shawi, 1996; Alshawi et al., 2000a; Alshawi and
Douglas, 2000), the description in this paper will be
relatively compact.
3.1 Weighted finite state head transducers
A weighted finite state head transducer is a finite
state machine that differs from ‘standard’ finite state
transducers in that, instead of consuming the input
string left to right, it consumes it ‘middle out’ from
a symbol in the string. Similarly, the output of a
head transducer is built up middle-out at positions
relative to a symbol in the output string.
Formally, a weighted head transducer is a 5-tuple:
an alphabet a47 of input symbols; an alphabet a48 of
output symbols; a finite set a49 of states a50a52a51a53a19a25a54a25a54a25a54a55a19a56a50a55a57 ; a
set of final states a58a60a59a61a49 ; and a finite set a62 of state
transitions. A transition from state a50 to state a50 a13 has
the form a63
a50a64a19a56a50 a13 a19a21a5a20a19a21a12a22a19a24a65a66a19a21a67a37a19a56a68a25a69
where a5 is a member of a47 or is the empty string a11 ;
a12 is a member of a48 or a11 ; the integer a65 is the input
position; the integer a67 is the output position; and
the real number a68 is the weight of the transition. The
roles of a50 , a50a55a13 , a5 , and a12 in transitions are similar to
the roles they have in left-to-right transducers, i.e. in
transitioning from state a50 to state a50a70a13 , the transducer
‘reads’ input symbol a5 and ‘writes’ output symbol
a12 , and as usual if a5 (or a12 ) is a11 then no read (respec-
tively write) takes place for the transition.
To define the role of transition positions a65 and
a67 , we consider notional input (source) and output
(target) tapes divided into squares. On such a tape,
one square is numbered a28 , and the other squares are
numbered a30a71a19a24a72a73a19a25a54a25a54a25a54 rightwards from square a28 , and
a74
a30a71a19
a74
a72a73a19a25a54a25a54a25a54 leftwards from square a28 . A transition
with input position a65 and output position a67 is in-
terpreted as reading a5 from square a65 on the input
tape and writing a12 to square a67 of the output tape; if
square a67 is already occupied then a12 is written to the
next empty square to the left of a67 if a67a76a75a77a28 , or to the
right of a67 if a67a79a78a80a28 , and similarly if input was al-
ready read from position a65 , a5 is taken from the next
unread square to the left of a65 if a65a81a75a77a28 or to the right
of a65 if a65a82a78a77a28 .
3.2 Dependency transduction models
Dependency transduction models are generative
statistical models which derive synchronized pairs
of dependency trees, a source language dependency
tree and a target dependency tree. A dependency
tree, in the sense of dependency grammar (for exam-
ple Hays (1964), Hudson (1984)), is a tree in which
the words of a sentence appear as nodes; the parent
of a node is its head and the child of a node is the
node’s dependent.
In a dependency transduction model, each syn-
chronized local subtree corresponds to a head trans-
ducer derivation: the head transducer is used to con-
vert a sequence consisting of a head word a5 and its
immediate left and right dependent words to a se-
quence consisting of a target word a12 and a83a85a84a21a86 immedi-
ate left and right dependent words. (Since the empty
string may appear in a transition in place of a source
or target symbol, the number of source and target de-
pendents can be different.) When applying a depen-
dency transduction model to translation, we choose
the target string obtained by flattening the target tree
of the lowest cost recursive dependency derivation
that also yields the source string.
For a dependency transduction model to be a sta-
tistical model for generating pairs of strings, we as-
sign transition weights that are derived from condi-
tional probabilities. Several probabilistic parameter-
izations can be used for this purpose including the
following for a transition with head words a5 and a12
and dependent words a5 a13 and a12 a13 :
a87
a6a88a50 a13 a19a21a5 a13 a19a21a12 a13 a19a24a65a37a19a21a67a90a89a5a20a19a21a12a22a19a56a50a91a9a92a54
Here a50 and a50a55a13 are the from-state and to-state for the
transition and a65 and a67 are the source and target posi-
tions, as before. We also need parametersa87 a6a88a50a85a51a93a89a5a20a19a21a12a14a9
for the probability of choosing an initial head trans-
ducer state a50a92a51 given a pair of words a6a8a5a34a19a21a12a29a9 heading
a synchronized pair of subtrees. To start the deriva-
tion, we need parametersa87 a6a88a94a95a6a8a5a7a51a71a19a21a12a91a51a55a9a21a9 for the prob-
ability of choosing a5a96a51 ,a12a91a51 as the root nodes of the
two trees.
These model parameters can be used to generate
pairs of synchronized dependency trees starting with
the topmost nodes of the two trees and proceeding
recursively to the leaves. The probability of such a
derivation can be expressed as:
a87
a6a88a94a95a6a8a5a45a51a53a19a21a12a71a51a55a9a21a9
a87
a6a88a97a99a98a35a100a33a101a102a21a100a25a9
where a87 a6a88a97a103a98a39a101a102a104a9 is the probability of a subderivation
headed by a5 and a12 , that is
a87
a6a88a97a99a98a16a101a102a104a9a37a36
a87
a6a88a50a31a51a93a89a5a20a19a21a12a14a9
a105
a51a33a106a18a107a8a106a18a108
a87
a6a88a50 a107a110a109a39a111 a19a21a5 a107 a19a21a12 a107 a19a24a65 a107 a19a21a67 a107 a89a5a20a19a21a12a22a19a56a50 a107 a9
a87
a6a88a97 a98a35a112a88a101a102a21a112 a9
for a derivation in which the dependents of a5 and a12
are generated by a113 transitions.
The parameters of this probabilistic synchronized
tree derivation model are estimated from the results
of running the hierarchical alignment algorithm de-
scribed in section 2 on the sentence pairs in the train-
ing corpus. For this purpose, each synchronized tree
resulting from the alignment process is assumed to
be derived from a dependency transduction model,
so transition counts for the model are tallied from
the set of synchronized trees. (For further details,
see Alshawi and Douglas (2000).)
To carry out translation with a dependency trans-
duction model, we apply a “middle-out” dynamic
programming search to find the optimal derivation.
This algorithm can take as input either word strings
or word lattices produced by a speech recognizer.
The algorithm is similar to those for context free
parsing such as chart parsing (Earley, 1970) and the
CKY algorithm (Younger, 1967). It is described in
Alshawi et al. (2000b).
4 Similarity Cased-Based Transduction
4.1 Training the transduction parameters
Our semantic similarity transduction method is a
case-based (or example-based) method for transduc-
ing source strings to target strings that makes use of
two different kinds of training data:
a114 A set of source-string, target-string pairs that
are instances of the transduction mapping.
Specifically, transcriptions of spoken utter-
ances in the source language and their transla-
tion into the target language. This is the same
data used for training the dependency trans-
duction model. It is used in this transduction
method to construct a probabilistic bilingual
lexicon, while the source side is used as the set
of examples for matching.
a114 A mapping between the source strings and sub-
sets of a (relatively small) set of classes, or la-
bels. The idea is that the labels give a broad
classification of the meaning of the source
strings, so we will refer to them informally as
“semantic” labels. In our experiments, these
classes correspond to 15 call routing destina-
tions associated with the transcribed utterances.
For the purposes of the case-based method, this
data is used to construct a similarity measure
between words of the source language.
As noted earlier, the alignment algorithm de-
scribed in section 2 is applied to the translation pairs
to yield a set of synchronized dependency trees. Us-
ing the resulting trees, the probabilities of a bilingual
lexicon, i.e.
a87
a6a8a12a115a89a5a10a9
where a5 is a source language word, and a12 is a tar-
get language word, are estimated from the counts of
synchronized lexical nodes. (Since the synchronized
trees are dependency trees, both paired fringe nodes
and interior nodes are included in the counts.) In this
probabilistic lexicon, a12 may be a11 , the empty symbol,
so source words may have different probabilities of
being deleted. However, for insertion probabilities,
we assume that a116a117a6a118a11a104a89a11a92a9a119a36a120a30 , to avoid problems with
spurious insertions of target words.
The labels associated with the source strings were
originally assigned by manual annotation for the
purposes of a different research project, specifically
for training an automatic call routing system, us-
ing the methods described by Gorin et al. (1997).
(Many of the training sentences are assigned mul-
tiple labels.)
For the translation task, the labels are used to
compute a similarity measure a121a81a6a8a5a122a111a31a19a21a5a45a123a31a9 as a diver-
gence between a probability distribution conditional
on source word a5a119a111 and a corresponding distribution
conditional on another source word a5 a123 . The distri-
butions involved, a87 a6a88a124a125a89a5a90a111a33a9 and a87 a6a88a124a10a89a5a126a123a55a9 , are those
for the probability a87 a6a88a127a56a89a5a10a9 that a source string which
includes word a5 has been assigned label a127 . The sim-
ilarity measure a121a81a6a8a5a119a111a31a19a21a5a45a123a55a9 is computed from the rel-
ative entropy a97 (Kullback Leibler distance (Kull-
back and Leibler, 1951)) between these distribu-
tions. To make the similarity measure symmetrical,
i.e. a121a81a6a8a5a10a111a31a19a21a5a45a123a55a9a20a36a128a121a81a6a8a5a45a123a104a19a21a5a10a111a33a9 , we take the average
of two relative entropy quantities:
a121a81a6a8a5 a111 a19a21a5 a123 a9a7a36 a30a55a129a71a72 a6a88a97a130a6
a87
a6a88a124a10a89a5 a111 a9a25a89a110a89
a87
a6a88a124a10a89a5 a123 a9a21a9a131a38
a97a130a6
a87
a6a88a124a10a89a5a45a123a55a9a25a89a110a89
a87
a6a88a124a125a89a5a10a111a25a9a21a9a21a9
Of course, this is one of many different possible
similarity measures which could have been used (cf
Pereira et al. (1993)), including ones that do not de-
pend on additional labels. However, since seman-
tic labels had already been assigned to our train-
ing data, the distributions seemed like a convenient
rough proxy for the semantic similarity of words in
this limited domain.
4.2 Case-based transduction procedure
Basically, the transduction procedure (i) finds an
instance a6a118a86a93a19a21a84a56a9 of the translation training pairs for
which the example source string a86 provides the
“best” match to the input source string a132 , and (ii)
produces, as the translation output, a modified ver-
sion of the example target string a84 , where the modifi-
cations reflect mismatches between a86 and the input.
For the first step, the similarity measure between
words computed in terms of the relative entropy for
label distributions is used to compute a distance
a40a22a6a118a86a64a30a71a19a24a86a104a72a93a9
between two source strings a86 a111 and a86 a123 . The (seman-
tically influenced) string distance a40 , is a weighted
edit distance (Wagner and Fischer, 1974) between
the two strings in which the cost of substituting one
source word a5a45a111 for another a5a122a123 is provided by the
“semantic” similarity measure a121a81a6a8a5a37a111a31a19a21a5a45a123a55a9 . A stan-
dard quadratic dynamic programming search algo-
rithm is used to find the weighted edit distance be-
tween two strings. This algorithm finds a sequence
of edit operations (insertions, deletions, and substi-
tutions) that yield a86a53a72 from a86a64a30 so that a40a22a6a118a86a64a30a71a19a24a86a53a72a93a9 , the
sum of the costs of the edit operations, is minimal
over all such edit sequences.
The weighted edit distance search is applied to
a132 and each example source string a86 to identify the
example translation pair a6a118a86a93a19a21a84a56a9 for which a40a22a6a8a132a39a19a24a86a53a9 is
minimal over all example source strings. The cor-
responding sequence of edits for this minimal dis-
tance is used to compute a modified version a84a13 from
a84 . For this purpose, the source language edits are
“translated” into corresponding target language edits
using the probabilistic bilingual lexicon estimated
from aligning the training data. Specifically, for
each substitution a5a90a111a134a133a135 a5a45a123 in the edits resulting
from the weighted edit distance search, a substitu-
tion a12 a111 a133a135 a12 a123 is applied to a84 . Here a12 a107 is chosen so
that a87 a6a8a12a104a107a56a89a5a119a107a118a9 is maximal. The translated edits are
applied sequentially to a84 to give a84a136a13 .
The modified example target string a84 a13 is used as
the output of this translation method unless the min-
imal edit distance between a132 and the closest example
a86 exceeds a threshold determined experimentally.
(For this purpose, the edit distance is normalized by
utterance length.) If the threshold is exceeded, so
that no “sufficiently close” examples are available,
then a word-for-word translation is used as the out-
put by simply applying the probabilistic lexicon to
each word of the input. It is perhaps worth men-
tioning that the statistical dependency transduction
method does not need a such a fall-back to word-
for-word translation: the middle-out (island parsing)
search algorithm used with head transducers grace-
fully degrades into word-for-word translation when
the training data is too sparse to cover the input
string.
5 Experiments and results
5.1 Data set
The corpora for the experiments reported here con-
sist of spoken English utterances, paired with their
translations into Japanese. The English utterances
were the customer side of actual AT&T customer-
operator conversations. There were 12,226 training
bitexts and an additional 3,253 bitexts for testing. In
the text experiments, the English side of the bitext is
the human transcriptions of the recorded speech; in
the speech experiments, it is the output of speech
recognition. The case-based model makes use of
additional information in the form of labels associ-
ated with source language utterances, classifying the
source utterances into 15 task related classes such as
“collect-call”, “directory-assistance”, etc.
The translations were carried out by a commer-
cial translation company. Since Japanese text has no
word boundaries, we asked the translators to insert
spaces between Japanese characters whenever they
‘arose from different English words in the source’.
This imposed an English-centric view of Japanese
text segmentation.
5.2 Evaluation metrics
We use two simple string edit-distance evaluation
metrics that can be calculated automatically. These
metrics, simple accuracy and translation accuracy,
are used to compare the target string produced by the
system against the reference human translation from
held-out data. Simple accuracy (the ‘word accu-
racy’ of speech recognition research) is computed by
first finding a transformation of one string into an-
other that minimizes the total number of insertions,
deletions and substitutions. Translation accuracy in-
cludes transpositions (i.e. movement) of words as
well as insertions, deletions, and substitutions. We
regard the latter measure as more appropriate for
evaluation of translation systems because the simple
metric would count a transposition as two errors: an
insertion plus a deletion. If we write a83 for the num-
ber of insertions, a40 for deletions, a86 for substitutions,
a84 for transpositions, and a17 for number of words in
the reference translation string, we can express the
metrics as follows:
simple accuracy a36a137a30 a74 a6a8a83a115a38a134a40a125a38a134a86a71a9a21a129a104a17
translation accuracy a36a138a30 a74 a6a8a83a115a38a134a40a10a38a134a86a90a38a82a84a56a9a21a129a104a17
Since a transposition corresponds to an insertion
and a deletion, the values of a83 and a40 will be different
in the expressions for computing the two accuracy
metrics. The units for string operations in the evalu-
ation metrics are Japanese characters.
5.3 Experimental conditions and results
The following experimental systems are evaluated
here:
Word-Word A simple word for word baseline
method in which each source word is replaced with
the most highly correlated target word in the training
corpus.
Stat-Dep The statistical dependency transduction
method as described in section 3.
Simple Translation
accuracy accuracy
Word-Word 37.2 42.8
Stat-Dep 69.3 72.9
Sim-Case 70.6 71.5
Table 1: Accuracy for text (%)
Simple Translation
accuracy accuracy
Word-Word 29.2 33.7
Stat-Dep 57.4 59.7
Sim-Case 59.4 60.2
Table 2: Accuracy for speech (%)
Sim-Case The semantic similarity case-based
method described in section 4.
Table 1 shows the results on human transcriptions
of the set of test utterances.
Table 2 shows the test set results of translating
automatic speech recognition output. The speech
recognizer used a speaker-independent telephony
acoustic model and a statisical trigram language
model.
Table 3 shows the speed of loading (once per test
set) and the average run time per utterance transla-
tion for the dependency transduction and case-based
systems.
6 Concluding Remarks
In this paper we have compared the accuracy and
speed of two translation methods, statistical depen-
dency transduction and semantic similarity cased-
based transduction. The statistical transduction
model is trainable from unannotated examples of
sentence translations, while the case-based method
additionally makes use of a modest amount of an-
notation to learn a lexical semantic similarity func-
tion, a factor in favor of the dependency transduction
method.
In the experiments we presented, the transduc-
tion methods were applied to translating automatic
speech recognition output for English utterances
into Japanese in a limited domain. The evaluation
metric used to compare translation accuracy was an
automatic string comparison function applied to the
output produced by both methods. The basic result
Load time Run time/
translation
text
Stat-Dep 7176 53
Sim-Case 3856 2220
speech
Stat-Dep 7447 66
Sim-Case 5925 2333
Table 3: Translation time (ms)
was that translation accuracy was very similar for
both models, while the statistical dependency trans-
duction method was significantly faster at produc-
ing translations at run time. Since training time for
both methods is dominated by the alignment training
phase they share, training time issues do not favor
one method over the other.
These results need to be interpreted in the rather
narrow experimental setting used here: the amount
of training data used, the specific language pair (En-
glish to Japanese), the evaluation metric, and the
uncertainty in the input strings (speech recognition
output) to which the methods were applied. Fur-
ther research varying these experimental conditions
is needed to provide a fuller comparison of the rela-
tive performance of the methods. However, it should
be possible to develop algorithmic improvements to
increase the computational efficiency of similarity
cased-based transduction to make it more compet-
itive with statistical dependency transduction at run-
time.

References
H. Alshawi and S. Douglas. 2000. Learning depen-
dency transduction models from unannotated exam-
ples. Philosophical Transactions of the Royal Soci-
ety (Series A: Mathematical, Physical and Engineer-
ing Sciences), 358:1357–1372, April.
H. Alshawi, S. Bangalore, and S. Douglas. 1998.
Learning Phrase-based Head Transduction Models for
Translation of Spoken Utterances. In Proceedings
of the International Conference on Spoken Language
Processing, pages 2767–2770, Sydney, Australia.
H. Alshawi, S. Bangalore, and S. Douglas. 2000a. Head
transducer models for speech translation and their au-
tomatic acquisition from bilingual data. Machine
Translation, 15(1/2):105–124.
H. Alshawi, S. Bangalore, and S. Douglas. 2000b.
Learning dependency translation models as collections
of finite state head transducers. Computational Lin-
guistics, 26(1), January.
H. Alshawi. 1996. Head automata for speech transla-
tion. In International Conference on Spoken Language
Processing, pages 2360–2364, Philadelphia, Pennsyl-
vania.
P.J. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L.
Mercer. 1993. The Mathematics of Machine Trans-
lation: Parameter Estimation. Computational Linguis-
tics, 16(2):263–312.
J. Earley. 1970. An Efficient Context-Free Parsing Algo-
rithm. Communications of the ACM, 13(2):94–102.
R. Frederking, S. Nirenburg, D. Farwell, S. Helmreich,
E. Hovy, K. Knight, S. Beale, C. Domashnev, D. At-
tardo, D. Grannes, and R. Brown. 1994. Integrating
translations from multiple sources within the pangloss
mark iii machine translation. In Proceedings of the
first conference of the Association for Machine Trans-
lation in the Americas (AMTA-94), Maryland.
W.A. Gale and K.W. Church. 1991. Identifying word
correspondences in parallel texts. In Proceedings of
the Fourth DARPA Speech and Natural Language Pro-
cessing Workshop, pages 152–157, Pacific Grove, Cal-
ifornia.
A.L. Gorin, G. Riccardi, and J.H. Wright. 1997. How
may I help you? Speech Communication, 23(1-
2):113–127.
D. G. Hays. 1964. Dependency theory: a formalism and
some observations. Language, 40:511–525.
R.A. Hudson. 1984. Word Grammar. Blackwell, Ox-
ford.
Judith L. Klavans and Philip Resnik, editors. 1996. The
Balancing Act: combining Symbolic and Statistical
Approaches to Language. The MIT Press.
S. Kullback and R. A. Leibler. 1951. On information and
sufficiency. Annals of Mathematical Statistics, 22:76–
86.
F. Pereira, N. Tishby, and L. Lee. 1993. Distributional
clustering of English words. In Proceedings of the
31st meeting of the Association for Computational Lin-
guistics, pages 183–190.
Robert A. Wagner and Michael J. Fischer. 1974. The
string-to-string correction problem. Journal of the As-
sociation for Computing Machinery, 21(1):168–173,
January.
D. Younger. 1967. Recognition and Parsing of Context-
Free Languages in Time a139a141a140 . Information and Control,
10:189–208.
