Translation Methodology in the Spoken Language Translator: 
An Evaluation 
David Carter 
Ralph Becket 
Manny Rayner 
Robert Eklund Sabine Kirchmeier-Andersen 
Catriona MacDermid Christina Philp 
Mats Wir4n 
SRI International 
Suite 23, Millers Yard 
Cambridge CB2 1RQ 
United Kingdom 
dmc@cam.sri.com 
rvabl@cam, sri.com Catriona. I.Macdermid@telia.se 
manny@cam.sri.com Mats.G.Wiren@telia.se 
Telia Research AB Handelshcjskolen i Kc~benhavn 
Spoken Language Processing Institut for Datalingvistik 
S-13680 Haninge Dalgas Have 15 
Sweden DK-2000 Frederiksberg 
Denmark 
Robert. H. Eklund@telia. se 
sabine.id@cbs.dk 
cp.id@cbs.dk 
Abstract 
In this paper we describe how the 
translation methodology adopted for the 
Spoken Language Translator (SLT) ad- 
dresses the characteristics of the speech 
translation task in a context where it is 
essential to achieve easy customization 
to new languages and new domains. We 
then discuss the issues that arise in any 
attempt to evaluate a speech translator, 
and present the results of such an evalu- 
ation carried out on SLT for several lan- 
guage pairs. 
1 The nature of the speech 
translation task 
Speech translation is in many respects a particu- 
larly difficult version of the translation task. High 
quality output is essential: the speech produced 
must sound natural if it is to be easily compre- 
hensible. The quality of the translation itself must 
also be high, in spite of the fact that, by the nature 
of the problem, no post-editing is possible. Things 
are equally difficult on the input side: pre-editing, 
too, is difficult or impossible, yet ill-formed input 
and recognition errors are both likely to be quite 
common. Thus robust analysis and translation are 
also required. Furthermore, any attempted solu- 
tions to these problems must be capable of oper- 
ating at a speed close enough to real time that 
users are not faced with unacceptable delays. 
Together, these factors mean that speech trans- 
lation is currently only practical for limited do- 
mains, typically involving a vocabulary of a few 
thousand words. Because of this, it is desir- 
able that a speech translator should be easily 
portable to new domains. Portability to new lan- 
guages, involving the acquisition of both monolin- 
gual and cross-linguistic information, should also 
be as straightforward as possible. These ends 
can be achieved by using general-purpose com- 
ponents for both speech and language processing 
and training them on domain-specific speech and 
text corpora. The training should be automated 
whenever possible, and where human intervention 
is required, the process should be deskilled to the 
level where, ideally, it can be carried out by peo- 
ple who are familiar with the domain but are not 
experts in the systems themselves. 
These points will be discussed in the context of 
the Spoken Language Translator (SLT) (Rayner, 
Alshawi eta/, 1993; Agn~s et al., 1994; Rayner 
and Carter, 1997), a customizable speech trans- 
lator built as a pipelined sequence of general- 
purpose components. These components are: a 
version of the Decipher (TM) speech recognizer 
(Murveit eta/, 1993) for the source language; a 
copy of the Core Language Engine (CLE) (Al- 
shawi (ed), 1992) for the source language; another 
copy of the CLE for the target language; and a 
target language text-to-speech synthesizer. 
The current SLT system carries out multi- 
lingual speech translation in near real time in the 
ATIS domain (Hemphill et al., 1990) for several 
language pairs. Good demonstration versions ex- 
ist for the four pairs English ~ Swedish, English 
French, Swedish ~ English and Swedish 
Danish. Preliminary versions exist for five more 
pairs: Swedish ~ French, French --~ English, En- 
glish ~ Danish, French --d. Spanish and English 
--~ Spanish. 
We describe the methodology used to build the 
SLT system itself, particularly in the areas of cus- 
tomization (Section 2), robustness (Section 3), 
and multilinguality (Section 4). For further de- 
tails on the topics of customization and multilin- 
73 
guality, see (Rayner, Bretan et al, 1996; Rayner, 
Carter et al, 1997); and on robustness, see (Rayner 
and Carter, 1997). We then discuss the evalu- 
ation of speech translation systems. This is an 
area that deserves more attention than it has re- 
ceived to date; indeed, it is not obvious how best 
to perform such an evaluation so as to measure 
meaningfully the performance both of the overall 
system and of each of its components. In Sec- 
tions 5 and 6 of this paper, we therefore consider 
the characteristics an evaluation should have, and 
describe one we have carried out, discussing the 
extent to which it meets the desired criteria. 
2 Customization to languages and 
domains 
In the Core Language Engine, the languag e pro- 
cessing component of the Spoken Language Trans- 
lator system, we address the requirement of porta- 
bility by maintaining a clear separation between 
(I) the system code; (2) linguistic rules, including 
lexicon entries, to generate possible analyses and 
translations non-deterministically; and (3) statis- 
tical information, to choose between these possi- 
bilities. The practical advantage of this architec- 
ture is that most of the work involved in porting 
the system to a new domain is concerned with the 
parts of the system that can be modified by non- 
experts: the central activities are addition of new 
lexicon entries, and supervised training to derive 
the statistical preference information. Porting to 
new languages is a more complex task, but still 
only involves modifications to a relatively small 
subset of the whole system. In more detail: 
(I) The system code is completely general-purpose 
and does not need any changes for new domains 
or, other than in exceptional cases, I for new lan- 
guages. 
(2) The more complex of the linguistic rules for 
a given language are the grammar, the func- 
tion word lexicon, and the macros defining com- 
mon content word behaviours (count noun, tran- 
sitive verb, etc). These are defined using explicit 
feature-value equations which must be written by 
a skilled grammarian. For a given language pair, 
the more complex transfer rules, which tend to be 
for function .words and other commonly-occurring, 
idiosyncratic words, can also involve arbitrarily 
large, recursive structures. However, nearly all of 
these monolingual and bilingual rules are domain- 
independent. 
On the other side of the coin, the main domain- 
dependent aspects of a linguistic description are 
t E.g. in our initial extension from English to lan- 
guages with more complicated morphology, which ne- 
cessitated the development of a morphological pro- 
cessor based on the two-level formalism (see (Carter, 
1995)). 
lexicon entries defining content words in terms 
of existing behaviours, and simple (atomic-to- 
atomic) transfer rules. These do need to be cre- 
ated manually for each new domain, but they are 
simple enough to be defined by non-experts with 
the help of relatively simple graphical tools. See 
Figures i and 2 for some examples of these two 
kinds of rule (the details of the formalism are 
unimportant here, we intend simply to illustrate 
the differences in complexity). 
When moving to a new language, more expert 
intervention is typically required than for a new 
domain, because many of the complex rules do 
need some modifications. However, we have found 
that the amount of work involved in developing 
new grammars for Swedish, French, Spanish and 
most recently Danish has always been at least an 
order of magnitude less than the effort required 
for the original grammar (Gambgck and Rayner, 
1992; Rayner, Carter and Bouillon, 1996; Rayner, 
Carter et al, 1997). 
(3) The statistical information used in analy- 
sis is entirely derived from the results of super- 
vised training on corpora carried out using the 
TreeBanker (Carter, 1997), a graphical tool that 
presents a non-expert user with a display of the 
salient differences between alternative analyses in 
order that the correct one may be identified. Once 
a user has become accustomed to the system, 
around two hundred sentences per hour may be 
processed in this way. This, together with the 
use of representative subcorpora (Rayner, Bouil- 
lon and Carter, 1995) to allow structurally equiv- 
alent sentences to be represented by a single ex- 
ample, means that a corpus of many thousands 
of sentences can be judged in just a few person 
weeks. The principal information extracted auto- 
matically from a judged corpus is: 
• Constituent pruning rules, which allow the 
detection and removal, at intermediate stages 
of parsing, of syntactic constituents occur- 
ring in contexts where they are unlikely to 
contribute to the correct parse. Removing 
these constituents significantly constrains the 
search space and speeds up parsing (Rayner 
and Carter, 1997). 
* An automatic tuning of the grammar to the 
domain using the technique of Explanation- 
Based Learning (van Harmelen and Bundy, 
1988; Rayner, 1988; Samuelsson and Rayner, 
1991; Rayner and Carter, 1996). This 
rewrites it to a form where only commonly- 
occurring rule combinations are represented, 
thus reducing the search space still further 
and giving an additional significant speedup. 
• Preference information attached to certain 
characteristics of full analyses of sentences - 
the most important being semantic triples of 
head, relationship and modifier - which allow 
74 
Syntax rule ~r S--~NP VP: 
syn(s_np_vp_Normal, core, 
\[s:\[@s_np_feats(MMM), @vp_feats(MM), 
~sententialsubj=SS,sai=Aux, hascomp=n,conjoined=n\], 
np:\[@snp_feats(MMM),vform=(fin\/to), relational=_,temporal=_,agr=Ag, 
sentential=SS, wh=_, whmoved=_,pron=_,nform=Sfm\], 
vp:\[©vp_feats(MM),vform=(\(en)),agr=Ag,sai=Aux, modifiable=_, 
mainv=_,beadfinal=_,subjform=Sfm~\] ). 
Macro definition ~rsyntaxoftransitive verb: 
macro(v_subj_obj, 
\[v:\[vform=base,mhdfl=A,passive=A,gaps=B,conjoined=n, 
subcat=\[np:\[relational=_,passive=A,wh=_,gap=_,gaps=B, 
temporal=_,pron=_,case=nonsubj\]\]\]\]). 
Trans~r rule relating English a~ective "early" and French PP "de bonne heure': 
trule(\[eng,fre\],semi_lex(early-debonne_heure), 
\[early_NotLate,~r(arg)\] 
©form(prep('de bonne beure_Early'),_, 
P" \[P,tr(arg), 
@term (ref (pro, de _bonne _heure, sing, _), 
V,W" \[time,W\] )+_\] )). 
Figure i: Complex, domain-independent linguistic rules 
a selection to be made between competing full 
analyses. See (Alshawi and Carter, 1994) and 
(Carter, 1997) for details. 
A similar mechanism has been developed to al- 
low users to specify appropriate translations, giv- 
ing rise to preferences on outcomes of the transfer 
process. Work on this continues. 
3 Robustness 
Robustness in the face of ill-formed input and 
recognition errors is tackled by means of a "multi- 
engine" strategy (Frederking and Nirenburg, 1994; 
Rayner and Carter, 1997), combining two differ- 
ent translation methods. The main translation 
method uses transfer at the level of QLF (Alshawi 
et al., 1991; Rayner and Bouillon, 1995); this is 
supplemented by a simpler, glossary-based trans- 
lation method. Processing is carried out bottom- 
up. Roughly speaking, the QLF transfer method 
is used to translate as much as possible of the in- 
put utterance, any remaining gaps being filled by 
application of the glossary-based method. 
In more detail, source-language parsing goes 
through successive stages of lexical (morpholog- 
ical) analysis, low-level phrasal parsing to iden- 
tify constituents such as simple noun phrases, and 
finally full sentential parsing using a version of 
the original grammar tuned to the domain using 
explanation-based learning (see Section 2 above). 
Parsing is carried out in a bottom-up mode. Af- 
ter each parsing stage, a corresponding translation 
operation takes place on the resulting constituent 
lattice. Translation is performed by using the 
glossary-based method at the early stages of pro- 
cessing, before parsing is initiated, and by using 
the QLF-transfer method during and after pars- 
ing. Each successful transfer attempt results in 
a target language string being added to a target- 
side lattice. Metrics are then applied to choose 
a path through this lattice. The criteria used to 
select the path involve preferences for sequences 
that have been encountered in a target-language 
corpus; for the use of more sophisticated trans- 
fer methods over less sophisticated; and for larger 
over smaller chunks. 
The bottom-up approach contributes to robust- 
ness in the obvious way: if a single analysis can- 
not be found for the whole utterance, then trans- 
lations can be produced for partial analyses that 
have already been found. It also contributes to 
system response in that the earlier, more local, 
shallower methods of analysis and transfer usu- 
ally operate very quickly to produce an attempt at 
translation. The target-language user may inter- 
rupt processing before the more global methods 
have finished if the translation (assuming it can 
be viewed on a screen) is adequate, or the sys- 
tem itself may abandon a sentence, and present 
its current best translation, if a specified time has 
elapsed. 
Figure 3 exemplifies the operation of the multi- 
engine strategy as well as of the preferences ap- 
plied to analysis and transfer. 2 The N-best list 
2The example chosen was the most interesting of 
the dozen or so in our most recent demonstration ses- 
sion, and the intermediate results have been repro- 
75 
Lexicon entry, using transitive verb macro, for "serve"asin "Does Continentalserve Atlanta?": 
ir(serve,v_subj_obj,serve_F1yTo). 
Trans~r rule relating that sense of "serve" to one sense of French "desservir": 
trule(\[eng,fre\] ,lex(simple),serve_FlyTo==desservir_ServeCity). 
Figure 2: Simple, domain-dependent linguistic rules 
delivered by the speech recognizer contains the 
sentence actually uttered, "Could you show me an 
early flight please?", but only in fourth position. 
• Before any linguistic processing is carried out, 
the word sequence at the top of the N-best list 
is the most preferred one, as only recognition 
preferences (shown by position in the list) are 
available. This sequence is translated word- 
for-word using the glossary method, giving 
result C a) in the figure. 
• After lexical analysis, which effectively in- 
cludes part-of-speech tagging, it is deter- 
mined that the word "a" is unlikely to precede 
"are", and so "a" is dropped from the trans- 
lated sequence (b) - thus translating recog- 
nizer hypothesis 2, using the glossary-based 
method. 
• Phrasal parsing identifies "an early flight" as 
a likely noun phrase, so that this is for the 
first time selected for translation, in (c). Note 
that the system has now settled on the correct 
English word sequence. QLF-bascd transfer 
is used for the first time, and the transfer 
rule in Figure 1 is used to translate "early" as 
"de bonne heure" which, because it is a PP, 
is placed after "vol" (flight) by the French 
grammar. 
• Finally, as shown in (d), an analysis and 
a QLF-based translation are found for the 
whole sentence, allowing the inadequate 
word-for-word translation of "could you show 
me" as "*pourriez vous montrez moi" to be 
improved to a more grammatical "pourriez- 
vous m'indiquer". 
We thus see the results of translation becoming 
steadily more accurate and comprehensible as pro- 
cessing proceeds. 
4 Multillnguality, interlinguas and 
the "N-squared problem" 
While using an interlingual representation would 
seem to be the obvious way to avoid the "N- 
squared problem" (translating between N lan- 
guages involves order N 2 transfer pairs), we are 
sceptical about interlinguas for the following rea- 
sons. 
duced from the system log file without any changes 
other than reformatting. 
Firstly, doing good translation is a mixture of 
two tasks: semantics (getting the meaning right) 
and collocation (getting the appearance of the 
translation right). Defining an interlingua, even 
if it is possible to do so for an increasing num- 
ber N of languages, really only addresses the first 
task. Interlingual representations also tend to bc 
less portable to new domains, since they if they 
are to be truly interlingual they normally need 
to be based on domain concepts, which have to 
be redefined for each new domain- a task that 
involves considerable human intervention, much 
of it at an expert level. In contrast, a transfer- 
based representation can be shallower Cat the level 
of linguistic predicates) while still abstracting far 
enough away from surface form to make most of 
the transfer rules simple atomic substitutions. 
Secondly, systems based on formal representa- 
tions are brittle: a fully interlingual system first 
needs to translate its input into a formal repre- 
sentation, and then realise the representation as a 
target-language string. An interlingual system is 
thus inherently more brittle than a transfer sys- 
tem, which can produce an output without ever 
identifying a "deep" formal representation of the 
input. For these reasons, we prefer to stay with a 
fundamentally transfer-based methodology; none 
the less, we include some aspects of the inter- 
lingual approach, by regularizing the intermedi- 
ate QLF representation to make it as language- 
independent as possible consonant with the re- 
quirement that it also be independent of domain. 
Regularizing the representation has the positive 
effect of making the transfer rules simpler (in the 
limiting case, a fully interlingual system, they be- 
come trivial). 
We tackle the N-squared problem by means of 
transfer composition (Rayner, Carter and Bouil- 
lon, 1996; Rayner, Carter et al, 1997). If we al- 
ready have transfer rules for mapping from lan- 
guage A to language B and from language B to 
language C, we can compose them to generate a 
set to translate directly from A to C. The first 
stage of this composition can be done automat- 
ically, and then the results can be manually ad- 
justed by adding new rules and by introducing 
declarations to disallow the creation of implausi- 
ble rules: these typically arise because the con- 
texts in which a E A can correctly be translated 
to ~ E B are disjoint from those in which ~ can 
be translated into 7 E C. As with the other cus- 
76 
N-best list (N=5) delivered by speech recognizer: 
I could you show me a are the flight please 
2 could you show me are the flight please 
3 could you show me in order a flight please 
4 could you show me an early flight please 
S could you show meals are the flight please 
(a) Selected input sequence and translation after surface phase: 
could \[ you I show I me \[ a I are ) the \[ ftight I please \] 
pourriez vous montrez moi un sont les vol s'il vous plait 
(b) Selected input sequence and translation after lexical phase: 
could I you I show I me I are Ithe lflight I please t pourriez vous montrez moi sont les vol s'il vous plait 
(c) Setected input sequence and translation after phrasal phase: 
could I you I show Imel an early flight please I pourriez vous montrez moi un vol de bonne heure s'il vous plait 
(d) Selected input sequence and translation after full parsing phase: 
I could you show me an early flight please I pourriez-vous m'indiquer un vol de bonne heure s'il vous plait 
Figure 3: N-best list and translation results for "Could you show me an early flight please?" 
tomization tasks described here, the amount of 
human intervention required to adjust a composed 
set of transfer rules is vastly less, and less special- 
ized, than what would be required to write them 
from scratch. 
In the current version of SLT, transfer rules 
were written directly for neighbouring languages 
in the sequence Spanish - French - English - 
Swedish - Danish (most of these neighbours being 
relatively closely related), with other pairs being 
derived by transfer composition. Further details 
can be found in (Rayner, Carter et al, 1997). 
5 Evaluation of speech translation 
systems: methodological issues 
There is still no real consensus on how to evalu- 
ate speech translation systems. The most com- 
mon approach is some version of the following. 
The system is run on a set of previously unseen speech data; the results are stored in text form; 
someone judges them as acceptable or unaccept- 
able translations; and finally the system's perfor- 
mance is quoted as the proportion that are ac- 
ceptable. This is clearly much better than noth- 
ing, but still contains some serious methodological 
problems. In particular: 
i. There is poor agreement on what constitutes 
an "acceptable translation". Some judges re- 
gard a translation as unacceptable if a single 
word-choice is suboptimal. At the other end 
of the scale, there are judges who will accept 
any translation which conveys the approxi- 
mate meaning of the sentence, irrespective of 
how many grammatical or stylistic mistakes 
it contains. Without specifying more closely 
what is meant by "acceptable", it is difficult 
to compare evaluations. 
2. Speech translation is normally an interactive 
process, and it is natural that it should be 
less than completely automatic. At a min- 
imum, it is clearly reasonable in many con- 
texts to feed back to the source-language user 
the words the recognizer believed it heard, 
and permit them to abort translation if recog- 
nition was unacceptably bad. Evaluation 
should take account of this possibility. 
3. Evaluating a speech-to-speech system as 
though it were a speech-to-text system intro- 
duces a certain measure of distortion. Speech 
and text are in some ways very different me- 
dia: a poorly translated sentence in writ- 
ten form can normally be re-examined sev- 
eral times if necessary, but a spoken utter- 
ance may only be heard once. In this re- 
spect, speech output places heavier demands 
on translation quality. On the other hand, it 
can also be the case that constructions which 
would be regarded as unacceptably sloppy in 
written text pass unnoticed in speech. 
We are in the process of redesigning our trans- 
lation evaluation methodology to take account of 
all of the above points. Currently, most of our 
7? 
empirical work still treats the system as though 
it produced text output; we describe this mode of 
evaluation in Section 5.1. A novel method which 
evaluates the system's actual spoken output is cur- 
rently undergoing initial testing, and is described 
in Section 5.2. Section 6 presents results of exper- 
iments using both evaluation methods. 
5.1 Evaluation of speech to text 
translation 
In speech-to-text mode, evaluation of the system's 
performance on a given utterance proceeds as fol- 
lows. The judge is first shown a text version of the 
correct source utterance (what the user actually 
said), followed by the selected recognition hypoth- 
esis (what the system thought the user said). The 
judge is then asked to decide whether the recog- 
nition hypothesis is acceptable. Judges are told 
to assume that they have the option of aborting 
translation if recognition is of insufficient quality; 
judging a recognition hypothesis as unacceptable 
corresponds to pushing the 'abort' button. 
When the judge has determined the acceptabil- 
ity of the recognition hypothesis, the text version 
of the translation is presented. (Note that it is 
not presented earlier, as this might bias the deci- 
sion about recognition acceptability.) The judge is 
now asked to classify the quality of the translation 
along a seven-point scale; the points on the scale 
have been chosen to reflect the distinctions judges 
most frequently have been observed to make in 
practice. When selecting the appropriate cate- 
gory, judges are instructed only to take into ac- 
count the actual spoken source utterance and the 
translation produced, and ignore the recognition 
hypothesis. The possible judgement categories are 
the following; the headings are those used in Ta- 
bles 1 and 2 below. 
Fully acceptable. Fully acceptable translation. 
Unnatural style. Fully acceptable, except that 
style is not completely natural. This is most 
commonly due to over-literal translation. 
Minor syntactic errors. One or two minor 
syntactic or word-choice errors, otherwise ac- 
ceptable. Typical examples are bad choices 
of determiners or prepositions. 
Major syntactic errors. At least one major or 
several minor syntactic or word-choice er- 
rors, but the sense of the utterance is pre- 
served. The most common example is an er- 
ror in word-order produced when the system 
is forced to back up to the robust translation 
method. 
Partial translation. At least half of the utter- 
ance has been acceptably translated, and the 
rest is nonsense. A typical example is when 
most of the utterance has been correctly rec- 
ognized and translated, but there is a short 
'false start' at the beginning which has re- 
sulted in a word or two of junk at the start 
of the translation. 
Nonsense. The translation makes no sense. The 
most common reason is gross misrecognition, 
but translation problems can sometimes be 
the cause as well. 
Bad translation. The translation makes some 
sense, but fails to convey the sense of the 
source utterance. The most common reason 
is again a serious recognition error. 
Results are presented by simply counting the 
number of translations in a run which fall into 
each category. By taking account of the "unac- 
ceptable hypothesis" judgements, it is possible to 
evaluate the performance of the system either in 
a fully automatic mode, or in a mode where the 
source-language user has the option of aborting 
misrecognized utterances. 
5.2 Evaluation of speech to speech 
translation 
Our intuitive impression, based on many eval- 
uation runs in several different language-pairs, 
is that the "fine-grained" style of speech-to- 
text evaluation described in the preceding sec- 
tion gives a much more informative picture of 
the system's performance than the simple accept- 
able/unacceptable dichotomy. However, it raises 
an obvious question: how important, in objec- 
tive terms, are the distinctions drawn by the fine- 
grained scale? The preliminary work we now go 
on to describe attempts to provide an empirically 
justifiable answer, in terms of the relationship be- 
tween translation quality and comprehensibility 
of output speech. Our goal, in other words, is 
to measure objectively the ability of subjects to 
understand the content of speech output. This 
must be the key criterion for evaluating a candi- 
date translation: if apparent deficiencies in syntax 
or word-choice fail to affect subject's ability to un- 
derstand content, then it is hard to say that they 
represent real loss of quality. 
The programme sketched above is difficult or, 
arguably, impossible to implement in a general 
setting. In a limited domain, however, it ap- 
pears quite feasible to construct a domain-specific 
form-based questionnaire designed to test a sub- 
ject's understanding of a given utterance. In the 
SLT system's current domain of air travel plan- 
ning (ATIS), a simple form containing about 20 
questions extracts enough content from most ut- 
terances that it can be used as a reliable measure 
of a subject's understanding. The assumption is 
that a normal domain utterance can be regarded 
as a database query involving a limited number 
of possible categories: in the ATIS domain, these 
are concepts like flight origin and destination, de- 
parture and arrival times, choice of airline, and 
78 
so on. A detailed description of the evaluation 
method follows. 
The judging interface is structured as a hyper- 
text document that can be accessed through a. 
web-browser. Each utterance is represented by 
one web page. On entering the page for a given 
utterance, the judge first clicks a button that plays 
an audio file, and then fills in an HTML form de- 
scribing what they heard. Judges are allowed to 
start by writing down as much as they can of the 
utterance, so as to keep it clear ir; their memory 
as they fill in the form. 
The form is divided into four major sections. 
The first deals with the linguistic form of the en- 
quiry, for example, whether it is a command (im- 
perative), a yes/no-question or a wh-question. In 
the second section the judge is asked to write down 
the principal '!object" of the utterance. For exam- 
ple, in the utterance "Show flights from Boston to 
Atlanta", the principal object would be "flights". 
The third section lists some 15 constraints on the 
object explicitly mentioned in the enquiry, like 
"...one-way from New York to Boston on Sun- 
day". Initial testing proved that these three sec- 
tions covered the form and content of most en- 
quiries within the domain, but to account for un- 
foreseen material the judge is also presented with 
a "miscellaneous" category. Depending on the 
character of the options, form entries are either 
multiple-choice or free-text. All form entries may 
be negated ("No stopovers") and disjunctive en- 
quiries are indicated by dint of indexing ("Delta 
on Thursday or American on Friday"). When the 
page is exited, the contents of the completed form 
are stored for further use. 
Each translated utterance is judged in three ver- 
sions, by different judges. The first two versions 
are the source and target speech files; the third 
time, the form is filled in from the tezt version 
of the source utterance. (The judging tool allows 
a mode in which the text version is displayed in- 
stead of an audio file being played.) The intention 
is that the source text version of the utterance 
should act as a baseline with which the source and 
target speech versions can respectively be com- 
pared. Comparison is carried out by a fourth 
judge. Here, the contents of the form entries for 
two versions of the utterance are compared. The 
judge has to decide whether the contents of each 
field in the form are compatible between the two 
versions. 
When the forms for two versions of an utterance 
have been filled in and compared, the results can 
be examined for comprehensibility in terms of the 
standard notions of precision and recall. We say 
that the recall of version 2 of the utterance with 
respect to version I is the proportion of the fields 
filled in version 1 that are filled in compatibly in 
version 2. Conversely, the precision is the propor- 
tion of the fields filled in in version 2 that are filled 
in compatibly in version i. 
The recall and precision scores together define 
a two-element vector which we will call the com- 
prehensibility of version 2 with respect to version 
i. We can now define C,o~,ce to be the compre- 
hensibility of the source speech with respect to 
the source text, and Ct~,get to be the comprehen- 
sibility of the target speech with respect to the 
source text. Finally, we define the quality of the 
translation to be I - (C,~,ce - Cta,get), where 
Cm~rce - Cta~get in a natural way can be inter- 
preted as the extent to which comprehensibility 
has degraded as a result of the translation process. 
At the end of the following section, we describe an 
experiment in which we use this measure to eval- 
uate the quality of translation in the English --~ 
French version of SLT. 
6 An evaluation of the Spoken 
Language Translator 
We begin by presenting the results of tests run in 
speech-to-text mode on versions of the SLT system 
developed for six different language-pairs: English 
Swedish, English ~ French, Swedish --+ En- 
glish, Swedish ~ French, Swedish -+ Danish, and 
English ~ Danish. Before going any further, it 
must be stressed that the various versions of the 
system differ in important ways; some language- 
pairs are intrinsically much easier than others, and 
some versions of the system have received far more 
effort than others. 
In terms of diffculty, Swedish --~ Danish is 
clearly the easiest language-pair, and Swedish 
French is clearly the hardest. English ~ French is 
easier than Swedish ~ French, but substantially 
more diffcult than any of the others. English --~ 
Swedish, Swedish ~ English and English --~ Dan- 
ish are all of comparable difficulty. We present 
approximate figures for the amounts of effort de- 
voted to each language pair in conjunction with 
the other results. 
We evaluated performance on each language- 
pair in the manner described in Section 5.1 above, 
taking as input two sets of 200 recorded speech 
utterances each (one for English and one for 
Swedish) which had not previously been used for 
system development. Judging was done by sub- 
jects who had not participated in system develop- 
ment, were native speakers of the target language, 
and were fluent in the source language. Results 
are presented both for a fully automatic version 
of the system (Table i), and for a version with a 
simulated 'abort' button (Table 2). 
Finally, we turn to a preliminary experi- 
ment which used the speech-to-speech evaluation 
methodology from Section 5.2 above. A set of 200 
previously unseen English utterances were trans- 
lated by the system into French speech, using 
the same kind of subjects as in the previous ex- 
periments. Source-language and target-language 
79 
speech was synthesized using commercially avail- 
able, state-of-the-art synthesizers (TrueTalk from 
Entropies and CNETVOX from ELAN Informa- 
tique, respectively). The subjects were only al- 
lowed to hear each utterance once. The results 
were evaluated in the manner described, to pro- 
duce figures for comprehensibility of source and 
target speech respectively. The figures are pre- 
sented in Table 3; we expect to be able to present 
a more detailed discussion of their ~ignificance by 
the time of the workshop. 
In summary, we have improved the standard 
evaluation method for speech translation by de- 
veloping a feasible alternative with a more fine- 
grained taxonomy of acceptability. In order to 
make the task of evaluation more realistic, we have 
also created a method in which instead of textual 
translations it is the spoken form that is judged. 
This method is currently in embryonic form, but 
the pilot experiment described here leads us to 
think that the method shows promise for further 
development. 
An interesting future task would be to in- 
vestigate the significance of various kinds of 
written-language translation errors in terms of re- 
ducing comprehensibility of the spoken output. 
This would amount to systematically comparing 
Cta,#et with results obtained in speech-to-text 
evaluations, divided up according to error cate- 
gories such as those in our taxonomy. 
Acknowledgements 
The Danish-related work reported here was 
funded by SRI International and Handels- 
hc~jskolen i Kebenhsvn. Other work was funded 
by Telia Research AB under the SLT-2 project. 
We would like to thank Beats Forsmark, Nathalie 
Kirchmeyer, Carin Lindberg, Thierry Reynier and 
Jennifer Spenader for carrying out judging tasks. 

References 
Agn~, M-S., Alshawi, H., Bretan, I., Carter, D., 
Coder, K., Collins, M., Crouch, R., Di- 
galakis, V., Ekholm, B., Gamb~ck, B., Kaja, J., 
Karlgren, J., Lyberg, B., Price, P., Pulman, S., 
Rayner, M., Ssmuelsson, C. and Svensson, T. 
1994. Spoken Language Translator: First Year 
Report. SRI Cambridge Technical report CRC- 
043. s 
Alshawi, H. (ed.) 1992. The Core Language En- 
gine. MIT Press. 
Alshawi, H., D. Carter, B. Gamb£ck and 
M. Rayner. 1991. Transfer through Quasi Log- 
ical Form. Proceedings of ACL-91. Also SRI 
Cambridge Technical report CRC-021. 
~All SRI Cambridge technical reports are available 
through WWW from http ://www. cam. sri. con 
Alshawi, H., and D. Carter. 1994. Training and 
Scaling Preference Functions for Disambigua- 
tion Computational Linguistics 20:4, 635-648. 
Also SRI Cambridge Technical report CRC-041. 
Carter, D. 1995. Rapid Development of Mor- 
phological Descriptions for Full Language Pro- 
cessing Systems. Proceedings of 7th European 
ACL. Also SRI Cambridge Technical Report 
CRC-047. 
Carter, D. 1997. The TreeBanker: a Tool for 
Supervised Training of Parsed Corpora. Proc. 
ACL/EACL workshop "Computational Envi- 
ronments for Grammar Development and Lin- 
guistic Engineering", Madrid. Also SRI Cam- 
bridge Technical report CRC-068. 
Frederking, R. and S. Nirenburg. 1994. Three 
heads are better than one. Proc. 4th ANLP, 
Stuttgart, Germany, pp 95-100. 
Gamb~ck, B., and M. Rayner. The Swedish Core 
Language Engine. Proc. 3rd Nordic Conference 
on Text Comprehension in Man and Machine, 
LinkSping, Sweden. Also SRI Cambridge Tech- 
nical Report CRC-025. 
van Harmelen, F., and A. Bundy. 1988. 
Explanation-Based Generalization = Partial 
Evaluation (Research Note). Artificial Intelli- 
gence 36, pp. 401-412. 
Hemphill, C.T., J.J. Godfrey and G.R. Dodding- 
ton. 1990. The ATIS Spoken Language Sys- 
tems pilot corpus. Proc. DARPA Speech and 
Natural Language Workshop, Hidden Valley, 
Pa., pp. 96-101. 
Murveit, H., Butzberger, J., Digalakis, V. and 
Weintraub, M. 1993. Large Vocabulary Dic- 
tation using SRI's DECIPHER(TM) Speech 
Recognition System: Progressive Search Tech- 
niques. Proc. Inter. Conf. on Acoust., Speech 
and Signal, Minneapolis, Mn. 
Pulman, S. 1992. Unification-Based Syntactic 
Analysis. In (Alshswi (ed), 1992). 
Rayner, M. 1988. Applying Explanation-Based 
Generalization to Natural-Language Process- 
ing. Proc. the International Conference on 
Fifth Generation Computer Systems, Kyoto, 
pp. 1267-1274. 
Rayner, M., Alshawi, H., Bretan, I., Carter, D.M., 
Digalskis, V., Gamb~ck, B., Kaja, J., Karl- 
gren, J., Lyberg, B., Price, P., Pulman, S. and 
Samuelsson, C. 1993. A Speech to Speech 
Translation System Built From Standard Com- 
ponents. Proc. ist ARPA workshop on Human 
Language Technology, Princeton, NJ. Morgan 
Kaufmsnn. Also SRI Technical Report CRC- 
031. 
Rayner, M. and Bouillon, P. 1995. Hybrid 
Transfer in an English-French Spoken Language 
Translator. Proceedings of IA '95, Montpellier, 
France. Also SRI Cambridge Technical Report 
CRC-056. 
Rayner, M., P. Bouillon and D. Carter. 1995. 
Using Corpora to Develop Limited-Domain 
Speech Translation Systems. Proc. Translat- 
ing and the Computer 17. Also SRI Cambridge 
Technical Report CRC-059. 
Rayner, M., I. Bretan, M. Wirdn, S. Rydin and 
E. Beshai. 1996. Composition of Transfer Rules 
in a Multi-Lingual MT System.- Proc. Work- 
shop on Future Issues for Multilingual Text Pro- 
cessing. Also SRI Cambridge Technical Report 
CRC-063. 
Rayner, M., D. Carter and P. Bouillon. 1996. 
Adapting the Core Language Engine to French 
and Spanish. Proc. NLP-IA, Moncton, New 
Brunswick. Also SRI Cambridge Technical Re- 
port CRC-061. 
Rayner, M. and D. Carter. 1996. Fast Pars- 
ing using Pruning and Grammar Specialization. 
Proc. ACL-96, Santa Cruz, Ca. Also SRI Cam- 
bridge Technical Report CRC-060. 
Rayner, M., and D. Carter. 1997 Hybrid language 
processing in the Spoken Language Translator. 
Proc. ICASSP-97, Munich, Germany. Also SRI 
Cambridge Technical Report CRC-064. 
Rayner, M., D. Carter, I. Bretan, R. Eklund, 
M. Wirdn, S. Hansen, S. Kirchmeier-Andersen, 
C. Philp, F. Scrensen and H. Erdman Thorn- 
sen. 1997. Recycling Lingware in a Multilin- 
gum MT System Proc. ACL/EACL workshop 
"From Research to Commercial Applications", 
Madrid. Also SRI Cambridge Technical Report 
CRC-067. 
Samuelsson, C., and M. Rayner. 1991. Quanti- 
tative Evaluation of Explanation-Based Learn- 
ing as an Optimization Tool for a Large-Scale 
Natural Language System. Proc. 12th IJCAI, 
Sydney, pp. 609-615. 
