Interactive Speech Translation 
in the DIPLOMAT Project 
Robert Frederking, Alexander Rudnicky, and Christopher Hogan 
{ref, air, chogan}©cs, cmu. edu 
Language Technologies Institute 
Carnegie Mellon University 
Pittsburgh, PA 15213 
Abstract 
The DIPLOMAT rapid-deployment 
speech translation system is intended to 
allow naive users to communicate across 
a language barrier, without strong do- 
main restrictions, despite the error- 
prone nature of current speech and 
translation technologies. Achieving this 
ambitious goal depends in large part 
on allowing the users to interactively 
correct recognition and translation er- 
rors. We briefly present the Multi- 
Engine Machine Translation (MEMT) 
architecture, describing how it is well- 
suited for such an application. We then 
describe our incorporation of interac- 
tive error correction throughout the sys- 
tem design. We have already developed 
a working bidirectional Serbo-Croatian 
English system, and are currently de- 
veloping Haitian-Creole ~ English and 
Korean ~ English versions. 
1 Introduction 
The DIPLOMAT project is designed to explore 
the feasibility of creating rapid-deployment, wear- 
able bi-directional speech translation systems. By 
"rapid-deployment", we mean being able to de- 
velop an MT system that performs initial trans- 
lations at a useful level of quality between a new 
language and English within a matter of days or 
weeks, with continual, graceful improvement to 
a good level of quality over a period of months. 
The speech understanding component used is the 
SPHINX II HMM-based speaker-independent con- 
tinuous speech recognition system (Huang el al., 
1992; Ravishankar, 1996), with techniques for 
rapidly developing acoustic and language models 
for new languages (Rudnicky, 1995). The ma- 
chine translation (MT) technology is the Multi- 
Engine Machine Translation (MEMT) architec- 
ture (Frederking and Nirenburg, 1994), described 
further below. The speech synthes!s component is 
a newly-developed concatenative system (Lenzo, 
1997) based on variable-sized compositional units. 
This use of subword concatenation is especially 
important, since it is the only currently avail- 
able method for rapidly bringing up synthesis for 
a new language. DIPLOMAT thus involves re- 
search in MT, speech understanding and synthe- 
sis, interface design, as well as wearable computer 
systems. While beginning our investigations into 
new semi-automatic techniques for both speech 
and MT knowledge-base development, we have al- 
ready produced an initial bidirectional system for 
Serbo-Croatian ~ English speech translation in 
less than a month, and are currently developing 
Haitian-Creole ~ English and Korean ~ English 
systems. 
A major concern in the design of the 
DIPLOMAT system has been to cope with the 
error-prone nature of both current speech under- 
standing and MT technology, to produce an ap- 
plication that is usable by non-translators with a 
small amount of training. We attempt to achieve 
this primarily through user interaction: wherever 
feasible, the user is presented with intermediate 
results, and allowed to correct them. In this pa- 
per, we will briefly describe the machine trans- 
lation architecture used in DIPLOMAT (showing 
how it is well-suited for interactive user correc- 
tion), describe our approach to rapid-deployment 
speech recognition and then discuss our approach 
to interactive user correction of errors in the over- 
all system. 
2 Multi-Engine Machine 
Translation 
Different MT technologies exhibit different 
strengths and weaknesses. Technologies such as 
Knowledge-Based MT (KBMT) can provide high- 
quality, fully-automated translations in narrow, 
well-defined domains (Mitamura el al., 1991; Far- 
well and Wilks, 1991). Other technologies such as 
lexical-transfer MT (Nirenburg et al., 1995; Fred- 
erking and Brown, 1996; MacDonald, 1963), and 
Example-Based MT (EBMT) (Brown, 1996; Na- 
61 
gao, 1984: Sato and Nagao, 1990) provide lower- 
quality general-purpose translations, unless they 
are incorporated into human-assisted MT systems 
(Frederking et al., 1993; Melby, 1983), but can be 
used in non-domain-restricted translation applica- 
tions. Moreover, these technologies differ not just 
in the quality of their translations, and level of 
domain-dependence, but also along other dimen- 
sions, such as types of errors they make, required 
development time, cost of development, and abil- 
ity to easily make use of any available on-line 
corpora, such as electronic dictionaries or online 
bilingual parallel texts. 
The Multi-Engine Machine Translation 
(MEMT) architecture (Frederking and Nirenburg, 
1994) makes it possible to exploit the differences 
between MT technologies. As shown in Figure 1, 
MEMT feeds an input text to several MT engines 
in parallel, with each engine employing a differ- 
ent MT technology 1. Each engine attempts to 
translate the entire input text, segmenting each 
sentence in whatever manner is most appropri- 
ate for its technology, and putting the resulting 
translated output segments into a shared chart 
data structure (Kay, 1967; Winograd, 1983) af- 
ter giving each segment a score indicating the en- 
gine's internal assessment of the quality of the 
output segment. These output (target language) 
segments are indexed in the chart based on the 
positions of the corresponding input (source lan- 
guage) segments. Thus the chart contains multi- 
ple, possibly overlapping, alternative translations. 
Since the scores produced by the engines are esti- 
mates of variable accuracy, we use statistical lan- 
guage modelling techniques adapted from speech 
recognition research to select the best overall set 
of outputs (Brown and Frederking, 1995; Frederk- 
ing, 1994). These selection techniques attempt to 
produce the best overall result, taking the proba- 
bility of transitions between segments into account 
as well as modifying the quality scores of individ- 
ual segments. 
Differences in the development times and costs 
of different .technologies can be exploited to en- 
able MT systems to be rapidly deployed for new 
languages (Frederking and Brown, 1996). If par- 
allel corpora are available for a new language pair, 
the EBMT engine can provide translations for a 
new language in a matter of hours. Knowledge- 
bases for lexical-transfer MT can be developed in 
a matter of days or weeks; those for structural- 
transfer MT may take months or years. The 
higher-quality, higher-investment KBMT-style en- 
gine typically requires over a year to bring on- 
line. The use of the MEMT architecture allows 
the improvement of initial MT engines and the 
1 Morphological analysis, part-of-speech tagging, 
and possibly other text enhancements can be shared 
by the engines. 
addition of new engines to occur within an un- 
changing framework. The only change that the 
user sees is that the quality of translation im- 
proves over time. This allows interfaces to re- 
main stable, preventing any need for retraining 
of users, or redesign of inter-operating software. 
The EBMT and Lexical-Transfer-based MT trans- 
lation engines used in DIPLOMAT are described 
elsewhere (Frederking and Brown, 1996). 
For the purposes of this paper, the most impor- 
tant aspects of the MEMT architecture are: 
• the initially deployed versions are quite error- 
prone, although generally a correct translation 
is among the available choices, and 
• the unchosen alternative translations are still 
available in the chart structure after scoring by 
the target language model. 
3 Speech recognition for novel 
languages 
Contemporary speech recognition systems derive 
their power from corpus-based statistical model- 
ing, both at the acoustic and language levels. Sta- 
tistical modeling, of course, presupposes that suf- 
ficiently large corpora are available for training. 
It is in the nature of the DIPLOMAT system that 
such corpora, particularly acoustic ones, are not 
immediately available for processing. As for the 
MT component, the emphasis is on rapidly acquir- 
ing an initial capability in a novel language, then 
being able to incrementally improve performance 
as more data and time are available. We have 
adopted for the speech component a combination 
of approaches which, although they rely on partic- 
ipation by native informants, also make extensive 
use of pre-existing acoustic and text resources. 
Building a speech recognition system for a tar- 
get domain or language requires models at three 
levels (assuming that a basic processing infras- 
tructure for training and decoding is already in 
place): acoustic, lexical and language. 
We have explored two strategies for acoustic 
modeling. Assimilation makes use of existing 
acoustic models from a language that has a large 
phonetic overlap with the target language. This 
allows us to rapidly put a recognition capability 
in place and was the strategy used for our Serbo- 
Croatian ~ English system. We were able to 
achieve good recognition performance for vocabu- 
laries of up to 733 words using this technique. Of 
course, such overlaps cannot be relied upon and 
in any case will not produce recognition perfor- 
mance that approaches that possible with appro- 
priate training. Nevertheless it does suggest that 
useful recognition performance for a large set of 
languages can be achieved given a carefully chosen 
set of core languages that can serve as a source of 
62 
Source Target 
Language Language 
0 • # _ , Morphological !_.d | 
Analyzer 
Transfer-Based MT 
User Interface 
Example-Based MT Statistical Modeller 
Knowledge-Based "L............................__I", 
'"'~ Expanmon slot f ..... 
Figure 1: Structure of MEMT architecture 
acoustic models for a cluster of phonetically simi- 
lar languages. 
The selective collection approach presupposes 
a preparation interval prior to deployment and 
can be a follow-on to a system based on assim- 
ilation. This is being developed in the context 
of our Haitian-Creole and Korean systems. The 
goal is to carry out a limited acoustic data collec- 
tion effort using materials that have been explic- 
itly constructed to yield a rich phonetic sampling 
for the target language. We do this by first com- 
puting phonetic statistics for the language using 
available text materials, then designing a record- 
ing script that exhaustively samples all diphones 
observed in the available text sample. Such scripts 
run from several hundred to around a thousand 
utterances for the languages we have examined. 
While the effectiveness of this approach depends 
on the quality (and quantity) of the text sample 
that can be obtained, we believe it produces ap- 
propriate data for our modeling purposes. 
Lexical modeling is based on creating pronunci- 
ations from orthography and involves a variety of 
techniques familiar from speech synthesis, includ- 
ing letter-to-sound rules, phonological rules and 
exception lists. The goal of our lexical modeling 
approach is to create an acceptable-quality pro- 
nouncing dictionary that can be variously used 
for acoustic training, decoding and synthesis. We 
work with an informant to map out the pronun- 
ciation system for the target language and make 
use of supporting published information (though 
we have found such to be misleading on occasion). 
System vocabulary is derived from the text mate- 
rials assembled for acoustic modeling, as well as 
scenarios from the target domain (for example, 
interviews focussed on mine field mapping or in- 
telligence screening). 
Finally, due to the goals of our project, lan- 
guage modeling is necessarily based on small cor- 
pora. We make use of materials derived from do- 
main scenarios and from general sources such as 
newspapers (scanned and OCRed), text in the tar- 
get language available on the Internet and trans- 
lations of select documents. Due to the small 
amounts of readily available data (on the order of 
50k words for the languages we have worked with), 
standard language modeling tools are difficult to 
use, as they presuppose the availability of cor- 
pora that are several orders of magnitude larger. 
Nevertheless we have been successful in creating 
standard backoff trigram models from very small 
corpora. Our technique involves the use of high 
discounts and appears to provide useful constraint 
without corresponding fragility in the face of novel 
material. 
63 
In combination, these techniques allow us to 
create working recognition systems in very short 
periods of time and provide a path for evolution- 
ary improvement of recognition capability. They 
clearly are not of the quality that would be 
expected if conventional procedures were used, 
but nevertheless are sufficient for providing cross- 
language communication capability in limited- 
domain speech translation. 
4 User Interface Design 
As indicated above, our approach to coping with 
error-prone speech translation is to allow user cor- 
rection wherever feasible. While we would like as 
much user interaction as possible, it is also im- 
portant not to overwhelm the user with either 
information or decisions. This requires a careful 
balance, which we are trying to achieve through 
early user testing. We have carried out initial test- 
ing using local naive subjects (e.g., drama majors 
and construction workers), and intend to test with 
actual end users once specific ones are identified. 
The primary potential use for DIPLOMAT 
identified so far is to allow English-speaking sol- 
diers on peace-keeping missions to interview local 
residents. While one could conceivably train the 
interviewer to use a restricted vocabulary, the in- 
terviewee's responses are much more difficult to 
control or predict. An initial system has been 
developed to run on a pair of laptop comput- 
ers, with each speaker using a graphical user in- 
terface (GUI) on the laptop's screen (see Figure 
2). Feedback from initial demonstrations made it 
clear that, while we could expect the interviewer 
to have roughly eight hours of training, we needed 
to design the system to work with a totally naive 
interviewee, who had never used a computer be- 
fore. We responded to this requirement by de- 
veloping an asymmetric interface, where any nec- 
essary complex operations were moved to the in- 
terviewer's side. The interviewee's GUI is now 
extremely simple, and a touch screen has been 
added, so that the interviewee is not required to 
type or use the pointer. In addition, the inter- 
viewer's GUI controls the state of the interviewee's 
GUI. The speech recognition system continuously 
listens, thus the participants do not need to phys- 
ically indicate their intention of speaking. 
A typical exchange consists of recognizing 
the interviewer's spoken utterance, translating 
it to the target language, backtranslating it to 
English 2, then displaying and synthesizing the 
(possibly corrected) translation. The intervie- 
wee's response is recognized, translated to En- 
2We realize that backtranslation is also an error- 
prone process, but it at least provides some evidence 
as to whether the translation was correct to someone 
who does not speak the target language at all. 
glish, and backtranslated. The (possibly cor- 
rected) backtranslation is then shown to the inter- 
viewee for confirmation. The interviewer receives 
a graphic indication of whether the backtransla- 
tion was accepted or not. (The actual communi- 
cation process is quite flexible, but this is a normal 
scenario.) 
In order to achieve such communication, the 
users currently can interact with DIPLOMAT in 
the following ways: 
• Speech displayed as text: After any speech 
recognition step, the best overall hypothesis is 
displayed as text on the screen. The user can 
highlight an incorrect portion using the touch- 
screen, and respeak or type it. 
• Confirmation requests: After any speech 
recognition or machine translation step, the user 
is offered an accept/reject button to indicate 
whether this is "what they said". For MT, back- 
translations provide the user with an ability to 
judge whether they were interpreted correctly. 
• Interactive chart editing: As mentioned 
above, the MEMT technology produces as out- 
put a chart structure, similar to the word hy- 
pothesis lattices in speech systems. After any 
MT step, the interviewer is able to edit the 
best overall hypothesis for either the forward or 
backward translation using a popup-menu-based 
editor, as in our earlier Pangloss text MT system 
(Frederking et al., 1993). The editor allows the 
interviewer to easily view and select alternative 
translations for any segment of the translation. 
Editing the forward translation, causes an auto- 
matic reworking of the backtranslation. Editing 
the backtranslation allows the interviewer to rec- 
ognize correct forward translations despite errors 
in the backtranslation; if the backtranslation can 
be edited into correctness, the forward transla- 
tion was probably correct. 
Since a major goal of DIPLOMAT is rapid- 
deployment to new languages, the GUI uses the 
UNICODE multilingual character encoding stan- 
dard. This will not always suffice, however; a ma- 
jor challenge for handling Haitian-Creole is that 
55% of the Haitian population is illiterate. We 
will have to develop an all-speech version of the 
interviewee-side interface. As we have done with 
previous interface designs, we will carry out user 
tests early in its development to ascertain whether 
our intuitions on the usability of this version are 
correct. 
5 Conclusion 
We have presented here the DIPLOMAT speech 
translation system, with particular emphasis on 
the user interaction mechanisms employed to cope 
with error-prone speech and MT processes. We 
expect that, after additional tuning based on fur- 
ther informal user studies, an interviewer with 
eight hours of training should be able to use the 
64 
Figure 2: Screen Shot of User Interfaces: Interviewer (left) and Interviewee (right) 
DIPLOMAT system to successfully interview sub- 
jects with no training or previous computer expe- 
rience. We hope to have actual user trials of either 
the Serbo-Croatian or the Haitian-Creole system 
in the near future, possibly this summer. 

References 
Ralf Brown. 1996. Example-Based Machine 
Translation in the Pangloss System. In Pro- 
ceedings of the 16th International Conference 
on Computational Linguistics (COLING-96). 
Ralf Brown and Robert Frederking. 1995. Apply- 
ing Statistical English Language Modeling to 
Symbolic Machine Translation. In Proceedings 
of the Sixth International Conference on The- 
oretical and Methodological Issues in Machine 
Translation (TMI-95), pages 221-239. 
David Farwell and Yorick Wilks. 1991. Ultra: A 
Multi-lingual Machine Translator. In Proceed- 
ings of Machine Translation Summit IlI, Wash- 
ington, DC, July. 
Robert Frederking. 1994. Statistical Lan- 
guage Models for Symbolic MT. Presented at 
the Language Engineering on the Information 
Highway Workshop, Santorini, Greece, Septem- 
ber. Refereed. 
Robert Frederking, D. Grannes, P. Cousseau, and 
S. Nirenburg. 1993. An MAT Tool and Its Ef- 
fectiveness. In Proceedings of the DARPA Hu- 
man Language Technology Workshop, Prince- 
ton, NJ. 
Robert Frederking and Ralf Brown. 1996. The 
Pangloss-Lite Machine Translation System. In 
Proceedings of the Conference of the Associa- 
tion for Machine Translation in the Americas 
(AMTA). 
Robert Frederking and Sergei Nirenburg. 1994. 
Three Heads are Better than One. In Proceed- 
ings of the fourth Conference on Applied Natu- 
ral Language Processing (ANLP-94), Stuttgart, 
Germany. 
Xuedong Huang, Fileno Alleva, Hsiao-Wuen Hon, 
Mei-Yuh Hwang, Ronald Rosenfeld. 1992. The 
SPHINX-II Speech Recognition System: An 
Overview. Carnegie Mellon University Com- 
puter Science Technical Report CMU-CS-92- 
112. 
Martin Kay. 1967. Experiments with a powerful 
parser. In Proceedings of the 2nd International 
COLING, August. 
Kevin Lenzo. 1997. Personal Communication. 
R. R. MacDonald. 1963. General report 1952- 
1963 (Georgetown University Occasional Pa- 
pers in Machine Translation, no. 30), Washing- 
ton, DC. 
A. K. Melby. 1983. Computer-assisted translation 
systems: the standard design and a multi-level 
design. Conference on Applied Natural Lan- 
guage Processing, Santa Monica, February. 
Teruko Mitamura, Eric Nyberg, Jaime Carbonell. 
1991. Interlingua Translation System for Multi- 
Lingual Document Production. In Proceedings 
of Machine Translation Summit III, Washing- 
ton, DC, July. 
M. Nagao. 1984. A framework of a mechani- 
cal translation between Japanese and English 
by analogy principle. In: A. Elithorn and 
R. Banerji (eds.) Artificial and Human Intel- 
ligence. NATO Publications. 
Sergei Nirenburg. 1995. The Pangloss Mark 
III Machine Translation System. Joint Tech- 
nical Report, Computing Research Laboratory 
(New Mexico State University), Center for Ma- 
chine Translation (Carnegie Mellon University), 
Information Sciences Institute (University of 
Southern (~alifornia). Issued as CMU technical 
report CMU-CMT-95-145. 
Mosur Ravishankar. 1996. Efficient Algorithms 
for Speech Recognition. Ph.D. Thesis. Carnegie 
Mellon University. 
Alex Rudnieky. 1995. Language modeling with 
limited domain data. In Proceedings of the 
ARPA Workshop on Spoken Language Technol- 
ogy. San Mateo: Morgan Kaufmann, 66-69. 
S. Sato and M. Nagao. 1990. Towards memory 
based translation. In Proceedings of COLING- 
90, Helsinki, Finland. 
Terry Winograd. 1983. Language as a Cognitive 
Process. Volume 1: Syntax. Addison-Wesley. 
