A Multi-Perspective Evaluation of the NESPOLE!
Speech-to-Speech Translation System
Alon Lavie
Carnegie Mellon University, Pittsburgh, PA, USA
alavie@cs.cmu.edu
Florian Metze
University of Karlsruhe, Germany
Roldano Cattoni
ITC-irst, Trento, Italy
Erica Costantini
University of Trieste, Trieste, Italy
Abstract
Performance and usability of real-
world speech-to-speech translation sys-
tems, like the one developed within
the Nespole! project, are a ected by
several aspects that go beyond the
pure translation quality provided by
the underlying components of the sys-
tem. In this paper we describe these
aspects as perspectives along which
we have evaluated the Nespole! sys-
tem. Four main issues are investigated:
(1) assessing system performance un-
der various network tra c conditions;
(2) a study on the usage and utility of
multi-modality in the context of multi-
lingual communication; (3) a compar-
ison of the features of the individual
speech recognition engines, and (4) an
end-to-end evaluation of the system.
1 Introduction
Nespole!
1
is a speech-to-speech machine tran-
slation project designed to provide fully func-
tional speech-to-speech capabilities within real-
world settings of common users involved in e-
commerce applications. The project is a collab-
oration between three European research groups
1
Nespole! { NEgotiation through SPOken Lan-
guage in E-commerce. See the project web-site at
http://nespole.itc.it for further details.
(IRST in Trento, Italy; ISL at Universit¨at Karl-
sruhe (TH); and CLIPS at Universit eJoseph
Fourier in Grenoble, France), one US research
group (ISL at Carnegie Mellon University in
Pittsburgh, PA) and two industrial partners
(APT; Trento, Italy { the Trentino provincial
tourism board, and AETHRA; Ancona, Italy {
a tele-communications company). The project is
funded jointly by the European Commission and
the US NSF. Over the past two years, we have
developed a fully functional showcase of the Ne-
spole! system within the domain of travel and
tourism, and have signi cantly improved system
performance and usability based on a series of
studies and evaluations with real users. Our ex-
perience has shown that improving translation
quality is only one of several important issues
that must be addressed in achieving a practical
real-world speech-to-speech translation system.
This paper describes how we tackled these is-
sues and evaluates their e ect on system per-
formance and usability. We focus on four main
issues: (1) assessing system performance under
various network tra c conditions and architec-
tural con gurations; (2) a study on the usage
and utility of multi-modality in the context of
multi-lingual communication; (3) a comparison
of the features of the individual speech recogni-
tion engines, and (4) an end-to-end evaluation
of the demonstration system.
                                            Association for Computational Linguistics.
                         Algorithms and Systems, Philadelphia, July 2002, pp. 121-128.
                          Proceedings of the Workshop on Speech-to-Speech Translation:
2 The NESPOLE! System
The Nespole! system (Lazzari, 2001) uses a
client-server architecture to allow a common
user, who is initially browsing through the web
pages of a service provider on the Internet, to
connect seamlessly to a human agent of the ser-
vice provider who speaks another language, and
provides speech-to-speech translation service be-
tween the two parties. Standard commercially
available PC video-conferencing technology such
as Microsoft’s NetMeeting r© is used to connect
between the two parties in real-time.
In the  rst showcase which we describe in this
paper, the scenario is the following: a client
is browsing through the web-pages of APT {
the tourism bureau of the province of Trentino
in Italy { in search of tour-packages in the
Trentino region. If more detailed information
is desired, the client can click on a dedicated
\button" within the web-page in order to estab-
lish a video-conferencing connection to a human
agent located at APT. The client is then pre-
sented with an interface consisting primarily of
a standard video-conferencing application win-
dow and a shared whiteboard application. Us-
ing this interface, the client can carry on a con-
versation with the agent, where the Nespole!
server provides two-way speech-to-speech trans-
lation between the parties. In the current setup,
the agent speaks Italian, while the client can
speak English, French or German.
2.1 System Architecture
The Nespole! system architecture is shown
in Figure 1. A key component in the Ne-
spole! system is the \Mediator" module,
which is responsible for mediating the commu-
nication channel between the two parties as
well as interfacing with the appropriate Human
Language Technology (HLT) speech-translation
servers. The HLT servers provide the actual
speech recognition and translation capabilities.
This system design allows for a very flexible and
distributed architecture: Mediators and HLT-
servers can be run in various physical locations,
so that the optimal con guration, given the lo-
cations of the client and the agent and antic-
Figure 1: The Nespole! System Architecture
ipated network tra c, can be taken into ac-
count at any time. A well-de ned API allows
the HLT servers to communicate with each other
and with the Mediator, while the HLT modules
within the servers for the di erent languages are
implemented using very di erent software pack-
ages. Further details of the design principles of
the system are described in (Lavie et al., 2001).
The computationally intensive part of speech
recognition and translation is done on dedicated
server machines, whose nature and location is
of no concern to the user. A wide range of
client-machines, even portable devices or pub-
lic information kiosks, are therefore able to run
the client software, so that the service can be
made available nearly everywhere.
The system architecture shown in Figure 1
contains two di erent types of Internet connec-
tions with di erent characteristics. The connec-
tion between Client/Agent PCs and the Media-
tor is a standard video-conferencing connection
that uses H323 and UDP protocols. In cases
of insu cient network bandwidth, these proto-
cols compromise performance by allowing de-
layed or lost packets of data to be \dropped" on
the receiving side, in order to minimize delays
and ensure close to real-time performance. The
connection between the Mediator and the HLT
servers uses TCP over IP in order to achieve loss-
less communication between the Mediator and
the translation components. For practical rea-
sons, Mediator and HLT servers in our current
system usually run in separate and distant loca-
tions, which can introduce some additional time
delay. System response times in recent demon-
strations have been about three times real-time.
2.2 User interface
The user interface display is designed for
Windows r©and consists of four windows: (1)
aMicrosoftr©Internet Explorer web browser;
(2) a Microsoft r©Windows NetMeeting video-
conferencing application; (3) the AeWhite-
board; and (4) the Nespole Monitor. Using
Internet Explorer, the client initiates the au-
dio and video call with an agent of the service
provider, by a simple click of a button on the
browser page. Microsoft Windows Netmeeting is
automatically opened and the audio and video
connection is established. The two additional
displays { the AeWhiteboard and the Nespole
Monitor are also launched at the same time.
Client and agent can then proceed in carrying
out a dialogue with the help of the speech trans-
lation system. For a screen snapshot of these
four displays, see (Metze et al., 2002).
We found it important to visually present as-
pects of the speech-translation process to the
end users. This is accomplished via the Ne-
spole Monitor display. Three textual represen-
tations are displayed in clearly identi ed  elds:
(1) a transcript of their spoken input (the output
from the speech recognizer); (2) a paraphrase of
their input { the result of translating the recog-
nized input back into their own language; and
(3) the translated textual output of the utter-
ance spoken by the other party. These textual
representations provide the users with the capa-
bility to identify mis-translations and indicate
errors to the other party. A bad paraphrase is
often a good indicator of a signi cant error in
the translation process. When a mis-translation
is detected, the user can press a dedicated but-
ton that informs the other party to ignore the
translation being displayed, by highlighting the
textual translation in red on the monitor display
of the other party. The user can then repeat the
turn. The current system also allows the partic-
ipants to correct speech recognition and transla-
tion errors via keyboard input, a feature which
is very e ective when bandwidth limitations de-
grade the system performance.
3 Multi-Perspective Evaluations
Several di erent evaluation experiments have
been conducted, targeting di erent aspects of
our system: (1) the impact of network tra c
and the consequences of real packet-loss on sys-
tem performance; (2) the impact and usability of
multi-modality; (3) a comparison of the features
of the various speech recognition engines, devel-
oped independently for di erent languages with
di erent techniques; and (4) end-to-end perfor-
mance evaluations. The data used in the evalu-
ations is part of a database collected during the
project (Burger et al., 2001).
3.1 Network Tra c Impact
In our various user studies and demonstrations,
we have been forced to deal with the detrimental
e ects of network congestion on the transmission
of Voice-over-IP in our system. The critical net-
work paths are the H323 connections between
the Mediator and the client and agent, which
rely on the UDP protocol in order to guaran-
tee real-time, but potentially lossy, human-to-
human communication. This can potentially be
very detrimental to the performance of speech
recognizers (Metze et al., 2001). The commu-
nication between the Mediator and HLT servers
can, in principle, be within a local network, al-
though we currently run the HLT servers at the
sites of the developing partners. This introduces
time delays, but no packet loss, due to the use
of TCP, rather than the UDP used for the H323
connections.
To quantify the influence of UDP packet-loss
on system performance, we ran a number of tests
with German client installations in the USA
(CMU at Pittsburgh) and Germany (UKA at
Karlsruhe) calling a Mediator in Italy (IRST),
whichinturncontactedtheGermanHLTserver
located at UKA. The tests were conducted by
feeding a high-quality recording of the German
36
37
38
39
40
41
42
0 1 2 3 4 5 6
Word Error Rate (%)
Packet Loss (%)
ITA-US
ITA-GER
Figure 2: Influence of packet loss on word accuracy of
the German Nespole! recognizer
development test-set collected at the beginning
of the project into a computer set-up for a video-
conference, i.e. we replaced the microphone by
a DAT recorder (or a computer) playing a tape,
while leaving everything else as it would be for
sessions with real subjects. In particular, seg-
mentation was based on silence detection per-
formed automatically by NetMeeting. Each test
consisted of several dialogues, lasting about an
hour. These tests (a total of more than 16 hours)
were conducted at di erent times of the day on
di erent days of the week, in an attempt to in-
vestigate a wide as possible variety of real-life
network conditions.
We were able to run 16 complete tests, re-
sulting in an average word accuracy of 60.4%,
2
with single values in the 63% to 59% range for
packet-loss conditions between 0.1% and 5.2%.
The results of these tests are presented in graph-
ical from in Figure 2. On a couple of occasions
we experienced abnormally bad network condi-
tions for short periods of time. These led to a
breakdown of the Client-Mediator or Mediator-
HLT server link due to time-out conditions being
reached, or the inability to establish a connec-
tion at all. We were able, however, to record one
full test with 21.0% packet loss, which resulted
in a word accuracy of 50.3%. These dialogues
are very di cult to understand even for humans.
Our conclusion from the packet loss experi-
2
The word accuracy on the clean 16kHz recording is
71.2%.
ment is that our speech recognition engine is
relatively robust to packet loss rates of up to
5%, since there is no clear degradation in the
word accuracy of the recognizer as a function
of packet loss rate (in this range). This is very
good news, since our experience indicates that
packet loss rates of over 5% are quite rare un-
der normal network tra c conditions. For 20%
packet-loss, the increase in WER is signi cant,
but the degradation is less severe than that re-
ported in (Milner and Semnani, 2000) on syn-
thetic data. We suspect that this is due to the
non-random distribution of lost packets.
The tests described above were the  rst phase
of our research on the impact of network tra c
on system performance. We are currently in the
process of conducting several further experimen-
tal investigations concerning di erent conditions
in which the system may run:
Transmission of video in addition to audio
through the video-conferencing communi-
cation channel: in this case we expect a sub-
stantial increase in UDP packet-loss rates due
to audio and video competing for the network
bandwidth over the H323 connections. It is not
clear, however, how this competition takes place
in practice and what are the resulting repercus-
sions on the audio quality (and consequently on
the recognizers’ performance).
The use of low-bandwidth network con-
nections (such as standard 56Kbps
modems): This is the most common network
scenario for real client users using a home in-
stalled computer. We are currently exploring
how the bandwidth limitations in this setting
a ect audio quality and system usability. In low
bandwidth conditions, NetMeeting supports en-
coding the speech with the G.723 codec, which
can consume a much lower bandwidth (less
then 6.4Kbps) compared to the G.711 codec
(64Kbps), which we currently use in our system.
We are in the process of testing the G.723 codec
within our system. Preliminary results indicate
that the recognizers used in the Nespole! sys-
tem are quite robust with respect to this new
front-end processing.
3.2 Experiments on Multi-Modality
The nature of the e-commerce scenario and ap-
plication in which our system is situated re-
quires that speech-translation be well-integrated
with additional modalities of communication
and information exchange between the agent
and client. Signi cant e ort has been devoted
to this issue within the project. The main
multi-modal component in the current version
of our system is the AeWhiteboard { a special
whiteboard, which allows users to share maps
and web-pages. The functionalities provided
by the AeWhiteboard include: image loading,
free-hand drawing, area selecting, color choos-
ing, scrolling the image loaded, zooming the im-
age loaded, URL opening, and Nespole! Monitor
activation. The most important feature of the
whiteboard is that each gesture performed by a
user is mirrored on the whiteboard of the other
user. Both users communicate while viewing the
same images and annotated whiteboards.
Typically, the client asks for spatial informa-
tion regarding locations, distances, and naviga-
tion directions (e.g., how to get from a hotel
to the ski slopes). By using the whiteboard,
the agent can indicate the locations and draw
routes on the map, point at areas, select items,
draw connections between di erent locations us-
ing a mouse or an optical pen, and accompany
his/her gestures with verbal explanations. Sup-
porting such combined verbal and gesture in-
teractions has required modi cations and exten-
sions of both HLT modules and the IF.
During July 2001, we conducted a detailed
study to evaluate the e ect of multi-modality on
the communication e ectiveness and usability of
our system. The goals of the experiment were
to test: (1) whether multi-modality increases
the probability of successful interaction, espe-
cially when spatial information is the focus of
the communicative exchange; (2) whether multi-
modality helps reduce mis-communications and
disfluencies; and (3) whether multi-modality
supports a faster recovery from recognition and
translation errors. For these purposes, two ex-
perimental conditions were devised: a speech-
only condition (SO), involving multilingual com-
munication and the sharing of images; and a
multi-modal condition (MM), where users could
additionally convey spatial information by pen-
basedgesturesonsharedmaps.
The setting for the experiment was the sce-
nario described earlier, involving clients search-
ing for winter tour-package information in the
Trentino province. The client’s task was to se-
lect an appropriate resort location and hotel
within the speci ed constraints concerning the
relevant geographical area, the available bud-
get, etc. The agent’s task was to provide the
necessary information. Novice subjects, previ-
ously unfamiliar with the system and task were
recruited to play the role of the clients. Subjects
wore a head-mounted microphone, using it in a
push-to-talk mode, and drew gestures on maps
by means of a table-pen device or a mouse. Each
subject could only hear the translated speech of
the other party (original audio was disabled in
this experiment). 28 dialogues were collected,
with 14 dialogues each for English and for Ger-
man clients, and Italian agents in all cases. Each
group contained 7 SO and 7 MM dialogues. The
dialogue transcriptions include: orthographical
transcription, annotations for spontaneous phe-
nomena and disfluencies, turn information and
annotations for gestures. Translated turns were
classi ed into successful, partially successful and
unsuccessful by comparing the translated turns
with the responses they generated. Repeated
turns were also counted.
The average duration of dialogues was 35 min-
utes (35.8 for SO and 35.5 for MM). On aver-
age, a dialogue contained 35 turns, 247 tokens
and 97 token types per speaker. Average val-
ues and variance of all measures are very similar
for agents and clients and across conditions and
Languages. ANOVA tests (p=0.05) on the num-
ber of turns and the number of spontaneous phe-
nomena and disfluencies, agents and customers
separately, did not produce any evidence that
modality or language a ected these variables.
Hence the spoken input is homogeneous across
groups. Details on the experimental database
collected and the various statistical analyses per-
formed appear in (Costantini et al., 2002). The
analysis of the results indicated that both the
SO and MM versions of the system were e ec-
tive for goal completion: 86% of the users were
able to complete the task’s goal by choosing a
hotel meeting the pre-speci ed budget and loca-
tion constraints.
In the MM dialogues, there were 7.6 gestures
per dialogue on average. The agents performed
almost all gestures (98%), with a clear prefer-
ence for area selections (61% of total gestures).
Most gestures (79%) followed a dialogue con-
tribution; none of the gestures were performed
during speech. Overall, few or no deictics were
used. We believe that these  ndings are related
to the push-to-talk procedure and to the time
needed to transfer gestures across the network:
agents often preceded gestures with appropriate
verbal cues e.g., \I’ll show you the hotel on the
map", in order to notify the other party of an
upcoming gesture. These verbal cues indicate
that gestures were well integrated in the com-
munication.
We found signi cant di erences between the
SO and MM dialogues in terms of unsuccessful
and repeated turns, particularly so in the spatial
segments of the dialogues. In the English-Italian
dialogues the MM dialogues contained 19% un-
successful turns versus 30% for the SO dialogues.
For German-Italian dialogues we found 18% in
MM versus 31% in SO. English-Italian MM dia-
logues contained 11% repeated turns versus 17%
for SO. For German-Italian dialogues repeated
turns amounted to 18% for MM versus 23% for
SO. In addition we found smoother dialogues
under MM condition, with fewer returns to al-
ready discussed topics for MM (one return every
19 turns in SO versus one return every 31 turns
in MM). MM also exhibited a lower number of
dialogue segments containing identi able misun-
derstandings between the two parties (one such
segment in each of 3 of the MM dialogues, ver-
sus a total of seven such segments in the SO dia-
logues { one dialogue with 3 segments, one with
two, and a third with a single segment of mis-
communication). Furthermore, the misunder-
standings in MM conditions were often immedi-
ately solved by resorting to MM resources, while
in case of SO ambiguous or mis-understood sub-
dialogues often remained unresolved. Finally,
the experiment subjects, given the choice be-
tween the MM and the SO system, expressed
a clear preference for the former. In summary,
we found strong supporting evidence that mul-
timodality has a positive e ect on the quality
of interaction by reducing ambiguity, making it
easier to resolve ambiguous utterances and to re-
cover from system errors, improving the flow of
the dialogue, and enhancing the mutual compre-
hension between the parties, in particular when
spatial information is involved.
3.3 Features of Automatic Speech
Recognition Engines
The Speech Recognition modules of the Nespo-
le! system were developed separately at the dif-
ferent participating sites, using di erent toolk-
its, but communicate with the Mediator using
a standardized interface. The French and Ger-
man ASR modules are described in more detail
in (Vaufreydaz et al., 2001; Metze et al., 2001).
The German engine was derived from the UKA
recognizer developed for the German Verbmobil
Task (Soltau et al., 2001).
All systems were derived from existing
LVCSR recognizers and adapted to the Nespo-
le! task using less than 2 hours of adaptation
data. This data was collected during an initial
user-study, in which clients from all countries
communicatedwithanAPTagentfluentintheir
mother tongue through the Nespole! system,
but without recognition and translation compo-
nents in place. Segmentation of input speech is
done based on automatic silence detection per-
formed by NetMeeting at the site of the origi-
nating audio. The audio is encoded according
to the G.711 standard at a sampling frequency
of 8kHz. The characteristics of the di erent rec-
ognizers are summarized in Table 1. The word
accuracy rates of the recognizers are presented
in Section 3.4.
3.4 End-to-End System Evaluation
In December 2001, we conducted a large scale
multi-lingual end-to-end translation evaluation
of the Nespole!  rst-showcase system. For
each of the three language pairs (English-Italian,
German-Italian and French-Italian), four previ-
English French German Italian
Vocabulary size 8,000 20,000 12,000 4,000
OOV rate 0.3% <1% 3.0%
LM training Verbmobil (E), C-Star Internet Verbmobil (D) C-Star
Data 550k words 1,500M words 500k words 100k words
+ adaptation Nespole none Nespole Nespole
Perplexity 33 98 150
Microphone type head-set head-set table-top head-set
Speaking style spontaneous read spontaneous read
Ac. training 16kHz G711 recoded 16kHz G711 recoded
Data 90h 12h 65h Verbmobil-II 11h C-Star
+ adaptation Up-sampling of G711 MLLR 80min. + FSA
Real-time factor 2.5, 1GHz P-III 1.1, 1GHz P-III 1.8, 650Mhz P-III
Memory consumption 280Mb 200Mb 100Mb 100Mb
WER on clean data 19.9% 28% 29.8% 31.5%
Table 1: Features of the Speech Recognition Engines
Language WARs SR Graded (% Acc)
English 61.9% 66.0%
German 63.5% 68.0%
French 71.2% 65.0%
Italian 76.5% 70.6%
Table 2: Speech Recognition Word Accuracy Rates and
Results of Human Grading (Percent Acceptable) of Recog-
nition Output as a Paraphrase
Language Transcribed Speech Rec.
English-to-English 58% 45%
German-to-German 46% 40%
French-to-French 54% 41%
Italian-to-Italian 61% 48%
Table 3: Monolingual End-to-End Translation Results
(Percent Acceptable) on Transcribed and Speech Recog-
nized Input
ously unseen test dialogues were used to evaluate
the performance of the translation system. The
dialogues included two scenarios: one covering
winter ski vacations, the other about summer
resorts. One or two of the dialogues for each lan-
guage contained multi-modal expressions. The
test data included a mixture of dialogues that
were collected mono-lingually prior to system
development (both client and agent spoke the
same language), and data collected bilingually
(during the July 2001 MM experiment), using
the actual translation system. This mixture of
data conditions was intended primarily for com-
prehensiveness and not for comparison of the dif-
ferent conditions.
We performed an extensive suite of evalua-
Language Transcribed Speech Rec.
English-to-Italian 55% 43%
German-to-Italian 32% 27%
French-to-Italian 44% 34%
Italian-to-English 47% 37%
Italian-to-German 47% 31%
Italian-to-French 40% 27%
Table 4: Cross-lingual End-to-End Translation Results
(Percent Acceptable) on Transcribed and Speech Recog-
nized Input
tions on the above data. The evaluations were
all end-to-end, from input to output, not as-
sessing individual modules or components. We
performed both mono-lingual evaluation (where
generated output language was the same as the
input language), as well as cross-lingual evalu-
ation. For cross-lingual evaluations, translation
from English German and French to Italian was
evaluated on client utterances, and translation
from Italian to each of the three languages was
evaluated on agent utterances. We evaluated on
both manually transcribed input as well as on
actual speech-recognition of the original audio.
We also graded the speech recognized output as
a \paraphrase" of the transcriptions, to measure
the levels of semantic loss of information due
to recognition errors. Speech recognition word
accuracies and the results of speech graded as
a paraphrase appear in Table 2. Translations
were graded by multiple human graders at the
level of Semantic Dialogue Units (SDUs). For
each data set, one grader  rst manually seg-
mented each utterance into SDUs. All graders
then used this segmentation in order to assign
scores for each SDU present in the utterance.
We followed the three-point grading scheme pre-
viously developed for the C-STAR consortium,
as described in (Levin et al., 2000). Each SDU is
graded as either \Perfect" (meaning translated
correctly and output is fluent), \OK" (meaning
is translated reasonably correct but output may
be disfluent), or \Bad" (meaning not properly
translated). We calculate the percent of SDUs
that are graded with each of the above cate-
gories. \Perfect" and \OK" percentages are also
summed together into a category of \Accept-
able" translations. Average percentages are cal-
culated for each dialogue, each grader, and sep-
arately for client and agent utterances. We then
calculated combined averages for all graders and
for all dialogues for each language pair.
Table 3 shows the results of the monolingual
end-to-end translation for the four languages,
and Table 4 shows the results of the cross-
lingual evaluations. The results indicate accept-
able translations in the range of 27{43% of SDUs
(interlingua units) with speech recognized in-
puts. While this level of translation accuracy
cannot be considered impressive, our user stud-
ies and system demonstrations indicate that it is
already su cient for achieving e ective commu-
nication with real users. We expect performance
levels to reach a range of 60{70% within the next
year of the project.
Acknowledgements
Additional Authors: S. Burger, D. Gates, C.
Langley, K. Laskowski, L. Levin, K. Peterson, T.
Schultz, A. Waibel, D. Wallace, Carnegie Mel-
lon University; J. McDonough, H. Soltau, Uni-
versity of Karlsruhe, Germany; G. Lazzari, N.
Manna, F. Pianesi, E. Pianta, ITC-irst, Trento,
Italy; L. Besacier, H. Blanchon, D. Vaufreydaz,
Universit e Joseph Fourier, Grenoble, France; L.
Taddei, AETHRA, Ancona, Italy.
This work was supported by NSF Grant
9982227 and EU Grant IST 1999-11562 as part
of the joint EU/NSF MLIAM research initiative.

References
Susanne Burger, Laurent Besacier, Paolo Coletti,
Florian Metze, and Celine Morel. 2001. The NE-
SPOLE! VoIP Dialogue Database. In Proc. Eu-
roSpeech 2001, Aalborg, Denmark. ISCA.
Erica Costantini, Susanne Burger, and Fabio Pianesi.
2002. Nespole!’s multilingual and multimodal cor-
pus. In Proceedings of the Third International
Conference on Language Resources and Evalua-
tion (LREC-2002), Grand Canary Island, Spain,
June. To appear.
Alon Lavie, Fabio Pianesi, and al. 2001. Architec-
ture and Design Considerations in NESPOLE!: a
Speech Translation System for E-Commerce Ap-
plications. In Proc. of the HLT2001, San Diego,
CA. ACM.
Gianni Lazzari. 2001. Spoken translation: chal-
lenges and opportunities. In Proc. ICSLP 2001,
Beijing, China, 10.
Lori Levin, Donna Gates, Fabio Pianesi, Donna Wal-
lace, Takeshi Watanabe, and Monika Woszczyna.
2000. Evaluation of a Practical Interlingua for
Task-Oriented Dialogues. In Proceedings NAACL-
2000 Workshop On Interlinguas and Interlingual
Approaches, Seattle, WA. AMTA.
Florian Metze, John McDonough, and Hagen Soltau.
2001. Speech Recognition over NetMeeting Con-
nections. In Proc. EuroSpeech 2001,Aalborg,
Denmark. ISCA.
Florian Metze, John McDonough, Hagen Soltau,
Alex Waibel, Alon Lavie, Susan Burger, Chad
Langley, Kornel Laskowski, Lori Levin, Tanja
Schultz, Fabio Pianesi, Roldano Cattoni, Gianni
Lazzari, Nadia Mana, Emanuele Pianta, Laurent
Besacier, Herve Blanchon, Dominique Vaufreydaz,
and Loredana Taddei. 2002. The NESPOLE!
Speech-to-Speech Translation System. In Proc.
HLT 2002, San Diego, CA, 3.
Ben Milner and Sharam Semnani. 2000. Robust
Speech Recognition over IP Networks. In Pro-
ceedings of International Conference on Acoustics
Speech and Signal Processing (ICASSP-00),Istan-
bul, Turkey, June.
Hagen Soltau, Thomas Schaaf, Florian Metze, and
Alex Waibel. 2001. The ISL Evaluation System
for Verbmobil - II. In Proc. ICASSP 2001,Salt
LakeCity,USA,5.
D. Vaufreydaz, L. Besacier, C. Bergamini, and
R. Lamy. 2001. presented at ISCA ITRW Work-
shop on Adaptation Methods for Speech Recogni-
tion, August. Sophia-Antipolis, France.
