Spoken Language Translation with the ITSVox System 
Eric Wehrli 
LATL - University of Geneva 
wehrli@latl, unige.ch 
1 Introduction 
This paper describes the ITSVox speech-to-speech 
translation prototype currently under development 
at LATL in collaboration with IDIAP. The ITSVox 
project aims at a general, interactive, multimodal 
translation system with the following characterics : 
(i) it is not restricted to a particular subdomain, (ii) 
it can be used either as a fully automatic system 
or as an interactive system, (iii I it can translate ei- 
ther written or spoken inputs into either written or 
spoken outputs, and (iv) it is speaker independent. 
ITSVox is currently restricted to (some subsets of) 
French , English in the speech-to-speech mode, 
French , , English in the written mode with speech 
output. We will first give a quick description of the 
system and then discuss in more details the speech 
modules and their interfaces with other components 
of the system. 
2 Architecture of ITSVox 
The ITSVox system consists (i) of a signal processing 
module based on the standard N-best approach (cf. 
Jimenez et al., 1995 I, (ii I a robust GB-based parser, 
(iii I a transfer-based translation module, and (iv) a 
speech synthesis module. 
To sketch the translation process and the interac- 
tion of the four components, consider the following 
example. 
(1) Je voudrais une chambre avec douche et avec 
rue sur le jardin. 
'I would like a room with shower and with 
view on the garden' 
The spoken input is first processed by the HMM- 
based signal processing component, which produces 
a word lattice, which is then mapped into ranked 
strings of phonetic words. 
For simplicity, let us consider only the highest can- 
didate in the list, which might (ideally) be something 
like (2), where / stands for voiceless fricatives and 0 
for schwas. 
(2) j vudr  iin /gbr av k du/ av k vii sir lO 
jardi 
Those word hypotheses constitute the input for 
the linguistic component. A lexical lookup using 
the phonetic trie representation described in the next 
section will produce a lexical chart. Applying linguis- 
tic constraints, the parser will try to disambiguate 
these words to produce a set of ranked GB-style en- 
riched surface structures as illustrated in (3). 
(3) \[TI" \[Or' je\] voudrais \[DP une chambre 
\[ConiV \[el" avec douche\] et \[Pv avec vue \[Pv 
sur le jardin\]\]\]\]\] 
The best analysis in the automatic mode -- or the 
analysis chosen by the user in the interactive mode 
96 
Jean-Luc Cochard 
IDIAP, Martigny 
J can- Luch.Cochard@idiap.ch 
-- undergoes lexical transfer and then a generation 
process (involving transformations and morphology) 
which produces target language GB-style enriched 
surface structures, as displayed in (41 . These struc- 
tures serve as input either to the orthographic dis- 
play component or to the speech synthesis compo- 
nent. In the case of English output, most of the 
speech synthesis work relies on the DeeTalk system, 
although linguistic structures help to disambiguate 
non-homophonous homographs (read, lead, record, 
wind, etc.). The French speech output uses the M- 
BROLA synthesizer developed by T. Dutoit, at the 
University of Mons. 
(4) \['r~" \[DP I\] would like It>,, a room \[con iP \[vp 
with shower\] and \[vl, with view \[pp on the 
garden\]\]\]\]\] 
Several of the components used by ITSVox have 
been described elsewhere. For instance, the transla- 
tion engine is based on the ITS-2 interactive mod- 
el (cf. Wehrli, 1996). The GB-parser (French and 
English) have been discussed in cf. Laenzlinger & 
Wehrli, 1991, Wehrli, 1992. As for the French speech 
synthesis system, it is described in Gaudinat and 
Wehrli (1997). 
2.1 The phonetic trie 
The phonetic lexicon is organized as atrie struc- 
ture (Knuth, 19731, that is a tree structure in which 
nodes correspond to phonemes and subtrees to pos- 
sible continuations. Each terminal node specifies one 
or more lexical entries in the lexical database. For in- 
stance, the phonetic sequence \[sa\] leads to a terminal 
node in the trie connected to the lexical entries cor- 
responding (i I to the feminine possessive determiner 
sa (her), and (ii) to the demonstrative pronoun ~a 
(that). 
With such a structure, words are recognized one 
phoneme at a time. Each time the system reaches a 
terminal node, it has recognized a lexical unit, which 
is inserted into a chart (oriented graph), which serves 
as data structure for the syntactic parsing. 
2.2 Interaction 
ITSVox is interactive in the sense that it can request 
on-line information from the user. Typically, inter- 
action takes the form of clarification dialogues. Fur- 
thermore, all interactions are conducted in source 
language only, which means that target knowledge 
is not a prerequisite for users of ITSVox. User con- 
sultation can occur at several levels of the translation 
process. First, at the lexicographie level, if an input 
sentence contains unknown words. In such cases, the 
system opens an editing window with the input sen- 
tence and asks the user to correct or modify the sen- 
tence. 
At the syntactic level, interaction occurs when the 
parser faces difficult ambiguities, for instance when 
the resolution of an ambiguity depends on contex- 
tual or extra-linguistic knowledge, as in the case of 
some prepositional phrase attachments or coordina- 
tion structures. By far, the most frequent cases of 
interaction occur during transfer, to a large exten- 
t due to the fact that lexical correspondences are 
all too often of the many-to-many variety, even at 
the abstract level of lexemes. It is also at this level 
that our decision to restrict dialogues to the source 
language is the most chaLlenging. While some cases 
of polysemy can be disambiguated relatively easily 
for instance on the basis of a gender distinction in 
the source sentence, as in (5), other cases such as 
the (much simplified) one in (6) are obviously much 
harder to handle, unless additional information is in- 
cluded in the bilingual dictionary. 
(5)a. Jean regarde les voiles. 
'Jean is looking at the sails/veils' 
b. masculin (le voile) 
fdminin (la voile) 
(6)a. Jean n'aime pass les avocats. 
'Jean doesn't like lawyers/advocadoes' 
b. avocats: 
homme de loi (la~lter) 
fruit (.b.uit) 
Another common case of interaction that occurs 
during transfer concerns the interpretation of pro- 
nouns, or rather the determination of their an- 
tecedent. In an sentence such as (7), the possessive 
son could refer either to Jean, to Marie or (less like- 
ly) to some other person, depending on contexts. 
(7) Jean dlt g Marie que son livre se vend bien. 
'Jean told Marie that his/her book is selling 
well' 
In such a case, a dialogue box specifying all pos- 
sible (SL) antecedents is presented to the user, who 
can select the most appropriate one(s). 
2.8 Speech output 
Good quality speech synthesis systems need a sig- 
nificant amount of linguistic knowledge in order (i) 
to disambiguate homographs which are not homo- 
phones (words with the same spelling but different 
pronunciations such as to lead/tile lead, to wind/tt~e 
wind, he read/to read, he records/the records, etc., 
(ii) to derive the syntactic structure which is used 
to segment sentences into phrases, to set accent lev- 
els, etc., and finally to determine an appropriate 
prosodic pattern. In a language like French, the 
type of attachment is crucial to determine whether 
a liaison between a word ending with a (laten- 
t) consonant and a word starting with a vowel is 
obligatory/possible/impossible 1 . 
1 For instance, liaison is obligatory between a prenom- 
inal adjective and a noun (e.g. petit animal), or between 
Such information is available during the transla- 
tion process. It turns out that in a linguistically- 
sound machine translation system, the surface struc- 
ture representations specify all the lexical, morpho- 
logical and syntactic information that a speech syn- 
thesis system needs. 
3 Concluding remark 
Although a small prototype has been completed, the 
ITSVox system described in this paper needs further 
improvements. The speech processing system un- 
der development at IDIAP is speaker-independent, 
HMM-based and contains models of phonetic units. 
A lexicon of word forms and a N-gram language 
model constitute the linguistic knowledge of this 
component. With respect to the linguistic compo- 
nents, current efforts focus on such tasks as retriev- 
ing ponctuation and use of stochastic information to 
rank parses. Those developments, however, will not 
affect the basic guideline of this project, which is 
that speech-to-speech translation systems and text 
translation systems must be minimally different. 

Bibliography 
Gaudinat, A. et E. Wehrli, 1997. "Analyse syn- 
taxique et synth6se de la parole : le projet 
FipsVox", TAL, 1997. 
Jimenes, V.M., A. Marsal & J. Monnd, 1995. "Ap- 
pfication of the A" Algorithm and the Recur- 
sire Enumeration Algorithm for Finding the N- 
best Sentence Hypotheses in Speech Recogni- 
tion, Technical Report DSIC-II/22/95, Dept. de 
Sistemas Informaticos y Computacion, Univer- 
sidad Politecuica de Valencia. 
Knuth, D. 1973. The Art of Computer Program- 
ming, Addison-Wesley. 
Laenslinger, C. and E. Wehrli, 1991. "FIPS : Un 
analyseur interactif pour le fran cais", TA Infor- 
mations, 32:2, 35-49. 
WehrIi, E. 1994. "1~aduction interactive : probl6mes 
et solutions (?)', in A. Clas et P. Bouillon (ed.), 
TA-TAO : Recherches de pointe et applications 
immddiates, Montreal, Aupelf-Uref, 333-342. 
Wehrli, E. 1996. "ITSVox". In Ezpanding MT Hori- 
zons, Proceedings of the Second Conference of 
the Association for Machine Translation in the 
Americas, 1996, pp. 247-251, Montreal, Cana- 
da. 
