A Model for Robust Processing of Spontaneous Speech 
by Integrating Viable Fragments* 
Karsten L. Worm 
Universit~it des Saarlandes 
Computerlinguistik 
D-66041 Saarbriicken, Germany 
worm@co li. uni- sb. de 
Abstract 
We describe the design and function of a robust pro- 
cessing component which is being developed for the 
Verbmobil speech translation system. Its task con- 
sists of collecting partial analyses of an input utter- 
ance produced by three parsers and attempting to 
combine them into more meaningful, larger units. It 
is used as a fallback mechanism in cases where no 
complete analysis spanning the whole input can be 
achieved, owing to spontaneous speech phenomena 
or speech recognition errors. 
1 Introduction 
In this paper we describe the function and design 
of the robust semantic processing component which 
we are currently developing in the context of the 
Verbmobil speech translation project. We aim at im- 
proving the system's performance in terms of cov- 
erage and quality of translations by combining frag- 
mentary analyses when no spanning analysis of the 
input can be derived because of spontaneous speech 
phenomena or speech recognition errors. 
2 The Verbmobil Context 
Verbmobil (Wahlster, 1997) is a large scale research 
project in the area of spoken language translation. 
Its goal is to develop a system that translates ne- 
gotiation dialogues between speakers of German, 
English and Japanese in face-to-face or video con- 
ferencing situations. The integrated system devel- 
oped during the first project phase (1993-96), the 
Research Prototype, was successfully demonstrated 
* The author wishes to thank his colleagues Johan Bos, 
Aljoscha Burchardt, Bj6rn Gamb~ick, Walter Kasper, Bemd 
Kiefer, Uli Krieger, Manfred Pinkal, Tobias Ruland, C. J. Rupp, 
J6rg Spilker, and Hans Weber for their collaboration. This re- 
search was supported by the German Federal Ministry for Ed- 
ucation, Science, Research and Technology under grant no. 01 
IV 701 R4. 
in autumn 1996 (Bub et al., 1997). The final Verb- 
mobil Prototype is due in 2000. 
Verbmobil employs different approaches to ma- 
chine translation. A semantic transfer approach 
(Doma and Emele, 1996) based on a deep linguistic 
analysis of the input utterance competes with statis- 
tical, example based and dialogue act based transla- 
tion approaches. 
The spoken input is mapped onto a word hypothe- 
sis graph (WHG) by a speech recognizer. A prosody 
component divides the input into segments and an- 
notates the WHGs with prosodic features. Within 
the semantic transfer line of processing, three dif- 
ferent parsers (an HPSG-based chart parser, a chunk 
parser using cascaded finite state automata, and 
a statistical parser) attempt to analyse the paths 
through the WHG syntactically and semantically. 
All three deliver their analyses in the VIT format 
(see 3). The parsers' work is coordinated by an inte- 
grated processing component which chooses paths 
through the WHG to be analysed in parallel by the 
parsers until an analysis spanning the whole input is 
found or the system reaches a time limit. 
Since in many cases no complete analysis span- 
ning the whole input can be found, the parsers pro- 
duce partial analyses along the way and send them 
to the robust semantic processing component, which 
stores and combines them to yield analyses of larger 
parts of the input. We describe this component in 
section 5. 
The relevant part of the system's architecture is 
shown in Figure 1. 
3 The VIT Format 
The VIT (short for Verbmobil Interface Term) was 
designed as a common output format for the two 
alternative and independently developed syntactic- 
semantic analysis components of the first project 
phase (Bos et al., 1998). Their internal semantic for- 
malisms differed, but both had to be attached to a 
1403 
' Speech _l Recognition dr\[ 
Prosody \] 
eech \[ nition & i HPSG \[ Dialogue and 
Parser \[ Context \[ 
Integrated I._-.~,~Vi.i.s,l_,.~ Semantic I-'~"-~'-(T,'-'/""~ Transfer ~ VIT \] Processing . : 
I ~ I l'roccssmK I ~ I I 
Chunk Statistical Synthesis Parser Parser 
Figure 1: Part of the system architecture. 
single transfer module. The need for a common out- 
put format is still present, since there are three al- 
ternative syntactic-semantic parsing modules in the 
new Verbmobil system, all of which again produce 
output for just one transfer module. 
(1) vit(vitID(sid(l,a,ge,O,20,l,ge,y, 
semantics), 
\[word(montag, 13, \[II16\]), 
word(ist,14, \[ii17\]), 
word(gut,15, \[lllOl)l), 
index(lll3,1109,il04), 
\[decl(lll2,hl05), 
gut(lllO,il05), 
dofw(lll6,ilO5,mon), 
support(lll7,il04,1110), 
indef(llll,ilO5,1115,hl06)\], 
\[ccom_plug(hl05,1114), 
ccom_plug(h106,1109), 
in g(ii12,1113), 
in_g(lll7,1109), 
in_g(lll6,1115), 
in_g(llll,lll4), 
leq(lll4,hlO5),leq(llO9,hl06), 
leq(llOg,hl05)\], 
Is sort(ilO5,time)l, 
\[\], 
\[num(ilO5,sg),pers(il05,3)\], 
\[ta_mood(ilO4,ind), 
ta_tense(ilO4,pres), 
ta_perf(ilO4,nonperf)\], 
\[\] 
) 
The VIT can be viewed as a theory-independent 
representation for underspecified semantic repre- 
sentations (Bos et al., 1996). It specifies a set of dis- 
course representation structures, DRSs, (Kamp and 
Reyle, 1993). If an utterance is structurally ambigu- 
ous, it will be represented by one VIT, which spec- 
ifies the set of DRSs corresponding to the different 
readings of the utterance. 
Formally, a VIT is a nine-place PROLOG term. 
There are slots for an identifier for the input segment 
to which the VIT corresponds, a list of the core se- 
mantic predicates, a list of scopal constraints, syn- 
tactic, prosodic and pragmatic information as well 
as tense and aspect and sortal information. An ex- 
ample of a VIT for the sentence Montag ist gut 
('Monday is fine') is given in (1). 
4 Approaches to Robustness 
There are three stages in processing where a speech 
understanding system can be made more robust 
against spontaneous speech phenomena and recog- 
nizer errors: before, during, or after parsing. While 
we do not see them as mutually exclusive, we think 
that the first two present significant problems. 
4.1 Before parsing 
Detection of self corrections on transcriptions be- 
fore parsing has been explored (Bear et al., 1992; 
Nakatani and Hirschberg, 1993), but it is not clear 
that it will be feasible on WHGs, since recognition 
errors interfere and the search space may explode 
due to the number of paths. Dealing with recogni- 
tion errors before parsing is impossible due to lack 
of structural information. 
4.2 During parsing 
Treating the phenomena mentioned during parsing 
would mean that the grammar or the parser would 
have to be made more liberal, i. e. they would have 
to accept strings which are ungrammatical. This is 
problematic in the context of WHG parsing, since 
the parser has to simultaneously perform two tasks: 
Searching for a path to be analysed and analysing it 
as well. 
If the analysis procedure is too liberal, it may 
already accept and analyse an ungrammatical path 
when a lower ranked path which is grammatical is 
1404 
also present in the WHG. I. e., the search through 
the WHG would not be restricted enough. 
5 Robust Semantic Processing 
Our approach addresses the problems mentioned af- 
ter parsing. In many cases the three parsers will 
not be able to find a path through the WHG that 
can be assigned a complete and spanning syntactic- 
semantic analysis. This is mainly due to two factors: 
• spontaneous speech phenomena, and 
• speech recognition errors. 
However, the parsers will usually be able to deliver 
a collection of partial analyses -- each covering a 
part of a path through the WHG. 
The goal of the robust semantic processing com- 
ponent in Verbmobil-2 is to collect these partial 
analyses and try to put them together on the basis 
of heuristic rules to produce deep linguistic analy- 
ses even if the input is not completely analysable. 
We speak of robust semantic processing since we 
are dealing with VITs which primarily represent se- 
mantic content and apply rules which refer to se- 
mantic properties and semantic structures. 
The task splits into three subtasks: 
1. Storing the partial analyses for different WHG 
(sub)paths from different parsers; 
2. Combining partial analyses to yield bigger 
structures; 
3. Choosing a sequence of partial analyses from 
the set of hypotheses as output. 
These subtasks are discussed in the following sub- 
sections. Section 5.4 contains examples of the prob- 
lems mentioned and outlines their treatment in the 
approach described. 
5.1 Storing Partial Analyses 
The first task of the robust semantic processing is 
to manage a possibly large number of partial analy- 
ses, each spanning a certain sub-interval of the input 
utterance. 
The basic mode of processing -- store competing 
analyses and combine them to larger analyses, while 
avoiding unnecessary redundancy -- resembles that 
of a chart parser. Indeed we use a chart-like data 
structure to store the competing partial analyses de- 
livered by the parsers and new hypotheses obtained 
by combining existing ones. All the advantages of 
the chart in chart parsing are preserved: The chart 
allows the storage of competing hypotheses, even 
from different sources, without redundancy. 
Since the input to the parsers consists of WHGs 
rather than strings, the analyses entered cannot refer 
to the string positions they span. Rather they have 
to refer to a time interval. This means also that the 
chart cannot be indexed by string positions, but is 
indexed by the time frames the speech recognizer 
uses. This makes necessary slight modifications to 
the chart handling algorithms. 
5.2 Combining Partial Analyses 
We use a set of heuristic rules to describe the con- 
ditions under which two or more partial analyses 
should be combined, an analysis should be left out 
or modified. Each rule specifies the conditions un- 
der which it should be applied, the operations to be 
performed, and what the result of the rule applica- 
tion is. Rules have the following format (in PROLOG 
notation): 
\[Condl ..... CondN\] ---> 
\[Opl ..... OpN\] & Result. 
The left hand side consists of a list of conditions 
on partial analyses, Cond2 being a condition (or a 
list of conditions) on the first partial analysis (VIT), 
etc., where the order of conditions parallels the ex- 
pected temporal order of the analyses. When these 
conditions are met, the rule fires and the operations 
Op 1 etc. are performed on the input VITs. One VIT, 
Result, is designated as the result of the rule. Af- 
ter applying the rule, an edge annotated with this 
VIT is entered into the chart, spanning the minimum 
time frame that includes the spans of all the analyses 
on the left hand side. Examples for rules are given 
in 5.4. 
5.3 Choosing a Result 
When no more analyses are produced by the parsers 
and all applicable rules have been applied, the last 
step is to choose a 'best' sequence of analyses from 
the chart which covers the whole input and deliver 
it to the transfer module. In the ideal case, there will 
be an analysis spanning the whole input. 
Currently, we employ a simple search which 
takes into account the acoustic scores of the WHG 
paths the analyses are based on, together with the 
length and coverage of the individual analyses. 
The length is defined as the length of the temporal 
interval an analysis spans; an analysis with a greater 
length is preferred. The coverage of an analysis is 
1405 
the sum of the lengths of the component analyses 
it consists of. Note that the coverage of an analysis 
will be less than its length iff some material inside 
the interval the analysis spans has been left out in 
the analysis; hence length and coverage are equal 
for the analyses produced by the parsers, l Analyses 
with greater coverage are preferred. 
5.4 Examples 
The examples in this section are taken from the 
Verbmobil corpus of appointment scheduling dia- 
logues. The problems we address here appeared in 
WHGs produced by a speech recognizer on the orig- 
inal audio data. 
5.4.1 Missing preposition 
Since function words like prepositions are usually 
short, speech recognizers often have trouble rec- 
ognizing them. Consider an example where the 
speaker uttered Mir wtire es am liebsten in den 
niichsten zwei Wochen ('During the next two weeks 
would be most convenient for me'). However, the 
WHG contains no path which includes the prepo- 
sition in in an appropriate position. Consequently, 
the parsers delivered analyses for the segments Mir 
ware es am liebsten and den niichsten zwei Wochen. 
These fragments are handled by two rules. The 
first turns a temporal NP like the second fragment 
into a temporal modifier, expressing that something 
is standing in an underspecified temporal relation to 
the temporal entity the NP denotes: 
\[temporal_np(Vl) \] ---> 
\[typeraise to mod(VI,V2)\] & V2. 
Then a very general rule can apply that modifier to 
the proposition expressed by the first fragment: 
\[type(Vl,prop) ,type(V2,mod)\] ---> 
\[apply(V2,Vl,V3)\] & V3. 
5.4.2 Self-Correction of a Modifier 
Here the speaker uttered Wir treffen uns am Montag, 
nein, am Dienstag ('We will meet on Monday, no, 
on Tuesday'). The parsers deliver three fragments, 
the first being a proposition containing a modifier, 
the second an interjection marking a correction, and 
the third a modifier of the same type as the one in the 
proposition. Under these conditions, we replace the 
modifier inside the proposition with the one uttered 
after the correction marker: 
~The chunk parser may be an exception here since it some- 
times leaves out words it cannot integrate into an analysis. 
\[ \[type (Vl,prop), 
has mod (Vi,Mi,ModType) \] , 
correction_marker (_) , 
\[ type (V2, mod), 
has_mod (V2, M2, ModType) \] \] 
---> \[replace_mod(Vi,Mi,M2,V3)\] & V3. 
5.4.3 Self-Correction of a Verb 
In this case, the speaker uttered Am Montag treffe... 
habe ich einen Terrain., i. e. decided to continue the 
utterance in a different way than originally intended. 
The parsers deliver fragments for, among others, the 
substrings am Montag, treffe, habe, ich, and einen 
Terrain (all the partial analyses received from the 
parsers and built up by robust semantic processing 
are shown in the chart 2 in Figure 2). 
Robust semantic processing then builds analyses 
by applying modifiers to verbal predicates (e. g., 
analyses 71,108) and verbal functors to possible ar- 
guments (e. g., 20, 106, 47). The latter is done by 
the following two rules: 
\[ type (Vl, Type) , unbound_arg (V2, Type) \] 
---> \[apply(V2,Vi,V3)\] & V3. 
\[ unbound_arg (VI, Type ) , type (V2, Type ) \] 
---> \[apply(Vl,V2,V3)\] & V3. 
Note that einen Termin is not considered to be a pos- 
sible argument of the verb treffe since that would 
violate the verb's sortal selection restrictions. 
After all partial analyses produced by the parsers 
have been entered into the chart and all applicable 
rules have been applied, there is still no spanning 
analysis (all analyses in Figure 2 are there, except 
the spanning one numbered 105). In such a case, 
the robust semantic processing component proceeds 
by extending active edges over passive edges which 
end in a chart node in which only one passive edge 
ends, or all passive edges ending there correspond 
to partial analyses still missing arguments. 
In this example, this applies to the node in which 
edges 1 and 71 end, which both are missing the two 
arguments of the transitive verb treffe. Application 
of the proposition modification rule mentioned in 
Section 5.4.1 to the modifyer PP am Montag has 
led to an active edge still looking for a proposi- 
tion. This is now being extended to end at the same 
node as the two passive edges missing arguments. 
:The analyses in the chart are numbered; the numbers in 
square brackets indicate the immediate constituents an analysis 
has been built from by robust semantic processing. I. e., anal- 
yses with an empty list of immediate constituents have been 
produced by a parser. 
1406 
105: am montafl, +habe+ich+einen temlin IGZt47\] 
~ 107: te~rfe.lch i 1 ,lS27: habe.ich n terrain 120,4S1 \[ 
Figure 2: The chart for Am Montag treffe ... habe ich einen Termin. 
There, it finds an edge corresponding to a propo- 
sition, namely edge 47, which had been built up 
earlier. The result is passive edge 105 spanning the 
whole input and expressing the right interpretation. 
6 Related Work 
An approach similar to the one described here was 
developed by Ros6 (Rosr, 1997). However, that ap- 
proach works on interlingual representations of ut- 
terance meanings, which implies the loss of all lin- 
guistic constraints on the combinatorics of partial 
analyses. Apart from that, only the output of one 
parser is considered. 
7 Conclusion and Outlook 
We have described a model for the combination of 
partial parsing results and how it can be applied in 
order to improve the robustness of a speech process- 
ing system. A prototype version was integrated into 
the Verbmobil system in autumn 1997 and is cur- 
rently being extended. 
We are working on improving the selection of re- 
suits by using a stochastic model of V1T sequence 
probabilities, on the extension of the rule set to 
cover more spontaneous speech phenomena of Ger- 
man, English and Japanese, and on refining the 
mechanism for extending active edges to arrive at 
a spanning analyses. 

References 
John Bear, John Dowding, and Elizabeth Shriberg. 
1992. Integrating multiple knowledge sources 
for detection and correction of repairs in human- 
computer dialog. In Proc. of the 30 th ACL, pages 
56--63, Newark, DE. 
Johan Bos, Bj6m Gamb~ick, Christian Lieske, 
Yoshiki Mori, Manfred Pinkal, and Karsten 
Worm. 1996. Compositional semantics in Verb- 
mobil. In Proc. of the 16 th COLING, pages 131- 
136, Copenhagen, Denmark. 
Johan Bos, Bianka Buschbeck-Wolf, Michael 
Dorna, and C. J. Rupp. 1998. Managing infor- 
mation at linguistic interfaces. In Proc. of the 
17 th COLING/36 th ACL, Montrral, Canada. 
Thomas Bub, Wolfgang Wahlster, and Alex Waibel. 
1997. Verbmobil: The combination of deep and 
shallow processing for spontaneous speech trans- 
lation. In Proc. Int. Conf. on Acoustics, Speech 
and Signal Processing (ICASSP), pages 71-74, 
Mfinchen, Germany. IEEE Signal Processing So- 
ciety. 
Michael Dorna and Martin C. Emele. 1996. 
Semantic-based transfer. In "Proc. of the 16 th 
COLING, pages 316-321, Copenhagen, Den- 
mark. 
Hans Kamp and Uwe Reyle. 1993. From Discourse 
to Logic. Kluwer, Dordrecht. 
Christine Nakatani and Julia Hirschberg. 1993. A 
speech-first model for repair detection and cor- 
rection. In Proc. of the 31 th ACL, pages 46-53, 
Columbus, OH. 
Carolyn Penstein Rosr. 1997. Robust Interactive 
Dialogue Interpretation. Ph.D. thesis, Carnegie 
Mellon University, Pittsburgh, PA. Language 
Technologies Institute. 
Wolfgang Wahlster. 1997. Verbmobil: Erken- 
nung, Analyse, Transfer, Generierung und Syn- 
these yon Spontansprache. Verbmobil-Report 
198, DFKI GmbH, Saarbriicken, June. 
