Processing Self Corrections in a speech to speech system 
JSrg Spilker, Martin Klarner, Gfinther G6rz 
University of Erlangen-Nuremberg - Computer Science Institute, 
IMMD 8 - Artificial Intelligence, 
Am Weichselgarten 9, 91058 Erlangen- Tennenlohe, Germany 
{ spilker, klarner, goerz}~immd8, inf ormat ik. uni-erlangen, de 
Abstract 
Speech repairs occur often in spontaneous spo- 
ken dialogues. The ability to detect and cor- 
rect those repairs is necessary for any spoken 
 system. We present a framework to 
detect and correct speech repairs where all tel- 
evant levels of information, i.e., acoustics, lexis, 
syntax and semantics can be integrated. The 
basic idea is to reduce the search space for re- 
pairs as soon as possible by cascading filters 
that involve more and more features. At first an 
acoustic module generates hypotheses about the 
existence of a repair. Second a stochastic model 
suggests a correction for every hypothesis. Well 
scored corrections are inserted as new paths in 
the word lattice. Finally a lattice parser decides 
on accepting the repair. 
1 Introduction 
Spontaneous speech is disfluent. In contrast 
to read speech the sentences aren't perfectly 
planned before they are uttered. Speakers of- 
ten modify their plans while they speak. This 
results in pauses, word repetitions or changes, 
word fragments and restarts. Current mlto- 
rustic speech understanding systems perform 
very well in small domains with restricted 
speech but have great difficulties to deal with 
such disfluencies. A system that copes with 
these self corrections (=repairs) must recognize 
the spoken words and identify the repair to get 
the intended meaning of an utterance. To char- 
acterize a repair it is commonly segmented into 
the following four parts (el. fig.i): 
• reparandum: the "wrong" part of the ut- 
terance 
• interruption point (IP): marker at the end 
of the reparandum 
• editing term: special phrases, which indi- 
cate a repair like "well", "I mean" or filled 
pauses such as "uhln '~, "uh" 
• reparans: the correction of the reparandum 
on Thursday lcannot • no Ican meet "ah afteronc 
/ / \ - 
/ "" 
Rct)arandmn Interruption- Editing Rcparans 
point Term 
Figure 1: Example of a self repair 
Only if reparandum and editing term are 
known, the utterance can be analyzed in the 
right way. It remains an open question whether 
the two terms should be deleted before a seman- 
tic analysis as suggested sometimes in the liter- 
ature 1. If both terms are marked it is a straight- 
forward preprocessing step to delete reparan- 
dum and editing term. In the Verbmobil 2 cor- 
pus, a corpus dealing with appointment schedul- 
ing a.nd tr~vel planning, nearly 21% of all turns 
contain at least one repair. As a consequence a 
speech understanding system thai; cannot han- 
dle repairs will lose perforlnance on these turns. 
Even if repairs are defined by syntactic and 
semantic well-formedness (Levelt, 1983) we ob- 
serve that most of them are local phenomena.. 
At this point we have to differentiate between 
restarts and other repairs a (modification re- 
pairs). Modification repairs have a strong corre- 
spondence between reparandum and reparans, 
1In most cases a reparaudum could be deleted with- 
out any loss of information. But, for exmnple, if it in- 
troduces an object which is referred to later, a deletion 
is not appropriate. 
>l?his work is part of the VERBMOBIL project and 
was funded by the German Federal Ministry for Research 
and Technology (BMBF) in the framework of the Verb- 
mobil Project under Grant BMBF 01 IV 701 V0. The 
responsibility for the contents of this study lies with the 
authors. 
SOften a third kind of repair is defined: "abridged 
repairs". These repairs consist solely of an editing term 
and are not repairs in our sense. 
1116 
whereas restarts a.re less structured. In our be- 
lieve there is no nted for a. complete syntactic 
am@sis to detect ~md correct most modification 
repairs. Thus, in wh~tt follows, we will concen- 
tra.te on this ldnd of repa.ir. 
There are two major arguments to process 
repairs before t)arsing. Primarily spontaneous 
speech is not always syntactically well-formed 
even in the absence of sell' corrections. Sec- 
ond (Meta-) rules increase the pa.rsers' search 
space. This is perhaps acceptable for transliter- 
ated speech but not for speech recognizers out- 
put like l~ttices because they represent millions 
of possible spoken utterances. \[n addition, sys- 
tems whk;h a.re not based on a. deep syntactic 
and semantic amdysis e.g. statistical dialog 
act prediction -- require a repa.ir processing step 
to resolve contr~dictions like the one in tit. 1. 
We propose all algorithm for word lattices 
th,~t divides repa.ir detection a.nd correction in 
three steps (of. fig. 2) l"irst, ~r trigger indi- 
cates potential 1Ps. Second, a sl;ochasl, ic model 
tries to lind an appropria.te repair h)r each IP by 
guessing 1,he mosl; l)robable segmentation, qb 
accomplish this, repair processing is seen as a 
statistical machine translation problem where 
the repa.randum is a transl~tion of the reparans. 
For every repair found, a pa.th representing the 
spcaker.' intended word sequence is inserted 
into the la.ttice. In the last step, a lattice parser 
selects the best pa.th. 
tlll 'llllll'Sday I ¢iIIlllt)l IlO \[ CIIII lllCel "ah tiller t)llC 
gpeec\]l I't'cOgtllZCi 
wllnl Io Slly ICOll ii1 il 
Oll +l'htusday 1 C;lllllOI lit) \] t'311 IIIL'L~I 'liIh alter 111112 
local word based scope dclectioll of lattice editing 1o represent restllt 
l{cl)ttl'ill/dtllll \]~.c\])alans !f 
1 ¢-ilJ\] 
• wll iii \[o say icoll ill , ii 
on "\[\]lttlSdlly t'lllllK~l t'atI1 litter I / "lib ' ' tll'\[Cl "t t)llC t J 
seleclion by 
1 linguistic allalysls 
s 
011 "l'htll'gday \] C/Ill nlcel "till :tiler olle 
Figure 2: An architecture for repa.ir processing 
2 Repair qh'iggers 
Because it is impossible for;t rea.l time speech 
system to check for every word whether it can 
be part of a repair, we use triggers which indi- 
cate the potential existence of a repa.ir. These 
triggers nlllst be immediately detectable for ev- 
ery word in the lattice. Currently we art using 
two different ldnd of triggers4: 
\]. 
. 
Acoustic/prosodic cuts: Spe~kers mark the 
117 in many cases by prosodic signals like 
1)auses, hesitations, etc. A prosodic classi- 
tier 5 determines for every word the proba- 
bility of an IP following. If it is above a cer- 
t~dn l;hreshold, the trigger becomes active. 
For a detailed description of the acoustic 
aspects see (Batliner eL al., 1998). 
Word Dagments are a very strong repair 
indicator. Unfortunately, no speech recog- 
nizer is able to detect word fl:agmtnts to 
date. But there are some interesting ap- 
proaches to detect words which are not in 
the recognizers vocabulary (Klakow et al., 
1999). A word fi'agment is normally an un- 
known word and we hope that it can bt 
distinguished from unfra.gmented unknown 
words by the prosodic classifier. So, cur- 
rently this is a hypol;hetical trigger. We 
will elaborate on it in the evaluation sec- 
tion (cf. sect. 5) to show the impact of this 
trigger. 
If a trigger is active, a. sea, rch for an acceptable 
segmentation into rel)arandum , editing term 
a.nd reparans is initia.ted. 
3 Scope Detection 
As mentioned in the introduction reDfir seg- 
mentation is based mainly on a stochastic trans- 
la.tion modtl, l~el'ore we explain it in detail we 
give a short introduction to statistical machine 
translation °. The fundalnentaJ idea. is the as- 
sumption that a given sentence S in a source 
 (e.g. English) can be translated in any 
^ 
sentence 5/' in a l;~rgel; I,~nguage (e.g. German). 
To every pair (5', ~/') a probability is assigned 
which reflects the likelihood that a tra.nsl~tor 
who sees S will produce \]' as the translation. 
The sta.tistical machine translation problem is 
4 Other triggers cal, be added as well. (Stolcke ct al., 
1999) for example integrate prosodic cues and an ex- 
tended  model in a speech recognizer to detect 
IPs. 
SThe classifier is developed by tile speech group of 
the IMM1) 5. Special thanks to Anton Batliner, Richard 
Iluber and Volker Warnke. 
~A more detailed introduction is given by (Brown el, 
al., 1990) 
I 117 
formul;~ted as: 
5~' = argmaXTI'(TlS) 
This is reformulated by Bayes' law for a better 
search space reduction, but we are only inter- 
ested in the conditional probability P(TIS ). For 
further processing steps we have to introduce 
the concept of alignment (Brown et al., 1990). 
Let S be the word sequence S1, S 2 .... 5,l ~ SI 
and T = ~,T2...Tm ~ 77\] ~. We can link a 
word in T to a word in S. This reflects the 
assumption that the word in T is translated 
from the word in S. \]?or example, if S is "On 
Thursday" and T is "Am l)onnerstag" "Am" 
can be linked to "On" but also to "Thursday". 
If each word in T is linked to exactly one word 
in ,S' these links can be described by a vector 
a~ '~ = al... a,~ with ai E O...l. If the word 51~. 
is linked to Si then aj = i. If it is not connected 
to any word in S then aj = 0. Such a vector 
is called an alignment a. P(T\],5,) can now be 
expressed by 
 '(TIS) = al,5,) (2) 
a is alignment 
Without any further assumptions we can infer 
the tbllowing: 
) 1 * (-45), 
H \])(ajl(t'{-l' rj-l' ?'"' '5,) ~ 
J--' Tii-', m, ,5,) (3) 
Now we return to self corrections. How can this 
framework help to detect the segments of a re- 
pair? Assulne we have a lattice l)~th where the 
reparandn.  (m)) a,d the reparans( S) are 
given, then (RS, \]{D) can be seen as a. transla- 
tion pair and P(RD\]R,5,) can be expressed ex- 
actly the same way as in equation (2). Hence 
we have a method to score (ITS, P~D) pairs.. But 
the triggers only indicate the interruption point, 
not the complete segmentation. Let us first 
look at editing terms. We assume them to be 
a closed list of short phrases. Thus if an entry 
of the editing term list; is found after an 1P, the 
corresponding words are skipped. Any subse- 
quence of words befbre/after the IP conld be the 
reparanduln/reparans. Because turns ca.n h~we 
an arbitrary length it is impossible to compute 
P(I-~D\]IL5,) for every (RS, H.D) pair. Bug this 
is not necessary at all, if repairs are considered 
as local phenomena. We restrict our search to a 
window of four words before and after the IP. A 
corpus analysis showed that 98% of all repairs 
are within this window. Now we only have to 
compute probabilities for 42 difl'erent pairs. If 
the probability of a (RS, RD) pair is above a 
certain threshold, the segmentation is accepted 
as a repair. 
3.1 Parameter Estimation 
The conditional probabilities in equation (3) 
cannot be estimated reliably fi'om any corpus 
of realistic size, because there are too many p~> 
rameters. For example both P in the product 
depend on the complete reparans R,5,. There- 
fore we simplify the probabilities by assuming 
that m depends only on l, a.i only on j,m and 
l and finally RDj on 1L5,,.j. So equation (3) be- 
comes 
P(Z D, siZeS) : 
\]-I (4) 
j=l 
These probabilffies can be directly trained fi'orn 
a nlannally annotated corl)ns , where all repairs 
are labeled with begin, end, liP and editing term 
and for each l'eparandnnl the words are linked 
to the corresponding words in the respective 
reparalls. All distributions are smoothed by a 
simple back-off method (Katz, 1987) to avoid 
zero probabilities with the exception that the 
word replacement probability P(I~I)jIILS,j) is 
smoothed in a more sophisticated way. 
3.2 Smoothing 
Even it" we reduce the number of parameters for 
the word replacement probability by the sim- 
plifications mentioned above there are a lot of 
parameters left. With a vocabulary size of 2500 
words, 25002 paralneters have to be estimated 
for P(I~DjllL5,~j). The corpus 7 contains 3200 
repairs fi'om which we extra.ct about 5000 word 
links. So most of the possible word links never 
occur in the corpus. Some of theln are more 
likely to occur in a repair than others. For ex- 
ample, the replacement of "Thursday" by "\]M- 
clay'" is supposed to be more likely than by "eat' 
ing", even if both replacements are not in the 
training corpus. Of course, this is related to 
7~110006urns with ~240000 words 
1118 
the fact that a, repair is a syntactic and/or se- 
mantic anomaly. We make nse of it by a.dding 
two additional knowledge sources to our model. 
Minimal syntactic information is given by part- 
of speech (POS) tags and POS sequences, se- 
mmltic information is given by semantic word 
classes. Ilence the input is not merely a se- 
quence of words but a sequence of triples. Each 
triple has three slots (word, POS tag, seman- 
tic class). In the next section we will describe 
how we ol)tain these two information pieces \[br 
every word in the lattice. With this additional 
informa.tion, P(RDjI1LS',~ j) probability could 1)e 
smoothed by linea.r interpolation of word, POS 
and semantic cla.ss replacement \])robabilities. 
= 
n,, l'(Word( l .Dj )ll4r o.rd( n,S'..j) ) 
+/3, 
+ 
with a'+\[3+7=1. l'Vord(IM):i ) 
is the not~tion tbr 1;11(: selector of 
the word slot of the triple a,t position j. 
4 Integration with Lattice 
Processing 
We ca, ll llOW del;e(;t a,nd correct a, repa,ir, given a 
sentence a.nnotated with I)()S tag;s an(I seman- 
1;ic classes, l~tll, how ca.n we ('onsl;rucl, such a. 
sequence, from a wor(l la.tl;ic(<? Integrating the 
ntodel in a lattice algoril;h m requires three steps: 
• mapping the word la£tice to a. tag lattice 
• triggering IPs and extra.cting the possible 
rel)ar;md um/reparans l):~irs 
• intr<)ducing new paths to represent tile 
plausible repa.rans 
The tag lattice constrnction is adapted from 
(Samuelsson, 11997). For every word edge and 
every denoted POS tag a corresponding tag 
edge is crea,ted and tim resulting prol)ability 
is determined. \[I' a tag edge already exists, 
tile probabilities of both edges are merged. 
The original words are stored together with 
their unique semantic class in a associated list. 
Paths through the tag graph a.re scored by a 
IX)S-trigram. If a trigger is active, all paths 
through the word before tim ll' need to be tested 
whether an acceptable rel)air segmentation ex- 
ists. Since the scope model takes at most \['our 
words for reparandum a.nd rel)a.ra.ns in account 
it is sufficient to expand only partial paths. 
l);ach of these partial paths is then processed by 
the scope model. To reduce the se~rch space, 
paths with a low score can be pruned. 
Repair processing is integrated into the Verb- 
mobil system as a. filter process between speech 
recognition a.nd syntactic analysis. This en- 
forces a rep~fir representation that ca.n be into- 
grated into a lattice. It is not possible to lna.rk 
only the words with some additional informa- 
tion, because a rel)air is a phenomenon that (le- 
pends on a path. Imagine that the system has 
detected ~ repair on ~ certain path in the btttice 
and marked all words by their top,fir function. 
Then a search process (e.g. the parser) selects a 
different D~th which shares only the words of the 
repa.randum. But these words are no reparan- 
dum for this path. A solution is to introduce a 
new path in the. lattice where reI)arandum a.nd 
editing terms a.re deleted. As we said betbre, we 
do not want l;o delete these segments, so they 
are stored in a special slot of 1;11o first word of 
the reparans. The original path can now 1)e re- 
construct if necessary. 
To ensure that these new I)aths are coml)~> 
ra.ble to other paths we score the reparandum 
the same wa.y the l)arser does, and add the re- 
suiting wdue to the \[irst wor(l of the reparaits. 
As a result, l>oth the original path a.nd the. one 
wil,h the repair get the sa.me score excel)t one 
word tra.nsition. The (proba.bly bad) transition 
in l, he original path from the last word o\[" the 
rei)arandtnn to the first word of 1;he repa.rans is 
rel)laeed by a. (proba.bly goo(t) transition From 
the repa.ran(hnn~s onset to the rel>arans. \Ve 
take the lattice in fig. 2 to give an example. 
The SCOl)e mo(M has ma.rked "l ca.nnot" as the 
reparandum, "no" as an editing term, and "l 
ca.n" as the rel)arans. We sum tip the acoustic 
scores of "1", "can" and "no". Then we add the 
maximnm  model scores for the tra.n- 
sition to "1", to "can" given "I", and to "no" 
given 'T' and "can". This score is ~(I(le(1 as an 
offset to the acoustic score of the second "1". 
5 Results and Further Work 
Due to the different trigger situations we per- 
formed two tests: One where we use only 
acoustic triggers and ~mother where the exis- 
tence of a perfect word fr~gment detector is as- 
sume(1. The input were unsegmented translit- 
era.ted utterance to exclude intluences a word 
1 1 19 
recognizer. We restrict the processing time on 
a SUN/ULTI{A 300MIIZ to 10 seconds. The 
parser was simulated by a word trigram. Train- 
ing and testing were done on two separated 
parts of the German part of the Verbmobil cor- 
pus (12558 turns training / 1737 turns test). 
Detection Correct scope 
Recall Precision Recall Precision 
Test 1 49% 70% 47 % 70% 
Test 2 71% 85% 62% 83% 
A direct comparison to other groups is rather 
difficult due to very different corpora, eval- 
uation conditions and goals. (Nakatani and 
Hirschberg, 1.993) suggest a acoustic/prosodic 
detector to identify IPs but don't discuss the 
problem of finding the correct segmentation in 
depth. Also their results are obtained on a 
corpus where every utterance contains at least 
one repair. (Shriberg, 1994) also addresses the 
acoustic aspects of repairs. Parsing approaches 
like in (Bear et al., 1992; Itindle, 1983; Core and 
Schubert, 1999) must be proved to work with 
lattices rather than transliterated text. An al- 
gorithm which is inherently capable of lattice 
processing is prot)osed by Heeman (Hem-nan, 
1997). He redefines the word recognition prob- 
lem to identify the best sequence of words, cor- 
responding POS tags and special rel)air tags. 
He reports a recall rate of 81% and a precision 
of 83% for detection and 78%/80% tbr correc- 
tion. The test settings are nearly the same as 
test 2. Unibrtunately, nothing is said about the 
processing time of his module. 
We have presented an approach to score po- 
tential reparandum/reparans pairs with a rela- 
tive simple scope model. Our results show that 
repair processing with statistical methods and 
without deep syntactic knowledge is a promis- 
ing approach at least for modification repairs. 
Within this fi'alnework more sophisticated scope 
models can be evaluated. A system integration 
as a filter process is described. Mapping the 
word lattice to a POS tag lattice is not optimal, 
because word inlbrmation is lost in the search 
tbr partial paths. We plan to implement a com- 
bined combined POS/word tagger. 

References 

A. Batliner, R. Kompe, A. Kiettling, M. Mast, 
H. Niemann, and F,. NSth. 1998. M = 
syntax + prosody: A syntactic-prosodic la- 
belling schema for large spontaneous speech 
databases. Epeech Communication, 25:193- 
222. 

J. Bear, J. Dowding, and E. Shriberg. 1992. 
Integrating multiple knowledge sources \["or 
detection and correction of repairs ill hu- 
man computer dialogs. In Proc. ACL, pages 
56-63, University of Delaware, Newark, 
Delaware. 

P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. 
Della Pietr~, F. Jelinek, J. D. Lafferty, R. L. 
Mercer, and P. S. Roossin. 1990. A sta.tisti- 
cal approach to machine translation. Compu- 
tational Linguistics, 16(2):79-85, June. 

M. G. Core and K. Schubert. 1999. Speech re- 
pairs: A parsing perspective. Satellite meet- 
ing ICPIIS 99. 

P. A. I-Iceman. 1997. Speech Repairs, Into- 
nation Boundaries and Discourse Markers: 
Modeling Epeakcrs' Utterances in ,5'pokcn Di- 
alog. Ph.l). thesis, University of Rochester. 

D. Hindle. 1983. Deterministic parsing of syn- 
tactic nontluencies. In Proc. ACL, MIT, 
Cambridge, Massachusetts. 

S. M. Katz. 1987. Estimation of probabilities 
from sparse data for tile  model con> 
ponent of a speech recognizer. 7)'ansaction 
on Acoustics, ,5'pcech and ,5'ignal 1)rocessing, 
ASSl'-35, March. 

ill). Klakow, G Rose, and X. Aubert. 1999. 
OOV-Detection in Large Vocabulary Sys- 
tem Using Automatically Defined Word- 
Fragments as Fillers. In EUR.OSPEECII '99, 
volume 1, pages 4:9-52, Budapest. 

W. Levelt. 1983. Monitoring and self-repair in 
speech. Cognition, 14:41-104. 

C. Naka.tani and a. tlirschberg. 1993. A speech- 
tirst model for repair detection and correc- 
tion. In P,vc. ACL, Ohio State University, 
Cohmbus, Ohio. 

C. Samuelsson. 1997. A left-to-right tagger for 
word graphs. In Proc. of the 5th Inter'national 
workshop on Parsing technologies, pages 171- 
178, Bosten, Massachusetts. 

E. E. Shriberg. 1994. Preliminaries to a Theory 
of Epeech Disflucncics. Ph.D. thesis, Univer- 
sity of California. 

A. Stolcke, E. Shriberg, D. Hakkani-Tur, and 
G. Tur. 1999. Modeling the prosody of hid- 
den events for improved word recognition. In 
EUROS'PEECII '99, volume 1, pages 307- 
310, Budapest. 
