Modelling Speech Repairs 
in German and Mandarin Chinese Spoken Dialogues 
Shu-Chuan Tseng 
Da-Yeh University 
112 Shan-Jiao Rd. Da-Tsuen 
Changhua, Taiwan 515 
tseng@aries.dyu.edu.tw 
Abstract 
Results presented in this paper strongly 
support the notion that similarities as well as 
differences in  systems can be 
empirically investigated by looking into the 
linguistic patterns of speech repairs in real 
speech data. A total of 500 Gemmn and 325 
Mandarin Chinese overt immediate speech 
repairs were analysed with regard to their 
internal phrasal structures, with particular 
focus on the syntactic and morphological 
characteristics. Computational models in the 
form of finite state automata (FSA) also 
illustrate the describable regularity of 
German and Mandarin Chinese speech 
repairs in a formal way. 
Introduction 
Spontaneous speech analysis has recently been 
playing a crucial role in providing empirical 
evidence for applications in both theoretical and 
applied fields of computational linguistics. For 
the purpose of constructing more salient and 
robust dialogue systems, recent analyses on 
speech repairs, or more generally speaking, on 
speech disfluencies in spoken dialogues have 
tried to explore the distributional characteristics 
of irregular sequences in order to develop 
annotation systems to cope with speech repairs 
(Heeman and Allen 1999, Nakatani and 
Hirschberg 1994). This new research direction, 
nevertheless, has until recently merely focused 
on the surface structure of speech repairs on the 
one hand. On the other hand, except for very few 
ilwestigations starting to deal with speech 
repairs across several s (Eklund and 
Shribcrg 1998), most of the studies on speech 
repairs have investigated only single s. 
In addition, studies have shown that syntactic 
and prosodic features of spontaneous speech 
data provide empirical evidence with regard to 
reflecting the speaking habits of speakers, and 
also help to develop better parsing strategies and 
natural  processing systems (Heeman 
and Allen 1999, Hindle 1983). These systems 
should understand and react to the  use 
of human users (Lickley and Bard 1998, Tseng 
1998). 
This paper presents results of a comparative 
stud), of speech repairs with the goal of 
examining and modelling repair syntax by 
looking into empirical cross-linguistic spccch 
data. In this paper, the phenomena of speech 
repairs are introduced first, followed by an 
empirical cross-linguistic analysis of speech 
repairs in German and Mandarin Chinese, which 
have different  typologies. Speech data, 
therefore, were collected to look for linguistic 
sequences and particularities of spontaneous 
speech, which usually cause difficulties for 
 dialogue systems. Syntactic patterns 
found in the comparative analysis have 
subsequently been formalised to make clear the 
internal structures of speech repairs. Formal 
modelling in FSA should finally show the 
fonnal characteristics of repair sequences in 
these two  systems. 
1 Related Work 
This section sumlnariscs previous results related 
864 
to speech repairs. First, a generally adopted 
template model of describing repairs is 
introduced, lbllowed by a brief sumnaary of 
recent studies on speech repair processing in 
German and Mandarin Chinese. 
1.1 Template Model of Repairs 
Most models of repair structures (Lcvclt 1983) 
apply a template-based approach. In 1)rinciplc, a 
telnplate model is colnposed of three parts: 
rcparandum (Rcp), editing tcnns (Et) and 
alteration (Alt). The rcparanduna denotes the 
speech stretch, which nccds to bc repaired, 
whereas the alteration is the repair itself. Editing 
terms are seqnences produced between the 
reparandum and the alteration, which often 
appear in tbrm of silent or filled pauses and can 
also bc absent, depending on the Sl)caking 
situation. A classification systenl of repairs can 
bc derived from the structural relations betv,'ecll 
the reparandum, the editing term and the 
alteration: 
• addilion repait:s' 
Ex an\]pie: -#: -2~,~, -~-~, ~i,J ~,,= ,e,~4'\[.*./ ej~j (Rcp) %. "IN ~ ;-~- 
#~ n g (TWPTH Corpus) l __t'J"J (Alt) .... ': ,a -'~ ,, 
• suhstilulion repaitw 
Examl)lc: Und unten rnnten ist halt die gelbe 
Mutter (Rcp) /ilA (Et) die orangc Mutter (Alt) 
(Sagcrer el al. 1994) -~ 
• repel/l/or1 repairs 
Example: En aan dc rechtcrkallt een oraRjc stip 
(Rcp) oranjc stip (Alt). (Lcvelt 1983) 3 
• abridged repair,s' 
Example: I think that you get - it is more strict in 
Catholic schools. (Hindle 1983) 
1.2 Gramnmr-Oriented Production of 
Gernmn Speech Repairs 
German, an Indo-Europcan , is a 
 with a strong emphasis on grammatical 
flexion. Phrases with congruence in gender, 
i Verbatiln translation: will influence whole 
POSSESSIVF,-particle Mmle industry 
PO SS\]:;SS1VI ';-particle investment interests. 
Sentential translation : It will influence the whole the 
whole industrial investment interests. 
2 And beneath that is the yellow nut dl the orange 
nut. 
3 And at the right-side an orange dot orange dot. 
munber and case are important from syntactic 
and naorphological viewpoints. Tiros, phrasal 
boundaries may play a role in the production of 
German repairs. Results provided by Tseng 
(1999) empirically support the significant role of 
phrasal boundaries in German by examining 
Gcmmn specch repairs. Phrasal boundaries seem 
to be the positions to start as well as to end 
speech repairs. The following utterance in which 
a German repair is produced clearly illustrates 
this t)henolneuon: "lch habe eiuen Wiirfel rail 
einer mit emem Gewmde 4'', where mit einer is a 
phrasal liagmcnt and mit einem Gewinde, 
starting fi'oln the phrasal beginning, is a 
complete phrase repairing the previous phrasal 
fiagment. In her conversation analysis on 
sclf-rcpairs in Gennan, Uhmalm (1997) also 
lnentions that repairs tend to appear at 
constituent boundaries ill nlost cases, i.e., 
deleting problem sequences involved in repairs 
will result in the utterances containing speech 
repairs becolning well-formed. 
1.3 Lcxis-Oricnted Production of Chinese 
Speech Repairs 
One way to illustrate the differences in 
s is to examine and to compare the 
types of speech repairs in the s 
respectively. The modcrn description 
methodologies of gralmnar structures in German 
and Chinese (Chao 1968, Li and Thompson 
1981) originated froln similar theoretical 
backgrounds. However, Chinese has a great 
variety of colnpound words, but lacks 
grammatical lnarkings at the morphological 
level. To be morn specific, the word formation 
in Chinese is accomplished by combining 
morphelncs, where each morpheme has its own 
lexical content and orthographic character. This 
is essentially different from the 
syntactic-lnorphological derivation as well as 
compounding in Gennan. 
Lee and Chen (1997) classified Chinese speech 
repairs in patterns and developed a  
lnodol for their  recognition system to 
4 I have one cube with a\[fcminine, singular, dative, 
indefinite\] with a\[neuter, singular, dative, indefinite\] 
bolt. 
865 
cope with speech repairs. However, they did not 
carry out any further investigations on the 
structure of repairs. Different fronl the 
production of German speech repairs, Chui 
(1996) proposed, in her studies on repairs in 
Chinese spoken conversations, that syntax seems 
to play a less important role than the lexical 
complexity and tile size of words in the 
production of Chinese speech repairs. For 
instance, not tbe constituent boundaries, but the 
completeness of the lexical content and the 
scope of the lexical quantity of the words should 
(~) and engineer (~-~_~) in the utterance 
/,g~ ;/c //g~y ~ ~_~ ~4~'~,f #/;~ "/2 
t~.~ s, are the major factors which influence the 
production of repairs. 
2 Data and Corpus 
In order to examine the production of speech 
repairs in different s, the German 
corpus BAUFIX and the Chinese corpus 
TWPTH were chosen to carry out further 
comparative analyscs. 
2.1 German Data: BAUFIX 
The BAUFIX corpus (Sagerer el al. 1994) 
consists of 22 digitally recorded German 
human-human dialogues. 44 participants 
co-operated in pairs as instructor and constructor, 
where their task was to build a toy-plane. 
Because of the limited visual contact between 
dialogue partners in some given cases, subjects 
had to rely on their verbal comnmnication to a 
great extent. This corpus setting was especially 
constructed to force subjects to repair their 
speech errors. For the purpose of this paper to 
investigate repair syntax, the corpus analysis is 
mainly concerned with immediate self-repairs. 
They were identified and hand-annotated by the 
author. In total, 500 speech repairs were 
classified according to their syntactic attributes 
such as categories and parts of speech. They 
were subsequently analysed with respect to the 
5 Verbatim translation: Hc should 
NEGATION-particle should promote engineer(word 
fragment) engineer so quickly DISCOURSE-particle. 
Sentential translation: He shouM should not be 
promoted to engineel(word fragment) engineer so 
soon. 
location of interruption and their repair structure. 
2.2 Mandarin Chinese Data: Taiwan 
Putonghua Corpus (TWPTH) 
Taiwan Putonghua Corpus (TWPTH), where 
Putonghua refers to Mandarin Chinese, was 
recorded in Taiwan. The speakers were all born 
in Taiwan and their first  is Taiwancsc 
(Southern Min). The speakers wcrc given the 
instructions in advance to speak in usual 
conversation style and they could speak on any 
topic they wanted to, or even on no topic at all. 
Thus, the spontaneous and conversation-oriented 
speech data were obtained. A total of 40 
speakers were recorded including five dialogues 
and 30 monologues. Three dialogues were 
analysed for the study in this paper and each is 
about 20 nfinutes long. In total, 325 immediate 
speech repairs were identified in these three 
dialogues and they were annotated according to 
the POS system developed for the Sinica Corpus 
(CKIP 1995). 
2.3 Comparison of Repair Data 
Seine central statistics on BAUFIX and TWPTH 
data are summarised in Table 1: 
Table 1: Summary Statistics 
13AUFIX TWPTH 
Language German Mandarin Chinese 
total no. ofwoMs 35036 9168 words 
woMs 47655 characters 
total no. of repairs 500 325 
no. words involved 1823 words 950 woMs 
m rcpairs 1622 characters 
% repair-woMs of 5.2 % 10.4 % (woM) 
total words 3.4 % (chmacter) 
% of phrases PP 34.8 % VP 35.7 % 
involved in repairs NP 38 % NP 41.2 % 
Table 1 shows that the percentage of problem 
words (words involved in speech repairs) is 
similar in both BAUFIX and TWPTH corpora. 
Witb regard to the number of words (i.e. lexical 
itelns) 10.4% of overall words in TWPTH are 
involved in repair sequences, whereas only 5.2% 
of words in BAUFIX are found in repair 
sequences. However, the statistics show a 
pattern, Mlich is more closely related, 3.4% and 
5.2% respectively, if we consider the number of 
characters instead of words ill Chinese. Chinese 
866 
words can bc mono- or multi-syllabic. In 
Chinese, lexical items are composed of 
characters, where each character is all 
independent lneaningful monosyllabic 
morpheme. This study can possibly provide 
insights into the role of characters in Chinese at 
syntactic and morphological levels. 
Other interesting results that can be noted from 
Table 1 are the types of phrases involved in 
repair sequences. In BAUFIX, because of the 
task-oriented corpus setting, few verbs were 
used. lnstead, the focus is more on NPs and PPs, 
since the speakers had to express exactly what 
the parts look like and where to place them. 
Different from BAUF1X, the TWPTH speakers 
did not have to give exact descriptions. 
Therefore, a considerable number of verbs were 
used, which we can observe from the high 
pereentage of VPs involved in repair sequences. 
However, in both corpora, NPs make up a high 
percentage, 38% and 41.2% respectively. For 
this reason, NPs will bc further investigated for 
their syntactic structures. 
3 Analysis of Repair Syntax in NPs 
Tiffs section is concerned with the distribution 
and patterns of NPs in the context of repair 
syntax in German and Mandarin Chinese. 
3.1 Regular Patterns 
Among 190 NPs involved in repair sequences in 
BAUFIX, there arc 147 NPs for which the 
internal structure within the NPs can bc given 
exactly as follows (Tscng 1999), 
NP => N 
NP => DET + N 
NP => DET + ADI 
NP => DET + ADI + N 
NP --> DET + ADJ + ADJ + N 
NP =>so + DET+ N 
NP => so + DET + ADI 4- N 
NP => so + DET + AD\] + ADJ + N 
where lhe other 43 NPs in repairs are abridged 
repairs, therefore, their internal structures cannot 
be determined. 
Compared with Gennan NP-rcpairs, Chinese 
speakers produce rather simple repair sequences 
in NPs. Only 62.7% (84 out of 134) of Chinese 
repairs found in the corpus are single NP phrases. 
The rest of repair sequences in which NPs are 
involvcd, contain other phrasal categories such 
as verb phrases or adverbials. Since these 
dialogues arc concerned with normal and 
everyday conversations, no complicated noun 
phrases were used. These NP-rcpairs have the 
following structures: 
NP => N 
NP => DET 
NP => DET + N 
NP => ADI + N 
NP => QUAN + CLASS 
NP => OUAN + CLASS + N 
where QUAN denotes numbers and CLASS 
means classifiers in Chinese. 
3.2 Syntactic Formalization 
83.4% out of 147 specific NP repairs in German 
start at phrase-initial positions and end at 
phrase-final positions. In the Chinese data, only 
thrcc NP-repairs among the 84 single NP-repairs 
were not traced back to file phrase-initial 
position. Phrasal boundaries play a role while 
speech repairs are produced in both s, 
especially phrase-initial positions before the 
rcparandum. The syntactic structure of the 
maiority of German and Chinese repairs in NPs 
can bc fonnally described by means of phrasal 
modelling. 
Figure 1 : Phrasal Modelling of German NP-Rcpairs 
63 
Figure 1 models 50% of NP repair sequcnces of 
the type DET ADJN in BAUFIX, where the 
reflexive arrow on DET designates the sequence 
867 
DET DET. The first DET can be a fragmentary 
or a false determiner, whereas the second DET is 
supposed to be the corrected word accordingly. 
The initial element DET in a German noun 
phrasc, i.e. the phrase-initial boundary is the 
most frequent location at which a repair is 
restarted. In other words, while producing 
repairs, speakers tend to go back to the 
determiner to repair NPs. 
Although the data investigated here is not 
necessarily representative for most Chinese 
speakers, this result, does not empirically 
confirm Chui's conclusion (1996) that syntax 
should play a less important role than the lexical 
complexity and the quantity constraint of the 
to-be-repaired lexical items, hlstead, the 
phrase-initial position seems to be the location 
to restart repairs in Chinese. Therefore, the 
results indicate that the lexical content of the 
to-be-repaired itclns tends to play a less 
important role than syntax in both s. 
3.3 Cross-Linguistic Differences 
In contrast to the similarities between German 
and Chinese speech repairs lncntioned in the 
sections above, differences can also be identified. 
Some differences can bc noted through a 
comparison of repair syntax in German and 
Mandarin Chinese. It is more colnnlon for NPs 
in German to be repaired directly within NPs, 
whereas in Chinese NPs are often repaired 
within a more complex syntactic context, i.e. 
Chinese repairs arc composed of more than one 
phrasal category. To investigate the syntactic 
and morphological distribution of speech repairs 
in both s, the length of retracing in both 
s is examined. The results are presented 
in Table 2. 
Table 2: Distribution of Retracing 
retraced words or German Chinese 
characters (words) (characters) 
0 22.5% 3.6% 
1 62.9% 61.9% 
2 12.9% 27.4% 
3 1.7% 6% 
4 0 1.2% 
No similarity between German and Chinese was 
obtained by checking the nulnbcr of retraced 
words in Chinese, because the majority of "the 
retraced parts" in Chinese are word fragments. 
But it is clearly shown in Table 2 that Gennan 
words and Chinese characters play a similar role 
in the production of speech repairs. Whether it 
has to do with the syllabic weighting in both 
s or the semantic contcnt of characters 
in Chinese necds fnrther linguistic investigation. 
4 Formal Modelling 
With regard to relations of repair syntax and the 
editing structuring in repairs, instead of only 
looking into their surface structure, the syntactic 
regularity in German and Chinese NP-repairs 
can be modelled in the form of finite state 
automata. We again take German as example. 
4.1 Finite State Automata 
Finite state automata similar to M with 
e-transitions denoted by a quintuple <Q, E, 8, q0, 
IF> defined as follows can model more than 80% 
of overall German NP-repairs: 
Q = {q0, ql, q2, q3, qf}, 
E = {det, adj, 11, dct-d G, adj-d, n-d, e}, 
q0 is the initial state, 
F ={q3} and 
~5(q0, det)=ql, 8(q l, adj)=q2, 6(@, n)=q3, 
8(q0, det-d)-qf, 6(ql, adj-d)=qf, 8(q2, n-d)=qf, 
6(qf, e)=q0, 8(ql, e)=q0, 6(@, e)=q(), 
8(@, e)=q0 
M is graphically illustrated in Figure 2. Several 
particularities are described in this automaton. 
First, when NP-repairs are produced, no matter 
where the real problmn word is located (It can be 
dct-d, adj, adj-d, n or n-d), speakers tend to go 
back to the phrase-initial position to restart lheir 
speech. It the case of NPs, the determiner is the 
most frequent location for re-initiating a correct 
speech. The final position is in most cases 
phrase-final. Therefore, in M, there is only one 
final state q3. This models the coherence within 
NP phrases in German that speakers usually 
complete pluTases, after they have started them. 
6 Det-d, adj-d, and n-d denote fragmentary (or false) 
determiners, adjectives and nouns respectively. 
868 
Figure 2: Finite State Automaton M 
E 
4.2 Discussion 
The FSA M suggested above is suitable for the 
syntactic characteristics of speech repairs in both 
German and Chinese. Repair syntax has been 
taken into consideration from a procedural point 
of view, instead of simply dcscribing the 
sequential structures. In this modcl, probabilities 
(for instance, word frequency or acoustic 
features) on the arcs can be implemented to 
operate a parsing system, which can deal with 
speech repairs, ttowcver, speech data of 
appropriate size are needed to obtain significant 
probabilities. 
\["or more linguistic insights into the 
word-character relations in Chinese or across 
s, i.e. the ovcrlapping syntactic and 
morphological role of phrasal boundaries, 
further modification is necded to make the rcpair 
processing and detection in the Chinese case 
more realistic. 
Conclusion 
This paper has shown that speech repairs not 
only play a decisive role in speech processing 
technology systems, they also provide empirical 
evidence and insights into the inherent linguistic 
characteristics of s. Based on the results 
of corpus analysis, similar syntactic features of 
speech repairs ill German and Chinese were 
identified and the repair syntax was formally 
modelled by means of phrasal modelling and 
finite state automata. Discrepancy at the 
morphological level of both s was 
shown and more detailed investigations are 
necessary. Further analyses on acoustic-prosodic 
features of cross-linguistic data am CmTently 
being can'ied out. 
Acknowledgements 
Fd like to thank the Sonderforschungsbereich 
(SFB 360) colleagues in Bielefeld who collccted 
and pre-proeessed the BAUFIX data as wall as 
the colleagues in the Industrial Research 
Technology Institute (IRTI) in Chu-Dong who 
kindly supported me with the TWPTH corpus 
data. Without them the investigation described 
in this paper would not have been carried out 
and this paper could not possibly have been 
written. 

References 

Chao Y.-R. (1968) A Grammar of S))ol:en Chinese. 
Berkeley: University of California Press. 

Clmi K.-W. (1996) Organization oJ" Repair #7 
Chinese Convel:vation. Text 16/3, pp. 343-372. 

CKIP (1995) S#Tica Balanced ('.orpus. Tedmical 
Report no. 95-02/98-04. (in Chinese) 

Ekhmd R. and Shribcrg E. (1998) Civ.~'s-l, ingui.vtic 
l)i,vfhten~3, Modeling: A Comparative AnaO~si,v o J" 
,S~,edish and American Engli,s'h thtman-Human 
and Hltnlan-Machme Dialogs. in: Proceedings 
oflCSLP'98. Sydney, Australia. pp. 2631-2634. 

Hecman, P. and Allen, J. (1999)Speech Repair.~', 
hTtonational Phra,s'es and Discozuwe Marketw: 
Modell#Tg ,S)~eaketw' Utterances in ,S))oken 
Dialogue. Computational Linguistics 25/4. to 
appear. 

Hindle, D. (1983) Determ#dstic Par,s'mg of 
,~vntactic Non-Jluencies. In: ACL'83. 
Philadelphia, USA. pp. 123-128. 

Lee Y.-S. and Chen H.-H. (1997) Using Acoustic 
and Prosodic Cues to Correct Chhwse Speech 
Repaim. In: Proceedings of EUROSPEECH'97. 
Rhodes, Greece. pp. 2211-2214. 

Levelt W. J. (1983) Monitoring and 5'elf-l~epair #7 
Speech. Cognition 14. pp. 41-104. 

Li C. and Thompson S. (1981)Mandarin Chinese: 
A Functional Reference Grammar. Berkeley: 
University of California Press. 

Liekley, R. J. and Bard, E. G. (1998) When Can 
Listeners Detect Di~fluency m Spontaneous 
Speech? Language and Speech 41/2. pp. 
203-226. 

Nakatani, C. and Hirschberg, J. (1994) A 
Corpus-Based Study of Repair Cues m 
Spontaneous Speech. Journal of lhe Acoustical 
Society of America 95. pp. 1603-1616. 

Sagerer G. and Eikmeyer H. and Riekheit G. (1994) 
"Wtr bauen jetzt em Flugzeug": Konstruieren 
im Dialog. Arbeitsmateriafen, Technical Report. 
SFB360 "Situierte Ktinstliche Kommunikation. 
University of Bielefeld, Germany. 

Tseng S.-C. (1999)Grammat; Pro.vody and Speech 
DLsfluencies #~ Spolcen Dialogues. PhD Thesis. 
University of Bielefeld, Gemlany. 

Tscng S.-C. (1998) A L#~guistic Analysis of Repair 
S(~,,nals m Co-operative Spoken Dialogues. In: 
Proceedings of ICSLP'98. Sydney, Australia. pp. 
2099-2102. 

Uhmann, S. (1997) Selbstreparaturen in 
Alltagsdialogen: Ein Fall .\[ilr eine integrative 
Konvetwationstheorie. In: Syntax des 
gesprochenen Deutsehen. Ed. Schlobinski. 
Westdeutscher Verlag. pp. 157-180. 
