SYNTACTIC NORMALIZATION OF SPONTANEOUS SPEECH* 
Hagen Langer, University of Bielefeld, W-Germany 
ABSTRACT 
This paper presents some techniques that provide a 
standard parsing system for the analysis of ill-formed 
utterances. These techniques are feature generalization 
and heuristically driven deletions. 
PROBLEM 
Generally the development of grammars, formalisms 
and natural language processors is based on written 
language data or, sometimes, not real data at all, but 
invented 'example sentences'. This holds for both 
computational and general linguistics. Thus many 
parsing systems that work quite well for sentences like 
la. and lb. fail, if they get applied to the authentic 
data in 2a. and 2b.: 
la. 
lb. 
2a. 
2b. 
die Grundform ist nicht eckig 
the basic form is not angular 
das blaue habe ich als Waage auf dem gr0nen llegen 
I have got the blue one lying upon theOAT greenOAT 
OneDAT like a balance 
die die Grund die Grundform sind is nich Is nleh eckig 
the the basic the basic form are is not is not angular 
das blaue hab ich ale Waage aul das gr0ne liegen 
I have got the blue one lying upon theACC greenAcc 
oneACC like a balance 
To native recipients the utterances in 2. appear to be 
more or less defective, but interpretable expressions. 
Moreover, the interpretation of 2a. or 2b. might re- 
quire even less effort than, for instance, understanding 
an absolutely grammatical 'garden path sentence'. 
Since utterances like 2a. and 2b. occur quite fre- 
quently in spontaneous speech, an approach to parsing 
everyday language has to provide techniques that 
cover repairs, ungrammatical repetitions (2a.), case- 
assignment violation (2b.), agreement errors and other 
phenomena that have been summarized under the label 
'iU-formed' in earlier research (Kwasny/Sondheimer 
"1 am indebted to Dafydd Gibbon, Hans Karlgren and Hannes 
Rieser for their comments on earlier drafts of this paper. This 
research was supported by the Deutsche Forschungsgemein- 
schaft. Some aspects are discussed in more detail in Langer 
1990. 
1981, Jensen et al. 1983, Weischedel/Sondheimer 
1983, Lesmo/Torasso 1984, Kudo et al. 1988). 
Though the present paper will adhere to this termi- 
nology, it should be emphasized that it is not pre- 
supposed that there are any general criteria precise 
enough to tell us exactly whether some utterance is 
'ill-formed' relative to a natural language. Let us 
assume, instead, that some utterance U is 'ill-formed 
(defective, irregular .... ) with respect to a grammar 
G' iff U is not a sentence of the language specified by 
G. Since, for instance, repairs exhibit a high degree 
of structural regularity (el. Schegloff et al. 1977, 
Lever 1983, Kindt/Laubenstein in preparation) one 
might prefer to describe them within the grarxmaar and 
not within some other domain (e.g. within a pro- 
duction/perception model). Therefore the concept 'ill-- 
formed' is used as a relational term that always has to 
be re-defined with respect to the given context. 
There have been two main directions in the prior 
research on ill-formedness. The one direction has 
focussed on the problem of parsing ill-formed input in 
restricted domain applications, such as natural 
language interfaces to databases or robot assembly 
systems (Lesmo/Torasso 1984, Self ridge 1986, 
Carbonell/Hayes 1.987). Though the techniques de- 
veloped in that field seem to be quite adequate for the 
intended purposes, the results are not directly trans- 
ferable to the interpretation of spontaneous speech, 
since the restrictions affect not only the topical 
domains but also the linguistic phenomena under con- 
sideration: e.g. the CASPAR parser (cf. Carbonell/ 
Hayes 1987) is restricted to a subset of inlperatives, 
Lesmo/Torasso (1984) achieve interpretations for ill- 
formed word order only at the price of neglecting 
long distance dependencies etc. 
The other main direction has been the 
'relaxation'-approach (Kwasny/Sondheimer 1981, 
Weischedel/Sondheimer 1983). The basic idea is to 
relax those grammatical constraints an input string 
does not meet, if a parse would fail otherwise. The 
main problem of this approach is that relaxing con- 
straints (i.e. ignoring them) makes a grammar less 
precise. Thus, for instance, a noun phrase that lacks 
agreement in number is analysed as a noun phrase 
without number and it remains unexplicated how this 
analysis might support a further interpretation. Surpris- 
ingly, none of these papers concentrates on real life 
180 1 
~q~ontaneous speech (most of them are explicitly eon~ 
cerned with written man-machine communication). 
The present paper focusses the problem of norm- 
Mization, i.e. how to define the relation between ill- 
,brined utterances (e.g. 2a. and 2b.) and their well- 
formed 'counterpa~s' (la. and lb.). A sentence is an 
adequate normalization of an ill-formed utterance, if it 
corresponds m our intuitions about what the speaker 
might have intended to say. This is, of course, not 
observable, but a request for repetition (which typical- 
ly does not give rise to a literally repetition in case of 
~n utterance like 2a.) might serve as a suitable test. 
In the present approach normalization is based on 
~olely syntactic heuristics, not because syntactic in- 
tbrmation is regarded to be sufficient, but as a starting 
point for further work. Thus, the normalizations 
achieved on the basis of these heuristics serve as de- 
hlult interpretations that have to be evaluated using 
additional intbrmation about the linguistic and situa- 
tional context. The empirical background is a corpus 
of authentic German dialogues about block worlds that 
has been recorded tbr the study of coherence pheno- 
mena (cf. Forschergruppe Koh@enz \[ed.\] 1987). 
I will discuss three heuristics that are used in an 
experimental normalization system, called NOBUGS 
(NOrmalisierungskomponente im Bielefelder Unifika- 
tionsbasierten Analysesystem f/Jr Gesprochene Sprache 
normalization component of a Bie!efeM tmifica- 
tion-based ,;peech analysis system). The core of 
NOBUGS is a left-corner parser that interprets a 
GPSG-Iike formalism encoded in DCG notation. The 
grammars used with NOBUGS are very restrictive and 
exclude everything that is beyond the bounds of writ- 
ten standard German. But in combination with the 
heuristics I will discuss now the system is capable of 
handling a wider range of phenomena including 
morpho-syntactic deviations, explicit repair and un- 
grammatical repetitions. 
MORPHO-SYNTACTIC DEVIATIONS 
Morpho-syntactic deviations make up a considerable 
proportion of errors both in spoken and written 
German (German has a much more complex inflect- 
ional morphology than English). 
The basic principle of this approach to normaliza- 
tion is as follows: 
'Fry to find out which properties of a given input 
string make a parse fail and use the given grammat- 
ical knowledge to alter the input string minimally so 
that it is as similar as l~ssible to its initial state but 
without the properties that caused the thilure. 
What is meant by that can easily be seen if we con- 
sider an example where the property that makes a 
parse fail is evident, e.g. the string 'John sleep', 
which lacks tile NP-VP-agreement concerning person 
and mtmber that is required by the following rule: 
cat = S eat = NP cat = VP 
person = X~ ~ ease = nom person = X~ 
num= X 2 person = X~ num= X~ 
num= X2 
This rule is not applicable to 'John sleep', since there 
are no lexieal entries for 'John' and 'sleep', respec- 
tively, that have unifiable specifications for person and 
number, and this makes the whole parse fail. 
The strategy to account for strings like 'John 
sleep' consists of three steps: 
Step 1: Collect all lexical entries that match with the 
words of the input string and generalize them by sub- 
stituting variables for their morpho-syntaetic specifi- 
cations (ease, number, gender etc.). 
Step2: Parse the string using the generalized lexical 
entries instead of tim completely specified entries. 
Step3: If the parse with generalized specifications is 
successful, the problem with the input string is mor- 
pho-syntactie (agreement error or ease-assignment vio- 
lation). Collect all preterminal categories (most of 
them still contain variable morpho-syntactic specifica- 
tions) and try to unify them with full-specified lexical 
entries. At least one matching entry will belong to 
some item different from the corresponding word in 
the input string. In that case replace the original word 
by the matching item. If there are many different sets 
of matching entries choose the one that requires the 
least number of substitutions and output it as the 
default normalization (if there are many sets of match- 
ing entries that require the same least number of sub- 
stitutions the normalization is ambigous. In that case 
output all of them). 
Returning to our example string 'John sleep', let 
us assmne that the grammar consists just of the rule 
stated above and the following lexical entries: 
Jol)n: 
sleeF. 
sleeps: 
person = 3, num= so, cat = rip, case =norn 
person = 3, num ,, pl. cat = vp 
person = 3, num= sg, cat = vp 
Generalizing the lexical entries for the input string 
'John sleep' will produce two new entries: 
John: 
sleep: 
person = MAR 1, hum = VAR 2, cat = np, 
case = VAR 3 
person = VAR 4, num = VAR 5, cat = vp 
A parse using these entries will be successfld. The 
application of the rule unifies the variable specifica- 
2 181 
dons for nmnber and person and instantiates case no- 
minative in the NP. The preterminal categories result- 
ing from the parse are: 
person = VARI person = VARI 
num= VAR2 num= VAR2 
cat =np cat = vp 
case : nom 
Though the crucial specifications (person and num) 
are still variable the difference is now that there are 
the same variables in both categories. The (only) set 
of lexical entries that match with these preterminal 
categories requires the replacement of 'sleep' by 
'sleeps' and thus 'John sleeps' is the normalization of 
'John sleep'. 
Note that this strategy is not, in principle, limited 
to morpho-syntactic features. It might be useful for 
phonological and semantic normalization, as well. 
EXPLICZT REPAIR 
When people detect an error during an utterance they 
often try to correct it immediately. This, in general, 
makes the utterance as a whole ungrammatical. The 
structure of an utterance containing a self repair is 
often: 
Left context - reparandum - repair indicator - reparans 
right context. 
The reparandum is the part of the utterance that is to 
be corrected by the reparans. Typical repair indic- 
ators are interjections like 'uh no', 'nonsense', 'sorry' 
etc. The following example from our corpus shows 
that structure (note that the left context is empty in the 
original German version): 
Den linken oh ~uatsch_ den roton stellst du links hin 
rel~arandum indicator reparans right contox~ 
You ,put the !ef~ one eh nonsense the red one to the left 
left c. reparandum indicator reparans right context 
A plausible normalization of this utterance would be 
'Den roten stellst du links hin' ('You put the red one 
to the left3. This normalization differs from the ori- 
ginal utterance in that the reparandum and the repair 
indicators have been deleted. The strategy to cover 
this type of repair is to scan the input string w~w v..w. 
until a repair indicator sequence w~w~÷r..wj is found 
(1 < i < j < n). If there is such an explicit signal, 
then there probably is something wrong immediately 
before the repair sequence. But it is not clear what the 
reparandum is. Possibly the reparandum is just the 
word immediately before the repair indicator sequence 
or a longer substring or even the whole substring w~ 
wv..w~_ ~. Which deletion of a substring WkWk+~...W j 
gives a grammatical sentence can only be decided by 
the grammar. Thus it is necessary to parse the results 
of the alternative deletions beginning with wl...w~. 2 
wj+t...w . and incrementing the length of the deleted 
suhstring until the parse succeeds. If the deletion of a 
substring wkw~+,..wj makes a parse successful and if 
there is no other deletion of a substring w~w~+l...wj 
such that k < 1 then wtw2...wk_~wj+~wi42...w n is the 
normalization of the input string. 
If applied to the utterance 'You put the left one eh 
nonsense the red one to the left' the first deletion 
gives 'You put the left the red one to the left' which 
is not accepted by the parer. The second alternative 
tried ('You put the the red one to the left') fails, too. 
But the third attempt ('You put the red one to the 
left') is accepted by the parser and thus considered as 
the normalization of the original utterance. 
UNGRAMMATICAL REPETITIONS 
Ungrammatical repetitions of single words or longer 
stretches occur quite frequently in spontaneous speech. 
As long as a sequence is repeated completely and 
without any alteration it is easy to detect the redun- 
dant duplication and remove it from the input string to 
get a normalized version. The problem is with incom- 
plete repetitions and repetitions that introduce new 
lexical items: 
Some blocks some red blocks are small, 
\ / \ / 
part 1 part 2 
Some red some blue blocks are small. 
\__/ \__ / 
part I part 2 
The deletion of the substrings indicated as 'part 1' in 
the utterances above, respectively, would yield a 
suitable normalization. Utterances of this kind are in 
many respects like the explicit repairs discussed 
above, but they lack indicators. Typically, part 2 is 
similar to part 1 in that at least some words occur in 
both substrings. Moreover, part 1 and part 2 often 
belong to the same category (e.g. NP in the utterances 
above). This similarity motivates the following heuris- 
tic: 
The input string wlw2...w ~ is scanned for two dif- 
ferent occurrences, say w~ and wj (1 _< w I < w i < 
w,), of the same lexical item. w~ and wj are per- 
mitted to differ in their inflectional properties, since 
an unsuitable inflection of w~ might have been the 
reason to repeat it in proper inflexion as wj (e.g. 
'He takes took a block'). If such a repetition is 
182 3 
fbund the substring beginning with the first occur- 
rence up to the word immediately before the second 
occurrence (i.e. w~w~+,...wj.~) is parsed. If the parse 
is~ succesful and yields some category C for the 
substring, the next step is to find a prefix of 
wjwj+a...w, that belongs to the same category C. 
If such a prefix exists and wtw2 ...w~_twjwj+~...w, is 
accepted as a grammatical sentence it is considered 
to be the suitable normalization. 
Let us apply this strategy to the utterance 'Some 
blocks some red blocks are small'. Scanning this input 
string from the left to the right will immediately find 
the repeated lexical item 'some'. The parse of the 
substring 'Some blocks' results in an NP and thus a 
prefix of 'some red blocks are small' is searched for 
which is also an NP. Such a prefix is found (i.e. 
'some red blocks') and therefore 'some red blocks are 
small' is tested if it is a grammatical sentence and, 
indeed, it is. 
RESULTS, CONCLUSIONS, FURTHER TASKS 
The normalization strategies outlined in this paper 
make a /given standard parsing system applicable to 
certain language phenomena that occur frequently in 
spontaneous speech, but deviate from the standards of 
written language. Additional rules, special grammar 
formalisms or fixed parsing algorithms are not requir- 
ed. 
If the parse succeeds, the analysis assigned to a 
deviating input is not only some partial structure 
description, but a well-formed sentence including its 
complete syntactic structure. 
Preliminary tests have shown that the normaliza- 
tions achieved by the strategies discussed in this paper 
are plausible default interpretations in most cases. Bad 
normalizations result from the lack of phonological, 
semantic and world knowledge. A typical example is 
'Take a red block oh no blue block' which gets in- 
correctly normalized into 'Take a red blue block', if 
the grarnmar accepts 'block' being specified by two 
different color adjectives. If it does not, trying the 
next alternative according to the explicit-repair stra- 
tegy described above will yield the most plausible 
result 'Take a blue block'. Another way to avoid the 
wrong normalization is to consult additional phonolog- 
ical infomlation about the input string. It is very 
probable that there is a contrastive stress upon 'blue' 
in the input utterance. Let us assume the rule: if there 
is a word with contrastive stress in a reparans 
sequence then there must be a suitable word in the 
reparandmn sequence to which it is in contrast. This 
implies that 'red' must be part of the reparandum (and 
thus has to be deleted) and rules out the wrong norm- 
alization 'Take the red blue block'. A further task will 
be to find out how additional semantic and phonolog- 
ical intormation both in the grammar and in the 
normalization strategies can be used to make the 
normalization results more reliable. 

REFERENCES 

Carbonell, J.G./Hayes, P.J.: Robust parsing using 
multiple construction-specific strategies. In: Bolc, 
Leonard\[ed.\]: Natural language parsing systems. 
Berlin 1987. pp. 1-32. (Springer series symbolic 
computation - artificial intelligence). 

Forschergruppe Koh~irenz \[ed.\]: "n Gebitde oder was" 
- Daten zum Diskurs fiber Modellwelten. KoLiBri- 
Arbeitsbericht 2. Bielefeld 1987. 

Jensen, K./Heidorn, G.E./Miller, L.A./Ravin, Y.: 
Parse fitting and prose fixing: getting a hold on 
ill-formedness. In: AJCL 9 (1983), 147-160. 

Kindt, W./Laubenstein, U.: Reparaturen und Koordi- 
nationskonstruktionen. KoLiBri-Arbeitsbericht 20. 
(In preparation). 

Kudo, I./Koshino, H./Chung, M./Morimoto, T. : 
Schema method: A framework for correcting 
grammatically ill-formed input. In: COLING 1988, 
341-347. 

Kwasny, S.C./Sondheimer, N.K.: Relaxation 
techniques for parsing ill-formed input in natural 
language understanding systems. In: AJCL 7 
(1982), 99-108. 

Langer, H.: Syntaktische Normalisierung gesprochener 
Spraehe. KoLiBri-Arbeitsbericht 23. Bielefeld 
1990. 

Lesmo, L./Torasso, P.: Interpreting syntactically ill- 
formed sentences. In: COLING 1984, 534-539. 

Levelt, W.J.M.: Monitoring and self-repair in speech. 
In: Cognition 14 (1983), 41-104. 

Schegloff, E.A./Jefferson, C./Sacks, H.: The prefer- 
ence for self-correction in the organization of 
repair in conversation. In: Language 53 (1977), 
361-382. 

Selfridge, M.: Integrated processing produces robust 
understanding. In: CL 12 (1983), 161-177. 

Weischedel, R.M./Sondheimer, N.K.: Meta-Rules as a 
basis for processing ill-formed input. In: AJCL 9 
(1983), 161-177. 
