File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/c90-3032_intro.xml
Size: 6,693 bytes
Last Modified: 2025-10-06 14:04:54
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3032"> <Title>SYNTACTIC NORMALIZATION OF SPONTANEOUS SPEECH*</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> PROBLEM </SectionTitle> <Paragraph position="0"> Generally the development of grammars, formalisms and natural language processors is based on written language data or, sometimes, not real data at all, but invented 'example sentences'. This holds for both computational and general linguistics. Thus many parsing systems that work quite well for sentences like la. and lb. fail, if they get applied to the authentic data in 2a. and 2b.: la.</Paragraph> <Paragraph position="1"> lb.</Paragraph> <Paragraph position="2"> 2a.</Paragraph> <Paragraph position="3"> 2b.</Paragraph> <Paragraph position="4"> die Grundform ist nicht eckig the basic form is not angular das blaue habe ich als Waage auf dem gr0nen llegen I have got the blue one lying upon theOAT greenOAT OneDAT like a balance die die Grund die Grundform sind is nich Is nleh eckig the the basic the basic form are is not is not angular das blaue hab ich ale Waage aul das gr0ne liegen I have got the blue one lying upon theACC greenAcc oneACC like a balance To native recipients the utterances in 2. appear to be more or less defective, but interpretable expressions. Moreover, the interpretation of 2a. or 2b. might require even less effort than, for instance, understanding an absolutely grammatical 'garden path sentence'. Since utterances like 2a. and 2b. occur quite frequently in spontaneous speech, an approach to parsing everyday language has to provide techniques that cover repairs, ungrammatical repetitions (2a.), caseassignment violation (2b.), agreement errors and other phenomena that have been summarized under the label 'iU-formed' in earlier research (Kwasny/Sondheimer &quot;1 am indebted to Dafydd Gibbon, Hans Karlgren and Hannes Rieser for their comments on earlier drafts of this paper. This research was supported by the Deutsche Forschungsgemeinschaft. Some aspects are discussed in more detail in Langer 1990.</Paragraph> <Paragraph position="5"> 1981, Jensen et al. 1983, Weischedel/Sondheimer 1983, Lesmo/Torasso 1984, Kudo et al. 1988).</Paragraph> <Paragraph position="6"> Though the present paper will adhere to this terminology, it should be emphasized that it is not presupposed that there are any general criteria precise enough to tell us exactly whether some utterance is 'ill-formed' relative to a natural language. Let us assume, instead, that some utterance U is 'ill-formed (defective, irregular .... ) with respect to a grammar G' iff U is not a sentence of the language specified by G. Since, for instance, repairs exhibit a high degree of structural regularity (el. Schegloff et al. 1977, Lever 1983, Kindt/Laubenstein in preparation) one might prefer to describe them within the grarxmaar and not within some other domain (e.g. within a production/perception model). Therefore the concept 'ill-formed' is used as a relational term that always has to be re-defined with respect to the given context.</Paragraph> <Paragraph position="7"> There have been two main directions in the prior research on ill-formedness. The one direction has focussed on the problem of parsing ill-formed input in restricted domain applications, such as natural language interfaces to databases or robot assembly systems (Lesmo/Torasso 1984, Self ridge 1986, Carbonell/Hayes 1.987). Though the techniques developed in that field seem to be quite adequate for the intended purposes, the results are not directly transferable to the interpretation of spontaneous speech, since the restrictions affect not only the topical domains but also the linguistic phenomena under consideration: e.g. the CASPAR parser (cf. Carbonell/ Hayes 1987) is restricted to a subset of inlperatives, Lesmo/Torasso (1984) achieve interpretations for ill-formed word order only at the price of neglecting long distance dependencies etc.</Paragraph> <Paragraph position="8"> The other main direction has been the 'relaxation'-approach (Kwasny/Sondheimer 1981, Weischedel/Sondheimer 1983). The basic idea is to relax those grammatical constraints an input string does not meet, if a parse would fail otherwise. The main problem of this approach is that relaxing constraints (i.e. ignoring them) makes a grammar less precise. Thus, for instance, a noun phrase that lacks agreement in number is analysed as a noun phrase without number and it remains unexplicated how this analysis might support a further interpretation. Surprisingly, none of these papers concentrates on real life 180 1 ~q~ontaneous speech (most of them are explicitly eon~ cerned with written man-machine communication).</Paragraph> <Paragraph position="9"> The present paper focusses the problem of norm-Mization, i.e. how to define the relation between ill,brined utterances (e.g. 2a. and 2b.) and their well-formed 'counterpa~s' (la. and lb.). A sentence is an adequate normalization of an ill-formed utterance, if it corresponds m our intuitions about what the speaker might have intended to say. This is, of course, not observable, but a request for repetition (which typically does not give rise to a literally repetition in case of ~n utterance like 2a.) might serve as a suitable test. In the present approach normalization is based on ~olely syntactic heuristics, not because syntactic intbrmation is regarded to be sufficient, but as a starting point for further work. Thus, the normalizations achieved on the basis of these heuristics serve as dehlult interpretations that have to be evaluated using additional intbrmation about the linguistic and situational context. The empirical background is a corpus of authentic German dialogues about block worlds that has been recorded tbr the study of coherence phenomena (cf. Forschergruppe Koh@enz \[ed.\] 1987).</Paragraph> <Paragraph position="10"> I will discuss three heuristics that are used in an experimental normalization system, called NOBUGS (NOrmalisierungskomponente im Bielefelder Unifikationsbasierten Analysesystem f/Jr Gesprochene Sprache normalization component of a Bie!efeM tmification-based ,;peech analysis system). The core of NOBUGS is a left-corner parser that interprets a GPSG-Iike formalism encoded in DCG notation. The grammars used with NOBUGS are very restrictive and exclude everything that is beyond the bounds of written standard German. But in combination with the heuristics I will discuss now the system is capable of handling a wider range of phenomena including morpho-syntactic deviations, explicit repair and ungrammatical repetitions.</Paragraph> </Section> class="xml-element"></Paper>