File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/j99-4003_abstr.xml
Size: 22,037 bytes
Last Modified: 2025-10-06 13:49:44
<?xml version="1.0" standalone="yes"?> <Paper uid="J99-4003"> <Title>Speech Repairs, Intonational Phrases, and Discourse Markers: Modeling Speakers' Utterances in Spoken Dialogue</Title> <Section position="2" start_page="0" end_page="533" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Consider the following example from the Trains corpus (Heeman and Allen 1995).</Paragraph> <Paragraph position="1"> Example 1 (d93-13.3 utt63) um it'll be there it'll get to Dansville at three a.m. and then you wanna do you take tho- want to take those back to Elmira so engine E two with three boxcars will be back in Elmira at six a.m. is that what you wanna do In order to understand what the speaker was trying to say, the reader probably segmented the above into a number of sentence-like segments, utterances, as follows. Example 1 Revisited um it'll be there it'll get to Dansville at three a.m.</Paragraph> <Paragraph position="2"> and then you wanna do you take tho- want to take those back to Elmira so engine E two with three boxcars will be back in Elmira at six a.m. is that what you wanna do * Computer Science and Engineering, P.O. Box 91000, Portland, OR 97291. E-mail: heeman@cse.ogi.edu t Department of Computer Science, Rochester, NY 14627. E-mail: james@cs.rochester.edu (~) 1999 Association for Computational Linguistics Computational Linguistics Volume 25, Number 4 Even this does not fully capture what the speaker was intending to convey. The first and second utterances contain speech repairs, where the speaker goes back and changes (or repeats) something she just said. In the first, the speaker changed it'll be there to it'll get to; in the second, she changed you wanna to do you take tho-, which she then further revised. The speaker's intended utterances are thus as follows: 1 Example 1 Revisited Again um it'll get to Dansville at three a.m.</Paragraph> <Paragraph position="3"> and then do you want to take those back to Elmira so engine E two with three boxcars will be back in Elmira at six a.m. is that what you wanna do The tasks of segmenting speakers' turns into utterance units and resolving speech repairs are strongly intertwined with a third task: identifying whether words, such as so, well, and right, are part of the sentential content or are being used as discourse markers to relate the current speech to the preceding context. In the example above, the second and third utterances begin with discourse markers.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Utterance Units and Intonational Phrases </SectionTitle> <Paragraph position="0"> As illustrated above, understanding a speaker's turn necessitates segmenting it into individual utterance units. However, there is no consensus as to how to define an utterance unit (Traum and Heeman 1997). The manner in which speakers break their speech into intonational phrases undoubtedly plays a major role in its definition. Intonational phrase endings are signaled through variations in the pitch contour, segmental lengthening, and pauses. Beach (1991) demonstrated that hearers can use intonational information early on in sentence processing to help resolve syntactic ambiguities.</Paragraph> <Paragraph position="1"> Bear and Price (1990) showed that a parser can use automatically extracted intonational phrasing to reduce ambiguity and improve efficiency. Ostendorf, Wightman, and Veilleux (1993) used hand-labeled intonational phrasing to do syntactic disambiguation and achieved performance comparable to that of human listeners. Due to their significance, we will focus on the task of detecting intonational phrase boundaries. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="529" type="sub_section"> <SectionTitle> 1.2 Speech Repairs </SectionTitle> <Paragraph position="0"> The on-line nature of spoken dialogue forces conversants to sometimes start speaking before they are sure of what they want to say. Hence, the speaker might need to go back and repeat or modify what she just said. Of course there are many different reasons why speakers make repairs; but whatever the reason, speech repairs are a normal occurrence in spoken dialogue. In the Trains corpus, 23% of speaker turns contain at least one repair and 54% of turns with at least 10 words contain a repair.</Paragraph> <Paragraph position="1"> Fortunately for the hearer, speech repairs tend to have a standard form. As illustrated by the following example, they can be divided into three intervals, or stretches of speech: the reparandum, editing term, and alteration. 2 Heeman and Allen Modeling Speakers' Utterances Example 2 (d92a-2.1 utt29) that's the one with the bananas I mean that's taking the bananas reparandum ip editing terms alteration The reparandum is the stretch of speech that the speaker is replacing, and can end with a word fragment, where the speaker interrupts herself during the middle of a word. The end of the reparandum is the interruption point and is often accompanied by a disruption in the intonational contour. This can be optionally followed by the editing term, which can consist of filled pauses, such as um or uh or cue phrases, such as I mean, well, or let's see. Reparanda and editing terms account for 10% of the words in the Trains corpus. The last part is the alteration, which is the speech that the speaker intends as the replacement for the reparandum. In order for the hearer to determine the intended utterance, he must detect the repair and determine the extent of the reparandum and editing term. We refer to this latter process as correcting the speech repair. In the example above, the speaker's intended utterance is that's the one that's taking the bananas.</Paragraph> <Paragraph position="2"> Hearers seem to be able to effortlessly understand speech with repairs in it, even when multiple repairs occur in a row. In laboratory experiments, Martin and Strange (1968) found that attending to speech repairs and the content of an utterance are mutually inhibitory, and Bard and Lickley (1997) found that subjects have difficulty remembering the actual words in the reparandum. Listeners must be resolving repairs very early on in processing the speech. Earlier work by Lickley and colleagues (Lickley, Shillcock, and Bard 1991; Lickley and Bard 1992) strongly suggests that there are prosodic cues across the interruption point that hearers make use of in detecting repairs. However, little progress has been made in detecting speech repairs based solely on acoustical cues (cf. Bear, Dowding, and Shriberg 1992; Nakatani and Hirschberg 1994; O'Shaughnessy 1994; Shriberg, Bates, and Stolcke 1997).</Paragraph> <Paragraph position="3"> 1.2.1 Classification of Speech Repairs. Psycholinguistic work in speech repairs and in understanding the implications that they pose for theories of speech production (e.g. Levelt 1983; Blackmer and Mitton 1991; Shriberg 1994) has come up with a number of classification systems. Categories are based on how the reparandum and alteration differ, for instance whether the alteration repeats the reparandum, makes it more appropriate, or fixes an error in the reparandum. Such an analysis can shed light on where in the production system the error and its repair originated. Our concern, however, is in computationally resolving repairs. The relevant features are those that the hearer has access to and can make use of in detecting and correcting a repair. Following loosely in the footsteps of the work of Hindle (1983), we divide them into the following categories: fresh starts, modification repairs, and abridged repairs.</Paragraph> <Paragraph position="4"> Fresh starts occur where the speaker abandons the current utterance and starts again, where the abandonment seems to be acoustically signaled either in the editing term or at the onset of the alteration. Example 3 illustrates a fresh start where the speaker abandons the partial utterance I need to send, and replaces it by the question Computational Linguistics Volume 25, Number 4 Example 3 (d93-14.3 utt2) I need to send let's see how many boxcars can one engine take reparandum ip editing terms alteration For fresh starts, there can sometimes be little or even no correlation between the reparandum and alteration. Although it is usually easy to determine the reparandum onset, initial discourse markers and preceding intonational phrases can prove problematic. null The second type are modification repairs, which comprise the remainder of repairs with a nonempty reparandum. The example below illustrates this type of repair. Example 4 (d92a-1.2 utt40) you can carry them both on J reparandum lp tow both on the same engine Y alteration Modification repairs tend to have strong word correspondences between the reparandum and alteration, which can help the hearer determine the reparandum onset as well as signal that a repair occurred. In the example above, there are word matches on the instances of both and on, and a replacement of the verb carry by tow. Modification repairs can in fact consist solely of the reparandum being repeated by the alteration. The third type are the abridged repairs. These repairs consist of an editing term, but with no reparandum, as the following example illustrates.</Paragraph> <Paragraph position="5"> Example 5 (d93-14.3 utt42) we need to um manage to get the bananas to Dansville more quickly T v Ip editing terms For these repairs, the hearer has to determine that an editing term occurred, which can be difficult for phrases such as let's see or well since they can also have a sentential interpretation. The hearer also has to determine that the reparandum is empty. As the example above illustrates, this is not necessarily a trivial task because of the spurious word correspondences between need to and manage to.</Paragraph> </Section> <Section position="3" start_page="529" end_page="530" type="sub_section"> <SectionTitle> 1.3 Discourse Markers </SectionTitle> <Paragraph position="0"> Phrases such as so, now, firstly, moreover, and anyways can be used as discourse markers (Schiffrin 1987). Discourse markers are conjectured to give the hearer information about the discourse structure, and so aid the hearer in understanding how the new speech or text relates to what was previously said and for resolving anaphoric references (Hirschberg and Litman 1993). Although discourse markers, such as firstly, and moreover, are not commonly used in spoken dialogue (Brown and Yule 1983), a lot of other markers are employed. These markers are used to achieve a variety of effects: such as signal an acknowledgment or acceptance, hold a turn, stall for time, signal a speech repair, or signal an interruption in the discourse structure or the return from one.</Paragraph> <Paragraph position="1"> Although Schiffrin defines discourse markers as bracketing units of speech, she explicitly avoids defining what the unit is. We feel that utterance units are the building Heeman and Allen Modeling Speakers' Utterances blocks of spoken dialogue and that discourse markers operate at this level to relate the current utterance to the discourse context or to signal a repair in an utterance. In the following example, and then helps signal that the upcoming speech is adding new information, while so helps indicate a summary is about to be made.</Paragraph> <Paragraph position="2"> Example 6 (d92a-1.2 utt47) and then while at Dansville take the three boxcars so that's total of five</Paragraph> </Section> <Section position="4" start_page="530" end_page="531" type="sub_section"> <SectionTitle> 1.4 Interactions </SectionTitle> <Paragraph position="0"> The tasks of identifying intonational phrases and discourse markers and detecting and correcting speech repairs are highly intertwined, and the solution to each task depends on the solution for the others.</Paragraph> <Paragraph position="1"> 1.4.1 Intonational Phrases and Speech Repairs. Phrase boundaries and interruption points of speech repairs share a number of features that can be used to identify them: there is often a pause at these events as well as lengthening of the final syllable before them. Even correspondences, traditionally associated with speech repairs, can cross phrase boundaries (indicated with &quot;%&quot;), as the following example shows.</Paragraph> <Paragraph position="2"> Example 7 (d93-8.3 utt73) that's all you need % you only need one boxcar % Second, the reparandum onset for repairs, especially fresh starts, often occurs at the onset of an intonational phrase, and reparanda usually do not span phrase boundaries. Third, deciding if filled pauses and cue phrases should be treated as abridged repairs can only be done by taking into account whether they are midutterance or not (cf. Shriberg and Lickley 1993), which is associated with intonational phrasing.</Paragraph> <Paragraph position="3"> at utterance boundaries, and hence have strong interactions with intonational phrasing. In fact, Hirschberg and Litman (1993) found that discourse markers tend to occur at the beginning of intonational phrases, while sentential usages tend to occur midphrase. Example 8 below illustrates so being used midutterance as a subordinating conjunction, not as a discourse marker.</Paragraph> <Paragraph position="4"> Example 8 (d93-15.2 utt9) it takes an hour to load them % just so you know % Now consider the third turn of the following example in which the system is not using no as a quantifier to mean that there are not any oranges available, but as a discourse marker in signaling that the user misrecognized oranges as orange juice.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 25, Number 4 Table 1 Frequency of discourse markers in the editing term of speech repairs and as the alteration onset.</Paragraph> <Paragraph position="6"> system: so so we have three boxcars of oranges at Coming user: three boxcars of orange juice at Coming system: no um oranges The discourse marker interpretation is facilitated by the phrase boundary between no and oranges, especially since the determiner reading of no would be very unlikely to have a phrase boundary separating it from the noun it modifies. Likewise, the recognition of no as a discourse marker makes it more likely that there will be a phrase boundary following it.</Paragraph> <Paragraph position="7"> 1.4.3 Speech Repairs and Discourse Markers. Discourse markers are often used in the editing term to help signal that a repair occurred, and can be used to help determine if it is a fresh start (cf. Hindle 1983; Levelt 1983), as the following example illustrates.</Paragraph> <Paragraph position="8"> Example 10 (d92a-1.3 utt75) we have the orange juice in two oh reparandum zp et how many did we need Realizing that oh is being used as a discourse marker helps facilitate the detection of the repair, and vice versus. This holds even if the discourse marker is not part of the editing term, but is the first word of the alteration. Table 1 shows the frequency with which discourse markers co-occur with speech repairs. We see that a discourse marker is either part of the editing term or is the alteration onset for 40% of fresh starts and 14% of modification repairs. Discourse markers also play a role in determining the onset for fresh starts, since they are often utterance initial.</Paragraph> </Section> <Section position="5" start_page="531" end_page="532" type="sub_section"> <SectionTitle> 1.5 Interactions with POS Tagging and Speech Recognition </SectionTitle> <Paragraph position="0"> Not only are the tasks of identifying intonational phrases and discourse markers and resolving speech repairs intertwined, but these tasks are also intertwined with identifying the lexical category or part of speech (POS) of each word, and the speech recognition problem of predicting the next word given the previous context.</Paragraph> <Paragraph position="1"> Just as POS taggers for text take advantage of sentence boundaries, it is natural to assume that tagging spontaneous speech would benefit from modeling intonational phrases and speech repairs. This is especially true for repairs, since their occurrence disrupts the local context that is needed to determine the POS tags (Hindle 1983). In Heeman and Allen Modeling Speakers' Utterances the example below, both instances of load are being used as verbs; however, since the second instance follows a preposition, it could easily be mistaken for a noun.</Paragraph> <Paragraph position="2"> Example 11 (d93-12.4 utt44) by the time we load in load the bananas T reparandum ~p However, by realizing that the second instance of load is being used in a repair and corresponds to the first instance of load, its POS tag becomes obvious. Conversely, since repairs disrupt the local syntactic context, this disruption, as captured by the POS tags, can be used as evidence that a repair occurred, as shown by the following example.</Paragraph> <Paragraph position="3"> Example 12 (d93-13.1 utt90) I can run trains on the in the opposite direction reparandum alteration Here we have a preposition following a determiner, an event that only happens across the interruption point of a speech repair.</Paragraph> <Paragraph position="4"> Just as there are interactions with POS tagging, the same holds for the speech recognition problem of predicting the next word given the previous context. For the lexical context can run trains on the, it would be very unlikely that the word in would be next. It is only by modeling the occurrence of repairs and their word correspondences that we can account for the speaker's words.</Paragraph> <Paragraph position="5"> There are also interactions with intonational phrasing. In the example below, after asking the question what time do we have to get done by, the speaker refines this to be whether they have to be done by two p.m. The result, however, is that there is a repetition of the word by, but separated by a phrase boundary.</Paragraph> <Paragraph position="6"> Example 13 (d93-18.1 utt58) what time do we have to get done by % by two p.m. % By modeling the intonational phrases, POS taggers and speech recognition language models would be expecting a POS tag and word that can introduce a new phrase.</Paragraph> </Section> <Section position="6" start_page="532" end_page="533" type="sub_section"> <SectionTitle> 1.6 Modeling Speakers' Utterances </SectionTitle> <Paragraph position="0"> In this paper, we address the problem of modeling speakers' utterances in spoken dialogue, which involves identifying intonational phrases and discourse markers and detecting and correcting speech repairs. We propose that these tasks can be done using local context and early in the processing stream. Hearers are able to resolve speech repairs and intonational phrase boundaries very early on, and hence there must be enough cues in the local context to make this feasible. We redefine the speech recognition problem so that it includes the resolution of speech repairs and identification of intonational phrases, discourse markers, and POS tags, which results in a statistical language model that is sensitive to speakers' utterances. Since all tasks are being resolved in the same model, we can account for the interactions between the tasks in a Computational Linguistics Volume 25, Number 4 Timing laI,,~m6,n: k / It takes no tit~ it, cf,nple .r decemlde cars ~ . . .-..u...~ IC/ takes 1 hcmr t. le~d t,e unh~d any amcmm C/~C/ carge, C/,11 a tram Manuf~luring O J: One be,xcar oranges c,,nwrts into .me tanker hxld. Any a,~,,mt can be made in e~le hmtr, Figure 1 Map used by the system in collecting the Trains corpus.</Paragraph> <Paragraph position="1"> framework that can compare alternative hypotheses for the speaker's turn. Not only does this allow us to model the speaker's utterance, but it also results in an improved language model, evidenced by both improved POS tagging and in better estimating the probability of the next word. Furthermore, speech repairs and phrase boundaries have acoustic correlates, such as pauses between words. By resolving speech repairs and identifying intonational phrases during speech recognition, these acoustic cues, which otherwise would be treated as noise, can give evidence as to the occurrence of these events, and further improve speech recognition results.</Paragraph> <Paragraph position="2"> Resolving the speaker's utterances early on will not only help a speech recognizer determine what was said, but it will also help later processing, such as syntactic and semantic analysis. The literature (e.g., Bear and Price 1990; Ostendorf, Wightman, and Veilleus 1993) already indicates the usefulness of intonational information for syntactic processing. Resolving speech repairs will further simplify syntactic and semantic understanding of spontaneous speech, since it will remove the apparent ill-formedness that speech repairs cause. This will also make it easier for these processes to cope with the added syntactic and semantic variance that spoken dialogue seems to license.</Paragraph> </Section> <Section position="7" start_page="533" end_page="533" type="sub_section"> <SectionTitle> 1.7 Overview of the Paper </SectionTitle> <Paragraph position="0"> We next describe the Trains corpus and the annotation of speech repairs, intonational phrases, discourse markers, and POS tags. We then introduce a language model that incorporates POS tagging and discourse marker identification. We then augment it with speech repair and intonational phrase detection, repair correction, and silence information, and give a sample run of the model. We then evaluate the model by analyzing the effects that each component of the model has on the other components.</Paragraph> <Paragraph position="1"> Finally, we compare our work with previous work and present the conclusions and future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>