XML Viewer - j99-4003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-4003_metho.xml
Size: 78,442 bytes
Last Modified: 2025-10-06 14:15:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="J99-4003">
  <Title>Speech Repairs, Intonational Phrases, and Discourse Markers: Modeling Speakers' Utterances in Spoken Dialogue</Title>
  <Section position="5" start_page="648" end_page="648" type="metho">
    <SectionTitle>
3.4 Results
</SectionTitle>
    <Paragraph position="0"> To test our POS-based language model, we ran two experiments. The first set examines the effect of using richer contexts for estimating the word and POS probability distributions. The second set measures whether modeling discourse usage leads to better language modeling. Before we give the results, we explain the methodology that we use throughout the experiments.</Paragraph>
    <Paragraph position="1"> 3.4.1 Experimental Setup. In order to make the best use of our limited data, we tested our model using a sixfold cross-validation procedure. We divided the dialogues into six partitions and tested each partition with a model built from the other partitions.</Paragraph>
    <Paragraph position="2"> We divided the dialogues for each pair of speakers as evenly between the six partitions as possible. Changes in speaker are marked in the word transcription with the special token &lt;turn&gt;. The end-of-turn marker is not included in the POS results, but is included in the perplexity results. We treat contractions, such as that'll and gonna, as separate words, treating them as that and 'll for the first example, and going and ta for the second. We also changed all word fragments into a common token &lt;fragment&gt;.</Paragraph>
    <Paragraph position="3"> Since current speech recognition rates for spontaneous speech are quite low, we have run the experiments on the hand-collected transcripts. In searching for the best  sequence of POS tags for the transcribed words, we follow the technique proposed by Chow and Schwartz (1989) and only keep a small number of alternative paths by pruning the low probability paths after processing each word.</Paragraph>
    <Paragraph position="4"> 3.4.2 Perplexity. A way to measure the effectiveness of the language model is to  measure the perplexity that it assigns to a test corpus (Bahl et al. 1977). Perplexity is an estimate of how well the language model is able to predict the next word of a test corpus in terms of the number of alternatives that need to be considered at each point. For word-based language models, with estimated probability distribution of Pr(wilwl,i_l) , the perplexity of a test set Wl,N is calculated as 2 H, where H is the i ~----1 log2 Pr(wilwl,i-1) * entropy, which is defined as H = -~  its POS tag as well. To estimate the branching factor, and thus the size of the search space, we use the following formula for the entropy, where di is the POS tag for word wi.</Paragraph>
    <Paragraph position="6"> Word Perplexity. In order to compare a POS-based model against a word-based language model, we should not penalize the POS-based model for incorrect POS tags.</Paragraph>
    <Paragraph position="7"> Hence, we should ignore them when defining the perplexity and base the perplexity measure on Pr(wi\[wl,i_l). However, for our model, this probability is not estimated.</Paragraph>
    <Paragraph position="8"> Hence, we must rewrite it in terms of the probabilities that we do estimate. To do this, our only recourse is to sum over all possible POS sequences.</Paragraph>
    <Paragraph position="10"> 3.4.3 Recall and Precision. We report results on identifying discourse markers in terms of recall, precision and error rate. The recall rate is the number of times that the algorithm correctly identifies an event over the total number of times that it actually occurred. The precision rate is the number of times the algorithm correctly identifies it over the total number of times it identifies it. The error rate is the number of errors in identifying an event over the number of times that the event occurred.</Paragraph>
    <Paragraph position="11"> 3.4.4 Using Richer Histories. Table 6 shows the effect of varying the richness of the information that the decision tree algorithm is allowed to use in estimating the POS and word probabilities. The second column uses the approximations given in Equation 8 and 9 and the third column uses the full context. The results show that adding the extra context has the biggest effect on the perplexity measures, decreasing the word perplexity by 44.4% from 43.22 to 24.04. The effect on POS tagging is less pronounced, but still gives a reduction of 3.8%. We also see a 8.7% improvement in identifying discourse markers. Hence, in order to use POS tags in a speech recognition language model, we need to use a richer context for estimating the probabilities than what is typically used. In other work (Heeman 1999), we show that our POS-based model results in lower perplexity and word error rate than a word-based model.</Paragraph>
    <Paragraph position="12">  markers by using special POS tags. In column two, we give the results of a model in which we use a POS tagset that does not distinguish discourse marker usage (P). The discourse conjuncts CC_D are collapsed into CC, discourse adverbials RB_D into RB, and acknowledgments AC and discourse interjections UH_D into UH_FP. The third column gives the results of the model in which we use our tagset that does distinguish discourse marker usage (D). To ensure a fair comparison, we do not penalize POS errors that result from a confusion between discourse and sentential usages. We see that modeling discourse markers results in a perplexity reduction from 24.20 to 24.04 and reduces the number of POS errors from 1,219 to 1,189, giving a 2.5% error rate reduction. Although the improvements in perplexity and POS tagging are small, they indicate that there are interactions, and hence discourse markers should be resolved at the same time as POS tagging and speech recognition word prediction.</Paragraph>
    <Paragraph position="13"> 4. Identifying Speech Repairs and Intonational Phrases In the previous section, we presented a POS-based language model that uses special tags to denote discourse markers. However, this model does not account for the occurrence of speech repairs and intonational phrases. Ignoring these events when building a statistical language model will lead to probabilistic estimates for the words and POS tags that are less precise, since they mix contexts that cross intonational boundaries and interruption points of speech repairs with fluent stretches of speech.</Paragraph>
    <Paragraph position="14"> However, there is not a reliable signal for detecting the interruption point of speech repairs (Bear, Dowding, and Shriberg 1992) nor the occurrence of intonational phrases.</Paragraph>
    <Paragraph position="15"> Rather, there are a number of different sources of information that give evidence as to the occurrence of these events. These sources include the presence of pauses, filled pauses, cue phrases, discourse markers, word fragments, word correspondences, and syntactic anomalies. Table 8 gives the number of occurrences for some of these features for each word in the corpus that is not turn-final nor part of the editing term of a speech repair. Each word is classified by whether it immediately precedes the interruption point of a fresh start, modification, or abridged repair, or ends an intonational phrase. All other words are categorized as fluent. The first row gives the number of occurrences of these events. The second row reports whether the word is a fragment.</Paragraph>
    <Paragraph position="16"> The third and fourth give the number of times the word is followed by a filled pause or discourse marker, respectively. The fifth and sixth rows report whether the word is followed by a pause that is less than or greater than 0.5 seconds, respectively. Pause durations were computed automatically with a speech recognizer constrained to the word transcription (Entropic Research Laboratory, Inc. 1994). The next row reports whether there is a word match that crosses the word with at most two intervening words, and the next row, those with at most five intervening words.</Paragraph>
    <Paragraph position="17">  Heeman and Allen Modeling Speakers' Utterances Table 8 Occurrence of features boundaries.</Paragraph>
    <Paragraph position="18"> that signal speech repairs and intonational  From the table, it is clear that none of the cues on their own is a reliable indicator of speech repairs or intonational boundaries. For instance, 44.5% (1,622/3,642) of all long pauses occur after an intonational boundary and 13.3% occur after the interruption point of a speech repair. Conversely, 31.1% (1,622/5,211) of intonational boundaries are followed by a pause while 20.2% of all repairs are followed by a long pause.</Paragraph>
    <Paragraph position="19"> Hence, pauses alone do not give a complete picture of whether a speech repair or intonational boundary occurred. The same holds for filled pauses, which can occur both after the interruption point of a speech repair and in fluent speech, namely between utterances or after utterance-initial discourse markers. Word matchings can also be spurious, as evidenced by the 27 word matches with at most two intervening words across abridged repairs, as well as the matchings across intonational boundaries and fluent speech. Even syntactic ill-formedness at the interruption point is not always guaranteed, as the following example illustrates.</Paragraph>
    <Paragraph position="20"> Example 15 (d93-13.2 utt53) load two boxes of boxcars with oranges reparandum zp Hence using parser failures to find repairs (cf. Dowding et al. 1993) will not be robust. In this section, we augment our POS-based language model so that it also detects intonational boundaries and speech repairs, along with their editing terms. Although not all speech repairs have obvious syntactic anomalies, the probability distributions for words and POS tags are going to be different depending on whether they follow the interruption point of a speech repair, an intonational boundary, or fluent speech. So, it makes sense to take the speech repairs and intonational boundaries into account by directly modeling them when building the language model, which automatically gives us a means of detecting these events and better prediction of the speech that follows. To model the occurrence of intonational boundaries and speech repairs, we introduce three extra variables into the language model. The repair tag Ri, the editing term tag Ei and the intonation tag Ii. These utterance tags capture the discontinuities in the speaker's turn, and we use these discontinuities to better model the speech that follows.</Paragraph>
    <Section position="1" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
4.1 Speech Repairs
</SectionTitle>
      <Paragraph position="0"> The repair tag indicates the occurrence of speech repairs. However, we not only want to know whether a repair occurred, but also the type of repair: whether it is a modification  Computational Linguistics Volume 25, Number 4 repair, a fresh start, or an abridged repair. The type of repair is important since the strategy that a hearer uses to correct the repair depends on the type of repair. For fresh starts, the hearer must determine the beginning of the current utterance. For modification repairs, the hearer can make use of the correspondences between the reparandum and alteration to determine the reparandum onset. For abridged repairs, there is no reparandum, and so simply knowing that it is abridged gives the correction. For repairs that do not have an editing term, the interruption point is where the local context is disrupted, and hence is the logical place to tag such repairs. For repairs with an editing term, there are two choices for marking the speech repair: either directly following the end of the reparandum, or directly preceding the onset of the alteration. The following example illustrates these two choices, marking them with Mod?.</Paragraph>
      <Paragraph position="1"> Example 16 (d92a-5.2 utt34) so we'll pick up a tank of Mod? uh Mod? the tanker of oranges The editing term by itself does not completely determine the type of repair. The alteration also helps to disambiguate the repair. Hence, we delay hypothesizing about the repair type until the end of the editing term, which should keep our search-space smaller, since we do not need to keep alternative repair type interpretations while processing the editing term. This leads to the following definition of the repair variable Ri for the transition between word Wi-1 and Wi:</Paragraph>
      <Paragraph position="3"> if Wi is the alteration onset of a modification repair if Wi is the alteration onset of a fresh start (or cancel) if Wi is the alteration onset of an abridged repair otherwise</Paragraph>
    </Section>
    <Section position="2" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
4.2 Editing Terms
</SectionTitle>
      <Paragraph position="0"> Editing terms are problematic for tagging speech repairs since they separate the end of the reparandum from the alteration onset, thus separating the discontinuity that gives evidence that a fresh start or modification repair occurred. For abridged repairs, they separate the word that follows the editing term from the context that is needed to determine the identity of the word and its POS tag. If editing terms could be identified without having to consider the context, we could skip over them, but still use them as part of the context for deciding the repair tag (cf. Heeman and Allen 1994). However, this assumption is not valid for words that are ambiguous as to whether they are an editing term, such as let me see. Even filled pauses are problematic since they are not necessarily part of the editing term of a repair. To model editing terms, we use the variable E i to indicate the type of editing term transition between word W/_ 1 and Wi.</Paragraph>
      <Paragraph position="1">  if Wi-1 is not part of an editing term but Wi is if Wi-1 and Wi are both part of an editing term if Wi-1 is part of an editing term but Wi is not if neither Wi-1 nor Wi are part of an editing term Below, we give an example and show all non-null editing term and repair tags.</Paragraph>
      <Paragraph position="2"> Example 17 (d93-10.4 utt30) that'll get there at four a.m. Push oh ET sorry Pop Mod at eleven a.m.</Paragraph>
      <Paragraph position="3">  Heeman and Allen Modeling Speakers' Utterances</Paragraph>
    </Section>
    <Section position="3" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
4.3 Intonational Phrases
</SectionTitle>
      <Paragraph position="0"> The final variable is Ii, which marks the occurrence of intonational phrase boundaries.</Paragraph>
      <Paragraph position="2"> The intonation variable is separate from the editing term and repair variables since it is not restricted by the value of the other two. For instance, an editing term could end an intonational phrase, especially on the end of a cue phrase such as let's see, as can the reparandum, as Example 18 below demonstrates.</Paragraph>
      <Paragraph position="3"> Example 18 (d92a-2.1 utt29) that's the one with the bananas % Push I ET mean Pop Mod that's taking the bananas</Paragraph>
    </Section>
    <Section position="4" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
4.4 Redefining the Speech Recognition Problem
</SectionTitle>
      <Paragraph position="0"> We now redefine the speech recognition problem so that its goal is to find the sequence of words and the corresponding POS, intonation, editing term, and repair tags that is most probable given the acoustic signal.</Paragraph>
      <Paragraph position="2"> The second term is the language model probability, and can be rewritten as follows.</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="5" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
4.5 Representing the Context
</SectionTitle>
      <Paragraph position="0"> Equation 13 requires five probability distributions to be estimated. The context for each includes all of the words, POS, intonation, repair, and editing term tags that have been hypothesized, each as a separate piece of information. In principal, we could give this to the decision tree algorithm and let it decide what information to use in constructing equivalence classes. However, repairs, editing terms, and even intonation phrases do not occur in the same abundance as fluent speech and are not as constrained. Hence, it will be difficult to model the discontinuities that they introduce into the context.</Paragraph>
      <Paragraph position="1">  engine E two picks Mod takes the two boxcars When predicting the first word of the alteration takes, it is inappropriate to ask about the preceding words, such as picks, without realizing that there is a modification repair in between. The same also holds for intonational boundaries and editing term pushes and pops. In the example below, a question should only be asked about is in the realization that it ends an intonational phrase.</Paragraph>
      <Paragraph position="2"> Example 20 (d92a-1.2 utt3) you'll have to tell me what the problem is % I don't have their labels Although the intonation, repair, and editing term tags are part of the context and so can be used in partitioning it, the question is whether this will happen. The problem is that null intonation, repair, and editing term tags dominate the training examples. So, we are bound to run into contexts in which there are not enough intonational phrases and repairs for the algorithm to learn the importance of using this information, and instead might blindly subdivide the context based on some subdivision of the POS tags. The solution is analogous to what is done in POS tagging of written text: we give a view of the words and POS tags with the non-null repair, non-null intonation, and editing term push and pop tags inserted. By inserting these tags into the word and POS sequence, it will be more difficult for the learning algorithm to ignore them. It also allows these tags to be grouped with other tags that behave in a similar way, such as change in speaker turn, and discourse markers. null Now consider the following examples, which both start with so we need to.</Paragraph>
      <Paragraph position="3"> Example 21 (d92a-2.2 utt6) so we need to Push urn Pop Abr get a tanker of OJ to Avon Example 22 (d93-11.1 utt46) so we need to get the three tankers This is then followed by the verb get, except the first has an editing term in between. However, in predicting this word, the editing term hinders the decision tree algorithm from generalizing with nonabridged examples. The same thing happens with fresh starts and modification repairs. To allow generalizations between repairs with an editing term and those without, we need a view of the context with completed editing terms removed (cf. Stolcke and Shriberg 1996b).</Paragraph>
      <Paragraph position="4"> Part of the context given to the decision tree is the words and POS tags with the non-null utterance tags inserted (i.e., %) and completed editing terms removed. We refer to this as the utterance context, since it incorporates the utterance information that has been hypothesized. Consider the following example.</Paragraph>
      <Paragraph position="5">  Heeman and Allen Modeling Speakers' Utterances</Paragraph>
      <Paragraph position="7"> Figure 5 Top part of the decision tree used for estimating the probability distribution of the intonation tag.</Paragraph>
      <Paragraph position="8"> Example 23 (d93-18.1 utt47) it takes one Push you ET know Pop Mod two hours % The utterance context for the POS tag of you is &amp;quot;it/PRP takes/VBP one/CD Push.&amp;quot; The context for the editing term Pop is &amp;quot;it/PRP takes/VBP one/CD Push you/PRP know/VBP.&amp;quot; The utterance context for the repair tag has the editing term cleaned up: &amp;quot;it/PRP takes/VBP one/CD&amp;quot; (we also give it the context with the editing term not cleaned up). The context for the POS tag of two is &amp;quot;it/PRP takes/VBP one/CD Mod.&amp;quot; We also include two variables that indicate whether we are processing an editing term without forcing it to look for an editing term Push in the utterance context: ETo state indicates whether we are processing an editing term and whether a cue phrase was seen; and ET-prev indicates the number of editing term words seen so far. Figure 5 gives the top part of the decision tree that was grown for the intonation tag, where uW and uD are the utterance context.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="648" end_page="648" type="metho">
    <SectionTitle>
5. Correcting Speech Repairs
</SectionTitle>
    <Paragraph position="0"> The previous section focused on the detection of speech repairs, editing terms, and intonational phrases. But for repairs, we have only addressed half of the problem; the other half is determining the extent of the reparandum. Hindle (1983) and Kikui and Morimoto (1994) both separate the task of correcting a repair from detecting it by assuming that there is an acoustic editing signal that marks the interruption point of speech repairs (as well as access to the POS tags and utterance boundaries).</Paragraph>
    <Paragraph position="1"> Although the model of the previous section detects repairs, this model is not effective enough. In fact, we feel that one of its crucial shortcomings is that it does not take into consideration the task of correcting repairs (Heeman, Loken-Kim, and Allen 1996).</Paragraph>
    <Paragraph position="2"> Since hearers are often unaware of speech repairs (Martin and Strange 1968), they must be able to correct them as the utterance is unfolding and as an indistinguishable event from detecting them and recognizing the words involved.</Paragraph>
    <Paragraph position="3"> Bear, Dowding, and Shriberg (1992) proposed that multiple information sources need to be combined in order to detect and correct speech repairs. One of these sources  mm.mm 85 mmmx.mmm 8 mrr.mrr 3 mm.mxxm 2 mx.m 76 m.xxm 8 mrmx.mrm 3 xr.r 2 mmx.mm 35 mrm.mrm 7 mmmmm.mmmmm 3 xmx.m 2 mr.mr 29 mx.xm 6 mm.xmm 3 xmmx.mm 2 mmm.mmm 22 xm.m 5 mmmmx.mmmm 2 rr.rr 2 rx.r 20 mmmmr.mmmmr 5 mrmm.mrmm 2 rm.rxm 2 rm.rm 20rmm.rmm 4 mmmxx.mmm 2 r.xr 2 xx. 12 mmxx.mm 4 mmm.xxxmmm 2 mmmm.mmmm 12 mmn~.mmmr 4 mmm.mxmm 2  includes a pattern-matching routine that looks for simple cases of word correspondences that could indicate a repair. However, pattern matching is too limited to capture the variety of word correspondence patterns that speech repairs exhibit (Heeman and Allen 1994). For example, the 1,302 modification repairs in the Trains corpus take on 160 different repair structures, even when we exclude word fragments and editing terms. Of these, only 47 occurred at least twice, and these are listed in Table 9. Each word in the repair is represented by its correspondence type: m for word match, r for replacement, and x for deletions and insertions. A period &amp;quot;.&amp;quot; marks the interruption point. For example, the structure of the repair given in Example 14 (engine two from Elmi(ra)- or engine three from Elmira) would be mrm.mrm.</Paragraph>
    <Paragraph position="4"> To remedy the limitation of Bear, Dowding, and Shriberg (1992), we proposed that the word correspondences between the reparandum and alteration could be found by a set of well-formedness rules (Heeman and Allen 1994; Heeman, Loken-Kim, and Allen 1996). Potential repairs found by the rules were passed to a statistical language model (a predecessor of the model of Section 4), which pruned out false positives. Part of the context for the statistical model was the proposed repair structure found by the well-formedness rules. However, the alteration of a repair, which makes up half of the repair structure, occurs after the interruption point and hence should not be used to predict the occurrence of a repair. Hence this model was of limited use for integration into a speech recognizer.</Paragraph>
    <Paragraph position="5"> Recently, Stolcke and Shriberg (1996) presented a word-based model for speech recognition that models simple word deletion and repetition patterns. They used the prediction of the repair to clean up the context and to help predict what word will occur next. Although their model is limited to simple types of repairs, it provides a starting point for incorporating speech repair correction into a statistical language model.</Paragraph>
    <Section position="1" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.1 Sources of Information
</SectionTitle>
      <Paragraph position="0"> There are several sources of information that give evidence as to the extent of the reparandum of speech repairs. Probably the most widely used is the presence of word correspondences between the reparandum and alteration, both at the word level and at the level of syntactic constituents (Levelt 1983; Hindle 1983; Bear, Dowding, and Shriberg 1992; Heeman and Allen 1994; Kikui and Morimoto 1994). Second, there tends to be a fluent transition from the speech that precedes the onset of the reparandum to the alteration (Kikui and Morimoto 1994). This source is very important for repairs that do not have initial retracing, and is the mainstay of the &amp;quot;parser-first&amp;quot; approach (e.g.,  Heeman and Allen Modeling Speakers' Utterances Dowding et al. 1993)--keep trying alternative corrections until one of them parses. Third, there are certain regularities for where speakers restart. Reparandum onsets tend to be at constituent boundaries (Nooteboom 1980), and in particular, at boundaries where a coordinated constituent can be placed (Levelt 1983). Hence, reparandum onsets can be partially predicted without even looking at the alteration.</Paragraph>
    </Section>
    <Section position="2" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.2 Our Approach
</SectionTitle>
      <Paragraph position="0"> Most previous approaches to correcting speech repairs have taken the standpoint of finding the best reparandum given the neighboring words. Instead, we view the problem as finding the reparandum that best predicts the following words. Since speech repairs are often accompanied by word correspondences (Levelt 1983; Hindle 1983; Bear, Dowding, and Shriberg 1992; Heeman and Allen 1994; Kikui and Morimoto 1994), the actual reparandum will better predict the words in the alteration of the repair. Consider the following example: Example 24 (d93-3.2 utt45) which engine are we are we taking T reparandum lp In this example, if we predicted that a modification repair occurred and that the reparandum consists of are we, then the probability of are being the first word of the alteration would be very high, since it matches the first word of the reparandum. Conversely, if we are not predicting a modification repair with reparandum are we, then the probability of seeing are would be much lower. The same reasoning holds for predicting the next word, we: it is much more likely under the repair interpretation. So, as we process the words of the alteration, the repair interpretation will better account for the words that follow it, strengthening the interpretation.</Paragraph>
      <Paragraph position="1"> When predicting the first word of the alteration, we can also make use of the second source of evidence identified in the previous section: the context provided by the words that precede the reparandum. Consider the following repair in which the first two words of the alteration are inserted.</Paragraph>
      <Paragraph position="2"> Example 25 (d93-16.2 utt66) and two tankers to of OJ to Dansville v T reparandum zp Here, if we know the reparandum is to, then we know that the first word of the reparandum must be a fluent continuation of the speech before the onset of the reparandum. In fact, we see that the repair interpretation (with the correct reparandum onset) provides better context for predicting the first word of the alteration than a hypothesis that predicts either the wrong reparandum onset or predicts no repair at all. Hence, by predicting the reparandum of a speech repair, we no longer need to predict the onset of the alteration on the basis of the ending of the reparandum, as we did in Section 4.5. Such predictions are based on limited amounts of training data since only examples of speech repairs can be used. Rather, by first predicting the reparandum, we can use examples of fluent transitions to help predict the first word of the alteration. We can also make use of the third source of information. When we initially hypothesize the reparandum onset, we can take into account the a priori probability  Computational Linguistics Volume 25, Number 4 that it will occur at that point. In the following example, the words should and the are preferred by Levelt's coordinated constituent rule (Levelt 1983), and hence should have a higher score. Exceptions to the rule, such as this one, should have a lower score.</Paragraph>
      <Paragraph position="4"> of orange juice should er of oranges should be made into orange juice * ,J Y reparandum lp To incorporate correction processing into our language model, we need to add extra variables. After we predict a repair, we need to predict the reparandum onset. Knowing the reparandum onset then allows us to predict the word correspondences between the reparandum and alteration, thus allowing us to use the repair to better predict the words and their POS tags that make up the alteration.</Paragraph>
    </Section>
    <Section position="3" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.3 Reparandum Onset
</SectionTitle>
      <Paragraph position="0"> After we predict a modification repair or a fresh start, we need to predict the reparandum onset. Consider the following two examples of modification repairs.</Paragraph>
      <Paragraph position="1">  drop off the one tanker reparandum l~p the two tankers Although the examples differ in the length of the reparandum, their reparanda both start at the onset of a noun phrase. This same phenomena also exists for fresh starts where reparandum onsets are likely to follow an intonational boundary, the beginning of the turn, or a discourse marker. In order to allow generalizations across different reparandum lengths, we query each potential onset to see how likely it is as the onset. For Ri E {Mod, Can} and j &lt; i, we define Oq as follows: Onset Wj is the reparandum onset of repair R i Oq = null otherwise We normalize the probabilities to ensure that ~-~j(Oij = Onset) = 1. Just as we exclude the editing terms of previous repairs from the utterance words and POS tags, so we exclude the reparanda of previous repairs. Consider the following example of overlapping repairs, repairs in which the reparanda and alterations cannot be separated.</Paragraph>
      <Paragraph position="2">  for engine two at Elmira The reparandum of the first repair is from engine. In predicting the reparandum of the second, we work from the cleaned up context: what's the shortest route from. The context used in estimating how likely a word is as the reparandum onset includes which word we are querying. We also include the utterance words and POS tags that precede the proposed reparandum onset, thus allowing the decision tree to check if the onset is at a suitable constituent boundary. Since reparanda rarely extend over more than one utterance, we include three variables that help indicate whether an utterance boundary is being crossed. The first indicates the number of intonational phrase boundaries embedded in the proposed reparandum. The second indicates the number of discourse markers in the reparandum. Discourse markers at the beginning of the reparandum are not included, and if discourse markers appear consecutively, the group is only counted once. The third indicates the number of filled pauses in the reparandum.</Paragraph>
      <Paragraph position="3"> Another source of information is the presence of other repairs in the turn. In the Trains corpus, 35.6% of nonabridged repairs overlap. If a repair overlaps a previous one then its reparandum onset is likely to co-occur with the alteration onset of the previous repair (Heeman 1997). Hence we include a variable that indicates whether there is a previous repair, and if there is, whether the proposed onset coincides with, precedes, or follows the alteration onset of the preceding repair.</Paragraph>
    </Section>
    <Section position="4" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.4 The Active Repair
</SectionTitle>
      <Paragraph position="0"> Determining word correspondences is complicated by the occurrence of overlapping repairs. To keep our approach simple, we allow at most one previous word to license the correspondence. Consider again Example 29. Here, one could argue that the word for corresponds to the word from from either the reparandum of the first or second repair. In either case, the correspondence to the word engine is from the reparandum of the first repair. Our approach is to first decide which repair the correspondence will be to and then decide which word of that repair's reparandum will license the current word. We always choose the most recent repair that has words in its reparandum that have not yet licensed a correspondence (other than a word fragment). Hence, the active repair for predicting the word for is the second repair, while the active repair for predicting engine is the first repair. For predicting the word two, neither the first nor second repair has any unlicensed words in its reparandum, and hence two will not have an active repair. In future work, we plan to choose between the reparandum of alternative speech repairs, as allowed by the annotation scheme (Heeman 1997).</Paragraph>
    </Section>
    <Section position="5" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.5 Licensing a Correspondence
</SectionTitle>
      <Paragraph position="0"> If we are in the midst of processing a repair, we can use the reparandum to help predict the current word Wi and its POS tag Di. In order to do this, we need to determine which word in the reparandum of the active repair will license the current word. As illustrated in Figure 6, word correspondences for speech repairs tend to exhibit a cross serial dependency (Heeman and Allen 1994); in other words, if we have a correspondence between Wj in the reparandum and Wk in the alteration, any correspondence with a word in the alteration after Wk will be to a word that is after wj.</Paragraph>
      <Paragraph position="1">  can we have reparandum z~p we can have three engines in Corning at the same time Since we currently do not support such exceptions, this means that if there is already a correspondence for the repair, then the licensing word will follow the last correspondence in the reparandum.</Paragraph>
      <Paragraph position="2"> The licensing word might need to skip over words due to deleted words in the reparandum or inserted words in the alteration. In the example below, the word tow is licensed by carry, but the word them must be skipped over before processing the licensing between the two instances of both.</Paragraph>
      <Paragraph position="4"> tow both on the same engine The next example illustrates the opposite problem: the word two has no correspondence with any word in the reparandum.</Paragraph>
      <Paragraph position="5"> Example 32 (d93-15.4 utt45) and fill my boxcars fully of oranges \]&amp;quot; reparandum lp my two boxcars full of oranges For words that have no correspondence, we define the licensing word as the first available word in the alternation, in this case boxcars. We leave it to the correspondence variable to encode that there is no correspondence. This gives us the following definition for the correspondence licensor, Lq, where i is the current word and j runs over all words in the reparandum of the active repair that come after the last word in the reparandum with a correspondence.</Paragraph>
      <Paragraph position="7"> Wj licenses the current word W i is an inserted word and Wj is first available word in reparandum otherwise Just as with the reparandum onset, we estimate the probability by querying each eligible word. The context for this query includes information about the proposed word, namely its POS tag, as well as the utterance POS and word context prior to the current word, the type of repair and the reparandum length. We also include  Heeman and Allen Modeling Speakers' Utterances information about the repair structure that has been found so far. If the previous word was a word match, there is a good chance that the current word will involve a word match to the next word. The rest of the features are the number of words skipped in the reparandum and alteration since the last correspondence, the number of words since the onset of the reparandum and alteration, and the number of words to the end of the reparandum.</Paragraph>
    </Section>
    <Section position="6" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.6 The Word Correspondence
</SectionTitle>
      <Paragraph position="0"> Now that we have decided which word in the reparandum will potentially license the current word, we need to predict the type of correspondence. We focus on correspondences involving exact word match (identical POS tag and word), word replacements (same POS tag), or no such correspondence.</Paragraph>
      <Paragraph position="2"> Wi is a word match of the word indicated by Li Wi is a word replacement of the word indicated by Li Wi has no correspondence (inserted word) No active repair The context used for estimating the correspondence variable is exactly the same as that used for estimating the licensor.</Paragraph>
    </Section>
    <Section position="7" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
5.7 Redefining the Speech Recognition Problem
</SectionTitle>
      <Paragraph position="0"> Now that we have introduced the correction tags, we redefine the speech recognition problem so that it includes finding the most probable corrections tags.</Paragraph>
      <Paragraph position="1"> WDCLOREI arg max Pr( WDCLOREIIA )</Paragraph>
      <Paragraph position="3"> The second term is the language model and can be rewritten as we did for Equation 12.</Paragraph>
      <Paragraph position="4"> We have already discussed the context used for estimating the three new probability distributions. We also have a richer context for estimating the other five distributions. For these, we take advantage of the new definition of the utterance word and POS tags, which now accounts for the reparanda of repairs. Consider the following example.</Paragraph>
      <Paragraph position="5"> Example 33 (d93-13.1 utt64) pick up and load two um the two boxcars on engine two reparandum lp In processing the word the, if we hypothesized that it follows a modification repair with editing term um and reparandum two, then we can now generalize with fluent examples, such as the following, in hypothesizing its POS tag and the word identity. Example 34 (d93-12.4 utt97) and to make the orange juice and load the tankers Thus, we can make use of the second knowledge source of Section 5.1.  Computational Linguistics Volume 25, Number 4 Cleaning up fresh starts requires a slightly different treatment. Fresh starts abandon the current utterance, and hence the alteration starts a new utterance. But this new utterance will start differently than most utterances in that it will not begin with initial filled pauses, or phrases such as let's see, since these would have been counted as part of the editing term of the fresh start. Hence, when we clean up the reparanda of fresh starts, we leave the fresh start marker Can, just as we do for intonational boundaries.</Paragraph>
      <Paragraph position="6"> For predicting the word and POS tags, we have an additional source of information, namely the values of the correspondence licensor and the correspondence type. Rather than use these two variables as part of the context that we give the decision tree algorithm, we use these tags to override the decision tree probability. If a word replacement or word match was hypothesized, we assign all of the POS probability to the appropriate POS tag. If a word match was hypothesized, we assign all of the word probability to the appropriate word.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="648" end_page="648" type="metho">
    <SectionTitle>
6. Acoustic Cues
</SectionTitle>
    <Paragraph position="0"> Silence, as well as other acoustic information, can also give evidence as to whether an intonational phrase, speech repair, or editing term occurred, as was shown in Table 8.</Paragraph>
    <Paragraph position="1"> In this section, we revise the language model to incorporate this information.</Paragraph>
    <Section position="1" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
6.1 Redefining the Speech Recognition Problem
</SectionTitle>
      <Paragraph position="0"> In the same way that speech recognizers hypothesize lexical items, they also hypothesize pauses. Rather than insert these into the word sequence (e.g., Zeppenfeld et al.</Paragraph>
      <Paragraph position="1"> 1997), we define the variable Si to be the amount of silence between words Wi_ 1 and Wi. We incorporate this information by redefining the speech recognition problem.</Paragraph>
      <Paragraph position="3"> Again, the first term is the acoustic model, which one can approximate by Pr(AIWS ), and thus reduce it to a traditional acoustic model. The second term is the new language model, which we rewrite as follows:</Paragraph>
      <Paragraph position="5"> We expand the silence variable first so that we can use it as part of the context in estimating the tags for the remaining variables.</Paragraph>
      <Paragraph position="6"> We now have an extra probability in our model, namely the probability of Si given the previous context. The variable Si will take on values in accordance with the minimum time samples that the speech recognizer uses. To deal with limited amounts of training data, one could collapse these durations into larger intervals. Note that including this probability impacts the perplexity computation. Usually, prediction of silence durations is not included in the perplexity calculation. In order to allow comparisons between the perplexity rates of the model that includes silence durations and ones that do not, we exclude the probability of Si in the perplexity calculation.</Paragraph>
    </Section>
    <Section position="2" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
6.2 Using Silence as Part of the Context
</SectionTitle>
      <Paragraph position="0"> We now need to include the silence durations as part of the context for predicting the values of the other variables. However, it is just for the intonation, repair, and editing  Preference for utterance tags given the length of silence.</Paragraph>
      <Paragraph position="1"> term variables that this information is most appropriate. We could let the decision tree algorithm use the silence duration as part of the context in estimating the probability distributions. However, our attempts at doing this have not met with success, perhaps because asking questions about the silences fragments the training data and hence makes it difficult to model the influence of the other aspects of the context. Instead, we treat the silence information as being independent from the other context. Below we give the derivation for the intonation variable. For expository ease, we define Contexti to be the prior context for deciding the probabilities for word Wi.</Paragraph>
      <Paragraph position="3"> The second line involved the assumptions that Contexti and Si are independent and that Contexti and Si are independent given Ii. The first assumption is obviously too strong. If the previous word is a noun it is more likely that there will be a silence after it than if the previous word was an article. However, the assumptions allow us to model the silence information independently from the other context, which gives us more data to estimate its effect. The result is that we use the factor Pr(IilSi) to adjust the probabilities computed by the decision tree algorithm, which does not use the silence durations. We guard against shortcomings by normalizing the adjusted probabilities to ensure that they sum to one.</Paragraph>
      <Paragraph position="4"> To compute Pr(IiiSi), we group the silence durations into 30 intervals and then smooth the counts using a Gaussian filter. We do the same adjustment for the editing term and repair variables. For the editing term variable, we only do the adjustment if the intonation tag is null, due to a lack of data in which editing terms co-occur with intonational phrasing. For the repair variable, we only do the adjustment if the intonation tag is null and the editing term tag is not a push or pop. Figure 7 gives the adjustments for the resulting six equivalence classes of utterance tags. The ratio  Computational Linguistics Volume 25, Number 4 between the curves gives the preference for one class over another, for a given silence duration. Silence durations were automatically obtained from a word aligner (Entropic Research Laboratory, Inc. 1994).</Paragraph>
      <Paragraph position="5"> Silences between speaker turns are not used in computing the preference factor, nor is the preference factor used at such points. The end of the speaker's turn is determined jointly by both the speaker and the hearer. So when building a system that is designed to participate in a conversation, these silence durations will be partially determined by the system's turn-taking strategy. We also do not include the silence durations after word fragments since these silences were hand-computed.</Paragraph>
      <Paragraph position="6"> 7. Example This section illustrates the workings of the algorithm. As in Section 3.4.1, the algorithm is constrained to the word transcriptions and incrementally considers all possible interpretations (those that do not get pruned), proceeding one word at a time. Since resolving speech repairs is the most complicated part of our model, we focus on this using the following example of overlapping repairs.</Paragraph>
      <Paragraph position="8"> Rather than try to show all of the competing hypotheses, we focus on the correct interpretation, which, for this example, happens to be the winning interpretation. We contrast the probabilities of the correct tags with those of its competitors. For reference, we give a simplified view of the context that is used for each probability. Full results of the algorithm will be given in the next section.</Paragraph>
    </Section>
    <Section position="3" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
7.1 Predicting &amp;quot;um&amp;quot; as the Onset of an Editing Term
</SectionTitle>
      <Paragraph position="0"> Below, we give the probabilities involved in the correct interpretation of the word um given the correct interpretation of the words okay uh and that will take a total of. We start with the intonation variable. The correct tag of null is significantly preferred over the alternative, mainly because intonational boundaries rarely follow prepositions.</Paragraph>
      <Paragraph position="1">  With El0 = Push, the only allowable repair tag is null. Since no repair has been started, the reparandum onset O10 must be null. Similarly, since no repair is in progress, L10, the correspondence licensor, and C10, the correspondence type, must both be null.</Paragraph>
      <Paragraph position="2"> We next hypothesize the POS tag. Below we list all of the tags that have a probability greater than 1%. Since we are starting an editing term, we see that POS tags  Heeman and Allen Modeling Speakers' Utterances associated with the first word of an editing term have a high probability, such as UFLFP for um, AC for okay, CC_D for or, UH_D for well, and VB for the let in let's see.  Given the correct interpretation of the previous words, the probability of the filled pause um along with the correct tags is 0.090.</Paragraph>
    </Section>
    <Section position="4" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
7.2 Predicting &amp;quot;total&amp;quot; as the Alteration Onset
</SectionTitle>
      <Paragraph position="0"> We now give the probabilities involved in the second instance of total, which is the alteration onset of the first repair, whose editing term um let's see, which ends an intonational phrase, has just finished. Again we start with the intonation variable.</Paragraph>
      <Paragraph position="1"> Pr(I14=% \[ a total of Push um let's see) = 0.902 Pr(I14=null I a total of Push um let's see) = 0.098 For I14 -~ %, the editing term probabilities are given below. Since an editing term is in progress, the only possibilities are that it is continued or that it has ended.</Paragraph>
      <Paragraph position="2"> Pr(E14=Pop \[ a total of Push um let's see %) = 0.830 Pr(E14=ET I a total of Push um let's see %) -- 0.170 For E14 ~ Pop, we give the probabilities for the repair variable. Since an editing term has just ended, the null tag for the repair variable is ruled out. Note the modification interpretation receives a score approximately one third of that of a fresh start. However, the repair interpretation catches up after the alteration is processed.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 25, Number 4 For each, we give the proposed reparandum onset, X, and the words that precede it.  With total as the correspondence licensor, we need to decide the type of correspondence: whether it is a word match, word replacement, or otherwise.  For the correct interpretation, the word correspondence is a word match with the word total and POS tag NN. Hence, the POS tag and identity of the current word are both fixed and hence have a probability of 1. Given the correct interpretation of the previous words, the probability of the word total along with the correct tags is 0.0111.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="648" end_page="648" type="metho">
    <SectionTitle>
8. Results
</SectionTitle>
    <Paragraph position="0"> In this section, we present the results of running our model on the Trains corpus. This section not only shows the feasibility of the model, but also supports the thesis that the tasks of resolving speech repairs, identifying intonational phrases and discourse markers, POS tagging, and speech recognition language modeling must be accomplished in a single model to account for the interactions between these tasks. We start with the models that we presented in Section 3, and vary which variables of Section 4, 5, and 6 that we include. All results in this section were obtained using the sixfold cross-validation procedure described in Section 3.4.1.</Paragraph>
    <Section position="1" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
8.1 POS Tagging, Perplexity, and Discourse Markers
</SectionTitle>
      <Paragraph position="0"> Table 10 shows that POS tagging, word perplexity, and discourse markers benefit from modeling intonational phrases and speech repairs. The second column gives the results of the POS-based language model of Section 3. The third column adds intonational phrase detection, which reduces the POS error rate by 3.8%, improves  discourse marker identification by 6.8%, and reduces perplexity slightly from 24.04 to 23.91. These improvements are of course at the expense of the branching perplexity, which increases from 26.35 to 30.61. Column four augments the POS-based model with speech repair detection and correction, which improves POS tagging and reduces word perplexity by 3.6%, while only increasing the branching perplexity from 26.35 to 27.69. Although we are adding five variables to the speech recognition problem, most of the extra ambiguity is resolved by the time the word is predicted. Thus, corrections can be sufficiently resolved by the first word of the alteration. Column five combines the models of columns three and four and results in a further improvement in word perplexity. POS tagging and discourse marker identification do not seem to benefit from combining the two processes, but both rates remain better than those obtained from the base model.</Paragraph>
      <Paragraph position="1"> Column six adds silence information. Silence information is not directly used to decide the POS tags, the discourse markers, nor what words are involved; rather, it gives evidence as to whether an intonational boundary, speech repair, or editing term occurred. As the following sections show, silence information improves the performance on these tasks, and this translates into better language modeling, resulting in a further decrease in perplexity from 22.96 to 22.35, giving an overall perplexity reduction of 7.0% over the POS-based model. We also see a significant improvement in POS tagging with an error rate reduction of 9.5% over the POS-based model, and a reduction in the discourse marker error rate of 15.4%. As we further improve the modeling of the user's utterance, we should expect to see further improvements in the language model.</Paragraph>
    </Section>
    <Section position="2" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
8.2 Intonational Phrases
</SectionTitle>
      <Paragraph position="0"> Table 11 demonstrates that modeling intonational phrases benefits from modeling silence information, speech repairs, and discourse markers. Column two gives the base results of modeling intonational phrase boundaries. Column three adds silence information, which reduces the error rate for turn-internal boundaries by 9.1%. Column four adds speech repair detection, which further reduces the error rate by 3.5%. Column five adds speech repair correction. Curiously, this actually slightly increases the error rate for intonational boundaries but the rate is still better than not modeling repairs at all (column four). The final result for within-turn boundaries is a recall rate of 71.8%, with a precision of 70.8%. The last column subtracts out the discourse marker modeling by using the POS tagset P of Section 3.4.5, which collapses discourse marker usage with sentential usages. Removing the modeling of discourse markers results in a 2.0% degradation in identifying turn-internal boundaries and 7.2% for end-of-turn boundaries.</Paragraph>
    </Section>
    <Section position="3" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
8.3 Detecting Speech Repairs
</SectionTitle>
      <Paragraph position="0"> We now demonstrate that detecting speech repairs benefits from modeling speech repair correction, intonational phrases, silences, and discourse markers. We use two measures to compare speech repair detection. The first measure, referred to as All Repairs, ignores errors that result from improperly identifying the type of repair, and hence scores a repair as correctly detected as long as it was identified as either an abridged repair, a modification repair, or a fresh start. For experiments that include speech repair correction, we further relax this rule. When multiple repairs have contiguous reparanda, we count all repairs involved (of the hand-annotations) as correct as long as the combined reparandum is correctly identified. Hence, for Example 29 given earlier, as long as the overall reparandum was identified as from engine from, both of the hand-annotated repairs are counted as correct.</Paragraph>
      <Paragraph position="1"> We argued earlier that the proper identification of the type of repair is necessary for successful correction. Hence, the second measure, Exact Repairs, counts a repair as being correctly identified only if the type of the repair is also properly determined. Under this measure, a flesh start detected as a modification repair is counted as a false positive and as a missed repair. Just as with All Repairs, for models that include speech repair correction, if a misidentified repair is correctly corrected, then it is counted as correct. We also give a breakdown of this measure by repair type.</Paragraph>
      <Paragraph position="2"> The results are given in Table 12. The second column gives the base results for detecting speech repairs. The third column adds speech repair correction, which improves the error rate from 46.2% to 41.0%, a reduction of 11.2%. Part of this improvement is attributed to better scoring of overlapping repairs. However, from an analysis of the results, we found that this could account for at most 32 of the 124 fewer errors. Hence, a reduction of at least 8.3% is directly attributed to incorporating speech repair correction. The fourth column adds intonational phrasing, which reduces the error rate for detecting repairs from 41.0% to 37.9%, a reduction of 7.4%. The fifth column adds silence information, which further reduces the error rate to 35.0%, a reduction of 7.7%.</Paragraph>
      <Paragraph position="3"> Part of this improvement is a result of improved intonational phrase modeling, and  part is a result of using pauses to detect speech repairs. This gives a final recall rate of 76.8% with a precision of 86.7%. In the last column, we show the effect of removing the modeling of discourse markers, which increases the error rate of detecting repairs by 4.8%.</Paragraph>
    </Section>
    <Section position="4" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
8.4 Correcting Speech Repairs
</SectionTitle>
      <Paragraph position="0"> Table 13 shows that correcting speech repairs benefits from modeling intonational phrasing, silences, and discourse markers. Column two gives the base results for correcting repairs, which is a recall rate of 61.9% and a precision of 71.4%. Note that abridged and modification repairs are corrected at roughly the same rate but the correction of fresh starts proves particularly problematic. Column three adds intonational phrase modeling. Just as with detecting repairs, we see that this improves correcting each type of repair, with the overall error rate decreasing from 62.9 to 58.9, a reduction of 6.3%. From Table 12, we see that only 73 fewer errors were made in detecting repairs after adding intonational phrase modeling, while 95 fewer errors were made in correcting them. Thus adding intonation phrases leads to better correction of the detected repairs. Column four adds silence information, which further reduces the error rate to 56.9%, a reduction of 3.4%. This gives a final recall rate of 65.9% with a precision of 74.3%. The last column subtracts out discourse marker modeling, which degrades the correction error rate by 5.2%. From Table 12, 40 errors were introduced in detecting repairs by removing discourse marker modeling, while 72 errors were introduced in correcting them. Thus modeling discourse markers leads to better correction of the detected repairs.</Paragraph>
    </Section>
    <Section position="5" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
8.5 Collapsing Repair Distinctions
</SectionTitle>
      <Paragraph position="0"> Our classification scheme distinguishes between fresh starts and modification repairs.</Paragraph>
      <Paragraph position="1"> Table 14 contrasts the full model (column 3) with one that collapses modification repairs and fresh starts (column 2). To ensure a fair comparison, the reported detection rates do not penalize incorrect identification of the repair type. We find that distinguishing fresh starts and modification repairs results in a 7.0% improvement in detecting repairs and a 6.6% improvement in correcting them. Hence, the two types of repairs differ enough both in how they are signaled and the manner in which they are corrected that it is worthwhile to model them separately. Interestingly, we also see that distinguishing between fresh starts and modification repairs improves intonational phrase identification by 1.9%. This improvement is undoubtedly attributable to the fact that the reparandum onset of fresh starts interacts more strongly with intonational boundaries than does the reparandum onset of modification repairs. As for perplexity and POS tagging, there was virtually no difference, except a slight increase in branching perplexity for the full model.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="648" end_page="648" type="metho">
    <SectionTitle>
9. Comparison
</SectionTitle>
    <Paragraph position="0"> Comparing the performance of our model to others that have been proposed is problematic. First, there are differences in corpora. The Trains corpus is a collection of dialogues between two people, both of whom realize that they are talking to another person. The ATIS corpus (MADCOW 1992), on the other hand, is a collection of human-computer dialogues. The rate of repairs in this corpus is much lower and almost all speaker turns consists of just one contribution. The Switchboard corpus (Godfrey, Holliman, and McDaniel 1992) is a collection of human-human dialogues, which are much less constrained and about a much wider domain. Even more extreme are corpora of professionally read speech. A second problem is that different systems employ different inputs; for instance, does the input include POS tags, utterance segmentation, or hand-transcriptions of the words that were uttered? We also note that this work is the first proposal that combines the detection and correction of speech repairs, the identification of intonational phrases and discourse markers, and POS tagging, in a framework that is amenable to speech recognition. Hence our comparison is with systems that address only part of the problem.</Paragraph>
    <Section position="1" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
9.1 Speech Repairs
</SectionTitle>
      <Paragraph position="0"> Table 15 gives the results of the full model for detecting and correcting speech repairs. The overall correction recall rate is 65.9% with a precision of 74.3%. In the table, we also report the results for each type of repair using the Exact Repair metric. To facilitate comparisons with approaches that do not distinguish between modification repairs and fresh starts, we give the combined results of these two categories. null Bear, Dowding, and Shriberg (1992) investigated the use of pattern matching of the word correspondences, global and local syntactic and semantic ill-formedness, and acoustic cues as evidence for detecting speech repairs. They tested their pattern matcher on a subset of the ATIS corpus from which they removed &amp;quot;all trivial&amp;quot; repairs, repairs that involve only the removal of a word fragment or a filled pause. For their  Heeman and Allen Modeling Speakers' Utterances pattern-matching results, they achieved a detection recall rate of 76% with a precision of 62%, and a correction recall rate of 44% with a precision of 35%. They also tried combining syntactic and semantic knowledge in a &amp;quot;parser-first&amp;quot; approach--first try to parse the input and if that fails, invoke repair strategies based on word patterns in the input. In a test set containing 26 repairs Dowding et al. 1993, they obtained a detection recall rate of 42% with a precision of 85%, and a correction recall rate of 31% with a precision of 62%.</Paragraph>
      <Paragraph position="1"> Nakatani and Hirschberg (1994) proposed that speech repairs should be detected in a &amp;quot;speech-first&amp;quot; model using acoustic-prosodic cues, without relying on a word transcription. In order to test their theory, they built a decision tree using a training corpus of 148 turns of speech. They used hand-transcribed prosodic-acoustic features such as silence duration, energy, and pitch, as well as traditional text-first cues such as presence of word fragments, filled pauses, word matches, word replacements, POS tags, and position of the word in the turn, and obtained a detection recall rate of 86% with a precision of 91%. The cues they found relevant were duration of pauses between words, word fragments, and lexical matching within a window of three words. Note that in their corpus, 73% of the repairs were accompanied by a word fragment, as opposed to 32% of the modification repairs and fresh starts in the Trains corpus.</Paragraph>
      <Paragraph position="2"> Hence, word fragments are a stronger indicator of speech repairs in their corpus than in the Trains corpus. Also note that their training and test sets only included turns with speech repairs; hence their &amp;quot;findings should be seen more as indicative of the relative importance of various predictors of \[speech repair\] location than as a true test of repair site location&amp;quot; (page 1612).</Paragraph>
      <Paragraph position="3"> Stolcke and Shriberg (1996b) incorporated repair resolution into a word-based language model. They limited the types of repair to single and double word repetitions and deletions, deletions from the beginning of the sentence, and filled pauses. In predicting a word, they summed over the probability distributions for each type of repair (including no repair at all). For hypotheses that include a repair, the prediction of the next word was based upon a cleaned up representation of the context, and took into account whether a single or double word repetition was predicted. Surprisingly, they found that this model actually degrades performance, in terms of perplexity and word error rate. They attributed this to their treatment of filled pauses: utterancemedial filled pauses should be cleaned up before predicting the next word, whereas utterance-initial ones should be left intact, a distinction that we make in our model by modeling intonational phrases.</Paragraph>
      <Paragraph position="4"> Siu and Ostendorf (1996) extended a language model to account for three roles that words such as filled pauses can play in an utterance: utterance-initial, part of a nonabridged repair, or part of an abridged repair. By using training data with these roles marked and a function-specific variable n-gram model (i.e., use a different context for the probability estimates depending on the function of the word), and summing over each possible role, they achieved a perplexity reduction of 82.9 to 81.1.</Paragraph>
    </Section>
    <Section position="2" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
9.2 Utterance Units and Intonational Phrases
</SectionTitle>
      <Paragraph position="0"> We now contrast our intonational phrase results with the results of other researchers in phrases, or other definitions of utterance units. Table 16 gives our performance.</Paragraph>
      <Paragraph position="1"> Most methods for detecting phrases use end-of-turn as a source of evidence; however, this is jointly determined by both participants. Hence, a dialogue system, designed to participate in the conversation, will not be able to take advantage of this information.</Paragraph>
      <Paragraph position="2"> For this reason, we focus on turn-internal intonational phrase boundaries.</Paragraph>
      <Paragraph position="3">  Wightman and Ostendorf (1994) used preboundary lengthening, pausal durations, and other acoustic cues to automatically label intonational phrases and word accents. They trained a decision tree to estimate the probability of a phrase boundary given the acoustic context. These probabilities were fed into a Markov model whose state is the boundary type of the previous word. For training and testing their algorithm, they used a single-speaker corpus of news stories read by a public radio announcer. With this speaker-dependent model, they achieved a recall rate of 78.1% and a precision of 76.8%. 4 However, it is unclear how well this will adapt to spontaneous speech, where repairs might interfere with the cues that they use, and to speaker independent testing. Wang and Hirschberg (1992) also looked at detecting intonational phrases. Using automatically labeled features, including POS tag of the current word, category of the constituent being built, distance from last boundary, and word accent, they built decision trees to classify each word as to whether it has an intonational boundary.</Paragraph>
      <Paragraph position="4"> Note that they do not model interactions with other tasks, such as POS tagging. With this approach, they achieved a recall rate of 79.5% and a precision rate of 82.7% on a subset of the ATIS corpus. Excluding end-of-turn data gives a recall rate of 72.2% and a precision of 76.2%. These results group speech repairs with intonational boundaries and do not distinguish between them. In their corpus, there were 424 disfluencies and 405 turn-internal boundaries. The performance of the decision tree that does not classify disfluencies as intonational boundaries is significantly worse. However, these results were achieved with one-tenth the data of the Trains corpus.</Paragraph>
      <Paragraph position="5"> Kompe et al. (1995) combined acoustic cues with a statistical language model to find intonational phrases. They combined normalized syllable duration, length of pauses, pitch contour, and energy using a multilayered perceptron that estimates the probability Pr(vilci), where vi indicates if there is a boundary after the current word and ci is the acoustic features of the neighboring six syllables. This score is combined with the score from a statistical language model, which determines the probability of the word sequence with the hypothesized phrase boundary inserted using a backoff strategy.</Paragraph>
      <Paragraph position="7"> Building on this work, Mast et al. (1996) segmented speech into speech acts as the first step in automatically classifying them and achieved a recognition accuracy of 92.5% on turn-internal boundaries using Verbmobil dialogues. This translates into a recall rate of 85.0%, a precision of 53.1%, and an error rate of 90.1%. Their model, which employs rich acoustic modeling, does not account for interactions with speech repairs or discourse markers, nor does it redefine the speech recognition language model.</Paragraph>
      <Paragraph position="8"> Meteer and Iyer (1996) investigated whether modeling linguistic segments, segments with a single independent clause, improves language modeling. They computed</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="648" end_page="648" type="metho">
    <SectionTitle>
4 Derivations of recall and precision rates are given in detail in Heeman (1997).
</SectionTitle>
    <Paragraph position="0"> Heernan and Allen Modeling Speakers&amp;quot; Utterances the probability of the sequence of words with the hypothesized segment boundaries inserted into the sequence. Working on the Switchboard corpus, they found that predicting linguistic boundaries improved perplexity from 130 to 127. Similar to this work, Stolcke and Shriberg (1996a) investigated how the language model can find the boundaries. Their best results were obtained by using POS tags as part of the input, as well as the word identities of certain word classes, in particular, filled pauses, conjunctions, and certain discourse markers. However, this work does not incorporate automatic POS tagging and discourse marker identification.</Paragraph>
    <Section position="1" start_page="648" end_page="648" type="sub_section">
      <SectionTitle>
9.3 Discourse Markers
</SectionTitle>
      <Paragraph position="0"> The full model results in 533 errors in discourse marker identification, giving an error rate of 6.43%, a recall of 97.26%, and a precision of 96.32%. Although numerous researchers have noted the importance of discourse markers in determining discourse structure, there has not been a lot of work in actually identifying them.</Paragraph>
      <Paragraph position="1"> Hirschberg and Litman (1993) examined how intonational information can distinguish between discourse and sentential interpretation for a set of ambiguous lexical items. They used hand-transcribed intonational features and only examined discourse markers that were one word long, as we have. They found that discourse usages were either an intermediate phrase by themselves (or in a phrase consisting entirely of ambiguous tokens), or they are first in an intermediate phrase (or preceded by other ambiguous tokens) and are either de-accented or have a low word accent. In a monologue of approximately 12,500 words, their model achieved a recall rate of 63.1% with a precision of 88.3%. Many of the errors occurred on coordinate conjuncts, such as and, or, and but, which proved problematic for annotating as well, since &amp;quot;the discourse meanings of conjunction as described in the literature ... seem to be quite similar to the meanings of sentential conjunction&amp;quot; (page 518).</Paragraph>
      <Paragraph position="2"> Litman (1996) used machine learning techniques to identify discourse markers.</Paragraph>
      <Paragraph position="3"> The best set of features for predicting discourse markers were lengths of intonational and intermediate phrase, positions of token in intonational and intermediate phrase, composition of intermediate phrase (token is alone in intermediate phrase or phrase consists entirely of potential discourse markers), and identity of the token. The algorithm achieved a success rate of 85.5%, which translates into a discourse marker error rate of 37.3%, in comparison to the rate of 45.3% for Hirschberg and Litman (1993).</Paragraph>
      <Paragraph position="4"> Direct comparisons with our error rate of 6.4% are problematic since our corpus is five times as large and we use task-oriented human-human dialogues, which include a lot of turn-initial discourse markers for coordinating mutual belief. In any event, the work of Litman and Hirschberg indicates the usefulness of modeling intermediate phrase boundaries and word accents. Conversely, our approach does not force decisions to be made independently and does not assume intonational annotations as input; rather, we identify discourse markers as part of the task of searching for the best assignment of discourse markers along with POS tags, speech repairs, and intonational phrases.</Paragraph>
      <Paragraph position="5"> 10. Conclusion and Future Work In this paper, we redefined the speech recognition language model so that it also identifies POS tags, intonational phrases, and discourse markers, and resolves speech repairs. This language model allows the speech recognizer to model the speaker's utterances, rather than simply the words involved. This allows it to better account for the words involved and allows it to return a more meaningful analysis of the speaker's turn for later processing. The model incorporates identifying intonational phrases, discourse markers, and POS tags, and detecting and correcting speech repairs; hence,  Computational Linguistics Volume 25, Number 4 interactions that exist between these tasks, as well as the task of predicting the next word, can be modeled.</Paragraph>
      <Paragraph position="6"> Constraining our model to the hand-transcription, it is able to identify 71.8% of all turn-internal intonational boundaries with a precision of 70.8%, identify 97.3% of all discourse markers with a precision of 96.3%, and detect and correct 65.9% of all speech repairs with a precision of 74.3%. These results are partially attributable to accounting for the interaction between these tasks: modeling intonation phrases improves speech repair detection by 7.4% and correction by 6.3%; modeling speech repairs improves intonational phrase identification by 3.5%; modeling repair correction improves repair detection by 8.3%; modeling repairs and intonational phrases improves discourse marker identification by 15.4%; and removing the modeling of discourse markers degrades intonational phrase identification by 2.0%, speech repair detection by 4.8%, and speech repair correction by 5.2%. Speech repairs and intonational phrases create discontinuities that traditional speech recognition language models and POS taggers have difficulty modeling. Modeling speech repairs and intonational phrases results in a 9.5% improvement in POS tagging and a 7.0% improvement in perplexity. Part of this improvement is from exploiting silences to give evidence of the speech repairs and intonational phrase boundaries.</Paragraph>
      <Paragraph position="7"> More work still needs to be done. First, with the exception of pauses, we do not consider acoustic cues. This is a rich source of information for detecting (and distinguishing between) intonational phrases, interruption points of speech repairs, and even discourse markers. It would also help in determining the reparandum onset of fresh starts, which tend to occur at intonational boundaries. Acoustic modeling is also needed to identify word fragments. The second area is extending the model to incorporate higher level syntactic and semantic processing. This would not only allow us to give a much richer output from the model, but it would also allow us to account for interactions between this higher-level knowledge and modeling speakers' utterances, especially in detecting the ill-formedness that often occurs with speech repairs. It would also aid in finding richer correspondences between the reparandum and alteration, such as between the noun phrase and pronoun in the following example.</Paragraph>
      <Paragraph position="8"> Example 36 (d93-14.3 utt27) the engine can take as many ~ ,~ it can take up to three loaded boxcars</Paragraph>
      <Paragraph position="10"> The third area of future research is to show that our model works on other languages.</Paragraph>
      <Paragraph position="11"> Although the model encodes the basic structure of speech repairs, intonational phrases, and discourse markers, actual parameters are learned from a training corpus. Preliminary work on a Japanese corpus indicates that the model is not language specific (Heeman and Loken-Kim 1999). The fourth and most important area is to incorporate our work into a speech recognizer. We have already used our POS-based model to rescore word-graphs, which results in a one percent absolute reduction in word error rate in comparison to a word-based model (Heeman 1999). Our full model, which accounts for intonational phrases and speech repairs, should lead to a further reduction, as well as return a richer understanding of the speech.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML