XML Viewer - h94-1034

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1034_metho.xml
Size: 24,885 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1034">
  <Title>Tagging Speech Repairs</Title>
  <Section position="4" start_page="187" end_page="187" type="metho">
    <SectionTitle>
3. The Corpus
</SectionTitle>
    <Paragraph position="0"> As part of the TRAINS project (Allen and Schubert, 1991), which is a long term research pmjeet to build a conversationally proficient planning assistant, we are collecting a corpus of problem solving dialogs. The dialogs involve two participants, one who is playing the role of a user and has a certain task to accomplish, and another, who is playing the role of the system by acting as a planning assistant (Gross, Allen and Traum, 1992). The entire corpus consists of 112 dialogs totaling almost eight hours in length and containing about 62,000 words and 6300 speaker turns. These dialogs have been segmented into utterance files (c.f. Heeman and Allen, 1994C/); words have been transcribed and the speech repairs have been annotated. For a training set, we use 40 of the dialogs, consisting of 24,000 words; and for testing, 7 of the dialogs, consisting of 5800 words.</Paragraph>
    <Paragraph position="1"> In order to provide a large training corpus for the statistical model, we use a tagged version of the Brown corpus, from the Penn Tree-bank (Marcus, Santorini and Marcinklewicz, 1993). We removed all punch~on in order to more closely approximate unsegmented spoken speech. This corpus provides us with category transition probabilities fur fluent speech. These probabilities have also been used to bootstrap our algorithm in order to determine the category probabilities for speechrepalrs from our training corpus.*  Speech repairs can be divided into three intervals (c.f. Levelt, 1983), the removed text, editing terms, and the resumed texL The removed text and the editing terms are what need to be deleted in order to determine what the speaker intended to say. 2 There is typically a correspondence between the removed text and the resumed text, and following Bear, Dowding and Shriberg (1992), we annotate this using the labels m for word matching and r for word replacements (words of the same syntactic category). Each pair is given a unique index. Other words in the removed text and resumed text are annotated with an x. Also, editing terms (filled pauses and clue words) are labeled with et, and the interruption point with Int, which will be before any editing terms associated with the repair, and after the fragment, if present. (Further details of our annotation scheme can be found in (Heeman and Allen, 1994a).) Below is a sample annotation, with removed text &amp;quot;go ~ oran-&amp;quot;, editing term &amp;quot;urn&amp;quot;, and resumed text &amp;quot;go to&amp;quot;.</Paragraph>
    <Paragraph position="2"> gol tol oran-I uml gol toi Corning mll m2l xl intl etl ml\[ m2l Table 1 gives a breakdown of the modification speech repairs (that do not interfere with other repairs) and the abridged repairs, based on hand-annotations. Modification repairs are broken down into four groups, word repetitious, larger repetitions, one word replacing another, and others. Also, the percentage of repairs that include fragments and editing terms is also given. Two trends emerge from this data. First, fragments and editing terms mark less than 34% of all modification repairs. Second, the presence of a fragment or editing term does not give conclusive evidence as to whether the repair is a modification or an abridged repair.</Paragraph>
  </Section>
  <Section position="5" start_page="187" end_page="188" type="metho">
    <SectionTitle>
4. Part.of-Speech Tagging
</SectionTitle>
    <Paragraph position="0"> Part-of-speech tagging is the process of assigning to a word the category that is most probable given the sentential context (Church, 1988). The sentential context is typically approximated by only a set number of previous categories, usually one or two. Since the context is limited, we are making the Markov assumption, that the next transition depends only on the input, which is the word that we the following changes: (1) we ~-parated Weposifiom from subordinating conjunctions; (2) we separated uses of &amp;quot;to&amp;quot; as a preposition from in me as part of a to-infinilive; (3) rather than classify verbs by tense, we classified them into four groups, conjugations of &amp;quot;be&amp;quot;, conjugations of &amp;quot;have&amp;quot;, verbs that are followed by a to-infinitive, and verbs that are followed immediately by another verb.</Paragraph>
    <Paragraph position="1"> 2The l~noved text and editing terms might still contain pragmatical information, as the following example displays, &amp;quot;Peter was.., well...he was fired/'  are currently uying to tag and the previous categories. Good part-of-speech results can be obtained using only the preceding category (Weischedel et al., 1993), which is what we will be using. In this case, the number of states of the Markov model will be N, where N is the number of tags. By making the Markov assumption, we can use the Viterbi Algorithrn to find a maximum probability path in linear time.</Paragraph>
    <Paragraph position="2"> Figure 1 gives a simplied view of a Markov model for part-of-speech tagging, where Ci is a possible category for the ith word, wi, and Ci+~ is a possible category for word wi/t. The category transition probability is simply the probability of category Ci+, following category Ci, which is written as P(Ci+tICi), and the probability of word wi+, given category C~+, is P(wi+xlCi+~). The category assignment that maximizes the product of these probabilities is taken to be the best category assignment.</Paragraph>
    <Paragraph position="4"> Figure h Markov Model of Part-of-Speech Tagging the next word. (R~-l is independent of RiCi+1, given C~.) So P(RiC~+iIRi-iCi) = P(R~Ci+i ICi).</Paragraph>
    <Paragraph position="5"> One manipulation we can do is to use the definition of conditional probabilities to rewrite P(R~C~+,\[Ci) as P(RdCi) * P ( C~+ 1 \[C~ R~). This manipulation allows us to view the problem as tagging null tokens between words as either the interreption point of a modification repair, R~ = n, or as fluent speech, R~ = ~b~. The resulting Markov model is shown in Figure 2. Note that the context for category Ci+l is both C~ and R~. So, R~ depends (indirectly) on the joint context of Ci and Ci+l. thus allowing syntactic anomalies to be detected. 4</Paragraph>
  </Section>
  <Section position="6" start_page="188" end_page="188" type="metho">
    <SectionTitle>
5. A Simple Model of Speech Repairs
</SectionTitle>
    <Paragraph position="0"> Modification repairs are often accompanied by a syntactic anomaly across the interruption point. Consider the following example, &amp;quot;so it takes two hours to go to - from Elmira to Coming&amp;quot; (d93-17.4 utt57), which contains a &amp;quot;to&amp;quot; followed by a &amp;quot;from&amp;quot;. Both should be classified as prepositions, but the event of a preposition followed by another preposition is very rare in well-formed speech, so there is a good chance that one of the prepositions might get erroneously tagged as some other part of speech. Since the category transitions across interruption points tend to be rare events in fluent speech, we simply give the tagger the category transition probabilities around interruption points of modification repairs. By keeping track of when this information is used, we not only have a way of detecting modification repairs, but part-of-speech tagging is also improved.</Paragraph>
    <Paragraph position="1"> To incorporate knowledge about modification repairs, we let R/be a variable that indicates whether the transition from word wi to wi+l contains the interruption point of a modification repair, and rather than tag each word, wl, with just a category, Ci, we will tag it with Ri-l Ci, the category and the presence of a modification repair. 3 This effectively multiplies the size of the tagsetby two. From Figure 1, we see that we will now need the following probabilities, P(RiCi+l iR,-I Ci) and P(wi\[R,-i C~).</Paragraph>
    <Paragraph position="2"> To keep the model simple, and ease problems with sparse data, we make several independence assumptions.</Paragraph>
    <Paragraph position="3">  (1) Given the category of a word, a repair before h is independent of the word. (Ri-i and wi are independent.</Paragraph>
    <Paragraph position="4"> given Ci.) So P(wi\[Ri-lCl) = P(wdC~).</Paragraph>
    <Paragraph position="5"> (2) Given the category of a word, a repair before that word is independent of a repair following it and the category of 3Changing each tag to CiRi would result in the same model.</Paragraph>
    <Paragraph position="6">  Table 3 (Section 6.4) gives results for this simple model running on our training corpus. In order to remove effects due to editing terms and word fragments, we temporarily eliminate them from the corpus. Also, for fresh starts and change-of-turn, the algorithm is reset, as if it was an end of sentence. To eliminate problems due to overlapping repairs, we include only ~ points in which the next word is not intended to be removed (based on our hand annotations). This gives us a total of 19587 data points, 384 were modification repairs, and the statistical model found 169 of these, and a ftmher 204 false positives. This gives us a recall rate of 44.2% and a precision of 45.3%. In the test corpus, there are 98 modification repairs, of which the model found 30, and a further 23 false positives; giving a recall rate of 30.6% and a precision rate of 56.6%.</Paragraph>
    <Paragraph position="7"> From Table 1, we can see that the recall rate of fragments as a predictor of a modification repair is 14.7% and their precision is 34.7%. s So, the method of statistically tagging modification repairs has more predictive power, and so can be used as a clue for detecting them. Furthermore, this method is doing something mere powerful than just detecting word repetitions or category repetitions. Of the 169 repairs that it found, 109 were word repetitions and an additional 28 were category repetitions. So, 32 of the repairs that were found were fIom less obvious syntactic anomalies.</Paragraph>
  </Section>
  <Section position="7" start_page="188" end_page="189" type="metho">
    <SectionTitle>
6. Adding Additional Clues
</SectionTitle>
    <Paragraph position="0"> In the preceding section we built a model for detecting modification repairs by simply using category transitions. However, there are other sources of infonnation that can be exploited, such as the presence of fragments, editing terms, and word matehings. The problem is that</Paragraph>
    <Paragraph position="2"> these clues do not always signal a modification repair. For instance, a fragment is twice as likely to be part of an abridged repair than it is to be part of a modification repair. One way to exploit these clues is to aT to learn how to combine them, using a technique such as CART (Bfiemen, Friedman and Olsherh 1984). However, a more intuitive approach is to adjust the transition probabilities for a modification repair to better reflect the more specific information that is known.</Paragraph>
    <Paragraph position="3"> Thus, we combine the information such that the individual pieces do not have to give a 'yes' or a 'no', but rather, all can contribute to the decision.</Paragraph>
    <Section position="1" start_page="189" end_page="189" type="sub_section">
      <SectionTitle>
6.1. Fragments
</SectionTitle>
      <Paragraph position="0"> Assuming; that fragments can be detected automatically (c.f. Nakatani and Hirschberg, 1993), the question arises as to what the tagger should do with them. If the tagger treats them as lexical items, the words on either side of the fragment will be separated. This will cause two problems. First' if the fragment is part of an abridged repair, category assignment to these words will be hindered. Second, and more important to our work, is that the fragment will prevent the statistical model from judging the syntactic well-formedness of the word before the fragment and the word after, preventing it from distinguishing a modification repair from an abridged repair. So, the tagger needs to skip over fragments. However, the fragment can be viewed as the &amp;quot;word&amp;quot; that gets tagged as a modification repair or not. (The 'not' in this case means that the fragment is part of an abridged repair.) When no fragment is present between words, we view the interval as a null word. So, we augment the model pictured in Figure 2 with the probability of the presence of a fragment' Fi, given the presence of a repair, Rh as is pictured in Figure 3.</Paragraph>
      <Paragraph position="2"> Since there are two alternatives for Fi--a fragment, fi, or not, 7i-and two alternatives for Ri---a repair or not, we need four statistics.</Paragraph>
      <Paragraph position="3"> From our training corpus, we have found that if a fragment is present, a modification repair is favored--P(filrl)/P(fdC/i)---by a factor of 28.9. If a fragment is not present' fluent speech is favored--P(fi kbi)/P(7, \[Ti), by a factor of 1.17.</Paragraph>
    </Section>
    <Section position="2" start_page="189" end_page="189" type="sub_section">
      <SectionTitle>
6.2. Editing Terms
</SectionTitle>
      <Paragraph position="0"> Editing tenus, like fragments, give information as to the presence of a modification repair. So, we incorporate them into the statistical model by viewing them as part of the &amp;quot;word&amp;quot; that gets tagged with Ri, thus changing the probability on the repair state from P( Fi \[Ri) to P( F~ E~ IRk), where E~ indicates the presence of editing terms. To simplify the probabilities, and reduce problems due to sparse data, we make the fonowing independence assumption.</Paragraph>
      <Paragraph position="1"> O) Given that there is a modification repair, the presence of a fragment or editing terms is independent. (F/and E~ are independent, given Ri.) So P(F~EdRi ) = P(F, IR,) * P(E, IR,).</Paragraph>
      <Paragraph position="2"> An additional complexity is that different editing terms do not have the same predictive power. So far we have investigated &amp;quot;urn&amp;quot; and &amp;quot;uh&amp;quot;. The presence of an &amp;quot;urn&amp;quot; favors a repair by a factor of 2.7, while for &amp;quot;uh&amp;quot; it is favored by a factor of 9.4. If no editing term is present, fluent speech is favored by a factor of 1.2.</Paragraph>
    </Section>
    <Section position="3" start_page="189" end_page="189" type="sub_section">
      <SectionTitle>
6.3. Word Matchings
</SectionTitle>
      <Paragraph position="0"> In a modification repair, there is often a coxrespondence between the text that must be removed and the text that follows the interruption point. The simplest type of correspondence is word matchings. In fact, in our test corpus, 80% of modification repairs have at least one matching. This information can be incorporated into the statistical model in the same way that editing terms and fragments are handled. So, we change the probability of the repair state to be P(F~EiM~\[R~), where M~ indicates a word matching. Again, we assume that the clues are independent of each other, allowing us to treat this clue separately from the others.</Paragraph>
      <Paragraph position="1"> Just as with editing terms, not all matches make the same predictions about the occurrence of a modification repair.</Paragraph>
      <Paragraph position="2"> Bear, Dowding and Shriberg (1992) looked at the number of matching words versus the number of intervening words. However, this ignores the category of the word matches. For instance, a matching verb (with some intervening words) is more likely to indicate a repair than say a matching preposition or determiner. So, we classify word matchings by category and number of intervening words. Furthermore, if there are multiple matches in a repair, we only use one, the one that most predicts a repair. For instance in the following repair, the matching instances of &amp;quot;take&amp;quot; would be used over the matching instances of &amp;quot;will&amp;quot;, since main verbs were found to more strongly signal a modification repair than do modals.</Paragraph>
      <Paragraph position="3"> how long will that take - will it take for engine one at Dansvifie (d93-183 utt43) Since the statistical model only uses one matching per repair, the same is done in collecting the statistics. So, our collection involves two steps. In the first we collect statistics on all word matches, and in the second, for each repair, we count only the matching that most strongly signals the repair. Table 2 gives a partial list of how much each matching favors a repair broken down by category and number of intervening words. Entries that are marked with&amp;quot;-&amp;quot; do not contain any datapoints andenlaies that are blank are below the baseline rate of 0.209, the rate at which a modification repair is favored (or actually disfavored) when there is no matching at all.</Paragraph>
      <Paragraph position="4"> The problem with using word matching is that it depends on identifying the removed text and its correspondences to the text that follows the intermpdon point. However, a good estimate can be obtained by using all word matches with at most eight intervening words.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="189" end_page="190" type="metho">
    <SectionTitle>
6.4. Results
</SectionTitle>
    <Paragraph position="0"> Table 3 summarizes the results of incorporating R~lifional clues into the Markov model. The first column gives the results without any clues, the second with fragments, the third with editing terms,  the test corpus, it achieved a recall rate of 83.0% and a precision of 80.2%.</Paragraph>
    <Paragraph position="1"> The true measure of success is the overall detection and correction rates. On 721 repairs in the training corpus, which includes overlapping repairs, the combined approach made the right corrections for 637, it made incorrect corrections for 19 more, and it falsely detected (and falsely corrected) 30 more. This gives an overall correction recall rate of 88.3% and a precision of 92.9%. On the test corpus consisting of 142 repairs, it made the right correction for 114 of them, it incorrectly corrected 4 more, and it falsely detected 14 more, for a correction recall rate of 80.3% end a precision of 86.4%. Table 4 summarizes the overall results for both the pattern builder and statistical model on the training corpus and on the test set.</Paragraph>
    <Paragraph position="2"> Table 2: Factor by which a repair is favored the fourth with word matches, and the fifth, with all of these clues incorporated. Of the 384 modification repairs in the training corpus, the full model predicts 305 of them versus 169 by the simple model. As for the false positives, the full model incorrectly predicted 207 versus the simple model at 204. So, we see that by incorporating ~didonal clues, the statistical model can better identify modification repairs.</Paragraph>
  </Section>
  <Section position="9" start_page="190" end_page="190" type="metho">
    <SectionTitle>
7. Correcting Repairs
</SectionTitle>
    <Paragraph position="0"> The actual goal of detecting speech repairs is to be able to correct them, so that the speaker's utterance can be understood. We have argued for the need to distinguish modification repairs from abridged repairs, because this distinction would be useful in determining the correction. We have implemented a pattern builder (Heeman and Allen, 1994b), which builds potential repair patterns based on word matches and word replacements. However, the pattom builder has only limited knowledge which it can use to decide which patterns are likely repairs. For instance, given the utterance &amp;quot;pick up uh fill up the boxcars&amp;quot; (d93-17.4 utt40), it will postulate that there is a single repair, in which &amp;quot;pick up&amp;quot; is replaced by &amp;quot;fill up&amp;quot;. However, for an uuerance like &amp;quot;we need to urn manage to get the bananas&amp;quot; (d93-14-3 uttS0), it will postulate that &amp;quot;manage to&amp;quot; replaces &amp;quot;need to&amp;quot;. So, we use the statistical model to filter repairs found by the pattern builder. This also removes a lot of the false positives of the statistical model, since no potential repair pattern would be found for them.</Paragraph>
    <Paragraph position="1"> On the training set, the model was queried by the pattern builder on 961 potential modification repairs, of which 397 contained repairs.</Paragraph>
    <Paragraph position="2"> The model predicted 365 of these, and incorrectly detected 33 more, giving a detection recall rate of 91.9% and a precision of 91.7%. For The results that we obtained are better than others reported in the literature. However, such compmisens are limited due to differences in both the type of repairs that ~ being studied ~ in the da~sets used for drawing results. Bear, Dowding, and Shn'berg (1992) use the ATIS corpus, which is a collection of queries made to an automated airline reservation system. As stated earlier, they removed all utterances that contained abridged repaY. For detection they obtained a recall rate of 76% and a precision of 62%, and for correction, a recall rate of 43% and a precision of 50%. It is not clear whether their results would be better or worse ff abridged repairs were included.</Paragraph>
    <Paragraph position="3"> Dowding et al. (1993) used a similar setup for their d~t~= As part of a complete system, they obtained a detection recall rate of 42% and a precision of 85%; and for correction, a recall rate of 30% and a precision of 62%. Lastly, Nakatani and Hirschberg (1993) also used the ATIS corpus, but in this case, focused only on detection, but detection of all three types of repairs. However, their test corpus consisted entirely of utterances that contained at least one repair.</Paragraph>
    <Paragraph position="4"> This makes it hard to evaluate their results, reporting a detection recall rate of 83 % and precision of 94%. Testing on an entire corpus would clearly decrease their precision. As for our own data, we used a corpus of natural dialogues that were segmented only by speaker tarns, not by individual utterances, and we focused on modification repairs and abridged repairs, with fresh starts being marked in the input so as not to cause interference in detecting the other two types.</Paragraph>
  </Section>
  <Section position="10" start_page="190" end_page="191" type="metho">
    <SectionTitle>
8. Discussion
</SectionTitle>
    <Paragraph position="0"> We have described a statistical model for detecting speech repairs.</Paragraph>
    <Paragraph position="1"> The model detects repairs by using category trausifion probabilities around repair intervals and for fluent speech~ By training on actual examples of repairs, we can detect them without having to set arbitrary cutoffs for category transidous that might be insensitive to rarely used constructs. If people actually use syntactic anomalies as a clue in detecting speech repairs, then training on examples of them  makes sense.</Paragraph>
    <Paragraph position="2"> In doing this work, we were faced with a lack of training dam. The eventual answer is to have a large corpus of tagged dialogs with the speech repairs annotated. Since this was not available, we used the Brown corpus for the fluent category-transition probabilities. As well, these transition probabilities were used to 'bootstrap' our tagger in determining the part-of-speech tags for our training corpus. The tags of the 450 or so hand-annotated modification repairs were then used for setting the transition probabilities around modification repairs.</Paragraph>
    <Paragraph position="3"> Another problem that we encountered was interference between adjacent utte,~ances in the same turn. Subsequentutterances often build on, or even repeat what was previously said (Walker, 1993). Consider the following utterance.</Paragraph>
    <Paragraph position="4"> that's all you need you only need one tanker (d93-83 uu79) The tagger incorrectly hypothesized that this was a modification repair with an interruption point after the first occurrence of the word &amp;quot;need&amp;quot;. Even a relatively simple segmentation of the dialogs into utterances would remove some of the false positives and improve performance.</Paragraph>
    <Paragraph position="5"> Speech repairs do interact negatively with part-of-speech tagging, and even with statistical modeling of repairs, inappropriate tags are still sometimes assigned. In the following example, the second occurrence of the word &amp;quot;load&amp;quot; was categorized as a noun, and the speech repair went undetected.</Paragraph>
    <Paragraph position="6"> it'll be seven a.m. by the time we load in - load the bananas (d93-12.4 utt53)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML