File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1085_metho.xml
Size: 18,205 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1085"> <Title>Automatic Detection and Correction of Repairs in Human-Computer Dialog*</Title> <Section position="3" start_page="0" end_page="419" type="metho"> <SectionTitle> 2. THE CORPUS </SectionTitle> <Paragraph position="0"> The data we are analyzing were collected at six sites 1 as part of DARPA's Spoken Language Systems project.</Paragraph> <Paragraph position="1"> The corpus contains digitized waveforms and transcriptions of a large number of sessions in which subjects made air travel plans using a computer. In the majority of sessions, data were collected in a Wizard of Oz setting, in which subjects were led to believe they were talking to a computer, but in which a human actually interpreted and responded to queries. In a small portion of the sessions, data were collected using SRI's Spoken Language volved. Relevant to the current paper is the fact that although the speech was spontaneous, it was somewhat planned (subjects pressed a button to begin speaking to the system) and the transcribers who produced lexical transcriptions of the sessions were instructed to mark words they inferred were verbally deleted by the speaker with special symbols. For further description of the corpus, see MADCOW \[10\].</Paragraph> </Section> <Section position="4" start_page="419" end_page="419" type="metho"> <SectionTitle> 3. CHARACTERISTICS AND DISTRIBUTION OF REPAIRS </SectionTitle> <Paragraph position="0"> Of the ten thousand sentences in our corpus, 607 contained repairs. We found that of sentences longer than nine words, 10% contained repairs. While this is lower than rates reported elsewhere for human-human dialog (Levelt \[7\] reports a rate of 34%), it is still large enough to be significant. And, as system developers move toward more closely modeling human-human interaction, the percentage is likely to rise.</Paragraph> <Section position="1" start_page="419" end_page="419" type="sub_section"> <SectionTitle> 3.1 Notation </SectionTitle> <Paragraph position="0"> In order to classify these repairs, and to facilitate communication among the authors, it was necessary for us to develop a notational system that would: (1) be relatively simple, (2) capture sufficient detail, and (3) describe the are there any flights vast majority of repairs observed. The notation is described fully in \[2\].</Paragraph> <Paragraph position="1"> The basic aspects of the notation include marking the interruption point, its extent, and relevant correspondences between words in the region. To mark the site of a repair, corresponding to Hindle's &quot;edit signal&quot; \[5\], we use a vertical bar (I). To express the notion that words on one side of the repair correspond to words on the other, we use a combination of a letter plus a numerical index. The letter M indicates that two words match exactly. R indicates that the second of the two words was intended by the speaker to replace the first. The two words must be similar, either of the same lexical category, or morphological variants of the same base form (including contraction pairs like I/I'd). Any other word withi, a repair is notated with X. A hyphen affixed to a symbol indicates a word fragment. In addition, certain cue words, such as &quot;sorry&quot; or &quot;oops&quot; (marked with CR) as well as filled pauses (CF) are also labeled if they occur immediately before the site of a repair.</Paragraph> </Section> <Section position="2" start_page="419" end_page="419" type="sub_section"> <SectionTitle> 3.2 Distribution </SectionTitle> <Paragraph position="0"> While only 607 sentences contained deletions, some sentences contained more than one, for a total of 646 deletions. Table 2 gives the breakdown of deletions by length, where length is defined as the number of consecutive deleted words or word fragments. Most of the deletions were fairly short. One or two word deletions accounted for 82% of the data. We categorized the length 1 and length 2 repairs according to their transcriptions.</Paragraph> <Paragraph position="1"> The results are summarized in Table 3. For the purpose of simplicity, we have in this table combined cases involving fragments (which always occurred as the second word) with their associated full-word patterns. The overall rate of fragments for the length 2 repairs was 34%.</Paragraph> </Section> </Section> <Section position="5" start_page="419" end_page="420" type="metho"> <SectionTitle> 4. SIMPLE PATTERN MATCHING </SectionTitle> <Paragraph position="0"> We analyzed a subset of 607 sentences containing repairs and concluded that certain simple pattern-matching techniques could successfully detect a number of them.</Paragraph> <Paragraph position="1"> The pattern matching component reported on here looks for the following kinds of subsequences: stems largely from the overlap of related patterns. Many sentences contain a subsequence of words that match not one but several patterns. For example the phrase &quot;FLIGHT <word> FLIGHT&quot; matches three different patterns: show the FLIGHT earliest FLIGHT</Paragraph> <Paragraph position="3"> show the delta FLIGHT united FLIGHT</Paragraph> <Paragraph position="5"> Each of these sentences is a false positive for the other two patterns. Despite these problems of overlap, pattern matching is useful in reducing the set of candidate sentences to be processed for repairs. Instead of applying detailed and possibly time-intensive analysis techniques to 10,000 sentences, we can increase efficiency by limiting ourselves to the 500 sentences selected by the pattern matcher, which has (at least on one measure) a 75% recall rate. The repair sites hypothesized by the pattern matcher constitute useful input for further processing based on other sources of information.</Paragraph> <Paragraph position="6"> * Simple syntactic anomalies, such as &quot;a the&quot; or &quot;to from&quot;.</Paragraph> <Paragraph position="7"> * Sequences of identical words such as &quot;<I> <would> <like> <a> <book> I would like a flight ...&quot; * Matching single words surrounding a cue word like &quot;sorry,&quot; for example &quot;from&quot; in this case: &quot;I would like to see the flights <from> <philadelphia> <i'm> <sorry> from denver to philadelphia.&quot; Of the 406 sentences with nontrivial repairs in our data (more editing necessary than deleting fragments and filled pauses), the program successfully corrected 177. It found 132 additional sentences with repairs but made the wrong correction. There were 97 sentences that contained repairs which it did not find. In addition, out of the 10,517 sentence corpus (10,718- 201 trivial), it incorrectly hypothesized that an additional 191 contained repairs. Thus of 10,517 sentences of varying lengths, it pulled out 500 as possibly containing a repair and missed 97 sentences actually containing a repair. Of the 500 that it proposed as containing a repair, 62% actually did and 38% did not. Of the 62% that had repairs, it made the appropriate correction for 57%.</Paragraph> <Paragraph position="8"> These numbers show that although pattern matching is useful in identifying possible repairs, it is less successful at making appropriate corrections. This problem</Paragraph> </Section> <Section position="6" start_page="420" end_page="421" type="metho"> <SectionTitle> 5. NATURAL LANGUAGE CONSTRAINTS </SectionTitle> <Paragraph position="0"> Here we describe experiments conducted to measure the effectiveness of a natural language processing system in distinguishing repairs from false positives. A false positive is a repair pattern that incorrectly matches a sentence or part of a sentence. We conducted the experiments using the syntactic and semantic components of the Gemini natural language processing system. Gemini is an extensive reimplementation of the Core Language Engine \[1\]. It includes modular syntactic and semantic components, integrated into an efficient all-paths bottom-up parser \[11\]). Gemini was trained on a 2,200 sentence subset of the full 10,718-sentence corpus (only those annotated as class A or D). Since this subset excluded the unanswerable (class X) sentences, Gemini's coverage on the full corpus is only an estimated 70% for syntax, and 50% for semantics. 2 Nonetheless, the results reported here are promising, and should improve as syntactic and semantic coverage increase.</Paragraph> <Paragraph position="1"> We tested Gemini on a subset of the data that the pat2Gemini's syntactic coverage of the 2,200 sentence dataset it was trained on (the set of annotated and answerable MADCOW queries) is approximately 91%, while its semantic coverage is approximately 77%. On a fair test of the February 1992 test set, Gemini's syntactic coverage was 87% and semantic coverage was 71%.</Paragraph> <Paragraph position="2"> tern matcher returned as likely to contain a repair. We excluded all sentences that contained fragments, resulting in a dataset of 335 sentences, of which 179 contained repairs and 176 contained false positives. The approach was as follows: for each sentence, parsing was attempted.</Paragraph> <Paragraph position="3"> If parsing succeeded, the sentence was marked as a false positive. If parsing did not succeed, then pattern matching was used to detect possible repairs, and the edits associated with the repairs were made. Parsing was then reattempted. If parsing succeeded at this point, the sentence was marked as a repair. Otherwise, it was marked as NO OPINION.</Paragraph> <Paragraph position="4"> Since multiple repairs and false positives can occur in the same sentence, the pattern matching process is constrained to prefer fewer repairs to more repairs, and shorter repairs to longer repairs. This is done to favor an analysis that deletes the fewest words from a sentence.</Paragraph> <Paragraph position="5"> It is often the case that more drastic repairs would result in a syntactically and semantically well-formed sentence, but not the sentence that the speaker intended. For instance, the sentence &quot;show me <flights> daily flights to boston&quot; could be repaired by deleting the words &quot;flights daily&quot;, and would then yield a grammatical sentence, but in this case the speaker intended to delete only &quot;flights.&quot; Table 4 shows the results of these experiments. We ran them two ways: once using syntactic constraints alone and again using both syntactic and semantic constraints.</Paragraph> <Paragraph position="6"> As can be seen, Gemini is quite accurate at detecting a repair, although somewhat less accurate at detecting a false positive. Furthermore, in cases where Gemini detected a repair, it produced the intended correction in 62 out of 68 cases for syntax alone, and in 60 out of 64 cases using combined syntax and semantics. In both cases, a large number of sentences (29% for syntax, 50% for semantics) received a NO OPINION evaluation. The NO OPINION cases were evenly split between repairs and false positives in both tests.</Paragraph> <Paragraph position="7"> The main points to be noted from Table 4 are that with syntax alone, the system is quite accurate in detecting repairs, and with syntax and semantics working together, it is accurate at detecting false positives. However, since the coverage of syntax and semantics will always be lower than the coverage of syntax alone, we cannot compare these rates directly.</Paragraph> </Section> <Section position="7" start_page="421" end_page="422" type="metho"> <SectionTitle> 6. ACOUSTICS </SectionTitle> <Paragraph position="0"> A third source of information that can be helpful in detecting repairs is acoustics. While acoustics alone cannot tackle th e problem of locating repairs, since any prosodic patterns found in repairs will be found in fluent speech, acoustic information can be quite effective when combined with other sources of information, particularly, pattern matching.</Paragraph> <Paragraph position="1"> Our approach in studying the ways in which acoustics might be helpful was to begin by looking at two patterns conducive to acoustic measurement and comparison. First, we focused on patterns in which there is only one matched word, and in which the two occurrences of that word are either adjacent or separated by only one word. Matched words allow for comparisons of word duration; proximity helps avoid variability due to global intonation contours not associated with the patterns themselves. We present here analyses for the Mi\[M1 (&quot;flights for <one> one person&quot;) and MI\[XM1 (&quot;<flight> earliest flight&quot;) repairs, and their associated false positives (&quot;u s air five one one,&quot; &quot;a flight on flight number five one one,&quot; respectively).</Paragraph> <Paragraph position="2"> Second, we have done a preliminary analysis of repairs in which a word such as &quot;no&quot; or &quot;well&quot; was present as an editing expression \[6\] at the point of interruption (&quot;...flights <between> <boston> <and> <dallas> <no> between oakland and boston&quot;). False positives for these cases are instances in which the cue word functions in its usual lexical sense (&quot;I want to leave boston no later than one p m.&quot;). Hirshberg and Litman \[3\] have shown that cue words that function differently can be distinguished perceptually by listeners on the basis of prosody. Thus, we sought to determine whether acoustic analysis could help in deciding, when such words were present, whether or not they marked the interruption point of a repair.</Paragraph> <Paragraph position="3"> In both analyses, a number of features were measured to allow for comparisons between the words of interest.</Paragraph> </Section> <Section position="8" start_page="422" end_page="423" type="metho"> <SectionTitle> M11XM1 Repairs </SectionTitle> <Paragraph position="0"> Word onsets and offsets were labeled by inspection of waveforms and parameter files (pitch tracks and spectrograms) obtained using the Entropic Waves software package. Files with questionable pitch tracks were excluded from the analysis. An average F0 value for words of interest was determined by simply averaging, within a labeled word, all 10-ms frame values having a probability of voicing above 0.20.</Paragraph> <Paragraph position="1"> In examining the MilM1 repair pattern, we found that the strongest distinguishing cue between the repairs (g = 20) and the false positives (g = 20) was the interval between the offset of the first word and the onset of the second. False positives had a mean gap of 42 ms (s.d. = 55.8) as opposed to 380 ms (s.d. = 200.4) for repairs. A second difference found between the two groups was that, in the case of repairs, there was a statistically reliable reduction in duration for the second occurrence of M1, with a mean difference of 53.4 ms.</Paragraph> <Paragraph position="2"> However because false positives showed no reliable difference for word duration, this was a much less useful predictor than gap duration. F0 of the matched words was not helpful in separating repairs from false positives; both groups showed a highly significant correlation for, and no significant difference between, the mean F0 of the matched words.</Paragraph> <Paragraph position="3"> and rarely before the X in the false positives. Note that values do not add up to 100% because cases of no pauses, or pauses on both sides are not included in the table. A second distinguishing characteristic was the F0 value of X. For repairs, the inserted word was nearly always higher in F0 than the preceding M1; for false positives, this increase in F0 was rarely observed. Table 6 shows the results of combining the acoustic constraints in Table 5. As can be seen, although acoustic features may be helpful individually, certain combinations of features widen the gap between observed rates of repairs and false positives possessing the relevant set of features.</Paragraph> <Paragraph position="4"> Finally, in a preliminary study of the cue words &quot;no&quot; and &quot;well,&quot; we compared 9 examples of these words at the site of a repair to 15 examples of the same words occurring in fluent speech. We found that these groups were quite distinguishable on the basis of simple prosodic features. Table 7 shows the percentage of repairs versus false positives characterized by a clear rise or fall in F0, lexical stress, and continuity of the speech immediately preceding and following the editing expression (&quot;continuous&quot; means there is no silent pause on either side of the cue word). As 'can be seen, at least for this limited data set, cue words marking repairs were quite distinguishable from those same words found in fluent strings on the basis of simple prosodic features.</Paragraph> <Paragraph position="5"> A different set of features was found to be useful in distinguishing repairs from false positives for the M11XM1 pattern. These features are shown in Table 5. Cell values are percentages of repairs or false positives that possessed the characteristics indicated in the columns. Despite the small data set, some suggestive trends emerge.</Paragraph> <Paragraph position="6"> For example, for cases in which there was a pause (defined for purposes of this analysis as a silence of greater than 200 ms) on only one side of the inserted word, the pause was never after the insertion (X) for the repairs data sets, such results are nevertheless interesting. They illustrate that acoustics can indeed play a role in distinguishing repairs from false positives, but only if each pattern is examined individually, to determine which features to use, and how to combine them. Analysis of additional patterns and access to a larger database of repairs will help us better determine the ways in which acoustics can play a role in detection of repairs.</Paragraph> </Section> class="xml-element"></Paper>