XML Viewer - p92-1008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1008_metho.xml
Size: 24,428 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="P92-1008">
  <Title>INTEGRATING MULTIPLE KNOWLEDGE SOURCES FOR DETECTION AND CORRECTION OF REPAIRS IN HUMAN-COMPUTER DIALOG*</Title>
  <Section position="4" start_page="56" end_page="56" type="metho">
    <SectionTitle>
THE CORPUS
</SectionTitle>
    <Paragraph position="0"> The data we are analyzing were collected as part of DARPA's Spoken Language Systems project. The corpus contains digitized waveforms and transcriptions of a large number of sessions in which subjects made air travel plans using a computer. In the majority of sessions, data were collected in a Wizard of Oz setting, in which subjects were led to believe they were talking to a computer, but in which a human actually interpreted and responded to queries. In a small portion of the sessions, data were collected using SRI's Spoken Language System (Shriberg et al., 1992b), in which no human intervention was involved. Relevant to the current paper is the fact that although the speech was spontaneous, it was somewhat planned (subjects pressed a button to begin speaking to the system) and the transcribers who produced lexical transcriptions of the sessions were instructed to mark words they inferred were verbally deleted by the speaker with special symbols.</Paragraph>
    <Paragraph position="1"> For further description of the corpus, see MADCOW (1992).</Paragraph>
  </Section>
  <Section position="5" start_page="56" end_page="57" type="metho">
    <SectionTitle>
NOTATION
</SectionTitle>
    <Paragraph position="0"> In order to classify these repairs, and to facilitate communication among the authors, it was necessary to develop a notational system that would: (1) be relatively simple, (2) capture sufficient detail, and (3) describe the vast majority of repairs observed. Table 1 shows examples of the notation used, which is described fully in Bear et al. (1992).</Paragraph>
    <Paragraph position="1"> The basic aspects of the notation include marking the interruption point, the extent of the repair, and relevant correspondences between words in the region. To mark the site of a repair, corresponding to Hindle's &amp;quot;edit signal&amp;quot; (Hindie, 1983), we use a vertical bar (I)- To express the notion that words on one side of the repair correspond to words on the other, we use a combination of a letter plus a numerical index. The letter M indicates that two words match exactly.</Paragraph>
    <Paragraph position="2"> R indicates that the second of the two words was intended by the speaker to replace the first.</Paragraph>
    <Paragraph position="3"> The two words must be similar-either of the same lexical category, or morphological variants of the same base form (including contraction pairs like &amp;quot;I/I'd&amp;quot;). Any other word within a repair is notated with X. A hyphen affixed to a symbol indicates a word fragment. In addition, certain cue words, such as &amp;quot;sorry&amp;quot; or &amp;quot;oops&amp;quot; (marked with CR) as well as filled pauses (CF) are also labeled  I want fl- flights to boston.</Paragraph>
    <Paragraph position="5"> show me flights daily flights</Paragraph>
    <Paragraph position="7"> ... fly to boston from boston R, M1 \[ R1 M1 ... fly from boston from denver</Paragraph>
    <Paragraph position="9"> if they occur immediately before the site of a repair. null</Paragraph>
  </Section>
  <Section position="6" start_page="57" end_page="57" type="metho">
    <SectionTitle>
DISTRIBUTION
</SectionTitle>
    <Paragraph position="0"> Of the 10,000 sentences in our corpus, 607 contained repairs. We found that 10% of sentences longer than nine words contained repairs. In contrast, Levelt (1983) reports a repair rate of 34% for human-human dialog. While the rates in this corpus are lower, they are still high enough to be significant. And, as system developers move toward more closely modeling human-human interaction, the percentage is likely to rise.</Paragraph>
    <Paragraph position="1"> Although only 607 sentences contained deletions, some sentences contained more than one, for a total of 646 deletions. Table 2 gives the breakdown of deletions by length, where length is defined as the number of consecutive deleted words or word fragments. Most of the deletions  were fairly short; deletions of one or two words accounted for 82% of the data. We categorized the length 1 and length 2 repairs according to their transcriptions. The results are summarized in Table 3. For simplicity, in this table we have counted fragments (which always occurred as the second deleted word) as whole words. The overall rate of fragments for the length 2 repairs was 34%.</Paragraph>
    <Paragraph position="2"> A major repair type involved matching strings of identical words. More than half (339 out of 436) of the nontrivial repairs (more editing necessary than deleting fragments and filled pauses) in the corpus were of this type. Table 4 shows the distributions of these repairs with respect to two parameters: the length in words of the matched string, and the number of words between the two matched strings. Numbers in parentheses indicate the number of occurrences, and probabilities represent the likelihood that the phrase was actually a repair and not a false positive. Two trends emerge from these data. First, the longer the matched string, the more likely the phrase was a repair. Second, the more words there were intervening between the matched strings, the less likely the phrase was a repair.</Paragraph>
  </Section>
  <Section position="7" start_page="57" end_page="58" type="metho">
    <SectionTitle>
SIMPLE PATTERN MATCHING
</SectionTitle>
    <Paragraph position="0"> We analyzed a subset of 607 sentences containing repairs and concluded that certain simple pattern-matching techniques could successfully detect a number of them. The pattern-matching  component reported on here looks for identical sequences of words, and simple syntactic anomalies, such as &amp;quot;a the&amp;quot; or &amp;quot;to from.&amp;quot; Of the 406 sentences containing nontrivial repairs, the program successfully found 309. Of these it successfully corrected 177. There were 97 sentences that contained repairs which it did not find. In addition, out of the 10,517 sentence corpus (10,718 - 201 trivial), it incorrectly hypothesized that an additional 191 contained repairs. Thus of 10,517 sentences of varying lengths, it pulled out 500 as possibly containing a repair and missed 97 sentences actually containing a repair. Of the 500 that it proposed as containing a repair, 62% actually did and 38% did not. Of the 62% that had repairs, it made the appropriate correction for 57%.</Paragraph>
    <Paragraph position="1"> These numbers show that although pattern matching is useful in identifying possible repairs, it is less successful at making appropriate corrections. This problem stems largely from the overlap of related patterns. Many sentences contain a subsequence of words that match not one but several patterns. For example the phrase &amp;quot;FLIGHT &lt;word&gt; FLIGHT&amp;quot; matches three different patterns: null show the flight time flight date</Paragraph>
    <Paragraph position="3"> show the flight earliest flight</Paragraph>
    <Paragraph position="5"> show the delta flight united flight</Paragraph>
    <Paragraph position="7"> Each of these sentences is a false positive for the other two patterns. Despite these problems of overlap, pattern matching is useful in reducing the set of candidate sentences to be processed for repairs. Rather than applying detailed and possibly time-intensive analysis techniques to 10,000 sentences, we can increase efficiency by limiting ourselves to the 500 sentences selected by the pattern matcher, which has (at least on one measure) a 75% recall rate. The repair sites hypothesized by the pattern matcher constitute useful input for further processing based on other sources of information. null</Paragraph>
  </Section>
  <Section position="8" start_page="58" end_page="60" type="metho">
    <SectionTitle>
NATURAL LANGUAGE
CONSTRAINTS
</SectionTitle>
    <Paragraph position="0"> Here we describe two sets of experiments to measure the effectiveness of a natural language processing system in distinguishing repairs from false positives. One approach is based on parsing of whole sentences; the other is based on parsing localized word sequences identified as potential repairs. Both of these experiments rely on the pattern matcher to suggest potential repairs.</Paragraph>
    <Paragraph position="1"> The syntactic and semantic components of the Gemini natural language processing system are used for both of these experiments. Gemini is an extensive reimplementation of the Core Language Engine (Alshawi et al., 1988). It includes modular syntactic and semantic components, integrated into an efficient all-paths bottom-up parser (Moore and Dowding, 1991). Gemini was trained on a 2,200-sentence subset of the full 10,718sentence corpus. Since this subset excluded the unanswerable sentences, Gemini's coverage on the full corpus is only an estimated 70% for syntax, and 50% for semantics. 2</Paragraph>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
Global Syntax and Semantics
</SectionTitle>
      <Paragraph position="0"> In the first experiment, based on parsing complete sentences, Gemini was tested on a subset of the data that the pattern matcher returned as likely to contain a repair. We excluded all sentences that contained fragments, resulting in a 2Gemlni's syntactic coverage of the 2,200-sentence dataset it was trained on (the set of annotated and answerable MADCOW queries) is approximately 91~, while its semantic coverage is approximately 77%. On a recent fair test, Gemini's syntactic coverage was 87~0 and seman- null dataset of 335 sentences, of which 179 contained repairs and 176 contained false positives. The approach was as follows: for each sentence, parsing was attempted. If parsing succeeded, the sentence was marked as a false positive. If parsing did not succeed, then pattern matching was used to detect possible repairs, and the edits associated with the repairs were made. Parsing was then reattempted.</Paragraph>
      <Paragraph position="1"> If parsing succeeded at this point, the sentence was marked as a repair. Otherwise, it was marked as no opinion.</Paragraph>
      <Paragraph position="2"> Table 5 shows the results of these experiments.</Paragraph>
      <Paragraph position="3"> We ran them two ways: once using syntactic constraints alone and again using both syntactic and semantic constraints. As can be seen, Gemini is quite accurate at detecting a repair, although somewhat less accurate at detecting a false positive. Furthermore, in cases where Gemini detected a repair, it produced the intended correction in 62 out of 68 cases for syntax alone, and in 60 out of 64 cases using combined syntax and semantics. In both cases, a large number of sentences (29% for syntax, 50% for semantics) received a no opinion evaluation. The no opinion cases were evenly split between repairs and false positives in both tests.</Paragraph>
      <Paragraph position="4"> The main points to be noted from Table 5 are that with syntax alone, the system is quite accurate in detecting repairs, and with syntax and semantics working together, it is accurate at detecting false positives. However, since the coverage of syntax and semantics will always be lower than the coverage of syntax alone, we cannot compare these rates directly.</Paragraph>
      <Paragraph position="5"> Since multiple repairs and false positives can occur in the same sentence, the pattern matching process is constrained to prefer fewer repairs to more repairs, and shorter repairs to longer repairs. This is done to favor an analysis that deletes the fewest words from a sentence. It is often the case that more drastic repairs would result in a syntactically and semantically well-formed sentence, but not the sentence that the speaker intended. For instance, the sentence &amp;quot;show me &lt;flights&gt; daily flights to boston&amp;quot; could be repaired by deleting the words &amp;quot;flights daily,&amp;quot; and would then yield a grammatical sentence, but in this case the speaker intended to delete only &amp;quot;flights.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
Local Syntax and Semantics
</SectionTitle>
      <Paragraph position="0"> In the second experiment we attempted to improve robustness by applying the parser to small substrings of the sentence. When analyzing long word strings, the parser is more likely to fail due to factors unrelated to the repair. For this experiment, the parser was using both syntax and semantics.</Paragraph>
      <Paragraph position="1"> The phrases used for this experiment were the phrases found by the pattern matcher to contain matching strings of length one, with up to three intervening words. This set was selected because, as can be seen from Table 4, it constitutes a large subset of the data (186 such phrases). Furthermore, pattern matching alone contains insufficient information for reliably correcting these sentences. The relevant substring is taken to be the phrase constituting the matched string plus intervening material plus the immediately preceding word. So far we have used only phrases where the grammatical category of the matched word was either noun or name (proper noun). For this test we specified a list of possible phrase types (NP, VP, PP, N, Name) that count as a successful parse. We intend to run other tests with other grammatical categories, but expect that these other categories could need a different heuristic for deciding which substring to parse, as well as a different set of acceptable phrase types.</Paragraph>
      <Paragraph position="2"> Four candidate strings were derived from the original by making the three different possible edits, and also including the original string unchanged. Each of these strings was analyzed by the parser. When the original sequence did not  parse, but one of edits resulted in a sequence that parsed, the original sequence was very unlikely to be a false positive (right for 34 of 35 cases). Furthermore, the edit that parsed was chosen to be the repaired string. When more than one of the edited strings parsed, the edit was chosen by preferring them in the following order: (1) M1\]XM1, (2) R1MIIR1M1, (3) M1RI\[M1R1. Of the 37 cases of repairs, the correct edit was found in 27 cases, while in 7 more an incorrect edit was found; in 3 cases no opinion was registered. While these numbers are quite promising, they may improve even more when information from syntax and semantics is combined with that from acoustics.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="60" end_page="62" type="metho">
    <SectionTitle>
ACOUSTICS
</SectionTitle>
    <Paragraph position="0"> A third source of information that can be helpful in detecting repairs is acoustics. In this section we describe first how prosodic information can help in distinguishing repairs from false positives for patterns involving matched words. Second, we report promising results from a preliminary study of cue words such as &amp;quot;no&amp;quot; and &amp;quot;well.&amp;quot; And third, we discuss how acoustic information can aid in the detection of word fragments, which occur frequently and which pose difficulty for automatic speech recognition systems.</Paragraph>
    <Paragraph position="1"> Acoustic features reported in the following analyses were obtained by listening to the sound files associated with each transcription, and by inspecting waveforms, pitch tracks, and spectrograms produced by the Entropic Waves software package.</Paragraph>
    <Section position="1" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
Simple Patterns
</SectionTitle>
      <Paragraph position="0"> While acoustics alone cannot tackle the problem of locating repairs, since any prosodic patterns found in repairs are likely to be found in fluent speech, acoustic information can be quite effective when combined with other sources of information, in particular with pattern matching.</Paragraph>
      <Paragraph position="1"> In studying the ways in which acoustics might help distinguish repairs from false positives, we began by examining two patterns conducive to acoustic measurement and comparison. First, we focused on patterns in which there was only one matched word, and in which the two occurrences of that word were either adjacent or separated by only one word. Matched words allow for comparison of word duration; proximity helps avoid variability due to global intonation contours not associated with the patterns themselves. We present here analyses for the MI\[M1 (&amp;quot;flights for &lt;one&gt; one person&amp;quot;) and M1\]XM1 (&amp;quot;&lt;flight&gt; earliest flight&amp;quot;) repairs, and their associated false positives (&amp;quot;u s air five one one,&amp;quot; '% flight on flight number five one one,&amp;quot; respectively).</Paragraph>
      <Paragraph position="2"> In examining the MI\[M1 repair pattern, we found that the strongest distinguishing cue between the repairs (N = 20) and the false positives (N = 20) was the interval between the offset of the first word and the onset of the second. False positives had a mean gap of 42 msec (s.d. = 55.8) as opposed to 380 msec (s.d. = 200.4) for repairs.</Paragraph>
      <Paragraph position="3"> A second difference found between the two groups was that, in the case of repairs, there was a statistically reliable reduction in duration for the second occurrence of M1, with a mean difference of 53.4 msec. However because false positives showed no reliable difference for word duration, this was a much less useful predictor than gap duration.</Paragraph>
      <Paragraph position="4"> F0 of the matched words was not helpful in separating repairs from false positives; both groups showed a highly significant correlation for, and no significant difference between, the mean F0 of the matched words.</Paragraph>
      <Paragraph position="5"> A different set of features was found to be useful in distinguishing repairs from false positives for the MI\[XM1 pattern. A set of 12 repairs and 24 false positives was examined; the set of false positives for this analysis included only fluent cases (i.e., it did not include other types of repairs matching the pattern). Despite the small data set, some suggestive trends emerge. For example, for cases in which there was a pause (200 msec or greater) on only one side of the inserted word, the pause was never after the insertion (X) for the repairs, and rarely before the X in the false positives. A second distinguishing characteristic was the peak F0 value of X. For repairs, the inserted word was nearly always higher in F0 than the preceding M1; for false positives, this increase in F0 was rarely observed. Table 6 shows the results of combining the acoustic constraints just described. As can be seen, such features in combination can be quite helpful in distinguishing repairs from false positives of this pattern. Future work will investigate the use of prosody in distinguishing the M1 \[XM1 repair not only from false positives, but also from other possible repairs having this pattern, i.e., M1RI\[M1R1 and R1MI\[R1M1.</Paragraph>
    </Section>
    <Section position="2" start_page="60" end_page="62" type="sub_section">
      <SectionTitle>
Cue Words
</SectionTitle>
      <Paragraph position="0"> A second way in which acoustics can be helpful given the output of a pattern matcher is in determining whether or not potential cue words such as &amp;quot;no&amp;quot; are used as an editing expression (Hockett, 1967) as in &amp;quot;...flights &lt;between&gt; &lt;boston&gt; &lt;and&gt; &lt;dallas&gt; &lt;no&gt; between oakland and boston.&amp;quot; False positives for these cases are instances in which the cue word functions in some other sense (&amp;quot;I want to leave boston no later than one p m.&amp;quot;). Hirshberg and Litman (1987) have shown that cue words that function differently can be distinguished perceptually by listeners on the basis of prosody. Thus, we sought to determine whether acoustic analysis could help in deciding, when such words were present, whether or not they marked the interruption point of a repair.</Paragraph>
      <Paragraph position="1"> In a preliminary study of the cue words &amp;quot;no&amp;quot; and &amp;quot;well,&amp;quot; we compared 9 examples of these words at the site of a repair to 15 examples of the same words occurring in fluent speech. We found that these groups were quite distinguishable on the basis of simple prosodic features. Table 7 shows the percentage of repairs versus false positives characterized by a clear rise or fall in F0  (greater than 15 Hz), lexical stress (determined perceptually), and continuity of the speech immediately preceding and following the editing expression (&amp;quot;continuous&amp;quot; means there was no silent pause on either side of the cue word). As can be seen, at least for this limited data set, cue words marking repairs were quite distinguishable from those same words found in fluent strings on the basis of simple prosodic features.</Paragraph>
      <Paragraph position="2"> Fragments A third way in which acoustic knowledge can assist in detecting and correcting repairs is in the recognition of word fragments. As shown earlier, fragments are exceedingly common; they occurred in 366 of our 607 repairs. Fragments pose difficulty for state-of-the-art recognition systems because most recognizers are constrained to produce strings of actual words, rather than allowing partial words as output. Because so many repairs involve fragments, if fragments are not represented in the recognizer output, then information relevant to the processing of repairs is lost.</Paragraph>
      <Paragraph position="3"> We found that often when a fragment had sufficient acoustic energy, one of two recognition errors occurred. Either the fragment was misrecognized as a complete word, or it caused a recognition error on a neighboring word. Therefore if recognizers were able to flag potential word fragments, this information could aid subsequent processing by indicating the higher likelihood that words in the region might require deletion. Fragments can also be useful in the detection of repairs requiring deletion of more than just the fragment.</Paragraph>
      <Paragraph position="4"> In approximately 40% of the sentences containing fragments in our data, the fragment occurred at the right edge of a longer repair. In a portion of  these cases, for example, &amp;quot;leaving at &lt;seven&gt; &lt;fif-&gt; eight thirty,&amp;quot; the presence of the fragment is an especially important cue because there is nothing (e.g., no matched words) to cause the pattern matcher to hypothesize the presence of a repair.</Paragraph>
      <Paragraph position="5"> We studied 50 fragments drawn at random from our total corpus of 366. The most reliable acoustic cue over the set was the presence of a silence following the fragment. In 49 out of 50 cases, there was a silence of greater than 60 msec; the average silence was 282 msec. Of the 50 fragments, 25 ended in a vowel, 13 contained a vowel and ended in a consonant, and 12 contained no vocalic portion.</Paragraph>
      <Paragraph position="6"> It is likely that recognition of fragments of the first type, in which there is abrupt cessation of speech during a vowel, can be aided by looking for heavy glottalization at the end of the fragment.</Paragraph>
      <Paragraph position="7"> We coded fragments as glottalized if they showed irregular pitch pulses in their associated waveform, spectrogram, and pitch tracks. We found glottalization in 24 of the 25 vowel-final fragments in our data. An example of a glottalized fragment, is shown in Figure 1.</Paragraph>
      <Paragraph position="8"> Although it is true that glottalization occurs in fluent speech as well, it normally appears on unstressed, low F0 portions of a signal. The 24 glottalized fragments we examined however, were not at the bottom of the speaker's range, and most had considerable energy. Thus when combined with the feature of a following silence of at least 60 msec, glottalization on syllables with sulficient energy and not at tile bottom of tile speaker's range, may prove a useful feature in recognizing fragments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML