File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1008_metho.xml
Size: 18,776 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1008"> <Title>Spontaneous Speech Collection for the ATIS Domain with an Aural User Feedback Paradigm</Title> <Section position="4" start_page="43" end_page="44" type="metho"> <SectionTitle> Sentence: WHAT AIRLINES FLY FROM BOSTON TO ATLANTA Table: AIRLINE NAME AIRLINE CODE FROM TO DELTA AIR LINES, INC. DL BOS ATL EASTERN AIR LINES, INC. EA BOS ATL USAIR US BOS ATL </SectionTitle> <Paragraph position="0"> ATIS: The airlines gith service between Boston and Atlanta are Delta, Eastern, and U S Air.</Paragraph> <Paragraph position="1"> Figure 3: In this example, a table with three rows is compressed into a single sentence.</Paragraph> <Paragraph position="2"> The information presentation component of the system was developed with two goals in mind. The first was to present information so it could be easily understood. Toward this end, the presentation component was developed to format information into coherent sentences, to expand or hide all codes and abbreviations, and to maximize the intelligibility of the speech synthesizer output. The second goal was to minimize the amount of irrelevant information presented to the user. Towards this goal, the presentation component includes the above mentioned facilities for summarizing, compressing and filtering information retrieved from the database.</Paragraph> <Paragraph position="3"> Sentence: I WANT TO 60 ABOUT 3 P M ATIS: There are no flights leaving beteeen teo forty five P N and three fifteen P N.</Paragraph> <Paragraph position="4"> The next earliest flight is eastern flight six forty five departing at tgo tgenty one P M.</Paragraph> <Paragraph position="5"> The next latest flight is delta flight nine seventy five departing at three tgenty P M.</Paragraph> <Paragraph position="6"> Please refer to these flights by flight number or departure time.</Paragraph> <Paragraph position="7"> feedback to the user on the state of the discourse in the form of text and synthesized speech. However, the supporting text produced by the MIT system is intended to complement and to direct the user's attention to a tabular display. This capability was modified to complement the summarization facility mentioned above. In the example shown in Figure 2, the MIT system would generate the text &quot;Here are the flights from Boston to Atlanta&quot; to accompany the table listing seventeen flights, while the AT~T sysiem would generate &quot;There are seventeen flights from Boston to Atlanta&quot; to accompany the following summary sentence. The system error responses (NL failure, database access failure, etc.) were also modified to fit the audio feedback paradigm.</Paragraph> <Paragraph position="8"> System Initiative. The MIT system takes initiative in two contexts: when the system does not have enough information to access the database, and when guiding a user through the flight booking process \[3\]. The AT&T system takes initiative in an additional context, when the subject is selecting a flight on the basis of departure or arrival time. First, the system prompts the user for a departure time if the departure time is summarized, as in Figure 2. Second, the system volunteers the next earliest and next latest flight when the subject requests a flight at a certain time, and there isn't one. An example is shown in Figure 4. This second capability was developed to address a problem that was causing a great deal of user frustration. Because the system would not provide complete flight information for more than three flights, subjects were forced to play a guessing game to find out flight departure times. The need for this type of system initiative is a result of the limits imposed by the interaction paradigm. The more a system restricts the flow of information, the more assistance it must provide to help the user access the information.</Paragraph> <Paragraph position="9"> Recording Control. At all the other ATIS data collection sites, the subject controls the recording process using a push-to-talk or push-and-hold to talk mechanism. We chose not to use such a subject-controlled recording mechanism in order to more closely simulate an actual telephone dialogue. Instead, the exper- null imenter who transcribed the subject's speech also controlled the start and end of recording from the keyboard. The control loop was designed to keep the interaction flowing as smoothly and efficiently as possible in the hope of eliciting more natural speech from our subjects. Many subjects were initially uncertain about when to start and stop talking, but most of them adjusted to the interaction after the first scenario. Some effects of experimenter-controlled recording on subjects' speech are discussed in section 3.4.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 2.2. Recording Environment and Sys- </SectionTitle> <Paragraph position="0"> tem Hardware Data were collected in a walled-off corner of a computer laboratory. The subjects were seated at a desk with a telephone, and provided with paper and writing implements. All system feedback to the subject was provided over the telephone by the AT&T TTS speech synthesizer. Speech data were captured simultaneously using (1) a Sennheiser HMD-410 close-talking microphone amplified by a Shure FPll microphone-to-line amplifier, and (2) a standard carbon button microphone (in the telephone handset) over local telephone lines. Digitization was performed by an Ariel Pro-Port A/D system.</Paragraph> </Section> <Section position="2" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 2.3. Data Collection Procedure </SectionTitle> <Paragraph position="0"> Before a recording session began, the experimenter provided the subject with a brief verbal explanation of the task and a page of written instructions. The subject also received a summary of the task domain and two sets of travel planning scenarios. The first set of scenarios included a number of simple tasks (referred to below as &quot;short scenarios&quot;) and the ATIS common scenario (used at all five ATIS data collection sites). The second set contained more complicated tasks (referred to below as &quot;long scenarios&quot;), and subjects were permitted to attempt to book flights while working on these scenarios. Initially, the subjects selected which scenarios they wanted to try; because of problems with uneven scenario distribution, the experimenter began selecting an initial set of scenarios (two short, one long) for each subject.</Paragraph> <Paragraph position="1"> Subjects were asked to speak as they would to a human being, and to speak in single sentences. They were not told that someone was listening to them and typing in what they said until after the entire recording session was over. A complete session lasted about an hour, including initial instruction, a two part recording session with a five minute break, and a debriefing questionnaire.</Paragraph> <Paragraph position="2"> During the recording session, the experimenter listened to the subject's speech and the system's response. The system initiated the dialogue with the prompt, &quot;I'm ready to begin a scenario,&quot; and responded after every utterance with information or an error message. An example of a typical series of interactions is given in Figure 5. The experimenter controlled recording from the keyboard, starting recording as soon as the system response ended, and stopping recording when the subject appeared to have completed a sentence. The experimenter was asked to transcribe exactly what the sub-ject said, excluding false starts. However, because of (perceived) pressure on the experimenters to get answers to the subjects, especially after repeated system failure, the session transcriptions sent to the interaction log files were not always accurate. Most of the time, the experimenter interacted with the subject only through the system. However in cases of complete system failure and severe subject confusion, the experimenter could communicate directly with the subject, either by sending a message through the speech synthesizer, or by speaking directly to the subject.</Paragraph> <Paragraph position="3"> Subjects for data collection were recruited from local civic organizations, and collection took place during working hours. As a result, 82% of the subjects were female, and subjects ranged in age from 29 to 77, with a median age of 55. Approximately 60% of the subjects came from the New York City dialect region; all were native speakers of English. In return for each subject's participation, a donation was made to the civic organization through which he or she was recruited.</Paragraph> <Paragraph position="4"> Four summer students served as experimenters for almost all of the data collection sessions. They were trained for two weeks during pre-collection system debugging. The system was debugged and intermittently upgraded during and after the 2 1/2 week collection. All of the data was then transcribed and submitted to NIST for distribution.</Paragraph> </Section> </Section> <Section position="5" start_page="44" end_page="44" type="metho"> <SectionTitle> 3. COMPARATIVE ANALYSES </SectionTitle> <Paragraph position="0"> In the following section, we present analyses of data collected at AT&T and the four other data collection sites.</Paragraph> <Paragraph position="1"> We will pay particular attention to comparing the AT&T data with the MIT data, since the two collection systems had identical natural language and database access components, and the two sites used the same travel planning scenarios.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 3.1. General Characteristics </SectionTitle> <Paragraph position="0"> Table 1 presents some statistics which can be used to characterize each site's collection system and procedure.</Paragraph> <Paragraph position="1"> The statistics were computed for each speaker, and the speaker values averaged for each site. AT~zT is in the middle of the pack in utterances per scenario, collection rate (number of utterances collected divided by recording session duration), and human-machine interaction time Scenar~\[o: &quot;Determine the type of aircraft used on a flight from one city to another that leaves before (or after) a certain time of the day.&quot; ATIS: l'm ready to begin a scenario.</Paragraph> </Section> </Section> <Section position="6" start_page="44" end_page="47" type="metho"> <SectionTitle> AVAILABLE FRON ATLANTA TO BOSTON ON THE MORNING OF JULY SEVENTH </SectionTitle> <Paragraph position="0"> ATIS: I'm confused. Could you try to rephrase your question? Subject: I WANT TO FLY FROM BOSTON TO ATLANTA ATIS: There are seventeen flights from boston to atlanta.</Paragraph> <Paragraph position="1"> There are flights departing between six thirty A H and eight twenty four P M.</Paragraph> <Paragraph position="2"> What time .ould you like to go? Subject: SIX THIRTY A M ATIS: There is one flight from boston to atlanta leaving six thirty A N.</Paragraph> <Paragraph position="3"> delta flight two twenty five from Boston to Atlanta departs at six thirty A M and arrives at nine thirteen A N with no stops.</Paragraph> <Paragraph position="4"> Subject: WHAT TYPE OF AIRCRAFT WILL BE USED ATIS: There is aircraft for one flight. delta flight two twenty five from Boston to Atlanta departs at six thirty A M and arrives at nine thirteen A M with no stops. The plane for this flight is a Mcdonnell Douglas H D eighty.</Paragraph> <Paragraph position="5"> Subject: I WOULD LIKE TO BOOK THAT FLIGHT ATIS: There is one flight.</Paragraph> <Paragraph position="6"> delta flight two twenty five from Boston to Atlanta departs at six thirty A M and arrives at nine thirteen A M with no stops.</Paragraph> <Paragraph position="7"> (interval from the beginning of one utterance to the beginning of the next). The average AT&T speaker used significantly more words per utterance than the average speaker at any other site. This may be due to our subjects' response to NL system failure, which is discussed in section 3.2. AT&T subjects also had a lower average speaking rate than speakers from other sites. This may be related to the higher disfluency rate (discussed in section 3.4) and increases in the frequency of occurrence and durations of silent pauses.</Paragraph> <Section position="1" start_page="44" end_page="46" type="sub_section"> <SectionTitle> 3.2. NL System Failure </SectionTitle> <Paragraph position="0"> One of the effects of the audio interaction paridigm was a higher NL system failure rate (MIT 33.4%, AT&T 42.9%), where NL system failure is defined as the fail-</Paragraph> <Paragraph position="2"> each site.</Paragraph> <Paragraph position="3"> ure to completely process an utterance because it contains unknown words or fails to parse. The change in the interaction paradigm changed the NL task; since the NL system was designed based on a visual display, the NL system failure rate was expected to increase. The response of subjects to NL system failure was also affected by the change to the audio interaction paradigm. Table 2 shows the subjects' responses to NL failure at AT&T and MIT. Subjects at both AT&T and MIT spoke longer sentences and slowed their speaking rates. However, the effects of NL failure on subjects' speech are more dramatic in the AT&T data: the number of words per utterance increased by over 50% (MIT 20%), the speaking rate dropped by 15% (MIT 5%), and the utterance duration increased by over 75% (MIT 25%) when compared with utterances which did not follow an NL system error.</Paragraph> <Paragraph position="4"> The large increase in utterance length after NL failure, combined with the high NL failure rate, is the main reason the average AT&T sentence is so much longer than the average MIT sentence, both in number of words and in duration. However, the reason behind the post-NLfailure increase in sentence length is not entirely clear. A qualitative examination seems to indicate that the system is not effectively communicating the reason for its failure. The NL system usuMly fails as a result of an unfamiliar, unusual, or u~agrammatical syntactic construction. During the initial task familiarization, the subjects were told that the system failure was triggered by problems with a sentence's grammatical construction, and not by any type of recognition problem. Subjects were also informed of the system's discourse capabilities. Yet when a sentence failed to parse and the subject was asked to rephrase his or her request, he or she frequently responded by simply tacking on a summary of the previous discourse without modifying the syntactic structure of the original sentence. In these cases, the subjects appeared to respond to the NL failure as a discourse failure instead of a syntactic failure. Subjects did appear to adjust their speech to the constraints imposed by the NL system, as the NL system failure rate decreased &quot;from 51% in the first scenario to 39% in subsequent scenarios.</Paragraph> </Section> <Section position="2" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 3.3. Vocabulary Comparisons </SectionTitle> <Paragraph position="0"> Table 3 contains statistics on the increase in lexicon size as a function of the number of sentences collected. The breakpoint of 600 sentences collected was chosen because it was the point at which the vocabulary growth rate remained less than 30 words/100 sentences for all sites.</Paragraph> <Paragraph position="1"> MIT has reached a terminal vocabulary growth rate of 8.7 new words/100 sentences collected; the vocabulary growth rates at the other four sites continue to decrease as the number of collected sentences increases. Figure 6 is a graph of lexicon size vs. number of sentences collected for each site.</Paragraph> <Paragraph position="2"> Excluding the AT&T data, the percentage of words in lexicon X that are found in lexicon Y appears to be proportional to the size of lexicon X. However, the percentages of words in the AT&T lexicon that appear in the BBN, CMU, and SRI lexicons are lower (by about 5 percentage points) than predicted, though the overlap with the MIT lexicon matches the prediction fairly well. One explanation is that, although the change in the interaction paradigm does affect the vocabulary, the effect of the change in paradigm is similar to the effects of other inter-site system variations. The AT&T system differs from the MIT system only in the interaction paradigm, but differs from the BBN, CMU, and SRI systems in other ways, in addition to the different interaction paradigm.</Paragraph> </Section> <Section position="3" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 3.4. Disfluencies </SectionTitle> <Paragraph position="0"> Table 4 contains statistics on the occurence of disfluencies as transcribed by each site. Partial words and word fragments are counted as lexical false starts, and verbally deleted complete words are counted as linguistic false starts, as in \[2\]. The high percentage of utterances containing linguistic false starts and filled pauses in the AT&T data reflect the subjects' response to experimenter-controlled recording and their uncertainty about how to interact with the system. Since the subjects did not take any initiative in starting and stopping recording, they were less likely to compose their thoughts before they began speaking. The rate of filled pauses and false starts did decrease somewhat as the subject became more comfortable talking to the system: comparing disfluency rates for first scenario and last scenario utterances, the percentage of utterances containing linguistic false starts decreased from 13% to 11%, the percentage containing lexical false starts from 8% to 7%, and the percentage containing filled pauses from 15% to 12%.</Paragraph> <Paragraph position="1"> NL system failure strongly affected the rate of linguistic false starts. The percentage of utterances containing linguistic false starts increased from 9.4% after a successfully processed utterance to 14.4% after an NL system error. The absence of similar increases in the rates of lexical false starts and filled pauses indicates that the subjects' speech was disrupted primarily at the syntactic level.</Paragraph> <Paragraph position="2"> % of sentences with AT&T MIT CMU BBN SRI linguistic false starts 11.4 3.9 1.2 2.4 4.5 lexical false starts 7.6 2.8 9.3 2.2 2.8 filled pauses 13.7 3.1 3.0 1.9 1.5</Paragraph> </Section> </Section> class="xml-element"></Paper>