File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1071_metho.xml

Size: 18,277 bytes

Last Modified: 2025-10-06 14:12:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1071">
  <Title>Collection of Spontaneous Speech for the ATIS Domain and Comparative Analyses of Data Collected at MIT and TI 1</Title>
  <Section position="3" start_page="0" end_page="360" type="metho">
    <SectionTitle>
DATA COLLECTION
</SectionTitle>
    <Paragraph position="0"> As is the case with other efforts \[4~2,1\], our data are collected under simulation. Nevertheless, we wanted the simulation to reflect as much as possible the system that we are developing. In this section, we will briefly describe some deslgn issues and document the actual collection process. Further details can be found elsewhere \[7\].</Paragraph>
    <Section position="1" start_page="0" end_page="360" type="sub_section">
      <SectionTitle>
Methodological Considerations
</SectionTitle>
      <Paragraph position="0"> While many years may pass before we are able to build systems with capabilities approaching those of humans, we believe strongly that it should soon be possible to develop functioning systems with limited capabilities. The successful development of such systems will partly depend on our ability to train subjects to stay within the restricted domain of the system. Therefore, we should try to collect data intentionally restricting the user in ways that closely match system capability. In this section we will describe some aspects of our data collection paradigm that support this viewpoint.</Paragraph>
      <Paragraph position="1"> Wizard vs. System By far the most important difference between the data collection procedures at TI and MIT is the way system simulation is conducted during data collection.</Paragraph>
      <Paragraph position="2"> TI made use of a &amp;quot;wizard&amp;quot; paradigm, in which a highly skilled experimenter interprets what was spoken, converts it into a form that enables database access, and produces an answer for the subject \[4,2\]. Based on our previous positive experience with collecting spontaneous speech for a different domain \[10\], we decided to explore an alternative paradigm from the one used at TI, in which we make use of the system under development to do most of the work. That is, prior to the beginning of data collection, the natural language component is developed to the point where it has reasonable coverage o-r  Subject: Show flights from Philadelphia to Denver serving lunch or dinner on February second and also show their fares.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="360" end_page="360" type="metho">
    <SectionTitle>
ATIS Response:
</SectionTitle>
    <Paragraph position="0"> ..........................................................................................................................</Paragraph>
  </Section>
  <Section position="5" start_page="360" end_page="360" type="metho">
    <SectionTitle>
(RAW DISPLAY) (PROCESSED DISPLAY)
</SectionTitle>
    <Paragraph position="0"> These are the flights from Philadelphia to Denver serving lunch and dinner on Friday February 2.</Paragraph>
  </Section>
  <Section position="6" start_page="360" end_page="362" type="metho">
    <SectionTitle>
AIRLINE FLIGHT AIRLINE NUMBER FROM TO DEPARTURE ARRIVAL STOPS MEALS
CODE NUMBER SERVED
</SectionTitle>
    <Paragraph position="0"> for a query.</Paragraph>
    <Paragraph position="1"> the possible queries. In addition, the system must be able to automatically translate the text into a query to the database, and return the information to the subject. Once such a system is available, data collection is accomplished by having the experimenter, a fast and accurate typist, type verbatim to the system what was spoken, after removing spontaneous speech disfluencies. The actual interpretation and response generation is accomplished by the system without further human intervention. If the sentence cannot be understood by the system, an error message is produced to help the subject make appropriate modifications.</Paragraph>
    <Paragraph position="2"> Another feature of our paradigm is that the underlying system can be improved incrementally using the data collected thus far. The resulting expansion in system capabilities permit us to accommodate more complex sentences as well as those that previously failed.</Paragraph>
    <Paragraph position="3"> Displays One of the considerations that led to the selection of ATm as the common task is the realization that, since most people have planned air travel at one time or another, there will be no shortage of subjects familiar with the task. Since the average traveller is not likely to be knowledgeable of the format and display of the Official Airline Guide (OAG), we have translated many of the cryptic symbols and abbreviations that OAG uses into easily recognizable words.</Paragraph>
    <Paragraph position="4"> We believe that this change has the positive effect of helping the subject focus on the travel planning aspect, and not be confused by the cryptic displays that are intended for more experienced users. In fact, we try to keep the displayed information at a minimum in order to encourage verbal problem solving. In general, we only display the airline, flight number, origination and destination cities, departure and arrival time, and the number of stops. Additional columns of information are included only when specifically requested.</Paragraph>
    <Paragraph position="5"> Figure 1 illustrates the difference between the raw displays returned from the OAG database and the ones that we present to the subjects by applying some post-processing to the raw display. The query is &amp;quot;Show flights from Philadelphia to Denver serving lunch or dinner on February second and also show their fares.&amp;quot; Note that the airlines, meal codes, and fare codes in the processed displays (shown in the right-hand panels) have all been translated into words as much as possible while keeping the displays manageable on a screen, e The military-time displays for departure and arrival times have also been converted to more familiar forms to facilitate interpretation. Furthermore, under the TI data collection scheme the answer is assembled as one large table. However, we break it up into two answers, one for the flights and one for the fares.</Paragraph>
    <Paragraph position="6"> System Feedback Our system provides explicit feedback to the subject in the form of text and synthetic speech, paraphrasing its understanding of the sentence. This feature is illustrated in Figure 1 in the right-hand panels, immediately above the display tables. By providing confirmation to the subject of what was understood, the system greatly reduces the confusion and frustration that may arise later on in the dialogue caused by an earlier error in the system's responses.</Paragraph>
    <Paragraph position="7"> In addition, the generation of a verbal response implicitly encourages the notion of human/machine interactive dialogue.</Paragraph>
    <Paragraph position="8"> Interactive Dialogue We believe that, for the ATIS system to be truly useful, the user must be able to carry out an  interactive dialogue with the system, in the same way that a traveller would with a travel agent. Our data collection procedure therefore encourages natural dialogue by incorporating some primitive discourse capabilities, allowing subjects to make indirect as well as direct anaphoric references, and fragmentary responses where appropriate. In some sessions, we even use a version of the system that plays an active role in guiding the subject through flight reservations. Details of the interactive dialogue capabilities of our system are described in a companion paper \[9\].</Paragraph>
    <Section position="1" start_page="361" end_page="362" type="sub_section">
      <SectionTitle>
Data Collection Process
</SectionTitle>
      <Paragraph position="0"> The data are collected in an office environment where the ambient noise is approximately 60 dB SPL, measured on the C scale. The subject sits in front of a display monitor, with a window slaved to a Lisp process running on the experimenter's Sun workstation located in a nearby room. The experimenter's typing is hidden from the subject to avoid unnecessary distractions. A push-to-talk mechanism is used to collect the data. The subject is instructed to hold down a mouse button while talking, and release the button when done. The resulting speech is captured both by a Sennheiser HMD-224 noise-cancelling microphone and a Crown PZM desk-top microphone, digitized simultaneously.</Paragraph>
      <Paragraph position="1"> Prior to the session, the subject is given a one-page description of the task, a one-page summary of the system's knowledge base, and three sets of scenarios \[7\]. The first set contains simple tasks such as finding the earliest flight from one city to another that serves a meal, and is intended as a &amp;quot;warm-up&amp;quot; exercise. The second set involves more complex tasks, and includes the official ATIS scenarios. Finally, the subject is asked to make up a scenario and attempt to solve it. The subjects are instructed to choose a pre-determined number of scenarios from each category. In addition, they are asked to clearly delineate each scenario with the commands &amp;quot;begin scenario x&amp;quot; and &amp;quot;end scenario x,&amp;quot; where x is a number, so that we can keep track of discourse utilization. A typical session lasts about 40 minutes, including initial task familiarization and final questionnaire.</Paragraph>
      <Paragraph position="2"> For several initial data collection sessions, the first two authors took turns serving as the experimenter. Once we began daily sessions, however, it was possible to hire a part-time helper to serve as the scheduler, experimenter, and transcriber. The experimenter can hear everything that the sub-ject says and can communicate with the subject via a two-way microphone/speaker hook-up. However, the experimenter rarely communicates with the subject once the experiment is under way. The digitized speech is played back to the experimenter, allowing him/her to confirm that the recording was successful. The voice input during the session, minus disfluencies, is typed verbatim to ATIS by the experimenter, and saved a.utomatically in a computer log. The system response is generated automatically from this text, and is also recorded into the log. The system's response typically takes less than 10 seconds after the text has been entered. At a later time, the experimenter listens again to each digitized sentence and inserts false starts and non-speech events into the orthography to form a detailed orthographic transcription, following the conventions described in \[6\].</Paragraph>
      <Paragraph position="3"> There are basically three ways that the system can fail, each of which provides a distinct error message. If the sentence contains unknown words, then the system identifies to the subject the words that it doesn't know. If it knows all the words, but the sentence fails to parse, then it identifies the point in the sentence where the parse failed. Finally, it may parse the sentence but fail to produce a response due to, for instance, an incorrect database query. In that case, it simply says, &amp;quot;I ran into an error trying to evaluate this sentence.&amp;quot; Our long-term goal is to make error messages sufficiently informative that the subject knows how to best rephrase the query. By examining how subjects react to the various kinds of error messages, we hope to improve the overall usability of the system. Figure 2 illustrates the data collection process with a simple dialogue between a subject and ATIS.</Paragraph>
      <Paragraph position="4"> SCENARIO: &amp;quot;Find the earliest (or latest) flight from one city to another that serves a meal of your choice.&amp;quot;  that they be native speakers. For their efforts, each subject is given a gift certificate at a popular Chinese restaurant or an ice-cream parlor. Presently, we are collecting data from two to three subjects per day.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="362" end_page="363" type="metho">
    <SectionTitle>
COMPARATIVE ANALYSES
</SectionTitle>
    <Paragraph position="0"> To facilitate system development, training, and testing, we arbitrarily partitioned part of the collected data into training, development-test, and test sets, as summarized in Table 1. All the comparative analyses reported in this section are based on our designated training set and the TI training set, the latter defined as the total amount of training data  General Characteristics Table 2 compares some general statistics of the data in the TI and MIT training set. On the average, the wizard paradigm used at TI can collect 25 sentences over approximately 40 minutes, for a yield of 39 sentences per hour \[2\]. In contrast, we were able to collect an average of about 39 sentences in approximately 45 minutes, for a yield of 53 sentences per hour. Our higher yield is presumably due to the fact that the system can respond much faster than a wizard; the process of translating the sentences into an NLParse command \[2\] by hand can sometimes be quite time-consuming: Note that the yields in both cases do not include the generation of the ancillary files, which is an essential task performed after data collection.</Paragraph>
    <Paragraph position="1">  and MIT.</Paragraph>
    <Paragraph position="2"> The average number of words per sentence for the MIT data is 15% fewer than that for the TI data. The shorter sentences in the MIT data can be due to several reasons.</Paragraph>
    <Paragraph position="3"> The system's inability to deal with longer sentences and the feedback that it provides may coerce the subject into making shorter sentences. The limited display may discourage the construction of lengthy and sometimes contorted sentences that attempt to solve the scenarios expeditiously. The interactive nature of problem solving may encourage the user to take a &amp;quot;divide-and-conquer&amp;quot; attitude and ask simpler questions. Closer examination of the data reveals that the standard deviation on sentence length is very different between the two data sets (o'TI = 5.53 and aMiT = 3.68). We suspect that this is primarily due to the preponderance of short symbol clarification sentences such as &amp;quot;What does EA mean?&amp;quot; in the TI data, along with occasional very long sentences.</Paragraph>
    <Paragraph position="4"> Table 2 shows that 25% of the TI sentences deal with table clarification compared to only 1% of our sentences. In fact, 8 of our 16 table clarification sentences concern airline code abbreviations. They were collected from earlier sessions when the display was still somewhat cryptic. Once we made some extremely simple changes in the display, such sentences no longer appeared.</Paragraph>
    <Paragraph position="5"> The speaking rate of the MIT sentences was more than 70% higher that that of the TI sentences. We believe that the speaking rate of the TI sentences (70 words/minute) is unnaturally low. This may be due to the insertion of many pauses, or the fact that the subjects simply spoke tentatively, due to their unfamiliarity with the task. Acoustic analysis is clearly needed before we can know for certain.</Paragraph>
    <Paragraph position="6"> System Growth Rate Figure 3 compares the size of the lexicon, i.e., the number of unique words, as a function of the number of training sentences collected at TI and MIT. The Figure shows that the vocabulary size grows at a much slower rate (about 20 words per 100 training sentences) for the MIT AWlS data than the TI data (about 50 words per 100 training sentences). Also included on the Figure for reference is a plot of the growth rate for our VOYAGER. corpus, which was collected using the same paradigm as we have used for ATIS.</Paragraph>
    <Paragraph position="7"> A previous comparison of the TI data and the MIT VOYAGER.</Paragraph>
    <Paragraph position="8"> data \[5\] led to the conclusion that the VOYAGER. domain was intrinsically more restricted. Since the MIT ATIS data are more similar to the VOYAGER. data, it may be the case that a more critical factor was the data collection paradigm. Thus, one may argue that our data collection procedure is better able to encourage the subjects to stay within the domain. A slow growth rate may also be an indication that the training data is more representative of the unseen test data.</Paragraph>
    <Paragraph position="9"> As further evidence that our training data is representative of the data that the system is likely to see, Table 3 compares the system's performance on the MIT training and development-test sets. The similarities in performance between the two data sets is striking, suggesting that the system is able to generalize well from training data. Since the system can deal with over 70% of the sentences, we feel that the subject is not likely to be overly frustrated by the system's inability to deal with the remaining sentences. This also reflects the apparent ability of subjects to adjust their speech so as to stay generally within the domain of the system.</Paragraph>
    <Paragraph position="10"> Disfluencies Table 4 compares the occurrence of spontaneous speech disfluencies in the two data sets. We define  training sentences for the TI and MIT ATIS training sets, as well as the MIT VOYAGER training set.</Paragraph>
    <Paragraph position="11">  after the system had been trained on these sentences, whereas the development-test set represents unseen data.</Paragraph>
    <Paragraph position="12"> % of Sentences TI Data \[MIT Data with filled pauses 8.1 I 1.3 with lexical false starts i 6.0 \] 2.8 with linguistic false starts 5.9 1.0  lexical false starts as the appearance of a partial word and linguistic false starts as the appearance of one or more extraneous whole words. Again, our analyses show quite a difference between the two data sets along all dimensions. A total of 73 filled pauses appear in 63 (or 8.1%) of the TI sentences, whereas only 25 appear in 21 (or 1.3%) of the MIT sentences. Similarly, it is twice as likely to find a sentence with a lexical false start in the TI data as is in the MIT data, and almost six times more likely for a linguistic false start.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML