File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2018_metho.xml
Size: 14,233 bytes
Last Modified: 2025-10-06 14:12:19
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2018"> <Title>THE COLLECTION AND PRELIMINARY ANALYSIS OF A SPONTANEOUS SPEECH DATABASE*</Title> <Section position="4" start_page="126" end_page="126" type="metho"> <SectionTitle> DATABASE CONSTRUCTION </SectionTitle> <Paragraph position="0"> We believe that data should be collected under conditions that closely reflect the actual capabilities of the system. As a result, we have chosen to have subjects use the system as if they are trying to obtain actual information. The data were recorded in a simulation mode in which the speech recognition component was excluded. This step was taken partly to avoid long processing delays that would disrupt the human-machine interaction. Instead, an experimenter in a separate room typed in the utterances spoken by the subject, after removing false starts and hesitations. Subsequent processing by the natural language and response generation components was done automatically by the computer.</Paragraph> <Paragraph position="1"> was hidden from the subject's view to avoid unnecessary distractions.</Paragraph> </Section> <Section position="5" start_page="126" end_page="127" type="metho"> <SectionTitle> DATA COLLECTION </SectionTitle> <Paragraph position="0"> The data were collected in an office environment where the ambient noise was approximately 65 dB SPL, measured on the C scale. A Sennheiser HMD-224 noise-cancelling microphone was used to record the speech.</Paragraph> <Paragraph position="1"> The subject sat in front of a computer console that displayed the geographical area of interest as shown in typing, shown in the bottom of the display, was hidden from the subject to avoid unnecessary distractions. Two information sheets describing both the knowledge base of VOYAGER and its possible responses were available to the subject. The subjects referred to these sheets from time to time in order to stay within VOYAGER's domain of knowledge.</Paragraph> <Paragraph position="2"> During a subject's dialogue, both the input speech and the resulting responses were recorded on audio tape. The voice input, minus false starts, hesitations, and filled pauses, was typed verbatim to VOYAGER by an experimenter, and saved automatically in a computer log. The system response was generated automatically from this text, which was also recorded into the log. The system's response typically took a second or two after the text had been entered.</Paragraph> <Paragraph position="3"> Whenever a sentence contained words or constructs that were unknown to the natural language component, the system would explain to the subject why a response could not be generated. In the event that the queries were outside of the knowledge domain and the system responses could not dislodge the subject from that line of questioning, the experimenter could override the system and trigger a canned response explaining that the system was currently unable to handle that kind of request. Another canned response was available for the case when the subject produced several queries at once.</Paragraph> <Paragraph position="4"> Each session lasted approximately 30 minutes, and began with a five minute introductory audio tape describing the task. This was followed by a 20 minute dialogue between the subject and VOYAGER. Following the dialogue, the subject was asked to read his or her sentences from the computer log. The resulting database therefore included both a read and a spontaneous version of the same sentence, modulo false starts, hesitations, and filled pauses in the spontaneous version.</Paragraph> <Paragraph position="5"> Fifty male and fifty female subjects were recruited as subjects from the general vicinity of MIT. They ranged in age from 18 to 59. The only requirement was that they be native speakers of American English with no known speech defects. For their efforts, each subject was given a gift certificate at a popular ice-cream parlor. The entire recording was carried out over a nine-day period in late July. Several of the sessions were also recorded on video tape to document the data collection process.</Paragraph> </Section> <Section position="6" start_page="127" end_page="127" type="metho"> <SectionTitle> DIGITIZATION AND TRANSCRIPTION </SectionTitle> <Paragraph position="0"> The recordings made during the data collection were digitized at 16 kHz, after being band-limited at 8 kHz. Special care was used to ensure that false starts, hesitations, mouth clicks, and breath noise were included as part of the digitized utterance. In addition, prg-determined conventions were established, and written instructions provided, for transcribing these non-speech and partial-word events both orthographically and phonetically. We started with the notations suggested by Rudnicky \[5\], and made modifications to suit our needs.</Paragraph> <Paragraph position="1"> To date, the entire database of 9,692 utterances has been digitized and orthographically transcribed, including markers for false starts, partial words, and non-word sounds. In addition, an aligned phonetic transcription has been obtained for approximately 20% of the data.</Paragraph> </Section> <Section position="7" start_page="127" end_page="127" type="metho"> <SectionTitle> PRELIMINARY ANALYSIS </SectionTitle> <Paragraph position="0"> We have divided the database into three parts according to speakers. Data from 70 arbitrarily selected speakers were designated as the training set. Of the remaining speakers, two-thirds were designated as the development set, and the rest as the ~est set. In each set, there were equal numbers of male and female speakers. In this section, we will report on the results of some preliminary analysis on parts of this database, carried out over the past few weeks.</Paragraph> </Section> <Section position="8" start_page="127" end_page="128" type="metho"> <SectionTitle> GENERAL STATISTICS </SectionTitle> <Paragraph position="0"> From the computer log, we were able to automatically generate some preliminary statistics of the database. Table 1 summarizes some of the relevant statistics for the sum of the training and development sets. Note that the number of sentences refers to the spontaneous ones; the total number collected is double this amount.</Paragraph> <Paragraph position="1"> As the table reveals, approximately two-thirds of the sentences could be handled by the current version of VOYAGEa. The remaining third of the data is evenly divided between sentences with out-of-vocabulary words and sentences for which no parses were generated. These sentences can be used to extend VOYAGER'S database.</Paragraph> <Paragraph position="2"> capabilities. Only a very small amount, about 1%, were parsed but not acted upon. This is a direct result of our conscious decision to constrain the coverage of the natural language component according to the capabilities of the back-end.</Paragraph> <Paragraph position="3"> The version of VOYAGER used for data collection had a vocabulary of about 300 words which were determined primarily from a small set of sentences that we made up. It is interesting to note that only about 200 of these words were actually used by the subjects. While the number of unknown words appears to be large, they actually account for less than 3% of the total number of words when frequency of usage is considered.</Paragraph> <Paragraph position="4"> The statistics of this database indicated that an average of slightly less than 50 sentences per subject were collected in each 20 minute dialogue. Thus we believe the database can easily be expanded as the capabilities of the system grow.</Paragraph> </Section> <Section position="9" start_page="128" end_page="130" type="metho"> <SectionTitle> ACOUSTIC ANALYSIS </SectionTitle> <Paragraph position="0"> Since time-aligned phonetic transcriptions were already available for part of the database, we performed some comparative acoustic analyses of the spontaneous and read utterances. These preliminary analyses were carried out using slightly over 1,750 sentences from 9 male and 9 female training speakers. While these data represent less than 20% of the recorded data, there were more than 60,000 phonetic events. As a result, the quantitative differences were found to be statistically significant. Rather than exhaustively reporting our findings, we will make a few observations based on some interesting examples.</Paragraph> <Paragraph position="1"> Figure 2 compares the overall duration of the read and spontaneous utterances. In this and subsequent figures, the thin line denotes read speech, whereas the thick line denotes spontaneous speech. The horizontal bars show the means and standard deviations. These values, together with the sample size, are also displayed to the right. The figure suggests that spontaneous utterances are longer than their read counterparts by more than one-third. However, there is much more variability in the duration of spontaneous speech, as evidenced by its considerably larger standard deviation.</Paragraph> <Paragraph position="2"> There were nearly 1,000 pauses found in the spontaneous sentences in our dataset, or more than one per sentence on the average. 2 In contrast, there were only about 200 pauses found in the read sentences. As Figure 3 reveals, the pauses in spontaneous speech are about 2.5 times longer on the average than those in read speech. Their durations are also much more variable.</Paragraph> <Paragraph position="3"> There are nearly 400 non-speech vocalizations found in this database, including mouth clicks, breath ~We make a distinction between pauses, epenthetic silences and stop closures. Only those silence regions that do not have phonetic significance are labeled as pauses.</Paragraph> <Paragraph position="4"> utterances for 9 male and 9 female speakers.</Paragraph> <Paragraph position="5"> utterances for 9 male and 9 female speakers.</Paragraph> <Paragraph position="6"> noise (both inhaling and exhaling), and filled pauses such as &quot;um, &quot;uh,&quot; or &quot;ah.&quot; Their distributions are shown in Table 2. Non-speech vocalizations occur about 2.7 times more often in spontaneous speech than in read speech. Almost all of the clicks appear at the beginning of sentences for read speech, whereas 25% of them occur sentence internally in spontaneous speech. Similarly, more than 20% of the breath no, ise occurs sentence internally, with five times as many in spontaneous speech as in read speech. All the filled pauses occur in spontaneous speech, two-thirds of them sentence internally.</Paragraph> <Paragraph position="7"> When we measured the durations of individual phonemes, we found very little difference between the two speech styles. Figure 4, for example, shows that the average vowel durations for read and spontaneous speech are 89 ms and 95 ms, respectively. Occasionally, we observed unusually long vowels in the spontaneous speech. They almost always correspond to words like &quot;is&quot; or &quot;to,&quot; when the subject tries to decide what to say next. An example is shown in Figure 5.</Paragraph> </Section> <Section position="10" start_page="130" end_page="132" type="metho"> <SectionTitle> LINGUISTIC ANALYSIS </SectionTitle> <Paragraph position="0"> When the database was transcribed orthographically, false starts and non-words such as &quot;ah,&quot; &quot;um,&quot; or laughter were explicitly marked in the orthography. Therefore, it is possible to perform a statistical analysis of how often such events occurred. We distinguished between non-words internal to the sentence and non-words at the beginning or end of the sentence. In the training set containing about 3,300 spontaneous sentences, more than 10% of the sentences contained at least one of these effects. An additional 25% contained mouth clicks or breath noise, which may be a less serious effect. About half of the non-words appear sentence internally.</Paragraph> <Paragraph position="1"> False starts occurred in almost 4% of the spontaneous sentences. Table 3 categorizes the words following false starts in terms of whether a given word was the same as the intended word, a different word in the same category, a new linguistic category, or a back up to repeat words already uttered. An example is given for each case. Over 40% of the time, the talker repeated words after the false start. In order to recognize such sentences correctly, the system would have to detect false starts and back up to an appropriate earlier syntactic boundary.</Paragraph> <Paragraph position="2"> We have examined the linguistic content of the training sentences and have begun to use them to expand VOYAGER's coverage. We have found a number of new linguistic patterns that are entirely appropriate for the domain, as well as a few recurring concepts currently outside of the domain that would be reasonable to add. An example of the latter is a comparison between two objects, such as &quot;Which is closer to here, MIT or Harvard University?&quot; In addition, we were surprised at the number of ways people said certain Your directions were not good.</Paragraph> <Paragraph position="3"> Do you like their ice cream? Does the Baybank serve ice cream? Where is my dog? Does Hampshire and Broadway intersect? How far am I from Central Square to Cajun Yankee? things. For instance, for sentences like &quot;What is the phone number of MIT?&quot; the prepositions, &quot;of,&quot; &quot;at,&quot; &quot;for,&quot; and &quot;to&quot; were all used. In a similar vein, the &quot;from location&quot; in a sentence requesting distance between two objects occurred in five distinct syntactic locations within the sentence. Users also sometimes spoke ungrammatically, violating both syntactic and semantic constraints. In other cases they asked abstract questions or questions involving judgment calls that would be inappropriate for VOYAGER to handle. Some of these sentences were probably uttered due to curiosity about how the system would respond. Some examples are given in Table 4.</Paragraph> </Section> class="xml-element"></Paper>