File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1029_metho.xml

Size: 23,113 bytes

Last Modified: 2025-10-06 14:12:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1029">
  <Title>COLLECTION AND ANALYSIS OF DATA FROM REAL USERS: IMPLICATIONS FOR SPEECH RECOGNITION/UNDERSTANDING SYSTEMS</Title>
  <Section position="4" start_page="164" end_page="164" type="metho">
    <SectionTitle>
REAL USER DATABASE COLLECTION
PROCEDURES
</SectionTitle>
    <Paragraph position="0"> Three real user telephone speech databases have been collected by pseudo-automating telephone operator functions and digitally recording the speech produced by users as they interacted with the services. In each case, experimental equipment was attached to a traditional telephone operator workstation and was capable of : 1.</Paragraph>
    <Paragraph position="1"> automatically detecting the presence of a call, 2. playing one of a set of pre-recorded prompts to the user, 3. recording user speech, 4.</Paragraph>
    <Paragraph position="2"> automatically detecting a user hang-up and 5. storing data about call conditions associated with a given speech file (e.g., time of day, prompt condition, etc.). The three operator services under study were 1. Intercept Services (IS) 2. Directory Assistance Call Completion (DACC) and 3. Directory Assistance (DA). In addition to collecting data for several automated dialogues, recordings were made of traditional 'operator-handled' calls for the services under investigation.</Paragraph>
    <Paragraph position="3"> Each of these databases was collected in a real serviee-providing environment. That is, users were unaware that they were participating in an experiment. The identity of the speakers was not known, so a precise description of dialectal distribution is difficult. Calls reached the trial position through random assignment of calls to operator positions, a task performed by a network component known as an Automatic Call Distributor (ACD). Therefore, for each of the databases, it is assumed that the number of utterances corresponds to the number of speakers. We have so far collected nearly 29 hours of real user speech: 34,000 utterances (presumably from that many different speakers).</Paragraph>
  </Section>
  <Section position="5" start_page="164" end_page="166" type="metho">
    <SectionTitle>
REAL USER COMPLIANCE
</SectionTitle>
    <Paragraph position="0"> For the IS trial, users were asked what telephone number they had just dialed. For the DACC trial, users were asked to accept or reject the call completion service. For the DA trial, users were asked for the city name corresponding to their directory request.</Paragraph>
    <Paragraph position="1"> The 'target' responses, therefore, were digit strings, yes/no responses and isolated city names, respectively. Users were presented with different automated prompts varying along a number of dimensions. Their responses were analyzed to determine the effects of dialogue condition on real user compliance (frequency with which users provided the target response).</Paragraph>
    <Paragraph position="2"> Intercept Services: 'Simple' Digit Recognition One problem with digit recognition is that users may say more than just digits. The target response for the IS trial was a digit string. The automated prompts to the users varied with respect to the 1. presence/absence of an introductory greeting which informed users that they were interacting with an automated system 2. speed with which the prompts were spoken (fast, slow), and 3. the explicitness of the prompts (wordy, concise). In addition, data were recorded under an operator-handled condition. During operator-handled intercept calls, operators ask users, &amp;quot;What number did you dial?&amp;quot;.</Paragraph>
    <Paragraph position="3"> A total of 3794 utterances were recorded: 2223 were in the automated-prompt conditions and 1571 were in the operator-handled condition. 'Non-target' words were defined as anything other than the digits '0' through '9' and the word 'oh'. Results showed that only 13.6% of the utterances in the automated conditions were classified as non-target, while 40.6% of the utterances in the operator-handled condition fell into the non-target category.</Paragraph>
    <Paragraph position="4"> Non-target utterances were further classified as '100-type' utterances (that is, utterances in which the user said the digit string as &amp;quot;992-4-one-hundred, etc.) and 'extra verbiage' utterances (that is, utterances in which the user said more than just the digit string such as, &amp;quot;I think the number is ...&amp;quot;, or &amp;quot;oh, urn, 992 ...&amp;quot;). For both automated and operator-handled calls, users produce more extra verbiage utterances than 100-type utterances. Both types of non-target responses occurred more than twice as often in the operator-handled condition compared to the automated conditions.</Paragraph>
    <Paragraph position="5"> The speed and wordiness of the automated prompts did not affect user compliance. However, contrary to our expectations, the data suggest that the proportion of non-target responses is substantially reduced when the user is not given an introductory greeting which explains the automated service (19.2% vs. 4.9% non-target responses for the greeting vs. no-greeting conditions, respectively). Instead, giving users an immediate directive to say the dialed number results in the highest proportion of responses which are restricted to the desired vocabulary. It appears that even untrained users are immediately attuned to the fact that they are interacting with an automated service and modify their instinctual response in ways beneficial to speech recognition automation. At least for this application, brevity is best. For more information on this trial, see \[6\].</Paragraph>
    <Paragraph position="6"> Directory Assistance Call Completion: Yes/No Recognition The target response for the DACC trial was an isolated 'yes' or 'no' response. Successful recognition of these words would have many applications, but the problem for a two-word recognizer is  that users sometimes say more than the desired two words. Data were collected under fl~ree automated prompt conditions and one operator-lmndled condition. The operator asked, ?Would you like us to automatically dial that call for an additional charge of__ cents?&amp;quot; The three automated prompts were as follows: 1. a recorded version of the operator prompt 2. a prompt which explicitly asked for a 'yes' or 'no' response and 3. a oromot that asked for a 'yes' or hang up response.</Paragraph>
    <Paragraph position="7"> A total of 3394 responses were recorded; 1781 were operator-handled calls, while 1613 were calls handled by automated prompts. Figure 1 shows the percentage of 'yes' responses among the affirmative responses as a function of dialogue condition.</Paragraph>
    <Paragraph position="8"> Results again indicate that variations in the prompt can have a sizable effect on user compliance and that there are considerable differences between user behavior with a human operator vs. an automated system.</Paragraph>
    <Paragraph position="9">  response ('yes') as a function of dialogue condition.</Paragraph>
    <Paragraph position="10"> Non-target affirmative responses were categorized as 'yes, please', 'sure' and 'other'. A response was categorized as 'other' if it accounted for less than 5% of the data for any prompt condition.</Paragraph>
    <Paragraph position="11"> The frequency of occurrence of these non-target responses as a function of dialogue condition is shown in Table 1.</Paragraph>
    <Paragraph position="12">  of prompt condition.</Paragraph>
    <Paragraph position="13"> The operator-handled condition exhibited the greatest range of variability, with 53% of the affirmative responses falling into the 'other' category. For more information on the DACC trial, see \[7\], &amp;quot;Directory assistance, what city please?&amp;quot; The target response for the Directory Assistance trial was an isolated city name. Data were collected under four automated prompt conditions and one operator-handled condition. Directory Assistance operators typically ask users &amp;quot;What city, please?&amp;quot;. One automated prompt used the same wording as the operator; the other three were worded to encourage users to say an isolated city name. Recording was initiated automatically at the offset of a beep tone that prompted users to respond. Recording was terminated by a human observer who determined that the user had finished responding to the automated request for information.</Paragraph>
    <Paragraph position="14"> A total of 26,946 utterances were collected under automated conditions. Operator-handled calls were collected during a separate trial \[8\] and only 100 of these utterances were available for analysis. Figure 2 shows the percentage of target responses as a function of dialogue condition.</Paragraph>
    <Paragraph position="15">  names as a function of dialogue condition.</Paragraph>
    <Paragraph position="16"> As in the other two trials, user behavior was quite different for operator-handled vs. automated calls. On average, users were almost four times more likely to say an isolated city name in response to an automated prompt than to an operator query.</Paragraph>
    <Paragraph position="17"> Moreover, the wording of the automated prompt had a large effect on user compliance. Superficially minor variations in prompt wording increased user compliance by a factor of four (15.0% vs.</Paragraph>
    <Paragraph position="18"> 64.0% compfiance for prompt 1 vs. 4, respectively).</Paragraph>
    <Paragraph position="19"> Very few users either did not reply or replied without a city name in response to an operator prompt. For the automated conditions, between 14% and 23% of the users simply did not respond. Between 3% and 23% responded without including a city name. To interpret these results, we point out that in contrast to the users oflS and DACC services, Directory Assistance users tend to be repeat callers with well-rehearsed scripts in mind. When the familiar interaction is unexpectedly disrupted, some of these users appear to be unsure of how to respond.</Paragraph>
    <Paragraph position="20"> Of particular interest was the effect of dialogue condition on the frequency of occurrence of city names embedded in longer utterances. These results appear in Figure 3.</Paragraph>
    <Paragraph position="21">  names as a function of dialogue condition.</Paragraph>
    <Paragraph position="22"> It is clear that embedded responses are most typical during useroperator interactions. To allow for this response mode, a recognizer would have to be able to 'find' the city name in such utterances. This could be accomplished with a word spotting system or with a continuous speech recognition/understanding system. To consider the difficulty of the former, embedded city name responses were further categorized as simple vs. complex; assuming that the former would be relatively easy to 'spot'. A 'simple' embedded city name was operationally defined as a city name surrounded by approximately one word (for example, &amp;quot;Boston, please&amp;quot;, &amp;quot;urn, Boston&amp;quot;, &amp;quot;Boston, thank you&amp;quot;). The proportion of embedded utterances classified as 'simple' as a function of prompt is shown in Figure 4.</Paragraph>
    <Paragraph position="23">  categorized as 'simple' as a function of dialogue condition.</Paragraph>
    <Paragraph position="24"> It is interesting to note that prompts 3 and 4, which elicited the highest proportion of isolated city names, also elicited a higher proportion of 'simple' embedded city names. It seems that users interpreted prompts 3 and 4 as the most constraining, even when they did not fully comply.</Paragraph>
  </Section>
  <Section position="6" start_page="166" end_page="166" type="metho">
    <SectionTitle>
DISCUSSION
</SectionTitle>
    <Paragraph position="0"> The results of this series of experiments on real user compliance suggest that this aspect of user behavior is significantly different when interacting with a live operator than when interacting with an automated system. The lesson is that feasibility projections made on the basis of observing operator-handled transactions will significantly underestimate automation potential. In addition, the precise wording of the prompts used in a speech recognition/understanding application significantly affects user compliance and therefore the likelihood of recognition success.</Paragraph>
    <Paragraph position="1"> Users seem to know immediately that they are interacting with an automated service and explicitly infotraing them of this fact does not improve (in fact, decreases) user compliance. Prompts should be brief and the tasks should not be too unnatural. Although not discussed above; informal analysis of the data suggests that very few users attempted to interrupt the prompts with their verbal responses. While this would suggest that 'barge-in' technology is not a high priority, it should be noted that the users under investigation were all first-time users of the automated service. It seems likely that their desire to interrupt the prompt will increase with experience, as has been found for Touch-Tone applications.</Paragraph>
    <Paragraph position="2"> Although each of the applications under investigation was different with respect to the degree of repeat usage, the motivation of the user, the cost of an error, etc., the trials were similar in that there was no opportunity for learning on the part of the user. This is an important factor in the success of many speech recognition/understanding systems and is an area of future research for the group.</Paragraph>
  </Section>
  <Section position="7" start_page="166" end_page="167" type="metho">
    <SectionTitle>
LABORATORY DATABASE COLLECTION
PROCEDURES
</SectionTitle>
    <Paragraph position="0"> While real user speech databases provide value to the researcher/developer, they present limitations as well. Most notably, the a priori probabilities for the vocabulary items under investigation are typically quite skewed. It is rare, in a real application, that any one user response is as likely as any other.</Paragraph>
    <Paragraph position="1"> The DA data collection gathered almost 27,000 utterances, yet there are less than 10 instances of particular cities and, correspondingly, less than 10 exemplars of certain phones. If these data are to be used for training speech recognition/understanding systems, they must be supplemented with laboratory data* To this end, as well as for the purposes of comparing real user to laboratory data, application-specific and standardized laboratory telephone speech data collection efforts were undertaken.</Paragraph>
    <Paragraph position="2"> Application-specific laboratory speech database collection A laboratory city name database was collected by having volunteers call a New York-based laboratory from their New England-based home or office telephones. Talkers were originally from the New England area and so were assumed to be familiar with the pronunciation of the target city names.</Paragraph>
    <Paragraph position="3"> When a speaker called, the system asked him/her to speak the city names, waiting for a prompt before saying the next city name (the order of the city names was randomized so as to minimize list effects). 10,900 utterances from over 400 speakers have been collected in this way.</Paragraph>
    <Paragraph position="4"> This kind of database provides some of the characteristics of a real user database (a sample of telephone network connections and  telephone sets). The speech, however, is read rather than spontaneously-produced and the speakers are aware that they are participating in a data collection exercise. This database has been compared to the DA corpus just described. Results are reported below.</Paragraph>
    <Paragraph position="5"> Standardized telephone speech data collection The TIM1T database reflects the general nature of speech and is not designed for any particular application \[10\]. It is well known that the telephone network creates both linear and nonlinear distortions of the speech signal during transmission. In the development of a telephone speech recognition/understanding system, it is desirable to have a database with the advantages of the TIMIT database, coupled with the effects introduced by the telephone network. Towards this end, a data collection system has been developed to create a telephone network version of the TIM1T database (as well as other standardized wideband speech databases). The system is capable of 1. systematically controlling the telephone network and 2. retaining the original time-aligned phonetic transcriptions.</Paragraph>
    <Paragraph position="6"> Figure 5 shows the hardware configuration used in the collection of the NTIMIT (Network TIM1T) database. The TIMIT utterance is transmitted in an acoustically isolated room through an artificial mouth. A telephone handset is held by a telephone test frame mounting device. Taken together, this equipment is designed to approximate the acoustic coupling between the human mouth and the telephone handset. To allow transmission of utterances to various locations, &amp;quot;loopback&amp;quot; devices in remote central offices were used.</Paragraph>
    <Paragraph position="7">  The choice of where to send the ~ utterances was carefully designed to ensure geographic coverage as well as to keep the distribution of speaker gender and dialect for each geographic area roughly equivalent to the distribution in the entire TIMIT database. To obtain information about transmission characteristics such as frequency response, loss, etc., two calibration signals (sweep frequency and 1000 Hz pure tone signals) were sent along with the TIMIT utterances. NTIM1T utterances were automatically aligned with the original TIM1T transcriptions.</Paragraph>
    <Paragraph position="8"> The NTIMIT database is currently being used to train a telephone network speech recognition system. Performance will be compared to that of a system trained on a band-limited version of the TIMIT database to determine the effects of a 'real' vs. simulated telephone network on recognition results. In addition, we are evaluating the performance of a recognizer trained on a combination of material from real user speech databases and NTIM1T.</Paragraph>
    <Paragraph position="9"> For more information on NTIMIT, see \[9\]. The NTIMIT database is being prepared for public distribution through NIST.</Paragraph>
  </Section>
  <Section position="8" start_page="167" end_page="168" type="metho">
    <SectionTitle>
LABORATORY VS. REAL USER SPEECH
FOR TRAINING AND TESTING SPEECH
RECOGNITION SYSTEMS
</SectionTitle>
    <Paragraph position="0"> The laboratory and real user city name databases described above allowed us to evaluate the performance of a speaker independent, isolated word, telephone network speech recognition system when tested on laboratory vs. real user data. Two training scenarios were included: 1. trained on laboratory data and 2. trained on real user data. To equate the number of training and testing samples for each of the 15 city names under investigation, only a subset of each database was used (and only isolated city names were used from the real user database).</Paragraph>
    <Paragraph position="1"> Each database was divided into a training and testing set, consisting of 90% and 10% of the databases, respectively. A phonetically-based speaker independent isolated word telephone network speech recognition system was used for this experiment.</Paragraph>
    <Paragraph position="2"> The recognizer, developed as part of an MIT-NYNEX joint development project, was built upon a system developed at MIT (for more details on the MIT recognizer, see \[11\]). The system was trained on each training set and then tested on each testing set. This resulted in four training/testing conditions.</Paragraph>
    <Paragraph position="3"> Results revealed that performance changed little as a function of laboratory vs. user training databases when tested on laboratory speech (95.9% vs. 91.1% for laboratory and user training databases, respectively). In contrast, performance changed dramatically as a function of training database when tested on real user speech (52.0% vs. 87.7% for laboratory and user training databases, respectively). Two points of interest here are: 1. The recognizer that was trained and tested on laboratory speech performed almost 9% better than the recognizer trained and tested on real user speech (95.9% vs. 87.7% respectively). Apparently, recognizing real user speech is an inherently more difficult problem. 2. Performance of the laboratory-trained system on real user speech was 43.9% poorer than the same system tested on laboratory speech. A number of experiments were conducted to better understand these results.</Paragraph>
    <Paragraph position="4"> It is assumed that the performance of the real user-trained system on real user speech (87.7%) represents optimal performance. Therefore, the performance discrepancy to be explored is the difference between 52.0% (the lab-trained system on real user speech) and 87.7%. Each one of the recognizer's components involved in the training was considered for analysis. This included 1. phonetic acoustic models, 2. silence acoustic models 3. lexical transition weights and 4. lexieal arc deletion weights. A series of experiments were done in which each of these components from the real user-trained recognizer was systematically substituted for its counterpart in the laboratory-trained recognizer. The resulting hybrid recognizer was evaluated at each stage.</Paragraph>
    <Paragraph position="5"> Results revealed that an overwhelming majority of the performance difference could be accounted for by the acoustic models for silence. A recognizer trained on laboratory speech which used silence acoustic models trained on real user speech achieved 82% accuracy when tested on real user speech. An acoustic analysis of the two databases revealed that they were quite similar with respect to the frequency characteristics of the non-speech portions of the signal and the signal-to-noise ratios. Rather, it was  the mean duration and variability in duration of the non-speech signal prior to the onset of the speech that accounts for this effect. It is important to note that the silence surrounding the laboratorycollected city names was artificially controlled by both the data collection procedures (talkers knew they had only a limited amount of time to speak before hearing the prompt to read the next city name) and subsequent hand editing. The real user data were not since a field recognizer will not see controlled or hand edited data. While these results may appear to be artifactual, they point out the limitations imposed on the researcher/developer in only being exposed to laboratory data. Further experimentation revealed that using real user-trained phonetic acoustic models accounts for most of the remaining 6%, with decreasing importance attributable to real user-trained lexical transition weights and real user-trained lexical arc deletion weights.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML