File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1017_metho.xml
Size: 19,212 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="N01-1017"> <Title>Generating Training Data for Medical Dictations</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Dictation Applications of ASR </SectionTitle> <Paragraph position="0"> The application for our work is medical dictation over the telephone. Medical dictation differs from other telephony based ASR applications, e.g. airline reservation systems, because the talkers are repeat users and utterances are long. Dictations usually consist of 1-30 minutes of speech. The talkers call in 3-5 days per week and produce between 1 and 12 dictations each day they call. Hence a medical dictation operation has access to hours of speech for each talker.</Paragraph> <Paragraph position="1"> Spontaneous telephone speech presents additional challenges that are caused partly by a poor acoustic signal and partly by the disfluent nature of spontaneous speech. A number of researchers have noted the effects of disfluencies on speech recognition and have suggested various approaches to dealing with them at language modeling and post-processing stages. (Shriberg 1994, Shriberg 1996, Stolcke and Shriberg 1996, Stolcke et al.</Paragraph> <Paragraph position="2"> 1998, Shriberg and Stolcke 1996, Siu and Ostendorf 1996, Heeman et al. 1996) Medical overthe-telephone dictations can be classified as spontaneous or quasi-spontaneous discourse (Pakhomov 1999, Pakhomov and Savova 1999).</Paragraph> <Paragraph position="3"> Most physicians do not read a script prepared in advance, instead, they engage in spontaneous monologues that display the full spectrum of disfluencies found in conversational dialogs in addition to other &quot;disfluencies&quot; characteristic of dictated speech. An example of the latter is when a physician gives instructions to the transcriptionist to modify something in the preceding discourse, sometimes as far as several paragraphs back.</Paragraph> <Paragraph position="4"> Most ASR dictation applications focus on desktop users; for example, Dragon, IBM, Philips and Lernout & Hauspie all sell desktop dictation recognizers that work on high quality microphone speech. Typically, the desktop system builds an adapted acoustic model if the talker &quot;enrolls&quot;, i.e. reads a prepared script that serves as a literal transcription. Forced alignment of the script and the speech provides the input to acoustic model adaptation.</Paragraph> <Paragraph position="5"> Enrollment makes it relatively easy to obtain literal transcriptions for adaptation. However, enrollment is not feasible for dictation over the telephone primarily because most physicians will refuse to take the time to enroll. The alternative is to hire humans who will type literal transcriptions of dictation until enough have been accumulated to build an adapted model, an impractical solution for a large scale operation that processes speech from thousands of talkers. ATRS is appealing because it can generate an approximation of literal transcription that can replace enrollment scripts and the need for manually generated literal transcriptions.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Three Classes of Training Data </SectionTitle> <Paragraph position="0"> In this paper, training texts for language and acoustic models fall into three categories: Non-Literal: Non-literal transcripts present the meaning of what was spoken in a written form appropriate for the domain. In a commercial medical transcription operation, the non-literal transcript will present the dictation in a format appropriate for a medical record. This typically involves (i.) ignoring filled pauses, pleasantries, and repeats; (ii.) acting on directions for repairs (&quot;delete the second paragraph and put this in instead...&quot;); (iii.) adding non-dictated punctuation; (iv.) correcting grammatical errors; and (v.) reformatting certain phrases such as &quot;Lung are Clear&quot;, to a standard form such as &quot;Lungs - Clear&quot;. Literal: Literal transcriptions are exact transcriptions of what was spoken. This includes any elements not found in the non-literal transcript, such as filled pauses (um's and ah's), pleasantries and body noises (&quot;thank you very much, just a moment, cough&quot;), repeats, fragments, repairs and directions for repairs, and asides (&quot;make that bold&quot;). Literal transcriptions require significant human effort, and therefore are expensive to produce. Even though they are carefully prepared, some errors will be present in the result.</Paragraph> <Paragraph position="1"> In their study of how humans deal with transcribing spoken discourse, Lindsay and O'Connell (1995) have found that literal transcripts were &quot;far from verbatim.&quot; (p.111) They find that the transcribers in their study tended to have the most difficulty transcribing hesitation phenomena, followed by sentence fragments, adverbs and conjunctions and, finally, nouns, verbs, adjectives and prepositions.</Paragraph> <Paragraph position="2"> Our informal observations made from the transcripts produced by highly trained medical transcriptionists suggest approximately 5% error margin and a gradation of errors similar to the one found by Lindsay and O'Connell.</Paragraph> <Paragraph position="3"> Semi-Literal: Semi-literal transcripts are derived using non-literal transcripts, the recognizer output, a set of grammars, a dictionary, and an interpreter to integrate the recognized material into the non-literal transcription. Semi-literal transcripts will more closely resemble the literal transcripts, as many of the elements missing from the non-literal transcripts will be restored.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Model Adaptation </SectionTitle> <Paragraph position="0"> It is well known that ASR systems perform best when acoustic models are adapted to a particular talker's speech. This is why commercial desktop systems use enrollment. Although less widely applied, language model adaptation based on linear interpolation is an effective technique for tailoring stochastic grammars to particular domains of discourse and to particular speakers (Savova et al.</Paragraph> <Paragraph position="1"> (2000), Weng et al. (1997)).</Paragraph> <Paragraph position="2"> The training texts used in acoustic modeling come from recognizer-generated texts, literal transcriptions or non-literal transcriptions. Within the family of transformation and combined approaches to acoustic modeling (Digalakis and Neumeyer (1996), Strom (1996), Wightman and Harder (1999), Hazen and Glass (1997)) three basic adaptation methods can be identified: unsupervised, supervised, or semi-supervised. Each adaptation method depends on a different type of training text. What follows will briefly introduce the three methods.</Paragraph> <Paragraph position="3"> Unsupervised adaptation relies on the recognizer's output as the text guiding the adaptation. Efficacy of unsupervised adaptation fully depends on the recognition accuracy. As Wightman and Harder (1999) pointed out, unsupervised adaptation works well in laboratory conditions when the speech signal has large bandwidth and is relatively &quot;clean&quot; of background noise, throat clearings, and other disturbances. In laboratory conditions, the errors introduced by unsupervised adaptation can be averaged out by using more data (Zavaliagkos and Colthurst, 1997); however, in a telephony operation with degraded input that is not feasible.</Paragraph> <Paragraph position="4"> Supervised adaptation is dependent on literal transcription availability and is widely used in enrollment in most desktop ASR systems. A speaker's speech sample is transcribed verbatim and then the speech signal is aligned with pronunciations frame by frame for each individual word. A speaker independent model is augmented to include the observations resulting from the alignment.</Paragraph> <Paragraph position="5"> Semi-supervised adaptation rests on the idea that the speech signal can be partially aligned by using of the recognition output and the non-literal transcription. A significant problem with semi-supervised adaptation is that only the speech that the recognizer already recognizes successfully ends up being used for adaptation. This reinforces what is already well represented in the model.</Paragraph> <Paragraph position="6"> Wightman and Harder (1999) report that semi-supervised adaptation has a positive side effect of excluding those segments of speech that were mis-recognized for reasons other than a poor acoustic model. They note that background noise and speech disfluency are detrimental to the unsupervised adaptation.</Paragraph> <Paragraph position="7"> In addition to the two problems with semi-supervised adaptation pointed out by Wightman and Harder, we find one more potential problem.</Paragraph> <Paragraph position="8"> As a result of matching the word labels produced by the recognizer and the non-literal transcription, some words may be skipped which may introduce unnatural phone transitions at word boundaries.</Paragraph> <Paragraph position="9"> Language model adaptation is not an appropriate domain for acoustic adaptation methods. However, adapted language models can be loosely described as supervised or unsupervised, based on the types of training texts--literal or non-literal--that were used in building the model.</Paragraph> <Paragraph position="10"> In the following sections we will describe the system of generating data that is well suited for acoustic and language adaptation and present results of experimental evaluation of this system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Generating semi-literal data </SectionTitle> <Paragraph position="0"> ATRS is based on reconstruction of non-literal transcriptions to train utterance specific language models. First, a non-literal transcription is used to train an augmented probabilistic finite state model (APFSM) which is, in turn, used by the recognizer to re-recognize the exact same utterance that the non-literal transcription was generated from. The APFSM is constructed by linear interpolation of a finite state model where all transitional probabilities are equal to 1 with two other stochastic models.</Paragraph> <Paragraph position="1"> One of the two models is a background model that accounts for expressions such as greetings, thanking, false starts and repairs. A list of these out-of-transcription expressions is derived by comparing already existing literal transcriptions with their non-literal transcription counterparts.</Paragraph> <Paragraph position="2"> The other model represents the same non-literal transcription populated with filled pauses (FP) (&quot;um's and ah's&quot;) using a stochastic FP model derived from a relatively large corpus of literal transcriptions (Pakhomov, 1999, Pakhomov and Savova, 1999).</Paragraph> <Paragraph position="3"> Interpolation weights are established empirically by calculating the resulting model's perplexity against held out data. Out-of-vocabulary (OOV) items are handled provisionally by generating on-the-fly pronunciations based on the existing dictionary spelling-pronunciation alignments. The result of interpolating these two background models is that some of the transitional probabilities found in the finite state model are no longer 1.</Paragraph> <Paragraph position="4"> The language model so derived can now be used to produce a transcription that is likely to be more true to what has actually been said than the non-literal transcription that we started to work with.</Paragraph> <Paragraph position="5"> Further refinement of the new semi-literal transcription is carried out by using dynamic programming alignment on the recognizer's hypothesis (HYP) and the non-literal transcription that is used as reference (REF). The alignment results in each HYP label being designated as a MATCH, a DELETION, a SUBSTITUTION or an INSERTION. Those labels present in the HYP stream that do not align with anything in the REF stream are designated as insertions and are assumed to represent the out-of-transcription elements of the dictation. Those labels that do align but do not match are designated as substitutions. Finally, the labels found in the REF stream that do not align with anything in the HYP stream are designated as deletions.</Paragraph> <Paragraph position="6"> The final semi-literal transcription is constructed differently depending on the intended purpose of Figure { SEQ Figure \* ARABIC } Percent improvement in true data representation of ATRS reconstruction vs. Non-Literal data the transcription. If the transcription will be used for acoustic modeling, then the MATCHES, the REF portion of SUBSTITUTIONS and the HYP portion of only those INSERTIONS that represent punctuation and filled pauses make it into the final semi-literal transcription. It is important to filter out everything else because acoustic modeling is very sensitive to misalignment errors. Language modeling, on the other hand, is less sensitive to alignment errors; therefore, INSERTIONS and DELETIONS can be introduced into the semi-literal transcription.</Paragraph> <Paragraph position="7"> One method of ascertaining the quality of semi-literal reconstruction is to measure its alignment errors against literal data using a dynamic programming application. By measuring the correctness spread between ATRS and literal data, as well as the correctness spread between non-literal and literal data, the ATRS alignment correctness rate was observed to be 4.4% higher absolute over 774 dictation files tested. Chart 1 summarizes the results. The X axis represents the number of dictations in each bin displayed along the Y axis representing the % improvement over the non-literal counterparts. The results showed nearly all ATRS files had better alignment correctness than their non-literal counterparts. The majority of the reconstructed dictations resemble literal transcriptions between 1% and 8% better than their non-literal counterparts. These results are statistically significant as evidenced by a t-test at 0.05 confidence level. Much of the increase in alignment can be attributed to the introduction of filled pauses by ATRS. However, ignoring filled pauses, we have observed informally that the correctness still improves in ATRS files versus non-literal.</Paragraph> <Paragraph position="8"> In the following sections we will address acoustic and language modeling and show that semi-literal training data is a good substitute for literal data.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental results </SectionTitle> <Paragraph position="0"> The usefulness of semi-literal transcriptions was evaluated in two ways: acoustic adaptation and language modeling.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Adapted acoustic model evaluation </SectionTitle> <Paragraph position="0"> Three speaker adapted acoustic models were trained for each of the 5 talkers in this study using the three types of label files and evaluated on the talker's testing data.</Paragraph> <Paragraph position="1"> The data collected for each talker were split into testing and training.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Training Data </SectionTitle> <Paragraph position="0"> 45-55 minutes of audio data was collected for each of the six talkers in this experiment: All talkers are native speakers of English, two males and three females.</Paragraph> <Paragraph position="1"> Non-literal transcriptions of this data were obtained in the course of normal transcription operation where trained medical transcriptionists record the dictations while filtering out disfluency, asides and ungrammatical utterances.</Paragraph> <Paragraph position="2"> Literal transcriptions were obtained by having 5 medical transcriptionists specially trained not to filter out disfluency and asides transcribe all the dictations used in this study.</Paragraph> <Paragraph position="3"> Semi-literal transcriptions were obtained with the system described in section 5 of this paper.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Testing Data </SectionTitle> <Paragraph position="0"> Three dictations (0.5 - 2 min) each were pulled out of the Literal transcriptions training set and set aside for each talker for testing.</Paragraph> <Paragraph position="1"> Recognition and evaluation software and formalism Software licensed from Entropic Laboratory was used for performing recognition, evaluating accuracy and acoustic adaptation. (Valtchev, et al. (1998)). Adapted models were trained using MLLR technique (Legetter and Woodland, (1996)) available as part of the Entropic package.</Paragraph> <Paragraph position="2"> Recognition accuracy and correctness reported in this study were calculated according to the following formulas: (1) Acc = hits - insertions / total words (2) Correctness = hits / total words The following Acoustic Models were trained via adaptation with a general SI model for each talker using all available data (except for the testing data). Each model's name reflects the kind of label data that was used for training.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> LITERAL </SectionTitle> <Paragraph position="0"> Each audio file was aligned with the corresponding literal transcription.</Paragraph> <Paragraph position="1"> NON-LITERAL Each audio file was recognized using SI acoustic and language models. The recognition output was aligned with the non-literal transcription using dynamic programming. Only those portions of audio that corresponded to direct matches in the alignment were used to produce alignments for acoustic modeling. This method was originally used for medical dictations by Wightman and Harder (1999).</Paragraph> <Paragraph position="2"> SEMI-LITERAL Each audio file has been processed to produce a semi-literal transcription that was then aligned with recognition output generated in the process of creating semi-literal transcriptions. The portions of the audio corresponding to matching segments were used for acoustic adaptation training.</Paragraph> <Paragraph position="3"> The SI model had been trained on all available at the time (12 hours)2 similar medical dictations to the ones used in this study. The data for the 2 Although 50-100 hours of data for SI modeling is the industry standard, the population we are dealing with is highly homogeneous and reasonable results can be obtained with lesser amount of data.</Paragraph> <Paragraph position="4"> speakers in this study were not used in training the SI model.</Paragraph> </Section> class="xml-element"></Paper>