File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1083_metho.xml
Size: 7,873 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1083"> <Title>Modeling Filled Pauses in Medical Dictations</Title> <Section position="2" start_page="0" end_page="619" type="metho"> <SectionTitle> 1. Filled Pauses </SectionTitle> <Paragraph position="0"> FP's are not random events, but have a systematic distribution and well-defined functions in discourse. (Shriberg and Stolcke 1996, Shriberg 1994, Swerts 1996, Macalay and Osgood 1959, Cook 1970, Cook and Lalljee 1970, Christenfeld, et al.</Paragraph> <Paragraph position="1"> 1991) Cook and Lalljee (1970) make an interesting proposal that FP's may have something to do with the listener's perception of disfluent speech. They suggest that speech may be more comprehensible when it contains filler material during hesitations by preserving continuity and that a FP may serve as a signal to draw the listeners attention to the next utterance in order for the listener not to lose the onset of the following utterance. Perhaps, from the point of view of perception, FP's are not disfluent events at all. This proposal bears directly on the domain of medical dictations, since many doctors who use old voice operated equipment train themselves to use FP's instead of silent pauses, so that the recorder wouldn't cut off the beginning of the post pause utterance.</Paragraph> </Section> <Section position="3" start_page="619" end_page="620" type="metho"> <SectionTitle> 2. Quasi-spontaneous speech </SectionTitle> <Paragraph position="0"> Family practice medical dictations tend to be pre-planned and follow an established SOAP format: (Subjective (informal observations), Objective (examination), Assessment (diagnosis) and Plan (treatment plan)). Despite that, doctors vary greatly in how frequently they use FP's, which agrees with Cook and Lalljee's (1970) findings of no correlation between FP use and the mode of discourse. Audience awareness may also play a role in variability. My observations provide multiple examples where the doctors address the transcriptionists directly by making editing comments and thanking them.</Paragraph> <Paragraph position="1"> 3. Training Corpora and FP</Paragraph> <Section position="1" start_page="619" end_page="619" type="sub_section"> <SectionTitle> Model </SectionTitle> <Paragraph position="0"> This study used three base and two derived corpora Base corpora represent three different sets of dictations described in section 3.1. Derived corpora are variations on the base corpora conditioned in several different ways described in section 3.2.</Paragraph> </Section> <Section position="2" start_page="619" end_page="620" type="sub_section"> <SectionTitle> 3.1 Base </SectionTitle> <Paragraph position="0"> Balanced FP training corpus (BFP-CORPUS) that has 75, 887 words of word-by-word transcription data evenly distributed between 16 talkers. This corpus was used to build a BIGRAM-FP-LM which controls the process of populating a no-FP corpus with artificial FP's.</Paragraph> <Paragraph position="1"> Unbalanced FP training corpus (UFP-CORPUS) of approximately 500,000 words of all available word-by-word transcription data from approximately 20 talkers. This corpus was used only to calculate average frequency of FP use among all available talkers.</Paragraph> <Paragraph position="2"> Finished transcriptions corpus (FT-CORPUS) of 12,978,707 words contains all available dictations and no FP's. It represents over 200 talkers of mixed gender and professional status. The corpus contains no FP's or any other types of disfluencies such as repetitions, repairs and false starts. The language in this corpus is also edited for grammar.</Paragraph> <Paragraph position="3"> Derived CONTROLLED-FP-CORPUS is a version of the finished transcriptions corpus populated stochastically with 2,665,000 FP's based on the BIGRAMFP-LM. null RANDOM-FP-CORPUS- 1 (normal density) is another version of the finished transcriptions corpus populated with 916,114 FP's where the insertion point was selected at random in the range between 0 and 29. The random function is based on the average frequency of FPs in the unbalanced UFP-CORPUS where an FP occurs on the average after every 15 th word.</Paragraph> <Paragraph position="4"> Another RANDOM-FP-CORPUS-2 (high density) was used to approximate the frequency of FP's in the CONTROLLED-FP-CORPUS.</Paragraph> </Section> </Section> <Section position="4" start_page="620" end_page="620" type="metho"> <SectionTitle> 4. Models </SectionTitle> <Paragraph position="0"> The language modeling process in this study was conducted in two stages. First, a bigram model containing bigram probabilities of FP's in the balanced BFP-COPRUS was built followed by four different trigram language models, some of which used corpora generated with the BIGRAM-FP-LM built during the first stage.</Paragraph> <Section position="1" start_page="620" end_page="620" type="sub_section"> <SectionTitle> 4.1 Bigram FP model </SectionTitle> <Paragraph position="0"> This model contains the distribution of FP's obtained by using the following formulas:</Paragraph> <Paragraph position="2"> Thus, each word in a corpus to be populated with FP's becomes a potential landing site for a FP and does or does not receive one based on the probability found in the BIGRAM-FP-LM.</Paragraph> </Section> <Section position="2" start_page="620" end_page="620" type="sub_section"> <SectionTitle> 4.2 Trigram models </SectionTitle> <Paragraph position="0"> The following trigram models were built using ECRL's Transcriber language modeling tools (Valtchev, et al. 1998). Both bigram and trigram cutoffs were set to 3.</Paragraph> </Section> </Section> <Section position="5" start_page="620" end_page="620" type="metho"> <SectionTitle> 5. Testing Data </SectionTitle> <Paragraph position="0"> Testing data comes from 21 talkers selected at random and represents 3 (1-3 min) dictations for each talker. The talkers are a random mix of male and female medical doctors and practitioners who vary greatly in their use of FP's. Some use literally no FP's (but long silences instead), others use FP's almost every other word. Based on the frequency of FP use, the talkers were roughly split into a high FP user and low FP user groups. The relevance of such division will become apparent during the discussion of test results.</Paragraph> </Section> <Section position="6" start_page="620" end_page="621" type="metho"> <SectionTitle> 6. Adaptation </SectionTitle> <Paragraph position="0"> Test results for ALLFP-LM (63.01% avg.</Paragraph> <Paragraph position="1"> word accuracy) suggest that the model over represents FP's. The recognition accuracy for this model is 4.21 points higher than that of the NOFP-LM (58.8% avg. word accuracy) but lower than that of both the RANDOMFP-LM-1 (67.99% avg. word accuracy) by about 5% and RANDOMFP-LM-2 (65.87% avg. word accuracy) by about 7%. One way of decreasing the FP representation is to correct the BIGRAM-FP-LM, which proves to be computationally expensive because of having to rebuild the large training corpus with each change in BIGRAM-FP-LM. Another method is to build a NOFP-LM and an ALLFP-LM once and experiment with their relative weights through adaptation. I chose the second method because ECRL Transcriber toolkit provides an adaptation tool that achieves the goals of the first method much faster. The results show that introducing a NOFP-LM into the equation improves recognition. The difference in recognition accuracy between the ALLFP-LM and ADAPTFP-LM is on average 4.9% across all talkers in ADAPTFP-LM's favor. Separating the talkers into high FP user group and low FP user group raises ADAPTFP-LM's gain to 6.2% for high FP users and lowers it to 3.3% for low FP users. This shows that adaptation to no-FP data is, counterintuitively more beneficial for high FP users.</Paragraph> </Section> class="xml-element"></Paper>