File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-2015_metho.xml
Size: 4,296 bytes
Last Modified: 2025-10-06 14:09:37
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-2015"> <Title>THE MIT SPOKEN LECTURE PROCESSING PROJECT</Title> <Section position="3" start_page="0" end_page="28" type="metho"> <SectionTitle> 2 Project Details </SectionTitle> <Paragraph position="0"> As mentioned earlier, we have developed a web-based Spoken Lecture Processing Server (http://groups.csail.mit.edu/sls/lectures) in which users can upload audio files for automatic transcription and indexing. In our work, we have ex- null perimented with collecting audio data using a small personal digital audio recorder (an iRiver N10). To help the speech recognizer, users can provide their own supplemental text files, such as journal articles, book chapters, etc., which can be used to adapt the language model and vocabulary of the system. Currently, the key steps of the transcription process are as follows: a) adapt a topic-independent vocabulary and language model using any supplemental text materials, b) automatically segment the audio file into short chunks of pausedelineated speech, and c) automatically annotate these chunks using a speech recognition system.</Paragraph> <Paragraph position="1"> Language model adaptation is performed is two steps. First the vocabulary of any supplemental text material is extracted and added to an existing topic-independent vocabulary of nearly 17K words. Next, the recognizer merges topic-independent word sequence statistics from an existing corpus of lecture material with the topic-dependent statistics of the supplemental material to create a topic-adapted language model.</Paragraph> <Paragraph position="2"> The segmentation algorithm is performed in two steps. First the audio file is arbitrarily broken into 10-second chunks for speech detection processing using an efficient speaker-independent phonetic recognizer. To help improve its speech detection accuracy, this recognizer contains models for non-lexical artifacts such as laughs and coughs as well as a variety of other noises. Contiguous regions of speech are identified from the phonetic recognition output (typically 6 to 8 second segments of speech) and passed alone to our speech recognizer for automatic transcription. The speech segmentation and transcription steps are currently performed in a distributed fashion over a bank of computation servers. Once recognition is completed, the audio data is indexed (based on the recognition output) in preparation for browsing by the user.</Paragraph> <Paragraph position="3"> The lecture browser provides a graphical user interface to one or more automatically transcribed lectures. A user can type a text query to the browser and receive a list of hits within the indexed lectures. When a hit is selected, it is shown in the context of the lecture transcription. The user can adjust the duration of context preceding and following the hit, navigate to and from the preceding and following parts of the lecture, and listen to the displayed segment. Orthographic segments are highlighted as they are played.</Paragraph> </Section> <Section position="4" start_page="28" end_page="28" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"> To date we have collected and analyzed a corpus of approximately 300 hours of audio lectures including 6 full MIT courses and 80 hours of seminars from the MIT World web site [2]. We are currently in the process of expanding this corpus.</Paragraph> <Paragraph position="1"> From manual transcriptions we have generated and verified time-aligned transcriptions for 169 hours of our corpus, and we are in the process of timealigning transcriptions for the remainder of our corpus.</Paragraph> <Paragraph position="2"> We have performed initial speech recognition experiments using 10 computer science lectures. In these experiments we have discovered that, despite high word error rates (in the area of 40%), retrieval of short audio segments containing important key-words and phrases can be performed with a highdegree of reliability (over 90% F-measure when examining precision and recall results) [5]. These results are similar in nature to the findings in the SpeechBot project (which performs a similar service for online broadcast news archives) [6].</Paragraph> </Section> class="xml-element"></Paper>