File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1107_metho.xml
Size: 4,097 bytes
Last Modified: 2025-10-06 14:13:26
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1107"> <Title>CSR CORPUS COLLECTION Denise Danielson, Project Leader</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> CSR CORPUS COLLECTION </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> PROJECT GOALS </SectionTitle> <Paragraph position="0"> The objective of the CSR Corpus Development is to collect and deliver a large corpus of continuous speech data to support DARPA research efforts in continuous speech recognition (CSR). SRI's current goal is the completion of Phase 2, Part 1 of the planned CSR Corpus. This consists of 86,000 sentences from 275 speakers, including 8000 spontaneous sentences from 40 joumatists.</Paragraph> <Paragraph position="1"> The Phase 2 Corpus collection task is a high volume data production task. SRI's major goal has been efficiency.</Paragraph> <Paragraph position="2"> Other goals include gathering data that is more representative of the real world by minimizing controls on vocabulary, microphones, background noise and speaker disfluencies, while improving data quality controls.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> RECENT RESULTS </SectionTitle> <Paragraph position="0"> SRI began work on the current phase of CSR in September, 1992 and expects to complete delivery of this portion in June, 1993.</Paragraph> <Paragraph position="1"> Data Production -- As of 12 March 1993, SRI has collected the following portion of this CSR database: ject interaction with the data collection software. Additional memory was added to the data collection systems, and data collection software made much faster, so that now the pace of the data collection process is directly controlled by the subject and no longer limited by the software. As a result, the average data collection pace has increased from 125 utts/hr to 200 utts/hr. For a typical short-term non-journalist subject collecting 190 read sentences, these changes and a faster paced orientation have reduced subject time from 120 minutes to 90 minutes. The shorter time requirement also makes it easier to attract and schedule subjects. Process Efficiency -- SRI has also been concerned with reducing the labor required to process speech data. A labor savings was realized by removing monitors from the data collection room. The data collection monitor now spends about 25 minutes instructing and observing while subjects collect their first few utterances, and then leaves the room. Two other changes have significanly improved labor efficiency. SRI has developed a new transcription tool that has led to a 15% to 20% reduction in transcription time and improved accuracy. We have also automated most prearchival and archival steps.</Paragraph> <Paragraph position="2"> Data Quality -- SRI has incorporated NIST data quality software into its procedures. Sample files are collected at the start of each day on each data collection system. These files are run through the wavmd program, which runs a signal-to-noise (SNR) evaluation and other tests. Additional checks are performed on all files as they are collected to ensure that problems (e.g. dead microphone) are caught.</Paragraph> <Paragraph position="3"> Labor Analysis -- SRI is analyzing labor costs as we proceed with the current project to enable us to predict costs in the future, as well as to target specific tasks for efficiency improvements. A first round of labor analysis in January of this year identified transcriptioa as one of the biggest labor costs. This has led to efforts to make the transcription task easier and more efficient. SRI continues to work with NIST and the CCCC to clarify transcription guidelines and implement changes recommended by CCCC. An analysis of recent project labor indicates that 10-15% of SRI's CSR project time has been spent on tasks in support of communication with NIST and various DARPA program committees. null</Paragraph> </Section> class="xml-element"></Paper>