File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1055_evalu.xml

Size: 7,820 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1055">
  <Title>Position Specific Posterior Lattices for Indexing Speech</Title>
  <Section position="7" start_page="447" end_page="449" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> We have carried all our experiments on the iCampus corpus prepared by MIT CSAIL. The main advantages of the corpus are: realistic speech recording conditions -- all lectures are recorded using a lapel microphone -- and the availability of accurate manual transcriptions -- which enables the evaluation of a SDR system against its text counterpart.</Paragraph>
    <Section position="1" start_page="447" end_page="447" type="sub_section">
      <SectionTitle>
6.1 iCampus Corpus
</SectionTitle>
      <Paragraph position="0"> The iCampus corpus (Glass et al., 2004) consists of about 169 hours of lecture materials: 20 Introduction to Computer Programming Lectures (21.7 hours), 35 Linear Algebra Lectures (27.7 hours), 35 Electro-magnetic Physics Lectures (29.1 hours), 79 Assorted MIT World seminars covering a wide variety of topics (89.9 hours). Each lecture comes with a word-level manual transcription that segments the text into semantic units that could be thought of as sentences; word-level time-alignments between the transcription and the speech are also provided. The speech style is in between planned and spontaneous.</Paragraph>
      <Paragraph position="1"> The speech is recorded at a sampling rate of 16kHz (wide-band) using a lapel microphone.</Paragraph>
      <Paragraph position="2"> The speech was segmented at the sentence level based on the time alignments; each lecture is considered to be a spoken document consisting of a set of one-sentence long segments determined this way -see Section 5.1. The final collection consists of 169 documents, 66,102 segments and an average document length of 391 segments.</Paragraph>
      <Paragraph position="3"> We have then used a standard large vocabulary ASR system for generating 3-gram ASR lattices and PSPL lattices. The 3-gram language model used for decoding is trained on a large amount of text data, primarily newswire text. The vocabulary of the ASR system consisted of 110kwds, selected based on frequency in the training data. The acoustic model is trained on a variety of wide-band speech and it is a standard clustered tri-phone, 3-states-per-phone model. Neither model has been tuned in any way to the iCampus scenario.</Paragraph>
      <Paragraph position="4"> On the first lecture L01 of the Introduction to Computer Programming Lectures the WER of the ASR system was 44.7%; the OOV rate was 3.3%.</Paragraph>
      <Paragraph position="5"> For the entire set of lectures in the Introduction to Computer Programming Lectures, the WER was 54.8%, with a maximum value of 74% and a minimum value of 44%.</Paragraph>
    </Section>
    <Section position="2" start_page="447" end_page="448" type="sub_section">
      <SectionTitle>
6.2 PSPL lattices
</SectionTitle>
      <Paragraph position="0"> We have then proceeded to generate 3-gram lattices and PSPL lattices using the above ASR system. Table 1 compares the accuracy/size of the 3-gram lattices and the resulting PSPL lattices for the first lecture L01. As it can be seen the PSPL represen- null tices for lecture L01 (iCampus corpus): node and link density, 1-best and ORACLE WER, size on disk tation is much more compact than the original 3-gram lattices at a very small loss in accuracy: the 1-best path through the PSPL lattice is only 0.3% absolute worse than the one through the original 3-gram lattice. As expected, the main reduction comes from the drastically smaller node density -- 7 times smaller, measured in nodes per word in the reference transcription. Since the PSPL representation  introduces new paths compared to the original 3-gram lattice, the ORACLE WER path -- least errorful path in the lattice -- is also about 20% relative better than in the original 3-gram lattice -- 5% absolute. Also to be noted is the much better WER in both PSPL/3-gram lattices versus 1-best.</Paragraph>
    </Section>
    <Section position="3" start_page="448" end_page="449" type="sub_section">
      <SectionTitle>
6.3 Spoken Document Retrieval
</SectionTitle>
      <Paragraph position="0"> Our aim is to narrow the gap between speech and text document retrieval. We have thus taken as our reference the output of a standard retrieval engine working according to one of the TF-IDF flavors, see Section 3. The engine indexes the manual transcription using an unlimited vocabulary. All retrieval results presented in this section have used the standard trec_eval package used by the TREC evaluations. null The PSPL lattices for each segment in the spoken document collection were indexed as explained in 5.1. In addition, we generated the PSPL representation of the manual transcript and of the 1-best ASR output and indexed those as well. This allows us to compare our retrieval results against the results obtained using the reference engine when working on the same text document collection.</Paragraph>
      <Paragraph position="1"> 6.3.1 Query Collection and Retrieval Setup The missing ingredient for performing retrieval experiments are the queries. We have asked a few colleagues to issue queries against a demo shell using the index built from the manual transcription.</Paragraph>
      <Paragraph position="2"> The only information1 provided to them was the same as the summary description in Section 6.1.</Paragraph>
      <Paragraph position="3"> We have collected 116 queries in this manner. The query out-of-vocabulary rate (Q-OOV) was 5.2% and the average query length was 1.97 words. Since our approach so far does not index sub-word units, we cannot deal with OOV query words. We have thus removed the queries which contained OOV words -- resulting in a set of 96 queries -- which clearly biases the evaluation. On the other hand, the results on both the 1-best and the lattice indexes are equally favored by this.</Paragraph>
      <Paragraph position="4"> 1Arguably, more motivated users that are also more familiar with the document collection would provide a better query collection framework  We have carried out retrieval experiments in the above setup. Indexes have been built from:  No tuning of retrieval weights, see Eq. (5), or link scoring weights, see Eq. (2) has been performed. Table 2 presents the results. As a sanity check, the retrieval results on transcription -- trans -- match almost perfectly the reference. The small difference comes from stemming rules that the baseline engine is using for query enhancement which are not replicated in our retrieval engine. The results on lattices (lat) improve significantly on (1-best) -20% relative improvement in mean average precision (MAP).</Paragraph>
      <Paragraph position="5">  A legitimate question at this point is: why would anyone expect this to work when the 1-best ASR accuracy is so poor? In favor of our approach, the ASR lattice WER is much lower than the 1-best WER, and PSPL have even lower WER than the ASR lattices. As reported in Table 1, the PSPL WER for L01 was 22% whereas the 1-best WER was 45%. Consider matching a 2-gram in the PSPL --the average query length is indeed 2 wds so this is a representative situation. A simple calculation reveals that it is twice -- (1 [?] 0.22)2/(1 [?] 0.45)2 = 2 -- more likely to find a query match in the PSPL than in the 1-best -if the query 2-gram was indeed spoken at that position. According to this heuristic argument one could expect a dramatic increase in Recall. Another aspect  is that people enter typical N-grams as queries. The contents of adjacent PSPL bins are fairly random in nature so if a typical 2-gram is found in the PSPL, chances are it was actually spoken. This translates in little degradation in Precision.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML