File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-2003_evalu.xml

Size: 4,332 bytes

Last Modified: 2025-10-06 13:58:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2003">
  <Title>Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluated on two tasks: 1) Switchboard (Godfrey et al., 1992), specifically the HUB5 eval 2001 set having a total of 60K words spoken by 120 speakers, and 2) an ICSI Meeting recorder (Morgan et al, 2001) eval set having a total of 44K words spoken by 25 speakers. Both sets featured spontaneous conversational speech. There were 45K words of held-out data for each task.</Paragraph>
    <Paragraph position="1"> Text corpora of conversational telephone speech (CTS) available for training language models consisted of Switchboard, Callhome English, and Switchboardcellular, a total of 3 million words. In addition to that we used 150 million words of Broadcast News (BN) transcripts, and we collected 191 million words of &amp;quot;conversational&amp;quot; text from the web. For the Meetings task, there were 200K words of meeting transcripts available for training, and we collected 28 million words of &amp;quot;topicrelated&amp;quot; text from the web.</Paragraph>
    <Paragraph position="2"> The experiments were conducted using the SRI large vocabulary speech recognizer (Stolcke et al, 2000) in the N-best rescoring mode. A baseline bigram language model was used to generate N-best lists, which were then rescored with various trigram models.</Paragraph>
    <Paragraph position="3"> Table 1 shows word error rates (WER) on the HUB5 test set, comparing performance of the class-based mixture against standard (i.e. class-independent) interpolation. The class-based mixture gave better results in all cases except when only CTS sources were used, probably because these sources are similar to each other and the class-based mixture is mainly useful when data sources are more diverse. We also obtained lower WER by using the web data instead of BN, which indicates that the web data is better matched to our task (i.e. it is more &amp;quot;conversational&amp;quot;). If training data is completely arbitrary, then its benefits to the recognition task are minimal, as shown by an example of using a 66M-word corpus collected from random web pages. The baseline Switchboard model gave test set perplexity of 96, which is reduced to 87 with a standard mixture CTS and BN data, reduced further to 83 by adding the web data, and to a best case of 82 with class-dependent interpolation and the added web data.</Paragraph>
    <Paragraph position="4"> Increasing the amount of web training data from 61M to 191M gave relatively small performance gains. We &amp;quot;trimmed&amp;quot; the 191M-word web corpus down to 61M words by choosing documents with lowest perplexity according to the combined CTS model, yielding the &amp;quot;Web2&amp;quot; data source. The model that used Web2 gave the same WER as the one trained with the original 61M web corpus. It could be that the web text obtained with &amp;quot;Google&amp;quot; filtering is fairly homogeneous, so little is gained by further perplexity filtering. Or, it could be that when choosing better matched data, we also exclude new N-grams that may occur only in testing.</Paragraph>
    <Paragraph position="5">  Results on the Meeting test set are shown in Table 2, where the baseline model was trained on CTS and BN sources. As in the HUB5 experiments, the class-based mixture outperformed standard interpolation. We achieved lower WER by using the web data instead of the meeting transcripts, but the best results are obtained by using all data sources. Language model perplexity is reduced from 122 for the baseline to a best case of 95.</Paragraph>
    <Paragraph position="6"> We also tried different class assignments for the class-based mixture on the HUB5 set and we found that using automatically derived classes instead of part-of-speech tags does not lead to performance degradation as long as we allocate individual classes for the top 100 words.</Paragraph>
    <Paragraph position="7"> Automatic class mapping can make class-based mixtures feasible for other languages where part-of-speech tags are difficult to derive.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML