File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2003_intro.xml
Size: 3,694 bytes
Last Modified: 2025-10-06 14:01:42
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2003"> <Title>Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Language models constitute one of the key components in modern speech recognition systems. Training an N-gram language model, the most commonly used type of model, requires large quantities of text that is matched to the target recognition task both in terms of style and topic. In tasks involving conversational speech the ideal training material, i.e. transcripts of conversational speech, is costly to produce, which limits the amount of training data currently available.</Paragraph> <Paragraph position="1"> Methods have been developed for the purpose of language model adaptation, i.e. the adaptation of an existing model to new topics, domains, or tasks for which little or no training material may be available. Since out-of-domain data can contain relevant as well as irrelevant information, various methods are used to identify the most relevant portions of the out-of-domain data prior to combination. Past work on pre-selection has been based on word frequency counts (Rudnicky, 1995), probability (or perplexity) of word or part-of-speech sequences (Iyer and Ostendorf, 1999), latent semantic analysis (Bellegarda, 1998), and information retrieval techniques (Mahajan et al., 1999; Iyer and Ostendorf, 1999). Perplexity-based clustering has also been used for defining topic-specific subsets of in-domain data (Clarkson and Robinson, 1997; Martin et al, 1997), and test set perplexity has been used to prune documents from a training corpus (Klakow, 2000). The most common method for using the additional text sources is to train separate language models on a small amount of in-domain and large amounts of out-of-domain data and to combine them by interpolation, also referred to as mixtures of language models. The technique was reported by IBM in 1995 (Liu et al, 1995), and has been used by many sites since then. An alternative approach involves decomposition of the language model into a class n-gram for interpolation (Iyer and Ostendorf, 1997; Ries, 1997), allowing content words to be interpolated with different weights than filled pauses, for example, which gives an improvement over standard mixture modeling for conversational speech.</Paragraph> <Paragraph position="2"> Recently researchers have turned to the World Wide Web as an additional source of training data for language modeling. For &quot;just-in-time&quot; language modeling (Berger and Miller, 1998), adaptation data is obtained by submitting words from initial hypotheses of user utterances as queries to a web search engine. Their queries, however, treated words as individual tokens and ignored function words. Such a search strategy typically generates text of a non-conversational style, hence not ideally suited for ASR. In (Zhu and Rosenfeld, 2001), instead of downloading the actual web pages, the authors retrieved N-gram counts provided by the search engine. Such an approach generates valuable statistics but limits the set of N-grams to ones occurring in the baseline model.</Paragraph> <Paragraph position="3"> In this paper, we present an approach to extracting additional training data from the web by searching for text that is better matched to a conversational speaking style.</Paragraph> <Paragraph position="4"> We also show how we can make better use of this new data by applying class-dependent interpolation.</Paragraph> </Section> class="xml-element"></Paper>