File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1014_intro.xml
Size: 5,595 bytes
Last Modified: 2025-10-06 14:05:40
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1014"> <Title>Language Modeling with Sentence-Level Mixtures</Title> <Section position="2" start_page="0" end_page="82" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> The overall performance of a large vocabulary continuous speech recognizer is greatly impacted by the constraints imposed by a language model, or the effective constraints of a stochastic language model that provides the a priori probability estimates of the word sequence P(wz,...,wr). The most commonly used statistical language model assumes that the word sequence can be described as a high order Marker process, typically referred to as an n-gram model, where the probability of a word sequence is given by:</Paragraph> <Paragraph position="2"> The standard n-gram models that are commonly used are the bigram (n = 2) and the trigram (n = 3) models, where n is limited primarily because of insufficient training data. However, with such low order dependencies, these models fail to take advantage of 'long distance constraints' over the sentence or paragraph. Such long distance dependencies may be grammatical, as in verb tense agreement or singnlar/pinral quantifier-noun agreement. Or, they may also be a consequence of the inhomogeneous nature of language; different words or word sequences are more likely for particular broad topics or tasks. Consider, for example, the following respouses made by the combined BU-BBN recognition system on the 1993 Wall Street Journal (WSJ) benchmark H1-C1 (20K) test: REF: the first recipient joseph webster junior ** ****** a PHI BETA KAPPA chemistry GRAD who plans to take courses this fall in ART RELIGION **** music and political science HYP: the first recipient joseph webster junior HE FRI-DAY a CAP OF ***** chemislzy GRANT who plans to take comes this fall in AREN'T REALLY CHIN music and pofitical science REF: *** COCAINE doesn't require a SYRINGE THE symbol of drug abuse and CURRENT aids risk YET can be just as ADDICTIVE and deadly as HEROIN HYP: THE KING doesn't require a STRANGE A symbel of drug abuse and TRADE aids risk IT can be just as ADDICTED and deadly as CHAIRMAN In the first example, &quot;art&quot; and &quot;refigion&quot; make more sense in the context of &quot;courses&quot; than &quot;aren't really chin&quot;, and similarly &quot;heroin&quot; should be more likely than &quot;chairman&quot; in the context of &quot;drug abuse&quot;.</Paragraph> <Paragraph position="3"> The problem of representing long-distance dependencies has been explored in other stochastic language models, though they tend to address only one or the other of the two issues raised here, i.e. either sentence-level or task-level dependence. Language model adaptation (e.g. \[1, 2, 3\]) addresses the problem of inhomogeneity of n-gram statistics, but mainly represents task level dependencies and does little to account for dependencies within a sentence. A context-free grammar could account for sentenco level dependencies, but it is costly to build task-specific grammars as well as costly to implement in recognition. A few automatic learning techniques, which are straight-forward to apply to new tasks, have been investigated for designing static models of long term dependence. For example, Bald et al. \[4\] used decision tree clustering to reduce the number of n-grams while keeping n large. Other efforts include models that integrate n-grams with context-free grammars (e.g., \[5, 6, 7\]).</Paragraph> <Paragraph position="4"> Our approach to representing long term dependence attempts to address both issues, while still using a very simple model. We propose to use a mixture of n-gram language models, but unlike previous work in mixture language modeling our mixture components are combined at the sentence level. The component n-grams enable us to capture topic dependence, while using mixtures at the sentence level captures the notion that topics do not change mid-sentence. Like the model proposed by Kneser and Stcinbiss \[8\], our language model uses m component language models, each of which can be identified with the n-gram statistics of a speeific topic or broad class of sentences. However, unlike \[8\], which uses mixtures at the n-gram level with dynamically adapted mixture coefficients, we use sentence-level mixtures to capture within-sentcnce dependencies. Thus, the probability of a word sequence is the</Paragraph> <Paragraph position="6"> Our approach has the advantage that it can be used either as a static or a dynamic model, and can easily leverage the techniques that have been developed for e~ptive language modeling, particularly cache \[1, 9\] and trigger \[2, 3\] models. One might raise the issue of recognition search cost for a model of mixtures at the sentence level, but in the N-best rescoring framework \[10\] the additional cost of the mixture language model is minimal.</Paragraph> <Paragraph position="7"> The general framework and mechanism for designing the mixture language model will be described in the next section, including descriptions of automatic topic clustering and robust.estimation techniques. Following this discussion, we will present some experimental results on mixture language modeling obtained using the BU recognition system. Finally, the paper will conclude with a discussion of the possible extensions of mixture language models, to dynamic language modeling and to applications other than speech transcription.</Paragraph> </Section> class="xml-element"></Paper>