Language Modeling with Sentence-Level Mixtures 
Rukmini lyer t Mari Ostendorflf J. Robin Rohlicek~ 
Boston University ~ BBN Inc. 
Boston, MA 02215 Cambridge, MA 02138 
ABSTRACT 
This paperintroduces a simple mixtare language model that attempts 
to capture long distance conslraints in a sentence or paragraph. The 
model is an m-component mixture of Irigram models. The models 
were constructed using a 5K vocabulary and trained using a 76 mil- 
lion word Wail Street Journal text corpus. Using the BU recognition 
system, experiments show a 7% improvement in recognition accu- 
racy with the mixture trigram models as compared to using a Irigram 
model. 
1. INTRODUCTION 
The overall performance of a large vocabulary continuous 
speech recognizer is greatly impacted by the constraints im- 
posed by a language model, or the effective constraints of a 
stochastic language model that provides the a priori proba- 
bility estimates of the word sequence P(wz,...,wr). The 
most commonly used statistical language model assumes that 
the word sequence can be described as a high order Marker 
process, typically referred to as an n-gram model, where the 
probability of a word sequence is given by: 
T 
P(w , .... = O) 
i=I 
The standard n-gram models that are commonly used are the 
bigram (n = 2) and the trigram (n = 3) models, where n is 
limited primarily because of insufficient training data. How- 
ever, with such low order dependencies, these models fail to 
take advantage of 'long distance constraints' over the sen- 
tence or paragraph. Such long distance dependencies may 
be grammatical, as in verb tense agreement or singnlar/pinral 
quantifier-noun agreement. Or, they may also be a conse- 
quence of the inhomogeneous nature of language; different 
words or word sequences are more likely for particular broad 
topics or tasks. Consider, for example, the following re- 
spouses made by the combined BU-BBN recognition system 
on the 1993 Wall Street Journal (WSJ) benchmark H1-C1 
(20K) test: 
REF: the first recipient joseph webster junior ** ****** 
a PHI BETA KAPPA chemistry GRAD who plans to take 
courses this fall in ART RELIGION **** music and po- 
litical science 
HYP: the first recipient joseph webster junior HE FRI- 
DAY a CAP OF ***** chemislzy GRANT who plans to 
take comes this fall in AREN'T REALLY CHIN music 
and pofitical science 
REF: *** COCAINE doesn't require a SYRINGE THE 
symbol of drug abuse and CURRENT aids risk YET can 
be just as ADDICTIVE and deadly as HEROIN 
HYP: THE KING doesn't require a STRANGE A sym- 
bel of drug abuse and TRADE aids risk IT can be just as 
ADDICTED and deadly as CHAIRMAN 
In the first example, "art" and "refigion" make more sense in 
the context of "courses" than "aren't really chin", and simi- 
larly "heroin" should be more likely than "chairman" in the 
context of "drug abuse". 
The problem of representing long-distance dependencies has 
been explored in other stochastic language models, though 
they tend to address only one or the other of the two is- 
sues raised here, i.e. either sentence-level or task-level depen- 
dence. Language model adaptation (e.g. \[1, 2, 3\]) addresses 
the problem of inhomogeneity of n-gram statistics, but mainly 
represents task level dependencies and does little to account 
for dependencies within a sentence. A context-free grammar 
could account for sentenco level dependencies, but it is costly 
to build task-specific grammars as well as costly to implement 
in recognition. A few automatic learning techniques, which 
are straight-forward to apply to new tasks, have been inves- 
tigated for designing static models of long term dependence. 
For example, Bald et al. \[4\] used decision tree clustering to 
reduce the number of n-grams while keeping n large. Other 
efforts include models that integrate n-grams with context- 
free grammars (e.g., \[5, 6, 7\]). 
Our approach to representing long term dependence attempts 
to address both issues, while still using a very simple model. 
We propose to use a mixture of n-gram language models, 
but unlike previous work in mixture language modeling our 
mixture components are combined at the sentence level. The 
component n-grams enable us to capture topic dependence, 
while using mixtures at the sentence level captures the notion 
that topics do not change mid-sentence. Like the model pro- 
posed by Kneser and Stcinbiss \[8\], our language model uses 
m component language models, each of which can be identi- 
fied with the n-gram statistics of a speeific topic or broad class 
82 
of sentences. However, unlike \[8\], which uses mixtures at the 
n-gram level with dynamically adapted mixture coefficients, 
we use sentence-level mixtures to capture within-sentcnce de- 
pendencies. Thus, the probability of a word sequence is the 
weighted combination: 
m T 
-- Ak n 
k=l i=1 (2) 
Our approach has the advantage that it can be used either 
as a static or a dynamic model, and can easily leverage the 
techniques that have been developed for e~ptive language 
modeling, particularly cache \[1, 9\] and trigger \[2, 3\] models. 
One might raise the issue of recognition search cost for a 
model of mixtures at the sentence level, but in the N-best 
rescoring framework \[10\] the additional cost of the mixture 
language model is minimal. 
The general framework and mechanism for designing the mix- 
ture language model will be described in the next section, 
including descriptions of automatic topic clustering and ro- 
bust.estimation techniques. Following this discussion, we 
will present some experimental results on mixture language 
modeling obtained using the BU recognition system. Finally, 
the paper will conclude with a discussion of the possible ex- 
tensions of mixture language models, to dynamic language 
modeling and to applications other than speech transcription. 
2. MIXTURE LANGUAGE MODEL 
2.1. General Framework 
The sentence-level mixture language model was originally 
motivated by an observation that news stories (and certainly 
other domains as well) often reflect the characteristics of dif- 
ferent topics or sub-domains, such as sports, finance, national 
news and local news. The likelihood of different words or 
n-grams could be very different in different sub-domains, 
and it is unlikely that one would switch sub-domains mid- 
sentence. A model with sentence-level mixtures of topic- 
dependent component models would address this problem, 
but the model would be more general ff it also allowed for 
n-gram level mixtures within the components (e.g. for robust 
estimation). Thus, we propose a model using mixtures at two 
levels: the sentence and the n-gram level. Using trigram 
components, this model is described by 
m T 
k=l i---1 
+(1 - Ok)P1(w, lwi-1, w,-2)\], (3) 
where k is an index to the particular topic described 
by the component language model Pk('\]'), PI(-\]') is a 
topic-independent model that is interpolated with the topic- 
dependent model for purposes of robust estimation or dynamic 
language model adaptation, and At and 0k are the sentence- 
level and n-gram level mixture weights, respectively. (Note 
that the component-dependent term Pk (toi Item_ 1, w~-2) could 
itself be a mixtme.) 
Two important aspects of the model are the definition of "'top- 
ics" and robust parameter estimation. The m component 
distributions of the language model correspond to different 
"topics", where topic can mean any broad class of sentences, 
such as subject area (as in the examples given above) or verb 
tense. Topics can be specified by hand, according to text la- 
bels ff they are available, or by heuristic rules associated with 
known characteristics of a task domain. Topics can also be 
determined automatically, which is the approach taken here, 
using any of a variety of clustering methods to initialiTc the 
component models. Robust parameter estimation is another 
important issue in mixture language modeling, because the 
process of partitioning the d,tA into topic-dependent subsets 
reduces the amount of training available to estimate each com- 
ponent language model. These two issues, automatic clnster- 
ing for topic initialization and robust parameter estimation, 
are described further in the next two subsections. 
2.2. Clustering Algorithm 
Since the standard WSJ language model training d~t~ does 
not have topic labels associated with the text, it was neces- 
sary to use automatic clustering to identify natural groupings 
of the d~t~ into "topics". Because of its conceptual simplic- 
ity, agglomerative clustering is used to partition the training 
dAt. intO the desired number of clusters. The clustering is 
at the paragraph level, relying on the assumption that an en- 
tire paragraph comes from a single topic. Each paragraph 
begins as a singleton cluster. Paragraph pairs are then pro- 
gressively grouped into clusters by computing the similarity 
between clusters and grouping the two most similar clusters. 
The basic clustering algorithm is as follows: 
1. Let the desired number of clusters be C* and the initial 
number of clusters C be the number of singleton da!~ 
samples, or paragraphs. 
2. Find the best matched clusters, say Ai and Aj, to mini- 
mize the similarity criterion S~. 
3. Merge Ai and Aj and decrement C. 
4. If current number of clusters C = C*, then stop; other- 
wise go to Step 2. 
At the end of this stage, we have the desired number of par- 
titions of the training datz: To save computation, we run 
agglomerative clustering first on subsets of the dnt~; and then 
continue by agglomerating resulting clusters into a final set of 
m clusters. 
83 
A variety of similarity measures can be envisioned. We use 
a normalized measure of the number of content words in 
common between the two clusters. (Paragraphs comprise 
both function words (e.g. is, that, bu0 and content words (e.g 
stocks, theater, trading), but the function words do not con- 
tribute towards the identification of a paragraph as belonging 
to a particular topic so they are ignored in the similarity crite- 
rion.) Letting Ai be the set of unique content words in cluster 
i, lAd the number of elements in Ai, and Ni the number of 
paragraplas in cluster i, then the specific measure of similarity 
of two clusters i and j is 
&~ = Ni~ I& n At\[ 
I& u Ail ' (4) 
where 
/ N, + 
= (5) 
is a normalization factor used to avoid the tendency for small 
clusters to group with one large cluster rather than other small 
clusters. 
At this point, we have only experimented with a small number 
of clusters, so it is difficult to see coherent topics in them. 
However, it appears that the current models are putting news 
related to foreign affairs (politics, as well as travel) into one 
cluster and news relating to finance (stocks, prices, loans) in 
another. 
2.3. Parameter Estimation 
Each component model is a conventional n-gram model. Ini- 
tial n-gram estimates for the component models are based 
on the partitions of the training data, obtained by using the 
above clustering algorithm. The initial component models 
are estimated separately for each cluster, where the Witten- 
Bell back-off \[11\] is used to compute the probabilities of 
n-grams not observed in training, based on distributing a cer- 
tain amount of the total probability mass among unseen n- 
grams. This method was chosen based on the results of \[12\] 
and our own comparative experiments with different back-off 
methods for WSJ n-gram language models. The parame- 
ters of the component models can be re-estimated using the 
Expectation-Maximization (EM) algorithm \[13\]. However, 
since the EM algorithm was computationally intensive, an it- 
erative re-labeling re-estimation technique was used. At each 
iteration, the training data is re-partitioned, by re-labeling 
each utterance according to which component model maxi- 
mizes the likelihood of that utterance. Then, the component 
n-gram statistics are re-computed using the new subsets of the 
training data, again using the Witten-Bell back-off technique. 
The iterations continue until a steady state size for the clusters 
is reached. 
Since the component models are built on partitioned training 
data, there is a danger of them being undertrained. There are 
two main mechanisms we have explored for robust parameter 
estimation, in addition to using standard back-off techniques. 
One approach is to include a general model PG trained on all 
the data as one of the mixture components. This approach has 
the advantage that the general model will be more appropriate 
for recognizing sentences that do not fall clearly into any of the 
topic-dependent components, but the possible disadvantage 
that the component models may be underutilized because they 
are relatively undertrained. An alternative is to interpolate 
the general model with each component model at the n-gram 
level, but this may force the component models to be too 
general in order to allow for unforeseen topics. Given these 
trade-offs, we chose to implement a compromise between the 
two approaches, i.e. to include a general model as one of the 
components, as well as some component level smoothing via 
interpolation with a general model. Specifically, the model is 
given by 
T 
P(w,,...,wT)---- ~ ~k I~(okPk(wilwi-,,wl-2) 
k=I,...,C,G i=1 
+(1 - Ok)Pa,(wilwi-l,wi-2)) (6) 
where Pa, is a general model (which may or may not be the 
same as Pa), {Ak} provide weights for the different topics, 
and {Ok } serve to smooth the component models. 
Both sets of mixture weights are estimated on a separate data 
set, using a maximum likelihood criterion and initializing with 
uniform weights. To simplify the initial implementation, we 
did not estimate the two sets of weights { Ak } and {0k } jointly. 
Rather, we first labeled the sentences in the mixture weight 
estimation data set according to their most likely component 
models, and then separately estimated the weight 0k tO max- 
imize the likelihood of the data assigned to its cluster. For 
a single set of data, the mixture weight estimation algorithm 
involves iteratively updating 
1 N o~ldpk(wl,.., Wn,) 
O~e'° = N ~ ~; 0~'aP/(----'--~17.-'7:, w--~,) (7) 
where n~ is the number of words in sentence i and N is the total 
number of sentences in cluster k. After the component models 
have been estimated, the sentence-level mixture weights { Ak } 
are estimated using an analogous algorithm. 
3. EXPERIMENTS 
The corpus used for training the different component models 
comprised the 38 million WSJ0 data, as well as the 38 mil- 
lion word augmented LM data obtained from BBN Inc. The 
vocabulary is the standard 5K non-verbalized pronunciation 
(NVP) data augmented with the verbalized punctuation words 
and a few additional words. In order to compute the mixture 
weights, both at the trigram-level as well as the sentence- 
level, the WSJ1 speaker-independent transcriptions serve as 
84 
the "held out" da!a set. Because we felt that the training data 
may not accurately represent the optional verbalized punctu- 
ation frequency in the WS.I1 data, we chose to train models 
on two dat~ sets. The general model Pa and the component 
models Pt were trained on the WSJ0 NVP data augnlented 
by the BBN data. The general model Pa, was trained on 
the WSJ0 verbalized pronunciation data, so that using Po, 
in smoothing the component models also provides a simple 
means of allowing for verbalized pronunciation. 
The experiments compare a single trigram language model 
to a five-component mixture of trigram models. To explore 
the trade-offs of using different numbers of clusters, we also 
consider an eight-component trigram mixture. Perplexity and 
recognition results are reported on the Nov. '93 ARPA devel- 
opment and evaluation 5k vocabulary WSJ test sets. 
3.1. Recognition Paradigm 
The BU Stochastic Segment Model recognition system is 
combined with the BBN BYBLOS system and uses the N-best 
resc0ring formalism \[10\]. In this formalism, the BYBLOS 
system, a speaker-independent Hidden Markov Model Sys- 
tem \[14\] 1, is used to compute the top N sentence hypotheses 
of which the top 100 are subsequently rescored by the SSM. 
A five-pass search strategy is used to generate the N-best hy- 
potheses, and these are re.scored with thirteen state HMMs. 
A weighted combination of scores from different knowledge 
sources is used to re-rank the hypotheses and the top ranking 
hypothesis is used as the recognized output. The weights for 
recombination are estimated on one test set (in this case the 
93 H2 development test data) and held fixed for all other test 
sets. 
No. of % Word 
components error Perplexity 
1 7.3 118 
5 7.1 116 
8 7.2 114 
Table 1: Dependence on number of components: evaluation 
on the '93 ARPA 5k WSJ development test set. 
The next series of experiments, summarized in Table 22 , com- 
pared recognition performance for the BBN trigram language 
model \[15\], the BU 5-component mixture model, and the 
case where both language model scores are used in the N-best 
reranking. All language models were estimated from the same 
training a,!8: The results show a 7% reduction in error rate 
on the evaluation test set, comparing the combined language 
models to the BBN trigram. It is interesting that the combi- 
nation of the trigram and the mixture model yielded a small 
improvement in performance (not significant, but consistent 
across lest sets), since the trigram is a component of the mix- 
ture model. The difference between the mixture model and the 
two combined models corresponds to a linear vs. non-linear 
combination of component probabilities, respectively. 
For reference, we also include the best case system perfor- 
mance, which corresponds the the case where all acoustic and 
language model scores. Even with all the acoustic model 
scores, adding the mixture language model improves perfor- 
mance, giving a best case result of 5.3% word error on the '93 
5k WSJ evalnnt_ion lest set. 
We conducted a series of experiments in the rescoring 
paradigm to assess the usefulness of the mixture model. Un- 
less otherwise noted, the only acoustic model score used was 
based on the stochastic segment model. The language model 
scores used varied with the experiments. For the best-case 
system, we used all scores, which included the SSM and the 
BBN Byblos HMM and SNN acoustic scores, and both the 
BBN trigram and BU mixture language model scores. 
4. DISCUSSION 
In summary, this paper presents a new approach to language 
modeling, which offers the potential for capturing both topic- 
dependent effects and long-range sentence level effects in 
2The performance figul~ quoted here are better than throe repoaed in 
the official November 1993 WSJ benchmark results, because more language 
model training data was available in the experimant repoNe.d here. 
3.2. Results 
The results reported in Table 1 compare three different lan- 
guage models in terms of perplexity and recognition perfor- 
mance: a simple trigram, and five- and eight-component mix- 
tures. The mixture models reduce the perplexity only by a 
small amount, but there is a reduction in word-error with the 
five-component mixture model. We hypothesize that there 
is not enough training data to effectively use more mixture 
components. 
1For an indication of the performance of this system, see the benchmark 
summary in \[17\]. 
KSs used % Word Error 
AM LM Dev Eval 
i trigram 7.4 6.1 
SSM mixture 7.1 5.8 
: both 7.0 5.7 
all both 6.3 5.3 
Table 2: Summary of results on'93 ARPA 5k WSJ test sets 
for different acoustic model (AM) and language model (LM) 
knowledge sources (KSs). 
85 
a conceptually simple variation of statistical n-gram mod- 
els. The model is actually a two-level mixture model, with 
separate mixture weights at the n-gram and sentence levels. 
Training involves either automatic clustering or heuristic rule, s 
to determine the initial topic-dependent models, and an itera- 
tive algorithm for estimating mixture-weights at the different 
levels. Recognition experiments on the WSJ task showed 
a significant improvement in the accuracy for the BU-SSM 
recognition system. 
This work can be extended in several ways. First, time lim- 
itations ,did not permit us to explore the use of the complete 
EM algorithm for estimating mixture components and weights 
jointly, ~xid we hope to investigate that approach in the future. 
In addition, it may be useful to consider other metrics for 
automatic topic clustering, such as a word count weighted by 
inverse document frequencies or a multinomial distribution 
assumption with a likelihood clustering criterion. Of course, 
it would also be interesting to see ff further performance gains 
could be achieved with more clusters. Much more could also 
be done in the area of robust parameter estimation. For exam- 
ple, one could use an n-gram part-of-speech sequence model 
as the base for all component models and topic-dependent 
word likelihoods given the part-of-speech label, a natural ex- 
tension of \[16\]. 
Dynamic language model adaptation, which makes use of 
the previous document history to tune the language model 
to that particular topic, can easily fit into the mixture model 
framework in two ways. First, the sentence-level mixture 
weights can be adapted according to the likelihood of the 
respective mixture components in the previous utterance, as 
in \[8\] for n-gram level mixture weights. Second, the dynamic 
n-gram cache model \[I, 9\] can easily be incorporated into 
the mixture language model. However, in the mixture model, 
it is possible to have component-dependent cache models, 
where each component cache would be updated after each 
sentence according to the likelihood of that component given 
the recognized word string. Trigger models \[2, 3\] could also 
be component dependenL 
The simple static mixture language model can also be use- 
ful in applications other than continuous speech transcription. 
For example, topic -dependent models could be used for topic 
spotting. In addition, as mentioned earlier, the notion of topic 
need not be related to subject area, it can be related to speak- 
ing style or speaker goal. In the ATIS task, for example, 
the goal of the speaker (e.g. flight information request, re- 
spouse clarification, error correction) is likely to be reflected 
in the language of the utterance. Representing this structure 
explicitly has the double benefit of improving recognition 
performance and providing information for a dialog model. 
From a cursory look at our recognition errors from the recent 
WSJ benchmark tests, it is clear that topic-dependent models 
will not be enough to dramatically reduce word error rate. 
Out-of-vocabulary words and function words also represent 
a major source of errors. However, an important advantage 
of this framework is that it is a simple extension of existing 
language modeling techniques that can easily be integrated 
with other language modeling advances. 
Acknowledgments 
This work was supported jointly by ARPA and ONR on grant 
ONR ONR-N00014-92-J-1778. We gratefully acknowledge 
the cooperation of several researchers at BBN, who provided 
the N-best hypotheses used in our recognition experiments, 
as well as additional language model training data. 
References 
I. F. Jelinek, B. Merialdo, S. Roukos andM. Strauss, "A Dynamic 
LM for Speech Recognition," Proc. DARPA Workshop on 
Speech and Natural Language, pp. 293-295, 1991. 
2. R. Lau, R. Rosenfeld and S. Roukos,'IYigger-Based Language 
Models: a Maximum Entropy Approach," Proc. Int'l. Conf. 
on Acoust., Speech and Signal Proc., VoL H, pp. 45-48, 1993. 
3. R. Rosenfeld,"A Hybrid Approach to Adaptive Statistical Lan- 
guage Modeling," this proceedings. 
4. L. R. Bah1, E E Brown, P. V. deSouza and R. L. Mercer, "A 
Tree-Based Statistical Language Model for Natural Language 
Speech Recognition," IEEE Trans. on Acouat., Speech, and 
Si&nalProc., Vol. 37, No. 7, pp. 1001-1008, 1989. 
5. J. H. Wright, G. J. F. Jones and H. Lloyd-Thomas, "A Con- 
solldated Language Model For Speech Recognition," Proc. 
EuroSpeech, Vol. 2, pp. 977-980, 1993. 
6. M. Mercer and J. R. Rohlicek, "Statistical Language Modeling 
Combining n-gram and Context Free Grammars," Proc. lnt'l. 
Conf. on Acouat., Speech and Signal Proc., Vol. 2, pp. 37-40, 
1993. 
7. J. Lafferty, "Integrating Probabilistic Finite-State and Context- 
Free Models of Language," presentation at the IEEE ASR 
Workshop, December 1993. 
8. R. Kneser and V. Steinbiss, "On the Dynamic Adaptation Of 
Stochastic LM," Proc. Int'l. Conf. on Acoust., Speech and 
SignalProc., Vol. 2, pp. 586-589, 1993. 
9. R. Kulm and R. de Mori, "A Cache Based Natural Language 
Model for Speech Recognition," IEEE Trans. PAMI, VoL 14, 
pp. 570-583, 1992. 
10. M. Ostendorf, A. Karman, S. Austin, O. Kimball, R. Schwartz 
and J. R. Rohlicek, "Integration of Diverse Recognition 
Methodologies Through Reevaluation of N-Best Sentence Hy- 
potheses," Proc. DARPA Workshop on Speech and Natural 
Language, pp. 83-87, February 1991. 
11. H.Witten and T. C. Bell, Whe Zero Frequency Estimation of 
Probabilities of Novel Events in Adaptive Text Compression," 
IEEE Trans.lnformation Theory, VoL 1T-37, No. 4, pp. 1085- 
1094, 1991. 
12. R Placeway and R. Schwartz, "Estimation Of Powerful LM 
from Small and Large Corpora," Proc. lnt'l. Conf. onAcoust., 
Speech andSignaIProc., Vol. 2, pp. 33-36, 1993. 
13. A. P. Dempster, N. M. Laird and D. B. Rubin, "Maximum 
Likelihood Estimation from Incomplete Data," Journal of the 
Royal Statiatical Society (B), VoL 39, No. 1, pp. 1-38, 1977. 
86 
14. BBN Byblos November 1993 WSJ Benchmark system. 
15. R. Schwartz et al., "On Using Written Language'l~raining Data 
for Spoken Language Modeling," this proceedings. 
16. M. Elbeze andA.-M. Derouault, "A Morphological Model for 
Large Vocabulary Speech Recognition," Proc. Ira'l. Conf. on 
Acouat., SpeechandSisnalProc., VoL 1, pp. 577-580, 1990. 
17. D. Pallett, J. Fiscus, W. Fisher, J. Garofolo, B. Lund and M. 
Pryzbocki, "1993 Benchmark Tests for the ARPA spoken Lan- 
guage Program," this p~gs. 
87 
