Dynamic Nonlocal Language Modeling via 
Hierarchical Topic-Based Adaptation 
Radu Florian and David Yarowsky 
Computer Science Department and Center for Language and Speech Processing, 
Johns Hopkins University 
Baltimore, Maryland 21218 
{rflorian,yarowsky}@cs.jhu.edu 
Abstract 
This paper presents a novel method of generating 
and applying hierarchical, dynamic topic-based lan- 
guage models. It proposes and evaluates new clus- 
ter generation, hierarchical smoothing and adaptive 
topic-probability estimation techniques. These com- 
bined models help capture long-distance lexical de- 
pendencies. °Experiments on the Broadcast News 
corpus show significant improvement in perplexity 
(10.5% overall and 33.5% on target vocabulary). 
1 Introduction 
Statistical language models are core components of 
speech recognizers, optical character recognizers and 
even some machine translation systems Brown et 
al. (1990). The most common language model- 
ing paradigm used today is based on n-grams, local 
word sequences. These models make a Markovian 
assumption on word dependencies; usually that word 
predictions depend on at most m previous words. 
Therefore they offer the following approximation for 
the computation of a word sequence probability: P(wU) 
= -') = 1-I =lP(w, 
where w{ denotes the sequence wi... wj ; a common 
size for m is 3 (trigram language models). 
Even if n-grams were proved to be very power- 
ful and robust in various tasks involving language 
models, they have a certain handicap: because of 
the Markov assumption, the dependency is limited 
to very short local context. Cache language models 
(Kuhn and de Mori (1992),Rosenfeld (1994)) try to 
overcome this limitation by boosting the probabil- 
ity of the words already seen in the history; trigger 
models (Lau et al. (1993)), even more general, try to 
capture the interrelationships between words. Mod- 
els based on syntactic structure (Chelba and Jelinek 
(1998), Wright et al. (1993)) effectively estimate 
intra-sentence syntactic word dependencies. 
The approach we present here is based on the 
observation that certain words tend to have differ- 
ent probability distributions in different topics. We 
propose to compute the conditional language model 
probability as a dynamic mixture model of K topic- 
specific language models: 
Einpirlcal Observat/on: 
Lexical Probabilities are Sensitive to Topic and Subtopic 
P( peace ! subtopic ) 
0~cs 
oJ~cs 
o.oo4 
~ o~ 
l'= 
i~ols 
o.l~l 
o.~ 
s Maj~ Topl~ amd SO sub*opl~ fnme the Bm*d~st N~ ¢oqpw 
Figure 1: Conditional probability of the word peace 
given manually assigned Broadcast News topics 
K 
P (w, lw~ -1) = E P (tlw~-X) "V (wilt, w~ -x) 
t=l 
K E P (tlw -a) • et ,-x (1) 
t=l The motivation for developing topic-sensitive lan- 
guage models is twofold. First, empirically speaking, 
many n-gram probabilities vary substantially when 
conditioned on topic (such as in the case of content 
words following several function words). A more im- 
portant benefit, however, is that even when a given 
bigram or trigram probability is not topic sensitive, 
as in the case of sparse n-gram statistics, the topic- 
sensitive unigram or bigram probabilities may con- 
stitute a more informative backoff estimate than the 
single global unigram or bigram estimates. Discus- 
sion of these important smoothing issues is given in 
Section 4. 
Finally, we observe that lexical probability distri- 
butions vary not only with topic but with subtopic 
too, in a hierarchical manner. For example, con- 
sider the variation of the probability of the word 
peace given major news topic distinctions (e.g. BUSI- 
NESS and INTERNATIONAL news) as illustrated in 
Figure 1. There is substantial subtopic proba- 
bility variation for peace within INTERNATIONAL 
news (the word usage is 50-times more likely 
167 
in INTERNATIONAL:MIDDLE-EAST than INTERNA- 
TIONAL:JAPAN). We propose methods of hierarchical 
smoothing of P(w~ Itopict) in a topic-tree to capture 
this subtopic variation robustly. 
1.1 Related Work 
Recently, the speech community has begun to ad- 
dress the issue of topic in language modeling. Lowe 
(1995) utilized the hand-assigned topic labels for 
the Switchboard speech corpus to develop topic- 
specific language models for each of the 42 switch- 
board topics, and used a single topic-dependent lan- 
guage model to rescore the lists of N-best hypothe- 
ses. Error-rate improvement over the baseline lan- 
guage model of 0.44% was reported. 
Iyer et al. (1994) used bottom-up clustering tech- 
niques on discourse contexts, performing sentence- 
level model interpolation with weights updated dy- 
namically through an EM-like procedure. Evalu- 
ation on the Wall Street Journal (WSJ0) corpus 
showed a 4% perplexity reduction and 7% word er- 
ror rate reduction. In Iyer and Ostendorf (1996), 
the model was improved by model probability rees- 
timation and interpolation with a cache model, re- 
sulting in better dynamic adaptation and an overall 
22%/3% perplexity/error rate reduction due to both 
components. 
Seymore and Rosenfeld (1997) reported significant 
improvements when using a topic detector to build 
specialized language models on the Broadcast News 
(BN) corpus. They used TF-IDF and Naive Bayes 
classifiers to detect the most similar topics to a given 
article and then built a specialized language model 
to rescore the N-best lists corresponding to the arti- 
cle (yielding an overall 15% perplexity reduction us- 
ing document-specific parameter re-estimation, and 
no significant word error rate reduction). Seymore 
et al. (1998) split the vocabulary into 3 sets: gen- 
eral words, on-topic words and off-topic words, and 
then use a non-linear interpolation to compute the 
language model. This yielded an 8% perplexity re- 
duction and 1% relative word error rate reduction. 
In collaborative work, Mangu (1997) investigated 
the benefits of using existing an Broadcast News 
topic hierarchy extracted from topic labels as a ba- 
sis for language model computation. Manual tree 
construction and hierarchical interpolation yielded 
a 16% perplexity reduction over a baseline uni- 
gram model. In a concurrent collaborative effort, 
Khudanpur and Wu (1999) implemented clustering 
and topic-detection techniques similar on those pre- 
sented here and computed a maximum entropy topic 
sensitive language model for the Switchboard cor- 
pus, yielding 8% perplexity reduction and 1.8% word 
error rate reduction relative to a baseline maximum 
entropy trigram model. 
2 The Data 
The data used in this research is the Broadcast News 
(BN94) corpus, consisting of radio and TV news 
transcripts form the year 1994. From the total of 
30226 documents, 20226 were used for training and 
the other 10000 were used as test and held-out data. 
The vocabulary size is approximately 120k words. 
3 Optimizing Document Clustering 
for Language Modeling 
For the purpose of language modeling, the topic la- 
bels assigned to a document or segment of a doc- 
ument can be obtained either manually (by topic- 
tagging the documents) or automatically, by using 
an unsupervised algorithm to group similar docu- 
ments in topic-like clusters. We have utilized the 
latter approach, for its generality and extensibility, 
and because there is no reason to believe that the 
manually assigned topics are optimal for language 
modeling. 
3.1 Tree Generation 
In this study, we have investigated a range of hierar- 
chical clustering techniques, examining extensions of 
hierarchical agglomerative clustering, k-means clus- 
tering and top-down EM-based clustering. The lat- 
ter underperformed on evaluations in Florian (1998) 
and is not reported here. 
A generic hierarchical agglomerative clustering al- 
gorithm proceeds as follows: initially each document 
has its own cluster. Repeatedly, the two closest clus- 
ters are merged and replaced by their union, until 
there is only one top-level cluster. Pairwise docu- 
ment similarity may be based on a range of func- 
tions, but to facilitate comparative analysis we have 
utilized standard cosine similarity (d(D1,D2) = 
<D1,D2~ ) and IR-style term vectors (see Salton IIDx Ih liD2 Ih 
and McGill (1983)). 
This procedure outputs a tree in which documents 
on similar topics (indicated by similar term content) 
tend to be clustered together. The difference be- 
tween average-linkage and maximum-linkage algo- 
rithms manifests in the way the similarity between 
clusters is computed (see Duda and Hart (1973)). A 
problem that appears when using hierarchical clus- 
tering is that small centroids tend to cluster with 
bigger centroids instead of other small centroids, of- 
ten resulting in highly skewed trees such as shown 
in Figure 2, a=0. To overcome the problem, we de- 
vised two alternative approaches for computing the 
intercluster similarity: 
• Our first solution minimizes the attraction of 
large clusters by introducing a normalizing fac- 
tor a to the inter-cluster distance function: 
< c(C1),c(C2) > 
d(C1,C2) = N(C1), ~ Ilc(C,)ll N(C2) ~ IIc(C2)ll (2) 
168 
a=O a = 0.3 a = 0.5 
Figure 2: As a increases, the trees become more 
balanced, at the expense of forced clustering 
e=0 e = 0.15 e = 0.3 e = 0.7 
Figure 3: Tree-balance is also sensitive to the 
smoothing parameter e. 
3.2 Optimizing the Hierarchical Structure 
To be able to compute accurate language models, 
one has to have sufficient data for the relative fre- 
quency estimates to be reliable. Usually, even with 
enough data, a smoothing scheme is employed to in- 
sure that P (wdw~ -1) > 0 for any given word sequence 
w~. 
The trees obtained from the previous step have 
documents in the leaves, therefore not enough word 
mass for proper probability estimation. But, on the 
path from a leaf to the root, the internal nodes grow 
in mass, ending with the root where the counts from 
the entire corpus are stored. Since our intention is to 
use the full tree structure to interpolate between the 
in-node language models, we proceeded to identify 
a subset of internal nodes of the tree, which contain 
sufficient data for language model estimation. The 
criteria of choosing the nodes for collapsing involves 
a goodness function, such that the cut I is a solu- 
tion to a constrained optimization problem, given 
the constraint that the resulting tree has exactly k 
leaves. Let this evaluation function be g(n), where 
n is a node of the tree, and suppose that we want 
to minimize it. Let g(n, k) be the minimum cost of 
creating k leaves in the subtree of root n. When the 
evaluation function g (n) satisfies the locality con- 
dition that it depends solely on the values g (nj,.), 
(where (n#)j_ 1 kare the children of node n), g (root) 
can be coml)uted efficiently using dynamic program- 
ming 2 : 
where N (Ck) is the number of vectors (docu- 
ments) in cluster Ck and c (Ci) is the centroid 
of the i th cluster. Increasing a improves tree 
balance as shown in Figure 2, but as a becomes 
large the forced balancing degrades cluster qual- 
ity. 
A second approach we explored is to perform 
basic smoothing of term vector weights, replac- 
ing all O's with a small value e. By decreasing 
initial vector orthogonality, this approach facili- 
tates attraction to small centroids, and leads to 
more balanced clusters as shown in Figure 3. 
Instead of stopping the process when the desired 
• number of clusters is obtained, we generate the full 
tree for two reasons: (1) the full hierarchical struc- 
ture is exploited in our language models and (2) once 
the tree structure is generated, the objective func- 
tion we used to partition the tree differs from that 
used when building the tree. Since the clustering 
procedure turns out to be rather expensive for large 
datasets (both in terms of time and memory), only 
10000 documents were used for generating the initial 
hierarchical structure. 
°Section 3.2 describes the choice of optimum a. 
gCn, 1) = g(n) 
g(n, k) = min h (g (nl, jl),..* , g (n/c, jk))(3) 
jl,,jk > 1 
Let us assume for a moment that we are inter- 
ested in computing a unigram topic-mixture lan- 
guage model. If the topic-conditional distributions 
have high entropy (e.g. the histogram of P(wltopic ) 
is fairly uniform), topic-sensitive language model in- 
terpolation will not yield any improvement, no mat- 
ter how well the topic detection procedure works. 
Therefore, we are interested in clustering documents 
in such a way that the topic-conditional distribution 
P(wltopic) is maximally skewed. With this in mind, 
we selected the evaluation function to be the condi- 
tional entropy of a set of words (possibly the whole 
vocabulary) given the particular classification. The 
conditional entropy of some set of words )~V given a 
partition C is 
HCWIC) = ~ PCC~) ~ P(wlC,). log(P(wlC,)) i=1 
wEWCIC d 
= ~ ~ ~_, cCw, C,). logCP(wlC,)) (4) 
i=1 wEWnC i 
1the collection of nodes that collapse 
2h is an operator through which the values 
g (nl,jl) ..... g (nk,jk) are combined, as ~ or YI 
169 
5.55 
5.5 
5.45 
5A 
5.35 
5.3 
5.25 
32 
3.13 
5.1 
5.05 
Ccad~tiooal F.~opy in the Avenge-Linkage Case 
, u , I n 64 Cin~ -- 
77 CinSlCn ...... 100 clus, ters ..... 
~ ;.................'" .................................................. 
I "'1' I I 
0.l 0.2 0-~ 0.4 ~5 01.6 
3.85 
3.8 
3.75 
3.7 
0.7 
Couditinnal Eam~py in in¢ Maximum.Linkage Case 
3.65 
3.6 
3.55 0 
n 
77 dusters ...... 
"'".,.. ....., ........ 
"'-.,. ................... ....'" 
..... .-° ............. 
""~.. .............. °.-°**° 
I I I 0., 
0.2 03 01.4 01., 01.6 (I 0.7 
Figure 4: Conditional entropy for different a, cluster sizes and linkage methods 
where c (w, Ci) is the TF-IDF factor of word w in 
class Ci and T is the size of the corpus. Let us 
observe that the conditional entropy does satisfy the 
locality condition mentioned earlier. 
Given this objective function, we identified the op- 
timal tree cut using the dynamic-programming tech- 
nique described above. We also optimized different 
parameters (such as a and choice of linkage method). 
Figure 4 illustrates that for a range of cluster sizes, 
maximal linkage clustering with a=0.15-0.3 yields 
optimal performance given the objective function in 
equation (2). 
The effect of varying a is also shown graphically in 
Figure 5. Successful tree construction for language 
modeling purposes will minimize the conditional en- 
tropy of P (~VIC). This is most clearly illustrated 
for the word politics, where the tree generated with 
a = 0.3 maximally focuses documents on this topic 
into a single cluster. The other words shown also 
exhibit this desirable highly skewed distribution of 
P (}4;IC) in the cluster tree generated when a = 0.3. 
Another investigated approach was k-means clus- 
tering (see Duda and Hart (1973)) as a robust and 
proven alternative to hierarchical clustering. Its ap- 
plication, with both our automatically derived clus- 
ters and Mangn's manually derived clusters (Mangn 
(1997)) used as initial partitions, actually yielded a 
small increase in conditional entropy and was not 
pursued further. 
4 Language Model Construction and 
Evaluation 
Estimating the language model probabilities is a 
two-phase process. First, the topic-sensitive lan- 
i--1 gnage model probabilities P (wilt, wi_,~+~ ) are com- 
puted during the training phase. Then, at run-time, 
or in the testing phase, topic is dynamically iden- 
tified by computing the probabilities P (tlw~ -1) as 
in section 4.2 and the final language model proba- 
bilities are computed using Equation (1). The tree 
used in the following experiments was generated us- 
ing average-linkage agglomerative clustering, using 
parameters that optimize the objective function in 
Section 3. 
4.1 Language Model Construction 
The topic-specific language model probabilities are 
computed in a four phase process: 
1. Each document is assigned to one leaf in the 
tree, based on the similarity to the leaves' cen- 
troids (using the cosine similarity). The doc- 
ument counts are added to the selected leaf's 
count. 
2. The leaf counts are propagated up the tree such 
that, in the end, the counts of every inter- 
nal node are equal to the sum of its children's 
counts. At this stage, each node of the tree has 
an attached language model - the relative fre- 
quencies. 
3. In the root of the tree, a discounted Good- 
Turing language model is computed (see Katz 
(1987), Chen and Goodman (1998)). 
4. m-gram smooth language models are computed 
for each node n different than the root by 
three-way interpolating between the m-gram 
language model in the parent parent(n), the 
(m - 1)-gram smooth language model in node 
n and the m-gram relativeffrequency estimate 
in node n: 
-1) = 
~1 \[wm--l~ . 1 J par. t(.)(wmlw; (5) 
( ml 7 
+.xs. (w~ '-~) f. (w~lw? -1) 
with + + = 
for each node n in the tree. Based on how 
~k (w~,-1) depend on the particular node n and 
the word history w~ -1, various models can be 
obtained. We investigated two approaches: a 
bigram model in which the ,k's are fixed over 
the tree, and a more general trigram model in 
170 
Case 1: fnode (Wl) ~ 0 
P root (w2lwl) 
,~1 fnode (w21wl) "?node (Wl) + ,~2/~node (W,.) Pnode (I/\]211°1) 
= -~ (1 -- )~1 -- ~2) Pp .... t(node) (~21~) 
~.ode (~I) Pnode (~2) 
where 
?node (flY1) = 
if w2 E ~'(~O1) 
if w2 E 7~(Wl) 
if w2 E/-4 (wl) 
w2 E~'(tOl) w2E3~(Wl) 
(1-F-/3) y\]. fnode(W21Wl)' Otnode (I#1) = ) 
-,2e~(,,1) 0+~) - ~ P,,ode ("2) 
tv2 E 3c(1~'1 ) U'R. ( tv I ) 
• Case 2: fnode (Wl) = 0 
I P root (w=lwl) if w2 E ~(Wl) 
~2Pnode (~O2) ''}'node (101) Pnode 
(w2lwl) = + (1 -- AS) Pp .... t(node) (w2lwl) if w2 e "R. (Wl) 
anode (I/31) Pnode (W2) if W2 e/4 (wl) 
where ?node (I/)1) and anode (I/31) are computed in a similar fashion such that the probabilities do sum to 1. 
Figure 5: Basic Bigram Language Model Specifications 
which A's adapt using an EM reestimation pro- 
cedure. 
4.1.1 Bigram Language Model 
Not all words are topic sensitive. Mangu (1997) ob- 
served that closed-class function words (FW), such 
as the, of, and with, have minimal probability vari- 
ation across different topic parameterizations, while 
most open-class content words (CW) exhibit sub- 
stantial topic variation. This leads us to divide the 
possible word pairs in two classes (topic-sensitive 
and not) and compute the A's in Equation (5) in 
such a way that the probabilities in the former set 
are constant in all the models. To formalize this: 
* Y(Wl) = {w2 • ~1 (Wl,W2) is fixed}-the 
'Taxed" space; 
• T~(Wl) = {w2 • "~l (Wl,W2) is free/variable}- 
the '~ree" space; 
• b/(Wl) = {w2 • 121 (Wl,W2) was never seen}- 
the "unknown" space. 
The imposed restriction is, then: for every word 
wland any word w2 • Y(wl) Pn(w21wl) = 
Proof (w21wl) in any node n. 
The distribution of bigrams in the training data 
is as follows, with roughly 30% bigram probabilities 
allowed to vary in the topic-sensitive models: 
This approach raises one interesting issue: the 
language model in the root assigns some probabil- 
ity mass to the unseen events, equal to the single- 
tons' mass (see Good (1953),Katz (1987)). In our 
case, based on the assumptions made in the Good- 
Turing formulation, we considered that the ratio of 
the probability mass that goes to the unseen events 
and the one that goes to seen, free events should be 
Model 
fixed 
fixed 
free 
free 
Bigrsm-type Exsmple 
p(FWIFW) p(thel~) 
p(FWICW) ~,(o.t'i.e.,~a,'io) 
p(CWICW) p(airlco/d) 
n(CWlFW) n(oi,.Ith=) 
Freq. 
45.3~ Iesst topic sensitive 
24.8~ .t 
5.3% .t 
24.5~ most topic sensitive 
fixed over the nodes of the tree. Let/3 be this ratio. 
Then the language model probabilities are computed 
as in Figure 5. 
4.1.2 Ngram Language Model Smoothing 
In general, n gram language model probabili- 
ties can be computed as in formula (5), where 
(A~ (w"'-~'J'l are adapted both for the partic- ~. 1 I / k-~l...3 
ular node n and history w~ -1. The proposed de- 
pendency on the history is realized through the his- 
tory count c (w~'-1) and the relevance of the history 
w~ -1 to the topic in the nodes n and parent (n). 
The intuition is that if a history is as relevant in the 
current node as in the parent, then the estimates in 
the parent should be given more importance, since 
they are better estimated. On the other hand, if the 
history is much more relevant in the current node, 
then the estimates in the node should be trusted 
more. The mean adapted A for a given height h 
is the tree is shown in Figure 6. This is consistent 
with the observation that splits in the middle of the 
tree tend to be most informative, while those closer 
to the leaves suffer from data fragmentation, and 
hence give relatively more weight to their parent. 
As before, since not all the m-grams are expected to 
be topic-sensitive, we use a method to insure that 
those rn grams are kept 'Taxed" to minimize noise 
and modeling effort. In this case, though, 2 lan- 
guage models with different support are used: one 
171 
It is at least on the Serb side a real setback to the 
peace 
a3 
cA 
~ o.~ 
o 
Topi¢ ID 
0.016 
0.014 
"~ 0.012 
,.~ 0.01 
o.l~le 
o.oo4 
o 
' I t ~11 P~ce~c I history) II 
•-- ,n _l II -• , b -- n.m_ I 
n0 2O 3O 4o f*o 
piece 
~3 
: o.2 
o.ls 
"~ o.! 
~ o.o5 
o 
Topic ID 
0.0006 
0.0005 
~ 0.0004 
P(piccc I history) 
Figure 7: Topic sensitive probability estimation for peace and piece in context 
"~ 0.8 
"J 0.6 
0.4 
0.2 
I I I I 
4 5 6 7 s 
Node Height 
Figure 6: Mean of the estimated As at node height 
h, in the unigram case 
that supports the topic insensitive m-grams and that 
is computed only once (it's a normalization of the 
topic-insensitive part of the overall model), and one 
that supports the rest of the mass and which is com- 
puted by interpolation using formula (5). Finally, 
the final language model in each node is computed 
as a mixture of the two. 
4.2 Dynamic Topic Adaptation 
Consider the example of predicting the word follow- 
ing the Broadcast News fragment: "It is at least on 
the Serb side a real drawback to the ~-?--~'. Our topic 
detection model, as further detailed later in this sec- 
tion, assigns a topic distribution to this left context 
(including the full previous discourse), illustrated in 
the upper portion of Figure 7. The model identi- 
fies that this particular context has greatest affinity 
with the empirically generated topic clusters #41 
and #42 (which appear to have one of their foci on 
international events). 
The lower portion of Figure 7 illustrates the topic- 
conditional bigram probabilities P(w\[the, topic) for 
two candidate hypotheses for w: peace (the actu- 
ally observed word in this case) and piece (an in- 
correct competing hypothesis). In the former case, 
P(peace\[the, topic) is clearly highly elevated in the 
most probable topics for this context (#41,#42), 
and thus the application of our core model combi- 
nation (Equation 1) yields a posterior joint product 
P (w, lw~ -1) = ~'~K= 1P ($lw~-l) • Pt (w, lw~_-~+l) that is 
12-times more likely than the overall bigram proba- 
bility, P(air\[the) = 0.001. In contrast, the obvious 
accustically motivated alternative piece, has great- 
est probability in a far different and much more dif- 
fuse distribution of topics, yielding a joint model 
probability for this particular context that is 40% 
lower than its baseline bigram probability. This 
context-sensitive adaptation illustrates the efficacy 
of dynamic topic adaptation in increasing the model 
probability of the truth. 
Clearly the process of computing the topic de- 
tector P (tlw~ -1) is crucial. We have investigated 
several mechanisms for estimating this probability, 
the most promising is a class of normalized trans- 
formations of traditional cosine similarity between 
the document history vector w~ -x and the topic cen- 
troids: 
P (tlw~-') = f (Cosine-Sire (t,w~-i)) 
f (Cosine-Sire (t', w~-l)) (6) 
tl 
One obvious choice for the function f would be the 
identity. However, considering a linear contribution 
172 
Language Perplexity on Perplexity on 
Model the entire the target 
vocabulary vocabulary 
Standard Bigram Model 215 584 
History size Scaled 
100 
5OO0 
.2 5000 
5000 
yes 
1000 yes 
yes* 
yes 
no 
5000 yes 
5000 yes 
g(x) f(x) k-NN 
X X ~ - 
X X Z - 
X* X Z* -* 
1 x - 
X ~z _ 
x x z 15-NN 
e z ~e z - 
206 
195 
192 (-10%) 
460 
405 
389(-33%) 
202 444 
193 394 
192 390 
196 411 
Table 1: Perplexity results for topic sensitive bigram language model, different history lengths 
of similarities poses a problem: because topic de- 
tection is more accurate when the history is long, 
even unrelated topics will have a non-trivial contri- 
bution to the final probability 3, resulting in poorer 
estimates. 
One class of transformations we investigated, that 
directly address the previous problem, adjusts the 
similarities such that closer topics weigh more and 
more distant ones weigh less. Therefore, f is chosen 
such that 
I(=~} < ~-~ for ~E1 < X2 ¢~ s¢.~)- ~ - (7) 
f(zl) < for zz < z2 X I ~ ag 2 
that is, ~ should be a monotonically increas- 
ing function on the interval \[0, 1\], or, equivalently 
f (x) = x. g (x), g being an increasing function on 
\[0,1\]. Choices for g(x) include x, z~(~f > 0), log (z), 
e z . 
Another way of solving this problem is through the 
scaling operator f' (xi) = ,~-mm~ By apply- max zi --min zi " 
ing this operator, minimum values (corresponding to 
low-relevancy topics) do not receive any mass at all, 
and the mass is divided between the more relevant 
topics. For example, a combination of scaling and 
g(x) = x ~ yields: 
p( jlwi-l! = 
($im('w~--l't')--min~Sim('w~--l'tk) )"Y 
(8) 
A third class of transformations we investigated 
considers only the closest k topics in formula (6) 
and ignores the more distant topics. 
4.3 Language Model Evaluation 
Table 1 briefly summarizes a larger table of per- 
formance measured on the bigram implementation 
3Due to unimportant word co-occurrences 
of this adaptive topic-based LM. For the default 
parameters (indicated by *), a statistically signif- 
icant overall perplexity decrease of 10.5% was ob- 
served relative to a standard bigram model mea- 
sured on the same 1000 test documents. System- 
atically modifying these parameters, we note that 
performance is decreased by using shorter discourse 
contexts (as histories never cross discourse bound- 
aries, 5000-word histories essentially correspond to 
the full prior discourse). Keeping other parame- 
ters constant, g(x) = x outperforms other candidate 
transformations g(x) = 1 and g(x) = e z. Absence 
of k-nn and use of scaling both yield minor perfor- 
mance improvements. 
It is important to note that for 66% of the vo- 
cabulary the topic-based LM is identical to the core 
bigram model. On the 34% of the data that falls in 
the model's target vocabulary, however, perplexity 
reduction is a much more substantial 33.5% improve- 
ment. The ability to isolate a well-defined target 
subtask and perform very well on it makes this work 
especially promising for use in model combination. 
5 Conclusion 
In this paper we described a novel method of gen- 
erating and applying hierarchical, dynamic topic- 
based language models. Specifically, we have pro- 
posed and evaluated hierarchical cluster genera- 
tion procedures that yield specially balanced and 
pruned trees directly optimized for language mod- 
eling purposes. We also present a novel hierar- 
chical interpolation algorithm for generating a lan- 
guage model from these trees, specializing in the 
hierarchical topic-conditional probability estimation 
for a target topic-sensitive vocabulary (34% of the 
entire vocabulary). We also propose and evalu- 
ate a range of dynamic topic detection procedures 
based on several transformations of content-vector 
similarity measures. These dynamic estimations of 
P(topici\[history) are combined with the hierarchical 
estimation of P(wordj Itopici, history) in a product 
across topics, yielding a final probability estimate 
173 
of P(wordj Ihistory) that effectively captures long- 
distance lexical dependencies via these intermediate 
topic models. Statistically significant reductions in 
perplexity are obtained relative to a baseline model, 
both on the entire text (10.5%) and on the target 
vocabulary (33.5%). This large improvement on a 
readily isolatable subset of the data bodes well for 
further model combination. 
Acknowledgements 
The research reported here was sponsored by Na- 
tional Science Foundation Grant IRI-9618874. The 
authors would like to thank Eric Brill, Eugene Char- 
niak, Ciprian Chelba, Fred Jelinek, Sanjeev Khudan- 
pur, Lidia Mangu and Jun Wu for suggestions and 
feedback during the progress of this work, and An- 
dreas Stolcke for use of his hierarchical clustering 
tools as a basis for some of the clustering software 
developed here. 

References 
P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, 
F. Jelinek, J. Lafferty, R. Mercer, and P. Roossin'. 
1990. A statistical approach to machine transla- 
tion. Computational Linguistics, 16(2). 
Ciprian Chelba and Fred Jelinek. 1998. Exploiting 
syntactic structure for language modeling. In Pro- 
ceedings COLING-ACL, volume 1, pages 225-231, 
August. 
Stanley F. Chen and Joshua Goodman. 1998. 
An empirical study of smoothing techinques for 
language modeling. Technical Report TR-10-98, 
Center for Research in Computing Technology, 
Harvard University, Cambridge, Massachusettes, 
August. 
Richard O. Duda and Peter E. Hart. 1973. Patern 
Classification and Scene Analysis. John Wiley & 
Sons. 
R~u Florian. 1998. Exploiting nonlo- 
cal word relationships in language mod- 
els. Technical report, Computer Science 
Department, Johns Hopkins University. 
http://nlp.cs.jhu.edu/-rflorian/papers/topic- 
lm-tech-rep.ps. 
J. Good. 1953. The population of species and the 
estimation of population parameters. Biometrica, 
40, parts 3,4:237-264. 
Rukmini Iyer and Mari Ostendorf. 1996. Modeling 
long distance dependence in language: Topic mix- 
tures vs. dynamic cache models. In Proceedings 
of the International Conferrence on Spoken Lan- 
guage Processing, volume 1, pages 236-239. 
Rukmini Iyer, Mari Ostendorf, and J. Robin 
Rohlicek. 1994. Language modeling with 
sentence-level mixtures. In Proceedings ARPA 
Workshop on Human Language Technology, pages 
82-87. 
Slava Katz. 1987. Estimation of probabilities from 
sparse data for the language model component 
of a speech recognizer. In IEEE Transactions on 
Acoustics, Speech, and Signal Processing, 1987, 
volume ASSP-35 no 3, pages 400-401, March 
1987. 
Sanjeev Khudanpur and Jun Wu. 1999. A maxi- 
mum entropy language model integrating n-gram 
and topic dependencies for conversational speech 
recognition. In Proceedings on ICASSP. 
R. Kuhn and R. de Mori. 1992. A cache based nat- 
ural language model for speech recognition. IEEE 
Transaction PAMI, 13:570-583. 
R. Lau, Ronald Rosenfeld, and Salim Roukos. 1993. 
Trigger based language models: a maximum en- 
tropy approach. In Proceedings ICASSP, pages 
45-48, April. 
S. Lowe. 1995. An attempt at improving recognition 
accuracy on switchboard by using topic identifi- 
cation. In 1995 Johns Hopkins Speech Workshop, 
Language Modeling Group, Final Report. 
Lidia Mangu. 1997. Hierarchical topic-sensitive 
language models for automatic speech recog- 
nition. Technical report, Computer Sci- 
ence Department, Johns Hopkins University. 
http://nlp.cs.jhu.edu/-lidia/papers/tech-repl .ps. 
Ronald Rosenfeld. 1994. A hybrid approach to 
adaptive statistical language modeling. In Pro- 
ceedings ARPA Workshop on Human Language 
Technology, pages 76-87. 
G. Salton and M. McGill. 1983. An Introduc- 
tion to Modern Information Retrieval. New York, 
McGram-Hill. 
Kristie Seymore and Ronald Rosenfeld. 1997. Using 
stow topics for language model adaptation. In 
EuroSpeech97, volume 4, pages 1987-1990. 
Kristie Seymore, Stanley Chen, and Ronald Rosen- 
feld. 1998. Nonlinear interpolation of topic mod- 
els for language model adaptation. In Proceedings 
of ICSLP98. 
J. H. Wright, G. J. F. Jones, and H. Lloyd-Thomas. 
1993. A consolidated language model for speech 
recognition. In Proceedings EuroSpeech, volume 2, 
pages 977-980. 
