Topic Analysis Using a Finite Mixture Model 
Hang Li and Kenji Yamanishi 
NEC Corporation 
{lihang,yamanisi} @ccm.cl.nec.co.jp 
Abstract 
We address the issue of 'topic analysis,' by 
which is determined a text's topic structure, 
which indicates what topics are included in a 
text, and how topics change within the text. 
We propose a novel approach to this issue, one 
based on statistical modeling and learning. 
We represent topics by means of word clusters, 
and employ a finite mixture model to repre- 
sent a word distribution within a text. Our 
experimental results indicate that our method 
significantly outperforms a method that com- 
bines existing techniques. 
1 Introduction 
-:We consider here the issue of 'topic analysis,' 
by which is determined a text's topic struc- 
ture, which indicates what topics are included 
in a text and how topics change within the 
text. Topic analysis consists of two main 
tasks: topic identification and text segmen- 
tation (based on topic changes). 
Topic analysis is extremely useful in a vari- 
ety of text processing applications. For exam- 
plea it can be used in the automatic indexing 
of texts for purposes of information retrieval. 
With it, one can understand what the main 
topics and subtopics of a text are, and where 
those subtopics lie within the text. 
To the best of our knowledge, however, no 
previous study has so far dealt with the topic 
analysis problem in the above sense. The 
most closely related are key word extraction 
and text segmentation. A keyword extrac- 
tion method (e.g., that using tf-idf (Salton 
and Yang, 1973)) generally extracts from a 
text key words which represent topics within 
the text, but it does not conduct segmenta- 
tion. A segmentation method (e.g., TextTil- 
ing (Hearst, 1997)) generally segments a text 
into blocks (paragraphs) in accord with topic 
changes within the text, but it does not iden- 
tify (or label) by itself the topics discussed in 
each of the blocks. 
The purpose of tMs paper is to provide a 
single framework for conducting topic analy- 
sis, i.e., performing both topic identification 
and text segmentation. 
The key characteristics of our framework 
are 1) representing a topic by means of a clus- 
ter of words that are closely related to the 
topic, and 2) employing a stochastic model, 
called a .finite mixture model (e.g., (Everitt 
and Hand, 1981)), to represent a word dis- 
tribution within a text. The finite mixture 
model has a hierarchical structure of probabil- 
ity distributions. The first level is a probabil- 
ity distribution of topics (topic distribution). 
The second level consists of probability distri- 
butions of words included within topics (word 
distributions). These word distributions are 
linearly combined to represent a word distri- 
bution within a text, with the topic distribu- 
tion being used as the coefficient vector. Here- 
after we refer to a finite mixture model hav- 
ing this structure as a stochastic topic model 
(STM). 
Before conducting topic analysis, we create 
word clusters (topics) on the basis of word co- 
occurrence in corpus data. We have devel- 
oped a new method for word clustering using 
stochastic complexity (or the MDL principle) 
(Rissanen, 1996). 
In topic analysis, we estimate a sequence 
of STMs that would have given rise to a given 
text, assuming that each block of a text is gen- 
erated by an individual STM. We perform text 
segmentation by detecting significant differ- 
ences between STMs and perform topic iden- 
tification by means of estimation of STMs. 
With the results, we obtain the text's topic 
structure which consists of segmented blocks 
and their topics. 
It is possible to perform topic analysis 
by combining an existing word extraction 
method (e.g., tf-idf) and an existing text seg- 
35 
mentation method (e.g., TextTiling). Specif- 
ically, one can extract key words from a text 
using tf-idf, view these extracted key words 
as topics, segment the text into blocks us- 
ing TextTiling, and estimate the distribution 
of topics (key words) within each block. Ex- 
perimental results indicate, :however, that our 
method significantly outper~brms such a com- 
bined method in topic identification and out- 
performs it in text segmentation, because it 
utilizes word cluster information and employs 
a well-defined probability framework. 
Finite mixture models have been employed 
in a number of text processing applications, 
such as text classification (e.g., (Li and Ya- 
mauishi, 1997; Nigam et al., 2000)) and infor- 
mation retrieval (e.g., (Hofmann, 1999)). As 
will be discussed, however, our definition of a 
finite mixture model and the way we use it 
here .differs significantly. 
2 Stochastic Topic Model 
2.1 Topic 
While the term 'topic' is used in different ways 
in different linguistic theories, we simply view 
it here as a subject within a text. We rep- 
resent a topic by means of a cluster of words 
that are closely related to the topic, assum- 
ing that a cluster has a seed word (or several 
seed words) which indicates a topic. Figure 1 
shows an example topic with the word 'trade' 
being the seed word. 
I trade: trade export import tariff trader GATT protectionist I I 
Figure 1: Example topic 
2.2 Definition of STM 
Let W denote a set of words, and K a set of 
topics. We first define a distribution of topics 
(clusters) P(k) : ~kEIK P(k) = 1. Then, for 
each topic k E K, we define a probability dis- 
tribution of words P(wik) : ~,ew P(wlk) = 
1. Here the value of P(wik) will be zeroif w is 
not included in k. We next define a Stochas- 
tic Topic Model (STM) as a finite mixture 
model, which is a linear combination of the 
word probability distributions P(w\[k), with 
the topic distribution P(k) being used as the 
coefficient vector. The probability of word w 
in W is, then, 
P(w) = ~ P(k)P(wlk ) we W. 
kEK 
Figure 2 depicts an example STM. 
Figure 2: Example STM 
For the purposes of statistical modeling, it 
is advantageous to conceive of a text (i.e., a 
word sequence) as having been generated by 
some 'true' STMs, which we then seek to esti- 
mate as closely as possible. A text may have a 
number of blocks, and each block is assumed 
to be generated by an individual STM. The 
STMs within a text are assumed to have the 
same set of topics, but have different param- 
eter values. 
From the linguistic viewpoint, a text gener- 
ally focuses on a single main topic, but it may 
discuss different subtopics in different blocks. 
While a text is discussing any one topic, it will 
more frequently use words strongly related to 
that topic. Hence, STM is a natural represen- 
tation of statistical word occurrence based on 
topics. 
3 Word Clustering 
Before conducting topic analysis, we create 
word clusters using a large data corpus. More 
precisely, we treat all words in a vocabulary as 
seed words, and for each seed word we collect 
from the data those words which frequently 
co-occur with it and group them into a cluster. 
As one example, the word-cluster in Figure 1 
has been constructed with the word 'trade' as 
the seed word. 
We have developed a new method for reli- 
ably collecting frequently co-occurring words 
on the basis of stochastic complexity, or the 
MDL principle. For a given data sequence 
z m = xl...zm and for a fixed probability 
model M, 1 the stochastic complexity of x m 
relative to M, which we denote as SC(x m : 
M), is defined as the least code length re- 
quired to encode x rn with M (Rissanen, 1996). 
SC(x m : M) can be interpreted as the amount 
information included in x n relative to M. The 
1 Here, we use 'model' to refer to aprobability dis- 
tnbution which has specified paxameters but unspeci- 
fied parameter values. 
36 
MDL (Minimum Description Length) princi- 
ple is a model selection criterion which asserts 
that, for a given data sequence, the lower a 
model's SC value, the greater its likelihood of 
being a model which would have actually gen- 
erated the data. MDL has many good prop- 
erties as a criterion for model selection. 2 
For a fixed seed word s, we take a word w as 
a frequently co-occurring word if the presence 
of s is a statistically significant indicator of 
the presence of w. 
Let a data sequence: (sl,wl), (s2,w2), .-., 
(Sin,Win) be given where (si, wi) denotes the 
state of co-occurrence of words s and w in 
the i-th text in the corpus data. Here, sl E 
{1,O},wi e {1,0},(i = 1,.-.,rn), 1 denotes 
the presence of a word, while 0 the absence 
of it. We further denote s TM = sl...sm, and 
W TM ~.. W 1 • . . W m . 
Then as in (Rissanen, 1996), the SC value of 
w TM relative to a model I in which the presence 
or absence of w is independent from those of 
s (i.e., a Bernoulli model), is calculated as 
SC(w TM : I) = mH + ~ log ~ + log 7r, 
where m + denotes the number of l's in wm. 
Here, log denotes the logarithm to the base 
2, ~- the circular constant, and H(z) deJ 
-zlogz - (1 - z)log(1 - z), when 0 < z < 1; 
H(z) des = 0, whenz=0orz= 1. 
Let w m" be the sequence of all wi's (wi E 
w rn) such that its corresponding si is 1, where 
ms denotes the number of l's in s ~. Let w rn'' 
be the sequence ofaU wi's (wi E w m) such that 
its corresponding si is 0, where rn.~s denotes 
the number O's in s m. The SC value of w m 
relative to a model D in which the presence 
or absence of w is dependent on those of s is 
then calculated as 
SC(w  ( s.u log ) : = + ~logT~ + 
+ (m"sH (-m'-'~'~'~ W ½1°g-m-='~ W l°gr) 2~ 
where ms + denotes the number of l's in wm', 
and w~+s the number of l's in w m~,. 
2For an introduction to MDL, see (Li, 1998). 
We can then calculate 
6SC = "~(SC(wm : I) - SC(wm : D)) \[() 
m \m.~/j 
fl IOarn~rn-,,~/ -1 o j" 
(I) 
According to the MDL principle, the larger 
the 6SC value, the more likely that the pres- 
ence or absence of w is dependent on those of 
8. 3 
Actually, we may think of a word w for 
which the value of 6SC is larger than a pre- 
determined threshold 3' and P(wls ) > P(w) 
is satisfied as that which occurs significantly 
frequently with the seed word s. 
Note that the word clustering process is 
independent of topic analysis. While one 
could employ other methods (e.g., (Hofmann, 
1999)) here for word clustering, our clus- 
tering algorithm is more efficient than con- 
ventional ones. For example, Hofmann's is 
of order O(\]DIIWI2), while ours is only of 
O(ID I + \]WI2), where IDI denotes the number 
of texts and IW\] the number of words. That 
means that our method is more practical when 
a large amount of text data is available. 
4 Topic Analysis 
4.1 Input and Output 
In topic analysis, we use STM to parse a 
given text and output a topic structure which 
consists of segmented blocks and their top- 
ics. Figure 3 shows an example topic struc- 
ture as output with our method. The text has 
been segmented into five blocks, and to each 
block, a number of topics having high prob- 
ability values have been assigned (topics axe 
represented by their seed words). The topic 
structure clearly represents what topics are in- 
cluded in the text and how the topics change 
within the text. 
4.2 Outline 
Our topic analysis consists of three processes: 
a pre-process called 'topic spotting,' text seg- 
mentation, and topic identification. In topic 
SNote that the quantity within \[---\] in (1) is (em- 
pirical) mutual inyormation, which is an effective mea- 
sure for word co-occurrence calculation (cf.,(Brown et 
al., 1992)). When the sample size is small, mutual 
information values tend to be undesirably large. The 
quantity within {-..} in (1) can help avoid this unde- 
sirable tendency because its value will become large 
when data size is small. 
37 
ASIAI SXPOITERS PSAk DAEAOS Fit05 U.S.-IAPA| RIFT (25-HAE-1987) 
block 0 ........ trade-expor~-cari~t-impo:rt(O,12) Japan-Japa.l~ese(O.07) U$(0.06) 
0 Sountin S trade friction between the U.:3. and $opau has raised fears amen S many of lsia*s exporting nations chat the row could inflict ... 
1 They told Router correspondents in Asian capitals a U.S. move against Japan might boost prote©tionist sentiment in she U.S. ~nd lead to ... 
2 But some exporters said Chat while the conflict would hurt them in the lens-run, in the short-term Tokyo's loss might be their gain. 
3 The U.S. Xas said it sill ~apose 300 ~tn dlrs of tariffs on imports of Japanese electronics seeds on April 17. in retaliation for Japa~*s ... 
4 Unofficial Japanese ost~Jnates put the impact of the tariffs at 10 billion dlro and spokesmen for major electronics 1irma said they would ... 
5 "go wouldn't be able to do business," Isaid a spokesman for l.odin S ;apanese electronics ~irm Satanohita Electric Industrial ¢o Lad tlt. 
6 "If the tariffs remain in place for any length of time beyond a ~eg months it sill ~an the complete erosion of experts (o~ good8 subject ... 
block I ........ trade-export-ta~vif~-Impo:rt(O.lT) US(O.Og) Taiwan(O.05) dlrs(O.O$) 
T In Taigan. businessmen and officials ~re also worried. 
$ "We ire agLre of the seriousness ot the U.5. threat against Japan because it serves as a warning to o|," said • senior Taiganese trade ... 
g Taiu&n had z trade trade surplus of 15~6 billion dire last year° gS pot of it uitb the U.$. 
10 The surplus helped sgell 7aiwan's foreign exchange reserves to 53 billion dlrs. ninon S the world's largest. 
11 "Re must quickly open our markets, remove trade barriers and cut Import tariffs to allow imports o~ U.S. predicts, if ue want to de~nse ... 
12 I senior officiL1 ef South \[orea's tr~Lde promotion association said the trade dispute between the U.S. and Japan might also lead to ... 
13 List year South |urea had a trade surplus ef 7.1 billion dlro uith the U.S.. np ~ron t.9 billion dlrs in 1985. 
1~ In Halaysia. erode officers and businessmen said ~ou~h curbs against Japan might allen hard-hit producers o~ anuLicondnctors in third ... 
block 2 ........ Hong-|en$(0.16) trado-export-ta~iff-impert(O. 10) U5(0.06) 
15 In Hung long, where nauspaporo have alleged Japan has been nailing baler-cost semiconductors, some electronicsmanu~acturnrn share ... 
16 "That is a very short-term vies." said Lawrn~ce Rills, director-general o~ the Federation of Hung Eerie Industry. 
17 "I~ the uhole purpose is te prevent imports, one day it gill be extended to other sources. Hush more serious for Hsng Ions is the ... 
18 The U.S. last year gas Hon K Eong's hiogest expert market, accounting for ever 30 pot of domestically produced exports. 
block 3 ........ trade-export-tariff-import(0.14) Button(O.08) ~apan-lapaneoe(O.07) 
19 ~ho Australian government is anaiting the outcome of trade talks botmean the U.S. and Japan uitb interest and concern, Industry ... 
20 *'1his kind o~ deterioration in trade relations between sue countries nhich &r~majer trading partners of ours is a very ... 
21 He said lostralia*s concerns centred en coal and beef, Anstrnlia:8 tee larsest exports to Japan and also significant U.S .... 
22 Heanwhile U.S.-JapanaSe "diplomatic manoeuvmes to solve the trade stand-off continue. 
block 4 ........ Japan-Japanese(O,12) measure(O.06) trade-export-tariff-i~port(O.O5) 
23 Japan's ruling Liberal Democratic Party yesterday outlined a package of economic measuru8 to boost the ~apananu $csnony. 
24 The Measures proposed include • lapse supplementary budget and record public works spending in the firso half of ohe financial year. 
25 \]hey also call gor stepped-up spending as an emergency measure to stimulate the economy danpite Prime Sinister Yasuhiro HaJ~asome ... 
26 Deputy U.S. Trade kepreanutagive 5ichael Sunth and H~koto lnrrda, Japan's deputy minister of International Trade ~nd Zndustry (BZTZ) .... 
0-26; sentence id 
(..): probability value 
Figure 3: Topic structure of text 
spotting, we select topics discussed in a given 
text. We can then construct STMs on the 
basis of the topics. In text segmentation, we 
segment the text on the basis of the STMs, 
assuming that each block is generated by an 
individual STM. In topic identification, we es- 
timate the parameters of the STM for each 
segmented block and select topics with high 
probabilities for the block. In this way, we 
obtain a topic structure for the text. 
4.3 Topic Spotting 
In topic spotting, we first select key words 
from a given text. We calculate what we call 
the Shannon information of each word in the 
text. The Shannon information of word w in 
text t is defined as 
I(w) = -N(w)logP(w), 
where N(w) denotes the frequency of w in t, 
and P(w) the probability of the occurrence of 
w as estimated from corpus data. I(w) may 
be interpreted as the amount of information 
represented by w. We select as key words the 
top I words sorted in descending order of I. 
While Shannon information is similar to 
the tf-idf widely used in information retrieval 
(e.g., (Salton and Yang, 1973)), the use of 
Shannon information can be justified on the 
basis of information theory, but that of tf-idf 
cannot. Our preliminary experimental results 
indicate that Shannon information performs 
better than or at least as well as tf-idf in key 
word extraction. 4 
From the results of word clustering, we next 
select any cluster (topic) whose seed word is 
included among the selected key words. 
We next merge any two clusters if one of 
their seed words is included in the other's clus- 
ter. For example, when a cluster with seed 
word 'trade' contains the word 'import,' and 
a cluster with seed word 'import' contains the 
word 'trade,' we merge the two. After two 
such merges, we may obtain a relatively large 
cluster with, for example, ~trade-import-tariff- 
export' as its seed words, as is shown in Fig- 
ure 3. Figure 4 shows the merging algorithm. 
In this way, we obtain the most conspicuous 
and mutually independent topics discussed in 
a given text. 
4.4 Text Segmentation 
In segmentation, we first identify candidates 
for points of segmentation within the given 
text. When we assume a relatively short text 
~We will discuss it in the full version of the paper. 
38 
kl, • • •, kn: clusters, 
V = {{ki},i = 1,2,...,n}. 
For each cluster pair (ki, kj), if the seed 
word of ki is included in kj and the seed 
word of kj is included in ki, then push 
(ki, kj) into queue Q; 
while (Q # 0) { 
Remove the first element (kl, kj) from Q; 
if (kl and kj belong to different sets 
W1,W2 in V) 
Replace W1 and W2 in V with 
w~ u w2 ; } 
For each element W of V, merge the 
clusters in it. 
Figure 4: Algorithm: merge 
for the purposes of our explanation here, all 
sentence-ending periods will be candidates. 
For each candidate, we create two pseudo- 
texts, one consisting of the h sentences pre- 
ceding it, and the other of the h sentences 
following it (when fewer than h exist in any 
..:direction, we simply use those which do exist). 
We use the EM algorithm ((Dempster et al., 
1977), cL, Figure 5) to separately estimate the 
parameters of an STM from each of the two 
pseudo texts. It is theoretically guaranteed 
that the EM algorithm converges to a local 
maximum of the likelihood. We next calculate 
the similarity (i.e., essentially the converse no- 
tion of distance s) between the STM based 
on the preceding pseudo-text, and the STM 
based on the following pseudo-text. These 
STMs axe denoted, respectively, as PL(W) and 
PR(w). The similarity between PL(W) and 
PR(w) is defined as 
S(LI\[R) = 1 - E~w \[PL(w) - PR(w)\[ 
2 
The numerator is referred to in statistics as 
variational distance and has good properties 
as a distance between two probability dis- 
tributions (cf., (Cover and Thomas, 1991), 
p.299). 
Figure 7 shows a graph of calculated simi- 
laxity values for each of the candidates in the 
5We use similarity rather than distance here in or- 
der to simplify comparison between our method and 
TextTiling (Hearst, 1997). 
s: predetermined number. 
For the lth iteration (I = 1,-.., s), 
we calculate 
PU)(k)PU)(wlk) P(Z+l)(klw) = Ek~P(')(k)P(')(wlk) 
p(l+l)(k) = N(w)PU+l)(klw) N 
P(Z+l)(w\]k) = N(w)P(l+l)(k\[ w) ~wew g(w)P(~+ l )(k\[w) 
N(w) denotes the frequency of word w 
in the data; N = ~ew N(w). 
Figure 5: EM algorithm 
n: number of segmentation candidates, 
S(i) i(i = 0... n): similarity score. 
for (i = 1;i < n- 1;i + +){ 
if (S(i - 1) > S(i) & S(i + 1) > S(i)){ 
j=i-1; 
while (j > 0 & S(j - 1) > S(j)) j--; 
P1 = S(j); 
j=i+ l; 
while(j < n & S(j + 1) > S(j)) j++; 
P2 = S(j); 
if(P1 - S(i) > ~ & P2- S(i) > 8) 
Conduct segmentation at i. }) 
Figure 6: Algorithm: segment 
text shown in Figure 3. 'Valleys' (i.e., low- 
similarity values) in the graph suggest points 
for reasonable segmentations. In actual prac- 
tice, segmentation is performed for each valley 
whose similarity values is lower to a predeter- 
mined degree 0 than each of the values of its 
left 'peak' and right 'peak' (cf., Figure 6) For 
example, for the text in Figure 3, segmenta- 
tion was performed at candidates (i.e., end of 
sentences) 6, 14, 18, and 22, with 8 = 0.05. 
4.5 Topic Identification 
After segmentation, we separately estimate 
the parameters of the STM for each block, 
again using the EM algorithm, and obtain 
a topic (cluster) probability distribution for 
each block. We then choose those topics (dus- 
ters) in each block having.high probability val- 
ues. In this way, we construct a topic struc- 
39 
0.35 
0.3 
0.25 
0.2 
0.15 
0.1 
0.05 
% ; 
' ' "STM" ~' t 
i = ,o 15 2'o 2'5 
sentence nurn~,er 
Figure 7: Similarity values for segmentation 
candidates 
ture as in Figure 3 for the given text (topics 
are here represented by their seed words). 
We can view topics appearing in all the 
blocks as main topics, and topics appearing 
only in individual blocks as subtopics. In 
the text in Figure 3, the topic represented 
by seed-words 'trade-export-tariff-import' is 
the main topic, and 'Japan-Japanese,' 'Hong 
Kong,' etc., are subtopics. 
5 Applications 
Our method can be used in a variety of text 
processing applications. 
For example, given a collection of texts 
(e.g., home pages), we can automatically con- 
struct an index of the texts on the basis of the 
extracted topics. We can indicate which topic 
is from which text or even which block of a 
text. Furthermore, we can indicate which top- 
ics are main topics of texts and which topics 
are subtopics (e.g., by displaying main topics 
in boldface, etc). In this way, users can get a 
fair sense of the contents of the texts simply 
by looking through the index. For a specific 
text, users can get a rough sense of the con- 
tent by looking at the topic structure as, for 
example, it is shown in Figure 3. 
Our method can also be useful for text min- 
ing, text summarization, information extrac- 
tion, and other text processing, which require 
one to first analyze the structure of a text. 
6 Related Work 
To the best of our knowledge, no previous 
study has so far dealt with topic identification 
and text segmentation within a single frame- 
work. 
A widely used method for key word extrac- 
tion calculates the tf-idf value of each word in 
a text and uses those words having the largest 
tf-idf values as key words for that text (e.g., 
(Salton and Yang, 1973)). One can view these 
extracted key words as the topics of the text. 
No keyword extraction method by itself, how- 
ever, is able to conduct segmentation. 
With respect to text segmentation, exist- 
ing methods can be classified into two groups. 
One is to divide a text into blocks (e.g., 
TextTiling (Hearst, 1997)), the other to di- 
vide a stream of texts into its original texts 
(e.g.,(Allan et al., 1998; Yamron et al., 1998; 
Beeferman et al., 1999; tteynar, 1999)). The 
former group generally employs unsupervised 
learning, while the latter supervised one. No 
existing segmentation method, however, has 
attempted topic identification. 
TextTiling creates for each segmentation 
candidate two pseudo-texts, one preceding it 
and the other following it, and calculates as 
similarity the cosine value between the word 
frequency vectors of the two pseudo texts. It 
then conducts segmentation at valley points 
in a similar way to that of our method. Since 
the problem setting of TextTiling (in general 
the former group) is most close to that of our 
study, we use TextTiling for comparison in our 
experiments. 
Our method by its nature performs topic 
identification and segmentation within a sin- 
gle framework. While it is possible with a 
combination of existing methods to extract 
key words from a given text by using tf-idf, 
view the extracted key words as topics, seg- 
ment the text into blocks by employing Text- 
Tiling, estimate distribution of topics in each 
block, and identify topics having high prob- 
abilities in each block. Our method outper: 
forms such a combination (referred to here- 
after as 'Corn') for topic identification, be- 
cause it utilizes word duster information. It 
also performs better than Com in text seg- 
mentation because it is based on a well-defined 
probability framework. Most importantly is 
that our method is able to output an easily 
understandable topic structure, which has not 
been proposed so far. 
Note that topic analysis is different from 
text classification (e.g., (Lewis et al., 1996; Li 
and Yamanishi, 1999; Joachims, 1998; Weiss 
et al., 1999; Nigam et al., 2000)). While text 
classification uses a number of pre-determined 
categories, topic analysis includes no notion 
of category. The output of topic analysis is a 
topic structure, while the output of text clas- 
40 
sification is a label representing a category. 
Furthermore, text classification is generally 
based on supervised learning, which uses la- 
beled text data 6. By way of contrast, topic 
analysis is based on unsupervised learning, 
which uses only unlabeled text data. 
Finite mixture models have been used in 
a variety of applications in text processing 
(e.g., (Li and Yamanishi, 1997; Nigam et al., 
2000; Hofmann, 1999)), indicating that they 
are essential to text processing. We should 
note, however, that their definitions and the 
ways they use them axe different from those 
for STM in this paper. For example, Li and 
Yamanishi propose to employ in text classi- 
fication a mixture model (Li and Yamanishi, 
1997) defined over categories: 
P(WIC) = ~ P(klc)P(wlk),w e W,c e C, 
kEK 
where W denotes a set of words, and C a 
set of categories. In their framework, a new 
text d is assigned into a category c* such that 
c* = argmaxeee P(c\]d) is satisfied. I-Iofmann 
proposes using in information retrieval a joint 
distribution which he calls 'an aspect model,' 
.Aefined as (Hofmann, 1999) 
P(w,d) = P(d)P(wld) 
= P(d) EkeK P(kld)P(wlk), 
wEW, dED 
where D denotes a set of texts. Furthermore, 
he proposes extracting in retrieval those texts 
whose estimated word distributions P(w\[d) 
are similar to the word distribution of a query. 
7 Experimental Results 
We have evaluated the performance of our 
topic analysis method (STM) in terms of three 
aspects: topic structure adequacy, text seg- 
mentation accuracy, and topic identification 
accuracy. 
7.1 Data Set 
We know of no data available for the pur- 
pose of evaluation of topic analysis. We thus 
utilized Reuters news articles referred to as 
'Reuters-21578,' which has been widely used 
in text classification v. We used a prepared 
SAn exception is the method proposed in (McCal- 
lure and Nigam, 1999), which, instead of labeled texts, 
uses unlabeled texts, pre-determined categories, and 
keywords defined by humans for each category. 
rAvailable at http://www.reseaxch.att.com/lewis/. 
split of the data 'Apte split,' which consists 
of 9603 texts for training and 3299 texts for 
test. All of the texts had already been classi- 
fied into 90 categories by human subjects. 
For each text, we used the Oxford Learner's 
Dictionary s to conduct stemming, and re- 
moved 'stop words' (e.g., 'the,' 'and') that we 
had included on a previously prepared list. 
The average length of a text was about 115 
words. (We did not use phrases, however, 
which would further improve experimental re- sults.) 
7.2 Word Clustering 
We conducted word clustering with 9603 
training texts. 7340 individual words had a 
total frequency of more than 5, and we used 
them as seeds with which to collect frequently 
co-occurring words. The threshold for clus- 
tering 7 was set at 0.005, and this yielded 
970 word clusters having more than one word 
(i.e., not simply containing a seed word alone). 
Note that the category labels of the training 
texts need not be used in clustering. 
We next conducted a topic analysis on all 
the 3299 texts. The thresholds of l, h, and 0 
were set at 20, 3, and 0.05, respectively, on 
the basis of preliminary experimental results. 
7.3 Topic Structure 
We looked at the topic structures of the 3299 
texts obtained by our method to determine 
how well they conformed to human intuition. 
For topic identification in this experiment, 
clusters in each block were sorted in descend- 
ing order of their probabilities, and the top 
7 seed words were extracted to represent the 
topics of the block. 
Figure 3 show results for the text with ID 
14826; they generally agree well with human 
intuition. The text has been segmented into 
5 blocks and the topics of each block is rep- 
resented by 7 seed words. The main topic is 
represented by the seed-words 'trade-export- 
tariff-import.' The subtopics are represented 
by 'Japan-Japanese,' 'Taiwan,' 'Hong Kong,' 
etc. There were, however, a small number 
of errors. For example, the text should also 
have been segmented after sentences 11 and 
13, but, due to limited sentence content, it was 
not. Furthermore, assigning subtopic of 'But- 
ton' (from 'Mr. Button') into block 3 (due 
to the high Shannon information value of the 
word 'Button') was also undesirable. 
SAvailable at ftp://sable.ox.ac.uk. 
41 
Table 1:10 categories and their identification 
words 
category 
earn 
acq 
money-fx 
grain 
crude 
trade 
interest 
ship 
wheat 
corn 
identification vcords 
earning, share, profit, dividend 
acquisition, acquire, sell, buy 
currency, dollar, yen, stg 
grain, cereal, crop 
oil, crude, gas 
trade, export, import, tariff 
interest & rate 
ship, vessel, ferry, tanker 
wheat 
cori1, maize 
7.4 Main Topic Identification 
We conducted an evaluation to determine 
whether or not the main topics in the topic 
structures obtained for the 3299 test texts 
could be approximately matched with the la- 
bels (categories) assigned to the test texts. 
Note that here labels are used only for eval- 
uation, not for training. This is in contrast 
to the situation in most text classification ex- 
periments, in which labels are generally used 
both for training and for evaluation. It is not 
particularly meaningful, then, to compare the 
results for main topic identification obtained 
here with those for text classification. 
With STM, clusters in each block were 
sorted in descending order of their probabil- 
ities, and the top k seed words were extracted 
to represent the topics of the block. Further- 
more, a seed word appearing in all the blocks 
of the text was considered to represent a main 
topic. When a text had not been segmented 
(i.e., has only one block), all top k seed words 
were considered to represent main topics. 
Table 1 lists the largest 10 categories in the 
Reuters data. On the basis of the definition of 
each of the 10 categories, we assigned based on 
our intuition to each of them the identification 
words that are listed in Table 1. 
For the evaluation, when the seed words for 
main topics contained at least one of the iden- 
tification words, we considered our method to 
have identified the corresponding main topic 
equivalent to a human-determined category. 
We then evaluated these in terms of preci- 
sion and recall. Here, precision is defined as 
the ratio of the number of decisions correctly 
made to the total number of decisions made. 
Recall is defined as the ratio of the mrmber of 
decisions correctly made to the total number 
Table 2: Main topic identification results with 
respect to 7 top words 
category 
earn 
acq 
money-fx 
grain 
crude 
trade 
interest 
ship 
wheat 
corn 
STM 
rec. pre. 
Com 
rec. pre. 
0.790 0.971 
0.245 0.854 
0.436 0.456 
0.322 0.750 
0.487 0.676 
0.667 0.473 
0.107 0.700 
0.247 0.957 
0.620 0.936 
0.429 0.960 
0.526 0.976 
0.184 0.841 
0.285 0.421 
0.174 0.650 
0.407 0.664 
0.590 0.356 
0.084 0.733 
0.270 0.828 
0.408 0.967 
0.446 1.00 
micro-average 0.515 0.824 0.365 0.774 
Table 3: Main topic identification results with 
respect to 5 top words 
category 
earn 
acq 
money-fx 
grain 
crude 
trade 
interest 
ship 
wheat 
corn 
micro-average 
STM 
rec. pre. 
0.742 0.971 
0.184 0.868 
0.413 0.503 
0.295 0.759 
0.471 0.718 
0.479 0.505 
0.053 0.700 
0.169 1.000 
0.577 0.953 
0.357 0.952 
0.461 0.850 
Corn 
rec. pre. 
0.348 0.977 
0.120 0.869 
0.268 0.471 
0.121 0.600 
0.333 0.656 
0.513 0.403 
0.069 0.818 
0.180 0.762 
0.282 0.952 
0.321 1.000 
0.257 0.767 
of decisions which should have been correctly 
made. 
We also looked at the performance of Corn 
(cf., Section 6). For Corn, we extracted from a 
text the key words with the 20 largest Shan- 
non information values, segmented the text 
using TextTiling, and extracted in each block 
the key words having the largest k probabil- 
ity values. Any key word extracted in all 
blocks was considered to represent a main 
topic. When the key words for main top- 
ics contained at least one of the identification 
words, we viewed that text as having the cor- 
responding main topic. 
Table 2 shows the results achieved with 
STM and Corn in the case of k ~-- 7. 9 Table 3 
9For the definition of micro-averaging, see, for ex- 
42 
Title: gOYPY BUYS PL 480 RHSAT FLOU! - U.S. YRADSItS 
"Body: ggyp1 bought 125,723 sonnet o~ U.S. shoat ~lour in its PL 
480 tender yesterdxy, trade ooQrceo said. The purchase included 
$t,880 tonnes ~er Say shipn~nt ~d 73,843 tennes for June sbipnent. 
Price details gere not available. 
Content Words (Freq.): tone(S) shipment(2) buy(t) detail(I) 
ggypt(1) tlour(l) include(I) June(l) PL(I) price(t) purchase(l) 
source(l) trade(l) US(l) 9heat(l) 
Icy Bordo (ShLn. Ing.): tonne(17.5) ohi~4nent(1S.3) PL(IO.6) flour(9.8) 
Sgypt(9.3) detail(7.S) Juno(7.2) uheat(6.8) purchas¢(6.6) source(S.S) 
U$(6.1) buy(6.0) inclnde(6.O) trade(B.3) price (S.l) 
Con Yopics (Prob.): tonn¢(O.17) shipnent(O.ll) price(O.06) June(O.O6) 
in¢lude(O.00) purcbaoe(O.06) source(O.O6) 
BIB Iopico (Prob.) : ilour~sheat(O.IB) tonn4(0.12) shipmont(O.tl) 
purchaoe-buy(O.tl) Egypt(O.O6) 
Cluster: (Sieur-wheat: ghent tonne ~lour) 
(purchase-buy: purchase bny) 
Figure 8: Topic Identification Example 
shows the results in the case of k = 5. The 
comparison may be considered fair in that it 
requires each of the two methods to provide 
the same number of words to represent top- 
ics. Results indicate that STM significantly 
outperforms Corn, particularly in terms of re- 
call. 
The main reason for the higher performance 
achieved by STM is that it utilizes word clus- 
ter information. Figure 8 shows topic analysis 
results for the text with ID 15572 labeled with 
'wheat.' The text contains only 15 content 
words (word types), thus all of the 15 words 
were extracted as key words and the text was 
not segmented by either method. Corn was 
unable to identify the main topic 'wheat,' be- 
cause the probability of each of the relevant 
key words 'wheat' and 'flour' was low. In 
contrast, STM successfully identified the topic 
because the relevant key words were classified 
into the same cluster, and its probability was 
relatively high. 
7.5 Segmentation and Subtopic 
Identification 
We collected the 50 longest test texts (re- 
ferred to here as 'seed texts') from each of the 
10 categories, and combined each with a test 
text randomly selected from other categories 
to produce 500 pseudo-texts. Placement of 
the seed text within its pseudo-text (i.e., be- 
fore or after the other text) was determined 
randomly. 
We used both STM and Corn to segment 
each of the pseudo-texts into two blocks and 
identify subtopics. Table 4 shows the segmen- 
tation results for the two method evaluated 
ample, (Lewis and Ringnette, 1994). 
Table 5: Subtopic identification results 
category of 
seed text 
eaxn 
acq 
money-fx 
grain 
crude 
trade 
interest 
ship 
wheat 
corn 
Average 
STM 
rec. pre. 
0.430 0.945 
0.237 0.939 
0.585 0.950 
0.276 0.947 
0.572 0.979 
0.634 0.951 
0.211 0.937 
0.260 1.000 
0.500 0.970 
0.317 1.000 
Corn 
rec. pre. 
0.324 0.973 
0.217 0.959 
0.533 0.961 
0.222 0.938 
0.557 O.990 
0.627 0.899 
0.136 1.000 
0.340 0.994 
0.395 0.980 
0.441 0.882 
0.402 0.962 0.379 0.958 
in terms of recall, precision, and error prob- 
ability. Table 5 shows the results of subtopic 
identification as evaluated in terms of recall 
and precision. Error probability is a metric 
for evaluating segmentation results proposed 
in (Allan et ai., 1998; Beeferman etal., 1999). 
It is defined here as the probability that a ran- 
domly chosen pair of sentences a distance of k 
sentence apart is incorrectly segmented. 1° 
Experimental results indicate that STM 
outperforms Corn in both segmentation and 
identification, n 
8 Conclusions 
We have proposed a new method of topic 
analysis that employs a finite mixture model, 
referred to here as a stochastic topic model 
(STM). 
Topic analysis consists of two main tasks: 
text segmentation and topic identification. 
With topic analysis, one can obtain a topic 
structure for a text. 
Our method addresses topic analysis within 
a single framework. It has the following novel 
features: 1) it represents topics by means of 
word dusters and employs a finite mixture 
model (STM) to represent a word distribution 
within a text; 2) it constructs topics on the 
basis of corpus data before conducting topic 
analysis; 3) it segments a text by detecting 
significant differences between STMs; and 4) 
it identifies topics by estimating parameters 
1°Here, k was set to 5 because the average length of 
a text was about 10 sentences. .... 
llWe will discuss the results in the full version of 
the paper. 
43 
Table 4: Text segmentation results 
category of 
seed text 
earn 
acq 
money-fx 
grain 
crude 
trade 
interest 
ship 
wheat 
corn 
Average 
STM 
0.660 
0.820 0.820 0.059 
0.700 0.700 0.087 
0.700 0.700 0.074 
0.860 0.860 0.051 
0.800 0.800 0.072 
0.760 0.760 0.119 
0.837 0.854 0.074 
0.760 0.760 0.075 
0.625 0.625 0.147 
Corn 
rec. pre. err. rec. pre. err. 
0.660 0.167 0.640 0.640 0.171 
0.740 0.740 0.085 
0.660 0.660 0.121 
0.660 0.660 0.076 
0.820 0.820 0.066 
0.800 0.800 0.081 
0.820 0.820 0.084 
0.816 0.833 0.084 
0.640 0.640 0.130 
0.650 0.650 0.105 
0.725 0.726 0.100 0.752 0.754 0.092 
of STMs. 
Experimental results indicate that our 
method outperforms a method that combines 
existing techniques. More specifically, it sig- 
nificantly outperforms the combined method 
in topic identification. 

References 
J. Allan, J. Carbonell, G. Doddington, J. Yam- 
ron, and Y. Yang. 1998. Topic detection and 
tracking pilot study: Final report. Proc. of the 
DARPA Broadcast News Transcription and Un- 
derstanding Workshop, pages 194-218. 
D. Beeferman, A. Berger, and J. Lafferty. 
1999. Statistical models for text segmentation. 
Machi. Lrn., 34:177-210. 
P. F. Brown, V. J. Della Pietra, P. V. deSouza, 
J. C. Lai, and R. L. Mercer. 1992. Class-based 
n-gram models of natural . Comp. 
Ling., 18(4):283-298. 
T. M. Cover and J. A. Thomas. 1991. Elements of 
Information Theory. John Wiley & Sons Inc., 
New York. 
A.P. Dempster, N.M. Laird, and D.B. Rubin. 
1977. Maximum likelihood from incomplete 
data via the em algorithm. Journ. of Roy. Star. 
Soci., Ser. B, 39(1):1-38. 
B. Everitt and D. Hand. 1981. Finite MiNute Dis- 
tribntions. Chapman and Hall. 
M. Hearst. 1997. Texttiling: Segmenting text 
into multi-paragraph subtopic passages. Comp. 
Ling., 23(1):33-64. 
Thomas Hofmann. 1999. Probabilistic latent se- 
mantic indexing. Proc. of SIGIR '99, pages 50- 
57. 
T. Joachirns. 1998. Text categorization with sup- 
port vector machines: Learning with many rel- 
evant features. Proc. of ECML '98. 
D. D. Lewis and M. Ringuette. 1994. A compar- 
ison of two learning algorithms for test catego- 
rization. Proc. of 3rd Ann. Syrup. on Doc. Ana. 
and Info. Retr., pages 81-93. 
D. D. Lewis, R. E. Schapire, J. P. Callan, and 
R. Papka. 1996. Training algorithms for linear 
text classifiers. Proc. of SIGIR'96. 
H. Li and K. Yamanishi. 1997. Document classi- 
fication using a finite mixture model. Proc. of 
A CL '97, pages 39-47. 
H. Li and K. Yamanishi. 1999. Text classification 
using ESC-based stochastic decision lists. Proc. 
of ACM-CIKM'99, pages 122-130. 
H. Li. 1998. A Probabilistic Approach to Lezical 
Semantic Knowledge Acquisition and Structural 
Disambignation. Ph.D. Thesis, Univ. of Tokyo. 
A. K. McCallum and K. Nigam. 1999. Text clas- 
sification by bootstrapping with keywords, em 
and shrinkage. Proc. of ACL'g9 Workshop Un- 
supervised Learning in NLP. 
K. Nigarn, A. K. McCallum, S. Thrun, and 
T. Mitchell. 2000. Text classification from 
labeled and unlabeled documents using era. 
Maehi. Lrn., 39:103-134. 
J. C. Reynar. 1999. Statistical models for topic 
segmentation. Proc. of ACL '99, pages 357-364. 
J. Rissanen. 1996. Fisher information and 
stochastic complexity. 1EEE Trans. on Info. 
Thry., 42(1):40-47. 
G. Salton and C.S. Yang. 1973. On the speci- 
fication of term values in automatic indexing. 
Journ. of Doc., 29(4):351-372. 
S. M. Weiss, C. Apte, F. Damerau, F. J. Oles, 
T. Goers, and T. Hampp. 1999. Maximiz- 
ing text-mining performance. IEEE Intel. Sys., 
14(4):63-69. 
J.P. Yamron, I. Carp, L. Gillick, S. Lowe, and 
P. van Mulbregt. 1998. A Hidden Markov 
Model approach to text segmentation and event 
tracking. Proc. of ICASSP'99, pages 333-336. 
