 
Building Semantic Perceptron Net for Topic Spotting 
 
Jimin Liu  and  Tat-Seng Chua 
School of Computing 
National University of Singapore 
SINGAPORE 117543 
  {liujm, chuats}@comp.nus.edu.sg 
 
 
 
 
Abstract 
This paper presents an approach to 
automatically build a semantic 
perceptron net (SPN) for topic spotting. 
It uses context at the lower layer to 
select the exact meaning of key words, 
and employs a combination of context, 
co-occurrence statistics and thesaurus to 
group the distributed but semantically 
related words within a topic to form 
basic semantic nodes. The semantic 
nodes are then used to infer the topic 
within an input document. Experiments 
on Reuters 21578 data set demonstrate 
that SPN is able to capture the semantics 
of topics, and it performs well on topic 
spotting task. 
 
1. Introduction 
 
Topic spotting is the problem of identifying the 
presence of a predefined topic in a text document. 
More formally, given a set of n topics together with 
a collection of documents, the task is to determine 
for each document the probability that one or more 
topics is present in the document. Topic spotting 
may be used to automatically assign subject codes 
to newswire stories, filter electronic emails and on-
line news, and pre-screen document in information 
retrieval and information extraction applications. 
Topic spotting, and its related problem of text 
categorization, has been a hot area of research for 
over a decade. A large number of techniques have 
been proposed to tackle the problem, including: 
regression model, nearest neighbor classification, 
Bayesian probabilistic model, decision tree, 
inductive rule learning, neural network, on-line 
learning, and, support vector machine (Yang & Liu, 
1999; Tzeras & Hartmann, 1993). Most of these 
methods are word-based and consider only the 
relationships between the features and topics, but 
not the relationships among features. 
It is well known that the performance of the 
word-based methods is greatly affected by the lack 
of linguistic understanding, and, in particular, the 
inability to handle synonymy and polysemy. A 
number of simple linguistic techniques has been 
developed to alleviate such problems, ranging from 
the use of stemming, lexical chain and thesaurus 
(Jing & Tzoukermann, 1999; Green, 1999), to 
word-sense disambiguation (Chen & Chang, 1998; 
Leacock et al, 1998; Ide & Veronis, 1998) and 
context (Cohen & Singer, 1999; Jing & 
Tzoukermann, 1999). 
The connectionist approach has been widely 
used to extract knowledge in a wide range of 
information processing tasks including natural 
language processing, information retrieval and 
image understanding (Anderson, 1983; Lee & 
Dubin, 1999; Sarkas & Boyer, 1995; Wang & 
Terman, 1995). Because the connectionist 
approach closely resembling human cognition 
process in text processing, it seems natural to adopt 
this approach, in conjunction with linguistic 
analysis, to perform topic spotting. However, there 
have been few attempts in this direction. This is 
mainly because of difficulties in automatically 
constructing the semantic networks for the topics. 
In this paper, we propose an approach to 
automatically build a semantic perceptron net 
(SPN) for topic spotting. The SPN is a 
connectionist model with hierarchical structure. It 
uses a combination of context, co-occurrence 
statistics and thesaurus to group the distributed but 
semantically related words to form basic semantic 
nodes. The semantic nodes are then used to identify 
the topic. This paper discusses the design, 
implementation and testing of an SPN for topic 
spotting. 
The paper is organized as follows. Section 2 
discusses the topic representation, which is the 
prototype structure for SPN. Sections 3 & 4 
respectively discuss our approach to extract the 
semantic correlations between words, and build 
semantic groups and topic tree. Section 5 describes 
the building and training of SPN, while Section 6 
presents the experiment results. Finally, Section 7 
concludes the paper. 
2. Topic Representation 
 
The frame of Minsky (1975) is a well-known 
knowledge representation technique. A frame 
represents a high-level concept as a collection of 
slots, where each slot describes one aspect of the 
concept. The situation is similar in topic spotting. 
For example, the topic “water ” may have many 
aspects (or sub-topics). One sub-topic may be 
about “water supply ”, while the other is about 
“water and environment protection”, and so on. 
These sub-topics may have some common 
attributes, such as the word “water”, and each sub-
topic may be further sub-divided into finer sub-
topics, etc. 
The above points to a hierarchical topic 
representation, which corresponds to the hierarchy 
of document classes (Figure 1). In the model, the 
contents of the topics and sub-topics (shown as 
circles) are modeled by a set of attributes, which is 
simply a group of semantically related words 
(shown as solid elliptical shaped bags or 
rectangles). The context (shown as dotted ellipses) 
is used to identify the exact meaning of a word. 
 
topic 
a word 
the context of a word 
Sub-topic  
Aspect attribute 
common attribute 
 Figure 1. Topic representation 
Hofmann (1998) presented a word occurrence 
based cluster abstraction model that learns a 
hierarchical topic representation. However, the 
method is not suitable when the set of training 
examples is sparse. To avoid the problem of 
automatically constructing the hierarchical model, 
Tong et al (1987) required the users to supply the 
model, which is used as queries in the system. 
Most automated methods, however, avoided this 
problem by modeling the topic as a feature vector, 
rule set, or instantiated example (Yang & Liu, 
1999). These methods typically treat each word 
feature as independent, and seldom consider 
linguistic factors such as the context or lexical 
chain relations among the features. As a result, 
these methods are not good at discriminating a 
large number of documents that typically lie near 
the boundary of two or more topics. 
In order to facilitate the automatic extraction 
and modeling of the semantic aspects of topics, we 
adopt a compromise approach. We model the topic 
as a tree of concepts as shown in Figure 1. 
However, we consider only one level of hierarchy 
built from groups of semantically related words. 
These semantic groups may not correspond strictly 
to sub-topics within the domain. Figure 2 shows an 
example of an automatically constructed topic tree 
on “water ”. 
 
Contexts 
Basic Semantic 
 Nodes 
Topic 
   price 
agreement  
   water 
    ton 
  
   waste 
environment 
    bank 
   provide 
costumer 
corporation 
      plant  
 
  rain 
rainfall 
  dry  water  
water 
 
water 
river 
tourist 
  f e 
d c b a 
 
Figure 2. An example of a topic tree 
In Figure 2, node “a” contains the common 
feature set of the topic; while nodes “b”, “c” and 
“d” are related to sub-topics on “water supply”, 
“rainfall”, and “water and environment protection” 
respectively. Node “e” is the context of the word 
“plant”, and node “f” is the context of the word 
“bank”. Here we use training to automatically 
resolve the corresponding relationship between a 
node and an attribute, and the context word to be 
used to select the exact meaning of a word. From 
this representation, we observe that: 
a) Nodes “c” and “d” are closely related and may 
not be fully separable. In fact, it is sometimes 
difficult even for human experts to decide how 
to divide them into separate topics. 
b) The same word, such as “water ”, may appear in 
both the context node and the basic semantic 
node.  
c) Some words use context to resolve their 
meanings, while many do not need context. 
3. Semantic Correlations  
 
Although there exists many methods to derive the 
semantic correlations between words (Lee, 1999; 
Lin, 1998; Karov & Edelman, 1998; Resnik, 1995; 
Dagan et al, 1995), we adopt a relatively simple 
and yet practical and effective approach to derive 
three topic -oriented semantic correlations: 
thesaurus-based, co-occurrence-based and context-
based correlation.  
3.1 Thesaurus based correlation  
WordNet is an electronic thesaurus popularly used 
in many researches on lexical semantic acquisition, 
and word sense disambiguation (Green, 1999; 
Leacock et al, 1998). In WordNet, the sense of a 
word is represented by a list of synonyms (synset), 
and the lexical information is represented in the 
form of a semantic network. 
However, it is well known that the granularity 
of semantic meanings of words in WordNet is often 
too fine for practical use. We thus need to enlarge 
the semantic granularity of words in practical 
applications. For example, given a topic on 
“children education”, it is highly likely that the 
word “child” will be a key term. However, the 
concept “child” can be expressed in many 
semantically relate d terms, such as “boy”, “girl ”, 
“kid”, “child”, “youngster”, etc. In this case, it 
might not be necessary to distinguish the different 
meaning among these words, nor the different 
senses within each word. It is, however, important 
to group all these words into a large synset {child, 
boy, girl, kid, youngster}, and use the synset to 
model the dominant but more general meaning of 
these words in the context. 
In general, it is reasonable and often useful to 
group lexically related words together to represent 
a more general concept. Here, two words are 
considered to be lexically related if they are related 
to by the “is_a”, “part_of”, “member_of”, or 
“antonym” relations, or if they belong to the same 
synset. Figure 3 lists the lexical relations that we 
considered, and the examples. 
Since in our experiment, there are many 
antonyms co-occur within the topic, we also group 
antonyms together to identify a topic. Moreover, if 
a word had two senses of, say, sense-1 and sense-2. 
And if there are two separate words that are 
lexically related to this word by sense-1 and sense-
2 respectively, we simply group these words 
together and do not attempt to distinguish the two 
different senses. The reason is because if a word is 
so important to be chosen as the keyword of a 
topic, then it should only have one dominant 
meaning in that topic. The idea that a keyword 
should have only one dominant meaning in a topic 
is also suggested in Church & Yarowsky (1992). 
 
corn 
 
maize 
metal 
zinc  
per
import 
export  
perso
synset         is_a              part_of       member_of      antonym 
tree 
leaf  
family  
son  
per  Figure 3: Examples of lexical relationship 
Based on the above discussion, we compute the 
thesaurus-based correlation between the two terms 
t1 and t2, in topic Ti, as: 
                         1    (t1 and t2 are in the same synset, or t1=t2) 
                         0.8  (t1 and t2  have “antonym” relation) 
0..5  (t1 and t2 have relations of “is_a”,  
          “part_of”, or “member_of”) 
                        0   (others) 
=),( 21)( ttR iL
 
3.2 Co-occurrence based correlation 
Co-occurrence relationship is like the global 
context of words. Using co-occurrence statistics, 
Veling & van der Weerd (1999) was able to find 
many interesting conceptual groups in the Reuters-
2178 text corpus. Examples of the conceptual 
groups found include: {water, rainfall, dry}, 
{bomb, injured, explosion, injuries}, and {cola, 
PEP, Pepsi, Pespi-cola, Pepsico}. These groups 
are meaningful, and are able to capture the 
important concepts within the corpus. 
Since in general, high co-occurrence words are 
likely to be used together to represent (or describe) 
a certain concept, it is reasonable to group them 
together to form a large semantic node. Thus for 
topic Ti, the co-occurrence-based correlation of two 
terms, t1 and t2, is computed as: 
)(/)(),( 21)(21)(21)( ttdfttdfttR iiico ∨∧=  (2) 
where )( 21)( ttdf i ∧  ( )( 21)( ttdf i ∨ ) is the fraction of 
documents in Ti that contains t1 and (or) t2. 
3.3 Context based correlation 
Broadly speaking, there are three kinds of context: 
domain, topic and local contexts (Ide & Vernois, 
1998). Domain context requires extensive 
knowledge of domain and is not considered in this 
paper. Topic context can be modeled 
approximately using the co-occurrence 
(1) 
relationships between the words in the topic. In this 
section, we will define the local context explicitly. 
The local context of a word t is often defined as  
the set of non-trivial words near t. Here a word wd  
is said to be near t if their word distance is less than 
a given threshold, which is set to be 5 in our 
experiment. 
We represent the local context of term tj in topic 
Ti by a context vector cv(i)(tj). To derive cv(i)(tj), we 
first rank all candidate context words of ti by their 
density values: 
)(/)( )()()( jikijijk tnwdm=r  (3) 
where )()( ji tn is the number of occurrence of tj in 
Ti, and )()( kij wdm is the number of occurrences of 
wdk near tj. We then select from the ranking, the top 
ten words as the context of tj in Ti as: 
),(),...,,(),,{()( )(10)(10)(2)(2)(1)(1)( ijijijijijijji wdwdwdtcv rrr=  (4) 
When the training sample is sufficiently large, 
the context vector will have good statistic 
meanings. Noting again that an important word to a 
topic should have only one dominant meaning 
within that topic, and this meaning should be 
reflected by its context. We can thus draw the 
conclusion that if two words have a very high 
context similarity within a topic, it will have a high 
possibility that they are semantic related. Therefore 
it is reasonable to group them together to form a 
larger semantic node. We thus compute the 
context-based correlation between two term t1 and 
t2 in topic Ti as: 
2/12)(
2
2/12)(
1
10
1
)(
)(2
)(
1
)(
)(2
)(
1
)(
21)( ])([*])([
**),(
),(
∑∑
∑
= =
k
i
kk
i
k
k
i
km
i
k
i
km
i
k
ico
ic
wdwdR
ttR
rr
rr
 (5) 
where  ),(maxarg)( )(2)(1)( isikico
s
wdwdRkm =  
For example, in Reuters 21578 corpus, 
“company” and “corp” are context-related words 
within the topic “acq”. This is because they have 
very similar context of “say, header, acquire, 
contract”. 
4. Semantic Groups & Topic Tree 
 
There are many methods that attempt to construct 
the conceptual representation of a topic from the 
original data set (Veling & van der Weerd, 1999; 
Baker & McCallum, 1998; Pereira et al, 1993). In 
this Section, we will describe our semantic -based 
approach to finding basic semantic groups and 
constructing the topic tree. Given a set of training 
documents, the stages involved in finding the 
semantic groups for each topic are given below. 
A)  Extract all distinct terms {t1, t2, ..tn} from the 
training document set for topic Ti. For each term 
tj, compute its df(i)(tj) and cv(i)(tj), where df(i)(tj) 
is defined as the fraction of documents in Ti that 
contain tj. In other words, df(i)(tj) gives the 
conditional probability of tj appearing in Ti. 
B)  Derive the semantic group Gj using tj as the 
main keyword. Here we use the semantic 
correlations defined in Section 3 to derive the 
semantic relationship between tj and any other 
term tk. Thus: 
 For each pair (tj,tk), k=1,..n,  set Link(tj,tk)=1 
 if )( iLR (tj,tk)>0,   or,    
  df(i)(tj)>d0  and  )(icoR (tj, tk)>d1  or 
  df(i)(tj)>d2  and  )(icR  (tj, tk)>d 3. 
 where d0, d1, d2, d3 are predefined thresholds. 
 For all tk with Link(tj,tk)=1, we form a semantic 
group centered around tj denoted by: 
 },...,,{},...,,{ 2121 njjjj ttttttG
jk
⊆=  (6) 
 Here tj is the main keyword of node Gj and is 
denoted by  main(Gj)=t j. 
C)  Calculate the information value inf (i)(Gj) of each 
basic semantic group. First we compute the 
information value of each tj: 
  }1,0max{*)()(inf )()(
Nptdft ijj
iji −=  (7) 
 where     
∑
=
=
N
k k
i
jiij
tdf
tdfp
1
)(
)(
)(
)(  
 and N is the number of topics. Thus 1/N  denotes 
the probability that a term is in any class, and pij 
denotes the normalized conditional probability 
of tj in Ti. Only those terms whose normalized 
conditional probability is higher than 1/N will 
have a positive information value. 
 The information value of the semantic group Gj 
is simply the summation of information value of 
its constituent terms weighted by their 
maximum semantic correlation with tj as: 
 ∑=
=
jk
k
kiijkji twG
1
)()()( )](inf*[)(inf  (8) 
 where )},(),,(),,(max{ )()()()( kjiLkjickjicoijk ttRttRttRw =  
D) Select the essential semantic groups using the 
following algorithm: 
 a) Initialize: 
  },...,,{ 11 nGGGS ← ,  Φ←Groups , 
b) Select the semantic group with highest 
information value: 
 ))((infmaxarg )( ki
SGk
Gj
k ∈
←  
 c) Terminate if inf (i)(Gj) is less than a 
predefined threshold d4. 
 d) Add Gj into the set Groups: 
  jGSS −= ,  and  }{ jGGroupsGroups ∪←  
 e) Eliminate those groups in S whose key terms 
appear in the selected group Gj. That is: 
  For each  SGk ∈ , if jk GGmain ∈)( , then  
 }{ kGSS −←  
 f) Eliminate those terms in remaining groups in 
S that are found in the selected group Gj. 
That is: 
  For each SGk ∈ ,  jkk GGG −← ,  
  and if Φ=kG , then  }{ kGSS −←  
 g) If Φ=S  then stop; else go to step (b). 
In the above grouping algorithm, the predefined 
thresholds d0,d1,d2,d3 are used to control the size of 
each group, and d4 is used to control the number of 
groups. 
The set of basic semantic groups found then 
forms the sub-topics of a 2-layered topic tree as 
illustrated in Figure 2. 
5. Building and Training of SPN 
 
The Combination of local perception and global 
arbitrator has been applied to solve perception 
problems (Wang & Terman, 1995; Liu & Shi, 
2000). Here we adopt the same strategy for topic 
spotting. For each topic, we construct a local 
perceptron net (LPN), which is designed for a 
particular topic. We use a global expert (GE) to 
arbitrate all decisions of LPNs and to model the 
relationships between topics. Here we discuss the 
design of both LPN and GE, and their training 
processes. 
5.1 Local Perceptron Net (LPN) 
We derive the LPN directly from the topic tree as 
discussed in Sectio n 2 (see Figure 2). Each LPN is 
a multi-layer feed-forward neural network with a 
typical structure as shown in Figure 4. 
In Figure 4, xij represents the feature value of 
keyword wdij in the ith semantic group; xijk-s (where 
k=1,…10) represent the feature values of the context 
words wdijk‘s of keyword wd ij; and aij denotes the 
meaning of keyword wd ij as determined by its 
context. Ai corresponds to the ith basic semantic 
node. The weights wi, wij, and wijk and biases èi and 
èij are learned from training, and y(i)(x) is the output 
of the network. 
 y(i) 
  iA  
  iw  
  ijw  
  ijkw  
  ija  
  ijkx  Context  key term  
Semantic 
group 
Class 
  ijx  
Basic 
meaning 
q (i) 
  ijq  
 
Figure 4: The architecture of LPN for topic i 
Given a document: 
 x = {(xij,cv ij) | i=1,2,…m, j=1,…ij} 
where m is the number of basic semantic nodes, ij 
is the number of key terms contained in the ith 
semantic node, and cvij={xij1,xij2…
ijijkx
} is the 
context of term xij. The output y(i) =y(i)(x) is 
calculated as follows: 
 ∑==
=
m
i ii
ii Awxyy
1
)()( )(  (9) 
where  ])*(exp[1 1* ∑ −−+=
∈ ijijk cvx ijijkijk
ijij xwxa q  (10) 
and     
)exp(1
)exp(1
1
1
∑−+
∑−−
=
=
=
j
j
i
j
iji
i
j iji
i
aw
aw
A
 (11) 
Equation (10) expresses the fact that only if a 
key term is present in the document (i.e. xij > 0), its 
context needs to be checked. 
For each topic Ti, there is a corresponding net 
y(i) =y(i)(x) and a threshold q(i). The pair of (y(i)(x), 
q(i)) is a local binary classifier for Ti such that: 
If y(i)(x)-q(i) > 0, then Ti is present; otherwise 
  Ti is not present in document x. 
From the procedures employed to building the 
topic tree, we know that each feature is in fact an 
evidence to support the occurrence of the topic. 
This gives us the suggestion that the activation 
function for each node in the LPN should be a non-
decreasing function of the inputs. Thus we impose 
a weight constraint on the LPN as: 
 wi>0,  wij>0,  wijk>0  (12) 
5.2 Global expert (GE) 
Since there are relations among topics, and LPNs 
do not have global information, it is inevitable that 
LPNs will make wrong decisions. In order to 
overcome this problem, we use a global expert 
(GE) to arbitrate al local decisions. Figure 5 
illustrates the use of global expert to combine the 
outputs of LPNs. 
 
)()( iiy q−
Y(i)  
)()( jjy q−
ijW  
)1()1( q−y
 )(iΘ  
 
Figure 5: The architecture of global expert 
Given a document x, we first use each LPN to 
make a local decision. We then combine the 
outputs of LPNs as follows: 
])([)( )()()()()( )(
0)()(
ijijiii j
jjy
ij
yWyY Θ−−∑+−=
>−
≠
qq
q
 (13) 
where Wij-s are the weights between the global 
arbitrator i and the jth LPN; and )(iΘ -s are the 
global bias. From the result of Equation (13), we 
have: 
If Y(i) > 0; then topic Ti is present; otherwise 
  Ti is not present in document x  
The use of Equation (13) implies that: 
a) If a LPN is not activated, i.e., y(i) £ q(i), then its 
output is not used in the GE. Thus it will not 
affect the output of other LPN.  
b) The weight Wij models the relationship or 
correlation between topic i and j. If Wij > 0, it 
means that if document x is related to Tj, it may 
also have some contribution ( Wij) to topic Tj. On 
the other hand, if Wij < 0, it means the two 
topics are negatively correlated, and a document 
x will not be related to both Tj and Ti. 
The overall structure of SPN is as follows: 
 
Input document 
Local Perception 
Global Expert 
x 
y(i) 
Y(i) 
 
Figure 6: Overall structure of SPN 
5.3 The Training of SPN 
In order to adopt SPN for topic spotting, we 
employ the well-known BP algorithm to derive the 
optimal weights and biases in SPN. The training 
phase is divided to two stages. The first stage 
learns a LPN for each topic, while the second stage 
trains the GE. As the BP algorithm is rather 
standard, we will discuss only the error functions 
that we employ to guide the training process. 
In topic spotting, the goal is to achieve both 
high recall and precision. In particular, we want to 
allow y(x) to be as large (or as small) as possible in 
cases when there is no error, or when +Ω∈x  and 
q>)(xy  (or −Ω∈x  and q<)( xy ). Here +Ω  and −Ω  
denote the positive and negative training document 
sets respectively. To achieve this, we adopt a new 
error function as follows to train the LPN: 
∑
Ω+Ω
Ω+
∑
Ω+Ω
Ω=
−
+
Ω∈
−
+−
+
Ω∈
+
+−
−
x
x
iijijijk
xy
xywwwE
)),((
||||
||
)),((
||||
||),,,,(
qe
qeqq  (14) 
where 


≥
<−=+
)(0
)()(21),( 2
q
qqqe
x
xxx ,  and 
 ),(),( qeqe −−= +− xx  
Equation (14) defines a piecewise differentiable 
error function. The coefficients 
||||
||
+−
−
Ω+Ω
Ω  and 
||||
||
+−
+
Ω+Ω
Ω  are used to ensure that the contributions 
of positive and negative examples are equal. 
After the training, we choose the node with the 
biggest wi value as the common attribute node. 
Also, we trim the topic representation by removing 
those words or context words with very small wij or 
wijk values. 
We adopt the following error function to train 
GE: 
∑ ∑ Θ+∑ Θ=Θ
= Ω∈
−
Ω∈
+
−+
n
i x iiix iiiiij ii
xYxYWE
1
])),(()),(([),( ee  (15) 
where +Ωi  is the set of positive examples of Ti. 
6. Experiment and Discussion 
 
We employ the ModApte Split version of Reuters-
21578 corpus to test our method. In order to ensure 
that the training is meaningful, we select only those 
classes that have at least one document in each of 
the training and test sets. This results in 90 classes 
in both the training and test sets. After eliminating 
documents that do not belong to any of these 90 
classes, we obtain a training set of 7,770 
documents and a test set of 3,019 documents.  
From the set of training documents, we derive the 
set of semantic nodes for each topic using the 
procedures outlined in Section 4. From the training 
set, we found that the average number of semantic 
nodes for each topic is 132, and the average 
number of terms in each node is 2.4. For 
illustration, Table 1 lists some examples of the 
semantic nodes that we found. From table 1, we 
can draw the following general observations. 
Node 
ID 
Semantic Node 
(SN) 
Method used 
to find SNs 
Topic 
1 wheat  1 
2 import, export, 
output  
1,2,3 
3 farmer, production, 
mln, ton 
2 
4 disease, insect, pest 2 
 
 
Wheat 
5 fall, fell, rise, rose 3 Wpi 
Method 1 - by looking up WordNet Method 2 - by analyzing co-occurrence correlation 
Method 3 - by analyzing context correlation  
Table 1: Examples of semantic nodes 
a) Under the topic “wheat ”, we list four semantic 
nodes. Node 1 contains the common attribute 
set of the topic. Node 2 is related to the “buying 
and selling of wheat”. Node 3 is related to 
“wheat production”; and node 4 is related to 
“the effects of insect on wheat production”. The 
results show that the automatically extracted 
basic semantic nodes are meaningful and are 
able to capture most semantics of a topic. 
b) Node 1 originally contains two terms “wheat” 
and “corn” that belong to the same synset found 
by looking up WordNet. However, in the 
training stage, the weight of the word “corn” 
was found to be very small in topic “wheat”, 
and hence it was removed from the semantic 
group. This is similar to the discourse based 
word sense disambiguation.  
c) The granularity of information expressed by the 
semantic nodes may not be the same as what 
human expert produces. For example, it is 
possible that a human expert may divide node 2 
into two nodes {import} and {export, output}. 
d) Node 5 contains four words and is formed by 
analyzing context. Each context vector of the 
four words has the same two components: 
“price” and “digital number”. Meanwhile, 
“rise” and “fall” can also be grouped together 
by “antonym” relation. “fell” is actually the past 
tense of “fall”. This means that by comparing 
context, it is possible to group together those 
words with grammatical variations without 
performing grammatical analysis. 
Table 2 summarizes the results of SPN in terms 
of macro and micro F1 values (see Yang & Liu 
(1999) for definitions of the macro and micro F1 
values). For comparison purpose, the Table also 
lists the results of other TC methods as reported in 
Yang & Liu (1999). From the table, it can be seen 
that the SPN method achieves the best macF1 
value. This indicates that the method performs well 
on classes with a small number of training samples. 
In terms of the micro F1 measures, SPN out-
performs NB, NNet, LSF and KNN, while posting 
a slightly lower performance than that of SVM. 
The results are encouraging as they are rather 
preliminary. We expect the results to improve 
further by tuning the system ranging from the 
initial values of various parameters, to the choice 
of error functions, context, grouping algorithm, and 
the structures of topic tree and SPN. 
Method MicR MicP micF1 macF1   
SVM 0.8120 0.9137 0.8599 0.5251 
KNN 0.8339 0.8807 0.8567 0.5242 
LSF 0.8507 0.8489 0.8498 0.5008 
NNet 0.7842 0.8785 0.8287 0.3763 
NB 0.7688 0.8245 0.7956 0.3886 
SPN 0.8402 0.8743 0.8569 0.6275 
Table 2. The performance comparison 
7. Conclusion 
 
In this paper, we proposed an approach to 
automatically build semantic perceptron net (SPN) 
for topic spotting. The SPN is a connectionist 
model in which context is used to select the exact 
meaning of a word. By analyzing the context and 
co-occurrence statistics, and by looking up 
thesaurus, it is able to group the distributed but 
semantic related words together to form basic 
semantic nodes. Experiments on Reuters 21578 
show that, to some extent, SPN is able to capture 
the semantics of topics and it performs well on 
topic spotting task. 
It is well known that human expert, whose most 
prominent characteristic is the ability to understand 
text documents, have a strong natural ability to spot 
topics in documents. We are, however, unclear 
about the nature of human cognition, and with the 
present state-of-art natural language processing 
technology, it is still difficult to get an in-depth 
understanding of a text passage. We believe that 
our proposed approach provides a promising 
compromise between full understanding and no 
understanding. 
Acknowledgment 
 
The authors would like to acknowledge the support 
of the National Science and Technology Board, and 
the Ministry of Education of Singapore for the 
provision of a research grant RP3989903 under 
which this research is carried out. 

References 
 
J.R. Anderson (1983). A Spreading Activation 
Theory of Memory. J. of Verbal Learning & 
Verbal Behavior, 22(3):261-295. 

L.D. Baker & A.K. McCallum (1998). 
Distributional Clustering of Words for Text 
Classification. SIGIR-98. 

J.N. Chen & J.S. Chang (1998). Topic Clustering 
of MRD Senses based on Information Retrieval 
Technique.  Comp Linguistic, 24(1), 62-95. 

G.W.K. Church & D. Yarowsky (1992). One Sense 
per Discourse. Proc. of 4th DARPA Speech and 
Natural Language Workshop. 233-237. 

W.W. Cohen & Y. Singer (1999). Context-
Sensitive Learning Method for Text 
Categorization. ACM Trans. on Information 
Systems, 17(2), 141-173, Apr. 

I. Dagan, S. Marcus & S. Markovitch (1995). 
Contextual Word Similarity and Estimation 
from Sparse Data. Computer speech and 
Language, 9:123-152. 

S.J. Green (1999). Building Hypertext Links by 
Computing Semantic Similarity. IEEE Trans on 
Knowledge & Data Engr, 11(5). 

T. Hofmann (1998). Learning and Representing 
Topic, a Hierarchical Mixture Model for Word 
Occurrences in Document Databases. 
Workshop on Learning from Text and the 
Web, CMU.  

N. Ide & J. Veronis (1998). Introduction to the 
Special Issue on Word Sense Disambiguation: 
the State of Art. Comp Linguistics, 24(1), 1-39. 

H. Jing & E. Tzoukermann (1999). Information 
Retrieval based on Context Distance and 
Morphology. SIGIR-99, 90-96. 

Y. Karov & S. Edelman (1998). Similarity-based 
Word Sense Disambiguation, Computational 
Linguistics, 24(1), 41-59. 

C. Leacock & M. Chodorow & G. Miller (1998). 
Using Corpus Statistics and WordNet for Sense 
Identification. Comp. Linguistic, 24(1), 147-165. 

L. Lee (1999). Measure of Distributional 
Similarity.  Proc of 37th Annual Meeting of 
ACL.  

J. Lee & D. Dubin (1999). Context-Sensitive 
Vocabulary Mapping with a Spreading 
Activation Network. SIGIR-99, 198-205. 

D. Lin (1998). Automatic Retrieval and Clust ering 
of Similar Words.  In COLING-ACL-98, 768-773. 

J. Liu & Z. Shi (2000). Extracting Prominent 
Shape by Local Interactions and Global 
Optimizations. CVPRIP-2000, USA.  

M.A. Minsky (1975). A Framework for 
Representing Knowledge. In: Winston P (eds). 
The psychology of computer vision,
McGraw-Hill, New York, 211-277. 

F.C.N. Pereira, N.Z. Tishby & L. Lee (1993). 
Distributional Clustering of English Words. 
ACL-93, 183-190. 

P. Resnik (1995). Using Information Content to 
Evaluate Semantic Similarity in a Taxonomy. 
Proc of IJCAI-95, 448-453. 

S. Sarkas & K.L. Boyer (1995). Using Perceptual 
Inference Network to Manage Vision 
Processes.  Computer Vision & Image 
Understanding, 62(1), 27-46. 

R. Tong, L. Appelbaum, V. Askman & J. 
Cunningham (1987). Conceptual Information  
Retrieval using RUBRIC.  SIGIR-87, 247-253. 

K. Tzeras & S. Hartmann (1993). Automatic 
Indexing based on Bayesian Inference 
Networks. SIGIR-93, 22-34. 

A. Veling & P. van der Weerd (1999). Conceptual 
Grouping in Word Co-occurrence Networks. 
IJCAI 99: 694-701. 

D. Wang & D. Terman (1995). Locally Excitatory 
Globally Inhibitory Oscillator Networks. IEEE 
Trans. Neural Network. 6(1). 

Y. Yang & X. Liu (1999). Re-examination of Text 
Categorization. SIGIR-99, 43-49. 
