Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 1–8,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Improving Context Vector Models by Feature Clustering for Auto-
matic Thesaurus Construction 
 
Jia-Ming You 
Institute of Information Science 
Academia Sinica 
swimming@hp.iis.sinica.edu.tw  
Keh-Jiann Chen 
Institute of Information Science 
Academia Sinica 
kchen@iis.sinica.edu.tw  
Abstract 
Thesauruses are useful resources for NLP; 
however, manual construction of thesau-
rus is time consuming and suffers low 
coverage. Automatic thesaurus construc-
tion is developed to solve the problem. 
Conventional way to automatically con-
struct thesaurus is by finding similar 
words based on context vector models 
and then organizing similar words into 
thesaurus structure. But the context vec-
tor methods suffer from the problems of 
vast feature dimensions and data sparse-
ness. Latent Semantic Index (LSI) was 
commonly used to overcome the prob-
lems. In this paper, we propose a feature 
clustering method to overcome the same 
problems. The experimental results show 
that it performs better than the LSI mod-
els and do enhance contextual informa-
tion for infrequent words. 
1 Introduction 
Thesaurus is one of the most useful linguistic 
resources. It provides information more than just 
synonyms. For example, in WordNet (Fellbaum, 
1998), it also builds up relations between syno-
nym sets, such as hyponym, hypernym. There are 
two Chinese thesauruses Cilin(1983) and 
Hownet1. Cilin provides synonym sets with sim-
ple hierarchical structure. Hownet uses some 
primitive senses to describe word meanings. The 
common primitive senses provide additional re-
lations between words implicitly. However, 
many words occurred in contemporary news cor-
pora are not covered by Chinese thesauruses.  
 
                                                 
1 http://www.HowNet.com(Dong Zhendong, Dong 
Qiang:HowNet) 
 
Therefore, we intend to create a thesaurus 
based on contemporary news corpora. The com-
mon steps to automatically construct a thesaurus 
include a) contextual information extraction, b) 
finding synonym words and c) organizing syno-
nym words into a thesaurus. The approach is 
based upon the fact that word meaning lays on its 
contextual behavior. If words act similarly in 
context, they may share the same meaning.  
However, the method can only handle frequent 
words rather than infrequent ones. In fact most of 
vocabularies occur infrequently, one has to dis-
cover extend information to overcome the data 
sparseness problem. We will introduce the con-
ventional approaches for automatic thesaurus 
construction in section 2. Follow a discussion 
about the problems and solutions of context vec-
tor models in section 3.  In section 4, we use two 
performance evaluation metrics, i.e. discrimina-
tion and nonlinear interpolated precision, to 
evaluate our proposed method. 
2 Conventional approaches for auto-
matic thesaurus construction  
The conventional approaches for automatic the-
saurus construction include three steps: (1) Ac-
quire contextual behaviors of words from cor-
pora. (2) Calculate the similarity between words. 
(3) Finding similar words and then organizing 
into a thesaurus structure.  
2.1 Acquire word sense knowledge 
One can model word meanings by their co-
occurrence context. The common ways to extract 
co-occurrence contextual words include simple 
window based and syntactic dependent based 
(You, 2004). Obviously, syntactic dependent 
relations carry more accurate information than 
window based. Also, it can bring additional in-
formation, such as POS (part of speech) and se-
mantic roles etc. To extract the syntactic de-
1
;1 )(log)()rdentropy(wo ∑= −=
m
k
i
kwordp
i
kwordpi
pended relation, a raw text has to be segmented, 
POS tagged, and parsed. Then the relation ex-
tractor identifies the head-modifier relations 
and/or head-argument relations. Each relation 
could be defined as a triple (w, r, c), where w is 
the thesaurus term, c is the co-occurred context 
word and r is the relation between w and c.  
Then context vector of a word is represented 
differently by different models, such as: tf, 
weight-tf, Latent Semantic Indexing (LSI) 
(Deerwester, S.,et al., 1990) and Probabilistic 
LSI (Hofmann, 1999). The context vectors of 
word x can be express by:     
 
a) tf model: word x = }tf...,,2tf,1{tf xnxx ,where xitf is 
the term frequency of the ith context word when 
given word x.  
 
b) weight-tf model: assume there are n contex-
tual words and m target words. word x= 
 
,where weighti,  we used here, is defined as  
[logm-entropy(wordi)]/logm 
   
)(wordikp      
is the co-occurrence probability of wordk when 
given wordi. 
 
c) LSI or PLSI models: using tf or weighted-tf 
co-occurrence matrix and by adopting LSI or 
PLSI to reduce the dimension of the matrix. 
 
2.2 Similarity between words  
The common similarity functions include  
a) Adopting simple frequency feature, such as 
cosine, which computes the angle between two 
context vectors;  
 
b) Represent words by the probabilistic distribu-
tion among contexts, such as Kull-Leiber diver-
gence (Cover and Thomas, 1991). 
The first step is to convert the co-occurrence 
matrix into a probabilistic matrix by simple for-
mula.  
∑ ===
∑ ====
= n
1k
y
ktf
y
itfxiq},xn,...qx2q,x1{q wordy 
n
1k
x
ktf
x
itfxip},xn,...px2p,x1{pwordx
 q
p
 
 
Then calculate the distance between probabil-
istic vectors by sums up the all probabilistic dif-
ference among each context word so called cross 
entropy. 
 
Due to the original KL distance is asymmetric 
and is not defined when zero frequency occurs. 
Some enhanced KL models were developed to 
prevent these problems such as Jensen-Shannon 
(Jianhua, 1991), which introducing a probabilis-
tic variable m, or α -Skew Divergence (Lee, 
1999), by adopting adjustable variable α. Re-
search shows that Skew Divergence achieves 
better performance than other measures. (Lee, 
2001) 
 
))1(||(yxS  rgence)D(SkewDive yxxKL aaa −+==  
 
2/)(                                  
,2/)}||()||({y)x,(JS Shannon)-D(Jensen
yxm
myKLmxKL
+=
+==
 
To convert distance to similarity value, we 
adopt the formula inspired by Mochihashi, and 
Matsumoto 2002.   
 
 
2.3 Organize similar words into thesaurus  
 
There are several clustering methods can be used 
to cluster similar words. For example, by select-
ing N target words as the entries of a thesaurus, 
then extract top-n similar words for each entry; 
adopting HAC(Hierarchical agglomerative clus-
tering, E.M. Voorhees,1986) method to cluster 
the most similar word pairs in each clustering 
loop. Eventually, these similar words will be 
formed into synonyms sets.  
 
3 Difficulties and Solutions 
There are two difficulties of using context vector 
models. One is the enormous dimensions of con-
}weight...tf2weight2tf,1weight1{tf nxxx n ×××
yx
xy)cos(x,
×
•= y
)},distance(exp{wordy)(wordx,similarity yx⋅−= l
i
i
21 q
plog)(q)KL(p, :Distance KL •∑
=
= ipn
i
2
textual words, and the other is data sparseness 
problem. Conventionally LSI or PLSI methods 
are used to reduce feature dimensions by map-
ping literal words into latent semantic classes. 
The researches show that it’s a promising 
method (April Kontostathis, 2003). However the 
latent semantic classes also smooth the informa-
tion content of feature vectors. Here we proposed 
a different approach to cope with the feature re-
duction and data sparseness problems. 
 
3.1 Feature Clustering 
Reduced feature dimensions and data sparseness 
cause the problem of inaccurate contextual in-
formation. In general, one has to reduce the fea-
ture dimensions for computational feasibility and 
also to extend the contextual word information to 
overcome the problem of insufficient context 
information. 
In our experiments, we took the clustered-
feature approaches instead of LSI to cope with 
these two problems and showed better perform-
ances. The idea of clustered-feature approaches 
is by adopting the classes of clustering result of 
the frequent words as the new set of features 
which has less feature dimensions and context 
words are naturally extend to their class mem-
bers. We followed the steps described in section 
2 to develop the synonyms sets. First, the syntac-
tic dependent relations were extracted to create 
the context vectors for each word. We adopted 
the skew divergence as the similarity function, 
which is reported to be the suitable similarity 
function (Masato, 2005), to measure the distance 
between words.  
 
We used HAC algorithm to develop the syno-
nyms classes, which is a greedy method, simply 
to cluster the most similar word pairs at each 
clustering iteration.  
 
The HAC clustering process: 
 
While  the similarity of the most similar word pair 
(wordx, wordy) is greater than a threshold ε 
 
then cluster wordx, wordy together and replace it with 
the centroid   between wordx and wordy 
 
Recalculate the similarity between other words and 
the  centroid   
 
3.2 Clustered-Feature Vectors 
We obtain the synonyms sets S from above HAC 
method. Let the extracted synonyms sets S = { S1, 
S2,…SR} which contains R synonym classes; 
i
jS stands for the jth element of the ith synonym 
class;  the ith synonym class Si contains Qi ele-
ments.  
 
 
The feature extension processing transforms 
the coordination from literal words to synonyms 
sets. Assume there are N contextual words 
{C1,C2,…CN}, and the first step is to transform 
the context vector of of Ci to the distribution vec-
tor among S. Then the new feature vector is the 
summation of the distribution vectors among S 
of its all contextual words. 
 
The new feature vector of wordj = 
∑= ×Ni 1  jitf Distribution_Vector_among_S( iC ) 
,where jitf  is the term frequency of the context 
word Ci occurs with wordj.  
Distribution_Vector_among_S( iC )= { }RSiPSiPSiP ,..., 21 ,
.S synonyms  at the  of rdscontext wo
 ofon distributi  themeans,(Ci)1
),(
  where,
jjthCi
freq
Qj
q
CijqSfreq
SiP j
∑
==
 
Due to the transformed coordination no longer 
stands for either frequency or probability, we use 
simple cosine function to measure the similarity 
between these transformed clustered-feature vec-
tors.  
4 Evaluation  
To evaluate the performance of the feature clus-
tering method, we had prepared two sets of test-
ing data with high and low frequency words re-
spectively. We want to see the effects of feature 
reduction and feature extension for both frequent 
and infrequent words. 
 














=
R
QR
RR
Q
Q
SSS
SSS
SSS
...
............
...
...
 S 
21
2
2
2
2
2
1
1
1
1
2
1
1
3
4.1 Discrimination Rates  
The discrimination rate is used to examine the 
capability of distinguishing the correlation be-
tween words. Given a word pair (wordi,wordj), 
one has to decide whether the word pair is simi-
lar or not. Therefore, we will arrange two differ-
ent word pair sets, related and unrelated, to esti-
mate the discrimination. By given the formula 
below  
 
 
,where Na and Nb are respectively the numbers 
of synonym word pairs and unrelated word pairs. 
As well as, na and nb are the numbers of correct 
labeled pairs in synonyms and unrelated words.   
4.2 Nonlinear interpolated precision  
 
The Nap evaluation is used to measure the per-
formance of restoring words to taxonomy, a 
similar task of restoring words in WordNet 
(Dominic Widdows, 2003).  
The way we adopted Nap evaluation is to re-
construct a partial Chinese synonym set, and 
measure the structure resemblance between 
original synonyms and the reconstructed one. By 
doing so, one has to prepare certain number of 
synonyms sets from Chinese taxonomy, and try 
to reclassify these words.  
Assume there are n testing words distributed 
in R synonyms sets.  Let i1R stands for the repre-
sented word of the ith synonyms set. Then we 
will compute the similarity ranking between each 
represented word and the rest n-1 testing words. 
By given formula  
 
i
jS  represents the jth similar word of 
i
1R  among 
the rest n-1 words 
 



= 0   synonym are R and  S if,1
i
1
i
ji
jZ
 
 
The NAP value means how many percent 
synonyms can be identified. The maximum value 
of NAP is 1, means the extracted similar words 
are exactly match to the synonyms.  
5 Experiments 
The context vectors were derived from a 10 
year news corpus from The Central News 
Agency. It contains nearly 33 million sentences, 
234 million word tokens, and we extracted 186 
million syntactic relations from this corpus. Due 
to the low reliability of infrequent data, only the 
relation triples (w, r, c), which occurs more than 
3 times and POS of w and c must be noun or 
verb, are used. It results that nearly 30,000 high 
frequent nouns and verbs are used as the contex-
tual features. And with feature clustering2, the 
contextual dimensions were reduced from 30,988 
literal words to 12,032 semantic classes. 
In selecting testing data, we consider the 
words that occur more than 200 times as high 
frequent words and the frequencies range from 
40 to 200 as low frequent words.    
 
Discrimination  
 
For the discrimination experiments, we randomly 
extract high frequent word pairs which include 
500 synonym pairs and 500 unrelated word pairs 
from Cilin (Mei et. al, 1983). At the mean time, 
we also prepare equivalent low frequency data.  
We use a mathematical technique Singular 
Value Decomposition (SVD) to derive principal 
components and to implement LSI models with 
respect to different feature dimensions from 100 
to 1000. We compare the performances of differ-
ent models. The results are shown in the follow-
ing figures. 
 
Figure1.  Discrimination for high frequent words 
 
The result shows that for the high frequent 
data, although the feature clustering method did 
not achieve the best performance, it perform-
ances better at related data and a balanced per-
formance at unrelated data. The tradeoffs be-
                                                 
2 Some feature clustering results are listed in the Ap-
pendix  
 += NbnbNana21ratetion Discrimina
,1R1NAP
1
1
1n
1j
R
1i



 += ∑∑∑ −
=
−
==
j
k
i
k
i
j Z
j
Z
4
tween related recalls and unrelated recalls are 
clearly shown. Another observation is that no 
matter of using LSI or literal word features (tf or 
weight_tf), the performances are comparable. 
Therefore, we could simply use any method to 
handle the high frequent words.  
 
Figure2 Discrimination for low frequent word 
 
For the infrequent words experiments, neither 
LSI nor weighted-tf performs well due to insuffi-
cient contextual information. But by introducing 
feature clustering method, one can gain more 6% 
accuracy for the related data. It shows feature 
clustering method could help gather more infor-
mation for the infrequent words.  
 
Nonlinear interpolated precision 
 
For the Nap evaluation, we prepared two testing 
data from Cilin and Hownet. In the high frequent 
words experiments, we extract 1311 words 
within 352 synonyms sets from Cilin and 2981 
words within 570 synonyms sets from Hownet.  
 
Figure 3. Nap performance for high frequent words 
 
In high frequent experiments, the results show 
that the models retaining literal form perform 
better than dimension reduction methods. It 
means in the task of measuring similarity of high 
frequent words using literal contextual feature 
vectors is more precise than using dimension 
reduction feature vectors. 
In the infrequent words experiments, we can 
only extract 202 words distributed in 62 syno-
nyms sets from Cilin and 1089 words within 222 
synonyms sets. Due to fewer testing words, LSI 
was not applied in this experiment. 
 
Figure 4. Nap performance for low frequent words 
 
It shows with insufficient contextual informa-
tion, the feature clustering method could not help 
in recalling synonyms because of dimensional 
reduction.  
 
6. Error Analysis and Conclusion  
 
Using context vector models to construct thesau-
rus suffers from the problems of large feature 
dimensions and data sparseness. We propose a 
feature clustering method to overcome the prob-
lems. The experimental results show that it per-
forms better than the LSI models in distinguish-
ing related/unrelated pairs for the infrequent data, 
and also achieve relevant scores on other evalua-
tions.  
Feature clustering method could raise the abil-
ity of discrimination, but not robust enough to 
improve the performance in extracting synonyms. 
It also reveals the truth that it’s easy to distin-
guish whether a pair is related or unrelated once 
the word pair shares the same sense in their 
senses. However, it’s not the case when seeking 
synonyms. One has to discriminate each sense 
for each word first and then compute the similar-
ity between these senses to achieve synonyms.  
Because feature clustering method lacks the abil-
ity of senses discrimination of a word, the 
method can handle the task of distinguishing cor-
relation pairs rather than synonyms identification. 
 
Also, after analyzing discrimination errors 
made by context vector models, we found that 
some errors are not due to insufficient contextual 
information. Certain synonyms have dissimilar 
contextual contents for different reasons. We 
observed some phenomenon of these cases:  
 
5
a) Some senses of synonyms in testing data are 
not their dominant senses.  
 
Take guang1hua2 (光華) for example, it has a 
sense of “splendid” which is similar to the sense 
of guang1mang2 (光芒). Guang1hua2 and 
guang1mang2 are certainly mutually changeable 
in a certain degree, guang1hua2jin4shi4 (光華盡
失) and guang1mang2jin4shi4 (光芒盡失), or 
xi2ri4guang1hua2 ( 昔日光華) and 
xi2ri4guang1mang2 (昔日光芒). However, the 
dominated contextual sense of guang1hua2 is 
more likely to be a place name, like 
guang1hua2shi4chang3( 光華市場) or 
hua1lian2guang1hua2 (花蓮光華) etc3.  
 
b) Some synonyms are different in usages for 
pragmatic reasons.  
 
Synonyms with different contextual vectors 
could be result from different perspective views. 
For example, we may view wai4jie4 (外界) as a 
container image with viewer inside, but on the 
other hand, yi3wai4 (以外) is an omnipotence 
perspective. This similar meaning but different 
perspective makes distinct grammatical usage 
and different collocations.  
 
 
 Similarly, zhong1shen1 (終身) and sheng1ping2 
( 生平) both refer to “life-long time”. 
zhong1shen1 explicates things after a time point, 
which differs from sheng1ping2, showing mat-
ters before a time point.  
 
 
 
 
 
c) Domain specific usages.  
 
 For example, in medical domain news ,wa1wa1 
(娃娃) occurs frequently with bo1li2 (玻璃) refer 
                                                 
3 This may due to different genres. In newspapers the 
proper noun usage of guang1hua2 is more common 
than in a literature text. 
to kind of illness. Then the corpus reinterpret 
wa1wa1 (娃娃) as a sick people, due to it occurs 
with medical term. But the synonym of wa1wa1 
(娃娃), xiao3peng2you3(小朋友) stands for 
money in some finance news.  Therefore, the 
meanings of words change from time to time. It’s 
hard to decide whether meaning is the right an-
swer when finding synonyms.  
  
With above observations, our future researches 
will be how to distinguish different word senses 
from its context features. Once we could distin-
guish the corresponding features for different 
senses, it will help us to extract more accurate 
synonyms for both frequent and infrequent words.  
 
 
References  
April Kontostathis,  William M. Pottenger 2003. ,  A 
Framework for Understanding LSI Performance,  In 
the Proceedings of the ACM SIGIR Workshop on 
Mathematical/Formal Methods in Information Re-
trieval, Annual International SIGIR Conference, 2003. 
 
Christiance Fellbaum, editor 1998,WordNet: An alec-
tronic lwxical database. MIT press, Cambrige MA.  
 
Deerwester, S.,et al. 1990 Indexing by Latent Seman-
tic Analysis. Jorunal of the American Society for In-
formation Science, 41(6):391-407 
 
Dominic Widdows. 2003. Unsupervised methods for 
developing taxonomies by combining syntactic and 
statistical information. In Proceeding of HLT-NAACL 
2003 Main papers, pp, 197-204.  
 
E.M. Voorhees, “Implement agglomerative hierarchi-
cal clustering algorithm for use in document re-
trieval”, Information Processing & Management. , no. 
22 pp.46-476,1986 
 
Hofmann, T.1999. Probabilistic Latent  Semantic 
Indexing. Proc.of the 22nd International conference on 
Research and Development in Information Retrieval 
(SIGIR’99),50-57 
 
James R.Curran and Marc Moens. 2002. Improve-
ments in Automatic Thesaurus Extraction. Proceed-
ings of the Workshop of the ACL Special Interest 
Group on the Lexicon (SIGLEX), pp. 59-66 
 
Jia-Ming You and Keh-Jiann Chen, 2004 Automatic 
Semantic Role Assignment for a Tree Structure, Pro-
ceedings of 3rd ACL SIGHAN Workshop 
 
Jiahua Lin. 1991. Divergence measures based on the 
Shannon Entropy. IEEE transactions on Information 
Theory, 37(1): 145-151 
 
Lillian Lee. 2001. On the effectiveness of the skew 
divergence for statistical language analysis. In Artifi-
cial Intelligence and Statistics 2001, page 65-72. 
 
Lillian Lee. 1999. Measure of distributional similarity. 
In Proceeding of the 37th Annual Meeting of the As-
sociation for Computational Linguistics (ACL-1999), 
page 23-32.  
 
Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko 
Toyama. 2005. PLSI Utilization for Automatic The-
saurus Construction. IJCNLP 2005, LNAI 651, pp. 
334-345.  
 
Mei,Jiaju,Yiming Lan, Yunqi Gao, Yongxian Ying 
(1983) 同義詞詞林  [ A Dictionary of Syno-
nyms],Shanghai Cishu Chubanshe.  
 
Mochihashi, D., Matsumoto, Y.2002. Probabilistic 
Representation of Meanings. IPSJ SIG Notes Natural 
Language, 2002-NL-147:77-84. 
 
T.Cover and J.Thomas, 1991. Element of Information 
Theory. Wiley & sons, New York 
