Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 33–39, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
 
Abstract 
Traditionally, word sense disambiguation 
(WSD) involves a different context classifi-
cation model for each individual word. This 
paper presents a weakly supervised learning 
approach to WSD based on learning a word 
independent context pair classification 
model. Statistical models are not trained for 
classifying the word contexts, but for classi-
fying a pair of contexts, i.e. determining if a 
pair of contexts of the same ambiguous word 
refers to the same or different senses. Using 
this approach, annotated corpus of a target 
word A can be explored to disambiguate 
senses of a different word B. Hence, only a 
limited amount of existing annotated corpus 
is required in order to disambiguate the entire 
vocabulary. In this research, maximum en-
tropy modeling is used to train the word in-
dependent context pair classification model. 
Then based on the context pair classification 
results, clustering is performed on word men-
tions extracted from a large raw corpus. The 
resulting context clusters are mapped onto 
the external thesaurus WordNet. This ap-
proach shows great flexibility to efficiently 
integrate heterogeneous knowledge sources, 
e.g. trigger words and parsing structures. 
Based on Senseval-3 Lexical Sample stan-
dards, this approach achieves state-of-the-art 
performance in the unsupervised learning 
category, and performs comparably with the 
supervised Naïve Bayes system. 
1 Introduction 
Word Sense Disambiguation (WSD) is one of the 
central problems in Natural Language Processing. 
The difficulty of this task lies in the fact that con-
text features and the corresponding statistical dis-
tribution are different for each individual word. 
Traditionally, WSD involves training the context 
classification models for each ambiguous word.  
(Gale et al. 1992) uses the Naïve Bayes method for 
context classification which requires a manually 
annotated corpus for each ambiguous word. This 
causes a serious Knowledge Bottleneck. The bot-
tleneck is particularly serious when considering the 
domain dependency of word senses. To overcome 
the Knowledge Bottleneck, unsupervised or weakly 
supervised learning approaches have been pro-
posed. These include the bootstrapping approach 
(Yarowsky 1995) and the context clustering ap-
proach (Schütze 1998). 
The above unsupervised or weakly supervised 
learning approaches are less subject to the Knowl-
edge Bottleneck. For example, (Yarowsky 1995) 
only requires sense number and a few seeds for 
each sense of an ambiguous word (hereafter called 
keyword). (Schütze 1998) may only need minimal 
annotation to map the resulting context clusters 
onto external thesaurus for benchmarking and ap-
plication-related purposes. Both methods are based 
on trigger words only. 
This paper presents a novel approach based on 
learning word-independent context pair classifica-
tion model. This idea may be traced back to 
(Schütze 1998) where context clusters based on 
generic Euclidean distance are regarded as distinct 
word senses. Different from (Schütze 1998), we 
observe that generic context clusters may not al-
ways correspond to distinct word senses. There-
fore, we used supervised machine learning to 
model the relationships between the context dis-
tinctness and the sense distinctness. 
Although supervised machine learning is used 
for the context pair classification model, our over-
all system belongs to the weakly supervised cate-
gory because the learned context pair classification 
Word Independent Context Pair Classification Model for Word 
Sense Disambiguation 
 
Cheng Niu, Wei Li, Rohini K. Srihari, and Huifeng Li 
Cymfony Inc. 
600 Essjay Road, Williamsville, NY 14221, USA. 
{cniu, wei, rohini,hli}@cymfony.com 
33
model is independent of the keyword for disam-
biguation. Our system does not need human-
annotated instances for each target ambiguous 
word. The weak supervision is performed by using 
a limited amount of existing annotated corpus 
which does not need to include the target word set.   
The insight is that the correlation regularity be-
tween the sense distinction and the context distinc-
tion can be captured at Part-of-Speech category 
level, independent of individual words or word 
senses. Since context determines the sense of a 
word, a reasonable hypothesis is that there is some 
mechanism in the human comprehension process 
that will decide when two contexts are similar (or 
dissimilar) enough to trigger our interpretation of a 
word in the contexts as one meaning (or as two 
different meanings). We can model this mecha-
nism by capturing the sense distinction regularity 
at category level.  
In the light of this, a maximum entropy model is 
trained to determine if a pair of contexts of the 
same keyword refers to the same or different word 
senses. The maximum entropy modeling is based 
on heterogeneous context features that involve 
both trigger words and parsing structures. To en-
sure the resulting model’s independency of indi-
vidual words, the keywords used in training are 
different from the keywords used in benchmarking. 
For any target keyword, a collection of contexts is 
retrieved from a large raw document pool. Context 
clustering is performed to derive the optimal con-
text clusters which globally fit the local context 
pair classification results. Here statistical annealing 
is used for its optimal performance. In benchmark-
ing, a mapping procedure is required to correlate 
the context clusters with external ontology senses. 
In what follows, Section 2 formulates the maxi-
mum entropy model for context pair classification. 
The context clustering algorithm, including the 
object function of the clustering and the statistical 
annealing-based optimization, is described in Sec-
tion 3. Section 4 presents and discusses bench-
marks, followed by conclusion in Section 5. 
2 Maximum Entropy Modeling for Con-
text Pair Classification 
Given n  mentions of a keyword, we first introduce 
the following symbols. 
i
C  refers to the i -th con-
text.  
i
S  refers to the sense of the i -th context. 
ji
CS
,
 refers to the context similarity between the 
i -th context and the j -th context, which is a subset 
of the predefined context similarity features. 
α
f  
refers to the α -th predefined context similarity 
feature. So 
ji
CS
,
 takes the form of {}
α
f . 
In this section, we study the context pair classi-
fication task, i.e. given a pair of contexts 
i
C and 
j
C  of the same target word, are they referring to 
the same sense? This task is formulated as compar-
ing the following conditional probabilities: 
( )
jiji
CSSS
,
Pr =  and ( )
jiji
CSSS
,
Pr ≠ . Unlike 
traditional context classification for WSD where 
statistical model is trained for each individual 
word, our context pair classification model is 
trained for each Part-of-speech (POS) category. 
The reason for choosing POS as the appropriate 
category for learning the context similarity is that 
the parsing structures, hence the context represen-
tation, are different for different POS categories. 
The training corpora are constructed using the 
Senseval-2 English Lexical Sample training cor-
pus. To ensure the resulting model’s independency 
of individual words, the target words used for 
benchmarking (which will be the ambiguous words 
used in Senseval-3 English Lexicon Sample task) 
are carefully removed in the corpus construction 
process. For each POS category, positive and nega-
tive instances are constructed as follows.  
Positive instances are constructed using context 
pairs referring to the same sense of a word.  Nega-
tive instances are constructed using context pairs 
that refer to different senses of a word.  
For each POS category, we have constructed 
about 36,000 instances, half positive and half nega-
tive. The instances are represented as pairwise con-
text similarities, taking the form of {}
α
f . 
Before presenting the context similarity features 
we used, we first introduce the two categories of 
the involved context features: 
 
i) Co-occurring trigger words within a prede-
fined window size equal to 50 words to both 
sides of the keyword. The trigger words are 
learned from a TIPSTER document pool con-
taining ~170 million words of AP and WSJ 
news articles. Following (Schütze 1998), χ
2
 is 
used to measure the cohesion between the 
keyword and a co-occurring word.  In our ex-
34
periment, all the words are first sorted based 
on its χ
2
 with the keyword, and then the top 
2,000 words are selected as trigger words. 
 
ii) Parsing relationships associated with the 
keyword automatically decoded by a broad-
coverage parser, with F-measure (i.e. the pre-
cision-recall combined score) at about 85% 
(reference temporarily omitted for the sake of 
blind review). The logical dependency rela-
tionships being utilized are listed below. 
 
Noun:  subject-of,  
object-of, 
complement-of,  
has-adjective-modifier,  
has-noun-modifier,  
modifier-of,  
possess, 
 possessed-by,  
appositive-of 
 
Verb:   has-subject,  
has-object, 
 has-complement,  
has-adverb-modifier,  
has-prepositional-phrase-modifier 
 
Adjective: modifier-of,  
has-adverb-modifier 
 
Based on the above context features, the follow-
ing three categories of context similarity features 
are defined: 
 
(1)  VSM-based (Vector Space Model based) 
trigger word similarity: the trigger words 
around the keyword are represented as a vec-
tor, and the word i in context j is weighted as 
follows: 
)(
log*),(),(
idf
D
jitfjiweight =  
where ),( jitf  is the frequency of word i in 
the j-th context; D is the number of docu-
ments in the pool; and )(idf  is the number of 
documents containing the word i. D and 
)(idf are estimated using the document pool 
introduced above. The cosine of the angle be-
tween two resulting vectors is used as the 
context similarity measure. 
 
(2)  LSA-based (Latent Semantic Analysis based) 
trigger word similarity: LSA (Deerwester et 
al. 1990) is a technique used to uncover the 
underlying semantics based on co-occurrence 
data. The first step of LSA is to construct 
word-vs.-document co-occurrence matrix. 
Then singular value decomposition (SVD) is 
performed on this co-occurring matrix. The 
key idea of LSA is to reduce noise or insig-
nificant association patterns by filtering the 
insignificant components uncovered by SVD. 
This is done by keeping only the top k singu-
lar values. By using the resulting word-vs.-
document co-occurrence matrix after the fil-
tering, each word can be represented as a vec-
tor in the semantic space. 
  
In our experiment, we constructed the original 
word-vs.-document co-occurring matrix as 
follows: 100,000 documents from the 
TIPSTER corpus were used to construct the 
co-occurring matrix. We processed these 
documents using our POS tagger, and se-
lected the top n most frequently mentioned 
words from each POS category as base 
words: 
 
top 20,000 common nouns 
top 40,000 proper names 
top 10,000 verbs 
top 10,000 adjectives 
top 2,000 adverbs 
 
In performing SVD, we set k (i.e. the number 
of nonzero singular values) as 200, following 
the practice reported in (Deerwester et al. 
1990) and (Landauer & Dumais, 1997). 
 
Using the LSA scheme described above, each 
word is represented as a vector in the seman-
tic space. The co-occurring trigger words are 
represented as a vector summation. Then the 
cosine of the angle between the two resulting 
vector summations is computed, and used as 
the context similarity measure. 
 
(3) LSA-based parsing relationship similarity: 
each relationship is in the form of )(wR
α
. 
Using LSA, each word w  is represented as a 
35
semantic vector ()wV . The similarity between 
)(
1
wR
α
and )(
2
wR
α
 is represented as the co-
sine of the angle between ()
1
wV  and ()
2
wV . 
Two special values are assigned to two excep-
tional cases: (i) when no relationship 
α
R  is 
decoded in both contexts; (ii) when the rela-
tionship 
α
R is decoded only for one context. 
 
In matching parsing relationships in a context 
pair, if only exact node match counts, very few 
cases can be covered, hence significantly reducing 
the effect of the parser in this task. To solve this 
problem, LSA is used as a type of synonym expan-
sion in matching.  For example, using LSA, the 
following word similarity values are generated: 
 
similarity(good, good)   1.00 
similarity(good, pretty) 0.79 
similarity(good, great) 0.72 
…… 
 
Given a context pair of a noun keyword, suppose 
the first context involves a relationship has-
adjective-modifier whose value is good, and the 
second context involves the same relationship has-
adjective-modifier with the value pretty, then the 
system assigns 0.79 as the similarity value for this 
relationship pair. 
 
To facilitate the maximum entropy modeling in 
the later stage, all the three categories of the result-
ing similarity values are discretized into 10 inte-
gers. Now the pairwise context similarity is 
represented as a set of similarity features, e.g. 
 
{VSM-Trigger-Words-Similairty-equal-to-2,  
  LSA-Trigger-Words-Similarity-equal-to-1,      
  LSA-Subject-Similarity-equal-to-2}. 
 
In addition to the three categories of basic con-
text similarity features defined above, we also de-
fine induced context similarity features by 
combining basic context similarity features using 
the logical and operator. With induced features, the 
context similarity vector in the previous example is 
represented as 
 
{VSM-Trigger-Word-Similairty-equal-to-2,  
  LSA- Trigger-Word-Similarity-equal-to-1,  
  LSA-Subject-Similarity-equal-to-2,  
  [VSM-Similairty-equal-to-2 and  
   LSA-Trigger-Word-Similarity-equal-to-1],    
  [VSM-Similairty-equal-to-2 and  
   LSA-Subject-Similarity-equal-to-2],  
  ……… 
  [VSM-Trigger-Word-Similairty-equal-to-2   
and LSA-Trigger-Word-Similarity-equal-to-1 
and LSA-Subject-Similarity-equal-to-2] 
} 
 
The induced features provide direct and fine-
grained information, but suffer from less sampling 
space. Combining basic features and induced fea-
tures under a smoothing scheme, maximum en-
tropy modeling may achieve optimal performance. 
Using the context similarity features defined 
above, the training corpora for the context pair 
classification model is in the following format: 
 
Instance_0 tag=”positive” {VSM-Trigger-Word-
Similairty-equal-to-2, …} 
Instance_1 tag=”negative” {VSM-Trigger-Word-
Similairty-equal-to-0, …} 
…………… 
where positive tag denotes a context pair associ-
ated with same sense, and negative tag denotes a 
context pair associated with different senses.  
 
The maximum entropy modeling is used to com-
pute the conditional probabilities 
( )
jiji
CSSS
,
Pr =  and ( )
jiji
CSSS
,
Pr ≠ : once the 
context pair 
ji
CS
,
 is represented as }{
α
f , the con-
ditional probability is given as 
 
()
{}
∏
∈
=
α
α
ff
ft
w
Z
ft
,
1
}{Pr         (1) 
where { }
jiji
SSSSt ≠=∈ , , Z is the normaliza-
tion factor, 
ft
w
,
 is the weight associated with tag t 
and feature f . Using the training corpora con-
structed above, the weights can be computed based 
on Iterative Scaling algorithm (Pietra etc. 1995) 
The exponential prior smoothing scheme (Good-
man 2003) is adopted in the training.  
36
3 Context Clustering based on Context 
Pair Classification Results 
Given n  mentions {}
i
C of a keyword, we use the 
following context clustering scheme. The discov-
ered context clusters correspond to distinct word 
senses.  
For any given context pair, the context similarity 
features defined in Section 2 are computed. With n 
mentions of the same keyword, 
2
)1( −nn
 context 
similarities [] [)()ijniCS
ji
,1,,1 
,
∈∈  are computed. 
Using the context pair classification model, each 
pair is associated with two scores 
()
jijiji
CSSSsc
,
0
,
Prlog ==  and 
()
jijiji
CSSSsc
,
1
,
Prlog ==  which correspond to 
the probabilities of two situations: the pair refers to 
the same or different word senses. 
Now we introduce the symbol {}MK,  which re-
fers to the final context cluster configuration, 
where K refers to the number of distinct sense, and 
M represents the many-to-one mapping (from con-
texts to a sense) such that 
() K]. [1,j n],[1,i j,iM ∈∈= Based on the pairwise 
scores {} 
0
, ji
sc and  {} 
1
, ji
sc , WSD is formulated as 
searching for {}MK, which maximizes the follow-
ing global scores: 
 
{}()
()
[]
[ )
 MK,c
,1
,n1,i
,
,Ga6
∈
∈
=
ij
jik
ji
scs  (2) 
where ()
() ( )
Gaf
Gae
Gad =
=
otherwise
jMiMif
jik
     ,1
 ,0
,  
 
Similar clustering scheme has been used success-
fully for the task of co-reference in (Luo etc. 
2004), (Zelenko, Aone and Tibbetts, 2004a) and 
(Zelenko, Aone and Tibbetts, 2004b). 
In this paper, statistical annealing-based optimi-
zation (Neal 1993) is used to search for {}MK,  
which maximizes Expression (2). 
The optimization process consists of two steps. 
First, an intermediate solution {}
0
, MK  is com-
puted by a greedy algorithm. Then by setting 
{}
0
, MK as the initial state, statistical annealing is 
applied to search for the global optimal solution. 
The optimization algorithm is as follows. 
1. Set the initial state {}MK, as nK = , and 
[]n1,i  ,)( ∈= iiM ; 
2. Select a cluster pair for merging that 
maximally increases 
{}()
()
[]
[ )
 MK,c
,1
,n1,i
,
,Ga6
∈
∈
=
ij
jik
ji
scs  
3. If no cluster pair can be merged to in-
crease {}()
()
[]
[ )
 MK,c
,1
,n1,i
,
,Ga6
∈
∈
=
ij
jik
ji
scs , output 
{}MK, as the intermediate solution; 
otherwise, update {}MK,  by the merge 
and go to step 2. 
 
Using the intermediate solution {}
0
, MK of the 
greedy algorithm as the initial state, the statistical 
annealing is implemented using the following 
pseudo-code:  
       Set {}{}
0
,, MKMK = ; 
       for( 1.01β*;ββ ;ββ
final0
=<= ) 
{ 
    iterate pre-defined number of times 
    { 
          set {}{}MKMK ,,
1
= ; 
     update {}
1
, MK  by randomly changing 
cluster number and cluster contents;  
            set 
{}()
{}()MK,c
MK,c
1
s
s
x =  
           if(x>=1) 
           { 
             set {}{}
1
,, MKMK =  
           } 
          else 
          { 
             set {}{}
1
,, MKMK =  with probability  
              
β
x . 
          } 
         if {}(){}()
0
MK,cMK,c ss >  
         then set {}{}MKMK ,,
0
=  
     } 
  } 
  output {}
0
, MK  as the optimal state. 
 
37
4 Benchmarking 
Corpus-driven context clusters need to map to a 
word sense standard to facilitate performance 
benchmark. Using Senseval-3 evaluation stan-
dards, we implemented the following procedure to 
map the context clusters:  
 
i) Process TIPSTER corpus and the origi-
nal unlabeled Senseval-3 corpora (in-
cluding the training corpus and the 
testing corpus) by our parser, and save 
all the parsing results into a repository.  
 
ii) For each keyword, all related contexts in 
Senseval-3 corpora and up-to-1,000 re-
lated contexts in TIPSTER corpus are 
retrieved from the repository.  
 
iii) All the retrieved contexts are clustered 
based on the context clustering algo-
rithm presented in Sect. 2 and 3. 
 
iv) For each keyword sense, three annotated 
contexts from Senseval-3 training cor-
pus are used for the sense mapping. The 
context cluster is mapped onto the most 
frequent word sense associated with the 
cluster members. By design, the context 
clusters correspond to distinct senses, 
therefore, we do not allow multiple con-
text clusters to be mapped onto one 
sense. In case multiple clusters corre-
spond to one sense, only the largest 
cluster is retained.  
 
v) Each context in the testing corpus is 
tagged with the sense to which its con-
text cluster corresponds to. 
 
As mentioned above, Sensval-2 English lexical 
sample training corpora is used to train the context 
pair classification model. And Sensval-3 English 
lexical sample testing corpora is used here for 
benchmarking. There are several keyword occur-
ring in both Senseval-2 and Senseval-3 corpora. 
The sense tags associated with these keywords are 
not used in the context pair classification training 
process.  
In order to gauge the performance of this new 
weakly supervised learning algorithm, we have 
also implemented a supervised Naïve Bayes sys-
tem following (Gale et al. 1992).  This system is 
trained based on the Senseval-3 English Lexical 
Sample training corpus.  In addition, for the pur-
pose of quantifying the contribution from the pars-
ing structures in WSD, we have run our new 
system with two configurations: (i) using only 
trigger words; (ii) using both trigger words and 
parsing relationships. All the benchmarking is per-
formed using the Senseval-3 English Lexical Sam-
ple testing corpus and standards.  
The performance benchmarks for the two sys-
tems in three runs are shown in Table 1, Table 2 
and Table 3. When using only trigger words, this 
algorithm has 8 percentage degradation from the 
supervised Naïve Bayes system (see Table 1 vs. 
Table 2). When adding parsing structures, per-
formance degradation is reduced, with about 5 per-
centage drop (see Table 3 vs. Table 2). Comparing 
Table 1 with Table 3, we observe about 3% en-
hancement due to the contribution from the parsing 
support in WSD. The benchmark of our algorithm 
using both trigger words and parsing relationships 
is one of the best in unsupervised category of the 
Senseval-3 Lexical Sample evaluation. 
 
Table 1. New Algorithm Using Only Trigger Words  
Accuracy  
Category 
Fine grain (%) Coarse grain (%) 
Adjective (5) 46.3 60.8 
Noun (20) 54.6 62.8 
Verb (32) 54.1 64.2 
Overall  54.0 63.4 
 
Table 2. Supervised Naïve Bayes System 
Accuracy  
Category 
Fine grain (%) Coarse grain (%) 
Adjective (5) 44.7 56.6 
Noun (20) 66.3 74.5 
Verb (32) 58.6 70.0 
Overall 61.6 71.5 
 
Table 3. New Algorithm Using Both Trigger Words and 
Parsing  
Accuracy  
Category 
Fine grain (%) Coarse grain (%) 
Adjective (5) 49.1 64.8 
Noun (20) 57.9 66.6 
Verb (32) 55.3 66.3 
Overall 56.3 66.4 
38
 
It is noted that Naïve Bayes algorithm has many 
variation, and its performance has been greatly 
enhanced during recent research. Based on Sen-
seval-3 results, the best Naïve Bayse system out-
perform our version (which is implemented based 
on Gale et al. 1992) by 8%~10%. So the best su-
pervised WSD systems output-perform our weakly 
supervised WSD system by 13%~15% in accuracy. 
5 Conclusion 
We have presented a weakly supervised learning 
approach to WSD. Statistical models are not 
trained for the contexts of each individual word, 
but for context pair classification. This approach 
overcomes the knowledge bottleneck that chal-
lenges supervised WSD systems which need la-
beled data for each individual word. It captures the 
correlation regularity between the sense distinction 
and the context distinction at Part-of-Speech cate-
gory level, independent of individual words and 
senses. Hence, it only requires a limited amount of 
existing annotated corpus in order to disambiguate 
the full target set of ambiguous words, in particu-
lar, the target words that do not appear in the train-
ing corpus.   
The weakly supervised learning scheme can 
combine trigger words and parsing structures in 
supporting WSD. Using Senseval-3 English Lexi-
cal Sample benchmarking, this new approach 
reaches one of the best scores in the unsupervised 
category of English Lexical Sample evaluation. 
This performance is close to the performance for 
the supervised Naïve Bayes system. 
In the future, we will implement a new scheme 
to map context clusters onto WordNet senses by 
exploring WordNet glosses and sample sentences. 
Based on the new sense mapping scheme, we will 
benchmark our system performance using Senseval 
English all-words corpora.  
 
References  
Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. 
Landauer, and R. Harshman. 1990. Indexing by 
Latent Semantic Analysis. In Journal of the 
American Society of Information Science 
Gale, W., K. Church, and D. Yarowsky. 1992. A 
Method for Disambiguating Word Senses in a 
Large Corpus. Computers and the Humanities, 
26. 
Goodman, J. 2003. Exponential Priors for Maxi-
mum Entropy Models. In Proceedings of HLT-
NAACL 2004. 
Landauer, T. K., & Dumais, S. T. 1997. A solution 
to Plato's problem: The Latent Semantic Analy-
sis theory of the acquisition, induction, and rep-
resentation of knowledge. Psychological 
Review, 104, 211-240, 1997. 
Luo, X., A. Ittycheriah, H. Jing, N. Kambhatla and 
S. Roukos. A Mention-Synchronous Corefer-
ence Resolution Algorithm Based on the Bell 
Tree. In The Proceedings of ACL 2004. 
Neal, R.M. 1993. Probabilistic Inference Using 
Markov Chain Monte Carlo Methods. Technical 
Report, Univ. of Toronto. 
Pietra, S. D., V. D. Pietra, and J. Lafferty. 1995. 
Inducing Features Of Random Fields. In IEEE 
Transactions on Pattern Analysis and Machine 
Intelligence. 
Schütze, H. 1998. Automatic Word Sense Disam-
biguation. Computational Linguistics, 23. 
Yarowsky, D. 1995. Unsupervised Word Sense 
Disambiguation Rivaling Supervised Methods. 
In Proceedings of ACL 1995.  
Zelenko, D., C. Aone and J. 2004. Tibbetts. 
Coreference Resolution for Information Extrac-
tion. In Proceedings of ACL 2004 Workshop on 
Reference Resolution and its Application.  
Zelenko, D., C. Aone and J. 2004. Tibbetts. Binary 
Integer Programming for Information Extrac-
tion. In Proceedings of ACE 2004 Evaluation 
Workshop.  
