A New Approach to Word Sense Disambiguation 
Rebecca Bruce and Janyce Wiebe 
The Computing Research Lab 
New Mexico State University 
Las Cruces, NM 88003 
ABSTRACT 
This paper presents and evaluates models created according 
to a schema that provides a description of the joint distribu- 
tion of the values of sense tags and contextual features that 
is potentially applicable to a wide range of content words. 
The models are evaluated through a series of experiments, 
the results of which suggest that the schema is particularly 
well suited to nouns but that it is also applicable to words in 
other syntactic categories. 
1. INTRODUCTION 
Assigning sense tags to the words in a text can be viewed 
as a classification problem. A probabilistic classifier assigns 
to each word the tag that has the highest estimated proba- 
bility of having occurred in the given context. Designing a 
probabilistic classifier for word-sense disambiguation includes 
two main sub-tasks: specifying an appropriate model and 
estimating the parameters of that model. The former in- 
volves selecting informative contextual features (such as col- 
locations) and describing the joint distribution of the values 
of these features and the sense tags of the word to be classi- 
fied. The parameters of a model are the characteristics of the 
entire population that are cohsidered in the model. Practical 
applications require the use of estimates of the parameters. 
Such estimates are based on functions of a data sample (i.e., 
statistics) rather than the complete population. To make the 
estimation of parameters feasible, a model with a simplified 
form is created by limiting the number of contextual features 
considered and by expressing the joint distribution of fea- 
tures and sense tags in terms of only the most important 
systematic interactions among variables. 
To date, much of the work in statistical NLP has focused on 
parameter estimation (\[11\], \[13\], \[12\], \[4\]). Of the research di- 
rected toward identifying the optimum form of model, most 
has been concerned with the selection of individually infor- 
mative features (\[2\], \[5\]), with relatively little attention di- 
rected toward the identification of an optimum approxima- 
tion to the joint distribution of the values of the contextual 
features and object classes. Most previous efforts to formu- 
late a probabilistic classifier for word-sense disambiguation 
did not attempt to systematically identify the interdepen- 
dencies among contextual features that can be used to clas- 
sify the meaning of an ambiguous word. Many researchers 
have performed disambiguation on the basis of only a single 
feature (\[61, \[15\], \[2\]), while others who do consider multiple 
contextual features assume that all contextual features are 
either conditionally independent given the sense of the word 
(Is\], \[14\]) or fuRRy independent (\[10\], \[16\]). 
J 
In earlier work, we describe a method for identifying an up- 
propriate model for use in disambiguating a word given a 
set of contextual features. We chose a particular set of con- 
textual features and, using this method, identified a model 
incorporating these features for use in disambiguating the 
noun interest. These features, which are assigned automati- 
cally, are of three types: morphological, collocation-specific, 
and class-based, with part-of-speech (POS) categories serving 
as the word classes (see \[3\] for how the features were chosen). 
The results of using the model to disambiguate the noun in- 
terest were encouraging. We suspect that the model provides 
a description of the distribution of sense tags and contex- 
tual features that is applicable to a wide range of content 
words. This paper provides suggestive evidence supporting 
this, by testing its applicability to the disambiguation of sev- 
eral words. Specifically, for each word to be disambiguated, 
we created a model according to a schema, where that schema 
is a generalization of the model created for interest. We eval- 
uate the performance of probabilistic word-sense classifiers 
that utilize maximum likelihood estimates for the parame- 
ters of models created for the following lexical items: the 
noun senses of bill and concern, the verb senses of close and 
help, and the adjective senses of common. We also identify 
upper and lower bounds for the performance of any proba- 
bilistic classifier utilizing the same set of contextual features, 
as well as compare, for each word, the performance of' (1) a 
classifier using a model created according to the schema for 
that word, with (2) the performance of a classifier that uses 
a model selected, per the procedure to be described in sec- 
tion 2, as the best model for that word given the same set of 
contextual features. 
Section 2 of this paper describes the method used for select- 
ing the form of a probabilistic model given sense tags and a 
set of contextual features. In section 3, the model schema 
is presented and, in section 4, the experiments using models 
created according to the schema are described. Section 5 dis- 
cusses the results of the experiments and section 6 discusses 
future work. 
2. MODEL SELECTION 
In this section, we address the problem of finding the model 
that generates the best approximation to a given discrete 
probability distribution, as selected from among the class of 
decomposable models. Decomposable models are a subclass of 
log-linear models and can be used to characterize and study 
the structure of data. They are members of the class of gen- 
eralized linear models and can be viewed as analogous to 
analysis of variance (ANOVA) models (\[1\]. The log-linear 
244 
model expresses the population mean as the sum of the con- 
tributions of the "effects" of the variables and the interac- 
tions between variables; it is the logarithm of the mean that 
is linear in these effects. 
Under certain sampling plans (see \[1\] for details), data con- 
sisting of the observed values of a number of contextual fea- 
tures and the corresponding sense tags of an ambiguous word 
can be described by a multinomial distribution in which each 
distinct combination of the values of the contextual features 
and the sense tag identifies a unique category in that distribu- 
tion. The theory of log-linear models specifies the su.~cient 
statistics for estimating the effects of each variable and of 
each interaction among variables on the mean. The statis- 
tics are the highest-order sample marginal distributions con- 
raining only inter-dependent variables. Within the class of 
decomposable models, the maximum likelihood estimate for 
the mean of a category reduces to the product of the sample 
relative frequencies (counts) defined in the sufficient statis- 
tics divided by the sample relative frequencies defined in the 
marginals composed of the common elements in the sufficient 
statistics. As such, decomposable models are models that can 
be expressed as a product of marginal distributions, where 
each marginal consists of certain inter-dependent variables. 
The degree to which the data is approximated by a model is 
called the fit of the model. In this work, the likelihood ratio 
statistic, G 2, is used as the measure of the goodness of fit of a 
model. It is distributed asymptotically as X 2 with degrees of 
freedom corresponding to the number of interactions (and/or 
variables) omitted from (unconstrained in) the model. Ac- 
cessing the fit of a model in terms of the significance of its G 2 
statistic gives preference to models with the fewest number of 
interdependencies, thereby assuring the selection of a model 
specifying only the most systematic variable interactions. 
Within the framework described above, the process of model 
selection becomes one of hypothesis testing, where each pat- 
tern of dependencies among variables expressible in terms of 
a decomposable model is postulated as a hypothetical model 
and its fit to the data is evaluated. The "best fitting" model, 
in the sense that the significance according to the reference 
X 2 value is largest, is then selected. The exhaustive search of 
decomposable models was conducted as described in \[9\]. 
Approximating the joint distribution of all variables with a 
model containing only the most important systematic inter- 
actions among variables limits the number of parameters to 
be estimated, supports computational efficiency, and provides 
an understanding of the data. The biggest limitation as- 
sociated with this method is the need for large amounts of 
sense-tagged data. Inconveniently, the validity of the results 
obtained using this approach are compromised when it is ap- 
plied to sparse data. 
3. THE MODEL 
Using the method presented in the previous section, a prob- 
abilistic model was developed for disambiguating the noun 
senses of interest utilizing automatically identifiable contex- 
tual features that were considered to be intuitively applica- 
ble to all content words. The complete process of feature 
selection and model selection is described in \[3\]. Here, we 
describe the extension of that model to other content words. 
In essence, what we are describing is not a single model, but 
a model schema. The values of the variables included in the 
model change with the word being disambiguated as stated 
below. 
The model schema incorporates three different types of con- 
textual features: morphological, collocation-specific, and 
class-based, with POS categories serving as the word classes. 
For all content words, the morphological feature describes 
only the suffix of the base lexeme: the presence or absence 
of the plural form, in the case of nouns, and the suffix in- 
dicating tense, in the case of verbs. Mass nouns as well as 
many adjectives and adverbs will have no morphological fea- 
ture under this definition (note the lack of this feature in the 
models for common in table 2). 
The values of the class-based variables are a set of 25 POS 
tags derived from the first letter of the tags used in the Penn 
Treebank corpus. The model schema contains four variables 
representing class-based contextual features: the POS tags 
of the two words immediately preceding and the two words 
immediately succeeding the ambiguous word. All variables 
are confined to sentence boundaries; extension beyond the 
sentence boundary is indicated by a null POS tag (e.g., when 
the ambiguous word appears at the start of the sentence, the 
POS tags to the left have the value null). 
Two collocation-specific variables are included in the model 
schema, where the term collocation is used loosely to refer 
to a specific spelling form occurring in the same sentence as 
the ambiguous word. In the model schema, each collocation- 
specific variable indicates the presence or absence of a word 
that is one of the four most frequently-occurring content 
words in a data sample composed of sentences containing 
the word to be disambiguated. This strategy for selecting 
collocation-specific variables is simpler than that used by 
many other researchers (\[6\], \[15\], \[2\]). This simpler method 
was chosen to support work we plan to do in the future (elim- 
inating the need for sense-tagged data; see section 6). In us- 
ing this strategy, we do, however, run the risk of reducing the 
informativeness of the variables. 
With the variables as described above, the form of this model 
is (where rlpos is the POS tag one place to the right of the 
ambiguous word W; r~pos is the POS tag two places to the 
right of W; llpos is the POS tag one place to the left of W; 
l~pos is the POS tag two places to the left of W; endingis the 
suffix of the base lexeme; word1 is the presence or absence of 
one of the word-specific collocations and words is the presence 
or absence of the other one; and tag is the sense tag assigned 
to W): 
P(rlpos, r2pos, llpos, 12pos, ending, word1, word2, tag) = 
P(rlpos, r2posltag ) x P(llpos, 12posltag ) x 
P(endingltag) × P(wordlltag) x P(word21tag) × 
P(tag) (1) 
This product form indicates certain conditional indepen- 
dences given the sense tag of the ambiguous word. In the 
remainder of this paper, the model for a particular word 
245 
matching the above schema will be referred to as model M. 
The sense for an ambiguous word is selected using M as fol- 
lows: 
tag = argmax( P(r lpos, r2posl*ag) x 
tag 
P(llpos, 12posltag) x P(ending\[tag) x 
P(wordlltag) × P(word2\[tag) × P(tag)) (2) 
4. THE EXPERIMENTS 
In this section, we first describe the data used in the experi- 
ments and then describe the experiments themselves. 
Due to availability, the Penn Treebank Wall Street Journal 
corpus was selected as the data set and the non-idiomatic 
senses defined in the electronic version of the Longman's 
Dictionary of Contemporary English LDOCE were chosen 
to form the tag set for each word to be disambiguated (three 
exceptions to this statement are noted in table 1). The only 
restriction limiting the choice of ambiguous words was the 
need for large amounts of sense-tagged data. As a result of 
that restriction, only the most frequently occurring content 
words could be considered. From that set, the following were 
chosen as test cases: the noun senses of bill and concern, 
the verb senses of close and help, and the adjective senses of 
common. 
The training and test sets for each word selected for dis- 
ambiguation were generated in the same manner. First, all 
instances of the word with the specified POS tag in the Penn 
Treebank Wall Street Journal Corpus were identified and the 
sentences containing them were extracted to form a data sam- 
ple. The data sample was then manually disambiguated and 
a test set comprising approximately one quarter of the to- 
tal sample size was randomly selected. The size of the data 
sample, test set, and training set for each word, along with 
a description of the word senses identified and their distribu- 
tion in the data are presented in table 1. Table 1 also includes 
entries for the earlier experiments involving the noun interest 
(\[3\]). 
In all of the experiments for a particular word, the estimates 
of the model parameters that were used were maximum like- 
lihood estimates made from the training set for that word. 
In each experiment, a set of data was tagged in accordance 
with equation (2), and the results were summarized in terms 
of precision and recall. (In most of the experiments, the data 
set was the test set, as expected, but in the experiments de- 
signed to establish an upper bound for performance, it was 
the training set, as discussed below.) Recall is the percentage 
of test words that were assigned some tag; it corresponds to 
the portion of the test set covered by the estimate3 of the 
parameters made from the training set. Precision is the per- 
centage of tagged words that were tagged correctly. A com- 
bined summary, the total percentage of the test set tagged 
correctly (the total percent correct) was also calculated. 
There were three experiments run for each word. In the first, 
the data set tagged was the test set and model M was used. 
In the second, the data set tagged was the test set, and the 
model was the one selected using the procedure described 
in section 2 for the word being disambiguated and the con- 
textual features used throughout the experiments. We will 
refer to this as the "best approximation model". In the third 
experiment, the data set tagged was the training set, and 
the model used was the one in which no assumptions are 
made about dependencies among variables (i.e., all variables 
are treated as inter-dependent). The purpose of experiment 
three was to establish upper bounds on the precision of the 
classifiers used in the first two experiments, as discussed in 
the following paragraphs. 
J 
If a classifier makes no assumptions regarding the dependen- 
cies among the'variables, and has available to it the actual pa- 
raaneter values (i.e., the true population characteristics), then 
the precision of that classifier would be the best that could 
be achieved with the specified set of features. The maximum 
likelihood estimates of the model parameters made from the 
training set are the population parameters for the training 
set; therefore, the precision of each third-experiment classi- 
fier is optimal for the training set. Because the true popula- 
tion will have more variation than the training set, the third 
experiment for each word establishes an upper bound for the 
precision of the classifiers tested in the first two experiments 
for that word (and in fact, for any classifier using the same 
set of variables). 
If we assume that the test and training sets have similar 
sense-tag distributions, establishing a lower bound is straight- 
forward. "A probabilistic classifier should perform at least as 
well as one that always assigns the sense that most frequently 
occurs in the training set. Thus, a lower bound on the preci- 
sion of a probabilistic classifier is the percentage of test-word 
instances with the sense tag that most frequently occurs. 
The results of all of the experiments, including the earlier 
experiments involving the noun senses of interest (\[3\]), are 
presented in table 2. 
5. DISCUSSION OF RESULTS 
In the following discussion, a classifier used in the first or 
second experiment for a word will be called an "experimental 
classifier", while a classifier used in the third experiment for 
a word will be referred to as the "upper-bound classifier" for 
that word. 
Before discussing the results of the experiments, there are 
some comments to be made about the comparison of the 
performance of different classifiers. In comparing the per- 
formance of classifiers developed for the same word, it makes 
sense to compare the precision, recall, and total percent cor- 
rect. Because the training set and the test set are the same, 
the differences we see are due strictly to the fact that they 
use different models. In comparing the performance of clas- 
sifters developed for different words, on the other hand, only 
the precision measures are compared. There are two things 
that affect recall: the complexity of the model (i.e., the order 
of the highest-order marginal in the model) and the size of 
the training set. The size of the training set was not held 
constant for each word; therefore, comparison of the recall 
results for classifiers developed for different words would not 
be meaningful. Because total percent correct includes recall, 
246 
it should also not be used in the comparison of classifiers 
developed for different words. 
In comparing the precision of classifiers developed for differ- 
ent words, what is compared is the improvement that each 
classifier makes over the lower bound for the word for which 
that classifier was developed. 
We now turn to the specific results. Model M seems par- 
ticularly well suited to the nouns (which is not surprising, 
given that it was developed for the noun-senses of the word 
interest). The precision of the noun experimental classifiers 
is superior to that of all of the experimental classifiers devel- 
oped for words in other syntactic categories. Further, for one 
of the nouns (concern), M was the same as the one used in 
experiment 2, and, for the other two nouns, M and the model 
used in experiment 2 are very similar. 
Turning to the verbs, it is striking that, for both of the verbs, 
the models used in the second experiment (the best approx- 
imation models) identify an interdependency between tense 
markings (i.e., ending in the verb entries in table 2) and the 
POS tags (rlpose, r~pos, llpos, and 12pos), a dependency 
that is not in M. This seems to suggest that a model includ- 
ing this dependency should be used for verbs. However, the 
additional complexity of such a model in comparison with M 
may make it less effective. For each verb we tested, a compar- 
ison of the total-percent-correct measures for experiments 1 
and 2 indicates that the classifier with Mis as good or better 
than the classifier using the best approximation model. 
The classifiers with the worst precision in comparison with 
the appropriate lower bound, as discussed above, are the ex- 
perimental classifiers for the verb senses of help. The sense 
distinctions for help are based mainly on the semantic class 
of the syntactic object of the verb. Perhaps this approa~ch 
to sense disambiguation is not as effective for these kinds of 
sense distinctions. 
Although there is a large disparity in performance between 
the experimental and upper-bound classifiers for a word, two 
things should be noted. First, the upper bounds are over- 
inflated due to the very small size of the training set relative 
to the true population (there would be much greater variation 
in the population). Second, such a model could never be 
used in practice, due to the huge number of parameters to be 
estimated. 
6. CONCLUSIONS AND FUTURE 
WORK 
In'this paper, we have presented and evaluated models cre- 
ated according to a schema that provides a description of the 
joint distribution of the values of sense tags and contextual 
features that is potentially applicable to a wide range of con- 
tent words. The models were evaluated through a series of 
experiments that provided the following information: 1) per- 
formance results (precision, recall, and total percent correct) 
for probabilistic classifiers using models created in azcordance 
with the schema and applied to the disambiguation of sev- 
eral difficult test words; 2) identification of upper and lower 
bounds for the performance of any probabilistic word-sense 
classifier using the contextual features defined in the model 
schema; and 3) a comparison of the performance of classifiers 
using models generated per the schema to that of classifiers 
using models selected as described in section 2. The results 
of these experiments suggest that the model schema is par- 
ticularly well suited to nouns but that it is also applicable to 
words in other syntactic categories. 
We feel that the results presented in this paper are encourag- 
ing and plan to continue testing the model schema on other 
words. But it is unreasonable to continue generating over 
1,000 manually sense-tagged examples of each word to be dis- 
ambiguated, as is required if parameters are estimated as we 
did here. In answer to this problem, other means of param- 
eter estimation are being investigated, including a procedure 
for obtaining maximum likelihood estimates from untagged 
data. The procedure is a variant of the EM algorithm (\[7\]) 
specifically applicable to models of the form described in this 
paper. 
ACKNOWLEDGEMENTS. The authors would 
like to gratefully acknowledge the contributions of the follow- 
ing people to the work presented in this paper: Rufus and 
Beverly Bruce for their help in sense-tagging data, Gerald 
Rogers for sharing his expertise in statistics, and Ted Dun- 
ning for advice and support in all matters having to do with 
software development. 
References 
1. Bishop, Y. M.; Fienberg, S.; and Holland, P (1975). Dis- 
crete Multivariate Analysis: Theory and Practice. Cam- 
bridge: The MIT Press. 
2. Brown, P.; Delia Pietra, S.; Della Pietra, V.; and Mer- 
cer, R. (1991). Word Sense Disambiguation Using Statis- 
tical Methods. Proceedings of the £9th Annual Meeting 
of the Association for Computational Linguistics (ACL- 
91), pp. 264-304. 
3. Bruce, Rebecca and Wiebe, Janyce. Word-Sense Dis- 
ambiguation Using Decomposable Models. Unpublished 
manuscript. 
4. Church, K. and W. Gale (1991). A Comparison of the 
Enhanced Good-Turing and Deleted Estimation Meth- 
ods for Estimating Probabilities of English Bigrams. 
Computer Speech and Language, Vol 5, pp. 19-54. 
5. Church, Kenneth W and Hanks, Patrick (1990). Word 
Association Norms, Mutual Information, and Lexicog- 
raphy, Computational Linguistics , Vol. 16, No. 1, pp. 
22-29. 
6. Dagan, I.; Itai, A.; and Schwall, U. (1991). Two Lan- 
guages Are More Informative Than One. Proceedings of 
the 29th Annual Meeting of the Association for Compu- 
tational Linguistics (ACL-91), pp. 130-137. 
7. Dempster, A., N. Laird, and D. Rubin (1977). Maxi- 
mum Likelihood from Incomplete Data Via the EM Al- 
gorithm. Journal of the Royal Statistical Society B, Vol 
39, pp. 1-38. 
8. Gale, W.; Church, K.; and Yarowsky, D. (1992). A 
Method for Disambiguating Word Senses in a Large Cor- 
pus. AT~T Bell Laboratories Statistical Research Report 
No. 104. 
247 
9. Havranek, Tomas (1984). A Procedure for Model Search 
in Multidimensional Contingency Tables. Biometrics 40: 
95-100. 
10. Hearst, Mufti (1991). 
Toward Noun Homonym Disambiguation--Using Local 
Cowtext in Large Text Corpora. Proceedings of the Sev- 
enth Annual Conference of the UW Centre for the New 
OED and Text Research Using Corpora, pp. 1-22. 
11. Jelinek, F. and R. Mercer (1980). Interpolated Estima- 
tion of Markov Source Parameters from Sparse Data. 
Proceedings Workshop on Pattern Recognition in Prac. 
tice, May 21-23, Amsterdam: North-Holland. 
12. Katz, S. M. (1987). Estimation of Probabilities From 
Sparse Data for the Language Model Component of a 
Speech Recognizer. IEEE Trans. Acoust., Speech, Signal 
Processing, Vol ASSP-35, pp. 400-401. 
13. Nadas, A. (1984). Estimation of Probabilities in the 
Language Model of the IBM Speech Recognition Sys- 
tem. IEEE Trans. Acoust., Speech, Signal Processing, 
Vol ASSP-32, pp. 859-861. 
14. Yarowsky, David (1992). Word-Sense.Disambiguating 
Using Statistical Models of Roget's Categories Trained 
on Large Corpora. Proceedings of the 15th International 
Conference on Computational Linguistics (COLING- 
9e). 
15. Yarowsky, David (1993). One Sense Per Collocation. 
Proceedings of the Speech and Natural Language ARPA 
Workshop, March 1993, Princeton, NJ. 
16. Zernik, Uri (1990). Tagging Word Senses In Corpus: 
The Needle in the Haystack Revisited. Technical Report 
90CRDI98, GE Research and Development Center. 
Distribution 
of 
Senses 
SENSE 1 "readiness to give attention": 15% 
SENSE 2 "quality of causing attention to be given": <1% 
SENSE 3 %ctivity, subject, etc., wh{ch one gives time and attention to': 3% 
SENSE 4 "advantage, advancement, or favor": 8% 
SENSE 5 "a share in a company, business, etc.": 21% 
SENSE 6 "money paid for the use of money": 53% 
Noun senses of concern: total in sample: 1488; in training set: 1117; in test set: 371. 
Distribution ~SENSE 1 "a matter that is of interest or importance': 3% 
of SENSE 2 "serious caxe or interest": 2% 
Senses SENSE 3 ~worry; anxiety": 32% 
SENSE 4 "a business; firm": 64% 
Noun senses of bill: total in sample: 1335; in training set: 1001; in test set: 334. 
Distribution SENSE 1 "a plan for a law, written down for the government to consider": 69%" 
of SENSE 2 "a list of things bought and their price": 10% 
Senses SENSE 4 "a piece of paper money" !extended to include treasury bills~: 21% 
Verb senses of 
Distribution 
of 
Senses 
SENSE 1 "to (cause to) shut": 2~0 
SENSE 2 "to (cause to) be not open to the public": 2% 
SENSE 3 "to (cause to) stop operation": 20% 
SENSE 4 "to (cause to) end": 68% 
SENSE 6 "to (cause to) come together by making less space between": 2% 
SENSE 7 "to close a deal" (extended from an idiomatic usage): 6% 
Verb senses of help: total in sample: 1396; in training set: 1047; in test set: 349. 
Distribution SENSE i "to do part of the work for - human object": 21% 
of SENSE 2 "to encourage, improve, or produce favourable conditions for - inanimate object": 75% 
Senses SENSE 3 "to make better - human object": 4% 
SENSE 4 ~to avoid; prevent; change - inanimate object": 1% 
Adjective senses of common: total in sample: 1063; in training set: 798; in test set: 265. 
Distribution 
of 
Senses 
SENSE 1 "belonging to or shared equally by 2 or more": 
SENSE 2 ~found or happening often and in many places; usual": 
SENSE 3 "widely known; general; ordinary": 
SENSE 4 "of no special quality; ordinary~: 
SENSE 6 ~technical, having the same relationship to 2 or more quantities': 
SENSE 7 "as in the phrase 'common stock' " (not in LDOCE): 
7% 
8% 
3% 
2% 
<1% 
80% 
Table 1: Data summary. 
248 
MODEL PERFORMANCE SUMMARY 
Precision Recall P Correct 
noun senses of 
Experiment 1: 
(Model M) 
Experiment 2: 
(best approx.) 
Experiment 3: 
(upper bound) 
lower bound: 
interest 
P(tag) x P(endingltag ) x P(rlpos, r2posltag)x 
P(llpos, 12posltag) x P(rateltag) x P(percentttag) 
P(tag) x P(ending\]tag) x P(rate,percentlsense ) 
P(rlpos, r2posltag ) x P(llpos, 12posltag ) 
! P(tag, ending, llpos, 12pos, rlpos, r2pos, rate, percent) 
P(tag) 
79.3% 98% 77.7% 
79.4% 98% 77.8% 
93% 
53% 
noun senses of 
Experiment 1: 
(Model M) 
Experiment 2: 
(best approx.) 
Experiment 3: 
(upper bound) 
lower bound: 
bill 
P(tag) x P(endingltag ) x P(rlpos, r2posltag)x 
P(llpos, 12posltag) x P(houseltag) x P(treasuryltag ) 
P(tag) x P(endingltag)x 
P(rlpos, r2pos, treasuryltag ) x P(llpos, 12pos, house\]tag) 
P( tag, ending, ll pos, 12pos, r l pos, r2pos, house, treasury) 
P(tag) 
87.5% 95.8% 83.8% 
89.1% 
97.6% 
93.7% 83.5% 
68.5% 
noun senses of concern 
Experiment 1: 
(Model M) 
Experiment 2: 
(best approx.) 
Experiment 3: 
(upper bound) 
lower bound: 
P(tag) x P(endingltag ) x P(rlpos, r2posltag)x 
P(llpos, 12posltag) x P(companyltag ) x P(possessiveltag ) 
P(tag) x P(endingltag ) x P(rlpos, r2posltag)x 
P(llpos, 12posltag) x P(companyltag ) x P(possessiveltag) 
P( tag, ending, ll pos, 12pos, r l pos, r2pos, company, possessive) 
P(tag) 
88.4% 
88.4% 
97.2% 
95.1% 84.1% 
95.1% 84.1% 
63.8% 
verb senses of close 
Experiment 1: 
(Model M) 
Experiment 2: 
(best approx.) 
Experiment 3: 
(upper bound) 
lower bound: 
P(tag) x P(ending\]tag) x P(rlpos, r2pos\[tag) x 
P(llpos, 12pos\[tag) x P(tradingltag) x P(ezchangeltag) 
P(tag) x P(cnding\]tag) x P(rlpos, r2pos\[ending, tag)x 
P(llpos, 12poslending, tag) x P(tradingltag) x P(ezchangeltag ) 
P( tag, ending, llpos, 12pos, rlpos, r2pos, trading, exchange) 
P(tag) 
83.6% 94% 78.1% 
88.7% 88% 78.1% 
97.2% 
68% 
verb senses of 
Experiment 1: 
(Model M) 
Experiment 2: 
(best approx.) 
Experiment 3: 
(upper bound) 
lower bound: 
help 
P(tag) x P(endingltag) x P(rlpos, r2posltag)x 
P(llpos, 12posltag) x P(dollarltag) x P(marketltag ) 
P( tag) x P(ending\]tag) x P(r lpos, r2poslending , tag) x 
P(llpos, t2poslending, tag) × P(dollarltag) x P(marketltag) 
P( tag, ending, llpos, 12pos, r lpos, r2pos, dollar, market) 
P(tag) 
79.9% 95.7% 76.5% 
80.2% 
91.7% 
86.8% 69.6% 
75.1% 
adjective senses of common 
Experiment 1: 
.(Model M) 
Experiment 2: 
(best approx.) 
Experiment 3: 
(upper bound) 
lower bound: 
P(tag) x P(rlpos, r2posltag) x 
P(llpos, 12posltag ) x P(millionltag ) x P(shareltag ) 
P(tag) x P(r2pos, shareltag) x 
P( llpos, 12pos, r lpos, millionltag ) 
P( tag, ending, llpos, 12pos, r lpos, r2pos, million, share) 
P(tag) 
85.9% 95.9% 82.3% 
89.7% 
95% 
91% 81.6% 
79.5% 
Table 2: Results of experiments. 
249 
