File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3249_metho.xml

Size: 24,234 bytes

Last Modified: 2025-10-06 14:09:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3249">
  <Title>Unsupervised Domain Relevance Estimation for Word Sense Disambiguation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Domains, WORDNET and Texts
</SectionTitle>
    <Paragraph position="0"> DRE heavily relies on domain information as its main knowledge source. Domains show interesting properties both from a lexical and a textual point of view. Among these properties there are: (i) lexical coherence, since part of the lexicon of a text is composed of words belonging to the same domain; (ii) polysemy reduction, because the potential ambiguity of terms is sensibly lower if the domain of the text is speci ed; and (iii) lexical identi ability of text's domain, because it is always possible to assign one or more domains to a given text by considering term distributions in a bag-of-words approach. Experimental evidences of these properties are reported in (Magnini et al., 2002).</Paragraph>
    <Paragraph position="1"> In this section we describe WORDNET DO-MAINS1 (Magnini and Cavagli a, 2000), a lexical resource that attempts a systematization of relevant aspects in domain organization and representation.</Paragraph>
    <Paragraph position="2"> WORDNET DOMAINS is an extension of WORD-NET (version 1.6) (Fellbaum, 1998), in which each synset is annotated with one or more domain labels, selected from a hierarchically organized set of about two hundred labels. In particular, issues concerning the completeness of the domain set, the balancing among domains and the granularity of domain distinctions, have been addressed. The domain set used in WORDNET DOMAINS has been extracted from the Dewey Decimal Classi cation (Comaroni et al., 1989), and a mapping between the two taxonomies has been computed in order to ensure completeness. Table 2 shows how the senses for a word (i.e. the noun bank) have been associated to domain label; the last column reports the number of occurrences of each sense in Semcor2.</Paragraph>
    <Paragraph position="3"> Domain labeling is complementary to information already present in WORDNET. First of all, a domain may include synsets of different syntactic categories: for instance MEDICINE groups together senses from nouns, such as doctor#1 and hospital#1, and from verbs, such as operate#7. Second, a domain may include senses from different WORDNET sub-hierarchies (i.e. deriving from different unique beginners or from different lexicographer les ). For example, SPORT contains senses such as athlete#1, deriving from life form#1, game equipment#1 from physical object#1, sport#1  from act#2, and playing field#1 from location#1.</Paragraph>
    <Paragraph position="4"> Domains may group senses of the same word into thematic clusters, which has the important side-effect of reducing the level of ambiguity when we are disambiguating to a domain. Table 2 shows an example. The word bank has ten different senses in WORDNET 1.6: three of them (i.e. bank#1, bank#3 and bank#6) can be grouped under the ECONOMY domain, while bank#2 and bank#7 both belong to GEOGRAPHY and GEOL-OGY. Grouping related senses is an emerging topic in WSD (see, for instance (Palmer et al., 2001)).</Paragraph>
    <Paragraph position="5"> Finally, there are WORDNET synsets that do not belong to a speci c domain, but rather appear in texts associated with any domain. For this reason, a FACTOTUM label has been created that basically includes generic synsets, which appear frequently in different contexts. Thus the FACTOTUM domain can be thought of as a placeholder for all other domains.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Domain Relevance Estimation for Texts
</SectionTitle>
    <Paragraph position="0"> The basic idea of domain relevance estimation for texts is to exploit lexical coherence inside texts.</Paragraph>
    <Paragraph position="1"> From the domain point of view lexical coherence is equivalent to domain coherence, i.e. the fact that a great part of the lexicon inside a text belongs to the same domain.</Paragraph>
    <Paragraph position="2"> From this observation follows that a simple heuristic to approach this problem is counting the occurrences of domain words for every domain inside the text: the higher the percentage of domain words for a certain domain, the more relevant the domain will be for the text. In order to perform this operation the WORDNET DOMAINS information is exploited, and each word is assigned a weighted list of domains considering the domain annotation of its synsets. In addition, we would like to estimate the domain of the text locally. Local estimation of domain relevance is very important in order to take into account domain shifts inside the text. The methodology used to estimate domain frequency is described in subsection 3.1.</Paragraph>
    <Paragraph position="3"> Unfortunately the simple local frequency count is not a good domain relevance measure for several reasons. The most signi cant one is that very frequent words have, in general, many senses belonging to different domains. When words are used in texts, ambiguity tends to disappear, but it is not possible to assume knowing their actual sense (i.e.</Paragraph>
    <Paragraph position="4"> the sense in which they are used in the context) in advance, especially in a WSD framework. The simple frequency count is then inadequate for relevance estimation: irrelevant senses of ambiguous words contribute to augment the nal score of irrelevant domains, introducing noise. The level of noise is different for different domains because of their different sizes and possible differences in the ambiguity level of their vocabularies.</Paragraph>
    <Paragraph position="5"> In subsection 3.2 we propose a solution for that problem, namely the Gaussian Mixture (GM) approach. This constitutes an unsupervised way to estimate how to differentiate relevant domain information in texts from noise, because it requires only a large-scale corpus to estimate parameters in an Expectation Maximization (EM) framework. Using the estimated parameters it is possible to describe the distributions of both relevant and non-relevant texts, converting the DRE problem into the problem of estimating the probability of each domain given its frequency score in the text, in analogy to the bayesian classi cation framework. Details about the EM algorithm for GM model are provided in subsection 3.3.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Domain Frequency Score
</SectionTitle>
      <Paragraph position="0"> Let t 2 T , be a text in a corpus T composed by a list of words wt1;::: ;wtq. Let D = fD1;D2;:::;Ddg be the set of domains used. For each domain Dk the domain frequency score is computed in a window of c words around wtj. The domain frequency score is de ned by formula (1).</Paragraph>
      <Paragraph position="2"> where the weight factor G(x; ; 2) is the density of the normal distribution with mean and standard deviation at point x and Rword(D;w) is a function that return the relevance of a domain D for a word w (see formula 3). In the rest of the paper we use the notation F(Dk;t) to refer to F(Dk;t;m), where m is the integer part of q=2 (i.e. the central point of the text - q is the text length).</Paragraph>
      <Paragraph position="3"> Here below we see that the information contained in WORDNET DOMAINS can be used to estimate Rword(Dk;w), i.e. domain relevance for the word w, which is derived from the domain relevance of the synsets in which w appears.</Paragraph>
      <Paragraph position="4"> As far as synsets are concerned, domain information is represented by the function Dom : S ) P(D)3 that returns, for each synset s 2 S, where S is the set of synsets in WORDNET DOMAINS, the set of the domains associated to it. Formula (2) denes the domain relevance estimation function (remember that d is the cardinality of D):</Paragraph>
      <Paragraph position="6"> (2) Intuitively, Rsyn(D;s) can be perceived as an estimated prior for the probability of the domain given the concept, as expressed by the WORDNET DOMAINS annotation. Under these settings FACTOTUM (generic) concepts have uniform and low relevance values for each domain while domain concepts have high relevance values for a particular domain. null The de nition of domain relevance for a word is derived directly from the one given for concepts. Intuitively a domain D is relevant for a word w if D is relevant for one or more senses c of w. More formally let V = fw1;w2;:::wjV jg be the vocabulary, let senses(w) = fsjs 2 S;s is a sense of wg (e.g. any synset in WORDNET containing the word w). The domain relevance function for a word</Paragraph>
      <Paragraph position="8"> 3P(D) denotes the power set of D</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The Gaussian Mixture Algorithm
</SectionTitle>
      <Paragraph position="0"> As explained at the beginning of this section, the  simple local frequency count expressed by formula (1) is not a good domain relevance measure.  In order to discriminate between noise and relevant information, a supervised framework is typically used and signi cance levels for frequency counts are estimated from labeled training data. Unfortunately this is not our case, since no domain labeled text corpora are available. In this section we propose a solution for that problem, namely the Gaussian Mixture approach, that constitutes an unsupervised way to estimate how to differentiate relevant domain information in texts from noise. The Gaussian Mixture approach consists of a parameter estimation technique based on statistics of word distribution in a large-scale corpus.</Paragraph>
      <Paragraph position="1"> The underlying assumption of the Gaussian Mixture approach is that frequency scores for a certain domain are obtained from an underlying mixture of relevant and non-relevant texts, and that the scores for relevant texts are signi cantly higher than scores obtained for the non-relevant ones. In the corpus these scores are distributed according to two distinct components. The domain frequency distribution which corresponds to relevant texts has the higher value expectation, while the one pertaining to non relevant texts has the lower expectation. Figure 1 describes the probability density function (PDF) for domain frequency scores of the SPORT domain estimated on the BNC corpus4 (BNC-Consortium, 2000) using formula (1). The empirical PDF, describing the distribution of frequency scores evaluated on the corpus, is represented by the continuous line.</Paragraph>
      <Paragraph position="2"> From the graph it is possible to see that the empirical PDF can be decomposed into the sum of two distributions, D = SPORT and D = non-SPORT .</Paragraph>
      <Paragraph position="3"> Most of the probability is concentrated on the left, describing the distribution for the majority of non relevant texts; the smaller distribution on the right is assumed to be the distribution of frequency scores for the minority of relevant texts.</Paragraph>
      <Paragraph position="4"> Thus, the distribution on the left describes the noise present in frequency estimation counts, which is produced by the impact of polysemous words and of occasional occurrences of terms belonging to SPORT in non-relevant texts. The goal of the technique is to estimate parameters describing the distribution of the noise along texts, in order to as- null sociate high relevance values only to relevant frequency scores (i.e. frequency scores that are not related to noise). It is reasonable to assume that such noise is normally distributed because it can be described by a binomial distribution in which the probability of the positive event is very low and the number of events is very high. On the other hand, the distribution on the right is the one describing typical frequency values for relevant texts. This distribution is also assumed to be normal.</Paragraph>
      <Paragraph position="5"> A probabilistic interpretation permits the evaluation of the relevance value R(D;t;j) of a certain domain D for a new text t in a position j only by considering the domain frequency F(D;t;j). The relevance value is de ned as the conditional probability P(DjF(D;t;j)). Using Bayes theorem we estimate this probability by equation (4).</Paragraph>
      <Paragraph position="7"> where P(F(D;t;j)jD) is the value of the PDF describing D calculated in the point F(D;t;j), P(F(D;t;j)jD) is the value of the PDF describing D, P(D) is the area of the distribution describing D and P(D) is the area of the distribution for D.</Paragraph>
      <Paragraph position="8"> In order to estimate the parameters describing the PDF of D and D the Expectation Maximization (EM) algorithm for the Gaussian Mixture Model (Redner and Walker, 1984) is exploited. Assuming to model the empirical distribution of domain frequencies using a Gaussian mixture of two components, the estimated parameters can be used to evaluate domain relevance by equation (4).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 The EM Algorithm for the GM model
</SectionTitle>
      <Paragraph position="0"> In this section some details about the algorithm for parameter estimation are reported.</Paragraph>
      <Paragraph position="1"> It is well known that a Gaussian mixture (GM) allows to represent every smooth PDF as a linear combination of normal distributions of the type in formula 5</Paragraph>
      <Paragraph position="3"> rameter list describing the gaussian mixture. The number of components required by the Gaussian Mixture algorithm for domain relevance estimation is m = 2.</Paragraph>
      <Paragraph position="4"> Each component j is univocally determined by its weight aj, its mean j and its variance j. Weights represent also the areas of each component, i.e. its total probability.</Paragraph>
      <Paragraph position="5"> The Gaussian Mixture algorithm for domain relevance estimation exploits a Gaussian Mixture to approximate the empirical PDF of domain frequency scores. The goal of the Gaussian Mixture algorithm is to nd the GM that maximize the likelihood on the empirical data, where the likelihood function is evaluated by formula (8).</Paragraph>
      <Paragraph position="7"> More formally, the EM algorithm for GM models explores the space of parameters in order to nd the set of parameters such that the maximum likelihood criterion (see formula 9) is satis ed.</Paragraph>
      <Paragraph position="9"> This condition ensures that the obtained model ts the original data as much as possible. Estimation of parameters is the only information required in order to evaluate domain relevance for texts using the Gaussian Mixture algorithm. The Expectation Maximization Algorithm for Gaussian Mixture Models (Redner and Walker, 1984) allows to ef ciently perform this operation.</Paragraph>
      <Paragraph position="10"> The strategy followed by the EM algorithm is to start from a random set of parameters 0, that has a certain initial likelihood value L0, and then iteratively change them in order to augment likelihood at each step. To this aim the EM algorithm exploits a growth transformation of the likelihood function ( ) = 0 such that L(T ;D; ) 6 L(T ;D; 0). Applying iteratively this transformation starting from 0 a sequence of parameters is produced, until the likelihood function achieve a stable value (i.e. Li+1 Li 6 ). In our settings the transformation function is de ned by the following set of equations, in which all the parameters have to be solved together.</Paragraph>
      <Paragraph position="12"> As said before, in order to estimate distribution parameters the British National Corpus (BNC-Consortium, 2000) was used. Domain frequency scores have been evaluated on the central position of each text (using equation 1, with c = 50).</Paragraph>
      <Paragraph position="13"> In conclusion, the EM algorithm was used to estimate parameters to describe distributions for relevant and non-relevant texts. This learning method is totally unsupervised. Estimated parameters has been used to estimate relevance values by formula (4).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Domain Driven Disambiguation
</SectionTitle>
    <Paragraph position="0"> DRE originates to improve the performance of Domain Driven Disambiguation (DDD). In this section, a brief overview of DDD is given. DDD is a WSD methodology that only makes use of domain information. Originally developed to test the role of domain information for WSD, the system is capable to achieve a good precision disambiguation. Its results are affected by a low recall, motivated by the fact that domain information is suf cient to disambiguate only domain words . The disambiguation process is done comparing the domain of the context and the domains of each sense of the lemma to disambiguate. The selected sense is the one whose domain is relevant for the context5.</Paragraph>
    <Paragraph position="1"> In order to represent domain information we introduced the notion of Domain Vectors (DV), that are data structures that collect domain information.</Paragraph>
    <Paragraph position="2"> These vectors are de ned in a multidimensional space, in which each domain represents a dimension of the space. We distinguish between two kinds of DVs: (i) synset vectors, which represent the relevance of a synset with respect to each considered domain and (ii) text vectors, which represent the relevance of a portion of text with respect to each domain in the considered set.</Paragraph>
    <Paragraph position="3"> More formally let D = fD1;D2;:::;Ddg be the set of domains, the domain vector ~s for a synset s is de ned as hR(D1;s);R(D2;s);::: ;R(Dd;s)i where R(Di;s) is evaluated using equation (2). In analogy the domain vector ~tj for a text t in a given position j is de ned as hR(D1;t;j);R(D2;t;j);::: ;R(Dd;t;j)i where R(Di;t;j) is evaluated using equation (4).</Paragraph>
    <Paragraph position="4"> The DDD methodology is performed basically in three steps:  1. Compute ~t for the context t of the word w to be disambiguated null 2. Compute ^s = argmaxs2Senses(w)score(s; w; t) where</Paragraph>
    <Paragraph position="6"> threshold) select sense ^s, else do not provide any answer The similarity metric used is the cosine vector similarity, which takes into account only the direction of the vector (i.e. the information regarding the domain).</Paragraph>
    <Paragraph position="7"> P(sjw) describes the prior probability of sense s for word w, and depends on the distribution of the sense annotations in the corpus. It is estimated by statistics from a sense tagged corpus (we used SemCor)6 or considering the sense order in 5Recent works in WSD demonstrate that an automatic estimation of domain relevance for texts can be pro table used to disambiguate words in their contexts. For example, (Escudero et al., 2001) used domain relevance extraction techniques to extract features for a supervised WSD algorithm presented at the Senseval-2 competion, improving the system accuracy of about 4 points for nouns, 1 point for verbs and 2 points for adjectives, con rming the original intuition that domain information is very useful to disambiguate domain words , i.e. words which are strongly related to the domain of the text.</Paragraph>
    <Paragraph position="8"> 6Admittedly, this may be regarded as a supervised component of the generally unsupervised system. Yet, we considered this component as legitimate within an unsupervised frame-WORDNET, which roughly corresponds to sense frequency order, when no example of the word to disambiguate are contained in SemCor. In the former case the estimation of P(sjw) is based on smoothed statistics from the corpus (P(sjw) = occ(s;w)+ occ(w)+jsenses(w)j , where is a smoothing factor empirically determined). In the latter case P(sjw) can be estimated in an unsupervised way considering the order of senses in WORDNET</Paragraph>
    <Paragraph position="10"> sensenumber(s;w) returns the position of sense s of word w in the sense list for w provided by WORDNET.</Paragraph>
    <Paragraph position="11"> 5 Evaluation in a WSD task We used the WSD framework to perform an evaluation of the DRE technique by itself. As explained in Section 1 Domain Relevance Estimation is not a common Text Categorization task. In the standard framework of TC, categories are learned form examples, that are used also for test. In our case information in WORDNET DOMAINS is used to discriminate, and a test set, i.e. a corpus of texts categorized using the domain of WORDNET DOMAINS, is not available. To evaluate the accuracy of the domain relevance estimation technique described above is thus necessary to perform an indirect evaluation.</Paragraph>
    <Paragraph position="12"> We evaluated the DDD algorithm described in Section 4 using the dataset of the Senseval-2 all-words task (Senseval-2, 2001; Preiss and Yarowsky, 2002). In order to estimate domain vectors for the contexts of the words to disambiguate we used the DRE methodology described in Section 3. Varying the con dence threshold k, as described in Section 4, it is possible to change the tradeoff between precision and recall. The obtained precision-recall curve of the system is reported in Figure 2.</Paragraph>
    <Paragraph position="13"> In addition we evaluated separately the performance on nouns and verbs, suspecting that nouns are more domain oriented than verbs. The effectiveness of DDD to disambiguate domain words is con rmed by results reported in Figure 3, in which the precision recall curve is reported separately for both nouns and verbs. The performances obtained for nouns are sensibly higher than the one obtained for verbs, con rming the claim that domain information is crucial to disambiguate domain words. In Figure 2 we also compare the results obtained by the DDD system that make use of the DRE technique described in Section 3 with the rework since it relies on a general resource (SemCor) that does not correspond to the test data (Senseval all-words task).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Verbs
</SectionTitle>
      <Paragraph position="0"> sults obtained by the DDD system presented at the Senseval-2 competition described in (Magnini et al., 2002), that is based on the same DDD methodology and exploit a DRE technique that consists basically on the simply domain frequency scores described in subsection 3.1 (we refer to this system using the expression old-DDD, in contrast to the expression new-DDD that refers to the implementation described in this paper).</Paragraph>
      <Paragraph position="1"> Old-DDD obtained 75% precision and 35% recall on the of cial evaluation at the Senseval-2 English all words task. At 35% of recall the new-DDD achieves a precision of 79%, improving precision by 4 points with respect to old-DDD. At 75% precision the recall of new-DDD is 40%. In both cases the new domain relevance estimation technique improves the performance of the DDD methodology, demonstrating the usefulness of the DRE technique proposed in this paper.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusions and Future Works
Domain Relevance Estimation, an unsupervised TC
</SectionTitle>
    <Paragraph position="0"> technique, has been proposed and evaluated inside the Domain Driven Disambiguation framework, showing a signi cant improvement on the overall system performances. This technique also allows a clear probabilistic interpretation providing an operative de nition of the concept of domain relevance. During the learning phase annotated resources are not required, allowing a low cost implementation. The portability of the technique to other languages is allowed by the usage of synset-aligned wordnets, being domain annotation language independent. null As far as the evaluation of DRE is concerned, for the moment we have tested its usefulness in the context of a WSD task, but we are going deeper, considering a pure TC framework.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML