XML Viewer - w98-0701

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0701_metho.xml
Size: 23,108 bytes
Last Modified: 2025-10-06 14:15:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0701">
  <Title>I I I I I I 1 General Word Sense Disambiguation Method Based on a Full Sentential Context</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 General Word Sense
</SectionTitle>
    <Paragraph position="0"> Disambiguation The aim of the system described here is to take any syntactically analysed sentence on the input and assign each of its content words a pointer to an appropriate sense in WordNet. Because the words in a sentence are bound by their syntactic relations, all the word's senses are determined by their most probable combination in all the syntactic relations derived from the parse structure of the given sentence. It is assumed here that each phrase has one central constituent (head), and all other constituents in the phrase modify the head (modifiers). It is also assumed that there is no relation between the modifiers. The relations are explicitly present in the parse tree, where head words propagate up through the tree, each parent receiving its head word from its head-child. Every syntactic relation can be also viewed as a semantic relationship between the concepts represented by the participating words. Consider, for example, the sentence (1) whose syntactic structure is given in Figure 1.</Paragraph>
    <Paragraph position="1"> (1) The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place.</Paragraph>
    <Paragraph position="2"> Each word in the above sentence is bound by a number of syntactic relations which determine the correct sense of the word. For example, the sense of the verb produced is constrained by the subject-verb relation with the noun investigation, by the verb-object relation with the noun evidence and by the subordinate clause relation with the verb said.</Paragraph>
    <Paragraph position="3"> Similarly, the verb said is constrained by its relations with the words Jury, Friday and produced; the sense of the noun investigation is constrained by the relation with the head of its prepositional phrase election, and by the subject-verb relation with the verb produced, and so on.</Paragraph>
    <Paragraph position="4"> The key to extraction of the relations is that any phrase can be substituted by the corresponding tree head-word (links marked bold in Figure 1). To determine the tree head-word we used a set of rules similar to that described by (Magerman, 1995)(Jelinek et al., 1994) and also used by (Collins, 1996), which we modified in the following way: * The head of a prepositional phrase (PP-- IN NP) was substituted by a function the name of which corresponds to the preposition, and its sole argument corresponds to the head of the noun phrase NP.</Paragraph>
    <Paragraph position="5"> * The head of a subordinate clause was changed to a function named after the head of the first element in the subordinate clause (usually 'that' or a 'NULL' element) and its sole argument corresponds to the head of its second element (usually head of a sentence).</Paragraph>
    <Paragraph position="6"> Because we assumed that the relations within the same phrase are independent, all the relations are between the modifier constituents and the head of a phrase only. This is not necessarily true in some situations, but for the sake of simplicity we took the liberty to assume so. A complete list of applicable relations for sentence (I) is given in (2).</Paragraph>
    <Paragraph position="7"> (2) NP(NN P(County), NN P(Jury))</Paragraph>
    <Paragraph position="9"> Each of the extracted syntactic relations has a certain probability for each combination of the senses of its arguments. This probability is derived from the probability of the semantic relation of each combination of the sense candidates of the related content words. Therefore, the approach described here consists of two phases: 1. learning the semantic relations, and 2. disambiguation through the probability evaluation of relations.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Learning
</SectionTitle>
    <Paragraph position="0"> At first, every content word in every sentence in the training set was tagged by an appropriate pointer to a sense in WordNet.</Paragraph>
    <Paragraph position="1"> Secondly, using the parse trees of all the corpus sentences, all the syntactic relations present in the</Paragraph>
    <Paragraph position="3"> training corpus were extracted and converted into the following form: (4) reI(PNT, MNT, HNT, MS, HS, RP).</Paragraph>
    <Paragraph position="4"> where PNT is the phrase parent non-terminal, MNT the modifier non-terminal, HNT the head nonterminal, MS the semantic content (see below) of the modifier constituent, \['IS the semantic content of the head constituent and RP the relative position of the modifier and the head (RP=I indicates that the modifier precedes the head, while for RP=2 the head precedes the modifier). Relations involving non-content modifiers were ignored. Synsets of the words not present in WordNet were substituted by the words themselves.</Paragraph>
    <Paragraph position="5"> The semantic content was either a WordNet sense identificator (synset) or, in the case of prepositional and subordinate phrases, a function of the preposition (or a null element) and the sense identificator of the second phrase constituent.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="5" type="metho">
    <SectionTitle>
5 Disambiguation Algorithm
</SectionTitle>
    <Paragraph position="0"> As mentioned above, we assumed that all the content words in a sentence are bound by a number of syntactic relations. Every content word can have several meanings, but each of these meanings has a different probability, which is given by the set of semantic relations in which the word participates. Because every relation has two arguments (head and its modifier), the probability of each sense also depends on the probability of the sense of the other participant in the relation. The task is to select such a combination of senses for all the content words, that the overall relational probability is maximal. If, for any given sentence, we had extracted N syntactic relations PA, the overall relational probability for the combination of senses X would be:</Paragraph>
    <Paragraph position="2"> where p(RiIX) is the probability of the i-th relation given the combination of senses X. If we consider, that an average word sense ambiguity in the used corpus is 5.8 senses, a sentence with 10 content words would have 5.8 tdeg possible sense combinations, leading to a combinatorial explosion of over 43,080,420 overall probability combinations, which is not feasible. Also, with a very small training corpus, it is not possible to estimate the sense probabilities very accurately. Therefore, we have opted for a hierarchical disambiguation approach based on similarity measures between the tested and the training relations, which we will describe in Section 5.2. At first, however, we will describe the part of the probabilistic model which assigns probability estimates to the individual sense combinations based on the semantic relations acquired in the learning phase.</Paragraph>
    <Section position="1" start_page="0" end_page="4" type="sub_section">
      <SectionTitle>
5.1 Relational Probability Estimate
</SectionTitle>
      <Paragraph position="0"> Consider, for example, the syntactic relation between a head noun and its adjectival modifier derived from NP~ JJ NN. Let us assume that the number of senses in WordNet is k for the adjective and 1 for the noun. The number of possible sense combinations is therefore m = k * 1. The probability estimate of a sense combination (i,j) in the relation R, where i is the sense of the modifier (adjective in this example) and j is the sense of the head (noun in this example), is calculated as follows:</Paragraph>
      <Paragraph position="2"> x with a had word sense .mantic rel. Lti &gt;ns R extract.d lase. Pleas~ n ~te, that beta as~ I ~ but rather a 5core of co-oc~ ur v), pR(i~j) is not a real plob rather its approximation. Because tile i count is replaced by a similarity score, the sparse data problem of a small training corpus is substantially reduced. The score of co-occurrences is defined as a sum of hits of similar pairs, where a hit is I a multiplication of the similarity measures, sim(i,x) and sim(j,y), between both participants, i.e.:</Paragraph>
      <Paragraph position="4"> where x, yER; r is the number of relations of the same type (for the above example I R=reI(NP,ADJ,NOUN,x,y,1)) found in the training corpus. To emphasise the sense-restricting contribution of each example found, every pair (x,y) is i restricted to contributing to only one sense combination (id): every example pair (x,y) contributes only to such a combination for which sim(i, x) * sim(j, y) is maximal.</Paragraph>
      <Paragraph position="5"> I fR0,j) represents a sum of all hits in the training corpus for the sense combination (ij). Because the similarity measure (see below) has a value between 0 and 1 and each hit is a multiplication of I two similarities, its value is also between 0 and 1. The reason why we used a multiplication of similarities was to eliminate the contributions of exami pies in which one participant belonged to a completely different semantic class. For example, the training pair new airport, makes no contribution to the probability estimate of any sense combination of i a new management, because none of the two senses of II the noun management (group or human activity) belongs to the same semantic class as airport (entity).</Paragraph>
      <Paragraph position="6"> On the other hand, new airport would contribute to I the probability estimate of the sense combination of modern building because one sense of the adjective modern is synonymous to one sense of the adjective i new, and one sense of the noun building belongs to the same conceptual class (entity) as the noun airport. The situation is analogous for all other relations. The reason why we used a count modified I by the semantic distances, rather than a count of exact matches only, was to avoid situations where no match would be found due to the sparse data, a problem of many small training corpora.</Paragraph>
      <Paragraph position="7"> I Every semantic relation can be represented by a relational matrix, which is a matrix whose first coordinate represents the sense of the modifier, the</Paragraph>
      <Paragraph position="9"> where fl~(id) is a score of co-occurrences of a modifier sense x with a head word sense y, among the same semantic relations R extracted during the learning phase. Please note, because fR(ij) is not a count but rather a score of co-occurrences (defined below), pR(i,j) is not a real probability but Because the occurrence  second coordinate represents the sense of the head and the value at the coordinate position (i j) is the estimate of the probability of the sense combination (id) computed by (6). An example of a relational matrix for an adjective-noun relation modern building based on two training examples (new airport and classical music) is given in Figure 3. Naturally, the more the examples, the more fields of the matrix get filled. The training examples have an accumulative effect on the matrix, because the sense probabilities in the matrix are calculated as a sum of 'similarity based frequency scores' of all examples (7) divided by the sum of all matrix entries, (6). The most likely sense combination scores the highest value in the matrix. Each semantic relation has its own matrix.</Paragraph>
      <Paragraph position="10"> The way all the relations are combined is described in Section 5.2.</Paragraph>
      <Paragraph position="11">  We base the definition of the semantic similarity between two concepts (concepts are defined by their WordNet synsets a,b) on their semantic distance, as follows: (8) sire(a, b) &amp;quot;-- 1 - sd(a, b) ~-, The semantic distance sd(a,b) is squared in the above formula in order to give a bigger weight to closer matches.</Paragraph>
      <Paragraph position="12"> The semantic distance is calculated as follows.</Paragraph>
      <Paragraph position="13"> Semantic Distance for Nouns and Verbs</Paragraph>
      <Paragraph position="15"> where DI is the depth of synset a, D2 is the depth of synset D2, and D is the depth of their nearest common ancestor in the WordNet hierarchy. If a and b have no common ancestor, sd(a,b) = 1.</Paragraph>
      <Paragraph position="16"> If any of the participants in the semantic distance calculation is a function (derived from a prepositional phrase or subordinate clause), the distance is equal to the distance of the function arguments for the same functor, or equals 1 for different functors.</Paragraph>
      <Paragraph position="17"> For example, sd(of(sensel), of(sense2)) = sd(sense 1, sense2), while sd(of(senset), about(sense2)) = t, no matter what sensel and sense2 are.</Paragraph>
      <Paragraph position="18"> Semantic Distance for Adjectives sd(a,b) = 0 for the same adjectival synsets (inci.synonymy), sd(a,b) = 0 for the synsets in antonymy relations, i.e. for ant(a,b), sd(a,b) = 0.5 for the synsets in the same similarity cluster, sd(a,b) = 0.5 if a belongs to the same similarity cluster as c and b is the antonymy of c (indirect</Paragraph>
      <Paragraph position="20"/>
      <Paragraph position="22"/>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
5.2 Hierarchical DisambJguation
</SectionTitle>
      <Paragraph position="0"> This section describes the main part of the algorithm, i.e. the disambiguation process based on the overall probability estimate of sententia\] relations. As we have outlined above, for computational reasons, it is not feasible to evaluate overall probabilities for all the sense combinations. Instead, we take advantage of the hierarchical structure of each sentence and arrive at the optimum combination of its word senses, in a process which has two parts: \[. bottom-up propagation of the head word sense scores and 2. top-down disambiguation.</Paragraph>
      <Paragraph position="1">  propagation \[n compliance with our assumption that all the semantic relations are only between a head word and its modifiers at any syntactic level, the modifiers do not participate in any relation with an element outside their parent phrase. As depicted in the example in Figure l, it is only the head word concepts that propagate through the parse tree and that participate in semantic relations with concepts on other levels of the parse tree. The modifiers (which are heads themselves at lower tree levels), however, play an important role in constraining the head-word senses. The number of relations derived at each level of the tree depends on the number of concepts that modify the head. Each of these relations contributes to the score of each sense of the head word. We define the sense score vector of a word w as a vector of scores of each WordNet sense of the word w. The initial sense score vector of the word w is given by its contextually independent sense distribution in the whole training corpus. Because the training corpus is relatively small, and because it always excludes the tested file, an appropriate sense of the word w may not be present in it at all. Therefore, each sense i of the word w is always given a non-zero initial score Pi(W) (ga):</Paragraph>
      <Paragraph position="3"> where count(w), is the number of occurrences of the sense i of the word w in the entire training corpus, and n is the number of different WordNet senses of the word w.</Paragraph>
      <Paragraph position="4"> The sense score vectors of head words propagate up the tree. At each level, they are modified by all the semantic relations with their modifiers which occur at that level. Also, the sense score vectors of head words are used to calculate the matrices of the sense score vectors of the modifiers. This is done as follows: Let H -- Jill, h2 .... , hit\] be the sense score vector  of the head word h. Let T = \[R1, R2, ...Rn\] be a set of relations between the head word h and its modifiers.</Paragraph>
      <Paragraph position="5"> 1. For each semantic relation R, E T between the head word h and a modifier mi with sense score vector Mi = loll, oi2 .... oil\], do: 1.1 Using (6), calculate the relational matrix Ri(m,h) of the relation Ri 1.2 For each ol E Mi multiply all the elements of the Ri(m,h) for which m=oi by oi, yielding Qi - the sense score matrix of the modifier mi 2. The new sense score vector of the head word h  is now G-&amp;quot; \[gl,g2, ...,gk\], where Lj (lo)g i = 2--, h~ Lj/L represents the score of the head word sense j based on the matrices Q calculated in the step 1., i.e.:</Paragraph>
      <Paragraph position="7"> where xi(j,u)E Qi and max(xi(j,u)) is the highest score in the line of the matrix Qi which corresponds to the head word sense j. n is the number of modifiers of the head word h at the current tree level, and</Paragraph>
      <Paragraph position="9"> where k is the number of senses of the head word h.</Paragraph>
      <Paragraph position="10"> The reason why gj (I0) is calculated as a sum of the best scores (ll), rather than by using the traditional maximum likelihood estimate (Berger et al., 1996)(Gah eta\[., 1993), is to minimise the effect of the sparse data problem. Imagine, for example, the phrase VP-- VB NP PP, where the head verb VB is in the object relation with the head of the noun phrase NP and also in the modifying relation with the head of the prepositional phr~e PP. Let us also assume that the correct sense of the verb VB is a.</Paragraph>
      <Paragraph position="11"> Even if the verb-object relation provided a strong selectional support for the sense a, if there was no example in the training set for the second relation (between VB and PP) which would score a hit for the sense a, multiplying the scores of that sense derived from the first and from the second relation respectively, would gain a zero probability for this sense and thus prevent its correct assignment.</Paragraph>
      <Paragraph position="12"> The newly created head word sense score vector G propagates upwards in the parse tree and the same process repeats at the next syntactic level. Note that at the higher level, depending on the head extraction rules described in section 3, the roles may be changed and the former head word may become a modifier of a new head (and participate in the above calculation as a modifier). The process repeats itself until the root of the tree is reached. The word sense score vector which has reached the root, represents a vector of scores of the senses of the main head word of the sentence (verb said in the example in Figure 1), which is based on the whole syntactic structure of that sentence. The sense with the highest score is selected and the sentence head disambiguated.</Paragraph>
      <Paragraph position="13">  Having ascertained the sense of the sentence head, the process of top-down disambiguation begins. The top-down disambiguation algorithm, which starts with the sentence head, can be described recursively as follows: Let 1 be the sense of the head word h on the input. Let M-\[ml,m2,...,mx\] be the set of the modifiers of the head word h. For every modifier mi E M, do: l. In the sense score matrix Qi of the modifier mi (calculated in step 1.2 of the bottom-up phase) find all the elements x(ki,l), where I is the sense of the head h 2. Assign the modifier mi such a sense k--k' for which the value x(ki,l) is maximum. In the case of a draw, choose the sense which is listed as more frequent in WordNet.</Paragraph>
      <Paragraph position="14"> 3. If the modifier mi has descendants in the parse tree, call the same algorithm again with ml being the head and k being its sense, else end.</Paragraph>
      <Paragraph position="15"> The disambiguation of the modifiers (which become heads at lower levels of the parse tree), is based solely on those lines of their sense score matrices which correspond to the sense of the head they are in relation with. This is possible because of our assumption that the modifiers are related only to their head words, and that there is no relation among the modifiers themselves. To what extent this assump- null tion holds in real life sentences, however, has yet to be investigated.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="5" end_page="5" type="metho">
    <SectionTitle>
6 Discourse Context
</SectionTitle>
    <Paragraph position="0"> (Yarowsky, 1995) pointed out that the sense of a target word is highly consistent within any given document (one sense per discourse). Because our algorithm does not consider the context given by the preceding sentences, we have conducted the following experiment to see to what extent the discourse context could improve the performance of the word-sense disambiguation: Using the semantic concordance files (Miller et al., 1993), we have counted the occurrences of content words which previously appear in the same discourse file. The experiment indicated that the &amp;quot;one sense per discourse&amp;quot; hypothesis works fairly well for nouns, however, the evidence is much weaker for verbs, adverbs and adjectives. Table 3 shows the numbers of content words which appear previously in the same discourse with the same meaning (same synset), and those which appear previously with a different meaning. The experiment also confirmed our expectation that the ratio of words with the same sense to those with a different sense, depends on the distance of sentences in which the same words appear (distance I indicates that the same word appeared in the previous sentence, distance 2 that the same word was present 2 sentences before, etc.).</Paragraph>
    <Paragraph position="1"> We have modified the disambiguation algorithm to make use of the information gained by the above experiment in the following way: All the disambiguated words and their senses are stored. The words of all the input sentences are first compared with the set of these stored word-sense pairs. If the same word is found in the set, the initial sense score assigned to it by (ga) is modified using Table 3, so that the sense, which has been previously assigned to the word, gets higher priority. The calculation of the initial sense score (9a) is thus replaced by (9b):</Paragraph>
    <Paragraph position="3"> where e(POS,SN) is the probability that the word with syntactic category POS which already occurred SN sentences before, has the same sense as its previous occurrence. If, for example, the same noun has occurred in the previous sentence (SN=I) where it was assigned sense n, the probability of sense n of the same noun in the current sentence is multiplied by e(NOUN,I)=3,039/(3,039+I03)=0.967, while all the probabilities of its remaining senses are multiplied by I-0.967=0.033. Ifno match is found, i.e. the word has not previously occurred in the discourse, e(POS,SN) is set to 1 for all senses.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML