File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1020_metho.xml

Size: 12,922 bytes

Last Modified: 2025-10-06 14:15:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1020">
  <Title>A Method for Word Sense Disambiguation of Unrestricted Text</Title>
  <Section position="3" start_page="0" end_page="152" type="metho">
    <SectionTitle>
2 A word-word dependency
</SectionTitle>
    <Paragraph position="0"> approach The method presented here takes advantage of the sentence context. The words are paired and an attempt is made to disambiguate one word within the context of the other word. This is done by searching on Internet with queries formed using different senses of one word, while keeping the other word fixed. The senses are ranked simply by the order provided by the number of hits. A good accuracy is obtained, perhaps because the number of texts on the Internet is so large. In this way, all the words are  processed and the senses axe ranked. We use the ranking of senses to curb the computational complexity in the step that follows. Only the most promising senses are kept.</Paragraph>
    <Paragraph position="1"> The next step is to refine the ordering of senses by using a completely different method, namely the semantic density. This is measured by the number of common words that are within a semantic distance of two or more words. The closer the semantic relationship between two words the higher the semantic density between them. We introduce the semantic density because it is relatively easy to measure it on a MRD like WordNet. A metric is introduced in this sense which when applied to all possible combinations of the senses of two or more words it ranks them.</Paragraph>
    <Paragraph position="2"> An essential aspect of the WSD method presented here is that it provides a raking of possible associations between words instead of a binary yes/no decision for each possible sense combination. This allows for a controllable precision as other modules may be able to distinguish later the correct sense association from such a small pool.</Paragraph>
  </Section>
  <Section position="4" start_page="152" end_page="154" type="metho">
    <SectionTitle>
3 Contextual ranking of word senses
</SectionTitle>
    <Paragraph position="0"> Since the Internet contains the largest collection of texts electronically stored, we use the Internet as a source of corpora for ranking the senses of the words.</Paragraph>
    <Section position="1" start_page="152" end_page="152" type="sub_section">
      <SectionTitle>
3.1 Algorithm 1
</SectionTitle>
      <Paragraph position="0"> For a better explanation of this algorithm, we provide the steps below with an example. We considered the verb-noun pair &amp;quot;investigate report&amp;quot;; in order to make easier the understanding of these examples, we took into consideration only the first two senses of the noun report. These two senses, as defined in WordNet, appear in the synsets: (report#l, study} and {report#2, news report, story, account, write up}.</Paragraph>
      <Paragraph position="1"> INPUT: semantically untagged word1 - word2 pair (W1 - W2) OUTPUT: ranking the senses of one word PROCEDURE: STEP 1. Form a similarity list \]or each sense of one of the words. Pick one of the words, say W2, and using WordNet, form a similarity list for each sense of that word. For this, use the words from the synset of each sense and the words from the hypernym synsets. Consider, for example, that W2 has m senses, thus W2 appears in m similarity lists:</Paragraph>
      <Paragraph position="3"> where W 1, Wff, ..., W~ n are the senses of W2, and W2 (s) represents the synonym number s of the sense W~ as defined in WordNet.</Paragraph>
      <Paragraph position="4"> Example The similarity lists for the first two senses of the noun report are: (report, study) (report, news report, story, account, write up) STEP 2. Form W1 - W2 (s) pairs. The pairs that may be formed are:</Paragraph>
      <Paragraph position="6"> Example The pairs formed with the verb investigate and the words in the similarity lists of the noun report are: (investigate-report, investigate-study) (investigate-report, investigate-news report, investigatestory, investigate-account, investigate-write up) STEP 3. Search the Internet and rank the senses W~ (s). A search performed on the Internet for each set of pairs as defined above, results in a value indicating the frequency of occurrences for Wl and the sense of W2. In our experiments we used (Altavista, 1996) since it is one of the most powerful search engines currently available. Using the operators provided by AltaVista, queryforms are defined for each W1 - W2 (s) set above:</Paragraph>
      <Paragraph position="8"> for all 1 &lt; i &lt; m. Using one of these queries, we get the number of hits for each sense i of W2 and this provides a ranking of the m senses of W2 as they relate with 1411.</Paragraph>
      <Paragraph position="9"> Example The types of query that can be formed using the verb investigate and the similarity lists of the noun report, are shown below. After each query, we indicate the number of hits obtained by a search on the Internet, using AltaVista.</Paragraph>
      <Paragraph position="10">  (a) (&amp;quot;investigate report&amp;quot; OR &amp;quot;investigate study&amp;quot;) (478) (&amp;quot;investigate report&amp;quot; OR &amp;quot;investigate news report&amp;quot; OR &amp;quot;investigate story&amp;quot; OR &amp;quot;investigate account&amp;quot; OR &amp;quot;investigate write up&amp;quot;) (~81) (b) ((investigate NEAR report) OR (investigate NEAR study)) (34880) ((investigate NEAR report) OR (investigate NEAR news report) OR (investigate NEAR story) OR (investigate</Paragraph>
      <Paragraph position="12"> A similar algorithm is used to rank the senses of W1 while keeping W2 constant (undisambiguated). Since these two procedures are done over a large corpora (the Internet), and with the help of similarity lists, there is little correlation between the results produced by the two procedures.</Paragraph>
      <Paragraph position="13">  This method was tested on 384 pairs: 200 verb-noun (file br-a01, br-a02), 127 adjective-noun (file br-a01), and 57 adverb-verb (file br-a01), extracted from SemCor 1.6 of the Brown corpus. Using query form (a) on AltaVista, we obtained the results shown in Table 1. The table indicates the percentages of correct senses (as given by SemCor) ranked by us in top 1, top 2, top 3, and top 4 of our list. We concluded that by keeping the top four choices for verbs and nouns and the top two choices for adjectives and adverbs, we cover with high percentage (mid and upper 90's) all relevant senses. Looking from a different point of view, the meaning of the procedure so far is that it excludes the senses that do not apply, and this can save a considerable amount of computation time as many words are highly polysemous.</Paragraph>
      <Paragraph position="14"> top 1 top 2 top 3 top 4  We also used the query form (b), but the results obtained were similar; using, the operator NEAR, a larger number of hits is reported, but the sense ranking remains more or less the same.</Paragraph>
    </Section>
    <Section position="2" start_page="152" end_page="154" type="sub_section">
      <SectionTitle>
3.2 Conceptual density algorithm
</SectionTitle>
      <Paragraph position="0"> A measure of the relatedness between words can be a knowledge source for several decisions in NLP applications. The approach we take here is to construct a linguistic context for each sense of the verb and noun, and to measure the number of the common nouns shared by the verb and the noun contexts. In WordNet each concept has a gloss that acts as a micro-context for that concept. This is a rich source of linguistic information that we found useful in determining conceptual density between words.</Paragraph>
      <Paragraph position="1">  INPUT: semantically untagged verb - noun pair and a ranking of noun senses (as determined by Algorithm 1) OUTPUT: sense tagged verb - noun pair P aOCEDURE: STEP 1. Given a verb-noun pair V - N, denote with &lt; vl,v2, ...,Vh &gt; and &lt; nl,n2, ...,nt &gt; the possible senses of the verb and the noun using WordNet.</Paragraph>
      <Paragraph position="2"> STEP 2. Using Algorithm 1, the senses of the noun are ranked. Only the first t possible senses indicated by this ranking will be considered. The rest are dropped to reduce the computational complexity.</Paragraph>
      <Paragraph position="3"> STEP 3. For each possible pair vi - nj, the conceptual density is computed as follows:  (a) Extract all the glosses from the sub-hierarchy including vi (the rationale for selecting the sub-hierarchy is explained below) (b) Determine the nouns from these glosses.  These constitute the noun-context of the verb. Each such noun is stored together with a weight w that indicates the level in the sub-hierarchy of the verb concept in whose gloss the noun was found.</Paragraph>
      <Paragraph position="4">  (c) Determine the nouns from the noun sub-hierarchy including nj.</Paragraph>
      <Paragraph position="5"> (d) Determine the conceptual density Cij of common concepts between the nouns obtained at (b) and the nouns obtained at (c) using the metric: Icdijl k Cij = log (descendents j) (1) where: * Icdljl is the number of common concepts between the hierarchies of vl and nj  * wk are the levels of the nouns in the hierarchy of verb vi * descendentsj is the total number of words within the hierarchy of noun nj STEP 4. Vii ranks each pair vi -nj, for all i and j.</Paragraph>
      <Paragraph position="6"> Rationale 1. In WordNet, a gloss explains a concept and  provides one or more examples with typical usage of that concept. In order to determine the most appropriate noun and verb hierarchies, we performed some experiments using SemCor and concluded that the noun sub-hierarchy should include all the nouns in the class of nj. The sub-hierarchy of verb vi is taken as the hierarchy of the highest hypernym hi of the verb vi. It is necessary to consider a larger hierarchy then just the one provided by synonyms and direct hyponyms. As we replaced the role of a corpora with glosses, better results are achieved if more glosses are considered. Still, we do not want to enlarge the context too much.</Paragraph>
      <Paragraph position="7"> 2. As the nouns with a big hierarchy tend to have a larger value for Icdij\[, the weighted sum of common concepts is normalized with respect to the dimension of the noun hierarchy. Since the size of a hierarchy grows exponentially with its depth, we used the logarithm of the total number of descendants in the hierarchy, i.e. log(descendents j).</Paragraph>
      <Paragraph position="8"> 3. We also took into consideration and have experimented with a few other metrics. But after running the program on several examples, the formula from Algorithm 2 provided the best results.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="154" end_page="155" type="metho">
    <SectionTitle>
4 An Example
</SectionTitle>
    <Paragraph position="0"> As an example, let us consider the verb-noun collocation revise law. The verb revise has two possible senses in WordNet 1.6 and the noun law * has seven senses. Figure 1 presents the synsets in which the different meanings of this verb and noun appear.</Paragraph>
    <Paragraph position="1"> First, Algorithm 1 was applied and search the Internet using AltaVista, for all possible pairs V-N that may be created using revise and the words from the similarity lists of law. The following ranking of senses was obtained: Iaw#2(2829), law#3(648), law#4(640), law#6(397), law#1(224), law#5(37), law#7(O),</Paragraph>
    <Paragraph position="3"> accumulation, assemblage} 2. {law#2} = &gt; {rule, prescript\] ...</Paragraph>
    <Paragraph position="4">  3. {law#3, natural law} = &gt; \[ concept, conception, abstract\] 4. {law#4, law of nature} = &gt; \[ concept, conception, abstract\] 5. {jurisprudence, law#5, legal philosophy} =&gt; \[ philosophy} 6. {law#6, practice of law} =&gt; \[ learned profession} 7. {police, police force, constabulary, law#7}</Paragraph>
    <Paragraph position="6"> ent meanings, as defined in WordNet where the numbers in parentheses indicate the number of hits. By setting the threshold at t = 2, we keep only sense #2 and #3.</Paragraph>
    <Paragraph position="7"> Next, Algorithm 2 is applied to rank the four possible combinations (two for the verb times two for the noun). The results are summarized in Table 2: (1) \[cdij\[ - the number of common concepts between the verb and noun hierarchies; (2) descendantsj the total number of nouns within the hierarchy of each sense nj; and (3) the conceptual density Cij for each pair ni - vj derived using the formula presented above.</Paragraph>
    <Paragraph position="8">  tual density and the conceptual density Cij The largest conceptual density C12 = 0.30 corresponds to V 1 -- n2:revise#l~2 - law#2/5 (the notation #i/n means sense i out of n pos- null sible tion Cor, senses given by WordNet). This combinaof verb-noun senses also appears in Semfile br-a01.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML