File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-3007_metho.xml

Size: 5,456 bytes

Last Modified: 2025-10-06 14:07:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-3007">
  <Title>Word Sense Disambiguation for Cross-Language Information Retrieval</Title>
  <Section position="5" start_page="35" end_page="36" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"> To disambiguate a given word, we would like to know the probability that a sense occurs in a given context, i.e., P(semse\[context). In this study, WordNet synsets are used to represent word senses, so P(senselcontext) can be rewritten as P(synsetlcontext), for each synset of which that word is a member. For nouns, we define the context of word w to be the occurrence of words in a moving window of I00 words (50 words on each side) around w 2.</Paragraph>
    <Paragraph position="1"> By Bayes Theorem, we can obtain the desired probability by inversion (see equation (I)). Since we are not specifically concerned with getting accurate probabilities but rather relative rank order for sense selection, we ignore P(context(w)) and focus on estimating P(context(w)lsymet)P(synset). The event space l~om which &amp;quot;context(w)&amp;quot; is drawn is the set of sets of words that ever appear with each other in the window around w. In other words, w induces a partition on the set of words. We define &amp;quot;context(w)&amp;quot; to be true whenever any of the words in the set appears in the window around w, and conversely to be false whenever none of the words in the set appears around w. If we assume independence of appearance of any two words in a given context, then we get:</Paragraph>
    <Paragraph position="3"> Due to the lack of sense-tagged corpora, we are not able to directly estimate P(synset) and P(wilsymet). Instead, we introduce &amp;quot;noisy estimators&amp;quot; (Pdsymet) and Pdwl\]symet)) to approximate these probabilities. In doing so, we make two assumptions: l) The presence of any word Wk that belongs to synset si signals the presence of si; 2) Any word Wk belongs to all its synsets simultaneously, and with equal probability. Although the assumptions underlying the &amp;quot;noisy estimators&amp;quot; are not strictly true, it is our belief that the &amp;quot;noisy estimators&amp;quot; should work reasonably well if: * The words that belong to symet sitend to appear in similar contexts when si is their intended sense; * These words do not completely overlap with the words belonging to some synset sj ( i ~ j ) that partially overlaps with si; 2 For other parts of speech, the window size should be much smaller as suggested by previous research.  The common words between si and sj appear in different contexts when si and sj are their intended senses.</Paragraph>
  </Section>
  <Section position="6" start_page="36" end_page="37" type="metho">
    <SectionTitle>
4 The WSD Algorithm
</SectionTitle>
    <Paragraph position="0"> We chose as a basis the algorithms described by Yan'owsky (1992) and by Cheng and Wilensky (1997). In our variation, we use the synset numbers in WordNet to represent the senses of a word. Our algorithm learns associations of WordNet synsets with words in a surrounding context to determine a word sense. It consists of two phases.</Paragraph>
    <Paragraph position="1"> During the training phase, the algorithm reads in all training documents in collection and computes the distance-adjusted weight of co-occurrence of each word with each corresponding synset. This is done by establishing a 100-word window around a target word (50 words on each side), and correlating each synset to which the target word belongs with each word in the surrounding window. The result of the training phase is a matrix of associations of words with synsets.</Paragraph>
    <Paragraph position="2"> In the sense prediction phase, the algorithm takes as input randomly selected testing documents or sentences that contain the polysemous words we want to disambiguate and exploits the context vectors built in the training phase by adding up the weighted &amp;quot;votes&amp;quot;. It then returns a ranked list of probability values associated with each synset, and chooses the synset with the highest probability as the sense of the ambiguous word.</Paragraph>
    <Paragraph position="3"> Figure 1 and Figure 2 show an outline of the algorithm.</Paragraph>
    <Paragraph position="4"> In this algorithm, &amp;quot;noisy estimators&amp;quot; are employed in the sense prediction phase. They are calculated using following formulas: M\[w, Ix\] Po(wilx)-- LwM\[wIx\] (3) where wi is a stem, x is a given synset, M\[w\]\[x\] is a cell in the correlation matrix that corresponds to word w and synset x, and</Paragraph>
    <Paragraph position="6"> where w is any stem in the collection, x is a given symet, y is any synset ever occurred in collection.</Paragraph>
    <Paragraph position="7"> For each document d in collection read in a noun stem w from d for each synset s in which w occurs get the column b in the association matrix M that corresponds to s if the column already exists; create a new column for s otherwise for each word stem j appearing in the 100-word window around w get the row a in M that corresponds to j if the row already exists; create a new row for j otherwise add a distance-adjusted weight to M\[a\]\[b\]  Set value = 1 For each word w to be disambiguated get synsets of w for each synset x ofw for each wi in the context ofw (within the 100-window around w)</Paragraph>
    <Paragraph position="9"/>
  </Section>
class="xml-element"></Paper>
Download Original XML