File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0829_intro.xml

Size: 6,359 bytes

Last Modified: 2025-10-06 14:02:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0829">
  <Title>WSD Based on Mutual Information and Syntactic Patterns</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> We will describe in this paper the system that we presented to the SENSEVAL-3 competition in the English all-words and lexical-sample tasks. It is an unsupervised system that relies only on dictionary information and raw coocurrence data that we collected from a large untagged corpus. There is also a supervised extension of the system for the lexical sample task that takes into account the training data provided for the lexical sample task. We will describe two heuristics; the first one selects the sense of the words' synset with a synonym with the highest Mutual Information (MI) with a context word.</Paragraph>
    <Paragraph position="1"> This heuristic will be covered in section 2. The second heuristic relies on a set of syntactic structure rules that support particular senses. This rules have been extracted from the examples in WordNet sense glosses. Section 3 will be devoted to this technique.</Paragraph>
    <Paragraph position="2"> In section 4 we will explain the combination of both heuristics to finish in section 5 with our conclusions and some considerations for future work.</Paragraph>
    <Paragraph position="3"> 2 Selection of the closest variant In the second edition of SENSEVAL, we presented a system, described in (Fern'andez-Amor'os et al., 2001), that assigned scores to each word sense adding up Mutual Information estimates between all the pairs (word-in-context, word-in-gloss). We have identified some problems with this technique.</Paragraph>
    <Paragraph position="4"> * This exhaustive use of the mutual information estimates turned out to be very noisy, given that the errors in the individual mutual information estimates often correlated, thus affecting the final score for a sense.</Paragraph>
    <Paragraph position="5"> * Sense glosses usually contain vocabulary that is not particularly relevant to the specific sense.</Paragraph>
    <Paragraph position="6"> * Another typical problem for unsupervised systems is that the sense inventory contains many senses with little or no presence in actual texts.</Paragraph>
    <Paragraph position="7"> This last problem has been addressed in a very straightforward manner, since we have discarded the senses for a word with a relative frequency below 10%.</Paragraph>
    <Paragraph position="8"> The first problem might very well improve by itself when larger untagged corpora are available and increasing computing power eliminates the need for a limited controlled vocabulary in the MI calculations. Anyway, a solution that we have tried to implement for this source of problems, that is, cumulative errors in estimates biasing the final result, consists in restricting the application of the MI measure to promising candidates.</Paragraph>
    <Paragraph position="9"> An interesting criterion for the selection of these candidates is to select those words in the context that form a collocation with the word to be disambiguated, in the sense that is defined in (Yarowsky, 1993). Yarowsky claimed that collocations are nearly monosemous, so identifying them would allow us to focus on very local context, which should make the disambiguation process, if not more efficient, at least easier to interpretate.</Paragraph>
    <Paragraph position="10"> One example of test item that was incorrectly disambiguated by the systems described in (Fern'andez-Amor'os et al., 2001) is the word church in the sentence :</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Association for Computational Linguistics
</SectionTitle>
      <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems An ancient stone church stands amid the fields, the sound of bells cascading from its tower, calling the faithful to evensong.</Paragraph>
      <Paragraph position="1"> The applicable collocation here would be noun/noun so that stone is the context word to be used.</Paragraph>
      <Paragraph position="2"> To address the second problem, the use of non-relevant words in the glosses, we have decided to consider only the variants (the synonyms in a synset,in the case of WordNet) of each sense. These synonyms (i.e. variants of a sense) constitute the intimate matter of WordNet synsets, a change in a synset implies a change in the senses of the corresponding words, while the glosses are just additional information of secondary importance in the design of the sense inventory. To continue with the example, the synonyms for the three synsets for church in WordNet are (excluding church itself, which is obviously common to all the synsets) :</Paragraph>
      <Paragraph position="4"> We didn't compute MI of compound words so instead we splitted them. Since church is the word to be disambiguated, Christian church is converted to church, church building to building and church service to service. The numbers in parenthesis indicate the MI 1 between the term and stone. In this case we have a clear and strong preference for the second sense, which happens to be in accordance with the gold standard.</Paragraph>
      <Paragraph position="5"> Unfortunately, we didn't have the time to finish a collocation detection procedure, we just had enough time to POS-tag the text with the Brill tagger (Brill, 1992) and parse it with the Collins parser (Collins, 1999). That effort was put to use in the syntactic pattern-matching heuristic in the next section, so in this case we just limited ourselves to detect, for each variant, the context word with the highest MI.</Paragraph>
      <Paragraph position="6"> It is important to note that this heuristic is not dependent on the glosses and it is completely unsupervised, so that it is possible to apply it to any language with a sense inventory based on variants, as is the case with the languages in EuroWordNet, and an untagged corpus.</Paragraph>
      <Paragraph position="7"> We have evaluated this heuristic and the results are shown in table 1 1for words a and b, MI(a,b)= p(a[?]b) p(a)*p(b) , the probabilities are estimated in a corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML