File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0809_metho.xml

Size: 13,808 bytes

Last Modified: 2025-10-06 14:08:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0809">
  <Title>Dutch Word Sense Disambiguation: Optimizing the Localness of Context</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Dutch WSD system: Algorithms,
</SectionTitle>
    <Paragraph position="0"> data, instance generation The memory-based WSD system for Dutch, henceforth referred to as MBWSD-D, is built from the viewpoint of WSD as a classification task. Given an ambiguous word and its context as input features, a data-trained classifier assigns the contextually correct class (sense) to it. Our approach to memory-based all-words WSD follows the memory-based approach of (Ng and Lee, 1996), and the work by (Veenstra et al., 2000) on a memory-based approach to the English lexical sample task of SENSEVAL-1. We borrow the classification-based approach, and the word-expert concept of the latter: for each wordform, a word expert classifier is trained on disambiguating its one particular wordform.</Paragraph>
    <Paragraph position="1"> In this section we give an overview of the learning algorithms used, the data, and how this data was converted into instances of ambiguous words in context, to make the WSD task learnable for the memory-based word experts.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Learning algorithms
</SectionTitle>
      <Paragraph position="0"> The distinguishing feature of memory-based learning (MBL) in contrast with minimal-descriptionlength-driven or &amp;quot;eager&amp;quot; ML algorithms is that MBL keeps all training data in memory, and only abstracts at classification time by extrapolating a class from the most similar item(s) in memory to the new test item. This strategy is often referred to as &amp;quot;lazy&amp;quot; learning. In recent work (Daelemans et al., 1999) we have shown that for typical natural language processing tasks, this lazy learning approach performs well because it allows extrapolation from low-frequency or exceptional cases, whereas eager methods tend to treat these as discardable noise.</Paragraph>
      <Paragraph position="1"> Also, the automatic feature weighting in the similarity metric of a memory-based learner makes the approach well-suited for domains with large numbers of features from heterogeneous sources, as it embodies a smoothing-by-similarity method when data is sparse (Zavrel and Daelemans, 1997). For our experiments, we used the MBL algorithms implemented in TIMBL1. We give a brief overview of the algorithms and metrics here, and refer to (Daelemans et al., 1997; Daelemans et al., 2001) for more information.</Paragraph>
      <Paragraph position="2"> IB1 - The distance between a test item and each memory item is defined as the number of features for which they have a different value (Aha et al., 1991). Classification occurs via the knearest-distances rule: all memory items which are equally near at the nearest a2 distances surrounding the test item are taken into account in classification. The classification assigned to the test item is simply the majority class among the memory items at the a2 nearest distances.</Paragraph>
      <Paragraph position="3"> Feature-weighted IB1 - In most cases, not all features are equally relevant for solving the task; different types of weighting are available in TIMBL to assign differential cost to a feature value mismatch during comparison. Some of these are information-theoretic (based on measuring the reduction of uncertainty about the class to be predicted when knowing the value of a feature): information gain and gain ratio.</Paragraph>
      <Paragraph position="4"> Others are statistical (based on comparing expected and observed frequencies of value-class associations): chi-squared and shared variance.</Paragraph>
      <Paragraph position="5"> Distance-weighted IB1 - Instead of simply taking the majority class among all memory items in the a2 nearest distances, the class vote of each memory item is weighted by its distance.</Paragraph>
      <Paragraph position="6"> The more distant a memory item is to the test item, the lower its class vote is. This can be implemented by using several mathematical functions; the TIMBL software implements linear inversed distance weights, inversed distance weights, and exponentially decayed distance weights.</Paragraph>
      <Paragraph position="7">  dered. In the previous variants, mismatches between values are all interpreted as equally important, regardless of how similar (in terms of classification behaviour) the values are. We adopted the modified value difference metric (Cost and Salzberg, 1993) to assign a different distance between each pair of values of the same feature. This algorithm can also be combined with the different feature weighting methods.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Data
</SectionTitle>
      <Paragraph position="0"> The Dutch WSD corpus was built as a part of a sociolinguistic project, led by Walter Schrooten and Anne Vermeer (1994), on the active vocabulary of children in the age of 4 to 12 in the Netherlands.</Paragraph>
      <Paragraph position="1"> The aim of developing the corpus was to have a realistic wordlist of the most common words used at elementary schools. This wordlist was further used in the study to make literacy tests, including tests how many senses of ambiguous words were known by children of different ages. The corpus consists of texts of 102 illustrated children books in the age range of 4 to 12. Each word in these texts is manually annotated with its appropriate sense. The data was annotated by six persons who all processed a different part of the data.</Paragraph>
      <Paragraph position="2"> Each word in the dataset has a non-hierarchical, symbolic sense tag, realised as a mnemonic description of the specific meaning the word has in the sentence, often using a related term. As there was no gold standard sense set of Dutch available, Schrooten and Vermeer have made their own set of senses, based on a children's dictionary (Van Dale, 1996). Sense tags consist of the word's lemma and a sense description of one or two words (berg stapel ) or a reference of the grammatical category (fiets N, fietsen V). Verbs have as their tag their lemma and often a reference to their function in the sentence (bent/zijn kww). When a word has only one sense, this is represented with a simple &amp;quot;=&amp;quot;. Names and sound imitations also have &amp;quot;=&amp;quot; as their sense tag. The dataset also contains senses that span over multiple words. These multi-word expressions cover idiomatic expressions, sayings, proverbs, and strong collocations. Each word in the corpus that is part of such multi-word expression has as its meaning the atomic meaning of the expression.</Paragraph>
      <Paragraph position="3"> These are two example sentences in the corpus:</Paragraph>
      <Paragraph position="5"> After SENSEVAL-2 the data was manually inspected to correct obvious annotation errors. 845 changes were made. The dataset now contains 152,728 tokens (words and punctuation tokens) from 10,258 different wordform types. 9133 of these wordform types have only one sense, leaving 1125 ambiguous wordform types.The average polysemy is 3.3 senses per wordform type and 10.7 senses per ambiguous token. The latter high number is caused by the high polysemy of high frequent prepositions which are part of many multi-word expressions. These ambiguous types account for 49.6 % (almost half) of the tokens in the corpus. As with the SENSEVAL-2 competition, the dataset was divided in  two parts. The training set consists of 76 books and 114,959 tokens. The test set contains the remaining 26 books and has 37,769 tokens.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Instance generation
</SectionTitle>
      <Paragraph position="0"> Instances on which the system is trained, consist only of features that are expected to give salient information about the sense of the ambiguous word.</Paragraph>
      <Paragraph position="1"> Several information sources have been suggested by the literature, such as local context of the ambiguous word, part-of-speech information and keywords.</Paragraph>
      <Paragraph position="2"> A previous study, described in (Hoste et al., 2002b) showed that MBWSD-D trained only on local features, has a better performance on the test set than all other variants that use keyword information.</Paragraph>
      <Paragraph position="3"> In this study the local context consisted of the three neighbouring words right and left of the ambiguous word and their part-of-speech tags. It performed even better than a system that combined several classifiers, including the local classifier itself, in a voting scheme.</Paragraph>
      <Paragraph position="4"> This suprising fact could have been caused by the use of an ineffective keyword selection method. The keywords were selected through a selection method suggested by (Ng and Lee, 1996) within three sentences around the ambiguous word; only content words were used as candidates. So, our first step was to try two different selection methods often used for this task: information gain and loglikelihood. Although both selection methods gave better results on the training set (information gain: 86.4, log-likelihood: 86.4, local classifier: 86.1), the results on the test set (information gain: 84.1, loglikelihood: 83.9) were still not higher than the score of the local classifier (84.2).</Paragraph>
      <Paragraph position="5"> As the use of keyword information does not seem to contribute to the Dutch WSD system, we decided to pursue optimizing the local context information. The previously used local context of three was never tested against smaller or bigger contexts, so for this study we varied the context from one word to five words left and right, plus their part-of-speech (POS) tags (i.e., we tested symmetrical contexts only). POS tags of the focus word itself are also included, to aid sense disambiguations related to syntactic differences (Stevenson and Wilks, 2001). POS tags were generated by MBT (Daelemans et al., 1996).</Paragraph>
      <Paragraph position="6"> The following is an instance of the ambiguous word donker [dark] and its context &amp;quot;(...)zei : hmmm , het donker is ook niet zo eng(...) [said:,hmm the dark is also not so scary]&amp;quot;: V zei Punc : Int hmmm Punc , Art het N V is Adv ook Adv niet Adv zo Adj eng donker duister Instances were made for each ambiguous word, consisting of 22 features. The first ten features represent the five words left to the ambiguous focus word and their part-of-speech tags, followed by the part-of-speech tag of the focus word, in this example N which stands for noun. The next ten features contain the five neighbouring words and tags to the right of the focus word. The last feature shows the classification of the ambiguous word, in this case donker duister [the dark].</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Cross-validating parameters and local
</SectionTitle>
    <Paragraph position="0"> context In principle, word experts should be constructed for all words with more than one sense. However, many ambiguous words occur only a few times. Word experts trained on such small amount of data may not surpass guessing the most frequent sense. In a previous experiment (Hoste et al., 2002b) it was shown that building word experts for words that occur at least ten times in the training data, yield the best results. In the training set, 484 wordforms exceeded the threshold of 10. For all words of which the frequency is lower than the threshold, the most frequent sense was predicted.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Cross-validating algorithmic parameters
</SectionTitle>
      <Paragraph position="0"> and local context For each of the 484 word experts, we performed an exhaustive matrix of experiments, cross-validating on training material through 10-fold cross-validation experiments. We varied among algorithmic parameters set out in Section 2, and among local context sizes. In detail, the matrix spanned the following  a3a5a4a7a6a9a8a7a6a9a8a10a6a9a11a12a6a9a11a14a13a15a3a5a4a16a4a16a4 variations: a17 The a2 parameter, representing the number of nearest distances in which memory items are  searched. In the experiments, a2 was varied between 1, 3, 5, 7, 9, 11, 15, 25, 35 and 45. a17 Feature weighting: all experiments were performed without feature-weighting, and with feature-weighted IB1 using gain ratio weighting, information gain, chi-square and shared  difference metric MVDM.</Paragraph>
      <Paragraph position="1"> a17 Local context size: all experiments were performed with symmetric context widths 1 to 5, where &amp;quot;5&amp;quot; means five left and five right neighbouring words with their POS tags.</Paragraph>
      <Paragraph position="2"> For each word expert, from these 1000 experiments the best-performing parameter setting was selected. Cross-validating on training material, the optimal accuracy of the word experts on ambiguous held-out words was 87.3%, considerably higher than the baseline of 77.0%). Subsequently, the best settings were used in a final experiment, in which all word experts were trained on all available training material and tested on the held-out test set. To further evaluate the results, described in the next section, the results were compared with a baseline score. The baseline was to select for each wordform its most frequent sense. Of the 484 wordforms for which word experts were made, 470 occured in the test set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML