File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1112_metho.xml
Size: 13,228 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1112"> <Title>A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Lemma-Based Approach </SectionTitle> <Paragraph position="0"> As we have mentioned in the previous section, lemmatization collapses all inflected forms of a given word to the same lemma. In our system, separate classifiers are built for every ambiguous wordform.</Paragraph> <Paragraph position="1"> 2000).</Paragraph> <Paragraph position="2"> Normally, this implies that the basis for grouping occurrences of particular ambiguous words together is that their wordform is the same. Alternatively, we chose for a model constructing classifiers based on lemmas therefore reducing the number of classifiers that need to be made.</Paragraph> <Paragraph position="3"> As has already been noted by Yarowsky (1994), using lemmas helps to produce more concise and generic evidence than inflected forms. Therefore building classifiers based on lemmas increases the data available to each classifier. We make use of the advantage of clustering all instances of e.g. one verb in a single classifier instead of several classifiers (one for each inflected form found in the data). In this way, there is more training data per ambiguous wordform available to each classifier. The expectation is that this should increase the accuracy of our maximum entropy WSD system in comparison to the wordform-based model.</Paragraph> <Paragraph position="4"> Figure 1 shows how the system works. During training, every wordform is first checked for ambiguity, i.e. whether it has more than one sense associated with all its occurrences. If the wordform is ambiguous, the number of lemmas associated with it is looked up. If the wordform has one lemma, all occurrences of this lemma in the training data are used to make the classifier for that particular wordform-and others with the same lemma. If a wordform has more than one lemmas, a classifier based on the wordform is built. This strategy has been decided on in order to be able to treat all ambiguous words, notwithstanding lemmatization errors or wordforms that can genuinely be assigned two or more lemmas.</Paragraph> <Paragraph position="5"> An example of a word that has two different lemmas depending on the context is boog: it can either be the past tense of the verb buigen ('to bend') or the noun boog ('arch'). Since the Dutch SENSEVAL-2 data is not only ambiguous with regard to meaning but also with regard to PoS, both lemmas are subsumed in the wordform classifier for boog.</Paragraph> <Paragraph position="6"> During testing, we check for each word whether there is a classifier available for either its wordform or its lemma and apply that classifier to the test instance. null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Maximum Entropy Word Sense </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Disambiguation System </SectionTitle> <Paragraph position="0"> Our WSD system is founded on the idea of combining statistical classification with linguistic sources of knowledge. In order to be able to take full advantage of the linguistic information, we need a classification algorithm capable of incorporating the information provided. The main advantage of maximum entropy modeling is that heterogeneous and single statistical model. Other learning algorithms, like e.g. decision lists, only take the strongest feature into account, whereas maximum entropy combines them all. Also, no independence assumptions as in e.g. Naive Bayes are necessary.</Paragraph> <Paragraph position="1"> We will now describe the different steps in putting together the WSD system we used to incorporate and test our lemma-based approach, starting with the introduction of maximum entropy, the machine learning algorithm used for classification.</Paragraph> <Paragraph position="2"> Then, smoothing with Gaussian priors will be explained. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Maximum Entropy Classification </SectionTitle> <Paragraph position="0"> Several problems in NLP have lent themselves to solutions using statistical language processing techniques. Many of these problems can be viewed as a classification task in which linguistic classes have to be predicted given a context.</Paragraph> <Paragraph position="1"> The statistical classifier used in the experiments reported in this paper is a maximum entropy classifier (Berger et al., 1996; Ratnaparkhi, 1997b). Maximum entropy is a general technique for estimating probability distributions from data. A probability distribution is derived from a set of events based on the computable qualities (characteristics) of these events. The characteristics are called features, and the events are sets of feature values.</Paragraph> <Paragraph position="2"> If nothing about the data is known, estimating a probability distribution using the principle of maximum entropy involves selecting the most uniform distribution where all events have equal probability.</Paragraph> <Paragraph position="3"> In other words, it means selecting the distribution which maximises the entropy.</Paragraph> <Paragraph position="4"> If data is available, a number of features extracted from the labeled training data are used to derive a set of constraints for the model. This set of constraints characterises the class-specific expectations for the distribution. So, while the distribution should maximise the entropy, the model should also satisfy the constraints imposed by the training data.</Paragraph> <Paragraph position="5"> A maximum entropy model is thus the model with maximum entropy of all models that satisfy the set of constraints derived from the training data.</Paragraph> <Paragraph position="6"> The model consists of a set of features which occur on events in the training data. Training itself amounts to finding weights for each feature using the following formula:</Paragraph> <Paragraph position="8"> where the property function fi(x;c) represents the number of times feature i is used to find class c for event x, and the weights i are chosen to maximise the likelihood of the training data and, at the same time, maximise the entropy of p. Z is a normalizing constant, constraining the distribution to sum to 1 and n is the total number of features.</Paragraph> <Paragraph position="9"> This means that during training the weight i for each feature i is computed and stored. During testing, the sum of the weights i of all features i found in the test instances is computed for each class c and the class with the highest score is chosen.</Paragraph> <Paragraph position="10"> A big advantage of maximum entropy modeling is that the features include any information which might be useful for disambiguation. Thus, dissimilar types of information, such as various kinds of linguistic knowledge, can be combined into a single model for WSD without having to assume independence of the different features. Furthermore, good results have been produced in other areas of NLP research using maximum entropy techniques (Berger et al., 1996; Koeling, 2001; Ratnaparkhi, 1997a).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Smoothing: Gaussian Priors </SectionTitle> <Paragraph position="0"> Since NLP maximum entropy models usually have lots of features and lots of sparseness (e.g. features seen in testing not occurring in training), smoothing is essential as a way to optimize the feature weights (Chen and Rosenfeld, 2000; Klein and Manning, 2003). In the case of the Dutch SENSEVAL-2, for many ambiguous words there is little training data available, therefore making smoothing essential.</Paragraph> <Paragraph position="1"> The intuition behind Gaussian priors is that the parameters in the maximum entropy model should not be too large because of optimization problems with infinite feature weights. In other words: we enforce that each parameter will be distributed according to a Gaussian prior with mean and variance 2 . This prior expectation over the distribution of parameters penalizes parameters for drifting too far from their mean prior value which is = 0.</Paragraph> <Paragraph position="2"> Using Gaussian priors has a number of effects on the maximum entropy model. We trade off some expectation-matching for smaller parameters. Also, when multiple features can be used to explain a data point, the more common ones generally receive more weight. Last but not least accuracy generally goes up and convergence is faster.</Paragraph> <Paragraph position="3"> In the current experiments the Gaussian prior was set to 2 = 1000 (based on preliminary experiments) which led to an overall increase of at least 0.5% when compared to a model which was built without smoothing.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Corpus Preparation and Building Classifiers </SectionTitle> <Paragraph position="0"> In the context of SENSEVAL-25, the first sense-tagged corpus for Dutch was made available (see (Hendrickx and van den Bosch, 2001) for a detailed description). The training section of the Dutch SENSEVAL-2 dataset contains approximately 120,000 tokens and 9,300 sentences, whereas the test section consists of ca. 40,000 tokens and 3,000 sentences.</Paragraph> <Paragraph position="1"> In contrast to the English WSD data available from SENSEVAL-2, the Dutch WSD data is not only ambiguous in word senses, but also with regard to PoS. This means that accurate PoS information is important in order for the WSD system to accurately achieve morpho-syntactic as well as semantic disambiguation.</Paragraph> <Paragraph position="2"> ation on SENSEVAL and for downloads of the data. First, the corpus is lemmatized (see section 2) and part-of-speech-tagged. We used the Memory-Based tagger MBT (Daelemans et al., 2002a; Daelemans et al., 2002b) with the (limited) WOTAN tag set (Berghmans, 1994; Drenth, 1997) to PoS tag our data (see (Gaustad, 2003) for an evaluation of different PoS-taggers on this task). Since we are only interested in the main PoS-categories, we discarded all additional information from the assigned PoS. This resulted in 12 different tags being kept. In the current experiments, we included the PoS of the ambiguous wordform (important for the morpho-syntactic disambiguation) and also the PoS of the context words or lemmas.</Paragraph> <Paragraph position="3"> After the preprocessing (lemmatization and PoS tagging), for each ambiguous wordform6 all instances of its occurrence are extracted from the corpus. These instances are then transformed into feature vectors including the features specified in a particular model. The model we used in the reported experiments includes information on the wordform, its lemma, its PoS, contextwords to the left and right as well as the context PoS, and its sense/class.</Paragraph> <Paragraph position="4"> 'Now he went to pick flowers and made a crown of it.' Below we show an example of a feature vector for the ambiguous word bloem ('flower'/'flour') in sentence 1: bloemen bloem N nu gaan hij Adv V Pron plukken en maken V Conj V bloem plant The first slot represents the ambiguous wordform, the second its lemma, the third the PoS of the ambiguous wordform, the fourth to twelfth slots contain the context lemmas and their PoS (left before right), and the last slot represents the sense or class. Various preliminary experiments have shown a context size of 3 context words, i.e. 3 words to the left and 3 words to the right of the ambiguous word, to achieve the best and most stable results. Only context words within the same sentence as the ambiguous wordform were taken into account.</Paragraph> <Paragraph position="5"> Earlier experiments showed that using lemmas as context instead of wordforms increases accuracy 6A wordform is 'ambiguous' if it has two or more different senses/classes in the training data. The sense '=' is seen as marking the basic sense of a word/lemma and is therefore also taken into account.</Paragraph> <Paragraph position="6"> due to the compression achieved through lemmatization (as explained earlier in this paper and put to practice in the lemma-based approach). With lemmas, less context features have to be estimated, therefore counteracting data sparseness.</Paragraph> <Paragraph position="7"> In the experiments presented here, no threshold was used. Experiments have shown that building classifiers even for wordforms with very few training instances yields better results than applying a frequency threshold and using the baseline count (assigning the most frequent sense) for word-forms with an amount of training instances below the threshold. It has to be noted, though, that the effect of applying a threshold may depend on the choice of learning algorithm.</Paragraph> </Section> class="xml-element"></Paper>