File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0102_metho.xml
Size: 30,851 bytes
Last Modified: 2025-10-06 14:14:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0102"> <Title>MBT: A Memory-Based Part of Speech Tagger-Generator</Title> <Section position="4" start_page="14" end_page="15" type="metho"> <SectionTitle> 2 Memory-Based Learning </SectionTitle> <Paragraph position="0"> Memory-based Learning is a form of supervised, inductive learning from examples. Examples are represented as a vector of feature values with an associated category label.</Paragraph> <Paragraph position="1"> During training, a set of examples (the training set) is presented in an incremental fashion to the classifier, and added to memory. During testing, a set of previously unseen feature-value patterns (the test set) is presented to the system. For each test pattern, its distance to all examples in memory is computed, and the category of the least distant instance(s) is used as the predicted category for the test pattern. The approach is based on the assumption that reasoning is based on direct reuse of stored experiences rather than on the application of knowledge (such as rules or decision trees) abstracted from experience.</Paragraph> <Paragraph position="2"> In AI, the concept has appeared in several disciplines (from computer vision to robotics), using terminology such as similarity-based, example-based, memory-based, exemplarbased, case-based, analogical, lazy, nearest-neighbour, and instance-based (Stanfill and Waltz, 1986; Kolodner, 1993; Aha et al. 1991; Salzberg, 1990). Ideas about this type of analogical reasoning can be found also in non-mainstream linguistics and pyscholinguistics (Skousen, 1989; Derwing ~ Skousen, 1989; Chandler, 1992; Scha, 1992). In computational linguistics (apart from incidental computational work of the linguists referred to earlier), the general approach has only recently gained some popularity: e.g., Cardie (1994, syntactic and semantic disambiguation); Daelemans (1995, an overview of work in the early nineties on memory-based computational phonology and morphology); Jones (1996, an overview of example-based machine translation research); Federici and Pirrelli (1996).</Paragraph> <Section position="1" start_page="14" end_page="15" type="sub_section"> <SectionTitle> 2.1 Similarity Metric </SectionTitle> <Paragraph position="0"> Performance of a memory-based system (accuracy on the test set) crucially depends on the distance metric (or similarity metric) used. The most straightforward distance metric would be the one in equation (1), where X and Y are the patterns to be compared, and ~i(x~, y/) is the distance between the values of the i-th feature in a pattern with n features.</Paragraph> <Paragraph position="2"> Distance between two values is measured using equation (2), an overlap metric, for symbolic features (we will have no numeric features in the tagging application).</Paragraph> <Paragraph position="4"> We will refer to this approach as IB1 (Aha et al., 1991). We extended the algorithm described there in the following way: in case a pattern is associated with more than one category in the training set (i.e. the pattern is ambiguous), the distribution of patterns over the different categories is kept, and the most frequently occurring category is selected when the ambiguous pattern is used to extrapolate from.</Paragraph> <Paragraph position="5"> In this distance metric, all features describing an example are interpreted as being equally important in solving the classification problem, but this is not necessarily the case. In tagging, the focus word to be assigned a category is obviously more relevant than any of the words in its context. We therefore weigh each feature with its information gain; a number expressing the average amount of reduction of training set information entropy when knowing the value of the feature (Daelemans & van de Bosch, 1992, Quinlan, 1993; Hunt et al. 1966) (Equation 3). We will call this algorithm IB-IG.</Paragraph> <Paragraph position="7"/> </Section> </Section> <Section position="5" start_page="15" end_page="17" type="metho"> <SectionTitle> 3 IGTrees </SectionTitle> <Paragraph position="0"> Memory-based learning is an expensive algorithm: of each test item, all feature values must be compared to the corresponding feature values of all training items. Without optimisation, it has an asymptotic retrieval complexity of O(NF) (where N is the number of items in memory, and F the number of features). The same asymptotic complexity is of course found for memory storage in this approach. We use IGTrees (Daelemans et al. 1996) to compress the memory. IGTree is a heuristic approximation of the IB-IG algorithm.</Paragraph> <Section position="1" start_page="15" end_page="17" type="sub_section"> <SectionTitle> 3.1 The IGTree Algorithms </SectionTitle> <Paragraph position="0"> IGTree combines two algorithms: one for compressing a case base into a trees, and one for retrieving classification information from these trees. During the construction of IGTree decision trees, cases are stored as paths of connected nodes. All nodes contain a test (based on one of the features) and a class label (representing the default class at that node). Nodes are connected via arcs denoting the outcomes for the test (feature values).</Paragraph> <Paragraph position="1"> A feature relevance ordering technique (in this case information gain, see Section 2.1) is used to determine the order in which features are used as tests in the tree. This order is fixed in advance, so the maximal depth of the tree is always equal to the number of features, and at the same level of the tree, all nodes have the same test (they are an instance of oblivious decision trees; cf. Langley ~ Sage, 1994). The reasoning behind this reorganisation (which is in fact a compression) is that when the computation of feature relevance points to one feature clearly being the most important in classification, search can be restricted to matching a test case to those stored cases that have the same feature value at that feature. Besides restricting search to those memory cases that match only on this feature, the case memory can be optimised by further restricting search to the 1. If T is unambiguous (all cases in T have the same class c), create a leaf node with class label c. 2. Else if i = (n + 1), create a leaf node with as label the class occurring most frequently in T. 3. Otherwise, until i = n (the number of features) * Select the first feature (test) Fi in Fi...Fn, and construct a new node N for feature Fi, and as default class c (the class occurring most frequently in T).</Paragraph> <Paragraph position="2"> * Partition T into subsets TI...Tm according to the values vl...vm which occur for Fi in T (cases with the same value for this feature in the same subset).</Paragraph> <Paragraph position="3"> * For each je{1, ..., m}: BUILD-IG-TREE (Tj, Fi+i...F,~), connect the root of this subtree to N and label the arc with vj.</Paragraph> <Paragraph position="4"> second most important feature, followed by the third most important feature, etc. A considerable compression is obtained as similar cases share partial paths.</Paragraph> <Paragraph position="5"> Instead of converting the case base to a tree in which all cases are fully represented as paths, storing all feature values, we compress the tree even more by restricting the paths to those input feature values that disambiguate the classification from all other cases in the training material. The idea is that it is not necessary to fully store a case as a path when only a few feature values of the case make its classification unique. This implies that feature values that do not contribute to the disambiguation of the case classification (i.e., the values of the features with lower feature relevance values than the the lowest value of the disambiguating features) are not stored in the tree. In our tagging application, this means that only context feature values that actually contribute to disambiguation are used in the construction of the tree.</Paragraph> <Paragraph position="6"> Leaf nodes contain the unique class label corresponding to a path in the tree. Non-terminal nodes contain information about the most probable or default classification given the path thus far, according to the bookkeeping information on class occurrences maintained by the tree construction algorithm. This extra information is essential when using the tree for classification. Finding the classification of a new case involves traversing the tree (i.e., matching all feature values of the test case with arcs in the order of the overall feature information gain), and either retrieving a classification when a leaf is reached, or using the default classification on the last matching non-terminal node if a feature-value match fails.</Paragraph> <Paragraph position="7"> A final compression is obtained by pruning the derived tree. All leaf-node daughters of a mother node that have the same class as that node are removed from the tree, as their class information does not contradict the default class information already present at the mother node. Again, this compression does not affect IGTree's generalisation performance.</Paragraph> <Paragraph position="8"> The recursive algorithms for tree construction (except the final pruning) and retrieval are given in Figures 1 and 2. For a detailed discussion, see Daelemans et al. (1996).</Paragraph> <Paragraph position="9"> Output: A class label.</Paragraph> <Paragraph position="10"> 1. If N is a leaf node, output default class c associated with this node.</Paragraph> <Paragraph position="11"> 2. Otherwise, if test Fi of the current node does not originate an arc labeled with ffi, output default class c associated with N.</Paragraph> <Paragraph position="12"> 3. Otherwise, * new node M is the end node of the arc originating from N with as label fi.</Paragraph> <Paragraph position="13"> * SEARCH-IG-TREE (M, f~+l...f,~)</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 3.2 IGTree Complexity </SectionTitle> <Paragraph position="0"> The asymptotic complexity of IGTree (i.e, in the worst case) is extremely favorable.</Paragraph> <Paragraph position="1"> Complexity of searching a query pattern in the tree is proportional to F * log(V), where F is the number of features (equal to the maximal depth of the tree), and V is the average number of values per feature (i.e., the average branching factor in the tree). In IB1, search complexity is O(N * F) (with N the number of stored cases). Retrieval by search in the tree is independent from the number of training cases, and therefore especially useful for large case bases. Storage requirements are proportional to N (compare O(N * F) for IB1). Finally, the cost of building the tree on the basis of a set of cases is proportional to N * log(V) * F in the worst case (compare O(N) for training in IB1).</Paragraph> <Paragraph position="2"> In practice, for our part-of-speech tagging experiments, IGTree retrieval is 100 to 200 times faster than normal memory-based retrieval, and uses over 95% less memory.</Paragraph> </Section> </Section> <Section position="6" start_page="17" end_page="61" type="metho"> <SectionTitle> 4 Architecture of the Tagger </SectionTitle> <Paragraph position="0"> The architecture takes the form of a tagger generator, given a corpus tagged with the desired tag set, a POS tagger is generated which maps the words of new text to tags in this tag set according to the same systematicity. The construction of a POS tagger for a specific corpus is achieved in the following way. Given an annotated corpus, three datastructures are automatically extracted: a lexicon, a case base for known words (words occurring in the lexicon), and a case base for unknown words. Case Bases are indexed using IGTree. During tagging, each word in the text to be tagged is looked up in the lexicon. If it is found, its lexical representation is retrieved and its context is determined, and the resulting pattern is looked up in the known words case base. When a word is not found in the lexicon, its lexical representation is computed on the basis of its form, its context is determined, and the resulting pattern is looked up in the unknown words case base. In each case, output is a best guess of the category for the word in its current context. In the remainder of this section, we will describe each step in more detail. We start from a training set of tagged sentences T.</Paragraph> <Section position="1" start_page="18" end_page="18" type="sub_section"> <SectionTitle> 4.1 Lexicon Construction </SectionTitle> <Paragraph position="0"> A lexicon is extracted from T by computing for each word in T the number of times it occurs with each category. E.g. when using the first 2 million words of the Wall Street Journal corpus 1 as T, the word once would get the lexical definition RB: 330; IN: 77, i.e. once was tagged 330 times as an adverb, and 77 times as a preposition/subordinating conjunction. ~ Using these lexical definitions, a new, possibly ambiguous, tag is produced for each word type. E.g. once would get a new tag, representing the category of words which can be both adverbs and prepositions/conjunctions (RB-IN). Frequency order is taken into account in this process: if there would be words which, like once, can be RB or IN, but more frequently IN than RB (e.g. the word below), then a different tag (IN-RB) is assigned to these words. The original tag set, consisting of 44 morphosyntactic tags, was expanded this way to 419 (possibly ambiguous) tags. In the WSJ example, the resulting lexicon contains 57962 word types, 7464 (13%) of which are ambiguous. On the same training set, 76% of word tokens are ambiguous.</Paragraph> <Paragraph position="1"> When tagging a new sentence, words are looked up in the lexicon. Depending on whether or not they can be found there, a case representation is constructed for them, and they are retrieved from either the known words case base or the unknown words case base.</Paragraph> </Section> <Section position="2" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 4.2 Known Words </SectionTitle> <Paragraph position="0"> A windowing approach (Sejnowski & Rosenberg, 1987) was used to represent the tagging task as a classification problem. A case consists of information about a focus word to be tagged, its left and right context, and an associated category (tag) valid for the focus word in that context.</Paragraph> <Paragraph position="1"> There are several types of information which can be stored in the case base for each word, ranging from the words themselves to intricate lexical representations. In the preliminary experiments described in this paper, we limited this information to the possibly ambiguous tags of words (retrieved from the lexicon) for the focus word and its context to the right, and the disambiguated tags of words for the left context (as the result of earlier tagging decisions). Table 1 is a sample of the case base for the first sentence of the corpus (Pierre Vinken, 61 years old, will join the board as a nonexecutive director nov.</Paragraph> <Paragraph position="2"> 29) when using this case representation. The final column shows the target category; the disambiguated tag for the focus word. We will refer to this case representation as ddfat (d for disambiguated, f for focus, a for ambiguous, and t for target). The information gain values are given as well.</Paragraph> <Paragraph position="3"> A search among a selection of different context sizes suggested ddfat as a suitable case representation for tagging known words. An interesting property of memory-based learning is that case representations can be easily extended with different sources of information if available (e.g. feedback from a parser in which the tagger operates, semantic types, the words themselves, lexical representations of words obtained from a different source than the corpus, etc.). The information gain feature relevance ordering technique achieves a delicate relevance weighting of different information sources when they are fused in a single case representation. The window size used by the algorithm will also dynamically change depending on the information present in the context for the disambiguation of a particular focus symbol (see Schfitze et al., 1994, and Pereira et al., 1995 with that category. This way, noise in the training material is filtered out. The value for this parameter will have to be adapted for other training sets, and was chosen here to maximise generalization accuracy (accuracy on tagging unseen text).</Paragraph> </Section> <Section position="3" start_page="19" end_page="61" type="sub_section"> <SectionTitle> 4.3 Unknown Words </SectionTitle> <Paragraph position="0"> If a word is not present in the lexicon, its ambiguous category cannot be retrieved. In that case, a category can be guessed only on the basis of the form or the context of the word.</Paragraph> <Paragraph position="1"> Again, we take advantage of the data fusion capabilities of a memory-based approach by combining these two sources of information in the case representation, and having the information gain feature relevance weighting technique figure out their relative relevance (see Schmid, 1994; Samuelsson, 1994 for similar solutions).</Paragraph> <Paragraph position="2"> In most taggers, some form of morphological analysis is performed on unknown words, in an attempt to relate the unknown word to a known combination of known morphemes, thereby allowing its association with one or more possible categories. After determining this ambiguous category, the word is disambiguated using context knowledge, the same way as known words. Morphological analysis presupposes the availability of highly language-specific resources such as a morpheme lexicon, spelling rules, morphological rules, and heuristics to prioritise possible analyses of a word according to their plausibility. This is a serious knowledge engineering bottleneck when the goal is to develop a language and annotation-independent tagger generator.</Paragraph> <Paragraph position="3"> In our memory-based approach, we provide morphological information (especially about suffixes) indirectly to the tagger by encoding the three last letters of the word as separate features in the case representation. The first letter is encoded as well because it contains information about prefix and capitalization of the word. Context information is added to the case representation in a similar way as with known words. It turned out that in combination with the 'morphological' features, a context of one disambiguated tag of the word to the left of the unknown word and one ambiguous category of the word to the right, gives good results. We will call this case representation pdassst: 3 three suffix letters (s), one prefix letter (p), one left disambiguated context words (d), and one ambiguous right context word (a). As the chance of an unknown word being a function word is small, and cases representing function words may interfere with correct classification of open-class words, only open-class words are used during construction of the unknown words case base.</Paragraph> <Paragraph position="4"> Table 2 shows part of the case base for unknown words.</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 4.4 Control </SectionTitle> <Paragraph position="0"> Figure 3 shows the architecture of the tagger-generator: a tagger is produced by extracting a lexicon and two case-bases from the tagged example corpus. During tagging, the control is the following: words are looked up in the lexicon and separated into known and unknown words. They are retrieved from the known words case base and the unknown words case base, respectively. In both cases, context is used, in the case of unknown words, the first and three last letters of the word are used instead of the ambiguous tag for the focus word. As far as disambiguated tags for left context words are used, these are of course not obtained by retrieval from the lexicon (which provides ambiguous categories), but by using the previous decisions of the tagger.</Paragraph> </Section> <Section position="5" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 4.5 IGTrees for Tagging </SectionTitle> <Paragraph position="0"> As explained earlier, both case bases are implemented as IGTrees. For the known words case base, paths in the tree represent variable size context widths. The first feature (the expansion of the root node of the tree) is the focus word, then context features are added as further expansions of the tree until the context disambiguates the focus word completely. Further expansion is halted at that point. In some cases, short context sizes (corresponding to bigrams, e.g.) are sufficient to disambiguate a focus word, in other cases, more context is needed. IGTrees provide an elegant way of automatic determination of optimal context size. In the unknown words case base, the trie representation provides an automatic integration of information about the form and the context of a focus word not encountered before. In general, the top levels of the tree represent the morphological information (the three suffix letter features and the prefix letter), while the deeper levels contribute contextual disambiguation.</Paragraph> </Section> </Section> <Section position="7" start_page="61" end_page="61" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> In this section, we report first results on our memory-based tagging approach. In a first set of experiments, we compared our IGTree implementation of memory-based learning to more traditional implementations of the approach. In further experiments we studied the performance of our system on predicting the category of both known and unknown words.</Paragraph> <Section position="1" start_page="61" end_page="61" type="sub_section"> <SectionTitle> Experimental Set-up </SectionTitle> <Paragraph position="0"> The experimental methodology was taken from Machine Learning practice (e.g. Weiss & Kulikowski, 1991): independent training and test sets were selected from the original corpus, the system was trained on the training set, and the generalization accuracy (percentage of correct category assignments) was computed on the independent test set.</Paragraph> <Paragraph position="1"> Storage and time requirements were computed as well. Where possible, we used a 10-fold cross-validation approach. In this experimental method, a data set is partitioned ten times into 90% training material, and 10% testing material. Average accuracy provides a reliable estimate of the generalization accuracy.</Paragraph> </Section> <Section position="2" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 5.1 Experiment 1: Comparison of Algorithms </SectionTitle> <Paragraph position="0"> Our goal is to adhere to the concept of memory-based learning with full memory while at the same time keeping memory and processing speed within attractive bounds. To this end, we applied the IGTree formalism to the task. In order to prove that IGTree is a suitable candidate for practical memory-based tagging, we compared three memory-based learning algorithms: (i) IB1, a slight extension (to cope with symbolic values and ambiguous training items) of the well-known k-nn algorithm in statistical pattern recognition (see Aha et al., 1991), (ii) IBI-IG, an extension of IB1 which uses feature relevance weighting (described in Section 2), and (iii) IGTree, a memory- and processing time saving heuristic implementation of IBi-IG (see Section 3). Table 3 lists the results in generalization accuracy, storage requirements and speed for the three algorithms using a ddfat pattern, a 100,000 word training set, and a 10,000 word test set. In this experiment, accuracy was The IGTree version turns out to be better or equally good in terms of generalization accuracy, but also is more than 100 times faster for tagging of new words 4, and compresses 4In training, i.e. building the case base, IB1 and IBi-IG (4 seconds) are faster than IGTree (26 seconds) because the latter has to build a tree instead of just storing the patterns.</Paragraph> <Paragraph position="1"> the original case base to 4% of the size of the original case base. This experiment shows that for this problem, we can use IGTree as a time and memory saving approximation of memory-based learning (IB-IG version), without loss in generalization accuracy. The time and speed advantage of IGTree grows with larger training sets.</Paragraph> </Section> <Section position="3" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 5.2 Experiment 2: Learning Curve </SectionTitle> <Paragraph position="0"> A ten-fold cross-validation experiment on the first two million words of the WSJ corpus shows an average generalization performance of IGTree (on known words only) of 96.3%.</Paragraph> <Paragraph position="1"> We did 10-fold cross-validation experiments for several sizes of datasets (in steps of 100,000 memory items), revealing the learning curve in Figure 4. Training set size is on&quot; the X-axis, generalization performance as measured in a 10-fold cross-validation experiment is on the Y-axis. the 'error' range indicate averages plus and minus one standard deviation on each 10-fold cross-validation experiment. 5 Already at small data set sizes, performance is relatively high. With increasingly larger data sets, the performance becomes more stable (witness the error ranges). It should be noted that in this experiment, we assumed correctly disambiguated tags in the left context. In practice, when using our tagger, this is of course not the case because the disambiguated tags in the left context of the current word to be tagged are the result of a previous decision of the tagger, which may be a mistake. To test the influence of this effect we performed a third experiment.</Paragraph> </Section> <Section position="4" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 5.3 Experiment 3: Overall Accuracy </SectionTitle> <Paragraph position="0"> We performed the complete tagger generation process on a 2 million words training set (lexicon construction and known and unknown words case-base construction), and tested on 200,000 test words. Performance on known words, unknown words, and total are given in Table 4. In this experiment, numbers were not stored in the known words case base; they are looked up in the unknown words case base.</Paragraph> <Paragraph position="1"> ~We are not convinced that variation in the results of the experiments in a 10-fold-cv set-up is statistically meaningful (the 10 experiments are not independent), but follow common practice here.</Paragraph> </Section> </Section> <Section position="8" start_page="61" end_page="61" type="metho"> <SectionTitle> 6 Related Research </SectionTitle> <Paragraph position="0"> A case-based approach, similar to our memory-based approach, was also proposed by Cardie (1993a, 1994) for sentence analysis in limited domains (not only POS tagging but also semantic tagging and structural disambiguation). We will discuss only the reported POS tagging results here. Using a fairly complex case representation based on output from the CIRCUS conceptual sentence analyzer (22 local context features describing syntactic and semantic information about a five-word window centered on the word to be tagged, including the words themselves, and 11 global context features providing information about the major constituents parsed already), and with a tag set of 18 tags (7 open-class, 11 closed class), she reports a 95% tagging accuracy. A decision-tree learning approach to feature selection is used in this experiment (Cardie, 1993b, 1994) to discard irrelevant features. Results are based on experiments with 120 randomly chosen sentences from the TIPSTER JV corpus (representing 2056 cases). Cardie (p.c.) reports 89.1% correct tagging for unknown words. Percentage unknown words was 20.6% of the test words, and overall tagging accuracy (known and unknown) 95%. Notice that her algorithm gives no initial preference to training cases that match the test word during its initial case retrieval. On the other hand, after retrieving the top k cases, the algorithm does prefer those cases that match the test word when making its final predictions. So, it's understandable that the algorithm is doing better on words that it's seen during training as opposed to unknown words.</Paragraph> <Paragraph position="1"> In our memory-based approach, feature weighting (rather than feature selection) for determining the relevance of features is integrated more smoothly with the similarity metric, and our results are based on experiments with a larger corpus (3 million cases).</Paragraph> <Paragraph position="2"> Our case representation is (at this point) simpler: only the (ambiguous) tags, not the words themselves or any other information are used. The most important improvement is the use of IGTree to index and search the case base, solving the computational complexity problems a case-based approach would run into when using large case bases.</Paragraph> <Paragraph position="3"> An approach based on k-nn methods (such as memory-based and case-based methods) is a statistical approach, but it uses a different kind of statistics than Markov model-based approaches. K-nn is a non-parametric technique; it assumes no fixed type of distribution of the data. The most important advantages compared to current stochastic approaches are that (i) few training items (a small tagged corpus) are needed for relatively good performance, (ii) the approach is incremental: adding new cases does not require any recomputation of probabilities, and (iii) it provides explanation capabilities, and (iv) it requires no additional smoothing techniques to avoid zero-probabilities; the IGTree takes care of that.</Paragraph> <Paragraph position="4"> Compared to hand-crafted rule-based approaches, our approach provides a solution to the knowledge-acquisition and reusability bottlenecks, and to robustness and coverage problems (similar advantages motivated Markov model-based statistical approaches).</Paragraph> <Paragraph position="5"> Compared to learning rule-based approaches such as the one by Brill (1992), a k-nn approach provides a uniform approach for all disambiguation tasks, more flexibility in the engineering of case representations, and a more elegant approach to handling of unknown words (see e.g. Cardie 1994).</Paragraph> </Section> class="xml-element"></Paper>