File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2414_metho.xml
Size: 8,461 bytes
Last Modified: 2025-10-06 14:09:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2414"> <Title>Memory-based semantic role labeling: Optimizing features, algorithm, and output</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> In this section we describe our approach to semantic role labeling. The core part of our system is a memory-based learner. During the development of the system we have used feature selection and parameter optimization by iterative deepening. Additionally we have evaluated three extensions of the basic memory-based learning method: class n-grams, i.e. complex classes composed of sequences of simple classes, iterative classifier stacking and automatic output post-processing.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Memory-based learning </SectionTitle> <Paragraph position="0"> Memory-based learning is a supervised inductive algorithm for learning classification tasks based on the k-nn algorithm (Cover and Hart, 1967; Aha et al., 1991) with various extensions for dealing with nominal features and feature relevance weighting. Memory-based learning stores feature representations of training instances in memory without abstraction and classifies new (test) instances by matching their feature representation to all instances in memory, finding the most similar instances.</Paragraph> <Paragraph position="1"> From these &quot;nearest neighbors&quot;, the class of the test item is extrapolated. See Daelemans et al. (2003) for a detailed description of the algorithms and metrics used in our experiments. All memory-based learning experiments were done with the TiMBL software package1.</Paragraph> <Paragraph position="2"> In previous research, we have found that memory-based learning is rather sensitive to the chosen features and the particular setting of its algorithmic parameters (e.g. the number of nearest neighbors taken into account, the function for extrapolation from the nearest neighbors, the feature relevance weighting method used, etc.). In order to minimize the effects of this sensitivity, we have put much effort in trying to find the best set of features and the optimal learner parameters for this particular task.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Feature selection </SectionTitle> <Paragraph position="0"> We have employed bi-directional hill-climbing (Caruana and Freitag, 1994) for finding the features that were most suited for this task. This wrapper approach starts with the empty set of features and evaluates the learner for every individual feature on the development set. The feature associated with the best performance is selected and the process is repeated for every pair of features that includes the best feature. For every next best set of features, the system evaluates each set that contains one extra feature or has one feature less. This process is repeated until the local search does not lead to a performance gain.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Parameter optimization </SectionTitle> <Paragraph position="0"> We used iterative deepening (ID) as a heuristic way of searching for optimal algorithm parameters. This technique combines classifier wrapping (using the training material internally to test experimental variants) (Kohavi and John, 1997) with progressive sampling of training material (Provost et al., 1999). We start with a large pool of experiments, each with a unique combination of algorithmic parameter settings. Each settings combination is applied to a small amount of training material and tested on a small held-out set also taken from the training set.</Paragraph> <Paragraph position="1"> Only the best settings are kept; the others are removed from the pool of competing settings. In subsequent iterations, this step is repeated, retaining the best-performing settings, with an exponentially growing amount of training and held-out data - until all training data is used or one best setting is left. Selecting the best settings at each step is based on classification accuracy on the held-out data; a simple one-dimensional clustering on the ranked list of accuracies determines which group of settings is selected for the next iteration.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Class n-grams </SectionTitle> <Paragraph position="0"> Alternative to predicting simple classes, sequential tasks can be rephrased as mappings from input examples to sequences of classes. Instead of predicting just A1 in the example given earlier, it is possible to predict a tri-gram of classes. The second example in the training data which we used earlier, is now labeled with the trigram A1 A1 A1, indicating that the chunk in focus has an A1 relation with the verb, along with its left and right neighbor chunks (which are all part of the same A1 argument).</Paragraph> <Paragraph position="1"> First, the classifier is forced to predict 'legal' sequences of classes; this potentially fixes a problem with simple classifiers which are blind to their previous or subsequent simple classifications in sequences, potentially resulting in impossible sequences such as A1 A0 A1. Second, if the classifier predicts the trigrams example by example, it produces a sequence of overlapping trigrams which may contain information that can boost classification accuracy. Effectively, each class is predicted three times, so that a simple majority voting can be applied: we simply take the middle prediction as the actual classification of the example unless the two other votes together suggest another class label.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Iterative classifier stacking </SectionTitle> <Paragraph position="0"> Stacking (Wolpert, 1992) refers to a class of meta-learning systems that learn to correct errors made by lower-level classifiers. We implement stacking by adding a windowed sequence of previous and subsequent output class labels to the original input features. To generate the training material, we copy these windowed (unigram) class labels into the input, excluding the focus class label (which is a perfect predictor of the output class). To generate test material, the output of the first-stage classifier trained on the original data is used.</Paragraph> <Paragraph position="1"> Stacking can be repeated; an nth-stage classifier can be built on the output of the n-1th-stage classifier. We implemented this by replacing the class features in the input of each nth-stage classifier by the output of the previous classifier.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.6 Automatic output post-processing </SectionTitle> <Paragraph position="0"> Even while employing n-gram output classes and classifier stacking, we noticed that our learner made systematic errors caused by the lack of broader (sentential) contextual information in the instances and the classes.</Paragraph> <Paragraph position="1"> The most obvious of these errors was having multiple instances of arguments A0-A5 in one sentence. Although sentences with multiple A0-A3 arguments appear in the training data, they are quite rare (0.17%). When the learner assigns an A0 role to three different arguments in a sentence, most likely at least two of these are wrong.</Paragraph> <Paragraph position="2"> In order to reflect this fact, we have restricted the system to outputting at most one phrase of type A0-A5. If the learner predicts multiple arguments then only the one closest to the main verb is kept.</Paragraph> <Paragraph position="3"> in Table 1. The numbers mentioned for words, part-of-speech tags, chunk tags, named entity tags and output classes show the position of the tokens with respect to the focus token (0). Distances are measured in chunks, NP chunks, VP chunks and words. In all other table entries, + denotes selection and - omission.</Paragraph> <Paragraph position="4"> used in the different runs mentioned in Table 1. More information about the parameters and their values can be found in Daelemans et al. (2003).</Paragraph> </Section> </Section> class="xml-element"></Paper>