File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0637_intro.xml
Size: 5,342 bytes
Last Modified: 2025-10-06 14:03:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0637"> <Title>Applying spelling error correction techniques for improving semantic role labelling</Title> <Section position="4" start_page="229" end_page="230" type="intro"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> This section gives a brief overview of the three main components of our approach: machine learning, automatic feature selection and post-processing by a novel procedure designed to clean up the classifier output by correcting obvious misclassifications.</Paragraph> <Section position="1" start_page="229" end_page="229" type="sub_section"> <SectionTitle> 3.1 Machine learning </SectionTitle> <Paragraph position="0"> The core machine learning technique employed, is memory-based learning, a supervised inductive algorithm for learning classification tasks based on the k-nn algorithm. We use the TiMBL system (Daelemans et al., 2003), version 5.0.0, patch-2 with uniform feature weighting and random tiebreaking (options: -w 0 -R 911). We have also evaluated two alternative learning techniques. First, Maximum Entropy Models, for which we employed Zhang Le's Maximum Entropy Toolkit, version 20041229 with default parameters. Second, Support Vector Machines for which we used Taku Kudo's YamCha (Kudo and Matsumoto, 2003), with one-versus-all voting and option -V which enabled us to ignore predicted classes with negative distances.</Paragraph> </Section> <Section position="2" start_page="229" end_page="229" type="sub_section"> <SectionTitle> 3.2 Feature selection </SectionTitle> <Paragraph position="0"> In previous research, we have found that memory-based learning is rather sensitive to the chosen features. In particular, irrelevant or redundant features may lead to reduced performance. In order to minimise the effects of this sensitivity, we have employed bi-directional hill-climbing (Caruana and Freitag, 1994) for finding the features that were most suited for this task. This process starts with an empty feature set, examines the effect of adding or removing one feature and then starts a new iteration with the set associated with the best performance.</Paragraph> </Section> <Section position="3" start_page="229" end_page="230" type="sub_section"> <SectionTitle> 3.3 Automatic post-processing </SectionTitle> <Paragraph position="0"> Certain misclassifications by the semantic rolelabelling system described so far lead to unlikely and impossible relation assignments, such as assigning two indirect objects to a verb where only one is possible. Our proposed classifier has no mechanism to detect these errors. One solution is to devise a post-processing step that transforms the resulting role assignments until they meet certain basic constraints, such as the rule that each verb may have only single instances of the different roles assigned in one sentence (Van den Bosch et al., 2004).</Paragraph> <Paragraph position="1"> We propose an alternative automatically-trained post-processing method which corrects unlikely role assignments either by deleting them or by replacing them with a more likely one. We do not do this by knowledge-based constraint satisfaction, but rather by adopting a method for error correction based on Levenshtein distance (Levenshtein, 1965), or edit distance, as used commonly in spelling error correction. Levenshtein distance is a dynamically computed distance between two strings, accounting for the number of deletions, insertions, and substitutions needed to transform the one string into the other. Levenshtein-based error correction typically matches a new, possibly incorrect, string to a trusted lexicon of assumedly correct strings, finds the lexicon string with the smallest Levenshtein distance to the new string, and replaces the new string with the lexicon string as its likely correction. We implemented a roughly similar procedure. First, we generated a lexicon of semantic role labelling patterns of A0-A5 arguments of verbs on the basis of the entire training corpus and the PropBank verb frames. This lexicon contains entries such as abandon A0 V A1, and categorize A1 VA2 - a total of 43,033 variable-length role labelling patterns.</Paragraph> <Paragraph position="2"> Next, given a new test sentence, we consider all of its verbs and their respective predicted role labellings, and compare each with the lexicon, searching the role labelling pattern with the same verb at the smallest Levenshtein distance (in case of an unknown verb we search in the entire lexicon). For example, in a test sentence the pattern emphasize A0 V A1 A0 is predicted. One closest lexicon item is found at Levenshtein distance 1, namely emphasize A0 VA1, representing a deletion of the final A0. We then use the nearest-neighbour pattern in the lexicon to correct the likely error, and apply all deletions and substitutions needed to correct the current pattern according to the nearest-neighbour pattern from the trusted lexicon. We do not apply insertions, since the post-processor module does not have the information to decide which constituent or word would receive the inserted label. In case of multiple possible deletions (e.g. in deleting one out of two A1s in emphasize A0 VA1 A1), we always delete the argument furthest from the verb.</Paragraph> </Section> </Section> class="xml-element"></Paper>