File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/j01-2002_abstr.xml

Size: 7,464 bytes

Last Modified: 2025-10-06 13:41:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-2002">
  <Title>Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems</Title>
  <Section position="2" start_page="0" end_page="200" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In all natural language processing (NLP) systems, we find one or more language models that are used to predict, classify, or interpret language-related observations.</Paragraph>
    <Paragraph position="1"> Because most real-world NLP tasks require something that approaches full language understanding in order to be perfect, but automatic systems only have access to limited (and often superficial) information, as well as limited resources for reasoning with that information, such language models tend to make errors when the system is tested on new material. The engineering task in NLP is to design systems that make as few errors as possible with as little effort as possible. Common ways to reduce the error rate are to devise better representations of the problem, to spend more time on encoding language knowledge (in the case of hand-crafted systems), or to find more training data (in the case of data-driven systems). However, given limited resources, these options are not always available.</Paragraph>
    <Paragraph position="2"> Rather than devising a new representation for our task, in this paper, we combine different systems employing known representations. The observation that suggests this approach is that systems that are designed differently, either because they use a different formalism or because they contain different knowledge, will typically produce different errors. We hope to make use of this fact and reduce the number of errors with  very little additional effort by exploiting the disagreement between different language models. Although the approach is applicable to any type of language model, we focus on the case of statistical disambiguators that are trained on annotated corpora. The examples of the task that are present in the corpus and its annotation are fed into a learning algorithm, which induces a model of the desired input-output mapping in the form of a classifier. We use a number of different learning algorithms simultaneously on the same training corpus. Each type of learning method brings its own &amp;quot;inductive bias&amp;quot; to the task and will produce a classifier with slightly different characteristics, so that different methods will tend to produce different errors.</Paragraph>
    <Paragraph position="3"> We investigate two ways of exploiting these differences. First, we make use of the gang effect. Simply by using more than one classifier, and voting between their outputs, we expect to eliminate the quirks, and hence errors, that are due to the bias of one particular learner. However, there is also a way to make better use of the differences: we can create an arbiter effect. We can train a second-level classifier to select its output on the basis of the patterns of co-occurrence of the outputs of the various classifiers. In this way, we not only counter the bias of each component, but actually exploit it in the identification of the correct output. This method even admits the possibility of correcting collective errors. The hypothesis is that both types of approaches can yield a more accurate model from the same training data than the most accurate component of the combination, and that given enough training data the arbiter type of method will be able to outperform the gang type. 1 In the machine learning literature there has been much interest recently in the theoretical aspects of classifier combination, both of the gang effect type and of the arbiter type (see Section 2). In general, it has been shown that, when the errors are uncorrelated to a sufficient degree, the resulting combined classifier will often perform better than any of the individual systems. In this paper we wish to take a more empirical approach and examine whether these methods result in substantial accuracy improvements in a situation typical for statistical NLP, namely, learning morphosyntactic word class tagging (also known as part-of-speech or POS tagging) from an annotated corpus of several hundred thousand words.</Paragraph>
    <Paragraph position="4"> Morphosyntactic word class tagging entails the classification (tagging) of each token of a natural language text in terms of an element of a finite palette (tagset) of word class descriptors (tags). The reasons for this choice of task are several. First of all, tagging is a widely researched and well-understood task (see van Halteren \[1999\]). Second, current performance levels on this task still leave room for improvement: &amp;quot;state-of-the-art&amp;quot; performance for data-driven automatic word class taggers on the usual type of material (e.g., tagging English text with single tags from a low-detail tagset) is at 96-97% correctly tagged words, but accuracy levels for specific classes of ambiguous words are much lower. Finally, a number of rather different methods that automatically generate a fully functional tagging system from annotated text are available off-the-shelf. First experiments (van Halteren, Zavrel, and Daelemans 1998; Brill and Wu 1998) demonstrated the basic validity of the approach for tagging, with the error rate of the best combiner being 19.1% lower than that of the best individual tagger (van Halteren, Zavrel, and Daelemans 1998). However, these experiments were restricted to a single language, a single tagset and, more importantly, a limited amount of training data for the combiners. This led us to perform further, more extensive, 1 In previous work (van Halteren, Zavrel, and Daelemans 1998), we were unable to confirm the latter half of the hypothesis unequivocally. As we judged this to be due to insufficient training data for proper training of the second-level classifiers, we greatly increase the amount of training data in the present work through the use of cross-validation.</Paragraph>
    <Paragraph position="5">  van Halteren, Zavrel, and Daelemans Combination of Machine Learning Systems tagging experiments before moving on to other tasks. Since then the method has also been applied to other NLP tasks with good results (see Section 6).</Paragraph>
    <Paragraph position="6"> In the remaining sections, we first introduce classifier combination on the basis of previous work in the machine learning literature and present the combination methods we use in our experiments (Section 2). Then we explain our experimental setup (Section 3), also describing the corpora (3.1) and tagger generators (3.2) used in the experiments. In Section 4, we go on to report the overall results of the experiments, starting with a comparison between the component taggers (and hence between the underlying tagger generators) and continuing with a comparison of the combination methods. The results are examined in more detail in Section 5, where we discuss such aspects as accuracy on specific words or tags, the influence of inconsistent training data, training set size, the contribution of individual component taggers, and tagset granularity. In Section 6, we discuss the results in the light of related work, after which we conclude (Section 7) with a summary of the most important observations and interesting directions for future research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML