XML Viewer - c04-1058

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1058_intro.xml
Size: 8,094 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1058">
  <Title>Kowloon</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The investigation we describe here arose from a very commonly discussed experience, apparently triggered by the recent popularity of shared task evaluations that have opened opportunities for researchers to informally compare their experiences &amp;quot;with a common denominator&amp;quot;, so to speak.</Paragraph>
    <Paragraph position="1"> Among the perennial observations which are made during the analysis of the results is that (1) methods designed to &amp;quot;fine-tune&amp;quot; the high-accuracy base classifiers behave unpredictably, their success or failure often appearing far more sensitive to where the test set was drawn from, rather than on any true quality of the &amp;quot;fine-tuning&amp;quot;, and consequently, (2) the resulting system rankings are often unpredictable, especially as they are typically conducted only on a single new test set, often drawn from a single arbitrary new source of a significantly different nature than the training sets. One could argue that such evaluations do not constitute a fair test, but in fact, this is  sity for supporting this research in part through research grants A-PE37 and 4-Z03S.</Paragraph>
    <Paragraph position="2"> where computational linguistics modeling diverges from machine learning theory, since for any serious NLP application, such evaluations constitute a much more accurate representation of the real world.</Paragraph>
    <Paragraph position="3"> We believe one primary reason for this common experience is that the models involved are typically already operating well beyond the limits of accuracy of the models' assumptions about the nature of distributions from which testing samples will be drawn. For this reason, even &amp;quot;sophisticated&amp;quot; discriminative training criteria, such as maximum entropy, minimum error rate, and minimum Bayes risk, are susceptible to these stability problems. There has been much theoretical work done on error correction, but in practice, any error correction usually lowers the performance of the combined system on unseen data, rather than improving it. Unfortunately, most existing theory simply does not apply.</Paragraph>
    <Paragraph position="4"> This is especially true if the base model has been highly tuned. For the majority of tasks, the performance of the trained models, after much fine tuning, tend to plateau out at around the same point, regardless of the theoretical basis of the underlying model. This holds true with most highly accurate classifiers, including maximum entropy classifiers, SVMs, and boosting models.</Paragraph>
    <Paragraph position="5"> In addition, even though data analysis gives us some general idea as to what kinds of feature conjunctions might help, the classifiers are not able to incorporate those into their model (usually because the computational cost would be infeasible), and any further post-processing tends to degrade accuracy on unseen data. The common practice of further improving accuracy at this point is to resort to ad hoc classifier combination methods, which are usually not theoretically well justified and, again, unpredictably improve or degrade performance--thus consuming vast amounts of experimental resources with relatively low expected payoff, much like a lottery.</Paragraph>
    <Paragraph position="6"> There are a variety of reasons for this, ranging from the aforementioned validity of the assumptions about the distribution between the training and test corpora, to the absence of a well justified stopping point for error correction. The latter problem is much more serious than it seems at first blush, since without a well-justified stopping criterion, the performance of the combined model will be much more dependent upon the distribution of the test set, than on any feature engineering. Empirical evidence for this argument can be seen from the result of the CoNLL shared tasks (Tjong Kim Sang, 2002)(Tjong Kim Sang and Meulder, 2003), where the ranking of the participating systems changes with the test corpora.</Paragraph>
    <Paragraph position="7"> Inspired by the repeated observations of this phenomenon by many participants, we decided to stop &amp;quot;sweeping the issue under the rug&amp;quot;, and undertook to confront it head-on. Accordingly, we challenged ourselves to design an error corrector satisfying the following criteria, which few if any existing models actually meet: (1) it would leverage off existing base models, while targeting their errors; (2) it would consistently improve accuracy, even on top of base models that already deliver high accuracy; (3) it would be robust and conservative, so as to almost never accidentally degrade accuracy; (4) it would be broadly applicable to any classification or recognition task, especially high-dimensional ones such as named-entity recognition and word-sense disambiguation; and (5) it would be template-driven and easily customizable, which would enable it to target error patterns beyond the base models' representation and computational complexity limitations.</Paragraph>
    <Paragraph position="8"> Our goal in this undertaking was to invent as little as possible. We expected to make use of relatively sophisticated error-minimization techniques. Thus the results were surprising: the simplest models kept outperforming the &amp;quot;sophisticated&amp;quot; models. This paper attempts to investigate some of the key reasons why.</Paragraph>
    <Paragraph position="9"> To avoid reinventing the wheel, we originally considered adapting an existing error-driven method, transformation-based learning (TBL) for this purpose.</Paragraph>
    <Paragraph position="10"> TBL seems well suited to the problem as it is inherently an error corrector and, on its own, has been shown to achieve high accuracies on a variety of problems (see Section 4). Our original goal was to adapt TBL for error correction of high-performing models (Wu et al., 2004a), with two main principles: (1) since it is not clear that the usual assumptions made about the distribution of the training/test data are valid in such extreme operating ranges, empirical observations would take precedence over theoretical models, which implies that (2) any model would have to be empirically justified by testing on a diverse range of data. Experimental observations, however, increasingly drove us toward different goals.</Paragraph>
    <Paragraph position="11"> Our resulting error corrector, NTPC, was instead constructed on the principle of making as few assumptions as possible in order to robustly generalize over diverse situations and problems. One observation made in the course of experimentation, after many attempts at fine-tuning model parameters, was that many of the complex theoretical models for error correction often do not perform consistently. This is perhaps not too surprising upon further reflection, since the principle of Occam's Razor does prefer simpler hypotheses over more complex ones.</Paragraph>
    <Paragraph position="12"> NTPC was introduced in (Wu et al., 2004b), where the controversial issues it raised generated a number of interesting questions, many of which were were directed at NTPC's seeming simplicity, which seems in opposition to the theory behind many other error correcting models.</Paragraph>
    <Paragraph position="13"> In this paper, we investigate the most commonly-asked questions. We illuminate these questions by contrasting NTPC against the more powerful TBL, presenting experiments that show that NTPC's simple model is indeed  key to its robustness and reliability.</Paragraph>
    <Paragraph position="14"> The rest of the paper is laid out as follows: Section 2 presents an introduction to NTPC, including an overview of its architecture. Section 3 addresses key questions related to NTPC's architecture and presents empirical results justifying its simplicity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML