XML Viewer - c04-1058

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1058_metho.xml
Size: 21,640 bytes
Last Modified: 2025-10-06 14:08:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1058">
  <Title>Kowloon</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 N-fold Templated Piped Correction
</SectionTitle>
    <Paragraph position="0"> N-fold Templated Piped Correction or NTPC, is a model that is designed to robustly improve the accuracy of existing base models in a diverse range of operating conditions. As was described above, the most challenging situations for any error corrector is when the base model has been finely tuned and the performance has reached a plateau. Most of the time, any further feature engineering or error correction after that point will end up hurting performance rather than improving it.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The architecture of NTPC is surprisingly
</SectionTitle>
      <Paragraph position="0"> simple One of the most surprising things about NTPC lies in the fact that despite its simplicity, it outperforms mathematically much more &amp;quot;sophisticated&amp;quot; methods at error correcting. Architecturally, it relies on a simple rule-learning mechanism and cross-partitioning of the training data to learn very conservative, cautious rules that make only a few corrections at a time.</Paragraph>
      <Paragraph position="1"> Figure 1 illustrates the NTPC architecture. Prior to learning, NTPC is given (1) a set of rule templates which describe the types of rules that it is allowed to hypothesize, (2) a single base learning model, and (3) an annotated training set.</Paragraph>
      <Paragraph position="2"> The NTPC architecture is essentially a sequentially chained piped ensemble that incorporates cross-validation style n-fold partition sets generated from the base model. The training set is partitioned n times in order to train n base models. Subsequently the n held-out validation sets are classified by the respective trained base models, with the results combined into a &amp;quot;reconstituted&amp;quot; training set. The reconstituted training set is used by Error Corrector Learner, which learns a set of rules.</Paragraph>
      <Paragraph position="3"> Rule hypotheses are generated according to the given set of allowable templates:</Paragraph>
      <Paragraph position="5"> where X is a sequence of X training examples xi, Y is a sequence of reference labels yi for each example respectively, ^Y is a sequence of labels ^yi as predicted by the base model for each example respectively, H is the hypothesis space of valid rules implied by the templates, and tmin is a confidence threshold. Setting tmin to a relatively high value (say 15) implements the requirement of high reliability. R is subsequently sorted by the ti value of each rule ri into an ordered list of rules R[?] = (r[?]0,...,r[?]i[?]1).</Paragraph>
      <Paragraph position="6"> During the evaluation phase, depicted in the lower portion of Figure 1, the test set is first labeled by the base model. The error corrector's rules r[?]i are then applied in the order of R[?] to the evaluation set. The final classification of a sample is then the classification attained when all the rules have been applied.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 NTPC consistently and robustly improves
</SectionTitle>
      <Paragraph position="0"> accuracy of highly-accurate base models In previous work (Wu et al., 2004b), we presented experiments on named-entity identification and classification across four diverse languages, using Adaboost.MH as the base learner, which showed that NTPC was capable of robustly and consistently improving upon the accuracy of the already-highly-accurate boosting model; correcting the errors committed by the base model but not introducing any of its own.</Paragraph>
      <Paragraph position="1"> Table 1 compares results obtained with the base Adaboost.MH model (Schapire and Singer, 2000) and the NTPC-enhanced model for a total of eight different named-entity recognition (NER) models. These experiments were performed on the CoNLL-2002 and CoNLL2003 shared task data sets. It can be seen that the Adaboost.MH base models clearly already achieve high accuracy, setting the bar very high for NTPC to improve upon. However, it can also be seen that NTPC yields further F-Measure gains on every combination of task and language, including English NE bracketing (Model M2) for which the base F-Measure is the highest.</Paragraph>
      <Paragraph position="2"> An examination of the rules (shown in the Appendix) can give an idea as to why NTPC manages to identify and correct errors which were overlooked by the highly tuned base model. NTPC's advantage comes from two aspects: (1) its ability to handle complex conjunctions of features, which often reflect structured, linguistically motivated expectations, in the form of rule templates; and (2) its ability to &amp;quot;look forward&amp;quot; at classifications from the right context, even when processing the sentence in a left-to-right direction. The base classifier is unable to incorporate these two aspects, because (1) including complex conjunctions of features would raise the computational cost of searching the feature space to a point where it would be infeasible, and (2) most classifiers process a sentence from left-to-right, deciding on the class label for each word before moving on to the next one. Rules that exploit these advantages are easily picked out in the table; many of the rules (especially those in the top 5 for both English and Spanish) consist of complex conjunctions of features; and rules that consider the right context classifications can be identified by the string &amp;quot;ne &lt;num&gt;&amp;quot;, where &lt;num&gt; is a positive integer (indicating how many words to the right).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> The most commonly-raised issues about NTPC relate to the differences between NTPC and TBL (though the conceptual issues are much the same as for other error-minimization criteria, such as minimum error rate or minimum Bayes risk). This is expected, since it was one of our goals to reinvent as little as possible. As a result, NTPC does bear a superficial resemblance to TBL, both of them being error-driven learning methods that seek to incrementally correct errors in a corpus by learning rules that are determined by a set of templates.</Paragraph>
    <Paragraph position="1"> One of the most frequently asked questions is whether the Error Corrector Learner portion of NTPC could be replaced by a transformation-based learner. This section will investigate the differences between NTPC and TBL, and show the necessity of the changes that were incorporated into NTPC.</Paragraph>
    <Paragraph position="2"> The experiments run in this section were performed on the data sets used in the CoNLL-2002 and CoNLL2003 Named Entity Recognition shared tasks. The high-performing base model is based on AdaBoost.MH (Schapire and Singer, 2000), the multi-class generalization of the original boosting algorithm, which implements boosting on top of decision stump classifiers (decision trees of depth one).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Any Error is Bad
</SectionTitle>
      <Paragraph position="0"> The first main difference between NTPC and TBL, and also what seems to be an extreme design decision on the part of NTPC, is the objective scoring function. To be maximally certain of not introducing any new errors with its rules, the first requirement that NTPC's objective function places onto any candidate rules is that they must not introduce any new errors (epsilon1(r) = 0). This is called the zero error tolerance principle.</Paragraph>
      <Paragraph position="1"> To those who are used to learners such as transformation-based learning and decision lists, which allow for some degree of error tolerance, this design principle seems overly harsh and inflexible. Indeed, for al- null most all models, there is an implicit assumption that the scoring function will be based on the difference between the positive and negative applications, rather than on an absolute number of corrections or mistakes.</Paragraph>
      <Paragraph position="2"> Results for eight experiments are shown in Figures 2 and 3. Each experiment compares NTPC against other variants that allow relaxed epsilon1(r) [?] epsilon1max conditions for various epsilon1max [?]{1,2,3,4,[?]}. The worst curve in each case is for epsilon1max =[?]-- in other words, the system that only considers net performance improvement, as TBL and many other rule-based models do. The results confirm empirically that the epsilon1(r) = 0 condition (1) gives the most consistent results, and (2) generally yields accuracies among the highest, regardless of how long training is allowed to continue. In other words, the presence of any negative application during the training phase will cause the error corrector to behave unpredictably, and the more complex model of greater error tolerance is unnecessary in practice.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Rule Interaction is Unreliable
</SectionTitle>
      <Paragraph position="0"> Another key difference between NTPC and TBL is the process of rule interaction. Since TBL allows a rule to use the current classification of a sample and its neighbours as features, and a rule updates the current state of the corpus when it applies to a sample, the application of one rule could end up changing the applicability (or not) of another rule. From the point of view of a sample, its classification could depend on the classification of &amp;quot;nearby&amp;quot; samples. Typically, these &amp;quot;nearby&amp;quot; samples are those found in the immediately preceding or succeeding words of the same sentence. This rule interaction is permitted in both training and testing.</Paragraph>
      <Paragraph position="1"> NTPC, however, does not allow for this kind of rule interaction. Rule applications only update the output classification of a sample, and do not update the current state of the corpus. In other words, the feature values for a  fluctuation and generally higher accuracy than the relaxed tolerance variations, in bracketing experiments.</Paragraph>
      <Paragraph position="2"> (bold = NTPC, dashed = relaxed tolerance) sample are initialized once, at the beginning of the program, and not changed again thereafter. The rationale for making this decision is the hypothesis that rule interaction is in nature unreliable, since the high-accuracy base model provides sparse opportunities for rule application and thus much sparser opportunities for rule interaction, making any rule that relies on rule interaction suspect.</Paragraph>
      <Paragraph position="3"> As a matter of fact, by considering only rules that make no mistake during the learning phase, NTPC's zero error tolerance already eliminates any correction of labels that results from rule interaction--since a label correction on a sample that results from the application of more than one rule necessarily implies that at least one of the rules made a mistake.</Paragraph>
      <Paragraph position="4"> Since TBL is a widely used error-correcting method,  ing + classification task show that allowing TBL-style rule interaction does not yield reliable improvement over NTPC. (bold = NTPC, dashed = rule interaction) it is natural to speculate that NTPC's omission of rule interaction is a weakness. In order to test this question, we implemented an iterative variation of NTPC that allows rule interaction, where each iteration targets the residual error from previous iterations as follows:  1. i-0,X0-X 2. r[?]i -null,s[?]i -0 3. foreach r[?]Hsuch that epsilon1i (r) = 0 * if ti (r) &gt; t[?]i then r[?]i -r,t[?]i -ti (r) 4. if t[?]i &lt; tmin then return 5. Xi+1-result of applying r[?]i toXi 6. i-i + 1 7. goto Step 3</Paragraph>
      <Paragraph position="6"> Here, incremental rule interaction is a natural consequence of arranging the structure of the algorithm to observe the right context features coming from the base model, as with transformation-based learning. In Step 5 of the algorithm, the current state of the corpus is updated with the latest rule on each iteration. That is, in each given iteration of the outer loop, the learner considers the corrected training data obtained by applying rules learned in the previous iterations, so the learner has access to the labels that result from applying the previous rules. Since these rules may apply anywhere in the corpus, the learner is not restricted to using only labels from the left context.</Paragraph>
      <Paragraph position="7"> The time complexity of this variation is an order of magnitude more expensive than NTPC, due to the need to allow rule interaction using nested loops. The ordered list of output rules r[?]0,...,r[?]i[?]1is learned in a greedy fashion, to progressively improve upon the performance of the learning algorithm on the training set.</Paragraph>
      <Paragraph position="8"> Results for eight experiments on this variation, shown in Figures 4 and 5, demonstrate that this expensive extra capability is rarely useful in practice and does not reliably guarantee that accuracy will not be degraded. This is yet another illustration of the principle that, in high-accuracy error correction problems, at least, more simple modes of operation should be preferred over more complex arrangements.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 NTPC vs. N-fold TBL
</SectionTitle>
      <Paragraph position="0"> Another question on NTPC that is frequently raised is whether or not ordinary TBL, which is after all, intrinsically an error-correcting model, can be used in place of NTPC to perform better error correction. Figure 6 shows the results of four sets of experiments evaluating this approach on top of boosting. As might be expected from extrapolation from the foregoing experiments that investigated their individual differences, NTPC outperforms the more complex TBL in all cases, regardless of how long training is allowed to continue.</Paragraph>
      <Paragraph position="1">  Another valid question would be to ask if the way that NTPC combines the results of the n-fold partitioning is oversimplistic and could be improved upon. As was previously stated, the training corpus for the error corrector in NTPC is the &amp;quot;reconstituted training set&amp;quot; generated by combining the held-out validation sets after they have labeled with initial classifications by their respective trained base models. To investigate if NTPC could benefit from a more complex model, we employed voting, a commonly-used technique in machine learning and natural language processing. As before, the training set was partitioned and multiple base learners were trained and evaluated on the multiple training and validation sets, respectively. However, instead of recombining the validation sets into a reconstituted training set, multiple error corrector models were trained on the n partition sets. During the evaluation phase, all n error correctors were evaluated on the evaluation set after it had been labeled by the base model, and they voted on the final output.</Paragraph>
      <Paragraph position="2"> Table 2 shows the results of using such an approach for the bracketing + classification task on English. The empirical results clearly show that the more complex and time-consuming voting model not only does not outperfom NTPC, but in fact again degrades the performance from the base boosting-only model.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Experiment Summary
</SectionTitle>
      <Paragraph position="0"> In our experiments, we set out to investigate whether NTPC's operating parameters were overly simple, and whether more complex arrangements were necessary or desirable. However, empirical evidence points to the fact that, in this problem of error correction in high accuracy ranges, at least, simple mechanisms will suffice to produce good results--in fact, the more complex operations end up degrading rather than improving accuracy.</Paragraph>
      <Paragraph position="1"> A valid question is to ask why methods such as decision list learning (Rivest, 1987) as well as transformation-based learning benefit from these more complex mechanisms. Though structurally similar to NTPC, these models operate in a very different environment, where many initially poorly labeled examples are available to drive rule learning with. Hence, it is possibly advantageous to trade off some corrections with some mistakes, provided that there is an overall positive change in accuracy. However, in an error-correcting situation, most of the samples are already correctly labeled, errors are few and far in between and the sparse data problem is exacerbated. In addition, the idea of error correction implies that we should, at the very least, not do any worse than the original algorithm, and hence it makes sense to err on the side of caution and minimize any errors created, rather than hoping that a later rule application will undo mistakes made by an earlier one.</Paragraph>
      <Paragraph position="2"> Finally, note that the same point applies to many other models where training criteria like minimum error rate are used, since such criteria are functions of the trade-off between correctly and incorrectly labeled examples, without zero error tolerance to compensate for the sparse data problem.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Previous Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Boosting and NER
</SectionTitle>
      <Paragraph position="0"> Boosting (Freund and Schapire, 1997) has been successfully applied to several NLP problems. In these NLP systems boosting is typically used as the ultimate stage in a learned system. For example, Shapire and Singer (2000) applied it to Text Categorization while Escudero et al.(2000) used it to obtain good results on Word Sense Disambiguation. More closely relevant to the experiments described here in, two of the best-performing three teams in the CoNLL-2002 Named Entity Recognition shared task evaluation used boosting as their base system (Carreras et al., 2002)(Wu et al., 2002).</Paragraph>
      <Paragraph position="1"> However, precedents for improving performance after boosting are few. At the CoNLL-2002 shared task session, Tjong Kim Sang (unpublished) described an experiment using voting to combine the NER outputs from the shared task participants which, predictably, produced better results than the individual systems. A couple of the individual systems were boosting models, so in some sense this could be regarded as an example.</Paragraph>
      <Paragraph position="2"> Tsukamoto et al.(2002) used piped AdaBoost.MH models for NER. Their experimental results were somewhat disappointing, but this could perhaps be attributable to various reasons including the feature engineering or not using cross-validation sampling in the stacking.</Paragraph>
      <Paragraph position="3"> Appendix The following examples show the top 10 rules learned for English and Spanish on the bracketing + classification task.  wcaptype 0=noneed-firstupper wcaptype -1=noneed-firstupper wcaptype 1=alllower captypeLex 0=not-inLex captypeGaz 0=not-inGaz ne 0=O =&gt; ne=I-ORG ne 0=O word 0=efe =&gt; ne=I-ORG ne -1=O ne 0=O word 1=Num word 2=. captypeLex 0=not-inLex captypeGaz 0=not-inGaz wcaptype 0=allupper =&gt; ne=I-MISC pos -1=ART pos 0=NCF wcaptype 0=noneed-firstupper ne -1=O ne 0=O =&gt; ne=I-ORG wcaptype 0=alllower ne 0=I-PER ne 1=O ne 2=O =&gt; ne=O ne 0=O ne 1=I-MISC word 2=Num captypeLex 0=not-inLex captypeGaz 0=not-inGaz wcaptype 0=allupper =&gt; ne=I-MISC ne 0=I-LOC word:[-3,-1]=universidad =&gt; ne=I-ORG ne 1=O ne 2=O word 0=de captypeLex 0=not-inLex captypeGaz 0=inGaz wcaptype 0=alllower =&gt; ne=O The AdaBoost.MH base model's high accuracy sets a high bar for error correction. Aside from brute-force en masse voting of the sort at CoNLL-2002 described above, we do not know of any existing post-boosting models that improve rather than degrade accuracy. We aim to further improve performance, and propose using a piped error corrector.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Transformation-based Learning
</SectionTitle>
      <Paragraph position="0"> Transformation-based learning (Brill, 1995), or TBL, is one of the most successful rule-based machine learning algorithms. The central idea of TBL is to learn an ordered list of rules, each of which evaluates on the results of those preceding it. An initial assignment is made based on simple statistics, and then rules are greedily learned to correct the mistakes, until no net improvement can be made.</Paragraph>
      <Paragraph position="1"> Transformation-based learning has been used to tackle a wide range of NLP problems, ranging from part-of-speech tagging (Brill, 1995) to parsing (Brill, 1996) to segmentation and message understanding (Day et al., 1997). In general, it achieves state-of-the-art performances and is fairly resistant to overtraining.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML