File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0611_metho.xml
Size: 11,662 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0611"> <Title>Improving sequence segmentation learning by predicting trigrams</Title> <Section position="5" start_page="82" end_page="83" type="metho"> <SectionTitle> 3 Predicting class trigrams </SectionTitle> <Paragraph position="0"> There is no intrinsic bound to what is packed into a class label associated to a windowed example.</Paragraph> <Paragraph position="1"> For example, complex class labels can span over trigrams of singular class labels. A classifier that learns to produce trigrams of class labels will at least produce syntactically valid trigrams from the training material, which might partly solve some near-sightedness problems of the single-class classifier. Although simple and appealing, the lurking disadvantage of the trigram idea is that the number of class labels increases explosively when moving from</Paragraph> <Paragraph position="3"> symbols. Sequences of input symbols and output symbols are converted into windows of fixed-width input symbols each associated with, in this example, trigrams of output symbols.</Paragraph> <Paragraph position="4"> single class labels to wider trigrams. The CHUNK data, for example, has 22 classes (&quot;IOB&quot; codes associated with chunk types); in the same training set, 846 different trigrams of these 22 classes and the start/end context symbol occur. The eight original classes of NER combine to 138 occurring trigrams.</Paragraph> <Paragraph position="5"> DISFL only has two classes, but 18 trigram classes.</Paragraph> <Paragraph position="6"> Figure 2 illustrates the procedure by which windows are created with, as an example, class trigrams. Each windowed instance maps to a class label that incorporates three atomic class labels, namely the focus class label that was the original unigram label, plus its immediate left and right neighboring class labels.</Paragraph> <Paragraph position="7"> While creating instances this way is trivial, it is not entirely trivial how the output of overlapping class trigrams recombines into a normal string of class sequences. When the example illustrated in Figure 2 is followed, each single class label in the output sequence is effectively predicted three times; first, as the right label of a trigram, next as the middle label, and finally as the left label. Although it would be possible to avoid overlaps and classify only every three words, there is an interesting prop-erty of overlapping class label n-grams: it is possible to vote over them. To pursue our example of trigram classes, the following voting procedure can be followed to decide about the resulting unigram class label sequence: 1. When all three votes are unanimous, their common class label is returned; 2. When two out of three votes are for the same score on the three test sets without and with class trigrams. Each third column displays the error reduction in F-score by the class trigrams method over the other method. The best performances per task are printed in bold.</Paragraph> <Paragraph position="8"> class label, this class label is returned; 3. When all three votes disagree (i.e., when majority voting ties), the class label is returned of which the classifier is most confident.</Paragraph> <Paragraph position="9"> Classifier confidence, needed for the third tie-breaking rule, can be heuristically estimated by taking the distance of the nearest neighbor in MBL, the estimated probability value of the most likely class produced by the MAXENT classifier, or the activation level of the most active unit of the WINNOW network.</Paragraph> <Paragraph position="10"> Clearly this scheme is one out of many possible schemes, using variants of voting as well as variants of n (and having multiple classifiers with different n, so that some back-off procedure could be followed).</Paragraph> <Paragraph position="11"> For now we use this procedure with trigrams as an example. To measure its effect we apply it to the sequence tasks CHUNK, NER, and DISFL. The results of this experiment, where in each case WPS was used to find optimal algorithmic parameters of all three algorithms, are listed in Table 1. We find rather positive effects of the trigram method both with MBL and MAXENT; we observe relative error reductions in the F-score on chunking ranging between 10% and a remarkable 51% error reduction, with MAXENT on the NER task. With WINNOW, we observe decreases in performance on CHUNK and DISFL, and a minor error reduction of 4% on NER.</Paragraph> </Section> <Section position="6" start_page="83" end_page="84" type="metho"> <SectionTitle> 4 The feedback-loop method versus class </SectionTitle> <Paragraph position="0"> trigrams An alternative method for providing a classifier access to its previous decisions is a feedback-loop approach, which extends the windowing approach by feeding previous decisions of the classifier as features into the current input of the classifier. This mances in terms of F-score of MBL on the three test sets, with and without a feedback loop, and the error reduction attained by the feedback-loop method, the F-score of the trigram-class method, and the F-score of the combination of the two methods.</Paragraph> <Paragraph position="1"> approach was proposed in the context of memory-based learning for part-of-speech tagging as MBT (Daelemans et al., 1996). The number of decisions fed back into the input can be varied. In the experiments described here, the feedback loop iteratively updates a memory of the three most recent predictions. null The feedback-loop approach can be combined both with single class and class trigram output. In the latter case, the full trigram class labels are copied to the input, retaining at any time the three most recently predicted labels in the input. Table 2 shows the results for both options on the three chunking tasks. The feedback-loop method outperforms the trigram-class method on CHUNK, but not on the other two tasks. It does consistently outperform the baseline single-class classifier. Interestingly, the combination of the two methods performs worse than the baseline classifier on CHUNK, and also performs worse than the trigram-class method on the other two tasks.</Paragraph> <Paragraph position="3"> classifier has produced a predicted output sequence.</Paragraph> <Paragraph position="4"> Sequences of input symbols, predicted output symbols, and real output symbols are converted into windows of fixed-width input symbols and predicted output symbols, each associated with one output symbol.</Paragraph> </Section> <Section position="7" start_page="84" end_page="85" type="metho"> <SectionTitle> 5 Stacking versus class trigrams </SectionTitle> <Paragraph position="0"> Stacking, a term popularized by Wolpert (1992) in an artificial neural network context, refers to a class of meta-learning systems that learn to correct errors made by lower-level classifiers. We implement stacking by adding a windowed sequence of previous and subsequent output class labels to the original input features (here, we copy a window of seven predictions to the input, centered around the middle position), and providing these enriched examples as training material to a second-stage classifier. Figure 3 illustrates the procedure. Given the (possibly erroneous) output of a first classifier on an input sequence, a certain window of class symbols from that predicted sequence is copied to the input, to act as predictive features for the real class label.</Paragraph> <Paragraph position="1"> To generate the output of a first-stage classifier, two options are open. We name these options perfect and adaptive. They differ in the way they create training material for the second-stage classifier: Perfect - the training material is created straight from the training material of the first-stage classifier, by windowing over the real class sequences.</Paragraph> <Paragraph position="2"> In doing so, the class label of each window is excluded from the input window, since it is always the same as the class to be predicted. In training, this focus feature would receive an unrealistically mances in terms of F-score of MBL on the three test sets, without stacking, and with perfect and adaptive stacking.</Paragraph> <Paragraph position="3"> high weight, especially considering that in testing this feature would contain errors. To assign a very high weight to a feature that may contain an erroneous value does not seem a good idea in view of the label bias problem.</Paragraph> <Paragraph position="4"> Adaptive - the training material is created indirectly by running an internal 10-fold cross-validation experiment on the first-stage training set, concatenating the predicted output class labels on all of the ten test partitions, and converting this output to class windows. In contrast with the perfect variant, we do include the focus class feature in the copied class label window. The adaptive approach can in principle learn from recurring classification errors in the input, and predict the correct class in case an error re-occurs.</Paragraph> <Paragraph position="5"> Table 3 lists the comparative results on the CHUNK, NER, and DISFL tasks introduced earlier.</Paragraph> <Paragraph position="6"> They show that both types of stacking improve performance on the three tasks, and that the adaptive stacking variant produces higher relative gains than the perfect variant; in terms of error reduction in F-score as compared to the baseline single-class classifier, the gains are 9% for CHUNK, 7% for NER, and 17% for DISFL. There appears to be more useful information in training data derived from cross-validated output with errors, than in training data with error-free material.</Paragraph> <Paragraph position="7"> Stacking and class trigrams can be combined.</Paragraph> <Paragraph position="8"> One possible straightforward combination is that of a first-stage classifier that predicts trigrams, and a second-stage stacked classifier that also predicts tri-grams (we use the adaptive variant, since it produced the best results), while including a centered sevenpositions-wide window of first-stage trigram class labels in the input. Table 4 compares the results mances in terms of F-score by MBL on the three test sets, with adaptive stacking, trigram classes, and the combination of the two.</Paragraph> <Paragraph position="9"> of adaptive stacking and trigram classes with those of the combination of the two. As can be seen, the combination produces even better results than both the stacking and the trigram-class methods individually, on all three tasks. Compared to the baseline single-class classifier, the error reductions are 15% for CHUNK, 15% for NER, and 18% for DISFL.</Paragraph> <Paragraph position="10"> As an additional analysis, we inspected the predictions made by the trigram-class method and its combinations with the stacking and the feedback-loop methods on the CHUNK task to obtain a better view on the amount of disagreements between the trigrams. We found that with the trigram-class method, in 6.3% of all votes some disagreement among the overlapping trigrams occurs. A slightly higher percentage of disagreements, 7.1%, is observed with the combination of the trigram-class and the stacking method. Interestingly, in the combination of the trigram-class and feedback-loop methods, only 0.1% of all trigram votes are not unanimous.</Paragraph> <Paragraph position="11"> This clearly illustrates that in the latter combination the resulting sequence of trigrams is internally very consistent - also in its errors.</Paragraph> </Section> class="xml-element"></Paper>