File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1304_evalu.xml

Size: 17,361 bytes

Last Modified: 2025-10-06 13:58:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1304">
  <Title>Coaxing Confidences from an Old Friend: Probabilistic Classifications from Transformation Rule Lists</Title>
  <Section position="6" start_page="28" end_page="87" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Three experiments that demonstrate the effectiveness and appropriateness of our probability estimates are presented in this section. The experiments are performed on text chunking, a subproblem of syntactic parsing. Unlike full parsing, the sentences are divided into non-overlapping phrases, where each word belongs to the lowest parse constituent that dominates it.</Paragraph>
    <Paragraph position="1"> The data used in all of these experiments is the CoNLL-2000 phrase chunking corpus (CoNLL, 2000). The corpus consists of sections 15-18 and section 20 of the Penn Treebank (Marcus et al., 1993), and is pre-divided into a 8936-sentence (211727 tokens) training set and a 2012-sentence (47377 tokens) test set. The chunk tags are derived from the parse tree constituents, and the part-of-speech tags were generated by the Brill tagger (Brill, 1995).</Paragraph>
    <Paragraph position="2"> As was noted by Ramshaw &amp; Marcus (1999), text chunking can be mapped to a tagging task, where each word is tagged with a chunk tag representing the phrase that it belongs to. An example sentence from the corpus is shown in Table 4. As a contrasting system, our results are compared with those produced by a C4.5 decision tree system (henceforth C4.5). The reason for using C4.5 is twofold: firstly, it is a widely-used algorithm which achieves state-.of-theart performance on a broad variety of tasks; and  secondly, it belongs to the same class of classifiers as our converted transformation-based rule list (henceforth TBLDT).</Paragraph>
    <Paragraph position="3"> To perform a fair evaluation, extra care was taken to ensure that both C4.5 and TBLDT explore as similar a sample space as possible. The systems were allowed to consult the word, the part-of-speech, and the chunk tag of all examples within a window of 5 positions (2 words on either side) of each target example. 2 Since multiple features covering the entire vocabulary of the training set would be too large a space for C4.5 to deal with, in all of experiments where TBLDT is directly compared with C4.5, the word types that both systems can include in their predicates are restricted to the most &amp;quot;ambiguous&amp;quot; 100 words in the training set, as measured by the number of chunk tag types that are assigned to them. The initial prediction was made for both systems using a class assignment based solely on the part-of-speech tag of the word.</Paragraph>
    <Paragraph position="4"> Considering chunk tags within a contextual window of the target word raises a problem with C4.5. A decision tree generally trains on independent samples and does not take into account changes of any features in the context. In our case, the samples are dependent; the classification of sample i is a feature for sample i + 1, which means that changing the classification for sample i affects the context of sample i + 1. To address this problem, the C4.5 systems are trained with the correct chlmk~ in the left context. When the system is used for classification, input is processed in a left-to-right manner;and the output of the system is fed forward to be used as features in the left context of following samples. Since C4.5 generates probabilities for each classification decision, they can be redirected into the input for the next position. Providing the decision treewith this confidence information effectively allows it to perform a limited search over the entire sentence. C4.5 does have one advantage over TBLDT, however. A decision tree can be trained using the subsetting feature, where questions asked are of the form: &amp;quot;does feature f belong to the set FT'. This is not something that a TBL can do readily,  but since the objective is in comparing TBLDT to another state-of-the-art system, this feature was enabled.</Paragraph>
    <Section position="1" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.1 Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> The most commonly used measure for evaluating tagging tasks is tag accuracy, lit is defined as Accuracy = # of correctly tagged examples of examples In syntactic parsing, though, since the task is to identify the phrasal components, it is more appropriate to measure the precision and recall:  To facilitate the comparison of systems with different precision and recall, the F-measure metric is computed as a weighted harmonic mean of precision and recall:</Paragraph>
      <Paragraph position="2"> The ~ parameter is used to give more weight to precision or recall, as the task at hand requires.</Paragraph>
      <Paragraph position="3"> In all our experiments, ~ is set to 1, giving equal weight to precision and recall.</Paragraph>
      <Paragraph position="4"> The reported performances are all measured with the evaluation tool provided with the CoNLL corpus (CoNLL, 2000).</Paragraph>
    </Section>
    <Section position="2" start_page="29" end_page="87" type="sub_section">
      <SectionTitle>
4.2 Active Learning
</SectionTitle>
      <Paragraph position="0"> To demonstrate the usefulness of obtaining probabilities from a transformation rule list, this section describes an application which utilizes these probabilities, and compare the resulting performance of the system with that achieved by C4.5.</Paragraph>
      <Paragraph position="1"> Natural language processing has traditionally required large amounts of annotated data from which to extract linguistic properties. However, not all data is created equal: a normal distribution of aunotated data contains much redundant information. Seung et al. (1992) and Freund et al. (1997) proposed a theoretical active learning approach, where samples are intelligently selected for annotation. By eliminating redundant information, the same performance can be achieved while using fewer resources. Empirically, active learning has been applied to various NLP tasks such as text categorization (Lewis and Gale, 1994; Lewis and Catlett, 1994; Liere and Tadepalli, 1997), part-of-speech tagging (Dagan and Engelson, 1995; Engelson and Dagan, 1996), and base noun phrase chunbiug (Ngai and Yarowsky, 2000), resulting in significantly large reductions in the quantity of data needed to achieve comparable performance.</Paragraph>
      <Paragraph position="2"> This section presents two experimental results which show the effectiveness of the probabilities generated by the TBLDT. The first experiment compares the performance achieved by the active learning algorithm using TBLDT with the performance obtained by selecting samples sequentially from the training set. The second experiment compares the performances achieved by TBLDT and C4.5 training on samples selected by active learning.</Paragraph>
      <Paragraph position="3"> The following describes the active learning algorithm used in the experiments:  1. Label an initial T1 sentences of the corpus; 2. Use the machine learning algorithm (G4.5 or TBLDT) to obtain chunk probabilities on the rest of the training data; 3. Choose T2 samples from the rest of the train null ing set, specifically the samples that optimize an evaluation function f, based on the class distribution probability of each sample; 4. Add the samples, including their &amp;quot;true&amp;quot; classification 3 to the training pool and retrain the system; 5. If a desired number of samples is reached, stop, otherwise repeat from Step 2.</Paragraph>
      <Paragraph position="4"> The evaluation function f that was used in our experiments is: where H(UIS, i ) is the entropy of the chllnk probability distribution associated with the word index i in sentence S.</Paragraph>
      <Paragraph position="5"> Figure 2 displays the performance (F-measure and chllnk accuracy) of a TBLDT system trained on samples selected by active learning and the same system trained on samples selected sequentially from the corpus versus the number of words in the annotated tralniug set. At each step of the iteration, the active learning-trained TBLDT system achieves a higher accuracy/F-measure, or, conversely, is able to obtain the same performance level with less training data. Overall, our system can yield the same performance as the sequential system with 45% less data, a significant reduction in the annotation effort.</Paragraph>
      <Paragraph position="6"> Figure 3 shows a comparison between two active learning experiments: one using TBLDT and the other using C4.5. 4 For completeness, a sequential run using C4.5 is also presented. Even though C4.5 examines a larger space than TBLDT by SThe true (reference or gold standard) classification is available in this experiment. In an annotation situation, the samples are sent to human annotators for labeling.</Paragraph>
      <Paragraph position="8"/>
      <Paragraph position="10"> utilizing the feature subset predicates, TBLDT still performs better. The difference in accuracy at 26200 words (at the end of the active learning run for TBLDT) is statistically significant at a 0.0003 level.</Paragraph>
      <Paragraph position="11"> As a final remark on this experiment, note that at an annotation level of 19000 words, the fully lexicalized TBLDT outperformed the C4.5 system by making 15% fewer errors.</Paragraph>
    </Section>
    <Section position="3" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
4.3 Rejection curves
</SectionTitle>
      <Paragraph position="0"> It is often very useful for a classifier to be able to offer confidence scores associated with its decisions. Confidence scores are associated with the probability P(C(z) correct\[z) where C(z) is the classification of sample z. These scores can be used in real-life problems to reject samples that the the classifier is not sure about, in which case a better observation, or a human decision, might be requested. The performance of the classifier is then evaluated on the samples that were not rejected. This experiment framework is well-established in machine learning and optimization research (Dietterich and Bakiri, 1995; Priebe et al., 1999).</Paragraph>
      <Paragraph position="1"> Since non-probabilistic classifiers do not offer any insights into how sure they are about a particular classification, it is not easy to obtain confidence scores from them. A probabilistic classifier, in contrast, offers information about the class probability distribution of a given sample.</Paragraph>
      <Paragraph position="2"> Two measures that can be used in generating confidence scores are proposed in this section.</Paragraph>
      <Paragraph position="3"> The first measure, the entropy H of the class probability distribution of a sample z, C(z) = {p(CllZ),p(c2\[z)...p(cklZ)}, is a measure of the uncertainty in the distribution:</Paragraph>
      <Paragraph position="5"> The higher the entropy of the distribution of class probability estimates, the more uncertain the  lected for rejection are chosen by sorting the data using the entropies of the estimated probabilities, and then selecting the ones with highest entropies. The resulting curve is a measure of the correlation between the true probability distribution and the one given by the classifier.</Paragraph>
      <Paragraph position="6"> Figure 4(a) shows the rejection curves for the TBLDT system and two C4.5 decision trees - one which receives a probability distribution as input (&amp;quot;soft&amp;quot; decisions on the left context) , and one which receives classifications (&amp;quot;hard&amp;quot; decisions on all fields). At the left of the curve, no samples are rejected; at the right side, only the samples about which the classifiers were most certain are kept (the samples with minimum entropy). Note that the y-values on the right side of the curve are based on less data, effectively introducing wider variance in the curve as it moves right.</Paragraph>
      <Paragraph position="7"> As shown in Figure 4(a), the C4.5 classifier that has access to the left context chunk tag probability distributions behaves better than the other C4.5 system, because this information about the surrounding context allows it to effectively perform a shallow search of the classification space. The TBLDT system, which also receives a probability distribution on the chunk tags in the left context, clearly outperforms both C4.5 systems at all rejection levels.</Paragraph>
      <Paragraph position="8"> The second proposed measure is based on the probability of the most likely tag. The assumption here is that this probability is representative of how certain the system is about the classification. The samples are put in bins based on the probability of the most likely chnnk tag, and accuracies are computed for each bin (these bins are cumulative, meaning that a sample will be included in all the bins that have a lower threshold than the probability of its most likely chnnlC/ tag). At each accuracy level, a sample will be rejected if the probability of its most likely chnn~  C4.5 systems and the TBLDT system is below the accuracy level. The resulting curve is a measure of the correlation between the true distribution probability and the probability of the most likely chunk tag, i.e. how appropriate those probabilities are as confidence measures. Unlike the first measure mentioned before, a threshold obtained using this measure can be used in an online manner to identify the samples of whose classification the system is confident.</Paragraph>
      <Paragraph position="9"> Figure 4(b) displays the rejection curve for the second measure and the same three systems.</Paragraph>
      <Paragraph position="10"> TBLDT again outperforms both C4.5 systems, at all levels of confidence.</Paragraph>
      <Paragraph position="11"> In summary, the TBLDT system outperforms both C4.5 systems presented, resulting in fewer rejections for the same performance, or, conversely, better performance at the same rejection rate.</Paragraph>
    </Section>
    <Section position="4" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
4.4 Perplexity and Cross Entropy
</SectionTitle>
      <Paragraph position="0"> Cross entropy is a goodness measure for probability estimates that takes into account the accuracy of the estimates as well as the classification accuracy of the system. It measures the performance of a system trained on a set of samples distributed according to the probability distribution p when tested on a set following a probability distribution q. More specifically, we utilize conditional cross entropy, which is defined as</Paragraph>
      <Paragraph position="2"> where X is the set of examples and C is the set of chnnlr tags, q is the probability distribution on the</Paragraph>
    </Section>
    <Section position="5" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
Test Set
</SectionTitle>
      <Paragraph position="0"> test document and p is the probability distribution on the train corpus.</Paragraph>
      <Paragraph position="1"> The cross entropy metric fails if any outcome is given zero probability by the estimator. To avoid this problem, estimators are &amp;quot;smoothed&amp;quot;, ensuring that novel events receive non-zero probabilities.</Paragraph>
      <Paragraph position="2"> A very simple smoothing technique (interpolation with a constant) was used for all of these systems.</Paragraph>
      <Paragraph position="3"> A closely related measure is perplexity, defined as</Paragraph>
      <Paragraph position="5"> The cross entropy and perplexity results for the various estimation schemes are presented in Table * 3. The TBLDT outperforms both C4.5 systems, obtaining better cross-entropy and chunk tag perplexity. This shows that the overall probability distribution obtained from the TBLDT system better matches the true probability distribution.</Paragraph>
      <Paragraph position="6"> This strongly suggests that probabilities generated this way can be used successfully in system combination techniques such as voting or boosting.</Paragraph>
    </Section>
    <Section position="6" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
4.5 Chunking performance
</SectionTitle>
      <Paragraph position="0"> It is worth noting that the transformation-based system used in the comparative graphs in Figure 3 was not r, uning at full potential. As described earlier, the TBLDT system was only allowed to consider words that C4.5 had access to. However, a comparison between the corresponding TBLDT curves in Figures 2 (where the system is given access to all the words) and 3 show that a transformation-based system given access to all the words performs better than the one with a restricted lexicon, which in turn outperforms the best C4.5 decision tree system both in terms of accuracy and F-measure.</Paragraph>
      <Paragraph position="1"> Table 4 shows the performance of the TBLDT system on the full CoNLL test set, broken down by chunk type. Even though the TBLDT results could not be compared with other published results on the same task and data (CoNLL will not take place until September 2000), our system significantly outperforms a similar system trained with a C4.5 decision tree, shown in Table 5, both in chunk accuracy and F-measure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML