File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2207_evalu.xml
Size: 7,785 bytes
Last Modified: 2025-10-06 13:59:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2207"> <Title>A Hybrid Approach for the Acquisition of Information Extraction Patterns</Title> <Section position="6" start_page="52" end_page="54" type="evalu"> <SectionTitle> 5 Experimental Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 5.1 Indirect Evaluation </SectionTitle> <Paragraph position="0"> For a better understanding of the proposed approach we perform an incremental evaluation: first, we evaluate only the various pattern selection criteria described in Section 2.4 by disabling the NB-EM component. Second, using the best selection criteria, we evaluate the complete co-training system.</Paragraph> <Paragraph position="1"> In both experiments we initialize the system with high-precision manually-selected seed rules which yield seed documents with a coverage of 10% ofthe trainingpartitions. The remaining 90% of the training documents are maintained unlabeled. For all experiments we used a maximum of 400 bootstrapping iterations. The acquired rules are fed to the decision list classifier which assigns category labels to the documents in the test partitions. null Evaluation of the pattern selection criteria Figure 3 illustrates the precision/recall charts of the four algorithms as the number of patterns made available to the decision list classifier increases. All charts show precision/recall points starting after 100 learning iterations with 100iteration increments. It is immediately obvious that the Collins selection criterion performs significantly better than the other three criteria. For the same recall point, Collins yields a classification model with much higher precision, with differences ranging from 5% in the REUTERS collection to 20% in the AP collection.</Paragraph> <Paragraph position="2"> Theorem 5 in (Abney, 2002) provides a theoretical explanation for these results: if certain independence conditions between the classifier rules are satisfied and the precision of each rule is larger than a threshold T, then the precision of the final classifier is larger than T. Although the rule independence conditions are certainly not satisfied in our real-world evaluation, the above theorem indicates that there is a strong relation between the precision of the classifier rules on labeled data and theprecisionofthefinalclassifier. Ourresultsprovide the empirical proof that controling the precision of the acquired rules (i.e. the Collins criterion) is important.</Paragraph> <Paragraph position="3"> The Collins criterion controls the recall of the learned model by favoring rules with high frequency in the collection. However, since the other two criteria do not use a high precision threshold, they will acquire more rules, which translates in better recall. For two out of the three collections, Riloff and Chi obtain a slightly better recall, about 2% higher than Collins', albeit with a much lower precision. We do not consider this an important advantage: in the next section we show that co-training with the NB-EM component further boosts the precision and recall of the Collins-based acquisition algorithm.</Paragraph> <Paragraph position="4"> The MI criterion performs the worst of the four evaluated criteria. A clue for this behavior lies in the following equivalent form for MI: MI(p,y) = logP(p|y)[?]logP(p). Thisformulaindicatesthat, for patterns with equal conditional probabilities P(p|y), MI assigns higher scores to patterns with lower frequency. This is not the desired behavior in a TC-oriented system.</Paragraph> <Paragraph position="5"> Evaluation of the co-training system Figure 4 compares the performance of the stand-alone pattern acquisition algorithm (&quot;bootstrapping&quot;) with the performance of the acquisition algorithm trained in the co-training environ- null ment (&quot;co-training&quot;). For both setups we used the best pattern selection criterion for pattern acquisition, i.e. the Collins criterion. To put things in perspective, we also depict the performance obtained with a baseline system, i.e. the system configured to use the Riloff pattern selection criterion and without the NB-EM algorithm (&quot;baseline&quot;). To our knowledge, this system, or a variation of it, is the current state-of-the-art in pattern acquisition (Riloff, 1996; Yangarber et al., 2000; Yangarber, 2003; Stevenson and Greenwood, 2005).</Paragraph> <Paragraph position="6"> All algorithms were initialized with the same seed rules and had access to all documents.</Paragraph> <Paragraph position="7"> Figure 4 shows that the quality of the learned patterns always improves if the pattern acquisition algorithm is &quot;reinforced&quot; with EM. For the same recall point, the patterns acquired in the co-training environment yield classification models with precision (generally) much larger than the models generated by the pattern acquisition algorithm alone. When using the same pattern acquisition criterion, e.g. Collins, the differences between the co-training approach and the stand-alone pattern acquisition method (&quot;bootstrapping&quot;) range from 2-3% in the REUTERS collection to 20% in the LATIMES collection.</Paragraph> <Paragraph position="8"> These results support our intuition that the sparse pattern space is insufficient to generate good classification models, which directly influences the quality of all acquired patterns.</Paragraph> <Paragraph position="9"> Furthermore, due to the increased coverage of the lexicalized collection views, the patterns acquired in the co-training setup generally have better recall, up to 11% higher in the LATIMES collection. null Lastly, the comparison of our best system (&quot;cotraining&quot;) against the current state-of-the-art (our &quot;baseline&quot;) draws an even more dramatic picture: collection by the baseline system (Riloff) and the co-training system.</Paragraph> <Paragraph position="10"> for the same recall point, the co-training system obtains a precision up to 35% higher for AP and LATIMES, and up to 10% higher for REUTERS.</Paragraph> </Section> <Section position="2" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 5.2 Direct Evaluation </SectionTitle> <Paragraph position="0"> As stated in Section 4.2, two experts have manually evaluated the top 100 acquired patterns for one different domain in each of the three collections. The three corresponding domains have been selected intending to deal with different degrees of ambiguity, which are reflected in the initial interexpert agreement. Any disagreement between experts is solved using the algorithm introduced in Section 4.2. Table 3 shows the results of this direct evaluation. The co-training approach outperforms the baseline for all three collections. Concretely, improvements of 9% and 8% are achieved for the Financial and the Corporate Acquisitions domains, and 46%, by far the largest difference, is found for the Sports domain in AP. Table 4 lists the top 20 patterns extracted by both approaches in the latter domain. It can be observed that for the baseline, only the top 4 patterns are relevant, the rest being extremely general patterns. On the other hand, the quality of the patterns acquired by our approach is much higher: all the patterns are relevant to the domain, although 7 out of the 20 might be considered ambiguous and according to the criterion defined in Section 4.2 have been evaluated as not relevant.</Paragraph> <Paragraph position="2"> by the baseline system (Riloff) and the co-training system for the AP collection. The correct patterns are in bold.</Paragraph> </Section> </Section> class="xml-element"></Paper>