File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1632_evalu.xml
Size: 3,557 bytes
Last Modified: 2025-10-06 13:59:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1632"> <Title>Using Linguistically Motivated Features for Paragraph Boundary Identification</Title> <Section position="7" start_page="271" end_page="272" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> Having trained our algorithm on the development data, we then determined the optimal feature combination and finally evaluated the performance on the previously unseen test data.</Paragraph> <Paragraph position="1"> Table 2 and Table 3 present the ranking of the least and of the most beneficial features respectively. Somewhat surprising to us, Table 2 shows that basically all features capturing information on discourse cues actually worsened the performance of the classifier. The bad performance of the prevSCue and currSCue features may be caused by their extreme sparseness. To test these features reasonably, we plan to increase the data set size by an order of magnitude. Then, at least, it should be possible to determine which discourse cues, if any, are correlated with paragraph boundaries. The bad performance of the prevSCueClass and currSCueClass features may be caused by the categorization provided by the IDS. This question also requires further investigation, maybe with a different categorization.</Paragraph> <Paragraph position="2"> Table 3 also provides interesting insights in the feature set. First, with only the three features relPos, word1 and word2 the baseline performs almost as well as the full feature set used by Sporleder & Lapata. Then, as expected, currSRE provides the largest gain in performance, followed by currSVF, currSPerson and prevSPerson.</Paragraph> <Paragraph position="3"> This result confirms our hypothesis that linguistically motivated features capturing information on pronominalization and information structure play an important role in determining paragraph segmentation. null The results of our system and the baselines for different classifiers (BT stands for BoosTexter and Ti for TiMBL) are summarized in Table 4. Accuracy is calculated by dividing the number of matches over the total number of test instances. Precision, recall and F-measure are obtained by considering true positives, false positives and false negatives. The latter metric, WindowDiff (Pevzner & Hearst, 2002), is supposed to overcome the disadvantage of the F-measure which penalizes near misses as harsh as more serious mistakes. The value of WindowDiff varies between 0 and 1, where a lesser count corresponds to better performance.</Paragraph> <Paragraph position="4"> The significance of our results was computed using the a0a1 test. All results are significantly better (on the a2 a3 a4 a5a4 a6 level or below) than both baselines and the reimplemented version of Sporleder & Lapata's (2006) algorithm whose performance on our data is comparable to what the authors reported on their corpus of German fiction. Interestingly, TiMBL does much better than BoosTexter on Sporleder & Lapata's feature set.</Paragraph> <Paragraph position="5"> Apparently, Sporleder & Lapata's presupposition, that they would rely on many weak hypotheses, does not hold. This is also confirmed by the results reported in Table 3 where only three of their features perform surprisingly strong. In contrast, on our feature set TiMBL and BoosTexter perform almost equally. However, BoosTexter achieves in all cases a much higher precision which is preferable over the higher recall provided by TiMBL.</Paragraph> </Section> class="xml-element"></Paper>