File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1007_evalu.xml
Size: 4,607 bytes
Last Modified: 2025-10-06 13:59:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1007"> <Title>Combining Hierarchical Clustering and Machine Learning to Predict High-Level Discourse Structure</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> As described in Section 2, the trained model was combined with the clustering method to build trees for the test set. These were evaluated against the manually built discourse trees. Precision (P) and recall (R) were defined in accordance with the PARSE- null defined as the number of correct nodes (i.e. matching brackets) divided by the number of nodes in the automatically built tree and recall as the number of correct nodes divided by the number of nodes in the manually built tree. Precision and recall are combined in the f-score (F), defined as</Paragraph> <Paragraph position="2"> Table 1 shows the results. We compared the performance of our model (ME) to Yaari's (1997) method of building trees based on term overlap (TO). In addition, three baselines were used: merging segments randomly (results averaged over 100 runs), producing a right-branching tree by always merging the last two segments (RB) and producing a left-branching tree by always merging the first two segments (LB). Finally, an upper bound was calculated by comparing the trees for the doubly annotated text files in the RST-DT. Note that the doubly annotated data set is slightly different from the test set, hence the upper bound can only give an indication of the human performance on this task.</Paragraph> <Paragraph position="3"> The maximum entropy model outperforms all other methods on precision, recall and f-score. The difference in correct discourse segments (true positives) between our method and the next best (i.e.</Paragraph> <Paragraph position="4"> left-branching) is statistically significant (one-tailed paired t-test,a13 =1.72, a6a49 = 37,a7a9a8a11a10a13a12a14a10a16a15 ). Interestingly, Yaari's word co-occurrence based method (TO) is outperformed by left-branching trees (LB). Furthermore, while Marcu (2000) argues that right-skewed structures should be considered better than left-skewed structures, in our experiments, the latter actually outperform the former, i.e. inter-paragraph structure in the RST-DT is predominantly left-branching. Predictably, human performance is better than any of the automatic methods.</Paragraph> <Paragraph position="5"> To investigate the contribution of our different feature sets we re-trained the model after removing lexical chains (MEa0 LC), term overlap (MEa0 TO) and lexical chains and term overlap (MEa0 LCTO).</Paragraph> <Paragraph position="6"> The results are also shown in Table 1. As can be seen, removing lexical chain features results in more performance loss than removing term-overlap features. Thus it seems that lexical chains are more useful for the task than term-overlap. However, the performance difference between MEa0 LC and MEa0 TO is not statistically significant (a13 =0.96, than is achieved by left-skewed clustering (LB), which indicates that other features, such as tense and cue word features, are able to compensate to some degree for the absence of chain and term overlap features. But the difference between LB and MEa0 LCTO is again not statistically significant (a13 =1.24, a6a49 =37,a7a20a17a11a10a13a12a14a10a16a15 ). So far we have not said much about the rhetorical relations that hold between larger discourse segments. In fact, assigning relations to higher-level structures is easier than doing so for inter-sentence structures. One reason for this is that there is much less variation on inter-paragraph level. For example, the RST-DT contains 111 different relations but only 64 of these are used at inter-paragraph level.</Paragraph> <Paragraph position="7"> Furthermore, the most frequent relation on inter-paragraph level (Elaboration-additional) accounts for a much larger percentage (37%) of all relations used at this level than does the most frequent relation on intra-paragraph level (List, 13%). Hence, always predicting Elaboration-additional would already achieve 37% accuracy. Being able to reliably distinguish between Elaboration-additional and the second most frequent inter-paragraph relation, List, would guarantee 53% accuracy. In contrast, correctly predicting the two most frequent relations on intra-paragraph level would only achieve 26% accuracy. We plan to address the prediction of rhetorical relations between larger discourse segments in future work.</Paragraph> </Section> class="xml-element"></Paper>