File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1030_evalu.xml
Size: 8,941 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1030"> <Title>Sentence Level Discourse Parsing using Syntactic and Lexical Information</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> In this section we present the evaluations carried out for both the discourse segmentation task and the discourse parsing task. For this evaluation, we re-trained Charniak's parser (2000) such that the test sentences from the discourse corpus were not seen by the syntactic parser during training.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation of the Discourse Segmenter </SectionTitle> <Paragraph position="0"> We train our discourse segmenter on the Training section of the corpus described in Section 2, and test it on the Test section. The training regime uses syntactic trees from the Penn Treebank. The metric we use to evaluate the discourse segmenter records the accuracy of the discourse segmenter with respect to its ability to insert inside-sentence discourse boundaries. That is, if a sentence has 3 edus, which correspond to 2 inside-sentence discourse boundaries, we measure the ability of our algorithm to correctly identify these 2 boundaries. We report our evaluation results using recall, precision, and F-score gures. This metric is harsher than the metric previously used by Marcu (2000), who assesses the performance of a discourse segmentation algorithm by counting how often the algorithm makes boundary and noboundary decisions for every word in a sentence.</Paragraph> <Paragraph position="1"> We compare the performance of our probabilistic discourse segmenter with the performance of the decision-based segmenter proposed by (Marcu, 2000) and the performance of two baseline algorithms. The rst base-line (a174a170a66a4a112a170a147 ) uses punctuation to determine when to insert a boundary; because commas are often used to indicate breaks inside long sentences, a174a113a66a27a112a170a147 inserts discourse boundaries after each comma. The second base-line (a174a71a145a32a112a170a147 ) uses syntactic information; because long sentences often have embedded sentences, a174a71a145a32a112a170a147 inserts discourse boundaries after each text span whose corresponding syntactic subtree is labeled S, SBAR, or SINV. We also compute the agreement between human annotators on the discourse segmentation task (a149a128a112a170a147 ), using the doubly-annotated discourse corpus mentioned Table 1 shows the results obtained by the algorithm described in this paper (a147a62a3a6a5a117a112a170a147a49a21a95a15a104a183a185a24 ) using syntactic trees produced by Charniak's parser (2000), in comparison with the results obtained by the algorithm described in (Marcu, 2000) (a112a113a18a20a11a29a112a170a147 ), and baseline algorithms a174a113a66a27a112a170a147 and a174a74a145a33a112a170a147 , on the same test set. Crucial to the performance of the discourse segmenter is the recall gure, because we want to nd as many discourse boundaries as possible. The baseline algorithms are too simplistic to yield good results (recall gures of 28.2% and 25.4%). The algorithm presented in this paper gives an error reduction in missed discourse boundaries of 24.5% (recall accuracy improvement from 77.1% to 82.7%) over (Marcu, 2000). The overall error reduction is of 15.1% (improvement in F-score from 80.1% to 83.1%).</Paragraph> <Paragraph position="2"> In order to asses the impact on the performance of the discourse segmenter due to incorrect syntactic parse trees, we also carry an evaluation using syntactic trees from the Penn Treebank. The results are shown in row a147a62a3a6a5a117a112a170a147a49a21a115a15a39a184a62a24 . Perfect syntactic trees lead to a further error reduction of 9.5% (F-score improvement from 83.1% to 84.7%). The performance ceiling for discourse segmentation is given by the human annotation agreement F-score of 98.3%.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation of the Discourse Parser </SectionTitle> <Paragraph position="0"> We train our discourse parsing model on the Training section of the corpus described in Section 2, and test it on the Test section. The training regime uses syntactic trees from the Penn Treebank. The performance is assessed using labeled recall and labeled precision as de ned by the standard Parseval metric (Black et al., 1991). As mentioned in Section 2, we use both 18 labels and 110 labels for the discourse relations. The recall and precision gures are combined into an F-score gure in the usual manner.</Paragraph> <Paragraph position="1"> The discourse parsing model uses syntactic trees produced by Charniak's parser (2000) and discourse segments produced by the algorithm described in Section 3. We compare the performance of our model (a147a137a3a22a5a117a112a170a57 ) with the performance of the decision-based discourse parsing model (a112a113a18a20a11a29a112a113a57 ) proposed by (Marcu, 2000), and racy for syntactic trees and discourse boundaries.</Paragraph> <Paragraph position="2"> with the performance of a baseline algorithm (a174a104a112a113a57 ). The baseline algorithm builds right-branching discourse trees labeled with the most frequent relation encountered in the training set (i.e., ELABORATION-NS). We also compute the agreement between human annotators on the discourse parsing task (a149a88a112a113a57 ), using the doubly-annotated discourse corpus mentioned in Section 2. The results are shown in Table 2. The baseline algorithm has a performance of 23.4% and 20.7% F-score, when using 18 labels and 110 labels, respectively. Our algorithm has a performance of 49.0% and 45.6% F-score, when using 18 labels and 110 labels, respectively. These results represent an error reduction of 18.8% (F-score improvement from 37.2% to 49.0%) over a state-of-the-art discourse parser (Marcu, 2000) when using 18 labels, and an error reduction of 15.7% (F-score improvement from 35.5% to 45.6%) when using 110 labels. The performance ceiling for sentence-level discourse structure derivation is given by the human annotation agreement F-score of 77.0% and 71.9%, when using 18 labels and 110 labels, respectively.</Paragraph> <Paragraph position="3"> The performance gap between the results of a147a62a3a6a5a117a112a113a57 and human agreement is still large, and it can be attributed to three possible causes: errors made by the syntactic parser, errors made by the discourse segmenter, and the weakness of our discourse model.</Paragraph> <Paragraph position="4"> In order to quantitatively asses the impact in performance of each possible cause of error, we perform further experiments. We replace the syntactic parse trees produced by Charniak's parser at 90% accuracy (a15 a183 ) with the corresponding Penn Treebank syntactic parse trees produced by human annotators (a15a186a184 ). We also replace the discourse boundaries produced by our discourse segmenter at 83% accuracy (a147a167a183 ) with the discourse boundaries taken from (RST-DT, 2002), which are produced by the human annotators (a147a51a184 ).</Paragraph> <Paragraph position="5"> The results are shown in Table 3. The results in column a15a39a184a137a147a107a183 show that using perfect syntactic trees leads to an error reduction of 14.5% (F-score improvement from 49.0% to 56.4%) when using 18 labels, and an error reduction of 12.9% (F-score improvement from 45.6% to 52.6%) when using 110 labels. The results in column a15 a183 a147 a184 show that the impact of perfect discourse segmentation is double the impact of perfect syntactic trees. Human-level performance on discourse segmentation leads to an error reduction of 29.0% (F-score improvement from 49.0% to 63.8%) when using 18 labels, and an error reduction of 25.6% (F-score improvement from 45.6% to 59.5%) when using 110 labels. Together, perfect syntactic trees and perfect discourse segmentation lead to an error reduction of 52.0% (F-score improvement from 49.0% to 75.5%) when using 18 labels, and an error reduction of 45.5% (F-score improvement from 45.6% to 70.3%) when using 110 labels. The results in column a15a39a184a137a147a137a184 in Table 3 compare extremely favorable with the results in column a149a88a112a113a57 in Table 2. The discourse parsing model produces unlabeled discourse structure at a performance level similar to human annotators (F-score of 96.2%). When using 18 labels, the distance between our discourse parsing model performance level and human annotators performance level is of absolute 1.5% (75.5% versus 77%). When using 110 labels, the distance is of absolute 1.6% (70.3% versus 71.9%). Our evaluation shows that our discourse model is sophisticated enough to match near-human levels of performance.</Paragraph> </Section> </Section> class="xml-element"></Paper>