File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1007_intro.xml

Size: 3,688 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1007">
  <Title>Combining Hierarchical Clustering and Machine Learning to Predict High-Level Discourse Structure</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
3 Data
</SectionTitle>
    <Paragraph position="0"> The RST Discourse Treebank (RST-DT) (Carlson et al., 2002) was used for training and testing. It contains 385 Wall Street Journal articles from the Penn Treebank, which are manually annotated with discourse structure in the framework of Rhetorical Structure Theory (RST) (Mann and Thompson, 1987). The set is divided into a training set (342 texts) and a test set (43 texts). 52 texts, selected from both sets, were annotated twice. We use these to estimate human agreement on the task.</Paragraph>
    <Paragraph position="1"> Since we focus only on inter-paragraph structure, intra-paragraph structure was discarded. In most cases the discourse structure of a text obeyed paragraph boundaries, but about 21% of the paragraphs did not correspond to a discourse segment. One way to deal with such cases is by removing them from the training set but since the training set is already relatively small, we decided instead to replace them by the inter-paragraph tree which comes closest to the original structure.</Paragraph>
    <Paragraph position="2"> In most cases where discourse structure does not follow paragraph structure the deviation is relatively minor. For example, Figure 1 shows 10 edus (numbered 1 to 10) in 3 paragraphs (A to C, indicated  by boxes). There is no discourse segment corresponding to paragraph B because a subsegment of the paragraph (consisting of edus 4 to 6) merges with the previous paragraph and only then is the resulting segment merged with the last edu of B (i.e. number 7). However, there is a discourse segment corresponding to the two paragraphs A and B, i.e.</Paragraph>
    <Paragraph position="3"> the structure in Figure 1 maps relatively easily to the inter-paragraph structure ((AB)C) (as opposed to (A(BC))).</Paragraph>
    <Paragraph position="4">  However, for 8% of paragraphs the mapping was less straightforward. For example, in the tree in Figure 2 some of B's edus attach to the left and some to the right. Hence it is not immediately clear whether one should map to ((AB)C) or (A(BC)). In these cases we used majority voting to resolve the ambiguity, i.e. if most of the edus of a paragraph attached to the left (as in Figure 2) the paragraph was merged with its left neighbour otherwise it was merged with its right neighbour. Hence, the tree in Figure 2 is assumed to have the structure ((AB)C).</Paragraph>
    <Paragraph position="5">  The few non-binary structures in the training set were binarised by replacing them with left-branching binary structures.</Paragraph>
    <Paragraph position="6"> Since we want to predict the likelihood of merging two segments, each pair of adjacent segments (of any size) can be treated as a training example. Segment pairs that are contained in a discourse tree are positive examples and segment pairs not contained in the tree are added as negative examples. For instance, the tree in Figure 3 contains 3 positive training examples (A+B, C+D, and AB+BC) and 7 negative examples (B+C, AB+C, A+BC, BC+D, B+CD, ABC+D, and A+BCD). Pairs of non-adjacent segments, e.g. A+D, were ignored because they are not permitted under the assumption that discourse structure is a tree with non-crossing branches (i.e. their probability is 0).</Paragraph>
    <Paragraph position="7">  The 342 texts in the RST-DT training set gave rise to 1,830 positive and 185,691 negative training examples. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML