File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1011_metho.xml
Size: 18,531 bytes
Last Modified: 2025-10-06 14:14:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1011"> <Title>Learning Features that Predict Cue Usage</Title> <Section position="4" start_page="80" end_page="81" type="metho"> <SectionTitle> 3 Relational Discourse Analysis </SectionTitle> <Paragraph position="0"> This section briefly describes Relational Discourse Anal~tsis (RDA) (Moser, Moore, and Glendening, 1996), the coding scheme used to tag the data for our machine learning experiments. 1 RDA is a scheme devised for analyzing tutorial explanations in the domain of electronics troubleshooting. It synthesizes ideas from (Grosz and Sidner, 1986) and from RST (Mann and Thompson, 1988).</Paragraph> <Paragraph position="1"> Coders use RDA to exhaustively analyze each explanation in the corpus, i.e., every word in each explanation belongs to exactly one element in the analysis. An explanation may consist of multiple segments. Each segment originates with an intention of the speaker. Segments are internally structured and consist of a core, i.e., that element that most directly expresses the segment purpose, and any number of contributors, i.e. the remaining constituents.</Paragraph> <Paragraph position="2"> For each contributor, one analyzes its relation to the core from an intentional perspective, i.e., how it is intended to support the core, and from an informational perspective, i.e., how its content relates to that 1For more detail about the RDA coding scheme see (Moser and Moore, 1995; Moser and Moore, 1997).</Paragraph> <Paragraph position="3"> of the core. The set of intentional relations in RDA is a modification of the presentational relations of RST, while informational relations are similar to the subject matter relations in RST. Each segment constituent, both core and contributors, may itself be a segment with a core:contributor structure. In some cases the core is not explicit. This is often the case with the whole tutor's explanation, since its purpose is to answer the student's explicit question.</Paragraph> <Paragraph position="4"> As an example of the application of RDA, consider the partial tutor explanation in (1) 2 . The purpose of this segment is to inform the student that she made the strategy error of testing inside part3 too soon.</Paragraph> <Paragraph position="5"> The constituent that makes the purpose obvious, in this case (l-B), is the core of the segment. The other constituents help to serve the segment purpose by contributing to it. (1-C) is an example ofsubsegment with its own core:contributor structure; its purpose is to give a reason for testing part2 first.</Paragraph> <Paragraph position="6"> The RDA analysis of (I) is shown schematically in the relations it participates in. Each relation node is labeled with both its intentional and informational relation, with the order of relata in the label indicating the linear order in the discourse. Each relation node has up to two daughters: the cue, if any, and the contributor, in the order they appear in the discourse. null Coders analyze each explanation in the corpus and enter their analyses into a database. The corpus consists of 854 clauses comprising 668 segments, for a total of 780 relations. Table 1 summarizes the distribution of different relations, and the number of cued relations in each category. Joints are segments comprising more than one core, but no contributor; clusters are multiunit structures with no recognizable core:contributor relation. (l-B) is a cluster composed of two units (the two clauses), related only at the informational level by a temporal relation. Both clauses describe actions, with the first action description embedded in a matriz (&quot;You should&quot;). Cues are much more likely to occur in clusters, where only informational relations occur, than in core:contributor structures, where intentional and informational relations co-occur (X 2 = 33.367, p <.001, df = 1). In the following, we will not discuss joints and clusters any further.</Paragraph> <Paragraph position="7"> An important result pointed out by (Moser and Moore, 1995) is that cue placement depends on core position. When the core is first and a cue is associated with the relation, the cue never occurs with the core. In contrast, when the core is second, if a cue occurs, it can occur either on the core or on the contributor.</Paragraph> <Paragraph position="8"> aTo make the example more intelligible, we replaced references to parts of the circuit with the labels partl, part2 and part3.</Paragraph> <Paragraph position="9"> and thus 2. is more susceptible to damage than part3.</Paragraph> <Paragraph position="10"> it is more work to open up part3 for testing the process of opening drawers and extending cards in part3 may induce problems which did not already exist.</Paragraph> </Section> <Section position="5" start_page="81" end_page="85" type="metho"> <SectionTitle> 4 Learning from the corpus </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 4.1 The algorithm </SectionTitle> <Paragraph position="0"> We chose the C4.5 learning algorithm (Quinlan, 1993) because it is well suited to a domain such as ours with discrete valued attributes. Moreover, C4.5 produces decision trees and rule sets, both often used in text generation to implement mappings from function features to forms? Finally, C4.5 is both readily available, and is a benchmark learning algorithm that has been extensively used in NLP applications, e.g. (Litman, 1996; Mooney, 1996; Vander Linden and Di Eugenio, 1996).</Paragraph> <Paragraph position="1"> As our dataset is small, the results we report are based on cross-validation, which (Weiss and Kulikowski, 1091) recommends as the best method to evaluate decision trees on datasets whose cardinality is in the hundreds. Data for learning should be divided into training and test sets; however, for small datasets this has the disadvantage that a sizable portion of the data is not available for learning. Cross-validation obviates this problem by running the algorithm N times (N=10 is a typical value): in each run, (N~l)th of the data, randomly chosen, is used as the training set, and the remaining ~th used as the test 3We will discuss only decision trees here.</Paragraph> <Paragraph position="2"> set. The error rate of a tree obtained by using the whole dataset for training is then assumed to be the average error rate on the test set over the N runs.</Paragraph> <Paragraph position="3"> Further, as C4.5 prunes the initial tree it obtains to avoid overfitting, it computes both actual and estimated error rates for the pruned tree; see (Quinlan, 1993, Ch. 4) for details. Thus, below we will report the average estimated error rate on the test set, as computed by 10-fold cross-validation experiments.</Paragraph> </Section> <Section position="2" start_page="81" end_page="82" type="sub_section"> <SectionTitle> 4.2 The features </SectionTitle> <Paragraph position="0"> Each data point in our dataset corresponds to a core:contributor relation, and is characterized by the following features, summarized in Table 2.</Paragraph> <Paragraph position="1"> Segment Structure. Three features capture the global structure of the segment in which the current core:contributor relation appears.</Paragraph> <Paragraph position="2"> * (Con)Trib(utor)-pos(ition) captures the position of a particular contributor within the larger segment in which it occurs, and encodes the structure of the segment in terms of how many contributors precede and follow the core. For example, contributor (l-D) in Figure 1 is labeled as BIA3-2after, as it is the second contributor following the core in a segment with 1 contributor before and 3 after the core.</Paragraph> <Paragraph position="3"> \[I feature type feature dencription Segment ntructure Trib-pos relative position of contrib in segment t number of contribs before and after core Inten-structure intentional structure of segment Infor-structure informational structure of segment utors in the segment bear the same intentional relations to the core.</Paragraph> <Paragraph position="4"> * Infor(mationalJ-structure. Similar to intentional structure, but applied to informational relations.</Paragraph> <Paragraph position="5"> Core:contributor relation. These features more specifically characterize the current core:contributor relation.</Paragraph> <Paragraph position="6"> * lnten(tionalJ-rel(ation). One of concede, convince, enable.</Paragraph> <Paragraph position="7"> * Infor(maiional)-rel(ation). About 30 informational relations have been coded for. However, as preliminary experiments showed that using them individually results in overfitting the data, we classify them according to the four classes proposed in (Moser, Moore, and Glendening, 1996): causality, similarity, elaboration, tempo- null ral. Temporal relations only appear in clusters, thus not in the data we discuss in this paper. * Syn(tactic)-rel(atiou). Captures whether the core and contributor are independent units (segments or sentences); whether they are coordinated clauses; or which of the two is subordinate to the other.</Paragraph> <Paragraph position="8"> * Adjacency. Whether core and contributor are adjacent in linear order.</Paragraph> <Paragraph position="9"> Embedding. These features capture segment embedding, Core-type and Trib-type qualitatively, and A bore/Below quantitatively.</Paragraph> <Paragraph position="10"> * Core-type/(ConJTrib(utor)-type. Whether the core/the contributor is a segment, or a minimal unit (further subdivided into action, state, matriz).</Paragraph> <Paragraph position="11"> * Above//Belozo encode the number of relations hierarchically above and below the current relation. null</Paragraph> </Section> <Section position="3" start_page="82" end_page="85" type="sub_section"> <SectionTitle> 4.3 The experiments </SectionTitle> <Paragraph position="0"> Initially, we performed learning on all 406 instances of core:contributor relations. We quickly determined that this approach would not lead to useful decision trees. First, the trees we obtained were extremely complex (at least 50 nodes). Second, some of the sub-trees corresponded to clearly identifiable subclasses of the data, such as relations with an implicit core, which suggested that we should apply learning to these independently identifiable subclasses. Thus, we subdivided the data into three subsets: an implicit core While this has the disadvantage of smaller training sets, the trees we obtain are more manageable and more meaningful. Table 3 summarizes the cardinality of these sets, and the frequencies of cue occurrence. null We ran four sets of experiments. In three of them we predict cue occurrence and in one cue placement. 4 Table 4 summarizes our main results concerning cue occurrence, and includes the error rates associated with different feature sets. We adopt Litman's approach (1906) to determine whether two error rates El and PS2 are significantly different. We compute 05% confidence intervals for the two error rates using a t-test. PS1 is significantly better than PS~ if the upper bound of the 95% confidence interval for PS1 is lower than the lower bound of the 95% confidence interval for g2-~ For each set of experiments, we report the following: 1. A baseline measure obtained by choosing the majority class. E.g., for Corel 58.9% of the relations are not cued; thus, by deciding to never include a cue, one would be wrong 41.1% of the times.</Paragraph> <Paragraph position="1"> 2. The best individual features whose predictive power is better than the baseline: as Table 4 makes apparent, individual features do not have much predictive power. For neither Gorcl nor Impl-core does any individual feature perform better than the baseline, and for Core~ only one feature is sufficiently predictive.</Paragraph> <Paragraph position="2"> 3. (One of) the best induced tree(s). For each tree, we list the number of nodes, and up to six of the features that appear highest in the tree, with their levels of embedding. 5 Figure 2 shows the tree for Core~ (space constraints prevent us from including figures for each tree). In the figure, the numbers in parentheses indicate the number of cases correctly covered by the leaf, and the number of expected errors at that leaf.</Paragraph> <Paragraph position="3"> Learning turns out to be most useful for Corel, where the error reduction (as percentage) from base-line to the upper bound of the best result is 32%; ~AII our experiments are run with groupin 9 turned on, so that C4.5 groups values together rather than creating a branch per value. The latter choice always results in trees overfitted to the data in our domain. Using classes of informational relations, rather than individual informational relations, constitutes a sort of a priori grouping. SThe trees that C4.5 generates are right-branching, so this description is fairly adequate.</Paragraph> <Paragraph position="4"> error reduction is 19% for Core2 and only 3% for Impl- core.</Paragraph> <Paragraph position="5"> The best tree was obtained partly by informed choice, partly by trial and error. Automatically trying out all the 211 -- 2048 subsets of features would be possible, but it would require manual examination of about 2,000 sets of results, a daunting task. Thus, for each dataset wc considered only the following subsets of features.</Paragraph> <Paragraph position="6"> 1. All features. This always results in C4.5 selecting a few features (from 3 to 7) for the final tree. 2. Subsets built out of the 2 to 4 attributes appearing highest in the tree obtained by running C4.5 on all features.</Paragraph> <Paragraph position="7"> 3. In Table 2, three features -- Trib-pos, In~e~struck, Infor-s~ruct- concern segment struc null ture, eight do not. We constructed three subsets by always including the eight features that do not concern segment structure, and adding one of those that does. The trees obtained by including Trib-pos, I~tert-struc~, Infor-struc~ at the same time are in general more complex, and not significantly better than other trees obtained by including only one of these three features. We attribute this to the fact that these features encode partly overlapping information.</Paragraph> <Paragraph position="8"> Finally, the best tree was obtained as follows. We build the set of trees that are statistically equivalent to the tree with the best error rate (i.e., with the lowest error rate upper bound). Among these trees, we choose the one that we deem the most perspicuous in terms of features and of complexity. Namely, we pick the simplest tree with Trib-Pos as the root if one exists, otherwise the simplest tree. Trees that have Trib-Pos as the root are the most useful for text generation, because, given a complex segment, Trib-Pos is the only attribute that unambiguously identifies a specific contributor.</Paragraph> <Paragraph position="9"> Our results make apparent that the structure of segments plays a fundamental role in determining cue occurrence. One of the three features concerning segment structure (Trib-Pos, Inten-Structure, Infor-StrucZure) appears as the root or just below the root in all trees in Table 4; more importantly, this same configuration occurs in all trees equivalent to the best tree (even if the specific feature encoding segment structure may change). The level of embedding in a O. Trlb-pos 1. Tril>-type 2. Syn-rel 3. C0re-type 4. Above 5. Inten-rel 27.44-1.28 (18) O. Trib-Pos I. Inten-rel 2. Info-rel 3. Above 4. Core-type 5. Below 22.1+0.57 (10) O. Core-type 1. Infor-struct 2. Inten-rel segment, as encoded by Core-type, Trib-type, Above and Below also figures prominently.</Paragraph> <Paragraph position="10"> InLen-rel appears in all trees, confirming the intuition that the speaker's purpose affects cue occurrence. More specifically, in Figure 2, Inten-reldistinguishes two different speaker purposes, convince and enable. The same split occurs in some of the best trees induced on Core1, with the same outcome: i.e., convince directly correlates with the occurrence of a cue, whereas for enable other features must be taken into account. 6 Informational relations do not appear as often as intentional relations; their discriminatory power seems more relevant for clusters. Preliminary ewe can't draw any conclusions concerning concede, as there are only 24 occurrences of concede out of 406 core:contributor relations.</Paragraph> <Paragraph position="11"> experiments show that cue occurrence in clusters depends only on informational and syntactic relations. Finally, Adjacency does not seem to play any substantial role.</Paragraph> <Paragraph position="12"> While cue occurrence and placement are interrelated problems, we performed learning on them separately. First, the issue of placement arises only in the case of Core~; for Core1, cues only occur on the contributor. Second, we attempted experiments on Core2 that discriminated between occurrence and placement at the same time, and the derived trees were complex and not perspicuous. Thus, we ran an experiment on the 100 cued relations from Core~ to investigate which factors affect placing the cue on the contributor in first position or on the core in second; see Table 5.</Paragraph> <Paragraph position="13"> We ran the same trials discussed above on this dataset. In this case, the best tree -- see Figure 3 -- results from combining the two best individual features, and reduces the error rate by 50%. The most discriminant feature turns out to be the syntactic relation between the contributor and the core. However, segment structure still plays an important role, via Trib-pos.</Paragraph> <Paragraph position="14"> While the importance of S~ln-rel for placement seems clear, its role concerning occurrence requires further exploration. It is interesting to note that the tree induced on Gorel -- the only case in which Synrel is relevant for occurrence -- indudes the same distinction as in Figure 3: namely, if the contributor depends on the core, the contributor must be marked, otherwise other features have to be taken into account. Scott and de Souza (1990) point out that &quot;there is a strong correlation between the syntactic specification of a complex sentence and its perceived rhetorical structure.&quot; It seems that certain syntactic structures function as a cue.</Paragraph> </Section> </Section> class="xml-element"></Paper>