XML Viewer - p95-1015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1015_metho.xml
Size: 21,819 bytes
Last Modified: 2025-10-06 14:14:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1015">
  <Title>Combining Multiple Knowledge Sources for Discourse Segmentation</Title>
  <Section position="4" start_page="108" end_page="111" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="108" end_page="109" type="sub_section">
      <SectionTitle>
3.1 Boundary Classification
</SectionTitle>
      <Paragraph position="0"> We represent each narrative in our corpus as a sequence of potential boundary sites, which occur between prosodic phrases. We classify a potential boundary site as boundary if it was identified as such by at least 3 of the 7 subjects in our earlier study.</Paragraph>
      <Paragraph position="1"> Otherwise it is classified as non-boundary. Agreement among subjects on boundaries was significant at below the .02% level for values ofj ___ 3, where j is  the number of subjects (1 to 7), on all 20 narratives. 2 Fig. 1 shows a typical segmentation of one of the narratives in our corpus. Each line corresponds to a prosodic phrase, and each space between the lines corresponds to a potential boundary site. The bracketed numbers will be explained below. The boxes in the figure show the subjects' responses at each potential boundary site, and the resulting boundary classification. Only 2 of the 7 possible boundary sites are classified as boundary.</Paragraph>
    </Section>
    <Section position="2" start_page="109" end_page="110" type="sub_section">
      <SectionTitle>
3.2 Coding of Linguistic Features
</SectionTitle>
      <Paragraph position="0"> Given a narrative of n prosodic phrases, the n-1 potential boundary sites are between each pair of prosodic phrases Pi and P/+I, i from 1 to n-1. Each potential boundary site in our corpus is coded using the set of linguistic features shown in Fig. 2.</Paragraph>
      <Paragraph position="1"> Values for the prosodic features are obtained by automatic analysis of the transcripts, whose conventions are defined in (Chafe, 1980) and illustrated in Fig. h .... and &amp;quot;?&amp;quot; indicate sentence-final intonational contours; &amp;quot;,&amp;quot; indicates phrase-final but not sentence final intonation; &amp;quot;\[X\]&amp;quot; indicates a pause lasting X seconds; &amp;quot;..&amp;quot; indicates a break in timing too short to be measured. The features before and after depend on the final punctuation of the phrases Pi and Pi+I, respectively. The value is '+sentence.final.contour' if &amp;quot;.&amp;quot; or &amp;quot;?&amp;quot;, 'sentence.final.contour' if &amp;quot;,&amp;quot;. Pause is assigned 'true' if Pi+l begins with \[X\], 'false' otherwise. Duration is assigned X if pause is 'true', 0 otherwise.</Paragraph>
      <Paragraph position="2"> The cue phrase features are also obtained by automatic analysis of the transcripts. Cue1 is assigned 'true' if the first lexical item in PI+I is a member of the set of cue words summarized in (Hirschberg and Litman, 1993). Word1 is assigned this lexical item if 2We previously used agreement by 4 subjects as the threshold for boundaries; for j &gt; 4, agreement was significant at the .01~0 level. (Passonneau and Litman, 1993)  * Prosodic Features - before:+sentence.final.contour,-sentence.flnal.contour - after: +sentence.final.contour,-sentence.flnal.contour.</Paragraph>
      <Paragraph position="3"> - pause: true, false.</Paragraph>
      <Paragraph position="4"> - duration: continuous.</Paragraph>
      <Paragraph position="5"> * Cue Phrase Features - cue1: true, false.</Paragraph>
      <Paragraph position="6"> - word1: also, and, anyway, basically, because, but, finally, first, like, meanwhile, no, now, oh, okay, only, or, see, so, then, well, where, NA.</Paragraph>
      <Paragraph position="7"> -- cue2: true, false.</Paragraph>
      <Paragraph position="8"> - word2: and, anyway, because, boy, but, now, okay, or, right, so, still, then, NA.</Paragraph>
      <Paragraph position="9"> * Noun Phrase Features - coref: +coref,-corer, NA.</Paragraph>
      <Paragraph position="10"> - infer: +infer, -infer, NA.</Paragraph>
      <Paragraph position="11"> - global.pro: +global.pro, -global.pro, NA.</Paragraph>
      <Paragraph position="12"> * Combined Feature -- cue-prosody: complex, true, false.</Paragraph>
      <Paragraph position="13">  cuel is true, 'NA' (not applicable) otherwise, a Cue2 is assigned 'true' if cue, is true and the second lexical item is also a cue word. Word2 is assigned the second lexical item if cue2 is true, 'NA' otherwise. Two of the noun phrase (NP) features are handcoded, along with functionally independent clauses (FICs), following (Passonneau, 1994). The two authors coded independently and merged their results. The third feature, global.pro, is computed from the hand coding. FICs are tensed clauses that are neither verb arguments nor restrictive relatives. If a new FIC (C/) begins in prosodic phrase Pi+I, then NPs in Cj are compared with NPs in previous clauses and the feature values assigned as follows4:  1. corer = '+coref' if Cj contains an NP that co-refers with an NP in Cj-1; else corer= '-cord' 2. infer= '+infer' ifCj contains an NP whose referent can be inferred from an NP in Cj-1 on the basis of a pre-defined set of inference relations; else infer- '-infer' 3. global.pro = '+global.pro' if Cj contains a defi null nite pronoun whose referent is mentioned in a previous clause up to the last boundary assigned by the algorithm; else global.pro = '-global.pro' If a new FIC is not initiated in Pi+I, values for all three features are 'NA'.</Paragraph>
      <Paragraph position="14"> Cue-prosody, which encodes a combination of prosodic and cue word features, was motivated by an analysis of IR errors on our training data, as described in section 4. Cue-prosody is 'complex' if: aThe cue phrases that occur in the corpus &amp;re shown as potential values in Fig. 2.</Paragraph>
      <Paragraph position="15">  1. before = '+sentence.final.contour' 2. pause = 'true' 3. And either: (a) cuet = 'true', wordt ~ 'and' (b) cuet = 'true', word1 = 'and', cue2 = 'true', word2 C/ 'and'  Else, cue-prosody has the same values as pause. Fig. 3 illustrates how the first boundary site in Fig. 1 would be coded using the features in Fig. 2. The prosodic and cue phrase features were motivated by previous results in the literature. For example, phrases beginning discourse segments were correlated with preceding pause duration in (Grosz and Hirschberg, 1992; ttirschberg and Grosz, 1992). These and other studies (e.g.~ (iiirschberg and Litman, 1993)) also found it useful to distinguish between sentence and non-sentence final intonational contours. Initial phrase position was correlated with discourse signaling uses of cue words in (Hirschberg and Litman, 1993); a potential correlation between discourse signaling uses of cue words and adjacency patterns between cue words was also suggested. Finally, (Litman, 1994) found that treating cue phrases individually rather than as a class enhanced the results of (iiirschberg and Litman, 1993).</Paragraph>
      <Paragraph position="16"> Passonneau (to appear) examined some of the few claims relating discourse anaphoric noun phrases to global discourse structure in the Pear corpus. Resuits included an absence of correlation of segmental structure with centering (Grosz et al., 1983; Kameyama, 1986), and poor correlation with the contrast between full noun phrases and pronouns. As noted in (Passonneau and Litman, 1993), the NP features largely reflect Passonneau's hypotheses that adjacent utterances are more likely to contain expressions that corefer, or that are inferentially linked, if they occur within the same segment; and that a definite pronoun is more likely than a full NP to refer to an entity that was mentioned in the current segment, if not in the previous utterance.</Paragraph>
    </Section>
    <Section position="3" start_page="110" end_page="111" type="sub_section">
      <SectionTitle>
3.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> The segmentation algorithms presented in the next two sections were developed by examining only a training set of narratives. The algorithms are then evaluated by examining their performance in predicting segmentation on a separate test set. We currently use 10 narratives for training and 5 narratives for testing. (The remaining 5 narratives are reserved for future research.) The 10 training narratives  range in length from 51 to 162 phrases (Avg.=101.4), or from 38 to 121 clauses (Avg.=76.8). The 5 test narratives range in length from 47 to 113 phrases (Avg.=S7.4), or from 37 to 101 clauses (Avg.=69.0).</Paragraph>
      <Paragraph position="1"> The ratios of test to training data measured in narratives, prosodic phrases and clauses, respectively, are 50.0%, 43.1% and 44.9%. For the machine learning algorithm we also estimate performance using cross-validation (Weiss and Kulikowski, 1991), as detailed in Section 5.</Paragraph>
      <Paragraph position="2"> To quantify algorithm performance, we use the information retrieval metrics shown in Fig. 4. Recall is the ratio of correctly hypothesized boundaries to target boundaries. Precision is the ratio of hypothesized boundaries that are correct to the total hypothesized boundaries. (Cf. Fig. 4 for fallout and error.) Ideal behavior would be to identify all and only the target boundaries: the values for b and c in Fig. 4 would thus both equal O, representing no errors. The ideal values for recall, precision, fallout, and error are 1, 1, 0, and 0, while the worst values are 0, 0, 1, and 1. To get an intuitive summary of overall performance, we also sum the deviation of the observed value from the ideal value for each metric: (1-recall) + (1-precision) + fallout + error. The summed deviation for perfect performance is thus 0.</Paragraph>
      <Paragraph position="3"> Finally, to interpret our quantitative results, we use the performance of our human subjects as a target goal for the performance of our algorithms (Gale et al., 1992). Table 1 shows the average human performance for both the training and test sets of narratives. Note that human performance is basically the same for both sets of narratives. However, two  factors prevent this performance from being closer to ideal (e.g., recall and precision of 1). The first is the wide variation in the number of boundaries that subjects used, as discussed above. The second is the inherently fuzzy nature of boundary location. We discuss this second issue at length in (Passonnean and Litman, to appear), and present relaxed IR metrics that penalize near misses less heavily in (Litman and Passonneau, 1995).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="111" end_page="112" type="metho">
    <SectionTitle>
4 Hand Tuning
</SectionTitle>
    <Paragraph position="0"> To improve performance, we analyzed the two types of IR errors made by the original NP algorithm (Passonneau and Litman, 1993) on the training data.</Paragraph>
    <Paragraph position="1"> Type &amp;quot;b&amp;quot; errors (cf. Fig. 4), mis-classification of non-boundaries, were reduced by changing the coding features pertaining to clauses and NPs. Most &amp;quot;b&amp;quot; errors correlated with two conditions used in the NP algorithm, identification of clauses and of inferential links. The revision led to fewer clauses (more assignments of 'NA' for the three NP features) and more inference relations. One example of a change to clause coding is that formulaic utterances having the structure of clauses, but which function like interjections, are no longer recognized as independent clauses. These include the phrases let's see, let me see, I don't know, you know when they occur with no verb phrase argument. Other changes pertained to sentence fragments, unexpected clausal arguments, and embedded speech.</Paragraph>
    <Paragraph position="2"> Three types of inference relations linking successive clauses (Ci-1, Ci) were added (originally there were 5 types (Passonneau, 1994)). Now, a pronoun (e.g., it, that, this) in Ci referring to an action, event or fact inferrable from Ci-1 links the two clauses. So does an implicit argument, as in Fig. 5, where the missing argument of notice is inferred to be the event of the pears falling. The third case is where an NP in Ci is described as part of an event that results directly from an event mentioned in Ci-1.</Paragraph>
    <Paragraph position="3"> &amp;quot;C&amp;quot; type errors (cf. Fig. 4), mis-classification of boundaries, often occurred where prosodic and cue features conflicted with NP features. The original NP algorithm assigned boundaries wherever the three values '-coref', '-infer', '-global.pro' (defined in section 3) co-occurred, represented as the first conditional statement of Fig. 6. Experiments led to the hypothesis that the most improvement came by assigning a boundary if the cue-prosody feature had the value 'complex', even if the algorithm would not otherwise assign a boundary, as shown in Fig. 6.</Paragraph>
    <Paragraph position="4">  We refer to the original NP algorithm applied to the initial coding as Condition 1, and the tuned algorithm applied to the enriched coding as Condition 2. Table 2 presents the average IR scores across the narratives in the training set for both conditions. Reduction of &amp;quot;b&amp;quot; type errors raises precision, and lowers fallout and error rate. Reduction of &amp;quot;c&amp;quot; type errors raises recall, and lowers fallout and error rate. All scores improve in Condition 2, with precision and fallout showing the greatest relative improvement. The major difference from human performance is relatively poorer precision.</Paragraph>
    <Paragraph position="5"> The standard deviations in Table 2 are often close to 1/4 or 1/3 of the reported averages. This indicates a large amount of variability in the data, reflecting wide differences across narratives (speakers) in the training set with respect to the distinctions recognized by the algorithm. Although the high standard deviations show that the tuned algorithm is not well fitted to each narrative, it is likely that it is overspecialized to the training sample in the sense that test narratives are likely to exhibit further variation.</Paragraph>
    <Paragraph position="6"> Table 3 shows the results of the hand tuned algorithm on the 5 randomly selected test narratives on both Conditions 1 and 2. Condition 1 results, the untuned algorithm with the initial feature set, are very similar to the training set except for worse precision. Thus, despite the high standard deviations, 10 narratives seems to have been a sufficient sample size for evaluating the initial NP algorithm.</Paragraph>
    <Paragraph position="7"> Condition 2 results are better than condition 1 in Table 3, and condition 1 in Table 2. This is strong evidence that the tuned algorithm is a better predictor of segment boundaries than the original NP algorithm. Nevertheless, the test results of condition 2 are much worse than the corresponding training results, particularly for precision (.44 versus .62). This  confirms that the tuned algorithm is over calibrated to the training set.</Paragraph>
  </Section>
  <Section position="6" start_page="112" end_page="113" type="metho">
    <SectionTitle>
5 Machine Learning
</SectionTitle>
    <Paragraph position="0"> We use the machine learning program C4.5 (Quinlan, 1993) to automatically develop segmentation algorithms from our corpus of coded narratives, where each potential boundary site has been classified and represented as a set of linguistic features. The first input to C4.5 specifies the names of the classes to be learned (boundary and non-boundary), and the names and potential values of a fixed set of coding features (Fig. 2). The second input is the training data, i.e., a set of examples for which the class and feature values (as in Fig. 3) are specified. Our training set of 10 narratives provides 1004 examples of potential boundary sites. The output of C4.5 is a classification algorithm expressed as a decision tree, which predicts the class of a potential boundary given its set of feature values.</Paragraph>
    <Paragraph position="1"> Because machine learning makes it convenient to induce decision trees under a wide variety of conditions, we have performed numerous experiments, varying the number of features used to code the training data, the definitions used for classifying a potential boundary site as boundary or non-boundary 5 and the options available for running the C4.5 program. Fig. 7 shows one of the highest-performing learned decision trees from our experiments. This decision tree was learned under the following conditions: all of the features shown in Fig. 2 were used to code the training data, boundaries were classified as discussed in section 3, and C4.5 was run using only the default options. The decision tree predicts the class of a potential boundary site based on the features before, after, duration, cuel, wordl, corer, infer, and global.pro. Note that although not all available features are used in the tree, the included features represent 3 of the 4 general types of knowledge (prosody, cue phrases and noun phrases). Each level of the tree specifies a test on a single feature, with a branch for every possible outcome of the test. 6 A branch can either lead to the assignment of a class, or to another test. For example, the tree initially branches based on the value of the feature before.</Paragraph>
    <Paragraph position="2"> If the value is '-sentence.final.contour' then the first branch is taken and the potential boundary site is assigned the class non-boundary. If the value of before is 'q-sentence.final.contour' then the second branch is taken and the feature corer is tested.</Paragraph>
    <Paragraph position="3"> The performance of this learned decision tree averaged over the 10 training narratives is shown in Table 4, on the line labeled &amp;quot;Learning 1&amp;quot;. The line labeled &amp;quot;Learning 2&amp;quot; shows the results from another 5(Litman and Passonneau, 1995) varies the number of subjects used to determine boundaries.</Paragraph>
    <Paragraph position="4"> eThe actual tree branches on every value of worda; the figure merges these branches for clarity.</Paragraph>
    <Paragraph position="5"> if before = -sentence.final.contour then non.boundary elaeif before = +sentence.final.contour then ifcoref = NA then non-boundary elseif coref = +corer then if after ----. +sentence.final.contour then if duration &lt;__ 1.3 then non-boundary elself duration &gt; 1.3 then boundary elseif after = -sentence.final.contour then if word 1 E {also,basically, because,finally, first,like, meanwhile,no,oh,okay, only, aee,so,well,where,NA} then non-boundary else|f word 1 E {anyway, but,now,or,then} then boundary else|f word I = and then if duration &lt; 0.6 then non-boundary  elseifdurat~on &gt; 0.6 then boundary elseif coref = -corer then if infer = +infer then non-boundary elself infer = NA then boundary elseifinfer = -infer then if after = -sentence.final.contour then boundary elself after = +sentence.final.contour then if cue 1 = true then if global.pro = NA then boundary elseif global.pro = -global.pro then boundary elself global.pro = +global.pro then  machine learning experiment, in which one of the default C4.5 options used in &amp;quot;Learning 1&amp;quot; is overridden. The &amp;quot;Learning 2&amp;quot; tree (not shown due to space restrictions) is more complex than the tree of Fig. 7, but has slightly better performance. Note that &amp;quot;Learning 1&amp;quot; performance is comparable to human performance (Table 1), while &amp;quot;Learning 2&amp;quot; is slightly better than humans. The results obtained via machine learning are also somewhat better than the results obtained using hand tuning--particularly with respect to precision (&amp;quot;Condition 2&amp;quot; in Table 2), and are a great improvement over the original NP results (&amp;quot;Condition 1&amp;quot; in Table 2).</Paragraph>
    <Paragraph position="6"> The performance of the learned decision trees averaged over the 5 test narratives is shown in Table 5. Comparison of Tables 4 and 5 shows that, as with the hand tuning results (and as expected), average performance is worse when applied to the testing rather than the training data particularly with respect to precision. However, performance is an improvement over our previous best results (&amp;quot;Condition 1&amp;quot; in Table 3), and is comparable to (&amp;quot;Learning 1&amp;quot;) or very slightly better than (&amp;quot;Learning 2&amp;quot;) the hand tuning results (&amp;quot;Condition 2&amp;quot; in Table 3).</Paragraph>
    <Paragraph position="7"> We also use the resampling method of cross-validation (Weiss and Kulikowski, 1991) to estimate performance, which averages results over multiple partitions of a sample into test versus training data. We performed 10 runs of the learning program, each using 9 of the 10 training narratives for that run's  training set (for learning the tree) and the remaining narrative for testing. Note that for each iteration of the cross-validation, the learning process begins from scratch and thus each training and testing set are still disjoint. While this method does not make sense for humans, computers can truly ignore previous iterations. For sample sizes in the hundreds (our 10 narratives provide 1004 examples) 1O-fold cross-validation often provides a better performance estimate than the hold-out method (Weiss and Kulikowski, 1991). Results using cross-validation are shown in Table 6, and are better than the estimates obtained using the hold-out method (Table 5), with the major improvement coming from precision. Because a different tree is learned on each iteration, the cross-validation evaluates the learning method, not a particular decision tree.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML