File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-1005_metho.xml
Size: 73,725 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-1005"> <Title>Discourse Segmentation by Human and Automated Means</Title> <Section position="5" start_page="107" end_page="111" type="metho"> <SectionTitle> 3. Intention-Based Segmentation </SectionTitle> <Paragraph position="0"> Here we present the results of a study investigating the ability of naive subjects to identify the same segments in a corpus of spoken narrative discourse. Our first goal is purely exploratory. Despite the wide agreement that discourse structure and linguistic form are mutually constraining, there is little agreement on how to determine the structure of any particular discourse. Thus we do not assume that there are &quot;correct&quot; segmentations against which to judge subjects' responses. Also, as discussed in our previous work (Passonneau and Litman 1996), the subjects' performance suggests that segmentation is a fuzzy phenomenon. Because our study is exploratory, we took the conservative approach of defining a very open-ended segmentation task that allowed subjects great freedom in the number and size of the segments to identify. Our statistical results indicate that, despite the freedom of the task, naive subjects independently perform surprisingly similar segmentations of the same discourse. We also show by example that subjects' segmentations reflect the presumed episode structure of the narrative.</Paragraph> <Paragraph position="1"> We ask subjects to segment discourse using a nonlinguistic criterion in order to avoid circularity when we later investigate the correlation of linguistic devices with segments derived from the segmentation task results. Abstracting statistically significant results from the subjects' responses is thus the second goal of our study of the segmentation task. Here we briefly review our statistical results and summarize the motivation for our method of abstracting a single segmentation for a given narrative from a set of subjects' responses. As noted below, more detailed discussion of the statistic we use is presented elsewhere. What we also discuss here, which has not been presented in previous work, is a preliminary evaluation of the reliability of our method where we give a conservative lower bound suggesting that the method is reliable.</Paragraph> <Paragraph position="2"> Passonneau and Litman Discourse Segmentation</Paragraph> <Section position="1" start_page="108" end_page="111" type="sub_section"> <SectionTitle> 3.1 Methodology: Empirically Derived Segmentation </SectionTitle> <Paragraph position="0"> The claim has been made that different people (investigators or subjects) are likely to assign similar segment boundaries or segment relations to a discourse (Grosz and Sidner 1986; Reichman 1985; Mann and Thompson 1988), but it has also been observed that discourse structure can be ambiguous (Pierrehumbert and Hirschberg 1987). Studies asking subjects to assign topical units to sample discourses have shown that the resulting segments vary widely in both size and location (Rotondo 1984). Yet until recently, there has been little attempt to quantify the degree of variability among subjects in performing such a task. Here we present the results of our study of naive subjects performing a relatively unstructured segmentation task on a corpus of similar discourses. Full details are presented in Passonneau and Litman (1996).</Paragraph> <Paragraph position="1"> The corpus consists of 20 spoken narrative monologues known as the Pear stories, originally collected by Chafe (1980). Chafe recorded and transcribed subjects who had been asked to view the same movie and describe it to a second person. The movie contained 7 sequential episodes about a man picking pears. Chafe identified three types of prosodic phrases from graphic displays of intonation contours, as described in Section 4.1. The corpus contains just over 2,000 prosodic phrases with roughly 13,500 words.</Paragraph> <Paragraph position="2"> For our study, each narrative was segmented by seven naive subjects (as opposed to trained researchers or trained coders), using an informal notion of communicative intention as the segmentation criterion. Except in rare cases, no subject segmented more than I narrative. As discussed above, a variety of criteria for identifying discourse units have been proposed. Our decision to use a commonsense notion of intention as the criterion is aimed at giving the subjects the freedom to choose their own segmentation criteria, and to modify the criteria to fit the evolving discourse.</Paragraph> <Paragraph position="3"> Two structural constraints were also imposed on the content units that subjects were asked to identify. First, subjects were asked to perform a linear rather than a hierarchical segmentation, where a linear segmentation simply consists of dividing a narrative into sequential units. Second, subjects were restricted to placing boundaries between the prosodic phrases identified by Chafe (1980). Subjects were presented with transcripts of the narratives formatted so that each non-indented new line was the beginning of a new prosodic phrase. The pause locations and durations transcribed by Chafe (see Section 4.1.2) were omitted, but otherwise all lexical and nonlexical articulations were retained. The instructions given to the subjects were designed to have as little bias as possible regarding segment size, and total number of segments. 3 As we discuss further below, both the rate at which subjects assigned boundaries and the size of segments varied widely.</Paragraph> <Paragraph position="4"> Figure 2 shows the subjects' responses for the excerpt corresponding to Figure 1.</Paragraph> <Paragraph position="5"> The potential boundary sites are between the text lines corresponding to prosodic phrases. The left column shows the prosodic phrase numbers, which are explained later. There are 19 phrases, hence 18 boundary sites. The seven subjects are differentiated by distinct letters of the alphabet. Note that a majority of subjects agreed on only 3 of the 18 possible boundary sites, corresponding to the segmentation illustrated in Figure 1. In general, subjects assigned boundaries at quite distinct rates, thus agreement among subjects is necessarily imperfect. All subjects assigned boundaries relatively infrequently. On average, subjects assigned boundaries at only 16.1% of the potential boundary sites (min = 5.5%; max = 41.3%) in any one narrative. Boundary 22.2 there are three little boys, 22.3 up on the road a little bit, 22.4 and they see this little accident.</Paragraph> <Paragraph position="6"> 23.1 And u-h they come over, 23.2 and they help him, 23.3 and you know, \[ 1 SUBJECT (c) \] 23.4 help him pick up the pears and everything.</Paragraph> <Paragraph position="7"> \[ 5 SUBJECTS (a, d, e, f, g) l 24.1 A-nd the one thing that struck me about the- three little boys that were there, \[1 SUBJECT (b)\] 24.2 is that one had ay uh I don't know what you call them, 24.3 but it's a paddle, 24.4 and a ball-, 24,5 is attached to the paddle, 24.6 and you know you bounce it? \[ 2 SUBJECTS (a, d)\] 25.1 And that sound was really prominent.</Paragraph> <Paragraph position="8"> 14 SUBJECTS (d, e, f, g)\] 26.1 Well anyway, \[2 SUBJECTS (b, c) l 26.2 so- u-m tsk all the pears are picked up, 26.3 and he's on his way again, Sample of subjects' responses.</Paragraph> <Paragraph position="9"> locations were relatively independent of one another, as shown by the the fact that segments varied in size from 1 to 49 phrases in length (Avg. = 5.9). The assumption of independence is important for motivating statistical analyses of how probable the observed distributions are.</Paragraph> <Paragraph position="10"> Figure 3 shows two bar charts. The one on the left gives the results for the full narrative excerpted in Figure 1. The x-axis is the number of subjects, from 0 to 7. The y-axis, from top to bottom, corresponds to the potential boundary locations, with prosodic phrase locations numbered as in Figure 2. Each horizontal bar thus represents the number of subjects assigning a boundary at a particular interphrase location. Interestingly, there were 6 segment boundaries identified by at least five subjects, yielding 7 segments that correspond closely to the 7 sequential episodes that Chafe (1980) used to describe the movie. The first 5 segments correspond to the first 5 episodes. The 6th segment corresponds to the 6th episode plus the beginning of the 7th, while the 7th segment corresponds to the end of the 7th episode.</Paragraph> <Paragraph position="11"> The large proportion of white space to black space in the left bar chart of Figure 3 illustrates graphically that subjects assign boundaries relatively infrequently. The large regions of white space separated by very wide bars shows a striking consensus on certain segments (white space) and segment boundaries (wide black bars). To illustrate graphically the improbability of the occurrence of wide bars (high-consensus boundaries), we also show a typical random distribution for a parallel data set in the right-hand bar chart of Figure 3. To create this data, we repeatedly performed the following experiment, and randomly selected one result. First we created seven hypothetical subjects, each of whom assigns the same number of boundaries as one of the</Paragraph> <Paragraph position="13"> Frequency that N subjects identify any boundary slot as a boundary.</Paragraph> <Paragraph position="14"> real subjects from the same number of potential boundary slots. The hypothetical subjects assign boundaries randomly (but with no repetition). In the random distribution, there are few bars of width 3, and none of any greater width.</Paragraph> <Paragraph position="15"> We show below that, given the loosely structured task, the probability of the observed distribution depicted in Figure 3 is extremely low, hence highly significant. Computational Linguistics Volume 23, Number 1 The statistical test we use identifies x ~ 3 as the threshold separating insignificant boundaries from significant ones. The large scattering of narrow bars (1 < x ~ 2) illustrates the inherent noisiness of the data arising from the fact that subjects assign boundaries at varying rates. The histogram in Figure 4 gives a different view of the same point, showing the relative frequency of cases where N subjects place a boundary at a given location, for N from 0 to 7. The y-axis is normalized to represent the average narrative length of 100 phrases, thus the bar at N = 0 indicates that on average, 47.8 of the 100 phrases were not classified as boundaries. 4 The large majority of responses (80%) fall within the bars for N = 0 (47.8%), N = 1 (23.0%), and N = 2 (10.0%), forming a rapidly descending curve. For N = 3 and above, the slope of the curve suddenly becomes linear, and much less steep, corresponding to a much more gradual decrease in frequency as values of N go to 7. That there should be any cases where six or seven subjects identify the same boundary is highly improbable, but on average, this happens 4.5 times per narrative. Summing the heights of the bars for N = 3 through N = 7 indicates that for an average narrative whose length is 100 phrases, there will be about 20 boundaries identified by three or more subjects.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> 3.2.1 Evaluation Metrics. Again, our first goal in evaluating the segmentation data from our subjects is to explore the possibility that subjects given as little guidance as possible might yet recognize rather similar segments in the narrative corpus. To make this evaluation, we first use a significance test of the null hypothesis that the distributions could have arisen by chance. We then analyze the distributions in more detail to determine what aspects of the distribution are significant, and thereby to abstract significant data for use in defining segmentations for each narrative. The results indicate that the observed distributions are highly significant, i.e., unlikely to have arisen by chance. In Section 3.2.2, we briefly review Cochran's Q (1950), the statistic that we use, and the test of the null hypothesis. We then partition Cochran's Q to determine the lowest value on the x-axis in Figure 3 at which agreements on boundaries become statistically significant. The results indicate significance arises when at least three subjects agree on a boundary.</Paragraph> <Paragraph position="1"> Reliability metrics (Krippendorff 1980; Carletta 1996) are designed to give a robust measure of how well distinct sets of data agree with, or replicate, one another. They are sensitive to the relative proportion of the different data types (e.g., boundaries versus nonboundaries), but insensitive to the statistical likelihood that agreements will occur. We have already discussed how variable the subjects' responses are, both in number and placement of segment boundaries, so we know that our subjects are not replicating the same behavior. However, all 20 narratives show the same pattern of responses as illustrated in Figure 3: certain boundaries are identified by large numbers of subjects. For any one narrative, we should expect a new set of seven subjects to yield roughly the same set of segment boundaries. In other words, our method for abstracting a single set of boundaries from the responses of multiple subjects should be reproducible. In Section 3.2.3, we evaluate our method by using Krippendorff's c~ to evaluate the reliability of boundaries derived from one set of subjects compared with those derived from another set of subjects on the same narrative.</Paragraph> </Section> </Section> <Section position="6" start_page="111" end_page="133" type="metho"> <SectionTitle> 4 Since the narratives vary in length and in relative frequency of boundaries placed by subjects, we </SectionTitle> <Paragraph position="0"> normalized the data before averaging across narratives. Where L is the length of a narrative i, the actual frequency of cases where N subjects agree in narrative i is multiplied by 100/L, where 100 is the average narrative length.</Paragraph> <Paragraph position="1"> Finally, for purposes of comparison with other studies of segmentation, we report percent agreement. Percent agreement is high, but as argued in Krippendorff (1980), percent agreement is relatively uninformative because it fails to take into account the response rate of individual subjects, a factor built into both Cochran's Q and Krippendorff's a.</Paragraph> <Paragraph position="2"> a set of 20 i x j matrices, each of the form shown in Table 1. Each matrix has a height of i -- 7 subjects and width of j = n prosodic phrases less 1. (Table 1 is a partial matrix of width j = 11.) The value in a cell Ci,j is a 1 if the ith subject assigned a boundary at site j, and blank if they did not. We use Cochran's Q to evaluate the significance of the distributions in the matrices. 5 Cochran's test evaluates the null hypothesis that the sums of ls in the columns, representing the total number of subjects assigning a boundary at the jth site (Tj), are randomly distributed. It does so by evaluating the significance of the differences in column totals (Tj) across the matrix, with each row total ui (or total number of boundaries assigned by subject i) assumed to be fixed. Where the average column total is T, the statistic is given by:</Paragraph> <Paragraph position="4"> Our results indicate that the agreement among subjects is extremely significant. For the 20 narratives, the probabilities that the observed distributions could have arisen by chance range from p = .1 x 10 -6 to p = .6 x 10 -9.</Paragraph> <Paragraph position="5"> We now turn to the second question addressed in the segmentation study, how to abstract a set of empirically justified boundaries from the data. We do this by selecting the statistically significant response data. Recall the large amounts of white space in Figure 3, contrasting with the few sharp peaks where many subjects identify the same boundary, which suggests that the significance of Q owes most to the cases where columns have many l's. The question is, how many l's is significant? We address this question by partitioning Q into distinct components for each possible value of Tj (0 to Computational Linguistics Volume 23, Number 1 (Cochran 1950). Partitioning Q by the 8 values of Tj shows that Qj is significant at the p = .0001 level for each distinct Tj > 4 across all narratives. Probabilities become more significant for higher levels of Tj, and the converse. At Tj = 3, p is significant at the .01 level on 19 narratives, and for the remaining narrative p = .0192. When we look at correlation of segment boundaries with linguistic features, we use both thresholds Tj >_ 4, and Tj > 3 to select a set of empirically justified boundaries. On average, this gives us 12 (Tj >_ 4) or 20 (Tj >_ 3) boundaries for a 100-phrase narrative.</Paragraph> <Paragraph position="6"> 3.2.3 Reliability. Reliability metrics provide a measure of the reproducibility of a data set, for example, across conditions or across subjects. Recently, discourse studies have used reliability metrics designed for evaluating classification tasks to determine whether coders can classify various phenomena in discourse corpora, as discussed in Section 2.1. The segmentation task reported here is not properly a classification task, in that we do not presume that there is a given set of segment boundaries that subjects are likely to identify. Given the freedom of the task and the use of untrained subjects, a reliability test would be relatively uninformative: it can be expected to range from very low to very high. In fact, sorting the 140 subjects into comparable pairs (i.e., subjects assigning a similar number of boundaries), a reliability metric that ranges between 1 for perfect reliability and -1 for perfect unreliability (Krippendorff's a, discussed below) gives a wide spread of reliability values (from -.3 to .9; average = .34). Our method aims at abstracting away from the absolute differences across multiple subjects per narrative (N = 7) to derive a statistically significant set of segment boundaries.</Paragraph> <Paragraph position="7"> Thus, an appropriate test of whether our method is statistically reliable would be to compare two repetitions of the method on the same narratives to see if the results are reproducible.</Paragraph> <Paragraph position="8"> Although we do not have enough subjects on any single narrative to compare two distinct sets of seven subjects, we do have four narratives with data from eight distinct subjects. For each set of eight subjects, we created two randomly selected partitions (A and B) with four distinct subjects in each. Then we assessed reliability by comparing the boundaries produced by partitions A and B on the four narratives (using a boundary threshold of at least three subjects). Because we only have four subjects within each partition, this necessarily produces fewer significant boundaries than our method. In other words, this test can only give us a conservative lower bound for reliability. (Recall that significance of a boundary increases exponentially with the number of subjects who agree on a boundary.) But even with this conservative evaluation, reliability is fairly good on two narratives, and promising on average.</Paragraph> <Paragraph position="9"> A reliability measure indicates how reproducible a data set is by quantifying similarity across subjects in terms of the proportion of times that each response category occurs. This differs from a significance test of the null hypothesis (e.g., our use of Cochran's Q), where observed data is compared to random distribution. We use Krippendorff's a (1980) to evaluate the reliability of the two data sets from partitions A and B. The general formula for a is 1 - D_o_o where Do and DE are observed disagreements DE ' and expected disagreements. Computation of a is described below.</Paragraph> <Paragraph position="10"> Krippendorff's a reports to what degree the observed number of matches could be expected to arise by chance. Again in contrast with Cochran's Q, it is simply a ratio rather than a point on a distribution curve with known probabilities. Values range from 1 to -1, with 0 representing that there are no more agreements observed in the data than would happen by chance. A value of .5 would indicate that the observed number of agreements is halfway between chance and perfect agreement.</Paragraph> <Paragraph position="11"> Negative values indicate the degree to which observed disagreements differ from chance. In principle, a is computed from the same type of matrix shown in Table 1, and can be applied to multivalued variables that are quantitative or qualitative. Here we summarize computation of a simplified formula for ~ used for comparing two data sets with a single, dichotomous variable. To exemplify the computation, we use the first two rows of Table 1, giving a matrix of size i = 2 x j = 11. The value of Do (proportion of observed disagreements) is then simply ~, where M is the total number of mismatches (j being the potential number of matches). In our example, Do has a value of 2 (.18). Where nl is the total number of l's and no is the total number of lSx4 (.31). The detailed formula blanks, DE is given by j(2j-1)'~ In our example, DE is for c~ then simplifies to: (2j- 1)(M) c~--1 n0nl This gives ~ -- .42, meaning that the observed case of one agreement out of two potential agreements on boundaries in our example is not quite halfway between chance and perfect agreement. Consider a case where two subjects had 12 responses each (j = 12), each subject responded with 1 half the time (nl = no -- 12), and wherever one put a 1, the other did not (M = 12). The data contains the maximum number of disagreements, yet ~ = -0.92, or somewhat less than -1, meaning that a small proportion of the observed disagreement would have arisen by chance.</Paragraph> <Paragraph position="12"> Table 2 presents the reliability results from a comparison of boundaries found by two distinct partitions of subjects' responses on four narratives. An ~ of .80 using two partitions of seven subjects would represent very good reproducibility, with values above .67 being somewhat good (Krippendorff 1980). Note that reliability on narrative 7 (.73) is good despite the small number of subjects. Since, as noted above, we would expect reliability to be much higher if there were seven subjects, we believe that values above .5 for N = 4 subjects indicate reproducibility. On average c~ = .58 and the spread is low (or = .09).</Paragraph> <Paragraph position="13"> tion metrics, unlike percent agreement. However, we also report percent agreement in order to compare results with other studies. As defined in Gale, Church, and Yarowsky (1992), percent agreement is the ratio of observed agreements with the majority opinion to possible agreements with the majority opinion. As detailed in Passonneau and Litman (1996), the average percent agreement for our subjects on all 20 narratives is 89% (max. = 92%; rain. = 82%). On average, percent agreement is highest on nonboundaries (91%; max. = 95%; rain. = 84%) and lowest on boundaries (73%; max. = 80%; min. = 60%), reflecting the fact that nonboundaries greatly outnumber boundaries.</Paragraph> <Paragraph position="14"> These figures compare with other studies (74% to 95% in Grosz and Hirschberg \[1992\], depending upon discourse feature, and greater than 80% in Hearst \[1993\]).</Paragraph> <Paragraph position="15"> Computational Linguistics Volume 23, Number 1 3.2.5 Discussion. We have shown that an atheoretical notion of speaker intention is understood sufficiently uniformly by naive subjects to yield highly significant agreement across subjects on segment boundaries in a corpus of spoken narratives. Probabilities of the observed distributions range from .6 x 10 -9 to .1 x 10 -6 as given by Cochran's Q. The result is all the more striking given that we used naive coders on a loosely defined task. Subjects were free to assign any number of boundaries, and to label their segments with anything they judged to be the narrator's communicative intention. Partitioning Cochran's Q shows that the proportion of boundaries identified by at least three subjects was significant across all 20 narratives (p < .02). Significance increases exponentially as the number of subjects agreeing on a boundary increases.</Paragraph> <Paragraph position="16"> A conservative means for estimating a lower bound for the reliability of our method, using Krippendorff's c~ as a metric, suggests that the method is reliable. The reliability evaluation is conservative in part because it uses fewer subjects to derive boundaries. Note that it is conservative also because it is based on the proportion of identical matches between two data sets. This type of metric ignores the inherent fuzziness of segment location, as discussed in Passonneau and Litman (1996). We conclude that boundaries identified by at least three of seven subjects most likely reflect the validity of the underlying notion that utterances in discourse can be grouped into more-or-less coherent segments. What remains is the question of whether linguistic features correlate at all well with these segments.</Paragraph> <Paragraph position="17"> 4. Algorithmic Identification of Segment Boundaries using Linguistic Cues As discussed in Section 2, there has been little work on examining the use of linguistic cues for recognizing or generating segment boundaries, 6 much less on evaluating the comparative utility of different types of information. In this section we present and evaluate a collection of algorithms that identify discourse segment boundaries, where each relies on a different type of linguistic information. We first introduce our methodology (Section 4.1), then evaluate three initial algorithms, each based on the use of a single linguistic device frequently proposed in the literature: pauses, cue words and referential noun phrases, respectively (Section 4.2). 7 Each algorithm was developed prior to any acquaintance with the narratives in our corpus. We evaluate each algorithm by examining its performance in segmenting an initial test set of 10 of our 20 narratives. We also evaluate a simple method for combining algorithms. These initial evaluations allow us to quantify the performance of existing hypotheses, to compare the utility of three very different types of linguistic knowledge, and to begin investigating the utility of combining knowledge sources. We then present two methods for enhancing performance: error analysis, and machine learning (Section 4.3). 8 Here we use the 10 narratives previously used for testing as training data. The resulting algorithms are then tested on 5 new narratives. By using enriched linguistic information and by allowing more complex interactions among linguistic devices, both methods achieve results that approach human performance.</Paragraph> <Section position="1" start_page="115" end_page="121" type="sub_section"> <SectionTitle> 4.1 Methodology </SectionTitle> <Paragraph position="0"> jects' segmentation task (break up a narrative into contiguous segments, with segment breaks falling between prosodic phrases). The input to each algorithm is a set of po-</Paragraph> <Paragraph position="2"> there are three little boys, (nonboundary) \[.15\] up on the road a little bit, (nonboundary) and they see this little accident. (nonboundary) \[1.6 \[.55\] And u-h\] they come over, (nonboundary) and they help him, (nonboundary) \[.4? and \[.2\]\] you know, ( nonboundary ) help him pick up the pears and everything. (boundary) \[2.7 |1.0\] A-nd \[1.15\]\] the one thing that struck me about the- \[.3\] three little boys that were there, is that one had ay uh \[.4\] I don't know what you call them, but it's a paddle, and a ball-, \[.2\] is attached to the paddle, and you know you bounce it? .. And that sound was really prominent.</Paragraph> <Paragraph position="3"> \[4.55 Well anyway, \[.45\] so- u-m \[.11 throat clearing \[.451 tsk \[1.15\]\] all the pears are picked up, and.. he's on his way again, Figure 5 Excerpt from narrative 9, with boundaries. tential boundary sites, coded with respect to a wide variety of linguistic features. The output is a classification of each potential boundary site as either boundary or nonboundary. In the target output, we classify a potential boundary site as boundary if it was identified as such by at least i of the seven subjects in our empirical study, where we use two values of i. Otherwise it is classified as nonboundary. In our experiments, we investigate the correlation of linguistic cues with boundaries identified by both i = 3 and i = 4 subjects.</Paragraph> <Paragraph position="4"> Figure 5 is a modified version of Figure 2, showing the classification of the statistically validated boundaries in the same narrative excerpt. (The bracketed numbers represent pauses, as explained below.) The boxes in the figure show the subjects' responses at each potential boundary site; if no box is shown, none of the seven subjects place a boundary at the site. The italicized parentheticals at each potential boundary site show the resulting boundary classification. Only 3 of the 18 possible boundary sites are classified as boundary, for both i = 3 and i = 4. are n - 1 potential boundary sites between each pair of prosodic phrases Pi and Pi+l, i from 1 to n - 1. Each potential boundary site in our corpus is coded for features representing the three different sources of linguistic information of interest: prosody, cue phrases, and referential noun phrases. The linguistic features used in our two sets of experiments are shown in Figure 6. Our initial experiments use only the features marked as &quot;o,&quot; while our later experiments use the full feature set, along with modifications to the noun phrase features.</Paragraph> <Paragraph position="5"> Values for the prosodic features are obtained by automatic analysis of the transcripts, whose conventions are defined in Chafe (1980) and illustrated in Figure 5:</Paragraph> <Paragraph position="7"> &quot;.&quot; and &quot;?&quot; indicate falling versus rising sentence-final intonational contours &quot;,&quot; indicates phrase-final but not sentence-final intonation &quot;\[X\]&quot; indicates a pause lasting X seconds (measured to an accuracy of about .05 seconds) &quot;\[W \[Y\] lexical material \[Z\]\]&quot; indicates a sequence lasting W seconds where a Y second pause is followed by lexical material then a pause of Z seconds &quot;..&quot; indicates a break in timing too short to be measured as a pause (The values in the transcripts are based in part on an analysis of displays of fundamental frequency contours.) The features before and after depend on the final punctuation of the phrases Pi and Pi+l, respectively. The value is +sentence.final.contour if &quot;.&quot; or &quot;?&quot;, -sentence.final.contour if &quot;,&quot;. Pause is assigned true if Pi+l begins with \[X\] (convention 3) (or with \[W \[Y\] for convention 4), false otherwise. Duration is assigned X (convention 3) (or Y for convention 4) if pause is true, 0 otherwise. The prosodic features were motivated by previous results in the literature. For example, phrases beginning discourse segments were correlated with preceding pause duration in Grosz and Hirschberg (1992). These and other studies (e.g., Hirschberg and Litman \[1993\]) also found it useful to distinguish between sentence-final and non-sentence-final intonational contours.</Paragraph> <Paragraph position="8"> Passonneau and Litman Discourse Segmentation The cue phrase features are also obtained by automatic analysis of the transcripts. Cue1 is assigned true if the first lexical item in Pi+l is a member of the set of cue words summarized in Hirschberg and Litman (1993). Word1 is assigned this lexical item if cue1 is true, NA (not applicable) otherwise. 9 Cue2 is assigned true if cue1 is true and the second lexical item is also a cue word. Word2 is assigned the second lexical item if cue2 is true, NA otherwise. As with the pause features, the cue phrase features were motivated by previous results in the literature. Initial phrase position (cue1) was correlated with discourse signaling uses of cue words in Hirschberg and Litman (1993). A potential correlation between discourse signaling uses of cue words and adjacency patterns between cue words (cue2) was also suggested. Finally, Litman (1996) found that treating cue phrases individually rather than as a class (word1, word2) enhanced the results of Hirschberg and Litman (1993).</Paragraph> <Paragraph position="9"> Two of the noun phrase (NP) features are hand coded, along with Functionally Independent Clause Units (FICUs; see below), following Passonneau (1994). The two authors coded independently and merged their results. Coding was performed on automatically created coding sheets for each narrative, consulting transcripts that were specially formatted to show prosodic phrase boundaries and numbers, but which were otherwise identical to Chafe's (1980) original transcriptions. Boundary data, which had been collected but not analyzed, was not available. Comprehensive operational definitions for recognition of reference features (coref and infer) are documented in Passonneau (1994). The last NP feature, global.pro, is computed from the coding of other features and of previously occurring boundaries.</Paragraph> <Paragraph position="10"> All three NP features are applied in the context of FICUs (Passonneau 1994). An FICU contains a single tensed clause that is neither a verb argument nor a restrictive relative clause, potentially with sentence fragments or repairs. If a new FICU (Cj) begins in prosodic phrase Pi+I, then NPs in Cj are compared with NPs in previous FICUs and the feature values assigned as follows: m 1. corer = +coref if any NPs in Cj and Cj-1 corefer; else corer = -coref 2. infer = +infer if the referent of an NP in Cj can be inferred from Cj-1 on the basis of a pre-defined set of inference relations; else infer = -infer 3. global.pro = +globaLpro if the referent of a definite pronoun in Cj is mentioned in a previous utterance, but not prior to the last time a boundary was assigned; else global.pro = -global.pro Note that the global.pro feature is defined in a manner that depends on incremental assignment of boundaries and coding of features. To evaluate global.pro for an utterance Cj requires that all boundaries occurring prior to Cj have been assigned. If a new FICU is not initiated in Pi+I, values for all three features are NA. The NP features reflect Passonneau's hypotheses that adjacent utterances are more likely to contain expressions that corefer, or that are inferentially linked, if they occur within the same segment; and that a definite pronoun is more likely than a full NP to refer to an entity that was mentioned in the current segment, if not in the previous utterance. These hypotheses are inspired by centering theory (Grosz, Joshi, and Weinstein 1995), psycholinguistic research (Marslen-Wilson, Levy, and Tyler 1982; Levy 1984), and pilot 9 The cue phrases that occur in the corpus are shown as potential values in Figure 6. 10 The NP algorithm can assign multiple boundaries within one prosodic phrase if the phrase contains multiple clauses; these very rare cases are normalized (Passonneau and Litman 1993). A total of 5 boundaries are eliminated in 3 of the 10 test narratives (out of 213 in all 10). studies on data from corpora (Passonneau 1993) or published excerpts (Grosz 1977; Grosz and Sidner 1986). Unlike the cue and pause features, the NP features were thus not directly based on simplifications of existing results.</Paragraph> <Paragraph position="11"> Cue-prosody, which encodes a combination of prosodic and cue word features, was motivated by an analysis of errors on our training data, as described in Section 4.3.1. Cue-prosody is assigned complex if: 1. before = +sentence.final.contour 2. pause -- true 3. And either: (a) (b)</Paragraph> <Paragraph position="13"> using the features in Figure 6. The subscripting on noun phrases indicates coreference.</Paragraph> <Paragraph position="14"> The ability of humans to reliably code linguistic features similar to those coded in Figure 7 has been demonstrated in various studies. Evaluation of prosodic labeling using TOBI, a prosodic transcription system somewhat similar to that used in the Pear corpus, has been found to be quite reliable between transcribers (Pitrelli, Beckman, and Hirschberg 1994). The results of a study of 953 spoken cue phrases showed that two judges agreed on whether cue phrases illustrated a discourse signaling usage or not in 878 (92.1%) cases (Hirschberg and Litman 1993). For these 878 cases, an algorithm that assigned discourse signaling usages to cues if they were the first lexical item in their intermediate intonational phrase performed with 75% accuracy (Litman 1994), which is analogous to the method used here to assign the value true to the feature cue1. When coding involves either relatively objective phenomena or a well-defined decision procedure, one can expect good interrater reliability among different coders (see Duncan and Fiske \[1977\] and Mokros \[1984\]). The corer feature falls into this category. In addition, preliminary data from a third coder provides good evidence that coref can be coded reliably. A feasibility study of the parsers CASS (Abney 1990) and FIDDITCH (Hindle 1983) showed that coding FICUs on this data could be automated, n Subjectivity in coding the infer feature was eliminated by providing operational definitions of Information retrieval metrics.</Paragraph> <Paragraph position="15"> a small set of types of inferential links, also fully documented in Passonneau (1994), where infer occurs only if one or more of the bridging inferences occurs.</Paragraph> <Paragraph position="16"> by quantifying their performance in segmenting a test set of 10 narratives from our corpus. As discussed above, there is no training data for the algorithms in this section, which are derived from the literature. These initial results provide us with a baseline for quantifying improvements resulting from distinct modifications to the algorithms.</Paragraph> <Paragraph position="17"> In contrast, the algorithms presented in section 4.3 are developed using the 10 narratives previously used for testing as a training set of narratives. The algorithms in this section are developed by tuning the previous algorithms (e.g., by considering both new and modified linguistic features) such that performance on the training set is increased. The resulting algorithms are then evaluated by examining their performance on a separate test set of 5 more narratives. (The remaining 5 of the 20 narratives in the corpus are reserved for future research.) The 10 training narratives range in length from 51 to 162 phrases (Avg.=101.4), or from 38 to 121 clauses (Avg.=76.8).</Paragraph> <Paragraph position="18"> The 5 test narratives range in length from 47 to 113 phrases (Avg.=87.4), or from 37 to 101 clauses (Avg.=69.0). The ratios of test to training data measured in narratives, prosodic phrases, and clauses, respectively, are 50.0%, 43.1%, and 44.9%. For the machine learning algorithm we also estimate performance using cross-validation (Weiss and Kulikowski 1991), as detailed in Section 4.3.2. The evaluations in this section allow us to compare the utility of two tuning methods: error analysis, and machine learning.</Paragraph> <Paragraph position="19"> To quantify algorithm performance, we use the information retrieval metrics shown in Figure 8. Recall is the ratio of correctly hypothesized boundaries to target boundaries. Precision is the ratio of hypothesized boundaries that are correct to the total hypothesized boundaries. (See Figure 8 for fallout and error.) These metrics assume that ideal behavior would be to identify all and only the target boundaries: the values for b and c in Figure 8 would thus both equal 0, representing no errors, u The ideal values for recall, precision, fallout, and error are 1, 1, 0, and 0, while the worst values are 0, 0, 1, and 1. To get an intuitive summary of overall performance, we also sum the deviation of the observed value from the ideal value for each metric: (1 - recall)</Paragraph> <Paragraph position="21"> thus 0.</Paragraph> <Paragraph position="22"> Finally, to interpret our quantitative results, we use the performance of our human subjects as a target for the performance of our algorithms (Gale, Church, and Yarowsky 1992). Table 3 shows the average human performance for both the training and test sets of narratives, for both boundaries identified by at least three and four subjects. Note that human performance is basically the same for both sets of narratives. However, two factors prevent this performance from being closer to ideal (e.g., recall and precision 12 Elsewhere we have discussed problems with the use of IR metrics, given that segmentation is a fuzzy phenomenon. However, they provide a rough (lower bound) measure of performance.</Paragraph> <Paragraph position="23"> of 1). The first is the wide variation in the number of boundaries that subjects used, as discussed above. The second is the inherently fuzzy nature of boundary location. We discuss this second issue at length in Passonneau and Litman (1996). In Litman and Passonneau (1995b), we also present relaxed IR metrics that penalize near misses less heavily (cases where an algorithm does not place a boundary at a statistically validated boundary location, but does place one within one phrase of the validated boundary).</Paragraph> </Section> <Section position="2" start_page="121" end_page="125" type="sub_section"> <SectionTitle> 4.2 Initial Hypotheses </SectionTitle> <Paragraph position="0"> In principle, the process of determining whether the statistically validated segment boundaries correlate with linguistic devices requires a complex search through a large space of possibilities, depending on what set of linguistic devices one examines, and what features are used to recognize and classify them. Rather than developing a method to search blindly through the space of possibilities, we first provide an initial evaluation of three linguistic devices whose distribution or surface form has frequently been hypothesized to be conditioned by segmental structure: referential noun phrases, cue words, and pauses. We evaluate three algorithms, each of which uses features pertaining to only one of these linguistic devices, in order to see whether linguistic associations proposed in the literature can be used by natural language processing systems to perform segmentation, and to compare the utility of different knowledge sources. Unlike most previous work, which typically considers each linguistic device in isolation, we also evaluate a simple additive method for combining linguistic devices, in which a boundary is proposed if each separate algorithm proposes a boundary. As we will see, the performance of our algorithms improves with the amount of knowledge exploited. The recall of the three algorithms is comparable to human performance, the precision much lower, and the fallout and error of only the noun phrase algorithm comparable. Furthermore, the results on combining algorithms suggests that with more sophisticated methods, results approaching human performance can be achieved.</Paragraph> <Paragraph position="1"> and discourse segment boundaries (Grosz and Hirschberg 1992; Hirschberg and Passonneau and Litman Discourse Segmentation if pause ~ true then boundary Statistically validated versus algorithmically derived boundaries.</Paragraph> <Paragraph position="2"> Nakatani 1996; Swerts 1995). For example, segment-initial phrases have been correlated with longer preceding pause durations. As shown in Figure 9, we used a simplification of these results to develop an algorithm for identifying boundaries in our corpus using pauses. 13 If a pause occurs at the beginning of the prosodic phrase after the potential boundary site, the potential boundary site is classified as boundary and the phrase is taken to be the beginning of a new segment.</Paragraph> <Paragraph position="3"> Figure 10 shows boundaries assigned by the pause algorithm (PAUSE) for the boundary slot codings from Figure 7, repeated at the top of the figure. For example, the pause algorithm assigns a boundary between prosodic phrases 22.4 and 23.1, but not between phrases 23.1 and 23.2.</Paragraph> <Paragraph position="4"> Table 4 shows the average performance of the pause algorithm for statistically validated boundaries at the .0001 level (those boundaries proposed by at least four subjects). Recall is 92% (o = .008; max ~ 1; rain -~ .73), precision is 18% (or = .002; max = .25; rain = .09), fallout is 54% (or = .004; max = .65; rain = .45), and error is 49% (o = .004 max = .61; min = .41). Our algorithm thus performs with recall higher than human performance. 14 However, precision is low, and both fallout and error are quite high. The summed deviatio n metric, which takes all the metrics into account, shows that on the whole performance is considerably worse than humans.</Paragraph> <Paragraph position="5"> signal the structure of a discourse. Hirschberg and Litman (1993) examined a large set of cue words proposed in the literature and showed that certain prosodic and structural features, including a position of first in prosodic phrase, are highly correlated with the discourse uses of these words. As shown in Figure 11, we developed a baseline segmentation algorithm based on a simplification of these results, using the value of the single cue phrase feature cue1. That is, if a cue word occurs at the beginning of the prosodic phrase after the potential boundary site, the usage is assumed to be discourse. 13 Our initial algorithm does not take the duration of the pause into account; pause duration is considered in the algorithms presented in Section 4.3.2, however. In addition, since our segmentation task is not hierarchical, we do not note whether phrases begin, end, suspend, or resume segments. 14 Note that the humans did not have access to pause information. Other studies have shown that when both speech and text are available to labelers, segmentation is clearer (Swerts 1995) and reliability improves (Hirschberg and Nakatani 1996).</Paragraph> <Paragraph position="6"> Thus the potential boundary site is classified as boundary and the phrase is taken to be the beginning of a new segment. Figure 10 shows boundaries (CUE) assigned by the algorithm.</Paragraph> <Paragraph position="7"> Table 4 shows the average performance of the cue word algorithm. Recall is 72% (or = .027; max = .88; min = .40), precision is 15% (or = .003; max = .23; min = .04), fallout is 53% (or = .006 max = .60; rain = .42) and error is 50% (o = .005 max = .60; min = .40). While recall is quite comparable to human performance (row 4), the precision is low while fallout and error are quite high.</Paragraph> <Paragraph position="8"> as input information about referential NPs. We refer to this algorithm as NP. Unlike the previous algorithms, in NP the potential boundaries are first computed as ordered pairs of adjacent functionally independent clauses (FICUi,FICUi+I; see section 4.1.2) then normalized to ordered pairs of prosodic phrases (see note 10). NP operates on the principle that if an NP in the current FICU provides a referential link to the current segment, the current segment continues. However, NPs and pronouns are treated differently based on the assumption that the referent of a third person definite pronoun is more prominently in focus (cf. Passouneau \[1994\]). A third person definite pronoun provides a referential link if its index occurs anywhere in the current segment. Any other NP type provides a referential link if its index occurs in the immediately preceding FICU. Figure 12 illustrates the two decisions made by NP for each pair of adjacent FICUs. As described in Section 4.1.2, the corer feature is -coref if no NP in FICUi corefers with an NP in the FICUi_I; the infer feature is -infer if no NP in FICUi is inferentially linked to an NP in FICUi_I; the global.pro feature is -global.pro if FICUi contains no third person definite pronoun coreferring with an NP in any prior FICU up to the last boundary assigned by the algorithm. If any feature has a positive value, no boundary is assigned; if all have negative values, (FICUi_I,FICUi) is classified as a boundary.</Paragraph> <Paragraph position="9"> The column headed NP in Figure 10 indicates boundaries assigned by the NP algorithm. No boundaries are assigned by NP. The first three phrases in Figure 7 correspond directly to three consecutive FICUs, and each FICU has an NP coreferring with an NP in the next; likewise the global.pro feature is present. However, phrase 23.3 is the onset of an FICU that continues through 23.4, so phrase 23.3 is not coded for NP features. The coref and global.pro features are present in the FICU that ends in 23.4, due to coreference of a pronominal NP with an NP in the preceding FICU (from phrase 23.2).</Paragraph> <Paragraph position="10"> Table 4 shows the average performance of the referring expression algorithm (row labeled NP) on the four measures we use here. Recall is .50 (C/ = .17; max = .71; min = .18), precision is .31 (C/ = .097; max = .50; min = .20), fallout is .15 (C/ = .06; max = .27; min -- .07) and error rate is 0.19 (~r = .06; max = .31; min = .12). Recall is worse than PAUSE, CUE and human performance, and precision is better than PAUSE and CUE but worse than human performance. Note that the error rate and fallout, which in a sense are more robust measures of inaccuracy than precision, are both much better than CUE and PAUSE.</Paragraph> <Paragraph position="11"> 4.2.4 Additive Algorithms. We report here evaluation of a simple additive method for combining the three algorithms described above. That is, a boundary is proposed if some combination of the algorithms proposed a boundary. We tested all pairwise combinations, and the combination of all three algorithms, as shown in Table 5. Precision is the most likely metric to be improved. For a composite algorithm, recall cannot be increased: if neither NP, PAUSE, nor CUE found a boundary, then no combination of them can. However, the composite algorithms use narrower criteria for boundaries, which should reduce the number of false positives. The precision of the additive algorithms is indeed higher than any of the algorithms alone. PAUSE/NP has the best additive algorithm performance as measured by the summed deviation.</Paragraph> <Paragraph position="12"> 4.2.5 Discussion. By using average human performance as a baseline against which to evaluate algorithms, we are asking whether algorithms perform in a manner that reflects an abstraction over a population of humans, rather than whether they perform like a typical human. No algorithm or combination of algorithms performs as well as this baseline. The referring expression algorithm (NP) performs better than the other unimodal algorithms (PAUSE and CUE), and a combination of PAUSE and NP performs best. Our results thus suggest that accurately predicting discourse segmentation involves far more than directly using known linguistic differences between discourse boundaries and nonboundaries, is Here we analyze some of the likely reasons for our Computational Linguistics Volume 23, Number 1 results, to motivate the methodologies for algorithm improvement presented in the next section.</Paragraph> <Paragraph position="13"> First, we must take into account the dimensions along which the three algorithms differ, apart from the different types of linguistic information used. As shown in Figures 9, 11, and 12, NP uses more knowledge than PAUSE and CUE. PAUSE and CUE each depend on only a single feature, while NP relies on three features. Unsurprisingly, NP performs most like humans. For both PAUSE and CUE, the recall is relatively high, but the precision is very low, and the fallout and error rate are both poor. For NP, recall and precision are not as different, precision is higher than PAUSE and CUE, and fallout and error rate are both relatively low. These results, as well as the improved performance of the additive algorithms, suggest that performance can be improved by considering more features. The algorithms presented in Section 4.3 indeed use more features, as shown in Figure 6.</Paragraph> <Paragraph position="14"> A second dimension to consider in comparing performance is that humans and NP assign boundaries based on a global criterion, in contrast to PAUSE and CUE.</Paragraph> <Paragraph position="15"> Our subjects typically use a relatively gross level of speaker intention. By default, NP assumes that the current segment continues, and assigns a boundary under relatively narrow criteria. However, PAUSE and CUE rely on cues that are relevant at the local as well as the global level, and consequently assign boundaries more often. This leads to a preponderance of cases where PAUSE and CUE propose a boundary but where a majority of humans did not. However, when either PAUSE or CUE is combined with the more global NP, as in PAUSE/NP and CUE/NP, we see that performance improves. These results suggest that another way to improve performance is to consider more sophisticated methods for combining features across the three types of linguistic devices.</Paragraph> </Section> <Section position="3" start_page="125" end_page="133" type="sub_section"> <SectionTitle> 4.3 Developing New Hypotheses by Combining Multiple Knowledge Sources </SectionTitle> <Paragraph position="0"> In this section we present two methods for developing segmentation algorithms that combine the features of multiple linguistic devices in more complex ways than simply combining the outputs of independent algorithms. Our first method relies on an analysis of the errors made by the best-performing algorithm. Our second method uses machine learning tools to automatically construct segmentation algorithms from a large set of input features: features used in our previous experiments, enhancements to hand-coded features, and new features obtainable automatically from our transcripts. Both methods consider much more knowledge than previously considered by ourselves or others, and result in algorithms that exhibit marked improvements in performance. We present our results using two sets of statistically validated boundaries: those derived using a significance level of .0001 (corresponding to Tj ~ 4 subjects, as in the previous section), and those derived using a less conservative level of .02 (corresponding to Tj ~ 3 subjects).</Paragraph> <Paragraph position="1"> defined in Figure 8 above, made by the original NP algorithm on the training data (Passonneau and Litman 1993). Type &quot;b&quot; errors, misclassification of nonboundaries, were reduced by redefining the coding features pertaining to clauses and NPs. Most &quot;b&quot; (like previous research) shows that pauses preceding boundaries have average longer durations. For Tj ~ 3, the average pause duration is .64 (C/ = .65) before boundaries, and .39 (rr = 1.70) before nonboundaries; for Tj ~ 4, the average durations are .72 (C/ = .67) and .39 (C/ = 1.64), respectively. As will be seen in Section 4.3.2, this correlation does not translate into any high-performing algorithm based primarily on pause duration. Inferential link due to implicit argument.</Paragraph> <Paragraph position="2"> errors correlated with one of two kinds of the information used in the NP algorithm: identification of clauses (FICUs) and of inferential links. The redefinition of FICU motivated by error analysis led to fewer clauses. For example, FICU assignment depends in part on filtering out clausal interjections, utterances that have the syntactic form of clauses but that function as interjections. These include phrases like let's see, let me see, I don't know when they occur with no overt or implied verb phrase argument. The extensional definition of clausal interjections was expanded, thus certain utterances were no longer classed as FICUs under the revised coding. Other changes to the definition of FICUs pertained to sentence fragments, unexpected clausal arguments, and embedded speech. Because the algorithm assigns boundaries between FICUs, reducing the number of FICUs in a narrative can reduce the number of proposed boundaries.</Paragraph> <Paragraph position="3"> Error analysis also led to a redefinition of infer, and to the inclusion of new types of inferential relations that an NP referent might have to prior discourse. Previously, infer was a relation between the referent of an NP in one utterance, and the referent of an NP in a previous utterance. This was loosened to include referential links between an NP referent and referents mentioned in, or inferable from, any part of the previous utterance. For example, discourse deixis (demonstrative reference to a referent derivable from prior discourse \[Passonneau 1993; Webber 1991\]) was added to the types of inferential links to code for. In the second utterance of The storm is still raging, and that's why the plane is grounded, the demonstrative pronoun that illustrates an example of discourse deixis. Expanding the definition of infer also reduces the number of proposed boundaries: recall that the algorithm does not assign a boundary if there is an inferential link between an NP in the current utterance unit and the prior utterance unit.</Paragraph> <Paragraph position="4"> Three types of inference relations linking successive clauses (Ci-1, Ci) were added (originally there were five types \[Passonneau 1994\]). Now, a pronoun (e.g., it, that, this) in Ci referring to an action, event, or fact inferable from Ci-1 provides an inferential link. So does an implicit argument, as in Figure 13, where the missing argument of notice is inferred to be the event of the pears falling. The third case is where an NP in Ci is described as part of an event that results directly from an event mentioned in Ci-1.</Paragraph> <Paragraph position="5"> Misclassification of boundaries (&quot;c&quot; type errors; see Figure 8) often occurred where prosodic and cue features conflicted with NP features. The original NP algorithm assigned boundaries wherever the three values -coref, -infer, -global.pro co-occurred.</Paragraph> <Paragraph position="6"> Experiments led to the hypothesis that the most improvement came by assigning a boundary if the cue-prosody feature had the value complex, even if the algorithm would not otherwise assign a boundary, as shown in Figure 14. See Figure 10 for boundaries assigned by the resulting algorithm (EA, for error analysis).</Paragraph> <Paragraph position="7"> Table 6 presents the average IR scores across the narratives in the training set for the NP and EA algorithms. The top half of the table reports results for boundaries that at least three subjects agreed upon (T = 3), and the lower half for boundaries using a threshold value of 4 (T = 4), where NP duplicates the figures from Table 4. Going by the summed deviations, the overall performance is about the same, although variation around the mean is lower for T = 4. The figures illustrate a typical tradeoff between precision and recall; where one goes up, the other goes down. All scores are better for EA.</Paragraph> <Paragraph position="8"> Table 7 shows the results of the tuned algorithm on the 5 randomly selected test narratives for NP and EA. Performance on the test set is slightly better overall for T = 4, as shown by lower summed deviations. The NP results are very similar to the training set except that precision is worse. Thus, despite the high standard deviations, 10 narratives seems to have been a sufficient sample size for evaluating the initial NP algorithm. EA results are better than NP in Table 7 or Table 6. This is strong evidence that the tuned algorithm is a better predictor of segment boundaries than the original Passonneau and Litman Discourse Segmentation NP algorithm. The test results of EA are, of course, worse than the corresponding training results, particularly for precision (.44 versus .62). This confirms that the tuned algorithm is over calibrated to the training set. Using summed deviations as a summary metric, EA's improvement is about 1/3 of the distance between NP and human performance.</Paragraph> <Paragraph position="9"> The standard deviations in Tables 6 and 7 are often close to 1/4 or 1/3 of the reported averages. This indicates a large amount of variability in the data, reflecting wide differences across narratives (speakers) in the training set with respect to the distinctions recognized by the algorithm. Although the high standard deviations show that the tuned algorithm is not well fitted to each narrative, it is likely that it is over specialized to the training sample in the sense that test narratives are likely to exhibit further variation.</Paragraph> <Paragraph position="10"> isting feature representation, it does not facilitate experimentation with large sets of multiple features simultaneously. To address this, we turned to machine learning to automatically develop algorithms from large numbers of both training examples and features.</Paragraph> <Paragraph position="11"> We use the machine learning program C4.5 (Quinlan 1993) to automatically develop segmentation algorithms from our corpus of coded narratives, where each potential boundary site has been classified and represented as a set of linguistic features. The first input to C4.5 specifies the names of the classes to be learned (boundary and nonboundary), and the names and potential values of a fixed set of coding features (Figure 6). The second input is the training data, i.e., a set of examples for which the class and feature values (as in Figure 7) are specified. Our training set of 10 narratives provides 1004 examples of potential boundary sites. The output of C4.5 is a classification algorithm expressed as a decision tree, which predicts the class of a potential boundary given its set of feature values.</Paragraph> <Paragraph position="12"> Because machine learning makes it convenient to induce decision trees under various conditions, we have performed numerous experiments varying the number of features used, the definitions used for classifying a potential boundary site as boundary or nonboundary and the options available for running the C4.5 program. Figure 15 shows one of the highest-performing learned decision trees from our experiments.</Paragraph> <Paragraph position="13"> This decision tree was learned under the following conditions: all of the features shown in Figure 6 were used to code the training data, boundaries were classified using a threshold of three subjects, and C4.5 was run using only the default options. 16 The decision tree predicts the class of a potential boundary site based on the features before, after, duration, cue1, wordt, corer, infer, and global.pro. Note that although not all available features are used in the tree, the included features represent three of the four general types of knowledge (prosody, cue phrases, and noun phrases). Each level of the tree specifies a test on a single feature, with a branch for every possible outcome of 16 The manually derived segmentation algorithm evaluates boundary assignment incrementally, i.e., utterance-by-utterance, after computing the features for the current utterance (or FICU). This allows relative information about previous boundaries to be used in deriving the global.pro feature. By allowing machine learning to use global.pro, we are testing whether characterizing the use of referring expressions (certain pronouns) in terms of relative knowledge about segments (whether the current referent was already mentioned in the current segment) is useful for classifying the current boundary site. Although none of the other features are derived using classification knowledge of any other potential boundary sites, note that global.pro does not encode the boundary/nonboundary classification of the particular site in question. Furthermore, even when machine learning does not use global.pro (as with the &quot;Learning 2&quot; algorithm discussed below), performance does not suffer. elseif corer = -coref then if infer = +infer then nonboundary elseif infer = NA then boundary elseif infer ~ -infer then if after = -sentence.final.contour then boundary elseif after = +sentence.final.contour then if cue1 = true then if global.pro ~ NA then boundary elseif global.pro = -global.pro then boundary elseif global.pro = +global.pro then Learned decision tree for segmentation.</Paragraph> <Paragraph position="14"> the test} 7 A branch can either lead to the assignment of a class, or to another test. For example, the tree initially branches based on the value of the feature before. If the value is -sentence.final.contour then the first branch is taken and the potential boundary site is assigned the class nonboundary. If the value of before is +sentence.final.contour then the second branch is taken and the feature coref is tested. Figure 10 illustrates sample output of this algorithm (ML).</Paragraph> <Paragraph position="15"> The performance of this learned decision tree averaged over the 10 training narratives is shown in Table 8, on the line labeled &quot;Learning 1&quot;. The line labeled &quot;Learning 2&quot; shows the results from another machine learning experiment, in which one of the default C4.5 options used in &quot;Learning 1&quot; is overridden. The default C4.5 approach creates a separate subtree for each possible feature value; as detailed in Quinlan (1993), this approach might not be appropriate when there are many values for a feature, which is true for features such as word1 and word2. In &quot;Learning 2&quot; C4.5 allows feature values to be grouped into one branch of the decision tree. While the &quot;Learning 2&quot; tree is more complex than the tree of Figure 15, it does have slightly better performance. The &quot;Learning 2&quot; decision tree predicts the class of a potential boundary site based on the features before, duration, cue1, word1, word2, corer, infer, and cue-prosody. At T = 3, &quot;Learning 1&quot; performance is comparable to human performance (Table 3), and &quot;Learning 2&quot; is slightly better than humans; at T = 4, both learning conditions are superior to human performance. The results obtained via machine learning are also better than the results obtained using error analysis (EA in Table 6), primarily 17 The actual tree branches on every value of word1; the figure merges these branches for clarity. due to better precision. In general, the machine learning results have slightly greater variation around the average.</Paragraph> <Paragraph position="16"> The performance of the learned decision trees averaged over the 5 test narratives is shown in Table 9. Comparison of Tables 8 and 9 shows that, as with the error analysis results (and as expected), average performance is worse when applied to the testing rather than the training data, particularly with respect to precision. However, the best machine learning performance is an improvement over our previous best results (EA in Table 7). For T ~ 3, &quot;Learning 1&quot; is comparable to EA while &quot;Learning 2&quot; is better. For T = 4, EA is better than &quot;Learning 1&quot;, but &quot;Learning 2&quot; is better still. However, as with the training data, EA has somewhat less variation around the average.</Paragraph> <Paragraph position="17"> We also use the resampling method of cross-validation (Weiss and Kulikowski 1991) to estimate performance, which averages results over multiple partitions of a sample into test versus training data. We performed 10 runs of the learning program, each using 9 of the 10 training narratives for that run's training set (for learning the tree) and the remaining narrative for testing. Note that for each iteration of the crossvalidation, the learning process begins from scratch and thus each training and testing set are still disjoint. While this method does not make sense for humans, computers can truly ignore previous iterations. For sample sizes in the hundreds (our 10 narratives provide 1004 examples) 10-fold cross-validation often provides a better performance estimate than the hold-out method (Weiss and Kulikowski 1991). Results using cross-validation are shown in Table 10, and are better than the estimates obtained using the hold-out method (Table 9), with tlle major improvement coming from precision.</Paragraph> <Paragraph position="18"> Finally, Table 11 shows the results from a set of additional machine learning experiments, in which more conservative definitions of boundary are used. For example, using a threshold of seven subjects yields the set of consensus boundaries, as defined in Hirschberg and Nakatani (1996). Comparison with Table 9 shows that for T = 5, &quot;Learning 1&quot; rather than &quot;Learning 2&quot; is the better performer. However, the more interesting result is that for T = 6 and T = 7, the learning approach has an important limitation with respect to the boundary classification task. In particular, the way in which C4.5 minimizes error rate is not an effective strategy when the distribution of the classes is highly skewed. For both T = 6 and T = 7, extremely few of the 1004 training examples are classified as boundary (40 and 19 examples, respectively). C4.5 minimizes the error rate by always predicting nonboundary. For example, for T -= 6, because only 4% of the training examples are boundaries, C4.5 achieves an error rate of 4% by always predicting nonboundary. However, this low error rate is achieved at the expense of the other metrics. Using the terminology of Figure 8, since the algorithm never predicts the class boundary, it is necessarily the case that a = 0, b = 0, recall = 0, and precision is undefined (&quot;-&quot; in Table 11). In addition, for T = 7, 2 of the 5 test sets happen to contain no boundaries; for these cases c = 0 and thus the value of recall is also sometimes undefined. The problem of unbalanced data is not unique to the boundary classification task. Current work in machine learning is exploring ways to induce patterns relevant to the minority class, for example, by allowing users to explicitly specify different penalties for false positive and false negative errors (Lewis and Catlett 1994). (In contrast, C4.5 assumes that both types of errors are penalized equally.) Other researchers (e.g., Hirschberg \[1991\]) have proposed sampling the majority class examples in a training set in order to produce a more balanced training sample.</Paragraph> <Paragraph position="19"> potheses using multiple linguistic features. The first method, error analysis, tunes features and algorithms based on analysis of training errors. The second method, machine learning, automatically induces.decision trees from coded corpora. Both methods rely on an enriched set of input features compared to our previous work. With each method, we have achieved marked improvements in performance compared to our previous work and are approaching human performance. Quantitatively, the machine learning versus EA methods differ only on certain metrics, and bear a somewhat inverse relation to one another for boundaries defined by T _ 4 versus T ~ 3. Table 12, which shows comparisons between EA and the two machine learning conditions, indicates which differences are statistically significant by indicating the probability of Computational Linguistics Volume 23, Number 1 a paired comparison on each of the 5 test narratives using Student's t test. For the T = 4 boundaries, the superior recall of EA compared with conditions 1 and 2 of the automated algorithms is significant. Conversely, the superior fallout of condition 1 and superior error rate of condition 2 are significant. For the T = 3 boundaries, the differences are not statistically significant for condition 2, but for condition 1, precision and error rate are both superior, and the difference as compared with EA is statistically significant. The largest and the most statistically significant difference is the higher precision of the condition 1 automated algorithm. Qualitatively, the algorithms produced by error analysis are more intuitive and easier to understand than those produced by machine learning. Furthermore, note that the machine learning algorithm used the changes to the coding features that resulted from the error analysis. This suggests that error analysis is a useful method for understanding how to best code the data, while machine learning provides a cost-effective (and automatic) way to produce an optimally performing algorithm given a good feature representation.</Paragraph> </Section> </Section> class="xml-element"></Paper>