File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/j97-1005_concl.xml
Size: 9,137 bytes
Last Modified: 2025-10-06 13:57:44
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-1005"> <Title>Discourse Segmentation by Human and Automated Means</Title> <Section position="7" start_page="133" end_page="135" type="concl"> <SectionTitle> 5. Conclusion and Future Directions </SectionTitle> <Paragraph position="0"> Our initial hypotheses regarding discourse segmentation were that multiutterance segment units reflect discourse coherence, and that while the semantic dimensions of this coherence may vary, it arises partly from consistency in the speaker's communicative goals (Grosz and Sidner 1986; Polanyi 1988). The results from the first part of our study (Section 3) support these hypotheses. On a relatively unconstrained linear segmentation task, the number of times different naive subjects identify the same segment boundaries in a given narrative transcript is extremely significant. Across the 20 narratives, statistical significance arises where at least three or four out of seven subjects agree on the same boundary location, depending on an arbitrary choice between probabilities of .02 versus .0001 as the significance threshold. We conclude that the segment boundaries identified by at least three or four of our subjects provide a statistically validated annotation to the narrative corpus corresponding to segments having relatively coherent communicative goals.</Paragraph> <Paragraph position="1"> Before making concluding remarks on part two of our study, we mention a few questions for future work on segmentation. We believe our results confirm the utility of abstracting from the responses of relatively many naive subjects (our method), and indicate a strong potential for developing coding protocols using smaller numbers of trained coders (as in Nakatani, Hirschberg, and Grosz \[1995\], and Hirschberg and Nakatani \[1996\]). The use of an even larger number of naive subjects might yield a finer-grained set of segments (cf. Rotondo \[1984\], Swerts, \[1995\]). This is an important dimension of difference between the two sets of segments we use: segments identified by a minimum of four subjects are larger and fewer in number than those identified by a minimum of three. In addition, performance can be improved by taking into account that some segment boundary locations may be relatively fuzzy, as we discuss in Passonneau and Litman (1996). Finally, differences in segmentation may reflect different interpretations of the discourse, as we pointed out in Passonneau and Litman (1996), based on observations of our subjects' segment descriptions.</Paragraph> <Paragraph position="2"> The second part of our study (Section 4) concerned the algorithmic identification of segment boundaries based on various combinations of three types of linguistic input: referential noun phrases, cue phrases, and pauses. We first evaluated an initial set of three algorithms, each based on a single type of linguistic input, and their additive combinations. Our results showed that the algorithms performed quite differently from one another on boundaries identified by at least four subjects on a test set of 10 narratives from our corpus. In particular, the NP algorithm (which used three fea- null Passonneau and Litman Discourse Segmentation tures) outperformed both the cue phrase and pause algorithms (each of which used only a single feature). While none of the algorithms approached human performance, the fact that performance improved with the number of features coded, and by combining algorithms in a simple additive way, suggested directions for improvement. We applied two training methods, error analysis and machine learning, to the previous test set of 10 narratives. Richer linguistic input and more sophisticated methods of combining linguistic data led to significant improvements in performance when the new algorithms were evaluated on a test set of 5 new narratives. The best-performing algorithm resulted from the machine learning experiment in which certain default options were overridden (&quot;Learning 2&quot; in Table 9). For the T = 4 boundary set, &quot;Learning 2&quot; recall was 53% as good as humans, precision was 95% as good, fallout was better than humans, and error (11%) was almost as low as that of humans (10%). Thus the main need for improvement is in recall.</Paragraph> <Paragraph position="3"> A comparison of results on two sets of boundaries, those identified by at least three, versus those identified by at least four subjects, shows roughly comparable performance. The &quot;Learning 1&quot; algorithm performs better on the set defined by T = 3 (Table 9); Error Analysis (Table 7) and &quot;Learning 2&quot; (Table 9) perform better on the T = 4 set. We have not yet determined what causes these differences, although in an early paper on our pilot study, we reported that there is a strong tendency for recall to increase and precision to decrease as boundary strength increases (Passonneau and Litman 1993). On the one hand, performance was consistently improved by enriching the linguistic input. On the other hand, there is wide performance variation around the mean. Despite this variation, as we pointed out in Litman and Passonneau (1995a), there are certain narratives that the NP, EA, and both machine learning algorithms perform similarly well, or poorly, on. These observations indicate a need for further research regarding the interaction among variation in speaker style, granularity of segmentation, and richness of the linguistic input.</Paragraph> <Paragraph position="4"> Finally, while our results are quite promising, how generally applicable are they, and do results such as ours have any practical import? As discussed in Section 2, the ability both to segment discourse and to correlate segmentation with linguistic devices has been demonstrated in dialogues and monologues, using both spoken and written corpora, across a wide variety of genres (e.g., task-oriented, advice-giving, informationquery, expository, directions, and newspapers). Studies such as these suggest that our methodologies and/or results have the potential of being applicable to more than spontaneous narrative monologues.</Paragraph> <Paragraph position="5"> As for the utility of our work, even though the algorithms in this paper were produced using some features that were manually coded, once developed, they could be used in reverse to enhance the comprehensibility of text generation systems or the naturalness of text-to-speech systems that already attempt to convey discourse structure (e.g., systems such as Moore and Paris \[1993\], and Hirschberg \[1990\]). For example, given the algorithm shown in Figure 14, a generation system could better convey its discourse boundaries by constructing associated utterances where the values of corer, infer, and global.pro are as shown in the first line of the figure, or, for a spoken language system, where the value of cue-prosody is complex. In related work, we have tested the hypothesis that the use of a discourse focus structure based on the Pear segmentation data improves performance of a generation algorithm, thus providing a quantitative measure of the utility of the segmentation data (Passonneau 1996). There we present results of an evaluation of an NP generation algorithm under various conditions. The input to the algorithm consisted of semantic information about utterances in a Pear narrative, such as the referents mentioned in the utterance. Output was evaluated against what the human narrator actually said. When the input to the algorithm Computational Linguistics Volume 23, Number 1 included a grouping of discourse referents into focus spaces derived from discourse segments, performance improved by 50%.</Paragraph> <Paragraph position="6"> In addition, if our results were fully automated, they could also be used to enhance the ability of understanding systems to recognize discourse structure, which in turn improves tasks such as information retrieval (Hearst 1994) and plan recognition (Litman and Allen 1990). Recent results suggest that many of our manually coded features have the promise of being automatically coded. Given features largely output by a speech recognition system, Wightman and Ostendorf (1994) automatically recognize prosodic phrasing with 85-86% accuracy; this accuracy is only slightly less than human-human accuracy. Similarly, although our spoken corpus was manually transcribed, this could have been automated using speech recognition (although this would introduce further sources of error). In Aone and Bennett (1995), machine learning is used to automatically derive anaphora resolution algorithms from automatically produced feature representations; the learned algorithms outperform a manually derived system (whose average recall and precision was 66.5% and 72.9%, respectively).</Paragraph> <Paragraph position="7"> Finally, the results of Litman (1996) show that there are many alternatives to the cue phrase algorithm used here, including some that use feature sets that can be fully coded automatically.</Paragraph> </Section> class="xml-element"></Paper>