File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/97/j97-1005_relat.xml

Size: 15,386 bytes

Last Modified: 2025-10-06 14:16:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-1005">
  <Title>Discourse Segmentation by Human and Automated Means</Title>
  <Section position="4" start_page="104" end_page="107" type="relat">
    <SectionTitle>
2. Related Work
</SectionTitle>
    <Paragraph position="0"> There is much debate about what to define discourse segments in terms of, and what kinds of relations to assign among segments. The nature of any hypothesized interaction between discourse structure and linguistic devices depends both on the model of discourse that is adopted, and on the types of linguistic devices that are investigated.</Paragraph>
    <Paragraph position="1"> Here we briefly review previous work on characterizing discourse segments, and on correlating discourse segments with utterance features. We conclude each review by summarizing the differences between our study and previous work.</Paragraph>
    <Section position="1" start_page="104" end_page="106" type="sub_section">
      <SectionTitle>
2.1 Characterizing the Notion of a Segment
</SectionTitle>
      <Paragraph position="0"> A number of alternative proposals have been presented, which relate segments to intentions (Grosz and Sidner 1986), Rhetorical Structure Theory (RST) relations (Mann and Thompson 1988) or other semantic relations (Polanyi 1988; Hobbs 1979). The linguistic structure of Grosz and Sidner's (1986) discourse model consists of multiutterance segments and structural relations among them, yielding a discourse tree structure. The hierarchical relations of their linguistic structure are isomorphic with the two other levels of their model, intentional structure and attentional state. Rhetorical relations do not play a role in their model. In Hobbs (1979) and Polanyi (1988), segmental structure is an artifact of coherence relations among utterances, such as elaboration, evaluation, cause, and so on. Their coherence relations are similar to those posited in RST (Mann and Thompson 1988), which informs much work in generation. Polanyi (1988) distinguishes among four types of Discourse Constituent Units (DCUs) based on different types of structural relations (e.g., sequence). As in Grosz and Sidner's (1986) model, Polanyi (1988) proposes that DCUs (analogous to segments) are structured as a tree, and in both models, the tree structure of discourse constrains how the discourse evolves, and how referring expressions are processed. Recent work (Moore and Paris 1993; Moore and Pollack 1992) has argued that to account for explanation dialogues, it is necessary to independently model both RST relations and intentions.</Paragraph>
      <Paragraph position="1"> Researchers have begun to investigate the ability of humans to agree with one another on segmentation, and to propose methodologies for quantifying their findings.</Paragraph>
      <Paragraph position="2"> The types of discourse units being coded and the relations among them vary. Several studies have used trained coders to locally and globally structure spontaneous or read speech using the model of Grosz and Sidner (1986), including Grosz and Hirschberg 1992; Nakatani, Hirschberg, and Grosz 1995; Stifleman 1995; Hirschberg and Nakatani 1996. In Grosz and Hirschberg (1992), percent agreement (see Section 3.2) among 7 coders on 3 texts under two conditions--text plus speech or text alone--is reported at levels ranging from 74.3% to 95.1%. In Hirschberg and Nakatani (1996), average reliability (measured using the kappa coefficient discussed in Carletta \[1996\]) of segment-initial labels among 3 coders on 9 monologues produced by the same speaker, labeled using text and speech, is .8 or above for both read and spontaneous speech; values of at least .8 are typically viewed as representing high reliability (see Section 3.2).</Paragraph>
      <Paragraph position="3"> Reliability labeling from text alone is .56 for read and .63 for spontaneous speech.</Paragraph>
      <Paragraph position="4"> Other notions of segment have also been used in evaluating naive or trained coders. Hearst (1993) asked naive subjects to place boundaries between paragraphs of running text, to indicate topic changes. Hearst reports agreement of greater than 80%, and indicates that significance results were found that were similar to those reported in Passonneau and Litman (1993). Flammia and Zue (1995) asked subjects to segment textual transcriptions of telephone task-oriented dialogues, using minimal segmentation instructions based on a notion of topic: 18 dialogues were segmented by 5 coders (with varying levels of expertise in discourse), with an average pairwise  Computational Linguistics Volume 23, Number 1 kappa coefficient of .45. To evaluate hierarchical aspects of segmentation, Flammia and Zue also developed a new measure derived from the kappa coefficient. Swerts (1995) asked 38 subjects to mark &amp;quot;paragraph boundaries&amp;quot; in transcriptions of 12 spontaneous spoken monologues; half of the subjects segmented from text alone and half from text plus speech. However, no quantitative evaluation of the results were reported.</Paragraph>
      <Paragraph position="5"> Swerts and Ostendorf (1995) also empirically derived discourse structure, using a spoken corpus of database query interactions. Although the labelers had high levels of agreement, the segmentations were fairly trivial.</Paragraph>
      <Paragraph position="6"> Isard and Carletta (1995) presented 4 naive subjects and 1 expert coder with transcripts of task-oriented dialogues from the HCRC Map Task Corpus (Anderson et al.</Paragraph>
      <Paragraph position="7"> 1991). Utterance-like units referred to as moves were identified in the transcripts, and subjects were asked to identify transaction boundaries. Since reliability was lower than the .80 threshold, they concluded that their coding scheme and instructions required improvement.</Paragraph>
      <Paragraph position="8"> Moser and Moore (1995) investigated the reliability of various features defined in Relational Discourse Analysis (Moser, Moore, and Glendening 1995), based in part on RST. Their corpus consisted of written interactions between tutors and students, using 3 different tutors. Two coders were asked to identify segments, the core utterance of each segment, and certain intentional and informational relations between the core and the other contributor utterances. As reported in their talk (not in the paper), reliability on segment structure and core identification was well over the .80 threshold. Reliability on intentional and informational relations was around .75, high enough to support tentative conclusions.</Paragraph>
      <Paragraph position="9"> Finally, a method for segmenting dialogues based on a notion of control was used in Whittaker and Stenton (1988) and Walker and Whittaker (1990). Utterances were classified into four types, each of which was associated with a rule that assigned a controller; the discourse was then divided into segments, based on which speaker had control. Neither study presented any quantitative analysis of the ability to reliably perform the initial utterance classification. However, in Whittaker and Stenton (1988), a higher level of discourse structure based on topic shifts was agreed upon by at least 4 of 5 judges for 46 of the 56 control shifts.</Paragraph>
      <Paragraph position="10"> In sum, relatively few quantitative empirical studies have been made of how to annotate discourse corpora with features of discourse structure, and those recent ones that exist use various models such as the Grosz and Sidner model (1986), an informal notion of topic (Hearst 1994; Flammia and Zue 1995), transactions (Isard and Carletta 1995), Relational Discourse Analysis (Moser and Moore 1995), or control (Whittaker and Stenton 1988; Walker and Whittaker 1990). The modalities of the corpora investigated include dialogic or monologic, written, spontaneous or read, and the genres also vary. Quantitative evaluations of subjects' annotations using notions of agreement, interrater reliability, and/or significance show that good results can be difficult to achieve. As discussed in Section 3, our initial aim was to explore basic issues about segmentation, thus we used naive subjects on a highly unstructured task. Our corpus consists of transcripts of spontaneous spoken monologues, produced by 20 different speakers. We use an informal notion of communicative intention as the segmentation criterion, motivated by Grosz and Sidner (1986) and Polanyi (1988), who argue that defining a segment as having a coherent goal is more general than establishing a repertoire of specific types of segment goals. We do not, however, ask coders to identify hierarchical relations among segments. The hypothesis that discourse has a tree structure has frequently been questioned (Dale 1992; Moore and Pollack 1992; Hearst 1994; Walker 1995), and the magnitude of our segmentation task precludes asking subjects to specify hierarchical relations. Finally, we quantify our results using a significance  Passonneau and Litman Discourse Segmentation test, a reliability measure, and, for purposes of comparison with other work, percent agreement.</Paragraph>
    </Section>
    <Section position="2" start_page="106" end_page="107" type="sub_section">
      <SectionTitle>
2.2 Correlation of Segmentation with Utterance Features
</SectionTitle>
      <Paragraph position="0"> The segmental structure of discourse has been claimed to constrain and be constrained by disparate phenomena, e.g., cue phrases (Hirschberg and Litman 1993; Grosz and Sidner 1986; Reichman 1985; Cohen 1984), plans and intentions (Carberry 1990; Litman and Allen 1990; Grosz and Sidner 1986), prosody (Hirschberg and Pierrehumbert 1986; Butterworth 1980), nominal reference (Webber 1991; Grosz and Sidner 1986; Linde 1979), and tense (Webber 1988; Hwang and Schubert 1992; Song and Cohen 1991). However, just as with the early proposals regarding segmentation, many of these proposals are based on fairly informal studies. It is only recently that attempts have been made to quantitatively evaluate how utterance features correlate with independently justified segmentations. Many of the studies discussed in the preceding subsection take this approach. The types of linguistic features investigated indude prosody (Grosz and Hirschberg 1992; Nakatani, Hirschberg, and Grosz 1995; Hirschberg and Nakatani 1996; Swerts 1995; Swerts and Ostendorf 1995), term repetition (Hearst 1994), cue words (Moser and Moore 1995; Whittaker and Stenton 1988), and discourse anaphora (Walker and Whittaker 1990).</Paragraph>
      <Paragraph position="1"> Grosz and Hirschberg (1992) investigate the prosodic structuring of discourse.</Paragraph>
      <Paragraph position="2"> The correlation of various prosodic features with their independently obtained consensus codings of segmental structure (codings on which all labelers agreed) is analyzed using t-tests; the results support the hypothesis that discourse structure is marked intonationally in read speech. For example, pauses tended to precede phrases that initiated segments (independent of hierarchical structure) and to follow phrases that ended segments. Similar results are reported in Nakatani, Hirschberg, and Grosz (1995) and Hirschberg and Nakatani (1996) for spontaneous speech as well. Grosz and Hirschberg (1992) also use the classification and regression tree system CART (Brieman et al. 1984) to automatically construct and evaluate decision trees for classifying aspects of discourse structure from intonational feature values.</Paragraph>
      <Paragraph position="3"> The studies of Swerts (1995) and Swerts and Ostendorf (1995) also investigate the prosodic structuring of discourse. In Swerts (1995), paragraph boundaries are empirically obtained as described above. The prosodic features pitch range, pause duration, and number of low boundary tones are claimed to increase continuously with boundary strength (the proportion of subjects identifying a boundary). However, there is no analysis of the statistical significance of these correlations. In Swerts and Ostendorf (1995), prosodic as well as textual features are shown to be correlated with their independenfly obtained (but fairly trivial) discourse segmentations of travel-planning interactions, with statistical significance.</Paragraph>
      <Paragraph position="4"> Hearst's (1994) TextTiling algorithm structures expository text into sequential segments based on term repetition. Hearst (1994) uses information retrieval metrics (see Section 4.1) to evaluate two versions of TextTiling against independently derived segmentations produced by at least three of seven human judges. Precision was .66 for the best version, compared with .81 for humans; recall was .61 compared with .71 for humans. The use of term repetition (and a related notion of lexical cohesion) is not unique to Hearst's work; related studies include Morris and Hirst (1991), Youmans (1991), Kozima (1993), and Reynar (1994). Unlike Hearst's work, these studies either use segmentations that are not empirically justified, or present only qualitative analyses of the correlation with linguistic devices.</Paragraph>
      <Paragraph position="5"> After identifying segments, and core and contributor relations within segments, Moser and Moore (1995) investigate whether cue words occur, where they occur, and  Computational Linguistics Volume 23, Number 1 what word occurs. In their talk, they presented results showing that the occurrence and placement of a discourse usage of a cue word correlates with relative order of core versus contributor utterances. For example, a discourse cue is more likely to occur when the contributor precedes the core utterance (p &lt; .001).</Paragraph>
      <Paragraph position="6"> Finally, Whittaker and Stenton (1988) examined a wide variety of means for signaling discourse structure. Prompts, repetitions, and summaries rather than cue words more often signaled control-based discourse segment boundaries. No statistical analysis of the significance of the differences was presented, however. By statistically analyzing distributions of discourse anaphora with respect to control-based discourse segments, Walker and Whittaker (1990) showed that shifts of attentional state (Grosz and Sidner 1986) occurred when shifts in control were accepted by all dialogue participants. null In sum, relatively few studies correlate linguistic devices with empirically justified discourse segmentations. Quantitative evaluations of the correlations include the use of statistical measures and information retrieval metrics. As discussed in Section 4, we derive discourse segmentations based on the statistical significance of the agreement among our subjects. In contrast to studies investigating a single feature, we investigate three types of linguistic devices--referential noun phrases, prosody, and cue phrases.</Paragraph>
      <Paragraph position="7"> In addition, we are concerned with the extra step of developing segmentation algorithms rather than with the demonstration of statistical correlations. We first develop algorithms using each type of linguistic device in isolation, motivated by existing hypotheses in the literature. Then we propose and evaluate methods for combining them.</Paragraph>
      <Paragraph position="8"> We use measures from information retrieval to quantify and evaluate our results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML