File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-0301_intro.xml

Size: 5,113 bytes

Last Modified: 2025-10-06 14:06:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0301">
  <Title>marization, and generation of natural language</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Motivation
</SectionTitle>
    <Paragraph position="0"> The automatic identification of discourse segments and discourse markers in unrestricted texts is crucial for solving many outstanding problems in natural language processing, which range from syntactic and semantic analysis, to anaphora resolution and text summarization. Most of the algorithmic research in discourse segmentation focused on segments of coarse granularity (Grosz and Hirschberg, 1992; Hirschberg and Litman, 1993; Passonneau and Litman, 1997; Hearst, 1997; Yaari, 1997).</Paragraph>
    <Paragraph position="1"> These segments were defined intentionally in terms of Grosz and Sidner's theory (1986) or in terms of an intuitive notion of &amp;quot;topic&amp;quot;.</Paragraph>
    <Paragraph position="2"> However, in case of applications such as anaphora resolution, discourse parsing, and text summarization, even sentences might prove to be too large discourse segments. For example, if we are to defive the discourse structure of texts using an RSTlike representation (Mann and Thompson, 1988), we will need to determine the elementary textual units that contribute rhetorically to the understanding of those texts; usually, these units are clause-like units. Also, if we want to select the most important parts of a text, sentences might prove again to be too large segments (Marcu, 1997a; Teufel and Moens, 1998): in some cases, only one of the clauses that make up a sentence should be selected for summarization.</Paragraph>
    <Paragraph position="3"> In this paper, I present a surface-based algorithm that uses cue phrases (connectives) in order to deterrnine not only the elementary textual units of text but also the phrases that have a discourse function.</Paragraph>
    <Paragraph position="4"> The algorithm is empirically grounded in an extensive corpus analysis of cue phrases and is consistent with the psycholinguistic position advocated by Caron (1997, p. 70). Caron argues that &amp;quot;'rather than conveying information about states of things, connectives can be conceived as procedural instructions for constructing a semantic representation&amp;quot;. Among the three procedural functions of segmentation, integration, and inference that are used by Noordman and Vonk (1997) in order to study the role of connectives, I will concentrate here primarily on the first.l 2 A corpus analysis of cue phrases I used previous work on coherence and cohesion to create an initial set of more than 450 potential discourse markers (cue phrases). For each cue phrase, I then used an automatic procedure that extracted from the Brown corpus a random set of text fragments that each contained that cue. On average, I selected approximately 17 text fragments per cue phrase, having few texts for the cue phrases that do not occur very often in the corpus and up to 60 for cue phrases, such as and, that I considered to be highly ambiguous. Overall, I randomly selected more than 7600 texts. Marcu (1997b) lists all cue phrases that were used to extract text fragments from the Brown corpus, the number of occurrences of each cue phrase in the corpus, and the number of text fragments that were randomly extracted for each cue phrase.</Paragraph>
    <Paragraph position="5"> All the text fragments associated with a potential discourse marker were paired with a set of slots in which I described, among other features, the following: 1. The orthographic environment that characterized the usage of the potential discourse marker. This included occurrences of periods, commas, colons, semicolons, etc. 2. The type of usage: Sentential, Discourse, or Pragmatic. 3. The I Marcu (1997b)studies the other two functions as well.</Paragraph>
    <Paragraph position="6"> position of the marker in the textual unit to which it belonged: Beginning, Medial, or End. 4. The right boundary of the textual unit associated with the marker. 5. A name of an &amp;quot;action&amp;quot; that can be used by a shallow analyzer in order to determine the elementary units of a text. The shallow analyzer assumes that text is processed in a left-to-fight fashion and that a set of flags monitors the segmentation process. Whenever a cue phrase is detected, the shallow analyzer executes an action from a predetermined set, whose effect is one of the following: create an elementary textual unit boundary in the input text stream; or set a flag. Later, if certain conditions are satisfied, the flag setting may lead to the creation of a textual unit boundary. Since a discussion of the actions is meaningless in isolation, I will provide it in conjunction with the clause-like unit boundary and marker-identification algorithm.</Paragraph>
    <Paragraph position="7"> The algorithm described in this paper relies on the results derived from the analysis of 2200 of the 7600 text fragments and on the intuitions developed during the analysis.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML