File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1206_intro.xml

Size: 1,982 bytes

Last Modified: 2025-10-06 14:01:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1206">
  <Title>Enhancement of a Chinese Discourse Marker Tagger with C4.5</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Discourse refers to any form of language-based communication involving multiple sentences or utterances. The most important forms of discourse of interest to Natural Language Processing (NLP) are text and dialogue. The function of discourse analysis is to divide a text into discourse segments, and to recognize and re-construct the discourse structure of the text as intended by its author.</Paragraph>
    <Paragraph position="1"> Automatic text abstraction has received considerable attention (Paice 1990). Various systems have been developed (Chan et al.</Paragraph>
    <Paragraph position="2"> 2000). Ono et al. (1994), T'sou et al. (1992) and Marcu (1997) focus on discourse structure in summarization using the Rhetorical Structure Theory (RST, Mann and Thompson 1986). The theory has been exploited in a number of computational systems (e.g. Hovy 1993). The main idea is to build a discourse tree where each node of the tree represents an RST relation.</Paragraph>
    <Paragraph position="3"> Summarization is achieved by trimming lmimportant sentences on the basis of the relative saliency or rhetorical relations.</Paragraph>
    <Paragraph position="4"> The SIFAS (Syntactic Marker based</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Full-Text Abstraction System) system has
</SectionTitle>
      <Paragraph position="0"> been implemented to use discourse markers in the automatic summarization of Chinese (T'sou et al. 1999). In this paper, we report our efforts to improve the SIFAS tagging system by applying machine learning techniques to disambiguation of discourse markers. C4.5 (Quirdan, 1993) is used in our system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML