File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0212_intro.xml

Size: 2,808 bytes

Last Modified: 2025-10-06 14:02:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0212">
  <Title>Annotation and Data Mining of the Penn Discourse TreeBank</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Large scale annotated corpora such as the Penn TreeBank (Marcus et al., 1993) have played a central role in speech and natural language research.</Paragraph>
    <Paragraph position="1"> However, with the demand for more powerful NLP applications comes a need for greater richness in annotation - hence, the development of PropBank (Kingsbury and Palmer, 2002), which adds basic semantics to the PTB in the form of verb predicate-argument annotation and eventually similar annotation of nominalizations. We have been developing yet another annotation layer above these both. The Penn Discourse TreeBank (PDTB) adds low-level discourse structure and semantics through the annotation of discourse connectives and their arguments, using connective-specific semantic role labels. With this added knowledge, the PDTB (together with the PTB and PropBank) should support more in-depth NLP research and more powerful applications.</Paragraph>
    <Paragraph position="2"> Work on the PDTB is grounded in a lexicalized approach to discourse - DLTAG (Webber and Joshi, 1998; Webber et al., 1999a; Webber et al., 2000; Webber et al., 2003). Here, low-level discourse structure and semantics are taken to result (in part) from composing elementary predicate-argument relations whose predicates come mainly from discourse connectives1 and whose arguments 1Despite this, we have deliberately adopted a policy of havcome from units of discourse - clausal, sentential or multi-sentential units. The PDTB therefore differs from the RST-annotated corpus (Carlson et al., 2003) which starts with (abstract) rhetorical relations (Mann and Thompson, 1988) and annotates a subset of the Penn WSJ corpus with those relations that can be taken to hold between (primarily) pairs of discourse spans identified in the corpus.</Paragraph>
    <Paragraph position="3"> The current paper focuses on what can be discovered through analyzing PDTB annotation, both on its own and together with the Penn TreeBank.</Paragraph>
    <Paragraph position="4"> Section 2 of the paper briefly reviews the theoretical background of the project, its current state, the guidelines given to annotators, the annotation tool they used (WordFreak), and the extent of inter-annotator agreement. Section 3 shows how we have used PDTB annotation, along with the PTB, to extract several features pertaining to discourse connectives and their arguments, and discusses the relevance of these features for NLP research and applications. Section 4 concludes with the summary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML