File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0206_intro.xml

Size: 7,408 bytes

Last Modified: 2025-10-06 14:02:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0206">
  <Title>Discourse-Level Annotation for Investigating Information Structure</Title>
  <Section position="3" start_page="2" end_page="7" type="intro">
    <SectionTitle>
2 Methodology
</SectionTitle>
    <Paragraph position="0"> Text samples of varying origin, genre, language and size have been previously annotated with theory-specific notions of IS by various authors. Such data are typically not publicly available, and even if they can be obtained, it is very hard if not impossible to compare and reuse different annotations. More promising in this respect are annotations that include or add some aspect(s) of IS to an existing corpus or treebank. The most systematic effort of this kind that we are familiar with is the Topic-Focus annotation in the Prague Dependency Tree-bank (Bur'aVnov'a et al., 2000).</Paragraph>
    <Paragraph position="1"> In contrast to other projects in which IS is annotated and investigated, we do not annotate theory-biased abstract categories like Topic-Focus or Theme-Rheme. Since we are particularly interested in the correlations and co-occurrences of features on different linguistic levels that can be interpreted as indicators of the abstract IS categories, we needed an annotation scheme to be as theory-neutral as possible: It should allow for a description of the phenomena, from which 'any' theory-specific explanatory mechanisms can subsequently be derived (Skut et al., 1997). We therefore concentrate instead on features pertaining, on the one hand, to the surface realization of linguistic expressions (the levels of syntax and prosody), and, on the other hand, to the semantic character of the discourse referents (the discourse level).</Paragraph>
    <Paragraph position="2"> In designing our annotation schemes, we followed the guidelines of the Text Encoding Initiative null  and the Discourse Resource Initiative (Carletta et al., 1997). In line with these standards, we define for each annotation level (i) the markable expressions, (ii) the attributes of markables, and (iii) the links between markables (if any).</Paragraph>
    <Paragraph position="3"> Syntax The Tiger treebank and the Penn tree-bank we use as the starting point already contain syntactic information. The additional syntactic features annotated in the MULI project pertain to clauses as markable units, and encode the presence of structures with noncanonical word order that typically serve to put the focus on certain syntactic elements. We include cleft, pseudo-cleft, reversed pseudo-cleft, extraposition, fronting and expletives, as well as voice distinctions (active, mediopassive and passive). We annotate these features explicitly (when not already present in the tree- null http://www.tei-c.org/ bank annotation), to be able to correlate them directly with features at other levels. The annotation scheme draws on accounts of the analysed features in (Eisenberg, 1994) and (Weinrich, 1993) for German and in (Quirk et al., 1985) and (Biber et al., 1999) for English.</Paragraph>
    <Paragraph position="4"> Prosody For the prosodic annotation, we recorded one German and one English native speaker reading aloud the texts of the MULI corpus.</Paragraph>
    <Paragraph position="5">  The recordings were digitised and annotated using the EMU Speech Database System ((Cassidy and Harrington, 2001b); http://emu.sourceforge.net/).</Paragraph>
    <Paragraph position="6"> The markables at the prosody level are intonation phrases, intermediate phrases and words. Their attributes encode the position and strength of phrase breaks, and the position and type of pitch accents and boundary tones, following the conventions of ToBI (Tones and Break Indices (Beckmann and Hirschberg, 1994)) for English and GToBI  (Grice et al., in press) for German, which are regarded as standards for describing the intonation of these languages within the framework of autosegmental-metrical phonology.</Paragraph>
    <Paragraph position="7"> Discourse At the discourse level, we define as markable those linguistic expressions that introduce or access discourse entities (i.e., discourse referents in the sense used in DRT and alike) (Webber, 1983; Kamp and Reyle, 1993). Currently we consider primarily the discourse entities introduced by &amp;quot;nominal-like&amp;quot; expressions (Passoneau, 1996). We include other kinds of expressions as markable only when they participate in an anaphoric relation with a &amp;quot;nominal-like&amp;quot; expression. For example, a sentence is a markable when it serves as an antecedent of a discourse-deictic anaphoric expression (Webber, 1991); the main verb of a sentence is a markable when the subject of the sentence is a &amp;quot;zero-anaphor&amp;quot;, etc. Our annotation instructions for identifying markables are an amalgamation and extension of those of the MUC-7 Coreference</Paragraph>
    <Section position="1" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
Task Definition
</SectionTitle>
      <Paragraph position="0"> , the DRAMA annotation manual (Passoneau, 1996), and (Wind, 2002). The attributes of markables in our discourse-level annotation scheme are designed to capture a range of properties that semantically characterize the discourse entities evoked by linguistic ex- null We are aware that using recorded speech is not ideal. We nevertheless decided for this approach, as we wanted to work on top of existing treebanks. As far as we are aware, there does not exist a treebank for any of the publicly available speech corpora.</Paragraph>
      <Paragraph position="1">  Since prosodic annotation is very time-consuming, we had to concentrate mainly on one language. Thus, we analysed all German texts and restricted ourselves to some English examples. Since individual speaking preferences may vary from speaker to speaker, we will have to record additional speakers in order to be able to come up with generalizable results.</Paragraph>
      <Paragraph position="3"> pressions. Thereby we differ from most existing discourse-level annotation efforts, which concentrate on the linguistic expressions and on identifying anaphoric relations between them (i.e., identifying anaphors and their antecedents). A notable exception is the GNOME project annotation scheme (Poesio et al., 1999): In GNOME, the aim was to annotate a corpus with information relevant for noun phrase generation. This included syntactic, semantic and discourse attributes of nominal expressions. The semantic attributes include, among others, animacy, ontological status, countability, quantification and generic vs. specific reference, which reflect similar distinctions as we make in our annotation scheme.</Paragraph>
      <Paragraph position="4"> Besides the semantic properties that characterize discourse entities individually, our annotation scheme of course also covers referential relations between discourse entities, including both identity and bridging. We build on and extend the MUC-7 coreference specification and the coreference/bridging classifications described in (Passoneau, 1996), (Carletta et al., 1997), (Poesio, 2000) and (M&amp;quot;uller and Strube, 2001). We represent anaphoric relations between linguistic expressions through links between the corresponding markables. Thetypeofrelationisannotatedas an attribute of the markable corresponding to the anaphor.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML