File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0206_metho.xml

Size: 19,201 bytes

Last Modified: 2025-10-06 14:09:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0206">
  <Title>Discourse-Level Annotation for Investigating Information Structure</Title>
  <Section position="4" start_page="7" end_page="8" type="metho">
    <SectionTitle>
3 Discourse-Level Annotation
</SectionTitle>
    <Paragraph position="0"> Information structure theories describe the phenomena at hand at a surface level, at a semantic level, or at both levels simultaneously, i.e., an expression belongs to some IS partition, in virtue of some information-status of the corresponding discourse entity. For the investigation of IS at the (discourse) semantic level, we thus need more information about the character of the discourse entities introduced by linguistic expressions. We therefore annotated expressions with their discourse referents and their following properties: Semantic type/sort reflects ontological character of a discourse entity: object, property, eventuality or textual entity. Since the primary focus of our current annotation are discourse entities evoked by nominal-like expressions, most of them denote objects. Objects are further classified according to semantic sorts: human/person, office/profession, organization, animal, plant, physical object, quantity/amount, date/time, location/place, group/collection, abstract entity, other. Properties are classified into either temporal or permanent. Eventuality has sub-classes phase (habit or state) and process (activity, accomplishment, achievement). Textual entities are for now not further classified.</Paragraph>
    <Paragraph position="1"> Denotation characteristics of a discourse entity are captured by a combination of attributes, inspired by (Hlavsa, 1975). First, we distinguish between denotational (extensional, referential) and non-denotational (intensional, attributive) uses of linguistic expressions. Denotationally used expressions pick out (specify) some instance(s) of the designated concept(s). The instance(s) can be uniquely specified (=identifiable to the hearer), or specific but not identifiable, or even unspecific (arbitrary, generic - so any instance will do). Generic references are seen as denoting types. An expression is used non-denotationally when it attribute or qualifies, i.e., evokes the characteristic properties of a concept, without actually instantiating it. A typical example of a non-denotationally used expression is a predicative NP, as in &amp;quot;He was a painter&amp;quot;.</Paragraph>
    <Paragraph position="2"> The annotation of a group of denotation properties is motivated by the need to have a language-independent characterization of the referents as such, rather then the properties of the referring expression, such as (in)definiteness. The latter is a surface reflex of a combination of denotation characteristics, and sometimes may not even be overtly indicated by articles or other determiners.</Paragraph>
    <Paragraph position="3"> For the denotationally used expressions, we then analyze what part of the domain designated by the expression is actualy included in the extension.</Paragraph>
    <Paragraph position="4"> These aspects are annotated in the determination, delimitation and quantification attributes.</Paragraph>
    <Paragraph position="5"> Determination characterizes the specificity of the denoted concept instance. Unique determination means that the entity is uniquely specified, i.e., the hearer can (or is assumed to be able to) identify the entity/instance intended by the speaker.</Paragraph>
    <Paragraph position="6"> Theremaybejustonesuchentity,e.g.,aswith proper names, or there are possibly more entities that satisfy the description, but the speaker means a particular one and assumes that the hearer can identify it. Anaphoric pronouns are also typically used as unique denotators. Finally, an entity can be uniquely specified through a relation to another entity, or through a relation between expressions in the text. In (Hlavsa, 1975) this is called relational uniqueness; it seems to correspond to Loebner's notion of NPs as functions, used in the GNOME annotation scheme.</Paragraph>
    <Paragraph position="7"> Existential determination is assigned to entities that are not uniquely specified, that is, the speaker does not assume the hearer to be able to identify a particular entity, but in principle the speaker would be able to identify one. Maybe such unique identification by the hearer is not important for the interaction, it is enough to take &amp;quot;some instance&amp;quot;. Variable determination is assigned when an expression not only does not uniquely specify an entity, but a particular entity cannot in principle be identified, rather, the speaker means an arbitrary ('any') instance. Typical examples are generics, or references to type.</Paragraph>
    <Paragraph position="8"> Delimitation characterizes the extent of the denoted concept instance with respect to the domain designated by the expression. The posible values are total and partial, indicating the entire domain designated by the expression is included in the extension, or only a part.</Paragraph>
    <Paragraph position="9"> Quantification captures the countability of the denotated concept instance, and if countable, the quantity of the individual objects included in the extension: uncountable is assigned when it is impossible to decompose the extension into countable distinguishable individual objects, e.g., with mass nouns; specific-single means quantity of one, e.g., &amp;quot;one x&amp;quot;, &amp;quot;the other x&amp;quot;; specific-multiple means a concrete quantity larger than one, e.g., &amp;quot;two x&amp;quot;, &amp;quot;both x&amp;quot;, &amp;quot;a dozen&amp;quot;; unspecific-multiple means an unspecified number larger than one, e.g., &amp;quot;some x&amp;quot;, &amp;quot;many x&amp;quot;, &amp;quot;most x&amp;quot;.</Paragraph>
    <Paragraph position="10"> Familiarity Status is a notion that most approaches to IS use as one dimension or level of the IS-partitioning, for example Given/New in (Halliday, 1985), Background/Focus in (Steedman, 2000), or as the basis for deriving a higher level of partitioning (Sgall et al., 1986).</Paragraph>
    <Paragraph position="11"> It is therefore important to capture it in our annotation as an independent feature, so that we can correlate it with other features at the discourse level and at other levels. We apply the familiarity status taxonomy from (Prince, 1981), distinguishing between new, unused, inferrable, textually and situationally evoked entities. We are aware that operationalizing Prince's taxonomy is a tough issue. For the time being, our annotation guidelines give intuitive descriptions of the different statuses, roughly as follows: brand new: create a new discourse referent for a previously unknown object; unused: create a new discourse referent for a known object; inferable: create a new discourse referent for an inferable object; evoked (textually or situationally): access an available discourse referent.</Paragraph>
    <Paragraph position="12"> Annotators' uncertainty or discrepancies between annotators help us to identify problematic cases, and to revise the guidelines where necessary.</Paragraph>
    <Paragraph position="13">  Linguistic form encodes the syntactic category of the markable expression. This is not an attribute encoding a semantic property of a discourse entity.</Paragraph>
    <Paragraph position="14"> We have found it useful to distinguish the following categories:  Our reason for applying the familiarity taxonomy from (Prince, 1981) is that it addresses the status of discourse entities as such, not other referential properties. For example, the givenness hierarchy in (Gundel et al., 1993) interleaves information status with uniqueness and specificity.</Paragraph>
    <Paragraph position="15"> nominal group is a &amp;quot;normal&amp;quot; NP with a head noun; pronominal subsumes expressions headed by a personal, demonstrative, interrogative or relative pronoun; possessive covers possessive premodifiers (typically a possessive pronoun, e.g., &amp;quot;our view&amp;quot;, or possessive adjective, e.g., &amp;quot;the Treasury's threat&amp;quot; or in German &amp;quot;newyorker Burse&amp;quot;; pronominal adverb in German, e.g. &amp;quot;daraus&amp;quot; (from that); apposition and coordination; clitic is used for clitics and in those cases when an expression contains a clitic affix (though not frequent in English and German newspaper text); ellipsis is used for elliptical (reduced) expressions, which function as nominal-like groups, but contain no nominal head (e.g., &amp;quot;the first&amp;quot;); in case a discourse entity is evoked by a zero argument, e.g., in case of subject- or object prodrop, a markable is created on a surrogate non-nominal expression, labeled as zero-arg; finally, clause or text are used for markables which are clause and simple sentences, or text segments, respectively (note that these are only markable, when they serve as antecedents to nominal anaphors).</Paragraph>
    <Paragraph position="16"> These categories classify the linguistic forms of expressions independently of the categories employed in the syntactic-level annotation. There are also technical reasons for introducing a form-feature, e.g., when some other expression serves as a markable to annotate the attributes of the discourse entity corresponding to a &amp;quot;zero-anaphor&amp;quot; or to a clitic affix. Referential link encodes the type of relation between the discourse entity corresponding to an anaphoric expression, and the one corresponding to the (most likely) antecedent. The referential links we distinguish are identity (representing coreference) and bridging, further classified into set-membership, set-containment, part-whole composition, propertyattribution, generalized possession, causal link and lexical-argument-filling.</Paragraph>
    <Paragraph position="17"> The attributes of information status and referential link are related, but we include them both, because the former is a property of a discourse entity, while the latter directly reflects anaphoricity as a property of an expression (the size of it ranging, ultimately, from a word to a segment of a discourse). The relation between anaphoricity and IS is not a straightforward one, and needs further investigation, enabled by an annotation like ours.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="9" type="metho">
    <SectionTitle>
4 Multi-level Investigation of IS
</SectionTitle>
    <Paragraph position="0"> We illustrate the different levels of annotation and analysis with an example sequence taken from our English corpus (Figure 1). We considered the syntactic annotation as a suitable starting point for the analysis. Where relevant features are detected, we compare the annotation at other levels.</Paragraph>
    <Paragraph position="1"> (1) In the 1987 crash, remember, the market was shaken by a Danny Rostenkowski proposal to tax takeovers out of existence. (2) Even more important, in our view, was the Treasury's threat to thrash the dollar. (3) The Treasury is doing the same thing today; (4) thankfully, the dollar is not under 1987-style  Of the four clauses in the example sequence, three show noncanonical word orders. In (1), the temporal adjunct is fronted, followed by the main predicate remember (in imperative mood). Additionally, (1) contains a passive construction bringing the patient in subject position. In (2), subject complement and adjunct (marking stance) are fronted. In (4), an adjunct (againmarking stance) is fronted.</Paragraph>
    <Paragraph position="2"> The discourse entity (DE) introduced in the fronted temporal phrase the 1987 crash in (1) is extensional, abstract, unique, specific singular, and has the information status of unused (also indicated by remember). The DE introduced in the unmarked subject position is extensional, abstract, unique, specific singular, but has the status of inferrable: the market can be seen as a bridging anaphor to the crash, by means of an argument filling (crash of the market). The DEs introduced by the sentence-final expressions in (1) and (2) are also extensional, abstract, unique, specific singular, and both have the information status of new.</Paragraph>
    <Paragraph position="3">  What appears sentence-final in (1) and (2) are thus two negative things that happened during the 1987 crash. The fronted expression(s) in (2) are not annotated as a DE. The DEs in the unmarked subject positions in (3) and (4) both have the information status of textually evoked, as both expressions are coreferential anaphors to parts of the Treasury's threat to thrash the dollar. While the DE referred to by the Treasury is an extensional, office, unique, specific singular, that of the dollar is intensional, abstract, unique, uncountable. The expression the same thing in (3) is anaphoric to the Treasury's threat ... in (2), but it introduces a new DE of the same type; its information status is that of inferrable. Finally, the DE introduced in the sentence-final expression 1987-style pressure in  (4) is intensional, abstract, existential, uncountable, and also has the information status of inferable; it is however hard to code it as a bridging anaphor, because it is not clear what relation it would have  to what antecedent: if anything, then a Danny Rostenkowski proposal ... in (1).</Paragraph>
    <Paragraph position="4"> The prosodic analysis shows that the fronted phrase in (2) is not only syntactically but also  We assume a layman reader. For an economy expert, these entities may have the status of unused.</Paragraph>
    <Paragraph position="5"> prosodically prominent (cf. Figure 2): Two peak accents on even and more highlight these words (with the more pronounced accent on more expressing a contrast), whereas the word important is deaccented, since the concept of 'importance' is inferable from the context. Furthermore, the adjective construction forms a phrase of its own, delimited by an intonation phrase boundary, which is in turn signalled by a falling-rising contour plus a short pause. The following parenthesis in our view also constitutes a single intonation phrase. Here again, our is assigned a contrastive accent, while view is unaccented.</Paragraph>
    <Paragraph position="6"> All remaining content words of the clause receive accents. However, the most 'newsworthy' word, threat, is the only one marked by a rising pitch accent (L+H*), indicating its higher degree of importance for the speaker. This interpretation is further supported by the insertion of a phrase break directly after this word. Finally, the high-downstepped nuclear accent (H+!H*) on dollar marks this item as being accessible by speaker and hearer (Pierrehumbert and Hirschberg, 1990).</Paragraph>
  </Section>
  <Section position="6" start_page="9" end_page="11" type="metho">
    <SectionTitle>
5 Technical Realization
</SectionTitle>
    <Paragraph position="0"> Above we presented a multi-level view on IS annotation, where each layer is to be annotated independently, to enable us to investigate interactions across the different levels. Such investigations involve either exploration of the integrated data (i.e., simultaneous viewing of the different levels and searching across levels) or integrated processing, e.g., in order to discover or test correlations across levels. There are two crucial technical requirements that must be satisfied to make this possible: (i) stand-off annotation at each level and (ii) alignment of base data across the levels. Without the first, we would not be able to keep the levels separate and perform annotation at each level independently, without the latter we would not be able to align the separate levels.</Paragraph>
    <Paragraph position="1"> We have chosen XML for the representation and maintenance of annotations. Each level of annotation is represented as a separate XML file, referring to (sequences of) tokens in a common base file containing the actual text data. We keep independent levels of annotation separate, even if they can in principle be merged into a single hierarchy.</Paragraph>
    <Paragraph position="2"> Parallel aligned texts (e.g., the written and spoken versions of our corpus) are also represented via shared token IDs. A related issue is that of annotation tools. We are not using one generic tool for all levels for the simple reason that we have not found a tool that would support the needs of all levels and still be efficient (Bauman et al., 2004b; M&amp;quot;uller and Strube, 2001). Therefore, we prefer to use tools specifically designed for the task at hand.</Paragraph>
    <Paragraph position="3"> We describe the tools of our choice below.</Paragraph>
    <Paragraph position="4">  several files in which time stamps are associated with the respective annotated labels.</Paragraph>
    <Paragraph position="5"> Syntactic Level For the syntactic annotation, we used the XML editor XML-Spy  . The annotation scheme is defined in a DTD, which is used to check the well-formedness and the validity of the annotation. null Discourse Level The discourse-level annotation is done with the MMAX annotation tool developed at EML, Heidelberg (M&amp;quot;uller and Strube, 2003).</Paragraph>
    <Paragraph position="6"> MMAXisalight-weighttoolwritteninJavathat runs under both Windows and Unix/Linux. It supports multilevel annotation of XML-encoded data using annotation schemes defined as DTDs. MMAX implements the above-mentioned general concepts of markables with attributes and standing in link relations to one another. To exploit and reuse annotated data in the MMAX format, there is the MMAX XML Discourse API.</Paragraph>
    <Paragraph position="7"> Integration The tools inevitably employ different data formats: on the prosodic level data is stored in the EMU data format, on the syntactic level in Tiger XML and on the discourse level in MMAX XML format.</Paragraph>
    <Paragraph position="8"> The EMU files have to be converted into stand-off XML format. To be able to align the prosodic annotation with the syntax and the discourse level, we chose the word as common basic unit. This poses several problems. First, punctuation marks count as separate words, but are not realised in spoken language. To be able to correlate prosodic phrasing and punctuation marks, we store the punctuation marks as attributes of the respective preceding word. Second, pauses occur very often in speech, but as they are not part of the written texts, they do not count as words. Because they are an important feature for phrasing and rhythm, we also code them as attributes of the preceding word. Third, in some cases a single word carries more than one accent, e.g.  http://www.xmlspy.com/ long compounds (Getr&amp;quot;ankedosenhersteller), or numbers. In these cases, it would be interesting to know which part(s) of the word get accented, which requires some way of annotating parts of words (e.g., syllables). Finally, for some multi-word units, e.g. 18,50 Mark, the spoken realisation (achtzehn Mark f&amp;quot;unfzig) cannot be aligned with the orthographic form, because spoken and orthographic form differ in number and order of words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML