XML Viewer - w99-0301

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0301_metho.xml
Size: 26,929 bytes
Last Modified: 2025-10-06 14:15:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0301">
  <Title>II I References</Title>
  <Section position="2" start_page="0" end_page="4" type="metho">
    <SectionTitle>
1 Annotation Graphs
</SectionTitle>
    <Paragraph position="0"> When we examine the kinds of speech transcription and annotation found in many existing 'communities of practice', we see commonality of abstract form along with diversity of concrete format. Our survey of annotation practice (Bird and Liberman, 1999) attests to this commonality amidst diversity.</Paragraph>
    <Paragraph position="1"> (See \[~.idc.upenn.edu/annotation\] for pointers to online material.) We observed that all annotations of recorded linguistic signals require one unavoidable basic action: to associate a label, or an ordered sequence of labels, with a stretch of time in the recording(s). Such annotations also typically distinguish labels of different types, such as spoken words vs. non-speech noises. Different types of annotation often span different-sized stretches of recorded time, without necessarily forming a strict hierarchy: thus a conversation contains (perhaps overlapping) conversational turns, turns contain (perhaps interrupted) words, and words contain (perhaps shared) phonetic segments. Some types of annotation are systematically incommensurable with others: thus disfluency structures (Taylor, 1995) and focus structures (Jackendoff, 1972) often cut across conversational turns and syntactic constituents.</Paragraph>
    <Paragraph position="2"> A minimal formalization of this basic set of practices is a directed graph with fielded records on the arcs and optional time references on the nodes. We have argued that this minimal formalization in fact has sufficient expressive capacity to encode, in a reasonably intuitive way, all of the kinds of linguistic annotations in use today. We have also argued that this minimal formalization has good properties with respect to creation, maintenance and searching of annotations. We believe that these advantages are especially strong in the case of discourse annotations, because of the prevalence of cross-cutting structures and the need to compare multiple annotations representing different purposes and perspectives. null Translation into annotation graphs does not magically create compatibility among systems whose semantics are different. For instance, there are many different approaches to transcribing filled pauses in English - each will translate easily into an annotation graph framework, but their semantic incompatibility is not thereby erased. However, it does enable us to focus on the substantive differences without having to be concerned with diverse formats, and without being forced to recode annotations in an agreed, common format. Therefore, we focus on the structure of annotations, independently of domain-specific concerns about permissible tags, attributes, and values.</Paragraph>
    <Paragraph position="3"> As reference corpora are published for a wider range of spoken language genres, annotation work is increasingly reusing the same primary data. For instance, the Switchboard corpus \[~. Idc. upenn, edu/Cat alog/LDC93S7, html\] has been marked up for disfluency (Taylor, 1995).</Paragraph>
    <Paragraph position="4"> See \[~. cis. upenn, edu/'treebank/swit chboardsample .html\] for an example, which also includes a separate part-of-speech annotation and a Treebank-Style annotation. Hirschman and Chinchor (1997) give an example of MUC-7 coreference annotation applied to an existing TRAINS dialog annotation marking speaker turns and overlap. We shall encounter a number of such cases here.</Paragraph>
    <Paragraph position="5"> The Formalism As we said above, we take an annotation label to be a fielded record. A minimal but sufficient set of fields would be: type this represents a level of an annotation, such as the segment, word and discourse levels; label this is a contentful property, such as a particular word, a speaker's name, or a discourse function; class this is an optional field which permits the arcs of an annotation graph to be co-indexed as members of an equivalence class. * One might add further fields for holding comments, annotator id, update history, and so on.</Paragraph>
    <Paragraph position="6"> Let T be a set of types, L be a set of labels, and C be a set of classes. Let R = {(t,l,c) I t 6 T,l 6 L, c 6 C}, the set of records over T, L, C. Let N be a set of nodes. Annotation graphs (AGs) are now defined as follows: Definition 1 An annotation graph G over R, N is a set of triples having the form (nl, r, n~), r e R, nl, n2 6 N, which satisfies the following conditions:  1. (N,{(nl,n2) l &lt;nl,r, n2) 6 A}) is a labelled acyclic digraph.</Paragraph>
    <Paragraph position="7"> 2. T : N ~ ~ is an order-preserving map assigning  times to (some o/) the nodes.</Paragraph>
    <Paragraph position="8"> For detailed discussion of these structures, see (Bird and Liberman, 1999). Here we present a fragment (taken from Figure 8 below) to illustrate the definition. For convenience the components of the fielded records which decorate the arcs are separated using the slash symbol. The example contains two word arcs, and a discourse tag encoding 'influence on speaker'. No class fields are used. Not all nodes have a time reference.</Paragraph>
    <Paragraph position="9"> 1We have avoided using explicit pointers since we prefer not to associate formal identifiers to the arcs. Equivalence classes will be exemplified later.</Paragraph>
    <Paragraph position="10"> The minimal annotation graph for this structure is as follows:</Paragraph>
    <Paragraph position="12"> XML is a natural 'surface representation' for annotation graphs and could provide the primary exchange format. A particularly simple XML encoding of the above structure is shown below; one might choose to use a richer XML encoding in practice. &amp;quot; &amp;quot;</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 LDC Telephone Speech Transcripts
</SectionTitle>
      <Paragraph position="0"> The LDC-published CALLHOME corpora include digital audio, transcripts and lexicons for telephone conversations in several languages, and are designed to support research on speech recognition \[www. Idc. upenn, edu/Cat alog/LDC96S46, html\]. The transcripts exhibit abundant overlap between speaker turns. What follows is a typical fragment of an annotation. Each stretch of speech consists of a begin time, an end time, a speaker designation, and the transcription for the cited stretch of time.</Paragraph>
      <Paragraph position="1"> We have augmented the annotation with + and * to indicate partial and total overlap (respectively) with the previous speaker turn.</Paragraph>
      <Paragraph position="3"> tests, and he was diagnosed as having attention deficit disorder. Which 990.18 989.56 A: you know, given how he's how far he's gotten, you know, he got hie degree at kTu~te and all, I found that surprising that for the first time as an adult they're diagnosing this. ~um  financial consultant and seems to be happy with that. 1003.14 1003.45 B: Good.</Paragraph>
      <Paragraph position="4"> Long turns (e.g. the period from 972.46 to 989.56 seconds) were broken up into shorter stretches for the convenience of the annotators and to provide additional time references. A section of this annotation which includes an example of total overlap is represented in annotation graph form in Figure 1, with the accompanying visualization shown in Figure 2. (We have no commitment to this particular visualization; the graph structures can be visualized in many ways and the perspicuity of a visualization format will be somewhat domain-specific.) The turns are attributed to speakers using the speaker/ type. All of the words, punctuation and disfluencies are given the w/type, though we could easily opt for a more refined version in which these are assigned different types. The class field is not used here. Observe that each speaker turn is a disjoint piece of graph structure, and that hierarchical organisation uses the 'chart construction' (Gazdar and Mellish, 1989, 179ff). Thus, we make a logica\] di.stinction between the situation where the end-points of two pieces of annotation necessarily coincide (by sharing the same node) from the situation where endpoints happen to coincide (by having distinct nodes which contain the same time reference). The former possibility is required for hierarchical structure, and the latter possibility is required for overlapping speaker turns where words spoken by different speakers may happen to sharing the same boundary.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Dialogue Annotation in COCONUT
</SectionTitle>
      <Paragraph position="0"> The COCONUT corpus is a set of dia\]ogues in which the two conversants collaborate on a task of deciding what furniture to buy for a house (Di Eugenio et al., 1998). The coding scheme augments the DAMSL scheme (Allen and Core, 1997) by having some new top-level tags and by further specifying some existing tags. An example is given in Figure 3.</Paragraph>
      <Paragraph position="1"> The example shows five utterance pieces, identified (a-e), four produced by speaker S1 and one produced by speaker $2. The discourse annotations can be glossed as follows: Accept - the speaker is agreeing to a possible action or a claim; Comit - the speaker potentially commits to intend to perform a future specific action, and the commitment is not contingent upon the assent of the addressee; Offer - the speaker potentially commits to intend to perform a future specific action, and the commitment is contingent upon the assent of the addressee; 0pen-0ption - the speaker provides an option for the addressee's future action; Action-Directive - the utterance is designed to cause the addressee to undertake a specific action.</Paragraph>
      <Paragraph position="2"> In utterance (e) of Figure 3, speaker $1 simultaneously accepts to the mete-action in (d) of not  SI: (a) Let's take the blue rug for 250, (b) my rug wouldn't match (c) which is yellow for 150.</Paragraph>
      <Paragraph position="3"> S2: (d) we don't have to match...</Paragraph>
      <Paragraph position="4"> SI: (e) well then let's use mine for 150  having matching colors, and to the regular action of using Sl's yellow rug. The latter acceptance is not explicitly represented in the original notation, so we shall only consider the former.</Paragraph>
      <Paragraph position="5"> In representing this dialogue structure Using annotation graphs, we will be concerned to achieve the following: (i) to treat multiple annotations of the same utterance fragment as an unordered set, rather than a list, to simplify indexing and query; (ii) to explicitly link speaker S1 to utterances (a-c); (iii) to formalize the relationship between Accept (d) and utterance (d); and (iv) formalize the rest of the annotation structure which is implicit in the textual representation.</Paragraph>
      <Paragraph position="6"> We adopt the types Sp (speaker), utt (utterance) and D (discourse). A more refined type system could include other levels of representation, it could distinguish forward versus backward communicative function, and so on. For the names we employ: speaker identifiers Sl, s2; discourse tags Offer, Commit, Accept, Open-0ption, Action-Directive; and orthographic strings representing the utterances.</Paragraph>
      <Paragraph position="7"> For the classes (the third, optional field) we employ the utterance identifiers a, b, c, d, e.</Paragraph>
      <Paragraph position="8"> An annotation graph representation of the COCONUT example can now be represented as in Figure 4. The arcs are structured into three layers, one for each type, where the types are written on the left. If the optional class field is specified, this information follows the name field, separated by a slash. The Acceptld arc refers to the s2 utterance simply by virtue of the fact that both share the same class field.</Paragraph>
      <Paragraph position="9"> Observe that the Commit and Accept tags for (a) are unordered, unlike the original annotation, and that speaker $1 is associated with all utterances (ac), rather than being explicitly linked to (a) and implicitly linked to (b) and (c) as in Figure 3.</Paragraph>
      <Paragraph position="10"> To make the referent of the Accept tag clear, we make use of the class field. Recall that the third component of the fielded records, the class field, permits arcs to refer to each other. Both the referring and the referenced arcs are assigned to equivalence class d.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
2.3 Coreference Annotation in MUC-7
</SectionTitle>
      <Paragraph position="0"> The MUC-7 Message Understanding Conference specified tasks for information extraction, named entity and coreference. Coreferring expressions are to be linked using SGML markup with ID and REF tags (Hirschman and Chinchor, 1997). Figure 5 is a sample of text from the Boston University Radio Speech Corpus [~al;. Idc. upenn, edu/Cat alog/LDC96S36, html], marked up with coreference tags. (We are grateful to Lynette Hirschman for providing us with this annotation.) Noun phrases participating in coreference are wrapped with &lt;coref&gt;...&lt;/corer&gt; tags, which can bear the attributes ID, REF, TYPE and MIN. Each such phrase is given a unique identifier, which may be referenced by a REF attribute somewhere else. Our example contains the following references: 3 --~ 2, 4-+ 2,6-+ 5, 7-+ 5, 8--~ 5, 12-+ 11, 15 ~ 13.</Paragraph>
      <Paragraph position="1"> The TYPE attribute encodes the relationship between the anaphor and the antecedent. Currently, only the identity relation is marked, and so coreferences form an equivalence class. Accordingly, our example contains the following equivalence classes: {2, 3, 4}, {5,6,7,s}, {11, n}, {13,15}.</Paragraph>
      <Paragraph position="2"> In our AG representation we choose the first number from each of these sets as the identifier for the equivalence class. MUC-7 also contains a specification for named entity annotation. Figure 7 gives an example, to be discussed in SS3.2. This uses empty</Paragraph>
      <Paragraph position="4"> tags to get around the problem of cross-cutting hierarchies. This problem does not arise in the annotation graph formalism; see (Bird and Liberman, 1999, 2.7).</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="4" end_page="7" type="metho">
    <SectionTitle>
3 Hybrid Annotations
</SectionTitle>
    <Paragraph position="0"> There are many cases where a given corpus is annotated at several levels, from discourse to phonetics.</Paragraph>
    <Paragraph position="1"> While a uniform structure is sometimes imposed, as with Partitur (Schiel et al., 1998), established practice and existing tools may give rise to corpora transcribed using different formats for different levels. Two examples of hybrid annotation will be discussed here: a TRAINS+DAMSL annotation~ and an eight-level annotation of the Boston University</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Radio Speech Corpus.
3.1 DAMSL annotation of TRAINS
</SectionTitle>
      <Paragraph position="0"> The TRAINS corpus (Heeman and Allen, 1993) is a collection of about 100 dialogues containing a total of 5,900 speaker turns \[~,.ldc.upenn.edu/Catalog /LDC95S25.htral\]. Part of a transcript is shown below, where s and u designate the two speakers, &lt;sil&gt; denotes silent periods, and + denotes boundaries of speaker overlaps.</Paragraph>
      <Paragraph position="2"> to Avon &lt;nil&gt; and a boxcar of bananas to Coming &lt;nil&gt; by three p.m.</Paragraph>
      <Paragraph position="3"> and I think itJs midnight now uh right it's midnight okay so we need to &lt;sil&gt; um get a tanker of 0J to Avon is the first thin E ee need to do  + so + + okay + &lt;click&gt; so we have to make orange juice first mm-hm &lt;sil&gt; okay so weJro senna pick up &lt;ell&gt; an engine two &lt;sil&gt; from Elmira go to Coming &lt;sil&gt; pick up the tanker mm-~un go back to Elmira &lt;ell&gt; to get &lt;nil&gt; pick up the orange juice alright &lt;nil&gt; tun well &lt;nil&gt; we also need to make the orange juice &lt;nil&gt; so we need to get + oranges &lt;nil&gt; to Elmira + + oh we need to pick up + oranges oh + okay + + yeah + alright so &lt;nil&gt; engine number two is going to  pick up a boxcar Accompanying this transcription are a number of xwaves label files containing time-aligned word-level and segment-level transcriptions. Below, the start of file speaker0.words is shown on the left, and the start of file speaker0.phones is shown on the right. The first number gives the file offset (in seconds), and the middle number gives the label color. The final part  is a label for the interval which ends at the specified time. Silence is marked explicitly (again using &lt;sil&gt;) so we can infer that the first word 'hello' occupies the interval \[0.110000, 0.488555\]. Evidently the segment-level annotation was done independently of the word-level annotation, and so the times do not line up exactly.</Paragraph>
      <Paragraph position="4">  The TRAINS annotations show the presence of backchannel cues and overlap. An example of overlap is shown below:  As seen in Figure 2 and explained more fully in (Bird and Liberman, 1999), overlap carries no impli-Cations for the internal structure of speaker turns or for the position of turn-boundaries.</Paragraph>
      <Paragraph position="5"> Now, independently of this annotation there is also a dialogue annotation in DAMSL, as shown in Figure 8. Here, a dialog is broken down into turns and thence into utterances, where the tags contain discourse-level annotation.</Paragraph>
      <Paragraph position="6"> In representing this hybrid annotation as an AG we are motivated by the following concerns. First, we want to preserve the distinction between the TRAINS and DAMSL components, so that they can remain in their native formats (and be manipulated by their native tools) and be converted independently to AGs then combined using AG union, and so that they can be projected back out if necessary. Second, we want to identify those boundaries that necessarily have the same time reference (such as the end of utterance 17 and the end of the word 'Elmira'), and represent them using a single graph node. Contributions from different speakers will remain disconnected in the graph structure. Finally, we want to use the equivalence class names to allow cross-references between utterances. A fragment of the proposed annotation graph is depicted using our visualization format in Figure 9. Observe that, for brevity, some discourse tags are not represented, and the phonetic segment level is omitted.</Paragraph>
      <Paragraph position="7"> Note that the tags in Figure 8 have the form of fielded records and so, according to the AG definition, all the attributes of a tag could be put into a single label. We have chosen to maximally split such records into multiple arc labels, so that search predicates do not need to take account of internal structure, and to limit the consequences of an erroneous code. A relevant analogy here is that of pre-composed versus compound characters in Unicode. The presence of both forms of a character in a text raises problems for searching and collating. This problem is avoided through normalization, and this is typically done by maximally decomposing the characters.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="7" type="sub_section">
      <SectionTitle>
3.2 Multiple annotations of the BU corpus
</SectionTitle>
      <Paragraph position="0"> Linguistic analysis is always multivocal, in two senses. First, there are many types of entities and  relations, on many scales, from acoustic features spanning a hundredth of a second to narrative structures spanning tens of minutes. Second, there are many alternative representations or construals of a given kind of linguistic information.</Paragraph>
      <Paragraph position="1"> Sometimes these alternatives are simply more or less convenient for a certain purpose. Thus a researcher who thinks theoretically of phonological features organized into moras, syllables and feet, will often find it convenient to use a phonemic string as a representational approximation. In other cases, however, different sorts of transcription or annotation reflect different theories about the ontology of linguistic structure or the functional categories of communication.</Paragraph>
      <Paragraph position="2"> The AG representation offers a way to deal productively with both kinds of multivocality. It provides a framework for relating different categories of linguistic analysis, and at the same time to compare different approaches to a given type of analysis.</Paragraph>
      <Paragraph position="3"> As an example, Figure 10 shows an AG-based visualization of eight different sorts of annotation of a phrase from the BU Radio Corpus, produced by Mari Ostendorf and others at Boston University, and published by the LDC [~.Idc.upenn.edu/Catalog/LDC96S36.html].</Paragraph>
      <Paragraph position="4"> The basic material is from a recording of a local public radio news broadcast. The BU annotations include four types of information: orthographic transcripts, broad phonetic transcripts (including main word stress), and two kinds  of prosodic annotation, all time-aligned to the digital audio files. The two kinds of prosodic annotation implement the system known as ToBI \[wvw. ling. ohio-state, edu/phonetics/E_ToBI/\]. ToBI is an acronym for &amp;quot;Tones and Break Indices&amp;quot;, and correspondingly provides two types of information: Tones, which are taken from a fixed vocabulary of categories of (stress-linked) &amp;quot;pitch accents&amp;quot; and (juncture-linked) &amp;quot;boundary tones&amp;quot;; and Break Indices, which are integers characterizing the strength and nature of interword disjunctures. We have added four additional annotations: coreference annotation and named entity annotation in the style of MUC-7 \[wWW .muc. saic. com/proceedings/muc_7_t oc. html\] provided by Lynette Hirschman; syntactic structures in the style of the Penn TreeBank (Marcus et al., 1993) provided by Ann Taylor; and an alternative annotation for the F0 aspects of prosody, known as Tilt (Taylor, 1998) and provided by its inventor, Paul Taylor. Taylor has done Tilt annotations for much of the BU corpus, and will soon be publishing them as a point of comparison with the ToBI tonal annotation. Tilt differs from ToBI in providing a quantitative rather than qualitative characterization of F0 obtrusions: where ToBI might say &amp;quot;this is a L+H* pitch accent,&amp;quot; Tilt would say &amp;quot;This is an Fo obtrusion that starts at time to, lasts for duration d seconds, involves a Hz total F0 change, and ends l Hz different in F0 from where it started.&amp;quot; As usual, the various annotations come in a bewildering variety of file formats. These are not entirely trivial to put into registration, because (for instance) the TreeBank terminal string contains both more (e.g. traces) and fewer (e.g. breaths) tokens than the orthographic transcription does. One other slightly tricky point: the connection between the word string and the &amp;quot;break indices&amp;quot; (which are ToBI's characterizations of the nature of interword disjuncture) are mediated only by identity in the floating-point time values assigned to word boundaries and to break indices in separate files. Since these time values are expressed as ASCII strings, it is easy to lose the identity relationship without meaning to, simply by reading in and writing out the values to programs that may make different choices of internal variable type (e.g. float vs. double), or number of decimal digits to print out, etc.</Paragraph>
      <Paragraph position="5"> Problems of this type are normal whenever multiple annotations need to be compared. Solving them is not rocket science, but does take careful work.</Paragraph>
      <Paragraph position="6"> When annotations with separate histories involve mutually inconsistent corrections, silent omissions of problematic material, or other typical developments, the problems are multiplied. In noting such difficulties, we are not criticizing the authors of the annotations, but rather observing the value of being able to put multiple annotations into a common framework.</Paragraph>
      <Paragraph position="7"> Once this common framework is established, via translation of all eight &amp;quot;strands&amp;quot; into AG graph terms, we have the basis for posing queries that cut across the different types of annotation. For instance, we might look at the distribution of Tilt parameters as a function of ToBI accent type; or the distribution of Tilt and ToBI values for initial vs. non-initial members of coreference sets; or the relative size of Tilt F0-change measures for nouns vs. verbs.</Paragraph>
      <Paragraph position="8"> We do not have the space in this paper to discuss the design of an AG-based query formalism at length - and indeed, many details of practical AG query systems remain to be decided - but a short discussion will indicate the direction we propose to take. Of course the crux is simply to be able to put all the different annotations into the same frame of reference, but beyond this, there are some aspects of the annotation graph formalism that have nice properties for defining a query system. For example, if an annotation graph is defined as a set of &amp;quot;arcs&amp;quot; like those given in the XML encoding in SS1, then every member of the power set of this arc set is also a well-formed annotation graph. The power set construction provides the basis for a useful query algebra, since it defines the complete set of possible values for queries over the AG in question, and is obviously closed under intersection, union and relative complement. As another example, various time-based indexes are definable on an adequately timeanchored annotation graph, with the result that many sorts of precedence, inclusion and overlap relations are easy to calculate for arbitrary subgraphs. See (Bird and Liberman, 1999, SS5) for discussion.</Paragraph>
      <Paragraph position="9"> In this section, we have indicated some of the ways in which the AG framework can facilitate the analysis of complex combinations linguistic annotations. These annotation sets are typically multivocal, both in the sense of covering multiple types of linguistic information, and also in the sense of providing multiple versions of particular types of analysis. Discourse studies are especially multivocal in both senses, and so we feel that this approach will be especially helpful to discourse researchers.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML