Annotation Graphs as a Framework for 
Multidimensional Linguistic Data Analysis 
Steven Bird and Mark Liberman 
Linguistic Data Consortium, University of Pennsylvania 
3615 Market St, Philadelphia, PA 19104-2608, USA 
{ sb, myl}©Idc, upenn, edu 
Abstract 
In recent work we have presented a formal 
framework for linguistic annotation based on 
labeled acyclic digraphs. These 'annotation graphs' 
offer a simple yet powerful method for representing 
complex annotation structures incorporating 
hierarchy and overlap. Here, we motivate and 
illustrate our approach using discourse-level 
annotations of text and speech data drawn from 
the CALLHOME, COCONUT, MUC-7, DAMSL 
and TRAINS annotation schemes. With the help 
of domain specialists, we have constructed a hybrid 
multi-level annotation for a fragment of the Boston 
University Radio Speech Corpus which includes 
the following levels: segment, word, breath, ToBI, 
Tilt, Treebank, coreference and named entity. We 
show how annotation graphs can represent hybrid 
multi-level structures which derive from a diverse 
set of file formats. We also show how the approach 
facilitates substantive comparison of multiple 
annotations of a single signal based on different 
theoretical models. The discussion shows how 
annotation graphs open the door to wide-ranging 
integration of tools, formats and corpora. 
1 Annotation Graphs 
When we examine the kinds of speech transcription 
and annotation found in many existing 'communi- 
ties of practice', we see commonality of abstract 
form along with diversity of concrete format. Our 
survey of annotation practice (Bird and Liberman, 
1999) attests to this commonality amidst diversity. 
(See \[~.idc.upenn.edu/annotation\] for pointers to 
online material.) We observed that all annotations 
of recorded linguistic signals require one unavoidable 
basic action: to associate a label, or an ordered 
sequence of labels, with a stretch of time in the 
recording(s). Such annotations also typically distin- 
guish labels of different types, such as spoken words 
vs. non-speech noises. Different types of annota- 
tion often span different-sized stretches of recorded 
time, without necessarily forming a strict hierarchy: 
thus a conversation contains (perhaps overlapping) 
conversational turns, turns contain (perhaps inter- 
rupted) words, and words contain (perhaps shared) 
phonetic segments. Some types of annotation are 
systematically incommensurable with others: thus 
disfluency structures (Taylor, 1995) and focus struc- 
tures (Jackendoff, 1972) often cut across conversa- 
tional turns and syntactic constituents. 
A minimal formalization of this basic set of prac- 
tices is a directed graph with fielded records on the 
arcs and optional time references on the nodes. We 
have argued that this minimal formalization in fact 
has sufficient expressive capacity to encode, in a 
reasonably intuitive way, all of the kinds of linguis- 
tic annotations in use today. We have also argued 
that this minimal formalization has good properties 
with respect to creation, maintenance and searching 
of annotations. We believe that these advantages 
are especially strong in the case of discourse anno- 
tations, because of the prevalence of cross-cutting 
structures and the need to compare multiple anno- 
tations representing different purposes and perspec- 
tives. 
Translation into annotation graphs does not mag- 
ically create compatibility among systems whose 
semantics are different. For instance, there are many 
different approaches to transcribing filled pauses in 
English - each will translate easily into an annota- 
tion graph framework, but their semantic incompati- 
bility is not thereby erased. However, it does enable 
us to focus on the substantive differences without 
having to be concerned with diverse formats, and 
without being forced to recode annotations in an 
agreed, common format. Therefore, we focus on the 
structure of annotations, independently of domain- 
specific concerns about permissible tags, attributes, 
and values. 
As reference corpora are published for a wider 
range of spoken language genres, annotation 
work is increasingly reusing the same primary 
data. For instance, the Switchboard corpus 
\[~. Idc. upenn, edu/Cat alog/LDC93S7, html\] has 
been marked up for disfluency (Taylor, 1995). 
See \[~. cis. upenn, edu/'treebank/swit chboard- 
sample .html\] for an example, which also includes a 
separate part-of-speech annotation and a Treebank- 
Style annotation. Hirschman and Chinchor (1997) 
give an example of MUC-7 coreference annotation 
applied to an existing TRAINS dialog annotation 
marking speaker turns and overlap. We shall 
encounter a number of such cases here. 
The Formalism 
As we said above, we take an annotation label to 
be a fielded record. A minimal but sufficient set of 
fields would be: 
type this represents a level of an annotation, such 
as the segment, word and discourse levels; 
label this is a contentful property, such as a par- 
ticular word, a speaker's name, or a discourse 
function; 
class this is an optional field which permits the 
arcs of an annotation graph to be co-indexed 
as members of an equivalence class. * 
One might add further fields for holding comments, 
annotator id, update history, and so on. 
Let T be a set of types, L be a set of labels, and 
C be a set of classes. Let R = {(t,l,c) I t 6 T,l 6 
L, c 6 C}, the set of records over T, L, C. Let N be 
a set of nodes. Annotation graphs (AGs) are now 
defined as follows: 
Definition 1 An annotation graph G over R, N 
is a set of triples having the form (nl, r, n~), r e R, 
nl, n2 6 N, which satisfies the following conditions: 
1. (N,{(nl,n2) l <nl,r, n2) 6 A}) is a labelled 
acyclic digraph. 
2. T : N ~ ~ is an order-preserving map assigning 
times to (some o/) the nodes. 
For detailed discussion of these structures, see 
(Bird and Liberman, 1999). Here we present a frag- 
ment (taken from Figure 8 below) to illustrate the 
definition. For convenience the components of the 
fielded records which decorate the arcs are separated 
using the slash symbol. The example contains two 
word arcs, and a discourse tag encoding 'influence 
on speaker'. No class fields are used. Not all nodes 
have a time reference. 
1We have avoided using explicit pointers since we prefer 
not to associate formal identifiers to the arcs. Equivalence 
classes will be exemplified later. 
The minimal annotation graph for this structure is 
as follows: 
T = {w,.} 
L = {oh, okay, IOS:Commit} 
C = 0 
N = {1,2,3} 
r = {<1,52.46},(3,53.14}} 
(l,Wlohl,2>, 
A = (2, W/okay/,3), 
(1, DnOS: Comm~/, 3> } 
XML is a natural 'surface representation' for 
annotation graphs and could provide the primary 
exchange format. A particularly simple XML 
encoding of the above structure is shown below; 
one might choose to use a richer XML encoding in 
practice. " " 
<a~not ation> 
<arc> 
<begin id=l time=52.46> 
<label ~ype="W" name="oh"> 
<end id=2> </arc> 
<arC> 
<begin id=2> 
<label type="W" name~"okay"> 
<end id=3 time=53.14> 
</arc> 
<arc> 
<begin id=l time=52.46> 
<label type="D" name="IOS:Commit"> 
<end id=3 time=53.14> 
</arc> 
</annot ation~ 
2 AGs and Discourse Markup 
2.1 LDC Telephone Speech Transcripts 
The LDC-published CALLHOME corpora include 
digital audio, transcripts and lexicons for telephone 
conversations in several languages, and are 
designed to support research on speech recognition 
\[www. Idc. upenn, edu/Cat alog/LDC96S46, html\]. The 
transcripts exhibit abundant overlap between 
speaker turns. What follows is a typical fragment 
of an annotation. Each stretch of speech consists of 
a begin time, an end time, a speaker designation, 
and the transcription for the cited stretch of time. 
We have augmented the annotation with + and * 
to indicate partial and total overlap (respectively) 
with the previous speaker turn. 
\[~@ -.~ w,. @ F 
Figure 1: Graph Structure for LDC Telephone Speech Example 
w/Speaker/ 
h9 i46 
994 i.65 
speaker/ :;'~'~ ~ ~:~ !!~ :'~' ''~ :::~'~" ~ 
I I i51 :61 | ~- 
997 i.4o 
I, so. I 
995 i.21 996 !.59 
Figure 2: Visualization for LDC Telephone Speech Example 
962.68 970.21 A: He was changin 8 projects every couple 
of weeks and he said he couldn't keep on top of it. 
He couldn't learn the whole new area 
* 968.71 969.00 B: 7~m. 
970.35 971.94 A: that fast each time. 
* 971.23 971.42 B: ~mm. 
972.46 979.47 A: ~um, and he says he went in and had some 
tests, and he was diagnosed as having attention deficit 
disorder. Which 
990.18 989.56 A: you know, given how he's how far he's 
gotten, you know, he got hie degree at kTu~te and all, 
I found that surprising that for the first time as an 
adult they're diagnosing this. ~um 
+ 959.42 991.96 B: ~mm. I wonder about it. Rut anyway. 
+ 991.75 994.65 A: yeah, but that's what he said. Knd ~um 
* 994.19 994.46 R: yeah. 
995.21 996.59 A: He gum 
+ 996.51 997.61 B: Whatever's helpful. 
+ 997.40 1002.55 A: Right. So he found this new job as a 
financial consultant and seems to be happy with that. 
1003.14 1003.45 B: Good. 
Long turns (e.g. the period from 972.46 to 989.56 
seconds) were broken up into shorter stretches for 
the convenience of the annotators and to provide 
additional time references. A section of this anno- 
tation which includes an example of total overlap is 
represented in annotation graph form in Figure 1, 
with the accompanying visualization shown in Fig- 
ure 2. (We have no commitment to this particular 
visualization; the graph structures can be visualized 
in many ways and the perspicuity of a visualization 
format will be somewhat domain-specific.) 
The turns are attributed to speakers using the 
speaker/ type. All of the words, punctuation and 
disfluencies are given the w/type, though we could 
easily opt for a more refined version in which these 
are assigned different types. The class field is not 
used here. Observe that each speaker turn is a dis- 
joint piece of graph structure, and that hierarchical 
organisation uses the 'chart construction' (Gazdar 
and Mellish, 1989, 179ff). Thus, we make a logi- 
ca\] di.stinction between the situation where the end- 
points of two pieces of annotation necessarily coin- 
cide (by sharing the same node) from the situation 
where endpoints happen to coincide (by having dis- 
tinct nodes which contain the same time reference). 
The former possibility is required for hierarchical 
structure, and the latter possibility is required for 
overlapping speaker turns where words spoken by 
different speakers may happen to sharing the same 
boundary. 
2.2 Dialogue Annotation in COCONUT 
The COCONUT corpus is a set of dia\]ogues in which 
the two conversants collaborate on a task of deciding 
what furniture to buy for a house (Di Eugenio et al., 
1998). The coding scheme augments the DAMSL 
scheme (Allen and Core, 1997) by having some new 
top-level tags and by further specifying some exist- 
ing tags. An example is given in Figure 3. 
The example shows five utterance pieces, identi- 
fied (a-e), four produced by speaker S1 and one pro- 
duced by speaker $2. The discourse annotations can 
be glossed as follows: Accept - the speaker is agreeing 
to a possible action or a claim; Comit - the speaker 
potentially commits to intend to perform a future 
specific action, and the commitment is not contin- 
gent upon the assent of the addressee; Offer - the 
speaker potentially commits to intend to perform a 
future specific action, and the commitment is contin- 
gent upon the assent of the addressee; 0pen-0ption 
- the speaker provides an option for the addressee's 
future action; Action-Directive - the utterance is 
designed to cause the addressee to undertake a spe- 
cific action. 
In utterance (e) of Figure 3, speaker $1 simul- 
taneously accepts to the mete-action in (d) of not 
3 
Accept, Commit 
Open-Option 
Action-Directive 
Accept(d), Offer, Commit 
SI: (a) Let's take the blue rug for 250, 
(b) my rug wouldn't match 
(c) which is yellow for 150. 
S2: (d) we don't have to match... 
SI: (e) well then let's use mine for 150 
Figure 3: Dialogue with COCONUT Coding Scheme 
spl ..... I I If' I II [ 'll ,~ 
)IH II ) 
Figure 4: Visualization of Annotation Graph for COCONUT Example 
having matching colors, and to the regular action of 
using Sl's yellow rug. The latter acceptance is not 
explicitly represented in the original notation, so we 
shall only consider the former. 
In representing this dialogue structure Using anno- 
tation graphs, we will be concerned to achieve the 
following: (i) to treat multiple annotations of the 
same utterance fragment as an unordered set, rather 
than a list, to simplify indexing and query; (ii) to 
explicitly link speaker S1 to utterances (a-c); (iii) 
to formalize the relationship between Accept (d) and 
utterance (d); and (iv) formalize the rest of the 
annotation structure which is implicit in the textual 
representation. 
We adopt the types Sp (speaker), utt (utterance) 
and D (discourse). A more refined type system 
could include other levels of representation, it could 
distinguish forward versus backward communicative 
function, and so on. For the names we employ: 
speaker identifiers Sl, s2; discourse tags Offer, 
Commit, Accept, Open-0ption, Action-Directive; and 
orthographic strings representing the utterances. 
For the classes (the third, optional field) we employ 
the utterance identifiers a, b, c, d, e. 
An annotation graph representation of the 
COCONUT example can now be represented as in 
Figure 4. The arcs are structured into three layers, 
one for each type, where the types are written on 
the left. If the optional class field is specified, this 
information follows the name field, separated by a 
slash. The Acceptld arc refers to the s2 utterance 
simply by virtue of the fact that both share the 
same class field. 
Observe that the Commit and Accept tags for (a) 
are unordered, unlike the original annotation, and 
that speaker $1 is associated with all utterances (a- 
c), rather than being explicitly linked to (a) and 
implicitly linked to (b) and (c) as in Figure 3. 
To make the referent of the Accept tag clear, we 
make use of the class field. Recall that the third 
component of the fielded records, the class field, per- 
mits arcs to refer to each other. Both the referring 
and the referenced arcs are assigned to equivalence 
class d. 
2.3 Coreference Annotation in MUC-7 
The MUC-7 Message Understanding Conference 
specified tasks for information extraction, named 
entity and coreference. Coreferring expressions 
are to be linked using SGML markup with 
ID and REF tags (Hirschman and Chinchor, 
1997). Figure 5 is a sample of text from 
the Boston University Radio Speech Corpus 
[~al;. Idc. upenn, edu/Cat alog/LDC96S36, html], 
marked up with coreference tags. (We are grateful 
to Lynette Hirschman for providing us with this 
annotation.) 
Noun phrases participating in coreference are 
wrapped with <coref>...</corer> tags, which can 
bear the attributes ID, REF, TYPE and MIN. Each such 
phrase is given a unique identifier, which may be 
referenced by a REF attribute somewhere else. Our 
example contains the following references: 3 --~ 2, 
4-+ 2,6-+ 5, 7-+ 5, 8--~ 5, 12-+ 11, 15 ~ 13. 
The TYPE attribute encodes the relationship between 
the anaphor and the antecedent. Currently, only 
the identity relation is marked, and so coreferences 
form an equivalence class. Accordingly, our example 
contains the following equivalence classes: {2, 3, 4}, 
{5,6,7,s}, {11, n}, {13,15}. 
In our AG representation we choose the first num- 
ber from each of these sets as the identifier for the 
equivalence class. MUC-7 also contains a specifica- 
tion for named entity annotation. Figure 7 gives an 
example, to be discussed in §3.2. This uses empty 
m 
m 
m 
m 
m 
m 
m 
m 
.-m 
m 
4 
<CGEEF ID="2 " MIN="woman"> 
This woman 
</COREF> 
receives three hundred dollars a 
month under 
<CO~.~ IVf"S"> 
General Relief 
</C01~:.F> 
, plus 
<COREF ID="IG- 
MIN="four hundred dollars"> 
four hundred dollars a month in 
<COREF ID="I7" 
MIN="benefits" REF="16"> 
A.F.D.C. benefits 
<ICORBF> 
<ICOP~T.F> 
for 
<C0REF ID="9" NINf"son"> 
<COREF ID="3" BEF="2"> 
her 
<ICOaEF> 
soH 
</COP, EF> 
• who is 
<COREF ID="lO" MIN="citizen" REF="9"> 
a U.S. citizen 
</C0REF>. 
<COREF ID="4 °' REF="2"> 
She 
<ICOP~F> 
J s among 
<COREF ID="I8" MINffi"aliens"> 
an estimated five hundred illegal 
aliens on 
<COEF_,F ID="6" REF="5"> 
General Relief 
</COREF> 
out of 
<COREF ID="ll" MIN="population"> 
<COREF ID="13" HIN="state"> 
the state 
</COP~T.F> 
's total illegal immigrant 
population of 
<C0REF ID="I2" REF="II"> 
one hundred thousand 
</C0REF> 
</C0P, EF> 
</C0REF> 
<COREF ID="7" REFf"S"> 
General Relief 
</COF.EF> 
is for needy families and unemployable 
adttlts vho don't qualify for other public 
assistance. Welfare Department spokeswoman 
Michael Reganburg says 
<COREF ID=-"I5" MINf"state" REF="I3"> 
the state 
</CO~> 
will save about one million dollars a year if 
<COREF ID="2O" NINf"aliens" EEF="I8"> 
illegal aliens 
</COE~F> 
are denied 
<COREF IDa"S" REF="5"> 
General Relief 
</COP#F> 
Figure 5: Coreference Annotation for BU Example 
f~. 
U.S. " CPVe~zI~9 
Figure 6: Annotation Graph for Coreference Example 
tags to get around the problem of cross-cutting hier- 
archies. This problem does not arise in the annota- 
tion graph formalism; see (Bird and Liberman, 1999, 
2.7). 
3 Hybrid Annotations 
There are many cases where a given corpus is anno- 
tated at several levels, from discourse to phonetics. 
While a uniform structure is sometimes imposed, 
as with Partitur (Schiel et al., 1998), established 
practice and existing tools may give rise to corpora 
transcribed using different formats for different lev- 
els. Two examples of hybrid annotation will be dis- 
cussed here: a TRAINS+DAMSL annotation~ and 
an eight-level annotation of the Boston University 
Radio Speech Corpus. 
3.1 DAMSL annotation of TRAINS 
The TRAINS corpus (Heeman and Allen, 1993) is a 
collection of about 100 dialogues containing a total 
of 5,900 speaker turns \[~,.ldc.upenn.edu/Catalog /LDC95S25.htral\]. 
Part of a transcript is shown 
below, where s and u designate the two speakers, 
<sil> denotes silent periods, and + denotes 
boundaries of speaker overlaps. 
uttl : s: 
utt2 : u: 
u¢¢3 : 
utt4 : 
utt5 : s: 
utt8 : u: 
utt7 : 
utt8 : s: 
utt9 : 
uttlO : u: 
uttll : 
uttl2 : s: 
uttl3 : u: 
utt14 : s: 
uttl5 : u: 
uttl6 : s: 
uttl7 : u: 
hello <nil> can I help you 
yes <nil> um <nil> I have a problem here 
I need to transport one tanker of orange juice 
to Avon <nil> and a boxcar of bananas to 
Coming <nil> by three p.m. 
and I think itJs midnight now 
uh right it's midnight 
okay so we need to <sil> 
um get a tanker of 0J to Avon is the first 
thin E ee need to do 
+ so + 
+ okay + 
<click> so we have to make orange juice first 
mm-hm <sil> okay so weJro senna pick up <ell> 
an engine two <sil> from Elmira 
go to Coming <sil> pick up the tanker 
mm-~un 
go back to Elmira <ell> to get <nil> pick up 
the orange juice 
alright <nil> tun well <nil> we also need to 
make the orange juice <nil> so we need to get 
+ oranges <nil> to Elmira + 
+ oh we need to pick up + oranges oh + okay + 
+ yeah + 
alright so <nil> engine number two is going to 
pick up a boxcar 
Accompanying this transcription are a number of 
xwaves label files containing time-aligned word-level 
and segment-level transcriptions. Below, the start of 
file speaker0.words is shown on the left, and the start 
of file speaker0.phones is shown on the right. The 
first number gives the file offset (in seconds), and the 
middle number gives the label color. The final part 
5 
This woman receives 
<b_nunex TYPE="MONEY"> 
three hundred dollars 
<e_nunex> 
a month under General Relief, plus 
<b_nunex TYPE="MONEY"> 
four hundred dollars 
<e_numex> 
a month in A.F.D.C. benefits for her son, ~ho is a 
<b_enamex TYPE="LOCATION"> 
U.S. 
<e_enanex> 
citizen. She's among an estimated five hundred illegal 
aliens on General Relief out of the stateSs total illeEal 
immigrant population of one hundred thousand. General 
Relief is for needy families and unemployable adults 
she don't qualify for other public assistance. 
<b_enamex TYPE="0RGANIZATION"> 
Welfare Department 
<e_ena~ex> 
spokeswoman 
<b_enamex TYPE:"PERSOM"> 
Michael ReEanbuzg 
<e_enamex> 
says the state will save about 
<b_numex TYPE="MONEY"> 
one million dollars 
<e_numex> 
a year if illegal aliens are denied General Relief. 
Figure 7: Named Entity Annotation for BU Example 
is a label for the interval which ends at the speci- 
fied time. Silence is marked explicitly (again using 
<sil>) so we can infer that the first word 'hello' occu- 
pies the interval \[0.110000, 0.488555\]. Evidently the 
segment-level annotation was done independently of 
the word-level annotation, and so the times do not 
line up exactly. 
0.110000 122 <sil> 0.100000 122 <sil> 
0.488555 122 hello 0.220000 122 hh 
0.534001 122 <sil> 0.250000 122 eh ;* 
0.640000 122 can 0.330000 122 1 
0.690000 122 I 0.460000 122 ov+l 
0.830000 122 help 0.530000 122 k 
1.O68003 122 you 0.570000 122 ih 
14.670000 122 <sil> 0.640000 122 n 
14.920000 122uh 0.690000 122 ay 
15.188292 122 right 0.760000 122 hh 
The TRAINS annotations show the presence of 
backchannel cues and overlap. An example of over- 
lap is shown below: 
50.130000 122 <sil> 
50.260000 122 so 
50.330000 122 we 
50.480000 122 need 
50.540000 122 to 
50.651716 122 get 
51.360000 122 oranges 
51.470000 122 <sil> 
51.540000 122 to 
51.975728 122 Elmira 
52.807837 76 <sil> 
53.047996 76 yeah 
51.094197 122 <sil> 
51.306658 122 oh 
51.410000 122 ue 
51.560000 122 need 
51.620000 122 to 
51.850000 122 pick 
52.020000 122 up 
52.470000 122 oranEes 
52.666781 122 oh 
52.940000 122 okay 
53.535600 122 <sil> 
53.785600 122 alright 
54.303529 122 so 
As seen in Figure 2 and explained more fully in 
(Bird and Liberman, 1999), overlap carries no impli- 
Cations for the internal structure of speaker turns or 
for the position of turn-boundaries. 
Now, independently of this annotation there is 
also a dialogue annotation in DAMSL, as shown in 
Figure 8. Here, a dialog is broken down into turns 
and thence into utterances, where the tags contain 
discourse-level annotation. 
In representing this hybrid annotation as an AG 
we are motivated by the following concerns. First, 
we want to preserve the distinction between the 
TRAINS and DAMSL components, so that they can 
remain in their native formats (and be manipulated 
by their native tools) and be converted indepen- 
dently to AGs then combined using AG union, and 
so that they can be projected back out if necessary. 
Second, we want to identify those boundaries that 
necessarily have the same time reference (such as 
the end of utterance 17 and the end of the word 
'Elmira'), and represent them using a single graph 
node. Contributions from different speakers will 
remain disconnected in the graph structure. Finally, 
we want to use the equivalence class names to allow 
cross-references between utterances. A fragment of 
the proposed annotation graph is depicted using our 
visualization format in Figure 9. Observe that, for 
brevity, some discourse tags are not represented, and 
the phonetic segment level is omitted. 
Note that the tags in Figure 8 have the form of 
fielded records and so, according to the AG defini- 
tion, all the attributes of a tag could be put into 
a single label. We have chosen to maximally split 
such records into multiple arc labels, so that search 
predicates do not need to take account of inter- 
nal structure, and to limit the consequences of an 
erroneous code. A relevant analogy here is that of 
pre-composed versus compound characters in Uni- 
code. The presence of both forms of a character in 
a text raises problems for searching and collating. 
This problem is avoided through normalization, and 
this is typically done by maximally decomposing the 
characters. 
3.2 Multiple annotations of the BU corpus 
Linguistic analysis is always multivocal, in two 
senses. First, there are many types of entities and 
6 
<Dialog Id=d92a-~.2 J~anoCation-date="08-14-97" Anuotator="Re¢onciled Version" 
Speech="/d92a-2.2/dialog.fea" Statue=Verified> 
<Turn Id=T9 Speakez~"s" Speech="-s 44.853889 -e 52.175728"> 
<U¢¢ Id=uttl7 A~eenent=None Influence-on-listener=Action-directive Influence-on-speaker=Commit Info-level=Task Response-to="" 
Speech="-s 45.87 -e 52.175728" Statement=Assert> 
[sil] tun eell [ail] ge also need ¢0 make the orange juice [sil] 
so ve need to get + oranges [sil] to Elmira + 
<Turn Id=T10 Speaker="u" Speech="-s 51.106658 -e 53.14"> 
<Uct Id=uct18 AEreenent=Accept Influence-on-listener=Action-directive Influence-on-speaker=Commit Info-level=Task 
Response-to-"ut¢17" Speech="-s 51.106658 -e 52.67" Statement=Assert Understanding=SU-Acknogledge> 
+ oh ge need to pick up + oranges 
<Utt Id--uCt19 Agreement=Accept Influence-on-speaker=Commit Info-level=Task Response-¢o="utt17" Speech="-s 52.466781 -e 53.14" 
Understandin~None> 
oh + okay ÷ 
<Turn Id=Tli Speaker~"s" Speech="-s 52.047996 -e 53.247996"> 
<Utt Id--utt20 Agreement=Accept Info-level=Task Response-to="uttl8" Speech="-s 52.047996-e 53.247996" Understanding=SU-Ackno~ledge> 
+ yeah+ 
</Dialog> 
Figure 8: DAMSL Annotation of a TRAINS Dialogue 
D/ 
Utt/ 
W/ 
W/ 
Utt/ 
D~ 
: i : : 
"13 "26 "33 :48 "54 "65 i "36 -47 "54 i~7 
1.09 "30 i.41 i.56 i.62 i.85 !.02 
I 
:so ,i04 
• '.47 "66 :.94 
Figure 9: Graph Structure for TRAINS Example 
relations, on many scales, from acoustic features 
spanning a hundredth of a second to narrative 
structures spanning tens of minutes. Second, there 
are many alternative representations or construals 
of a given kind of linguistic information. 
Sometimes these alternatives are simply more 
or less convenient for a certain purpose. Thus a 
researcher who thinks theoretically of phonological 
features organized into moras, syllables and feet, 
will often find it convenient to use a phonemic 
string as a representational approximation. In 
other cases, however, different sorts of transcription 
or annotation reflect different theories about the 
ontology of linguistic structure or the functional 
categories of communication. 
The AG representation offers a way to deal pro- 
ductively with both kinds of multivocality. It pro- 
vides a framework for relating different categories of 
linguistic analysis, and at the same time to compare 
different approaches to a given type of analysis. 
As an example, Figure 10 shows an AG- 
based visualization of eight different sorts of 
annotation of a phrase from the BU Radio 
Corpus, produced by Mari Ostendorf and others 
at Boston University, and published by the 
LDC [~.Idc.upenn.edu/Catalog/LDC96S36.html]. 
The basic material is from a recording of a 
local public radio news broadcast. The BU 
annotations include four types of information: 
orthographic transcripts, broad phonetic transcripts 
(including main word stress), and two kinds 
7 
of prosodic annotation, all time-aligned to the 
digital audio files. The two kinds of prosodic 
annotation implement the system known as ToBI 
\[wvw. ling. ohio-state, edu/phonetics/E_ToBI/\]. 
ToBI is an acronym for "Tones and Break 
Indices", and correspondingly provides two types of 
information: Tones, which are taken from a fixed 
vocabulary of categories of (stress-linked) "pitch 
accents" and (juncture-linked) "boundary tones"; 
and Break Indices, which are integers characterizing 
the strength and nature of interword disjunctures. 
We have added four additional annota- 
tions: coreference annotation and named 
entity annotation in the style of MUC-7 
\[wWW .muc. saic. com/proceedings/muc_7_t oc. html\] 
provided by Lynette Hirschman; syntactic structures 
in the style of the Penn TreeBank (Marcus et al., 
1993) provided by Ann Taylor; and an alternative 
annotation for the F0 aspects of prosody, known as 
Tilt (Taylor, 1998) and provided by its inventor, 
Paul Taylor. Taylor has done Tilt annotations for 
much of the BU corpus, and will soon be publishing 
them as a point of comparison with the ToBI tonal 
annotation. Tilt differs from ToBI in providing a 
quantitative rather than qualitative characterization 
of F0 obtrusions: where ToBI might say "this is a 
L+H* pitch accent," Tilt would say "This is an Fo 
obtrusion that starts at time to, lasts for duration d 
seconds, involves a Hz total F0 change, and ends l 
Hz different in F0 from where it started." 
As usual, the various annotations come in a bewil- 
dering variety of file formats. These are not entirely 
trivial to put into registration, because (for instance) 
the TreeBank terminal string contains both more 
(e.g. traces) and fewer (e.g. breaths) tokens than the 
orthographic transcription does. One other slightly 
tricky point: the connection between the word string 
and the "break indices" (which are ToBI's character- 
izations of the nature of interword disjuncture) are 
mediated only by identity in the floating-point time 
values assigned to word boundaries and to break 
indices in separate files. Since these time values are 
expressed as ASCII strings, it is easy to lose the 
identity relationship without meaning to, simply by 
reading in and writing out the values to programs 
that may make different choices of internal variable 
type (e.g. float vs. double), or number of decimal 
digits to print out, etc. 
Problems of this type are normal whenever multi- 
ple annotations need to be compared. Solving them 
is not rocket science, but does take careful work. 
When annotations with separate histories involve 
mutually inconsistent corrections, silent omissions of 
problematic material, or other typical developments, 
the problems are multiplied. In noting such difficul- 
ties, we are not criticizing the authors of the annota- 
tions, but rather observing the value of being able to 
put multiple annotations into a common framework. 
Once this common framework is established, via 
translation of all eight "strands" into AG graph 
terms, we have the basis for posing queries that 
cut across the different types of annotation. For 
instance, we might look at the distribution of Tilt 
parameters as a function of ToBI accent type; or 
the distribution of Tilt and ToBI values for initial 
vs. non-initial members of coreference sets; or the 
relative size of Tilt F0-change measures for nouns 
vs. verbs. 
We do not have the space in this paper to dis- 
cuss the design of an AG-based query formalism at 
length - and indeed, many details of practical AG 
query systems remain to be decided - but a short 
discussion will indicate the direction we propose to 
take. Of course the crux is simply to be able to put 
all the different annotations into the same frame of 
reference, but beyond this, there are some aspects 
of the annotation graph formalism that have nice 
properties for defining a query system. For example, 
if an annotation graph is defined as a set of "arcs" 
like those given in the XML encoding in §1, then 
every member of the power set of this arc set is 
also a well-formed annotation graph. The power set 
construction provides the basis for a useful query 
algebra, since it defines the complete set of possible 
values for queries over the AG in question, and is 
obviously closed under intersection, union and rela- 
tive complement. As another example, various time- 
based indexes are definable on an adequately time- 
anchored annotation graph, with the result that 
many sorts of precedence, inclusion and overlap rela- 
tions are easy to calculate for arbitrary subgraphs. 
See (Bird and Liberman, 1999, §5) for discussion. 
In this section, we have indicated some of the ways 
in which the AG framework can facilitate the anal- 
ysis of complex combinations linguistic annotations. 
These annotation sets are typically multivocal, both 
in the sense of covering multiple types of linguistic 
information, and also in the sense of providing multi- 
ple versions of particular types of analysis. Discourse 
studies are especially multivocal in both senses, and 
so we feel that this approach will be especially help- 
ful to discourse researchers. 
4 Conclusion 
This proliferation of formats and approaches can be 
viewed as a sign of intellectual ferment. The fact 
that so many people have devoted so much energy to 
fielding new entries into this bazaar of data formats 
indicates how important the computational study of 
communicative interaction has become. However, 
for many researchers, this multiplicity of approaches 
I 
II 
l 
II 
II 
[] 
II 
[] 
II 
!1 
II 
[] 
II 
II 
II 
II 
II 
I/ 
F- 
8 
NE 
ToBI 
W 
I 
l 
B 
: ~ : i! : 
.il , ;. i;. 
!H* !H- L+H* L* !H* l.-Lqr 
Breafl : : 
ToBI H" 
Tilt ~zn.s [] 
' "', .... I' 
1 2 
: ~ : ~ ! 
, ,t , •., i ; .' 
i3 ii '* i 
H* H* H-L~" H* 
f/133.8 f/l'r/.4 ! 
: i 
j ; , 
3 4 
! ~ ~ ~ ill ~ i ~ ~ ! ! 
' I ~; • .; ; . : ' :~: : • :; : !: : : :': :" 
5 :: : 6 i ! 7 i 8 : i9 
~ : : :. . 
L'L-H* H w H* L-H~ H* L-LSF 14" H* L-L~ 
|a .m am nm 
:: ~ : : ! ~ !i 
ii: , ~ ~ ~: ; ;; I: I ; ii ~i I . 
5 6 7 8 9 
Figure 10: Visualization for BU Example 
has produced headaches and confusion, rather than 
productive scientific advances. We need a way to 
integrate these approaches without imposing some 
form of premature closure that would crush experi- 
mentation and innovation. 
Both here, and in associated work (Bird and 
Liberman, 1999), we have endeavored to show 
how all current annotation formats involve the 
basic actions of associating labels with stretches 
of recorded signal data, and attributing logical 
sequence, hierarchy and coindexing to such labels. 
We have grounded this assertion by defining 
annotation graphs and by showing how a disparate 
range of annotation formats can be mapped into 
AGs. This work provides a central piece of the 
algebraic foundation for inter-translatable formats 
and inter-operating tools. The intention is not 
to replace the formats and tools that have been 
accepted by any existing community of practice, 
but rather to make the descriptive and analytical 
practices, the formats, data and tools universally 
accessible. This means that annotation content 
for diverse domains and theoretical models can 
be created and maintained using tools that are 
the most suitable or familiar to the community in 
question. It also means that we can get started 
on integrating annotations, corpora and research 
findings right away, without having to wait until 
final agreement on all possible tags and attributes 
has been achieved. 
There are many existing approaches to discourse 
annotation, and many options for future approaches. 
Our explorations presuppose a particular set 
of goals: (i) generality, specificity, simplicity; 
(ii) searchability and browsability; and (iii) 
malnfalnability and durability. These are discussed 
in full in (Bird and Liberman, 1999, §6). By 
identifying a common conceptual core to all 
annotation structures, we hope to provide a 
foundation for a wide-ranging integration of tools, 
formats and corpora. One might, by analogy to 
translation systems, describe AGs as an interlingua 
which permits free exchange of annotation data 
between n systems once n interfaces have been 
written, rather than n 2 interfaces. 
Although we have been primarily concerned with 
the structure rather than the content of annota- 
tions, the approach opens the way to meaningful 
evaluation of content and comparison of contentful 
differences between annotations, since it is possible 
to do all manner of quasi-correlational analyses of 
parallel annotations. A tool for converting a given 
format into the AG framework only needs to be 
written once. Once this has been done, it becomes 
a straightforward task to pose complex queries over 
multiple corpora. Whereas if one were to start with 
annotations in several distinct file formats, it would 
be a major programming chore to ask even a simple 
question. 
Acknowledgements 
We are grateful to the following people for discus- 
sions and input concerning the material presented 
here: Chris Cieri, Dave Graft, Julia Hirschberg, 
Lynette Hirschman, Brian MacWhinney, Ann Tay- 
lor, Paul Taylor, Marilyn Walker, and three anony- 
mous reviewers. 
9 
II 
I 

References 
James Allen and Mark Core. 1997. Draft of 
DAMSL: Dialog act markup in several layers. 
\[www.cs.rochester.edu/research/tralns/annotation 
/RevisedManual/RevisedManual.html\]. 
Steven Bird and Mark Liberman. 1999. A formal 
framework for linguistic annotation. Technical 
Report MS-CIS-99-01, Department of Computer 
and Information Science, University of 
Pennsylvania. \[xxx.lanl.gov/abs/cs.CL/9903003\], 
expanded from version presented at ICSLP-98, 
Sydney. 
Barbara Di Eugenio, Pamela W. Jordan, and Liina 
Pylkk~inen. 1998. The COCONUT project: 
Dialogue annotation manual. Technical Report 
98-1, University of Pittsburgh, Intelligent Systems 
Program. 
\[www.isp.pitt.edu/-intgen/coconut.html\]. 
Gerald Gazdar and Chris Mellish. 1989. Natural 
Language Processing in Prolog: An Introduction to 
Computational Linguistics. Addison-Wesley. 
Peter A. Heeman and James Allen. 1993. The 
TRAINS 93 dialogues. Technical Report TRAINS 
Technical Note 94-2, Computer Science 
Department, University of Rochester. 
\[ftp~cs.rochester.edu/pub/papers/ai 
/94.tn2.Trains_93_dialogues.ps.gz\]. 
Lynette Hirschman and Nancy Chinchor. 1997. 
Muc-7 coreference task definition. In Message 
Understanding Conference Proceedings. Published 
online. 
\[www.muc.saic.com/proceedings/muc_7_toc.html\]. 
Ray Jackendoff. 1972. Semantic Interpretation in 
Generative Grammar. Cambridge Mass.: MIT 
Press. 
Mitchell P. Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building a large 
annotated corpus of English: The Penn Treebank. 
Computational Linguistics, 19(2):313-30. 
www.cis.upenn.edu/treebank/home.html. 
Florian Schiel, Susanne Burger, Anja Geumann, 
and Karl Weilhammer. 1998. The Partitur format 
at BAS. In Proceedings of the First International 
Conference on Language Resources and Evaluation. 
\[www.phonetik.uni-muenchen.de 
/Bas/BasFormatseng.html\]. 
Ann Taylor, 1995. Dysfluency Annotation Stylebook 
for the Switchboard Corpus. University of 
Pennsylvania, Department of Computer and 
Information Science. 
\[ftp.cis.upenn.edu/pub/treebank/swbd/doc/DFL- 
book.ps\], original version by Marie Meteer et 
al. 
Paul A. Taylor. 1998. The tilt intonation model. 
In Proceedings of the 5th International Conference 
on Spoken Language Processing. 
