Building a Tool for Annotating Reference in Discourse 
Jonathan DeCristofaro 
CIS Department 
University of Delaware 
103 Smith Hall 
Newark, DE 19716, USA 
decris~ogcis, udel. edu 
Michael Strube 
IRCS 
University of Pennsylvania 
3401 Walnut Street, Suite 400A 
Philadelphia, PA 19104, USA 
strubeOl inc. cis. upenn, edu 
Kathleen F. McCoy 
CIS Department 
'University of Delaware 
103 Smith Hall 
Newark, DE 19716, USA 
mccoySc is. udel. edu 
Abstract 
We discuss the development of a system for 
marking several types of reference to facilitate 
the analysis of reference in discourse. The tool 
is designed to be used in three applicationsi 
generating training data for machine learning 
of co-reference relations, evaluating iheories 
of referring expression generation and resolu- 
tion in texts, and developing theories for un- 
derstanding reference in dialogs. The need to 
mark any of a broad set of relations which may 
span several levels of discourse structure drives 
the system architecture. The system has the 
abilityto collect statistics over encoded rela- 
tions and meastwe inter-coder reliability, and 
includes tools to increase the accuracy of the 
user's markings by highlighting the di.u:rep- 
ancies between two sets of markings. Using 
parsed corpora as the input further reduces the 
human workload and increases reliability. 
1 Mo~a~n 
To examine the phenomenon of reference in discourse, 
and to analyze how discourse structure and reference in- 
teract, we need a tool,which allows several kinds of func- 
tionality including mark-up, visualization, and evalua- 
tion. Before desitming slach a tool, we must ~y an- 
alyze the kinds of information each application requires. 
Three applications have driven the design of the sys- 
tem. These are: 1) the creation of training data for auto- 
matic derivation of reference resolution algorithms (/.e., 
machine learning), 2) the formation ofa testhed for eval- 
uating proposed reference generation and anaphera reso- 
lution theories, and 3) the development of theories about 
understanding reference in dialog. The influence that 
these three areas have upon the functional requirements 
of an annotation system are discussed below. 
In this paper we fn~t describe the requirements that 
each of these three related applications demand from 
a discourse annotation tool geared to aid in answering 
questions concerning reference. We next discuss some 
of the theoretical implications and decisions concerning 
the tool development that have arisen from these require- 
ments. Next we describe the tool itself. F'mally, we 
discuss related work, future directions of this work, and 
some conclusions. 
1.1 Machine Learning 
Consider a learning task in which we will present the 
learner with a sequence of triples of the form (E, F, U), 
where: 
• E is a pair of text expressions EA and EB, 
, F is vector of features describing the expressions, 
and 
• U is the classification: + ff EA and Es co-refer,. 
otherwise. 
(Two expressions co-refer when ~ey denote the same 
discourse entity (DE).) A successful learner will output 
a model which, when given only (E, F), can predict the 
value of G'. that is, classify the instance, as positive or 
negative. We intend to use the annotation tool to produce 
a set of such instances and features. 
The first requirement for a tool which would help us 
generate such a body of data is that it must allow us to 
mark all the potential referring expressions. This simply 
means that the us~ will have the ability to delineate any 
span of text which represents a DE, and Ireat~that span as 
a single entity. Of course, this is a time-consuming and 
ermr-proue process and thus it is helpful to automate as 
much as possible. In the training phase, the learner must 
be given all the potential antecedents for an anaphoric 
reference, so that it will know how to distinguish the 
proper antecedent from all the other candidates. For the 
testing phase, the correct antecedent's span must he in- 
cluded as a marked entity in the corptis, or the learner has 
no chance of getting that instance of co-reference right. 
The other crucial function of an annotation tool is to 
let the user associate attributes, or f~h~re values, with the 
54 
0 0 
O 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
O 
0 
0 
0 
0 
,0 
0 
0 
O 
0 
0 
0 
0 
0 
0 
0 
0 
0 
O 0 
0 
0 
e o 
e @ 
e @ 
o 
e @ 
o @ 
@ 
o 
o Q 
@ 
o @ 
e @ 
e @ 
o 
o 
o @ 
@ 
@ 
@ 
@ 
@ 
e @ 
e 
e 
o @ 
/) 
@ 
e 
s @ 
e 
e 
marked expression. During the training phase, a learning 
algorithm is trying to find correlations between the fea- 
tures F and the classification 6'. Choosing the set of 
features to include in the learning phase is a very diffi- 
cult task. The set must be sufficiently rich so as to in- 
clude all of those features which might affect a refer- 
ring expression's resolution. On the other hand, since 
the learner will likely find that only certain features pre- 
dict co-reference, we do not want to burden the learner 
with many useless features that will bog it down with 
computational complexity. Also, a less restricted set of 
features permits more oppot:tunity for inconsistency in a 
given coder's markings and disagreement among coders 
(Condon & Cech, 1995). 
We cannot know (before training) exactly which fea- 
tures are most predictive of co-reference. So. we will 
try to mark a set of features which is a superset of the 
necessary features. Drawing on the feature sets used in 
Connolly et al. (1997) and Ge et al. (1998), we believe 
the following factors might indicate co-referenco: 
• Syntactic role (e.g. Subject, Object, Prepositional 
Object,...), 
• Pronominalization (yea or no), 
• Distance between EA and Ee (an integer), 
• Definiteness (yes or no), 
• Semantic role (e.g. indicating location, manner, 
time,...), 
• Nesting depth of an N'P (an integer), 
• Information status (as defined by Strube (1998)) of 
the DE, 
• Gender, Number, Animacy. 
The tool must allow the coder to assign values for these 
features to each marked expression, but should not de- 
mand that every expression has a value assigned for ev- 
ery feature. 
Since we cannot claim that this set Of features is ex- 
haustive, the tool must allow further features to be added 
by the user. Since reliability of feature assignment is 
important, the tool should have the abifity to extract as 
many features aspossible automatically (for example, 
from a parsed corpus). In addition since some features 
must be hand-marked, the tool must have the ability to 
compare feature marking between two coders for the 
same teXL 
1.2 Evaluating Anaphora Generation and 
Resolution Mgorithmq 
Our discourse annotation and visualization tool also ful- 
fills the role of a testhed in which we can examine the- 
oriea of generating and resolving anaphoric expressions. 
From the generation perspective., we look for answers to 
questions concerning when it isappropriate to generate a 
pronoun versus some other anaphoric expression (e.g., 
a definite description or name; see McCoy & Strube 
(199%), McCoy & Strube (i 999b))~ 
Some researchers have looked at the question of when 
to generate a pronoun (versus some other description) 
(e.g., McDonald (1980), McKeown (1983), McKeown 
(1985), Appelt (1981)). In this work the decision was 
based on a notion of focus of attention (Sidner, 1979) - 
if an entity was the focus of the previoussentence and 
is the focus of the current sentence, then use a pronoun. 
To evaluate such claims, not only must co-reference re- 
latious be marked in a text, but information concerning 
focusing data structures must be kept. 
Dale (1992) discussed the generation of pronouns in 
the context of work on generating referring expressions 
(Appelt, 1985; Reiter, 1990). Dale suggests the princi- 
plea of efficiency and adequacy which favor generating 
the smallest referring expression that distinguishes the 
object in question from all others in the context. This 
notion was somewhat altered in Dale & Reiter (1995) to 
more adequately reflect human-generated referring ex- 
• pressious and to be more computationally tractable. 
Other researchers have suggested that a notion of dis- ~-~ 
course structure must be taken into account when gener- 
ating referring expressions. In particular, Gresz & Sid- 
her (1986) and Reichman (1985) both suggest that a full 
noun phrase might be generated at discourse segment 
boundaries when a pronoun might have been adequate 
(in Dale's sense). Passonnean (1996b)argues for the use 
of the principles of information adequacy and economy. 
Her algorithm takes discourse segmentation into account 
through the use of focus spaces which am associated with 
discourse segments. Passonneau argues that, a fuller de- 
scription might be used at a bounda~ because the set of 
accessible objects changes at discourse segment bound- 
aries. 
Passonneau's work suggests additional features which 
must be marked in a text to evaluate referring expression 
generation algorithms. These include discourse segment 
boundaries and sets of "confnsable" DE's contained in 
• the focus space, Thus the definition of what constitutes a 
discourse segment is another item which is open to re- 
search; our tool should allow for alternative markings 
of discourse segments so that various algorithins can be 
evaluated. For example, in our current work we look at 
changes in time as segment boundaries. Other definitions - 
• are possible. So, the tool must be able to keep informa- 
tion for various alternative algorithms. 
While it is intuitively appealing that notions of dis- 
course segmentation affect pronoun generation, the 
above work fails to identify how a discourse segment 
should he defined to a generation algorithm - thus it is 
not clear how this work can he applied to the generation 
process. 
Given this previous work, we need a tool that will al- 
55 
low us to specify (!) alternative definitions of discourse 
segmentation, and (2) alternative algorithms for pronoun 
versus definite description generation (and anaphora res- 
olution). The tool must have the ability to then calculate 
statistics so that the alternative definitions and algorithms 
can be compared. 
Thus, this application requires the ability to spec- 
ify co-reference relations, associate various features 
with referring expressions (both syntactic and discourse- 
relevant), calculate the results of certain well-specified 
algorithms on the referring expressions, and tabulate ~e 
results of such algorithms. In addition to this information 
on referring expressions themselves, the tool must allow 
the marking of arbitrary features over arbitrary pieces 
of text (e.g., for alternative definitions of discourse seg- 
ments). Because this work is exploratory in nature, the 
tool should allow a researcher to easily find places where 
various algorithms fail so that they can be examined and 
the algorithms updated as needed. 
1.3 Underst~mding Spoken Dialog 
The evaluation of algorithms for anaphora resolution in 
spoken dialog requires annotation of discourse structure 
on several levels. This is because spoken dialog shows 
more complex phenomena than written discourse. Prob- 
lematic issues in spoken dialog include 
• the determination of the center of attention in multi- 
party discourse; 
• utterances with no discourse entities; 
• abandoned or partial utterances, interruptions, 
speech repa!rs; 
• the determination of utterance boundaries; 
• the high frequency of discourse deictic and vague 
anaphora (Eckert & Slmbe, 1999). 
In order to capture the complexity of anaphora resolu- 
• tion in spoken dialog, the annotation requires a multitude 
of steps. 
Dialog Acts. To determine the domain ofanaphoric an- 
tecedent~ the dialog must be divided into short piece. 
We have chosen to use units based on dialog acts for this 
task. Therefore, turns have to be segmcnted into dialog 
~t units. Our study of anaphoric extnssious reveals that 
in a dialog between two participants A and B, the DE's 
introduced by A are not added to the shared discourse 
memory model until ,As contribution has been acknowl- 
edged by B. Thus the segment is important for resolution 
algorithms. 
As in all coding schemes, intercoder reliabifity (here. 
of the dialog act units)must be questioned. For the pur- 
pose.of applying the Kappa (g) stnfi~c the segmenta- 
tion task must be turned into a classification task. So, 
we view boundaries between dialog acts as one class and 
non-boundaries as the other (see Passonneau & Litman 
(1997) for a similar practice). The next step is to classify 
dialog act units as particular dialog acts. For this task the 
statistic is also appropriate. 
Individual and Abstract Object Anaphot~a. Since 
spoken dialog shows a high number of discourse deictic 
and vague anaphora, pronouns and demonstratives have 
to be classified accusingly. Thus an additional feature, 
anaphor type, must be marked in the corpus. 
Co-Indexatlon of Anaphora and Antecedents. 
Vague pronouns do not have a particular antecedent 
in the text. Hence, they cannot be co-indexed with 
an antecedent. The co-indexatiun of individual object 
anaphora in spoken dialog does not differ from written 
discourse. However, the high number of discourse 
deicfic pronouns requires a second set of markables 
since discourse deictic pronouns can co-specify wi~ 
propositions, sentences and even diseourse segments. 
Therefore, the reliability of the annotation depends on 
(1) the marking of the correct text span and (2) whether 
the correct antecedent is linked with the pronoun. Deter- 
mining the reliability of marking spans of text is difficult 
when any span can be marked, since this means almost 
any word boundary is a candidate segment boundary. 
Here, the s: statistic does not seem meaningful because 
of the huge disparity in the number of non-boundaries 
and boundaries. This highly skewed distribution seems 
to overwhelm s~. 
Thus we are exploring more appropriate measures Of 
intereoder reliability on this task. At the moment, our ap- 
proach to this problem is to use =, but restrict the anno- 
tators, so that they are allowed to mark only certain con- 
tiguous linguistic objects like verb phrases, sentences, or 
a well defined segment spanning more than one turn. 
2 Annotating a Parsed Corpus 
All of the applications discussed in section I depend on 
having a corpus of reliably marked expressions, features, 
and relations. In order to determine that these dimen- 
sions have been '*reliably marked", we need to measure 
agreement between two codeas marking the same text. 
One way to increase the relinb'dity of the coding (re- 
gardless of the method used to measure reliability) is to 
automate part of the coding process. Our system can ex- 
tract a number of markings, features and relations from 
the parsed, part-of-speech-tagged corpora of the type 
found in in the Penn Treebank 2 (Marcus et al., 1994). 
Use of the Treebank data means we can find most of the 
markables and many of the necessary features before giv- 
ing the task to a human coder. We do not try to extract 
any of the co-reference information from the parsed cor- 
pora. 
56 
0 
6 
0 
0 
0 
0 
0 
0 
0 
0 
e 
0 
0 
0 Q 
0 
0 
I 
0 
0 
0 
6 
0 
0 
0 
0 
0 
0 
0 
0 
0 @ 
0 
0 
0 
0 
0 
0 
0 
O 
e 
0 
0 
0 
e 
o 
e @ 
e 
O @ 
@ 
e 
e e 
@ 
e, @ 
e 
@ 
@ 
@ 
e @ 
0 
e @ 
@ 
@ 
@ 
@ 
@ 
@ 
e e 
e 
e 
e 
e 
e @ 
e @ 
o @ 
_e @ 
2.1 Extracting Markables 
In this context, a markable is a text span representing 
a discourse entity which can be anaphoricaily referred 
to in a text or dialog. The majority of markables are 
noun phrases. Because the Treebank is a fully-parsed 
and well-defined representation of the text, it is trivial 
to determine the boundaries of all of the NP's in the 
text. However, the full set of NP's found by the Tree- 
bank parse is too inclusive for our purposes (/.e., it is a 
superset of the NP markables). While the Treebank de- 
lineates all NP's at all levels of embedding, it is not the 
case that each such NP contributes a distinct DE. Con- 
sider the following example containing three NP's in the 
parsed Treebank: 
(I) ~ (NP different parts) 0PPof(NPEumpe))) 
We want to mark both "different parts of Europe" and 
"Europe", since they both contribute distinct DE's. How- 
ever, notice that "different parts" does not contribute a 
DE since it is not possible to refer to this subexpression 
alone in subsequent discourse. 
To avoid finding such undesirable NP's, our system 
has a heuristic (HI) which says: Pass overan~NP which 
is a leftmost child of a top-level NP. This heuristic is too 
drastic, though, eliminating constructions like (2). 
(2) (NP (NP the inner bra/n) and OqP the eym)) 
To avoid losing these examples, we include another 
heursitic (H2) which says: HI does not apply when the 
NP is a sibling of another NP. A third heuristic must 
be added to overrule HI in the case of a possessor in a 
possessive construction, such as: 
(3) (NP (NP Chicago's) South Side) 
where we should extract both "Chicago" and "Chicago's 
South Side". So, the heuristic H3 is introduced: HI does 
not apply when the NP is a posse.~sive form. 
Even with heuristics eliminating the NP's which we 
do not need to consider, there are some NP's that will 
be found by the system which cannot be eliminated an- 
tomadcally. Copular consU'uctions such as (4) introduce 
unnecessary NP's. 
(4) John is a docto~ 
"John" and "a doctor" are syntactically NP's, but the 
second does not contribute a unique DE. 
Also, idiomatic expressions such as (5) must be elim- 
inated by hand: 
(5) Ned kick~ the bucket. 
' The syntactic NP"the bucket"refers to no DE and cannot 
be the antecedent of any future referring expression, so it 
should not be marked. 
At this time, we do not have a way for the expression 
extracting system to detect and avoid these examples. As 
a result, we must introduce a correction phase in which a 
human corrects the markings, eliminating those that are 
superfluous, and adjusting those that may have been mis- 
marked. The goal is to have a set of expressions which 
is as close as possible to the set of expressions necessary 
and sufficient for the applications. For example, if there 
are many extraneous expressions in the machine learn- 
ing task, they will act as distractors - examples which 
decrease the accuracy of the learned model by diluting 
the highly correlative data with noise. 
2.2 Extracting Features 
In addition to extractingmany markables themselves, the 
parsed corpora contain information from which many of 
the features can be automatically derived. Some fea- 
tures' values are marked explicitly in the corpus while 
others can be automatically extract.~ by examining the 
tree structure. The simplest source of feature values is 
the Treebank "functional tags". For example, the gram- 
matical functions (syntactic subject, topicalization, logi- 
cal subject of passives, etc~) of phrases and the semantic 
role (vocative, location, manner, etc.) are marked in the 
corpus. 
Other features must be found by walking the tree 
structure provided in the Trcebank. The form of the 
NP (whether the NP is realized as a personal pronoun, 
demonstrative pronoun, or definite description) is a func- 
tion of the part-of-speech tags assigned to the words in 
the NP. Whether the NP is definite, indefinite, or indeter- 
minable depends on whether an article begins the NP. If 
the article is "a', "an", or "some", we assume the NiP is 
indefinite. "The" indicates definiteness; otherwise, we 
assign a value of"none", which simply indicates that 
there is no simple way of classifying this instance. The 
case of an NP is usually determined by its position in 
the tree. Any child of a VP is marked as an "object". 
Children of PP's are marked "l~ep-adjunct" unless the 
PP was tagged "PP-put "t, which indicates that the PP 
acts as a complement to the verb. In this case we tag the 
NP as "prep-complement". 
2.3 Relations between Expressions 
We allow two classea of relations to hold between mark- 
able entities: the co-referenca relation and an open class 
of aser-definable directional relations. A co-reference 
relation holds between A and B when A and B are ex- 
pressions which both refer to the same discourse entity. 
Since co-reference is a symmetric, reflexive, and transi- 
the relation, it divides the set of markables into equiva- 
leace classes. Within a given equivalence class, all mem- 
tPP's using "in", "on". or "around" are sometimes marked 
PP-put. 
57 
:-:.- 
bet's refer to the same DE. Intuitively, our co-reference 
relation is a set of undirected links connecting all co- 
referring expressions. The symmetric property implies 
that it is not meaningful to store the direction of a rela- 
tion. However, we do store each markable's antecedent 
when the user defines a co-reference link, so that we can 
later reconstruct the co-reference chain if necessary. 
The other kind of link is directional. We allow the user 
to define any number of relation.,/which are not sym- 
metric, reflexive, or transitive. The only restriction on 
these relations is that they hold between exactly two en- 
titles. Initially, we postulate four such relations which are 
necessary to handle indirect co-reference relations, also 
called bridging relations (see also Passonneau (1996a)): 
• Attribute.of 
(6) \[The car\]~ won't start because \[the engine\]i is miss- 
int. 
* Propositional-inference 
(7) \[The man llas agun.\]# \[Thatlj scares me. 
• Contains 
(8) \[The peaches\]s are in a basket. Give me \[the 
biggestlk, o 
• Member.of 
(9) \[Ja£k\]m algl Jill went up the hill. \[l'hey\],,, were 
never seen again. 
Clearly, these must be directional (/.~, not symmetric) 
since, for example, if Member-of(A,B), then we should 
not assume Member-ef{B,A). The user is not prevented, 
however, from defining two such links, one in each direc- 
tion. In fact. Contains and Member.of are logical du- 
als; that is, Contnim(a,b) ~ Member(boO. However, 
we are always interested in the relation of a referring ex- 
pression to its potential antecedents and so require that 
the referring expression be the first argument and the an- 
tecedent the second. In (8), Member-of(tim biggest, the 
peaches), but in (9), Contains(They, Jack) and Con- t.h~fl~ey, Jm)." 
7.4 Mmsur~A~m.eat 
All of the annotation discussed in the above Sections is 
prone to error when a human is involved. The best way to 
combat these errors to is have several coders annotate the 
same corpus according to a coding manual. 2 A high mea- 
sure of agreement between these coders gives us more 
confidence in the reliability of the data. Therefore, we 
2The intent is that the coders will achieve a high degree of 
consistency if the manual is clear, and then if the manual accu- 
rately represents the desired coding style, consistency among 
coden implies accuracy of all the coding~ 
must be able to measure agreement between two 3 cod- 
ings of the same text. 
The first kind of agreement that we need to measure is 
agreement of two sets of markables. Since we expect a 
few of the markables found by the system to need human 
editing, we may not assume that two coders working on 
the same text will have the same set of markables after 
the correction phase. We define agreement of two sets of 
markables 81 and .5'2 as 
2,c .. Agreement(Sl, $2) = a + \[~ 
where a = \]Sl\], b : IS2\], and c = the number of ex- 
pressions marked in .St that were marked with exactly 
the same boundaries in Sa. When agreement of mark- 
ables is found to be less than 1, the coders are shown 
the expressions on which they disagree and can come 
to agreement (by referring to the coding manual and re- 
marking those passages). We are developing a function 
of the tool which will simultaneously display the two ver- 
sions of the text and highlight the expressions which are 
not common to the two codings. This will make it eas- 
ier to visualize the differences between the codings and 
reach perfect agreement of markables. 
The second kind of agreement measures agreement 
between two coders' co-reference codings. We require 
that the two coders have the same set of markables before 
comparing their co-reference annotations, so achieving 
markable agreement of I is a prerequisite for this calcu- 
lation. As discussed in section 2.3, the co-reference rela- 
tion divides the set of markables into equivalence classes. 
A model-theoretic algorithm proposed by Vilain et al. 
(1995) uses these Co-referenc e classes to define a preci- 
sion and recall metric which yields intuitively plausible 
results and is easy to calculate. The method depends on 
counting how many co-refereace links must be added to 
one coder's equivalence classes to wansform the set into 
that found by the other c~ler. We adopt this method and 
enable the tool to perform tl~ computation between any 
two codings which fully agree on the underlying set of 
markahies. 
Finally. we can measure feature-value agreement by 
viewing the featme assignment task as a kind of clas- 
sification task and then computing Kappa (a), which 
measures how well the coders a.g~l compared to their 
random eJcpected agreemeat4(CaHetta, 1996). We con- 
form to the method proposed in Poesio & Vieira (1998) 
for computing actual and expected agreement. (Again 
we assume the coders have already agreed on the set of 
+ markables.) Suppose we are considering a given feature 
SAgreemem among a set of n > 2 coders is usmdly calcu- 
lated as a function of the .~ pail'wise agreements, so we 
will discuss only the pa/rwise case here, realizing that the full 
comlmmion is straightforward. 
58 
@ 
@ 
@ 
O 
0 @ 
@ 
0 @ 
@ 
@ 
@ 
@ 
0 Q 
@ 
0 
0 @ 
@ 
0 @ 
0 @ 
0 
0 @ 
@ 
0 @ 
0 @ 
0 
0 g 
@ 
0 @ 
@ 
0 @ 
0 
t 
0 
0 
0 
e 
0 
0 
e 
0 
0 
0 
0 
0 
0 0 
0 
0 
e 
0 
e 
0 
0 
0 
0 
e 
'e 
0 
0 
0 
e 
e 
e 
0 
0 
e 
0 @ 
e 
o 
o 
o 
0 
f, which was marked by two coders on each of N ex- 
pressions in a corpus. Percent agreement is Simply the 
fraction of expressions out of N for which the two coders 
assigned the same value to .f. Expected agreement is not 
computed by assuming that each value is equally likely, 
though. We compute the expected agreement based on 
the actual distribution of values, as follows. For two 
coders, if f takes on values from V, 
= 
~v 2. N \] 
where c~(v, f) is the number of times coder i assigned 
value v to feature f. Thus, if the coders have used the 
values in a perfectly even distribution among the IV I val- 
ues, P(E) -- \[~. Any distribution which is not perfectly 
even will have an expected agreement higher than this. 
As with measuring markable agreement, we measure 
feature-value agreement to ensure that we have reliable 
features before using the data for one of the applications 
discussed in section 1. Therefore, coders can ask the sys- 
tem to show the examples for which they disagree on a 
specified feature. Again, the coders have the opportunity 
to recede those examples to achieve perfect agreement 
before passing the data to the application. 
3 REFEREE: The Discourse Annotation 
Tool 
We have built a discourse annotation and visualization 
tool which is designed according to the issues discussed 
in section I and which has all the capabilities described 
in section 2. RBFEREE s is a graphical interface tool writ- 
ten in Tclfrk. This makes i t highly portable and easily 
extensible. 
3.1 Anm~ttion Modes 
The tool has three "modes" - reference mode, segment 
mode, and dialog mode. In reference mode, the user 
can mark expressions, associate features with any ex- 
pression, and assign co-reference (or other kinds of ref- 
erence) links. Clicking on an expression with the mouse 
displays the features of that expression and highlights all 
other expressions in the text which co-refer with it. At 
this point, the user can up,t~3f the oo-reference or feature 
information or type some notes to be stored with the ex- 
pression. (These notes are shown with the features when- 
4 
.n _ P(E) 
where P(A) is the proportion oftlmes the annotators agree and 
P(E) is the proportion of times the annotators are expected to agree by chance. 
5for Refening Exwession Reader/Editor 
ever this expression is clicked on in the future.) Easy vi- 
sualization of the co-reference equivalence classes could 
aid the user as he clicks through the text and sees how 
the co-reference chains thread through discourse. 
A byproduct of the built-in flexibility of REFEREE is 
the ability to use different feature "masks" in case the 
user only wants to consider some subset of the complete 
set of marked features. For example, the user can con- 
figure the tool to display and allow changes tO only the 
pronominalization feature. Then, the irrelevant features 
are not displayed and cannot be changed until the tool is 
reconfigured. This is also useful for associating different 
feature sets with different kinds of expressions. 
Segment mode allows the user to break the text into 
arbitrarily nesting and overlapping segments. (These do 
not have to correspond to any certain definition of dis- 
course segment or text segment.) This allows the user 
the freedom to choose any degree of constraints upon the 
structure. When the user selects a region and clicks on 
the "mark" button, a new segment is created spanning 
that region. Thus, we can build up a list of start and end 
points of segments, and automatically determine which 
segments are contained in or overlap with which other 
segments. A separate window displays graphically the 
start and end point of each segment. 
At first glance, this seems to replicate the functional- 
ity of the reference mode, since both modes allow un- 
constrained marking (of contiguous ~xt spans). The im- 
portant difference is that in reference mode, the user de- 
lineates referrable entities, while in segment mode, the 
user is marking spans which represent the structure of 
discourse. So, a use r Could have many spans marked as 
segments which exactly coincide with markables in ref- 
erence mode; this simply represents the fact that the user 
believes it is possible for the text to refer to the segments 
or the pmpositious they express. Still, the segment mark- 
ings are not superfluous. They impose a structure on top 
of the reference mode markables, even if some of them 
coincide. (While this could be simulated in the refer- 
ence mode by adding a binary feature for segmenthood, 
the visualizationofsegments would be lost, as would the 
decoupling of the two kinds of spans we mark.) 
The last mode of interaction with RF.~ItEe is dialog 
mode. This allows the user to code a dialog by breaking 
it into turns. Each dialog participant's tm'ncan be bro- 
ken into utterances which may be labeled as initiation or 
response units. (In some cases, there is overlap between 
these two dialog acts.) The most important function of 
dialog mode, as it relates to understanding reference in 
spoken language is to allow segmentation of the dialog 
• into turns assigned to one speaker or the other. Recall the 
• proper.closure of a turn is crucial for determining which 
DE's are in the shared discourse model. 
59 
4 Previous Work 
,~ ~ .,.,, to ~.,~. ~, .:~ ,,,..¢~. ~. Previous systems were designed for different purposes, 
--, ,. ~. ,~. ~ e,. ... .. .,~ ~,, m ,,~ ~ and therefore do not provide all of the functionality that ~td Iht ~¢t ~ tm dJ~ Cc, ~ t~r htw dt~rmr . Fall It*m" t* rc.t • 
m~t~mbl* tram . rm dmmr hmar tkww ute tml,,qt nma. It" this ,-.,,~*~,~ =,u .,. --.2o ,o ~.., h,. ~,,~. ~. ~ our applications require. For example, MITRE's Alem- 
he~ t~d little tim In Mhjch to ~ tho ~|ei ~ thst m8 • .m- 
~.- ~ ,~ ~. a ,...~,. ,~,-.- ~d ~. bic Workbench (Day et ai., 1997) builds up an annotated 
tt~ H.:20 . I~* tt ~|d h,~e ~ ~rlst~ • 8J~ dl did e~.~ +~ v-. ,..,I e~ 
.,,..~.~ w ~. eza, corpus from scratch, under a mixed-initiative paradigm 
-~ ~.,, ~ e. .... =~ ~ ~ - • m-~. (in which some markings are given by the user, and some Ut4q~tmee f~t~" . \[\[ggl~ ~uttd Uhag tht~t hal l~m ~ ~' ¢tvl 
I~lmeto r td~O m t~4 sine era" ruth ~/i. ~m emvkgt4P d~ll N~d i ~,~,.J.=.. ~ =~ ,~s..,=,,.,.,.., -,~ e. are automatically inserted by the computer). Learning an 
aim ~l~m =~ tt~. m ~Iv f.tt~ltg f~dl IB~ that .....~,..t.,.. 
~ .,., ,,.,, e,, ~ w., information extraction system was a primary function of ~111'8 etatmmt te4.II~ ~'omd ~tMnll • 
this system. While the associated Alembic NLP system 
~,~,~,.~-,.~ ~, .-- m ,,~ ~ e..~.,,.-, e=a-,~ does incorporate some discourse level information into 
,,. ~ ,.~. h. ~,~ m-= ,o e.,~ ~=. ~ ~. ~ the system, the user may not impose an arbitrarily com- Iot~hl ta d~w~ kit morn x~ xtatl~l. ~ t~w* m m~ ~ . 
.. ~ ,. ~. ,~ ~,t..a. ~,. ~,..~.~m,.+ m plex discourse structure whose structure the system can i~mt.~.In tMo Ic~l SIp or* tim VMeh ~mwe tt~ lw'hj ~ ~ 
~,. t,~ ,,, t,,.e. - rcprescn(. 
e~m .~.,,.~ ,.- .~.+ v -- ~:.~ ~,~.,~ ,- ~, The Discourse Tagging Tool (Aone& Bennett, 1994) 
~. ,. + em ~ ,.~ ". ~,~.¢. ,~m, was designed for tagging multilingnal corpora, and also uar Ihd~eet ~aJlctt(~, Shtutdltm~d~fea 
~'e, ,~ .'~t~ .,.ltm. ~ ,,~ .~,., ~' '*' ~'~ ~'~em ~,e ~.~ e~ s-..~ ~" ~"~ "~+ ~" ~' "+ does not allow complex marking of discourse structure. 
Furthermore, the tag sets and relations are fixed and may ~m's tm~tm~ *~ ~t~ tt-~ I~mm role bq~st ~d 
"~"~" ~ " ~ ~ "-"*~," '~' ~" "-- not be elaborated by the user. Also. this work was not hl~Oq~f¢, gatmut~t~ms¢~trtvwll .htl~dlt~ 
,~... M ,,..,...~.~. *m, e. ,~ e...- ~,. ~ ..- concerned with dialogs. . ~ d~pJ ge Ig~ml~l~*l 4~g4~ gg ~hl~ tl~g th~ 1~4m t~lt Ig~ tp 
,.,~-,--,~,.zo--~-,.~ ~ 5 Future Work and Conclusions 
a,,mq.,m: +e, m ~ , We are beginning annotation of a parsed corpus using \[ fs.r/7 "t~e- 
emmt ~mmua: ~.Zm f~;D "tmrm" '\[~-- 
Ul-lm~ 
Figure 1: REFEREE in Reference Mode 
3.2 Interface and Implementatien Notes 
Each of the three modes (reference, segment, and dia- 
log) has one main window in which a page of text is dis- 
played. For example, Figwe i shows the main screen for 
reference mode. In this figure, dark text represents the 
NP's that have been marked or extracted from the Tre~ 
bank. The "current expression" is highlighted and core- 
fen'ins expressions are underlined. Though this scheme 
is perhaps visually unpleasing on paper, note that on the 
computer, the application uses vivid colors and easily 
differentiable typefaces. Furthermore, elements of the 
color scheme are cnstomizable by the user. 
The tool saves the user's aunotafioes in several data 
files while leaving the original text file unchanged. Other 
annotation programs have embedded the annotations into 
the text using a sublanguage of XML. Hies generated un- 
• der either method are equally capable of representing the 
desired levels of annotation; we separate the text from 
the annotations \[n order to simplify the parsing of the 
data. In case a REFEREE user should want to port some 
marked text to a new annotation system, it is straightfor- 
ward to automatically generate a text-and-annotation file 
to conform to any XML-style definition. 
Referee. We have found that it is much easier to code a 
corpus and get reliable results when the system has al- 
ready found the majority of the markables. We intend 
to improve the tool by providing more functionality and 
better visualization of patterns in the data. We hope to 
add more complex feature-extraction rules that scarf, h the 
parse tree more extensively for syntactic features ~ are 
evident from the tree suucune; We are also interested 
in using a lexicalized knowledge base to find semantic 
relationships between the marked expressions. 
We believe that the requirements of the intended appli- 
cations dictate the design of a novel and unique tool for 
the analysis of the relationship between discourse struc- 
ture and reference. Referee fills this niche, and greatly 
reduces the workload placed on the human users. Fur- 
thermore, the open design of Referee makes it flexible, 
extenm'ble, and appficable to any number of other appli- 
cations. 
6 Acknowledgments 
This work has been supported by NSF Graduate Trainee- 
ship Grant GER-9354869 to the University of Delaware. 
and a post-d.octoral fellowship award from the Institute 
for Research in Cognitive Science (IRCS) at the Univer- 
sity of Pennsylvania (NSF SBR 8920230). Much of this 
work was completed while the third author was a visiting 
scholar at IRCS and was supported by a grant from NSF 
(NSF SBR 8920230). We would like to thank Miriam 
Eckert for the many valuable discussions of this work. 
60 
O 
O @ 
O 
O @ 
@ 
O @ 
O Q 
@ 
O 
0 
0 
0 
O 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
Q 
0 
0 
0 
0 
B 
0 
0 
0 
0 
0 
0 
0 
t 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 0 
I 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

References 
Aone, C. & S. W. Bennett (1994). Discourse tagging tool 
and discourse-tagged muitilingual corpora. In Pro- 
ceedings of the International Workshop on Sharable 
Natural Language Resources ( SNLR ). 
Appelt, D.E. (1981). Planning Natural.Language Ut- 
terances to Satisfy Multiple Goals, (Ph.D. thesis). 
Stanford University. Also appeared as: SRI Inter- 
national Technical Note 259, March 1982. 
Appeit, D. E. (1985). Planning English referring expres- 
sions. Artificial Intelligence, 26(1): 1-33. 
Carletta, J. (1996). Assessing agreement on classification 
tasks: The kappa statistic. Computational Linguis- 
tics, 22(2):249-254. 
Condon, S. & C. Cech (1995). Problems for reliable dis- 
course coding systems. In Working Notes for AAAI 
Spring Symposium on Empirical Methods in Dis- 
course Interpretation and Generation, pp. 27-33. 
Stanford University. 
Connolly, D., J. D. Burger & D. S. Day (1997). A ma- 
chine learning approach to anaphoric reference. In 
D. Jones & H. Somers (Eds.), New Methods in Lan- 
guage Processing, pp. 133-143. Oxford University 
Press. 
Dale, R. (1992). Generating Referring Expressions: 
Constructing Descriptions in a Domain of Objects 
and Processes. Cambridge; Mass.: MIT Press. 
Dale, R. & E. Reiter (i995). Computational interpre- 
tations of the Oriceun maxims in the generation of 
referring expressions. Cognitive Science, 18:233- 
263. 
Day, D. S., J. Aberdeen, L. Hir~hman, R. Kozierok, 
P. Robinson & M. Vilain (1997). Mixed-initiative 
development of language preceding systems. In 
Proceedings of the F'g~h Conference on Applied 
Natura/Langunge Processing, pp. 348-355, Wash- 
ington, D.C. 
Eckert, M. & M. Su~be (1999). Resolving discourse 
deictic anaphora in dialogues. In Proceedings of 
the 9 it' Conference of the European ChaPter of the 
Association for Computational Linguistics, Bergen, 
Norway, 8-:12 June 1999. To appear. 
Go, N., J. Hale & E. Charniak (1998). A statistical ap- 
proach to anaphora resolution. In Proceedings of 
the Sixth Workshop on Very Large Corpora. 
Grosz, B. J. & C. L. Sidner (I 986). Attention, intentions, 
and the structure of discourse. Computational Lin- 
guistics, 12(3): i 75-204. 
Marcus, M., G. Kim, M. A. Marcinkiewicz, R. MacIn- 
tyre, A. Bies, M. Ferguson, K. Katz & B. Schas- 
berger (1994). The Penn treebank: Annotating 
predicate argument structure. In Proceedings of 
ARPA Speech and Nataral Language Workshop. 
McCoy, K. E & M. Strube (1999a).' Generating 
anaphoric expressions: Pronoun or definite descrip- 
tion? In ACL '99 Workshop o.n the Relationship be- 
tween Discourse~Dialogue Structure and Reference, 
University of Marfland, Marylan~ 21 June, 1999. 
This volume. 
McCoy, K. E & M. Strube (1999b). Taking time to sl~c- 
ture discourse: Pronoun generation beyond accessi- 
bility. In Proceedings of the 21 "t Annual Confer- 
ence of the Cognitive Science Society, Vancouver, 
British Columbia, Canada, 19-21 August 1999. To 
appear. 
McDonald, D. D. (1980). Natural Language Production 
as a Process of Decision Making Under Constraint, 
(Ph.D. thesis). MIT. 
McKeown, If. 17,. (1983). Focus constraints on language 
generation. In Proceedings of the 8 th International 
Joint Conference on Artificial Intelligence, Karl- 
sruhe, Germany, August 1983, pp. 582-587. 
McKeown, ILR. (1985). Text Generation: Using Dis- 
course Strategies and Focus Constraints to Gener- 
ate NatumlLanguage Text. Cambridge. U.K.: Cam- 
bridge University Press. 
Passonneau, R. (1996a). Instructions for applying dis- 
course reference annotation for multiple applica. 
tions (DRAMA). Colmnbia University, New York, 
Dept. of Compute<. Science. 
Passonneau, R. (1996b). Using centering to relax 
Gricean constraints on disocurse anaphoric noun 
ophrases. Language and Speech, 39(2):229--264. 
Passonneau, R. & D. Lilman (1997)~ Discourse segmen- 
tation by human and automated means. Compata- 
tional Linguistics, 23( I): 103-139. 
Poesio, M. & R. Vieira (1998). A corpus-based inves- 
tigation of definite description use. Computational 
L/ngu/stics, 24(2): 183.-216. 
Reichman, R. (1985). Getting Computers to Talk like You 
and Me. Cambridge, Mass.: MrrPress. 
Reiter, E. (1990). Generating descriptions that exploit a 
user's domain knowledge. In R. Dale, C. Mellish & 
M. Zock (Eds.), Current Research in Natural Lan- 
guage Generation. London: Academic Press. 
Sidner. C. L. (1979). Towards a Computational The- 
ory of Definite Anaphora Comprehension in En- 
glish. Technical Report AI-Memo 537, Cambridge, 
Mass.: Massachusetts Institute of Technology, AI 
Lab. 
Strube, M. (1998). Never look back: An alternative to 
centering. In Proceedings of the i7 th International 
Conference on Computational Linguistics and 36 ta 
Annual Meeting of the As$ocia "tlon for Computa- 
tional Linguistics. Montreal, Qut~3ec, Canada. 10- 
14 August 1998, Vol. 2, pp. 1251-1257. 
Vilain, M., J. Burger, I. Aberdeen, D. Connolly & 
I.,. Hirschman (1995). A model-theoretic corefer- 
ence scoring scheme. In Proceedings fo the 6 tit 
Message Understanding Conference (MUC-6), pp. 
45-52. San Mate, o, Cal.: Morgan Kaufmann. 
