Discourse Structure and Co-Reference: An Empirical Study 
Dan Cristea 
Department of Computer Science 
University "A.I. Cuza" 
Ia4i, Romania 
dcristea @ info iasi. r@ 
Nancy Ide 
Department of Computer Science 
Vassar College 
Poughkeepsie, NY, USA 
ideOcs.vassar.edu 
Daniel Marcu 
Information Sciences Institute and 
Department of Computer Science 
University of Southern California 
Los Angeles, CA, USA 
marcu @ isi. edu 
Valentin Tablan 
Department of Computer Science 
University "A.I. Cuza'" 
la~i, Rom~Lnia 
valyt@ infoiasi.ro 
Abstract 
We compare the potential of two classes of finear 
and hierarchical models of discourse to determine 
co-reference links and resolve anaphors. The com- 
parison uses a corpus of thirty texts, which were 
manually annotated for co-reference'and discourse 
structure. 
1 Introduction 
Most current anaphora resolution systems im- 
plement a pipeline architecture with three mod- 
ules CLappin and Leass," 1994; Mitkov, 1997; 
Kameyama, 1997). 
1. A COLLECT module determines a list of poten- 
tial antecedents (LPA) for each anaphor (pro- 
noun, definite noun, proper name, etc.) that 
have the potential to resolve it, 
2. A FILTER module eliminates referees incom- 
patible with the anaphor f~m the LPA. 
3. A PREFEI~NCE module detennm" es the most 
likely antecedent on the basis of an Ordering 
policy. 
In most cases,, the COLLECT module determines 
an LPA by enumerating all antecedents in a win- 
dow of text that pLeced__es the anaphor under 
scrutiny (Hobbs, 1978; Lappin and Leass, 1994; 
Mitkov, 1997; Kameyama, 1997; Ge et al., 1998). 
This window can be as small as two or three sen- 
tences or as large as the entire preceding text. 
The FILTER module usually imposes semantic con- 
straints by requiring that the anaphor and poten- 
tial antecedents have the same number and gender, 
that selectional restrictions are obeyed, etc. The 
PREFERENCE module imposes preferences on po- 
tential antecedents on the basis of their grammati- 
cal roles, parallelism, frequency, proximity, etc. In 
some cases, anaphora resolution systems implement 
these modules explicitly (I-Iobbs, 1978; Lappin and 
Leass, 1994; Mitkov, 1997; Kameyama, 1997). In 
other cases, these modules are integrated by means 
of statistical (Ge et al., 1998) or uncertainty reason- 
ing techniques (Mitkov, 1997). 
The fact that current anaphora resolution systems 
rely exclusively on the linear nature of texts in Or- 
der to determine the LPA of an anaphor seems odd, 
given that several studies have claimed that there 
is a strong relation between discourse structure and 
reference (Sidner, 1981; Gmsz and Sidner, 1986; 
Grosz et aL, 1995; Fox, 1987; Vonk et al., 1992; 
Azzam et al., 1998; Hitzeman and P .oesio, 1998). 
These studies claim, on the one hand, that the use of 
referents in naturally occurring texts imposes con- 
stmints on the interpretation of discourse; and, on 
the other, that the structure of discourse constrains 
the HAs to which anaphors can be resolved. The 
oddness of the situation can be explained by the fact 
that both groups seem primafacie to be righL Em- 
pkical experiments studies that employ linear tech- 
niques for determining the LPAs of anaphom report 
recall and precision anaphora resolution results in 
the range of 80% ~in and I.eass, 1994; Ge et al., 
1998). Empirical experiments that investigated the 
relation between discourse structure and reference 
also claim that by exploiting the structure of dis- 
course one has the potential of determining correct 
co-referential links for more than 80% of the refer- 
ential expressions (Fox, 1987; Cristea et al., 1998) 
although to date, no discourse-based anaphora res- 
olution system has been implemented. Since no di- 
46 
0 
0 
0 
0 
0 Q 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
B 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
rect comparison of these two classes of approaches 
has been made, • it is difficult to determine which 
group is right, and what method is the best. 
In this paper, we attempt to fill this gap by em- 
pirically comparing the potential of linear- and hi- 
erarchical models of discourse to correctly establish 
co-referential links in texts, and hence, their poten- 
tiai to correctly resolve anaphors. Since it is likely 
that both linear- and discourse-based anaphora res- 
olution systems can implement similar FILTER and 
PREFERENCE strategies, we focus here only on the 
strategies that can be used to COLLECT lists of po- 
tential antecedents. Specifically, we focus on de- 
termining whether discourse theories can help an 
anaphora resolution system determine LPAs that are 
"better" than the LPAs that can be computed from 
a linear interpretation of texts. Section 2 outlines 
the theoretical as.~umptions of our empirical inves- 
tigation. Section 3 describes our experiment. We 
conclude with a discussion of the results. 
2 Background 
2.1 Assumptions 
Our approach is based on the following assump- 
tions: 
1. For each anaphor in a text, an anaphora reso- 
lution s~,stem must produce an LPA that con- 
rains a referent to which the anaphor can be 
resolved. The size of this LPA varies from sys- 
tem to system, depending on the theory a sys- 
tem implements. 
2. The smaller the LPA (while retaining a correct 
antecedent), the less likely that errors in the 
FILTER and PREFERENCE modules will affect 
the ability of a system to select the appropriate 
referent. 
. Theory A is better than theory B for the task 
of reference resolution if theory A produces 
LPAs that contain more antecedents to which 
anaphors can be correctly resolved than theory 
B, and if the LPAs produced by theory A are 
smaller than those produced by theory B. For 
example, if for a given anapbor, theory A pro- 
duces an LPA that contains a referee to which 
the anaphor can be resolved, while theory B 
produces an LPA that does not contain such a 
referee, theory A is better than theory B. More- 
over, if for a given anaphor, theory A produces 
an LPA with two referees and theory B pro- 
duces an LPA with seven referees (each LPA 
containing a referee to which the. anaphor can 
be resolved), theory A is considered better than 
theory B because it has a higher probability of 
solving that anaphor correctly. 
We •consider two Classes of models for determining 
the LPAs of anaphors in a text: 
Linearok models. This is a class of linear models 
in which the LPAs include all the references found 
in the discourse unit under scrutiny and the k dis- 
course:ufiits that immediately precede it. Linear-O 
models an approach that assumes that all anaphors 
can be resolved intra-uuit; Linear-I models an ap- 
preach that corresponds roughly to centering (Grosz 
et aL, 1995). Linear-k is consistent with the assump- 
tions that underlie most current anaphora resolution 
systems, which look back k units in order to resolve 
an anaphor. 
Discourse-VT-k models. In this class of models, 
LPAs include all the referential expressions found in 
the discourse unit under scrutiny and the k discourse 
units that hierarchically precede it. The units that hi- 
erarchically precede a given unit are determined ac- 
cording to Veins Theory (VT) (Cristea et al., 1998), 
which is described briefly below. 
2.2 Veins Theory 
VT extends and formalizes the relation between 
discourse structure and reference proposed by 
Fox (1987). It identifies "veins", i.e., chains of el- 
ementary discourse units, over discourse stmctme 
trees that are built according to the requirements put 
forth in Rhetorical Sa-acture Theory (RST) (Mann 
and Thompson, 198g). 
One of the conjectures of VT is that the vein ex- 
pression of an elementary discourse unit provides a 
coherent "abstract" of the discourse fragment that 
contains that unit. As an internally coherent dis- 
course fragment, all anaphors and referential ex- 
pressions (REs) in a unit must be resolved to ref- 
erees that occur in the text subsumed by the units 
in the vein. This conjecture is consistent with Fox's 
view (1987) that the units that contain referees to 
which anaphors can be resolved are determined by 
the nuclearity of the discourse units that precede the 
anaphors and the.overall structure of discourse. Ac- 
cording to VT, REs of both satellites and nuclei can 
access referees of immediately preceding nucleus 
nodes. REs of nuclei can only access referees of 
preceding nuclei nodes and of directly subordinated 
satellite nodes. And the interposition of a nucleus 
47 
after a satellite blocks the accessibility of the satel- 
lite for all nodes that are lower in the corresponding 
discourse structure (see (Cristea et el., 1998) for a 
full definition). 
Hence, the fundamental intuition underlying VT 
is that the RST-specific distinction between nuclei 
and satellites constrains the range of referents to 
which anaphors can be resolved; in other words, 
the nucleus-satellite distinction induces for each 
anaphor (and each referential expression) a Do- 
main of Referential Accessibility (DRA). For each 
anaphor a in a discourse unit u, VT hypothesizes 
that a can be resolved by examining referential ex- 
pressions that were used in a subset of the discourse 
units that precede u; this subset is called the DRA 
of u. For any elementary unit u in a text. the corre- 
sponding DRA is computed automatically from the 
rhetorical representation of that text in two steps: 
. Heads for each node are computed bottom-up 
over the rhetorical representation tree. Heads 
• of elementary discoune units are the units 
themselves. Heads of internal nodes, i.e., dis- 
course spans, are computed by taking the union 
• of the heads of the immediate child nodes that 
are nuclei. For example, for the text in Fig- 
ure 1, whose rhetorical structure is shown in 
Figure 2, the head of span \[5,7\] is unit 5 be- 
cause the head of the immediate nucleus, the 
elementary unit 5, is 5. However, the head of 
span \[6,7\] is the list (6,7) because both imme- 
diate children are nuclei of a multlnuclesr rela- 
tion. 
. Using the results of step 1, Vein expressions 
are computed top.down for each node in the 
tree. The vein of the root is its head. Veins 
of child nodes are computed recursively ac- 
cording to the rules described by Cristea et 
al.(1998). The DRA of a unit u is given by the 
units in the vein that precede u. 
For example, for the text and RST tree in Fig- 
ures 1 and 2, the vein expression of unit 3, 
• which contains units 1 and 3, suggests that 
anaphors from unit 3 should be resolved only 
to referential expressions in units I and 3. Be- 
cause unit 2 is a satellite to unit 1, it is consid- 
ered to be "blocked" to referential links from 
unit 3. In contrast, the DRA of unit 9, consist- 
ing of units I, 8, and 9, reflects the intuition 
that anaphon from unit 9 can be resolved only 
to referential expressions from unit 1, which 
1. l.~ch--/D. ,.,.\]. cop ~oh.o.~oh~o. 
manager, moved co ~, 
a small b~o~chnology concern hero, 
2. to becgme~t~ presLdent a'ncl chief 
ol~srlt~ng officer. I 
3. J Pit. c~.asey, 46 years oXcl,\] was~ presAdent; of 
J~'I HCNoxL Phantaceutxcal subsJ,ldJ.ary, | 
4. which ,,qm merged w~th another ~r~r urtlg, 
OrCho pharnacsut:£ca/ Corp., chLa year in 
• cosC-cut.t.Jng move. 
S. Hr. Case~ succeeds N.~rrett, SO," 
6. Mr. l)a~\]c'et, lr, z'em,,;x~ chief execut4ve off.icer 
7. and becomes chA~rnan, me 'I • 
9. h.ln.e, th, nov. co 
10. "becsuse\['h~\]saw hee11:h care mov4ng Coward 
tec~hnologiea J.$.ke ~gene therapy 
products. 
11. J'X'~be.l.Aeve the. ~'.he fj.eld is energing and As 
profited I;o brsO. loose, 
~.. \['~ mtid. 
Figure h An example of text and its elementary 
units. The referential expressions surroundedby 
boxes and ellipses correspond to two distinct co- 
referential equivalence classes. Referential expres- 
sions surrounded by boxes refer to Mr. Casey; 
those surrounded by ellipses refer to Genetic Ther- 
apy Inc.. 
O 
e 
e 
o 
e @ 
o 
o @ 
e 
O 
is the most important unit in span \[1,7\], and 
to unit 8, a satellite that immediately precedes 
unit 9. Figure 2 shows the heads and veins of 
all internal nodes in the rhetorical representa- 
tion. 
2.3 Comparing models 
The premise underlying our experiment is that there 
are potentially significant differences in the size of 
the search space required to resolve referential ex- 
pressions when using Linear models vs. Discourse- 
VT models. For example, for text and the RST 
tree in Figures I and 2, the D/scourse-VT model 
narrows the search space required to resolve the 
anaphor the smaller company in unit 9. Accord- 
ing to VT, we look for potential antecedents for the 
smaller Company in the DRA of unit 9, which lists 
48 
O 
O @ 
O 
O 
O 
O 
O 
O H=I9 * 
O V=lg* 
H'I " H'9 O 
V=I9* ~ ~"~. V'19* 
V=Ig* * V-19\[* 
O v-19. 3~9- _w v-~679, v-19. I~ ~I.~ 
-- V=l(g~9* 
1 2 3 4 8 10 
1910. 
6 7 
..~.. H = II 
\[H= 3 \[ 9 V- 1 9 10 ll" Iv'13591 _~ 
Iv~- 131 \[H. 9 ~ \[ 10 \[v-i~9 I 
IDRA. XS ~ 11 n 
Figure 2: The RST analysis of the text in figure 1. The tree is represented using the conventions proposed 
by Mann and Thompson (1988). 
@ 
@ 
O 
0 
O 
O @ 
@ 
@ 
0 
O @ 
@ 
O 
0 
O @ 
0 
0 @ 
@ 
O @ 
units 1, 8, and 9. The antecedent Genetic Ther- 
ap3 Inc. appears in unit 1; therefore, using VT we 
search back 2 units (units 8 and 1) to find a correct 
antecedent. In contrast, to resolve the same refer- 
ence using a finear model, four units (units 8, 7. 
6, and 5) must be examined before Gene6c Ther- 
apy is found. Assuming that referential links are es- 
tablished as the text is processed, Gene~c Therapy 
would be linked back to pronoun its in unit 2, which 
would in mm be linked to the first occurrence of the 
antecedent,Genetic Therapy. Inc., in unit 1, the an- 
tecedent determined directly by using Wl'. 
In general, when hierarchical adjacency is con- 
sidere& an anaphor may be resolved to a referent 
that is not the closest in a linear interpretation of 
a text~ Similarly, a referential expression can be 
linked to a referee flint is not the closest in a lin- 
ear interpretation of a text. However, this does not 
create problems because we are focusing here only 
on co-referential relations of identity (see section 
3). Since these relations induce equivalence classes 
over the set of referential expressions in a text, it 
is sufficient that an anaphor or referential expres- 
sion is resolved to any of the members of the rule- 
v-ant equivalence class, For example, according to 
VT, the referential expression Mr. Casey in unit 5 
in Figure I can be linked directly only to the ref- 
eree Mr Casey in unit !. because the DRA of unit 5 
is { 1,5}. By considen'ng the co-referential links of 
the REs in the other units, the full equivalence class 
can be determined. This is consistent with the dis- 
tinction between "direct" and "indirect" references 
discussed by Cristea, et ai.(1998). 
3 The Experiment 
3.1 MaterhJs 
We used thirty newspaper texts whose lengths var- 
ied widely; the mean o is 408 words and the stan- 
dard deviation/~ is 376. The texts were anno- 
tated manually for co-reference relations of iden- 
tity (ITh'schman and Chinchor, 1997). The co- 
reference relations define equivalence classes on the 
set of all marked referents in a text. The texts were 
also manually annotated with discourse structures 
built in the style of Mann and Thompson (1988). 
Each analysis yielded an average of 52 elementary 
discourse units. Details of the discourse annotation 
process are given in (Marcu et al., 1999). 
3-~ Comparing potential to establish 
co-referential links 
3~,.1 Method 
The annotations for co-reference relations and 
rhetorical structure trees for the thirty texts were 
fused, yielding representations tha t ~flect not only 
the discourse structure, but also the c~reference 
49 
equivalence classes specific to each text. Based on 
this information, we evaluated the potential of each 
of the two classes of models discussed in section 
2 (Linear-k and Discourse-VT-k) to correctly estab- 
• lish co-referential links as follows: For each model, 
each k, and each marked referential expression a, 
we determined whether or not the corresponding 
LPA (defined over k elementary units) contained a 
referee from the same equivalence class. For exam- 
ple, for the Linear-2 model and referential expres- 
sion the smaller company in unit 9, we estimated 
whether a co-referential link could be established 
between the smaller company and another referen- 
tial expression in units 7, 8, or 9. For the Discourse- 
VT-2 model and the same referential expression, we 
estimated whether a co-referential link could be es- 
tablished between the smaller company and another 
referential expression in units 1, 8, or 9. which cor- 
respond to the DRA of unit 9. 
To enable a fair comparison of the two models, 
• when k is larger than the size of the DRA of a given 
unit, we extend that DRA using the closest units that 
precede the unit under scrutiny and are not already 
in the DRA. Hence, for the Linear-3 model and the 
referential expression the smaller company in unit 9, 
we estimate whether a co-referential link can be es- 
tablished between the smaller company and another 
referential expression in units 6, 7, 8, or 9. For the 
Discourse-VT-3 model and the same referential ex- 
pression, we estimate whether a co-referential link 
can be established between the smaller company 
and another referential expression in units 1, 8, 9, 
or 7, which c:orrespond to the DRA of unit 9 (units 
1, 8, and 9) and to unit 7, the closest unit preceding 
unit 9 that is not in its DRA. 
For the Discourse-VT-k models, we assume that 
the Extended DRA (EDRA) of size k of a unit 
u (EDRAk(u)) is given by the first I < k units of 
a sequence that lists, in reverse order, the units of 
the DRA of u plus the k - I units that precede u but 
are not in its DRA. For example, for the text in Fig- 
me 1, the following relations hold: F_~RAo(9) = 
9; F, DP, A~(9) = 9,8; F_,DRAa(9) = 9,8,1; 
EDP~(9) = 9, 8,1, 7; EDRA4(9) - 9, 8,1, 7, 6. 
For Linear-k models, the EDRAt(u) is given by u 
and the k units that immediately precede u. 
The potential p( M, a, EDRAt) of a model M to 
determine correct co-referential links with respect 
to a referential expression a in unit u, given a corre- 
sponding EDRA of size k (EDRAt(u)), is assigned 
the value 1 if the EDRA contains a co-referent 
from the same equivalence class as a. Otherwise, 
p(M, a, EDRAk) is assigned the value 0. The poten- 
tial p(M, C, k) of a model M to determine correct 
co-referential links for all referential expressions in 
a corpus of texts C, using EDRAs of size k, is com- 
puted as the sum of the potentials p(M, a, EDRAk) 
of all referential expressions a in C. This potential 
is normalized to a value between 0 and I by dividing 
p(M, 6", k) by the number of referential expressions 
in the corpus that have an antecedent. 
By examining the potential of each model to cor- 
rectiy determine co-referential expressions for each 
k, it is possible to determine the degree to which 
an implementation of a given approach can con- 
tribute to the overall efficiency of anaphora resolu- 
tion systems. That is, if a given model has the po- 
tential to correctly determine a significant percent- 
age of co-referential expressions with small DR/is, 
an anaphora resolution system implementing that 
model will have to consider fewer options overall. 
Hence, the probabifity of error is reduced. 
3.2.2 Results 
The graph in Figure 3 shows the potentials of the 
Linear-k and Discourse-VT-k models to correctly 
determine co-referential links for each k from 1 to 
20. The graph in Figure 4 represents the same po- 
tentials but focuses only on ks in the interval \[2,9\]. 
As theze two graphs show, the potentials increase 
monotonically with k, the VT-k models always do- 
ing better than the Linear-k models. Eventually, for 
large ks, the potential performance of the two mod- 
els converges to 100~. 
The graphs in Figures 3 and 4 also suggest reso- 
lution strategies for implemented systems. For ex- 
ample, the graphs suggests that by choosing to work 
with EDRAs of size 7, a discourse-based system has 
the potential of resolving more thun 90~ of the co- 
referential links in a text correctly. To achieve the 
same potential, a linear-based system needs to look 
back 8 units. If a system does not look back at 
all and attempts to resolve co-referential links only 
within the unit under scrutiny (k -- 0), it has the 
potential to correctly resolve about 40~ of the co- 
referential links. 
To provide a clearer idea of how the two models 
differ, Figure 5 shows, for each k, the value of the 
Discourse-VT-k potentials divided by the value of 
the Linear-k potentials. For k = 0, the potentials of 
both models are equal because both use only the unit 
in focus in order to determine cwreferential links. 
For k = 1, the Discourse-VT-I model is about 7% 
50 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
O 
0 
0 
0 
0 
0 
0 
0 
O 
0 Q 
0 
O 
O 
O 
O 
O q) 
O 
O O 
O 0 
e 
O 
O O 
O, 
9OOO5, 
8O0O% 
70 U)~ 
6000~ 
~000q, 
40.00% o 
It ItlA oJte 
- VT..k ....... Lima-It 
Figure 3: The potential of Linear-k and Discourse- 
VT-k models to determine correct co-referential 
links (0 < ~ < 20). 
O 
O 
O 
O 
O 
0 
O 
0 
O 
O 
O 
O 
O 
lUO~t 
7~ 
0 
l \] 
B 
Ella oSno 
Figure 4: The potential of Linear-k and Discourse- 
VT-k models to determine correct co-referentlal 
.nks (2 _ k < 9). 
better than the IAnear-I model. As the value of k 
increases, the value Discourse-VT-k/Linear-k con- 
verges to I. 
In Figures 6 and 7, we display the number of 
exceptions, i.e., co-referential links that Discourse- 
VT-k and Linear-k models cannot determine cor- 
rectly. As one can see, . over the whole corpus, for 
each k _< 3, the Discourse-VT-k models have the 
potential to determine correctly about tO0 mote co- 
referential links than the Linear-k models. AS k 
increases, the performance of the two models con- 
verges. 
n. o~' 
1. ag 
i.o! 
~t.99 
O. g| 
o, ~7 
°'tt o L ,,, ~ ~, ..~ e ,- 
...... i I 
J 
..z.. VTJO,.m Jt 
Hgure 5: A direct comparison of Discourse-VT-k 
and Linear-V'r-k potentials to correctly determine 
co-referential links (0 _< k _< 20). 
.... I ! 
i i 
Ella .the 
• , wY~tmj. - - -8- • • kt~BmO. 
Figure 6: The number of co-referential link.~ that 
cannot be correctly determined by Discourse-VT-k 
and Linear-k models (0 _.< k _< 20). 
3.2.3 Statistical Sig-ifieRnce 
In order to assess the statistical significance of the 
difference between the potentials of the two models 
to establish correct co-referential links, we carried 
out a Paired-Samples T Test for each k. In general, a 
Paired-Samples T Test checks whether the mean of 
casewise differences between two variables differs 
from 0. For each text in the corpus and each k, we 
determined the potentials of both VT-k and Line.ar- 
k models to establish correct co-referential links in 
that text. For ks smaller than 4, the difference in 
potentials was statistically significant. For example, 
for k -- 3, t -- 3.345, df - 29, P = 0.002. For 
values of k larger than or equal to 4, the difference 
was no longer significant. These results are consis- 
tent with the graphs shown in Figure 3 to 7, which 
all show that the potentials of Discourse-VT-k and 
Linear-k models converges to the same value as the 
value of k increases. 
51 
100 ¸ 
j: 
j~ j= 
I00 
0 
.., 
. . . , J , | . 
} • 5 S f g | I0 
I ~"'re'~.. ..... ,,,.a.,.~ 1 ' "" """ 
Figure 7: The number of co-referential links that 
cannot be correctly determined by Discourse-VT-k 
and Linear-k models (1 _< k < I0). 
3.3 Comparing the effort required to establish 
co-referential links 
3.3.1 Method 
The method described in section 3.2.1 estimates the 
potential of Linear-k and Discourse-VT-k models 
to determine correct co-referential links by treating 
EDRAs as sets. However, from a computational per- 
spective (and presumably, from a psycholinguistic 
perspective as well) it also makes sense to compare 
the effort required by the two classes of models to 
establish correct co-referential links. We estimate 
this effort using a very simple metric that assumes 
that the closer an an ~teo~__ent is to a correspond- 
ing referential expression in the EDRA, the better. 
Hence, in estimating the effort to estabfish a co- 
referential link, we treat EDRAs as ordered lists. For 
example, using the Linesr-9 model, to determine the 
correct antecedent of the referential expression the 
smaller company in unit 9 of Hgure 1, it is neces- 
sary to search back through 4 units (to unit 5, which 
contains the refezent Genet/c Therapy). Had unit 5 
been Mr. Cosset succeeds M. James Barrett, .50, we 
would have had to go back 8 units (to unit 1 ) in order 
to correctly resolve the RE the smaller company. In 
contrast, in the Discourse-VT-9 model, we go back 
only 2 units because unit 1 is two units away from 
• unit 9 (EDRAg(9) = 9,8,1,7,8,5,4,3,2). 
We consider that the effort e(M, a, EDRAt) of a 
model M to determine correct c0-referential links 
with respect to one referential a in unit u, given a 
correspondingEDRA of size k (EDRAt(u)) is given 
by the number of units between u and the first unit in 
EDRAt(u) that contains a co-referential expression 
ofa. 
The effort e(M, C, k) of a model M to deter- 
gNO " . o~.....~ i 
~., , 
21~. ' 1 g ! 
II)l& a,ze 
VT g'~us -. ...... Un g'~sn 
Hgure 8: The effort required by Linear-k and 
Discourse-VT-k models to determine correct co- 
referential links (0 < k < 100). 
mine correct co-referential links for all referent/al 
expressions in a corpus of .tex~ C using EDRAs 
of size k was computed as the sum of the efforts 
e(M,a, EDRAk) of all referential expressions a in 
C. 
3.3.2 Results 
Figure 8 shows the Discourse-VT-k and Linear-k ef- 
forts computed over all referential expressions in the 
corpus and all ks. It is possible, for a given referent 
a and a given k, that no co-referential link exists in 
the units of the corresponding EDRAt. In this case. 
we consider that the effort is equal to k. As a conse- 
quence, for small ks the effort required to establish 
co-referential linksis similar for both theories, be- 
cause both can establish only a limited number of 
links. However, as k increases, the effort computed 
over the entire corpus diverges dramatically: using 
the Discourse-VT model, the search space for co- 
referential links is reduced by about 800 units for a 
corpus containing roughly 1200 referential expres- 
sions. 
3.3.3 Statistical signiflcanee 
A Paired-Samples T Test was performed foreach k. 
For each text in the corpus and each k, we deter- 
mined the effort of both VT-k and Linear-k models 
to establish correct co-referential links in that text. 
For all ks the difference in effort was statistically 
significant. For example, for k = 7, we obtained 
the values t = 3.51, df = 29, P = 0.001. These re- 
sults are intuitive: because EDRAs are treated as or- 
dered lists and not as sets, the effect of the discourse 
structure on establishing correct co-referential links 
is not diminished as k increases. 
52 
0 @ 
@ 
@ 
0 @ 
0 @ 
@ 
B @ 
0 
0 
0 
0 @ 
@ 
B 
0 
0 @ 
0 
0 @ 
@ 
0 
0 @ 
@ 
@ 
0 @ 
e @ 
0 
e 
@ 
O 
O 
O 
O 
O @ 
0. 
0 
0 @ 
@ 
@ 
@ 
@ 
@ 
@ 
O @ 
@ 
@ 
O 
O 
O @ 
@ 
@ 
@ 
O 
O 
0 
O 
O @ 
O @ 
@ 
@ 
@ 
@ 
@ 
@ 
0 
4 Conclusion 
We an~,lyzed empirically the potentials of discourse 
and linear models of text to determine co-referential 
links. Our analysis suggests that by exploiting 
the hierarchical structure of texts, one can increase 
the potential of natural language systems to cor- 
rectly determine co-referential links, which is a re- 
quirement for correctly resolving anaphors. If one 
treats all discourse units in the preceding discourse 
equally, the increase is statistically significant only 
when a discourse-based corefererice system looks 
back at most four discourse units in order to estab- 
lish co-referenfial links. However, if one assumes 
that proximity plays an important role in establish- 
ing co-referential links and that referential expres- 
sions are more likely to be linked to referees that 
were used recently in discourse, the increase is sta- 
tistically significant no matter how many units a 
discourse-based co-reference system looks back in 
order toestablish co-referenfial links. 
Acknowledgements. We ate grateful to Lynette 
Hirschman and Nancy Chinchor for making avail- 
able their corpus of co-reference annotations. We 
are also grateful to Graeme Hirst for comments and 
feedback on a previous draft of this paper. 

References 
Saliha Azzam, Kevin Humphreys, and Robert 
Gaizauskas. 1998. Evaluating a focus-based ap- 
proach to anaphora resolution. In Proceedings of 
the 36th Annual Meeting of the Association for 
Computational Linguistics and of the 17th Inter- 
national Conference on Computational Linguis- 
tics (COLING/ACL'98), pages 74-78, Montreal, 
Canada, August 10--14. 
Dan Criste~ Nancy Ide, and Lanrent Romary. 1998. 
Veins theory: A model of global discourse co- 
hesion and cohexence. In Proceedings of the 
36th Annual Mee~g of the Association for Com- 
putational Linguistics and of the 17th Interna- 
tional Conference on Computational Linguistics 
(COLING/ACL'98), pages 281-285, :Montreal, 
Canada, August. 
Barbara Fox. 1987. Discourse Structure and 
Anaphora. Cambridge Studies in Linguistics; 48. 
Cambridge University Press. 
Niyu Ge" John Hale, and Eugene Chamiak. 1998. 
A statistical approach to anaphora resolution. In 
Proceedings of the Sixth Workshop on Very Large 
Corpora, pages 161-170, Montreal, Canada, Au- 
gust 15-16. 
Barbara J. Grosz and Candace L. Sidner. 1986. At- 
tention, intentions, and the structure of discourse. 
ComputationalLinguistics, 12(3): 175-204, July- 
September. 
Barbara J. Grosz, Aravind K. Joshi, and Scott We- 
instein. 1995. Centering: A framework for mod- 
eling the local coherence of discourse. Computa- 
tional Linguistics, 21 (2):203-226, June. 
Lynette Hirschman and Nancy Chinchor, 1997. 
M U C- 7 Coreference TaskDefinition* July 13; 
Janet Hitzeman and Massimo Poesio. 1998. Long 
distance pronominalizafion and global focus. In 
Proceedings of the 36th Annual Meeting of the 
Association for Computational Linguistics and 
of the 17th International Conference "on Com- 
putational Linguistics (COLING/ACL'98), pages 
550-556, Montreal, Canada, August. 
Jerry H. Hobbs. 1978. Resolving pronoun refer-; 
ences. Lingua, 44:311-338. 
Megumi Kameyama. 1997. Recogni'zing referen- 
fial links: An information extraction perspec- 
five. In Proceedings of the ACL/F~CL'97 Work- 
shop on Operational Factors in Practical, Robust 
Anaphora Resolution, pages 46--53. 
Shalom Lappin and Herbert J. Leass. 1994. An 
algorithm for pronominal anaphora resolution. 
Computational Linguistics, 20(4): 535- 561. 
William C. Mann and Sandra A. Thompson. 1988. 
Rhetorical structure theory: .Toward a functional 
theory of text organization. Text, 8(3):243-281. 
Daniel Marcu, Estibaliz Amorrortu, and Magdalena 
Romera. 1999~ Experiments in constmcfing a 
corpus of discourse trees. In Proceedings of the 
ACL'99 Workshop on Standards and Tools for 
Discourse Tagging, University of Maryland, June 
22. 
Ruslan Mitkov. 1997. Factors in anaphora reSo- 
lution: They am not the only things that mat- 
ter. a case study based on two different ap- 
proaches. In Proceedings of the ACL/F~CL'97 
Workshop on Operational Factors in Practical, 
Robust Anaphora Re.solution, pages 14--21. 
Candace L. Sidner. 1981. Focusing for interpre- 
tation of pronouns. Coronal Linguistics, 
7(4):217-231, October-December. 
Wietske Vonk, I.etfica G.M.M. Hustinx, and Whn 
H.G. Simons. 1992. The use of referential ex- 
pressions in structuring discourse, l.anguag e and 
Cognitive Processes, 7(3,4):301-333. 
