An Empirical Investigation of the Relation Between Discourse Structure and 
Co-Reference 
Dan Cristea 
Department of Computer Science 
University "A.I. Cuza" 
Ia~i, Romania 
dcristea @ infoiasi, m 
Daniel Marcu 
Information Sciences Institute and 
Department of Computer Science 
University of Southern California 
Los Angeles, CA, USA 
ma twig @ isi. edu 
Nancy Ide 
Department of Computer Science 
Vassar College 
Poughkeepsie, NY, USA 
ide @ cs. vassal: edu 
Valentin Tablan* 
Department of Computer Science 
University of Sheffield 
United Kingdom 
v. tablan @ sheJ.'field, ac. uk 
Abstract 
We compare the potential of two classes el' linear and hi- 
erarchical models of discourse to determine co-reference 
links and resolve anaphors. The comparison uses a col 
pus of thirty texts, which were manually annotated for 
co-reference and discourse structure. 
1 Introduction 
Most current anaphora resolution systems implelnent a 
pipeline architecture with three modules (Lappin and Le- 
ass, 1994; Mitkov, 1997; Kameyama, 1997). 
1. A COLLF.CT module determines a list of potential 
antecedents (LPA) for each anaphor (l~ronourl, deli- 
nile noun, proper name, etc.) that have the potential 
to resolve it. 
2. A FILTI~,P, module eliminates referees incompatible 
with the anaphor fi'om the LPA. 
3. A PP, EFERENCE module determines the most likely 
antecedent on the basis of an ordering policy. 
In most cases, the COLLECT module determines an LPA 
by enumerating all antecedents in a window o1' text that 
precedes the anaphor under scrutiny (Hobbs, 1978; Lap- 
pin and Leass, 1994; Mitkov, 1997; Kameyama, 1997; 
Ge et al., 1998). This window can be as small as two 
or three sentences or as large as the entire preceding 
text. The FILTEP, module usually imposes semantic con- 
straints by requiring that the anaphor and potential an- 
tecedents have the same number and gendm; that selec- 
tional restrictions are obeyed, etc. The PREFERENCE 
module imposes preferences on potential antecedents 
on the basis of their grammatical roles, parallelism, 
fi'equency, proximity, etc. In some cases, anaphora 
resolution systems implement these modules explic- 
itly (Hobbs, 1978; Lappin and Leass, 1994; Mitkov, 
* On leave fi'om lhe Faculty of Computer Science, University "AI. I. 
Cuza" of lasi. 
1997; Kameyama, 1997). In other cases, these modules 
are integrated by means of statistical (Ge et al., 1998) or 
uncertainty reasoning teclmiques (Mitkov, 1997). 
The fact that current anaphora resolution systems rely 
exclusively on the linear nature of texts in order to de- 
termine the LPA of an anaphor seems odd, given flint 
several studies have claimed that there is a strong rela- 
tion between discourse structure and reference (Sidner, 
1981 ; Grosz and Sidner, 1986; Grosz et al., 1995; Fox, 
1987; Vonk ct al., 1992; Azzam el al., 1998; Hitzcman 
and Pocsio, 1998). These studies claim, on the one hand, 
that the use of referents in naturally occurring texts im- 
poses constraints on the interpretation of discourse; and, 
on the other, that the structure of discourse constrains the 
LPAs to which anaphors can be resolved. The oddness 
of the situation can be explained by lho fac! that both 
groups seem primafilcie m be right. Empirical exper- 
iments studies that employ linear techniques for deter- 
mining the LPAs o1' almphol's report recall and precision 
anaphora resolution results in the range of 80% (Lappin 
and Leass, 1994; Ge ct al., 1998). Empirical experiments 
that investigated the relation between discourse structure 
and reference also claim that by exploiting the structure 
of discourse one has the potential of determining correct 
co-referential links for more than 80% of the referential 
expressions (Fox, 1987; Cristca et al., 1998) although to 
date, no discourse-based anaphora resolution system has 
been implemented. Since no direct comparison of these 
two classes of approaches has been made, it is difficult to 
determine which group is right, and what method is the 
best. 
In this paper, we attempt to Iill this gap by empiri- 
cally comparing the potential of linear and hierarchical 
models el' discourse to correctly establish co-referential 
links in texts, and hence, their potential to correctly re- 
solve anaphors. Since it is likely that both linear- and 
discourse-based anaphora resolution systems can imple- 
ment similar FILTER and PI~,I~FERENCE strategies, we fo- 
cus here only on the strategies that can be used to COL- 
208 
LECT lists of potential antecedents. Specilically, we fo- 
ct, s on deterlnining whether discourse llteories can help 
an anaphora resolution system detemfine LPAs that are 
"better" than the LPAs that can be contputed from a lin- 
ear interpretation of texts. Section 2 outlines the theoreti- 
cal assumptions of otu" empirical investigation. Section 3 
describes our experiment. We conclude with a discussion 
of tile results. 
2 Background 
2.1 Assumptions 
Our approach is based on the following assmnptions: 
I. For each anaphor in a text, an anaphora resolution 
system must produce an LPA that contains a refer- 
ent to which 111e anaphor can be resolvcd. The size 
of this LPA wuies fronl system to system, depend- 
ing on tile theory a system implements. 
2. The smaller the I,PA (while retaining a correct an- 
tecedent), the less likely that errors ill the \]7tI,TH{ 
:(lid PI',V,I;F, RI;,NCI ~, modules will affect the ability of 
a system to select the appropriate referent. 
3. Theory A is better than lheory B for the task of rel- 
erence resolution if theory A produces IJ'As that 
contain more antecedents to which amtphors can be 
corrcclly resolved than theory B, and if the l,l~As 
produced by theory A arc smaller than those pro- 
duccd by theory B. l;or cxaml)lc, if for a given 
anaphor, theory A produces an I,PA thai contains a 
referee to which the anaphor can be resolved, while 
theory B produces an IJ~A that does not contain 
such a re\[eree, theory A is better than theory B. 
Moreover, if for a given anaphor, theory A produces 
an Lt)A wilh two referees and theory B produces an 
LPA with seven rel'crees (each LPA containing a ref- 
eree to which tile anal)her can be resolved), lheory 
A is considered better than theory 11 because it has a 
higher probability of solving that anaphor correctly. 
We consider two classes of models for determining the 
LPAs of anaphors ill a text: 
Linear-k models. This is at class of linear models in 
which the LPAs include all the references foulad in the 
discourse unit under scrutiny and the k discourse units 
that immediately precede it. Linear-0 models an ap- 
proach that assumes that :tll anaphors can be resolved 
intra-unit; Linear- 1 models an approach that cor,'esponds 
roughly to centering (Grosz et al., 1995). Linear-k is con- 
sistent with the asslunl)tions that underlie most current 
anaphora resohltion systems, which look back h units in 
order to resolve an anaphor. 
l)iscourse-V1:k models. In |his class ()1'models, LPAs 
include all lhe refcrentM expressions fotmd in the dis- 
course unit under scrutiny and the k discourse units that 
hierarchically precede it. The units that hierarchically 
precede a given unit are determined according to Veins 
Theory (VT) (Cristea et al., 1998), which is described 
brielly below. 
2.2 Veins Theory 
VT extends and formalizes the relation between dis- 
course structure and reference proposed by Fox (1987). 
It identilies "veins", i.e., chains of elementary discourse 
units, over discourse structure trees that are built accord- 
ing to the requirements put forth in Rhetorical Structure 
Theo,y (RST) (Mann and Thompson, 1988). 
One of the conjectures ()1' VT is that the vein expres- 
sion of an elementary discourse unit provides a coher- 
ent "abstract" of the discourse fi'agmcnt that contains 
that unit. As an internally coherent discottrse fragment, 
most ()1' the anaphors and referentM expressions (REst 
in a unit must be resolved to referees that oceul" in the 
text subs:used by the units in tile vein. This conjec- 
ture is consistent with Fox's view (1987) that the units 
that contain referees to which anaphors can be resolved 
are determined by the nuclearity of the discourse units 
thal precede the anaphors and the overall structure of dis- 
course. According to V'I; REs of both satellites and nu- 
clei can access referees of hierarchically preceding nt,- 
cleus nodes. REs of nuclei can mainly access referees of 
preceding nuclei nodes and of directly subordinated, pre- 
ceding satellite nodes. And the interposition ()1' a nucleus 
after a satellite blocks tim accessibility of the satellite for 
all nodes that att'e lovcer in the corresponding discourse 
structure (see (Cristea et al., 1998) for a full delinition). 
Hence, the fundamental intuition unde,lying VT is 
that the RST-spceilie distinction between nuclei and 
satellites constrains the range of referents to which 
anaphors can 19e resolved; in other words, the nucleus- 
satellite distinction induces for each anaphor (and each 
referential expression) a Do,naita of Refcrenlial Acces- 
sibility (DRA). For each anaphor a in a discourse unit 
~z, VT hypothesizes that a can be resolved by examin- 
ing referential expressions that were used in a subset o1' 
the discourse units that precede it; this subset is called 
the DRA of u. For any elcntentary unit u in a text, the 
corresponding DRA is computed autonmtically from the 
rhetorical representation of that text in two steps: 
1. lteads for each node are computed bottom-up over 
the rhetorical representation tree. Heads ()1" elemen- 
tary discottrse traits are the traits themselves. Heads 
of internal nodes, i.e., discourse spans, are con> 
pt, ted by taking the union of the heads of the im- 
mediate child nodes that :ire nuclei. For example, 
for the text in Figure I, whose rhetorical structure is 
shown in Figure 2, the head ()1' span 115,711 is unit 5 
because the head ()t' the inmmdiate nucleus, the ele- 
mentary unit 5, is 5. However, the head of span 116,7\] 
is the list (6,7) because both immediate children are 
nuclei of a multinuclear relation. 
2. Using the results of step 1, Vein expressions are 
eOmlmted top-down lbr each node in the tree. The 
vein of the root is its head. Veins of child nodes 
209 
i. \[Michael D. Casey,\[a top Johnson&Johnson 
...... get, moved teCGe~ Therapy In~, 
a small biotechnology concern here, 
2. to become_~_t>president and chieC 
operating officer. \[ 
3. \[Mr. Casey, 46 years old,\] was\[ president of 
J&J's McNeil Pharmaceutical subsidiary,\] 
*t. which was merged with another J&J unit, 
Ortho Pharmaceutical Corp., this year in 
a cost-cutting move. 
5. I,Ir. Casev\[ succeeds M. James Barrett, 50, 
as\[president of ~,~netic Ther-ap~. 
6. Mr. Barrett remains chief executive officer 
7. and becomes chainaan. 
8. \[Mr-\] Mr. Casey\] said 
9. ~made the move te {\]{e smaller compan~ 
i0. because~saw health care moving toward 
technologies like 
products. 
ll. Ubelieve that the field is emerging and i~ 
prepared to break loose, 
12.\[he\[said. 
Figure 1: An example of text and its elementary units. 
The |'eferential expressions surrounded by boxes and el- 
lipses correspond to two distinct co-referential equiv- 
alence classes. Referential expressions surrounded by 
boxes refer to Mr: Casey; those surrounded by ellipses 
refer to Genetic Thercq~y Inc. 
are computed recursively according to tile rules de- 
scribed by Cristea et al.(1998). The DRA ot" a unit u 
is given by the units that precede u in 1t~e vein. 
For example, for the text and RST tree in Figures 1 
and 2, the vein expression of unit 3, which contains 
units 1 and 3, suggests that anaphors from unit 3 
should be resolved only to referential expressions 
in units 1 and 3. Because unit 2 is a satellite to 
unit 1, it is considered to be "blocked" to referen- 
tial links fi'om trait 3. In contrast, tile DRA of unit 
9, consisting o1' units 1, 8, and 9, reflects the intu- 
ition that anaphors l¥om unit 9 can be resolved only 
to referential ext)ressions fi'om unit 1, which is the 
most important trait in span \[1,7\], and to unit 8, a 
satellite that immediately precedes unit 9. Figure 2 
shows the heads and veins of all internal nodes in 
the rhetorical representation. 
2.3 Comparing models 
The premise underlying out" experiment is that there are 
potentially significant differences in the size of the search 
space rcquired to resolve referential cxpressions when 
using Linear models vs. Discou|'se-VT models. For ex- 
ample, for text and tile RST tree in Figures 1 and 2, the 
Discourse-VT model narrows tlle search space required 
to resolve the a|mphor the smaller company in unit 9. 
According to VT, we look lbr potential antecede|Us for 
the smaller company in the DRA of unit 9, which lists 
units I, 8, and 9. The antecedent Genetic Therapy Inc. 
appears in unit 1 ; therefore, using VT we search back 2 
units (units 8 and 1) to lind a correct antecedent. In con- 
trast, to resolve the same reference using a linear model, 
four units (units 8, 7, 6, and 5) must he examined be- 
fore Genetic The;zq?y is found. Assuming that referen- 
tial links are established as tile text is processed, Genetic 
Therapy would be linked back to pronottn its in unit 2, 
which would in turn be linked to the first occurrence of 
the antecedent,Genetic Therapy Inc., in unit 1, tile an- 
tecedent determined directly by using V'E 
In general, when hierarchical adjacency is considered, 
an anaphor may be resolved to a referent that is not the 
closest in a linear interpretation of a text. Simihu'ly, a ref- 
erential expression can be linked to a referee that is not 
the closest in a linear interpretation of a text. However, 
this does not create problems because we are focusing 
here only on co-referential relations of identity (see sec- 
tion 3). Since these relations induce equivalence classes 
over tile set of referential expressions in a text, it is suf\[i- 
cient that an anaphor or referential expression is resolved 
to any of the members of the relevant equiw|lence class. 
For example, according to V'I, the referential expression 
MI: Casey in unit 5 in Figm'e 1 can be linked directly 
only to the referee Mr Casey in unit 1, because the DRA 
o1' unit 5 is { 1,5}. By considering the co-referential links 
of the REs in the other units, tile full equivalence class 
can be determined. This is consistent with tile distinction 
between "direct" and "indirect" references discussed by 
Cristea, et al.(1998). 
3 The Experiment 
3.1 Materials 
We used thirty newspaper texts whose lengths varied 
widely; the mean o- is 408 words and tile standard de- 
viation/t is 376. Tile texts were annotated manually for 
co-reference relations of identity (Hirschman and Chin- 
chef, 1997). Tile co-reference relations define equiv- 
alence classes oil the set of all marked referents in a 
text. Tile texts were also manually annotated by Marcu 
et al. (1999) with disconrse structures built in the style 
of Mann and Thompson (1988). Each discourse analy- 
sis yielded an average of 52 elementary discourse units. 
See (Hirschman and Chinchor, 1997) and (Marcu et al., 
1999) for details of tile annotation processes. 
210 
H = 1 9 * 
V=lg* 
tI=l 
"¢r ... = 1 9 * . ...... 
H=I L- --- V=lg* _ .......... 
H= i| ~{:-.,, - 
V 1 9 :+:~f- ..... ~'}'-~%l. 3 5 9 * 
1 2 3 4 
H=3 
V=1359 
DF~,= 1 3 
___----- --~ 
__--- ---___ 
I-_ I 
H=5 
-----__ _ 
';,~ = 1 59* 
q _}L___= 6 7 
._~-"-"- \; =xt,,5 6 7 9 * 
5 //%,,\ 
6 7 
H=9 
"?" = 1 9 * 
I H=9 i .... i la---- ............ 
m 
V= 1 9" ~ ~---- .... -- i 21-25 
\[- 
13-213 
9 191011* 
\" = 1 (g)9 
DRA = 1 11 12 
Figure 2: The I),ST analysis of the text in ligure I. The trcc is rcprescnted using the conventions proposed by Mann 
and Thompson (1988). 
3.2 Comparing potential to establish co-referential 
links 
3.2.1 Method 
The annotations for co-reference rchttions and rhetorical 
struclure trues for the thirty texts were fused, yielding 
representations that rcllect not only tile discourse strut- 
lure, but also the co-reference equivalencc classes spe- 
citic to each tcxl. Based on this information, we cval- 
ualed the potential of each of the two classes (51" mod- 
els discussed in secdon 2 (Linear-k and Discourse-VT-k) 
to correctly establish co-referential links as follows: For 
each model, each k, and each marked referential expres- 
sion o., we determined whether or not tlle corresponding 
LPA (delined over k elementary units) contained a ref- 
eree from the same equiwdence class. For example, for 
the Linear-2 model and referential expression lhe .vmaller 
company in t, nit 9, we estimated whether a co-refercntial 
link could be established between the smaller company 
and another referential expression in units 7, 8, or 9. 
For the Discourse-VT-2 model and the same referential 
expression, we estimated whether a co-referential link 
could bE established between the smaller company and 
another referential expression in units 1, 8, or 9, which 
correspond to the DRA of unit 9. 
qb enable a fair comparison of the two models, when k 
is la,'ger than the size of the DRA of a given unit, WE ex- 
tend thatDRA using the closest units that precede the unit 
under scrutiny and are not ah'eady in the DRA. Hence, 
for the Linear-3 model and the referential expression the 
smaller conq~any in trait 9, we estimate whether a co- 
referential link can be established between the xmaller 
company and another referential expression in units 9, 8, 
7, or 6. For tile Discourse-VT-3 model and tile same rcf- 
ermltial expression, we estimate whclher a co-referential 
link can be established between the smaller company and 
another referential expression in units 9, 8, 1, or 7, which 
correspond I(5 the DRA of mill 9 (unfls 9, 8, and 1) and to 
unit 7, the closest unit preceding unit 9 that is not ill ils 
I)RA. 
For the l)iscottrse-VT-k models, we assume Ihat the 
Fxtended DRA (EDRA) of size \]c of a unit ~t. (EDRAz: ('~)) 
is given by the lh'st 1 _< k units of a sequence that 
lists, in reverse order, the units of the DRA of '~z plus 
the /c - l units that precede tt but arc not in its DRA. 
For example, \['or the text in Figure 1, the follow- 
ing relations hold: EDRAc,(!)) = 9; EDRAI(C.)) = 
9, 8; EDRA.,(9) = .q,8, I; EI)RA3(9) := 9,8,1,7; 
I'~DRA.,I(!)) = !),8, 1,7,6. For Linear-k inodels, the 
EDRAz:(u) is given by u and the k units that immedi- 
ately precede ~t. 
The potential p(M, a, EDRA,~) (5t' a model M to de- 
termine correct co-referential links with respect to a ref- 
Erential expression a in unit u, given a corresponding 
EDRA of size k (EDRAt.(u)), is assigned the value 1 if 
the EDRA contains a co-referent from the same equiwt- 
lence class as a. Otherwise, p(M, ,, EDRAt~) is assigned 
the value O. The potential p(k4, 6', k) of a model M 
to determine correct co-rEferential links for all referen- 
tial expressions in a corpus of texts C, using EDRAs 
of size k, is computed as the SUlll oF the potentials 
p(M, a.,EI)RA#) of all referential expressions ct in C'. 
This potential is normalized to a wdue bEtweEn 0 and 
1 by dividing p(k/l, C, k) by the number ot' referential 
211 
expressions in the corpus that have an antecedent. 
By examining the potential of each model to correctly 
determine co-referential expressions for each k, it is pos- 
sible to determine the degree to which an implementa- 
tion of a given approach can contribute to the overall 
efficiency of anaphora resolution systems. That is, if a 
given model has the potential to correctly determine a 
significant percentage of co-referential expressions with 
small DRAs, an anaphora resolution system implement- 
ing that model will have to consider fewer options over- 
all. Hence, the probability of error is reduced. 
3.2.2 Results 
The graph in Figure 3 shows the potentials of the Linear- 
k and Discourse-VT-k models to correctly determine co- 
referential links for each k from 1 to 20. The graph in 
Figure 4 represents the same potefftials but focuses only 
on ks in the interval \[2,9\]. As these two graphs show, the 
potentials increase monotonically with k, the VT-k mod- 
els always doing better than the Linear-k models. Even- 
tually, for large ks, the potential performance of the two 
models converges to 100%. 
The graphs in Figures 3 and 4 also suggest resolution 
strategies for implemented systems. For example, the 
graphs suggests that by choosing to work with EDRAs 
of size 7, a discourse-based system has the potential of 
resolving more than 90% of the co-referential links in 
a text correctly. To achieve the same potential, a linear- 
based system needs to look back 8 units. Ifa system does 
not look back at all and attempts to resolve co-referential 
links only within the unit under scrutiny (k = 0), it has 
the potential to correctly resolve about 40% of the co- 
referential links. 
To provide a clearer idea of how the two models differ, 
Figure 5 shows, for each k, the value of the Discourse- 
VT-k potentials divided by the value of the Linear-k po- 
tentials. For k = 0, the potentials of both models are 
equal because both use only the unit in focus in order to 
determine co-referential links. For k = 1, the Discourse- 
VT-1 model is about 7% better than the Linear-! model. 
As the value of k increases, the value Discourse-VT- 
k/Linear-k converges to 1. 
In Figures 6 and 7, we display the number of excep- 
tions, i.e., co-referential links that Discourse-VT-k and 
Linear-k models cannot determine correctly. As one 
can see, over the whole corpus, for each k _< 3, the 
Discourse-VT-k models have the potential to determine 
correctly about 100 more co-referential links than the 
Linear-k models. As k increases, the performance of the 
two models converges. 
3,2,3 Statistical significance 
In order to assess the statistical significance of the differ- 
ence between the potentials of the two models to estab- 
lish correct co-referential links, we carried out a Paired- 
Samples T Test for each k. In general, a Paired-Samples 
T Test checks whether the mean of casewise differences 
between two variables differs from 0. For each text in 
I O0 00% 
__~____x.~-.,:..-~'.. ....... 
a0~° oo~°°'~ •- .~:..- ....... 
70.00% 
60 00% ~' 
5000% . 
40O0% • 
o 
EDRA size ---- VT-k ....... tmeaf.k 
Figure 3: The potential of Linear-k and Discourse-VT- 
k models to determine correct co-referential links (0 < 
k < 20). 
9"5.00% 
90.00"/,, 
85.00% 
800O% 
75.(X~% 
70.007~ 
~~.-'"'" -"'"" --"" .... .. 
J 
1 2 3 4 5 6 7 9 
EDI1A Bt=~ 
• ---13--- VT-k --.If,,. Linear-k 
Figure 4: The potential of Linear-k and Discourse-VT- 
k models to determine correct co-referential links (2 < 
k _< 9). 
the corpus and each k, we determined the potentials of 
both VT-k and Linear-k models to establish correct co- 
referential links in that text. For ks smaller than 4, the 
difference in potentials was statistically significant. For 
example, for k = 3, t = 3.345, df = 29, P = 0.002. For 
values of k larger than or equal to 4, the difference was no 
longer significant. These results are consistent with the 
graphs shown in Figure 3 to 7, which all show that the 
potentials of Discourse-VT-k and Linear-k models con- 
verges to the same value as the value of k increases. 
3.3 Comparing the effort required to establish 
co-referential links 
3.3.1 Method 
The method described in section 3.2.1 estimates the po- 
tential of Linear-k and Discourse-VT-k models to deter- 
mine correct co-referential links by treating EDRAs as 
sets. However, from a computational perspective (and 
212 
L'07 / Q 
t.a6 '" .... \[ / ", 
tl'., u 
0.9~ 
o.97 I 
~,oo I "' 
700 c , " • 
"', o 
'500 "-.- "~-~ 
400 \'--, "'*x ¢ "--- 12~:~: 
...... 
2OO 
~00 
2 3 4 5 6 7 t3 10 
L 
Figure 5: A direct comparison of Discourse-VT-k 
and Linear-VT-k potentials to correctly determine co- 
referential links (0 < k < 20). 
Figure 7: The number of co-referential links that caunot 
be correctly determined by Discourse-VT-k and Linear-k 
models (1 < k < 10). 
14130 
1200 
0 lOOO 
600 
401) 
200 
\ 
---~u,~: ?~: :g~: \[t..:~g<_ .~, ~ g:~...t~_~.~_~_,.=~r,~ ~ 
EORA ~lz~ 
Figure 6: The number of co-referential links that cannot 
be correctly determined by Discourse-VT-k and Linear-k 
models (0 < /~' < 20). 
presumably, from a psycholinguistic perspective as well) 
it also makes sense to compare the effort required by the 
two classes of models to establish correct co-referential 
links. We estimate this effort using a very simple metric 
that assumes that the closer an antecedent is to a cor- 
responding referential expression in the EDRA, the bet- 
ter. Hence, in estimating the effort to establish a co- 
referential link, we treat EDRAs as ordered lists. For ex- 
ample, using the Linear-9 model, to determine the correct 
antecedent of the referential expression the smaller com- 
pany in unit 9 of Figure 1, it is necessary to search back 
through 4 units (to unit 5, which contains the referent Ge- 
netic Therapy). Had unit 5 been Ml: Cassey succeeds M. 
James Barrett, 50, we would have had to go back 8 units 
(to unit 1) in order to correctly resolve the RE the smaller 
company. In contrast, in the Discourse-VT-9 model, we 
go back only 2 units because unit 1 is two units away 
fi'om unit 9 (EDRA:~ (9) = 9, 8, 1,7, 6, 5,4, 3, 2). 
We consider that the effort e(AJ, a, EDRAa.) of a 
model M to determine correct co-referential links with 
respect to one referential, in unit u, given a correspond- 
ing EDRA of size L" (EDRA~.(,)) is given by the number 
of units between u and the first unit in EDRAk(u) that 
contains a co.-referential expression of a. 
The effort e(M, C, k) of a model M to determine cor- 
rect co-referential links for all referential expressions in 
a corpus of texts C using EDRAs of size k was computed 
as the sum of the efforts e(M, a, EDRAk) of all referen- 
tia ! expressions a in C. 
3.3.2 Results 
Figure 8 shows the Discourse-VT-k and Linear-k efforts 
computed over all referential expressions in the corpus 
and all ks. It is possible, for a given referent a and a 
given k, that no co-referential link exists in the units of 
the corresponding EDRAa.. In this case, we consider that 
the effort is equal to k. As a consequence, for small ks 
the effort required to establish co-referential links is sim- 
ilar for both theories, because both can establish only a 
limited number of links. However, as k increases, the 
effort computed over the entire corpus diverges dramat- 
ically: using the Discourse-VT model, the search space 
for co-referential links is reduced by about 800 units for a 
corpus containing roughly 1200 referential expressions. 
3.3.3 Statistical significance 
A Paired-Samples T Test was performed for each k. For 
each text in the corpus and each k, we determined the 
effort of both VT-k and Linear& models to establish cor- 
rect co-referential links in that text. For all ks the dif- 
ference in effort was statistically significant. For exam- 
ple, for k = T, we obtained the values t = 3.5l, df = 
29, P = 0.001. These results are intuitive: because 
EDRAs are treated as ordered lists and not as sets, the 
effect of the discourse structure on establishing correct 
co-referential links is not diminished as/,' increases. 
4 Conclusion 
We analyzed empirically the potentials of discourse and 
linear models of text to determine co-referential links. 
Our analysis suggests that by exploiting the hierarchi- 
cal structure of texts, one can increase the potential 
213 
7000 .......... i-;;;~:':-';; .... ~ ..... 
~00- 
k .......................... 
I000 
tORA ~zQ 
--VT~ess ....... ~oss 
Figure 8: The effort required by Linear-k and Discourse- 
VT-k models to determine correct co-referential links 
(0< h< 100). 
q 
of natural  systems to correctly determine co- 
referential links, which is a requirement for correctly re- 
solving anaphors. If one treats all discourse units in the 
preceding discourse equally, the increase is statistically 
significant only when a discourse-based coreference sys- 
tem looks back at most four discourse units in order to 
establish co-referential links. However, if one assumes 
that proximity plays an important role in establishing co- 
referential links and that referential expressions are more 
likely to be linked to referees that were used recently in 
discourse, the increase is statistically significant no mat- 
ter how many units a discourse-based co-reference sys~ 
tern looks back in order to establish co-referential links. 
Acknowledgements. We are grateful to Lynette 
Hirschman and Nancy Chinchor for making available 
their corpora of co-reference annotations. We are also 
grateful to Graeme Hirst for comments and feedback on 
a previous draft of this paper. 

References 

Saliha Azzam, Kevin Humphreys, and Robert 
Gaizauskas. 1998. Evaluating a focus-based approach to anaphora resolution. In Proceedings of 
the 361h Annual Meeting of the Association for 
Computational Linguistics and of the 17th International Conference on Computational Linguistics 
(COLlNG/ACL'98), pages 74-78, Montreal, Canada, 
August 10-14. 

Dan Cristea, Nancy Ide, and Laurent Romary. 1998. 
Veins theory: A model of global discourse cohesion 
and coherence. In Proceedings of the 36th Ammal 
Meeting of the Association for Computational Linguistics and of the 17th lnternational Conference on 
Computational Linguistics (COLING/ACL'98), pages 
281-285, Montreal, Canada, August. 

Barbara Fox. 1987. Discourse Structure and Anaphora. 
Cambridge Studies in Linguistics; 48. Cambridge Uni- 
versity Press. 

Niyu Ge, John Hale, and Eugene Charniak. 1998. A statistical approach to anaphora resolution. In Proceedings of the Sixth Workshop on Very Large Corpora, 
pages 161-170, Montreal, Canada, August 15-16. 

Barbara J. Grosz and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. 
Computational Linguistics, 12(3): 175-204, July-September. 

Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 
1995. Centering: A framework tbr modeling the local coherence of discourse. Computational Linguistics, 21 (2):203-226, June. 

Lynette Hirschman and Nancy Chinchor, 1997. MUC-7 
Coreference Task Definition, July 13. 

Janet Hitzeman and Massimo Poesio. 1998. Long distance pronominalization and global focus. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and of the 17th International Conference on Computational Linguistics (COLING/ACL'98), pages 550-556, Montreal, 
Canada, August. 

Jerry H. Hobbs. 1978. Resolving pronoun references. 
Lingua, 44:311-338. 

Megumi Kameyama. 1997. Recognizing referential 
links: An information extraction perspective. In Proceedings of the ACL/EACL'97 Workshop on Operational Factors in Practical, Robust Anaphora Resolution, pages 46-53. 

Shalom Lappin and Herbert J. Leass. 1994, An algorithm for pronominal anaphora resolution. Computational Linguistics, 20(4):535-561. 

William C. Mann and Sandra A. Thompson. 1988. 
Rhetorical structure theory: Toward a functional the- 
ory of text organization. Text, 8(3):243-28 i. 

Daniel Marcu, Estibaliz Amorrortu, and Magdalena 
Romera. 1999. Experiments in constructing a corpus of discourse trees. In Proceedings of the ACL'99 Workshop on Standards and Tools for Discourse Tagging, pages 48-57, University of Maryland, June 22. 

Ruslan Mitkov. 1997. Factors in anaphora resolution: 
They are not the only things that matter, a case study 
based on two different approaches. In Proceedings of 
the ACL/EACL'97 Workshop on Operational Factors 
in Practical, Robust Anaphora Resolution, pages 14-21. 

Candace L. Sidner. 1981. Focusing for interpretation of 
pronouns. Computational Linguistics, 7(4):217-231, 
October-December. 

Wietske Vonk, Lettica G.M.M. Hustinx, and Wire tt.G. 
Simons. 1992. The use of referential expressions in 
structuring discourse, l~mguage and Cognitive Pro- 
cesses, 7(3,4):301-333. 
