Automatic Processing of Large Corpora fbr the Resolution of 
Anaphor  References 
Ido Dagan * Alon Itai 
Computer Science Department 
Technion, tIaifa, Israel 
dagan~techunix.bitnet, itai~ cs.technion, ac.il 
Abstract 
Manual acquisition of semantic constraints in broad 
domains is very expensive. This paper presents an 
automatic scheme for collecting statistics on cooc- 
currence patterns in a large corpus. To a large ex- 
tent, these statistics reflect, semantic constraints and 
thus are used to disambiguate anaphora references 
and syntactic ambiguities. The scherne was imple- 
mented by gathering statistics on the output of other 
linguistic tools. An experiment was performed to 
resolve references of the pronoun "it" in sentences 
that were randomly selected from the corpus. Ttle 
results of the experiment show that in most of the 
cases the cooccurrence statistics indeed reflect the 
semantic constraints and thus provide a basis {'or a 
useful disambiguat.ion tool. 
1 Introduction 
The use of selectional constraints is one of the most 
popular methods in applying semantic information 
to the resolution of ambiguities in natural languages. 
The constraints typically specify which combina- 
tions of semantic classes are acceptable in subject- 
verb-object relationships and other syntactic struc- 
tures. This information is used to filter ont some 
analyses of ambiguous constructs or to set prefer- 
ences between alternatives. 
Though the use of selectional constraints is very 
popular, there is very little success (if any) in im- 
plementing this method for broad domains. The 
major problem is the huge amount of information 
that must be acquired in order to achieve a rea- 
sonable representation of a large domain. In order 
to overcome this problem, our project suggests an 
alternative to the traditional model, based on auto- 
matic acquisition of constraints fl'om a large corpus. 
The rest of the paper describes how this method is 
used to resolve anaphora references. Similarly, the 
constraints are used also to resolve syntactic am- 
biguities, but this will not be described here. The 
*Part of this resem'ch was conducted wb.ile visiting IBM 
T. J. Watson Research Center, Yorktown Ileights, NY 
reader should bare in mind that like the conven- 
tional use of selectional constraints, our method is 
inteuded to work in co,tjunction with other disam- 
biguation means. These, such as various syntactic 
and pragmatic constraints and heuristics \[Carbonetl 
and Brown P.)88, tlobbs 1978\], represent additional 
levels of knowledge and are essential when selec- 
tional constraints are not sufficient. 
2 The Statistical Approach 
According to the statistical model, cooccurrence 
patterns that were observed in tile corpns are used 
as selection patterns. Whenever several alternatives 
are presented by an ambiguous construct, we prefer 
the one correspot~ding t.o more frequent patterns. 
When using selectional constraints for anaphora 
resolution, the referent must satisfy the constraints 
which are imposed on the anaphor. If the anaphor 
participates in a certain syntactic relation, like be- 
ing an object of some verb, then the substitution 
of the anaphor with the referent must satisfy the 
selectional constraim.s. In the statistical model, we 
substitute each of the candidt~tes with the anaphor 
and approve only those candidates which produce 
frequent cooccurrence patterns. Consider, for exam- 
pie, the following sentence, taken from the Hansard 
corpus of the proceedings of the Canadian parlia- 
ment \[Brown et al. 1988\]: 
(1) They know full well that the companies held 
tax money aside for collection later on the b~sis 
that the government said it was going to collect 
it. 
There are two occurrences of "it" in this sentence. 
The first serves ~ the subject of "collect" and the 
second as its object. We gathered the statistics for 
three candidates which occur in the sentence: "col- 
lection", "money" and "government". According to 
the syntactic structure of the sentence, each of them 
may serw~' aL, s the referent for each of the occurrences 
of the pronoun. The following table lists the pat- 
terns that were produced by substituting each can- 
330 1 
didate with the anaphor, and the number of times 
each of these patterns occurred in the corpus: 
subject-verb collection collect 0 
subject-verb money collect S 
subject-verb government collect 198 
verb-object collect collection 0 
verb.-obj ect collect money 149 
verb-~object collect government 0 
According to these statistics "government" is pre- 
ferred as the reti~rent of the first "it", and "money" 
of the second. 
This example demonstrates the case of definite se.- 
mantle constraints which eliminate all but the cor- 
rect alternative. In other cases, several alternatives 
may ,;atisfy the selectional constraints, and may be 
observed in the corpus a significant number of times. 
In such cases the tlnal selection between the ap- 
proved candidates should be performed by other 
means, such as syntactic heuristics or asking the 
user. Another passibility may be to use statistical 
preferences, and prefer the relatively more frequent 
patterns, tlowever, at this stage it is not clear to us 
how useflfl the statistical preference can be, and we 
use the statistics only relative to a certain threshold, 
approving any patterns that pass this threshold. 
3 Implementing the Acquisi- 
tion Phase 
The use of the statistical model involves two sepa- 
rate phases. The first is the acquisition pha.se, in 
which the corpus is processed and the statistical 
database is built. The second is the disambigua- 
tion phase, in which the statistical datab~Lse is used 
to resolve ambiguities. 
The statistical database contains cooccurrence 
patterns for various syntactic relations. In the ex- 
periment reported here we have used constraints for 
the %ubject-verb", "verb-object" and "adjective- 
noun" relations. To locate these relations in the 
sentences of the corpus, each sentence is parsed 
by the PEG parser \[Jensen 1986\]. Then, a post- 
processing algorithm identifies the various relations 
in the parse tree. As wa.s noted in \[Grishman et 
al. 1986\], the cooccurrence patterns reflect regu- 
larized or canonical structure. Therefore the post- 
processing algorithm has to map surface structures 
into the normalized relations. During our experi- 
ments we have used two different implementations 
for this algorithm \[Lappin et al. 1988\] \[Jensen 1989\], 
which take into account structures like passives, sub- 
clauses, questions and relative and infinitive clauses. 
The use of an automatic procedure for extracting 
information from a corpus that was not preprocessed 
manually raises a basic problem of circularity. Since 
the corpus was not disambiguated, it is not possible 
to distinguish the semantically correct patterns from 
the incorrect ones. Both types of ambiguity, syntac- 
tic and lexical, may cause the system to acquire or 
use inappropriate patterns. This problems is consid~ 
ered very important when dealing with a corpus: it 
was the re,Leon for the substantial human interven- 
tion in the procedure of \[Grishman et al. 1986\], and 
it is the reason why other techniques use manually 
tagged corpora (e.g. \[Church 1988\]). 
In practice, however, we have discovered that the 
problem is not so cruciah semantically vMid pat- 
terns have occurred many more times in syntac- 
tically unambiguous constructs than in mnbiguous 
ones. Thus, they could be identified without the 
need of first disambiguating the sentences. Seman- 
tically non-valid patterns indeed occurred in the in- 
appropriate parses but they were too rare to pass 
the threshold. As tbr lcxical ambiguities, the chance 
that one sense of a word will be confused with an- 
other during disambiguation seems to be very small, 
and it never happened in our experiment. 
4 The Experiment 
An experiment was performed to resolve references 
of the anaphor "it" in the IIansard corpus. The 
examples of the ambiguous sentences were selected 
in tile following way: First, sentences containing the 
word "it" were extracted randomly from the corpus. 
Then, we manually filtered out sentences that were 
not relevant for the use of selectional constraints in 
resolving anaphoric references. Such cases were non- 
anaphoric occurrences of "it", cases where the ref- 
erent was not a noun phrase and cases where the 
anaphor was not involved in one of the three rela- 
tions that we used. In addition, we have excluded 
cases where there was only one possible referent, 
so that our results will reflect correctly the perfor~ 
mance of the disambiguation method. The filtering 
process eliminated about two t.hirds of the original 
sentences, and we proceeded with 59 examples. The 
alternative candidates for the referent (which satisfy 
definite syntactic constrair, ts such as number, gen- 
der and requirements for reflexives) were identified 
manually in each example. 1 
The statistics were collected from part o\[ the cor- 
pus, of about 28 million words. For 21 out of the 59 
examples the statistics were not meaningful (we used 
a threshold of 5 occurrences for each of the alterna- 
tive patterns). In these cases the algorithm cannot 
approve any of the candidates, getting a "coverage" 
of 38/59 (64%). 
As explained in Section 2, the output of the sta- 
tistical method is used to represent the selectional 
1 The llansard corpus, as maintained by the speech g\]'oup 
at IBM Watson Research Center, does not contain consec- 
utive sentences. Therefore, we identified only candidaLes 
within the same sentence as the anaphor. "I'o provide enough 
candidales, we examined occurrences of "it" ~ffter the 15th 
word of the senLence. The examples provided between 2 to 5 
candidates, with an average of 2,8 candidates per anaphor. 
2 331 
constraints. This is done by approving all pat- 
terns which appeared a significant nunaber of times. 
Therefore, the output is considered correct if the 
appropriate candidate is approved. This happened 
in 33 cases, getting "accuracy" of 33/38 (87%). In 
18 of these cases, the appropriate candidate was the 
only one which was approved, getting a complete 
resolution of the ambiguity. 
This last result demonstrates the advantage of the 
statistical data over semantic constraints. While se- 
mantic constraints should approve any combination 
of arguments in a syntactic relation that may oc- 
cur in the text, the statistics approve only those 
combinations that actually occur and reject others. 
Manual observation of the 18 sentences in which the 
statistics completely resolved the ambiguity showed 
that only in 7 cases the ambiguity could be elim- 
inated by traditional selectional constraints. This 
is consistent with the evaluation in \[Itobbs 1978\], 
where only in 12 out of 132 sentences the ambiguity 
was eliminated by selectional constraints. 
An additional note should be made concerning the 
technical methodology of the experiment. Within 
the limited resources of our research, it was not fea- 
sible to build the statistical database for the entire 
Itansard corpus, which contains about 60 million 
words. The expensive resources are the parsing time 
and the storage for the cooceurrence patterns and 
their statistics. ~ However, it turns out that pars- 
ing the entire corpus is not necessary to evaluate 
the success of the statistical model! As the evalu- 
ation relates to a limited number of examples, it is 
sufficient to collect the statistics only for patterns 
that are relevant for the disambiguation of these ex- 
amples. Therefore, we have extracted from the cor- 
pus only those sentences that contained at least one 
cooccurrenee of words from a relevant pattern. This 
procedure allowed us to parse only 10,000 sentences. 
5 Conclusions 
We have suggested using cooccurrence patterns, au- 
tomatically acquired from a large corpus, as an alter- 
native to selectional constraints. The initial results 
indicate that even in its basic form, as presented 
here, the approach is useful for disambiguation, and 
many times performs even better than the tradi- 
tional model. This should be considered relative to 
the effort that would have been required to achieve 
such coverage and accuracy by manual acquisition 
of constraints, for the broad domain of parliament 
proceedings. 
SAlthough the constnlction of the full size database is not 
feasible for us, it is clearly feasible for a large scale project. 
This is shown by a similar database that w~s implemented 
as part of the laslgllage model of the IBM speech recognition 
system. 3?his database contalns counters for occurrences of 
sequences of three words in lm'ge corpora (trigrams), which 
arc much more numerous than our syntactic patterns. 
In a general perspective, this project promotes the 
use of a large corpus for linguistic research and ap- 
plications. Processing such large corpora is a non- 
trivial engineering problem, the solution of which 
enables research to focus on complicated real world 
sentences. Our research demonstrates how statisti- 
cal methods can be built on top of more 'traditional' 
linguistic tools, achieving a better and more feasible 
environment for the resolution of ambiguities. 
6 Acknowledgements 
We would like to thank Mori Rimon, Shalom Lap- 
pin, Wlodek Zadrozny, Slavs Katz, John Justeson, 
Lisa Braden-Harder and Peter Brown for their fruit- 
ful advice and technical support. 

References 

\[Brown et al. 1988\] Brown, P., Cocke, J., Della 
Pietra, S., Della Pietra, V., Jelinek, F., Mercer, 
R.L. and Roossin P.S., A statistical approach 
to language translation, COLING 1988. 

\[Carbonell and Brown 1988\] Carbonell, J. G. and 
Brown, R. D. Anaphora resolution: A multi 
strategy approach, COLING I988. 

\[Church 1988\] Church, K. W., A stochastic parts 
program and noun phrase parser for unre- 
stricted text, ACL Conf. on Applied NLP, 
1988. 

\[Grishman et al. 1986\] R. Grishman, L. Hirschman 
and Ngo Thanh Nhan, Discovery procedures for 
sublanguage selectional patterns initial experi- 
ments, Computational Linguislics, vol. 12,205- 
214, 1986. 

\[Hobbs 1978\] ttobbs, J. R. Resolving pronoun ref- 
erences, Lingua, vol. 44,311-338, 1978. 

\[Jensen 1986\] K. Jensen, PEG 1986: A broad- 
coverage computational syntax of English, 
Technical Report, IBM T. J. Watson Research 
Center, 1986. 

\[Jensen 1989\] Jensen, K. PEGASOS: Deriving 
predicate-argument structures after a syntactic 
parse. Presented at the International Work- 
shop on Parsing Technologies, Carnegie Mellon 
University, August 1989. 

\[Lappin et al. 1989\] S. Lappin, I. Golan, M. Ri- 
mon, Computing grammatical fimctions from 
a configurational parse tree, Technical Report 
88.268, IBM Israel Center of Science and Tech- 
nology, 1989. 
