Using Collocation Statistics in Information Extraction
Dekang Lin
Department of Computer Science
University of Manitoba
Winnipeg, Manitoba, Canada R3T 2N2
lindek@cs.umanitoba.ca
and
Nalante, Inc.
7 Blackwood Bay, Winnipeg, Manitoba, Canada
lindek@nalante.com
INTRODUCTION
Our main objective in participating MUC-7 is to investigate and experiment with the use of col-
location statistics in information extraction. A collocation is a habitual word combination, such
as #5Cweather a storm", #5C#0Cle a lawsuit", and #5Cthe falling yen". Collocation statistics refers to the
frequency counts of the collocational relations extracted from a parsed corpus. For example, out of
6577 instances of #5Caddition" in a corpus, 5190 was used as the object of #5Cin". Out of 3214 instances
of #5Chire", 12 of them take #5Calien" as the object.
We participated in two tasks: Named Entity and Coreference. In both tasks, the input text
is processed in two passes. During the #0Crst pass we use the parse trees of input texts, combined
with collocation statistics obtained from a large corpus, to automatically acquire or enrich lexical
entries which are then used in the second pass.
COLLOCATION DATABASE
We de#0Cne a collocation to be a dependency triple that consists of three #0Celds:
#28word, relation, relative#29
where the word #0Celd is a word in a sentence, the relative #0Celd can either be the modi#0Cee or a
modi#0Cer of word, and the relation #0Celd speci#0Ces the type of the relationship between word and
relative as well as their parts of speech.
For example, the dependency triples extracted from the sentence #5CI have a brown dog" are:
#28have V:subj:N I#29 #28I N:r-subj:V have#29
#28have V:comp1:N dog#29 #28dog N:r-comp1:V have#29
#28dog N:jnab:A brown#29 #28brown A:r-jnab:N dog#29
#28dog N:det:D a#29 #28a D:r-det:N dog#29
1
The identi#0Cers for the dependency types are explained in Table 1.
Table 1: Dependency types
Label Relationship between:
N:det:D a noun and its determiner
N:jnab:A a noun and its adjectival modi#0Cer
N:nn:N a noun and its nominal modi#0Cer
V:comp1:N averb and its noun object
V:subj:N averb and its subject
V:jvab:A averb and its adverbial modi#0Cer
We used MINIPAR, a descendent of PRINCIPAR #5B2#5D, to parse a text corpus that is made up
of 55-million-word Wall Street Journal and 45-million-word San Jose Mercury. Two steps were
taken to reduce the number of errors in the parsed corpus. Firstly, only sentences with no more
than 25 words are fed into the parser. Secondly, only complete parses are included in the parsed
corpus. The 100 million word text corpus is parsed in about 72 hours on a Pentium 200 with 80MB
memory. There are about 22 million words in the parse trees.
Figure 1 shows an example entry in the resulting collocation database. Eachentry contains of
all the dependency triples that have the same word #0Celd. The dependency triples in an entry are
sorted #0Crst in the order of the part of speech of their word #0Celds, then the relation #0Celd, and then
the relative #0Celd.
The symbols used in Figure #281#29 are explained as follows. Let X be a multiset. The symbol kXk
stands for the number of elements in X and jXj stands for the number of distinct elements in X.For
example,
a. k#28review, V:comp1:N, acquisition#29k is the number of times #5Cacquisition" is used as the
object of the verb #5Creview".
b. k#28review, *, *#29k is the number of dependency triples in which the word #0Celd is #5Creview"
#28which can be a noun or a verb#29.
c. k#28review, V:jvab:A, *#29k is the number of times #5B
v
review#5D is pre-modi#0Ced by an adverb.
d. j#28review, V:jvab:A, *#29j is the number of distinct adverbs that were used as a pre-modi#0Cer
of #5B
v
review#5D.
e. k#28*, *, *#29k is the total number of dependency triples, whichistwice the number of depen-
dency relationships in the parsed corpus.
f. k#28review, N#29k is the number of times the word #5Creview" is used as a noun.
g. k#28*, N#29k is the total number of occurrences of nouns.
h. j#28*, N#29j is the total number of distinct nouns that
2
review 8514
V 1424
V:subj:N 789 179
administration 5
appeals court 2
... ...
V:jvab:A 101 39
briefly 2
carefully 7
formally 2
... ...
V:comp1:N 1353 384
account 7
acquisition 3
action 5
activity 2
... ...
N 1576
N:r-subj:V 239 118
affect 8
become 2
... ...
N:r-comp1:V 525 157
approve 3
avoid 5
await 3
... ...
N:nn:N 241 85
admission 2
bank 2
book 4
... ...
N:jnp:P 76 9
by 26
for 28
... ...
N:jnab:A 518 182
administrative 5
annual 12
antitrust 10
... ...
part of speech
||(review, V:comp1:N, acquisition)||
||(review, V:jvab:A, *)||
|(review, V:jvab:A, *)|
||(review, * ,*)||
word-field
relation-field
relative-field
"review" has been used as the objects
of these verbs in the corpus
these nouns were used as a prenominal
modifier of "review"
||(review, N)||
Figure 1: An example entry in the Collocation Database
3
i. k#28review, *#29k is the total numberof occurrences of the word #5Creview" #28used as any category#29
in the parsed corpus.
NAMED ENTITY RECOGNITION
Our named entity recognizer is a #0Cnite-state pattern matcher, whichwas developed as part Univer-
sity of Manitoba MUC-6 e#0Bort. The pattern matcher has access to both lexical items and surface
strings in the input text. In MUC-7, we extended the earlier system in twoways:
#0F We extracted recognition rules automatically from the collocation database to augment the
manually coded pattern rules.
#0F We treated the collocational context of words in the input texts as features and used a
Naive-Bayes classi#0Cer to categorized unknown proper names, which are then inserted into the
systems lexicon.
A collocational context of a proper name is often a good indicator of its classi#0Ccation. For
example, in the 22-million-word corpus, there are 33 instances where a proper noun is used as
a prenominal modi#0Cer of #5Cmanaging director". In 26 of the 33 instances, the proper name was
classi#0Ced as an organization. In the remaining 7 instances, the proper name was not classi#0Ced.
Therefore, if an unknown proper name is a prenominal modi#0Cer of #5Cmanaging director", it is likely
to refer to an organization. We extracted 3623 such contexts in which the frequency of one type
of proper names is much greater #28as de#0Cned by a rather arbitrary threshold#29 than the frequencies
of other types of proper names. If a proper name occurs in one of these contexts, we can then
classify it accordingly. This use of the collocation database is equivalent to automatic generation of
classi#0Ccation rules. In fact, some of the collocational contexts are equivalent to pattern-matching
rules that were manually coded in the system.
There are only a small number of collocational contexts in which the classi#0Ccation of a proper
name can be reliablydetermined. In most cases, a clear decision cannot be reached based on a single
collocational context. For example, among 1504 objects of #5Cconvince", 49 of them were classi#0Ced
as organizations, and 457 of them were classi#0Ced as persons. This suggests that if a proper name
is used as the object of #5Cconvince", it is likely that the name refers to a person. However, there is
also signi#0Ccant probability that the name refers to an organization. Instead of making the decision
based on this single piece of evidence, we collect from the input texts all the collocational contexts
in which an unknown proper names occurred. We then classify the the proper name with a naive
Bayes classi#0Cer, using the the set of collocation contexts as features.
The naive Bayes classi#0Cer uses a table to store the frequencies of proper name classes in col-
locational contexts. Sample entries of the frequency table are shown in Table 2. Each row in the
table represents a collocation feature. The #0Crst column is a collocation feature. Words with this
feature have been observed to occur at position X in the second column. The third to #0Cfth columns
contain the frequencies of di#0Berent proper name classes.
Let C be a class of proper name #28C is one of LOC, ORG, or PER#29. Let F
i
be a collocation
feature. Classi#0Ccation decision is made by #0Cnd the class C that maximizes
Q
k
i=1
P#28F
i
jC#29P#28C#29,
4
Table 2: Frequency of Collocation Features
Collocation Context Frequency Counts
Feature Pattern LOC ORG PER
control|N:r-comp1:V to control X 9 87 39
control|N:r-gen:N X's control 14 14 54
control|N:r-nn:N the X control 6 0 0
control|N:r-subj:V X to control 10 99 307
control|N:subj:N X is the control 0 3 0
convene|N:r-comp1:V to convene X 0 5 0
convene|N:r-subj:V X to convene 0 10 18
convention|N:r-gen:N X's convention 0 4 0
convention|N:r-nn:N the X convention 5 23 5
where F
1
;F
2
;:::F
k
are the features of an unknown proper name. The probability P#28F
i
jC#29 is
estimated by m-estimates #5B5#5D, with m = 1 and p =
1
jCFj
as the parameters, where CF is the set of
collocation features:
P
m
#28F
i
jC#29=
kF
i
;Ck+
1
jCFj
P
f2CF
kf;Ck+1
where kF
i
;Ckdenotes the frequency of words that belong to C in the context represented by f.
Example: The walkthrough article contains several occurrences of the word #5CXichang" whichis
not found in our lexicon. The parser extracted the following set of collocation contexts from the
formal testing corpus:
1. #5Cthe Xichang base", whereXichangisusedasthe prenominalmodi#0Cerof#5Cbase" #28base|N:nn:N#29;
2. #5Cthe Xichang site", where Xichang isusedas the prenominalmodi#0Cerof #5Csite" #28site|N:nn:N#29;
3. #5Cthe site in Xichang", from whichtwo features are extracted:
#0F the object of #5Cin" #28in|P:pcomp:N#29;
#0F indirect modi#0Cer of #5Csite" via the preposition #5Cin" #28site|N:pnp-in:N#29.
The frequencies of the features are shown in Table 3. These features allowed the naive Bayes
classi#0Cer to correctly classify #5CXichang" as a locale.
Automatically acquiring lexical information on the #0Dy is an double edged sword. On the one
hand, it allows classi#0Ccation of proper names that would otherwise be unclassi#0Ced. On the other
hand, since there is no human con#0Crmation, the correctness of the automatically acquired lexical
items cannot be guaranteed. When incorrect information is entered into the lexicon, a single error
may propagate to many places. For example, during the development of our system, a combination
5
Table 3: Frequencies of features of #5CXichang"
Collocation Frequency Counts
Feature LOC ORG PER
base|N:nn:N 77 19 0
site|N:nn:N 26 16 34
in|P:pcomp:N 35641 15630 0
site|N:pnp-in:N 7 0 0
of parser errors and the naiveBayes classi#0Ccation caused the word #5CI" to be added into the lexicon
as a personal name. During the second pass, 143 spurious personal names were generated.
Our NE evaluation results are shown in Table 4. The #5Cpass1" results are obtained by manually
coded patterns in conjunction with the classi#0Ccation rules automatically extracted from the collo-
cation database. With the naive Bayes classi#0Ccation, the recall is boosted by 6 percent while the
precision is decreased by 2#25 with an overall increase of F-measure by 2.67.
Table 4: Evaluation results of the named entity task
Precision Recall F-measure
pass1 89#25 79#25 83.70
o#0Ecial 87#25 85#25 86.37
COREFERENCE
Our coreference recognition subsystem used the same constraint-based model as our MUC-6 system.
This model consists of an integrator and a set of independent modules, suchassyntactic patterns
#28e.g., copula construction and appositive#29, string matching, bindingtheory, and centering heuristics.
Each module proposes weighted assertions to the integrator. There are twotypes of assertions. An
equality assertion states that two noun phrases have the same referent. An inequality assertion
states that two noun phrases must not have the same referent. The modules are allowed to freely
contradict one another, or even themselves. The integrator use the weights associated with the
assertions to resolve the con#0Dicts. A discourse model is constructed incrementally by the sequence
of assertions that are sorted in descending order of their weights. When an assertion is consistent
with the current model, the model is modi#0Ced accordingly. Otherwise, the assertion is ignored and
the model remains the same.
One of the important factors to determine whether or not two noun phrases may refer to the
same entity is their semantic compatibility. A personal pronoun must refer to a person. For
example, the pronoun #5Cit" may refer to an organization, an artifact, but not a person. A #5Cplane"
may refer to an aircraft. A #5Cdisaster" may refer to a crash. In MUC-6, we used the WordNet to
6
determine the semantic compatibility and similaritybetween two noun phrases. However, without
the ability to determine the intended sense of a word in the input text, we had to say that all
senses are possible.
1
The problem with this approach is that the WordNet, likeany other general
purpose lexical resource, aims at providing broad-coverage. Consequently, it includes many usages
of words that are very rare in our domain of interest. For example, one of the 8 potential senses
of #5Ccompany" in WordNet 1.5 is a #5Cvisitor#2Fvisitant", whichisahyponym of #5Cperson". This usage
of the word practically never happens in newspaper articles. However, its existence prevents us to
make assertions that personal pronouns like #5Cshe" cannot co-refer with #5Ccompany".
In MUC-7, we developed a word sense disambiguation #28WSD#29 module, which removes some of
the implausible senses from the list of potential senses. It does not necessarily narrows down the
possible senses of a word instance to a single one, however.
Given a polysemous word w in the input text, we take the following steps to narrowdown the
possibilities for its intended meaning:
1. Retrieve collocational contexts of w from the parse trees of the input text.
2. For each collocational context of w, retrieve its set of collocates, i.e., the set of words that
occurred in the same collocational context. Take the union of all the sets of collocates of w.
3. Take the intersection of the union and the set of similar words of w which are extracted
automatically with the collocational database #5B4#5D. We call the words in the intersection
selectors.
4. Score the set of potential senses of w by computing the similarities between senses of w and
senses of the selectors in the WordNet #5B3#5D. Remove the senses of w that received a score less
than 75#25 of the highest score.
Example: consider the word #5C#0Cghter" in the following context in the walkthrough article:
... in the multibillion-dollar deals for #0Cghter jets.
WordNet lists three senses of #5C#0Cghter":
#0F combatant, battler, disrupter
#0F champion, hero, defender, protector
#0F #0Cghter aircraft, attack aircraft
The disambiguation of this word takes the following steps:
1. The parser recognized that #5C#0Cghter" was used as the prenominal modi#0Cer of #5Cjet".
7
Table 5: Collocates of #5C#0Cghter" as prenominal modi#0Cer of #5Cjet"
Word Freq LogL Word Freq LogL
#0Cghter 80 449.56 NUM 212 160.15
ORG 187 59.56 air force 13 56.28
passenger 17 51.93 Airbus 10 44.18
Lear 6 37.79 Harrier 5 33.62
PROD 14 30.08 -bound 3 22.68
Concorde 4 22.22 Mirage 4 20.02
Avianca 3 15.93 widebody 3 15.66
stealth 4 10.43 turbofan 2 10.35
MiG 2 10.35 KAL 2 9.23
series 5 8.69 cargo 4 8.30
Aero#0Dot 2 8.16 four-engine 1 7.55
Delta 3 7.53 steering 2 7.09
CANADIENS 2 6.34 water 6 6.23
NUM-passenger 1 6.17 Dragonair 1 6.17
BLACKHAWKS 2 5.98 Skyhawk 1 5.65
Egyptair 1 5.65 transport 3 5.63
trainer 2 5.50 Coast guard 3 5.43
Advanced Tactical Fighter 1 5.31 reconnaissance 2 5.12
Qantas 1 5.05 Pan American 1 5.05
training 3 4.97 United Express 1 4.85
Gulfstream 1 4.85 Swissair 1 4.69
PSA 1 4.69 ANA 1 4.69
ground attack 1 4.54 NUM-seat 1 4.21
Alitalia 1 4.12 Lufthansa 1 3.96
PAL 1 3.89 KLM 1 3.89
NUM Syrian 1 3.76 whirlpool 1 3.03
8
2. Retrievewords from the collocation database that were also used as the prenominal modi#0Cer
of #5Cjet" #28shown in Table 5#29. Freq is the frequency of the word in the context, LogL is the log
likelihood ratio between the word and the context #5B1#5D.
3. Retrieve the similar words of #5C#0Cghter" from an automatically generated thesaurus:
jet 0.15; guerrilla 0.14; aircraft 0.12; rebel 0.11; bomber 0.11; soldier 0.11; troop
0.10; plane 0.10; missile 0.09; force 0.09; militia 0.09; helicopter 0.09; leader 0.08;
civilian 0.07; faction 0.07; pilot 0.07; airplane 0.07; insurgent 0.07; commander 0.06;
tank 0.06; airliner 0.05; militant 0.05; marine 0.05; transport 0.05; reconnaissance
0.05; prisoner 0.05; artillery0.05; army 0.05; stealth 0.05; victim 0.05; terrorist 0.05;
weapon 0.04; rocket 0.04; resistance 0.04; rioter 0.04; gunboat 0.04; collaborator
0.04; assailant 0.04; thousand 0.04; gunman 0.04; sympathizer 0.04; radio 0.04;
submarine 0.04; attacker 0.04; youth 0.04; camp 0.04; refugee 0.04; dependent 0.04;
combat 0.04; mechanic 0.04; demonstrator 0.04; personnel 0.04; movement 0.04;
gunner 0.04; territory 0.04
The number after a word is the similaritybetween the word and #5C#0Cghter". The intersection
of the similar word list and the above table consists of:
combat 0.04; reconnaissance 0.05; stealth 0.05; transport 0.05;
4. Find a sense of #5C#0Cghter" in WordNet that is most similar to senses of #5Ccombat", #5Creconnais-
sance", #5Cstealth" or #5Ctransport". The #5C#0Cghter aircraft" sense of #5C#0Cghter" was selected.
We submitted two sets of results in MUC-7:
#0F the #5Cnowsd" result in which the senses of a word are chosen simply bychoosing its #0Crst two
senses in the WordNet.
#0F the o#0Ecial result that employs the aboveword sense disambiguation algorithm.
The results are summarized in Table 6. Although the di#0Berence between the use of WSD and the
baseline is quite small, it turns out to be statistically signi#0Ccant. In some of the 20 input texts that
were scored in coreference evaluation, the WSD module did not make any di#0Berence. However,
whenever there was a di#0Berence it was always an improvement. It is also worth noting that, with
WSD, both the recall and precision are increased.
Table 6: Coreference recognition results
Precision Recall F-measure
nowsd 62.7#25 57.5#25 60.0
o#0Ecial 64.2#25 58.2#25 61.1
1
In hindsight, we probably should have just used the #0Crst sense listed in the WordNet for eachword.
9
CONCLUSIONS
The use of collocational statistics greatly improved the performance of our named entity recognition
system. Although collocation-based Word Sense Disambiguation lead only to a small improvement
in coreference recognition, the di#0Berence is nonetheless statistically signi#0Ccant.
ACKNOWLEDGEMENTS
This research is supported by NSERC Research Grant OGP121338 and a research contract awarded
to Nalante Inc. by Communications Security Establishment.

REFERENCES
#5B1#5D Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19#281#29:61#7B74, March 1993.
#5B2#5D Dekang Lin. Principle-based parsing without overgeneration. In Proceedings of ACL#7B93, pages 112#7B120, Columbus, Ohio, 1993.
#5B3#5D Dekang Lin. Using syntactic dependency as local context to resolve word sense ambiguity. In Proceedings of ACL#2FEACL-97, pages 64#7B71, Madrid, Spain, July 1997.
#5B4#5D Dekang Lin. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL '98, pages 768#7B774, Montreal, Canada, August 1998.
#5B5#5D Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
