CRL/NMSU
Description of the CRL/NMSU Systems Used for MUC- 6
Jim Cowie
Computing Research Laboratory
New Mexico State University
jcowie@nmsu.edu
(505) 646-5181
Introduction
This paper discusses the two CRL named entity recognition systems submitted for MUC-6 . The
systems are based on entirely different approaches . The first is a data intensive method, which
uses human generated patterns . The second uses the training data to develop decision trees whic h
detect the start and end points of names .
Background
CRL submitted two systems for the Named Entity task . One of these (Basic) is an improved ver-
sion of the CRL name recognizer developed in phase one of Tipster[1]. The second (AutoLearn)
is a system which learns automatically from training data . The Basic system had approximatel y
six man months of work in its original development. Improvements for MUC-6 were carried out
by one graduate student (about one man month). The AutoLearn system was developed by one
graduate student specifically for MUC-6 (also one man month).
Availability
The Basic system can be accessed for testing through the mail server - ne_tag@crl .nmsu.edu. the
subject field of the message should consist of the word "tag" . A web interface to the tagger is als o
available from CRL's home page . Source code and data files can be ftp'ed from CRL (mai l
cable@crl.nmsu.edu for ftp address and access password) . The system has been successfull y
tested on Linux running on a PC.
Basic NE System
The Basic system consists of a pipeline of Unix process . These can be identified as carrying out
four different types of task .
1. Recognizing names based on character patterns (numbers, dates) .
2. Recognizing names based on pre-stored names.
3. Applying mark-up to potential components of complete names .
4. Recognizing names based on patterns of components .
157
For the most part the system is context free . A few of the patterns used require some additiona l
context before a name is recognized. For example an ambiguous human name in isolation may be
recognized if it is followed closely by a title .
The system consists of suite of `C' and lex programs .
Component units recognized by the system are cities, provinces, countries, company prefixes an d
suffixes, company beginning and ending words (Club, Association etc .), unambiguous and ambig-
uous human first and last names, titles and human position words .
Additional Fix-up Procedures
Final patterns are used to join together units of the same type which are immediately next to eac h
other in the text.
After all the pattern based procedures have operated on the text a final pass is made to recogniz e
abbreviated forms of names . This takes the lists of names found so far and truncates them . (right
to left for person and left to right for companies) . These new lists are then used as lists of known
organizations and persons and any occurrences of these in the text are marked . In particular for
organizations the headline is not processed apart from this last stage . This avoids recognition o f
organizations such as "Leaves Bank" . The assumption that names mentioned in the heading wil l
be repeated in the body of the text holds almost universally .
Data Sources
The data used in the Basic system is derived from public domain source, university phone lists ,
the Tipster Gazetteer and government data-bases of company names.
Performance
The performances of the for the test set and for the walk through article are given in Appendix A .
Overall performance was Recall - 85% and Precision - 87% giving an F measure of 85 .8 .
Walk through article
Performance here was Recall - 63% and Precision - 83% .
The main source of error was missing patterns in the system . For example Robert L. James was
partially recognized (as L . James), McCann-Erickson was missed as no hyphenated company pat -
tern had been added. Once a frequently mentioned name is ignored in its full form the syste m
unfortunately misses all abbreviated forms . This article also shows the importance of context in
reliably recognizing some names (e .g. an analyst with PayneWebber).
158
AutoLearn NE System
The AutoLearn system was developed to explore the possibility of using simple learning algo-
rithms to detect specific features in text . An implementation of Quinlan's ID3 Algorithm was used
[2,3] . This algorithm constructs a decision tree which decides whether an element of a collection
satisfies a property or not . Each element of a collection has a finite number of attributes each o f
which may take one of several values . Quinlan's original paper suggests the range of values of th e
attributes should be "small". In the case of the AutoLearn system the values are every word occur-
ring in the training collection .
Collections for Name Recognitio n
In order to apply the ID3 algorithm the data needs to be structured into a collection, each membe r
of which has specific values for a set of attributes and for each of which it is known whether th e
member has a specific property or not . For the name recognition problem the training data was
converted in tuple of five words . Each tuple was marked as having the start (or end) of a specifi c
type of proper name at the middle word of the tuple. This data can be easily generated from the
training articles. Thus for the beginning of a person -
many differences between Robert L . - 1
differences between Robert L. James 1
between Robert L. James , -1
etc.
Fourteen sets of training data were generated using the 318 development articles supplied for
MUC-6. The quality of the tagging is not particularly uniform, but no attempt has been made to
improve this.
Generating the decision trees
As each word of the training data is read it is hashed and stored in a hash table . Thereafter words
are referred to by their hash values. For each of the values of the five attributes (words 1 through
5) a count is maintained of the number of times this value contributed to an element holding a
proper named occurrence at the middle attribute. The attribute to be tested first is chosen by com-
puting for each value the relative frequency of positive and negative outcomes for this value . This
is used to approximate the information content of that attribut e
-p+log2p+ - plog2p-
	
(EQ 1)
The sum of the approximate information contents for each column is calculated and the colum n
with the highest value is chosen as the primary decision . Here all the values which always contrib -
uted to a positive outcome are used as the primary decision . Values which are always negative are
ignored (this is primarily to reduce the size of the data being handled) . New sub-collections are
formed with elements containing one value which contributed both to positive or negative out-
comes are collected and the tree building process is repeated for each of these new collections .
159
The decision trees thus formed can be output in a readable if somewhat lengthy form . In most
cases the first choice is the third word in a group taking one of a large number of values . Thereaf-
ter a group off fairly impenetrable tests occur. For example for location beginnings -
If word 3 is one of the following - Milwaukee Ridgefild Pa ST.. (around
300 more words) then location_beginning
else if word 3 is Illinois and word 1 is Indiana then location_beginnin g
else if word 3 is Northeast and word 1 is `in' then location_beginnning
The printed decision table takes about 5 pages .
Running the AutoLearn System
A pass through the texts is made for each decision tree (beginning and end) of each named entity .
First the hash table of words is read and the corresponding decision tree . The text is then pro-
cessed in groups of five words . Whenever a positive decision is made a new tag is added to th e
output stream.
Ideally at this stage the tagging would be done . However, given that we are processing new texts ,
there are many occasions where an end or a beginning is identified, but the corresponding begin-
ning or end is not. For example a surname may have been seen previously, but not the attached
forename. At this point a heuristic is applied which for every un-matched bracket in the text work s
forward or backward until some appropriate point is reached . The actual skipping heuristics need
to be different for organizations, persons, locations, dates and numbers .
Data Sources
The only data source used for the AutoLearn system was the 318 MUC-6 training texts .
Performance
A high precision was expected from this system . Most of the errors that occur are due to failures
of the bracket insertion heuristic . The overall scores were Recall - 47% and Precision- 81% giving
an F measure of 59 .3 .
No specific code was inserted to handle numbers or dates . The method was more successful wit h
organizations and locations then with persons. More training data is perhaps required to make th e
system aware of the spread of examples for human names .
Walk through article
The performance here was Recall - 36% and Precision - 88% .
160
The major problem here is that the system has not learned a rule which uses "Mr ." to identify the
word previous to a name.
Relationship of Performance to Amount of Training Data .
The evaluation texts were processed with decision trees generated using subsets of the MUC-6
development data. This was done in increasing units of 50 texts . The results are shown in Figure 1
below. Both recall and precision increase with increasing training data . Precision appears to tail
off at around 82%. Recall, however, increases (with one exception) steadily over the whole range .
score
75 .0 0
70.0 0
65 .0 0
60.00
.r -
55 .0 0
85 .0 0
x0.00 .m
35 .00
30 .00
Wumbcr of Training Fil m
0.00
	
100 .00
	
200.00
	
300.0 0
FIGURE 1 . Relationship of Performance to Amount of Training
Future Developments
We intend to rebuild the Basic system. One of the principle drawbacks of the system is its sequen -
tial application of component tags . In many cases a second tag is not applied because the word o r
5 0.0 0
45.00
40 .00
161
phrase is ambiguous . The correct solution here is to apply all tags in a manner that allows the cor-
rect tags to be selected by the pattern processing mechanisms . In addition we plan to improve our
collection of patterns. The current version of the system is being made generally available. This,
we hope, will provide us with some feedback on patterns and errors in the data files .
Some further experiments are also planned with the AutoLearn system . The main drawback with
the system is that it doesn't make maximal use of the training data in so far that with small train-
ing samples one word may be sufficient to make a decision . This situation can probably be
improved by replacing specific words with a NULL word. This will force he system to develo p
rules based more on context . In particular when the system encounters unknown words these will
be considered equivalent to the NULL word.
We also intend to apply the learning method described here to other NLP tasks such as part o f
speech tagging and disambiguation .
References
[1] Cowie, J.,Guthrie, L., Pustejovsky, J., Waterman, S., and Wakao, T., The CRL/Bradneis Sys-
tem us Used for MUC-5 In Proceedings of the Fifth Message Understanding Conference (MUC-
5) Baltimore, Ma ., Morgan Kaufmann, 1993 .
[2] Quinlan, J.R. Discovering Rules by Induction from Large Collections of Examples .In Expert
Systems in the Micro-Electronic Age, ed Michie D ., Edinburgh University Press, 1979 .
[3] Quinlan, J.R. Machine Learning: Easily Understood Decision Rules .In Computer Systems
that Learn, eds. Weiss S.M. and Kulikowski C .A., Morgan Kaufmann, 1991 .
162
Appendix A - Basic System Scores
-------------------------- --------------- ------------ ----------------------- -
SLOT
	
POS ACT' COR PAR INC! SPU MIS NONI REC PRE UND OVG ERR SUB
<enamex>
	
925
	
8891 841
	
0
	
01 48 84
	
01 91 95
	
9
	
5 14
	
0
type
	
925
	
8891 784
	
0
	
571 48 84
	
01 85 88
	
9
	
5 19
	
7
text
	
925
	
8891 755
	
0
	
861 48 84
	
01 82 85
	
9
	
5 22 10
subtotals
	
1850 17781 1539
	
0 1431 96 168
	
01 83 86
	
9
	
5 21
	
8
<timex>
	
111
	
1081 102
	
0
	
01
	
6
	
9
	
01 92 94
	
8
	
6 13
	
0
type
	
111
	
1081 102
	
0
	
01
	
6
	
9
	
01 92 94
	
8
	
6 13
	
0
text
	
111
	
1081
	
92
	
0
	
101
	
6
	
9
	
01 83 85
	
8
	
6 21 10
subtotals
	
222
	
2161 194
	
0
	
101 12 18
	
01 87 90
	
8
	
6 17
	
5
<numex>
	
93
	
1021
	
91
	
0
	
01 11
	
2
	
01 98 89
	
2 11 12
	
0
type
	
93
	
1021
	
91
	
0
	
01 11
	
2
	
01 98 89
	
2 11 12
	
0
text
	
93
	
1021
	
88
	
0
	
31 11
	
2
	
01 95 86
	
2 11 15
	
3
subtotals
	
186
	
2041 179
	
0
	
31 22
	
4
	
01 96 88
	
2 11 14
	
2
ALL OBJECTS
	
2258 21981 1912
	
0 1561 130 190
	
01 85 87
	
8
	
6 20
	
8
MATCHED ONLY
	
2068 20681 1912
	
0 1561
	
0
	
0
	
01 92 92
	
0
	
0
	
8
	
8
-------------------------- --------------- ------------ ----------------------- -
P&R
	
2P&R
	
P&2R
F-MEASURES
	
85.82
	
86.52
	
85.13
SLOT POS ACTI COR PAR INC' SPU MIS NON' REC PRE UND OVG ERR SUB
-------------------------- --------------- ------------ ----------------------- -
Enamex:
organization
	
442
	
3921 330
	
0
	
401 22 72
	
01 75 84 16
	
6 29 11
person
	
373
	
3781 353
	
0
	
91 16 11
	
01 95 93
	
3
	
4
	
9
	
2
location
	
110
	
1191 101
	
0
	
81 10
	
1
	
01 92 85
	
1
	
8 16
	
7
other
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
Timex :
date
	
111
	
1081 102
	
0
	
01
	
6
	
9
	
01 92 94
	
8
	
6 13
	
0
time
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
other
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
Numex :
money
	
76
	
771
	
74
	
0
	
01
	
3
	
2
	
01 97 96
	
3
	
4
	
6
	
0
percent
	
17
	
251
	
17
	
0
	
01
	
8
	
0
	
01 100 68
	
0 32 32
	
0
other
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
* * * DOCUMENT SECTION SCORES * * *
SLOT
	
POS ACTT COR PAR INCI SPU MIS NONI REC PRE UND OVG ERR SUB
HL
	
136
	
1301 115
	
0
	
111
	
4 10
	
01 84 88
	
7
	
3 18
	
9
DD
	
60
	
601
	
60
	
0
	
01
	
0
	
0
	
01 100 100
	
0
	
0
	
0
	
0
DATELINE
	
52
	
501
	
49
	
0
	
11
	
0
	
2
	
01 94 98
	
4
	
0
	
6
	
2
TXT
	
2010 19581 1688
	
0 1441 126 178
	
01 84 86
	
9
	
6 21
	
8
163
Appendix A - Autolearn System Score s
SLOT POS ACT' COR PAR INCI SPU MIS NON' REC PRE UND OVG ERR SUB
<enamex> 915 5141 469 0 01
	
45 446 01 51 91 49 9 51 0
type 915 5141 445 0 241
	
45 446 01 49 86 49 9 54 5
text 915 5141 371 0 981
	
45 446 01 40 72 49 9 61 21
subtotals 1830 10281 816 0 1221
	
90 892 01 44 79 49 9 58 13
<timex> 111 651 59 0 01
	
6 52 01 53 91 47 9 50 0
type 111 651 59 0 01
	
6 52 01 53 91 47 9 50 0
text 111 651 48 0 111
	
6 52 01 43 74 47 9 59 19
subtotals 222 1301 107 0 111
	
12 104 01 48 82 47 9 54 9
<numex> 93 721 65 0 01
	
7 28 01 70 90 30 10 35 0
type 93 721 65 0 01
	
7 28 01 70 90 30 10 35 0
text 93 721 63 0 21
	
7 28 01 68 88 30 10 37 3
subtotals 186 1441 128 0 21
	
14 56 01 69 89 30 10 36 2
ALL OBJECTS 2238 13021 1051 0 1351116 1052 01 47 81 47 9 55 11
MATCHED ONLY 1186 11861 1051 0 1351
	
0 0 01 89 89 0 0 11 11
-------------------------- --------------- ------------ ----------------------- -
	
P&R
	
2P&R
	
P&2R
F-MEASURES
	
59 .38
	
70.57
	
51.25
-------------------------- --------------- ------------ ----------------------- -
SLOT
	
POS ACTI COR PAR INC' SPU MIS NON' REC PRE UND OVG ERR SUB
Enamex:
organization
	
433
	
2871 251
	
0
	
81 28 174
	
01 58 87 40 10 46
	
3
person
	
372
	
1551 131
	
0
	
141 10 227
	
01 35 84 61
	
6 66 10
location
	
110
	
721
	
63
	
0
	
21
	
7 45
	
01 57 88 41 10 46
	
3
other
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
Timex:
date
	
111
	
651
	
59
	
0
	
01
	
6 52
	
01 53 91 47
	
9 50
	
0
time
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
other
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
	
*
Numex:
money
	
76
	
551
	
52
	
0
	
01
	
3 24
	
01 68 94 32
	
5 34
	
0
percent
	
17
	
171
	
13
	
0
	
01
	
4
	
4
	
01 76 76 24 24 38
	
0
other
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
* * * DOCUMENT SECTION SCORES * * *
SLOT
	
POS ACTI COR PAR INCI SPU MIS NON' REC PRE UND OVG ERR SUB
HL
	
128
	
921
	
65
	
0
	
0 71 20 56
	
01 51 71 44 22 56 10
DD
	
60
	
01
	
0
	
0
	
01
	
0 60
	
01
	
0
	
* 100
	
* 100
	
*
DATELINE
	
52
	
241
	
24
	
0
	
01
	
0 28
	
01 46 100 54
	
0 54
	
0
TXT
	
1998 11861 962
	
0 1281 96 908
	
01 48 81 45
	
8 54 12
16 4
Appendix A - Basic System Walk-through Score s
SLOT
	
POS
	
ACTI
	
COR
	
PAR
	
INCI SPU MIS NONI REC PRE UND OVG ERR SUB
-------------------------- --------------- ------------ ------------------------
<enamex> 66 521 47 0 01 5 19 01 71 90 29 10 34 0
type 66 521 42 0 51 5 19 01 64 81 29 10 41 11
text 66 521 42 0 51 5 19 Cl 64 81 29 10 41 11
subtotals 132 1041 84 0 101 10 38 01 64 81 29 10 41 11
<timex> 6 51 5 0 01 0 1 01 83 100 17 0 17 0
type 6 51 5 0 01 0 1 01 83 100 17 0 17 0
text 6 51 5 0 01 0 1 01 83 100 17 0 17 0
subtotals 12 101 10 0 01 0 2 01 83 100 17 0 17 0
<numex> 6 71 6 0 01 1 0 01 100 86 0 14 14 0
type 6 71 6 0 01 1 0 01 100 86 0 14 14 0
text 6 71 6 0 01 1 0 01 100 86 0 14 14 0
subtotals 12 141 12 0 01 2 0 01 100 86 0 14 14 0
ALL OBJECTS 156 1281 106 0 101 12 40 01 68 83 26 9 37 9
MATCHED ONLY 116 1161 106 0 101 0 0 01 91 91 0 0 9 9
P&R
	
2P&R
F-MEASURES
P&2R
70 .4874 .65 79 .34
-------------------------- --------------- ------------
SPU MIS NONI
------------------------
REC PRE UND OVG ERR SUBSLOT
Enamex:
POS ACTT COR PAR INC I
organization 29 121 5 0 51 2
	
19
	
01 17 42 66 17 84 50
person 35 381 35 0 01 3
	
0
	
01 100 92 0 8 8 0
location 2 21 2 0 01 0
	
0
	
01 100 100 0 0 0 0
other 0 01 0 0 01 0
	
0
	
01 * * * * * *
Timex :
date 6 51 5 0 01 0
	
1
	
01 83 100 17 0 17 0
time 0 o l 0 0 o l 0
	
0
	
o l * * * * * *
other 0 OI 0 0 O1 0
	
0
	
OI * * * * * *
Numex:
money 5 61 5 0 01 1
	
0
	
01 100 83 0 17 17 0
percent 1 11 1 0 01 0
	
0
	
01 100 100 0 0 0 0
other 0 01 0 0 01 0
	
0
	
01 * * * * * *
-------------------------- --------------- ------------ ----------------------- -
* * * DOCUMENT SECTION SCORES * * *
SLOT
	
POS ACTI COR PAR INCI SPU MIS NONI REC PRE UND OVG ERR SUB
HL
	
8
	
61
	
6
	
0
	
01
	
0
	
2
	
01 75 100 25
	
0 25
	
0
DD
	
2
	
21
	
2
	
0
	
01
	
0
	
0
	
01 100 100
	
0
	
0
	
0
	
0
DATELINE
	
0
	
01
	
0
	
0
	
01
	
0
	
0
	
01
	
*
	
*
	
*
	
*
	
*
TXT
	
146
	
1201
	
98
	
0
	
101 12 38
	
01 67 82 26 10 38
	
9
165
Appendix A - Autolearn System Walk-through Score s
SLOT
	
POS ACT' COR PAR INC' SPU MIS NON' REC PRE UND OVG ERR SUB
	 +	 +	 +	
<enamex> 69 221 20 0 01 2 49 01 29 91 71 9 72 0
type 69 221 20 0 01 2 49 01 29 91 71 9 72 0
text 69 221 18 0 21 2 49 01 26 82 71 9 75 10
subtotals 138 441 38 0 21 4 98 01 28 86 71 9 73 5
<timex> 6 51 5 0 01 0 1 01 83 100 17 0 17 0
type 6 51 5 0 01 0 1 01 83 100 17 0 17 0
text 6 51 4 0 11 0 1 01 67 80 17 0 33 20
subtotals 12 101 9 0 11 0 2 01 75 90 17 0 25 10
<numex> 6 61 6 0 01 0 0 01 100 100 0 0 0 0
type 6 61 6 0 01 0 0 01 100 100 0 0 0 0
text 6 61 5 0 11 0 0 01 83 83 0 0 17 17
subtotals 12 121 11 0 11 0 0 01 92 92 0 0 8 8
	 +	 +	 +	
ALL OBJECTS
	
162
	
661
	
58
	
0
	
41
	
4 100
	
01 36 88 62
	
6 65
	
6
MATCHED ONLY
	
62
	
621
	
58
	
0
	
41
	
0
	
0
	
01 94 94
	
0
	
0
	
6
	
6
	 +	 +	 +	
P&R
	
2P&R
	
P&2R
F-MEASURES
	
50 .88
	
68 .08
	
40.62
	 +	 +	 +	
SLOT
	
POS ACT' COR PAR INC'
	
SPU MIS NONI REC PRE UND OVG ERR SUB
	 +	 +	 +	
Enamex:
organization 32 121 12 0 01 0 20 01 38 100 62 0 62
	
0
person 35 71 6 0 01 1 29 01 17 86 83 14 83
	
0
location 2 31 2 0 01 1 0 01 100 67 0 33 33
	
0
other 0 01 0 0 01 0 0 01 * * * * *
	 +	 +	 +
Timex:
date 6 51 5 0 01 0 1 01 83 100 17 0 17
	
0
time 0 01 0 0 01 0 0 01 * * * * *
	
*
other 0 01 0 0 01 0 0 01 * * * *
	 +	 +	 +
Numex:
money 5 51 5 0 01 0 0 01 100 100 0 0 0
	
0
percent 1 11 1 0 01 0 0 01 100 100 0 0 0
	
0
other 0 01 0 0 01 0 0 01 * * * * *
	
*
	 +	 +	 +	
* * * DOCUMENT SECTION SCORES * * *
	 +	 +	 +	
SLOT
	
POS ACT' COR PAR INC' SPU MIS NON' REC PRE UND OVG ERR SUB
	 +	 +	 +	
HL 8 21 2 0 01 0 6 01 25 100 75 0 75
	
0
DD 2 01 0 0 01 0 2 01 0 * 100 * 100
	
*
DATELINE 0 01 0 0 01 0 0 01 * * * * *
TXT 152 641 56 0 41 4 92 01 37 88 60 6 64
	
7
166
