MUC-3 LINGUISTIC PHENOMENA TEST EXPERIMEN T
Nancy Chinchor, Ph .D .
Science Applications International Corporatio n
10260 Campus Point Drive, M/S 1 2
San Diego, CA 9212 1
(619) 458-272 8
INTRODUCTION
The evaluation of data extraction systems can be supplemented by determinin g
performance of the systems on a representative selection of linguistic phenomena .
The experiment performed as part of the MUC-3 evaluation was aimed at determinin g
whether linguistic phenomena can be tested in isolation for state-of-the-art dat a
extraction . Although not all of the methods of data extraction used by the partici-
pants in MUC-3 directly process linguistic phenomena, the methods are dealing with
those phenomena in some manner because they are present in the input .
Phenomena testing in MUC-3 is testing according to the characteristics of th e
messages rather than characteristics of the systems processing those messages .
	
In
order to determine the validity of phenomena testing at the current level of syste m
performance, an experiment was run . The results of the experiment indicate that
linguistic phenomena are isolatable and that performance for linguistic phenomen a
can be measured using the MUC-3 scoring system .
DESIGN
The problem is to devise a test to measure the performance of a data extractio n
system with respect to a single linguistic phenomenon . The experiment took several
approaches to devising such a test to determine whether the phenomenon had bee n
isolated . The design of the experiment required the choice of a linguistic phenom-
enon frequently appearing in the messages and critical to the template fill task .
The slots from phrases exhibiting the phenomenon would be scored an d
compared to the overall scores . If there was no correlation with the overall scores ,
then the possibility that overall scores were fully determining the phenomenon' s
scores would be eliminated and isolating the phenomenon would be possible . The
slots filled from the sentences were scored and compared to the overall scores as wel l
as the scores obtained for the slots filled from the phrases . If the scores for the sen-
tences correlated more closely with the scores for the phrases than they did with th e
overall scores, then it would be more likely that the scoring was isolating th e
phenomenon . Processing of the phrase can have an effect on the processing of th e
entire sentence . If the results for the phrases and the sentences coincide, then i t
would be feasible to use scores for slots from entire sentences for future phenomen a
testing .
Slots from phrases exhibiting well-defined subsets of the phenomenon woul d
be scored and compared to each other. The results of the comparisons that can b e
predicted or explained would give us more confidence that we have isolated th e
phenomenon .
31
Altered messages would be produced without the phenomenon for purposes o f
a "minimal pair" type test . The responses would be scored for slots filled from th e
phrases that formerly constituted the phenomenon . The scores would be compared t o
the scores for responses to the original messages containing the phenomenon . The
comparison would provide more information concerning the success of isolating th e
phenomenon .
METHOD
Choice of Phenomeno n
Apposition was chosen as the linguistic phenomenon because of its frequency
of occurrence in messages and its criticalness for slot fills . An example of an apposi-
tive from the test corpus is "David Lecky, Director of the Columbus School ." There
, were approximately 60 sentences in the test corpus containing one or more apposi-
tives which were critical to slot fills . Preliminary phenomena testing for thre e
other phenomena occurring with varying frequencies suggested that a frequency o f
20 was adequate for testing purposes . With more than 60 instances of apposition ,
subdividing the set for testing well-defined subsets would still leave adequate num-
bers in the subdivisions . Also, there many more cases of appositives which affected
slot fills, but could not be included in the testing because there were other source s
for the slot fills elsewhere in the message. This high frequency of occurrence o f
apposition in the messages suggests that it is a phenomenon which systems mus t
handle in some way.
Definition of Apposition
The examples used from the test messages are all cases of noun phrases i n
apposition .
	
Among linguists, there is variation in the liberality with which the ter m
apposition is used .
	
According to Quirk et -al [1], apposition meeting the followin g
three conditions is full apposition :
a. each of the appositives can be separately omitted without affecting th e
acceptability of the sentence;
b. each fulfills the same syntactic function in the resultant sentence ; and
c. there is no difference between the original sentence and either of the
resultant sentences in extralinguistic reference .
An example of full apposition is the following from test message TST2-MUC3 -
0004.
JOSE PARADA GRANDY, THE BOLIVIAN POLICE CHIEF, TOLD EFE THAT A N
UNIDENTIFIED PERSON STEPPED OUT OF A VEHICLE AND PLACED A
PACKAGE IN ONE OF THE PLANT POTS ON JUAN DE LA RIVA STREET, A FE W
METERS FROM THE U.S. EMBASSY IN DOWNTOWN LA PAZ .
"Jose Parada Grandy" and "the Bolivian Police Chief" are in full appositio n
because they each can be omitted resulting in the following acceptable sentences ,
they each are the subject in those sentences, and all three sentences have the same
extralinguistic reference .
32
JOSE PARADA GRANDY TOLD EFE THAT AN UNIDENTIFIED PERSON STEPPE D
OUT OF A VEHICLE AND PLACED A PACKAGE IN ONE OF THE PLANT POTS ON
JUAN DE LA RIVA STREET, A FEW METERS FROM THE U .S. EMBASSY IN
DOWNTOWN LA PAZ .
THE BOLIVIAN POLICE CHIEF TOLD EFE THAT AN UNIDENTIFIED PERSO N
STEPPED OUT OF A VEHICLE AND PLACED A PACKAGE IN ONE OF THE PLAN T
POTS ON JUAN DE LA RIVA STREET, A FEW METERS FROM THE U .S .
EMBASSY IN DOWNTOWN LA PAZ .
Partial apposition occurs when the three conditions are not all met . An
example of partial apposition not meeting condition (a) appears in test message TST2 -
MUC3-0100 .
THE BRAZILIAN EMBASSY IN COLOMBIA HAS CONFIRMED THE RELEASE O F
REDE GLOBO JOURNALIST CARLOS MARCELO WHO WAS KIDNAPPED B Y
COLOMBIAN ARMY OF NATIONAL LIBERATION GUERRILLAS .
The difference between full and partial apposition in this case is trivial requirin g
only the addition of a determiner to "Rede Globo journalist" to make the sentence
omitting "Carlos Marcelo" acceptable .
	
Partial appositives that were omitted from the
phenomenon testing were cases of appositives containing "also" and "alias ."
	
These
were omitted because of their adverbial nature.
Another gray area in choosing examples concerns titles . Quirk et al makes th e
distinction between apposition and institutionalized titles . The authors show the
range from apposition in "critic Paul Jones" to full title in "Mr . Porter" with th e
following examples :
a. critic Paul Jones
the critic Paul Jones (with appositives, a preposed determiner i s
normal but not with titles )
(with appositives, postpositio n
more normal than preposition
whereas the opposite i s
allow postposition )
(appositives and most
without the prope r
determiners )
(most titles and some appositives can be use d
as vocatives )
b. Farmer Brown
the farmer Brow n
?Brown the farme r
the farme r
farmer (vocative )
c. Brother George (family )
my brother George/ ?the brother Georg e
*George the brothe r
the brothe r
brother (vocative )
Paul Jones the critic
the criti c
?critic (vocative)
titles can be used
nouns and wit h
with "the" i s
without "the "
true for titles that
33
d. Professor Brow n
*the professor Brow n
?Brown the professor
the professo r
professor (vocative)
e. Dr. Smith (Ph.D.)
*the doctor Smit h
*Smith the docto r
*the docto r
doctor (vocative )
f. Mr. Porter
*the Mr. Porter
	
(with titles, a preposed determiner is not normal )
*Porter the mister (postposition with "the" is not allowed here )
*the mister
	
(some titles cannot be used without the proper
nouns and with determiners)
*mister (vocative)
	
(most titles can be used as vocatives )
(substandard )
In the MUC-3 messages, the appositives and titles are distinguished by the test s
above with the cut-off between (3) and (4) . For example, "Colonel," "Senator," and
"Ambassador" are titles because the following judgments are similar to those fo r
"Professor" above:
Colonel Heriberto Hernandez
*the Colonel Heriberto Hernandez (with titles, a preposed determiner i s
not normal unless the noun phrase s
are modified restrictively )
?Heriberto Hernandez the Colonel (with titles that allow postposition ,
preposition without "the" is mor e
normal than postposition with "the" )
the Colonel (appositives and most titles can b e
used without the proper nouns an d
with determiners )
Colonel (vocative)
	
(most titles can be used as vocatives )
However, "student" and "peasant" are considered appositives because of th e
following pattern similar to the pattern for "critic" above :
student Mario Flores
the student Mario Flores
Mario Flores the studen t
the studen t
?student (vocative )
Judgments may vary .
	
One possible questionable inclusion as an appositive i s
the phrase "Attorney General ." My judgments follow :
Attorney General Roberto Garcia Alvarad o
the Attorney General Roberto Garcia Alvarad o
Roberto Garcia Alvarado the Attorney Genera l
the Attorney Genera l
Attorney General (vocative)
34
An attempt was made to limit the appositives used in the testing to those mos t
likely to be agreed upon as appositives while still maintaining a reasonable number
of examples.
Construction of the Test Set s
The message sentences containing appositives were extracted from th e
messages for analysis . The examples were put in a file for distribution to the partici-
pants to assist in analysis of their results . This file contained information
concerning the categorization of the appositives and the slots affected by the apposi-
tioned noun phrases and the entire sentence.
The appositives were categorized as postposed versus preposed and simple
versus complex . An example of a postposed appositive is "Jose Parada Grandy, th e
Bolivian Police Chief" and an example of a preposed appositive is "Rede Glob o
journalist Carlos Marcelo ."
	
The subdivision of the appositives according to thei r
complexity was done subjectively based on internal structure and the context . Both
"Jose Parada Grandy, the Bolivian Police Chief" and "Rede Globo journalist Carlo s
Marcelo" were considered simple . Any complexity in an example, such as conjunc-
tion within the appositive, a missing comma, or a comma inside double quotes, pu t
that example in the complex category. Probably the most complex appositioned nou n
phrase in the corpus was in apposition to "peasants" in TST2-MUC3-0036 . The mis-
spelling "Colonal" is part of the message .
THE PEASANT COMMUNAL ASSOCIATION, ACC, CONTINUES TO DEMAND TH E
RELEASE OF PEASANTS BARTOLO RODRIGUEZ, WHO WAS CAPTURED ON 2 7
JANUARY, AND [NAME INDISTINCT] CAPTURED ON 2 FEBRUARY B Y
TROOPS OF COLONAL ORLANDO MONTANO OF THE 6TH INFANTRY BRIGADE .
The most important and difficult activity in constructing phenomena tests is t o
determine the individual slots that could only be filled from the phrase containin g
the phenomenon being tested. The slots that could only be filled by the informatio n
in the appositioned noun phrases as well as in the sentences containing those appo-
sitioned noun phrases were noted . The configuration option files for the scoring
system were constructed to score just those slots directly affected by the presence o f
an appositive . Slots that could have been filled from any other phrase/sentence no t
containing an appositive were excluded from the scoring . This step in the test con-
struction is the most likely point where human error can intrude .
For the purposes of running the "minimal pair" test, a modified version of th e
message file was produced . The messages were altered to contain simple sentence s
expressing the equivalence of the appositioned noun phrases in cases where th e
appositioned noun phrases directly affected at least one slot in the template fill . The
appositive no longer appeared in the original sentence . For example ,
THE BRAZILIAN EMBASSY IN COLOMBIA HAS CONFIRMED THE RELEASE O F
REDE GLOBO JOURNALIST CARLOS MARCELO WHO WAS KIDNAPPED B Y
COLOMBIAN ARMY OF NATIONAL LIBERATION GUERRILLAS .
was replaced b y
THE BRAZILIAN EMBASSY IN COLOMBIA HAS CONFIRMED THE RELEASE O F
CARLOS MARCELO WHO WAS KIDNAPPED BY COLOMBIAN ARMY O F
NATIONAL LIBERATION GUERRILLAS . CARLOS MARCELO IS A REDE GLOB O
JOURNALIST .
35
The "minimal pair" test was voluntary because it required a separate run o f
the data extraction systems on the modified messages .
The scoring of the appositive tests was diluted somewhat by the allowance i n
the scoring guidelines for partial credit to be given when the key contains a com-
plete proper name and the response contains only the identifying part of the name .
It was typical of the appositioned noun phrases that they were the place where th e
full name of the person was introduced with only part of the name being used fro m
then on for reference . A previously undetected bug in the scoring system cause d
one template not affected by apposition to be scored instead of another template tha t
was affected by apposition . For phrases, only 2 slots out of a possible 66 slots (3%)
were potentially affected ; for sentences, 9 slots out of a possible 198 slots (4 .5%) were
potentially affected.
HYPOTHESES
The intent of the testing was to discover whether the scoring isolated the phe-
nomenon of apposition . Each of the following hypotheses was proposed and tested i n
order to uncover evidence of isolation of the phenomenon .
Hypothesis 1 . The systems should score differentally on the appositive s
(both phrasally and sententially) than they did on the overall testing .
Hypothesis 2 . The systems should score higher on the simple r
appositives .
Hypothesis 3 . The systems should score differently on the postpose d
and preposed appositives . It was not possible to hypothesize which
score would be higher. Although postposed appositives are more
prototypical and have indications they are appositives such as comma s
or dashes,
	
preposed appositives lend themselves to treatment a s
adjectives .
Hypothesis 4. The systems should score higher on their responses t o
the messages where simple sentences were substituted for appositives .
RESULTS
The recall and precision scores for the appositive tests appear in Table 1 . Table
2 contains the scores based on the single measure calculated by multiplying recal l
times precision .
Analysis of Result s
Hypothesis 1 asserts that the apposition results are independent of the overal l
performance of the systems . To determine the validity of Hypothesis 1, scatter plots
were made of overall recall versus precision scores for a test run under comparabl e
conditions (Figure 1), the appositive scores for phrases (Figure 2), and the appositiv e
scores for sentences (Figure 3) . Comparing Figures 1 and 2 shows that the scores fo r
apposition are significantly different from the overall scores .
	
The performance b y
systems on apposition is largely independent of their overall scores . The same con-
clusion can be drawn for the appositive scores for sentences by comparing Figures 1
and 3 .
36
Appositive Result s
Site App
R
Ap p
P
Sen
R
Sen
P
Eas y
R
Eas y
P
Har d
R
Har d
P
Pos t
R
Post
P
Pr e
R '
Pr e
P
ADS 0 0 1 40 0 0 0 0 0 0 0 0
BBN 20 31 26 42 35 41 12 21 18 27 23 3 5
GE 32 48 22 54 48 50 20 47 27 36 32 7 3
GTE 2 30 3 32 0 0 2 33 3 33 0 0
HU 20 16 19 29 30 14 14 25 29 23 8 9
ITP 11 54 7 66 7 66 10 43 12 67 10 4 3
LSI 2 14 3 28 3 12 1 25 3 33 0 0
MDC 20 29 14 28 22 38 18 24 27 26 13 4 0
NYU 32 62 21 57 42 77 25 52 23 50 42 7 4
PRC 8 42 10 48 20 57 1 25 12 67 3 2 5
SRI 25 63 17 59 35 58 19 67 26 57 23 7 0
SYN 0 0 0 0 0
	
: 0 0 0 0 0 0 1 0
UM 43 77 32 65 68 84 32 71 45 68 40 9 2
UN 2 25 6 40 2 25 2 25 4 25 0 0
UNI _
	
15 40 11 46 32
	
; 54 _
	
6 21 14 45 18 37
Table 1 :
	
The sites reported recall and precision scores for the appositive phrases ,
the sentences containing appositives, the easy and hard appositives, and the post -
posed and preposed appositives .
Single Appositive Measure s
Site EasyRXP HardRXP PostRXP
	
PreRX P
ADS 0 0 0
	
0
BBN 1435 252 486
	
80 5
GE 2500 940 1008
	
233 6
Cit'E 0 66 99
	
0
HU 420 350 667
	
7 2
ITP 462 430 804
	
43 0
LSI 36 25 99
	
0
MDC 836 432 702
	
52 0
NYU 3234 1300 1150
	
310 8
PRC 1140 25 804
	
7 5
SRI 2030 1273 1482
	
161 0
SYN 0 0 0
	
0
UM 5712 2272 3060
	
368 0
UN 50 50 100
	
0
UNI 1728 126 630
	
66 6
Table 2 :
	
The single measure scores were calculated for comparing easy versus har d
and postposed versus preposed appositives .
3 7
Recall vs. Precision
10 0P
r 8 0
e
c 6 0
s 4 0
2 00
n 0
nI n
I . 'n
0 10 20 30 40 50 60 70 80 90 10 0
Recal l
Figure 1 :
	
A scatter plot shows the scores for overall recall versus precision in a
test run under comparable conditions .
Recall vs Precision for Appositive s
. 1
n P nU
n L
	A
10 0
P 9 0
r 8 0
e 7 0
c 6 0
5 0
s 4 0
3 0
0 2 0
n 1 0
.11
0 .
	_W1n
n H
Nni sq.
.cE
n U N
0
	
10 20 30 40 50 60 70 80 90 10 0
Figure 2 :
	
The recall versus precision scores for appositive phrases shows that th e
performance is different from the overall performance .
38
Recall vs Precision for Appositive Sentence s
Recall
Figure 3 :
	
The recall versus precision scores for appositive sentences are more like
the scores for phrases than like the overall scores .
The scatter plots for appositives scored from phrases and sentences in Figure s
2 and 3, respectively, are more comparable to each other than to the overall score s
suggesting that the use of information from sentences could be a valid test of per-
formance on a phenomenon .
	
Further analysis illustrated in Figures 4 and 5 show s
that the scores for appositives and sentences containing appositives
	
parallel each
other for both recall and precision . These parallelisms affirm that material from
sentences containing a phenomenon can be used for testing that phenomenon an d
also indicate that we may be isolating the phenomenon .
Hypothesis 2 asserts that the systems will score higher on the simple r
appositives than on the more complex ones. The scores for recall are remarkabl y
higher for the easy appositives as opposed to the harder appositives as shown i n
Figure 6. Figure 7 shows a less clear trend for the precision scores . The single
measure of recall times precision, however, shows an unmistakable trend of system s
scoring more highly for the easier appositives. These results give us confidence that
we are isolating the phenomenon of apposition .
10 0
P 9 0
r
	
8 0
e 7 0
c 6 0
i
	
5 0
s
	
40 -
i
	
3 0
q 2 0
n 1 0
0
0
n UM
n'1Cif H
SY
10 20 30 40 50 60 70 80 90 10 0
39
App & App-Sen Recall
4 5
4 0
R 3 5
e 3 0
c 2 5
a 2 0
l
	
1 5
l 1 0
5
0
Site
Figure 4 :
	
The recall scores for appositives and sentences containing appositive s
correlate with each other.
App & App-Sen Precision
-n- App Prec
Sen Prec
Figure 5 :
	
The precision scores for appositives and sentences containing
appositives correlate with each other .
1 2 3 4 5 6 7 8 9 101112131415
-n- App Recall
-~ Sen Recal l
80
• 70
• 60
• 50
4 0
• 3 0 —
1 '
	
1
	
1
	
1
	
1	 1	 1
	
1	 1	 1		 1
					
1 .
	
1
	
1
	
1
1 2 3 4 5 6 7 8 9 10111213141 5
Site
40
App-Easy & App-Hard Recal l
7 0
R 6 0
e 5 0
c 4 0
a 3 0
l
	
2 0
1 0
0
Site
Figure 6 :
	
The recall scores for the easy appositive phrases are generally highe r
than those for the harder phrases .
App-Easy & App-Hard Precisio n
90
• 80 —
70 —
• 60
Site
Figure 7 : The precision scores show a tendency to be higher for the easie r
appositions .
1 2 3 4 5 6 7 8 9101112131415
-•- Easy Recal l
D- Hard Recal l
• 50 —
40 —
30
• 20
• 1 0
0
	
I
	
I
	
!'
	
1
	
11111 1	 II	 I	 I
	
I
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-'- Easy Prec
-0' Hard Prec
41
App-Easy & App-Hard R X P
1 2 3 4 5 6 7 8 910111213141 5
Site
Figure 8 : The single measure scores of recall times precision are higher for the
easier appositions than for the harder ones .
The inability to predict whether postposed or preposed appositives would scor e
higher was actually supported by the data .
	
Hypothesis 3 was born out in that th e
systems did score differently on the two types of appositives. There was no clear
trend in the results as to which kind of apposition was easier. The recall, precision ,
and single measure scores are shown in Figures 9 through 11 . Notice that the result s
were predicted providing further evidence that the phenomenon of apposition i s
being isolated . It would be interesting to look at the methods of processing the tw o
types of appositives for each of the systems to see why their scores are as they are .
App-Post & App-Pre Recall
-'- Post Recall
Pre Recall
Figure 9 : The recall scores for postposed and preposed appositives show difference s
in the scores but no clear trend as to which is easier to process .
6000 —
R 5000 —
4000 —
X 3000
2000
P 1000
0
n EasyRX P
Hard RX P
1 2 3 4 5 6 7 8 910111213141 5
Site
42
App-Post & App-Pre Precision
-'- Post Prec
"D- Pre Prec
Figure 10 : The precision scores for postposed and preposed appositives sho w
differences in the scores but no clear indication as to which is easier to process .
App-Post & App-Pre R X P
400 0
350 0
R 300 0
250 0
X 200 0
150 0
P 100 0
	
500
	
n
	
0
	
I I	 I	 i	 I	 i	 I	 /I
1 2 3 4 5 6 7 8 910111213141 5
Site
Figure 11 : The single measure of recall times precision for postposed and prepose d
appositives shows that the systems score differently on the two but neither i s
consistently easier.
Hypothesis 4 predicts that the systems will score higher for the message s
containing simple sentences in place of the appositives . Two sites volunteered to ru n
this part of the test and they both contradicted the hypothesis . Their scores are
shown in Table 3 alongside their scores for the messages containing the appositione d
phrases . On further analysis, it was found that the introduction of the simple sen-
tences made the task more complex in both cases. Apparently, the appositioned nou n
phrases convey the information more simply than a separate sentence containing a
copula and requiring reference resolution .
	
The systems, for various reasons, tende d
1 2 3 4 5 6 7 8 910111213141 5
Site
Post RX P
D- Pre R X P
43
not to use the information in the separate sentence . The recall scores are thu s
lower. The precision scores are somewhat affected . The results show an explanable
effect on the scores lending further credence to the claim that the appositio n
phenomena is being isolated .
VOLUNTARY
Site Recall Precision App R App P
NYU 28 53 32 6 2
UMASS 3 8 6 8 4 3 7 7
Table 3 : The recall and precision scores for the voluntary "minimal pair" test fo r
the messages without apposition and the messages with apposition show an effect o f
modifying the appositioned noun phrases .
CONCLUSION S
In summary, the systems scored differently on the appositives than they di d
on the overall testing suggesting that the testing may be isolating the phenomeno n
of apposition . The systems scored similarly on the slots filled from phrase s
containing appositives and sentences containing appositives suggesting that infor-
mation from sentences could be used to test phenomena . Because the processing o f
apposition can affect the processing of the entire sentence, the parallel results i n
these scores further suggests that the phenomenon of apposition is being isolated .
The systems scored markedly higher on the simpler appositives as opposed to th e
more complex ones . These results are perhaps the strongest evidence that it is pos -
sible to isolate the phenomenon of apposition by scoring slot fills . The system s
scored differently on the postposed and preposed appositives . It would be interesting
to look at the methods employed by each system with respect to these classes o f
appositives. It was predicted that neither class would be clearly easier . The fact that
this prediction was correct provides strong support for the claim that apposition i s
being isolated . The systems scored lower on their responses to the messages where
simple sentences were substituted for appositives . The effect on the scores, althoug h
unexpected, still supports the isolatability of apposition . In some of the more well -
defined trends, the anomalies noticed are often for the lower scoring systems .
However, the systems are scoring highly enough overall at this stage of development
for the phenomena scores to be meaningful . In conclusion, there are strong indica-
tions that the phenomenon of apposition has been isolated by the testing and tha t
performance on apposition can be scored using the MUC-3 scoring system .
Further Researc h
Further work in phenomena testing should now be focused on carefull y
developing a representative selection of phenomena tests for the messages . The
evaluation of data extraction systems can be enhanced by determining performanc e
of the systems on these linguistic phenomena . Phenomena testing should be done a t
various linguistic levels including the word level, phrase level, sentence level ,
intersententially, and the level of discourse .
	
Testing according to the linguisti c
characteristics of the messages would
	
encourage the data extraction systems t o
improve capabilities applicable to other domains .
44
REFERENCES
[1] Quirk, R., Greenbaum, S ., Leech, G., and
	
Svartvik, J., A Grammar of
Contemporary English (London : Longman Group Limited, 1984) .
45
PART II : TEST RESULTS AND ANALYSIS
(SITE REPORTS )
The papers in this section were prepared by each of the fifteen sites that
completed the MUC-3 evaluation . The papers are intended to provide the reade r
with some context for interpreting the test results, which are presented more full y
in appendices F and G of the proceedings . The sites were asked to comment on the
following aspects of their MUC-3 experience :
* Explanation of test settings (precision/recall/overgeneration )
and how these settings were chose n
* Where bulk of effort was spent, and how much time was spen t
overall .on MUC-3
* What the limiting factor was (time, people, CPU cycles ,
knowledge, . . . )
* How the training of the system was don e
-
	
What proportion of the training data was used (and how )
Whether/Why/How the system improved over time, an d
how much of the training was automate d
* What was successful and what wasn't, and what system modul e
you would most like to rewrit e
* What portion of the system is reusable on a different applicatio n
* What was learned about the system, about a MUC-like task ,
about evaluation
