APPENDIX B :
TEST PROCEDURES
1. GENERAL INSTRUCTION S
Testing may be done any time during the week of 6-12 May . The onl y
requirement is that all reports (see section 4, below) be received by NOSC by firs t
thing Monday morning, 13 May . Permission to attend MUC-3 at NOSC on 21-23 Ma y
may be revoked if you do not meet this deadline !
To complete the required testing, you will need approximately the same amount o f
time as it would normally take you to run 100 texts in DEV and interactively scor e
them, plus some time to permit you to be extra careful doing the interactive scorin g
(since the resulting history file is to be used for all passes through the scorin g
program) and some time for the initializations of the scoring program with the
different configuration files required for the various linguistic phenomena tests . If
you carry out the optional testing, you will need to allow time to generate at least a
couple new sets of response templates . In that case, you will also need time to add t o
the history file as needed during the additional scoring runs .
IF YOU INTEND TO CARRY OUT ANY OF THE OPTIONAL TESTING, YOU MUST REPORT
THE PLANNED "PARAMETER SETTINGS" TO NOSC FOR BOTH THE REQUIRED TEST AND TH E
OPTIONAL TESTING BEFORE STARTING THE TEST PROCEDURE. This means that you should
describe, in some meaningful terms, SPECIFICALLY how you will alter the behavio r
of the system so that it will produce each of the different tradeoffs in metric s
described in the sections below .
1.1 REQUIRED TESTING : MAXIMIZED RECALL/PRECISION TRADEOF F
To ensure comparability among the test results for all systems, THE REQUIRE D
TESTING MUST BE CONDUCTED WITH THE SYSTEM SET TO MAXIMIZE THE TRADEOF F
BETWEEN RECALL AND PRECISION IN THE MATCHED/MISSING ROW IN THE SCOR E
SUMMARY REPORT. The maximum of recall and precision does not mean an ADDITIVE
maximization, but that the total scores for each of the two metrics should be as clos e
together and as high as possible . For most systems, this is probably the normal way
the system operates .
Several passes through the scoring program will be required, one for the officia l
test on generating templates for the whole test set and the others for the
experimental tests on generating the specific slots called out by the linguisti c
phenomena tests .
	
You generate only one set of system responses, and only the firs t
pass through the scoring program will require user interaction .
	
The history fil e
produced during this interaction will be used in the scoring of the linguisti c
phenomena tests .
	
(It will also serve as the basis for scoring any optional tests tha t
are conducted . )
1.2 OPTIONAL TESTING : OTHER RECALL/PRECISION TRADEOFF S
The objective of the optional testing is to learn more about the tradeoffs that som e
systems may be designed to make between recall and precision . It is intended to elici t
B–1
extra data points only on those systems that are currently designed to make som e
theoretically interesting tradeoffs in some controlled fashion.
Thus, we are interested in having you conduct the optional testing in either of th e
two following cases, but not otherwise :
1) if the system can control the tradeoff between recall and precision in order t o
produce a set of data points sufficient to plot the outline of a recall-precisio n
curve ;
2) if the system's recall and precision can be consciously manipulated by th e
loosening or tightening of analysis constraints, etc ., in order to produce a t
least one data point that contrasts in an interesting way with the result s
produced by the required testing .
To yield these additional data points, you will generate and score new syste m
response templates, using the history file generated during the required testing . N O
SYSTEM DEVELOPMENT IS PERMITTED BETWEEN OFFICIAL TESTING AND OPTIONA L
TESTING -- ONLY MODIFICATION OF SYSTEM CONTROL PARAMETERS AND/O R
REINSERTION OR DELETION OF EXISTING CODE THAT AFFECTS THE SYSTEM'S BEHAVIO R
WITH RESPECT TO THE TRADEOFF BETWEEN RECALL AND PRECISION.
If, as a consequence of altering the system's behavior, templates are generate d
that weren't generated during the required testing or slots are filled differently, you
may find it necessary to add to the history file and to change some of the manua l
template remappings. START THE SCORING OF EACH OPTIONAL TEST WITH THE HISTOR Y
FILE GENERATED DURING THE REQUIRED TESTING, MINUS THE MANUAL TEMPLAT E
REMAPPINGS ; SAVE ANY UPDATED HISTORIES TO NEW FILE NAMES .
In order to obtain these data points, you may wish to conduct a number of test s
and throw out all but the best ones . Remember, however, that you are to notify NOS C
of ALL the planned parameter settings in advance (see section 1) . Thus, it would b e
wise to experiment on the training data and use the results to know what differen t
runs are worth making during the test . If, among the "throwaways" there are som e
results that you find significant, you may wish to include them in your site report fo r
the MUC-3 proceedings, but they will not be part of the official record .
You may submit results for the experimental linguistic phenomena tests as part of
the optional testing if you wish, but please do so only if you find the differences i n
scores to be significant.
2. SPECIFIC PROCEDURES FOR THE REQUIRED TESTIN G
2.1 FREEZING THE SYSTEM AND FTP'ING THE TEST PACKAG E
When you are ready to run the test, ftp the files in the test package fro m
/pub/tst2. You are on your honor not to do this until you have completely froze n
your system and are ready to conduct the test. You must stop all system developmen t
once you have ftp'ed the test package .
Note : If you expect to be running the test over the weekend and are concerned tha t
a host or network problem might interfere with your ability to ftp, you may ftp the
B—2
files on Friday . However, for your own sake, minimize the accessibility of those files ,
e .g., put them in a protected directory of someone who is not directly involved i n
system development.
2.2 GENERATING THE SYSTEM RESPONSE TEMPLATE S
There are 100 texts in tst2-muc3, and the message IDs have the following format :
TST2-MUC3-nnnn . Without looking at the texts, run your system against the file an d
name the output file response-max-tradeoff.tst2 .
You are to run the required test only once -- you are not permitted to make an y
changes to your system until the test is completed. If you get part way through the
test and get an error that requires user intervention, you may intervene only to th e
extent that you are able to continue processing with the NEXT message . You are not
allowed to back up!
Notes :
1)
	
If you run short on time and wish to break up tst2-muc3 and run portions of i t
in parallel, that's fine as long
	
as you are truly running in parallel with a
single system or can completely simulate a parallel environment,
	
i .e ., the
systems are identically configured .
	
You must also be sure to concatenate th e
outputs before submitting them to the scoring program .
2) No debugging of linguistic capability can be done when the system breaks .
For example, if your system breaks when it encounters an unknown word an d
your only option for a graceful recovery is to define the word, then abor t
processing and start it up again on the next test message .
3) If you get an error that requires that you reboot the system, you may do so, bu t
you must pick up processing with the message FOLLOWING the one that wa s
being processed when the error occurred . If, in order to pick up -processing at
that point, you need to create a new version of tst2-muc3 that excludes th e
messages already processed or you need to start a new output file, that's ok. Be
sure to concatenate the output files before submitting them to the scorin g
program .
2 .3 SCORING THE SYSTEM RESPONSE TEMPLATE S
2 .3.1 SCORING ALL SYSTEM RESPONSES FOR OFFICIAL, REQUIRED TES T
Run the scoring program on the system response templates, using key-tst2 as th e
answer key and entering config .el as the argument to initialize-muc-scorer . (The
config file contains arguments to the define-muc-configuration-options function ,
which you will have to edit to supply the proper pathnames) . When you enter th e
scoring program, type "is" so that the score buffer will contain detail table s
(template by template) as well as the final summary table . Save the score buffer
(*MUC Score Display*) to a file called scores-max-tradeoff .tst2.
Note : During the interactive scoring, make use of the guidelines (supplie d
separately) for interactively assigning full and partial credit . Also refer to key-tst2-
notes (in the ftp directory) for NOSC's comments on how the answer key wa s
generated . See section 5, below, for information on the plans for handling th e
rescoring of results .
B–3
Following the instructions in the user manual for the scoring program, save th e
history to a file called history-max-tradeoff .tst2.
2.3.2 SCORING SPECIFIC SETS OF SLOTS FOR THE EXPERIMENTAL, REQUIRE D
LINGUISTIC PHENOMENA TEST S
Read the file readme.phentest. Run the scoring program again for each of th e
linguistic phenomena tests, i .e., type the configuration file names that appear in th e
test package in sequence as the argument to the function initialize-muc-scorer .
(These files must be edited to provide the proper pathnames for your environment .)
Scoring for the phenomena testing should be done using the history file create d
when all templates were scored. No updates to the history file should be made durin g
these runs . Save each score buffer (*MUC Score Display*) to the file name scores -
<phenomenon test name>-max-tradeoff.tst2, where <phenomenon test name> matche s
the names in the config files .
3. SPECIFIC PROCEDURES FOR OPTIONAL TESTIN G
3.1 WITH MODIFIED SYSTEM CONTROL PARAMETERS FOR ALL TEMPLATE S
For each optional run, modify the system as specified IN ADVANCE to NOSC . Then
follow the procedures described in section 1 .2 and section 2 . Save the system respons e
templates to files with unique, meaningful names . When you do the scoring, start th e
scoring program each time with the history file generated during the require d
testing (minus the manually remapped templates, since you may wish to chang e
them), and save the history when you have finished scoring (whether it was update d
or not) and the scores to files with names that permit them to be matched up with th e
corresponding system response template file .
Once you have determined which of the optional runs to submit to NOSC for th e
official record, name the files for those runs in some meaningful, easily-understoo d
fashion (fitting these patterns : response-<meaningful name here>.tst2, scores -
<meaningful name here> .tst2, and history-<meaningful name here>.tst2) and provid e
them along with a readme file that explains the significance of the files an d
identifies their corresponding parameter setting .
3.2 FOR LINGUISTIC PHENOMENA TESTS, USING MODIFIED SYSTEM CONTRO L
PARAMETER S
After you have produced the files listed at the end of section 3 .1, above, follow the
procedures in section 2 .3 .2 if you wish to produce separate linguistic phenomena tes t
results for any/all of them . Use the history file corresponding to each of those
response files .
Please submit these linguistic phenomena test scores to NOSC only if they ar e
significantly different from those produced for the required testing . If you do submi t
these scores, name the file for each of the phenomena tests to correspond with th e
appropriate response file, using the following pattern : scores-<phenomenon test
name>-<meaningful name here> .tst2 .
B—4
4 . REPORTS TO BE SUBMITTED TO NOSC BY MONDAY MORNING ,
MAY 13
All results submitted to NOSC are considered "official," with the exception of th e
results of the linguistic phenomena testing, which are considered "experimental . "
All results, whether official or experimental, may be included, in part or in full, i n
publications resulting from MUC-3 . However, only the official results may be used fo r
any comparative ranking or rating of systems . The proper means of using th e
official results for that purpose will be discussed during the conference at NOSC . The
results of the linguistic phenomena testing are to be used only to gain insight int o
the linguistic performance of individual systems and into the testing methodology .
The files listed below are to be submitted to NOSC by Monday morning, May 13, vi a
email to sundheim@nosc.mil. TO HELP NOSC FILE THE MESSAGES ACCURATELY, PLEAS E
SUBMIT EACH FILE IN A SEPARATE MESSAGE, AND IDENTIFY YOUR ORGANIZATION AN D
THE FILE NAME IN THE SUBJECT LINE OF THE MESSAGES .
4.1 REQUIRED TESTING (MAXIMIZED RECALL/PRECISION TRADEOFF )
1. response-max-tradeoff .tst2
2. history-max-tradeoff .tst2
3. scores-max-tradeoff.tst2
4. trace-max-tradeoff .tst2 (system trace for the 100 messages) -- You may submi t
whatever you think is appropriate, i .e., whatever would serve to help validat e
the results of testing . If the traces are voluminous and you do not wish t o
email them, please compress them and ftp them to the /pub directory ; send
sundheim@nosc .mil an email message to identify the file name .
5. scores-<phenomenon test name>-max-tradeoff .tst2
	
-- where <phenomenon tes t
name> matches the names in the config files (see readme .phentest)
4.2 OPTIONAL TESTING (OTHER RECALL/PRECISION TRADEOFFS)
Items 1-5, below, are required for EACH optional test run that is reported to NOSC .
1. history-<meaningful name here> .tst 2
2. response-<meaningful name here> .tst 2
3. scores-<meaningful name here>.tst2
4. readme-optional-testing .tst2
	
-- See section 3 .1, above.
5. trace-<meaningful name here> .tst2 -- See note in section 4 .1, above.
6. scores-<phenomenon test name>-<meaningful name here> .tst2
	
-- where
<phenomenon test name> matches the names in the config files (se e
readme .phentest) .
	
Submit these scores only if significantly different fro m
those obtained for the required testing .
5.0 RESCORING OF RESULT S
The interactive scoring that is done during testing should be done in stric t
conformance to the scoring guidelines . If you perceive errors in the guidelines o r
in the answer keys as you are doing the scoring, please make note of them and send a
summary to NOSC along with the items listed in section 4, above . When all the result s
are in, NOSC will attempt to merge everyone's history-max-tradeoff .tst2 files and
rescore everyone's response-max-tradeoff.tst2 files .
	
Your notes on perceived error s
may be useful to NOSC at that time. If the errors are not easy to rectify and if they
B–5
appear to be serious enough to significantly affect the legitimacy of the scoring, w e
may have to wait to rectify them after the conference and rescore the respons e
templates at that time . THE RESULTS OF RESCORING BEFORE AND/OR AFTER TH E
CONFERENCE WILL BECOME THE OFFICIAL RESULTS .
