MUC-5 EVALUATION METRIC S
Nancy Chinchor, Ph .D.
Science Applications International Corporation
10260 Campus Point Drive, MIS A2-F
San Diego, CA 9212 1
chinchor@ gso.saic.com
(619) 458-2614
Beth Sundheim
Naval Command, Control, and Ocean Surveillance Cente r
RDT&E Division (NRaD)
Information Access Technology Project Team, Code 44208
San Diego, CA 92152-742 0
sundheim @nosc.mil
INTRODUCTION
The metrics used for the Fifth Message Understanding Conference (MUC-5) evaluation are a major update
to those used for MUC-4 in 1992. The official MUC-5 metrics express error rates while the official MUC-4 metric s
express performance in terms of recall and precision (used for MUC-5 only as "unofficial" metrics) . This paper
discusses the current metrics and the reasons for their adoption .
SCORE REPORTS
The MUC-5 Scoring System is evaluation software that aligns and scores the templates produced by th e
information extraction systems under evaluation in comparison to an "answer key" created by humans . The Scoring
System produces comprehensive summary reports showing the overall scores for the templates in the test set ; these
may be supplemented by detailed score reports showing scores for each template individually. Figure 1 shows a
sample summary score report in the joint ventures task domain for the error metrics ; Figure 2 shows a corresponding
summary score report for the recall-precision metrics .
Scoring Categories
The basic scoring categories are found in the score report under the column headings COR, PAR, INC ,
XCR, XPA, XIC, MIS, SPU, and NON. These categories have not fundamentally changed since the MUC-4
evaluation. The rows in the body of the score report are for the various slots and objects in the template ; various totals
appear at the bottom .
For the MUC-5 evaluation, alignment of system responses (i.e., templates, objects, and slot-fillers generated
by the system under evaluation) with the answer key was done fully automatically, and scoring was don e
interactively. In interactive scoring mode, the evaluator is queried for a scoring decision only under certain
circumstances; under most circumstances, the scoring decisions are made automatically . The meaning of each of th e
scoring categories is described below and summarized in Table 1 .
• If the response and the key are deemed to be equivalent, the category is correct (COR); if interactively
assigned, a tally appears in both the COR and XCR (interactive correct) columns.
• If the response and the key are judged to be a near match, the category is partial (PAR) ; if interactively
assigned, a tally appears in both the PAR and XPA (interactive partial) columns .
69
SLOT POS ACT COR PAR INC XCR XPA XIC SPU MIS NON ERR UND OVG SUS +
<template>
conten t
subtotals
<tie-up-relati
status
entity
joint-venture
ownershi p
activity
subtotals
<entity>
name
aliases
location
nationality
type
282
	
282
348
	
389
348
	
389
348
	
389
348
	
389
791
	
834
180
	
21 2
103
	
122
387
	
367
1809 2024
976
	
1057
872
	
937
359
	
389
322
	
338
265
	
21 2
976
	
1057
282
289
289
289
231
535
101
64
214
1145
749
554
232
140
100
716
0
0
0
0
0
0
0
0
0
0
0
44
6
19
0
0
0
6
6
6
64
94
23
4
44
229
61
115
5
23
1 1
94
0
0
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
38
5
3
0
0
0
0
0
0
0
0
0
0
0
0
0
81
3
1
9
0
0
94
94
94
94
305
88
54
109
650
247
224
146
156
101
247
0
53
53
53
53
162
56
35
129
435
166
159
116
140
154
166
0
1 1
1 1
1 1
0
0
96
180
10
286
0
27
399
443
509
0
0
35
35
35
48
51
62
59
57
53
39
47
53
69
73
41
0
1 5
1 5
15
15
20
31
34
33
24
17
1 8
32
43
58
17
0
24
24
24
24
33
42
44
30
32
23
24
38
46
48
23
0
2
2
2
22
1 5
1 8
6
17
17
8
19
3
18
10
12
ALL OBJECTS
MATCHED ONLY
121251391 3
9140 9729
6793
6793
149
149
1562
1159
6
6
85
85
124
124
5405
1624
3621
1039
3996
3996
61
36
30
11
39
17
1 9
1 5
Richness-Normalized Error Wrong
	
Req-fills All-fills Min-err
	
Max-err
10662.5
	
11813 12138 0.8784 0.9026
Error Rate Per Word Wrong
	
Word-count Error-rate
10662.5
	
92862 0.1148
Figure 1: Sample Error Score Report .
• If the key and response do not match, the category is incorrect (INC) ; if interactively assigned, a tall y
appears in both the INC and XIC (interactive incorrect) columns.
• If the key has a fill and the response has no corresponding fill, the category is missing (MIS) .
• If the response has a fill which has no corresponding fill in the key, the category is spurious (SPU) .
• If the key and response are both left blank, then the category is noncommittal (NON) .
The columns in Figures 1 and 2 labelled possible (POS) and actual (ACT) contain the tallies of the numbe r
of slot fillers that should be generated and the number of fillers that the system under evaluation actually generated ,
respectively. Possible is the sum of the correct, partial, incorrect, and missing . Actual is the sum of the correct, partial,
incorrect, and spurious . These tallies are used in the computation of some of the evaluation metrics . The total possibl e
is system-dependent and is therefore computed by summing the tallies assigned to the system responses rather tha n
by simply summing the slot fillers to be found in the key template . In contrast, a system-independent metric will be
explained in a later section .
70
content
status
entity
activity
subtotals
<entity>
name
aliases
location
type
<template>
subtotals
<tie-up-relati
joint-venture
ownership
nationality
SLOT POS
282
348
348
348
348
791
180
10
387
1809
976
872
359k.
322
265
97
ACT
282
389
389
389
389
934
212
122
367
2024
1057
937
389
338
21 2
1057
COR
282
289
28
289
231
535
101
64
214
1145
749
554
232
140
100
716
PAR
0
0
0
0
0
0
0
Q:
0
0
0
44
6
1 9
0
0
INC
0
6
6
6
64
4
23
4
44
229
61
115
5
23
1 1
94
XCR
0
0
0
0
0
0
0
0
0
0
0
4
0
0
0
0
XPA
0
0
0
0
0
0
0
0
0
0
0
38
5
3
0
0
XIC
0
0
0
0
0
0
0
0
0
0
0
81
3
1
9
0
SPU
0
94
94
94
94
305
88
54
109
650
247
224
146
156
101
247
MIS
0
53
53
53
53
162
56
35
129
435
166
159
11 6
140
154
166
NON
0
1 1
11
0
0
0
96
180
10
286
0
27
399
443
509
0
REC
100
83
83
83
66
68
56
62
55
63
77
66
65
46
38
73
PRE
100
74
74
74
59
57
48
52
58
56
71
61
60
44
47
68
UN D
0
1 5
1 5
1 5
1 5
20
31
34
33
24
17
18
32
43
58
17
OVG
0
24
24
24
24
33
42
44
30
32
23
24
38
46
48
23
ALL OBJECTS
MATCHED ONLY
TEXT FILTERING
12125 1391 3
9140 9729
251
	
262
6793 149 1562
6793 149 1159
242
	
*
	
*
6
	
85
	
124
6
	
85
	
124
*
	
*
	
*
5405 3621 3996
1624 1039 3996
20
	
9
	
11
57
	
49
	
30
	
39
75
	
70
	
11
	
1 7
96
	
92
	
4
	
8
F-MEASURES P&R 2P&R
	
P&2R
52.75 50.66
	
55.02
Figure 2: Sample Recall-Precision Score Report.
Summary Rows
The two summary rows in the score report labelled "ALL OBJECTS" and "MATCHED ONLY" show th e
accumulated tallies obtained by scoring spurious and missing objects in different manners . Templates may contain
Table 1 : Scoring Categories.
q Correct
q Partial
q Incorrect
q Spurious
D Missing
q Noncommittal
response = key
response a key
response  key
key is blank and response is no t
response is blank and key is no t
key and response are both blank
7 1
more than one instance of a kind of object, e .g., more than one <entity> object. The keys and responses may not agree
in the number of objects generated. These cases lead to spurious and/or missing objects. Opinions as to how muc h
systems should be penalized for spurious or missing objects differ depending upon the requirements of th e
application in mind . These differing views have lead us to provide the two ways of scoring spurious and missin g
information as outlined in Table 2 .
The MATCHED ONLY manner of scoring penalizes the least for missing and spurious objects by scorin g
them only in the object ID slot. This object ID score does not impact the overall score because the object ID slot is no t
included in the summary tallies; the tallies include only the individual slots . ALL OBJECTS is a stricter manner o f
scoring because it penalizes for both the slot fills missing in the missing objects and the slots filled in the spuriou s
object. The metrics calculated based on the scores in the ALL OBJECTS row of the error score report are the officia l
MUC-5 scores .
q
	
Matched Only
Missing and spurious objects scored in object slot only
q
	
All Objects
Missing object slots scored as missing
Spurious object slots scored as spurious
Table 2: Manners of Scoring.
Evaluation Metric s
The rightmost four columns in both the error score report and the recall-precision score report contain th e
scores for the evaluation metrics. These are computed for each object and slot in the template, and overall scores ar e
shown at the bottom.
The primary evaluation metrics for MUC-5 have been changed from those used in previous MU C
evaluations. The reasoning behind this change will be described in a later section . First, the formulas used to calculat e
the evaluation metrics on the score reports will be given .
Error Metrics
The error per response fill (ERR) is the official measure of MUC-5 system performance . This measure is
calculated as the number wrong divided by the total (possible plus spurious) as shown in Table 3 . It is dependent on
the system because tallies change according to the amount of spurious data generated and according to how th e
system tilled slots that have optional or alternate fills in the key. (See the discussion below on richness-normalize d
error metric.)
Table 3 also shows the computation of three secondary metrics -- undergeneration, overgeneration, an d
substitution -- which isolate the three elements constituting overall error . Undergeneration and overgeneration were i n
use for MUC-4 as well, and this is why they appear in both the error score report and the recall-precision score report .
Those metrics are computed the same way for both reports . The substitution metric is new for MUC-5 and is foun d
only in the error score report . The metric is not isolated in the recall-precision view on information extraction ; this is
because it is a (negative) factor in both recall and precision ; in the error-based view, on the other hand, it is isolated a s
a distinct type of error. The reader should note that the denominator in each of the secondary metrics is differen t
because each metric offers a distinct perspective on the errors that a system can make .
72
Primary Metric Error per response fill = wrong =
	
INC + PAR/2 + MIS + SP U
total
	
COR+PAR+INC+MIS+SP U
Secondary Metrics Undergeneration =
Overgeneration =
MIS _
POS
MIS
COR + PAR + INC + MI S
SPU
	
SPU_
ACT
	
COR + PAR + INC + SP U
Substitution =
	
INC + PAR/ 2
COR + PAR + INC
Table 3: System-dependent Error Metrics .
The error per response fill has been chosen as the primary measure reported for a system for this evaluatio n
because developers now need to focus on the sources of errors, explain them, and remedy them to push the state o f
the art. For example, if System A has the raw scores shown in Figure 3, its error per response fill is calculated a s
follows:
wrong=INC+PAR/2+MIS+SPU=25+5+0+ 10=4 0
total=COR+PAR+INC+MIS+SPU=10+10+25+0+10=5 5
wrong/total = 40/55 = 73%
While the error per response fill metric and the undergeneration, overgeneration, and substitution metrics ar e
designed to suit the system developers' need for performance diagnostics, a different measure that is as independen t
of the system and the text sample as possible may be more useful in some other circumstances . The richness -
normalized error measure is designed to measure errors relative to the amount of information to be extracted from th e
texts. This metric is shown in one of the summary rows at the bottom of the error score report.
COR PAR INC SPU MIS NON ER R
SYSTEM A
	
I 10
	
10
	
25
	
10
	
0
	
35
	
73
Figure 3: System A .
Richness-normalized error is calculated by dividing the number of errors per word by the number of key fill s
per word . This calculation reduces to the number of errors divided by the fill-count . If a program manager i s
considering use of a system on a distinct class of documents from the ones the system was tested on, this measure wil l
predict the number of errors the system will make, given the richness of the new set of documents .
Due to the optional and alternate fills in the key, there will be a range of fill-counts from the minimu m
number of fills required to the maximum number of fills allowed . The difference between the two numbers represen t
"discretionary" fills, i .e., ones that represent the ambiguity inherent in the text) The formaulas for calculating the
minimum and maximum richness-normalized error appear in Table 4 .
1 . For further information on the variability inherent in the key templates, please refer to the published ver -
sion of the proceedings, which will contain a paper about the text and template corpora.
73
Richness-
Minimum Error =
	
wrong
	
INC + PAR/2 + MIS + SPU_
All - fills
	
Required + Optional + MaximumAlternat e
Normalized
Error Maximum Error =
	
wrong
	
INC + PAR/2 + MIS + SPU–
Req - fills
	
Required + MinimumAlternate
Table 4: Richness-normalized error.
For example, if system B has the raw scores in Figure 4 and if the key is filled as in Figure 5, the fill-coun t
will range from the minimum required fills, which is a sum of Required Fills + Minimum Alternate Discretionar y
Fills (20+ 10), to the maximum allowed fills, which is the sum of Required Fills + Optional Discretionary Fills +
Maximum Alternate Discretionary Fills (20 + 10 + 30) . For this system, the richness-normalized error will range
from 40/60 to 40/30 or 0.67 to 1 .33.
Note that the maximum richness-normalized error can be greater than 1 .00 because the fill-count in the key
can he less than the number wrong for a system that overgenerates . Note also that the minimum richness-normalized
error can he less than the error per response fill because the (system-independent) fill-count in the key can be greate r
than the (system-dependent) total used in the denominator in error per response fill .
The error score report also contains a row called "Error Rate per Word," but it should be noted that thi s
metric is not comparable between the Japanese and the English and is not highly accurate for Japanese .
POS ACT COR PAR INC XCR XPA XIC SPU MIS NON ERR UNDOVG SU B
SYSTEM B
	
10 10 5
	
20 10 35
Wrong
	
Req-fills
	
All-fills
	
Min-err
	
Max-err
40
	
30
	
60
	
0.67
	
1 .33
Figure 4: System B .
REQUIRED
FILLS
Optional
DISCRETIONARY FILL S
Alternate
BLANK S
Minimum Maximu m
20 10 10 30 35
Figure 5: Key Fills for System B.
Recall precision Metrics
We have designated the recall, precision, and F-measure metrics that were used for MUC-4 as unofficia l
secondary metrics for MUC-5 in order to maintain continuity with previous MUCs . They can be used to explain
current performance in comparison to past performance. Further analysis is still necessary to determine thei r
contribution to the evaluation of data extraction systems as compared to the error-based metrics .
Richness-Normalized Error
74
The recall-precision evaluation metrics were adapted from the field of Information Retrieval (IR) an d
extended for the MUC evaluations . They measure four different aspects of performance and an overall, combine d
view of performance . The four evaluation metrics of recall, precision, undergeneration, and overgeneration ar c
calculated for the slots and in the summary score rows (see Table 5) . The fifth metric, the F-measure, is a combined
score for the entire system and is listed at the bottom of the report .
Recall (REC) is the percentage of possible answers which were correct . Precision (PRE) is the percentage of
actual answers given which were correct . A system has a high recall score if it does well relative to the number of slo t
fills in the key. A system has a high precision score if it does well relative to the number of slot fills it attempted :
In IR, a common way of representing the characteristic performance of systems is in a precision-recal l
graph. Normally, as recall goes up, precision tends to go down and vice versa [I ] . To directly measure
underpopulation or overpopulation of the template database by the information extraction systems, we introduced th e
measures of undergeneration and overgeneration .
recall
	
=
	
correct+(nartial x 0.5)
possible
precision
	
=
	
correct+(partial x 0.5),
actual
undergeneration
	
missing
possible
overgeneration
	
=
	
spurious
actual
Table 5: Recall- Precision Evaluation Metrics .
Methods have been developed for combining the measures of recall and precision to get a single measure . In
MUC-4, we used van Rijsbergen's F-measure [1, 2] for this purpose . The F-measure provides a way of combinin g
recall and precision to get a single measure which falls between recall and precision . Recall and precision can hav e
relative weights in the calculation of the F-measure, giving it the flexibility to be useful in the context of differen t
application requirements . The formula for calculating the F-measure is :
(132 +1 .0)xPx R
(02 xP)+ R
where P is precision, R is recall, and is the relative importance given to recall over precision . If recall and precision
are of equal weight, Q = 1 .0. This value is shown in the score report under the heading "P&R." The heading "2P&R "
is for recall half as important as precision (R = 0.5). The heading "P&2R" is for recall twice as important as precisio n
(f3 = 2.0). The F-measure is calculated from the recall and precision values in the ALL OBJECTS row .
Note that the F-measure is higher if the values of recall and precision are more towards the center of th e
precision-recall graph than at the extremes and their sums are the same . So, for R = 1 .0, a system which has recall o f
50% and precision of 50% has a higher F-measure than a system which has recall of 20% and precision of 80%. This
behavior is what we wanted from this single measure, which we expected would encourage developers to pus h
overall performance and, at the same time, to minimize the trade-off between the competing requirements fo r
minimal missing, spurious, and substitution types of error .
F=
75
An example showing the new metrics and the old (along with the pertinent scoring categories) for thre e
theoretical systems is given in Figures 6 and 7 . In this example, the error per response fill is the same for each of th e
three systems even though the F-measures are different. However, the secondary metrics of undergeneration ,
overgeneration, and substitution serve to distinguish the three systems . This hypothetical example points out th e
important role that the secondary metrics could play in system analysis as well as the analysis of the quality of the
extracted information .
POS ACT COR PAR INC SPU MIS NON ERR UND OVG SU B
SYSTEM A 45 55 10 10 25 10 0 35 73 0 18 67
SYSTEM B 45 35 10 10 5 10 20 35 73 44 29 40
SYSTEM C 55 35 10 10 15 0 20 35 73 36 0 57
Figure 6: Three Systems with Equal Error per Response Fill.
POS ACT COR PAR INC SPU MIS NON FP&R REC PR E
SYSTEM A 45 55 10 10 25 10 0 35 29.70 33 27
SYSTEM B 45 35 10 10 5 10 20 35 37.34 33 43
SYSTEM C 55 35 10 10 15 0 20 35 33.17 27 43
Figure 7: Unofficial Metrics for Three Systems with Equal Error per Response Fill .
Also appearing in the recall-precision score report is a row called "Text Filtering." The purpose of this row i s
to report how well systems distinguish relevant articles from irrelevant articles . The scoring program keeps track of
how many times each of the situations in the contingency table arises for a system (see Table 6) . It then uses those
values to calculate the entries in the Text Filtering row . The evaluation metrics are calculated for the row as indicate d
by the formulas at the bottom of Table 6.
The Role of the Noncommittal Scoring Categor y
The reader will have noticed that the category of "noncommittal" responses has been omitted from the
metrics. Although this may not seem reasonable from an applications perspective, from a research perspective w e
believe that the exclusion of noncommittal responses results in a much less distorted cross-system view o f
performance. The question comes down to whether systems normally leave a slot blank out of knowledge or whethe r
they do so out of a lack of knowledge . Highly immature systems tend either to overgenerate to an extreme, leavin g
few blanks, or to undergenerate to an extreme, leaving many blanks . The latter type of immature system is more
common and may benefit unfairly from a metric that considers a noncommittal response to be a correct response,
especially if there are relatively many blanks in the key templates .
If, for example, noncommittals were considered correct responses and included in the denominator of the
error per response fill measure, the rankings of all 17 MUC-4 systems on TST3 (the name of one of the two test set s
used in the evaluation) would change. The most radical changes would be for immature systems whose number of
noncommittals greatly outweighs all other categories of response . Since there are a lot of immature systems evaluate d
for MUC-5 (as there were for MUC-4) and since the average number of fills in the answer-key templates for MUC- 5
is only about half of what it was for MUC-4, the distortions of the results for MUC-5 have the potential to be eve n
greater than they were for MUC-4 . However, the potential effect on the MUC-5 evaluation is damped somewhat b y
the fact that the MUC-5 template consists of objects that are aligned separately ; response objects that contain an
insufficient amount of slot-fillers to warrant an alignment with a key object are not scored against a key object at th e
slot level. Nonetheless, we believe that omitting noncommittals from the metrics provides a better basis fo r
comparison across the full range of MUC-5 (and MUC-4) systems and provides a more accurate assessment of the -
state of the art.
7 6
Decides Relevant
Relevant I s
Correct
a
Irrelevant Is
Correct
b a+b
Decides Irrelevant c d c+d
a+c b+d a+b+c+d = n
POS ACT COR PAR INC ICR IPA Spu MIS NON
Recall = a/(a+c)
	
Undergeneration = c/(a+c)
Precision = a/(a+b)
	
Overgeneration = b/(a+b)
Text
Filtering a+c a+b
	
a c db
Table 6: Text Filtering.
CHANGES TO THE METRICS FROM PREVIOUS EVALUATION S
The changes to the evaluation metrics are expected to enable three different types of evaluation "users "
(NLP researchers, program managers, and potential customers) to assess and compare system performance in a
meaningful way. It is also hoped that the changes will correct deficiencies in the evaluation that may unwittingl y
encourage conservative development strategies on the part of the researchers and that may also limit the evaluation' s
meaningfulness to other evaluation users.
Although the terms recall and precision were borrowed from IR, the metrics themselves represent a
significant departure from the contingency table model, which underlies the IR version of the metrics . The task o f
extraction is a complex one that includes elements of information detection and classification, plus open-ende d
generation of strings and object pointers . The focus on recall and precision as primary metrics for the last few year s
has had some advantages, among them the following :
• they bring out the fundamental tension between spurious and missing data;
• they require that evaluation users view system performance along more than one dimension ;
• they present a positive view of system performance, which may have helped to make the NL P
researchers more comfortable with the idea of submitting their systems to evaluation .
However, recall and precision have the disadvantage of making a two-way distinction between error type s
(spurious and missing) when in fact there are three types of error. The third kind of error is captured by the
substitution metric; it is accounted for by the categories of incorrect and ( .5 times) partial . Substitution errors arc
taken into account in the recall-precision metrics to the extent that they contribute to the denominator of both recal l
and precision; however, this type of error is not isolated, and its inclusion in the denominator of recall and precision
prevents those metrics from revealing to what extent a system's shortfalls are due to substitution rather than t o
missing (in the case of recall) or spurious (in the case of precision) .
77
In a way, the recall-precision metrics view substitution as a blend of missing and spurious ; a system did no t
simply produce the wrong fill, but rather produced a spurious fill on the one hand and missed a fill on the other hand .
This is a reasonable model of system behavior in many cases, but not in others, especially when a response is scored
partially correct. These deficiencies of the recall and precision metrics make the use of the error per response fil l
reasonable, as long as it is accompanied by the secondary metrics of overgeneration (spurious), undergeneratio n
(missing), and substitution (incorrect, including half of the partial) .
The F-measure, which was introduced for MUC-4 in response to needs of researchers and program
managers for a ranking metric, has come to be used more generally than just for cross-system comparisons . By
becoming the one metric of focus, it has been competing with recall and precision for the role of primary metric ,
thereby weakening two of the major advantages that recall and precision originally had . Furthermore, now that
performance of some systems is in or approaching the 50% range, recall and precision are at a disadvantage fo r
motivating researchers to push performance of the top systems through the more difficult stages ahead because the y
focus on the positive aspects of performance. These factors make the adoption of error per response fill as the primary
metric a reasonable next step in determining the best way to measure performance .
The statistical significance results from MUC-5 give us feedback on how well the error metric and the F -
measure distinguish systems . The results show that there are no differences between the rankings determined by erro r
per response fi11 2 and the rankings determined by F-measure. The error per response fill distinguishes systems slightly
better; four more system pairs were significantly different in their error per response fill than were significantl y
different in their F-measure . The error per response fill also shows a tendency towards clustering systems in slightl y
clearer groups than the F-measure for EJV due to its ability to distinguish systems slightly better .
The richness-normalized error represents another change from previous evaluations and was motivated b y
the desire for a system-independent metric . The nature of this metric requires that spurious behavior be ignored . The
search for such a metric led us to innovate one in which two values, a minimum and maximum, were calculated sinc e
language understanding necessarily involves variability in interpretation . It remains to be seen whether ignoring
overgeneration interferes with the predictive quality of the richness-normalized error metric .
REFERENCES
[1] Frakes, W.B. and Baeza-Yates, R . (eds.) (1992) Information Retrieval: Data Structures & Algorithms .
Englewood Cliffs : Prentice Hall .
[2] Van Rijsbergen, C.J. (1979) Information Retrieval. London : Butterworths .
[3] Nierstrasz, O. (1989) "A Survey of Object-Oriented Concepts" in W. Kim and F. H. Lochovsky (Eds .) Object-
Oriented Concepts, Databases, and Applications . New York: Addison-Wesley.
2. Although rounded numbers appear in the score report, floating point values of error per response till wer e
used for statistical analyses.
78
