Evaluating Natural Language Generated 
Database Records 
Rita McCardell 
Department of Defense 
Fort Meade, Maryland 20755 
ABSTRACT 
With the onslaught of various natural language pro- 
cessing (NLP) systems and their respective applica- 
tions comes the inevitable task of determining a way 
in which to compare and thus evaluate the output of 
these systems. This paper focuses on one such evalua- 
tion technique that originated from the text understand- 
ing system called Project MURASAKI. This evaluation 
technique quantitatively and qualitatively measures the 
match (or distance) from the output of one text under- 
standing system to the expected output of another. 
Introduction 
Project MURASAKI 
The purpose of Project MURASAKI is to develop a 
foreign language text understanding system that will 
demonstrate the extensibility of message understanding 
technology3 In its current design, Project MURASAKI 
will process Spanish and Japanese text and extract in- 
formation in order to generate records in both natural 
language databases, respectively. The fields within these 
database records will contain a natural language phrase 
or expression in that respective language. 
The domain of Project MURASAKI is the disease 
AIDS. The associated software system will include a 
general domain model of AIDS in the knowledge base. 
Within this model, there will be five subdomains: 
incidence reports records the occurrence of AIDS 
and HIV infection in countries and regions, 
among various populations, 
testing policies covers measures to test groups for 
AIDS, 
campaigns describes measures adopted to combat 
AIDS, 
new technologies lists new equipment and material 
used in detecting and preventing AIDS, and 
1Thus, it is no_...t to be confused as a message undel~tanding 
project, but rather a multi-paragraph (i.e., text) understanding 
project \[51. 
AIDS research details the various vaccines and treat- 
ments that are being developed to prevent 
AIDS. 
The subdomains of incidence reports, testing poli- 
cies and campaigns are found in the Spanish text while 
the topics of incidence reports, new technologies 
and AIDS research are covered in the Japanese text. 
Project MURASAKI will demonstrate a sufficient 
level of full text understanding to be able to identify 
the existence of factual information within either a given 
Spanish or Japanese text that belongs within a partic- 
ular Spanish or Japanese language database. Then, it 
will determine what information in that text constitutes 
a single record in the selected database. 
The balance of this paper will focus on the evaluation 
technique: why it was chosen, some basic assumptions 
underlying it, as well as the design and application of this 
technique. To illustrate various technical points of this 
technique, examples will be given using text excerpted 
from the Spanish AIDS corpus and its associated (gener- 
ated) Spanish database records. Appendix A contains a 
sample Spanish AIDS text (Text #124) and its English 
translation. 2 Appendix B contains a record from the 
Incidence Reporting database that was generated from 
Text #124. Similarly, Appendix C contains a record 
from the Testing Policies database that was also gener- 
ated from Text #124. 
The Need for a Black Box 
Given the overall design of this foreign language text 
understanding program, there arose the need for devel- 
oping a general purpose evaluation technique\[l\]. This 
technique would compare the actual, computer generated 
output of one such system to the expected, human gener- 
ated output of another. That is to say, given some sam- 
ple piece of (foreign language) text as input, some pre- 
defined system output (namely, for project MURASAKI, 
the generation of a finite number of database records) 
could be manually generated so that a determination 
as to the correct performance of the computer system 
was made. Given this type of "correct" output, it could 
2In the MURASAKI text corpus, there do not exist any English 
translations for any of the text. 
therefore be possible to measure the performance of an 
automated system based on this type of well-defined in- 
put/output pairs. It was precisely this type of ratio- 
nale that led to the development of a black box eval- 
uation -- evaluation primarily focused on what a sys- 
tem produces externally rather than what a system does 
internally. In direct contrast to this type of evaluation is 
glass box evaluation-- "looking inside the system and 
finding ways of measuring how well it does something, 
rather than whether or not it does it" \[5\]. 
With the development of the MURASAKI evaluation 
technique, comes the notion of two types of measures: 
a quantitative measure and a qualitative measure. The 
quantitative measure determines the number of cor- 
rect (and/or incorrect) records that have been generated 
in any one database while the qualitative measure 
evaluates the "correctness" of any database record field. 
Background 
Some Assumptions 
Given the overall design of Project MURASAKI, there 
are a few assumptions, or rather, some groundwork that 
needs to be laid, in order to proceed in the development 
of this evaluation technique. These assumptions are ex- 
plained as follows: 
• Given the nature of the AIDS text corpus, any one 
text could possibly generate one or more records 
in one or more databases. This fact is loosely re- 
ferred to as domain complexity. (Furthermore, for 
any record, all fields may not be filled.) 
• Given the structure of the AIDS domain model, it is 
just as easy (or hard) to distinguish one subdomain 
from another. That is, each database is as likely 
to have a record generated in it as another. This 
hypothesis is known as subdomain differentiation. 
• Upon the determination of what the expected output 
of Project MURASAKI should resemble, a correct 
record (in any database) is uniquely identified by 
the contents of its key fields plus the contents of one 
or more non-key fields. This statement constitutes 
the definition of a correct record. 3 
Generated Output: What Could Go 
Wrong? 
After a thorough analysis of the system flow for Project 
MURASAKI and given a typical AIDS text as system in- 
put, the following list represents all possible undesirable 
situations that could arise: 
3 All appropriate information should be extracted from the text 
and placed in the correct database. A change in any of the key 
fields will result in the generation of a new record. For example, 
if data from a different time period is presented in the text, a key 
field change is required, and a new record is generated. If data from 
a new region is presented, a new record is generated. Examples 
of key and non-key fields are found in Appendices B and C. Key 
fields, which are found in the thick, darkened boxes, are the same 
throughout each database. 
1. Generate one or more records in the wrong 
database. 
2. Not generate one or more records in the correct 
database. 
3. Generate too many records in the correct 
database, i.e., over-generate. 
4. Generate too few records in the correct database, 
i.e., under-generate. 
5. Generate too many fields in the correct record. 
6. Generate too few fields in the correct record. 
7. Generate the wrong answer in the fields. 
Situations 1 and 2 illustrate what could go wrong at 
the database level while scenarios 3 and 4 depict possi- 
ble problems arising at the database record level. The 
remaining criteria (namely 5, 6 and 7) shows what could 
happen at the database record field level. However, the 
more crucial way of viewing these problems is not so 
much in where (i.e., at what level) these events occur, 
but rather in how these problems can be detected and 
thus measured for evaluation purposes. It is with this 
motivation that the following categorization was derived: 
a quantitative measure could be designed to account for 
the problems that could arise at both the database and 
database record levels while a qualitative measure could 
comparably be designed for evaluation at the database 
record field level. 
In the next section, two examples are given depict- 
ing how the quantitative measure accounts for problems 
arising at the first two levels. (Note: 'rec.' is the abbre- 
viation for record in these examples.) 
A Quantitative Measure 
Background 
A scoring function is used for the quantitative measure 
to calculate an aggregate score for the number of correct 
records (as defined previously) generated ('gem' in the 
following examples) for a given MURASAKI text. This 
scoring function assigns one point for the generation of a 
correct record ('coL') and -p points, where 0 < p < 1, 
for the generation of an incorrect record ('inc.'). 
Some Questions 
Given the two examples in Table 1, the following ques- 
tions come to mind: 
• What should be the value of p? !? i? 17 Does 2" 3" 4" 
bounding it between 0 and 1 imply any linguistic 
restrictions on focus or coverage of the text? Or 
rather, should these bounds become parameters of 
this measure? 
Ex. #i: DB #I DB #2 DB #3 TOTAL Ex. #2: DB #i DB #2 DB #3 TOTAL 
3 tee. 2 rec. 1 rec. 6 Text 124 3 rec. 1 rec. 0 rec. 4 Text xxx 
what if, 
where 
2 gen. 2 gen. 2 gen. 
1 cor. 2 cor. 2 inc. 
1 inc. 
(1 inc.) 
1-2p 2 -2p a-4p 6 
what if, 
where 
4 gen. 0 gen. 1 gen. 
3 cor. 1 inc. 1 inc. 
1 inc. 
3-p -p -p 
Table 1: Examples of How the Quantitative Measure Works 
3-3p 4, 
• Which is worse: to over-generate or under-generate? 
That is, should we have one penalty for one and 
another penalty for the other? (In Example #1 of 
Table 1, the extra, or over-generated, record is also 
penalized by -p points.) 
* What happens if the numerator is negative? Or 
equal to 0? Should the score in these cases be 0? 
• If the score for a single text is Texti, then should the 
scoring algorithm for the overall (average) Quanti- 
tative Score be ~ where i = 1, 2, N and N ' "" "' 
N is the total number of text? 
A Qualitative Measure 
Background 
Before proceeding into the design of the qualitative mea- 
sure, some background is needed in order to motivate 
this measure. For Project MURASAKI, a database 
field is defined to be logically equivalent to that of a 
SLOT while the contents of that field is equivalent 
to its FILLER. 4 The slots define three types of DO- 
MAINS: (1) unordered, e.g., OCCUPATIONS, (2) or- 
dered, e.g., MONTHS-OF-THE-YEAR and (3) contin- 
uous, e.g., HEIGHT. The slot fillers have three types of 
ATTRIBUTES: (1) symbolic, e.g., (temperature(value 
tepid)), (2) numeric, e.g., (weight(value 141.3)) and (3) 
hybrid, e.g., (test_results(value(i,000 people were de- 
ported))). Also, the slot fillers have three types of CAR- 
DINALITY: (1) single, e.g., (sex(value male)), (2) enu- 
merated, e.g., (subjects(value(math physics art))) and 
(3) range, e.g., (age(value(0 100))). 
The notion of IMPORTANCE VALUES (IVs) are 
introduced here and are used to numerically describe 
how easy/hard it was (is) to extract a particular field's 
(or slot's) information from the text. These importance 
values are assigned to both the key and the non-key fields 
of a database record for each of the five databases. 5 Im- 
portance values are integers from 1 to 10, inclusive, and 
are interpreted as follows: 
4The origination of this knowledge representation scheme 
(KRS) was taken from \[4\]. The application of this KRS to Project 
MURASAKI was taken from\[l\]. 
5 Recall that each database, for both Spanish and Japanese, cor- 
responds to one of the five different subdomains within the AIDS 
domain model. 
IV Interpretation 
10 very easy to extract 
: 
5 moderately easy/hard to extract 
: 
1 very hard to extract 
With this view of importance values 6, the extraction 
process for Project MURASAKI may now be considered 
as two subprocesses; that is, extraction plus deduction. 
For example, the key field fuente (meaning "source") 
may be filled with OMS or any one of the other period- 
icals and technical papers that are listed in the header 
line of each text (reference Appendix A, where the fuente 
is El Pa(s). Since the fuente field is constrained to only 
a few possible fillers, an importance value of 9 has been 
assigned to it. 7 
Scoring Functions & Algorithm 
Scoring functions are also used for the qualitative mea- 
sure to calculate an aggregate penalty for the fields (both 
key and non-key) in a database record. There are three 
types of scoring functions based upon the cardinality of 
the slot fillers: (1) single, (2) enumerated and (3) range, s 
An example of an ordered domain with single fillers is 
that of TEMPERATURE: 
(make-frame TEMPERATURE 
(instance-of (value field)) 
(database-in (value z)) 
(element-type (value symbol)) 
(domain-type (value ordered)) 
(cardinality (value single)) 
(elements (value cold cool tepid 
lukewarm warm hot scalding))) 
6l_nt'orrnal feedback thus far has indicated that these values are 
geared to having more emphasis placed on the records that contain 
easier fields and less on the harder ones, thus not rewarding those 
who perform well on the harder fields. 
ran importance value of 10 would have been assigned had it not 
been for the fact that in some instances, the "deduction" portion of 
the extraction process for this field specifies the conversion of some 
sources to their respective acronym, e.g., OMS is Organizacidn 
Mundial de la Salud (WHO). 
Sin Project MURASAKI, only slots that contain single fillers 
have been identified thus far. 
66 
(The filler x in the database-in slot represents the sin- 
gle character identification value for a particular AIDS 
database.) Continuing with this example, if the follow- 
ing actual output (AO) were to be matched against what 
was expected (EO, expected output), 
AO: (temperature (value cool)) 
EO: (temperature (value lukewarm)) 
the penalty assigned to this mismatch would depend on 
two variables: (1) D, the distance between the fillers in 
the ordered set of values and (2) C, the size of the do- 
main. The scoring function that relates these two vari- 
ables is 
WxD P - -- (1) 
f(c) 
where W is the numerical weight on the distance between 
the fillers and :P is a damping function on the size of the 
domain. 
As mentioned before, an example of an unordered do- 
main with single fillers is OCCUPATIONS. Since the dis- 
tance, D, is not meaningful for this example, the penalty 
assigned to the match becomes a function merely of the 
size of the domain (and hence the probability of the cor- 
rect filler appearing): 
W P- ~(C) (2) 
Consider the slot CASOS_NOTIFICADOS from the 
Incidence (I) Reporting database. It is a continuous do- 
main with (single) numeric fillers and its attribute entry 
is the following: 
(make-frame CASOS_NOTIFICADOS 
(instance-of (value field)) 
(database-in (value I)) 
(element-type (value number)) 
(domain-type (value continuous)) 
(cardinality (value single)) 
(unit-size (value 1)) 
(elements (value (0 1200.000)))) 
As before, suppose we are trying to match the 
CASOS_NOTIFICADOS slots between the actual out- 
put and the expected output: 
AO: (casos_notificados (value 2.700)) 
EO: (casos_notificados (value 2.781)) 
Since only numbers can be represented in a continuous 
domain, the elements of the domain are defined by giv- 
ing the endpoints of the domain (or closed interval) and 
the unit size of representation is used in computing the 
distance between fillers. When defined in this manner, 
the same scoring function that was used for an ordered 
domain with single fillers (namely Equation 1) can be 
used to compute the penalty for continuous domain sets 
as well. 
The overall Score for a single database record is 
× Pi) (3) 
for i = 1, 2, ..., (number of fields in that database 
record). The Pi's are the computed penalties between 
each field of the actual output and the expected output 
for that particular database record. The IVy's are the 
importance values for the corresponding fields of that 
database record. 
The Scoring Algorithm that computes the overall qual- 
itative measure for the entire text corpus is given below: 
for each TEXT 
for each DB RECORD 
for each DB RECORD FIELD 
if EO_field and AO_field are equal 
then no penalty 
else 
begin 
compute penalty ;;; based on 
appropriate scoring function 
weight penalty ;;; according to 
the IV of that field 
add weighted penalty 
to total record penalty 
end 
Some Unresolved Issues 
So far, fields that contain either numeric fillers or single 
word fillers (fillers that are both easily "distanceable") 
have been discussed. However, one would think that the 
more linguistically complex fields, i.e., those containing 
generated natural language phrases, would be more of a 
true test for the qualitative measure of this evaluation 
technique. Consider, for example, a non-key field like 
poblaci6n ("population") (from Appendix C): 
AO: poblaei6n inmigrantes 
EO: poblaci6n 
personas que pretendlan entrar en el pals ("people who 
try to enter the country") 
How should one extend the current notion of the qual- 
ititative measure to include evaluating the distance be- 
tween natural language phrases of this kind? It would 
appear that poblaci6n would be an unordered domain 
containing symbolic information. However, what are the 
elements of this domain? Should they have cardinality 
single? Should they include only those phrases that were 
generated from the expected output or should they addi- 
tionally include al_! semantically equivalent phrases, i.e., 
those containing a common set of semantic primitives or 
attributes, as well? If the latter situation were to pre- 
vail, then, in the example listed above, should a penalty 
be assessed? If so, by how much? Or rather, should one 
group together all semantically equivalent phrases and 
then determine the distance between these classes? 
Consider another example of an unordered domain 
field from the Testing Policies Database: 
AO: resultados han deportado a 1000 personas 
que resultaron 
EO: resultados desde 1985, han deportado a 1000 
personas que resultaron 
Should this non-key field be defined as having both a 
symbolic and numeric, i.e., hybrid, attribute? If so, 
should a scoring function based on symbolic and numeric 
text be designed? Given the example above, should a 
penalty be assigned for lack of a specific time element 
(in the actual output) or are these phrases semantically 
equivalent? 
A possible algorithmic extension to the current quali- 
tative measure is outlined as follows: 
1. for a given database field, obtain and examine all 
possible fillers, 
2. group/classify semantically equivalent phrases 
(by those that share common semantic primi- 
tives/attributes, e.g., theme, agent, actor, time, 
etc.) and then 
3. calculate the distance between each group/class 
(through determining by just how many semantic 
primitives/attributes they differ from each other). 
If this approach were taken, the scoring function of Equa- 
tion i would be applicable where D would be the distance 
between classes of fillers rather than just between the 
fillers themselves. 
Conclusion 
It is hoped that this evaluation technique will prove ef- 
fective for Project MURASAKI and thus become the 
basis on which to develop a general purpose evaluation 
tool. Research continues on answering those quantita- 
tive questions and on resolving those qualitative issues. 
Acknowledgements 
I would like to thank Roberta Merchant, Mary Ellen 
Okurowski and John Prange for their assistance and sup- 
port with this work. Also, I would like to thank Tom Do- 
err who was instrumental with the preparation of this 
document. But most of all, I would like to thank my 
morn for everything. It is in her memory that this paper 
will be presented. 
References 
\[1\] MeCardell, R. 1990. "An Evaluation Technique for 
STUP Database Records". An unpublished docu- 
ment. 
\[2\] McCardell, R. 1988. "Lexical Selection for Natural 
Language Generation". Thesis Proposal, Computer 
Science Department, University of Maryland Balti- 
more County. 
\[3\] 
\[4\] 
\[5\] 
Merchant, R. and M. E. Okurowski. Personal Com- 
munciation. January ~ February, 1990. 
Nirenburg, S., R. McCardell, E. Nyberg, P. Werner, 
S. Huffman, E. Kenschaft and I. Nirenburg. 1988. 
DIOGENES-88, CMU Technical Report CMU-CMT- 
88-107, Center for Machine Translation, Carnegie 
Mellon University. 
Palmer, M., T. Finin, and S. M. Walter. 1989. 
"Workshop on the Evaluation of Natural Lan- 
guage Processing Systems". RADC-TR-89-302, Fi- 
nal Technical Report, Unisys Paoli Research Center. 
Appendix A: Sample Spanish 
AIDS Text and Translation 
#~124 08ju189 E1 Pals Madrid palabras 899 
Los Emiratos Arabes Unidos han deportado, 
desde 1985, a 1.000 
The United Arab Emirates has deported, since 1985, 
1,000 
personas que resultaron posltivas en las pruebas 
de detecci6n del SIDA y 
people who tested positive on AIDS screening tests and 
que pretendlan entrar en el pals. Un portavoz 
de su embajada en 
who tried to enter the country. An embassy 
spokesperson in 
Espafia manifest6 que "es las soluci6n menos 
mala", ya que la naci6n "es 
Spain said that "it is the less harmful solution", because 
the nation "is 
muy pequefia, tiene menos de medio mill6n de 
habitantes y no puede 
very small, it has less than half a million inhabitants, 
and it cannot 
hacer frente a los enfermos". La Organizaci6n 
Mundial de la Salud ha 
care for the patients". The World Health Organization 
registrado 10.000 nuevos casos de SIDA en el 
pasado mes de junio, 
registered 10,000 new cases of AIDS last June, 
ascendiendo el ndmero total a 167.373. Espafia 
tlene 2.781 casos 
raising the total number to 167,373. Spain has 2,781 
cases 
registrados. 
registered. 
9This is the header line for Text #124. This article was re- 
ported in the El Pais newspaper, located in Madrid, on July 8, 
1989 and contains 89 words. 
68 
Appendix B: An Incidence Reporting Database Record 
INCIDENCIA DEL SIDA 
ar'tfculo 124-021 fecha 00iun89 fuente El Pals 
region todo el mundo 
fuente de la information OMS 
VIH" varones mujeres 
categoria 
infectados por VIH (porcentaje) 
infectados por VIH (estimados) 
infectados por VIH (notificados) 
modo de transmision 
prevalencia: % de populaci6n de 
tasa de progresion ai SIDA: 
tasa de progresi6n al SIDA: 
tasa de progresion al SIDA: 
tasa de progresi6n al SIDA: 
perfodo de duplicaci6n 
incremento mensual 
nifios 
% para 
% para 
% para 
% para 
meses 
% 
afios 
afios 
afios 
afios 
SlDA: varones mujeres nifios 
casos notificados 10.000 nuevos casos en iunio 1989 
casos estimados 
prevalencia: 
tasa de letalidad 
tasa de letalidad 
fallecidos 
fallecidos 
relacibn m:f 
periodo de duplication 
para afio(s) 
% de populaci6n de 
% / casos notificados en 
% / casos notificados antes de 
(n~mero) 
% de los casos notificados 
meses 
Appendix C: A Testing Policies Database Record 
PRUEBAS CONTRA EL SIDA 
articulo 124-01T fecha 08iul89 fuente El Pais 
region Los Emiratos ,~rabes Unidos 
fuente de la informaciOn portavoz de Los Emiratos .a, rabes Unidos en Espafia 
autoridad de acci6n 
nivel de acciOn 
periodo 
poblaciOn personas que pretendian entrar en el pais 
poblaci6n 
poblaciOn 
poblaci6n 
local de la prueba 
tipo de prueba 
tipo de prueba 
tipo de prueba 
resultados desde 1985, han deport:ado a 1.000 personas que resultaron 
~ositivas 
