File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1014_abstr.xml
Size: 19,334 bytes
Last Modified: 2025-10-06 13:46:58
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1014"> <Title>Evaluating Natural Language Generated Database Records</Title> <Section position="1" start_page="0" end_page="66" type="abstr"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> With the onslaught of various natural language processing (NLP) systems and their respective applications comes the inevitable task of determining a way in which to compare and thus evaluate the output of these systems. This paper focuses on one such evaluation technique that originated from the text understanding system called Project MURASAKI. This evaluation technique quantitatively and qualitatively measures the match (or distance) from the output of one text understanding system to the expected output of another.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Introduction Project MURASAKI </SectionTitle> <Paragraph position="0"> The purpose of Project MURASAKI is to develop a foreign language text understanding system that will demonstrate the extensibility of message understanding technology3 In its current design, Project MURASAKI will process Spanish and Japanese text and extract information in order to generate records in both natural language databases, respectively. The fields within these database records will contain a natural language phrase or expression in that respective language.</Paragraph> <Paragraph position="1"> The domain of Project MURASAKI is the disease AIDS. The associated software system will include a general domain model of AIDS in the knowledge base.</Paragraph> <Paragraph position="2"> Within this model, there will be five subdomains: incidence reports records the occurrence of AIDS and HIV infection in countries and regions, among various populations, testing policies covers measures to test groups for AIDS, campaigns describes measures adopted to combat AIDS, new technologies lists new equipment and material used in detecting and preventing AIDS, and 1Thus, it is no_...t to be confused as a message undel~tanding project, but rather a multi-paragraph (i.e., text) understanding project \[51.</Paragraph> <Paragraph position="3"> AIDS research details the various vaccines and treatments that are being developed to prevent AIDS.</Paragraph> <Paragraph position="4"> The subdomains of incidence reports, testing policies and campaigns are found in the Spanish text while the topics of incidence reports, new technologies and AIDS research are covered in the Japanese text. Project MURASAKI will demonstrate a sufficient level of full text understanding to be able to identify the existence of factual information within either a given Spanish or Japanese text that belongs within a particular Spanish or Japanese language database. Then, it will determine what information in that text constitutes a single record in the selected database.</Paragraph> <Paragraph position="5"> The balance of this paper will focus on the evaluation technique: why it was chosen, some basic assumptions underlying it, as well as the design and application of this technique. To illustrate various technical points of this technique, examples will be given using text excerpted from the Spanish AIDS corpus and its associated (generated) Spanish database records. Appendix A contains a sample Spanish AIDS text (Text #124) and its English translation. 2 Appendix B contains a record from the Incidence Reporting database that was generated from Text #124. Similarly, Appendix C contains a record from the Testing Policies database that was also generated from Text #124.</Paragraph> <Paragraph position="6"> The Need for a Black Box Given the overall design of this foreign language text understanding program, there arose the need for developing a general purpose evaluation technique\[l\]. This technique would compare the actual, computer generated output of one such system to the expected, human generated output of another. That is to say, given some sample piece of (foreign language) text as input, some pre-defined system output (namely, for project MURASAKI, the generation of a finite number of database records) could be manually generated so that a determination as to the correct performance of the computer system was made. Given this type of &quot;correct&quot; output, it could 2In the MURASAKI text corpus, there do not exist any English translations for any of the text.</Paragraph> <Paragraph position="7"> therefore be possible to measure the performance of an automated system based on this type of well-defined input/output pairs. It was precisely this type of rationale that led to the development of a black box evaluation -- evaluation primarily focused on what a system produces externally rather than what a system does internally. In direct contrast to this type of evaluation is glass box evaluation-- &quot;looking inside the system and finding ways of measuring how well it does something, rather than whether or not it does it&quot; \[5\].</Paragraph> <Paragraph position="8"> With the development of the MURASAKI evaluation technique, comes the notion of two types of measures: a quantitative measure and a qualitative measure. The quantitative measure determines the number of correct (and/or incorrect) records that have been generated in any one database while the qualitative measure evaluates the &quot;correctness&quot; of any database record field.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Background Some Assumptions </SectionTitle> <Paragraph position="0"> Given the overall design of Project MURASAKI, there are a few assumptions, or rather, some groundwork that needs to be laid, in order to proceed in the development of this evaluation technique. These assumptions are explained as follows: * Given the nature of the AIDS text corpus, any one text could possibly generate one or more records in one or more databases. This fact is loosely referred to as domain complexity. (Furthermore, for any record, all fields may not be filled.) * Given the structure of the AIDS domain model, it is just as easy (or hard) to distinguish one subdomain from another. That is, each database is as likely to have a record generated in it as another. This hypothesis is known as subdomain differentiation.</Paragraph> <Paragraph position="1"> * Upon the determination of what the expected output of Project MURASAKI should resemble, a correct record (in any database) is uniquely identified by the contents of its key fields plus the contents of one or more non-key fields. This statement constitutes the definition of a correct record. 3 Generated Output: What Could Go Wrong? After a thorough analysis of the system flow for Project MURASAKI and given a typical AIDS text as system input, the following list represents all possible undesirable situations that could arise: 3 All appropriate information should be extracted from the text and placed in the correct database. A change in any of the key fields will result in the generation of a new record. For example, if data from a different time period is presented in the text, a key field change is required, and a new record is generated. If data from a new region is presented, a new record is generated. Examples of key and non-key fields are found in Appendices B and C. Key fields, which are found in the thick, darkened boxes, are the same throughout each database.</Paragraph> <Paragraph position="2"> 1. Generate one or more records in the wrong database.</Paragraph> <Paragraph position="3"> 2. Not generate one or more records in the correct database.</Paragraph> <Paragraph position="4"> 3. Generate too many records in the correct database, i.e., over-generate.</Paragraph> <Paragraph position="5"> 4. Generate too few records in the correct database, i.e., under-generate.</Paragraph> <Paragraph position="6"> 5. Generate too many fields in the correct record. 6. Generate too few fields in the correct record.</Paragraph> <Paragraph position="7"> 7. Generate the wrong answer in the fields.</Paragraph> <Paragraph position="8"> Situations 1 and 2 illustrate what could go wrong at the database level while scenarios 3 and 4 depict possible problems arising at the database record level. The remaining criteria (namely 5, 6 and 7) shows what could happen at the database record field level. However, the more crucial way of viewing these problems is not so much in where (i.e., at what level) these events occur, but rather in how these problems can be detected and thus measured for evaluation purposes. It is with this motivation that the following categorization was derived: a quantitative measure could be designed to account for the problems that could arise at both the database and database record levels while a qualitative measure could comparably be designed for evaluation at the database record field level.</Paragraph> <Paragraph position="9"> In the next section, two examples are given depicting how the quantitative measure accounts for problems arising at the first two levels. (Note: 'rec.' is the abbreviation for record in these examples.) A Quantitative Measure Background A scoring function is used for the quantitative measure to calculate an aggregate score for the number of correct records (as defined previously) generated ('gem' in the following examples) for a given MURASAKI text. This scoring function assigns one point for the generation of a correct record ('coL') and -p points, where 0 < p < 1, for the generation of an incorrect record ('inc.').</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Some Questions </SectionTitle> <Paragraph position="0"> Given the two examples in Table 1, the following questions come to mind: * What should be the value of p? !? i? 17 Does 2&quot; 3&quot; 4&quot; bounding it between 0 and 1 imply any linguistic restrictions on focus or coverage of the text? Or rather, should these bounds become parameters of this measure? Ex. #i: DB #I DB #2 DB #3 TOTAL Ex. #2: DB #i DB #2 DB #3 TOTAL penalized by -p points.) * What happens if the numerator is negative? Or equal to 0? Should the score in these cases be 0? * If the score for a single text is Texti, then should the scoring algorithm for the overall (average) Quantitative Score be ~ where i = 1, 2, N and N ' &quot;&quot; &quot;' N is the total number of text? A Qualitative Measure Background Before proceeding into the design of the qualitative measure, some background is needed in order to motivate this measure. For Project MURASAKI, a database field is defined to be logically equivalent to that of a SLOT while the contents of that field is equivalent to its FILLER. 4 The slots define three types of DO-MAINS: (1) unordered, e.g., OCCUPATIONS, (2) ordered, e.g., MONTHS-OF-THE-YEAR and (3) continuous, e.g., HEIGHT. The slot fillers have three types of ATTRIBUTES: (1) symbolic, e.g., (temperature(value tepid)), (2) numeric, e.g., (weight(value 141.3)) and (3) hybrid, e.g., (test_results(value(i,000 people were deported))). Also, the slot fillers have three types of CAR-DINALITY: (1) single, e.g., (sex(value male)), (2) enumerated, e.g., (subjects(value(math physics art))) and (3) range, e.g., (age(value(0 100))).</Paragraph> <Paragraph position="1"> The notion of IMPORTANCE VALUES (IVs) are introduced here and are used to numerically describe how easy/hard it was (is) to extract a particular field's (or slot's) information from the text. These importance values are assigned to both the key and the non-key fields of a database record for each of the five databases. 5 Importance values are integers from 1 to 10, inclusive, and are interpreted as follows: 4The origination of this knowledge representation scheme (KRS) was taken from \[4\]. The application of this KRS to Project MURASAKI was taken from\[l\].</Paragraph> <Paragraph position="2"> 5 Recall that each database, for both Spanish and Japanese, corresponds to one of the five different subdomains within the AIDS domain model.</Paragraph> <Paragraph position="3"> With this view of importance values 6, the extraction process for Project MURASAKI may now be considered as two subprocesses; that is, extraction plus deduction. For example, the key field fuente (meaning &quot;source&quot;) may be filled with OMS or any one of the other periodicals and technical papers that are listed in the header line of each text (reference Appendix A, where the fuente is El Pa(s). Since the fuente field is constrained to only a few possible fillers, an importance value of 9 has been assigned to it. 7</Paragraph> </Section> <Section position="4" start_page="0" end_page="66" type="sub_section"> <SectionTitle> Scoring Functions & Algorithm </SectionTitle> <Paragraph position="0"> Scoring functions are also used for the qualitative measure to calculate an aggregate penalty for the fields (both key and non-key) in a database record. There are three types of scoring functions based upon the cardinality of the slot fillers: (1) single, (2) enumerated and (3) range, s An example of an ordered domain with single fillers is geared to having more emphasis placed on the records that contain easier fields and less on the harder ones, thus not rewarding those who perform well on the harder fields.</Paragraph> <Paragraph position="1"> ran importance value of 10 would have been assigned had it not been for the fact that in some instances, the &quot;deduction&quot; portion of the extraction process for this field specifies the conversion of some sources to their respective acronym, e.g., OMS is Organizacidn Mundial de la Salud (WHO).</Paragraph> <Paragraph position="2"> Sin Project MURASAKI, only slots that contain single fillers have been identified thus far.</Paragraph> <Paragraph position="3"> (The filler x in the database-in slot represents the single character identification value for a particular AIDS database.) Continuing with this example, if the following actual output (AO) were to be matched against what was expected (EO, expected output), AO: (temperature (value cool)) EO: (temperature (value lukewarm)) the penalty assigned to this mismatch would depend on two variables: (1) D, the distance between the fillers in the ordered set of values and (2) C, the size of the domain. The scoring function that relates these two variables is</Paragraph> <Paragraph position="5"> where W is the numerical weight on the distance between the fillers and :P is a damping function on the size of the domain.</Paragraph> <Paragraph position="6"> As mentioned before, an example of an unordered domain with single fillers is OCCUPATIONS. Since the distance, D, is not meaningful for this example, the penalty assigned to the match becomes a function merely of the size of the domain (and hence the probability of the correct filler appearing): As before, suppose we are trying to match the CASOS_NOTIFICADOS slots between the actual output and the expected output: AO: (casos_notificados (value 2.700)) EO: (casos_notificados (value 2.781)) Since only numbers can be represented in a continuous domain, the elements of the domain are defined by giving the endpoints of the domain (or closed interval) and the unit size of representation is used in computing the distance between fillers. When defined in this manner, the same scoring function that was used for an ordered domain with single fillers (namely Equation 1) can be used to compute the penalty for continuous domain sets as well.</Paragraph> <Paragraph position="7"> The overall Score for a single database record is</Paragraph> <Paragraph position="9"> for i = 1, 2, ..., (number of fields in that database record). The Pi's are the computed penalties between each field of the actual output and the expected output for that particular database record. The IVy's are the importance values for the corresponding fields of that database record.</Paragraph> <Paragraph position="10"> The Scoring Algorithm that computes the overall qualitative measure for the entire text corpus is given below:</Paragraph> <Paragraph position="12"> if EO_field and AO_field are equal So far, fields that contain either numeric fillers or single word fillers (fillers that are both easily &quot;distanceable&quot;) have been discussed. However, one would think that the more linguistically complex fields, i.e., those containing generated natural language phrases, would be more of a true test for the qualitative measure of this evaluation technique. Consider, for example, a non-key field like poblaci6n (&quot;population&quot;) (from Appendix C): AO: poblaei6n inmigrantes EO: poblaci6n personas que pretendlan entrar en el pals (&quot;people who try to enter the country&quot;) How should one extend the current notion of the qualititative measure to include evaluating the distance between natural language phrases of this kind? It would appear that poblaci6n would be an unordered domain containing symbolic information. However, what are the elements of this domain? Should they have cardinality single? Should they include only those phrases that were generated from the expected output or should they additionally include al_! semantically equivalent phrases, i.e., those containing a common set of semantic primitives or attributes, as well? If the latter situation were to prevail, then, in the example listed above, should a penalty be assessed? If so, by how much? Or rather, should one group together all semantically equivalent phrases and then determine the distance between these classes? Consider another example of an unordered domain field from the Testing Policies Database: AO: resultados han deportado a 1000 personas que resultaron EO: resultados desde 1985, han deportado a 1000 personas que resultaron Should this non-key field be defined as having both a symbolic and numeric, i.e., hybrid, attribute? If so, should a scoring function based on symbolic and numeric text be designed? Given the example above, should a penalty be assigned for lack of a specific time element (in the actual output) or are these phrases semantically equivalent? A possible algorithmic extension to the current qualitative measure is outlined as follows: 1. for a given database field, obtain and examine all possible fillers, 2. group/classify semantically equivalent phrases (by those that share common semantic primitives/attributes, e.g., theme, agent, actor, time, etc.) and then 3. calculate the distance between each group/class (through determining by just how many semantic primitives/attributes they differ from each other). If this approach were taken, the scoring function of Equation i would be applicable where D would be the distance between classes of fillers rather than just between the fillers themselves.</Paragraph> <Paragraph position="13"> Conclusion It is hoped that this evaluation technique will prove effective for Project MURASAKI and thus become the basis on which to develop a general purpose evaluation tool. Research continues on answering those quantitative questions and on resolving those qualitative issues.</Paragraph> </Section> </Section> class="xml-element"></Paper>