File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/j86-2002_metho.xml

Size: 84,820 bytes

Last Modified: 2025-10-06 14:11:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="J86-2002">
  <Title>SUMMARIZING NATURAL LANGUAGE DATABASE RESPONSES</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
108 Computational Linguistics, Volume 12, Number 2, April-June 1986
</SectionTitle>
    <Paragraph position="0"> Kalita, Jones, and McCana Summarizing Natural Language Database Responses of quantity and relation in that most users would be given information they already knew. Similarly, producing the answer All students who got over 60% (if this happened to be true in the current data base) might mislead a user into thinking that 60deg6 was a passing grade, hence violating the maxim of quality (actually, Joshi's generalization of this maxim since, strictly speaking, the answer is truthful). Some techniques are incorporated into our system to help reduce the chances of this kind of thing occurring (see section 4), but it is still a problem, as we discuss in the concluding section.</Paragraph>
    <Paragraph position="1"> Generation of summary responses is analogous to a reversal of the interpretation process. A natural language question is interpreted into one or more propositions the data in the answer must satisfy, and then the appropriate data is retrieved. In a conventional database management system (DBMS), this extensional response is the only possible answer. But, we want to go back from the extensional data to predicates describing characteristics of the data and from there to natural language. Consider the query, Which employees use a company car?. The internal form into which this question is interpreted might be (employee uses car) &amp; (car belongs to company) (The actual internal notation used in our system is more complicated than this - see section 4.3). A conventional DBMS would produce a response consisting of a set of employee names and possibly other relevant information about them. But, we want to obtain a descriptive answer, such as (employee is president) V (employee is vice-president) which in turn can be expressed in natural language as The president and the vice-presidents.</Paragraph>
    <Paragraph position="2"> Hence, we must obtain a description that is true of the relevant data and present the description to the questioner instead of providing the actual data values that satisfy the propositions set forth in the question.</Paragraph>
    <Paragraph position="3"> It is possible for a system to arrive at such concise responses from an extended database schema by employing a heuristic search of the extensional data for the existence of &amp;quot;interesting&amp;quot; patterns. In the next section we overview a system for producing summary responses.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 OVERVIEW OF THE SYSTEM
</SectionTitle>
    <Paragraph position="0"> We have designed a system that produces summary responses to queries posed to a simple relational data base of student records. In order to concentrate on the pragmatics issues underlying the generation of summary responses, we ignore the complexities of starting with, and eventually producing, surface language. Instead, the system starts with predicates representing the user's query and produces predicates representing a summary response.</Paragraph>
    <Paragraph position="1"> The flow of control in the system is simple. The user's query is formulated in an internal form which is understood by the underlying database management system. This internal form is discussed more fully in section 4.3. Using this query, the DBMS obtains ~ the extensional response set, that is, the tuples that satisfy the user's query. After the data is accessed, the system consults its knowledge base to try to formulate a summary response.</Paragraph>
    <Paragraph position="2"> A prime component of this knowledge base is a set of heuristics used to find interesting non-enumerative patterns. As soon as a heuristic succeeds in discovering such a pattern, the system terminates the search and produces the response as dictated by the successful heuristic. This response is also in an internal notation identical in form to that used to represent the input. If all heuristics fail, the system reports its inability to produce a descriptive response. In any event, the user may ask the system to produce an extensional list of the data if desired.</Paragraph>
    <Paragraph position="3"> Let's look at the knowledge base in slightly more detail. In order for the system to provide meaningful descriptive responses, the user's conceptions regarding the nature and contents of the data base must be taken into account. Without a separate knowledge base, this would be impossible. The knowledge base is employed to outline strategies for obtaining summary responses, to ensure that the qualitative responses generated are appropriate, and to produce salient information for describing the data that satisfy a query. The knowledge base consists of two distinct parts: the heuristics, and the frames for the relations and attributes.</Paragraph>
    <Paragraph position="4"> The heuristics guide the search for &amp;quot;interesting&amp;quot; patterns in the data; the frames assist in determining &amp;quot;interestingness&amp;quot;. The heuristics are the procedural part of the system's knowledge. There are several heuristics, including the equality, inequality, range, conjunction, disjunction, and foreign-key heuristics. They are ordered according to the complexity of the search procedures involved and are tried in this order so that the easiest (and usually the simplest to understand) summary response is found first.</Paragraph>
    <Paragraph position="5"> The second part of the system's knowledge is represented by frames which encode useful information about the relations in the data base and their attributes. There are two types of frames: relation frames, which suggest ways of joining relations together in order to facilitate the discovery of elaborate patterns in the data; and attribute frames, which give characteristics of various attributes in the relations in order to aid the determination of relevant and interesting patterns.</Paragraph>
    <Paragraph position="6"> Currently, both the frames and the heuristics must be prespecified by the system designer, rather than automatically created by the system to suit a given database context. However, this isn't a big problem since the heuristics are domain-independent and, hence, may be used with any other database domain without modification. And, although the frames must be tailored to reflect characteristics of the particular data base and user, the frame notation is sufficiently straightforward that it Computational Linguistics, Volume 12, Number 2, April-June 1986 109 Kalita, Jones, and McCaila Smnmarizlng Natural Language Database Responses seems possible for a database manager to be able to do it relatively easily.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 DETAILS OF THE SYSTEM
4.1 THE RELATION AND ATI'RIBUTE FRAMES
</SectionTitle>
    <Paragraph position="0"> The sample relational data base used in our implementation consists of three relations:  - STUDENTS, - COURSE-DESCRIPTIONS, and - COURSE-REGISTRATIONS.</Paragraph>
    <Paragraph position="1">  The data base stores useful information about graduate students and the courses in which they register. The relations and their attributes are shown in Figure 1. Key attributes are shown in italics.</Paragraph>
    <Paragraph position="2"> The current relation frames are very simple. Each frame corresponds to an actual relation in the data base; it provides the possible links with all other relations. In other words, these frames define all lossless joins of two relations. In cases where a direct join is not possible between two specific relations, the frame contains the name of a third relation that must be included in the join. If two relations R~ and R 2 can be directly joined through attributes A 1 in R 1 and A 2 in R 2, the corresponding entry in the LINKS siot is ((R1 R2) (A 1 A2)).</Paragraph>
    <Paragraph position="3"> If the relations R 1 and R 2 cannot be joined directly, but can be indirectly joined through a relation R3, the corresponding entry in the LINKS slot of the relation frames for R 1 and R 2 is ((R1 R2 R 3) (A 1 A31) (A32 A2)).</Paragraph>
    <Paragraph position="4"> The first sublist indicates that the relations R 1 and R 2 can be indirectly joined through relation R 3. The second sublist indicates that R 1 and R 3 can be joined using the attribute A 1 in R1 and the attribute A31 in R 3. Similarly, the relations R 3 and R. 2 can then be joined through the attribute A32 in R 3 and A 2 in R v For the STUDENTS relation under consideration, the relation frame can be seen in Figure 2. The relations STUDENTS and COURSE-REGISTRATIONS may be joined through the fields STUDENT-ID-NO in STUDENTS and STUDENT-ID in COURSE-REGISTRATIONS. The</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
relations STUDENTS and COURSE-DESCRIPTIONS
</SectionTitle>
    <Paragraph position="0"> cannot be joined directly; the join has to be performed through the relation COURSE-REGISTRATIONS.</Paragraph>
    <Paragraph position="1"> STUDENTS and COURSE-REGISTRATIONS are linked through the fields named above. COURSE-</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
REGISTRATIONS and COURSE-DESCRIPTIONS are
</SectionTitle>
    <Paragraph position="0"> joined through the COURSE-NO field in both these relations.</Paragraph>
    <Paragraph position="1"> The information in the relation frames is employed when the system fails to produce a non-enumerative answer after exhausting all the heuristics that deal with only one relation. The system then attempts to find a descriptive expression considering another relation with which the original or target relation has some common join-attribute(s).</Paragraph>
    <Paragraph position="2"> Relation frames allow the database manager the flexibility of naming attributes differently in different relations. They also can be used to restrict the types of joins that can be undertaken (i.e. not all possible joins need to be specified). Except for these distinctions, it would be relatively straightforward to generate the relation frames automatically.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
110 Computational Linguistics, Volume 12, Number 2, April-June 1986
</SectionTitle>
    <Paragraph position="0"> Kalita, Jones, and McCalla Summarizing Natural Language Database Responses In addition to the relation frames, the system is provided with a number of attribute frames, each of which corresponds to an actual attribute in the data base.</Paragraph>
    <Paragraph position="1"> Attribute frames are critical in this approach to summary response generation and thus are described in some detail. Attribute frames allow important attributes and meaningful attribute values to be specified in advance.</Paragraph>
    <Paragraph position="2"> Together with the heuristics, they give our system many of the abilities of McCoy's (1982) ENHANCE system to reflect a user's preconceived notions as to which patterns of data are meaningful and which are not. A different set of attribute frames can be designed for each type of user (presumably by the database manager), thus allowing user modelling of a sort to be implemented.</Paragraph>
    <Paragraph position="3"> Attribute frames guide the system in describing the data on the basis of attributes whose values serve to partition an entity class (represented by a relation in the data base) into two mutually exclusive subclasses, namely the part of the entity class that satisfies the user's query and the part that does not. As pointed out by Lee and Gerritsen (1978), some partitions of an entity class are more meaningful than others. Our system employs attribute frames to determine which attributes should be used for describing a partition and which resulting classifications are meaningful. Figure 3 shows an attribute frame for the attribute NATIONALITY in the STUDENTS relation.</Paragraph>
    <Paragraph position="4">  The NAME slot contains the internal name of the attribute, i.e. the name under which it is stored in the data base, and the name of the relation in which it occurs. If the attribute occurs in more than one relation, this field contains an entry for each relation. The general format of the contents of this slot is</Paragraph>
    <Paragraph position="6"> The expression within the &amp;quot;\[ \]&amp;quot; brackets is optional. The three dots indicate that an arbitrary number of repetitions of the immediately preceding expression is allowed. In the case of the attribute frame of Figure 3, the NAME slot indicates that the frame represents information about the NATIONALITY attribute in the STUDENTS relation.</Paragraph>
    <Paragraph position="7"> The second slot, NATURE-OF-ATTRIBUTES, contains information regarding the type of values contained in the field - e.g., numeric, character, or boolean. The NATIONALITY attribute assumes character values.</Paragraph>
    <Paragraph position="8"> The DISTINGUISHING-VALUE slot provides information for distinguishing a subclass of an entity from other subclasses. This slot stores any distinguishing values the attribute may take. These values are crucial in producing descriptive responses to the user's queries, so some time will be spent elaborating this idea. The slot contains one or more clauses, each of the following format:</Paragraph>
    <Paragraph position="10"> If the actual values of the attribute satisfy &amp;quot;applicable-operator-l-l&amp;quot; with respect to the contents of the list &amp;quot;list-of-attribute-values-l&amp;quot;, the actual values may be termed as &amp;quot;denomination-l-l&amp;quot; for producing responses. If the value of &amp;quot;denomination-l-l&amp;quot; is null, no special names can be attached to the actual values of the attribute.</Paragraph>
    <Paragraph position="11"> Looking at the NATIONALITY attribute frame of Figure 3, a number of distinguishing values have been specified. Consider the clause ((Canadian) (=) (# Computational Linguistics, Volume 12, Number 2, April-June 1986 111 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses foreign)). The value &amp;quot;Canadian&amp;quot; is a distinguishing value. The term &amp;quot;(=)&amp;quot; indicates that it is possible to identify a class of students using the descriptive expression &amp;quot;NATIONALITY ---- Canadian&amp;quot;. If NATIONALITY # &amp;quot;Canadian&amp;quot;, the student may be referred to as a &amp;quot;FOREIGN&amp;quot; student. Similarly, if the value stored for a student under the attribute NATIONALITY is a member of the set (U.K.U.S.A.</Paragraph>
    <Paragraph position="12"> Australia ...), he/she may be designated as coming from an English-speaking country. Finally, if the student has value U.K., France, etc. for NATIONALITY, he/she may be considered to be from Europe.</Paragraph>
    <Paragraph position="13"> Distinguishing values correspond to key values that naturally divide the values in a domain into distinct classes. In this sense they are very similar to McCoy's (1982) &amp;quot;very specific axioms&amp;quot;, although how they interact with heuristics to produce summary responses is different. To illustrate, for most users the value 18 of an AGE attribute is a distinguishing value dividing children from adults; 65 is a distinguishing value separating adults from senior citizens. Other values are not important and, therefore, should not be considered to be &amp;quot;distinguishing&amp;quot;. Similarly, suppose that a grade point average of six or greater is necessary for a graduate student to register in four courses rather than the usual three courses. The value &amp;quot;6&amp;quot;, then, can be considered to be a distinguishing value for the CUMULATIVE-GPA attribute. This would allow questions like Which students are taking four or more courses? to be answered with All students with GPA of six or higher rather than with the response All students with GPA of 6.52 or higher which might be true of the current data. The latter response is inappropriate because it violates the maxim of quality in that it might mislead the user into thinking that 6.52 is a significant value in the University. (See Q8-$8 in section 4.2.4 for the details as to how the proper summary response for this kind of question is generated by our system.) Returning to the NATIONALITY frame of Figure 3, the distinguishing values specified there would make it possible for our system to answer the question Which students are taking the &amp;quot;'Intensive English&amp;quot; course in the Fall term? with the response Most entering foreign students from non-English speaking countries rather than the misleading answer All students from China, Iran, and France, which might happen to be true currently. Once again, the latter response violates the maxim of quality, a common occurrence if summary responses are not carefully tuned to reflect significant domain subdivisions.</Paragraph>
    <Paragraph position="14"> The DISTINGUISHING VALUE slot enables the data-base manager to specify classifications that he/she would a priori like to appear meaningful to the user in descriptive responses. Without this information the system may fail to faithfully reflect the user's perceived notions regarding appropriate partitioning of entity classes. By changing the distinguishing values, the database manager can adapt the system to serve the needs of a variety of users. Although it isn't our concern here, it would even be possible to remove all distinguishing values and hence have the system produce no summary responses. For any given class of users, the database manager will need to specify all of these distinguishing values by hand, but once they are specified, they can be used by many different heuristics in many different situations for as long as the database structure remains the same, even if the tuples in the data base change. Further examples of the use of distinguishing values and how they interact with the heuristics will be presented shortly.</Paragraph>
    <Paragraph position="15"> Let us return at last to the other slots in an attribute frame. The POTENTIAL-RANGE slot provides an approximate range in which the values of the attribute may lie. The information in this slot is employed in conjunction with the range heuristics which are discussed in the next section. In the NATIONALITY attribute frame of Figure 3, the potential range would be specified in terms of a long list (not shown) of possible countries of origin.</Paragraph>
    <Paragraph position="16"> It is sometimes necessary to round off values of numeric attributes in order to produce answers with acceptable range specifications. However, not all numeric attributes can be rounded. Whether rounding is allowable for a particular attribute depends on several factors including the type of values the attribute can assume (i.e., integer, real, etc.) and the potential range o~ its values as well as other attribute characteristics. The ROUNDING-TO-BE-DONE? slot contains a boolean value indicating whether rounding is appropriate for the particular attribute under consideration. It obviously is not for the character values of the NATIONALITY attribute frame.</Paragraph>
    <Paragraph position="17"> Straightforward as it may seem, rounding allows our system to avoid violating Grice's maxims of manner, specifically by making answers less obscure.</Paragraph>
    <Paragraph position="18"> It is often more useful to provide descriptive answers on the basis of certain preferred attributes. For example, in the STUDENTS relation, it is more &amp;quot;meaningful&amp;quot; to provide answers on the basis of the attribute NATIONALITY or UG-MAJOR rather than STUDENT-ID-NO or AMOUNT-OF-FINANCIAL-AID. However, it is impossible to give a concrete weight regarding each attribute's preferability. Therefore, we have classified the attributes into several groups; all attributes in a group are considered equally useful in producing meaningful qualitative answers to queries. The groups for the STUDENTS relation are given in Figure 4.</Paragraph>
    <Paragraph position="19"> This classification means that it is preferable and more useful to produce descriptive responses using the attributes in group 1 than the attributes in group 2, and the attributes in 2 are preferable to 3, which are in turn preferable to 4. This categorization is done by the database manager, based on his/her judgement as to the perspectives of the various classes of users. In the slot PREFER-ENCE-CATEGORY, there is an entry corresponding to each relation the attribute occurs in. The information in this slot ensures that the system chooses a description based on the most salient attribute for producing a  response. The preference category of the NATIONALITY attribute of Figure 3 is 1.</Paragraph>
    <Paragraph position="20"> Preferred attributes perform for our system the same function that McCoy's (1982) &amp;quot;important attributes list&amp;quot; does for the ENHANCE system. We go further than McCoy in specifying several preference categories, rather than having one long list. Although all attributes are assigned a preference category in this example, we can, like McCoy, leave out unimportant attributes altogether if it is appropriate to do so.</Paragraph>
    <Paragraph position="21"> Let us now look at two more attribute frames. Figure 5 shows the frame for the attribute CUMULATIVE-GPA in the STUDENTS relation. From this, it is clear that CUMULATIVE-GPA takes real values in the range 0.00 to 8.00. If CUMULATIVE-GPA is in the range 2.00-4.00, it may be termed &amp;quot;poor.&amp;quot;; Similarly, if it is in the range 4.00 to 6.00, it is considered as &amp;quot;good&amp;quot;, and so on. If none of the first five clauses in the DISTINGUISHING VALUE slot is satisfied, the system attempts to use the last two clauses. The clause ((2.00) (&gt;) (&lt;)) says that we can use expressions such as &amp;quot;GPA &gt; 2.00&amp;quot; or &amp;quot;GPA &lt; 2.00&amp;quot;, which cover a wider range than the first five clauses (e.g. Which students are allowed to continue? might be answered All students with GPA of 2 or more. - see also Q7-$7 in section 4.2.4). It should be noted that these expressions may be used only if all values for the attribute GPA in the selected tuples satisfy the corresponding condition. However, we cannot use expressions of the form &amp;quot;GPA = 2.00&amp;quot;. We avoid using equafities for attributes that assume rational values. The clause ((6.00) (&gt;) (&lt;)) conveys a similar idea.</Paragraph>
    <Paragraph position="22"> Figure 6 shows the frame for the attribute NO-OF-COURSES-THIS-TERM in the STUDENTS relation. From this figure, one can conclude that the attribute NO-OF-COURSES-THIS-TERM assumes integer values in the range 0-6. If this field has a value &lt;2, it may be termed &amp;quot;light-load&amp;quot;. If NO-OF-COURSES-THIS-TERM is either 3 or 4, it is &amp;quot;normal-load&amp;quot;. If the value of the attribute is &gt; 5, it is &amp;quot;heavy load&amp;quot;. The values of the thresholds shown here are applicable in the case of graduate students. These values would, obviously, be different if we considered a data base of undergraduate students.</Paragraph>
    <Paragraph position="23"> Currently, the attribute frames are static entities with their contents being defined a priori by the database manager to reflect the expectations of one set of users.</Paragraph>
    <Paragraph position="24"> Of course it is possible to have many different sets of attribute frames for many different classes of users, but a Computational Linguistics, Volume 12, Number 2, April-June 1986 113 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses better approach might be to allow the user to alter the contents of these frames interactively to suit his/her own idiosyncratic perceptions of the information in the data base. This would require us to figure out how to present the frames and the possible changes to the user, something we haven't done as yet. Even more difficult would be the automatic creation (and later adjustment) of the attribute frames as the result of feedback from a particular user or class of users. This is dearly a major research issue beyond the scope of our current concerns.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 THE HEURISTICS
</SectionTitle>
      <Paragraph position="0"> AS mentioned earlier, the heuristics employed in the system are procedural in nature. In conjunction with the frames described above, they guide the system to search for various interesting patterns that distinguish the tuples describing the query response from the rest of the tuples in the data base. The interesting patterns are similar to McCoy's (1982) &amp;quot;distinguishing descriptive attributes&amp;quot;, although we use them to produce summary responses rather than to answer questions on database structure.</Paragraph>
      <Paragraph position="1"> Our use of a number of specially designed heuristics using frames to produce meaningful responses is also different from McCoy's approach, which uses three different kinds of axioms to control a general search procedure.</Paragraph>
      <Paragraph position="2"> In order to help overcome possible problems of combinatorial explosion (mentioned as a problem for McCoy's ENHANCE as well), the heuristics are linearly ordered according to the complexity of the required search procedures. Hence, the system first searches for simple patterns; the complexity of the response patterns grows as later heuristics are employed. This ordering of the heuristics assumes that, if more than one descriptive answer can be obtained for a query, it is sensible to produce the &amp;quot;simplest&amp;quot; one. It would be easy to change this if more sophisticated termination conditions for the search were desired.</Paragraph>
      <Paragraph position="3"> We assume that the natural language query has been parsed and transformed to an internal form, and the required data have been accessed. The heuristics are applicable only after the tuples that satisfy the user's query are at hand. Let Tqual be the set of tuples that satisfy the user's query, and Tunqual be the rest of the tuples in the relation relevant to the current query.</Paragraph>
      <Paragraph position="4">  The equality heuristic is the most elementary of all the heuristics. It corresponds to the usage of everyday words such as all, everybody and everyone. To start our discussion, we present a formal specification of the heuristic.</Paragraph>
      <Paragraph position="5"> Determine if all data values appearing as the value of a particular attribute A in Tqual are the same (say, a). a must be a DISTINGUISHING VALUE in the domain of values for attribute A. If so, and if no tuple in Tunqual has the value a for the attribute A, the general formulation of the response is: All tuples having the value a for attribute A.</Paragraph>
      <Paragraph position="6"> An example is the question-answer pair Q3-$3: Q3: Who are the Canadian students with GPA of 7.5 or higher? $3: All students receiving NSERC scholarships.</Paragraph>
      <Paragraph position="7"> For applying this heuristic, the value a of the attribute A must have some &amp;quot;distinguishing&amp;quot; importance in the domain. In the above example, the attribute under consideration is NATURE-OF-FINANCIAL-AID. The value NSERC is considered to be a DISTINGUISHING VALUE in the domain of values that the attribute NATURE-OF-FINANCIAL-AID can take.</Paragraph>
      <Paragraph position="8"> The equality heuristic may also be applied to certain numeric attributes. Consider the following question and answer pertaining to the graduate student data base.</Paragraph>
      <Paragraph position="9"> Q4: Which students have completed less than 5 courses? $4: All first year students.</Paragraph>
      <Paragraph position="10"> Here, the value of the attribute NO-OF-YEARS-COMPLETED is 0 for all tuples that satisfy the query Q4. Also, among the unqualified tuples, there is none in which NO-OF-YEARS-COMPLETED ---- 0. Finally, the value 0 distinguishes first year students from others, according to the attribute frame for NO-OF-YEARSCOMPLETED. null Before leaving the equality heuristic, it should be noted that Q1-S1 (Which employees engage in profit sharing? - All vice-presidents.) from section 2 could be handled by the equality heuristic (all employees engaging in profit sharing have the rank &amp;quot;vice-president&amp;quot;; nobody who isn't engaging in profit sharing has this rank).</Paragraph>
      <Paragraph position="11">  The dual of the equality heuristic is the inequality heuristic; instead of looking for equalities, the system searches for inequalities. Formally, the heuristic may be stated as, Determine if each data value for a particular attribute in Tqual is not equal to some particular value 7 and all tuples in Tunqual have that value. This value -f must be a DISTINGUISHING VALUE in the domain of the values for attribute A. The general formulation of the response is All tuples with value of attribute A ~ V.</Paragraph>
      <Paragraph position="12"> In order to produce the required response, the system must make certain that A ~ ~, is not true in any of the tuples which do not satisfy the user's query.</Paragraph>
      <Paragraph position="13"> Let us consider an example. In the student data base, the value &amp;quot;Computer Science&amp;quot; for the attribute UG-MAJOR may be considered a distinguishing value. This allows us to produce a response such as All students with majors other than Computer Science.</Paragraph>
      <Paragraph position="14"> or, equivalently, All non-Computer Science majors.</Paragraph>
      <Paragraph position="15"> as in the following question and answer pair:  Q5: Which students have taken more than six courses? 114 Computational Linguistics, Volume 12, Number 2, April-June ! 986 Kalita, Jones, and McCalla Snmmarizing Natural Language Database Responses $5: All students with non-Computer Science undergraduate background.</Paragraph>
      <Paragraph position="16">  At the same time, we may avoid producing a response such as (say) All students from departments other than Mechanical Engineering, if Mechanical Engineering is not of interest to us. Thus, it is clear that the specification of distinguishing attribute values is dependent on the user's conception of the data as well as the application under consideration. It should be noted that phraseology subtleties such as the differences between All non-Computer Science majors, All students with majors other than Computer Science, or All students with non-Computer Science undergraduate background are not reflected in different internal notations, but are the responsibility of the natural language generation component which we haven't developed as yet. Such subtleties can be quite important, but are left for future research. The whole issue of natural language generation (and interpretation) is discussed further in section 4.3.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INEQUALITY HEURISTICS
</SectionTitle>
    <Paragraph position="0"> If the equality or inequality heuristics are not applicable in their pure form and there are a &amp;quot;few&amp;quot; (&amp;quot;few&amp;quot; depends on the relative number of tuples in Tqual and Tunqual and some other factors) tuples in Tunqual that do not satisfy the requirement of the heuristic, a modification of the response produced by the heuristic may be presented to the user. An example of such a modification is seen in the following: Q6: Which students are receiving University scholarships? null $6: All but one foreign student. In addition, two Canadian students are also receiving University scholarships. null  These heuristics determine if the data values for an attribute in the tuples in Tqual are within a particular well-defined range. There are two main types of range heuristics - one is concerned with maximum values and the other with minimum values. The first of these, the maximum heuristic, may be formally stated as, Determine if all data values for attribute C in Tqual are below some maximum (say,/3), and there is no tuple in Tunqual with values for C &lt; /3. This value/3 must have some &amp;quot;distinguishing importance&amp;quot; in the domain of the values of attribute C. In this case, the general formulation of the response is All tuples with the value of attribute C &lt; /3.</Paragraph>
    <Paragraph position="1"> An illustrative example is Q7: Which students have been advised to discontinue studies at the University? $7: All students with a cumulative GPA of 2.0 or less. Here GPA = 2.0 is assumed to have some &amp;quot;distinguishing importance&amp;quot; in the field of numbers representing GPAs of students (i.e., a value that may be &amp;quot;generally&amp;quot; used to partition the set of all possible GPAs into two classes: ones above 2.0, and ones equal to or lower than 2.0).</Paragraph>
    <Paragraph position="2"> The maximum heuristic is generally applicable in the case of numeric attributes.</Paragraph>
    <Paragraph position="3"> Similarly, the minimum heuristic may be formally specified as, Determine if all data values for attribute C in Tqual are above a certain minimum (say, d) and there are no tuples in Tunqual with value for C &gt; d * dJ must have some &amp;quot;distinguishing importance&amp;quot; in the domain of the values of attribute C. The general formulation of the response is All tuples having the value in column C &gt; &amp; An illustrative example is Q8: Which students are taking four or more courses? $8: All students with GPA of six or higher.</Paragraph>
    <Paragraph position="4"> When the tuples in Tqual satisfy both the maximum and the minimum heuristics for the same attribute A, we get a range specification. Let a be the minimum value and/3 be the maximum value of the attribute A in Tqual. Then the response can be modified as All tuples with value of attribute ranging from a through/3.</Paragraph>
    <Paragraph position="5"> An example of an answer with range specification is Q9: Who are the students taking courses in second year? $9: All students who have completed between 3 and 5 courses so far.</Paragraph>
    <Paragraph position="6"> There are several rules that should be followed while producing answers in terms of ranges. Some of the rules employed in the current implementation are given below. These rules are fairly arbitrary, but rules like them will be necessary to prevent summary responses from themselves violating Grice's maxims, especially the maxims of manner and quality.</Paragraph>
    <Paragraph position="7"> * If the upper limit of the actual range for an attribute is the maximum potential value for the attribute, it is better to modify the answer as more than a where a is the lower limit of the actual range. For example, if for an attribute A the upper limit of the maximum potential range is 1000, instead of providing a response between 750 and 1000, it is advisable to say more than 750 if Grice's maxim of manner (be brief) is to be satisfied.</Paragraph>
    <Paragraph position="8"> * A similar action is taken at the other end of the scale. For example, if the lower limit of the maximum potential range is 0, instead of responding as between 0 and 200, we might answer as less than 200.</Paragraph>
    <Paragraph position="9"> * The actual range specified in an answer should not be more than 75% of the potential range of the attribute values. The particular choice of 75% is not sacrosanct, but the rule itself is important if we are to avoid the problem of producing a response that essentially covers Computational Linguistics, Volume 12, Number 2, April--June 1986 115 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses the entire range of potential attribute values. Such a response would mislead the user into thinking that there existed values outside of this range, which would violate Grice's maxim of quality.</Paragraph>
    <Paragraph position="10"> * The actual range specified in an answer should not be so small as to identify the actual tuples that constitute the answer. For example, we should not produce a response such as, All students with student-id-no between 821661 and 821663. In fact, such answers are not brief when compared with the size of the set of tuples they qualify. Moreover, they can mislead the user into thinking that there are many more tuples than there actually are in the response set.</Paragraph>
    <Paragraph position="11"> These violations of the maxims of manner and quality should be avoided.</Paragraph>
    <Paragraph position="12"> While producing range specifications, it is often necessary to round off the upper and lower limits in case of numeric attributes. For example, instead of saying Students with GPA between 6. 06 and 6.92 we may as well say Students with GPA between 6. O0 and 7. 00.</Paragraph>
    <Paragraph position="13"> Rounding cannot be done for all numeric attributes.</Paragraph>
    <Paragraph position="14"> The applicability of the rounding operation depends on several factors including the nature of the values the attribute takes - e.g. whether they are an integer or rational, and their potential range.</Paragraph>
    <Paragraph position="15"> * In case of integer values, if the potential range is &amp;quot;small&amp;quot;, rounding should be avoided. For example, the field NO-OF-YEARS-COMPLETED in a student data base has a tight potential range (0-5 years). In this case, if we have data values between 2 and 4 years, we should not round off and say between 2 and 5 years.</Paragraph>
    <Paragraph position="16"> * For integer values, if the potential range is wide, rounding off may be done (except for some cases discussed below). For example, the expression Students with marks between 61 and 78 may be rounded to Students with marks between 60 and 80. However, for this rounding to be correct, it is necessary to ensure that there are no tuples in Tunqual with marks 60, 79 or 80.</Paragraph>
    <Paragraph position="17"> * There are certain attributes that are integral and do not allow approximation by their inherent nature. One example of such an attribute is STUDENT-ID-NO. A student identification number 82116 cannot be approximated as STUDENT-ID-NO = 82115 or STUDENT-ID-NO = 82120. Similarly, we cannot round the attribute YEAR-OF-BIRTH in many circumstances. This decision whether rounding should be done or not is often subjective. Hence, this information must be provided by the system builder and stored in the knowledge base.</Paragraph>
    <Paragraph position="18"> * If an attribute assumes non-integer (i.e., rational) values, the system may nearly always proceed with rounding. It may be possible to find counter examples to this assertion in some database domains. However, for the purpose of the current implementation, we accept this assumption to be true at all times.</Paragraph>
    <Paragraph position="19"> It should be noted that the heuristics explained above are applicable when a single attribute of the relevant relation is considered. If no such heuristic can be successfully applied to the pertinent data, the system attempts to use one of the conjunction or disjunction heuristics jointly on two or more attributes.</Paragraph>
    <Paragraph position="20">  The conjunction heuristic is the first of the complex heuristics involving more than one predicate. Usually, each of these predicates involves a distinct attribute in the data base, although it is possible that two or more predicates relate to values of the same attribute. These heuristics provide the system with the facility to use common connectives such as and and or.</Paragraph>
    <Paragraph position="21"> The conjunction heuristic is expressed succinctly in the following paragraph.</Paragraph>
    <Paragraph position="22"> If all values of an attribute C in Tqual satisfy a relation R (in the mathematical sense), and there are tuples in Tunqual that also satisfy the same relation R, determine via the above heuristics if there is/are some &amp;quot;interesting&amp;quot; distinguishing characteristic(s) that the set Tqual satisfies, but the set of tuples in Tunqual satisfying the relation R do not. Let us call the distinguishing characteristic(s) D. The general formulatioh of the response is All tuples that satisfy the relation R and have the characteristics D.</Paragraph>
    <Paragraph position="23"> An example is, Q10: Which students are working as T.A. or R.A.? S10: Students who have completed more than 1 year at the University and who are not employed outside the University.</Paragraph>
    <Paragraph position="24"> All the tuples in Tqual resulting from Q 10 are found by the system to have the values for the attribute NO-OF-YEARS-COMPLETED &gt; 1. However, the system finds that there are some tuples in Tunqual that also have values greater than 1 for the attribute NO-OF-YEARS-COMPLETED. Let us call these tuples Tequa 1. Next the system attempts to find some characteristics that distinguish Tequa 1 from Tqual. It finds that in Tequa 1 the field NATURE-OF-FINANCIAL-AID _- OUT- null If none of the above heuristics can be applied successfully, the system attempts to use the disjunction heuristic. As is evident from the nomenclature, this heuristic enables the system to formulate complex responses using the connective OR. Formally, this heuristic may be expressed as follows.</Paragraph>
    <Paragraph position="25"> 116 Computational Linguistics, Volume 12, Number 2, April-June 1986 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses Divide the tuples in Tqual into a number of subsets and try to apply one of the heuristics explained earlier to each subset. If successful, the resulting response consists of several predicates connected by the relational operator OR (V). It has the generalized format Tuples with (attribute I RI a~ ) V (attribute e R 2 a2) lV (attribute 3 R 3 a 3 ) ... ! where Ri's are relations (in the mathematical sense); ai's are distinguishing values of the corresponding attributei's.</Paragraph>
    <Paragraph position="26"> While formulating responses with the disjunction heuristic, the number of such subsets should be restricted to two or three, if possible. If too many subsets are identified, it is difficult for the user to grasp all of them. If more than three subsets are presented, this approach is no more elegant than listing the data, which we are trying to avoid. The number of allowable subsets also depends on the number n of tuples in Tqual. If n is &amp;quot;large&amp;quot;, the number of subsets one would consider acceptable may be somewhat higher.</Paragraph>
    <Paragraph position="27"> It should be mentioned that in the generalized expression for the response, the various attributei's may be the same attribute, or they may be different. In certain cases, the same attribute may partition the relevant information into two or more groups in distinct ways. An example showing three partitions based on the values of three different attributes is, Q I 1: Which students are not receiving University scholarships? null S11: Students who are receiving NSERC scholarships or have cumulative GPA less than 6.0 or have completed at least two years at the University.</Paragraph>
    <Paragraph position="28"> In attempting to answer Q11, the system finds that it is not possible to obtain an appropriate answer using the previous heuristics. It then checks to see if the tuples in Tqual can be divided into two or three separately identifiable subsets. In this case, it successfully partitions Tqual into three subsets - Tqual_l, Tqual_ 2 and Tqual_ 3 where  While subdividing the total response set Tqual into subsets, the system should ensure that no tuple in Tunqual satisfies the various disjunctive predicates.</Paragraph>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.2.7 FOREIGN-KEY HEURISTIC
</SectionTitle>
    <Paragraph position="0"> If nothing satisfactory can be found employing all of the above heuristics, the system attempts to search other &amp;quot;related&amp;quot; relations to obtain a suitable response. A related relation is one with which the relation under consideration has some common or join attribute(s).</Paragraph>
    <Paragraph position="1"> Formally, the foreign-key heuristic may be stated as, Obtain the tuples in the target relation R t that satisfy the user's query. Let these tuples constitute a new relation R n. Determine if the target relation R t may be joined directly or indirectly with some other relation(s) in the data base by consulting the relation frame for R t. Let these other relations be designated {Rj} where maximum(j) = number of such &amp;quot;related&amp;quot; relations. Take join of R n with the Rj's one at a time (these joins may be direct or indirect and are performed via the attributes specified in the relation frame). Project the resulting relation on the attributes of Rj and try to apply one of the previous heuristics to this resultant relation. Stop only when there is successful application of a heuristic for some Rj, or each relation Rj has been tried unsuccessfully.</Paragraph>
    <Paragraph position="2"> As an example, consider the following question and the response to it: Q12: Which students are taking CMPT 994? S12: All students who have completed at least one year of studies.</Paragraph>
    <Paragraph position="3"> While attempting to answer Q 12, the system finds that the question pertains to the relation COURSE-REGIS-TRATION. However, it fails to obtain any interesting descriptive pattern about the tuples in Tqual by considering this relation alone. Hence, the system consults the LINKS slot in the relation frame for COURSE-REGISTRATION and finds that COURSE-REGISTRATION may be joined with the relation STUDENT via the fields STUDENT-ID-NO in STUDENTS and STUDENT-ID in COURSE-REGISTRATION. It takes a join of all the tuples constituting Tqual with the relation STUDENTS and projects the resulting relation on the attributes of the relation STUDENTS. Let us call these tuples Tnew_qual.</Paragraph>
    <Paragraph position="4"> Next, it attempts to discover the existence of some pattern in the Tnew_qual tuples. Ultimately, it succeeds in producing the response given in S12 by employing a minimum range heuristic.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 THE INTERNAL FORM OF A QUERY
</SectionTitle>
      <Paragraph position="0"> The internal form of a query is  * Command is some operation to be performed, at the moment limited to the command OBTAIN, meaning obtain information from the data base. * Database-Identification names a particular data base on which the command is to be carried out; in the current implementation, GRAD-STUDENT-RECORDS.</Paragraph>
      <Paragraph position="1"> * Predicate-Form breaks down into - (Predicate (Relation-Name Attribute-Name)</Paragraph>
      <Paragraph position="3"> Computational Linguistics, Volume 12, Number 2, April-June 1986 117 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses</Paragraph>
      <Paragraph position="5"> Common predicates such as EQUAL, NOT-EQUAL, LESS-THAN, and GREATER-OR-EQUAL, and conjunctions such as AND-ALL-OF, AND-ANY-OF, etc. can be handled. The internal form of the system output is defined in a similar manner except there is no command. The following examples of queries and answers (taken from section 4.2) represented in this internal form should clarify this notation. First we look at a relatively simple  We do not want to downplay the difficulties of interpreting natural language into an internal form such as this, nor do we want to trivialize the difficulty of producing surface language responses from the internal form. However, parsing and natural language generation were not the central concerns of this research; we instead wanted to concentrate on the pragmatic issues underlying summary response generation in a natural language data-base interface. There is a plethora of work, of course, describing various approaches to parsing we could draw on should we want to extend our system. Possibly the most appropriate parsing strategy for this domain would be a keyword approach (e.g., Small 1980) where the input query is scanned for words indicative of attribute names or predicates relevant to the particular data base being queried. This approach might work well here because the target internal form is phrased only in terms of these domain specific attributes and predicates.</Paragraph>
      <Paragraph position="6"> Similarly, generation could be in terms of catch phrases triggered by the presence of predicates or attributes in the internal form of the output. There is relatively less work on natural language generation on which to base a more sophisticated natural language generation component, but work such as McDonald's (1983) MUMBLE system 1night be usefully adapted to the determination of appropriate surface phraseology of summary responses.</Paragraph>
      <Paragraph position="7"> The approach taken in McKeown's (1982) TEXT system is also appealing in this regard since its area of application is data bases (albeit describing database structure rather than database contents). To adapt methods from either of these systems (or in fact from most other approaches to generation) would require a considerable enhancement of the knowledge base of our system, something that is currently beyond the scope of the research.</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 IMPLEMENATION CONSIDERATIONS
</SectionTitle>
    <Paragraph position="0"> A system incorporating the details discussed above has been implemented in Franz Lisp on a VAX-11/750 running under UNIX 1 and has been tested on a data base of student records. The system was tested on a variety of questions. These included all of the examples Q3-Q12 where the system produced internal versions of the summary responses $3-S12. Further details of these examples (and others) are contained in Kalita (1984).</Paragraph>
    <Paragraph position="1"> The data base is currently very small (containing the records of only 25 students or so), so the average response time of the system was in the order of seconds, even for the most time consuming heuristics. A more meaningful analysis is a complexity analysis of the response time in terms of the number of tuples in the data base. With this in mind, in this section we examine implementation aspects of the system, including a complexity analysis of the various heuristics.</Paragraph>
    <Paragraph position="2"> The system has two main components - one for data manipulation, the other to produce the summary responses. The data manipulation component enables the system builder to introduce new relations, new attributes, and new tuples into the relations. As new tuples are entered, various checks regarding the nature of the attribute values and the number of attributes are performed. The data manipulation component also accesses the data that satisfy a query and performs standard relational functions such as selection, projection, and lossless join. This component does not possess the sophistication of a standard database package. However, it is sufficient for the purposes of this research since the internal form of a query can be directly handled by the data manipulation routines.</Paragraph>
    <Paragraph position="3"> The other main component of the system produces summary responses to a user's queries. First the user's input is read and checked for syntactic accuracy (i.e., that it follows the proper internal form, that it contains only references to valid names of relations and attributes, etc.). The query is then passed to the data manipulation 118 Computational Linguistics, Volume 12, Number 2, April-June 1986 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses component for data access. Returned are the two sets of tuples: Tqual, those tuples satisfying the user's query, and Tunqual , those that do not. The summary response component then regains control and invokes routines corresponding to the individual heuristics. The invocation of the heuristics is done successively in predetermined order until one of them is successful. There is some dependence among the heuristic routines since certain information, once obtained, can be shared. The heuristics receive assistance from the frames during the process of obtaining summary responses. These frames are stored as property lists associated with the relation and attribute names.</Paragraph>
    <Paragraph position="4"> The heuristics attempt to determine a descriptive response by searching through Tqual and Tunqual. In the current implementation, the tuples are examined serially.</Paragraph>
    <Paragraph position="5"> Once a tuple has been accessed, various attribute values in it are tested in parallel to determine if they satisfy the requirement of a heuristic. Such tuple-serial attribute-parallel inspection of attribute values may increase processing time in some cases. However, on average, the time required for obtaining a response is considerably reduced since the tuples need not be accessed repeatedly for each candidate attribute under consideration.</Paragraph>
    <Paragraph position="6"> While applying the equality heuristic for each attribute in the target relation, the system keeps a frequency count of different values that occur in the various attributes in Tqual. If at any time during the equality processing of Tqual the system finds more than three different data values for a particular attribute, it ignores the attribute during subsequent processing for the equality heuristic. If for a particular attribute, all values in Tqual are the same and this particular value is (a) a distinguishing value and (b) does not occur in any of the tuples in Tunqual, the system produces a response using the equality heuristic.</Paragraph>
    <Paragraph position="7"> If there are up to three different values that occur for an attribute in Tqual and do not occur in Tunqual, the system compares the dominant frequency with the other frequencies. In the current implementation, the system produces an answer using a modification of the equality heuristic if the other frequencies are less than 10% of the dominant frequency.</Paragraph>
    <Paragraph position="8"> For the application of the inequality heuristic, the roles played by Tqual and Tunqual are interchanged.</Paragraph>
    <Paragraph position="9"> Otherwise, the processing is essentially the same.</Paragraph>
    <Paragraph position="10"> For the range heuristics, the maximum and minimum values for each attribute in Tqual are found in tuple-serial attribute-parallel mode. If both heuristics are successful for a particular attribute, a response in terms of range specification is generated. The rules discussed in section 4.2.4 are applied for obtaining responses using the heuristics.</Paragraph>
    <Paragraph position="11"> The disjunction heuristic is attempted when it is possible to divide the tuples in Zqual into two or three subgroups based on equality, inequality, or range heuristics. While applying the earlier heuristics, the system has retained information that may help in the application of the disjunction heuristic. However, application of the disjunction heuristic may necessitate a substantial amount of repetitive grouping and regrouping of tuples and may be expensive in its time requirements. Even so, success is not guaranteed.</Paragraph>
    <Paragraph position="12"> The conjunction heuristic is successful when there are tuples in Tunqual that satisfy the predicate(s) satisfied by the tuples in Tqual. Let the tuples in Tunqual that satisfy f the predicate(s) be called T unqual. To obtain an answer, the equality, inequality, and range heuristics are t employed using Tqual and T unqual (instead of the usual Tqual and Tunqual) to find some distinguishing characteristics between the two sets. This distinguishing description is then used as a qualifier to obtain the final complete response.</Paragraph>
    <Paragraph position="13"> The foreign-key heuristic involves a join and a projection, and finally the application of all previous heuristics. If the target relation has common join attributes with several other relations, joins may have to be performed with each such relation, and the process repeated again for each resultant relation.</Paragraph>
    <Paragraph position="14"> If one of the heuristics succeeds, a response is generated in the format described above. If none of the heuristics succeeds, the extensional response Tqual is produced. The user can also ask for Tqual to be produced if he or she is unsatisfied with just the summary response. In order to determine the implications of our approach to summary response generation, it is important to look at the computational complexity of the algorithms. The application of the equality, inequality, and the range heuristics takes time of the order of O(N a N t) where N a is the number of attributes in the target relation, and N t is the number of tuples in the target relation (i.e., the sum of the number of tuples in Tqual and Tunqual for the target relation). Performance is improved if the value comparisons are done in parallel for all attributes in a tuple. This performance improvement results since the tuples need not be accessed for each attribute separately. However, this does not reduce the basic complexity involved in the determination.</Paragraph>
    <Paragraph position="15"> The complexity of applying the disjunction heuristic is dependent on the nature of data distribution in Tqual.</Paragraph>
    <Paragraph position="16"> Successful application may involve a large number of permutations of the tuples for repetitive grouping and regrouping. This is the heuristic most likely to lead to a combinatorial explosion.</Paragraph>
    <Paragraph position="17"> The conjunction heuristic takes time of the order O(N a N t + N a Ntl) where Ntl is the sum of the number of tuples in Tqual and the number of tuples in Tunqual that satisfy some mathematical relation(s) satisfied by the ! tuples in Tunqual (i.e., T unqual defined earlier). This complexity can be arrived at only if we assume that the disjunction heuristic is not applied to determine the ! distinguishing characteristics between Tqual and T unqual.</Paragraph>
    <Paragraph position="18"> Otherwise, the time required will be O(N a N t + N a Ntl) + Odh (Tqual, 74unqual) where Odh is the time requirement for the application of the disjunction heuristic.</Paragraph>
    <Paragraph position="19"> Computational Linguistics, Volume 12, Number 2, April-June 1986 119 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses The foreign-key heuristic requires additional time for performing joins and projections. The number of joins that may be performed is a function of the number of relations with which the target relation has common join (direct or indirect) attributes. Indirect joins involve one or more additional simple joins. The complexity of the computations necessary after completion of the join and subsequent projection is the same as discussed in the preceding paragraphs.</Paragraph>
    <Paragraph position="20"> The implications of these complexity bounds for large data bases cannot be ignored. Processing time for the simple equality, inequality, and range heuristics is linear in the number of tuples in the data base. This is about as good as can be expected, although it still may be quite slow if real time response is needed. Processing times for the disjunction, conjunction, and foreign-key heuristics can be substantially worse as juggling, rejuggling, and joining take place. If each query must be independently processed, we don't see much hope of improving on these times. However, it may be possible to add a &amp;quot;memory&amp;quot;to the system's knowledge base to keep track of previous responses and hence avoid re-accessing the data base for each query. The nature of such a memory and some of the implications are discussed in the next section.</Paragraph>
  </Section>
  <Section position="13" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 CONCLUSIONS AND FUTURE RESEARCH
</SectionTitle>
    <Paragraph position="0"> Generation of descriptive summary responses has important implications if interactions with a data base are to have the properties and constraints normally associated with human dialogue. Without these constraints, interactions with a DBMS can simply be viewed as dull factual exchanges between a human being and a machine. No doubt, the necessary data is obtained by the user, but these interactions lack the &amp;quot;intelligence&amp;quot; and elegance we ascribe to human behaviour.</Paragraph>
    <Paragraph position="1"> Furthermore, such interactions may fail to present the information content of the data. The data produced is the superficial representation of the &amp;quot;actual contents&amp;quot; or the information that underlies it. In general, most commercial DBMSs make little attempt to extract this deep-seated abstract information. Advances in data modelling have helped to bridge this gap (see, for example, Chen 1976, Smith and Smith 1977, Roussopoulos 1977, Mylopoulos et al. 1980). However, the data models are tools meant principally for the database administrator. They provide little guidance to the user in interpreting the data. The task of interpretation and obtaining a &amp;quot;feeling&amp;quot; for the information content of the data still rests mostly with the user. A system such as the one discussed here transfers some of the responsibility of data interpretation from the user to the computer system. It undertakes a guided search of the data that satisfy the user's query and attempts to extract a brief qualitative expression describing the information therein.</Paragraph>
    <Paragraph position="2"> Currently, while producing summary responses, the system stops as soon as any heuristic is successful in obtaining a pattern. Such responses are composed of one or more predicate forms, as explained in section 4.3.</Paragraph>
    <Paragraph position="3"> However, the first such response may not be the &amp;quot;best&amp;quot; possible one. In order to obtain the best answer, it is advisable to continue the process of identifying responses using the remaining heuristics. If, ultimately, several answers are obtained, a decision regarding which one to present to the user must be made. For this purpose, each answer could be assigned a weight and those with weights below a particular threshold would not be presented. Although the problem of assigning weights is encountered in several other applications of artificial intelligence, the issues involved are complicated; we do not delve into this topic here.</Paragraph>
    <Paragraph position="4"> There are a number of issues that arise concerning the interaction of the data base and the knowledge base.</Paragraph>
    <Paragraph position="5"> The current system depends on the discovery of relationships occurring in the data base and makes use of the knowledge base only to find distinguishing values, possible joins, and appropriate heuristics. Since the heuristics are universal in nature, this implies that the techniques employed here can be transported to another domain (or used by another set of users) without undue modification. The only changes that have to be incorporated are new relation and attribute frames for each new database domain (or each new type of user).</Paragraph>
    <Paragraph position="6"> Unfortunately, the portability is achieved by going directly to the data base, and is bought at the expense of using reasonably inefficient sequential searches (see section 5). This raises the question of whether it might be possible to avoid database access altogether. The current knowledge base is too impoverished to be used directly, but we could consider various enhancements.</Paragraph>
    <Paragraph position="7"> One possibility might be to generalize the idea of distinguishing values to provide rules describing the criteria for membership in a given class. For example, one such criterion could be that a passing mark is a grade point of 1 so that any question such as Who failed CMPT 110? could be answered with All students with a mark of less than 1 without needing to consult the data base at all (assuming the user didn't know this so that the maxim of quality isn't violated). This kind of criterion would be simple to represent, but is obviously not a complete representation of what it means to fail a course since it ignores students who have failed by withdrawing too late, by not writing the final exam, by murdering the professor, etc. In general, to represent all the various subtleties of such criteria is a substantial problem in knowledge representation (consider, for example, having to represent the qualifications needed for a scholarship or all the requirements to get an undergraduate degree). Although it would be nice to be able to represent such general rules, it should be pointed out that consulting the data base as in our current approach circumvents the need to consider these representation issues. The heuristics can pick out relevant commonalities among students who failed a course, or won a scholarship, or received an 120 Computational Linguistics, Volume 12, Number 2, April-June 1986 Kalita, Jones, and McCalla Summarizing Natural Language Database Responses undergraduate degree, without the need for sophisticated knowledge representation techniques.</Paragraph>
    <Paragraph position="8"> Even if the representation issues were to be solved, however, database access would still be necessary. A prime reason is that any general rules will sometimes have exceptions that can only be discovered by looking at the data. Our heuristics currently allow some such exceptions to be found, although they are by no means a complete solution to the problem of exceptions. The modified equality and inequality heuristics (section 4.2.3) explicitly allow for occasional deviations (e.g., see Q6-$6), and the conjunction and disjunction heuristics can find characteristics common to entire exception classes (e.g., if one entire section of a class were given exemption from the final examination, the disjunction heuristic could answer the question Who completed CMPT 378? with the response All students who wrote the final examination or were in section 02.). Nevertheless, there are still open questions involving exceptions (such as being less ad hoe in defining exactly what a few means in the modified equality and inequality heuristics), which could be worked on as a further direction of this research.</Paragraph>
    <Paragraph position="9"> There is still a third reason (besides avoiding knowledge representation problems and recognizing exceptions) that the data base should be consulted: to discover patterns in the data that can't be explicitly predicted.</Paragraph>
    <Paragraph position="10"> Consider the following question answer pair: Q12: Which athletes failed HIST 101? S12: The football players.</Paragraph>
    <Paragraph position="11"> Response S12 summarizes information that can't be represented in a rule in the knowledge base (i.e., it isn't necessary that only the football players failed this course) and can only be found by looking through the data.</Paragraph>
    <Paragraph position="12"> Again, our heuristics would be able to find this pattern in the data, assuming that football player is a distinguishing value (which it might be to the athletic director, for example). In general there will be many such situations where the system knows that something interesting (i.e., something for which there is a distinguishing value) could occur in the data, but the exact context in which it actually occurs can't be foreseen.</Paragraph>
    <Paragraph position="13"> Thus, it will normally be necessary to consult the data base. Nevertheless, an interesting research direction will certainly be to enhance the knowledge base as much as possible to provide rules that can at least direct the search through the data with more subtlety than distinguishing values are able to.</Paragraph>
    <Paragraph position="14"> Another possible knowledge-base extension, which would avoid the problem of having rules disembodied from the data they reflect and which might be an answer to some of the efficiency problems of database consultation mentioned above, is to create a &amp;quot;memory&amp;quot; that would store patterns found in previous database searches.</Paragraph>
    <Paragraph position="15"> In other words, the memory would store the scalar implicatures that the system finds to be valid in the data base. This is similar in intent to Lebowitz's (1983) RESEARCHER system, which attempts to generalize concepts read from patent abstracts into a generalization-based memory. We would not be as concerned with describing how given instances differ from their generalizations, but would have to be concerned with how the generalizations change as the database contents are modified.</Paragraph>
    <Paragraph position="16"> The memory would obviate the need to search the data base for repeated queries. For example, let a question Q that has been answered by the system have an answer A stored in an internal form. If the question Q is posed by the user again, the answer A can be returned.</Paragraph>
    <Paragraph position="17"> Similarly, if the question posed matches A, Q can be produced as the answer. For example, let us consider the example Q7-$7 from section 4.2.4. If the question posed is Q7-1, which is the interrogative form of $7, the answer provided may be S7-1, the assertive form of Q7.</Paragraph>
    <Paragraph position="18"> Q7-1:Who are the students with cumulative GPA of 2.0 or less? $7-1: All students who have been advised to discontinue studies at the University.</Paragraph>
    <Paragraph position="19"> In some cases, it may not be possible to phrase a meaningful English question corresponding to the interrogative form of the response to a query. This is especially true in situations where complex responses are produced (e.g., in questions Q6, Q10, and Q11 in section 4). However, it may be possible to break up a complex query into two or more sub-queries. If answers to these sub-queries are already resident in the memory, the system may be able to compose the final response from the existing answers to these sub-queries. Clearly, the amount of search required to answer the query may be considerably reduced if parts of the answer can be retrieved from the memory, assuming the memory itself is organized for efficient retrieval.</Paragraph>
    <Paragraph position="20"> The memory would usually be empty at system initialization; it would grow in size as the system interacted with the user and learned new facts about the data. It would have to be modified as the data in the data base changed. This would mean that the memory would somehow have to keep track of how the stored queries related to the data that produced them so as to be able to determine which queries would be affected by a given change in the data. It would also require some means of determining how new data affected queries summarizing existing data. This is the reverse process to that suggested by the Mays (1982) monitoring scheme, where monitors are posted to look for future changes in the data base. The memory part of our system would have to reason backwards from the current situation to infer how changes affect previously abstracted summaries. Whether Mays's temporal logic can be adapted to be useful in backwards reasoning is an interesting question. In any event, the amount of processing time required to keep the memory up to date is unclear. However, it would seem to be a computationally intense activity, which suggests there would be a trade-off between the time Computational Linguistics, Volume 12, Number 2, April-June 1986 121 Kalita, Jones, and McCalla Snmmarlzlng Natural Language Database Responses spent maintaining the memory, on the one hand, and the time saved in database access by having the memory, on the other hand.</Paragraph>
    <Paragraph position="21"> As mentioned earlier, the database manager can implement different user models by creating different sets of attribute and relation frames for each type of user. This capability is similar to the idea of database views.</Paragraph>
    <Paragraph position="22"> For each type of user, the system would contain information about stereotypical knowledge possessed by that class of user. Different classes of users have different ideas about what values are &amp;quot;distinguishing&amp;quot; (e.g., an average of 4 might be fairly insignificant to an undergraduate accessing a student record data base, but to a graduate student it represents the dividing line between being allowed to graduate or not). The CUMULATIVE-GPA attribute frame in the graduate student user model would therefore be different from the CUMULATIVE-GPA frame in the undergraduate student model. For security purposes, it might be useful to prevent certain joins from taking place (e.g., it wouldn't be appropriate for students to access their professors' marks lists by joining the relation COURSE-REGISTRATIONS to the relation MARKS-LISTS (say) on the attribute COURSE-NO). The student user models could reflect this by appropriate restrictions on the relation frames. It is even possible to prevent summary responses altogether for certain attributes by having no distinguishing values in the corresponding attribute frames, or by providing them with a &amp;quot;nil&amp;quot; preference category. Such security and privacy considerations can be important for certain classes of users. All of this is currently possible, although not something we have actively experimented with. An extension to this capability might make it possible for the user to customize the kinds of summary responses he/she receives, rather than relying on the database manager to provide him/her with the appropriate user model. Whether to have the user fill in a template corresponding to each attribute frame, or whether to use natural language to specify the information in the various attribute frames is an open research question.</Paragraph>
    <Paragraph position="23"> In the present system we have assumed that the system is provided with (and produces) a formal representation of the user's query. Ideally, the system's interface should include a natural language parser and generator, but as discussed earlier (section 4.3) this issue was not tackled here. There are still many open questions having to do with surface language, apart from issues of interpretation and generation per se. One such question of particular interest to this research is categorizing types of surface language that demand a summary response, as opposed to types that demand an extensional response or types where either an extensional or a summary response are appropriate. For example, What are the characteristics of the students who failed CMPT 110? requires a summary response; Give me the names of the students who failed CMPT 110 demands an extensional response; and Who failed CMPT 110? allows for either kind of response. However, the problem is subtle.</Paragraph>
    <Paragraph position="24"> For example, the request Give me the names of the students who registered on Wednesday could be answered with an extensional response (which would normally be what is expected) or conceivably the summary response Those with surnames beginning with the letters N through R.</Paragraph>
    <Paragraph position="25"> The key to recognizing what kind of response is needed is to recognize the user's intent (or at least his/her knowledge) in asking the question; that is, to consult a user model to see which kind of answer is appropriate.</Paragraph>
    <Paragraph position="26"> Thus, if it is known that the user is an administrator in charge of registration and that he/she is formulating registration policies, the second answer above might be reasonable. If the user is a clerk in charge of sending out registration forms, the first might be correct. Finally, if the user already knows all the names, then perhaps the summary response is desired (assuming the user has been unable to discern the pattern on his/her own).</Paragraph>
    <Paragraph position="27"> The kind of user model needed to handle this is more sophisticated than the simple user model currently used.</Paragraph>
    <Paragraph position="28"> To see this, let's look at the ambiguous query Who failed CMPT 1107 once again. This question can admit either a summary response or an extensional response. If the system knows the user knows all the students who failed CMPT 110, then some description of their characteristics (e.g., students who were absent from the final examination) is probably more appropriate. On the other hand, if the system knows the user knows that students who miss the final examination fail the course, then a summary response describing this fact would be inappropriate, and a list of the students' names is likely what is desired.</Paragraph>
    <Paragraph position="29"> This won't be foolproof, of course. The user could be asking for a re-iteration of something he/she already knows (for confirmation purposes, perhaps) or could be asking for another summary pattern besides the one the user already knows. Another subtlety that arises is the distinction between implicit and explicit knowledge - the user may know something but not realize it, or may not be able to make the inferences needed to deduce something that he/she has the knowledge to deduce. For example, the user may know the names of all the students who failed CMPT 110 but not realize these are the only students; or he/she may know everybody who didn't write the final examination and also the rule that if the final examination is missed a student fails the course, but the user may not have applied the rule in this case.</Paragraph>
    <Paragraph position="30"> Finally, for some extensional responses, it still might be appropriate to repeat a general rule that the user knows, just to re-inforce in his/her mind the applicability of the rule in this situation. Thus, if the user has asked which managers earn more than $40K (see Q2), then even if the user knows that in general all managers earn over $40k, it might be useful to re-iterate this fact after producing the list of managers' names since it would be difficult for the user to check that all the names had</Paragraph>
  </Section>
  <Section position="14" start_page="0" end_page="0" type="metho">
    <SectionTitle>
122 Computational Linguistics, Volume 12, Number 2, April-June 1986
</SectionTitle>
    <Paragraph position="0"> Kalita, Jones, and McCalla Snrnmarizing Natural Language Database Responses appeared without exceptions (especially if the list were long).</Paragraph>
    <Paragraph position="1"> These kinds of complications make the task of devising the user model quite tricky. It must keep track of subtle degrees of knowledge, incomplete knowledge, changing knowledge, laziness in applying knowledge, etc., and it must be possible to recognize user's intentions in the use of this knowledge. There is a growing body of research involved in representing the kinds of knowledge needed here, and in dealing with language as intentional behaviour. The work by the University of Toronto group, in particular, pioneered this approach (see Cohen 1978, Allen 1979, and Allen and Perrault 1980, for example) and could form a starting point for research into user model extensions. The first problem would be to represent what the user knows and doesn't know (since many of the decisions about what to present to him/her depend on this). Subsequent steps could get into recognizing intentions and other sophisticated discourse phenomena.</Paragraph>
    <Paragraph position="2"> There are other, more subtle, problems that arise with this approach to summary response generation. One such problem involves avoiding the production of responses that &amp;quot;overlap&amp;quot; (i.e., are implicit in) the question. Such overlapping definitions themselves violate Grice's maxims of relation and quantity. For example, Q14: Which students had a GPA of greater than 5? S14: All students with a GPA of greater than 5.</Paragraph>
    <Paragraph position="3"> or Q15: Which graduate students are both teaching and research assistants? S15: All graduate students receiving money for teaching and being paid by a professor to do research.</Paragraph>
    <Paragraph position="4"> Simple cases like Q14-S14 can be prevented by explicitly prohibiting responses that have the same predicates as the question. This would apply even if only one conjunct or disjunct is the same in both question and response.</Paragraph>
    <Paragraph position="5"> Q15-S15 presents a more complex problem since the answer, although directly implied in the user's mind, may in fact involve attributes different from the question (e.g., the data base may have attributes such as TEACH-</Paragraph>
  </Section>
  <Section position="15" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ING-ASSISTANT?, RESEARCH-ASSISTANT?, MONEY-
RECEIVED-FROM-TEACHING, MONEY-RECEIVED-
</SectionTitle>
    <Paragraph position="0"> FOR-RESEARCH). In cases such as this, where the data base is in some sense redundant, extra information would have to be added to the attribute frames to indicate overlapping attributes. This information could then be used to avoid producing responses that overlap the query. Even this extension would not provide a total solution to the problem, since the user may be able to make many subtle connections among the data in the data base that will lead to an overlapping response from his/her point of view. Additional user modelling techniques to those discussed above will have to be developed to predict these connections and thus prevent the production of a response implicit in the query.</Paragraph>
    <Paragraph position="1"> Another subtle problem that arises is the problem of &amp;quot;accidental summaries&amp;quot;, i.e., summaries that are true of the current data base but not in general. Our use of distinguishing values is an attempt to reduce the chances of this occurring, but it can still happen. For example, it may be true in the simple student data base that, currently, all people who are from Canada also have NSERC grants. &amp;quot;NSERC&amp;quot; may also be a distinguishing value for the NATURE-OF-FINANCIAL-AID attribute (e.g., to answer question Q3). However, to respond to the question Who are the students from Canada? with the answer All students with NSERC grants might mislead the user into thinking that there was some necessary connection between being from Canada and having an NSERC grant, rather than an accidental one. Accidental summaries violate Grice's maxim of quality in that they imply something is true that is not. Just avoiding the production of summary responses in such cases will not solve the problem, since it still may be very useful to produce a summary response. Thus, it may be accidental that all managers earn over $40k, but answer $2-1 (Abel Baker, Charles, Doug.) to question Q2 (.Which department managers earn over $40k per year?) still (normally) violates Grice's maxims, and the summary response $2-2 (All of them.) is still (normally) more appropriate. The only long term solution to this problem is to expand the knowledge base with further information about necessary relationships in the world being modelled (e.g., for the student data base, the knowledge base could be stocked with rules and regulations about academic programmes, student eligibility for various prizes, etc.) These necessary relationships could then be used to clarify the summaries provided to the user as to whether accidental or necessary relationships were being reported.</Paragraph>
    <Paragraph position="2"> In conclusion, we would like to say that, despite its use over the last twenty years, the database environment still forms a nice microworld to study a variety of natural language issues. Hopefully, some of these have been illuminated by this research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML