File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/81/p81-1035_metho.xml

Size: 17,996 bytes

Last Modified: 2025-10-06 14:11:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P81-1035">
  <Title>TRANSPORTABLE NATURAL-LANGUAGE INTERFACES TO DATABASES</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TRANSPORTABLE NATURAL-LANGUAGE INTERFACES TO DATABASES
</SectionTitle>
    <Paragraph position="0"> by Gary G. Hendrlx and William H. Lewis</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SRI International
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Over the last few years a number of application systems have been constructed that allow users to access databases by posing questions in natural languages, such as English. When used in the restricted domains for which they have been especially designed, these systems have achieved reasonably high levels of performance. Such systems as LADDER \[2\], PLANES \[10\], ROBOT \[1\], and REL \[9\] require the encoding of knowledge about the domain of application in such constructs as database schemata, lexlcons, pragnmtic grammars, and the llke. The creation of these data structures typically requires considerable effort on the part of a computer professional who has had special training in computational linguistics and the use of databases. Thus, the utility of these systems is severely limited by the high cost involved in developing an interface to any particular database.</Paragraph>
    <Paragraph position="1"> This paper describes initial work on a methodology for creating natural-language processing capabilities for new domains without the need for intervention by specially trained experts. Our approach is to acquire logical schemata and lexical information through simple interactive dialogues with someone who is familiar with the form and content of the database, but unfamiliar with the technology of natural-language interfaces. To test our approach in an actual computer environment, we have developed a prototype system called TED (Transportable English Datamanager). As a result of our experience with TED. the NL group at SRI is now undertaking the develop=ant of a ~ch more ambitious system based on the sane philosophy \[4\].</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
II RESEARCH PROBLEMS
</SectionTitle>
    <Paragraph position="0"> Given the demonstrated feasibility of language-access systems, such as LADDER, major research issues to be dealt with in achieving transportable database interfaces include the following: * Information used by transportable systems must be cleanly divided into databaseindependent and database-dependent portions.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
* Knowledge representations must be
</SectionTitle>
    <Paragraph position="0"> established for the database-dependent part in such a way that their form is fixed and applicable to all databases and their content readily acquirable.</Paragraph>
    <Paragraph position="1"> * Mechanisms must be developed to enable the system to acquire information about a particular applicationfrom nonlinguists.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="160" type="metho">
    <SectionTitle>
III THE TED PROTOTYPE
</SectionTitle>
    <Paragraph position="0"> We have developed our prototype system (TED) to explore one possible approach to chase problems.</Paragraph>
    <Paragraph position="1"> In essence, TED is a LADDER-like natural-language processing system for accessing databases, combined with an &amp;quot;automated interface expert&amp;quot; that interviews users to learn the language and logical structure associated with a particular database and that automatically tailors the system for use with the particular application. TED allows users to create, populate, and edit ~heir own new local databases, to describe existing local databases, or even to describe and subsequently access heterogeneous (as in \[5\]) distributed databases.</Paragraph>
    <Paragraph position="2"> Most of TED is based on and built from components of LADDER. In particular, TED uses the LIFER parser and its associated support packages \[3\], the SODA data access planner \[5\], and the FAM file access manager \[6\]. All of these support packages are independent of the particular database used. In LADDER, the data structures used by these components ~re hand-generated for s particular database by computer scientists. In TED, however, they are created by TED's automated interface expert.</Paragraph>
    <Paragraph position="3"> Like LADDER, TED uses a pragmatic granmar; but TED's pragmatic gramemr does not make any asstmptlons about the particular database being accessed. It assumes only that interactions with the system will concern data access or update, and that information regarding the particular database will be encoded in data structures of a prescribed form, which are created by the automated interface expert.</Paragraph>
    <Paragraph position="4"> The executive level of TED accepts three kinds of input: questions stated in English about the data in files that have been previously described to the system; questions posed in the SODA query language; single-~ord commands that ~nltlaCe dialogues with the automated interface expert.</Paragraph>
    <Paragraph position="6"> TED's mechanism for acquiring inforaatlon about a particular database application Is to conduct interviews wlth users. For such Intervlews to be successful, The work reported herein was supported by the Advanced Research Projects Agency of the Department of Defense under contracts N00039-79-C-0118 and NOOO39-80-C-O6A5 wlth the Naval Electronic Systems Command. The views and conclusions contained in this document are those of the authors and should not be interpreted as representative of the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency of the U.S. Government.</Paragraph>
    <Paragraph position="7">  * There must be a range of readily understood questions that elicit all the information needed about a new database.</Paragraph>
    <Paragraph position="8">  assistance, when needed, to enable a user to understand the kinds of responses that are expected.</Paragraph>
    <Paragraph position="9"> All these points cannot be covered herein, but the sample transcript shown at the end of this papert in conjunction with the following discussion, suggests the manner of our approach.</Paragraph>
    <Paragraph position="10"> B. Strategy A key strateSy of TED is to first acquire information about the structure of files. Because the semantics of files is relatively well understoodt the system thereby lays the foundation for subsequently acquiring information about the linguistic constructions likely to be used in questions about the data contained in the file.</Paragraph>
    <Paragraph position="11"> One of the single-word co----nds accepted by the TED executive system is the command NEW, which initiates a dialogue prompting the user to supply information about the structure of a new data file.</Paragraph>
    <Paragraph position="12"> The NEW dialogue allows the user to think of the file as a table of information and asks relatively simple questions about each of the fields (columns) in the file (table).</Paragraph>
    <Paragraph position="13"> For example, TED asks for the heading names of the columns, for possible synonyms for the heading names, and for information about the types of values (numeric, Boolean, or symbolic) that each column can contain. The heading names generally act like relational nouns, while the information about the type of values in each column provides a clue to the column's semantics. The heading name of a symbolic column tends to he the generic name for the class of objects referred to by the values of that column. Heading names for Boolean columns tend co be the names of properties that database objects can possess. T.f a column contains numbers, thls suggests that there may be some scale wlth associated adjectives of degree. To allow the system to answer questions requiring the integration of information from multiple files, the user is also asked about the interconnections between the file currently being defined and other files described previously.</Paragraph>
    <Paragraph position="14"> C. Examples from a Transcript In the sample transcript at the end of this paper, the user initiates a NEW dialogue at Point A. The automated interface expert then takes the initiative in the conversation, asking first for the name of the new file, then for the names of the file's fields. The file name wlll be used to dlstlngulsh the new file from others during the acquisition process. The field names are entered into the lexicon as the names of attributes and are put on an agenda so that further questions about the fields may be asked subsequently of the user.</Paragraph>
    <Paragraph position="15"> At this point, TED still does not know what type of objects the data in the new file concern.</Paragraph>
    <Paragraph position="16"> Thus, as its next task, TED asks for words that might be used as generic names for the subjects of the file. Then, at Point E, TED acquires Information about how to identify one of these subjects co the user and, at Point F, determines what kinds of pronouns might be used to refer to one of the subjects. (As regards ships, TED is fooled, because ships may be referred to by &amp;quot;she.&amp;quot;) TED is progra-,~ed wlch the knowledge that the identifier of an object must be some kind of name, rather than a numeric quantity or Boolean value.</Paragraph>
    <Paragraph position="17"> Thus, TED can assume a priori that the NAME field given in Interaction E is symbolic in nature. At Point G, TED acquires possible synonyms for NAME.</Paragraph>
    <Paragraph position="18"> TED then cycles through all the other fields, acquiring information about their individual semantics. At Point H, TED asks about the CLASS field, but the user doesn't understand the question. By typing a question eu'rk, the user causes TED to give a more detailed explanation of what it needs. Every question TED asks has at least two levels of explanation that a user may call upon for clarification. For example, the user again has trouble at J, whereupon he receives an extended explanation with an example. See T also.</Paragraph>
    <Paragraph position="19"> Depending upon whether a field is symbolic, arithnetic or Boolean, TED makes different forms of entries in its lexicon and seeks to acquire different types of information about the field.</Paragraph>
    <Paragraph position="20"> For example, as at Points J, K and Y=, TED asks whether symbolic field values can be used as modifiers (usually in noun-~oun combinations). For arithmetic fields, TED looks for adjectives associated with scales, as is illustrated by the sequence 0PQR. Once TED has a word such as OLD, it assumes MORE OLD, OLDER and OLDEST may also be used. (GOOD-BETTER-BEST requires special intervention. ) Note the aggressive use of previously acquired information in formulating new questions to the user (as in the use of AGE, and SHIP at Point P).</Paragraph>
    <Paragraph position="21"> We have found that this aids considerably in keeping the user focused on the current items of interest co the system and helps to keep interactions brief.</Paragraph>
    <Paragraph position="22"> Once TED has acquired local information about a new file, it seeks to relate it to all known files, including the new file itself. At Points Z through B+, TED discovers chat the *SHIP* file may be Joined with itself. That is, one of the attrlbutes of a ship is yet another ship (the escorted shlp)j which may itself be described in the same file. The need for this information is illustrated by the query the user poses at Point G+.</Paragraph>
    <Paragraph position="23"> TO better illustrate linkages between files, the transcript includes the acquisition of a second file about ship classes, beginnlng at Point J+.</Paragraph>
    <Paragraph position="24"> Much of thls dialogue is omitted but, aC L/s TED learns there is a link between the *SHIP* and *CLASS* files. At /4+ it learns the direction of  this link; at N+ and O+ it learns the fields upon which the Join must be made; at P+ it learns the attributes inherited through the llnk. This information Is used, for example, In answering the query at S+. TED converts the user's question &amp;quot;What Is the speed of the hoel?&amp;quot; into '~hat is the speed of the class whose CN~ is equal to the CLASS of the hoel?.&amp;quot; Of course, the whole purpose of the NEW dialogues is to make it possible for users to ask questions of their databases in English. Examples of English inputs accepted by TED are shown at Points E+ through I+, and S+ and T+ In the transcript. Note the use of noun-noun combinations, superlatives and arithmetic.</Paragraph>
    <Paragraph position="25"> Although not illustrated, TED also supports all the available LADDER facilities of ellipsis, spelling correction, run-time gram,~r extension end introspection.</Paragraph>
  </Section>
  <Section position="7" start_page="160" end_page="160" type="metho">
    <SectionTitle>
V THE PRACHATIC GRAMMAR
</SectionTitle>
    <Paragraph position="0"> The pragmatic grammar used by TED includes special syntactic/semantic categories that are acquired by the NEW dialogues. In our actual implementation, these have rather awkward names, but they correspond approx/macely to the following: * &lt;GENERIC&gt; is the category for the generic names of the objects in files. Lexlcal properties for this category include the name of the relevant file(s) and the names of the fields that can be used Co identify one of the objects to the user. See  transcript Points D and E.</Paragraph>
    <Paragraph position="1"> * &lt;ID.VALUE&gt; is the category for the identifiers of subjects of individual records (i.e., key-field values). For example, for the *SHIP* file, it contains the values of the NAME field. See transcript Point E.</Paragraph>
    <Paragraph position="2"> * &lt;MOD.VALUE&gt; is the category for the values of database fields that can serve as modifiers. See Points J and K.</Paragraph>
    <Paragraph position="3"> * &lt;NUM.ATTP.&gt;, &lt;SYM.ATTR&gt;, and &lt;BOOL.ATTP.&gt; are n,--eric, symbolic and Boolean attributes,  respectively. They include the names of all database fields and their synonyms.</Paragraph>
    <Paragraph position="4"> * &lt;+NUM.ADJ&gt; is the category for adjectives (e.g. OLD) associated with numeric fields. Lexlcal properties include the name of the associated field and flies, as veil as information regarding whether the adjective is associated with greater (as In OLD) or lesser (as in YOUNG) values in the field.</Paragraph>
    <Paragraph position="5"> See Points P, Q and R.</Paragraph>
    <Paragraph position="6"> * &lt;COMP.ADJ&gt; and &lt;SUPERLATIVE&gt; are derived fro= &lt;+NUM.ADJ&gt;.</Paragraph>
    <Paragraph position="7"> Shown below are some illustrative pragmatic production rules for nonlexlcal categories. As in the foregoing examples, these are not exactly the rules used by TED, but they do convey the unCure of the approach.</Paragraph>
    <Paragraph position="8">  These pragmatic Era-mar rules are very much like the ones used in LADDER \[2\], but they differ from those of LADDER in two critical ways. (1) They capture the pragmatics of accessing databases without forcibly PSncludin8 information about the praSmatics of any one particular set of data.</Paragraph>
    <Paragraph position="9"> (2) They use s~tsct4~/semantic categories that support the processes of accessln8 databases, but that are domsinindependent and easily acquirable. It is worth noting that, even when a psrClcular application requires the introduction of Special-purpose rules, the basic pragmatlc grmamar used by TED provides a starting point from whlch domain-specific features can be added.</Paragraph>
  </Section>
  <Section position="8" start_page="160" end_page="161" type="metho">
    <SectionTitle>
VI DIRECTIONS FOR FURTHER WORK
</SectionTitle>
    <Paragraph position="0"> The TED system represents a first step toward truly portable natural-language interfaces to database systems. TED is only a prototype, however, and --,ch additional work will be required  to provide adequate syntactic and conceptual coverage, as well as to increase the ease with which systems may be adapted to new databases. A severe limitation of the current TED system is its restricted range of syntactic coverage. For example, TED deals only with the verbs BE and HAVE, and does not know about units (e.g., the Waddel's age is 15.5, not 15.5 YEARS). To remove this limitation, the SRI NL group is currently adapting Jane Robinson's extensive DIAGRAM grammar {7\] for use in a successor Co TED. In preparation for the latter, we are experimenting with verb acquisition dialogues such as the following:</Paragraph>
  </Section>
  <Section position="9" start_page="161" end_page="161" type="metho">
    <SectionTitle>
&gt; VERB
</SectionTitle>
    <Paragraph position="0"> Please conjugate the verb (e.g. fly flew flown) &gt; EARN EARNED EARNED  EARN is: 1 intransitive (John dines) 2 transitive (John eats dinner) 3 dicransitive (John cooks Mary dinner) (Choose the most general pattern) &gt; 2 who or what is EARNED? &gt; A SALARY who or what EARNS A SALARY? &gt; AN EMPLOYEE can A SALARY be EARNED by AN EMPLOYEE? &gt; YES can A SALARY EARN? &gt; NO can AN ~dPLOYEE EARN? &gt; NO  Ok:, an EMPLOYEE can EARN a SALARY What database field identifies an EMPLOYEE? &gt; NAME What database field identifies a SALARY? &gt; SALARY extensive conceptual and symtacclc coverage continues to pose a challenge to research, a polished version of the TED prototype, even with its limited coverage, would appear to have high potential as a useful tool for data access.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML