HOW TO DRIVE A DATABASE FRONT DID USING GENERAL SEMANTIC INFORMATION 
~ Boguraev and K. Sparck Jones 
Computer Laborator~ U.iversity of CambridKe 
Corn Exchange Street, Cambridge CB2 3QG, England 
ABSTRACT 
This paper describes a front end for natural 
language access to databases making extensive use of 
general, l~. domain-independent, semantic 
information for question interpretation. In the 
interests of portability, initial syntactic and 
semantic processing of a question is carried out 
without any reference to the database domain, and 
domain-dependent operations are confined to 
subsequent, comparatively straightforward. 
processing o£ the initial interpretation. The 
different modules of the front end are described, and 
the system's performance is illustrated by examples. 
I I~TRODUC'TION 
Following the developmemt 0£ various front ends 
for natural language access to databases, it is now 
generally agreed that such a front end must utillse 
at least three different kinds of knowledge to 
accomplish its task: linguistic k~owledge, knowledge 
of the domain of discourse, and knowledge of the 
organlsational structure of the database. Thus 
broadly speaking, a user request to the database goes 
through three conceptually different forms: the 
output of linguistic analysis o£ the question, its 
representation in terms of the domain's conceptual 
schema, and its interpretation in the database 
access language. Early natural language front ends 
usually did not have a clearcut separation between 
the different stages of the process: for example 
LUNAR (Woods 1972) merged the domain model and the 
database model into one, and systems such as the 
early incarnation of LADDER (Hendrix et al 1978) and 
PLANES (Waltz 1978) made heavy use of semantic 
grammars with their domain-dependent lexicons 
ccmbinin8 linguistic kncwledge with domain knowledge 
and so merging the first two stages. None 0£ these 
systems, moreover, made any significant use of 
~eneral, as opposed to domain-specific, semantic 
information. 
In an attempt to achieve portability from one 
database to another, mcst current systems adhere to a 
~eneral framework (Konolige 1979), which makes a 
clear distinction between the different processing 
phases and distinguishes the domain-dependent from 
the domaln-independent parts of the front end, and 
also domain operations from database management 
cperatlons. However semantic processing is still 
This work is supported by the U.K. Science and Engineering Research Council. 
8t 
essentially driven by domain-dependent semantics. 
Linguistic processing is therefore primarily 
syntactic parsing, and relating general linguistic 
to specific domain knowledge within the framework of 
a modular front end takes the form of applying 
domain-dependent semantic processing to the output 
of the syntactic parser. This may be done in a 
slmple, minded way as in PHLIQAI (Bronnenberg et al 
1979) and T~ (Damerau 1980), or by providing hooks 
in the syntactic representation (domain-independent 
calls to semantic operators which will evaluate 
differently in dl£ferent contexts), as in DIALOGIC 
(Grosz et ai 1982). In either case the usual unhappy 
consequence o£ separating syntactic and semantic 
processing, namely the hassle of manipulating 
alternative syntactic trees, follows. Furthermore, 
changlngdomalns implies changing the definitions of 
the semantic operators, which are procedural in 
nature, while it may be preferable to keep the 
domain-dependent parts of the front end in 
declarative form, as is indeed done in (Warren and 
Pereira 1981). 
Thus in systems of this by now conventional type, 
the 'portability' achieved by confining the necessary 
domain-dependent semantic processing to well- 
defined modules is purchased at the heavy price of 
limiting the early linguistic processing to syntax, 
and, perhaps, some very global and undiscriminating 
semantics (see for example the sccping algorithm of 
(Grosz et al 1982)). 
II SPECIFIC APPROACH 
Our objective is to do better than this by making 
more use of powerful, but still non-domain-dependent 
semantics in the front-end linguistic analysis. 
Doing this should have two advantages: restraining 
syntax, and providing a good platform for domain- 
dependent semantic processing. However, the overall 
architecture of the front end still follows the 
Konolige model in maintaining a clearcut separation 
between the different kinds of knowledge to be 
utilised, keeping the bulk of the domain-dependent 
knowledge in declarative form, and attempting to 
minimlse the consequences of changes in the front 
end environmant, whether of domain or database model, 
to promote s~ooth transfers cf the front end from 
one back end database management system to another. 
We believe that there is a lot of mileage to be 
got from non-task-specific semantic analysis of user 
requests, because their resulting rich, explicit, and 
ncrmalised meaning representations are a ~ccd 
starting point for subsequent task-specific 
operations, and specificall~ are better than either 
syntax trees, or the actual input text of e.g. the 
PLANES approach. Furthermore, since the domain world 
is (in some sense) a subset of the real world, it is 
possible to interpret descriptions of it using the 
same semantic apparatus and representation language 
as is used by the natural language analyser, which 
should allow easy and reliable linking of the 
natural language input words, domain world objects 
and relationships and data language terms and 
expressions. Since the connections between these do 
not appear hard-wired in the lexicon, but are 
established on the basis of matching rich semantic 
patterns, no changes at all should be required in the 
lexicon as the application moves from one domain or 
database to another, only expansions to allow for the 
semantic definitions of new words relevant to the 
new application. 
The approach leads to an overall front end 
structure as follows: 
: English question : 
ANALYSIS 
Anal~er I' 1 -~ 
i (uses linguistic knowledge) i 
: meaning representation : 
j Extractor 2 
(uses logico-linguistic knowledge) I 
I 
: logic representation : 
TRANSLATION ,L 
L--~---- ___~--_ 3 : Translator 
(uses domain world knowledge) ,: 
$ 
: query representation : 
Convertor 
(uses database organisation 
~cwledge) l $ 
: search rePresentation : 
Each process in the diagram above operates cn the 
output of the previous one. Processes I and 2 
constitute the analysis phase, and processes 3 and 
- the translation phase. Such a system has 
essentially been constructed, and is under active 
test; a detailed acccunt cf its components and 
operations follows. 
For the purposes of illustration we shall use 
questions addressed to the Suppliers and Parts 
relational database of (Date 1977). This has three 
relaticns with the following structure: 
Supplier(Snc, Shame, Status, Scity), Part(Pno, Pname, 
Colour, Weight, Pcity), and Shipments(Sno, Pnc, 
Quantity). 
82 
III ANALYSIS 
A. The Anal)met 
The natural language anal l met has been described 
in detail elsewhere (Boguraev 1979), (Boguraev and 
Sparck Jones 1982), and only a brief summary will be 
presented here. It has been designed as a general 
purpose, domain- and task-independent language 
processor, driven by a fairly extensive 
llnguistlcally-motivated grammar and controlled in 
its operation by variegated application cf a rich 
and powerful semantic apparatus. Syntactically- 
controlled constituent identification is coupled 
with the Judgemental application cf semantic 
specialists: following the evaluation of the 
semantic plausibility of the constituent at hand, 
the currently active processor either aborts the 
analysis path or constructs a meaning representation 
for the textual unit (noun phrase, ccmplementiSero 
embedded clause, etc.) for incorporation into any 
larger semantic construct. The philosophy behind the 
anal yser is that syntactlcally-drlven analysis 
(which is a major prerequisite for domain- and/or 
task-independence) is made efficient by frequent and 
timely calls to semantic specialists, which both 
control blind syntactic backtracking and construct 
meaning representations for input text without going 
through the potentiall y costly enumeration of 
intermediate syntactic trees. The analyser can 
therefore operate smoothly in environments which are 
syntactically or lexically hlghiy ambiguous. 
To achieve its objectives the program pursues a 
passive parsing strategy based on semantic pattern 
matching of the kind proposed by (Wilks 1975). Thus 
the semantic specialists work with a range of 
patterns referring to narrower or broader word 
classes, all defined using general semantic 
primitives and ultimately depending on formulae 
which use the primitives to characterise individual 
word senses. However the application of patterns in 
the search for input text meaning is mcre 
effectively controlled by syntax in this system than 
in Wilks'. 
The particular advantages of the approach in the 
database application context are the powerful and 
flexible means of representing linguistic and world 
knowledge provided by the semantic primitives, and 
the ease with which 'traps for the unexpected' can be 
procedurally encoded. The latter means that the 
system can readily deal with the kinds cf problems 
generated by unconstrained natural language text 
which provoke untoward 'ripple' effects when large 
semantic grammars are mcdified. The semantic 
primitive foundatlcn for the analyser provides a 
good base fcr the whole front end, since the 
ccmprehensive inventory cf primitives can be 
exploited to characterise both natural language and 
data language terms and expressions, and to 
reconcile the user's view of the database domain with 
the actual administrative organisaticn of the 
database. 
For present purposes, the form and ccntent cf the 
outputs of the natural language analyser are more 
important than the means by which they are derived 
(for these see Boguraev and Sparck Jones 1982). The 
meaning representations output by the analyser are 
dependency structures with clusters of case-labelled 
components centred around main verb or noun 
elements. Apart from the structure of the dependency 
tree itself, and group identifying markers like 'ins' 
and 'modallty', the substantive information in the 
meaning representation is provided by the case 
labels, which are drawn from a large set of semantic 
relation primitives forming part of the overall 
inventory of primitives, and by the semantic 
category primitive characterisations of lexically- 
derived items. 
The formulae charaoterislng word senses may be 
quite rich. The fairly straightforward 
characterisation of 'supplier1', representing one 
sense of "supplier" is 
(Supplier ... 
( supplier 1 (~(ee~t obJe) give) (subJ CorK)) ...), 
meaning approximately that some sort of organisatton 
(which may reduce to an individual) gives entities. 
The meaning representation for the whole sentence 
"Suppliers live in cities" (with the formulae for 
individual units abbreviated, for space reasons, to 
their head primitives) is 
( el ause ........ 
(v 
(livel ... be I @@agent (n (supplierl ... am))) 
ee~oca~ion (n (city2 ... spread)))))), 
where ~ and @location are case labels. "The 
parts are coloured red" will be analysed as 
( el ause ...... (v 
(be2 ... be 
thin in tpartl ... mennK)))yl (@@number 
(@~state ~:~ <colourl ... sign) 
(val (red1 ... sense))))))), 
and "Who supplies green parts?" will give rise to 
the structure: 
(clause ... (type question) (v 
(supplyl ... 81ve 
(@@agent (n (query (d~y)))) 
~race (clause V agent)) (clause 
(v 
(be2 ... be (@@@gent 
£n <partl ... ~InS))) 
(@@state (st (eolourl ... sign) 
(gr, eenl ... , tsee ~.se))))))))))))). 
As these examples sho~ the anal yser's 
representations combine expressive power with 
structural simplicity. Further, the power of the 
semantic category primitives used to identify text 
message patterns means that it is possible to 
achieve far mcre semantic analysis cf a question, far 
earlier in the frcnt end processing, than can be 
achieved with frcnt ends conforming tc the Koncllge 
model. The effectiveness cf the anal yser as a general 
natural-language prccesslng device has been 
demcnstrated by its successful application to a 
range of natural language processing tasks. There is, 
however, a price to pay, in the database context, for 
its generality. Natural language makes ocn=acn use of 
vague concepts ("have", "do"), almost content-empty 
markers ("be e, "of"), and opaque constructions such 
as compound nouns. Clearl~ front ends where domain- 
specific information can provide leverage in 
interpreting these input text items have advantages. 
and it is not clear how a principled solution to the 
problems they present can be achieved within the 
framework of a general-purpose anal yser of the kind 
described. To provide a domain-specific 
interpretation of, for example, compounds like 
"supplier city", an interface would have to be 
provided oharaeterising domain k~owledge in the 
semantic terms familiar to the parser, and 
guaranteeing the provision of explicit structural 
charaoterlsations of the text constituent which 
would be available for further exploitation by the 
parser. 
To avoid invoking domain knowledge in this way in 
analysis we have been obliged to accept questicn 
interpretations which are incomplete in limited 
respects. That is, we push the ordinary semantic 
analysis procedures as far as they will go, accepting 
that they may leave 'dummy' markers in the dependency 
structure and compound nominals with ambiguous 
member words and no explicit extracted structure. 
B. The Extractor 
nile the meaning representations constructed by 
the natural language analyser are general and 
informative enough to be able to support dlfferent 
tasks in different applications for different 
domains, they are not necessarily the best fcrm cf 
representation for question answering, and 
specifically for addressing a coded database. After 
the initial determination of question meaning. 
therefore, the question is subjected to task- 
oriented, though not yet domain- and database- 
oriented, processing. Imposing domain world and 
database organisatlon restrictions on the question 
at this stage would be premature, since it cculd 
ecmplloate or even inhibit possible later inference 
operations. The idea cf providing a system ccmponent 
addressing a general linguistic task, withcut 
throwing away any detailed information not in fact 
needed for scme specific instance cf that task, like 
natural language distinctions between quantifiers 
ignored by the database system, is also an attractive 
one. 
The extractor thus emphasises the fact that the 
input text is a questicn, but carries the detailed 
semantic information provided by the analyser 
forward fcr exploitation in the translation phase cf 
the processing. 
A gccd way to achieve a question formulation 
abstracted from the low-level crganisaticn cf the 
database is to interpret the user's input as a formal 
quer~ However our extractor, unlike the equivalent 
processors described by (Wocds 1972). (Warren and 
Pereira 1981) and (Grcsz et al 1982), does not make 
any use cf domain-dependent in fcrmaticn, but 
constructs a icgic expression whose variable ranges 
and predicate relaticnships are defined in terms cf 
83 
the general semantic primitives used for 
ccnstructlng the input question meaning 
representation. The logic representation of the 
question which is output by the extractor highlights 
the search aspects cf the input, formalising them so 
that the subsequent processes which will eventually 
generate the search specification for the database 
management system can locate and focus on them 
easily; at the same time, the semantic richness of 
the original meaning representation is maintained to 
facilitate the later domain-crlented translation 
operations. 
The syntax of the logic representation closely 
follc~ that defined by (Wocds 1978): 
(For <quantifier> <variable> / <range> : 
<restrictions on variable> - <prcpcslticn> ), 
where each cf the restrictions, or the proposition, 
can themselves be quantified expressions. The 
rationale for such quantified expressions as media 
for questions addressed towards an abstract database 
has been discussed by Woods. As we accept this, we 
have developed a transformation procedure which 
takes the meaning representation of an input 
question and ccnstructs a corresponding logic 
representation in the form just described. Thus for 
the question "Who supplies green parts?" analysed in 
Section A, we obtain 
(For Every SVarl / query : (For Every $Var2 / part1 
: (cclourl $Var2 8reenl) - (supply1SVarl SVar2)) 
(Display SVarl)). 
where the lexically-derived items indicating the 
ranges of the quantified variables ('query', 'part1'), 
the relationships between the variables ('supply1') 
and the predicates and predicate values ('cclcur1', 
'green I') in fact carry along wltb them their 
semantic formulae: these are omitted here, and in the 
rest cf the paper, to save space. 
The extractor is geared to seek, in the analyser's 
dependent y structures, the simple prc positicns 
(atomic predications) which make up the logic 
representaticn. Follcwing the philcscphy cf the 
semantic thecry underlying the analyser design, 
these simple prcpositicns are identified wlth the 
basic messages, i.e. semantic patterns, which drive 
the parser and are expressed in the meaning 
representations it produces as verb and noun group 
clusters of case-related elements. In order to 
'unpack' these, the extractor iccks for the sources 
cf atomic predicates as 'SVO' triples, identifiable 
by a verb (cr ncun) and its case rcle fillers, which 
can be extracted quite naturally in a 
straightforward way from the dependency structure. 
Depending bcth cn the semantic characterisaticn 
cf the verb and its case arguments, and cn the 
semantic context as defined by the dependency tree, 
the triples are categcrised as belcnging to cne cf 
two types: \[$ObJ SLink $ObJ\]. or \[$Obj SPoss SPrcp\]. 
where the $Obj, SLink. or $Prcp items are further 
characterised in semantic terms. It is clear that the 
'basic messages' that the extractor seeks to identify 
as a preliminary step tc ccnstructing the logic 
representation define either primitive 
relationships between objects, cr properties of 
those same cbjects. Thus the meaning representation 
for "part suppliers" will be unpicked as a 'dummy' 
relationship between "suppliers" and "parts", i.e. as 
\[$ObJ1(supplierl) $Link1(dummy) $Obj2(partl)\]. 
while "green parts" will be interpreted as 
\[$Obj2(part 1) $Poss(be2) SProp(colourl =green 1) \]. 
Larger constructs can be similarly deocmpcsed: thus 
"Where do the status 32 red parts suppliers live?" 
will be broken down into the following set of 
triples: 
\[$ObJl(supplierl) SLinkl(livel) $ObJS(query)\] 
& \[$Objl(supplierl) SLink2(dummy) $Ob~2(partl)\] & \[$Objl(supplierl) SPossl(be2) $Prcpl(status=32) \] 
& \[$Obj2(partl) SPcss2(be2) $Prcp2(cclcurl=redl)J. 
It must be empbasised that while there are parallels 
between these structures and those of the entity- 
attribute approach to data modelling, the forms cf 
triple were chosen without any reference to 
databases. As noted earlier, they naturally reflect 
the form of the 'atomic propositions', i.e. basic 
messages, used as semantic patterns by the natural 
language anal yser. 
For completeness, the triples underlying the 
earlier question "Who supplies green parts?" are 
\[$Obj1(query=identity) 
$Llnkl(supplyl) $Ob32(partl)\] & \[$Obj2(part 1) 
$Possl(be2) $Prcpl(cclcurl=greenl)\] 
The sets cf interconnected triples are derived 
from the meaning representations by a fairly simple 
recursive prccedure. The next stage o~ the 
extraction process restructures the triples tree 
into a skeleton quantified structure, the icgic 
representation, to be passed fcrward tc the 
translator generating the formal query 
representaticn. Whenever mcre explicit information 
regarding the interpretation of the input as a 
question can be extracted frcm the meaning 
representaticn, this is inccrpcrated into the logic 
representation. Thus the processing includes 
identification and sccping of quantifiers following 
the approach adopted by Wccds, and establishing the 
aspect, mcdaiity and focus cf the questicn. Like 
anyone else, we do not claim tc provide a 
ccmprehensive treatment cf natural language 
quantifiers, and indeed in practice have not 
implemented prccesses for all the quantifiers 
handled by LUNAR. 
The icgic representaticn defines the logical 
content and structure cf the information the user is 
seeking. It may, as ncted, be inccmplete at pcints 
where domain reference is required, e.g. in the 
interpretation cf compound ~cuns; but it carries 
along, tc the translator, the very large amcunt cf 
semantic information provided by the case labels and 
formulae of the meaning representation, which should 
be adequate to pinpoint the items sought by the user 
and tc describe them in terms suited to the database 
management system, so they may be accessed and 
retrieved. 
84 
IV TRAMSLATIOM 
A. The translator 
In the process of transforming the semantic 
content of the user's question into a low-level 
search representation geared to the administrative 
structure of the target database, it is necessary to 
reconcile the user's view of the world with the 
domain model. Before even attempting to construct, 
Say, a relational algebra expression to be 
interpreted by the back-end database management 
system, we must try to interpret the semantic content 
of the loKlc representation with reference to the 
se~emt cr variant of the real world modelled by the 
database. 
An obvious possibility here is to proceed 
directly from the variables and predications of the 
Icglc representation to their database counterparts. 
For example, 
( su~p.lyl (give) 
svarl/supplierl (Bin) SVar2/partl (t~t~)) 
can be mapped directly onto a relation Shipments in 
the Suppliers and Parts database. The mapping could 
be established by reference to the lexicon and to a 
schedule of equivalences between logical and 
database structures. 
This approach suffers, however, from severe 
problems: the most important is that end users do not 
necessarily constrain their natural language to a 
highly limited vocabulary. Even in the simple 
context of the ~,ppliers and Parts database, it is 
possible to refer to "firms", "goods", "buyers", 
"sellers", "provisions", "customers", etc. In fact, it 
was precisely in order to bring variants under a 
common denominator that semantic grammars were 
employed. We, in contrast, have a more powerful, 
because more flexible, semantic apparatus at our 
disposal, capable of drawing out the similarities 
between "firms", "sellers", and "suppllers", as 
opposed to taking them as read. Thus a general 
semantic pattern which will match the dictionary 
definitions cf all of these words is (((neat obJm) 
give) (~bJ |org) ). Furthermore, if instead of 
attempting to define any sort of direct mapping 
between the natural language terms and expressions 
of the user and corresponding domain terms and 
expressions, we concentrate on finding the common 
links between them, we can see that even though the 
domain and, in turn, database terms and expression= 
may not mean exactly the same as their natural 
language relatives or sources, we should be able to 
detect overlaps in their semantic characterlsatlons. 
It is unlikely that the same cr similar words will be 
used in both natural and data languages if their 
meanings have ncthing in ccmmcn, even if they are not 
identical, so characterising each using the same 
repertoire of semantic primitives shculd serve to 
establish the link~ between the two. Thus, for 
example, one sense of the natural language word 
"iccaticn" will have the formula (this (where 
spread) ) and the data language word "&city" 
referring to the domain object &city will have the 
formula (((man folk) wrap) (wl~re spread)), which 
can be connected by the common constituent (~re 
spread). 
85 
One distinctive feature of our front end design, 
the use of general semantics for initial question 
interpretation, iS thus connected with ancther: the 
more stringent requirements imposed on natural 
lanKusge to data language translation by the initial 
unconstrained question interpretation can be met by 
exploiting the resources for language meaning 
representation initially utilised for the natural 
language question interpretation. We define the 
domain world modelled by the database using the same 
semantic apparatus as the one used by the natural 
language front end processor, and invoke a flexible 
and sophisticated semantic pattern marcher tc 
establish the connection between the semantic 
content of the user question (which is carried over 
in the logic representation) and related ccncepts in 
the domain world. Taking the next step from a domain 
world concept or relationship between domain world 
obJants to their direct model in the administrative 
structure of the database is then relatively easy. 
Since the domain world is essentially a closed 
world restricted in sets if not in their members, it 
is possible to describe it in terms of a limited set 
of concepts and relationships: we have possible 
properties of objects and potential relationships 
between them. We can talk about &suppliers and &parts 
and the important relationship between them, namely 
that &suppliers &supply &parts. We can also specify 
that &suppliers &llve in &cities, &parts can be 
&n,-bered, and so on. 
We can thus utillse, either explicitly or 
implicitly, a description of the domain world which 
could be represented by dependency structures llke 
those used for natural language. The important point 
about these is the way they express the semantic 
content of whole statememts about the domain, rather 
than the way they label individual domaln-referrlng 
terms as, e.g. "&supplier" or "&part". It is then easy 
to see how the logic representation for the question 
"What are the numbers of the status 30 suppliers?", 
name1 y 
(For Every Syarl./suppllerl : (statusl $Varl 30) 
- (Dlap~ay tnum~rl $Varl))), 
can be unpacked by semantic pattern matching 
routines to establish the ccnnecticn between 
"supplier 1" and "&supplier", "number 1" and 
"&number", and so on. In the same way the lcgic 
representations for "From where does Blake operate?" 
and "Where are screws found?" can be analysed for 
semantic content which will establish that "Blake" 
is a &supplier, "operate" in the context cf the 
database domain means &supply, and "where" is a query 
marker acting fcr &city from which the &supplier 
Blake &supplies (as opposed to street corner, bucket 
shop, or crafts market); similarly, "screW' is an 
instance of &part and the cnly iccational 
information associated with &parts in the database 
in question is the &city where they are stored. All 
this becomes clear simply by matching the underlying 
semantic primitive definitions of the natural 
language and domain world words, in their 
propositional contexts. 
The translator is alac the module where domain 
reference is brought in tc complete the 
interpretation cf the input question where this 
cannot be fully interpreted by the analyser alcne. 
The semantic pattern-matchlnK potential cf the 
translation module can be exploited to determine the 
nature of the unresolved domain-specific 
predications (both 'dummy' relationships and those 
implicit in compound nominals), and vacuously 
defined objects ('query' variables). Thus the 
fragment of logical form for "... London suppliers of 
parts ..", namely 
(For <quant> $Varl/supplierl 
: (AND (For <quant> iVar2/partl 
- (dummy $Varl $Var2)) (For <quant> iYar3/London 
- (dummy SVarl SVar3))) 
is brcken down into the corresponding domain 
predications 
(&supply $Varl(&supplier) $Var2(&part)) 
and 
(&live $Var1(&supplier) $Var3(&clty)), 
while translating the logic representation for the 
example question "Who supplies green parts?" gives 
the query representation 
(For Every SVarl/&suppller 
: (For Every $Var2/&part : (&cclour iVar2Kreen) 
- (&supply $Varl SVar2)) 
- (Display $Varl)). 
Apart from the fact that semantic pattern 
matching seems to cope quite successfully with 
unexpected inputs ('unexpected' in the sense that in 
the alternative approach nc mapping function would 
have been defined for them, thus implying a failure 
to parse and/or interpret the input question), 
having a general natural language analyser at our 
disposal offers an additional bonus: the description 
of the domain world in terms of semantic primitives 
and primitive patterns can be generated largely 
automatically, since the domain world can be 
described in natural language (assuming, of course, 
an apprcpriate lexicon of domain world Words and 
definitions) and the descriptions simply analysed as 
utterances, producing a set of semantic structures 
which can subsequently be prccessed to cbtaln a 
repertoire of domain-relevant forms to be exploited 
fcr the matching procedures. 
B. The Convertor 
Having identified the domain . terms and 
expressions, we have a high-level database 
equivalent cf the original English question. A 
substantial amcunt cf processing has pinpointed the 
question focus, has eliminated potential 
ambiguities, has resolved domain-dependent language 
ccnstructicns, and has provided fillers for 'dummy' 
or 'query' items. Further, the system has established 
that "London" is a &city, for example, cr that 
"Clark" is a specific instance of &supplier. The 
processing now has to make the final transition to 
the specific fcrm in which questions are addressed 
to the actual database management system. The 
semantic patterns cn which the translator relies, 
for example defining a domain word "&supplier" as 
(((cent obje) give) (subJ IorK)), while adequate 
encugh tc deduce that Clark is a &supplier, are not 
informative enough to suggest how &suppliers are 
modelled in the actual database. 
Again, the cbvious approach to adopt here is the 
mapping one, so that, for instance, we have: 
&supplier :=> relation Supplier 
Clark ==> tuple of relation Supplier 
such that Shames"Clark" 
But this approach suffers from the same limitations 
as direct mapping from logic representation tc 
search representation; and a mcre flexible apprcach 
using the way the database mcdels the domain world 
has been adopted. 
In the previous section we discussed how the 
translator uses an inventory of semantic patterns to 
establish the connection between natural language 
and domain world words. This inventory is not, 
however, a flat structure with no internal 
organisatlon. On the ccntrar~ the semantic 
information about the domain world is crganised in 
such a way that it can naturally be associated with 
the administrative structure cf the target database, 
For example in a relational database, a relation with 
tuples over domains represents properties of. cr 
relationships between, the objects in the domain 
world. The objects, properties and relationships are 
described by the semantic apparatus used for the 
translator, and as they also underlie, at not toc 
great remove, the database structure, the domain 
world concepts or predications of the query 
representation act as pointers into the data 
structures cf the database administrative 
crganlsatlon. 
For example, given the relation supplier over the 
domains S~ame, Snc. Status and Scity. the semantic 
patterns which describe the facts that in the domain 
world &suppliers &have &status, &numbers, &names and 
&live in &cities are crcsslinked, in the sense that 
they have the superstructure cf the database 
relation .Supplier imposed over them. We can thus use 
them to avoid explicit mapping between query data 
references and template relaticnal structures for 
the database. From the initial meaning 
representation for the question fragment "... Clark, 
who has status 30 ..." through to the query 
representation, the semantic pattern matching has 
established that Clark is an instance cf &supplier, 
that the relationship between the generic &supplier 
and the specific instance of &supplier (i.e. Clark) 
is that cf &name, and that the query is focussed cn 
his &status (whose value is supplied explicitly). 
Now from the position of the query predication 
(&status &supplier 30) in the characterisaticn cf 
the relaticn Supplier, the system will be able tc 
deduce that the way the target database 
administrative structure models the question's 
semantic ccntent is as a relation derived from 
Supplier with "Clark" and "30" as values in the 
columns Shame and Status respectlvely. 
The convertor thus employs declarative knowledge 
about the database organisaticn and the 
correspondence between this and the domain world 
structure to derive a generalised relational algebra 
expression which is an interpretation cf the formal 
86 
query in the context of the relational database 
model of the domain. We have chosen to gear the 
convertor towards a generallsed relational algebra 
expression, because both its simple underlying 
definition and the generality of its data structures 
within the relational model allow easy generation of 
final low-level search representations for different 
specific database access systems. 
To derive the generallsed relational algebra form 
of the question from the query representation, the 
convertor uses its k~owledge of the way domain 
objects and predications are modelled in the 
database to establish a primary or derivable 
relation for each of the'quantifled variables of the 
query representation. These constituents of the 
algebra expression are then combined, with an 
appropriate sequence of relational operators, to 
obtain the complete expression. 
The basic premise of the convertor is that every 
quantified variable in the formal representation can 
be associated with some primary or computable 
relation in the target database; restrictions on the 
quantified variables specify how, with that relation 
as a starting point, further relational algebra 
computations can be performed to mcdel the 
restricted variable; the process is recurslve, and as 
the query representation is scanned by the 
convertor, variables and their associated relational 
algebra expressions are bound by an 'environmemt- 
type' mechanism which provides all the necessary 
information to 'evaluate' the propositions of the 
quer~ Thus ccnverslon is evaluating a predicate 
expression in the context of its semantic 
interpretation in the domain ~rld and the 
envlronmemt of the database • models for its 
variables. 
For example, given the query representation 
fragment for the phrase "... all London suppliers who 
supply red parts ..", namely 
(For Every SVarl/&supplier :(AND 
(For The $Var3/London - (&live SVarl SVar3)) 
(For Every SVar2/&part : (&cclcur SVar2 red) - (&supply $VarlSVar2))) .... 
SVarl will initially be bound to the primary 
relation .Supplier, which will be subsequently 
restricted to those tuples Where Sctty is equal to 
"London". Slmllarl~ $Var2 will be associated with a 
partial relation derived from Part, for which the 
value of Colcur is "red". Evaluating the prcposltion 
(&supply SVarl $Var2). whose dcmain relationship Is 
mcdelled in the database by Shipments, will in the 
envlrcnment of $Varl and SVar2 yield the relational 
expression 
(jcin I 
select .Suppller where Seity equals "London") j91n Shlpmen~s 
~select Part where Colcur equals "red"))). 
At this point, the information that the user wants 
has been described in terms of the target relational 
database: names cf files, fields and columns. The 
search description has, however, still to be given 
the specific form required by the back-end database 
management system. This is achieved by a fairly 
straightforward application of standard ccmplling 
techniques, and does not deserve detailed discussicn 
here. At present we can generate search 
specifications in three different relational search 
languages. Thus the final form in the local search 
language Salt of the example question "Who supplies 
green parts?" is 
list (Part:Colour="green" 
• (Supplier • Shipments)) 
87 
V IMPLEMENTATION 
All of the modules have been implemented (in 
LISP). The convertor is at present restricted to 
relational databases, and we would like to extend it 
to other models. The system has so far been tested cn 
Suppliers and Parts, which is a toy database from the 
point of view of scale and complexity, but which is 
rich enough to allow questions presenting challenges 
tO the general semantics approach to question 
interpretation. To illustrate the performance of the 
front end. we show below the query representations 
and final search representations for some questions 
addressed to this database. Work is currently in 
progress to apply the front end to a different 
(relational) database containing planning 
information: this simulates IBM's TQA database 
(Damerau 1980). Most of the work in this is likely to 
come in writing the lexical entries needed for the 
new vocabulary. Longer term developments include 
validating each step of the translation by 
generating back into English, and extending the 
front end, and specifically the translator, with an 
inference engine. 
Clearly. in the longer term, database front ends 
will have to be provided with an inference 
capability. As Konolige points out, in attempting tc 
insulate users, with their particular and varied 
views of the domain cf discourse, from the actual 
administrative organisatlon cf the database, it may 
be necessary to do an arbitrary amcunt cf 
inferenclng exploiting domain informaticn to connect 
the user's question with the database. An obvious 
problem ~r~th front ends not clearly separating 
different processing stages is that it may be 
difficult to handle inference in a coherent and 
ccntrclled way. Insofar as inference is primarily 
domain-based, it seems natural in a modular front end 
to provide an inference capability as an extension 
of the translator. This should serve bcth tc Iccaliae 
inference operations and to facilitate them because 
they can work on the partially-processed input 
question, However the inference engine requires an 
ex pllclt and well-crganised domain model, and 
specifically one which is rather more comprehensive 
than current data models, or than the rather infcrmal 
nonce ptual schema we have used tc dr i ve the 
translator. 
We hope to begin work on providing an inference 
capability in the near future, but it has to be 
reccgnised that even for the restricted task cf 
database access, it may prove impossible to confine 
inference operations to a single mcdule: dcing so 
would imply, for example, that compound nouns will 
generally only be partly interpreted in the analysis 
and extraction phases. Starting with inference 
limited to the translation mcdule is therefore 
primarily a research strategy for tackling the 
inference prcblem. 
• Green parts are supplied by which suppliers? 
+ query representation: 
(For_Every $Varl/&supplier :trot ~very SVar2/&part : (&colour SOar2 green) 
-(&supply SOar1 SOar2)) 
-(Display SOar1 )) 
÷ search representation in ~uel: 
Range of Ol-varl is Part Range cf Ql-vsr2 is Supplier 
Range cf Ol-var3 is Shipments 
Retrieve into Terminal (Ql-var2.Sname) where 
(Ol-varl.Pnc = Ql-varS.Pno) and (Ol-var2.Sn? = Ql-var3.Snc) 
and (Ql-varl.CoAcur : "green") 
•Frcm where does Blake operate? 
+ query representation: 
(For The SVar2/&city :(For The SVarl/Blake - (&live SOar1 SOar2)) 
-(Display SOar2 )) 
+ search representation in (~Jel: 
Range cf (Ql-varl) is (Supplier) Retrieve into Terminal (Ol-var1.Seity) 
where (Ol-varl.Sname : "Blake") 
• What is the status of the Paris part suppliers 
who supply blue parts? 
÷ query representation: 
(For Every $Varl/&supplier 
:(AND 
For Some SVar2/&part - (&supply SOar1 SOar2)) For The SVar3/Paris - (&live SOar1 SOar3)) 
For Every SVarU/&part 
:(&cclour $Var~ blue) 
-(&supply SOar1 $Varq))) -(Display (&atatus 
SOar1) )) 
+ search representation in Ouel: 
Range of Ql-varl is Part Range cf Ql-var2 is Supplier 
Range cf Ol-var3 is Shipments 
Retrieve into Terminal (Ql-var2.Status) where (Ol-var1.Pno = Ql-var3.Foo) 
and (Ql-var2.Sno Ql-var3.Snc) 
and (Ql-var2.Scity : "Paris") 
and (Ol-var1.Cclcur = "blue") 
VI CONCLUSION 
The project results so far suggest that 
developing a natural language front end tc databases 
based cn a general semantic anal yser which 
constructs rich and explicit meaning representations 
offers distinct advantages in at least two respects: 
it makes all subsequent prccessing cleaner than 
would be the case with a representation dominated by 
ccnventicnal syntax, and enhances portability by 
encouraging the declarative description cf domain- 
specific ~ncwledge. 

REFERENCES 

Boguraev, EK. "Automatic resolution of linguistic 
ambigulties", Technical Report No.11, Computer 
Laboratory, University of Cambridge, 1979. 

Boguraev, B.K. and Sparck Jones, K. "A natural 
language analyser for database access", 
Information Technology: Research and Development, 
1, 23-39, 1982. 

Bronnenberg, W.j.H.J. et al. "The question answering 
system PHLIQAI", in Natural language questicn 
answerln~ systems (Ed. BOle), L~ndon: Macmillan. 
1979. 

Damerau. F.J. "The transformational question 
answering (TQA) system: description, operating 
experience, and implications", Report RC8287, IBM 
Thomas J. Watson Research Center, Yorktown 
Heights, N.Y.. 1980. 

Date, C.J. An introduction to database s~rstems, 
Reading, Mass.: Addison-Wesley, 1977. 

Grosz B. et al. "DIALOGTC: a core natural-language 
processing system", in Proceedings of the Ninth 
International Conference on Computational 
Linguistics, Prague, 1982. 

Hendrix, D.G. et al. "Developing a natural language 
interface to complex data", ACM Transactions cn 
Database Systems, 3, 105-147, 1978. 

Konolige \]L "A framework for a portable natural-language inter face to large data bases", 
Technical Note 197, Artificial Intelligence 
Center, SRI International. 1979. 

Waltz, IX "An English language question answering 
system for a large relational database", 
Communications cf the ACM, 21, 526-539, 1978. 

Warren, D.H.D. and Pereira, F.C.N. "An efficient easily 
adaptable system for interpreting natural 
language queries", Research Paper 155, Department 
cf Artificial Intelligence, University cf 
Edinburgh, 1981. 

Wilks, Y. "An intelligent anall~ser and understander 
cf English", Communications cf the ACM, 18, 26~- 
27~, 1975. 

Wccds, WJ. "The lunar sciences natural language 
information system", Final Report, Bolt, Beranek 
and Newman Inc.. Cambridge, Mass., 1972. 

Woods, W.A. "Semantics and quantification in natural 
language question answering", Advances in 
Computers, 17, 1-87, 1978. 
