Parsing in the Ahsmmee ofa Comldete Lexicon 
Jim Davidson and S. Jerrold Kaplan 
Computer Science Departmen~ Stanford University 
Stanfor~ CA 94305 
I. Introduction 
It is impractical for natural language parsers which serve as front ends to 
large or changing databases to maintain a complete in-core lexicon of 
words and meanings. This note discusses a practical approach to using 
alternative sources of lexical knowledge by postponing word categorization 
decisions until the parse is complete, and resolving remaining lexical 
anthiguities usiug a variety of informatkm available at that time. 
il. The Problem 
A natutal language parser working with a database query system (c.g~ 
PLANES \[Waltz et al, 1976\], LADDER \[Hcndrix, 1977\], ROBOT \[Harris, 
1977\], CO-OP \[Kaplan, 19791) encounters lexical diflicultics not present in 
simpler applications. In pprticular, the description of the domain of 
discourse may be quite large (millions of words), and varies as the 
underlying database changes. This precludes reliance upon an explicit, 
fixed ,'exicote-a dictionary which records all the terms known to the 
system--because of: 
ta) redundv.cy: Kccpmg the same intbrmation in two places (the lexicon 
and the database) lcads to problcms of integrity. Updating is more 
difficult if it must occur simultaneously in two places. 
(h) size: A database of, say, 30.000 cntries cannot hc duplicated in 
primary memory. 
For example, it may hc impractical fi)r a systcm dcaling with a database 
of ships to store the names of all the ships in a separate it-core Icxicun. If 
not all allowable Icxical entries are explicitly encoded, |here will be tcrms 
encountered by the parser about which nnthing is known. The problem is 
to assign these terms to a particular class, in the absence of a specific 
lexical entry. 
Thus. given the scntcnco, "Where is the Fox docked?", the parser would 
have to decide, in the absence of any prior informatiou about "Fox", that 
it was the name of a ship, and nuL say, a port. 
IlL. Previous approaches 
Th.ere are several methods by which unknown tenns can bc immediately 
assigned to a category: the parser can chock tire database to scc if the 
unknown term is there (as iu \[Harris, 1977\]); the user may be 
intcractivcly queried (in the style of RFNDEYOUS \[Codd ct al.. 1978\]); 
the parser might siutolv make an assumption based on the immcdiat~ 
context, and proceed (as in \[Kaplan, 1979\]). (We call these 
extended-lexicon methods.) However, these methods have the aaso¢iated 
costs of time, inconvenience, and inaccuracy, and so constitute imperfect 
solutions. 
Note in particular that simply using the database itself as a lexicon will 
not work in the general case. If the database is not fully indexed, the 
time required to search various fields to identify an unknown lexical item 
will tend to be prohibitive, if this requires multiple disk accesses. In 
addition, as noted in \[Kaplan, Mays` and Josh\[ 1979\]. the query may 
reasonably contain unknown terms that are not in the database ("Is John 
Smith an employee?" should be answerable even if "John Smith" is not in 
the database). 
IV. An Approach--Delay the Decision, then Compare Classification 
Methods 
Our approach is to defer any Icxical decision as long as possible, and then 
to apply the extended-lexicon methods identified above, in order of 
iucrcasing COSL 
Specifically, all possible parses are colloctcd` using a semantic grammar 
(see below), by allowing the unknown term to satisfy any category 
required to complete the par~e. The result is a list of categnri~ for 
unknown terms, each of which is syntactically valid as a classification for 
'Jln item. Consequcotly, interpretations thar do not result in complete 
parscs are eliminated. Since a semantic grammar tightly restricts the class 
of allowable sentences, this technique can substantially rcduce rile 
complexity of the remaining disambiguation process. 
The category assignments leading to successful parses are then ordered by 
a procedure which estimates the cost of chocking them. This ordering 
currently assumcs an undcrlying cost model in which aec~sing the 
database on indexcd or hashed ficlds is the least expensive, a single 
remaining interpretation warrants an assumption of corrccmcss, aud lasdy, 
remaining ambiguities are resolved by asking the user. 
A disambigu.',.ted lexical item is added temporarily to the in-core lexicon, 
so that future qucrics involving that term will not require repetition of the 
disambiguation process. After the item has not been rcferenccd for some 
period of time (dctcrmincd empirically) the term is droppcd from the 
lexicon. 
Y. Example 
This approach has been implemented in the parser for the Knowlcdgc 
llasc Management Systems (KBMS) project tcstbcd` \[Wicdcthold, 1978\] 
(11)e KBMS pr,3ject is conccrned wig) the application of artificial 
intelligence techniques to the design and use of database systems. Among 
other comoonents, it contains a natural language front end fi)r a 
CODASYL databa.s¢ in the merchant shipping domain.) 
The KBMS parser is implementcd using the LIFER package, a semantic 
grommar based system designed at SRI \[Hendrix, 1977\]. Semantic 
grammars have the property that the metasymbols correspond to objects 
and actions in thc domain, rather than to abstract grammatical concepts. 
For example, the KBMS parser has classes called SHIPS and PORTS. 
The KBMS pa~r starts with a moderate-size in-core lexicon (400 
words); however, none of the larger database categories (SHIPS. PORTS, 
SItlPCLASSES. CARGOES) art stored in the in-core lexicon. 
Following is a tran~ript from a run of the KBMS parser. Thc input to 
the pa~er is in italics: annotations are in braces. 
,.is izmir in italy? {"Italy" is known, from the in-core lexicon, to be a 
country. "|zmir" is unknown.} 
) UNKNOWN TERM IZMIR 
) POSSIBLE CATEGORII~: SIIIPS. PORTS. CARGOES 
{At the point where the word |ZMIR is encountered, any category which 
admits a name is possible. These include ships, ports, and cargoes.} 
) FIN1SIIING PARSE 
) POSSIBI.E CATEGORY FOR IZMIR, LEADING TO VALID PARSE: SHIPS. PORTS 
{When the parse is complete, the category "cargoes" has been eliminated, 
since it did not lead to a valid parse. So, the remaining two categories are 
considered.} 
>" CHECKING SHIPS FILE IN DATABASE 
) IZM\[R NOT THERE 
) ASSUME TI\[AT IZMIR IS A PORT. 
{Of the two remaining categories, SHIPS is indexed in the database by 
name while PoR'rs is not and would theretbre be very expensive to check. 
So, the SIIII~. file iS examined first Since |TJVllR is not in the database as a 
shipname, only PORTS remains. At this point, the parser assumes that 
IZMIR is a port since this is the only remaining plausible interpretation. 
This assumption will be presented to d~e user, and will ultiw,=tely be 
verified in the database query.} 
105 
) FINAl. QUERY: 
> \[:u,' the PORTS with PUl'tnall|e etlual tO 'IZMIR'. 
> is the Portcountry equal to "1"1"? 
A simple English generation system (written by l'qlrl Saeerdoti). illustrated 
above, has been used :o provide the user with a simplified natural 
language paraphrase of the qnery. Thus, invalid assumptions or 
interpretations ntade by tile parser are easily detected. In a normal run, 
the inlbmlation about lexical prtx:essing would not bc printed. 
In the cxanlplc above, the unknown term happencd to consist of a single 
word. In the gcncral ease. of course, it could be scvcral words long (as is 
often thc case with the names of ships or pcnple). 
Items recognized by cxtendcd-lcxicon methods are added to the in-core 
lexicon, for a period of time. Thc time at which thcy are droppcd from 
the in-.core lexicon is dctermincd by considcration of the time of last 
reference, and comp.'~rison of thc (known) cost of recognizing thc items 
again with the eest in space of keeping them in core. 
VIii. Applications of this Method 
The method of delaying a categorization decision until the parse is 
completed has some possible extensions. At tile time a check is made of 
the database for classification purposes, it is known which query will be 
returacd if the lookup is successRil. For simple queries, therefore, it is 
possible not only to verify the classification of the unknown term. but also 
to fetch the answer to the query during the check of the database. For 
examplc, with the query "What cargo is the Fox carrying. ~'. the system 
could retrieve the answer at the samc time that it verified that thc "Fox" 
is a ship. Thus, the phases of parsing and qucry-prncessing can be 
combined. This 'pro-fetching' is possible only because the classification 
decision has been postponcd undl thc parse is complete. 
Thc technique of collecting all parses before attempting verification can 
also provide thc user with information. Since all possible categories for 
the unknown term have been considered, the user v.ill have a better idea. 
in the event that the parse cventually fails, whether an additional grammar 
rulc is needed, an item is missing fiom the databasc, or a lexicon entry 
has been omitted. 
VI. Limitations of this Method 
In its simplest form. this method is restricted to operating with semantic 
grammars. Specifically. the files in the database must correspond to 
categories in the grammar. With a syntactic grammar, the method is still 
applicable, but more complicated; semantic compatibility checks are 
ne,:essary at various points. Moreover. the set of acceptable sentences is 
not as tightly constrained as with a semantic grammar, so there is less 
inlbrmation to be gained from the grammar itself. 
This method (and all extended-lexicon metht~s) prevents use of an 
INTI:'RLL~'P.type spelling correcter. Snch a spclling cnrreetor relies on 
having a complete in-enre lexicon against which to compare words; the 
thrust of the extended-lexicon methods is the ab~nce of such a lexicon. 
If the unknown term already has a meaning to the system, which leads to 
a valid parse, the extended-lexicon methods won't even be invoked. For 
example, in the KBMS system, the question "Where is the City of 
Istanbul?" is interpreted as referring to the city, rather than the ship 
named 'City of Istanbul'. This difficulty is mitigated somewhat by the fact 
that semantic grammar restricts the number of possible interpretations, so 
that the number of genuinely ambiguous eases like this is comparatively 
small. For instance, the query " What is t,. speed of" the City of l~tanbul" 
would be parsed correctly as refcrrmg to a ship, since 'City of Istanbul" 
cannot meaningfully refer to the city in this case. 
V. Conclusion 
The technique discussed here could be implemented in practically any 
application that uses a semantie grammar--it does not require any 
particular parsing strategy or system. In the KBMS tcstbcd, the work was 
done without any access to the internal mechanisms of I.IFER. The only 
requirement was the ability to call user supplied functions at appropriate 
times during the parse, such as would be provided by any comparable 
parsing system. 
This method was developed with the assumption that the costs of 
extended-lexicon operations such as database access, asking the user. etc., 
are significantly greater than the costs of parsing. T'nus these operations 
were avoided where possible. Different cost models might result in 
different, more complex, strategies. Note also that the cost model, by 
using information in the database catalogue and database schema, can 
automatically reflect many aspects of the database implementation, thus 
providing a certain degree of domain-independence. Changes such as 
implementation of a new index will be picked up by tile cost model, and 
thus be transparent to the design of the rest of the parser. 
For natural language systems to provide practical access for database 
users, they must be capable of handling realistic databases. Such databases 
arc often quite large, and may be subject to frequent update. Both of 
these characteristics render impractical the encoding and maintenance of a 
fixed, in--core lexicon. Existing systems have incorporated a variety of 
strategies for coping with these problems. This note has described a 
technique for reducing the number of lexical ambiguities for unknown 
terms by deferring lexical decisions as long as possible, and using a simple 
cost model to select an appropriate method for resolving remaining 
ambiguities. 
Vl. Acknowledgments 
This work was performed under ARPA contract #N00039-80-G-0132. 
The Views and conclusions contained m this document are those of the 
authors and should not bc interpreted as representative of the official 
policies, either expressed or implied, of DARPA or the U.S. Government. 
Thc authors would likc to thank Daniel Sagalowicz. Norman Haas, Gary 
Hendrix and F.arl Sacerdoti of SRI International for their invaluable 
assistance and for making thcir programs available to us. Wc would also 
like to thank Sheldon Finkelstein. Dung Appclt, and Jonathan King for 
proofreading thc final dralL 
VI. References 
\[1\] Codd, E. F., ¢t at., Rendezvous Version /: An Experimental English- 
Language Query Formulation System for Casual Users of Relational Data 
Bases. IBM Research report RJ2144(29407), IBM Research Laboratory, 
San Jose, CA, 1978. 
\[2\] Harris, L., Natural Language Data Base Query: Using the database 
itself as the definition of world knowledge and as an extension of the 
dictionary, Technical Rcport 77-2, Mathematics Dept.. Dartmouth 
Collcge, Hanovcr. NH, 1977 
\[3\] Hcndrix. G.G., The LIFER Manual: A Guide to Building Practical 
Natural Language Interfaces, Technical Note t38, Artificial Intelligence 
Center. SRI International, 1977 
\[41 Kaplan, S. J.. Cooperative Responses from a Portable Natural Language 
Data Base Query System, Ph.D. dissertation, U. of Pennsylvania, available 
as HPP-79-19, Computer Science Department, Stanford University. 
Stanford, CA. 1979 
\[5\] Kaplan. 5. J.. E. Mays. and A. K. Joshi. A Technique for Managing 
the Lexicon in a Natural Language Interface to a Changing Data Base, 
Prac. Sixth \[nternation_l Joint Conference on Artificial Intelligence. Tokyo, 
1979. pp 463-465. 
\[6\] Sacerdoti, F.D., Language Access to Distributed Data with Error 
Recovery, Prec. Fifth International Joint Conference on Artificial 
Intelligence. Cambridge, MA, 1977, pp 196-202 
\[7\] Waltz, D.I,.. An English Language Question Answering System for a 
Large Relational Database, Communications of the ACM, 21. 7, July, 
1978 
\[8\] Wiedcrhold, Gio. Management of Scmantic Information for Databases, 
Third USA-Japan Computer Conference Praceedings. San Francisco, 1978. 
pp 192-197 
106 
