Automated Construction of Database Interfaces: Integrating 
Statistical and Relational Learning for Semantic Parsing 
Lappoon R. Tang and Raymond J. Mooney 
Department of Computer Sciences 
University of Texas at Austin 
Austin, TX 78712-1188 
{rupert, mooney}@cs, utexas, edu 
Abstract 
The development of natural  inter- 
faces (NLI's) for databases has been a chal- 
lenging problem in natural  process- 
ing (NLP) since the 1970's. The need for 
NLI's has become more pronounced due to the 
widespread access to complex databases now 
available through the Internet. A challenging 
problem for empirical NLP is the automated 
acquisition of NLI's from training examples. 
We present a method for integrating statisti- 
cal and relational learning techniques for this 
task which exploits the strength of both ap- 
proaches. Experimental results from three dif- 
ferent domains suggest that such an approach 
is more robust than a previous purely logic- 
based approach. 
1 Introduction 
We use the term semantic parsing to refer 
to the process of mapping a natural  
sentence to a structured meaning representa- 
tion. One interesting application of semantic 
parsing is building natural  interfaces 
for online databases. The need for such appli- 
cations is growing since when information is 
delivered through the Internet, most users do 
not know the underlying database access lan- 
guage. An example of such an interface that 
we have developed is shown in Figure 1. 
Traditional (rationalist) approaches to con- 
structing database interfaces require an ex- 
pert to hand-craft an appropriate semantic 
parser (Woods, 1970; Hendrix et al., 1978). 
However, such hand-crafted parsers are time 
consllming to develop and suffer from prob- 
lems with robustness and incompleteness even 
for domain specific applications. Neverthe- 
less, very little research in empirical NLP has 
explored the task of automatically acquiring 
such interfaces from annotated training ex- 
amples. The only exceptions of which we 
are aware axe a statistical approach to map- 
ping airline-information queries into SQL pre- 
sented in (Miller et al., 1996), a probabilistic 
decision-tree method for the same task de- 
scribed in (Kuhn and De Mori, 1995), and 
an approach using relational learning (a.k.a. 
inductive logic programming, ILP) to learn a 
logic-based semantic parser described in (Zelle 
and Mooney, 1996). 
The existing empirical systems for this task 
employ either a purely logical or purely sta- 
tistical approach. The former uses a deter- 
ministic parser, which can suffer from some 
of the same robustness problems as rational- 
ist methods. The latter constructs a prob- 
abilistic grammar, which requires supplying 
a sytactic parse tree as well as a semantic 
representation for each training sentence, and 
requires hand-crafting a small set of contex- 
tual features on which to condition the pa- 
rameters of the model. Combining relational 
and statistical approaches can overcome the 
need to supply parse-trees and hand-crafted 
features while retaining the robustness of sta- 
tistical parsing. The current work is based 
on the CHILL logic-based parser-acquisition 
framework (Zelle and Mooney, 1996), retain- 
ing access to the complete parse state for mak- 
ing decisions, but building a probabilistic re- 
lational model that allows for statistical pars- 
ing- 
2 Overview of the Approach 
This section reviews our overall approach 
using an interface developed for a U.S. 
Geography database (Geoquery) as a 
sample application (ZeUe and Mooney, 
1996) which is available on the Web (see 
hl:tp://gvg, cs. utezas, edu/users/n~./geo .html). 
2.1 Semantic Representation 
First-order logic is used as a semantic repre- 
sentation . CHILL has also been ap- 
plied to a restaurant database in which the 
logical form resembles SQL, and is translated 
133 
Damba~ 
QUERY YOU PO~TED: 
all a goo~ ~ z~caL~ ~m ~o ~.t~o'P 
RE~UI.T: 
~ooa~e~ I,~ p.~.,.~r ~,~o~o ~ J 
~u~,,o~ ",,~u.,. p~o ~.~.,,¢Bo ~ ~,.~.o ,~.~o ~.~ I 
a~ooo~z~u~r~ ~ ~rr~r ~,~o~.~o ~ I 
THE SOL GENERATED: 
~n0~t ~.K~t ~Fo, LOCAnONS 
C~*tl~rOJt~ ~ Z3 AgO 
Figure 1: Screenshots of a Learned Web-based NL Database Interface 
automatically into SQL (see Figure 1). We 
explain the features of the Geoquery repre- 
sentation  through a sample query: 
Input: "W'hat is the largest city in Texas?" 
Quc~'y: a nswer(C,largest(C,(city(C),loc(C,S), 
const (S,stateid (texas))))). 
Objects are represented as logical terms and 
are typed with a semantic category using 
logical functions applied to possibly ambigu- 
ous English constants (e.g. stateid(Mississippi), 
riverid(Mississippi)). Relationships between ob- 
jects are expressed using predicates; for in- 
stance, Ioc(X,Y) states that X is located in Y. 
We also need to handle quantifiers such 
as 'largest'. We represent these using meta- 
predicates for which at least one argument is a 
conjunction ofliterals. For example, largest(X, 
Goal) states that the object X satisfies Goal 
and is the largest object that does so, using 
the appropriate measure of size for objects of 
its type (e.g. area for states, population for 
cities). Finally, an nn.qpeci~ed object required 
as an argument to a predicate can appear else- 
where in the sentence, requiring the use of the 
predicate const(X,C) to bind the variable X to 
the constant C. Some other database queries 
(or training examples) for the U.S. Geography 
domain are shown below: 
What is the capital of Texas? 
a nswer(C,(ca pital(C,S),const(S,stateid (texas)))). 
What state has the most rivers running through it? 
a nswer(S,most (S,R,(state(S),rlver(R),traverse(R,S)))). 
2.2 Parsing Actions 
Our semantic parser employs a shift-reduce 
architecture that maintains a stack of pre- 
viously built semantic constituents and a 
buffer of remaining words in the input. The 
parsing actions are automatically generated 
from templates given the training data. The 
templates are INTRODUCE, COREF_VABS, 
DROP_CON J, LIFT_CON J, and SttIFT. IN- 
TRODUCE pushes a predicate onto the stack 
based on a word appearing in the input and 
information about its possible meanings in 
the lexicon. COREF_VARS binds two argu- 
ments of two different predicates on the stack. 
DROP_CONJ (or LIFT_CON J) takes a pred- 
icate on the stack and puts it into one of the 
arguments of a meta-predicate on the stack. 
SHIFT simply pushes a word from the input 
buffer onto the stack. The parsing actions are 
tried in exactly this order. The parser also 
requires a lexicon to map phrases in the in- 
put into specific predicates, this lexicon can 
also be learned automatically from the train- 
ing data (Thompson and Mooney, 1999). 
Let's go through a simple trace of parsing 
the request "What is the capital of Texas?" 
A lexicon that maps 'capital' to 'capital(_,_)' 
and 'Texas' to 'const(_,stateid(texas))' su.~ces 
134 
here. Interrogatives like "what" may be 
mapped to predicates in the lexicon if neces- 
sary. The parser begins with an initial stack 
and a buffer holding the input sentence. Each 
predicate on the parse stack has an attached 
buffer to hold the context in which it was 
introduced; words from the input sentence 
are shifted onto this buffer during parsing. 
The initial parse state is shown below: 
Parse Stack: \[answer(_,_):O\] 
Input Buffer: \[what,is,the,ca pital,of,texas,?\] 
Since the first three words in the input 
buffer do not map to any predicates, three 
SHIFT actions are performed. The next is an 
INTRODUCE as 'capital' is at the head of 
the input buffer: 
Parse Stack: \[capital(_,_): O, answer(_,_):\[the,is,what\]\] 
Input Buffer: \[capital,of,texas,?\] 
The next action is a COREF_VARS that 
binds the first argument of capital(_,_) with 
the first argument of answer(_,_). 
Parse Stack: \[capital(C,_): O, answer(C,_):\[the,is,what\]\] 
Input Buffer: \[capital,of,texas,?\] 
The next sequence of steps axe two SHIFT's, 
an INTRODUCE, and then a COR.EF_VARS: 
Parse Stack: \[const(S,stateid(texas)): 0' 
ca pital(C,S):\[of, ca pital\], 
answer(C,_):\[the,is,what~ 
Input Buffer: \[texas,?\] 
The last four steps are two DROP_CONJ's 
followed by two SHIFT's: 
Parse Stack: \[answer(C, (capital(C,S), 
const(S,stateld(texas)))): 
\[?,texas,of, ca pital,the,is,what\]\] 
Input Buffer: I\] 
This is the final state and the logical query is 
extracted from the stack. 
2.3 Learning Control Rules 
The initially constructed parser has no con- 
straints on when to apply actions, and is 
therefore overly general and generates n11rner- 
ous spurious parses. Positive and negative ex- 
amples are collected for each action by parsing 
each tralnlng example and recordlng the parse 
states encountered. Parse states to which an 
action should be applied (i.e. the action leads 
to building the correct semantic representa- 
tion) are labeled positive examples for that 
action. Otherwise, a parse state is labeled a 
negative example for an action if it is a posi- 
tive example for another action below the cur- 
rent one in the ordered list of actions. Control 
conditions which decide the correct action for 
a given parse state axe learned for each action 
from these positive and negative examples. 
The initial CHILL system used ILP (Lavrac 
and Dzeroski, 1994) to learn Prolog control 
rules and employed deterministic parsing, us- 
ing the learned rules to decide the appropriate 
parse action for each state. The current ap- 
proach learns a model for estimating the prob- 
ability that each action should be applied to 
a given state, and employs statistical parsing 
(Manning and Schiitze, 1999) to try to find 
the overall most probable parse, using beam 
search to control the complexity. The advan- 
tage of ILP is that it can perform induction 
over the logical description of the complete 
parse state without the need to pre-engineer a 
fixed set of features (which vary greatly from 
one domain to another) that are relevant to 
making decisions. We maintain this advan- 
tage by using ILP to learn a committee of 
hypotheses, and basing probability estimates 
on a weighted vote of them (Ali and Pazzani, 
1996). We believe that using such a proba- 
bilistic relational model (Getoor and Jensen, 
2000) combines the advantages of frameworks 
based on first-order logic and those based on 
standard statistical techniques. 
3 The TABULATE ILP Method 
This section discusses the ILP method used to 
build a committee of logical control hypothe- 
ses for each action. 
3.1 The Basic TABULATE Algorithm 
Most ILP methods use a set-covering method 
to learn one clause (rule) at a time and con- 
struct clauses using either a strictly top-down 
(general to specific) or bottom-up (specific to 
general) search through the space of possi- 
ble rules (Lavrac and Dzeroski, 1994). TAB- 
ULATE, 1 on the other hand, employs both 
bottom-up and top-down methods to con- 
struct potential clauses and searches through 
the hypothesis space of complete logic pro- 
grams (i.e. sets of clauses called theories). It 
uses beam search to find a set of alternative 
hypotheses guided by a theory evaluation met- 
ric discussed below. The search starts with 
aTABULATB stands for Top-doera And Bottom-Up 
cLAuse construction urith Theory Evaluation. 
135 
Procedure Tabulate 
Input: 
t(X,,...,Xn): the target concept to learn 
~+: the (B examples 
~-: the (9 examples 
Output: 
Q: a queue of learned theories 
Theoryo := {E '¢'-I E E ~+} /* the initial theory */ 
T(No) := Theoryo /* theory of node No */ 
C(No) := empty /* the clause being built */ 
Q := \[No\] /* the search queue */ 
Repeat 
CO ¢ 
Fo_._X each search node Ni E Q D__q 
C(Ni) = empty or C(Ni) = fail Then 
Pairs := sampling of S pairs of clauses from T(N~) 
Find LGG G in Pairs with the greatest cover in ~+ 
Ri := Refine_Clause(t(X1,... ,Xn) +-) U 
Refine_Clause( G ~--) 
Else 
R4 := Reflne_Clause(C(Ni)) 
End If 
If Ri ---- ¢ Then 
CQ, := {(T(N,), fail)} 
Else 
CQi := {(Coraplete(T(N,), Gj, ~+), neztj) \[ 
for each G~ E ,,~, next~ = empty if Gj 
satisfies the noise criteria; otherwise, G$} 
End If 
CQ := CQ u CQ~ 
End For 
Q := the B best nodes from Q U CQ 
ranked by metric M 
Until terminatlon-criteria-satisfied 
Return Q 
End Procedure 
Figure 2: The TABULATE algorithm 
the most specific hypothesis (the set of posi- 
tive examples each represented as a separate 
clause). Each iteration of the loop attempts 
to refine each of the hypotheses in the current 
search queue. There are two cases in each it- 
eration: 1) an existing clause in a theory is 
refined or 2) a new clause is begun. Clauses 
are learned using both top-down specialiT.~- 
tion using a method similar to FOIL (Quin- 
lan, 1990) and bottom-up generalization using 
Least General Generalizations (LGG's). Ad- 
vantages of combining both ILP approaches 
were explored in CHILLIN (ZeUe and Mooney, 
1994), an ILP method which motivated the 
design of TABULATE. An outline of the TAB- 
ULATE algorithm is given in Figure 2. 
A noise-handling criterion is used to de- 
cide when an individual clause in a hypoth- 
esis is sufficiently accurate to be permanently 
retained. There are three possible outcomes 
in a refinement: 1) the current clause satisfies 
the noise-handling criterion and is simply re- 
turned (nextj is set to empty), 2) the current 
clause does not satisfy the noise-handling cri- 
teria and all possible refinements are returned 
(neztj is set to the refined clause), and 3) 
the current clause does not satisfy the noise- 
handling criterion but there are no further re- 
finements (neztj is set to fai O. If the refine- 
ment is a new clause, clauses in the current 
theory subs-reed by it are removed. Oth- 
erwise, it is a specialization of an existing 
clause. Positive examples that are not cov- 
ered by the resulting theory, due to special- 
izing the clause, are added back into theory 
as individual clauses. Hence, the theory is al- 
ways maintained complete (i.e. covering all 
positive examples). These final steps are per- 
formed in the Complete procedure. 
The termination criterion checks for two 
conditions. The first is satisfied if the next 
search queue does not improve the sum of 
the metric score over all hypotheses in the 
queue. Second, there is no clause currently 
being built for each theory in the search queue 
and the last finished clause of each theory sat- 
isfies the noise-handling criterion. Finally, a 
committee of hypotheses found by the algo- 
rithm is returned. 
3.2 Compression and Accuracy 
The goal of the search is to find accurate 
and yet simple hypotheses. We measure accu- 
racy using the m-estimate (Cestnik, 1990), a 
smoothed measure of accuracy on the training 
data which in the case of a two-class problem 
is defined as: 
accuracy(H) s + m. p+ = (1) 
n ,-I- rrt 
where s is the n-tuber of positive examples 
covered by the hypothesis H, n is the total 
number of examples covered, p+ is the prior 
probability of the class (9, and m is a smooth- 
ing parameter. 
We measure theory complexity using a met- 
ric slmi\]ar to that introduced in (Muggleton 
and Buntine, 1988). The size of a Clause hav- 
ing a Head and a Body is defined as follows 
(ts="term size" and ar="arity'): 
size(Clause) = 1 + ts(Head) + ts(Body) (2) 
136 
I 1 T is a variable 
ts(T) = 2 r ~,, ¢o~t 
2 + ts(argi(T)) 
(3) 
The size of a clause is roughly the n,,mber of 
variables, constants, or predicate symbols it 
contains. The size of a theory is the sum of 
the sizes of its clauses. The metric M(H) used 
as the search heuristic is defined as: 
M(H) = accuracy(H) + C 
log 2 size(H) (4) 
where C is a constant used to control the rel- 
ative weight of accuracy vs. complexity. We 
ass~,me that the most general hypothesis is as 
good as the most specific hypothesis; thus, C 
is determined to be: 
C = EbSt -- EtSb (5) &-& 
where Et, Eb are the accuracy estimates of the 
most general and most specific hypotheses re- 
spectively, and St, Sb are their sizes. 
3.3 Noise Handling 
A clause needs no further refinement when it 
meets the following criterion (as in RIPPER 
(Cohen, 1995)): 
P -.__.2_ > (6) p+n 
where p is the number of positive examples 
covered by the clause, n is the number of neg- 
ative examples covered and -1 </~ _< 1 is a 
parameter. The value of ~ is decreased when- 
ever the sum of the metric over the hypotheses 
in the queue does not improve although some 
of them still have ,nflni~hed or failed clauses. 
4 Statistical Semantic Parsing 
4.1 The Parsing Model 
A parser is a relation Parser C_ Sentences x 
Queries where Sentences and Queries are 
the sets of natural  sentences and 
database queries respectively. Given a sen- 
tence I • Sentences, the set Q(1) = {q • 
Queries I (l, q) • Parser} is the set of queries 
that are correct interpretations of I. 
A parse state consists of a stack of lexical- 
ized predicates and a list of words from the 
input sentence. S is the set of states reach- 
able by the parser. Suppose our learned parser 
has n different parsing actions, the ith ac- 
tion a/is a function a/(s) : ISi -+ OSi where 
ISi G S is the set of states to which the ac- 
tion is applicable and OSi C_ S is the set of 
states constructed by the action. The function 
ao(l) : Sentences ~ IniS maps each sentence l 
to a corresponding unique initial parse state in 
In/S C_ S. A state is called afinalstate if there 
exists no parsing action applicable to it. The 
partial function a,+l(s) : FS ~ Queries is 
defined as the action that retrieves the query 
from the final state s 6 FS C S if one exists. 
Some final states may not "contain" a query 
(e.g. when the parse stack contain.q predicates 
with unbound ~rariables) and therefore it is a 
partial function. When the parser meets such 
a final state, it reports a failure. 
A path is a finite sequence of parsing ac- 
tions. Given a sentence 1, a good state s is 
one such that there exists a path from it to a 
query q 6 Q(1). Otherwise, it is a bad state. 
The set of parse states can be uniquely divided 
into the set of good states and the set of bad 
states given l and Parser. S + and S- are the 
sets of good and bad states respectively. 
Given a sentence l, the goal is to construct 
the query ~ such that 
= argmqaX P(q • Q(l) \[l ~ q) (7) 
where I ~ q means a path exists from l to q. 
Now, we need to estimate P(q • Q(1) I l =-~ 
q). First, we notice that: 
P(q • Q(1) \[l =~. q) ---- (8) 
P(s • FS + I I ~ s and an+l(S) ---- q) 
where FS + = FS N S +. For notational con- 
venience we drop the conditions and denote 
the above probabilities as P(q • Q(l)) and 
P(s • FS +) respectively, assuming these con- 
ditions in the following discussion. The equa- 
tion states that the probability that a given 
query is a correct meaning for I is the same as 
the probability that the final state (reached 
by parsing l) is a good state. We need to de- 
termine in general the probability of having a 
good resulting parse state. Given any parse 
state s i at the jth step of parsing and an ac- 
tion ai such that si+1 = a/(sj), we have: 
PCsi+1 • (9) 
pCsj+l e o& + I • x&+)pCs  • x& +) + 
P(Si+l • OSi + I sj ¢ ISi+)P(sj ¢~ ISi +) 
where IS~ = ISi N S + and OS~ = OS~ N S +. 
Since no parsing action can produce a good 
137 
parse state from a bad one, the second term 
is zero. Now, we are ready to derive P(q • 
Q(l)). Suppose q = an+l(Sm), we have: 
P(q 6 Q(l)) (10) 
= P(s~ • F~) 
... 
= P(s,n • FS + l sm-1 •/St,_a)... 
P(s~ • OS~_, I sj-1 • Is~_,)... 
P(s2 • Ob~, Is1 • IS~, )P('I • IS~,) 
where ak denotes the index of which action 
is applied at the kth step. We assume that 
= P(sl • I~aa) ~ 0 (which may not be true 
in general). Now, we have 
m--I 
P(q 6 Q(l)) = 7 II P(sj+I • O~ l sj • IS~-3). 
i=l (11) 
Next we describe how we estimate the proba- 
bili~ of the goodness of each action in a given 
state (P(~(s) • o$ I s • I~)). We n~ 
not estimate 7 since its value does not affect 
the outcome of equation (7). 
4.2 Estimating Probabilities for 
Parsing Actions 
The committee of hypotheses learned by TAB- 
ULATE is used to estimate the probability that 
a particular action is a good one to apply to a 
given parse state. Some hypotheses are more 
"important" than others in the sense that they 
carry more weight in the decision. A weight- 
ing parameter is also included to lower the 
probability estimate of actions that appear 
fm'ther down the decision list. For actions ai 
where 1 < i < n - 1: 
P(ai(s) • o~ Is • Is7-) = 
.po,(i)-I ~ AkP(~Cs) 60b~ ~ I h~) 
hk~H~ 
(12) 
where s is a given parse state, pos(i) is the 
position of the action ai in the list of ac- 
tions applicable to state s, Ak and 0 < /~ < 
1 are weighting parameters, z Hi is the set 
of hypotheses learned for the action ai, and 
~k A~ = 1. 
To estimate the probability for the last ac- 
tion an, we devise a simple test that checks 
if the maximum of the set A(s) of proba- 
bility estimates for the subset of the actions 
2p is set to 0.95 for all the experiments performed. 
{al,..., an-l} applicable to s is less than or 
equal to a threshold a. If A(s) is empty, we 
assume the maxlrn,,rn is zero. More precisely, 
PCa.Cs) • os~ Is • xs~) = 
{ ,c..(,)~os~) if maxCACs)) < ~(,~IS~) 
0 otherwise 
(13) 
where a is the threshold, 3 c(an(s) • Ob~) is 
the count of the number of good states pro- 
duced by the last action, and c(s • IS~) is the 
count of the number of good states to which 
the last action is applicable. 
Now, let's discuss how P(ai(s) • OS~ ~ I hk) 
and Ak are estimated. If hk ~ s (i.e. hk covers 
s), we have 
PCai(s) • o~ I hk) = pc + O " nc Pc -t- nc (14) 
where Pc and ne are the number of positive 
and negative examples covered by hk respec- 
tively. Otherwise, if h~ ~= s (i.e. hk does not 
cover s), we have 
PCai(s) • OS 7" I hk) -- p" + 8.n,, Pu +nu (15) 
where Pu and nu are the n,,rnber of positive 
and negative examples rejected by hk respec- 
tively. /9 is the probability that a negative 
example is mislabelled and its value can be 
estimated given # (in equation (6)) and the 
total nnrnber of positive and negative exam- 
ples. 
One could use a variety of linear combina- 
tion methods to estimate the weights Ak (e.g. 
Bayesian combination (Buntine, 1990)). How- 
ever, we have taken a simple approach and 
weighted hypotheses based on their relative 
simplicity: 
size(hk) -1 
Ak = ~.lHd size(hi)_1" (16) z-d=l 
4.3 Searching for a Parse 
To find the most probably correct parse, the 
parser employs a beam search. At each step, 
the parser finds all of the parsing actions ap- 
plicable to each parse state on the queue and 
calculates the probability of goodness of each 
of them using equations (12) and (13). It then 
SThe threshold is set to 0.5 for all the experiments performed. 
138 
computes the probability that the resulting 
state of each possible action is a good state 
using equation (11), sorts the queue of possi- 
ble next states accordingly, and keeps the best 
B options. The parser stops when a complete 
parse is found on the top of the parse queue 
or a failure is reported. 
5 Experimental Results 
5.1 The Domains 
Three different domains are used to demon- 
strate the performance of the new approach. 
The first is the U.S. Geography domain. 
The database contains about 800 facts about 
U.S. states like population, area, capital city, 
neighboring states, major rivers, major cities, 
and so on. A hand-crafted parser, GEOBASE 
was previously constructed for this domain as 
a demo product for Turbo Prolog. The second 
application is the restaurant query system il- 
lustrated in Figure 1. The database contains 
information about thousands of restaurants 
in Northern California, including the name of 
the restaurant, its location, its specialty, and a 
guide-book rating. The third domain consists 
of a set of 300 computer-related jobs automat- 
ically extracted from postings to the USENET 
newsgroup austin.jobs. The database con- 
talus the following information: the job title, 
the company, the recruiter, the location, the 
salary, the s and platforms used, and 
required or desired years of experience and de- 
grees. 
5.2 Experimental Design 
The geography corpus contains 560 questions. 
Approximately 100 of these were collected 
from a log of questions submitted to the web 
site and the rest were collected in studies in- 
volving students in undergraduate classes at 
our university. We also included results for the 
subset of 250 sentences originally used in the 
experiments reported in (Zelle and Mooney, 
1996). The remaining questions were specif- 
icaUy collected to be more complex than the 
original 250, and generally require one or more 
meta-predicates. The restaurant corpus con- 
taln~ 250 questions automatically generated 
from a hand-built grammar Constructed to re- 
flect typical queries in this domain. The job 
corpus contains 400 questions automatically 
generated in a similar fashion. The beam 
width for TABULATE was set~ to five for all the 
domains. The deterministic parser used only 
the best hypothesis found. The experiments 
were conducted using 10-fold cross validation. 
For each domain, the average recall (a.k.a. 
accuracy) and precision of the parser on dis- 
joint test data are reported where: 
of correct queries produced Recall = 
of test sentences 
Precision = # of correct queries produced 
# of complete parses produced" 
A complete parse is one which contains an ex- 
ecutable query (which could be incorrect). A 
query is considered correct if it produces the 
same answer set as the gold-standard query 
supplied with the corpus. 
5.3 Results 
The results are presented in Table 1 and Fig- 
ure 3. By switching from deterministic to 
probabilistic parsing, the system increased the 
number of correct queries it produced. Re- 
call increases almost monotonically with pars- 
ing beam width in most of the domains. I_m- 
provement is most apparent in the Jobs do- 
maln where probabilistic parsing signiBcantly 
outperformed the deterministic system (80% 
vs 68%). However, using a beam width of 
one (and thus the probabilistic parser picks 
only the best action) results in worse perfor- 
mance than using the original purely logic- 
based determlni~tic parser. This suggests that 
the probability esitmates could be improved 
since overall they are not indicating the sin- 
gle best action as well as a non-probabilistic 
approach. Precision of the system decreased 
with beam width, but not signi~cantly except 
for the larger Geography corpus. Since the 
system conducts a more extensive search for 
a complete parse, it risks increasing the num- 
ber of incorrect as well as correct parses. The 
importance of recall vs. precision depends on 
the relative cost of providing an incorrect an- 
swer versus no answer at all. Individual ap- 
plications may require emphasizing one or the 
other. 
All of the experiments were run on a 
167MHz UltraSparc work station under Sic- 
stus Prolog. Although results on the parsing 
time of the different systems are not formally 
reported here, it was noted that the difference 
between using a beam width of three and the 
original system was less than two seconds in 
all domains but increased to a~r0und twenty 
seconds when using a beam width of twelve. 
However, the current Prolog implementation 
is not highly optimized. 
139 
Parsers \ Corpora Geo250 Geo560 Jobs400 Rest250 
Prob-Parser(12) 
Prob-Parser(8) 
Prob-Parser(5) 
Prob-Parser(3) 
Prob-Parser(1) 
TABULATE 
Original CHILL 
Hand-Built Parser 
R P 
80.40 88.16 
79.60 86.90 
78.40 87.11 
77.60 87.39 
67.60 90.37 
75.60 92.65 
68.50 97.65 
56.00 96.40 
R P 
71.61 78.94 
71.07 79.76 
70.00 79.51 
69.11 79.30 
62.86 82.05 
69.29 89.81 
I~ P 
80.50 86.56 
78.75 86.54 
74.25 86.59 
70.50 87.31 
34.25 85.63 
68.50 87.54 
R P 
99.20 99.60 
99.20 99.60 
99.20 99.60 
99.20 99.60 
99.20 99.60 
99.20 99.60 
Table 1: Results For All Domains: R = % Recall and P = % Precision. Prob-Parser(B) is 
the probabilistic parser using a beam width of B. TABULATE is CHILL using the TABULATE 
induction algorithm with determ;nistic parsing. 
100 ~. 
9O 
80 
70 
60 
50 
40 
30 
0 
Geo250 , 
Geo560 ---×--- 
Jobs400- ~* 
Rest50 -~- 
100 
95 
90 
85 
80 
i i i i / 75 
2 4 6 8 lO 12 
B~n ,~Ze 
Geo250 , 
Geo560 ---x--- 
Job¢,400 ~ - 
Rest250 ---c-- 
~-~ ......... 
x,, ... 
i i i / 
2 4 6 8 10 12 
Seam S¢~ 
Figure 3: The recall and precision of the parser using various beam widths in the different 
domains 
While there was an overall improvement in 
recall using the new approach, its performance 
varied signiGcantly from dom~;~ to domain. 
As a result, the recall did not always improve 
dramatically by using a larger beam width. 
Domain factors possibly affecting the perfor- 
mance are the quality of the lexicon, the rel- 
ative amount of data available for calculat- 
ing probability estimates, and the problem of 
'~parser incompleteness" with respect to the 
test data (i.e. there is not a path from a sen- 
tence to a correct query which happens when 
'7 = 0). The performance of all systems were 
basically equivalent in the restaurant domain, 
where they were near perfect in both recall 
and precision. This is because this corpus is 
relatively easier given the restricted range of 
possible questions due to the limited informa- 
tion available about each restaurant. The sys- 
tems achieved > 90% in recall and precision 
given only roughly 30% of the training data 
in this domain. Finally, GEOBASE perfomed 
the worst on the original geography queries, 
since it is difficult to hand-crat~ a parser that 
handles a sn~cient variety of questions. 
6 Conclusion 
A probabilistic framework for semantic shift- 
reduce parsing was presented. A new ILP 
learning system was also introduced which 
learns multiple hypotheses. These two tech- 
niques were integrated to learn semantic 
parsers for building NLI's to on|ine databases. 
Experimental results were presented that 
demonstrate that such an approach outper- 
forms a purely logical approach in terms of 
the accuracy of the learned parser. 
7 Ack~nowledgements 
This research was supported by a grant from 
the Daimler-Chrysler Research and Technol- 
ogy Center and by the National Science Foun- 
dation under grant n~I-9704943. 
140 

References 
K. Ali and M. Pazzani. 1996. Error reduction 
through learning multiple descriptions. Ma- 
chine Learning Journal, 24:3:100--132. 
W. Buntine. 1990. A theory of learning classifica- 
tion rules. Ph.D. thesis, University of Technol- 
ogy, Sydney, Australia. 
B. Cestnik. 1990. Estimating probabilities: A cru- 
cial task in machine learning. In Proceedings of 
the Ninth European Conference on Artificial In- 
teUigence, pages 147-149, Stockholm, Sweden. 
W. W. Cohen. 1995. Fast effective rule induc- 
tion. In Proceedings of the Twelfth Interna- 
tional Conference on Machine Learning, pages 
115-123. 
L. Getoor and D. Jensen, editors. 2000. Papers 
from the AAA1 Workshop on Learning Statis- 
tical Models from Relational Data, Austin, TX. 
AAAI Press. 
G. G. Hendrix, E. Sacerdoti, D. Sagalowicz, and 
J. Slocum. 1978. Developing a natural  
interface to complex data. AGM Transactions 
on Database Systems, 3(2):105-147. 
R. Knhn and R. De Mori. 1995. The application of 
semantic classification trees to natural  
understanding. IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 17(5):449-.- 
460. 
N. Lavrac and S. Dzeroski. 1994. Inductive Logic 
Programming: Techniques and Applications. 
Ellis Horwood. 
C. D. Mauning and H. Sch/itze. 1999. Founda- 
tions of Statistical Natural Language Process- 
ing. MIT Press, Cambridge, MA. 
Scott Miller, David StaUard, Robert Bobrow, and 
Richard Schwartz. 1996. A fully statistical ap- 
proach to natural  interfaces. In Pro- 
ceedings of the 34th Annual Meeting of the As- 
sociation for Computational Linguistics, pages 
55-61, Santa Cruz, CA. 
S. Muggleton and W. Buntine. 1988. Machine 
invention of first-order predicates by inverting 
resolution. In Proceedings of the Fifth Interna- 
tional Conference on Machine Learning, pages 
339--352, Ann Arbor, MI, June. 
J. tL Q,inlan. 1990. Learning logical definitions 
from relations. Machine Learning, 5(3):239- 
266. 
C. A. Thompson and R. J. Mooney. 1999. Au- 
tomatic construction of semantic lexicons for 
learning natural  interfaces. In Pro- 
ceedings of the Sixteenth National Conference 
on Artificial Intelligence, pages 487-493, Or- 
lando, FL, July. 
W. A. Woods. 1970. Transition network gram- 
mars for natural  analysis. Communi- 
cations of the Association for Computing Ma- 
chinery, 13:591-606. 
J. M. Zelle and R. J. Mooney. 1994. Combin- 
ing top-down and bottom-up methods in indue- 
tive logic programming. In Proceedings of the 
Eleventh International Conference on Machine 
Learning, pages 343--351, New Brunswick, NJ, 
July. 
J. M. Zelle and tL J. Mooney. 1996. Learning 
to parse database queries using inductive logic 
programming. In Proceedings of the Thirteenth 
National Conference on Artificial Intelligence, 
pages 1050--1055, Portland, OR, August. 
