Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 439–446,
New York, June 2006. c©2006 Association for Computational Linguistics
Learning for Semantic Parsing with Statistical Machine Translation
Yuk Wah Wong and Raymond J. Mooney
Department of Computer Sciences
The University of Texas at Austin
1 University Station C0500
Austin, TX 78712-0233, USA
{ywwong,mooney}@cs.utexas.edu
Abstract
We present a novel statistical approach to
semantic parsing, WASP, for construct-
ing a complete, formal meaning represen-
tation of a sentence. A semantic parser
is learned given a set of sentences anno-
tated with their correct meaning represen-
tations. The main innovation of WASP
is its use of state-of-the-art statistical ma-
chine translation techniques. A word
alignment model is used for lexical acqui-
sition, and the parsing model itself can be
seen as a syntax-based translation model.
We show that WASP performs favorably
in terms of both accuracy and coverage
compared to existing learning methods re-
quiringsimilaramountofsupervision,and
shows better robustness to variations in
task complexity and word order.
1 Introduction
Recent work on natural language understanding has
mainly focused on shallow semantic analysis, such
as semantic role labeling and word-sense disam-
biguation. This paper considers a more ambi-
tious task of semantic parsing, which is the con-
struction of a complete, formal, symbolic, mean-
ing representation (MR) of a sentence. Seman-
tic parsing has found its way in practical applica-
tions such as natural-language (NL) interfaces to
databases (Androutsopoulos et al., 1995) and ad-
vice taking (Kuhlmann et al., 2004). Figure 1 shows
a sample MR written in a meaning-representation
language (MRL) called CLANG, which is used for
((bowner our {4})
(do our {6} (pos (left (half our)))))
If our player 4 has the ball, then our player 6 should
stay in the left side of our half.
Figure 1: A meaning representation in CLANG
encoding coach advice given to simulated soccer-
playing agents (Kuhlmann et al., 2004).
Prior research in semantic parsing has mainly fo-
cused on relatively simple domains such as ATIS
(AirTravelInformationService)(Milleretal., 1996;
Papineni et al., 1997; Macherey et al., 2001), in
which a typcial MR is only a single semantic frame.
Learning methods have been devised that can gen-
erate MRs with a complex, nested structure (cf.
Figure 1). However, these methods are mostly
based on deterministic parsing (Zelle and Mooney,
1996; Kate et al., 2005), which lack the robustness
that characterizes recent advances in statistical NLP.
Other learning methods involve the use of fully-
annotated augmented parse trees (Ge and Mooney,
2005) or prior knowledge of the NL syntax (Zettle-
moyer and Collins, 2005) in training, and hence re-
quire extensive human efforts when porting to a new
domain or language.
In this paper, we present a novel statistical ap-
proach to semantic parsing which can handle MRs
with a nested structure, based on previous work on
semantic parsing using transformation rules (Kate et
al., 2005). The algorithm learns a semantic parser
given a set of NL sentences annotated with their
correct MRs. It requires no prior knowledge of
the NL syntax, although it assumes that an unam-
biguous, context-free grammar (CFG) of the target
MRL is available. The main innovation of this al-
439
answer(count(city(loc 2(countryid(usa)))))
How many cities are there in the US?
Figure 2: A meaning representation in GEOQUERY
gorithm is its integration with state-of-the-art statis-
tical machine translation techniques. More specif-
ically, a statistical word alignment model (Brown
et al., 1993) is used to acquire a bilingual lexi-
con consisting of NL substrings coupled with their
translations in the target MRL. Complete MRs are
then formed by combining these NL substrings and
their translations under a parsing framework called
the synchronous CFG (Aho and Ullman, 1972),
which forms the basis of most existing statisti-
cal syntax-based translation models (Yamada and
Knight, 2001; Chiang, 2005). Our algorithm is
called WASP, short for Word Alignment-based Se-
mantic Parsing. In initial evaluation on several
real-world data sets, we show that WASP performs
favorably in terms of both accuracy and coverage
compared to existing learning methods requiring the
same amount of supervision, and shows better ro-
bustness to variations in task complexity and word
order.
Section 2 provides a brief overview of the do-
mains being considered. In Section 3, we present
the semantic parsing model of WASP. Section 4 out-
lines the algorithm for acquiring a bilingual lexicon
through the use of word alignments. Section 5 de-
scribes a probabilistic model for semantic parsing.
Finally, we report on experiments that show the ro-
bustness of WASP in Section 6, followed by the con-
clusion in Section 7.
2 Application Domains
Inthispaper, weconsidertwodomains. Thefirstdo-
main is ROBOCUP. ROBOCUP (www.robocup.org)
is an AI research initiative using robotic soccer as its
primary domain. In the ROBOCUP Coach Competi-
tion, teams of agents compete on a simulated soccer
field and receive coach advice written in a formal
language called CLANG (Chen et al., 2003). Fig-
ure 1 shows a sample MR in CLANG.
The second domain is GEOQUERY, where a func-
tional, variable-free query language is used for
querying a small database on U.S. geography (Zelle
and Mooney, 1996; Kate et al., 2005). Figure 2
shows a sample query in this language. Note that
both domains involve the use of MRs with a com-
plex, nested structure.
3 The Semantic Parsing Model
To describe the semantic parsing model of WASP,
it is best to start with an example. Consider the
task of translating the sentence in Figure 1 into its
MR in CLANG. To achieve this task, we may first
analyze the syntactic structure of the sentence us-
ing a semantic grammar (Allen, 1995), whose non-
terminals are the ones in the CLANG grammar. The
meaning of the sentence is then obtained by com-
bining the meanings of its sub-parts according to
the semantic parse. Figure 3(a) shows a possible
partial semantic parse of the sample sentence based
on CLANG non-terminals (UNUM stands for uni-
form number). Figure 3(b) shows the corresponding
CLANG parse from which the MR is constructed.
This process can be formalized as an instance of
synchronous parsing (Aho and Ullman, 1972), orig-
inally developed as a theory of compilers in which
syntax analysis and code generation are combined
into a single phase. Synchronous parsing has seen a
surge of interest recently in the machine translation
community as a way of formalizing syntax-based
translation models (Melamed, 2004; Chiang, 2005).
According to this theory, a semantic parser defines a
translation, a set of pairs of strings in which each
pair is an NL sentence coupled with its MR. To
finitely specify a potentially infinite translation, we
usea synchronous context-free grammar (SCFG)for
generating the pairs in a translation. Analogous to
an ordinary CFG, each SCFG rule consists of a sin-
gle non-terminal on the left-hand side (LHS). The
right-hand side (RHS) of an SCFG rule is a pair of
strings, 〈α,β〉, where the non-terminals in β are a
permutation of the non-terminals in α. Below are
some SCFG rules that can be used for generating the
parse trees in Figure 3:
RULE →〈if CONDITION 1 , DIRECTIVE 2 . ,
(CONDITION 1 DIRECTIVE 2 )〉
CONDITION →〈TEAM 1 player UNUM 2 has the ball ,
(bowner TEAM 1 {UNUM 2 })〉
TEAM →〈our , our〉
UNUM →〈4 , 4〉
440
RULE
If CONDITION
TEAM
our
player UNUM
4
has the ball
...
(a) English
RULE
( CONDITION
(bownerTEAM
our
{ UNUM
4
})
...)
(b) CLANG
Figure 3: Partial parse trees for the CLANG statement and its English gloss shown in Figure 1
Each SCFG rule X → 〈α,β〉 is a combination of a
production of the NL semantic grammar, X → α,
and a production of the MRL grammar, X → β.
Each rule corresponds to a transformation rule in
Kate et al. (2005). Following their terminology,
we call the string α a pattern, and the string β a
template. Non-terminals are indexed to show their
association between a pattern and a template. All
derivations start with a pair of associated start sym-
bols, 〈S1 ,S1〉. Each step of a derivation involves
the rewriting of a pair of associated non-terminals
in both of the NL and MRL streams. Below is a
derivation that would generate the sample sentence
and its MR simultaneously: (Note that RULE is the
start symbol for CLANG)
〈RULE 1 , RULE 1〉
⇒〈if CONDITION 1 , DIRECTIVE 2 . ,
(CONDITION 1 DIRECTIVE 2 )〉
⇒〈if TEAM 1 player UNUM 2 has the ball, DIR 3 . ,
((bowner TEAM 1 {UNUM 2 }) DIR 3 )〉
⇒〈if our player UNUM 1 has the ball, DIR 2 . ,
((bowner our {UNUM 1 }) DIR 2 )〉
⇒〈if our player 4 has the ball, DIRECTIVE 1 . ,
((bowner our {4}) DIRECTIVE 1 )〉
⇒ ...
⇒〈if our player 4 has the ball, then our player 6
should stay in the left side of our half. ,
((bowner our {4})
(do our {6} (pos (left (half our)))))〉
Here the MR string is said to be a translation of the
NL string. Given an input sentence, e, the task of
semantic parsing is to find a derivation that yields
〈e,f〉, so that f is a translation of e. Since there may
be multiple derivations that yield e (and thus mul-
tiple possible translations of e), a mechanism must
be devised for discriminating the correct derivation
from the incorrect ones.
The semantic parsing model of WASP thus con-
sists of an SCFG, G, and a probabilistic model, pa-
rameterized by λ, that takes a possible derivation, d,
and returns its likelihood of being correct given an
input sentence, e. The output translation, f⋆, for a
sentence, e, is defined as:
f⋆ = m
parenleftBigg
argmax
d∈D(G|e)
Prλ(d|e)
parenrightBigg
(1)
where m(d) is the MR string that a derivation d
yields, and D(G|e) is the set of all possible deriva-
tions of G that yield e. In other words, the output
MR is the yield of the most probable derivation that
yields e in the NL stream.
The learning task is to induce a set of SCFG rules,
which we call a lexicon, and a probabilistic model
for derivations. A lexicon defines the set of deriva-
tions that are possible, so the induction of a proba-
bilistic model first requires a lexicon. Therefore, the
learning task can be separated into two sub-tasks:
(1) the induction of a lexicon, followed by (2) the
induction of a probabilistic model. Both sub-tasks
require a training set, {〈ei,fi〉}, where each training
example 〈ei,fi〉 is an NL sentence, ei, paired with
its correct MR, fi. Lexical induction also requires
an unambiguous CFG of the MRL. Since there is no
lexicon to begin with, it is not possible to include
correct derivations in the training data. This is un-
like most recent work on syntactic parsing based on
gold-standard treebanks. Therefore, the induction of
a probabilistic model for derivations is done in an
unsupervised manner.
4 Lexical Acquisition
In this section, we focus on lexical learning, which
is done by finding optimal word alignments between
441
RULE → (CONDITION DIRECTIVE)
TEAM → our
UNUM → 4
If
our
player
4
has
the
ball
CONDITION → (bowner TEAM {UNUM})
Figure 4: Partial word alignment for the CLANG statement and its English gloss shown in Figure 1
NL sentences and their MRs in the training set. By
defining a mapping of words from one language to
another, word alignments define a bilingual lexicon.
Using word alignments to induce a lexicon is not a
new idea (Och and Ney, 2003). Indeed, attempts
have been made to directly apply machine transla-
tion systems to the problem of semantic parsing (Pa-
pinenietal., 1997; Machereyetal., 2001). However,
these systems make no use of the MRL grammar,
thus allocating probability mass to MR translations
that are not even syntactically well-formed. Here we
present a lexical induction algorithm that guarantees
syntactic well-formedness of MR translations by us-
ing the MRL grammar.
The basic idea is to train a statistical word align-
ment model on the training set, and then form a
lexicon by extracting transformation rules from the
K = 10 most probable word alignments between
the training sentences and their MRs. While NL
words could be directly aligned with MR tokens,
this is a bad approach for two reasons. First, not all
MR tokens carry specific meanings. For example, in
CLANG, parentheses and braces are delimiters that
are semantically vacuous. Such tokens are not sup-
posed to be aligned with any words, and inclusion of
these tokens in the training data is likely to confuse
the word alignment model. Second, MR tokens may
exhibit polysemy. For instance, the CLANG pred-
icate pt has three meanings based on the types of
arguments it is given: it specifies the xy-coordinates
(e.g. (pt 0 0)), the current position of the ball (i.e.
(pt ball)), or the current position of a player (e.g.
(pt our 4)). Judging from the pt token alone, the
word alignment model would not be able to identify
its exact meaning.
A simple, principled way to avoid these diffi-
culties is to represent an MR using a sequence of
productions used to generate it. Specifically, the
sequence corresponds to the top-down, left-most
derivation of an MR. Figure 4 shows a partial word
alignment between the sample sentence and the lin-
earized parse of its MR. Here the second produc-
tion, CONDITION → (bowner TEAM {UNUM}), is
the one that rewrites the CONDITION non-terminal
in the first production, RULE → (CONDITION DI-
RECTIVE), and so on. Note that the structure of a
parse tree is preserved through linearization, and for
each MR there is a unique linearized parse, since the
MRL grammar is unambiguous. Such alignments
can be obtained through the use of any off-the-shelf
word alignment model. In this work, we use the
GIZA++ implementation (Och and Ney, 2003) of
IBM Model 5 (Brown et al., 1993).
Assuming that each NL word is linked to at most
one MRL production, transformation rules are ex-
tracted in a bottom-up manner. The process starts
with productions whose RHS is all terminals, e.g.
TEAM → our and UNUM → 4. For each of these
productions, X → β, a rule X → 〈α,β〉 is ex-
tracted such that α consists of the words to which
the production is linked, e.g. TEAM → 〈our, our〉,
UNUM → 〈4, 4〉. Then we consider productions
whose RHS contains non-terminals, i.e. predicates
with arguments. In this case, an extracted pattern
consists of the words to which the production is
linked, as well as non-terminals showing where the
arguments are realized. For example, for the bowner
predicate, the extracted rule would be CONDITION
→ 〈TEAM 1 player UNUM 2 has (1) ball, (bowner
TEAM 1 {UNUM 2 })〉, where (1) denotes a word
gap of size 1, due to the unaligned word the that
comes between has and ball. A word gap, (g), can
be seen as a non-terminal that expands to at most
g words in the NL stream, which allows for some
flexibility in pattern matching. Rule extraction thus
proceeds backward from the end of a linearized MR
442
our
left
penalty
area
REGION → (left REGION)
REGION → (penalty-area TEAM)
TEAM → our
Figure 5: A word alignment from which no rules can be extracted for the penalty-area predicate
parse (so that a predicate is processed only after its
arguments have all been processed), until rules are
extracted for all productions.
There are two cases where the above algorithm
would not extract any rules for a production r. First
is when no descendants of r in the MR parse are
linked to any words. Second is when there is a
link from a word w, covered by the pattern for r,
to a production r′ outside the sub-parse rooted at
r. Rule extraction is forbidden in this case be-
cause it would destroy the link between w and r′.
The first case arises when a component of an MR
is not realized, e.g. assumed in context. The sec-
ond case arises when a predicate and its arguments
are not realized close enough. Figure 5 shows an
example of this, where no rules can be extracted
for the penalty-area predicate. Both cases can be
solved by merging nodes in the MR parse tree, com-
bining several productions into one. For example,
since no rules can be extracted for penalty-area,
it is combined with its parent to form REGION →
(left (penalty-area TEAM)), for which the pat-
tern TEAM left penalty area is extracted.
The above algorithm is effective only when words
linked to an MR predicate and its arguments stay
close to each other, a property that we call phrasal
coherence. Any links that destroy this property
wouldleadtoexcessivenodemerging,amajorcause
of overfitting. Since building a model that strictly
observes phrasal coherence often requires rules that
model the reordering of tree nodes, our goal is to
bootstrap the learning process by using a simpler,
word-based alignment model that produces a gen-
erally coherent alignment, and then remove links
thatwouldcauseexcessivenodemergingbeforerule
extraction takes place. Given an alignment, a, we
count the number of links that would prevent a rule
from being extracted for each production in the MR
parse. Then the total sum for all productions is ob-
tained, denoted by v(a). A greedy procedure is em-
ployed that repeatedly removes a link a ∈ a that
would maximize v(a) − v(a\{a}) > 0, until v(a)
cannot be further reduced. A link w ↔ r is never
removed if the translation probability, Pr(r|w), is
greater than a certain threshold (0.9). To replenish
the removed links, links from the most probable re-
verse alignment, ˜a (obtained by treating the source
languageastarget, andviceversa), areaddedtoa, as
long as a remains n-to-1, and v(a) is not increased.
5 Parameter Estimation
Once a lexicon is acquired, the next task is to learn a
probabilistic model for the semantic parser. We pro-
pose a maximum-entropy model that defines a con-
ditional probability distribution over derivations (d)
given the observed NL string (e):
Prλ(d|e) = 1Z
λ(e)
exp
summationdisplay
i
λifi(d) (2)
where fi is a feature function, and Zλ(e) is a nor-
malizing factor. For each rule r in the lexicon there
is a feature function that returns the number of times
r is used in a derivation. Also for each word w there
is a feature function that returns the number of times
w is generated from word gaps. Generation of un-
seen words is modeled using an extra feature whose
value is the total number of words generated from
word gaps. The number of features is quite modest
(less than 3,000 in our experiments). A similar fea-
ture set is used by Zettlemoyer and Collins (2005).
Decoding of the model can be done in cubic time
with respect to sentence length using the Viterbi al-
gorithm. An Earley chart is used for keeping track
of all derivations that are consistent with the in-
put (Stolcke, 1995). The maximum conditional like-
lihood criterion is used for estimating the model pa-
rameters, λi. A Gaussian prior (σ2 = 1) is used for
regularizing the model (Chen and Rosenfeld, 1999).
Since gold-standard derivations are not available in
the training data, correct derivations must be treated
as hidden variables. Here we use a version of im-
443
proved iterative scaling (IIS) coupled with EM (Rie-
zler et al., 2000) for finding an optimal set of param-
eters.1 Unlike the fully-supervised case, the condi-
tional likelihood is not concave with respect to λ,
so the estimation algorithm is sensitive to initial pa-
rameters. To assume as little as possible, λ is initial-
ized to 0. The estimation algorithm requires statis-
tics that depend on all possible derivations for a sen-
tence or a sentence-MR pair. While it is not fea-
sible to enumerate all derivations, a variant of the
Inside-Outside algorithm can be used for efficiently
collecting the required statistics (Miyao and Tsujii,
2002). Following Zettlemoyer and Collins (2005),
only rules that are used in the best parses for the
training set are retained in the final lexicon. All
other rules are discarded. This heuristic, commonly
known as Viterbi approximation, is used to improve
accuracy, assuming that rules used in the best parses
are the most accurate.
6 Experiments
We evaluated WASP in the ROBOCUP and GEO-
QUERY domains (see Section 2). To build a cor-
pus for ROBOCUP, 300 pieces of coach advice were
randomly selected from the log files of the 2003
ROBOCUP Coach Competition, which were manu-
ally translated into English (Kuhlmann et al., 2004).
The average sentence length is 22.52. To build a
corpus for GEOQUERY, 880 English questions were
gathered from various sources, which were manu-
ally translated into the functional GEOQUERY lan-
guage (Tang and Mooney, 2001). The average sen-
tence length is 7.48, much shorter than ROBOCUP.
250 of the queries were also translated into Spanish,
Japanese and Turkish, resulting in a smaller, multi-
lingual data set.
For each domain, there was a minimal set of ini-
tial rules representing knowledge needed for trans-
lating basic domain entities. These rules were al-
ways included in a lexicon. For example, in GEO-
QUERY, the initial rules were: NUM → 〈x,x〉, for
all x ∈ CA; CITY → 〈c, cityid(’c’, )〉, for all
city names c (e.g. new york); and similar rules for
other types of names (e.g. rivers). Name transla-
tionswereprovidedforthemultilingualdataset(e.g.
1We also implemented limited-memory BFGS (Nocedal,
1980). Preliminaryexperimentsshowedthatittypicallyreduces
training time by more than half with similar accuracy.
CITY → 〈nyuu yooku, cityid(’new york’, )〉 for
Japanese).
Standard 10-fold cross validation was used in our
experiments. A semantic parser was learned from
the training set. Then the learned parser was used
to translate the test sentences into MRs. Translation
failed when there were constructs that the parser did
not cover. We counted the number of sentences that
weretranslatedintoanMR,andthenumberoftrans-
lations that were correct. For ROBOCUP, a trans-
lation was correct if it exactly matched the correct
MR. For GEOQUERY, a translation was correct if it
retrieved the same answer as the correct query. Us-
ing these counts, we measured the performance of
the parser in terms of precision (percentage of trans-
lations that were correct) and recall (percentage of
test sentences that were correctly translated). For
ROBOCUP, it took 47 minutes to learn a parser us-
ing IIS. For GEOQUERY, it took 83 minutes.
Figure 6 shows the performance of WASP com-
pared to four other algorithms: SILT (Kate et al.,
2005), COCKTAIL (Tang and Mooney, 2001), SCIS-
SOR (Ge and Mooney, 2005) and Zettlemoyer and
Collins (2005). Experimental results clearly show
the advantage of extra supervision in SCISSOR and
Zettlemoyer and Collins’s parser (see Section 1).
However, WASP performs quite favorably compared
to SILT and COCKTAIL, which use the same train-
ing data. In particular, COCKTAIL, a determinis-
tic shift-reduce parser based on inductive logic pro-
gramming, fails to scale up to the ROBOCUP do-
main where sentences are much longer, and crashes
on larger training sets due to memory overflow.
WASP also outperforms SILT in terms of recall,
where lexical learning is done by a local bottom-up
search, which is much less effective than the word-
alignment-based algorithm in WASP.
Figure 7 shows the performance of WASP on
the multilingual GEOQUERY data set. The lan-
guages being considered differ in terms of word or-
der: Subject-Verb-Object for English and Spanish,
and Subject-Object-Verb for Japanese and Turkish.
WASP’s performance is consistent across these lan-
guages despite some slight differences, most proba-
bly due to factors other than word order (e.g. lower
recall for Turkish due to a much larger vocabulary).
Details can be found in a longer version of this pa-
per (Wong, 2005).
444
 0
 20
 40
 60
 80
 100
 0  50  100  150  200  250  300
Precision (%)
Number of training examples
WASP
SILT
COCKTAIL
SCISSOR
(a) Precision for ROBOCUP
 0
 20
 40
 60
 80
 100
 0  50  100  150  200  250  300
Recall (%)
Number of training examples
WASP
SILT
COCKTAIL
SCISSOR
(b) Recall for ROBOCUP
 0
 20
 40
 60
 80
 100
 0  100  200  300  400  500  600  700  800
Precision (%)
Number of training examples
WASP
SILT
COCKTAIL
SCISSOR
Zettlemoyer et al. (2005)
(c) Precision for GEOQUERY
 0
 20
 40
 60
 80
 100
 0  100  200  300  400  500  600  700  800
Recall (%)
Number of training examples
WASP
SILT
COCKTAIL
SCISSOR
Zettlemoyer et al. (2005)
(d) Recall for GEOQUERY
Figure 6: Precision and recall learning curves comparing various semantic parsers
 0
 20
 40
 60
 80
 100
 0  50  100  150  200  250
Precision (%)
Number of training examples
English
Spanish
Japanese
Turkish
(a) Precision for GEOQUERY
 0
 20
 40
 60
 80
 100
 0  50  100  150  200  250
Recall (%)
Number of training examples
English
Spanish
Japanese
Turkish
(b) Recall for GEOQUERY
Figure 7: Precision and recall learning curves comparing various natural languages
7 Conclusion
We have presented a novel statistical approach to
semantic parsing in which a word-based alignment
model is used for lexical learning, and the parsing
model itself can be seen as a syntax-based trans-
lation model. Our method is like many phrase-
based translation models, which require a simpler,
word-based alignment model for the acquisition of a
phrasal lexicon (Och and Ney, 2003). It is also sim-
ilar to the hierarchical phrase-based model of Chi-
ang (2005), in which hierarchical phrase pairs, es-
sentially SCFG rules, are learned through the use of
a simpler, phrase-based alignment model. Our work
shows that ideas from compiler theory (SCFG) and
machinetranslation(wordalignment models)can be
successfully applied to semantic parsing, a closely-
related task whose goal is to translate a natural lan-
guage into a formal language.
Lexicallearningrequireswordalignmentsthatare
phrasally coherent. We presented a simple greedy
algorithmforremovinglinksthatdestroyphrasalco-
herence. Althoughitisshowntobequiteeffectivein
the current domains, it is preferable to have a more
principled way of promoting phrasal coherence. The
problem is that, by treating MRL productions as
atomic units, current word-based alignment models
have no knowledge about the tree structure hidden
in a linearized MR parse. In the future, we would
like to develop a word-based alignment model that
445
is aware of the MRL syntax, so that better lexicons
can be learned.
Acknowledgments
This research was supported by Defense Advanced
Research Projects Agency under grant HR0011-04-
1-0007.
References
A. V. Aho and J. D. Ullman. 1972. The Theory of Pars-
ing, Translation, and Compiling. PrenticeHall, Engle-
wood Cliffs, NJ.
J. F. Allen. 1995. Natural Language Understanding (2nd
Ed.). Benjamin/Cummings, Menlo Park, CA.
I. Androutsopoulos, G. D. Ritchie, and P. Thanisch.
1995. Natural language interfaces to databases: An
introduction. Journal of Natural Language Engineer-
ing, 1(1):29–81.
P. F. Brown, V. J. Della Pietra, S. A. Della Pietra, and
R. L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Computa-
tional Linguistics, 19(2):263–312, June.
S. Chen and R. Rosenfeld. 1999. A Gaussian prior for
smoothing maximum entropy models. Technical re-
port, Carnegie Mellon University, Pittsburgh, PA.
M. Chen et al. 2003. Users manual: RoboCup soc-
cer server manual for soccer server version 7.07 and
later. Available at http://sourceforge.net/
projects/sserver/.
D. Chiang. 2005. A hierarchical phrase-based model
for statistical machine translation. In Proc. of ACL-05,
pages 263–270, Ann Arbor, MI, June.
R. Ge and R. J. Mooney. 2005. A statistical semantic
parser that integrates syntax and semantics. In Proc.
of CoNLL-05, pages 9–16, Ann Arbor, MI, July.
R. J. Kate, Y. W. Wong, and R. J. Mooney. 2005. Learn-
ing to transform natural to formal languages. In Proc.
of AAAI-05, pages 1062–1068, Pittsburgh, PA, July.
G. Kuhlmann, P. Stone, R. J. Mooney, and J. W. Shavlik.
2004. Guiding a reinforcement learner with natural
language advice: Initial results in RoboCup soccer. In
Proc. of the AAAI-04 Workshop on Supervisory Con-
trol of Learning and Adaptive Systems, San Jose, CA,
July.
K. Macherey, F. J. Och, and H. Ney. 2001. Natural lan-
guage understanding using statistical machine transla-
tion. In Proc. of EuroSpeech-01, pages 2205–2208,
Aalborg, Denmark.
I. D. Melamed. 2004. Statistical machine translation
by parsing. In Proc. of ACL-04, pages 653–660,
Barcelona, Spain.
S.Miller, D.Stallard, R.Bobrow, andR.Schwartz. 1996.
A fully statistical approach to natural language inter-
faces. In Proc. of ACL-96, pages 55–61, Santa Cruz,
CA.
Y. Miyao and J. Tsujii. 2002. Maximum entropy estima-
tionforfeatureforests. InProc. of HLT-02, SanDiego,
CA, March.
J. Nocedal. 1980. Updating quasi-Newton matrices
with limited storage. Mathematics of Computation,
35(151):773–782, July.
F. J. Och and H. Ney. 2003. A systematic comparison of
various statistical alignment models. Computational
Linguistics, 29(1):19–51.
K. A. Papineni, S. Roukos, and R. T. Ward. 1997.
Feature-based language understanding. In Proc. of
EuroSpeech-97, pages 1435–1438, Rhodes, Greece.
S. Riezler, D. Prescher, J. Kuhn, and M. Johnson. 2000.
Lexicalized stochastic modeling of constraint-based
grammars using log-linear measures and EM training.
In Proc. of ACL-00, pages 480–487, Hong Kong.
A. Stolcke. 1995. An efficient probabilistic context-free
parsing algorithm that computes prefix probabilities.
Computational Linguistics, 21(2):165–201.
L. R. Tang and R. J. Mooney. 2001. Using multiple
clauseconstructorsininductivelogicprogrammingfor
semantic parsing. In Proc. of ECML-01, pages 466–
477, Freiburg, Germany.
Y. W. Wong. 2005. Learning for semantic parsing us-
ing statistical machine translation techniques. Techni-
cal Report UT-AI-05-323, Artificial Intelligence Lab,
University of Texas at Austin, Austin, TX, October.
K. Yamada and K. Knight. 2001. A syntax-based sta-
tistical translation model. In Proc. of ACL-01, pages
523–530, Toulouse, France.
J. M. Zelle and R. J. Mooney. 1996. Learning to parse
database queries using inductive logic programming.
In Proc. of AAAI-96, pages 1050–1055, Portland, OR,
August.
L. S. Zettlemoyer and M. Collins. 2005. Learning to
map sentences to logical form: Structured classifica-
tion with probabilistic categorial grammars. In Proc.
of UAI-05, Edinburgh, Scotland, July.
446
