File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1056_metho.xml
Size: 15,489 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1056"> <Title>Learning for Semantic Parsing with Statistical Machine Translation</Title> <Section position="3" start_page="439" end_page="439" type="metho"> <SectionTitle> 2 Application Domains </SectionTitle> <Paragraph position="0"> Inthispaper, weconsidertwodomains. Thefirstdomain is ROBOCUP. ROBOCUP (www.robocup.org) is an AI research initiative using robotic soccer as its primary domain. In the ROBOCUP Coach Competition, teams of agents compete on a simulated soccer field and receive coach advice written in a formal language called CLANG (Chen et al., 2003). Figure 1 shows a sample MR in CLANG.</Paragraph> <Paragraph position="1"> The second domain is GEOQUERY, where a functional, variable-free query language is used for querying a small database on U.S. geography (Zelle and Mooney, 1996; Kate et al., 2005). Figure 2 shows a sample query in this language. Note that both domains involve the use of MRs with a complex, nested structure.</Paragraph> </Section> <Section position="4" start_page="439" end_page="440" type="metho"> <SectionTitle> 3 The Semantic Parsing Model </SectionTitle> <Paragraph position="0"> To describe the semantic parsing model of WASP, it is best to start with an example. Consider the task of translating the sentence in Figure 1 into its MR in CLANG. To achieve this task, we may first analyze the syntactic structure of the sentence using a semantic grammar (Allen, 1995), whose non-terminals are the ones in the CLANG grammar. The meaning of the sentence is then obtained by combining the meanings of its sub-parts according to the semantic parse. Figure 3(a) shows a possible partial semantic parse of the sample sentence based on CLANG non-terminals (UNUM stands for uniform number). Figure 3(b) shows the corresponding CLANG parse from which the MR is constructed.</Paragraph> <Paragraph position="1"> This process can be formalized as an instance of synchronous parsing (Aho and Ullman, 1972), originally developed as a theory of compilers in which syntax analysis and code generation are combined into a single phase. Synchronous parsing has seen a surge of interest recently in the machine translation community as a way of formalizing syntax-based translation models (Melamed, 2004; Chiang, 2005).</Paragraph> <Paragraph position="2"> According to this theory, a semantic parser defines a translation, a set of pairs of strings in which each pair is an NL sentence coupled with its MR. To finitely specify a potentially infinite translation, we usea synchronous context-free grammar (SCFG)for generating the pairs in a translation. Analogous to an ordinary CFG, each SCFG rule consists of a single non-terminal on the left-hand side (LHS). The right-hand side (RHS) of an SCFG rule is a pair of strings, <a,b> , where the non-terminals in b are a permutation of the non-terminals in a. Below are some SCFG rules that can be used for generating the parse trees in Figure 3: Each SCFG rule X - <a,b> is a combination of a production of the NL semantic grammar, X - a, and a production of the MRL grammar, X - b.</Paragraph> <Paragraph position="3"> Each rule corresponds to a transformation rule in Kate et al. (2005). Following their terminology, we call the string a a pattern, and the string b a template. Non-terminals are indexed to show their association between a pattern and a template. All derivations start with a pair of associated start symbols, <S1 ,S1> . Each step of a derivation involves the rewriting of a pair of associated non-terminals in both of the NL and MRL streams. Below is a derivation that would generate the sample sentence and its MR simultaneously: (Note that RULE is the start symbol for CLANG) <RULE 1 , RULE 1> =<if CONDITION 1 , DIRECTIVE 2 . , (CONDITION 1 DIRECTIVE 2 )> =<if TEAM 1 player UNUM 2 has the ball, DIR 3 . , ((bowner TEAM 1 {UNUM 2 }) DIR 3 )> =<if our player UNUM 1 has the ball, DIR 2 . ,</Paragraph> <Paragraph position="5"> =<if our player 4 has the ball, then our player 6 should stay in the left side of our half. , ((bowner our {4}) (do our {6} (pos (left (half our)))))> Here the MR string is said to be a translation of the NL string. Given an input sentence, e, the task of semantic parsing is to find a derivation that yields <e,f> , so that f is a translation of e. Since there may be multiple derivations that yield e (and thus multiple possible translations of e), a mechanism must be devised for discriminating the correct derivation from the incorrect ones.</Paragraph> <Paragraph position="6"> The semantic parsing model of WASP thus consists of an SCFG, G, and a probabilistic model, parameterized by l, that takes a possible derivation, d, and returns its likelihood of being correct given an input sentence, e. The output translation, f[?], for a sentence, e, is defined as:</Paragraph> <Paragraph position="8"> where m(d) is the MR string that a derivation d yields, and D(G|e) is the set of all possible derivations of G that yield e. In other words, the output MR is the yield of the most probable derivation that yields e in the NL stream.</Paragraph> <Paragraph position="9"> The learning task is to induce a set of SCFG rules, which we call a lexicon, and a probabilistic model for derivations. A lexicon defines the set of derivations that are possible, so the induction of a probabilistic model first requires a lexicon. Therefore, the learning task can be separated into two sub-tasks: (1) the induction of a lexicon, followed by (2) the induction of a probabilistic model. Both sub-tasks require a training set, {<ei,fi> }, where each training example <ei,fi> is an NL sentence, ei, paired with its correct MR, fi. Lexical induction also requires an unambiguous CFG of the MRL. Since there is no lexicon to begin with, it is not possible to include correct derivations in the training data. This is unlike most recent work on syntactic parsing based on gold-standard treebanks. Therefore, the induction of a probabilistic model for derivations is done in an unsupervised manner.</Paragraph> </Section> <Section position="5" start_page="440" end_page="442" type="metho"> <SectionTitle> 4 Lexical Acquisition </SectionTitle> <Paragraph position="0"> In this section, we focus on lexical learning, which is done by finding optimal word alignments between NL sentences and their MRs in the training set. By defining a mapping of words from one language to another, word alignments define a bilingual lexicon. Using word alignments to induce a lexicon is not a new idea (Och and Ney, 2003). Indeed, attempts have been made to directly apply machine translation systems to the problem of semantic parsing (Papinenietal., 1997; Machereyetal., 2001). However, these systems make no use of the MRL grammar, thus allocating probability mass to MR translations that are not even syntactically well-formed. Here we present a lexical induction algorithm that guarantees syntactic well-formedness of MR translations by using the MRL grammar.</Paragraph> <Paragraph position="1"> The basic idea is to train a statistical word alignment model on the training set, and then form a lexicon by extracting transformation rules from the K = 10 most probable word alignments between the training sentences and their MRs. While NL words could be directly aligned with MR tokens, this is a bad approach for two reasons. First, not all MR tokens carry specific meanings. For example, in CLANG, parentheses and braces are delimiters that are semantically vacuous. Such tokens are not supposed to be aligned with any words, and inclusion of these tokens in the training data is likely to confuse the word alignment model. Second, MR tokens may exhibit polysemy. For instance, the CLANG predicate pt has three meanings based on the types of arguments it is given: it specifies the xy-coordinates (e.g. (pt 0 0)), the current position of the ball (i.e. (pt ball)), or the current position of a player (e.g.</Paragraph> <Paragraph position="2"> (pt our 4)). Judging from the pt token alone, the word alignment model would not be able to identify its exact meaning.</Paragraph> <Paragraph position="3"> A simple, principled way to avoid these difficulties is to represent an MR using a sequence of productions used to generate it. Specifically, the sequence corresponds to the top-down, left-most derivation of an MR. Figure 4 shows a partial word alignment between the sample sentence and the linearized parse of its MR. Here the second production, CONDITION - (bowner TEAM {UNUM}), is the one that rewrites the CONDITION non-terminal in the first production, RULE - (CONDITION DI-RECTIVE), and so on. Note that the structure of a parse tree is preserved through linearization, and for each MR there is a unique linearized parse, since the MRL grammar is unambiguous. Such alignments can be obtained through the use of any off-the-shelf word alignment model. In this work, we use the GIZA++ implementation (Och and Ney, 2003) of IBM Model 5 (Brown et al., 1993).</Paragraph> <Paragraph position="4"> Assuming that each NL word is linked to at most one MRL production, transformation rules are extracted in a bottom-up manner. The process starts with productions whose RHS is all terminals, e.g.</Paragraph> <Paragraph position="5"> TEAM - our and UNUM - 4. For each of these productions, X - b, a rule X - <a,b> is extracted such that a consists of the words to which the production is linked, e.g. TEAM - <our, our> , UNUM - <4, 4> . Then we consider productions whose RHS contains non-terminals, i.e. predicates with arguments. In this case, an extracted pattern consists of the words to which the production is linked, as well as non-terminals showing where the arguments are realized. For example, for the bowner predicate, the extracted rule would be CONDITION - <TEAM 1 player UNUM 2 has (1) ball, (bowner TEAM 1 {UNUM 2 })> , where (1) denotes a word gap of size 1, due to the unaligned word the that comes between has and ball. A word gap, (g), can be seen as a non-terminal that expands to at most g words in the NL stream, which allows for some flexibility in pattern matching. Rule extraction thus proceeds backward from the end of a linearized MR parse (so that a predicate is processed only after its arguments have all been processed), until rules are extracted for all productions.</Paragraph> <Paragraph position="6"> There are two cases where the above algorithm would not extract any rules for a production r. First is when no descendants of r in the MR parse are linked to any words. Second is when there is a link from a word w, covered by the pattern for r, to a production r' outside the sub-parse rooted at r. Rule extraction is forbidden in this case because it would destroy the link between w and r'.</Paragraph> <Paragraph position="7"> The first case arises when a component of an MR is not realized, e.g. assumed in context. The second case arises when a predicate and its arguments are not realized close enough. Figure 5 shows an example of this, where no rules can be extracted for the penalty-area predicate. Both cases can be solved by merging nodes in the MR parse tree, combining several productions into one. For example, since no rules can be extracted for penalty-area, it is combined with its parent to form REGION (left (penalty-area TEAM)), for which the pattern TEAM left penalty area is extracted.</Paragraph> <Paragraph position="8"> The above algorithm is effective only when words linked to an MR predicate and its arguments stay close to each other, a property that we call phrasal coherence. Any links that destroy this property wouldleadtoexcessivenodemerging,amajorcause of overfitting. Since building a model that strictly observes phrasal coherence often requires rules that model the reordering of tree nodes, our goal is to bootstrap the learning process by using a simpler, word-based alignment model that produces a generally coherent alignment, and then remove links thatwouldcauseexcessivenodemergingbeforerule extraction takes place. Given an alignment, a, we count the number of links that would prevent a rule from being extracted for each production in the MR parse. Then the total sum for all productions is obtained, denoted by v(a). A greedy procedure is employed that repeatedly removes a link a [?] a that would maximize v(a) [?] v(a\{a}) > 0, until v(a) cannot be further reduced. A link w - r is never removed if the translation probability, Pr(r|w), is greater than a certain threshold (0.9). To replenish the removed links, links from the most probable reverse alignment, ~a (obtained by treating the source languageastarget, andviceversa), areaddedtoa, as long as a remains n-to-1, and v(a) is not increased.</Paragraph> </Section> <Section position="6" start_page="442" end_page="443" type="metho"> <SectionTitle> 5 Parameter Estimation </SectionTitle> <Paragraph position="0"> Once a lexicon is acquired, the next task is to learn a probabilistic model for the semantic parser. We propose a maximum-entropy model that defines a conditional probability distribution over derivations (d) given the observed NL string (e):</Paragraph> <Paragraph position="2"> where fi is a feature function, and Zl(e) is a normalizing factor. For each rule r in the lexicon there is a feature function that returns the number of times r is used in a derivation. Also for each word w there is a feature function that returns the number of times w is generated from word gaps. Generation of unseen words is modeled using an extra feature whose value is the total number of words generated from word gaps. The number of features is quite modest (less than 3,000 in our experiments). A similar feature set is used by Zettlemoyer and Collins (2005).</Paragraph> <Paragraph position="3"> Decoding of the model can be done in cubic time with respect to sentence length using the Viterbi algorithm. An Earley chart is used for keeping track of all derivations that are consistent with the input (Stolcke, 1995). The maximum conditional likelihood criterion is used for estimating the model parameters, li. A Gaussian prior (s2 = 1) is used for regularizing the model (Chen and Rosenfeld, 1999).</Paragraph> <Paragraph position="4"> Since gold-standard derivations are not available in the training data, correct derivations must be treated as hidden variables. Here we use a version of im- null proved iterative scaling (IIS) coupled with EM (Riezler et al., 2000) for finding an optimal set of parameters.1 Unlike the fully-supervised case, the conditional likelihood is not concave with respect to l, so the estimation algorithm is sensitive to initial parameters. To assume as little as possible, l is initialized to 0. The estimation algorithm requires statistics that depend on all possible derivations for a sentence or a sentence-MR pair. While it is not feasible to enumerate all derivations, a variant of the Inside-Outside algorithm can be used for efficiently collecting the required statistics (Miyao and Tsujii, 2002). Following Zettlemoyer and Collins (2005), only rules that are used in the best parses for the training set are retained in the final lexicon. All other rules are discarded. This heuristic, commonly known as Viterbi approximation, is used to improve accuracy, assuming that rules used in the best parses are the most accurate.</Paragraph> </Section> class="xml-element"></Paper>