File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1107_metho.xml

Size: 29,115 bytes

Last Modified: 2025-10-06 14:15:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1107">
  <Title>Semantic Lexicon Acquisition for Learning Natural Language Interfaces</Title>
  <Section position="4" start_page="57" end_page="58" type="metho">
    <SectionTitle>
3 The Semantic Lexicon Acquisition
Problem
</SectionTitle>
    <Paragraph position="0"> We now define the learning problem at hand. Given a set of sentences, each consisting of an ordered list of words and annotated with a single semantic representation, we assume that each representation can be fractured into all of its components (Siskind, 1992). The fracturing method depends upon the given representation and must be explicitly provided or implicit in the algorithm that forms hypotheses for word meanings. Given a valid set of components, they can be constructed into a valid sentence meaning using a relation we will call compose.</Paragraph>
    <Paragraph position="1"> The goal is to find a semantic lexicon that will assist parsing. Such a lexicon consists of (phrase, meaning) pairs, where the phrases and their meanings are extracted from the input sentences and their representations, respectively, such that each sentence's representation can be composed from a set of components each chosen from the possible meanings of a (unique) phrase appearing in the sentence. If such a lexicon is found, we say that the lexicon covers the corpus. We will also talk about the coverage of components of a representation (or  sentence/representation pair) by a lexicon entry. Ideally, we would like to minimize the ambiguity and size of the learned lexicon, since this should ease the parser acquisition task. Note that this notion of semantic lexicon acquisition is distinct from work on learning selectional restrictions (Manning, 1993; Brent, 1991) and learning clusters of semantically similar words (Riloff and Sheperd, 1997).</Paragraph>
    <Paragraph position="2"> Note that we allow phrases to have multiple meanings (homonymy) and for multiple phrases to have the same meaning (synonymy). Also, some phrases in the sentences may have a null meaning. We make only a few fairly straightforward assumptions about the input.</Paragraph>
    <Paragraph position="3"> First is compositionality, i.e. the meaning of a sentence is composed from the meanings of phrases in that sentence.</Paragraph>
    <Paragraph position="4"> Since we allow multi-word phrases in the lexicon (e.g.</Paragraph>
    <Paragraph position="5"> (\[kick the bucket\], die(_))), this assumption seems fairly unproblematic. Second, we assume each component of the representation is due to the meaning of a word or phrase in the sentence, not to an external source such as noise. Third, we assume the meaning for each word in a sentence appears only once in the sentence's representation. The second and third assumptions are preliminary, and we are exploring methods for relaxing them. If any of these assumptions are violated, we do not guarantee coverage of the training corpus; however, the system can still be run and learn a potentially useful lexicon.</Paragraph>
  </Section>
  <Section position="5" start_page="58" end_page="60" type="metho">
    <SectionTitle>
4 The WOLFIE Algorithm and an
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="58" end_page="60" type="sub_section">
      <SectionTitle>
Example
</SectionTitle>
      <Paragraph position="0"> In order to limit search, a greedy algorithm is used to learn phrase meanings. At each step, the best phrase/meaning pair is chosen, according to a heuristic described below, and added to the lexicon. The initial list of candidate meanings for a phrase is formed by finding the common substructure between sampled pairs of representations of sentences in which the phrase appears. 1 In the current implementation, phrases are limited to at most two words. This is for efficiency reasons only, and in the future we hope to incorporate an efficient method for including potentially meaningful phrases of more than two words.</Paragraph>
      <Paragraph position="1"> The WOLFIE algorithm, outlined in Figure 2, has been implemented to handle two kinds of semantic representations. One is a case-role meaning representation based on conceptual dependency (Schank, 1975}. For example, the sentence &amp;quot;The man ate the cheese&amp;quot; is represented by: \[ingest, agent : \[person, sex:male, age : adult\], 1We restrict ourselves to a sampled pairs instead of all pairs because this provides enough information to get good initial candidate meanings. Using all pairs is possible but not generally necessary.</Paragraph>
      <Paragraph position="2"> For each phrase (of at most two words):  1) Sample the examples in which the phrase appears 2) Find largest common subexpressions of pairs of representations from these examples Until the input representations are covered, or there are no remaining candidate pairs do: 1) Add the best phrase/meaning pair to the lexicon.</Paragraph>
      <Paragraph position="3"> 2) Constrain meanings of phrases occurring in the  same sentences as the phrase just learned Return the lexicon of learned phrase/meaning pairs.</Paragraph>
      <Paragraph position="4">  patient : \[food, type: cheese\] \]. Experiments in this domain were presented in Thompson (1995).</Paragraph>
      <Paragraph position="5"> The second representation handled is the logical query representation illustrated earlier, and is the focus of the current paper. To find the common substructure between pairs of query representations, we use a method similar to finding the Least General Generalization of first-order clauses (Plotkin, 1970). However. instead of using subsumption to guide generalization, we find the set of largest common substructures that two representations share. For example, given the two queries from Section 2, the (unique} common substructure is state (_).2 One of the key ideas of the algorithm is that each phrase/meaning choice can constrain the candidate meanings of phrases yet to be learned. This is the second step of the loop in Figure 2. Such constraints exist because of the assumption that each portion of the representation is due to at most one phrase in the sentence. * Therefore, once part of a sentence's representation is covered by the meaning of one of its phrases, no other phrase in the sentence has to be paired with that meaning (for that sentence}.</Paragraph>
      <Paragraph position="6"> For example, assume we have the sentence/representation pairs in Section 2, plus the additional pair: What is the highest point of the state with the biggest area? answer(P, (high-point (S,P) , largest(A, (state(S),area(S,A))))).</Paragraph>
      <Paragraph position="7"> As a simplification, assume sentences are stripped of phrases that we know a priori have a null meaning (although in general this is not required}. In the exampie sentences, these phrases are \[what\], \[is\], \[with\], and \[the\]. From these three examples, the meaning of \[state\], the only phrase common to all sentences, is determined to be state(_), which is the only predicate the three representations have in common. Before determining this, the candidate meaning for ~Since CHILL initializes the pv.r~e stack with the ansv6r predicate, it is first stripped from the input given to WOLFIE.  \[biggest\] is \[largest(_, state(_))\] (the largest sub-structure shared by the representations of the two sentences containing &amp;quot;biggest&amp;quot;). However, since state(_) is now covered by (\[state\], state(_)), it can be eliminated from consideration as part of the meaninl~ of \[biggest\], and the candidate meaning for \[biggest\] becomes \[largest (_,_)\].</Paragraph>
      <Paragraph position="8"> We now describe the algorithm in more detail. The first step is to select a random sample of the sentences that each one and two word phrase appears in, and derive an initial set of candidate meanings for each phrase. This is done by deriving common substructure between pairs of representations of sentences that contain these phrases. For example, let us suppose we have the following pairs as input:  What is the capital of the state with the biggest population? answer (C, (capital (S,C), largest (P, (state (S), population(S, P) ) ) ) ).</Paragraph>
      <Paragraph position="9"> What is the highest point of the state with the biggest area? ansver (P, (high-point (S, P), largest(A, (state(S), area(S,A))))).</Paragraph>
      <Paragraph position="10"> What state is Texarkana located in? ansver ($, (statue (S), eq(C, cityid (tezarkana,_)), loc (C, S) ) ).</Paragraph>
      <Paragraph position="11"> What is the area of the United States? answer (A, (area (C, A), eq (C, countryid (usa)) ) ). What is the population of the states bordering Minnesota? answer(P, (population(S,P), state(S), next.to (S, H), eq(M, stateid (minnesota) ) ) ).</Paragraph>
      <Paragraph position="12">  The sets of initial candidate meanings for some of the phrases in this corpus are: \[biggest\]: \[largest (_, state(_))\], \[state\]: \[state (_), largest (_, state (J)\], \[area\]: \[area(_)\], \[population\]: \[(population(_,_), state(_))\], \[capital\]: \[(capital (S,_), largest(P, (state(S), population(S,P))))\].</Paragraph>
      <Paragraph position="13"> Note that \[state\] has two candidate meanings, each generated from a different pair of representations of sentences in which it appears. A detail is that for phrases that only appear in one sentence, we use the entire representation of the sentence in which they appear as an initial candidate meaning. An example in this corpus is \[capital\]. As we will see, this type of pair typically has a low score, so the meaning will usually get pared down to just the correct portion of the representation, if any. Finally, if a phrase is ambiguous, the pairwise matchings to generate candidate items, together with the constraining of representations: would enable multiple meanings to be learned for it.</Paragraph>
      <Paragraph position="14"> After deriving these initial meanings, the greedy search begins. The heuristic used to evaluate candidates has five weighted components:  1. Ratio of the number of times the phrase appears with the meaning to the number of times the phrase appears, or P(meaninglphrase ).</Paragraph>
      <Paragraph position="15"> 2. Ratio of the number of times the phrase appears with the meaning to the number of times the meaning appears, or P(phraselmeaning ).</Paragraph>
      <Paragraph position="16"> 3. Frequency of the phrase, or P(phrase~.</Paragraph>
      <Paragraph position="17"> 4. Percent of orthographic overlap between the phrase and its meaning.</Paragraph>
      <Paragraph position="18"> 5. The generality of the meaning.</Paragraph>
      <Paragraph position="19">  The first measure helps reduce ambiguity (homonymy) by preferring phrases that indicate a particular meaning with high probability. The second measure helps reduce %monymy by favoring pairs in which the meaning appears with few other phrases. The third measure is used because frequent phrases are more likely to be paired with a correct meaning since we have more information about the representations of sentences in which they appear. null The fourth measure is useful in some domains since sometimes phrases have many characters in common with their meanings, as in area and area(_). It measures the maximum number of consecutive characters in common between the phrase and the terms and predicates in the meanings, as an average of the percent of both the number of characters in the phrase and in the term and predicate names. However, as we will demonstrate in our experiments, the use of this portion of the heuristic is not required to learn useful lexicons.</Paragraph>
      <Paragraph position="20"> The final measure, generality, measures the number of terms and predicates in the meaning. Preferring a meaning with fewer terms helps evenly distribute the predicates in a sentence's representation among the meanings of the phrases in that sentence and thus leads to a lexicon that is more likely to be correct. To see this, we note that some words are likely to co-occur with one another, and so their joint representation (meaning) is likely to be in the list of candidate meanings for both words. By preferring a more general meaning, we more easily ignore these incorrect joint meanings. In the candidate set above for example, if all else were equal, the generality portion of the heuristic would prefer state(_) over largest (_, state (_)) as the meaning of state.</Paragraph>
      <Paragraph position="21"> For purposes of this example, we will use a weight of 50 for each of the first four parameters, and a weight of 8  for the last. The first four components have smaller values than the last, so they have higher weights. Results are not overly-sensitive to the heuristic weights. Automatically setting the weights using cross-validation on the training set (Kohavi and John, 1995) had little effect on overall performance. In all of the experiments, these same weights were used. To break ties, we choose less &amp;quot;ambiguous&amp;quot; phrases first and learn short phrases before longer ones. A phrase is considered more ambiguous if it currently has more meanings in the partially learned lexicon.</Paragraph>
      <Paragraph position="22"> The heuristic measure for the above six pairs is: \[\[biggest\], la..rgest (_,state(_))/: 50(2/2) + 50(2/2) +</Paragraph>
      <Paragraph position="24"> \[\[state\]. largest(_,state(_))\]: 110 \[\[population\], (population(:,_), state(_))/: 130 \[\[capital\], (capital(S,_), largest (P, (state(S), population(S,P))))\]: 101 The best pair by our measure is (\[area\], area(_)), so it is added to the lexicon.</Paragraph>
      <Paragraph position="25"> The next step in the algorithm is to constrain the remaining candidate meanings for the learned phrase, if any, so as to only consider sentences for which no meaning has yet been learned for the phrase. In our example, the learned pair covers all occurrences of \[area\], so there are no remaining meanings that need to be constrained. Next, for the remaining unlearned phrases, their candidate meanings are constrained to take into account the meaning just learned, as was discussed at the beginning of this section. In our example, learning \[area\] would not affect any of the meanings listed above, but the next best pair, \[\[state\], state(_)/, would constrain the (only) candidate meaning for \[population l to become population(_,_), the candidate meaning for \[capital\] to become (capital(S,_), largest(P, population(S,p))), and the candidate meaning for \[biggest I to become largest (_,_). The greedy search continues until the lexicon covers the training corpus.</Paragraph>
      <Paragraph position="26"> A detail of the search not yet mentioned is to check if covered sentence/representation pairs can be parsed by CHILL'S overly-general parser. If this is not the case, we know that some phrase in the sentence has a meaning that is not useful to CHILL. Therefore, whenever a sentence is covered, we check whether it can be parsed. If not, we retract the most recently learned pair, and adjust that phrase's candidate meanings to omit that meaning.</Paragraph>
      <Paragraph position="27"> We call this the parsability heuristic.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="60" end_page="62" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> This section describes our experimental results on a database query application. The corpus contains 250 questions about U.S. geography paired with logical representations. This domain was chosen due to the availability of an existing hand-built natural language interface, Geobase, to a simple geography database containing about 800 facts. This interface was supplied with Turbo Prolog 2.0 (Borland International, 1988), and was designed specifically for this domain. The questions were collected from uninformed undergraduates and mapped into their logical form by an exp, r. Examples from the corpus were given in the previous sections. To broaden the test, we had the same 250 sentences translated into Spanish, Turkish~ and Japanese. The Japanese translations are in word-segmented Roman orthography. Translated questions were paired with the appropriate logical queries from the English corpus.</Paragraph>
    <Paragraph position="1"> To evaluate the learned lexicons, we measured their utility as background knowledge for CHILL. This is performed by choosing a random set of 25 test examples and then creating lexicons and parsers using increasingly larger subsets of the remaining 225 examples. The test examples are parsed using the learned parser, the resulting queries submitted to the database, the answers compared to those generated by the correct representation, and the percentage of correct answers recorded. By making a comparison to the &amp;quot;gold standard&amp;quot; of retrieving a correct answer to the original query, we avoid measures of partial accuracy which do not give a picture of the real usefulness of the parser. To improve the statistical significance of the results, we repeated the above steps for ten different random splits of the data into training and test sets. For all significance tests we used a two-tailed, paired t-test and a significance level of p &lt; 0.05.</Paragraph>
    <Paragraph position="2"> We compared our system to that developed by Siskind (1996). Siskind's system is an on-line (incremental) learner, while ours is batch. To make a closer comparison between the two, we ran his in a &amp;quot;simulated&amp;quot; batch mode, by repeatedly presenting the corpus 500 times, analogous to running 500 epochs to train a neural network. We also made comparisons to the parsers learned by CHILL when using a hand-coded lexicon as background knowledge. This lexicon was available for this domain because when CHILL was originally developed, WOLFIE had not yet been developed.</Paragraph>
    <Paragraph position="3"> In this application, there are many terms, such as state and city names, whose meanings are easily extracted from the database. Therefore, all tests below were run with such names given to the learner as an initial lexicon, although this is not required for learning in general.</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="60" end_page="62" type="sub_section">
      <SectionTitle>
5.1 Comparisons using English
</SectionTitle>
      <Paragraph position="0"> The first experiment was a comparison of the two systems on the original English corpus. However, since Siskind has no measure of orthographic overlap, and it could arguably give our system an unfair advantage on this data, we ran WOLFIE with a weight of zero for this component. We also did not use the parsability heuristic for this test. By making these adjustments, we attempted to generate the fairest head-to-head comparison between the two systems.</Paragraph>
      <Paragraph position="1"> Figure 3 shows learning curves for CHILL when using the lexicons learned by WOLFIE (CHILL+WOLFIE) and by Siskind's system (CHILL+Siskind). The uppermost curve (CHILL+corrlex) is CHILL'S performance when given the hand-built lexicon. Finally, the horizontal line shows the performance of the Geobase benchmark. The results show that a lexicon learned by WOLFIE led to parsers that were almost as accurate as those generated using a hand-buih lexicon. The best accuracy is achieved by the hand-built lexicon, followed by WOLFIE followed by Siskind's system. All the systems do as well or better than Geobase by 225 training examples. The differences between WOLFIE and Siskind's system are statistically significant at 25 and 175 examples. These results show that WOLFZZ can learn lexicons that lead to successful learning of parsers, and that are somewhat better from this perspective than those learned by a competing system. null As noted above, these tests were run with only the meaning of database constants provided as background knowledge. Next, we examined the effect of also providing closed-class words as background knowledge. Figure 4 shows the resulting learning curves. For these tests, we also show the advantage of adding both the orthographic overlap and parsability heuristics to WOLFIZ (CHILL-fullWOLFIE). Both the additional backgrouknowledge and the improved heuristic increase the ore all performance a couple of percentage points. The d ferences between Siskind's system and WOLFIE witho parsing or overlap are statistically significant at 75, 17 and 225 examples. Finally, we noted that Siskind's s3 tern run in batch mode on this test averaged 54.8% at 2&amp;quot; examples, versus non-batch mode which attained 49.6 accuracy, giving evidence that batch mode does impro his system.</Paragraph>
      <Paragraph position="2"> One of the implicit hypotheses of our approach is th coverage of the training pairs implies a good lexicon.</Paragraph>
      <Paragraph position="3"> can compare the coverage of WOLFIE'S lexicons to tho of Siskind's and verify that WOLFm's have better co erage. For the first experiment above, WOLFIE cover 100% of the 225 training examples, while Siskind co ered 94.4%. For the second experiment, the coverag were 100% and 94.5%, respectively. This may accou for some of the performance difference between the t,~ systems.</Paragraph>
      <Paragraph position="4"> Further differences may be explained by the percer age of training examples usable by CHmh, which is t.</Paragraph>
      <Paragraph position="5"> percentage parsable by its overly-general parser. For t. first experiment, CHILl., could parse 93.7% of the 225 e amples when given the lexicons learned by WOLFIE b only 78% of the examples when given lexicons learn by Siskind's system. When the lexicon learners are giv closed class words, these percentages rise to 98.1% a..</Paragraph>
      <Paragraph position="6"> 84.6%, respectively. In addition, the lexicons learn by Siskind's system were more ambiguous than tho learned by WOLFIE. WOLFm'S lexicons had 1.1 mea ings per word for the second experiment (after 225 trai ing examples) versus 1.7 meanings per word in Siskinc lexicons. These differences most likely contribute to t.</Paragraph>
      <Paragraph position="7"> differences seen in the generalization accuracy of CHIT_  The ability to learn multiple-word phrases is not a significant source of the advantage of WOLFIE over Siskind's system, since only 2% of the lexicon entries learned by WOLFIE On average contained two-word phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
5.2 Comparisons using Spanish
</SectionTitle>
      <Paragraph position="0"> Next. we examined the performance of the two systems on the Spanish version of the corpus. We again omitted orthographic overlap and the parsability heuristic. Figure .5 shows the results. In these tests, we also gave closed class words to the lexicon learners as background knowledge, since these results were slightly better for English. Though the performance compared to a hand-built lexicon is not quite as close as in English, the accuracy of the parser using the learned lexicon is very similar.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
5.3 Accuracy on Other Languages
</SectionTitle>
      <Paragraph position="0"> We also had the geography query sentences translated into Japanese and Turkish, and ran similar tests to determine how well WOLFIE could learn lexicons for these languages, and how well CHILL could learn to parse them.</Paragraph>
      <Paragraph position="1"> Figure 6 shows the results. For all four of these tests, we used the parsability heuristic, but did not give the learner access to the closed class words of any of the languages. We also set the weight of the orthographic overlap heuristic to zero for all four languages, since this gives little advantage in the foreign languages. The performance differences among the four languages are quite small, demonstrating that our methods are not language dependent.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="62" end_page="70" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> Pedersen and Chen (1995) describe a method for acquiring syntactic and semantic features of an unknown word.</Paragraph>
    <Paragraph position="1"> They assume access to an initial concept hierarchy, and</Paragraph>
    <Paragraph position="3"> do not present any experimental results. Many systems (Fukumoto and Tsujii, 1995; Haruno, 1995; Johnston et al., 1995; Webster and Marcus, 1995) focus only on acquisition of verbs or nouns, rather than all types of words. Also, these either do not experimentally evaluate their systems, or do not show the usefulness of the learned lexicons. Manning (1993) and Brent (1991) acquire subcategorization information for verbs. Finally, several systems (Knight, 1996; Hastings and Lytinen.</Paragraph>
    <Paragraph position="4"> 1994; Russell, 1993) learn new words from context, assuming that a large initial lexicon and parsing system are available.</Paragraph>
    <Paragraph position="5"> Tishby and Gorin (Tishby and Gorin, 1994) learn associations between words and actions (as meanings of those words). Their system was tested on a corpus of sentences paired with representations but they do not demonstrate the integration of learning a semantic parser using the learned lexicon.</Paragraph>
    <Paragraph position="6"> The aforementioned work by Siskind is the closest.</Paragraph>
    <Paragraph position="7"> His approach is somewhat more general in that it handles noise and referential uncertainty (multiple possible meanings for a sentence), while ours is specialized for applications where a single meaning is available. The experimental results in the previous section demonstrate the advantage of our method for such an application. His system does not currently handle multiple-word phrases.</Paragraph>
    <Paragraph position="8"> Also, his system operates in an incremental or on-line fashion, discarding each sentence as it processes it, while ours is batch. While he argues for psychological plausibility, we do not. In addition, his search for word meanings is most analogous to a version space search, while ours is a greedy search. Finally, and perhaps most significantly, his system does not compute statistical correlations between words and their possible meanings, while ours does.</Paragraph>
    <Paragraph position="9">  His system proceeds in two stages, first learning what symbols are part of a word's meaning, and then learning the structure of those symbols. For example, it might first learn that capital is part of the meaning of capital, then in the second stage learn that capital can have either one or two arguments. By using common substructures, we can combine these two stages in WOLFIE. This work also has ties to the work on automatic construction of translation lexicons (Wu and Xia, 1995; Melamed, 1995; Kumano and Hirakawa, 1994; Catizone et al., 1993: Gale and Church, 1991). While most of these methods also compute association scores between pairs (in their case, word/word pairs) and use a greedy algorithm to choose the best translation(s) for each word, they do not take advantage of the constraints between pairs. One exception is Melamed (1996); however, his approach does not allow for phrases in the lexicon or for synonymy within one text segment, while ours does.</Paragraph>
  </Section>
  <Section position="8" start_page="70" end_page="70" type="metho">
    <SectionTitle>
7&amp;quot; Future Work
</SectionTitle>
    <Paragraph position="0"> Although the current greedy search method has performed quite well, a better search heuristic or alternative search strategy could result in improvements. A more important issue is lessening the burden of building a large annotated training corpus. We are exploring two options in this regard. One is to use active learning (COhn et al., 1994) in which the system chooses which examples are most usefully annotated from a larger corpus of unannotated data. This approach can dramatically reduce the amount of annotated data required to achieve a desired accuracy (Engelson and Dagan, 1996).</Paragraph>
    <Paragraph position="1"> Second, we are currently developing a corpus of sentences paired with SQL database queries. Extending our system to handle this representation should be a fairly simple matter. Such corpora should be easily constructed by recording queries submitted to existing SQL applications along with their original English forms, or translating existing lists of SQL queries into English (presumably an easier direction to translate). The fact that the same training data can be used to learn both a semantic lexicon and a parser also helps limit the overall burden of constructing a complete NL interface.</Paragraph>
    <Paragraph position="2"> On a separate note, the learning algorithm may be applicable to other domains, such as learning for translation or diagnosis. We hope to investigate these possibilities in the future as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML