File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1428_metho.xml
Size: 17,563 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1428"> <Title>Integrating a Large-scale, Reusable Lexicon with a Natural Language Generator</Title> <Section position="3" start_page="209" end_page="210" type="metho"> <SectionTitle> 3 The lexicon and its benefits to generation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="209" end_page="210" type="sub_section"> <SectionTitle> 3.1 A large-scale, reusable lexicon for </SectionTitle> <Paragraph position="0"> generation Natural Language generation starts from semantic concepts and then finds words to realize such semantic concepts. Most existing lexical resources, however, are indexed by words rather than by semantic concepts. Such resources, therefore, can not be used for generation directly. Moreover, generation needs different types of knowledge, which typically are encoded in different resources. However, the different representation formats used by these resources make it impossible to use them simultaneously in a single system.</Paragraph> <Paragraph position="1"> To overcome these limitations, we built a largescale, reusable lexicon for generation by combining multiple existing resources. The resources that are combined include: o Tile WordNet Lexical Database (Miller et al., 1990). WordNet is the largest lexical database to date, consisting of over 120,000 unique words (version 1.6). It also encodes many types of lexical relations between words, including synonytny, antonymy, and many more.</Paragraph> <Paragraph position="2"> o English Verb Classes and Alternations (EVCA) (Levin, 1993). It categorized 3.104 verbs into classes based on their syntactic properties and studied verb alternations. An alternation is a variation in the realization of verb arguments. For example, the alternation &quot;there-insertion&quot; transforms A ship appeared ~-on..the horizon_to There,appeared a ship..o~....the horizon. A total of 80 alternations for 3,104 verbs were studied.</Paragraph> <Paragraph position="3"> The COMLEX syntax dictionary (Grishman et al., 1994). COMLEX contains syntactic information for over 38,000 English words.</Paragraph> <Paragraph position="4"> The Brown Corpus tagged with WordNet senses (Miller et al., 1993). We use this corpus for frequency measurement. .</Paragraph> <Paragraph position="5"> In combining these resources, we focused on verbs, since they play a more important role in deciding sentence structures. The combined lexicon includes rich lexical and syntactic knowledge for 5,676 verbs. It is indexed by WordNet synsets(which are at the semantic concept level) as required by the generation task. The knowledge in the lexicon includes: Q A complete list of subcategorizations for each sense of a verb.</Paragraph> <Paragraph position="6"> o A large variety of alternations for each sense of a verb.</Paragraph> <Paragraph position="7"> o Frequency of lexical items and verb subcategorizations in the tagged Brown corpus Rich lexicat relations between words The sample entry for the verb &quot;appear&quot; is shown in Figure 1. It shows that the verb appear has eight senses (the sense distinctions come from WordNet). For each sense, the lexicon lists all the applicable subcategorization for that particular sense of the verb. The subcategorizations are represented using the same format as in COMLEX. For each sense, the lexicon also lists applicable alternations, which we encoded based on the information in EVCA. In addition, for each subcategorization and alternation, the lexicon lists the semantic category constraints on verb arguments. In the figure, we omitted the frequency information derived from Brown Corpus and lexical relations (the lexical relations are encoded in WordNet).</Paragraph> <Paragraph position="8"> The construction of the lexicon is semi-automatic. First, COMLEX and EVCA were merged, producing a list of syntactic subcategorizations and alternations for each verb. Distinctions in these syntactic restrictions according to each sense of a verb are achieved in the second stage, where WordNet is merged with the result of the first step. Finally, the corpus information is added, complementing the static resources with actual usage counts for each syntactic pattern. For a detailed description of the combination process, refer to (Jing and Mchieown, 1998).</Paragraph> <Paragraph position="9"> appear: sense 1 give an impression ((PP-TO-INF-gS :PVAL (&quot;to&quot;) :SO ((sb, -))) (TO-INF-RS :S0 ((sb, --))) (NP-PRED-RS :S0 ((sb, --))) (ADJP-PRED-RS :SO ((sb, -) (sth, --))))) sense 2 become visible ((PP-T0-INF-KS :PVAL (&quot;to&quot;) :S0 ((sb, -) (sth, -))) (INTRANS TIIERE-V-SUB J . . ....... .._ : ALT there-insertion :S0 ((sb, --) (sth, --)))) sense 8 have an outward expression ((NP-PRED-RS :SO ((sth, --))) (ADJP-PRED-RS :S0 ((sb, --) (sth, --))))</Paragraph> </Section> <Section position="2" start_page="210" end_page="210" type="sub_section"> <SectionTitle> 3.2 The benefits of the lexicon </SectionTitle> <Paragraph position="0"> There are a number of benefits that this combined lexicon can bring to language generation.</Paragraph> <Paragraph position="1"> First, the use of synsets as semantic tags can help map an application conceptual model to lexical items. Whenever application concepts are represented at the abstraction level of a WordNet synset, they can be directly accepted as input to the lexicon. By this way, the lexicon can actually lead to the generation of many lexical paraphrases. For example, (look, seem, appear} is a WordNet synset; it includes a list of words that can convey the semantic concept ' 'give an impression of' '. We can use synsets to find words that can lexicalize the semantic concepts in the semantic input. By choosing different words in a synset, we can therefore generate lexical paraphrases. For instance, using the above synset, the system can generate the following paraphrases: &quot;He seems happy. &quot; &quot;He looks happy. &quot; &quot;He appears happy.'&quot; Secondly, the subcategorization information in the lexicon prevents generating a non-grammatical output. As shown in Figure 1, the lexicon lists applicable subcategorizations for each sense of a verb. It will not allow the generation of sentences like &quot;*He convinced me in his innocence&quot; (wrong preposition) &quot;*He convinced to go to the party&quot; (missing object) &quot;*Th.e bread cuts&quot; (missing adverb (e.g., &quot;'easily&quot; )) &quot;*The book consists three parts&quot; ( m issing t)reposit.ion) In addition, alternation information can help generate .syntactic paraphrases. For instance, using the &quot;simple reciprocal intransitive&quot; alternation, the system can generate the following syntactic paraphrases: * , &quot;Brenda agreed with Molly.&quot; &quot;Brenda and Molly agreed*&quot; &quot;Brenda and Molly agreed with each other.&quot; Finally, the corpus frequency information can help ............... _the.lexicat.. -~ice.proeesa~.,When:multiple .words can be used to realize a semantic concept, the system can use corpus frequency information in addition to other constraints to choose the most appropriate word.</Paragraph> <Paragraph position="2"> The knowledge encoded in the lexicon is general, thus it can be used in different applications. The lexicon has wide coverage: the final lexicon consists of 5,676 verbs in total, over 14,100 senses (on average 2.5 senses/verb), and over 11,000 semantic concepts (synsets). It uses 147 patterns to represent the subcategorizations and includes 80 alternations.</Paragraph> <Paragraph position="3"> To exploit the lexicon's many benefits, its format must be made compatible with the architecture of a generator. We have integrated the lexicon with the FUF/SURGE syntactic realizer to form a combined lexico-grammar.</Paragraph> </Section> </Section> <Section position="4" start_page="210" end_page="213" type="metho"> <SectionTitle> 4 Integration Process </SectionTitle> <Paragraph position="0"> In this section, we first explain how lexical choosers are interfaced with FUF/SURGE. We then describe step by step how the lexicon is integrated with FUF/SURGE and show that this integration process helps to automate the development of a lexical realization component.</Paragraph> <Section position="1" start_page="210" end_page="212" type="sub_section"> <SectionTitle> 4.1 FUF/SURGE and the lexical chooser </SectionTitle> <Paragraph position="0"> FUF (Elhadad, 1992) uses a functional unification formalism for generation. It unifies the input that a user provides with a grammar to generate sentences.</Paragraph> <Paragraph position="1"> SURGE (Elhadad and Robin, 1996) is a comprehensive English Grammar written in FUF. Tile role of a lexical realization component is to map a semantic representation drawn from the application domain to an input format acceptable by SURGE, adding necessary lexical and syntactic information during this process.</Paragraph> <Paragraph position="2"> Figure 2 shows a sample semantic input (a), the lexicalization module that is used to map this semantic input to SURGE input (b), and 'thefinal SURGE input (c) -- taken from a real application system(Passoneau et al., 1996). The functions of the lexicalization module include selecting words that can be used to realize the semalltic concepts in the input, adding syntactic features, and mapping tile arguments in tile semantic input to the thematic roles in SURGE.</Paragraph> <Paragraph position="3"> The development of the lexicalizer component was done by hand in the past. Furthermore, for. each new application, a new lexicatizer component had to be written despite the fact that some lexical and syntactic information is repeatedly used in different applications. The integration process we describe, however, partially automates this process.</Paragraph> </Section> <Section position="2" start_page="212" end_page="213" type="sub_section"> <SectionTitle> 4.2 The integration steps </SectionTitle> <Paragraph position="0"> The integration of the lexicon with FUF/SURGE is done through incremental unification, using four unification steps as shown in Figure 3. Each step adds information to the semantic input, and at the end of the four unification steps, the semantic input has been mapped to the SURGE input format.</Paragraph> <Paragraph position="1"> (1) The semantic input Different generation systems usually use different representation formats for semantic input. Some systems use case roles ; some systems use flat attribute-value representation (Kukich et al., 1994). For the integrated lexicon and FUF/SURGE package to be easily pluggable in applications, we need to define a standard semantic input format. It should be designed in such a way that applications can easily adapt their particular semantic inputs to this standard format. It should also be easily mapped to the SURGE input format.</Paragraph> <Paragraph position="2"> In this paper, we only consider the issue of semantic input format for the expression of the predicate-argument relation. Two questions need to be answered in the design of the standard semantic input format: one, how to represent semantic concepts; and two, how to represent the predicate-argument relation.</Paragraph> <Paragraph position="3"> We use WordNet synsets to represent semantic concepts. The input can refer to synsets in several ways: either using a globally unique synset number I or by specifying a word and its sense number in WordNet.</Paragraph> <Paragraph position="4"> The representation of verb arguments is a more complicated issue. Case roles are frequently used in generation systems to represent verb arguments in semantic inputs. For example, (Dorr et al., 1998) used 20 case roles in their lexical conceptual structure corresponding to underlying positions in a compositional lexical structure. (Langkilde and Knight. 1998) use a list of case roles in their interlingua representations. null We decided to use numbered arguments (similar to the DSyntR in MTT (Mel'cuk and Perstov, 1987)) instead of case roles. The difference between the two 1Since there are a huge number of synsets in WordNet, we will provide a searchable database of synsets so that users can look up a synset and its index number easily. For a particular application, users can adapt the synsets to their specific domain, such as removing non-relevant synsets, merging synsets. and relabeling the synsets for convenience, as discussed in (,ling, 1998).</Paragraph> <Paragraph position="5"> is not critical but the numbered argument approach * avoids the need* to commit: the: lexicon to a specific ontology and seems to be easier to learn 2.</Paragraph> <Paragraph position="6"> Figure 4 shows a sample semantic input. For easy understanding, we refer to the semantic concepts using their definitions rather than numerical index numbers. There are two arguments in the input.</Paragraph> <Paragraph position="7"> The intended output sentence for this semantic input is &quot;A boat appeared on the horizon&quot; or its paraphrases. null (2) Lexical unification In this step, we map the semantic concepts in the &quot; semantic input to concrete words. To do this, we use the synsets in WordNet. All the words in the same synset can be used to convey the same semantic concept. For the above example, the semantic concepts &quot;become visible&quot; and &quot;a small vessel for travel on water&quot; can be realized by the the verb appear and the noun boat respectively. This is the step that can produce lexical paraphrases. Note that when the system chooses a word, it also determines the particular sense number of the word, since a word as it belongs to a synset has a unique sense number in WordNet.</Paragraph> <Paragraph position="8"> We represented all the synsets in Wordnet in FUF format. Each synset includes its numerical index number and the list of word senses included in the synsets. This lexical unification, works for both nouns and verbs.</Paragraph> <Paragraph position="9"> (3) Structural unification After the system has chosen a verb (actually a particular sense of a verb), it uses that information as an index to unify with the subcategorization and alternations the particular verb sense has. This step adds additional syntactic information to the original input and has the capacity to produce syntactic paraphrases using alternation information.</Paragraph> <Paragraph position="10"> (4) Constraints on the number of arguments Next, we use the constraints that a subcategorization has on the number of arguments it requires to restrict unification with subcategorization patterns. \~k~ use 147 possible patterns. For example, the input in Figure 4 has two arguments. Although INTRANS (meaning intransitive) is listed as a possible subcategorization pattern for &quot;appear&quot; (see sense 2 in Figure 1), the input will fail to unify with it since INTRANS requires a single argument only.</Paragraph> <Paragraph position="11"> This prevents the generation of non-grammatic'A sentences. This step adds a feature which specifies the transitivity of the verb to FUF/SURGE input, selecting one from the lexicon when there is more than one possibility for the given verb.</Paragraph> <Paragraph position="12"> 2The difference between numbered arguments and labeled roles is similar to that between named semantic primitives and synsets in \.VordNet. Verb classes share the same definition of which argument is denoted by l, 2 etc. if they share some syntactic properties as far as argument taking properties are concerned.</Paragraph> <Paragraph position="13"> (5) Mapping structures to SURGE input In the last step, the subcategorization and alternations are mapped to SURGE input format. The mapping from subcategorizations to SURGE input was manually encoded in the lexicon for each one of the 147 patterns. This mapping information can be reused for all applications, which is more efficient than composing SURGE input in the lexicalization component of each different application. Figure 5 shows how the subcategorization NP-WITH-NP (e.g., The clown amused the children with his antics) is mapped to the SURGE input format. This mapping mainly involves matching the numbered arguments in the semantic input to appropriate lexical roles and syntactic categories so that FIJF/SURGE can generate them in the correct order.</Paragraph> <Paragraph position="14"> The final SURGE input for the sentence &quot;,4 boat appeared on the horizon&quot; is shown in Figure 6. Using the &quot;THERE-INSERTION&quot; alternation that the verb &quot;appear&quot; (sense 2) authorizes, the system can also generate the syntactic paraphrase &quot;There appeared a boat on the horizon&quot;. The SURGE input the system generates for &quot;There appeared a boat on the horizon&quot; is very different .from that for &quot;A boat appeared on the horizon&quot;.</Paragraph> <Paragraph position="15"> It is possible that for a given application some generated paraphrases are not appropriate. In this case, users can edit the synsets and the alternations to filter out tile paraphrases tile) do not want.</Paragraph> <Paragraph position="16"> Tile four unification steps are completely automatic. Tile system can send feedback upon failure</Paragraph> </Section> </Section> class="xml-element"></Paper>